TL;DR: min-tfs-client is a minimal python client for TensorFlow Serving that allows you to use serverless services (e.g. AWS Lambda) that have deployment size limits (250 Mb uncompressed, at time of writing). It works by removing TensorFlow as a dependency to create tensor protobufs for the prediction request. Feel free to check out the repository here if you’d like to contribute
TensorFlow Serving (TFS) is a serving system for machine learning (ML) models, primarily used for models built in TensorFlow. In this blog post we introduce a lightweight python client for TFS that allows python apps to make gRPC requests to an instance of TFS without having to install TensorFlow.
Serving ML models in production is often the last and also trickiest part of completing the development of an ML product. In this phase of development Data Scientists, ML Engineers and Software Engineers must all collaborate to integrate the ML stack with the broader product stack. TFS is a gRPC/HTTP server written in C++ and distributed by Google to accelerate the deployment of TensorFlow models to production environments. In our experience, it provides a robust, highly scalable, and reasonably configurable platform to serve models. We’ve used it successfully on the Answer Bot product, running our semantic models on both CPU and GPU -based infrastructure.
Although TFS now supports REST, we elected to use gRPC internally because 1) REST wasn’t available when we were productionizing TensorFlow models, and 2) The use of protobufs in gRPC requests makes API contract management slightly more robust. There are other benefits associated with using gRPC, but we won’t be re-litigating those reasons in this blog post. That being said, using gRPC also introduces complexities that are directly tied to its benefits; specifically the requirement that API contract-defining protobufs must be compiled and circulated to all clients in order to communicate with the server.
The protobufs required to communicate with TFS using python are contained within the package tensorflow-serving-api, distributed by the TensorFlow team at Google. These protobufs have been compiled into Python classes that can be instantiated and loaded with attributes that correspond to protobuf fields. Collectively, these protobufs contain the definition of objects including the payload of a TFS request (prediction_service_pb2_grpc.py), a tensor in TensorFlow (tensor.proto), and even the shape of a tensor in TensorFlow (tensor_shape.proto). An important property of protobufs is their ability to support interdependency — for instance, loading a TFS request protobuf depends on TensorFlow’s tensor protobuf, which in turn depends on the tensor shape protobuf, to handle the shape of the tensor being loaded.
These dependencies create a situation where the tensorflowpython package is a dependency for tensorflow-serving-api, as it necessarily provides the protobuf definitions for a tensor in order to make a request, and deserialize the response from TFS. TensorFlow provides this functionality through the tf.make_tensor_proto and tf.make_nd_array functions. However, installing the tensorflow python package just to gain access to the protobuf definitions for tensors is far from ideal; at the time of writing, TensorFlow 2.1.0 requires 544 Mb for tensorflow_core, and another 10 Mb for Tensorboard. The tensorflow package contains all the code required to train, monitor, evaluate, and export models — code that is not required to communicate with TFS in a production environment.
The size of the TensorFlow package creates complications when trying to build apps that need to communicate with TFS. In particular, it greatly increases the size of a Docker container, and in the case of serverless function services like AWS Lambda, it results in a virtual environment that exceeds the maximum deployment size. This makes it impossible to deploy a lambda that communicates with TFS by installing tensorflow-serving-api.