How To Build a Robust ML Workflow With Pachyderm and Seldon
by Enrico Rotundo , Associate Data Scientist
This article outlines the technical design behind the Pachyderm-Seldon Deploy integration available on GitHub and is intended to highlight the salient features of the demo. For an in depth overview watch the accompanying video on YouTube.
Introduction
Pachyderm and Seldon run on top of Kubernetes, a scalable orchestration system; here I explain their installation process, then I use an example use case to illustrate how to operate a release, rollback, fix, re-release cycle in a live ML deployment. Throughout the demo, I showcase how Pachyderm comes in handy with data lineage and automation in a critical scenario.
Let’s start with a bird’s eye view of the main components.
What is Pachyderm?
Pachyderm is a data layer for your data science applications with built-in version control and lineage. In other words, it brings data version control and pipelines together so that you can orchestrate and track sophisticated ML workflows. By combining pipelines, you can build a Directed Acyclic Graph (DAG) that’s automatically versioned, so you can trace back any run to its origin. Push new data in, pull a freshly trained model out. The data versioning is automatic and there’s a dashboard too.
You can learn more on the Pachyderm website.
What is Seldon Deploy?
Seldon Deploy is the enterprise offering based on the open-source Seldon Core machine learning engine; it’s designed to deploy and monitor ML graphs at scale via a dashboard or API. A Python sdk is also available and I show it in action in the demo notebook.
What do I mean by ML graph? Well, in Seldon you can combine a multitude of models in a dedicated resource. Think of a canary or A/B testing rollout, in Seldon it’s a trivial setup. On top of that, you can deploy almost any machine learning model because it supports multiple ML frameworks, and thanks to the Seldon Alibi module you can constantly audit how your deployment is behaving. See the Seldon documentation for more information.
Cluster setup
To get started you need to set up a Kubernetes cluster with some prerequisites. I provide step by step instructions for Google Kubernetes Engine (GKE) and Minikube in the repository, the latter requires a high-specs laptop to run smoothly: a 6 core, 32Gb RAM MacBook Pro is the bare minimum. I recommend using the cloud installation so in the remainder of this article I’ll refer to that option. If you’re testing this on a private GKE instance, then you’ll have to add a custom firewall rule too, which is included in the installation instructions.
Seldon Deploy
Seldon Deploy relies on a set of components that you’ll have to install. Here’s what they are and why they’re needed in this setup.
- Istio routes ingress traffic to model deployments and allows for flexible rollout strategies such as canary, shadow, A/B testing, etc.
- Knative Serving is a requirement for KFServing which allows for deploying a multitude of ML frameworks (TensorFlow, Scikit-Learn, XGBoost, MLFlow, etc.).
- Knative Eventing logs forecast requests as well as post-predict detections such as outlier, drift, metrics, etc.
- Elastic, Fluentd, Kibana aggregate logs from and make them searchable. This dependency is required to have the UI display inbound requests.
- Seldon Core Analytics collects monitoring metrics; it’s based on Grafana and Prometheus.
- Seldon Core is a machine-learning engine, it provides a fully fledged Kubernetes resource named SeldonDeployment capable of defining flexible ML graphs.
Pachyderm
For GKE I use Pachyderm’s built-in deployment on Google Cloud. It uses cloud-native storage so all you need to provide is a dedicated storage bucket that Pachyderm will use to persist your pipelines’ data. Alternatively, Pachyderm also supports Minio (or S3!) for object storage, although the setup is slightly more complex.
For Minikube, I’d suggest a simple local deployment to save resources. This option uses local storage on disk and is not meant for multi-node clusters.
Note that in this setup I create two cluster role bindings. Since GKE uses RBAC by default, I grant cluster-admin privileges to my user account so that Pachyderm deploy can successfully launch. Check with your cluster admin before doing that in a production environment. Furthermore, I grant Pachyderm workers the permission to edit objects so that they can manage Secrets that Seldon will use to fetch trained ML models via a sidecar S3 gateway.
Use Case
This section explores the case of CreditCo, a hypothetical credit company building an ML-powered service to predict people’s income, but of course it applies to other contexts as well.
In real life deployments you will train and release your model multiple times. Perhaps circumstances force you to temporarily rollback while your team is baking a new model. That’s perfectly fine as long as you have the right setup in place.
What I’ll show in the remainder of this article is the degree of automation and data lineage I achieve with Pachyderm & Seldon. For a complete walkthrough you may want to check out the related notebook and video demo.
In this example, the company’s task is to predict whether an individual’s income is above or below 50K a year based on demographic data such as education level, employment type, age, and so on.
At first, the company released an income prediction model trained on the well known Census Income dataset. As CreditCo’s business grows, the prediction model becomes obsolete and soon there’s the need for a newer version.
To quickly roll out a new model, the company acquires a dataset from a vendor but this is found to be influenced by controversial features and to avoid a scandal, CreditCo quickly rolls back to the very first model.
When new data is collected and a third model is trained, the company wants to be cautious about new rollouts and therefore instead of going live right away they go for a shadow deployment strategy.
In the next section I’ll give you an example of how Pachyderm can integrate with Seldon Deploy to create an end-to-end ML deployment for this example.
Solution
This use case has a data entrypoint where the company will push their datasets to. Then I’ll create a training pipeline to iteratively run a set of containers to train ML models over the given dataset. Trained models shall be collected in an accessible location, versioned, and passed along to a deployment stage. This results in the following pipeline phases:
- Dataset repository
- Training pipeline
- Model repository
- Deployment pipeline
- ML deployment and monitoring platform
You find the details of this implementation in the demo repository. To keep this short I’ll just illustrate the resulting Pachyderm DAG, shown in the figure below (from Pachyderm dashboard).
Data Repository
The top most repository (in light blue) is the location for my CSV datasets. Pachyderm automatically version controls my datasets similarly to what git does for code repos: any time I push a new CSV file in, it generates a commit hash and points the branch HEAD to it.
I built this DAG such that Pachyderm triggers the training pipeline downstream when a new dataset is added to the master branch of this repository. Note that instead of pushing data to master directly, I use side branches to defer its processing (Deferred Processing of Data - Pachyderm Documentation) until I want it to be triggered. This is good practice to avoid accidental training runs that may be computationally expensive.
Training Pipelines
The training pipelines downstream (green chevrons) listen to changes in “income_data” and run as soon as new data is pushed. The result is a set of artifacts such as a Scikit-learn income model, a Seldon-Alibi explainer and other monitoring models. The latter are powerful algorithms to inspect and detect issues with live ML deployments, learn more about Seldon-Alibi here. These pipelines run custom user code shipped in Docker containers where I use python to fit the respective model parameters to the given dataset.
Deployment Pipelines
While the training phase is going to be fully automated, the deployment/release should have some degree of human supervision to select what model version you’re deploying; after all, you want to have human supervision on customer facing machine learning models.
The copy_models
pipeline is a tiny layer whose purpose is to collect all models and make them available at a single location. This is done for convenience: I can now view all available artifacts by accessing a single repository rather than querying every single source. To minimize the overhead of this step, this pipeline uses the input.pfs.empty_files spec, which exposes files as empty but allows me to use symlinks, this way I efficiently copy over the model files to a single location. Read more about empty_files.
For deploying, I create a pipeline for the production and staging environments respectively. Again, I make use of Deferred Processing of Data to manually control when to deploy what environment so these pipelines are setup to run when a commit is created on an ad-hoc branch of copy_models
.
I’d like to point out that this step uses a Pachyderm Service, a specific kind of pipeline meant to expose data to the external world rather than transforming it. Thus, this service passes models on to Seldon via a sidecar S3 Gateway and runs a Python script to call Seldon Deploy REST api. This ensures compatibility with any S3-compatible ML platform as well as data provenance: any change to the upstream data can be tracked back to any given job/input commit. Talking about provenance, I pass the copy_models
commit hash to the deployment script so that it injects it into the model container. This way, I can any time inspect what model version is deployed and cross-check it with the model repo history.
Seldon Deployment
Seldon is going to create a SeldonDeployment, a Kubernetes resource capable of representing a complex ML graph composed by multiple predictors with related monitoring models. From now on, you can query the ML model via Seldon Deploy endpoint and oversee the live deployment through the UI. The screenshot below shows an outlier detector in action, which is employed to flag out-of-ordinary incoming requests.
Summary
In this article, I described how Pachyderm and Seldon Deploy can be integrated to implement an end-to-end ML pipeline. After detailing the cluster setup and its dependencies, I gave an example use case of a credit company using ML in production, and explained how Pachyderm and Seldon come in handy to manage complex live situations.
To dig deeper into this demo check out the demo notebook available on GitHub or watch the accompanying video on YouTube.
Of course, Winder.AI is ready to help you improve your ML workflows through a combination of MLOps consulting and ML expertise.
Talk to SalesReferences
Credits
This project was funded by Pachyderm.