In software engineering, the famous quote by Phil Karlton, extended by Martin Fowler goes something like: “There are two hard things in computer science: cache invalidation, naming things, and off-by-one errors.” In data science, there’s one hard thing that towers over all other hard things: deployment.
Why is deploying data science models so hard?
First, development of a model is so decoupled from implementation you might as well write it in another language. And I mean that literally; I have seen many companies developing in one language, Matlab say, and deploying in another, like Java. This adds so many hurdles that I could write an entire blog post.
The second issue is that if you were to create a pipeline to build, train and deploy a model from scratch, it’s actually quite complicated. The domain of the data and the choice of technology act like a drop of food colouring in a washing machine, leading to fragmentation, increased maintenance burden and pink shirts.
To round this off into 3-tuple of complaints: opinions. I don’t like opinions, unless they are mine. I’m joking of course. But each language, every framework, all orchestrators have an opinion of how you should deploy. They all have the right goal, but somewhere along the way they add one constraint too many. Like it has to be connected to some master service 24/7, or I have to use a special library or container. No! I have all the abstractions I need thank you very much.
I’ve seen and used a range of proprietary and open-source “solutions” all of which told me how I should deploy. How rude! Without further ado, let me tell you how you should deploy. 🙃
Before I dig in, I want to set out my aims. I also have opinions, which you are welcome to ignore, but I think they make sense. The goal is to produce a simple build-train-deploy workflow that works in any CI/CD tool. In addition, I want the result to:
- Use a docker container, to build/serve anywhere containers can run
- Expose a REST API so that others can consume it
- Decouple the training and serving, to minimise the complexity and attack surface of the operational container
The result is a Docker-based workflow that will look familiar to any software engineer.
Training Inside a Docker Container
I do that vast majority of my data science work in Python. So the first job is to take whatever I have done in a Jupyter Notebook and create a
train.py file. Yes, I know that there are many options to train a model directly from Jupyter Notebook’s these days. But I think they add an unnecessary complexity. Once I’ve finished development, I typically don’t return to the Notebook for a long time. The Python file becomes the source of truth. This way, I can treat it more like a “normal” software engineering project.
I’m not going to waste your time with massive code listings. But if you do want to see some actual code, check out one of the COVID-19 models I’ve developed for the Athena project. In summary, it looks something like this:
- Load and parse data
- Clean and prepare data
- Train model
- Save model parameters or results to a file
Next you need a
Dockerfile to run the training script. As you can imagine, that’s pretty simple, something like:
FROM base AS train COPY requirements.txt . RUN pip install -r requirements.txt COPY ./app/train.py /app/train.py RUN python3 /app/train.py
base is some container that is appropriate for your project.
To train your model, all you need to do is run
docker build -t myimage . and wait for it to finish, then you have a container with trained parameters/results ready to serve.
Creating a Serving Container
The next task is to build a container to serve your model. First, create a
main.py file that is responsible for (example here):
- Load saved model parameters/results
- Instantiate REST API
- Define routes and serve model
Step 1 is the inverse of whatever you did to save your model. Step 2 and 3 depend on what you want to use to serve your API. Recently I discovered FastAPI which is a really clean, batteries-included REST framework. It includes OpenAPI and Swagger docs out of the box, with no extra work, includes robust but simple data specification tools, it’s syntax is really tight and I don’t have to think about web serving because it comes with gunicorn. Best of all, there’s a Docker container ready to go!
You could swap this out for Flask or whatever. I won’t mind.
Next I define another container in our
Dockerfile that I use for serving (same example again):
FROM tiangolo/uvicorn-gunicorn-fastapi:python3.7 as base FROM base AS train # ... Rest of training dockerfile # Now for the serving container FROM base COPY requirements.txt . RUN pip install -r requirements.txt COPY --from=train predictions.pkl . COPY main.py .
This is a multi-stage Dockerfile which incorporates both our training and serving containers. The serving container copies the training artefacts from the training build and pastes them into a location that is expected by the
Building and Deploying Your Container
Building is as simple as doing a
docker build -t myimage .. That should kick off the training, copy the results over to your serving container and result in a built serving container without all the fluff necessary for training. You can add that to your CI/CD pipeline, Gitlab’s AutoDevOps pipeline or, dare I say it, build it locally and push it manually to a container registry.
Deploying your container is entirely dependent on your tech stack. It could be as simple as a
docker run -d -p 8080:80 myimage. And it will be right there, with your API documentation served on
http://localhost:8080/docs. Or you could use that container in a Kubernetes KNative manifest for full on serverless machine learning deployments.
Either way, I hope you agree that this massively simplifies deployment. You have no excuse not to publish your models!
Extensions and/or Future Ideas
My goal was simplicity, so I neglected to mention a few things you need to watch out for:
- If you have high computational needs for training, for example if you’re training a Deep Learning model, then you’ll need to think about where you are doing the training. Because a free CI/CD services is either a) not going to be powerful enough or b) you’ll run out of free run-time in no time at all.
- Testing is particularly important when automating machine learning builds. I skipped any mention of it to simplify the build process. But bear in mind the necessity. If you’re building a model that’s going to be used by a single person once a month, then it’s probably easier to have a simple on-call model.
- Creating a resilient deployment is not much harder than what I have done here, because the container is immutable. Feel free to scale the container across regions/availability zones to match your SLAs.
- Adding monitoring/tracing/alerting isn’t hard, but will require work depending on your stack. I’d recommend building a new base container that contains some automatic middleware to add all of these things, so everyone else gets it for free. Or maybe you can build a library that adds it automatically!
- Take care with your REST APIs. Read up on building great APIs so that you don’t get swamped under a nest of your own urls.