Industrial machine learning consulting projects come in a variety of forms.
Sometimes clients ask for exploratory data analysis, to evaluate whether their data can be used to help solve a problem using artificial intelligence. Other times we use machine learning (ML) algorithms to automate decisions and improve efficiencies within a business or product. More recently we’ve refocused on reinforcement learning and customers ask us to help control some complex multi-step process.
In each case we start in a data science notebook.
Notebooks Are Not For Production
Notebooks are great for documenting research. And for communicating and preserving results. But notebooks contain exactly what the name suggests, notes. They are entirely inappropriate for production use.
The reason for this lies with the lack of engineering rigour. The code does not scale. It is untested. It has a multitude of edge cases. Results are unpredictable and unreproducible.
Notebooks are designed to produce results once only, to generate an image at the end to demonstrate to the client that results have been achieved.
To scale, harden, catalogue, and to make pipelines reproducible, you need an automated machine learning pipeline.
The Value of a Machine Learning Pipeline
The result of such a pipeline is an artifact that is consumed by other people or systems. I stress the importance of the word build, because like any building, it lasts a long time and is a valuable asset.
Imagine that you want to update your building, by building an extension.
If the building was in a state of disrepair, if you could not enter it, if it was dangerous, then no sentient building inspector would sign off on completion of the project, irrespective of how good the extension is.
It’s the same when you use a training pipeline for model training. If you can’t demonstrate that the artifact is stable and the pipeline repeatable, reliable, and scalable before you continue development, then how can you sign-off your model after completion? If you don’t have confidence in your baseline, then you can’t measure the improvement.
The value of a pipeline in your ML workflow is that it represents a verifiable, quantifiable, repeatable foundation for your application and ultimately, your business.
ML Pipelines Are DAGs
It turns out that most ML pipeline code looks very much like a directed, acyclic graph (DAG). In other words the pipeline is constructed from a series of steps which feed into subsequent steps for further computation, and so on. Software engineers have a lot of experience of the execution of DAGs and they tend to call these pipelines.
ML pipelines allow you to structure the steps required to train your model as a DAG.
This is useful because many steps within the DAG are repeated over and over again. Complex pipelines can therefore leverage caching, to avoid re-running steps that would produce the same output.
It also makes it much simpler to “fan-out” computationally expensive parts of the training pipeline, to reduce the amount of time it takes to train a machine learning model by performing tasks in parallel. If tasks don’t depend on each other, then they can run in parallel. This happens a lot in ML, like when performing a hyperparameter search, which is the process of testing different settings that affect the training of an ML model.
With an ML pipeline (or DAG) defined, then all you need to do is tell the orchestration layer what it is you need each task to do and what data is shared with each task.
Kubeflow Pipelines Suck - But Not In The Way You Think
I first wrote about Kubeflow two and a half years ago when Google first announced it. Since then it has become immensely popular and is a defacto standard in the MLOps community.
Kubeflow pipelines (KFP) is a library that leverages the eminent capabilities of Kubernetes for orchestration, custom blocks of Python for the tasks, and an opinionated library for providing the DAG and data sharing framework.
KFP is very popular. We’ve used it for production training workloads at a variety of clients and from what I can gather from the community chatter I’m involved with, this is a microcosm of the rest of the world.
The reason for its popularity is because there aren’t many projects that are entirely open source, are tightly integrated with Kubernetes, and specifically target ML.
Sure there are hundreds of continuous integration/continuous delivery focussed vendors and tools, and many more generic task-focussed DAG libraries. But none of them quite cater for some of the unique challenges that ML presents like KFP does. It has a tight focus on ML requirements, for example: data pipeline sharing, caching, the ability to scale common tasks like data preprocessing, feature engineering, training deep learning models, and hyperparameter tuning.
But there’s a huge issue with the way that KFP is implemented; a problem common with many other projects. The specification and control of the DAG is done via a Python library.
The often cited reason for doing this is that it makes it easier for people, data scientists in this case, to interact more easily with the project. But this immediately leads to some obvious issues like:
- you need to rewrite your pipelines and/or notebooks mostly from scratch
- you need to learn and understand the KFP API
- you need to understand the concepts that are being exposed by the library
“That’s the same for every library”, I hear you say. But wait, there’s more.
A DAG Is Meta - Treat It That Way
The main problem is that a DAG is a meta-level concept.
It describes a structure of one possible combination of steps required to solve a problem. But KFP directly couples this to the code that performs the task – yes, even if you use custom containers. You not only need to be a domain expert in the task at hand, i.e. a PhD in data science and experience in the domain, but also in Kubeflow.
And this cascades.
If you want to be an expert in Kubeflow, then you need to know Kubernetes, because the library is so un-abstracted. Containers, sidecars, labels, tolerations, volumes, affinity, … Nearly all fundamental Kubernetes primitives have been added to the Kubeflow pipeline API, because you need that control to orchestrate your containers effectively.
This results in a bunch of new Python code that is masquerading as a Kubernetes manifest.
The unhelpful truth is that to use Kubeflow you not only need a PhD in data science (or whatever), but you also need to be an expert in Kubernetes, and therefore cloud-native, and therefore be a modern full-stack software engineer.
Our experience suggests that there are very few PhD data scientists that are also experts in cloud-native. (Contact us at Winder.AI if you want to talk to some of them! :-p)
Not Fit for Teams
Then there’s Kubeflow as a project. I love it, honestly I do!
It’s a really great example of an (primarily) user-driven, (primarily) open project – obviously there are commercial interests, but these are largely community driven. And of course, it is useful.
But because Kubeflow not under the control of a single team, because so many companies and people contribute, is also it’s downfall.
For quite a while now, it has tried to do too much. For example, it tried to implement a multi-tenant capability, but did it in a really wonky way that only suits a small subset of users. It expects to own and control namespaces and assumes that people are working as individuals, not teams. And it needs Istio to do anything useful, which again is awesome but is tricky to use and operate.
In the end, for some of our clients, we’ve ended up stripping out all the namespace-related nonsense and provisioning whole, independent Kubeflow installations in each team’s namespace.
Not Quite Finished - The Dreaded Kubeflow Install
And then there’s the dreaded installation process. My jaw tenses and I show some teeth whenever someone says “just install Kubeflow”.
The project has gone through what seems like every installation process in the book. Raw manifests, this binary called kfctl, Kustomize, now via Argo, Helm… And it never quite works, mainly due to the user’s authentication setup and Istio. I did a quick Github issue query. A whopping 26% of all issues are about installation.
If I assume that these were individual people, which they are probably not, there’s probably lots of people having the same issues, then every 1 in 4 were unable to actually install anything.
This was one of the key reasons behind our Combinator.ML project. We wanted to make it easier for people to spin up and try out tools like this.
My Machine Learning Pipeline Requirements
There’s more issues that I could dig into, but I think I’ve done enough Kubeflow-bashing for one day!
I mean, we still use it on many projects, so it’s not all that bad.
But I do feel like there is significant room for improvement, here are my high-level suggestions:
- Concentrate on one thing and do it well. Best-of-breed is the solution here. KFP is useful, it should be separated from Kubeflow, like KServe is (formerly this was closely affiliated with Kubeflow).
- Related: Ditch the UI. Remove all the namespace nonsense. Remove auth. Get rid of Istio. Etc. Let other projects or companies bring all of these tools together into a pretty UX. They’re far more inclined to make it good. It’s a drag on development because everyone expects the UI to work with every new feature.
- Data scientists shouldn’t have to know a thing about Kubernetes to create and manage machine learning pipelines - specification of the DAG is and should be kept separate
- Controversial: the DAG shouldn’t (primarily) be encoded in Python. It’s unrelated to the ML code and I think the data scientists will thank you for hiding all DAG related code. I’d accept a Python library for power-user or programmatic access. But the key is to let the data scientists concentrate on the data science.
- Improve the development/debugging UX. It’s really hard to develop and debug KFP when they go wrong.
- How do you share code? How do you define dependencies in a GitOps friendly way?
- How do you convert the typical (not recommended!)
train.py --lots --of --optionsinto a pipeline that data scientists want to use?
- As a data scientist, I really don’t want to have to worry about CUDA/library incompatibilities, or what random set of letters I need to request in my quota to get a GPU (Looking at you Azure!)
We’ve been working on proof-of-concepts with Microsoft to try and solve some of these issues.
But we’re waiting for someone like you to develop the answer. When you do, let me know!
The Future of Machine Learning Pipelines
So what does the future look like?
I’d hope that when I want to develop a production-quality machine learning pipeline that the technology is open and cloud-native. I hope that I don’t have to be a Kubernetes expert to run experiments and build production artifacts. I want to concentrate on the task at hand, not on the orchestration. I want a simple way to access, process, and snapshot my data.
I want to do more data science.