GitOps for Machine Learning Projects

by Dr. Phil Winder , CEO

Not so long ago, developers used clunky consoles to provision infrastructure and applications. It wasn’t long before someone realized it was better to automate such a process via scripts and APIs. But it wasn’t until Hashicorp showed that APIs were not enough. Their insight was to declare a canonical representation of the infrastructure. You can then reconcile this declaration against the live view of the infrastructure.

In 2015-16 we helped WeaveWorks develop their cloud monitoring platform. We targeted Kubernetes but lamented the hackiness of our script-based deployment pipeline. The common pattern at that time was to run a script that ran kubectl apply over and over again.

Someone realized that this pattern, applying manifests using a centralized source of truth, was the same as Terraform’s. This wasn’t infrastructure. It wasn’t even specific to the application. But it tended to live in version control and was deployed into production. Hence GitOps.

We then started developing Flux, which grew out of a need for reconciliation of Kubernetes manifests that resided in Git. But nowadays there are other equally capable continuous deployment tools like ArgoCD.

What Is the Definition of GitOps?

GitOps is many things, but I reduce it down to two key tenets.

1. Declare the Canonical State of Your Entire Business

By declare, I mean write down and expose at a suitable level of visibility.

Yes, that could be Kubernetes manifests to define your web application, which you would expose to your company’s developers. But it could also be markdown files representing your hiring policies, which you would expose to the public.

Canonical state means the most recent agreed-upon goal state for whatever it is we’re talking about.

Again, that could be the most recently built binary or the current company strategy.

The execution of your deployment pipelines, or indeed your business strategy, should drive the system towards the declared canonical state.

2. Version Control

Version control is a bonus, but it is very powerful. Imagine time-travelling to find and recreate a previous business strategy or application.

But this is more a feature of lineage and provenance and I want to concentrate on declaration.

What Does GitOps Look Like in a Modern Cloud-Native Application?

Any application that has a deployment phase depends on a complex chain of events. To see why GitOps is beneficial, it helps if you make these dependencies explicit.

The image below presents a simplified dependency graph of a modern application. Boxes represent a stage in the deployment of an application. Boxes with a gear represent a part of the process that you should automate, like a build process. Boxes with a floppy disk represent a phase that generates and stores artifacts. Boxes in pink represent parts of the process that need declaring.

A dependency graph depicting GitOps in a modern web application

You can see that there are three main types of declaration in this process. The application code declares business logic. The environment declares the context and runtime in which your application will operate. And the deployment manifests declare how and when your application should operate.

This example represents a modern web application. But you can perform a dependency analysis of any other part of your business.

What Are the Benefits of GitOps?

GitOps has a range of benefits for both the business and the engineer. GitOps increases productivity, improves the developer experience, enhances stability, and promotes standardization.

Increased Productivity

If your application (or business) is both declared and automated, it is easy for anyone in the business to make small changes. Tests, executed via automated processes, ensure that these changes do not impair the application. Automated deployments deploy changes into production (or wherever).

This makes it easy and fast to deploy improvements, fixes, or depreciations with minimal risk. This alone improves productivity because it removes impediments.

Better Developer Experience

You pay developers to develop. And they don’t like to context switch.

A modern cloud-native stack is so tall that switching from the top to the bottom is way more than a context switch. It’s like free-falling in a lift and trying to step out at the right microsecond to escape through the 13th-floor opening. Machine learning generally stacks on top of this, which makes the problem worse.

GitOps helps to assure developers in one part of the stack that other parts are robust. Because developers can see the intended state of the application. Automations test and deploy those changes which minimize the impact in another part of the stack.

Developers are free to develop on their slice of the stack without the burden of context switching.

Improved Stability and Reliability

Builders work to an architect’s plan. Architects adhere to the rules provided by civil engineers. Civil engineers rely on the tolerances of steel manufacturers. Like in construction, GitOps allows developers to concentrate on challenges that they are experts in.

Application engineers don’t need to focus on deployments. DevOps engineers don’t need to focus on security. And so on. This leads to improvements across the board and results in secure, stable, reliable applications.

The observability that code brings also aids reliability. Code reviews and automated tests can catch bugs before deployment.

Enhances Consistency and Standardization

Making content visible promotes openness, feedback, and best practices. There’s nothing more visible than code in a repository. Open-source software is a testament to this ideal.

One underrated feature of open-source software is a developers’ ability to self-debug by looking at the code. I have had to do this myself on countless occasions. You can apply this principle to your organization by providing unfettered access to your codebase. Of course, you should consider whether you want to restrict access. But you should start from public access and work back, not the other way around.

Developers would much rather find and fix their own bug, rather than filing {insert ticketing system here} tickets and waiting for 2-5 days.

Scheduling is the ultimate blocker; make it unnecessary by opening your code bases.

Another common side effect of GitOps is that it promotes patterns and sharing. If developers can see others’ code, when they need to do something similar they can copy and paste. The copy and paste’s get noticed and turn into new shared projects that benefit the whole company.

Sometimes they turn into products. I’ve recently adopted Dendron. This began life at AWS when developers recognized the need for a shared knowledge library.

How to use GitOps for Machine Learning Applications

But how does GitOps differ when it comes to using it in artificial intelligence projects?

First, I want to step away from the term artificial intelligence, even though I (reluctantly) use it. By artificial intelligence, I mean any project that attempts to automate decisions or strategies. I do not mean machine learning specifically, although I will use machine learning as an example. I mean anything, like reinforcement learning, symbolic approaches, data analysis, analytics, data science, anything.

No matter what application you are developing, GitOps can help. Only the declarations change.

Bear in mind that your declarations may change depending on the type of project. Check out my tips at the end of this article to learn how to figure out what you need to declare.

GitOps for Machine Learning

For a machine learning project, I follow the same workflow as in the software engineering GitOps example to decompose the declarative elements via a dependency analysis.

The figure below shows a dependency graph for an example machine learning application. As before, boxes with gears are automations, floppy disks store artifacts, and the boxes in pink are the things you need to declare.

A dependency graph depicting GitOps in a machine learning application

As you can see, the code, manifests, and environment is the same. But in a machine learning project, you also need to declare the data and the hyperparameters. Another difference is that the manifests now depend on both the code and model artifact.

Model Artifact Deployment Patterns

In fact, there are two possible patterns here. In this example I have separated the code artifacts from the model artifacts; a pattern popularized by the folks at Seldon and KServe.

I’m not a huge fan of this, because it does not define a single well-known artifact. It is hard to know what combination of code-model versions were running at any one time, because the only link is in the manifests. Yes, it is possible to look back at the code. But that doesn’t guarantee that the code-model combination was actually running. Someone could have manually applied a new version of the model. Or the manifests might not have been successfully applied.

Instead, I prefer to bake the model to into the code artifact, which alters the dependency graph. In this configuration, you can guarantee provenance. The code artifact holds the same code-model combination no matter where on when you pull it. You can see this structure in the image below.

A dependency graph depicting GitOps in a machine learning application with the model artifacts baked in

As you can see it complicates the low level code artifacts, because two independent processes depend on it. This means it is possible to accidentally build code that wasn’t exactly the same as the training run.

To prevent this from happening, I recommend adding another code artifact build step. First build an artifact to use for experimentation, like a base container. Then build another artifact that adds a layer on top containing the trained model. This way you can guarantee that the experimentation code and the code in the final container is the same.

Common Machine Learning Code

There is one final point of interest in the dependency graph for the machine learning example. Common code often exists in both the training and serving code. For example, you need to ensure that you pre-process your data in exactly the same way.

In traditional software engineering it is common to ship code without any ancillary code. For example you would not package test or CI code. But in machine learning, where lineage and reproducibility is paramount, all code is important. Especially the code used in the training process. The benefits of bundling the whole repo outweigh the risks and disadvantages of not.

Declarative Data

Declaring data is beyond the scope of this article. But common approaches include:

  • declaring data in version control systems like Pachyderm and Databricks Delta Lake
  • storing uniquely identifiable, compressed, immutable datasets in a data lake, like S3
  • an immutable query in a data warehouse, like BigQuery.

Declarative Hyperparameters

Declaring hyperparameters is not difficult. They tend to be in plaintext and are easy to store in Git, alongside the code.

How Does GitOps Benefit AI Development?

GitOps already helps AI development because it helps software development. AI is 80% engineering, so the previous benefits continue to apply.

But there are extra benefits that arise out of the use of GitOps that are more specific to AI.

Governance

Governance is the use of controls, processes, and audits to prove that applications follow the business’ policies. Of course, there is a lot more to it than GitOps. But making your entire AI development declarative is a step towards making your application observable and obvious. If tools can see a declaration of your application, it is easy to audit that it conforms.

And when auditors perform an investigation, a definitive source of truth that presents the state of an application at a point in time is extremely valuable.

Preventing Developer Lock-In

The AI technology stack, from the code, up through the DevOps tooling, into the AI frameworks, and out into the deployment, is immensely tall. This makes it quite unlikely that everyone in your company is an expert in everything. And when someone leaves the company, that can be a real problem because of the serious loss of knowledge.

The worst case is that the developer has not declared anything. In this case, the application is good as dead. You cannot reproduce anything, let alone continue development. One of my favorite examples of this is How To Write Unmaintainable Code by Andrew Yurisich. Check it out, it’s hilarious!

But if the entire state of the application is declared, then although not easy, it makes it possible to take control of the application, see how it works, and move on.

Reproducible Experiments

One key challenge in industry, and especially in academia, is experimental reproducibility. It’s hard to make machine learning experiments return repeatable results. It’s even harder if projects are not declarative. Imagine having to attempt to patch together an environment if all the author provided was Python 3.7, or equivalent.

GitOps helps with this because by definition, the state of an application, and indeed an experiment is declared. Reproduction should be as easy as clicking a button.

Ultimate Experimentation Platform

Reproducible experiments simplify making small changes to the hyperparameters or the code. After pushing the changes the automated processes kick in and run a new experiment.

Repeat this process a few times to test a few different things then put your feet up. Come back after lunch and compare the experiments in your experimentation tracking solution. Ultimate machine learning productivity.

Re-Testing Previous Models

Back-testing is a common strategy to prove model performance on historical data. But it’s also useful to test old models against new data, especially when the data exhibits seasonality.

For example, imagine that you need to re-develop a Black Friday model for this year’s madness. You can:

a) refer to the declared state from last year and compare it to the current model, and b) pull up the old artifacts and test them against data from that period.

In essence, you can both back-test and forward-test last year’s model to see how well it really performed.

Moving Between Infrastructure

Moving from a local development environment into production or a large scale training platform is difficult. So much so that some frameworks actively promote developing remotely, like Spark. But the developer experience and feedback loop are so much better when working locally.

If you work declaratively, it is easy to inject an environment-specific configuration option at run-time. For example, you could use a local directory with a small subset of data locally and use an s3 path to the full dataset when remote.

Where To Start?

Hopefully, I’ve convinced you that GitOps for AI is something worth fighting for. So now you’re asking “what’s next?”.

First, it’s likely that you already have. It’s very likely that you’re codebase is in Git. Somebody is probably already using GitOps somewhere in your company’s process. Check to find out if someone has beaten you to it!

After that, follow these steps:

  1. Map out and follow the dependency graph for your application. Make it declarative, all the way down.
  2. Automate away the execution phases.
  3. Develop procedures and tools to speed up development and experimentation.
  4. Develop observability tools to help governance and observability.

What is the Future Of GitOps?

GitOps is here to stay.

The benefits of being declarative outweigh the very few counter-arguments. It’s hardly any extra work and promotes cleanliness throughout a business. Some might have different opinions on lineage. Some may even question declaration if no-code tools are available.

But the future will retain GitOps.

And that’s because state is stationary if you look through a small enough window. And teams can’t cooperate if there are no agreed sources of truth.

The future will bring changes, however. No-code and point-and-click interfaces are useful. So I see the convergence of no-code solutions and GitOps is somewhat inevitable. At least in developer-focused applications.

This gives rise to the idea of SlackOps or ClickOps. These are user interfaces to a GitOps backend.

I even think that MLOps will eventually converge towards being a user interface for GitOps, albeit focussed on AI applications.

The point is that there will be many user-facing interfaces, but all fads lead to GitOps.

Frequently Asked Questions

Frequently asked questions

More articles

Databricks vs Pachyderm - A Data Engineering Comparison

Databricks vs. Pachyderm - two leading data engineering products. Find out how they compare in this white paper.

Read more

MLOps Presentation: How to Build Resilient AI With GitOps

Learn how GitOps for AI is a key ingredient in any ML platform to enhance resiliency and observability.

Read more
}