Cloud Native Data Science: Best Practices
by Dr. Phil Winder , CEO
Following the Cloud Native best practices of immutability, automation and provenance will serve you well in a CNDS project. But working with data brings its own subtle challenges around these themes.
Other articles: Strategy | Technology | Best Practices
Provenance
Affirming the provenance of a model is most important when things go wrong. For example, unexpected new data or attacks [1] can cause catastrophic errors in the predictions of live systems. In these situations, we need to be able to fully reconstruct the state of the model at the time the error occurred. This can be achieved by snapshotting:
- the model, along with its parameters and hyperparameters,
- the training data,
- the results at the end of training,
- the code that trained and ran the model, and
- by fixing seeds.
When failures do occur it’s really important to make sure you cover the basics first. All failures should be surprising and due to some misunderstanding of the data. For example, there have been reports of Tesla’s parking mode ramming customers garage doors. If it can’t avoid hitting a door, how can people have confidence whilst driving at 70 mph.
When failures do occur, make sure they don’t happen again. Always inspect the largest failures and ensure automated tests are up to scratch. Retrain the model if necessary. If online and offline results are different, then you have a bug. If it makes sense, add the data to a regression test set to make sure it never happens again. And finally, keep raising the baseline; people don’t accept a decrease in performance unless it’s for a very good reason.
Automation
Automation is a central tenet of being Cloud Native. Models are updated often throughout early development. Automated delivery pipelines to ensure quality are vital. Making it easy to push new developments into production is important to reduce the friction and the feedback cycle.
Jupyter Notebooks [2], a common method of communicating research in the Data Science community, are not to be used in production code, testing or within pipelines. Production code needs the rigours of dedicated Software Engineering. Dedicated graph-based pipeline tools like Luigi [3] are helpful here.
People shouldn’t be spending time repeatedly writing boilerplate to read from different data sources. Invest effort to create abstractions of your data sources and give engineers easy access to data over APIs. Furthermore, your feature extractors shouldn’t care where data is coming from; this will make it easier to add new data sources and reuse code.
Rapid and clear feedback is vital for gaining confidence in a model and a product. Switching between environments or data should be avoided (e.g. throwing models over a “wall” to software engineers are a sure way to introduce subtle bugs that fail silently). Ideally, the exact same code should be used on and offline to be confident that offline results should match online. And again, monitoring is often the first and last line of defence.
Finally, deployment of models can be particularly tricky, since you can never be 100% happy that a model is working as expected. Once adequate monitoring has been implemented, you can feed this information back into the deployment pipeline. For example, a simple but efficient strategy is to shadow the new model against a model in production. Once you have enough data over enough time to prove your model is working, you can manually switch it over. You can also run multiple models in this way to compare performance. This is known as the champion-challenger approach. The model with the best production performance is moved to the front. Other standard deployment strategies like A/B testing, blue-green, etc. also apply here. But make sure your monitoring is up to scratch first.
Series Conclusion
In this series we talked about strategy, technology and best practices. Developing a Cloud Native strategy, focused towards data science, can have profound effects on a business or a product. There can be speed bumps, both with the product direction and team organisation, but the result is reliable, flexible products that will help you compete.
The important early technology choices are, unsurprisingly, focused on the data. You must consider the type of data you will be gathering and the use case that you are trying to fulfil. Non-functional requirements are also very helpful. An incorrect choice can often mean throwing away significant amounts of work.
During our time developing CNDS projects we have found that the best practices of immutability, provenance and automation have notable differences when compared to process automation projects. Keeping complete copies of the state of models is very useful for post-mortem analysis. And many day to day tasks are highly repetitive; it is worth spending effort to automate these to reduce mistakes and improve efficiency.
Doing Data Science in a Cloud Native world can have its difficulties. The development cycle of a Data Science project can be very different to a Software project; at least in the early stages of development. But being Cloud Native yields robust, performant products. Being confident that your models are operating as expected will help you sleep at night.
Terminology
Data Science encapsulates the process of engineering value from data. Other people use different terms like Machine Learning or Artificial Intelligence or Big Data and usually, they all mean the same thing. Given data, make a decision that adds value. Find out more about data science.
Cloud Native Data Science (CNDS) is an emerging trend that combines Data Science with the benefits of being Cloud Native. Find out more about cloud native.
References
1 - Goodfellow, Ian J., Jonathon Shlens, and Christian Szegedy. “Explaining and harnessing adversarial examples.” arXiv preprint arXiv:1412.6572 (2014). 2 - http://jupyter.org/ 3 - https://github.com/spotify/luigi