Cloud Native Data Science: Technology
by Dr. Phil Winder , CEO
Technology choices in data-driven products are, as you would expect, largely directed by the type and amount of data. The first and most crucial decision to make is whether the data will be processed in a batch or streaming fashion.
Other articles: Strategy | Technology | Best Practices
Stream or Batch?
The theoretical distinction between streaming and batching is less important than the practical implications. Streaming data implies a constant flow of new data and batch implies at least semi-static data when viewed from smaller timescales. But the distinction is fuzzy; you can stack streaming data and treat it like a batch process or use streaming methods even when update rates are slow. The real distinction is between the tools and technologies that have been developed to handle these two types of data. It is generally considered that problems are quicker and easier to solve if you can solve them in batch, but you lose the granularity of receiving rapid updates [1].
You can infer the regime from the problem definition. For example, detection automation problems (e.g. fraud detection) usually handle their data as a stream to allow for rapid response times. But application automation problems (e.g. customer churn, loan applications) can be handled in batch because this matches the underlying task.
Data Storage
The second key decision is the handling and storage of data. It can be hard to move data once it has come to rest, due to the size. So it makes sense to think carefully about storage requirements upfront. This is usually called Data Warehousing and has interesting implications for a Cloud Native architecture (terminology at the bottom).
Many of the benefits of being Cloud Native are focused on the consumer-facing parts of an application. We want it to be resilient because we don’t want our consumers to see our broken code. We want it to be scalable so that we can meet demand. The main way in which Cloud Native techniques achieve this is through immutability. Scalability and resiliency is the result of replicating a small amount of code many times (e.g. containers). But data, by definition, is mutable. It is constantly changing. This means we have to repeatedly move new data.
Therefore the real challenge is how to best move the data from one place to another. And the key decision to be made is when, how and in what form this data is moved.
If your data is highly structured and it needs to be accessed often, databases are your best bet. Note that even unstructured data can often be coerced into a database. The primary benefit of doing so is that we offload the complex task of storing, managing and exposing the data efficiently. If you have highly unstructured data, for example data of many differing types, then it is usually better to use an object or file store. Object stores such as S3 have become very popular for storing binary blobs of any size, at the expense of performance. High-performance blob storage systems often revert to clusters of machines that expose the performance of SSDs.
Monitoring
Creating software is easy. You decide what you want it to do, make it do what you want it to do, then assert that it does it. This is all made possible because the code is deterministic. For any given input you can make sure it generates the expected output. But algorithms used to make decisions about data are often developed as black boxes. We create models based on the data that we see at the time; this is called induction. The problem with inductive reasoning is when we observe data that we haven’t seen before, we don’t precisely know how the black box is going to behave.
There are techniques that we can use to improve the level of determinism, but a good form of validation is to constantly reassure ourselves that the application is working. Like Cloud Native software, we need to instrument our Data Science code in order to monitor their operation. For Data Science applications we can:
- generate statistics about the types of decisions being made and verify that they are “normal”,
- visualise distributions of results or inputs and assert their validity,
- assert that the input data is as expected and isn’t changing over time or has invalid values, and
- instrument feature extraction to ensure performance.
This is the feedback that the Data Scientists and Engineers need to iteratively improve their product. Through monitoring, we can gain trust in models and applications and prove to others that they work. Ironically, implementing monitoring and alerting in your CNDS application is the best way to avoid being woken up at 3 AM.
Terminology
Data Science encapsulates the process of engineering value from data. Other people use different terms like Machine Learning or Artificial Intelligence or Big Data and usually, they all mean the same thing. Given data, make a decision that adds value. Find out more about data science.
Cloud Native Data Science (CNDS) is an emerging trend that combines Data Science with the benefits of being Cloud Native. Find out more about cloud native.
References
1 - Friedman, Ellen, and Kostas Tzoumas. Introduction to Apache Flink: Stream Processing for Real Time and Beyond. " O’Reilly Media, Inc.", 2016.