Cloud Native Data Science: Strategy
by Dr. Phil Winder , CEO
Data Science has become an important part of any business because it provides a competitive advantage. Very early on, Amazon’s data on book purchases allowed them to deliver personalised recommendations whilst customers were browsing their site. Their main competitor in the US at the time was Borders, who mainly operated in physical stores. This physicality prevented them from seamlessly providing customers with personalised recommendations [1]. This example highlights how strategic business decisions and data science are inextricably linked.
Other articles: Strategy | Technology | Best Practices
The context and direction of a business also have an impact on CNDS (see bottom for terminology). Smaller businesses, where offerings are small but scopes fluctuate dramatically, can benefit from the flexibility that Cloud Native delivers. Enterprise businesses, with larger projects and longer lead times, will benefit from reduced coupling and decreased feedback cycles. Irrespective of size, all businesses should tend towards empowering teams by integrating data into their business.
DataDevOps
DevOps is an integration philosophy that popularised the idea that the people who build a system should run the system. Teams are responsible for the ongoing success of a component of the business. This goes against traditional ideas of functional roles, where entire teams would be dedicated to one technical niche. This is a particular problem in enterprises. A “wall” between people developing and operating products negatively affects productivity and makes it a worse place to work [2].
When data-driven products are added to the mix, the problems compound; another “wall” is created. Teams of Mathematicians and Scientists would “throw” models and algorithms to Developers to implement. Software would then be given to Operators to run in production. The result is blindingly obvious. Operators blame the Developers for providing them with buggy code. Developers blame the Scientists saying that they’ve provided an inefficient model, or the model doesn’t work. And the Scientists blame both for not understanding their work.
The only solution, which can be approached in several different ways, is that these walls must not exist. Everyone is responsible for delivering a product. This should be stated explicitly in the way that teams are organised.
Team Organisation
A practical solution depends again on the scale and strategy of a business. Typical solutions include creating cross-functional teams or dedicated “Data Science Consultants” within the business. Cross-functional teams, dedicated to one or more products or customers, benefit from the cross-pollination of experience and learning. Developers learn Data Science. Operators learn Software Engineering. Scientists learn how to operate a product. Whereas dedicated teams of data science consultants within a business can be more efficient.
The most interesting part of organising a team which includes Data Scientists is that the level of expertise required is flipped upside down. In conventional Software Engineering projects (i.e. software that is easy to plan), it pays dividends to have experienced Engineers up front. Good software designs and architectures make it easier to create good products.
But Data Science projects are often harder at the end. It is very easy to come up with a quick model given some data. It is very hard to run a robust model in production. It is often said that Data Science only makes up about 10% of the effort of a data-driven product. The other 90% consists of other typical Cloud Native components: monitoring and alerting, continuous integration and delivery, stateful storage, user interfaces, business logic and various other supporting tools and technologies. For this reason, it pays dividends to have experienced Data Scientists at the end of a project; where they can help grow the long-term viability of a product rather than the short-term proof of concept.
Strategic Uncertainty
Data science projects are high risk. There are many ways in which a project can fail. But we can de-risk a project through careful communication of the risks and by diversifying development.
The development of data science projects are iterative. They flow through phases which include problem definition, data understanding, model development and evaluation. At each of these steps the project can fail. Quite often projects fail after a full development cycle after the original problem was misunderstood.
By evaluating and communicating the risks clearly, we can begin to mitigate against the chance of failure by establishing research paths that are more likely to succeed. It can be beneficial to develop these paths in tandem using the low risk path to hedge against the failure of the high risk.
Low risk paths are often simpler and less groundbreaking than their high risk equivalents. But they protect against the chance of complete failure.
If the success of the product is dependent on a high risk path then it should be de-risked as soon as possible. Attempt to increase the priority so that the work can be attempted sooner. Then less time is wasted on secondary tasks that depend on the first occurring.
One final method to mitigate uncertainty at a strategic level is to diversify multiple high risk/reward projects. Try to pick projects that are not dependent on the same risks. For example, don’t pick projects that depend on the same data and don’t pick problems to solve that are too similar. Otherwise you might end up with multiple failing projects due to a problem with something fundamental.
Terminology
Data Science encapsulates the process of engineering value from data. Other people use different terms like Machine Learning or Artificial Intelligence or Big Data and usually, they all mean the same thing. Given data, make a decision that adds value. Find out more about data science.
Cloud Native Data Science (CNDS) is an emerging trend that combines Data Science with the benefits of being Cloud Native. Find out more about cloud native.
References
1 - Provost, Foster, and Tom Fawcett. Data Science for Business: What you need to know about data mining and data-analytic thinking. " O’Reilly Media, Inc.", 2`c013.
2 - Humble, Jez, and David Farley. Continuous Delivery: Reliable Software Releases through Build, Test, and Deployment Automation (Adobe Reader). Pearson Education, 2010.