DevOps and Data Science: DataDevOps?

I’ve seen a few posts recently about the emergence of a new field that is kind of like DevOps, but not quite, because it involves too much data. Verbally, about two years ago and in blog form about a year ago, I used the word DataDevOps, because that’s what I did. I develop and operate Data Science platforms, products and services.

But more recently I have read of the emergence of DataOps. Apparently enterprises have realised that it takes more than a PhD in Data Science to create products and value (not that I begrudge the value of a PhD, I have one, after all!). It also takes engineering. Specifically, software engineering, to perform a series of tasks that support the wafer-thin slice of the product cake that represents the Data Science model.

I think these two names imply the same problem. They represent the full-stack engineering and data skills, and I stress the connection, that are required to provide value.

Operations

Operations, or Ops for short, has come to mean something different from simply operating systems. Operations has leaked into Engineering and for good reason. The premise is that Engineers will always do the simplest thing to solve a problem (i.e. Occam’s Razor). This is a good thing and is desired. However, in the Software of old (i.e. prior to 2000-ish), this was interpreted as being able to produce some code that solved a specific task.

The issue was that the code tended, to use a Data term, to over-fit. The code was functional in the purest sense, for a very specific problem it might work. But for a worryingly large number of edge cases it would not. As software become more complicated it became more probable that software would bring down the entire solution.

This popularised ideas such as test driven development (TDD), but there was always one problem. Engineers didn’t care enough. Tests were always the last thing to write and never really fully tested the software, due to the rule of doing the simplest thing (remember focussing on just getting 100% coverage?). This meant that tests didn’t cover the edge cases and the same problems occurred (although to a lesser extent).

A breakthrough was to add a carrot. Or rather, a stick (to brandish threateningly). The idea was that we make Engineers, the people that are writing the code, responsible for the continued operation of the product or service. This might sound obvious to many non-Engineers, but this was revolutionary. Up to this point people were hired to write Java (a medieval programming language) code, for example, and code only. Not tests. Not supporting software. Just code. There were other people for that.

Because Engineers didn’t want to be woken at 3AM like the Operations staff were previously, they mysteriously began to put processes in place to prevent breakages on production systems. This largely comprised of better testing (not necessarily more), automated testing, deployment pipelines, smaller, simpler components and a better ability to visualise the status of products or service (i.e. monitoring). This acceptance is a critical part of Cloud-Native development.

Cut to the Data Chase

All of this still applies to Data Science. We’re still Engineers (even though we’re not a factory using blueprints). We’re still building useable products or services to provide value. In other words, we still need the stick to ensure we build reliable software. Until someone can figure out what works better than a stick.

All data-driven products require the engineering finesse that has been developed by the incredible process-focused engineers (think Clean Code, Agile, TDD, SOLID, Scrum, Kanban, Kata, Lean, etc). Although these methodologies scream marketing, there is value. Honest gov’ner. I can’t remember any studies off-hand that prove we are more proficient and efficient at writing software now (2019) than we were in 1999, but I’m confident that is the case (like, 95%).

Data products (products that have empirical data at their heart) require the same standards as any other software. Given a problem, the desired service level agreement (SLA) and a business impact assessment of failure, we must optimise the solution to maximise simplicity, minimise effort and destroy any possibility of being woken up at 3AM.

Development

This brings me to development. We’ve established that building a data product is essentially the same as building any software product. Therefore, engineers need to know how to develop. This requires a knowledge of the processes (outlined above), experience with the technologies (right tool for the job) and an acknowledgement that they are responsible for the product (empowerment).

Technology

Technology is one of the common reasons why people become Engineers. They like to play with new toys. This is true for my Mechanical Engineer friends that play with novel heating, ventilation and air-conditioning (HVAC) systems using exotic refrigerants. And for my Electronics friends that enjoy playing with the newest Raspberry Pi to control their lighting/chicken coop door/key finder/weather balloon/garden watering system (I kid you not. Someone should do a presentation on the best, craziest projects!).

Sometimes the proliferation of technology can be overwhelming, often to the detriment of the wider community (think of all the wasted man-hours spent on unused javascript frameworks - aside: we shouldn’t be using the term man-hours. People-hours or person-hours doesn’t have the same ring to it. Dev-hours maybe?).

The goal of Software-related technology is to automate, simplify or codify common problems. This allows engineers to be more efficient and proficient (i.e. fewer bugs and in less time). To optimise for efficiency and proficiency, engineers need to know what tool to use and when. The data used in this optimisation is experience and knowledge.

Empowerment

Once engineers are happy they are solving the right problem (this is more common than you may think!) then they should be empowered to utilise the right tool to solve the problem efficiently.

A common economic theory states that improved infrastructure increases economic productivity. This is because the flow of goods and people are less restrained so more can be sold in less time. Similarly, data represents the raw material that drives an application. We need the supporting infrastructure, the building blocks, to be able to deliver our cargo. The infrastructure in this case is Software.

Therefore, the ability to develop software is paramount. Just as much as the data. Without it we cannot exploit the power of data.

Data

Which brings me to the data itself. I’m an Engineer by heart, but I’ve always worked with data. My PhD was all about the data produced by rain drops. I worked for years listening and classifying sounds. More recently I’ve detected hackers, optimised lasers and empowered Data Scientists. So you might imagine I’m quite data-precious?

In fact I’m not. The reason why I became an Engineer, fundamentally, was because I enjoy solving problems. And I do that by understanding what the problem is, figuring out what information is available and exploiting representations of the information to solve said problem. This again is the holy trinity of development, data and operations. I.e. Data+Data+Ops=DataDevOps.

The Catch: It’s All Too Much

The problem with DataDevOps is that it can be too much. All three fields are huge and are only getting bigger. You can do full degrees in all of these subjects and you would only touch the surface (arguably the wrong surface!).

The traditional solution to this problem is to hire specialists in each discipline and place them in disparate teams. Except that doesn’t work. When surrounded by the same people and the same problems, people optimise for efficiency, rather than proficiency, because there is no proficiency feedback loop. The only pressure is time. So people tend to shirk tasks that traditionally belong to other teams. And this leads to walls, which in turn leads to blame, inefficiencies and ultimately product-incompetencies.

This is why DevOps is a thing now. It is the end-to-end lifecycle, from problem to operation, that drives empowerment and proficiency.

The Fortunate Unfortunate Conclusion

This leads me to the conclusion that, for better or worse, Engineers need to know a lot about a lot. And obviously there are limitations. The best ways I’ve found or seen in others to help with this problem are:

Cross-functional teams. Gain experience through osmosis.
Experiencing a range of people, problems and situations.
Communicate. Teach. Present. Write.
An appetite for continuous learning (the right attitude).
A knack of inadvertently creating opportunities, without expending too much effort. E.g. networking.
An open mind and humility.
Read. Then read some more.
Knowing when to stop, but persisting when it gets hard or boring.

So I don’t disagree, DataOps is just as much of a thing that DevOps, but both contain only two of the three necessary components of a modern product or service. I maintain that training Software Developers to become better at Data Science and Operations (or vice versa, depending on what you learn first) is the only logical conclusion.

This is great if you love learning. You can look forward to a fulfilling life and career. But this also poses a problem because you can never know everything. This is the paradox of modern engineering. Everyone wants to and needs to become a fabled “Full-Stack Engineer” but will never become one, simply because the goalposts are moving constantly in an already immense body of knowledge. All I can suggest is try your best and enjoy the ride.