How We Built an MLOps Platform Into Grafana
by Dr. Phil Winder , CEO
Winder.AI collaborated with Grafana Labs to help them build a Machine Learning (ML) capability into Grafana Cloud. A summary of this work includes:
- Product consultancy and positioning - delivering the best product and experience
- Design and architecture of MLOps backend - highly scalable - capable of running training jobs for thousands of customers
- Tight integration with Grafana - low integration costs - easy product enablement
Grafana’s Need - Machine Learning Consultancy and Development
Grafana Cloud is a successful cloud-native monitoring solution developed by Grafana Labs.
Monitoring modern microservice or serverless applications is becoming increasingly complex, so users are looking for ways to reduce the information overload. Furthermore, analytics users are looking for advanced functionality like automated forecasting.
The purpose of this project was to:
- Collaborate with the Grafana team
- Design a service to train and serve machine learning-based algorithms
- Develop Kubernetes-native applications to host the MLOps API and schedule training jobs
- Build robust cloud-native software, continuous delivery pipelines
- Deliver example time-series forecasting models
- Deliver enhancements that allow users to deploy their own forecasts and anomaly detection algorithms
Winder.AI’s Solution - Highly Scalable Kubernetes-native ML Backend
This project involved a combination of both design and implementation. Winder.AI’s engineers, in full collaboration with the Grafana Labs engineering team, were able to deliver a fully working product over the space of several months of effort. This required significant experience in both cloud-native and ML technologies.
Together, we delivered a tiered design that comprised of:
- Prometheus proxy - an API that allows ML models to expose a prometheus API for easy integration
- ML management - a service that manages and schedules ML training runs
- Model training - a service that controls and scales ML training runs
The model training service was particularly interesting. We designed a way of being able to keep the model training abstraction generic enough to work with any type of model (i.e. any ML library) and work in a way that data scientists expect (i.e. Pandas). This was only possible because the input data is constrained to use the Grafana/Prometheus/InfluxDB interface, but allowed us to rapidly develop and test new models and even spin up quite complex hyperparameter tuning jobs. All of this was orchestrated by pure Kubernetes manifests, which made it easy to port between local and production clusters.
All services took advantage of Kubernetes for orchestration, Celery for queueing and scheduling and Redis for state. Both Go and Python were used along with libraries such as Facebook’s prophet for time-series forecasting and anomaly detection.
Result - Machine Learning for Grafana
Thanks to our collaborative efforts, Grafana are now able to offer an ML capability in their enterprise cloud offering which adds significant value to their product. This has been deployed into production and is now serving customers. Contact Grafana to try it out!
You can learn more about MLOps here.
Talk to Sales