These are all highly relevant questions to operationalize machine learning – and not just because MLOps is hitting peak (Gartner) hype. A few people are trying to address those in the Gartners of the world and beyond. Ori Cohen and Lior Gavish are among them. Their opinion counts because they are both machine learning practitioners with years of experience whose daily job touches upon various aspects of MLOps. Cohen has a PhD in Computer Science, Machine Learning, AI, Real-Time Brain Computer Interface (BCI). He is currently the Lead Researcher at New Relic, and he recently went public with his pet project - The State of MLOps. Lior Gavish’s background is also in Computer Science, with extensive machine learning as well as business and startup experience. He has co-founded Monte Carlo, a vendor active in data observability and data reliability, where he is currently leading Engineering. We caught up with Cohen and Gavish to discuss The State of MLOps.
What is MLOps, and whom is it for?
To begin with, what exactly is MLOps? As we noted recently, there’s no shortage of terms flying around in the adjacent data and machine learning / AI domains. As far as MLOps go, Cohen referred to a popular diagram floating around in many variations, in which machine learning is in the center, and everything else is placed around it. “MLOps can practically mean everything related to that space around that small box of machine learning. You could start with data and engineering. Data science analysis, DevOps infrastructure systems, experiment management…Two or three years ago, a lot of companies were doing experiment management. Now it also means monitoring and observability, for data and data pipelines”, said Cohen. Gavish concurred, distilling his own definition of MLOps as “the practices and the tools that help you deliver machine learning with certain constraints that you’re interested in”. For him, this touches upon velocity in building and deploying, reliability and SLAs, security, and compliance. DataOps and MLOps are very early in their lifecycle, Gavish went on to add. This is why a certain degree of confusion is expected, and Cohen’s work is a valuable tool to navigate this space. Cohen has defined a number of facets that characterize MLOps solutions, and he collected and verified data points for each solution included. Some of those, such as the number of customers or total funding, for example, are interesting but also trivial in some sense – they could apply to any company in any domain. Our attention was piqued by a number of facets that we felt may uniquely apply to MLOps solutions. Product focus is one of those. It refers to whether a solution is more focused on data, data pipelines, or both. Some solutions monitor and observe things such as inputs and outputs for models, drift, loss, precision and recall accuracy for data. Some others do “similar but different things”, as Cohen put it, around data pipelines. There are only a few solutions doing both. Some of them are doing one next to the other, and some are trying to correlate between events, Cohen went on to add. If you have a problem with your data, it could mean that maybe some servers went down, or the CPU is at 100%. If different inputs can be correlated, the issue can be identified faster. DevOps and other teams can be notified, which is something New Relic is working on as well, Cohen noted. Then there’s the personas facet, which identifies what kind of role each solution caters to. Data-centric solutions cater to the needs of data scientists and data science leads, and maybe also machine learning engineers and data analysts. Data pipeline-centric solutions are more oriented towards DataOps engineers, as per Cohen. Executives may also benefit from MLOps solutions. For example, by offering them dashboards that monitor the cost of training machine learning models on GPUs or how broken models impact business KPIs.
Features right, left and center
Focusing on observability may help identify contextual differences, Gavish noted. Observability into machine learning models that are running in production is very different from observability on data pipelines that are feeding those models. There’s a good amount of overlap there, but there are also differences in the stack that people work with. As a pipeline observability company, Monte Carlo is focusing on data lakes, data warehouses, and analytic dashboards, Gavish went on to explain. An AI observability solution might focus more on the stack that people use to train and then deploy machine learning models and the frameworks and libraries that are used in that context. For Gavish and Monte Carlo, the main objective going forward is to reduce time to detection. Over the last two years, they have gotten from what would have been weeks or months to hours. Going forward, the goal is to get closer to the minute’s mark. Data issues are complex in the sense that an operational issue in the infrastructure can cause them, drift in the data, or some code change with unintended consequences. Eventually, Gavish said, they want to also help prevent incidents from happening in the first place. He claimed that this is actually possible by capitalizing on what they learn from data health issues and how they are detected and resolved. Another aspect of MLOps solutions that needs to be considered is the type of data it can apply to. Cohen noted that most solutions work with tabular data because it’s the easiest use case and a mostly solved one. Some solutions are now moving into images and audio as a way to address additional use cases and differentiate. The one facet in Cohen’s analysis that is the most involved and diverse one is featured. There’s features right, left and center, and they also tend to cluster around each solution’s focus. Data-centric solutions offer features mostly around the drift. It can be data drift or concept drift for labels. There’s also data quality and data integrity, which, again, “could be the same, but kind of different”, as per Cohen. And then, we have monitoring bias and fairness, which is getting more attention in view of the EU AI regulations released a few months ago, plus anomaly detection, segmentation, tracking, and explainability in general. Cohen finds that those are the basics that people need to get started. Oftentimes things were not entirely straightforward even for Cohen, and he had to embark on involved research and ask vendors directly what they are doing behind the scenes. The State of MLOps is a passion project. Its roots go back to Cohen’s motivation to call data scientists to action to monitor everything related to their models, not just the model themselves. As part of his writing, Cohen researched many monitoring solutions. When he revisited this space 2 years later, he realized there were more than 30 new companies. Cohen’s research brings the amount of money invested in the MLOps space at a staggering $3.8 Billion, and he foresees consolidation in the field. Until that happens, however, The State of MLOps project is expanding to include more tools, and Cohen is increasingly, yet happily, busy trying to accommodate more requests. This work was just too good not to be shared, and it’s a useful tool for anyone wanting to navigate the complex MLOps landscape.