The industrialization wall: the difficulties of putting a Machine Learning model into production

14/04/2022

MLOps

All articles

Contents

For a long time, the main objective of a Data Scientist has been to find the best algorithmic recipe to answer a given business problem. To facilitate this prototyping phase, many tools have emerged such as open-source libraries and Data Science platforms; the latter even offer a no-code experience.

Download

However, an essential aspect of an artificial intelligence project has been ignored for too long: industrialization. We must not lose sight of the fact that only AI in production, i.e. a system whose results (forecasts, recommendations, etc.) are made available to their end users, can enable significant productivity gains for companies.

According to Gartner, nearly 85% of AI projects fail to go into production. And for those that do, the conclusion is clear: the costs are high and the constraints numerous. So what explains this "industrialization wall"?

Deploy and redeploy the Machine Learning model

Once the algorithmic recipe has been validated during prototyping, it becomes necessary to confront the model with dynamic data, arriving in real time. The production environment must have a stable and efficient data connector (databases, web APIs, cloud storage spaces, etc.).

The prototyping results are then challenged by the real-time data, which are too often different from the prototyping data. The method therefore needs to be revised.

After this first test, the code must gain in maintainability and performance in order to go into production while holding the load and supporting each evolution. Thisrefactoring is a necessary step; optimizing the code too far upstream would be counterproductive, reducing the agility of prototyping. The production environment must lend itself to this conversion of methods between prototyping and production.

"Scaling up and integrating with production information systems requires skills specific skills which multiplies the initial cost of the project."

Scaling up and integrating with production information systems requires specific and highly sought-after skills - including high costs - such as ML Engineers / Devops / Developers, which multiplies the initial cost of the project. A manual approach to these steps, not capitalized, makes the production launch possible in theory but expensive, long and perilous from one launch to the next. 

The difficulties become even greater when it comes to modifying and updating the algorithmic recipe, which is rarely fixed once it has gone into production. Each redeployment will be very time-consuming, as it will require the production stages to be repeated. Thus, the evolutions will be very heavy to redeploy, which can lead to keeping unsuitable solutions in production or spending a lot of time on redeployments. 

Friction during the production process often leads to the abandonment of solutions: too heavy, too risky, too time-consuming.

Monitoring production

Once the model is deployed, permanent supervision is necessary to allow the Data teams to be alerted to service malfunctions, if possible upstream; to diagnose and correct them.

Two different types of malfunctions can occur. 

  • First possible malfunction: users do not receive results anymore. This can be due to an error in the code, an unanticipated case, unavailability of the source data...
  • Second possible malfunction, more specific to Machine Learning services: users receive predictions but they are of poor quality. This is the case, for example, when a predictive maintenance service raises too many false alarms or a recommendation engine sends inadequate or uninteresting proposals. In this situation, the system is operational from a software point of view, but not from a business point of view, which can lead to a lesser adoption of AI due to a loss of confidence in its results. 

For the first case of malfunction, having service logs to be able to trace the error and debug it will be key. The observability of operations is crucial here. It drastically reduces the time needed to identify and correct the service.

In addition, to prevent service interruptions related to input data, Data Scientists or ML Engineers need to implement data quality checks and validations (e.g. check for data types, data loss, missing data, outliers, name changes in data fields, ...).

"The observability of operations is crucial here. It drastically reduces service identification and correction times."

For the second type of malfunction, it is much less obvious to detect. Identifying a loss of mathematical performance in a Machine Learning model requires more complex strategies for dealing with it. One of the main issues is to set up continuous performance checks and raise alerts when performance decreases. We are talking here aboutdrift management, re-learning strategies and in some cases model modification.

To support deployment and maintenance, Data teams will need, among other things,versioning andtracking tools: to be able to go back to any previous state (with the right version of data, code, model, etc.) in a simple and secure way and to compare models with each other.

The road to production is long and winding for Machine Learning services to deliver long-term value.

Optimization of the algorithmic recipe, management of the production environment, code recovery, model updating, drift management, re-training, etc. These are all key steps and points of friction that can jeopardize the industrialization of an AI project. 

All these functions are at the crossroads of the fields of competence of a Data Scientist and a DevOps.  

This is why the emergence and adoption of MLOps(Machine Learning Operations) is so important: it finally provides Data teams with a methodology and tools to serenely cross the industrialization wall.

By allowing a reduction in production costs and an increase in the reliability of results, MLOps has everything it takes to make artificial intelligence really take off.

A platform compatible with the entire ecosystem

aws
Azure
Google Cloud
OVH Cloud
scikit-lean
PyTorch
Tensor Flow
XGBoost
jupyter
PC
Python
R
Rust
mongo DB

You may also like

MLOps
02/05/2023

Why MLOps is every data scientist's dream? Part 2

We will try to provide some answers to this questions in two parts. This second article focuses on the first deployment and iterations to quickly improve it while the first one focuses on conception, data collection, exploration and application prototyping.

Read the article

MLOps
25/04/2023

Why MLOps is every data scientist's dream? Part 1

We will try to provide some answers to this questions in two parts. This first article focuses on conception, data collection, exploration and application prototyping while the second one focuses on the first deployment of the solution and iterations to quickly improve it.

Read the article

Trustworthy AI
29/03/2023

Improve your ML workflows with synthetic data

As a data scientist, you know that high-performance machine learning models cannot exist without a large amount of high-quality data to train and test them. Most of the time, building an appropriate machine learning model is not a problem: there are plenty of architectures available, and since it is part of your job, you know exactly which one will best suit your use case. However, having a large amount of high-quality data can be much more challenging: you need a labeled and cleaned dataset matching exactly your use case. Unfortunately, such a dataset is usually not already available. Maybe you only have a few data matching your requirements, maybe you have data but they are not matching exactly what you want (they can have biases or unbalanced classes for example), or maybe a dataset exists but you cannot access it because it contains private information. Therefore, you need to collect new data, label them and clean them, which can be a time-consuming and costly process, or even not be possible at all.

Read the article