For a long time, the main objective of a Data Scientist has been to find the best algorithmic recipe to answer a given business problem. To facilitate this prototyping phase, many tools have emerged such as open-source libraries and Data Science platforms; the latter even offer a no-code experience.
However, an essential aspect of an artificial intelligence project has been ignored for too long: industrialization. We must not lose sight of the fact that only AI in production, i.e. a system whose results (forecasts, recommendations, etc.) are made available to their end users, can enable significant productivity gains for companies.
According to Gartner, nearly 85% of AI projects fail to go into production. And for those that do, the conclusion is clear: the costs are high and the constraints numerous. So what explains this "industrialization wall"?
Deploy and redeploy the Machine Learning model
Once the algorithmic recipe has been validated during prototyping, it becomes necessary to confront the model with dynamic data, arriving in real time. The production environment must have a stable and efficient data connector (databases, web APIs, cloud storage spaces, etc.).
The prototyping results are then challenged by the real-time data, which are too often different from the prototyping data. The method therefore needs to be revised.
After this first test, the code must gain in maintainability and performance in order to go into production while holding the load and supporting each evolution. Thisrefactoring is a necessary step; optimizing the code too far upstream would be counterproductive, reducing the agility of prototyping. The production environment must lend itself to this conversion of methods between prototyping and production.
"Scaling up and integrating with production information systems requires skills specific skills which multiplies the initial cost of the project."
Scaling up and integrating with production information systems requires specific and highly sought-after skills - including high costs - such as ML Engineers / Devops / Developers, which multiplies the initial cost of the project. A manual approach to these steps, not capitalized, makes the production launch possible in theory but expensive, long and perilous from one launch to the next.
The difficulties become even greater when it comes to modifying and updating the algorithmic recipe, which is rarely fixed once it has gone into production. Each redeployment will be very time-consuming, as it will require the production stages to be repeated. Thus, the evolutions will be very heavy to redeploy, which can lead to keeping unsuitable solutions in production or spending a lot of time on redeployments.
Friction during the production process often leads to the abandonment of solutions: too heavy, too risky, too time-consuming.
Once the model is deployed, permanent supervision is necessary to allow the Data teams to be alerted to service malfunctions, if possible upstream; to diagnose and correct them.
Two different types of malfunctions can occur.
- First possible malfunction: users do not receive results anymore. This can be due to an error in the code, an unanticipated case, unavailability of the source data...
- Second possible malfunction, more specific to Machine Learning services: users receive predictions but they are of poor quality. This is the case, for example, when a predictive maintenance service raises too many false alarms or a recommendation engine sends inadequate or uninteresting proposals. In this situation, the system is operational from a software point of view, but not from a business point of view, which can lead to a lesser adoption of AI due to a loss of confidence in its results.
For the first case of malfunction, having service logs to be able to trace the error and debug it will be key. The observability of operations is crucial here. It drastically reduces the time needed to identify and correct the service.
In addition, to prevent service interruptions related to input data, Data Scientists or ML Engineers need to implement data quality checks and validations (e.g. check for data types, data loss, missing data, outliers, name changes in data fields, ...).
"The observability of operations is crucial here. It drastically reduces service identification and correction times."
For the second type of malfunction, it is much less obvious to detect. Identifying a loss of mathematical performance in a Machine Learning model requires more complex strategies for dealing with it. One of the main issues is to set up continuous performance checks and raise alerts when performance decreases. We are talking here aboutdrift management, re-learning strategies and in some cases model modification.
To support deployment and maintenance, Data teams will need, among other things,versioning andtracking tools: to be able to go back to any previous state (with the right version of data, code, model, etc.) in a simple and secure way and to compare models with each other.
The road to production is long and winding for Machine Learning services to deliver long-term value.
Optimization of the algorithmic recipe, management of the production environment, code recovery, model updating, drift management, re-training, etc. These are all key steps and points of friction that can jeopardize the industrialization of an AI project.
All these functions are at the crossroads of the fields of competence of a Data Scientist and a DevOps.
This is why the emergence and adoption of MLOps(Machine Learning Operations) is so important: it finally provides Data teams with a methodology and tools to serenely cross the industrialization wall.
By allowing a reduction in production costs and an increase in the reliability of results, MLOps has everything it takes to make artificial intelligence really take off.