MLOps
MLOps
MLOps Principles
As machine learning and AI propagate in software products and services, we
need to establish best practices and tools to test, deploy, manage, and monitor
ML models in real-world production. In short, with MLOps we strive to
avoid “technical debt” in machine learning applications.
SIG MLOps defines “an optimal MLOps experience [as] one where Machine
Learning assets are treated consistently with all other software assets within a
CI/CD environment. Machine Learning models can be deployed alongside the
services that wrap them and the services that consume them as part of a
unified release process.” By codifying these practices, we hope to accelerate
the adoption of ML/AI in software systems and fast delivery of intelligent
software. In the following, we describe a set of important concepts in MLOps
such as Iterative-Incremental Development, Automation, Continuous
Deployment, Versioning, Testing, Reproducibility, and Monitoring.
MLOps 1
The complete MLOps process includes three broad phases of “Designing the
ML-powered application”, “ML Experimentation and Development”, and “ML
Operations”.
The first phase is devoted to business understanding, data
understanding and designing the ML-powered software. In this stage, we
identify our potential user, design the machine learning solution to solve its
problem, and assess the further development of the project. Mostly, we would
act within two categories of problems - either increasing the productivity of the
user or increasing the interactivity of our application.
Initially, we define ML use-cases and prioritize them. The best practice for ML
projects is to work on one ML use case at a time. Furthermore,
the design phase aims to inspect the available data that will be needed to train
our model and to specify the functional and non-functional requirements of our
ML model. We should use these requirements to design the architecture of the
ML-application, establish the serving strategy, and create a test suite for the
future ML model.
MLOps 2
The follow-up phase “ML Experimentation and Development” is devoted to
verifying the applicability of ML for our problem by implementing Proof-of-
Concept for ML Model. Here, we run iteratively different steps, such
as identifying or polishing the suitable ML algorithm for our problem, data
engineering, and model engineering. The primary goal in this phase is to deliver
a stable quality ML model that we will run in production.
The main focus of the “ML Operations” phase is to deliver the previously
developed ML model in production by using established DevOps practices such
as testing, versioning, continuous delivery, and monitoring.
All three phases are interconnected and influence each other. For example, the
design decision during the design stage will propagate into the experimentation
phase and finally influence the deployment options during the final operations
phase.
Automation
The level of automation of the Data, ML Model, and Code pipelines determines
the maturity of the ML process. With increased maturity, the velocity for the
training of new models is also increased. The objective of an MLOps team is to
automate the deployment of ML models into the core software system or as a
service component. This means, to automate the complete ML-workflow steps
without any manual intervention. Triggers for automated model training and
deployment can be calendar events, messaging, monitoring events, as well as
changes on data, model training code, and application code.
Automated testing helps discovering problems quickly and in early stages. This
enables fast fixing of errors and learning from mistakes.
To adopt MLOps, we see three levels of automation, starting from the initial
level with manual model training and deployment, up to running both ML and
CI/CD pipelines automatically.
MLOps 3
2. ML pipeline automation. The next level includes the execution of model
training automatically. We introduce here the continuous training of the
model. Whenever new data is available, the process of model retraining is
triggered. This level of automation also includes data and model validation
steps.
The following picture shows the automated ML pipeline with CI/CD routines:
MLOps 4
Source code for pipelines: Data
Development & Experimentation (ML extraction, validation, preparation,
algorithms, new ML models) model training, model evaluation, model
testing
Model Continuous Delivery (Model serving Deployed model prediction service (e.g.
for prediction) model exposed as REST API)
Monitoring (Collecting data about the model Trigger to execute the pipeline or to
performance on live data) start a new experiment cycle.
After analyzing the MLOps Stages, we might notice that the MLOps setup
requires several components to be installed or prepared. The following table
lists those components:
MLOps 5
Automating the steps of the ML
ML Pipeline Orchestrator
experiments.
Continuous X
To understand Model deployment, we first specify the “ML assets” as ML
model, its parameters and hyperparameters, training scripts, training and
testing data. We are interested in the identity, components, versioning, and
dependencies of these ML artifacts. The target destination for an ML artifact
may be a (micro-) service or some infrastructure components. A deployment
service provides orchestration, logging, monitoring, and notification to ensure
that the ML models, code and data artifacts are stable.
MLOps is an ML engineering culture that includes the following practices:
Continuous Integration (CI) extends the testing and validating code and
components by adding testing and validating data and models.
Versioning
The goal of the versioning is to treat ML training scrips, ML models and data
sets for model training as first-class citizens in DevOps processes by tracking
ML models and data sets with version control systems. The common reasons
when ML model and data changes (according to SIG MLOps) are the following:
MLOps 6
Models may be subject to attack and require revision.
Experiments Tracking
Machine Learning development is a highly iterative and research-centric
process. In contrast to the traditional software development process, in ML
development, multiple experiments on model training can be executed in
parallel before making the decision what model will be promoted to production.
The experimentation during ML development might have the following scenario:
One way to track multiple experiments is to use different (Git-) branches, each
dedicated to the separate experiment. The output of each branch is a trained
model. Depending on the selected metric, the trained ML models are compared
with each other and the appropriate model is selected. Such low friction
branching is fully supported by the tool DVC, which is an extension of Git and
an open-source version control system for machine learning projects. Another
popular tool for ML experiments tracking is the Weights and Biases
(wandb) library, which automatically tracks the hyperparameters and metrics of
the experiments.
Testing
MLOps 7
Figure source: “The ML Test Score: A Rubric for ML Production Readiness and
Technical Debt Reduction” by E.Breck et al. 2017
Action: Use the subset of features “One of k left out and train a set of
different models.
MLOps 8
Features and data pipelines should be policy-compliant (e.g. GDPR). These
requirements should be programmatically checked in both development
and production environments.
Feature creation code should be tested by unit tests (to capture bugs in
features).
Model staleness test. The model is defined as stale if the trained model
does not include up-to-date data and/or does not satisfy the business
impact requirements. Stale models can affect the quality of prediction in
intelligent software.
Action: A/B experiment with older models. Including the range of ages
to produce an Age vs. Prediction Quality curve to facilitate the
understanding of how often the ML model should be trained.
Action: Use an additional test set, which is disjoint from the training and
validation sets. Use this test set only for a final evaluation.
MLOps 9
Action: Collect more data that includes potentially under-represented
categories.
ML infrastructure test
Training the ML models should be reproducible, which means that training
the ML model on the same data should produce identical ML models.
Action: Unit tests to randomly generate input data and training the
model for a single optimization step (e.g gradient descent).
Action: Crash tests for model training. The ML model should restore
from a checkpoint after a mid-training crash.
Action: Create a fully automated test that regularly triggers the entire
ML pipeline. The test should validate that the data and code
MLOps 10
successfully finish each stage of training and the resulting ML model
performs as expected.
All integration tests should be run before the ML model reaches the
production environment.
Testing that the model in the training environment gives the same score as
the model in the serving environment.
Monitoring
Once the ML model has been deployed, it need to be monitored to assure that
the ML model performs as expected. The following check list for model
monitoring activities in production is adopted from “The ML Test Score: A
Rubric for ML Production Readiness and Technical Debt Reduction” by E.Breck
et al. 2017:
MLOps 11
Changes in source system.
Dependencies upgrade.
Monitor data invariants in training and serving inputs: Alert if data does not
match the schema, which has been specified in the training step.
Monitor whether training and serving features compute the same value.
Since the generation of training and serving features might take place
on physically separated locations, we must carefully test that these
different code paths are logically identical.
Action: (1) Log a sample of the serving traffic. (2) Compute distribution
statistics (min, max, avg, values, % of missing values, etc.) on the
training features and the sampled serving features and ensure that they
match.
MLOps 12
Monitor degradation of the predictive quality of the ML model on served
data. Both dramatic and slow-leak regression in prediction quality should be
notified.
The picture below shows that the model monitoring can be implemented by
tracking the precision, recall, and F1-score of the model prediction along with
the time. The decrease of the precision, recall, and F1-score triggers the model
retraining, which leads to model recovery.
MLOps 13
The “ML Test Score” measures the overall readiness of the ML system for
production. The final ML Test Score is computed as follows:
For each test, half a point is awarded for executing the test manually, with
the results documented and distributed.
A full point is awarded if the there is a system in place to run that test
automatically on a repeated basis.
Sum the score of each of the four sections individually: Data Tests, Model
Tests, ML Infrastructure Tests, and Monitoring.
The final ML Test Score is computed by taking the minimum of the scores
aggregated for each of the sections: Data Tests, Model Tests, ML
Infrastructure Tests, and Monitoring.
After computing the ML Test Score, we can reason about the readiness of the
ML system for production. The following table provides the interpretation
ranges:
Points Description
Reproducibility
Reproducibility in a machine learning workflow means that every phase of
either data processing, ML model training, and ML model deployment should
produce identical results given the same input.
MLOps 14
1) Always backup your data.2)
Saving a snapshot of the data set
Generation of the training data
(e.g. on the cloud storage).3) Data
can't be reproduced (e.g due
Collecting Data sources should be designed with
to constant database changes
timestamps so that a view of the
or data loading is random)
data at any point can be
retrieved.4) Data versioning.
MLOps 15
Additionally, Gene Kim et al., recommend to “use a loosely coupled
architecture. This affects the extent to which a team can test and deploy their
applications on demand, without requiring orchestration with other services.
Having a loosely coupled architecture allows your teams to work
independently, without relying on other teams for support and services, which
in turn enables them to work quickly and deliver value to the organization.”
PyScaffold
MLOps 16
the deployment process, which might
range between *manual deployment*
and *fully automated CI/CD pipeline*.
What percentage of
changes to production or ML Model Change Failure Rate can be
released to users result in expressed in the difference of the
degraded service (e.g., currently deployed ML model
Change Failure lead to service impairment performance metrics to the previous
Rate or service outage) and model's metrics, such as Precision,
subsequently require Recall, F-1, accuracy, AUC, ROC, false
remediation (e.g., require a positives, etc. ML Model Change Failure
hotfix, rollback, fix Rate is also related to A/B testing.
forward, patch)?
MLOps 17
change, data change or model change. The following table summarizes the
MLOps principles for building ML-based software:
MLOps
Data ML Model Code
Principles
1) Data
preparation 1) ML model training
1) Application
pipelines2) pipeline2) ML model
Versioning code2)
Features store3) (object)3) Hyperparameters4)
Configurations
Datasets4) Experiment tracking
Metadata
1) Versions of all
dependencies in
dev and prod are
1) Backup data2) 1) Hyperparameter tuning is identical2) Same
Data identical between dev and technical stack
versioning3) prod2) The order of features for dev and
Extract is the same3) Ensemble production
Reproducibility
metadata4) learning: the combination of environments3)
Versioning of ML models is same4)The Reproducing
feature model pseudo-code is results by
engineering documented providing
container images
or virtual
machines
MLOps 18
1) Feature store
1) Containerization of the ML
is used in dev 1) On-premise,
Deployment stack2) REST API3) On-
and prod cloud, or edge
premise, cloud, or edge
environments
1) Data
distribution
1) ML model decay2) 1) Predictive
changes (training
Numerical stability3) quality of the
Monitoring vs. serving
Computational performance application on
data)2) Training
of the ML model serving data
vs serving
features
Along with the MLOps principles, following the set of best practices should help
reducing the “technical debt” of the ML project:
MLOps Best
Data ML Model Code
Practices
1) Data sources2)
1) Model selection
Decisions, 1) Deployment
criteria2) Design of
Documentation how/where to get process2) How to
experiments3) Model
data3) Labelling run locally
pseudo-code
methods
1) A folder that
1) Data folder for raw
contains the trained 1) A folder for
and processed data2)
model2) A folder for bash/shell scripts2)
A folder for data
Project notebooks3) A folder A folder for tests3)
engineering
Structure for feature A folder for
pipeline3) Test folder
engineering4)A folder deployment files
for data engineering
for ML model (e.g Docker files)
methods
engineering
MLOps 19