TowardsMLOps AFrameworkandMaturityModel
TowardsMLOps AFrameworkandMaturityModel
Abstract—The adoption of continuous software engineering well integrated with continuous development and production
practices such as DevOps (Development and Operations) in in practice [2].
business operations has contributed to significantly shorter soft- Despite the popularity of ML, there is little research on
ware development and deployment cycles. Recently, the term
MLOps because it is a recent phenomenon. To advance
MLOps (Machine Learning Operations) has gained increasing
interest as a practice that brings together data scientists and understanding of how companies practice MLOps, including
operations teams. However, the adoption of MLOps in practice collaboration between data science and operations teams, we
is still in its infancy and there are few common guidelines on use a Systematic Literature Review (SLR), a Grey Literature
how to effectively integrate it into existing software development Review (GLR), and a validation study in three case companies.
practices. In this paper, we conduct a systematic literature review
The paper makes three contributions.
and a grey literature review to derive a framework that identifies
the activities involved in the adoption of MLOps and the stages • We conduct a SLR and a GLR literature review to present
in which companies evolve as they become more mature and the state-of-the-art regarding the adoption of MLOps in
advanced. We validate this framework in three case companies practice and derive a framework from the reviews
and show how they have managed to adopt and integrate
MLOps in their large-scale software development companies. • We present a maturity model with different stages in
The contribution of this paper is threefold. First, we review which companies evolve during MLOps adoption
contemporary literature to provide an overview of the state-of- • We validate the framework and map the three case
the-art in MLOps. Based on this review, we derive an MLOps companies to the stages of the maturity model
framework that details the activities involved in the continuous
development of machine learning models. Second, we present The remainder of the paper is organized as follows: Section II
a maturity model in which we outline the different stages that describes the background of the study, Section III describes
companies go through in evolving their MLOps practices. Third, the research methods used and Section IV addresses the
we validate our framework in three embedded systems case threats to validity. Section V summarizes the findings from the
companies and map the companies to the stages in the maturity
model. literature review. Section VI describes the MLOps framework
Index Terms—MLOps, Framework, Maturity Model, SLR, and maturity model. Section VII describes the validation study
GLR, Validation Study conducted in three case companies and Section VIII discusses
the results. Section IX concludes our study.
I. I NTRODUCTION
Machine Learning (ML) has a significant impact on the II. BACKGROUND
decision-making process in companies. As a result, companies
can save significant costs in the long run while ensuring This section discusses DevOps, DevOps application on the
value for their customers [1] and also enabling fundamentally ML systems (referred to as MLOps), and the challenges
new ways of doing business. To improve value creation and associated with it.
automate the end-to-end life cycle of ML, data scientists and
operations teams are trying to apply DevOps concepts to their A. DevOps
ML systems [2] in companies. DevOps is a “set of practices DevOps [3] aims to “reduce the time between committing a
and tools focused on software and systems engineering” [3] change to a system and the change being placed into normal
with close collaboration between developers and operations production while ensuring high quality” [7]. The goal is to
teams to improve quality of service [4]. ML models embedded merge development, quality assurance, and operations into a
in a larger software system [5] are only a small part of the single continuous process. The key principles of DevOps are
overall software system, so the interaction between the model automation, continuous delivery, and rapid feedback. DevOps
and the rest of the software and its context is essential [6]. requires a “delivery cycle that involves planning, development,
From literature it is apparent that ML processes are often not testing, deployment, release and monitoring as well as active
Software Center cooperation between different team members” [3].
Continuous software engineering (SE) refers to iterative which companies evolve as they gain maturity and become
software development and related aspects like continuous inte- more advanced. To achieve this objective, we developed the
gration, continuous delivery, continuous testing and continuous following research questions:
deployment. Continuous SE enables development, deployment • RQ1: What is the state-of-the-art regarding the adoption
and feedback at a rapid pace [8] [9] and is divided into three of MLOps in practice and the different stages that com-
phases: a) Business strategy and planning, b) Development and panies go through in evolving their MLOps practices?
c) Operations. Software development activities such as con- • RQ2: How do case companies evolve and advance their
tinuous integration (CI) and continuous delivery (CD) support MLOps practices?
the operations phase. In CI [8], team members of software- We performed a SLR [18] [19], a GLR [20] [21] and a
intensive companies often integrate and merge development validation case study [22] to address the two RQs.
code to have a faster and more efficient delivery cycle and
increases team productivity [9]. This facilitates the automation A. SLR and GLR
of software development and testing [10]. CD ensures that The goal of the SLR is to find, examine and interpret
an application is not moved to the production phase until relevant studies on the topic of interest [18] [19]. To answer
automated testing and quality checks have been successfully the RQs, we defined search strings according to [18] and
completed [11] [12]. It lowers deployment risk, cost, and searched five popular scientific libraries. Figure 1 shows an
provides rapid feedback to users [13] [14]. overview of the SLR and the GLR process that was used in
B. MLOps this study. We integrated and exported relevant studies into an
Excel spreadsheet for deeper analysis. In SLR, we included
With the successful adoption of DevOps, companies are
conference and journal studies that reported MLOps. On the
looking for continuous practices in the development of ML
other hand, we excluded studies that were duplicate versions,
systems. To unify the development and operation of ML sys-
published in a language other than English, were not peer-
tems, MLOps [5] extends DevOps principles [15]. In addition
reviewed, and were not available electronically on the Internet.
to traditional unit and integration testing, CI introduces addi-
We conducted the GLR [20] to provide a detailed descrip-
tional testing procedures such as data and model validation.
tion of the state-of-practice and practitioner experiences in
From the perspective of CD, processed datasets and trained
adopting MLOps. Compared to the SLR, the GLR provides
models are automatically and continuously delivered by data
the voice of practitioners on the topic under study. In GLR,
scientists to ML systems engineers. From the perspective
we included studies in the Google Search that address MLOps,
of continuous training (CT), introduction of new data and
published in English in PDF format and documents from
model performance degradation require a trigger to retrain the
companies by filtering the site under the domain name “.com”.
model or improve model performance through online methods.
To improve the reliability of the retrieved results from the
In addition, appropriate monitoring facilities ensure proper
GLR, we excluded peer-reviewed scientific articles and other
execution of operations.
sources of knowledge such as blogs, posts, etc.
C. Challenges associated with MLOps For the SLR and the GLR, we used the search query as
In our own previous research [16] [17], we have identified “MLOps” OR “Machine Learning Operations” and restricted
a number of challenges when it comes to the business case, the search to the period between January 1, 2015 and March
data, modeling and deployment of ML or Deep Learning (DL) 31, 2021. The time interval was chosen because the term
models. These include high AI costs and expectations, fewer MLOps is prevalent after the concept “Hidden Technical Debt
data scientists, need for large datasets, privacy concerns and in Machine Learning Systems” [6] in 2015. Based on the SLR
noisy data, lack of domain experts, labeling issues, increasing and the GLR, we shortlisted 6 SLR ( [23] - [28]) and 15 GLR
feature complexity, improper feature selection, introduction of [29] - [43] studies. Based on these studies, we developed an
bias when experimenting with models, highly complex DL MLOps framework and various stages that companies take in
models, need for deep DL knowledge, difficulty in determining evolving MLOps practices.
final model, model execution environment, more hyperparam- B. Validation Case Study
eter settings, and verification and validation. It also includes
less DL deployment, integration issues, internal deployment, Following [44], we conducted a validation study to map
need for an understandable model, training-serving skew, end- companies to the stages in the maturity model derived
user communication, model drifts, and maintaining robustness. from literature reviews. Case study methodology is an
Some of the challenges in MLOps practice [5] include tracking empirical research approach based on an in-depth study of a
and comparing experiments, lack of version control, difficulty contemporary phenomenon that is difficult to study separately
in deploying models, insufficient purchasing budgets and a in its real-world environment [45]. In SE, case studies are
challenging regulatory environment. used to better understand how and why SE was done and
thus improve the SE process and resulting software products
III. R ESEARCH M ETHOD [46]. Throughout the validation study, we worked closely
The main objective of the study is to identify the activities with practitioners in each case company. Table 1 provides a
associated with the adoption of MLOps and the stages in brief description of each case company, the practitioners (P*,
Fig. 1. Overall SLR and GLR process used in the study
W*, M*, and S* represent interview, workshop, meeting, and and meetings lasted 30 minutes to one hour, and daily stand-
stand-up meeting participants respectively) and their roles. up meetings lasted 15 minutes. We validated the MLOps
framework in case companies and present different stages
TABLE I that companies go through when implementing MLOps.
D ESCRIPTION OF PRACTITIONERS IN THE VALIDATION STUDY
Transcripts from interviews and notes from workshops,
Case Company Practitioners meetings and stand-ups were used to capture empirical data.
ID Roles
Later, they were shared with the other authors by primary
P1, W1, S1 Senior Data Engineer
Telecommunications P2, W2, S2 Data Scientist author for detailed analysis. We applied elements of open
P3, W3, S3 Data Scientist coding to analyse and categorize collected empirical data
P4, W4, S4 Data Scientist [47]. In order to obtain different perspectives on the topic
W5, S5 Senior Data Scientist
W6, S6 Data Scientist under study, triangulation was used [48].
W7, S7 Software Developer
W8, S8 Software Developer IV. T HREATS TO VALIDITY
S9 Operational Product Owner
S10 Sales Director Potential validity threats were considered and minimized in
Automotive W9 Expert Engineer this study [49]. Construct validity was improved by consid-
W10 Expert Engineer ering information from SLR, GLR and the validation case
Packaging M1 Solution Architect
M2 Data scientist
study. Authors and practitioners involved in this study are
well versed in MLOps. Multiple techniques (semi-structured
interviews, workshops, meetings, and stand-up meetings) and
Case Companies: We present three case companies and use multiple sources (senior data engineer, data scientist, software
cases that were investigated in each company as part of our developer, expert engineer, etc.) were used to collect and
validation study. validate empirical data. Internal validity threats caused by
1. Hardware Screening: The telecommunications company faulty conclusions due to primary author bias in data selection
predicts faults in hardware to minimize the amount of or interpretation are mitigated by consulting with other two au-
hardware returned by the customer for repair. In this use case, thors. By extending our research to additional case companies,
they focus on a) Returning defect-free hardware back to the generalization of the results can be justified and thus external
customer b) Sending defective hardware to the repair center. validity can be mitigated.
2. Self-driving Vehicles: The company that manufactures
vehicles strives to provide autonomous transportation V. L ITERATURE R EVIEW F INDINGS
solutions. The main use case is self-driving vehicles to Based on the SLR and the GLR, we extract insights from the
increase the productivity. The company also needs to ensure literature to give an overview of the state-of-the-art of MLOps
that the failure rate is low in these safety-critical use case. in practice. They are divided into three parts: a) Data for ML
3. Defect Detection: The packaging company provides Development b) ML Model Development and c) Release of
packaging solutions as well as machines to customers. ML Models. Below, we discuss each part in detail.
One of the main use cases is the detection of defects in
finished/semi-finished packages. A. Data for ML Development
Aggregate heterogeneous data from different data sources
Data Collection and Analysis: For data collection, we [32] [31] [41], preprocessing [27] and extracting relevant
used interview studies, workshops, meetings and stand-up features are necessary to provide data for ML development.
meetings in companies. They were held in English via video Later, the features are registered in a feature store [42] which
conferencing. All interviews lasted 45 minutes, workshops can be used for development of any ML models [42] and used
for inference when deploying the model. Also, the data points with hyperparameters, evaluate the model and package it
are stored in the data repository [39] after versioning. The data for production deployment. If performance degrades, trigger
collected from various sources has to be properly stored and retraining of the model by initiating a data feedback loop.
managed. Data anonymization and encryption [27] should be Data and code that have been versioned are stored in the data
performed to comply with data regulations (e.g. GDPR [25]). repository and code repository. To track the deployable model
version, store it in the model registry. Deployment cycles of
B. ML Model Development ML models can be shortened using CI/CD/CT.
In ML model development, provisions should be made to
run experiments in parallel, optimize the chosen model with MLOps Maturity Model: Based on the SLR and the
hyperparameters, and finally evaluate the model to ensure that GLR, we present a maturity model in which we outline four
it fits the business case. After versioning, the code is stored stages in which companies evolve when adopting MLOps
in the code repository [42] [23]. The model repository [39] practices. The four stages are a) Automated Data Collection
keeps track of the models that will be used in production, and b) Automated Model Deployment c) Semi-automated Model
the metadata repository contains all the information about the Monitoring and d) Fully-automated Model Monitoring.
models (e.g., hyperparameter information). Data scientists can These stages capture key transition points in the adoption of
collaborate on the same code base, which also allows them to MLOps in practice. Below, we detail each MLOps stage and
run the code in different environments and against a variety preconditions for a company to reach this stage.
of datasets. This facilitates scaling and the ability to track the A. Automated Data Collection: In this stage, companies
execution of multiple experiments and reproducibility [29]. have a manual processing of data, model, deployment
and monitoring. With the adoption of MLOps, company
C. Release of ML Models experience a transition from manual process to automated
To release ML models, package [41], validate [41] and data collection for (re)training process.
deploy models [40] to production [41]. When deploying a Preconditions: For transition from manual process to
model to production, it has to be integrated with other models automated data collection, there is a need for mechanism
as well as existing applications [30] [41]. When the model to aggregate data from different data sources which can be
is in production, it serves requests. Despite the fact that stored and accessed whenever required [32]. In addition,
training is often a batch process, the inferences can be REST it also demands capability for integrating and processing
endpoint/custom code, streaming engine, micro-batch, etc. new data sources, regardless of variety, volume or velocity
[35]. When performance drops, monitor the model [41] and [31]. It also requires infrastructure resources for automated
enable the data feedback loop [41] to retrain the models . In a data collection [34], data preparation and collaboration [38].
fully mature MLOps context, perform continuous integration Also, standardized and automated pipelines helps to drive
and delivery by enabling the CI/CD pipeline and continuous ingestion, transformation and storage of analytic data into a
retraining through CT pipeline [41] [31]. database or data lake [31]. Same feature manipulation during
From the literature review, we see that successful AI/ML training has to be replicated at the inference [35]. AI teams
operationalization ensures a safe, traceable, testable, and re- can promote trust by addressing data management challenges
peatable path for developing, training, deploying, and updat- like accountability, transparency, regulation and compliance,
ing ML models in different environments [30]. The use of and ethics [37].
MLOps enables automation, versioning, reproducibility, etc., B. Automated Model Deployment: The companies at this
with successful collaboration of required skills such as data stage have a manual model deployment and monitoring. With
engineer, data scientist, ML engineer/developer [40] [29]. For the adoption of MLOps, they undergo transition from manual
example, data scientists must specialize in SE skills such as model deployment and monitoring to automated deployment
modularization, testing, versioning, etc. [36]. Supporting pro- of the retrained model.
cesses formalized in policies serve as the basis for governance Preconditions: The transition can be achieved by implementing
[31] and can be automated to ensure solution reliability and provisions for automated model deployment to environments
compliance [31]. MLOps also support explainability (GDPR [43] [39] [43] [38] especially across dev, Q/A and production
regulation [25]) and audit trails [40]. environments [43] [34]. It encourages deployment freedom in
on-premise, cloud and edge [34] [38]. Automated deployment
VI. MLO PS F RAMEWORK AND M ATURITY M ODEL of retrained model can be achieved by providing a dedicated
Based on the SLR and the GLR, we derive an MLOps infrastructure-centric CI/CD pipeline [31], integration with
framework that identifies the activities involved in MLOps DevOps for automation, scale and collaboration [35].
adoption. Figure 2 depicts the MLOps framework. The Sufficient infrastructure choices for deployment includes
entire framework is divided into three pipelines: a) Data model hosting, evaluation, and maintenance [32], and
Pipeline b) Modeling Pipeline and c) Release Pipeline. After means to register, package (containerization [38] [24]),
collecting data relevant to ML models from data sources, deploy models [43] [40] [39] and integration of reusable
preprocessing of data and feature extraction is performed. software environments for training and deploying models
Once a suitable model has been experimented and optimized [43]. Tracking experiments [43] [39] [40] and models [31],
Fig. 2. MLOps Framework
VIII. D ISCUSSION