Requirements and Reference Architecture For MLOpsI
Requirements and Reference Architecture For MLOpsI
2
requirement in detail. should empower developers to publish, share and
reuse pipelines to enable fast development and au-
Infrastructure Layer tomated (re)deployment of pipelines (R5). More-
This layer provides computing resources to over, the pipeline steps need to be implemented
host and execute platform services, pipelines, as modular containerized components that can be
governance applications, and CI/CD automation easily reused and composed (R6). The pipeline
services. The infrastructure needs to be flexible execution can adopt a choreography model or an
and portable to prevent vendor lock-in and enable orchestration model (R7). The former needs an
the rapid (re)deployment of pipelines (R1). A re- event bus to facilitate the event/message-driven
producible and versionable infrastructure supports coordination of pipeline steps. The latter defines
auditing and debugging infrastructure changes the pipeline as a workflow model that a workflow
and allows switching between different versions engine can execute centrally.
of platform services, pipelines, and ML mod-
els. The other desired features of the infras- Experimentation, Training, and Testing
tructure layer include multi-cloud support, auto- The experimentation of a pipeline requires its
scalability, and hardware accelerators (R2). The deployment, execution, and debugging, followed
multi-cloud support allows different teams in an by the analysis and interpretation of the produced
organization to use the best possible cloud and artifacts (e.g., models and features). Notebooks
tools for building and deploying their models. are commonly used for experiments. However,
Auto-scaling enables scaling up and down train- they are not recommended in production due to
ing pipelines and models (serving components) the difficulties of versioning, instrumentation, and
automatically to cope with fluctuating data and automated execution. Hence, the platform should
serving requests. Hardware accelerators may be provide services to export notebooks into deploy-
necessary to speed up pipeline execution during able pipelines (R8). Moreover, the platform needs
experimentation and (re)training and to improve to offer a service to record and query the metadata
real-time serving latency. of each experiment to enable reproducing and
An MLOps environment can include separate troubleshooting experiments (R9). The typical
development, staging, and production environ- metadata includes, but is not limited to, code
ments (R3). These environments’ configurations versions, data versions, configuration files, output
(e.g., tools and hardware resources) may exhibit artifacts, and performance metrics.
variations. Infrastructure as code (IaC) can be Platform services for training may improve
used to automate the provisioning and manage- the performance and reliability of training jobs
ment of such environments while preserving their through strategies such as model check-pointing,
reproducibility and auditability (R4). distributed model training, exploitation of specific
heterogeneous hardware (e.g., GPU and TPU
Platform Layer accelerators), prioritizing training activities, train-
This layer facilitates applying platform think- ing on a slice of the data set, and AutoML
ing [11] to build a self-serve MLOps platform (Automated Machine Learning) (R10 and R11).
to empower different actors involved in creating, Check-pointing enables incrementally training a
deploying, and maintaining ML models. It sup- model using more iterations and recovering from
ports managing the entire lifecycle of models and failures during the training. The distributed train-
related artifacts such as data and ML code. In this ing needs specific middleware capable of running
section, we present the platform capabilities that a training job elastically over multiple compute
we synthesized from the gray literature. nodes. A scheduler service can queue and prior-
itize the training jobs, enabling policing training
Pipeline Development and Execution The activities, for example, capping the amount of
pipelines implement ML processes such as data data used by training or preventing long-running
pre-processing, feature engineering, (re)training, jobs from blocking the deployment of critical
and prediction. A pipeline consists of steps that tasks such as security or bug fixes. Finally, Au-
must be executed in a specific order. The platform toML can simplify and accelerate building ML
3
Table 1: Requirements for an MLOps Environment
Category Requirements Sources
R1 Portability, reproducibility, and versionability S2, S4, S12, S15, S18, S35
R2 Auto-scaling and use of GPU and hardware accelerators S3, S5, S19, S28, S35, S37, S38, S44, S52, S53
Infrastructure
R3 Cater for different environments (e.g., test and production) S3, S2, S6, S17, S27
R4 Manage the infrastructure using IaC (Infrastructure as Code) S7, S12, S22, S54
S3, S5, S17, S26, S29, S28, S32, S33,
Pipeline R5 Create, publish, discover, use, and customize pipelines
S34, S35, S42, S44, S47, S49, S53, S56, S58
Development,
R6 Modular and reusable pipelines and components S3, S14, S17, S19, S32, S41, S49, S58
and Execution
R7 Execution of pipelines via orchestration or choreography S2, S7, S10, S26, S52, S53, S58
R8 Export experimental code from notebooks into pipelines S4, S17, S22, S49
R9 Record and query experiments and training runs S7, S9, S16, S22, S23, S25, S30, S33, S38, S53, S58
Experimentation,
Apply training scaling strategies such as model check-pointing, distributed
Training, and R10 S2, S3, S4, S8, S9, S10, S23, S24, S36, S38, S53
training, use of hardware accelerators, and training with data set slices
Testing
R11 Use automated machine learning tools S7, S10, S11, S16, S29, S58
R12 Validation tests for all ML artifacts (e.g., model, code, and data) S3, S4, S5, S9, S24, S25, S33, S39, S45, S47, S58
R13 Prioritization and scheduling of tests and training jobs S3, S4, S33
R14 Ensure the compatibility of the model with the target infrastructure S17, S22, S44, S58
Model
R15 Use open model formats for portable and flexible deployments S7, S9, S10, S16, S24, S35, S38, S40, S57
Deployment
R16 Package models for ease of deployment, integration, and testing S3, S17, S28, S48, S49
and Serving
S1, S2, S7, S9, S17, S23, S24, S26, S28, S31,
R17 Support different patterns of deploying, releasing, and serving models
S33, S34, S35, S51, S58
S1, S2, S3, S4, S7, S10, S11, S13, S15, S16, S17,
Realtime monitoring of and alerting for various issues in
R18 S19, S20, S23, S25, S26, S28, S29, S31, S32, S33,
models, data, pipelines, and infrastructure
Monitoring and S34, S35, S43, S44, S48, S52, S54, S57, S58
Feedback Loops R19 Automated triggering of corrective actions for alerts S1, S2, S16, S17, S25, S35
R20 Generation of custom actionable dashboards for models S2, S3, S10, S34, S40, S54
R21 Support different pipeline triggering models S2, S3, S4, S5, S7, S9, S17, S19, S20
S1, S2, S3, S9, S16, S17, S18, S19, S21, S23,
Model Life R22 ML asset storage and marketplace
S26, S35, S50, S51, S53, S56, S58
Cycle
S2, S4, S5, S7, S9, S20, S22, S23, S26, S25, S31,
Management R23 Version control and lineage tracking for ML artifacts
S33, S34, S35, S41, S45, S57, S58
and
R24 ML asset metadata management S9, S14, S16, S17, S24, S35, S49, S58
Governance
R25 Access control and privacy compliance for ML artifacts S2, S5, S7, S10, S11, S12, S24, S28, S54, S56, S58
R26 Ensure adversarial robustness and interpretability of models S2, S7, S10, S19, S22, S24, S34, S41, S43, S54, S58
Support for general data platform services such as data catalog, data storage, S7, S10, S11, S19, S23, S24, S31, S33, S36,
General R27
data discovery, data exploration, data augmentation/labeling, and data fusion S38, S50, S53, S58
Platform
Support for general ML platform services such as feature engineering, model
Services R28 S2, S4, S6, S9, S26, S33, S58, etc. (most articles)
exploration, model selection, hyper-parameter tuning, and model validation
S1, S2, S3, S4, S5, S7, S8, S9, S22, S25, S27,
CI/CD R29 Treat ML artifacts as first-class citizens in DevOps processes
S29, S31, S32, S35, S45, S55, S57
Automation
R30 Integrate unit, integration, and smoke tests of ML artifacts to CI/CD pipelines S8, S9, S24, S25, S26, S33, S53, S58
applications and lower domain experts’ barriers Deployment and Serving When deploying
to developing their ML models. a model to the target serving environment, it
All ML artifacts must be tested appropriately; is necessary to ensure its compatibility with the
the key testable artifacts are data, code, and model infrastructure regarding compute resources, soft-
(R12). Data tests can verify that input data and ware dependencies, and model formats (R14). As
features do not exhibit data quality issues such ML libraries may use specific formats for their
as malformed data, anomalies, and mismatches models, a model translation service may be neces-
with the expected schemas and distributions. Such sary to convert models into an open model format
tests can prevent training-serving skew, namely for portable and flexible deployments (R15). A
differences in model performance during training platform can also offer an image-building service
and serving, which can be caused by differences to package models, scoring code, and dependen-
between training data and serving data. Similar cies into container images (R16). In this way,
to testing of conventional software applications, the operational team can quickly deploy a model
source code tests can assess the quality of the ML into staging and production environments, test it,
code, e.g., violation of best practices, existence and integrate it with the applications that need to
of known defects, and resource efficiency of consume the predictions from the model.
training tasks. Finally, the model validation tests
The platform must support different strate-
can verify model fairness and consistency.
gies for deploying/releasing and serving models
As model training and data processing can (R17). Standard model deployment approaches
be expensive and time-consuming, the execution include shadow, canary, and blue/green deploy-
of tests needs to be prioritized and scheduled ment. A shadow deployment does not immedi-
appropriately, for example, running a subset of ately release the new model to users but instead
tests or long-running testing during off-hours and uses the production traffic to test the model.
training on small datasets (R13). A canary deployment releases the new model
4
Figure 1: Layered Reference Architecture for an MLOps Environment
to users incrementally. In contrast, a blue/green contribute to the degradation of the model per-
deployment immediately deploys and releases the formance over time, i.e., model drift. The metrics
new model to users. related to the stability of the model need to
Common patterns from serving predictions be monitored continuously to detect drifts. The
from a model include model-as-service, precom- model needs to be retrained regularly in response
pute, and model-as-dependency. Model-as-service to drift alerts to address the model drift. Pipelines
exposes the model as a web service or a messag- also should be monitored continuously as their
ing endpoint. Precompute compute predictions for failures can prevent updating models. The logs of
a batch of input data and store them to serve the a pipeline can be collected and analyzed to gauge
clients later. In the model-as-dependency pattern, its health. Common performance metrics such as
the application embeds the model as a binary resource usage, execution time, and throughput
and loads it at runtime to make predictions. should also be collected for pipelines and in-
The application can also download the model at ference services. Such metrics can be used to
runtime from a model registry. Serving methods trigger auto-scaling services to scale pipelines and
can also be classified into offline and online models up and down.
serving. Each method may need specific platform Platform services can automate monitoring,
services, for example, a middleware to run batch altering, and enacting corrective actions (R19).
and streaming data processing pipelines. For example, the developer should be able to
develop and run a monitoring pipeline that can
Monitoring and Feedback Loops The plat- process logs and metrics and generate alerts,
form layer should continuously monitor various and define the rules to select and trigger actions
quality issues in models and data at runtime, on alerts. In addition, a dashboard service can
generating alerts and triggering corrective actions support creating interactive dashboards that allow
(R18). Two typical quality issues are data drift monitoring models, pipelines, and infrastructure
and concept drift, which refer to the changes and troubleshooting suspicious or poorly per-
to the statistical properties of the model input forming models and failed pipelines (R20).
and the target variable, respectively. Both can Platform also needs to support common
5
approaches for triggering continuous training offer services to simplify the implementation
pipelines (R21): metrics-driven, schedule-driven, of these tasks, e.g., a messaging service and a
event-driven, and ad-hoc manual. In the metrics- data catalog service. Moreover, the platform also
driven approach, data and model performance should provide tools for building, testing, orches-
metrics are measured and used to determine trating, monitoring, and managing data pipelines.
pipeline (re)execution. The schedule-driven strat-
egy triggers the pipelines at a specified time or Pipelines Layer
regularity. In the even-drive model, events, such This layer consists of the pipelines that can be
as changes to the model’s source code and the built, tested, and executed using platform services
availability of a new training data set, trigger and CI/CD automation services. From the gray
pipelines. Finally, a human operator can manually literature, we identified eight major pipeline types
execute a pipeline. in an MLOps environment: build, release (and
deployment) , data , feature engineering, exper-
Lifecycle Management and Governance imentation, training, scoring/serving, and mon-
Services ML artifacts and metadata must itoring. Build and release pipelines are CI/CD
be versioned, stored, and managed to support pipelines. The first pipeline builds the code, ex-
their reproduction, discovery, auditing, and reuse ecutes tests to verify code quality, and publishes
(R22-24). Different artifacts may need specific the artifacts produced. The second pipeline op-
storage components, e.g., a model registry for erationalizes and promotes the artifacts across
models, a feature store for features, a pipeline different environments (e.g., staging and produc-
store for data/ML pipelines, and a source code tion) to enable their consumption and testing.
repository for ML and IaC scripts. The artifact When building ML and data pipelines, the source
metadata (e.g., schemas, hyper-parameters and codes are pipeline models (e.g., workflow config-
model metrics) can also be placed in the data store urations) and pipeline components (e.g., Python
or a separate metadata store. programs). The pipeline models are generally
The platform should provide the services for published on the platform services of pipeline
enforcing policies governing ML artifacts’ life coordinators/engines as endpoints to enable their
cycle, for instance, identity and access man- execution via API calls. In addition, the pipeline
agement services for implementing access con- components are containerized and stored in an
trol policies, and privacy-preserving mechanisms image registry. The same processes apply to plat-
(e.g., data anonymization and federated learning) form services and scoring/prediction services.
for enforcing data privacy compliance. Moreover, Figure 2 shows the pipelines and (a sub-
the practitioners may need tools for authoring, set of) their interconnections. Developers create
testing, and observing policies. and test pipelines and platform services, poten-
The models should be resilient to model tially reusing the relevant existing implementa-
attacks such as membership inference attacks, tions. The deployment of pipelines and platform
adversarial attacks, and model inversion attacks. services may need provisioning and configuring
Moreover, model decisions should be explainable compute stacks, which can be automated via IaC.
and interpretable (R26). Hence, the practitioners All source codes should be version-controlled.
need platform services to test models for their Changes to the code can trigger build pipelines,
vulnerability to attacks and to generate and visu- which can result in publishing ML/data pipelines
alize explanations for model behavior. and container images. Let us consider the first-
time execution of the training process. A data
Data Platform Services Developers use data pipeline extracts the raw data sets from multiple
pipelines to turn raw historical and online data sources, cleans, standardizes, and stores them in a
from multiple sources into features for their ML data store. A feature engineering pipeline creates,
models. A data pipeline typically considers tasks selects, and stores features in a feature store.
such as data ingestion, storage, discovery, stan- Finally, the training pipeline builds and tunes a
dardization, labeling, cleaning, transformation, model and publishes it to the model registry,
and feature extraction (R27). The platform should which triggers the model release pipeline, which
6
Figure 2: MLOps Pipelines and their Relationships
builds and deploys serving pipelines/services. A model accuracy, model fairness, and data pri-
deployed model may serve client requests using vacy. Platform services can empower pipeline
batch, event-driven, or real-time serving methods. developers to use the policy-as-code approach
The logs from the execution of each step of the for automating policy enforcement. For exam-
training and serving processes are continuously ple, access control policies for controlling data
collected. The monitoring pipelines can analyze and model access can be embedded into the
such logs and generate alerts indicating quality stages of data and ML pipelines using a policy
issues in models, data, pipelines, and infrastruc- authoring service. Those policies can then be
ture. Alerts can be used to trigger the re-execution tested when developing pipelines using a policy
of the training process and, consequently, the testing service, and observed and enforced during
deployment of a new model. their execution using a policy engine service and
model/data monitoring services.
Governance and Automation Cross-Cutting Automation applies CI/CD techniques and
Modules processes to automate the provisioning and con-
Governance needs the ability to consistently figuration of infrastructure resources in target
specify, configure, observe, and enforce policies environments and the building, testing, deploying,
for all ML artifacts and MLOps processes (e.g., and configuring of components such as platform
model approval and data collection/sharing). Poli- services, pipelines, serving apps, and governance
cies can consider various metrics, for example, apps (R29-30). The relevant artifacts are stored
7
in version-control repositories, and their CI-CD requirements for MLOps and mapped 26 existing
Pipelines are triggered in various ways, e.g., man- tools to them. They did not conduct a systematic
ual, event-driven, or scheduled. literature review, and the requirements mainly
consider general platform features such as data
Roles and Responsibilities ingestion, versioning, model selection, model reg-
From the gray literature, we identified three istry, and model performance monitoring.
key roles for MLOps team members: data en- Different from the work above, we aimed to
gineer, data scientist, and ML engineer. Data determine the requirements for, and components
engineers build, test, deploy, execute and manage of, a reference architecture for a complete MLOps
data pipelines that pull raw data from various stack (from infrastructure to applications) by sys-
sources, validate, clean and standardize the raw tematically reviewing the relevant gray literature.
data, and make the curated data accessible to the
data scientists in a secure and timely manner. Conclusions
Data scientists analyze a business problem that Organizations are adopting MLOps to accel-
needs a data science solution, and then build erate the deployment and delivery of high-quality
ML models that address the business problem. ML models and pipelines into production. Not
They typically work in an experimentation envi- surprisingly, there has been a rapid prolifera-
ronment. ML engineers are responsible for oper- tion of gray literature on MLOps. This paper
ationalizing models developed by data scientists. investigated the requirements and architectures
An MLOs team must interact with other in- for MLOps proposed by the gray literature. By
dividuals and groups, such as business analysts, systematically analyzing 58 sources, we distilled
DevOps engineers, application developers, and a catalog of 30 requirements and a reference
infrastructure engineers. Our reference architec- architecture for MLOps environments. The re-
ture introduces two new teams: platform and gov- quirements and reference architecture can pro-
ernance. The former builds (or sourcing), tests, vide a research framework for MLOps and guide
deploys and manages platform services. The latter identifying key research challenges in bringing
defines and enforces governance policies using ML models into widespread use. They can also
the services offered by the platform. help practitioners build or assemble an MLOps
environment, select and adapt an existing MLOps
Related Work environment, and select or develop tools for sup-
Kolltveit and Li [4] reviewed 25 academic porting various tasks in an MLOps environment.
articles on MLOps, focusing on tooling and in-
frastructure aspects. John et al. [12] synthesized REFERENCES
a maturity model for MLOps adoption, using 1. A. Cam, M. Chui, and B. Hall, “Global ai survey: Ai
the findings from the gray and academic litera- proves its worth, but few scale impact,” 2019.
ture. They also proposed an MLOps framework 2. A. Paleyes, R.-G. Urma, and N. D. Lawrence, “Chal-
consisting of three main pipelines (i.e., data, lenges in deploying machine learning: A survey of case
modeling, and release) and a governance layer. studies,” ACM Comput. Surv., apr 2022.
Warnett and Zdun [7] systematically reviewed 35 3. Algorithmia, “2020 state of enterprise machine learn-
gray literature articles to identify design deci- ing,” Algorithmia, Tech. Rep., 2020.
sions for model deployment, where MLOps is 4. A. B. Kolltveit and J. Li, “Operationalizing ma-
a specific design option. Symeonidis et al. [13] chine learning models - a systematic literature re-
surveyed the tools that support various tasks view,” in 2022 IEEE/ACM 1st International Workshop
in MLOps, such as model deployment, experi- on Software Engineering for Responsible Artificial
ment tracking, and feature engineering. They also Intelligence (SE4RAI), 2022, pp. 1–8.
identified several MLOps challenges, including 5. Thoughtworks, “Guide to evaluating mlops platforms,”
pipeline development, retraining, and monitoring. Tech. Rep., 11 2021.
Idowu et al. [6] compared 17 tools for managing 6. S. Idowu, D. Strüber, and T. Berger, “Asset management
ML assets such as data, models, pipelines, and in machine learning: A survey,” in 2021 IEEE/ACM 43rd
experiments. Philipp et al. [14] introduced 22 International Conference on Software Engineering:
8
Software Engineering in Practice (ICSE-SEIP), 2021, 2022 IEEE 12th Annual Computing and Communication
pp. 51–60. Workshop and Conference (CCWC), 2022, pp. 0453–
7. S. J. Warnett and U. Zdun, “Architectural design deci- 0460.
sions for machine learning deployment,” in 2022 IEEE 14. P. Ruf, M. Madan, C. Reich, and D. Ould-Abdeslam,
19th International Conference on Software Architecture “Demystifying mlops and presenting a recipe for the se-
(ICSA), 2022, pp. 90–100. lection of open-source tools,” Applied Sciences, vol. 11,
8. S. Keele et al., “Guidelines for performing systematic no. 19, 2021.
literature reviews in software engineering,” Technical
report, Ver. 2.3 EBSE Technical Report. EBSE, Tech.
Indika Kumara is an Assistant Professor at the
Rep., 2007.
Jheronimus Academy of Data Science (JADS) and
9. V. Garousi, M. Felderer, and M. V. Mäntylä, “Guidelines
Tilburg University, the Netherlands.
for including grey literature and conducting multivocal
literature reviews in software engineering,” Information Rowan Arts was a master student at JADS and
and Software Technology, vol. 106, pp. 101–121, 2019. Tilburg University, the Netherlands.
10. C. Wohlin, P. Runeson, M. Höst, M. C. Ohlsson, B. Reg-
nell, and A. Wesslén, Experimentation in software Dario Di Nucci is an Assistant Professor at University
engineering. Springer Science & Business Media, of Salerno, Italy.
2012.
11. Z. Dehghani, “Data mesh: Delivering data-driven value Rick Kazman is a Professor at University of Hawaii
at scale,” 2022.
and a Visiting Researcher at Software Engineering
Institute of Carnegie Mellon University.
12. M. M. John, H. H. Olsson, and J. Bosch, “Towards
mlops: A framework and maturity model,” in 2021 47th
Willem-Jan Van Den Heuvel is a Full Professor at
Euromicro Conference on Software Engineering and
JADS and Tilburg University, the Netherlands.
Advanced Applications (SEAA), 2021, pp. 1–8. Damian Andrew Tamburri is an Associate Professor
13. G. Symeonidis, E. Nerantzis, A. Kazakis, and G. A. Pa- at JADS and Eindhoven University of Technology, the
pakostas, “Mlops - definitions, tools and challenges,” in Netherlands.