0% found this document useful (0 votes)

223 views

Machine Learning in Production

Uploaded by

Le Doan

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

223 views

Machine Learning in Production

Uploaded by

Le Doan

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 31

Machine Learning in

Production: From Models

to Systems

Christian Kästner
This chapter covers part of the “From Models to AI-Enabled
Systems (Systems Thinking)” lecture of our Machine
Learning in Production course. For other chapters see
the table of content.

In production systems, machine learning is almost always

used as a component in a larger system — often a very
important component, but usually still just one among many
components. Yet, most education and research regarding
machine learning focuses narrowly on the learning
techniques and models, maybe on the pipeline to create,
deploy, and operate models, but rarely on the entire system
including ML and non-ML components.

The focus on the entire system is important in many ways

though. Many relevant decisions consider the entire system
and how the system interacts with users and the
environment more broadly. For example, to what degree
should the system automatically act based on predictions
from machine-learned models and should the system be
designed to keep humans in the loop? Such decisions matter
substantially for how a system can cope with mistakes and
implications for usability, safety, and fairness. Before the rest
of the book looks at various facets of building software
systems with ML components, let us dive a little deeper into
how machine learning relates to the rest of the system and
why a system-level view is so important.

ML and Non-ML Components in a System

In production systems, machine learning is used to train

models to make predictions that are used in the system. In
some systems, those predictions are the very core of the
system, whereas in others they provide only an auxiliary
feature.

In the transcription service startup from the previous

chapter, machine learning provides the very core
functionality of the system that converts uploaded audio files
into text. Yet, to turn the model into a product, many other
(usually non-ML) parts are needed, such as (a) a user
interface to create user accounts, upload audio files, and
show results, (b) a data storage and processing
infrastructure to queue and store transcriptions and process
them at scale, (c) a payment service, and (d) monitoring
infrastructure to ensure the system is operating within
expected parameters.

Architecture sketch of a transcription system, illustrating the central ML component for speech
recognition and many non-ML components.

At the same time, many traditional software systems use

machine learning for some extra “add-on” functionality. For
example, a traditional end-user tax software may add a
module to predict the audit risk for a specific customer; a
word processor may add a better grammar checker; a
graphics program may add smart filters, or a photo album
application may add automated tagging of friends. In all
these cases, machine learning is added as an often relatively
small component to provide some added value in an existing
system.
Architecture sketch of the tax system, illustrating the ML component for audit risk as an addition
to many non-ML components in the system.

Traditional Model Focus (Data Science)

Much of the attention in machine learning education and

research has been on learning accurate models from given
data. Machine learning education typically focuses on how
specific machine learning algorithms work (e.g., the internals
of SVM or deep neural networks) or how to apply them to
train accurate models from provided data. Similarly, machine
learning research focuses primarily on the learning steps,
trying to improve prediction accuracy of models trained on
common datasets (e.g., exploring new deep neural network
architectures, new embeddings).
Typical steps of a machine learning process. Mainstream machine-learning education and
research focuses on the modeling steps itself with provided datasets.

Comparatively little attention is paid (a) at the one end to

how data is collected and labeled, and (b) at the other end
how the learned models might actually be used for a real
task. Rarely is there any discussion of the larger system that
might produce the data or use the model’s predictions. Many
researchers and practitioners have expressed frustrations
with this somewhat narrow focus on model training due to
various incentives in the research culture, such as Wagstaff’s
2012 essay “Machine Learning that Matters” and
Sambasivan et al.’ 2021 study “Everyone wants to do the
model work, not the data work”. Outside of BigTech
organizations with lots of experience, this also leaves
machine learning practitioners that want to turn models into
products with little guidance, as can often be observed in
teams and startups that struggle productionizing initially
promising models (and Gartner’s claim that over half of the
projects with machine-learning prototypes in 2020 did not
make it to production).

Automating Pipelines and MLOps (ML Engineering)

With increasing use of machine learning in production

systems, engineers have noticed various practical problems
of deploying and maintaining machine-learned models.
Traditionally, models might be learned in a notebook or with
some script, then serialized (“pickled”) and then embedded
in a web server that provides an API for making predictions
(which we will discuss in more detail in chapter Deploying a
Model). However, when used in production systems, scaling
the system with changing demand, often in cloud
infrastructures, and monitoring service quality in real time
become increasingly important. Similarly, with larger data
sets and deep learning jobs, the model training itself can
become challenging to scale. Also, when models need to be
updated regularly, either due to continuous experimentation
and improvement (as we will discuss in the Quality
Assurance in Production chapter) or due to routine updates
to handle various forms of distribution shifts (as we will
discuss in the Data Quality chapter), manual steps in
learning and deploying models become tedious and error
prone. Experimental data science code is often derided as
being of low quality by software engineering standards, often
monolithic, with minimal error handling, and barely tested —
which is not fostering confidence in regular or automated
deployments. All this has put increasing attention on
distributed training, deployment, quality assurance, and
monitoring, supported with automation of machine-learning
pipelines, often under the label MLOps.
Widening the focus from modeling to the entire ML pipeline, including deployment and
monitoring, with a heavy focus on automation.

This focus on entire machine-learning pipelines that include

model deployment and monitoring has received significant
attention in recent years. It addresses many common
challenges of wrapping models into scalable web services,
regularly updating them, and monitoring their execution.
Increasing attention is paid to scalability achieved in model
training and model serving through massive parallelization in
cloud environments. While many teams originally
implemented this infrastructure for each project and
maintained substantial amounts of infrastructure code
(described prominently in the 2015 technical debt article
from a Google team), these days many competing open
source and commercial solutions exist for many of these
steps as MLOps tools (more details in later chapters).

Figure from Google’s 2015 technical debt paper, indicating that the amount of code for actual
model training is comparably small compared to lots of infrastructure code needed to automate
model training, serving, and monitoring. These days, much of this infrastructure is readily
available through competing MLOps tools (e.g., serving infrastructure, feature stores, cloud
resource management, monitoring).
Researchers and consultants report that shifting a team’s
mindset from models to machine-learning pipelines is
challenging. Data scientists are often used to working with
private datasets and local workspaces (e.g., in computational
notebooks) to create models. Migrating code toward an
automated machine-learning pipeline, where each step is
automated and tested, requires a substantial shift of
mindsets and a strong engineering focus. This is not
necessarily valued by all team members; for example, data
scientists frequently report resenting having to do too much
engineering work which prevents them from focusing on
their models; though many eventually appreciate the
additional benefits of being able to experiment more rapidly
in production and deploy improved models with confidence.

ML-Enabled Systems (ML in Production)

Notwithstanding the increased focus on automation and

engineering, the broader view of automated machine-
learning pipeline and MLOps still is entirely model-centric. It
starts with model requirements and ends with deploying the
model as a reliable and scalable service, but it usually does
not consider other parts of the system and how the model
interacts with those. Zooming out, the entire purpose of the
machine-learning pipeline is to create a model that will be
used as one component of a larger system (and potentially
additional components for training and monitoring the
model).

The ML pipeline corresponds to all activities for producing, deploying, and updating the ML
component that is part of a larger system.

As we will discuss throughout this book, key challenges of

building production systems with machine-learning
components (ML-enabled systems) arise at the interface
between these ML components and non-ML components of
the system and how they, together, achieve system goals.
There is constant tension between the goals and
requirements of the overall system and the requirements and
design of individual ML and non-ML components:

 Requirements for the entire system influence model

requirements as well as requirements for model
monitoring and pipeline automation. For example, in
the transcription scenario, user-interface designers
may request to receive confidence scores for
individual words and alternative plausible
transcriptions to provide a better user experience;
operators may set expectations for latency and
memory-demand during inference that constrain data
scientists in what models they can chose; and
advocates and legal experts may suggest what
fairness constraints to enforce during training.

 Conversely, capabilities of the model influence the

design of non-ML parts of the system and assurances
we can make about the entire system. For example,
in the transcription scenario, accuracy of predictions
may influence user-interface design and to what
degree humans are kept in the loop to check and fix
the automated transcriptions; it may limit what
promises about system quality we can make to
customers.

Systems Thinking

Given how machine learning is part of a larger system, it is

important to pay attention to the entire system, not just the
machine-learning components. We need a holistic approach
with an interdisciplinary team that involves all stakeholders.

Systems thinking is the name for a discipline that focuses

how systems interact with the environment and how
components within the system interact. For example, Donella
Meadows defines a system as “a set of inter-related
components that work together in a particular environment
to perform whatever functions are required to achieve the
system’s objective.” System thinking postulates that
everything is interconnected; that combining parts often
leads to new emergent behaviors that are not apparent from
the parts themselves; and that it is essential to understand
dynamics of a system, where actions have effects and may
form feedback loops (as the YouTube conspiracy and gaming
examples in the Introduction chapter).

A system consists of components working together toward the system goal. The system is situated
in and interacts with the environment.

As we will explore throughout this book, many common

challenges in building productive systems with machine-
learning components are really system challenges that
require understanding the interaction of ML and non-ML
components and the interaction of the system with the
environment.
Beyond the Model

A model-centric view of machine learning allows data

scientists to focus on the hard problems involved in training
more accurate models and allows MLOps engineers to build
an infrastructure that enables rapid experimentation and
improvement in production. However, the common model-
centric view of machine learning misses many facets of
building high quality production systems.

System Quality versus Model Quality

Outside of machine-learning education and research, model

accuracy is almost never a goal in itself but a means to
support the goal of a system. A system typically has the goal
of satisfying some user needs and making money (we will
discuss this in more detail in chapter System and Model
Goals). The accuracy of a machine-learned model can directly
or indirectly support such a system goal. For example, a
better audio transcription model is likely to attract more
users and sell more transcriptions; predicting the audit risk
in tax software provides value to the users of that software
and may hence encourage more sales and sales of additional
services. In both cases, the system goal is distinct from the
model’s accuracy but supported more or less directly by it.

Interestingly, improvements in model accuracy does not even

have to translate to improvements toward system goals. For
example, experience at Booking.com has shown that
improved model accuracy of models predicting different
aspects of a customer’s travel preferences which influence
hotel suggestions does not necessarily improve hotel sales
and in some cases improved accuracy even may negatively
impact sales. One possible explanation for such observations
offered by the team was that the model becomes too good up
to a point where it comes creepy: It seems to know too much
about a customer’s travel plans, when they have not actively
shared that information. In the end, more accurate models
were not adopted if they did not support the system goals.

Observations from online experiments at Booking.com, showing that model accuracy

improvement (“Relative Difference in Performance”) does not necessarily translate to
improvements of system goals (“Conversion Rate”). From Bernardi et al. “150 successful
machine learning models: 6 lessons learned at Booking.com.” In Proc. KDD, 2019.
Accurate predictions are important for many uses of machine
learning in production systems, but it is not always the most
important goal. In many cases accurate predictions are not
critical for the system goal and “good enough” predictions
may actually be good enough. For example, for the audit
prediction feature in the tax system, roughly approximating
the audit risk is likely sufficient for many users (and for the
software vendor who might try to upsell users on additional
services or insurance). In other cases, marginally better
predictions may come at excessive costs, for example, in
terms of acquiring or labeling much more data, longer
training times, and privacy concerns — a simpler, cheaper,
though less accurate model might often be preferable
considering the entire system. Finally, other parts of the
system, such as a better user-interface design explaining the
predictions, keeping humans in the loop, or having system-
level non-ML safety features (as we will discuss in the
Requirements and Safety chapters) can mitigate many
problems from inaccurate predictions.

A narrow focus only on model accuracy that ignores how the

model interacts with the rest of the system will miss
opportunities to design the model to better support the
overall system goals and balance various desired qualities.

User Interaction Design

System designers have powerful tools to shape the system
through user interaction design. For example, Geoff Hulten
distinguishes between different levels of forcefulness with
which predictions can be integrated into user interactions
with the system:

 Automate: The system can take an action on the

user’s behalf. For example, a smart home automation
system may automate actions to turn off lights when
the occupants leave.

 Prompt: The system may ask a user whether an

action should be taken. For example, the tax software
may show a recommendation based on the predicted
audit risk, but leave it to the user to decide whether
to buy some “audit protection insurance” or change
some reported tax data.

 Organize: The system may organize and order items

based on predictions. For example, a hotel
reservation service may order and group hotels based
on predicted customer preferences.

 Annotate: The system may add information based on

predictions to the display. For example, the
transcription service may underline uncertain words
in the transcript or the tax software may highlight
entries that are correlated with high audit risk,
suggesting but not requiring actions from the user.

These design choices decrease in forcefulness, from taking

direct action, to interrupting the user with a prompt, to
simply annotate information. While full automation can
appear magical when it works well, many designs keep
humans in the loop to better cope with possible mistakes.
How forceful an interaction should be will depend on the
specifics of the actual system, including the confidence of its
predictions, the frequency of the interaction, the benefits of
automating a correct prediction, and the cost of mistakes.
We will discuss this further in the Human-AI Interaction
chapter.

A smart safe browsing feature uses machine learning to warn of malicious web sites. In this
case, the design is fairly forceful, prompting the user to make a choice, but stops short of fully
automating the action.
Beyond forcefulness, another common user interface design
question is to what degree to explain predictions to users.
For example, shall the tax software simply report an audit
risk score, or explain how the prediction was made, or
explain which inputs are mostly responsible for the predicted
audit risk? As we will discuss in chapter Interpretability and
Explainability, the need for explaining decisions depends
heavily on the confidence of the model and the potential
impact of mistakes on users: In high-risk situations, such as
medical diagnosis it is much more important that a human
expert can understand and check a prediction based on an
explanation than in a routine and low-risk situation as
ranking hotel offers.

A model-centric approach without considering the rest of the

system misses out on many important design decisions and
opportunities regarding interactions with users. Model
qualities, including the accuracy and the ability to report
confidence or explanations, shape possible and necessary
user interface design decisions and user interface
considerations may influence model requirements.

Data Acquisition and Anticipating Change

A model-centric view often assumes that data is given and

representative, even though system designers often have
substantial flexibility in deciding what data to collect and
how and though production data often drifts.

Compared to feature engineering and modeling and even

deployment and monitoring, data collection and data quality
work is often undervalued (explored in depth in Sambasivan
et al.’ 2021 study “Everyone wants to do the model work, not
the data work”). System designers should not only focus on
building accurate models and integrating them into a system,
but also on how to collect data, which may include educating
people collecting and entering data and providing labels,
planning data collection upfront, documenting data quality
standards, considering cost, setting incentives, and
establishing accountability.

User interaction design can influence what data is generated

by the system to be potentially used as future training data.
For example, providing an attractive user interface to show
and edit transcription results would allow us to observe how
users change the transcript, thereby providing insights about
probable transcription mistakes. More invasively, we could
directly ask users which of multiple possible transcriptions
for specific audio snippets is correct. Both designs could
potentially be used as a proxy to measure model quality and
also to collect additional labeled training data from observing
user interactions.
Screenshot of the transcription service Temi, which provides an excellent editor to show and edit
transcripts. By encouraging users or in-house experts to edit transcripts in such an observable
environment, such a system could collect fine-grained telemetry about mistranscribed words.

In general, it can be very difficult to acquire representative

and generalizable training data. Different forms of data drift
are common in production, when the distribution of data
changes over time and may no longer align well with the
training data (we will discuss different forms of drift in
chapter Data Quality). For example, in the transcription
service the model needs to be continuously updated to
support new names that recently started to occur in recorded
conversations (e.g., of new products or newly popular
politicians). Also further model development and new
machine-learning techniques may improve models over time.
Anticipating the need to continuously collect training data
and evaluate model and system quality in production will
allow developers to prepare a system proactively.

Again, a focus on the entire system rather than a model-

centric focus encourages a more holistic view of aspects of
data collection and encourages design for change, preparing
the entire system for constant updates and experimentation.

Predictions have Consequences

As already discussed in the introduction, most software

systems, including those with machine-learning components,
interact with the environment. They aim to influence how
people behave or directly control physical devices, such as
self-driving cars. As such, predictions have consequences in
the real world, positive as well as negative. Reasoning about
interactions between the software and the environment
outside of the software (including humans) is a system-level
concern and cannot be done reasoning about the software or
the machine-learned component alone.

From a software engineering perspective, it is prudent to

consider every machine-learned model as an unreliable
function within a system that sometimes will return
unexpected results. The way we learn models by fitting a
function to match observations rather than providing
specifications, such mistakes seem unavoidable (it is not
even clear that we can always clearly determine what
constitutes a mistake). Since we have to accept eventual
mistakes, it is up to other parts of the system, including user
interaction design or safety mechanisms, to compensate for
such mistakes.

Consider the safety concerns of wrong predictions in a smart toaster and how to design a safe
system regardless.

Consider Geoff Hulten’s example of a smart toaster that uses

sensors and machine learning to decide how long to toast
some bread, achieving consistent outcomes to the desired
level of toastedness. As with all machine-learned models, we
should anticipate eventual mistakes and the consequences of
those mistakes on the system and the environment. Even a
highly accurate machine-learned model may eventually
suggest toasting times that would burn the toast or even
start a kitchen fire. While the software is not unsafe itself,
the way it actuates the heating element of the toaster could
be a safety hazard. More training data and better machine-
learning techniques may make the model more accurate and
robust and reduce the rate of mistakes, but it will not
eliminate mistakes entirely. Hence, the system designer
should look at means to make the system safe even despite
the unreliable component. For the toaster, this can include
(1) that non-ML code requesting the model’s prediction caps
toasting at a maximum duration, (2) that an additional non-
ML component could use a temperature sensor to stop
toasting at some set point, or (3) that system designers could
simply install a thermal fuse (a cheap hardware component
to shut off power when it overheats) as a non-ML, non-
software safety mechanism to ensure that the toaster. With
these safety mechanisms, the toaster may occasionally burn
some toast when the machine-learned model makes
mistakes, but it will not burn down the kitchen.

Another important problem where consequences of

predictions show in systems are feedback loops. As people
react to a system at scale, we may observe that predictions
of the model reinforce or amplify behavior the models
initially learned, thus producing more data to support the
model’s predictions. Feedback loops can be positive, such as
in public health campaigns (e.g., against smoking) when
people adjust behavior in intended ways, providing role
models for others and more data to support the intervention.
However, many feedback loops are negative, reinforcing bad
outcomes. Beyond the YouTube recommendation system
suggesting more and more conspiracy theory videos
mentioned in the introduction, feedback loops have been
frequently argued to reinforce historic bias. For example, a
system predicting more crime in an area overpoliced due to
historical bias, may lead to more policing and more arrests in
that area, which provides additional data reinforcing the
discriminatory prediction even if the area does not have
more crime than others. Understanding feedback loops
requires reasoning about the entire system and how it
interacts with the environment.

Just as safety is a system-level property that requires

understanding how the software interacts with the
environment, so are many other qualities of interest,
including security, privacy, fairness, accountability, energy
consumption, and user satisfaction. In the machine-learning
community, many of these qualities are now discussed under
the umbrella of Responsible AI. A model-centric view that
focuses only on the analysis of a machine-learned model
without considering how it is used in a system cannot make
any assurances about system-level qualities such as safety
(we will discuss this in more detail in the Safety and Security
chapters) and will have a hard time anticipating feedback
loops. Responsible engineering requires a system-level
approach.

Interdisciplinary Teams

In the introduction, we already argued how building

production systems requires a wide range of skills, typically
by bringing together team members with different
specialties. Taking a holistic system view of ML-enabled
systems reinforces this notion further: Machine-learning
expertise alone is not sufficient and even engineering skills
to build machine-learning pipelines and deploy models cover
only small parts of a system. Also software engineers need to
understand basics on machine learning to understand how to
integrate machine-learned components and plan for
mistakes. When considering how the model interacts with
the rest of the system and how that interacts with the
environment, we need to bring together diverse skills. For
collaboration and communication in these teams, AI literacy
is important, but so is understanding system-level concerns
like user needs, safety, or fairness.

On Terminology
Unfortunately, there is no standard term for referring to
building production systems with machine-learning
components. In this quickly evolving field, there are many
terms and they are largely not used consistently. In this
book, we adopt the term “ML-enabled system” or simply the
descriptive “production system with machine-learning
components” to emphasize the broad focus on the entire
system, in contrast to a more narrow model-centric focus of
data science education or even MLOps pipelines. The terms
“ML-infused system” or “ML-based system” have been used
with similar intentions.

In this book, we talk about machine learning and largely

focus on supervised learning. Technically, machine learning
is a subfield of artificial intelligence, where machine learning
refers to systems that learn functions from data (“A
computer program is said to learn from experience E with
respect to some task T and some performance measure P, if
its performance on T, as measured by P, improves with
experience E.” — Tom Mitchell, 1997). There are many other
artificial intelligence approaches that do not involve machine
learning, such as constraint satisfaction solvers, expert
systems, and probabilistic programming, but many of them
do not share the same challenges arising from missing
specifications of machine-learned models. In colloquial
conversation and media, machine learning and artificial
intelligence are used largely interchangeably and AI is the
favored term among public speakers, media, and politicians.
For most terms discussed here, there is also a version that
uses AI instead of ML, e.g., “AI-enabled system” rather than
“ML-enabled system.”

The software-engineering community sometimes

distinguishes between “Software Engineering for Machine
Learning” (short SE4ML, SE4AI, SEAI) and “Machine
Learning for Software Engineering” (short ML4SE, AI4SE).
The former refers to applying and tailoring software-
engineering approaches to problems related to machine
learning, which includes challenges of building ML-enabled
systems but also more model-centric challenges like testing
machine-learned models as isolated components. The latter
refers to using machine learning to improve software
engineering tools, such as using machine learning to detect
bugs in code or to automatically generate textual summaries
of code fragments. While software engineering tools with
machine-learned components are also ML-enabled systems,
they are not necessarily representative of the typical end-
user focused ML-enabled system discussed in this book, such
as transcription services or tax software.

The term “AI Engineering” and the job title of an “ML

engineer” are gaining popularity to highlight a stronger
focus on engineering in data-science projects. They most
commonly refer to building automated pipelines, deploying
models, and MLOps and hence tend to skew model-focused
rather than system-focused, though some people use the
terms also with a broader meaning. The terms ML System
Engineering and MLSys (and sometimes also AI Engineering)
refer to the engineering of infrastructure for machine
learning and serving machine-learned models, such as
building efficient distributed learning algorithms.

To further complicate terminology,

also AIOps and DataOps have been suggested and are
distinct from MLOps. AIOps tends to refer to the use of
artificial intelligence (mostly machine learning) techniques in
the operation of software systems, for example, to use
models to make decisions about when and how to scale
deployed systems. DataOps tends to be used to refer to agile
methods and automation in business data analytics.

Summary

To turn machine-learned models into production systems

requires a shift of perspective that focuses not only on the
model but the entire system, including many non-ML parts,
and how the system interacts with the environment. This
requires zooming out from a focus on model training; a
broader and more engineering-heavy focus on machine-
learning pipelines emphasizes important aspects of deploying
and updating machine-learned components but is still model
centric; considering the entire system (including non-ML
components) and how it interacts with the environment is
essential for building production systems responsibly. The
nature of machine-learned components, which can roughly
be characterized as unreliable functions, places heavy
emphasis on understanding and designing the rest of the
system to achieve the system goals despite occasional wrong
predictions, without serious harms or unintended
consequences when those wrong predictions occur.

 Book discussing the design and implementation of

ML-enabled systems, including coverage of
considerations for user interaction design and
planning for mistakes beyond a purely model-centric
view: Hulten, Geoff. Building Intelligent Systems: A
Guide to Machine Learning Engineering. Apress.
2018

 Essay arguing about how the machine-learning

community focuses on ML algorithms and
improvements on benchmarks, but should do more to
focus on impact and deployments (as part of
systems): Wagstaff, Kiri. “Machine learning that
matters.” In Proceedings of the 29 th International
Conference on Machine Learning, (2012).

 Interview study revealing how the common model-

centric focus undervalues data collection and data
quality and how this has downstream consequences
for the success of ML-enabled systems: Sambasivan,
Nithya, Shivani Kapania, Hannah Highfill, Diana
Akrong, Praveen Paritosh, and Lora M. Aroyo.
““Everyone wants to do the model work, not the data
work”: Data Cascades in High-Stakes AI”. In
proceedings of the 2021 CHI Conference on Human
Factors in Computing Systems, pp. 1–15. 2021.

 Interview study revealing conflicts at the boundary

between ML and non-ML teams in production ML-
enabled systems, including differences in how
different organizations prioritize models or products:
Nahar, Nadia, Shurui Zhou, Grace Lewis, and
Christian Kästner. “Collaboration Challenges in
Building ML-Enabled Systems: Communication,
Documentation, Engineering, and Process.” In
Proceedings of the 44th International Conference on
Software Engineering (ICSE), May 2022.

 On ML pipelines: Short paper reporting how

machine-learning practitioners struggle with
switching from a model-centric view to considering
and automating the entire ML pipeline: O’Leary,
Katie, and Makoto Uchida. “Common problems with
Creating Machine Learning Pipelines from Existing
Code.” Proc. Third Conference on Machine Learning
and Systems (MLSys) (2020).

 On ML pipelines: A well known paper arguing for the

need of paying attention to engineering of machine-
learning pipelines: Sculley, David, Gary Holt, Daniel
Golovin, Eugene Davydov, Todd Phillips, Dietmar
Ebner, Vinay Chaudhary, Michael Young, Jean-
Francois Crespo, and Dan Dennison. “Hidden
technical debt in machine learning systems.” In
Advances in neural information processing systems,
pp. 2503–2511. 2015.

 Experience report from teams at booking.com, with a

strong discussion about the difference between
model accuracy improvements and improving system
outcomes: Bernardi, Lucas, Themistoklis Mavridis,
and Pablo Estevez. “150 successful machine learning
models: 6 lessons learned at Booking.com.” In
Proceedings of the 25th ACM SIGKDD International
Conference on Knowledge Discovery & Data Mining,
pp. 1743–1751. 2019.
As all chapters, this text is released under Creative
Commons 4.0 BY-SA license.