0% found this document useful (0 votes)
40 views20 pages

Operationalizing Mchine Learning - An Interview Study

Uploaded by

syamhillcounty
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views20 pages

Operationalizing Mchine Learning - An Interview Study

Uploaded by

syamhillcounty
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Operationalizing Machine Learning: An Interview Study

Shreya Shankar∗ , Rolando Garcia∗ , Joseph M. Hellerstein, Aditya G. Parameswaran


University of California, Berkeley
{shreyashankar,rogarcia,hellerstein,adityagp}@berkeley.edu
∗ Co-first authors

ABSTRACT
Organizations rely on machine learning engineers (MLEs) to opera- Evaluation and Monitoring
Data Collection Experimentation
Deployment and Response
tionalize ML, i.e., deploy and maintain ML pipelines in production.
The process of operationalizing ML, or MLOps, consists of a contin-
Figure 1: Routine tasks in the ML engineering workflow.
ual loop of (i) data collection and labeling, (ii) experimentation to
arXiv:2209.09125v1 [cs.SE] 16 Sep 2022

improve ML performance, (iii) evaluation throughout a multi-staged We find that MLEs perform four routine tasks, shown in Fig-
deployment process, and (iv) monitoring of performance drops in ure 1: (i) data collection, (ii) experimentation, (iii) evaluation and
production. When considered together, these responsibilities seem deployment, and (iv) monitoring and response. Across tasks, we
staggering—how does anyone do MLOps, what are the unaddressed observe three variables that dictate success for a production ML
challenges, and what are the implications for tool builders? deployment: Velocity, Validation, and Versioning.1 We describe
We conducted semi-structured ethnographic interviews with common MLOps practices, grouped under overarching findings:
18 MLEs working across many applications, including chatbots, ML engineering is very experimental in nature (Section 4.3).
autonomous vehicles, and finance. Our interviews expose three As mentioned earlier, various articles claim that it is a problem for
variables that govern success for a production ML deployment: Ve- 90% of models to never make it to production [76], but we find that
locity, Validation, and Versioning. We summarize common practices this statistic is misguided. The nature of constant experimentation
for successful ML experimentation, deployment, and sustaining pro- is bound to create many versions, a small fraction of which (i.e. “the
duction performance. Finally, we discuss interviewees’ pain points best of the best”) will make it to production. Thus it is beneficial
and anti-patterns, with implications for tool design. to prototype ideas quickly, by making minimal changes to existing
workflows, and demonstrate practical benefits early—so that bad
1 INTRODUCTION models never make it far.
As Machine Learning (ML) models are increasingly incorporated Operationalizing model evaluation requires an active orga-
into software, a nascent sub-field called MLOps (short for ML Op- nizational effort (Section 4.4). Popular model evaluation “best
erations) has emerged to organize the “set of practices that aim practices” do not do justice to the rigor with which organiza-
to deploy and maintain ML models in production reliably and effi- tions think about deployments: they generally focus on using one
ciently” [4, 77]. It is widely agreed that MLOps is hard. Anecdotal re- typically-static held-out dataset to evaluate the model on [38] and
ports claim that 90% of ML models don’t make it to production [76]; a single ML metric choice (e.g., precision, recall) [1, 2]. We find that
others claim that 85% of ML projects fail to deliver value [69]. MLEs invest significant resources in maintaining multiple up-to-
At the same time, it is unclear why MLOps is hard. Our present- date evaluation datasets and metrics over time—especially ensuring
day understanding of MLOps is limited to a fragmented landscape of that data sub-populations of interest are adequately covered.
white papers, anecdotes, and thought pieces [14, 18, 20, 21, 34, 45], Non-ML rules and human-in-the-loop practices keep mod-
as well as a cottage industry of startups aiming to address MLOps els reliable in production (Section 4.5). We find that MLEs pre-
issues [27]. Early work by Sculley et al. attributes MLOps chal- fer simple ideas, even if it means handling multiple versions: for
lenges to “technical debt”, due to which there is “massive ongoing example, rather than leverage advanced techniques to minimize
maintenance costs in real-world ML systems” [64]. Most successful distribution shift errors [15, 83], MLEs would simply create new
ML deployments seem to involve a “team of engineers who spend a models, retrained on fresh data. MLEs ensured that deployments
significant portion of their time on the less glamorous aspects of ML were reliable via strategies such as on-call rotations, model roll-
like maintaining and monitoring ML pipelines” [54]. Prior work has backs, or elaborate rule-based guardrails to avoid incorrect outputs.
studied general practices of data analysis and science [30, 49, 62, 82], In Section 5, we discuss recurring MLOps challenges across all
without considering MLOps challenges of productionizing models. tasks. We express these pain points as tensions and synergies be-
There is thus a pressing need to bring clarity to MLOps, specif- tween our three “V” variables—for example, undocumented “tribal
ically in identifying what MLOps typically involves—across or- knowledge” about pipelines (Section 5.2.4) demonstrates a tension
ganizations and ML applications. A richer understanding of best between velocity (i.e., quickly changing the pipeline in response
practices and challenges in MLOps can surface gaps in present-day to a bug) and well-executed versioning (i.e., documenting every
processes and better inform the development of next-generation change). We conclude the description of each pain point with a
tools. Therefore, we conducted a semi-structured interview study of discussion of opportunities for future tools.
ML engineers (MLEs), each of whom has worked on ML models in
1 Our Three Vs of MLOps aren’t meant to be confused with the Three Vs of Big Data
production. We sourced 18 participants from different organizations
(Volume, Variety, Velocity) [61]. The first authors learned of the Big Data Vs after draft-
and applications (Table 1) and asked them open-ended questions to ing the MLOps Vs and were surprised to find similarities around volume/versioning
understand their workflow and day-to-day challenges. and velocity.
Shreya Shankar∗ , Rolando Garcia∗ , Joseph M. Hellerstein, Aditya G. Parameswaran

Role Company Size Application maintain over time, based on their experience at Google [64]. Since
then, several research projects have emerged to explore and tackle
p1 MLE Manager Large Autonomous vehicles
individual challenges in the MLOps workflow. For example, some
p2 MLE Medium Autonomous vehicles
discuss the need to manage data provenance and training context
p3 MLE Small Computer hardware
for model debugging purposes [8, 17, 24, 50]. Others describe the
p4 MLE Medium Retail
challenges of handling state and ensuring reproducibility (i.e., “man-
p5 MLE Manager Large Ads
aging messes”) while using computational notebooks [23, 42, 66].
p6 MLE Large Cloud computing
Additionally, data distribution shifts have been technically but not
p7 MLE Small Finance
operationally studied—i.e., how humans debug such shifts in prac-
p8 MLE Small NLP
tice [46, 51, 57, 71, 78]. Rather than focus on a single pain point, Lee
p10 MLE Small OCR + NLP
et al. analyze challenges across ML workflows on an open-source
p11 MLE Manager Medium Banking
ML platform [36]. Similarly, Xin et al. [79] analyze ML pipelines at
p12 MLE Large Cloud computing
Google to understand typical model configurations and retraining
p13 MLE Small Bioinformatics
patterns. Polyzotis et al. [54, 55] survey challenges centric to data
p14 MLE Medium Cybersecurity
management for machine learning deployments. Paleyes et al. re-
p15 MLE Medium Fintech
view published reports of individual ML deployments and survey
p16 MLE Small Marketing and analytics
common challenges [52]. Our study instead focuses on issues across
p17 MLE Medium Website builder
the production workflow (i.e., MLOps practices and challenges) as
p18 MLE Large Recommender systems
opposed to individual pain-points, identified by interviewing those
p19 MLE Manager Large Ads
who are are most affected by it—the ML engineers.
Table 1: Anonymized description of interviewees. Small companies
have fewer than 100 employees; medium-sized companies have 100- Data Science and ML-Related Interview Studies. Kandel et
1000 employees, and large companies have 1000 or more employees. al. [30] interview data analysts at enterprises, focusing on broader
organizational contexts like we do; however, MLOps workflows
and challenges extend beyond data analysis. Other studies build
on Kandel et al.’s work, exploring aspects such as collaboration,
2 RELATED WORK code practices, and tools [32, 33, 49, 53, 82], all centered on gen-
Several books and papers in the traditional software engineering eral data analysis and data science, as opposed to transitioning
literature describe the need for DevOps, a combination of software workflows in ML to production. Many ML-related interview stud-
developers and operations teams, to streamline the process of deliv- ies focus on a single tool, task, or challenge in the workflow—for
ering software in organizations [13, 37, 39, 40]. Similarly, MLOps, example, AutoML [75, 80], data iteration [25], model training [72],
or DevOps principles applied to machine learning, has emerged minimizing bias in ML models [26, 35, 43], and building infras-
from the rise of machine learning (ML) application development in tructure for ML pipelines [47]. Sambasivan et al. [62] study data
software organizations. MLOps is a nascent field, where most exist- quality issues during machine learning, as opposed to challenges
ing papers give definitions and overviews of MLOps, as well as its in MLOps. Other ML-related interview studies focus on specific
relation to ML, software engineering, DevOps, and data engineer- applications of ML, such as medicine [56], customer service [16],
ing [28, 34, 44, 73]. MLOps poses unique challenges because of its and interview processing [7]. Some interview studies report on
focus on developing, deploying, and sustaining models, or artifacts software engineering practices for ML development; however, they
that need to reflect data as data changes over time [59, 65, 67]. We focus only on a few applications and primarily on engineering, not
discuss work related to MLOps workflows, challenges and interview operational, challenges [5, 41]. Our interview study aims to be both
studies for ML. broad and focused: we consider many applications and companies,
MLOps Workflow. The MLOps workflow involves supporting but is centered around the engineers that perform MLOps tasks,
data collection and processing, experimentation, evaluation and de- with an eye towards highlighting both engineering and operational
ployment, and monitoring and response, as shown in Figure 1. Sev- practices and challenges. Additionally, our focus is on learning how
eral research papers and companies have proposed tools to accom- models are deployed and sustained in production—we discover this
plish various tasks in the workflow, such as data pre-processing [22, by interviewing ML practitioners directly.
58, 60] and experiment tracking [6, 74, 81]. Crankshaw et al. stud-
ied the problem of model deployment and low-latency prediction 3 METHODS
serving [12]. With regards to validating changes in production sys-
Following review by our institution’s review board, we conducted
tems, some researchers have studied CI (Continuous Integration) for
an interview study of 18 ML Engineers (MLEs) working across a
ML and proposed preliminary solutions—for example, ease.ml/ci
wide variety of sectors to learn more about their first-hand experi-
streamlines data management and proposes unit tests for overfit-
ences serving and maintaining models in production.
ting [3], Garg et al. survey different MLOps tools [19], and some
papers introduce tools to perform validation and monitoring in
production ML pipelines [9, 31, 63]. 3.1 Participant Recruitment
MLOps Challenges. Sculley et al. were early proponents that pro- We recruited persons who were responsible for the development,
duction ML systems raise special challenges and can be hard to regular retraining, monitoring and deployment of any ML model in
Operationalizing Machine Learning: An Interview Study

production. A description of the 18 MLEs (22% female-identifying2 ) Appendix C, including a list of the most frequently occurring codes
is shown in Table 1. The MLEs we interviewed varied in their (Table 3) and co-occurring codes (Figure 4).
educational backgrounds, years of experience, roles, team size, and
work sector. Recruitment was conducted in rounds over the course 4 MLOPS PRACTICES: OUR FINDINGS
of an academic year (2021-2022). In each round, between three to In this section, we present information about common practices in
five candidates were reached by email and invited to participate. production ML deployments that we learned from the interviews.
We relied on our professional networks and open calls posted on First, we describe common tasks in the production ML workflow in
MLOps channels in Discord3 , Slack4 , and Twitter to compile a Section 4.1. Next, we introduce the Three Vs of MLOps, grounding
roster of candidates. The roster was incrementally updated roughly both the discussion of findings and the challenges that we will ex-
after every round of interviews, integrating information gained plain in Section 5. Then in Section 4.3, we describe the strategies ML
from the concurrent coding and analysis of transcripts (Section 3.3). engineers leverage to produce successful experiment ideas. In Sec-
Recruitment rounds were repeated until we reached saturation on tion 4.4, we discuss organizational efforts to effectively evaluate
our findings [48]. models. Finally, in Section 4.5, we investigate the hacks ML engi-
neers use to sustain high performance in productions ML pipelines.
3.2 Interview Protocol
With each participant, we conducted semi-structured interviews 4.1 Tasks in the Production ML Lifecycle
over video call lasting 45 to 75 minutes each. Over the course of the We characterized ML engineers’ workflows into four high-level
interview, we asked descriptive, structural, and contrast questions tasks, each of which employ a wide variety of tools. We briefly
abiding by ethnographic interview guidelines [68]. The questions describe each task in turn, and elaborate on them as they arise in
are listed in Appendix A. Specifically, our questions spanned six our findings below.
categories: (1) the type of ML task(s) they work on; (2) the ap- Data Collection and Labeling. Data collection spans sourcing
proach(es) they use for developing and experimenting on models; new data, wrangling data from sources into a centralized reposi-
(3) how and when they transition from development/experimen- tory, and cleaning data. Data labeling can be outsourced (e.g., Me-
tation to deployment; (4) how they evaluate their models prior chanical Turk) or performed in-house with teams of annotators.
to deployment; (5) how they monitor their deployed models; and Since descriptions and interview studies of data collection, anal-
(6) how they respond to issues or bugs that may emerge during ysis, wrangling and labeling activities can be found in related pa-
deployment or otherwise. We covered these categories to get an pers [11, 29, 30, 62], we focus our summary of findings on the other
overarching understanding of an ML lifecycle deployment, from three tasks.
conception to production sustenance. Feature Engineering and Model Experimentation. ML engi-
Participants received a written consent form before the interview, neers typically focus on improving ML performance, measured via
and agreed to participate free of compensation. As per our agree- metrics such as accuracy or mean-squared-error. Experiments can
ment, we automatically transcribed the interviews using Zoom be data-driven or model-driven; for example, an engineer can create
software. In the interest of privacy and confidentiality, we did not a new feature or change the model architecture from tree-based to
record audio or video of the interviews. Transcripts were redacted neural network-based.
of personally identifiable information before being uploaded to a Model Evaluation and Deployment. A model is typically eval-
secured drive in the cloud. More information about the transcripts uated by computing a metric (e.g., accuracy) over a collection of
can be found in Appendix B. labeled data points hidden at training time, or a validation dataset,
to see if its performance is better than what the currently-running
3.3 Transcript Coding & Analysis production model achieved during its evaluation phase. Deploy-
Following a grounded theory approach [10, 70], we employed open ment involves reviewing the proposed change, possibly staging the
and axial coding to analyze our transcripts. We used MaxQDA, change to increasing percentages of the population, or A/B testing
a common qualitative analysis software package for coding and on live users, and keeping records of any change to production in
comparative analysis. During a coding pass, two study personnel case of a necessary rollback.
independently read interview transcripts closely to group passages ML Pipeline Monitoring and Response. Monitoring ML pipelines
into codes or categories. Coding passes were either top-down or and responding to bugs involve tracking live metrics (via queries
bottom-up, meaning that codes were derived from theory or in- or dashboards), slicing and dicing sub-populations to investigate
duced from interview passages, respectively. Between coding passes, prediction quality, patching the model with non-ML heuristics for
study personnel met to discuss surprises and other findings, and known failure modes, and finding in-the-wild failures and adding
following consensus, the code system was revised to reflect changes them to the evaluation set.
to the emerging theory. Coding passes were repeated until reach-
ing convergence. More information about the codes is shown in 4.2 Three Vs of MLOps: Velocity, Validation,
Versioning
When developing and pushing ML models to production, three
2 Theskewed gender distribution is one possible indicator of sampling bias. We openly properties of the workflow and infrastructure dictate how successful
and actively recruited female-idenitifying MLEs to mitigate some sampling bias.
3 mlops.discord.com deployments will be: Velocity, Validation, and Versioning, discussed
4 mlops-community.slack.com in turn.
Shreya Shankar∗ , Rolando Garcia∗ , Joseph M. Hellerstein, Aditya G. Parameswaran

Velocity. Since ML is so experimental in nature, it’s important to Sometimes we’ll fix something [here and] there to like
be able to prototype and iterate on ideas quickly (e.g., go from a build some good goodwill, so that we can call on them in
new idea to a trained model in a day). ML engineers attributed the future...I do this stuff as I have to do it, not because
their productivity to development environments that prioritized I’m really passionate about doing it.
high experimentation velocity and debugging environments that
4.3.2 Iterate on the data, not necessarily the model. Several partic-
allowed them to test hypotheses quickly (P1, P3, P6, P10, P11, P14,
ipants recommended focusing on experiments that provide addi-
P18).
tional context to the model, typically via new features (P5, P6, P11,
Validation. Since errors become more expensive to handle when P12, P14, P16, P17, P18, P19). P17 mentioned that most ML projects
users see them, it’s good to test changes, prune bad ideas, and at their organization centered around adding new features. P14
proactively monitor pipelines for bugs as early as possible (P1, P2, mentioned that one of their current projects was to move feature
P5, P6, P7, P10, P14, P15, P18). P1 said: “The general theme, as we engineering pipelines from Scala to SparkSQL (a language more
moved up in maturity, is: how do you do more of the validation familiar to ML engineers and data scientists), so experiment ideas
earlier, so the iteration cycle is faster?” could be coded and validated faster. P11 noted that iterating on the
Versioning. Since it’s impossible to anticipate all bugs before they data, not the model, was preferable because it resulted in faster
occur, it’s helpful to store and manage multiple versions of produc- velocity:
tion models and datasets for querying, debugging, and minimizing I’m gonna start with a [fixed] model because it means
production pipeline downtime. ML engineers responded to buggy faster [iterations]. And often, like most of the time em-
models in production by switching the model to a simpler, historical, pirically, it’s gonna be something in our data that we
or retrained version (P6, P8, P10, P14, P15, 18). can use to kind of push the boundary...obviously it’s not
like a dogmatic We Will Never Touch The Model, but it
4.3 Machine Learning Engineering is Very shouldn’t be our first move.
Experimental, Even in Production Prior work has also identified the importance of data work [62].
ML engineering, as a discipline, is highly experimental and itera- 4.3.3 Account for diminishing returns. At many organizations (es-
tive in nature, especially compared to typical software engineering. pecially larger companies), deployment can occur in stages—i.e.,
Contrary to popular negative sentiment around the large numbers first validated offline, then deployed to 1% of production traffic,
of experiments and models that don’t make it to production, we then validated again before a deployment to larger percentages of
found that it’s actually okay for experiments and models not to traffic. Some interviewees (P5, P6, P18) explained that experiment
make it to production. What matters is making sure ideas can be ideas typically have diminishing performance gains in later stages
prototyped and validated quickly—so that bad ones can be pruned of deployment. As a result, P18 mentioned that they would initially
away immediately. While there is no substitute for on-the-job expe- try multiple ideas but focus only on ideas with the largest perfor-
rience to learn how to choose successful projects (P5), we document mance gains in the earliest stage of deployment; they emphasized
some self-reported strategies from our interviewees. the importance of “validat[ing] ideas offline...[to make] produc-
tivity higher.” P19 corroborated this by saying end-to-end staged
4.3.1 Good project ideas start with collaborators. Project ideas, such deployments could take several months, making it a high priority
as new features, came from or were validated early by domain ex- to kill ideas with minimal gain in early stages to avoid wasting
perts, data scientists and analysts who had already performed a lot future time. Additionally, to help with validating early, many engi-
of exploratory data analysis. P14 and P17 independently recounted neers discussed the importance of a sandbox for stress-testing their
successful project ideas that came from asynchronous conversa- ideas (P1, P5, P6, P11, P12, P13, P14, P15, P17, P18, P19). For some
tions on Slack: P17 said, “I look for features from data scientists, engineers, this was a single Jupyter notebook; others’ organizations
[who have ideas of] things that are correlated with what I’m trying had separate sandbox environments to load production models and
to predict.” Solely relying on other collaborators wasn’t enough, run ad-hoc queries.
though—P5 mentioned that they “still need to be pretty proactive
about what to search for.” 4.3.4 Small changes are preferable to larger changes. In line with
Some organizations explicitly prioritized cross-team collabora- software best practices, interviewees discussed keeping their code
tion as part of their culture. P11 said: changes as small as possible for multiple reasons, including faster
code review, easier validation, and fewer merge conflicts (P2, P5,
We really think it’s important to bridge that gap be- P6, P10, P11, P18, P19). Additionally, changes in large organizations
tween what’s often, you know, a [subject matter expert] were primarily made in config files instead of main application code
in one room annotating and then handing things over (P1, P2, P5, P10, P12, P19). For example, instead of editing parameters
the wire to a data scientist—a scene where you have directly in a Python model training script, it was preferable to edit
no communication. So we make sure there’s both data a config file (e.g., JSON or YAML) of parameters instead and link
science and subject matter expertise representation [on the config file to the model training script.
our teams]. P19 described how, as their team matured, they edited the model
To foster a more collaborative culture, P16 discussed the concept code less: “Eventually it was the [DAG] pipeline code which changed
of “building goodwill” with other teams through tedious tasks that more...there was no reason to touch the [model] code...everything
weren’t always explicitly a part of company plans: is config-based.” P5 mentioned that several of their experiments
Operationalizing Machine Learning: An Interview Study

involved “[taking] an existing model, modify[ing] [the config] with P11 did, P8 offered a reactive strategy of spawning a new dataset
some changes, and deploying it within an existing training cluster.” for each observed live failure: “Every [failed prediction] gets into
Supporting a config-driven development was important, P1 said, the same queue, and 3 of us sit down once a week and go through
otherwise bugs might arise when promoting the experiment idea the queue...then our [analysts] collect more [similar] data.” This
to production: new dataset was then used in the offline validation phase in future
People might forget to, when they spawn multiple pro- iterations of the production ML lifecycle.
cesses, to do data loading in parallel, they might forget While processes to dynamically update the validation datasets
to set different random seeds, especially [things] you ranged from human-in-the-loop to frequent synthetic data con-
have to do explicitly many times...you’re talking a lot struction (P6), we found that higher-stakes applications of ML (e.g.,
about these small, small things you’re not going to be autonomous vehicles), created separate teams to manage the dy-
able to catch [at deployment time] and then of course namic evaluation process. P1 said:
you won’t have the expected performance in production. We had to move away from only aggregate metrics like
Because ML experimentation requires many considerations to MAP towards the ability to curate scenarios of interest,
yield correct results—e.g., setting random seeds, accessing the same and then validate model performance on them specifi-
versions of code libraries and data—constraining engineers to config- cally. So, as an example, you can’t hit pedestrians, right.
only changes can reduce the number of bugs. You can’t hit cyclists. You need to work in roundabouts.
You have a base layer of ML performance and the per-
4.4 Operationalizing Model Evaluation is an formance is not perfect...what you need to be able to
Active Effort do in a mature MLOps pipeline is go very quickly from
user recorded bug, to not only are you going to fix it,
We found that MLEs described intensive model evaluation efforts at
but you also have to be able to drive improvements to
their companies to keep up with data changes, product and business
the stack by changing your data based on those bugs.
requirement changes, user changes, and organizational changes.
The goal of model evaluation is to prevent repeated failures and bad Although the dynamic evaluation process might require many
models from making it to production while maintaining velocity— humans in the loop—a seemingly intense organizational effort—
i.e., the ability for pipelines to quickly adapt to change. engineers thought it was crucial to have. When asked why they
invested a lot of energy into their dynamic process, P11 said: “I guess
4.4.1 Validation datasets should be dynamic. Many engineers re- it was always a design principle—the data is [always] changing.”
ported processes to analyze live failure modes and update the vali-
dation datasets to prevent similar failures from happening again (P1, 4.4.2 Validation systems should be standardized. The dynamic na-
P2, P5, P6, P8, P11, P15, P16, P17, P18). P1 described this process as ture of validation processes makes it hard to effectively maintain
a departure from what they had learned in academia: “You have this versions of such processes, motivating efforts to standardize them.
classic issue where most researchers are evaluat[ing] against fixed Several participants recalled instances of bugs stemming from in-
data sets...[but] most industry methods change their datasets.” We consistent definitions of successful validation—i.e., where different
found that these dynamic validation sets served two purposes: (1) engineers on their team evaluated models differently, causing un-
the obvious goal of making sure the validation set reflects live data expected changes to live performance metrics (P1, P3, P4, P5, P6,
as much as possible, given new learnings about the problem and P7, P17). For instance, P4 lamented that every engineer working
shifts in the aggregate data distribution, and (2) the more subtle goal on a particular model had a cloned version of the main evaluation
of addressing localized shifts that subpopulations may experience notebook, with a few changes. The inconsistent requirements for
(e.g., low accuracy for a specific label). promoting a model to production caused headaches while moni-
The challenge with (2) is that many subpopulations are typically toring and debugging, so their company instated a new effort to
unforeseen; many times they are discovered post-deployment. To standardize evaluation scripts.
enumerate them, P11 discussed how they systematically bucketed Although other MLOps practices highlighted the synergy be-
their data points based on the model’s error and created validation tween velocity and validating early (Section 4.3.3), we found that
sets for each underperforming bucket: standardizing the validation system exposed a tension between
Some [of the metrics in my tool] are standard, like a velocity (i.e., being able to promote models quickly) and validating
confusion matrix, but it’s not really effective because it early, or eliminating the possibility of some bugs at deployment
doesn’t drill things down [into specific subpopulations time. Since many validation systems needed to frequently change,
that users care about]. Slices are user-defined, but some- as previously discussed, turnaround times for code review and
times it’s a little bit more automated. [During offline merges to the main branch often could not keep up with the new
evaluation, we] find the error bucket that [we] want tests and collections of data points added to the validation system.
to drill down, and then [we] either improve the model So, it was easier for engineers to fork and modify the evaluation
in very systematic ways or improve [our] data in very system. However, P2 discussed that the decrease in velocity was
systematic ways. worth it for their organization when they standardized evaluation:
Rather than follow an anticipatory approach of constructing We have guidelines on how to run eval[uation] compre-
different failure modes in the offline validation phase—e.g., perfor- hensively when any particular change is being made.
mance drops in subpopulations users might care deeply about—like Now there is a merge queue, and we have to make sure
Shreya Shankar∗ , Rolando Garcia∗ , Joseph M. Hellerstein, Aditya G. Parameswaran

that we process the merge queue in order, and that im- to an existing model was worth putting in production. P19 men-
provements are actually also reflected in subsequent tioned that they use shadow mode to invalidate experiment ideas
models, so it requires some coordination. We’d much that would eventually fail. But shadow mode alone wasn’t a substi-
rather gate deploying a model than deploy a model tute for all stages of deployment—P6 said, “[in the early stage], we
that’s bad. So we tend to be pretty conservative [now]. don’t have a good sample of how the model is going to behave in
A standardized evaluation system also reduced friction in deploy- production”—requiring the multiple stages. Additionally, in prod-
ing ML in large companies and high-stakes settings. P5 discussed ucts that have a feedback loop (e.g., recommender systems), it is
that for some models, deployments needed approvals across the impossible to evaluate the model in shadow mode because users
organization, and it was much harder to justify a deployment with do not interact with shadow predictions.
a custom or ad-hoc evaluation process: “At the end of the day, it’s 4.4.4 ML evaluation metrics should be tied to product metrics. Mul-
all a business-driven decision...for something that has so much rev- tiple participants stressed the importance of evaluating metrics
enue riding on it, [you can’t have] a subjective opinion on whether critical to the product, such as click-through rate or user churn rate,
[your] model is better.” rather than ML-specific metrics alone like MAP (P5, P7, P15, P16,
4.4.3 Spread a deployment across multiple stages, and evaluate at P11, P17, P18, P19). The need to evaluate product-critical metrics
each stage. Several organizations, particularly those with many stemmed from close collaboration with other stakeholders, such
customers, had a multi-stage deployment process for new models as product managers and business operators. P11 felt that a key
or model changes, progressively evaluating at each stage (P3, P5, P6, reason many ML projects fail is that they don’t measure metrics
P7, P8, P12, P15, P17, P18, P19). P6 described the staged deployment that will yield the organization value:
process as: Tying [model performance] to the business’s KPIs (key
In [the large companies I’ve worked at], when we deploy performance indicators) is really important. But it’s a
code it goes through what’s called a staged deployment process—you need to figure out what [the KPIs] are, and
process, where we have designated test clusters, [stage 1] frankly I think that’s how people should be doing AI. It
clusters, [stage 2] clusters, then the global deployment [shouldn’t be] like: hey, let’s do these experiments and
[to all users]. The idea here is you deploy increasingly get cool numbers and show off these nice precision-recall
along these clusters, so that you catch problems before curves to our bosses and call it a day. It should be like:
they’ve met customers. hey, let’s actually show the same business metrics that
Each organization had different names for its stages (e.g., test, everyone else is held accountable to to our bosses at the
dev, canary, staging, shadow, A/B) and different numbers of stages end of the day.
in the deployment process (usually between one and four). The Since product-specific metrics are, by definition, different for
stages helped invalidate models that might perform poorly in full different ML models, it was important for engineers to treat choos-
production, especially for brand-new or business-critical pipelines. ing the metrics as an explicit step in their workflow and align with
P15 recounted an initial chatbot product launch using their staged other stakeholders to make sure the right metrics were chosen. For
deployment process, claiming it successfully "made it" because they example, P16 said that for every new ML project they work on, their
were able to catch failures and update the model in early stages: “first task is to figure out, what are customers actually interested
We spent a long time very slowly, ramping up the model in, or what’s the metric that they care about.” P17 said that every
to very small percentages of traffic and watching what model change in production is validated by the product team: “if
happened. [When there was a failure mode,] a product we can get a statistically significant greater percentage [of] people
person would ping us and say: hey, this was kind of to subscribe to [the product], then [we can fully deploy].”
weird, should we create a rule around this [suggested For some organizations, a consequence of tightly coupling eval-
text] to filter this out? uation to product metrics was an additional emphasis on important
customers during evaluation (P6, P10). P6 described how, at their
Of particular note was one type of stage, the shadow stage—
company, experimental changes that increased aggregate metrics
where predictions were generated live but not surfaced to users,
could sometimes be prevented from going to production:
that came before a deployment to a small fraction of live users. P14
described how they used the shadow stage to assess how impactful There’s an [ML] system to allocate resources for [our
new features could be: product]. We have hard-coded rules for mission critical
So if we’re testing out a new idea and want to see, what customers. Like at the beginning of Covid, there were
would the impact be for this new set of features without hospital [customers] that we had to save [resources] for.
actually deploying that into production, we can deploy Participants who came from research or academia noted that
that in a type of shadow mode where it’s running along- tying evaluation to the product metrics was a different experience.
side the production model and making predictions. We P6 commented:
track all the metrics for [both] models in [our data I think about where the business will benefit from what
lake]...so we can compare them easily. we’re building. We’re not just shipping off fake wins,
Shadow mode had other use cases—for instance, P15 discussed like we’re really in the value business. You’ve got to see
how shadow mode was used to convince other stakeholders (e.g., value from AI in your organization in order to feel like
product managers, business analysts) that a new model or change [our product] was worth it to you, and I guess that’s
Operationalizing Machine Learning: An Interview Study

a mindset that we really ought to have [as a broader working on an anomaly detection model and adding a heuristics
community]. layer on top to filter the set of anomalies that surface based on
domain knowledge. P15 discussed one of their language models for
4.5 Sustaining Models Requires Deliberate a customer support chatbot:
Software Engineering and Organizational The model might not have enough confidence in the
Practices suggested reply, so we don’t return [the recommenda-
Here, we present a list of strategies ML engineers employed during tion]. Also, language models can say all sorts of things
monitoring and debugging phases to sustain model performance you don’t necessarily want it to—another reason that
post-deployment. we don’t show suggestions. For example, if somebody
asks when the business is open, the model might try
4.5.1 Create new versions: frequently retrain on and label live data. to quote a time when it thinks the business is open. [It
Production ML bugs can be detected by tracking pipeline perfor- might say] “9 am”, but the model doesn’t know that. So
mance, measured by metrics like prediction accuracy, and triggering if we detect time, then we filter that [reply]. We have a
an alert if there is a drop in performance that exceeds some pre- lot of filters.
defined threshold. On-call ML engineers noted that reacting to an Constructing such filters was an iterative process—P15 men-
ML-related bug in production often took a long time, motivating tioned constantly stress-testing the model in a sandbox, as well as
them to find alternative strategies to quickly restore performance observing suggested replies in early stages of deployment, to come
(P1, P7, P8, P10, P14, P15, P17, P19). P14 mentioned automatically up with filter ideas. Creating filters was a more effective strategy
retraining the model every day so model performance would not than trying to retrain the model to say the right thing; the goal
suffer for longer than a day: was to keep some version of a model working in production with
Why did we start training daily? As far as I’m aware, little downtime. This combination of modern model-driven ML and
we wanted to start simple—we could just have a single old-fashioned rule-based AI indicates a need for managing filters
batch job that processes new data and we wouldn’t need (and versions of filters) in addition to managing learned models.
to worry about separate retraining schedules. You don’t The engineers we interviewed managed these artifacts themselves.
really need to worry about if your model has gone stale
if you’re retraining it every day. 4.5.4 Validate data going in and out of pipelines. While partici-
pants reported that model parameters were typically "statically"
Retraining cadences ranged from hourly (P18) to every few validated before deploying to full production, features and predic-
months (P17) and were different for different models within the tions were continuously monitored for production models (P1, P2,
same organization (P1). None of the participants interviewed re- P6, P8, P14, P16, P17, P18, P19). Several metrics were monitored—
ported any scientific procedure for determining the cadence; the P2 discussed hard constraints for feature columns (e.g., bounds
retraining cadences were set in a way that streamlined operations on values), P6 talked about monitoring completeness (i.e., fraction
for the organization in the easiest way. For example, P18 mentioned of non-null values) for features, P16 mentioned embedding their
that “retraining takes about 3 to 4 hours, so [they] matched the pipelines with "common sense checks," implemented as hard con-
cadence with it such that as soon as [they] finished any one model, straints on columns, and P8 described schema checks—making sure
they kicked off the next training [job].” each data item adheres to an expected set of columns and their
Some engineers reported an inability to retrain unless they had types.
freshly labeled data, motivating their organizations to set up a team While rudimentary data checks were embedded in most systems,
to frequently label live data (P1, P8, P10, P11, P16). P10 reported that P6 discussed that it was hard to figure out what higher-order data
a group within their company periodically collected new documents checks to compute:
for their language models to fine-tune on. P11 mentioned an in-
house team of junior analysts to annotate the data; however, a Monitoring is both metrics and then a predicate over
problem was that these annotations frequently conflicted and the those metrics that triggers alerts. That second piece
organization did not know how to reconcile the noise. doesn’t exist—not because the infrastructure is hard,
but because no one knows how to set those predicate
4.5.2 Maintain old versions as fallback models. Another way to values...for a lot of this stuff now, there’s engineering
minimize downtime when a model is known to be broken is to headcount to support a team doing this stuff. This is
have a fallback model to revert to—either an old version or simple people’s jobs now; this constant, periodic evaluation of
version. P19 said: “if the production model drops and the calibration models.
model is still performing within a [specified] range, we’ll fall back to Some participants discussed using black-box data monitoring
the calibration model until someone will fix the production model.” services but lamented that their alerts did not prevent failures (P7,
P18 mentioned that it was important to keep some model up and P14). P7 said:
running, even if they “switched to a less economic model and had
to just cut the losses.” We don’t find those metrics are useful. I guess, what’s
the point in tracking these? Sometimes it’s really to
4.5.3 Maintain layers of heuristics. P14 and P15 each discussed cover my ass. If someone [hypothetically] asked, how
how their models are augmented with a final, rule-based layer to come the performance dropped from X to Y, I could go
keep live predictions more stable. For example, P14 mentioned back in the data and say, there’s a slight shift in the
Shreya Shankar∗ , Rolando Garcia∗ , Joseph M. Hellerstein, Aditya G. Parameswaran

user behavior that causes this. So I can do an analysis Additionally, P6, P7, P8, P10, P12, P14, P15, P16, P18, and P19
of trying to convince people what happened, but can I mentioned having a central queue of production ML bugs that every
prevent [the problem] from happening? Probably not. engineer added tickets to and processed tickets from. Often this
Is that useful? Probably not. queue was larger than what engineers could process in a timely
While basic data validation was definitely useful for the partici- manner, so they assigned priorities to tickets. Finally, P6, P7, P10,
pants, many of the participants expressed pain points with existing and P15 discussed having Service Level Objectives (SLOs), or commit-
techniques and solutions, which we discuss further in Section 5.1.2. ments to minimum standards of performance, for pipelines in their
organizations. For example, an pipeline to classify images could
4.5.5 Keep it Simple. Many participants expressed an aversion to have an SLO of 95% accuracy. A benefit of using the SLO frame-
complexity, preferring to rely on simple models and algorithms work for ML pipelines is a clear indication of whether a pipeline
whenever possible (P1, P2, P6, P7, P11, P12, P14, P15, P16, P17, P19). is performing well or not—if the SLO is not met, the pipeline is
P7 described the importance of relying on a simple training and broken, by definition.
hyperparameter search algorithm:
In finance, we always split data by time. The thing I 5 MLOPS CHALLENGES AND
[learned in finance] is, don’t exactly try to tune the OPPORTUNITIES
hyperparameters too much, because that just overfits to In this section, we enumerate common pain points and anti-patterns
historic data. observed across interviews. We discuss each pain point as a tension
P7 discussed choosing tree-based models over deep learning or synergy between the Three Vs (Section 4.2). At the end of each
models for their ease of use, which simplified post-deployment pain point, we describe our takeaways of ideas for future tools
maintenance: “I can probably do the same thing with neural nets. and research. Finally, in Section 5.3, we characterize layers of the
But it’s not worth it. [After] deployment it just doesn’t make any MLOps tool stack for those interested in building MLOps tools.
sense at all.” However, other participants chose to use deep learning
as a means of simplifying their pipelines (P1, P16). For instance, 5.1 Pain Points in Production ML
P16 described training a small number of higher-capacity models
We focus on four themes that we didn’t know before the interviews:
rather than a separate model for each target: “There were hundreds
the mismatch between development and production environments,
of products that [customers] were interested in, so we found it
handling a spectrum of data errors, the ad-hoc nature of ML bugs,
easier to instead train 3 separate classifiers that all shared the same
and long validation processes.
underlying embedding...from a neural network.”
While there was no universally agreed-upon answer to a ques- 5.1.1 Mismatch Between Development and Production Environments.
tion as broad as, “should I use deep learning?" we found a common While it is important to create a separate development environment
theme in how participants leveraged deep learning models. Specif- to validate ideas before promoting them to production, it is also
ically, for ease of post-deployment maintenance (e.g., an ability necessary to minimize the discrepancies between the two environ-
to retroactively debug pipelines), outputs of deep learning mod- ments. Otherwise, unanticipated bugs might arise in production
els were typically human-interpretable (e.g., image segmentation, (P1, P2, P6, P8, P10, P13, P14, P15, P18). Creating similar devel-
object recognition, probabilities or likelihoods as embeddings). P1 opment and production environments exposes a tension be-
described a push at their company to rely more on neural networks: tween velocity and validating early: development cycles are
A general trend is to try to move more into the neural more experimental and move faster than production cycles; how-
network, and to combine models wherever possible so ever, if the development environment is significantly different from
there are fewer bigger models. Then you don’t have the production environment, it’s hard to validate (ideas) early.
these intermediate dependencies that cause drift and We discuss three examples of point points caused by the envi-
performance regressions...you eliminate entire classes of ronment mismatch—data leakage, Jupyter notebook philosophies,
bugs and and issues by consolidating all these different and code quality.
piecemeal stacks.
Data Leakage. A common issue was data leakage—i.e., assuming
4.5.6 Organizationally Supporting ML Engineers Requires Delib- during training that there is access to data that does not exist at
erate Practices. Our interviewees reported various organizational serving time—an error typically discovered after the model was
processes for sustaining models as part of their ML infrastructure. deployed and several incorrect live predictions were made. Antic-
P6, P12, P14, P16, P18, and P19 described on-call processes for su- ipating any possible form of data leakage is tedious and hinders
pervising production ML models. For each model, at any point in velocity; thus, sometimes leakage was retroactively checked during
time, some ML engineer would be on call, or primarily responsible code review (P1). The nature of data leakage ranged across reported
for it. Any bug or incident observed (e.g., user complaint, pipeline bugs—for example, P18 recounted an instance where embedding
failure) would receive a ticket, created by the on-call engineer, and models were trained on the same split of data as a downstream
be placed in a queue. On-call rotations lasted a week or two weeks. model, P2 described a class imbalance bug where they did not have
At the end of a shift, an engineer would create an incident report— enough labeled data for a subpopulation at training time (compared
possibly one for each bug—detailing major issues that occurred and to its representation at serving time), and P15 described a bug in
how they were fixed. which feedback delay (time between making a live prediction and
Operationalizing Machine Learning: An Interview Study

getting its ground-truth label) was ignored while training. Differ- have ML-specific coding guidelines for experiments. Generally, ex-
ent types of data leakage resulted in different magnitudes of ML perimental code (in development) was not reviewed, but changes
performance drops: for example, in a pipeline with daily retraining, to production went through a code review process (P1, P5). Par-
feedback delays could prevent retraining from succeeding because ticipants felt that code review wasn’t too useful, but they did it to
of a lack of new labels. However, in P18’s embedding leakage exam- adhere to software best practices (P1, P3, P5, P10). P5 mentioned
ple, the resulting model was slightly more overfitted than expected, that “it’s just really not worth the effort; people might catch some
yielding lower-than-expected performance in production but not minor errors”. P10 hypothesized that the lack of utility came from
completely breaking. difficulty of code review:
Strong Opinions on Jupyter Notebooks. Participants described It’s tricky. You use a little bit of judgment as to where
strongly opinionated and different philosophies with respect to things might go wrong, and you maybe spend more time
how to use Jupyter notebooks in their workflows. Jupyter note- sort of reviewing that. But bugs will go to production,
books were heavily used in development to support high velocity, and [as long as they’re not] that catastrophic, [it’s okay.]
which we did not find surprising. However, we were surprised that
Code review and other good software engineering practices
although participants generally acknowledged worse code quality
might make deployments less error-prone. However, because ML
in notebooks, some participants preferred to use them in produc-
is so experimental in nature, they can be significant barriers to
tion to minimize the differences between their development and
velocity; thus, many model developers ignore these practices (P1,
production environments. P6 mentioned that they could debug
P6, P11). P6 said:
quickly when locally downloading, executing, and manipulating
data from a production notebook run. P18 remarked on the modu- I used to see a lot of people complaining that model
larization benefits of a migration from a single codebase of scripts developers don’t follow software engineering [practices].
to notebooks: At this point, I’m feeling more convinced that they don’t
follow software engineering [practices]—[not] because
We put each component of the pipeline in a notebook,
they’re lazy, [but because software engineering prac-
which has made my life so much easier. Now [when
tices are] contradictory to the agility of analysis and
debugging], I can run only one specific component if
exploration.
I want, not the entire pipeline... I don’t need to focus
on all those other components, and this has also helped Takeaway. We believe there’s an opportunity to create virtualized
with iteration. infrastructure specific to ML needs with similar development and
On the other hand, some participants strongly disliked the idea production environments. Each environment should build on the
of notebooks in production (P10, P15). P15 even went as far as to same foundation but supports different modes of iteration (i.e.,
philosophically discourage the use of notebooks in the development high velocity in development). Such tooling should also track the
environment: “Nobody uses notebooks. Instead, we all work in a discrepancies between environments and minimize the likelihood
shared code base, which is both the training and serving code base that discrepancy-related bugs arise.
and people kick off jobs in the cloud to train models.” Similarly, P10
5.1.2 Handling A Spectrum of Data Errors. As alluded to in Sec-
recounted a shift at their company to move any work they wanted
tion 4.5.4, we found that ML engineers struggled to handle the
to reproduce or deploy out of notebooks:
spectrum of data errors: hard → soft → drift (P5, P6, P8, P11, P14,
There were all sorts of manual issues. Someone would, P16, P17, P18, P19). Hard errors are obvious and result in clearly
you know, run something with the wrong sort of inputs “bad predictions”, such as when mixing or swapping columns or
from the notebook, and I’m [debugging] for like a day when violating constraints (e.g., a negative age). Soft errors, such
and a half. Then [I’d] figure out this was all garbage. as a few null-valued features in a data point, are less pernicious
Eight months ago, we [realized] this was not working. and can still yield reasonable predictions, making them hard to
We need[ed] to put in the engineering effort to create catch and quantify. Drift errors occur when the live data is from a
[non-notebook] pipelines. seemingly different distribution than the training set; these happen
The anecdotes on notebooks identified conflicts between com- relatively slowly over time. One pain point mentioned by the in-
peting priorities: (1) Notebooks support high velocity and therefore terviewees was that different types of data errors require different
need to be in development environments, (2) Similar development responses, and it was not easy to determine the appropriate re-
and production environments prevents new bugs from being in- sponse. Another issue was that requiring practitioners to manually
troduced, and (3) It’s easy to make mistakes with notebooks in define constraints on data quality (e.g., lower and upper bounds
production, e.g., running with the wrong inputs; copy-pasting in- on values) was not sustainable over time, as employees with this
stead of reusing code. Each organization had different rankings knowledge left the organization.
of these priorities, ultimately indicating whether or not they used The most commonly discussed pain point was false-positive alerts,
notebooks in production. or alerts triggered even when the ML performance is adequate.
Non-standardized Code Quality. We found code quality and re- Engineers often monitored and placed alerts on each feature or
view practices to be non-standardized and inconsistent across devel- input column and prediction or output column (P5, P6, P8, P11,
opment and production environments. Some participants described P14, P16, P17, P18, P19). Engineers automated schema checks and
organization-wide production coding standards for specific lan- bounds to catch hard errors, and they tracked distance metrics (e.g.,
guages (P2, P5), but even the most mature organizations did not KL divergence) between historical and live features to catch soft
Shreya Shankar∗ , Rolando Garcia∗ , Joseph M. Hellerstein, Aditya G. Parameswaran

and drift errors. If the number of metrics tracked is so large, even Creating Meaningful Data Alerts is Challenging. If schema
with only a handful of columns, the probability that at least one checks and rudimentary column bounds didn’t flag all the errors,
column violates constraints is high! and distance metrics between historical and live feature values
Taking a step back, the purpose of assessing data quality before flagged too many false positive errors, how could engineers find a
serving predictions is to validate early. Correctly monitoring “Goldilocks” alert setting? 5 We organized the data-related issues
data quality demonstrates the conflict between validating faced by engineers into a hierarchy, from most frequently occurring
early and versioning—if data validation methods flag a broken to least frequently occurring:
data point, which in turn rejects the corresponding prediction made • Feedback delays: Many participants said that ground-truth
by the main ML model, some backup plan or fallback model ver- labels for live predictions often arrived after a delay, which
sion (Section 4.5) is necessary. Consequently, an excessive number could vary unpredictably (e.g., human-in-the-loop or net-
of false-positive alerts leads to two pain points: (1) unnecessarily working delays) and thus caused problems for knowing real-
maintaining many model versions or simple heuristics, which can time performance or retraining regularly (P2, P7, P8, P15, P17,
be hard to keep track of, and (2) a lower overall accuracy or ML P18). P7 felt strongly about the negative impact of feedback
metric, as baseline models might not serve high-quality predictions delays on their ML pipelines:
(P14, P19). I have no idea how well [models] actually perform on live
Dealing with Alert Fatigue. A surplus of false-positive alerts led data. We do log the the [feature and output] data, but
to fatigue and silencing of alerts, which could miss actual perfor- feedback is always delayed by at least 2 weeks. Sometimes
mance drops. P8 said “people [were] getting bombed with these we might not have feedback...so when we realize maybe
alerts.” P14 mentioned a current initiative at their company to re- something went wrong, it could [have been] 2 weeks ago,
duce the alert fatigue: and yeah, it’s just straight up—we don’t even care...nobody
is solving the label lag problem. It doesn’t make sense to
Recently we’ve noticed that some of these alerts have
me that a monitoring tool is not addressing this, because
been rather noisy and not necessarily reflective of events
[it’s] the number one problem.
that we care about triaging and fixing. So we’ve recently
P8 discussed how they spent 2-3 years developing a human-
taken a close look at those alerts and are trying to figure
in-the-loop pipeline to manually label live data as frequently
out, how can we more precisely specify that query such
as possible: “you want to come up with the rate at which
that it’s only highlighting the problematic events?
data is changing, and then assign people to manage this rate
P18 shared a similar sentiment, that there was “nothing critical roughly”. On the other hand, P17 and P19 both talked about
in most of the alerts.” The only time there was something critical how, when they worked on recommender systems, they did
was “way back when [they] had to actually wake up in the middle not have to experience feedback delay issues. P17 said: “With
of the night to solve it...the only time [in years].” When we asked recommendations, it’s pretty clear whether or not we got it
what they did about the noncritical alerts and how they acted on right because we get pretty immediate feedback. We suggest
the alerts, P18 said: something, and someone’s like go away or they click it.”
You typically ignore most alerts...I guess on record I’d • Unnatural data drift: Often, in production pipelines, data
say 90% of them aren’t immediate. You just have to was missing, incomplete, or corrupted, causing model perfor-
acknowledge them [internally], like just be aware that mance to sharply degrade (P3, P6, P7, P10, P16, P17). Several
there is something happening. participants cited Covid as an example, but there are other
(better) everyday instances of unnatural data drift. P6 de-
The alert fatigue typically materialized when engineers were
scribed a bug where users had inconsistent definitions of the
on-call, or responsible for ML pipelines during a 7 or 14-day shift.
same word, complicating the deployment of a service to a
P19 recounted how on-call rotations were dreaded amongst their
new user. P7 mentioned a bug where data from users in a
team, particularly for new team members, due to the high rate of
certain geographic region arrived more sporadically than
false-positive alerts:
usual. P10 discussed a bug where the format of raw data was
The pain point is dealing with that alert fatigue and occasionally corrupted: “Tables didn’t always have headers
the domain expertise necessary to know what to act on in the same place, even though they were the same tables.”
during on-call. New members freak out in the first [on- • Natural data drift: Surprisingly, participants didn’t seem
call], so [for every rotation,] we have two members. One too worried about slower, expected natural data drift over
member is a shadow, and they ask a lot of questions. time—they noted that frequent model retrains solved this
P19 also discussed an initiative at their company to reduce the problem (P6, P7, P8, P12, P15, P16, P17). As an anecdote, we
alert fatigue, ironically with another model: asked P17 to give an example of a natural data drift problem
their company faced, and they could not think of a good
The [internal tool] looks at different metrics for what example. P14 also said they don’t have natural data drift
alerts were [acted on] during the on-call...[the inter- problems:
nal tool] tries to reduce the noise level, alert. It says,
hey, this alert has been populated this like 1,000 times 5 Goldilocks and the Three Bears is a popular Western fairy tale. Goldilocks, the main
and ignored 45% of time. [The on-call member] will character, looks for things that are not too big or not too small, things that are “just
acknowledge whether we need to [fix] the issue. right.”
Operationalizing Machine Learning: An Interview Study

The model gets retrained every day, so we don’t have the variant of slicing and dicing data for different groups of customers
scenario of like: Oh, our models got stale and we need to re- or data points (P2, P6, P11, P14, P17, P19). P14 discussed tracking
train it because it’s starting to make mistakes because data bugs for different slices of data and only drilling down into their
has drifted...fortunately we’ve never had to deal with [such queue of bugs when they observed “systematic mistakes for a large
a] scenario. Sometimes there are bad [training] jobs, but number of customers.” P2 did something similar, although they
we can always effectively roll back to a different [model]. hesitated to call it debugging:
However, a few engineers mentioned that natural data shift You can sort of like, look for instances of a particular
could cause some hand-curated features and data quality [underperforming slice] and [debug]—although I’d ar-
checks to corrupt (P3, P6, P8). P6 discussed a histogram used gue that [it isn’t] debugging as much as it is sampling
to make a feature (i.e., converting a real-valued feature to the world for more data...maybe it’s not a bug, and [it’s]
a categorical feature) for an ML model—as data changed just [that] the model has not seen enough examples of
over time, the bucket boundaries became useless, resulting some slice.
in buggy predictions. P8 described how, in their NLP models,
the vocabulary of frequently-occurring words changed over Paranoia Caused by ML Debugging Trauma. After several it-
time, forcing them to update their preprocessor functions erations of chasing bespoke ML-related bugs in production, ML
regularly. Our takeaway is that any function that summarizes engineers that we interviewed developed a sense of paranoia while
data—be it cleaning tools, preprocessors, features, or models— evaluating models offline, possibly as a coping mechanism (P1, P2,
needs to be refit regularly. P6, P15, P17, P19). P1 recounted a bug that was “impossible to
Takeaway. Unfortunately, none of the participants reported having discover” after a deployment to production:
solved the Goldilocks ML alert problem at their companies. What ML [bugs] don’t get caught by tests or production sys-
metrics can be reliably monitored in real-time, and what criteria tems and just silently cause errors [that manifest as]
should trigger alerts to maximize precision and recall when identify- slight reductions in performance. This is why [you] need
ing model performance drops? How can these metrics and alerting to be paranoid when you’re writing ML code, and then
criteria—functions of naturally-drifting data—automatically tune be paranoid when you’re coding. I remember one ex-
themselves over time? We envision this to be an opportunity for ample of a beefy PR with a lot of new additions to data
new data management tools. augmentation...but the ground truth data was flipped.
5.1.3 Taming the Long Tail of ML Pipeline Bugs. In the interviews, If it hadn’t been caught in code review, it [would’ve
we gathered the sentiment that ML debugging is different from been] almost impossible to discover. I can think of no
debugging during standard software engineering, where one can mechanism by which we would have found this besides
write test cases to cover the space of potential bugs. But for ML, someone just curiously reading the code. [In production],
if one can’t categorize bugs effectively because every bug feels it would have only [slightly] hurt accuracy.
unique, how will they prevent future similar failures? Moreover, It’s possible that many of the bespoke bugs could be ignored if
it’s important to fix pipeline bugs as soon as possible to minimize they didn’t actually affect model performance. Tying this concept
downtime, and a long tail of possible ML pipeline bugs forces to the data quality issue, maybe all engineers needed to know was
practitioners to have high debugging velocity. “I just sort of when model performance was suffering. But they needed to know
poked around until, at some point, I figured [it] out,” P6 said, de- precisely when models were underperforming, an unsolved question
scribing their ad-hoc approach to debugging. Other participants as discussed in Section 5.1.2. When we asked P1, “how do you know
similarly mentioned that they debug without a systematic frame- when the model is not working as well as you expect?”—they gave
work, which could take them a long time (P5, P8, P10, P18). the following answer:
While some types of bugs were discussed by multiple partici- Um, yeah, it’s really hard. Basically there’s no surefire
pants, such as accidentally flipping labels in classification models strategy. The closest that I’ve seen is for people to inte-
(P1, P3, P6, P11) and forgetting to set random seeds (P1, P12, P13), grate a very high degree of observability into every part
the vast majority of bugs described to us in the interviews were of their pipeline. It starts with having really good raw
seemingly bespoke and not shared among participants. For exam- data, observability, and visualization tools. The ability
ple, P8 forgot to drop special characters (e.g., apostrophes) for their to query. I’ve noticed, you know, so much of this [ad-hoc
language models. P6 found that the imputation value for missing bug exploration] is just—if you make the friction [to de-
features was once corrupted. P18 mentioned that a feature of un- bug] lower, people will do it more. So as an organization,
structured data type (e.g., JSON) had half of the keys’ values missing you need to make the friction very low for investigating
for a “long time.” what the data actually looks like, [such as] looking at
Unpredictable Bugs; Predictable Symptoms. Interestingly, these specific examples.
one-off bugs from the long tail showed similar symptoms of failure.
For instance, a symptom of unnatural data drift issues (defined Takeaway. Our takeaway is that there is a chicken-and-egg prob-
in Section 5.1.2) was a large discrepancy between offline validation lem in making it easier to tackle the long tail of ML bugs. To group
accuracy and production accuracy immediately after deployment the tail into higher-order categories—i.e., to know what bugs to
(P1, P6, P14, P18). The similarity in symptoms highlighted the simi- focus on and what to throw out—we need to know when models are
larity in methods for isolating bugs; they were almost always some precisely underperforming; then we can map performance drops
Shreya Shankar∗ , Rolando Garcia∗ , Joseph M. Hellerstein, Aditya G. Parameswaran

to bugs. However, to know when models are precisely underper- 5.2 Observed MLOps Anti-Patterns
forming, given feedback delays and other data quality assessment Here we report a list of MLOps anti-patterns observed in our inter-
challenges as described in Section 5.1.2, we need to be able to views, or common potentially-problematic behaviors in the ecosys-
identify all the bugs in a pipeline and reason how much each one tem around ML experiments and deployments.
could plausibly impact performance. Breaking this cycle could be a
valuable contribution to the production ML community and help 5.2.1 Industry-Classroom Mismatch. P1, P5, P7, P11, and P16 each
alleviate challenges that stem from the long tail of ML bugs. discussed some ML-related bugs they encountered early in their
career, right after leaving school, that they knew how to avoid only
5.1.4 Multi-Staged Deployments Seemingly Take Forever. Multiple after on-the-job experience. “I learned a lot of data science in school,
participants complained that end-to-end experimentation—the con- but none of it was quite like all these things you’re [asking],” P7 told
ception of an idea to improve ML performance to validating the us at the end of their interview. P5 said they did a lot of “learning
idea—took too long (P7, P14, P16, P17, P18, P19). This reveals the by doing.” P11 provided further insight:
synergies between velocity and validating early: if ideas can It was [hard to be], like, thrown into the wild, and have
be invalidated in earlier stages of deployment, then overall velocity to learn all of this on the job. Coming out of [university
is increased. But sometimes a stage of deployment would take a in the US with a strong CS program], these are not
long time to observe meaningful results—for example, P19 men- things that anyone has ever taught right, at least in
tioned that at their company, the timeline for trying a new feature school...my mindset has always been a little bit more, I
idea took over three months: guess, practically oriented, even since the academic days,
I don’t have the exact numbers; around 40 or 50% will and that’s not to say we had great mental models—or
make it to initial launch. And then, either because it frameworks or playbooks—for doing this.
doesn’t pass the legal or privacy or some other complex- Our interviews with participants didn’t focus on what specific
ity, we drop about 50% of [the launched experiments]. skills they could have learned in the classroom that would have
We have to drop a lot of ideas. prepared them better for their jobs. We leave this to future study
The uncertainty of whether projects will be successful stemmed and collaborations between academia and industry.
from the unpredictable, real-world nature of the experiments (P18,
P19). P19 said that some features don’t make sense after a few 5.2.2 Keeping GPUs Warm. P5 first mentioned the phrase “keeping
months, given the nature of how user behaviors change, which GPUs warm”—i.e., running as many experiments as possible given
would cause an initially good idea to never fully and finally deploy computational resources, or making sure all GPUs were utilized at
to production: any given point in time:
You have to look at so many different metrics, and even One thing that I’ve noticed is, especially when you have
for very experienced folks doing this process like dozen as many resources as [large companies] do, that there’s
times, sometimes it’s hard to figure out especially when a compulsive need to leverage all the resources that you
the user’s behavior changes very steadily. There’s no have. And just, you know, get all the experiments out
sudden change [in evaluation metrics] because of one there. Come up with a bunch of ideas; run a bunch
launch; it just usage patterns that change. of stuff. I actually think that’s bad. You can be overly
concerned with keeping your GPUs warm, [so much]
P18 offered an anecdote where their company’s key product so that you don’t actually think deeply about what the
metrics changed in the middle of one of their experiments, causing highest value experiment is.
them to kill a experiment that appeared to be promising (the original I think you can end up saving a lot more time—and
metric was improving): obviously GPU cycles, but mostly end-to-end comple-
It was causing a huge gain on the product metrics; it tion time—if you spend more efforts choosing the right
was definitely a green signal. But as for the product, experiment to run instead of [spreading yourself] thin.
metrics keep on rotating based on the company’s pri- All these different experiments have their own frontier
orities, you know. Is it the revenue at this point? Is it to explore, and all these frontiers have different options.
the total number of, let’s say, installs? Or clicks at this I basically will only do the most important thing from
particular point of time? They keep on changing with each project’s frontier at a given time, and I found that
company’s roadmap... the net throughput for myself has been much higher.
Takeaway. While most participants were unable to share exact In executing experiment ideas, we noticed a tradeoff between
information about the length of the staged deployment process a guided search and random search. Random searches were more
and specific anecdotes about experiments they needed to cancel suited to parallelization—e.g., hyperparameter searches or ideas
for privacy reasons, we found it interesting how different organiza- that didn’t depend on each other. Although computing infrastruc-
tions had different deployment evaluation practices yet similar pain ture could support many different experiments in parallel, the cog-
around failed project ideas due to the highly iterative, experimen- nitive load of managing such experiments was too cumbersome
tal nature of ML. We believe there is an opportunity for tooling to for participants (P5, P10, P18, P19). In other words, having high
streamline ML deployments in this multi-stage pattern, to minimize velocity means drowning in a sea of versions of experiments.
wasted work and help practitioners predict the end-to-end gains Rather, participants noted more success when pipelining learnings
for their ideas. from one experiment into the next, like a guided search to find the
Operationalizing Machine Learning: An Interview Study

best idea (P5, P10, P18). P18 described their ideological shift from satisfy customers and help organize teams around a roadmap of
random search to guided search: experiment ideas, maybe they are not so bad.
Previously, I tried to do a lot of parallelization. I used
5.2.4 Undocumented Tribal Knowledge. P6, P10, P13, P14, P16,
to take, like, 5 ideas and try to run experimentation in
P17, and P19 each discussed pain points related to undocumented
parallel, and that definitely not only took my time, but
knowledge about ML experiments and pipelines amongst collab-
I also focused less. If I focus on one idea, a week at a
orators with more experience related to specific pipelines. Across
time, then it boosts my productivity a lot more.
interviews, it seemed like high velocity created many versions,
By following a guided search, engineers are, essentially, signifi- which made it hard to maintain up-to-date documentation.
cantly pruning a tree of experiment ideas without executing them. P10 mentioned that there were parts of a pipeline that no one
Contrary to what they were taught in academia, P1 observed that touched because it was already running in production, and the
some hyperparameter searches could be pruned early because hy- principal developer who knew most about it had left the company.
perparameters had such little impact on the end-to-end pipeline: P16 said that “most of the, like, actual models were trained before
I remember one example where the ML team spent all [their] time.” P14 described a “pipeline jungle” that was difficult to
this time making better models, and it was not helping maintain:
[overall performance]. Then everyone was so frustrated
You end up with this pipeline jungle where everything’s
when one person on the controls team just tweaked one
super entangled, and it’s really hard to make changes,
parameter [for the non-ML part of the pipeline], and
because just to make one single change, you have to
[the end-to-end pipeline] worked so much better. Like
hold so much context in your brain. You’re trying to
we’ve invested all this infrastructure for hyperparame-
think about like, okay this one change is gonna affect
ter tuning a experiment, and I’m like what is this. Why
this system which affects this [other] system, [which
did this happen?
creates]...the pipeline got to the point where it was very
Our takeaway is that while it may seem like there are unlimited difficult to make even simple changes.
computational resources, developer time and energy is the limiting
reagent for ML experiments. At the end of the day, experiments While writing down institutional knowledge can be straightfor-
are human-validated and deployed. Mature ML engineers know ward to do once, P6 discussed that in the ML setting, they learn
their personal tradeoff between parallelizing disjoint experiment faster than they can document; moreover, people don’t want to
ideas and pipelining ideas that build on top of each other, ultimately read so many different versions of documentation:
yielding successful deployments. There are people in the team, myself included, that
have been on it for several years now, and so there’s
5.2.3 Retrofitting an Explanation. Right from the first interview,
some institutional knowledge embodied on the team
participants discussed uncovering good results from experiments,
that sometimes gets written down. But you know, even
productionizing changes, and then trying to reason why these
when it does get written down, maybe you will read
changes worked so well (P1, P2, P7, P12). P1 said:
them, but then, they kind of disappear to the ether.
A lot of ML is is like: people will claim to have like
principled stances on why they did something and why Finally, P17 realized that poorly documented pipelines forced
it works. I think you can have intuitions that are use- them to treat pipelines as black boxes: “Some of our models are
ful and reasonable for why things should be good, but pretty old and not well documented, so I don’t have great expec-
the most defining characteristic of [my most productive tations for what they should be doing.” Without intuition for how
colleague] is that he has the highest pace of experi- pipelines should perform, practitioner productivity can be stunted.
mentation out of anyone. He’s always running exper-
iments, always trying everything. I think this is rel- Takeaway. The MLOps anti-patterns described in this section re-
atively common—people just try everything and then veal that ML engineering, as a field, is changing faster than educa-
backfit some nice-sounding explanation for why it works. tional resources can keep up. We see this as opportunities for new
We wondered, why was it even necessary to have an expla- resources, such as classroom material (e.g., textbooks, courses) to
nation for why something worked? Why not simply accept that, prescribe the right engineering practices and rigor for the highly
unlike in software, we may not have elegant, principled reasons for experimental discipline that is production ML, and automated doc-
successful ML experiments? P2 hypothesized that such retrofitted umentation assistance for ML pipelines in organizations.
explanations could guide future experiment ideas over a longer hori-
zon. Alternatively, P7 mentioned that their customers sometimes 5.3 Characterizing the “MLOps Stack” for Tool
demanded explanations for certain predictions: Builders
Do I know why? No idea. I have to convince people that, MLOps tool builders may be interested in an organization of the
okay, we try our best. We try to [compute] correlations. dozens of tools, libraries, and services MLEs use to run ML and
We try to [compute] similarities. Why is it different? I data processing pipelines. Although multiple MLEs reported hav-
have to make conjectures. ing to “glue” open-source solutions together and having to build
We realized that although they could be false, retrofitted ex- “homegrown” infrastructure as part of their work (P1, P2, P5, P6,
planations can help with collaboration and business goals. If they P10, P12), an analysis of the various deployments reveals that tools
Shreya Shankar∗ , Rolando Garcia∗ , Joseph M. Hellerstein, Aditya G. Parameswaran

want to pay attention to the layer(s) they are addressing and make
Run Layer
sure they are not designing tools for the wrong layer(s).
Additionally, we noticed a high-level pattern in how interviewees
discussed the tools they used: engineers seemed to prefer tools that
significantly improved their experience with respect to the Three
Pipeline Layer Vs (Section 4.2). For example, experiment tracking tools increased
engineers’ speed of iterating on feature or modeling ideas (P14,
P15)—a velocity virtue. In another example, feature stores (i.e., tables
of derived columns for ML models) helped engineers debug models
Component Layer because they could access the relevant historical versions of features
used in training such models (P3, P6, P14, P17)—a versioning virtue.
MLOps tool builders may want to prioritize “10x” better experiences
across velocity, validating early, or versioning for their products.
Infrastructure Layer
6 CONCLUSION
In this paper, we presented results from a semi-structured inter-
Figure 2: Layers of tools in the MLOps stack. view study of ML engineers spanning different organizations and
applications to understand their workflow, best practices, and chal-
lenges. We found that successful MLOps practices center around
having high velocity, validating as early as possible, and maintain-
can be grouped into a stack of four layers, depicted in Figure 2 and ing multiple versions of models for minimal production downtime.
discussed further in Appendix D. We discuss the four layers in turn: We reported on the experimental nature of production ML, aspects
(1) Run Layer: A run is a record of an execution of an ML or of effective model evaluation, and tips to sustain model perfor-
data pipeline (and its components). Run-level data is often mance over time. Finally, we discussed MLOps pain points and
managed by data catalogs, model registries, and training anti-patterns discovered in our interviews to inspire new MLOps
dashboards. tooling and research ideas.
Example Tools: Weights & Biases, MLFlow, Hive metas-
tores, AWS Glue ACKNOWLEDGEMENTS
(2) Pipeline Layer: Finer-grained than a run, a pipeline further We thank the interviewees for their valuable time and thoughtful re-
specifies the dependencies between artifacts and details of sponses. We are also grateful to Sarah Catanzaro for connecting us
the corresponding computations. Pipelines can run ad-hoc to some of the interviewees, and Alex Tamkin and Preetum Nakki-
or on a schedule. Pipelines change less frequently than runs, ran for helpful suggestions. We acknowledge support from grants
but more frequently than components. IIS-2129008, IIS-1940759, IIS-1940757, and CNS-1730628 awarded
Example Tools: Papermill, DBT, Airflow, TensorFlow Ex- by the National Science Foundation, DOE Grant No. DE-SC0016934,
tended, Sagemaker an NDSEG Graduate Fellowship, an NSF Graduate Research Fellow-
(3) Component Layer: A component is an individual node of ship, funds from the Alfred P. Sloan Foundation, as well as EPIC lab
computation in a pipeline, often a script inside a managed sponsors: Adobe, Microsoft, Google, and Sigma Computing. Work
environment. Some MLEs reported having an organization- was done while Hellerstein was on leave at Sutter Hill Ventures.
wide “library of common components” for pipelines to use,
such as feature generation and model training (P2, P6). REFERENCES
Example Tools: Python, Spark, PyTorch, TensorFlow [1] Evaluating a model - advice for applying machine learning.
(4) Infrastructure Layer: MLEs described a wide range of solu- [2] Single number evaluation metric - ml strategy.
[3] Leonel Aguilar, David Dao, Shaoduo Gan, Nezihe Merve Gurel, Nora Hollenstein,
tions, but most used cloud storage (e.g., S3), and GPU-backed Jiawei Jiang, Bojan Karlas, Thomas Lemmin, Tian Li, Yang Li, Susie Rao, Johannes
cloud computing (AWS and GCP). Infrastructure changed Rausch, Cedric Renggli, Luka Rimanic, Maurice Weber, Shuai Zhang, Zhikuan
far less frequently than other layers in the stack, but each Zhao, Kevin Schawinski, Wentao Wu, and Ce Zhang. Ease.ml: A lifecycle man-
agement system for mldev and mlops. In Conference on Innovative Data Systems
change was more laborious and prone to wide-ranging con- Research (CIDR 2021), January 2021.
sequences. [4] Sridhar Alla and Suman Kalyan Adari. What is mlops? In Beginning MLOps with
Example Tools: Docker, AWS, GCP MLFlow, pages 79–124. Springer, 2021.
[5] Saleema Amershi, Andrew Begel, Christian Bird, Robert DeLine, Harald Gall,
We found that MLEs used layers of abstraction (e.g., “config-based Ece Kamar, Nachiappan Nagappan, Besmira Nushi, and Thomas Zimmermann.
Software engineering for machine learning: A case study. In 2019 IEEE/ACM 41st
development”) as a way to manage complexity: most changes (es- International Conference on Software Engineering: Software Engineering in Practice
pecially high-velocity ones) were minor and limited to the Run (ICSE-SEIP), pages 291–300, 2019.
[6] Lukas Biewald. Tracking with weights and biases www.wandb.com/, 2020.
Layer, such as selecting hyperparameters. As the stack gets deeper, [7] Catherine Billington, Gonzalo Rivero, Andrew Jannett, and Jiating Chen. A
changes become less frequent: MLEs ran training jobs daily but mod- machine learning model helps process interviewer comments in computer-
ified Dockerfiles occasionally. In the past, as MLOps tool builders, assisted personal interview instruments: A case study. Field Methods, page
1525822X221107053, 2022.
we (the authors) have incorrectly assumed uniform user access [8] Mike Brachmann, Carlos Bautista, Sonia Castelo, Su Feng, Juliana Freire, Boris
patterns across all layers of the MLOps stack. Tool builders may Glavic, Oliver Kennedy, Heiko Müeller, Rémi Rampin, William Spoth, and Ying
Operationalizing Machine Learning: An Interview Study

Yang. Data debugging and exploration with vizier. In Proceedings of the 2019 Software Engineering, 44(11):1024–1038, 2017.
International Conference on Management of Data, SIGMOD ’19, page 1877–1880, [34] Dominik Kreuzberger, Niklas Kühl, and Sebastian Hirschl. Machine learning
New York, NY, USA, 2019. Association for Computing Machinery. operations (mlops): Overview, definition, and architecture, 2022.
[9] Eric Breck, Marty Zinkevich, Neoklis Polyzotis, Steven Whang, and Sudip Roy. [35] Po-Ming Law, Sana Malik, Fan Du, and Moumita Sinha. Designing tools for
Data validation for machine learning. In Proceedings of SysML, 2019. semi-automated detection of machine learning biases: An interview study. arXiv
[10] Ji Young Cho and Eun-Hee Lee. Reducing confusion about grounded theory preprint arXiv:2003.07680, 2020.
and qualitative content analysis: Similarities and differences. Qualitative report, [36] Angela Lee, Doris Xin, Doris Lee, and Aditya Parameswaran. Demystifying a
19(32), 2014. dark art: Understanding real-world machine learning model development, 2020.
[11] Xu Chu, Ihab F. Ilyas, Sanjay Krishnan, and Jiannan Wang. Data cleaning: [37] Leonardo Leite, Carla Rocha, Fabio Kon, Dejan Milojicic, and Paulo Meirelles.
Overview and emerging challenges. In Proceedings of the 2016 International A survey of devops concepts and challenges. ACM Computing Surveys (CSUR),
Conference on Management of Data, SIGMOD ’16, page 2201–2206, New York, 52(6):1–35, 2019.
NY, USA, 2016. Association for Computing Machinery. [38] Zhiqiu Lin, Jia Shi, Deepak Pathak, and Deva Ramanan. The CLEAR benchmark:
[12] Daniel Crankshaw, Xin Wang, Guilio Zhou, Michael J Franklin, Joseph E Gonzalez, Continual LEArning on real-world imagery. In Thirty-fifth Conference on Neural
and Ion Stoica. Clipper: A {Low-Latency } online prediction serving system. In Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021.
14th USENIX Symposium on Networked Systems Design and Implementation (NSDI [39] Mike Loukides. What is DevOps? " O’Reilly Media, Inc.", 2012.
17), pages 613–627, 2017. [40] Lucy Ellen Lwakatare, Terhi Kilamo, Teemu Karvonen, Tanja Sauvola, Ville
[13] Christof Ebert, Gorka Gallardo, Josune Hernantes, and Nicolas Serrano. Devops. Heikkilä, Juha Itkonen, Pasi Kuvaja, Tommi Mikkonen, Markku Oivo, and Casper
Ieee Software, 33(3):94–100, 2016. Lassenius. Devops in practice: A multiple case study of five companies. Informa-
[14] Mihail Eric. Mlops is a mess but that’s to be expected. tion and Software Technology, 114:217–230, 2019.
[15] Tongtong Fang, Nan Lu, Gang Niu, and Masashi Sugiyama. Rethinking impor- [41] Lucy Ellen Lwakatare, Aiswarya Raj, J. Bosch, Helena Holmström Olsson, and
tance weighting for deep learning under distribution shift. In H. Larochelle, Ivica Crnkovic. A taxonomy of software engineering challenges for machine
M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural In- learning systems: An empirical investigation. In XP, 2019.
formation Processing Systems, volume 33, pages 11996–12007. Curran Associates, [42] Stephen Macke, Hongpu Gong, Doris Jung-Lin Lee, Andrew Head, Doris Xin,
Inc., 2020. and Aditya Parameswaran. Fine-grained lineage for safer notebook interactions.
[16] Asbjørn Følstad, Cecilie Bertinussen Nordheim, and Cato Alexander Bjørkli. Proc. VLDB Endow., 14(6):1093–1101, sep 2021.
What makes users trust a chatbot for customer service? an exploratory interview [43] Michael A. Madaio, Luke Stark, Jennifer Wortman Vaughan, and Hanna Wallach.
study. In International conference on internet science, pages 194–208. Springer, Co-designing checklists to understand organizational challenges and opportuni-
2018. ties around fairness in ai. In Proceedings of the 2020 CHI Conference on Human
[17] Rolando Garcia, Eric Liu, Vikram Sreekanti, Bobby Yan, Anusha Dandamudi, Factors in Computing Systems, CHI ’20, page 1–14, New York, NY, USA, 2020.
Joseph E. Gonzalez, Joseph M Hellerstein, and Koushik Sen. Hindsight logging Association for Computing Machinery.
for model training. In VLDB, 2021. [44] Sasu Mäkinen, Henrik Skogström, Eero Laaksonen, and Tommi Mikkonen. Who
[18] Rolando Garcia, Vikram Sreekanti, Neeraja Yadwadkar, Daniel Crankshaw, needs mlops: What data scientists seek to accomplish and how can mlops help?
Joseph E Gonzalez, and Joseph M Hellerstein. Context: The missing piece in the In 2021 IEEE/ACM 1st Workshop on AI Engineering-Software Engineering for AI
machine learning lifecycle. In CMI, 2018. (WAIN), pages 109–112. IEEE, 2021.
[19] Satvik Garg, Pradyumn Pundir, Geetanjali Rathee, P.K. Gupta, Somya Garg, and [45] MLReef. Global mlops and ml tools landscape: Mlreef, Feb 2021.
Saransh Ahlawat. On continuous integration / continuous delivery for automated [46] Jose G. Moreno-Torres, Troy Raeder, Rocío Alaiz-Rodríguez, Nitesh V. Chawla,
deployment of machine learning models using mlops. In 2021 IEEE Fourth Inter- and Francisco Herrera. A unifying view on dataset shift in classification. Pattern
national Conference on Artificial Intelligence and Knowledge Engineering (AIKE), Recognition, 45(1):521–530, 2012.
pages 25–28, 2021. [47] Dennis Muiruri, Lucy Ellen Lwakatare, Jukka K Nurminen, and Tommi Mikko-
[20] Inc Gartner. Understanding mlops to operationalize machine learning projects. nen. Practices and infrastructures for ml systems–an interview study in finnish
[21] Samadrita Ghosh. Mlops challenges and how to face them, Aug 2021. organizations. 2022.
[22] Stefan Grafberger, Shubha Guha, Julia Stoyanovich, and Sebastian Schelter. Mlin- [48] Michael Muller. Curiosity, creativity, and surprise as analytic tools: Grounded
spect: A data distribution debugger for machine learning pipelines. In SIGMOD’21, theory method. In Ways of Knowing in HCI, pages 25–48. Springer, 2014.
2021. [49] Michael Muller, Ingrid Lange, Dakuo Wang, David Piorkowski, Jason Tsay, Q Vera
[23] Andrew Head, Fred Hohman, Titus Barik, Steven M. Drucker, and Robert DeLine. Liao, Casey Dugan, and Thomas Erickson. How data science workers work with
Managing messes in computational notebooks. In Proceedings of the 2019 CHI data: Discovery, capture, curation, design, creation. In Proceedings of the 2019
Conference on Human Factors in Computing Systems, CHI ’19, page 1–12, New CHI conference on human factors in computing systems, pages 1–15, 2019.
York, NY, USA, 2019. Association for Computing Machinery. [50] Mohammad Hossein Namaki, Avrilia Floratou, Fotis Psallidas, Subru Krishnan,
[24] Joseph M Hellerstein, Vikram Sreekanti, Joseph E Gonzalez, James Dalton, Ashvin Agrawal, Yinghui Wu, Yiwen Zhu, and Markus Weimer. Vamsa: Auto-
Akon Dey, Sreyashi Nag, Krishna Ramachandran, Sudhanshu Arora, Arka Bhat- mated provenance tracking in data science scripts. In Proceedings of the 26th
tacharyya, Shirshanka Das, et al. Ground: A data context service. In CIDR, ACM SIGKDD International Conference on Knowledge Discovery & Data Mining,
2017. pages 1542–1551, 2020.
[25] Fred Hohman, Kanit Wongsuphasawat, Mary Beth Kery, and Kayur Patel. Un- [51] Yaniv Ovadia, Emily Fertig, J. Ren, Zachary Nado, D. Sculley, Sebastian Nowozin,
derstanding and visualizing data iteration in machine learning. In Proceedings of Joshua V. Dillon, Balaji Lakshminarayanan, and Jasper Snoek. Can you trust
the 2020 CHI conference on human factors in computing systems, pages 1–13, 2020. your model’s uncertainty? evaluating predictive uncertainty under dataset shift.
[26] Kenneth Holstein, Jennifer Wortman Vaughan, Hal Daumé, Miro Dudik, and In NeurIPS, 2019.
Hanna Wallach. Improving fairness in machine learning systems: What do [52] Andrei Paleyes, Raoul-Gabriel Urma, and Neil D. Lawrence. Challenges in de-
industry practitioners need? In Proceedings of the 2019 CHI Conference on Human ploying machine learning: A survey of case studies. ACM Comput. Surv., apr
Factors in Computing Systems, CHI ’19, page 1–16, New York, NY, USA, 2019. 2022. Just Accepted.
Association for Computing Machinery. [53] Samir Passi and Steven J Jackson. Trust in data science: Collaboration, translation,
[27] Chip Huyen. Machine learning tools landscape v2 (+84 new tools), Dec 2020. and accountability in corporate data science projects. Proceedings of the ACM on
[28] Meenu Mary John, Helena Holmström Olsson, and Jan Bosch. Towards mlops: A Human-Computer Interaction, 2(CSCW):1–28, 2018.
framework and maturity model. In 2021 47th Euromicro Conference on Software [54] Neoklis Polyzotis, Sudip Roy, Steven Euijong Whang, and Martin Zinkevich. Data
Engineering and Advanced Applications (SEAA), pages 1–8. IEEE, 2021. management challenges in production machine learning. In Proceedings of the
[29] Sean Kandel, Andreas Paepcke, Joseph Hellerstein, and Jeffrey Heer. Wrangler: 2017 ACM International Conference on Management of Data, pages 1723–1726,
Interactive visual specification of data transformation scripts. In Proceedings of 2017.
the SIGCHI Conference on Human Factors in Computing Systems, pages 3363–3372, [55] Neoklis Polyzotis, Sudip Roy, Steven Euijong Whang, and Martin Zinkevich.
2011. Data Lifecycle Challenges in Production Machine Learning: A Survey. SIGMOD
[30] Sean Kandel, Andreas Paepcke, Joseph M. Hellerstein, and Jeffrey Heer. Enter- Record, 47(2):12, 2018.
prise data analysis and visualization: An interview study. IEEE Transactions on [56] Luisa Pumplun, Mariska Fecho, Nihal Wahl, Felix Peters, Peter Buxmann, et al.
Visualization and Computer Graphics, 18(12):2917–2926, 2012. Adoption of machine learning systems for medical diagnostics in clinics: qualita-
[31] Daniel Kang, Deepti Raghavan, Peter Bailis, and Matei Zaharia. Model assertions tive interview study. Journal of Medical Internet Research, 23(10):e29301, 2021.
for debugging machine learning. [57] Stephan Rabanser, Stephan Günnemann, and Zachary Chase Lipton. Failing
[32] Mary Beth Kery, Amber Horvath, and Brad A Myers. Variolite: Supporting loudly: An empirical study of methods for detecting dataset shift. In NeurIPS,
exploratory programming by data scientists. In CHI, volume 10, pages 3025453– 2019.
3025626, 2017. [58] Alexander Ratner, Stephen H Bach, Henry Ehrenberg, Jason Fries, Sen Wu, and
[33] Miryung Kim, Thomas Zimmermann, Robert DeLine, and Andrew Begel. Data Christopher Ré. Snorkel: Rapid training data creation with weak supervision. In
scientists in software teams: State of the art and challenges. IEEE Transactions on Proceedings of the VLDB Endowment. International Conference on Very Large Data
Shreya Shankar∗ , Rolando Garcia∗ , Joseph M. Hellerstein, Aditya G. Parameswaran

Bases, volume 11, page 269. NIH Public Access, 2017.


[59] Cedric Renggli, Luka Rimanic, Nezihe Merve Gürel, Bojan Karlaš, Wentao Wu, and
Ce Zhang. A data quality-driven view of mlops. arXiv preprint arXiv:2102.07750,
2021.
[60] E. Rezig et al. Dagger: A data (not code) debugger. In CIDR, 2020.
[61] Philip Russom et al. Big data analytics. TDWI best practices report, fourth quarter,
19(4):1–34, 2011.
[62] Nithya Sambasivan, Shivani Kapania, Hannah Highfill, Diana Akrong, Praveen
Paritosh, and Lora M Aroyo. “everyone wants to do the model work, not the data
work”: Data cascades in high-stakes ai. In proceedings of the 2021 CHI Conference
on Human Factors in Computing Systems, pages 1–15, 2021.
[63] Sebastian Schelter et al. Automating large-scale data quality verification. In
PVLDB’18, 2018.
[64] D. Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Diet-
mar Ebner, Vinay Chaudhary, Michael Young, Jean-François Crespo, and Dan
Dennison. Hidden technical debt in machine learning systems. In NIPS, 2015.
[65] Shreya Shankar, Bernease Herman, and Aditya G. Parameswaran. Rethinking
streaming machine learning evaluation. ArXiv, abs/2205.11473, 2022.
[66] Shreya Shankar, Stephen Macke, Sarah Chasins, Andrew Head, and Aditya
Parameswaran. Bolt-on, compact, and rapid program slicing for notebooks.
Proc. VLDB Endow., sep 2023.
[67] Shreya Shankar and Aditya G. Parameswaran. Towards observability for produc-
tion machine learning pipelines. ArXiv, abs/2108.13557, 2022.
[68] James P Spradley. The ethnographic interview. Waveland Press, 2016.
[69] Steve Nunez. Why ai investments fail to deliver, 2022. [Online; accessed 15-
September-2022].
[70] Anselm Strauss and Juliet Corbin. Grounded theory methodology: An overview.
1994.
[71] Masashi Sugiyama et al. Covariate shift adaptation by importance weighted
cross validation. In JMLR, 2007.
[72] Justin Talbot, Bongshin Lee, Ashish Kapoor, and Desney S. Tan. Ensemblematrix:
Interactive visualization to support machine learning with multiple classifiers.
In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems,
CHI ’09, page 1283–1292, New York, NY, USA, 2009. Association for Computing
Machinery.
[73] Damian A Tamburri. Sustainable mlops: Trends and challenges. In 2020 22nd in-
ternational symposium on symbolic and numeric algorithms for scientific computing
(SYNASC), pages 17–23. IEEE, 2020.
[74] Manasi Vartak. Modeldb: a system for machine learning model management. In
HILDA ’16, 2016.
[75] Dakuo Wang, Justin D. Weisz, Michael Muller, Parikshit Ram, Werner Geyer,
Casey Dugan, Yla Tausczik, Horst Samulowitz, and Alexander Gray. Human-ai
collaboration in data science: Exploring data scientists’ perceptions of automated
ai. Proc. ACM Hum.-Comput. Interact., 3(CSCW), nov 2019.
[76] Joyce Weiner. Why ai/data science projects fail: how to avoid project pitfalls.
Synthesis Lectures on Computation and Analytics, 1(1):i–77, 2020.
[77] Wikipedia contributors. Mlops — Wikipedia, the free encyclopedia, 2022. [Online;
accessed 15-September-2022].
[78] Olivia Wiles, Sven Gowal, Florian Stimberg, Sylvestre-Alvise Rebuffi, Ira Ktena,
Krishnamurthy Dvijotham, and Ali Taylan Cemgil. A fine-grained analysis on
distribution shift. ArXiv, abs/2110.11328, 2021.
[79] Doris Xin, Hui Miao, Aditya Parameswaran, and Neoklis Polyzotis. Production
machine learning pipelines: Empirical analysis and optimization opportunities.
In Proceedings of the 2021 International Conference on Management of Data, pages
2639–2652, 2021.
[80] Doris Xin, Eva Yiwei Wu, Doris Jung-Lin Lee, Niloufar Salehi, and Aditya
Parameswaran. Whither automl? understanding the role of automation in ma-
chine learning workflows. In Proceedings of the 2021 CHI Conference on Human
Factors in Computing Systems, CHI ’21, New York, NY, USA, 2021. Association
for Computing Machinery.
[81] M. Zaharia et al. Accelerating the machine learning lifecycle with mlflow. IEEE
Data Eng. Bull., 41:39–45, 2018.
[82] Amy X Zhang, Michael Muller, and Dakuo Wang. How do data science workers
collaborate? roles, workflows, and tools. Proceedings of the ACM on Human-
Computer Interaction, 4(CSCW1):1–23, 2020.
[83] Marvin Zhang, Henrik Marklund, Abhishek Gupta, Sergey Levine, and Chelsea
Finn. Adaptive risk minimization: A meta-learning approach for tackling group
shift. CoRR, abs/2007.02931, 2020.
Operationalizing Machine Learning: An Interview Study

A SEMI-STRUCTURED INTERVIEW in (not the raw number of occurrences across all documents). Fig-
QUESTIONS ure 4 displays the top five correlated codes for each top-level or
parent code. Two codes are correlated if they occur within twenty
In the beginning of each interview, we explained the purpose of
sentences of each other.
the interview—to better understand processes within the organi-
zation for validating changes made to production ML models, ide-
D MLOPS TOOL STACK
ally through stories of ML deployments. We then kickstarted the
information-gathering process with a question to build rapport Table 2 shows common tools used by MLEs across layers of the
with the interviewee, such as tell us about a memorable previous ML stack and tasks in the production ML lifecycle.
model deployment. This question helped us isolate an ML pipeline or
product to discuss. We then asked a series of open-ended questions:
(1) Nature of ML task
• What is the ML task you are trying to solve?
• Is it a classification or regression task?
• Are the class representations balanced?
(2) Modeling and experimentation ideas
• How do you come up with experiment ideas?
• What models do you use?
• How do you know if an experiment idea is good?
• What fraction of your experiment ideas are good?
(3) Transition from development to production
• What processes do you follow for promoting a model from
the development phase to production?
• How many pull requests do you make or review?
• What do you look for in code reviews?
• What automated tests run at this time?
(4) Validation datasets
• How did you come up with the dataset to evaluate the
model on?
• Do the validation datasets ever change?
• Does every engineer working on this ML task use the same
validation datasets?
(5) Monitoring
• Do you track the performance of your model?
• If so, when and how do you refresh the metrics?
• What information do you log?
• Do you record provenance?
• How do you learn of an ML-related bug?
(6) Response
• What historical records (e.g., training code, training set)
do you inspect in the debugging process?
• What organizational processes do you have for responding
to ML-related bugs?
• Do you make tickets (e.g., Jira) for these bugs?
• How do you react to these bugs?
• When do you decide to retrain the model?

B INTERVIEW TRANSCRIPTS
Histograms of the number of codes and sentences in the interview
transcripts are shown in Figures 3a and 3b, respectively.

C CODES
Across the interview transcripts, we had a total of 1766 coded seg-
ments, with exactly 600 unique codes. We organized codes into
hierarchies. Table 3 shows the most frequently occurring codes,
ordered by the number of distinct interviews the codes appeared
Shreya Shankar∗ , Rolando Garcia∗ , Joseph M. Hellerstein, Aditya G. Parameswaran

Data Collection Experimentation Evaluation and Deployment Monitoring and Response


Know what data is avail- Prototype ideas and track Catch errors in training Track ML metrics over
able and where it lives results (e.g., overfitting) time
Run Data catalogs, Amund- Weights & Biases, MLFlow, train/test set parameter Dashboards, SQL, met-
sen, AWS Glue, Hive configs, A/B test tracking tools ric functions and window
metastores sizes
Regularly scheduled, pos- Ad-hoc or user-triggered, Scheduled refresh of hold- Scheduled computation
sibly outsourced hyperparameter search out validation sets of metrics and triggered
alerts
Pipeline In-house or outsourced AutoML Github Actions, Travis CI, Prometheus, AWS Cloud-
annotators Prediction serving tools, Watch
Kafka, Flink
Airflow, Kubeflow, Argo, Tensorflow Extended (TFX), Vertex AI, DBT
Sourcing, labeling, clean- Feature generation and Running model on hold- Data validation, ML met-
ing selection, model training out validation set, model ric computation, tracing
compression or rewrite, predictions
model serialization
Component
Data cleaning tools Tensorflow, MLlib, Py- C++, ONNX, OctoML, Scikit-learn metric func-
Torch, Scikit-learn, XG- TVM, joblib, pickle tions, Great Expectations,
Boost Deequ
Python, Pandas, Spark, SQL
Velocity Velocity Validate early Versioning
Annotation schema, Jupyter notebook setups, Edge devices, CPUs Logging and observabil-
Infrastructure cleaning criteria configs GPUs ity services (e.g., Data-
Dog)
Cloud (e.g., AWS, GCP), compute clusters, storage (e.g., AWS S3, Snowflake), Docker, Kubernetes
Table 2: Primary goals and tools for each layer in the MLOps stack and routine task in the ML engineering workflow.

4
4
Number of Interviews

3
Number of Interviews

2 2

1 1

0 0
40 60 80 100 120 140 160 400 450 500 550 600 650
Number of Coded Segments Number of Sentences

(a) Histogram of number of coded segments in each interview. (b) Histogram of number of sentences in each interview.

Figure 3: Interview transcript statistics. Each histogram has 10 equally-spaced buckets.


Operationalizing Machine Learning: An Interview Study

Parent Code Code # Coded Segments # Transcripts


1 known challenges data drift/shift/skew 15 10
2 monitoring and response (+) live monitoring 21 9
3 Python (+) Jupyter 16 8
4 evaluation and deployment build the infrastructure 16 8
5 data pipeline data iteration, fresh data 15 8
6 fast & simple high iteration speed, agile, rapid cycles 24 7
7 production bugs debugging and bugs 18 7
8 tests AB Testing 14 7
9 software development pull request 13 7
10 data pipeline pipeline on a schedule 13 7
11 operations model training & retraining 12 7
12 data ingest automated featurization 9 7
13 evaluation and deployment CI/CD 8 7
14 trends per-customer model and many customers 15 6
15 known challenges feedback delay 13 6
16 evaluation and deployment metrics and validation 12 6
17 metrics and validation (+) accuracy 11 6
18 models deep learning 9 6
19 sandboxing offline demonstration of value 7 6
20 apps & use-cases ranking 7 6
Table 3: Top 20 codes, ordered by the number of distinct transcripts the codes were mentioned in, descending.
Shreya Shankar∗ , Rolando Garcia∗ , Joseph M. Hellerstein, Aditya G. Parameswaran

Figure 4: Correlated codes for each top-level code. Each edge is weighted by the occurrence count for its pair of codes.

You might also like