0% found this document useful (0 votes)
44 views

How Does Machine Learning Change Software Development Practices?

This document summarizes a research paper that studied how machine learning changes software development practices. Through interviews with 14 practitioners and a survey of 342 respondents, the study found several significant differences between developing machine learning systems and non-machine learning systems. Specifically, it found that developing machine learning systems involves more preliminary experimentation in requirements collection, more refactoring of code, and greater emphasis on testing compared to developing regular software. The study aims to help establish best practices for machine learning system development.

Uploaded by

gamecel90
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views

How Does Machine Learning Change Software Development Practices?

This document summarizes a research paper that studied how machine learning changes software development practices. Through interviews with 14 practitioners and a survey of 342 respondents, the study found several significant differences between developing machine learning systems and non-machine learning systems. Specifically, it found that developing machine learning systems involves more preliminary experimentation in requirements collection, more refactoring of code, and greater emphasis on testing compared to developing regular software. The study aims to help establish best practices for machine learning system development.

Uploaded by

gamecel90
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Singapore Management University

Institutional Knowledge at Singapore Management University

Research Collection School Of Computing and School of Computing and Information Systems
Information Systems

8-2019

How does machine learning change software development


practices?
Zhiyuan WAN
Zhejiang University

Xin XIA
Monash University

David LO
Singapore Management University, [email protected]

Gail C. MURPHY
University of British Columbia

Follow this and additional works at: https://ink.library.smu.edu.sg/sis_research

Part of the Artificial Intelligence and Robotics Commons, and the Software Engineering Commons

Citation
WAN, Zhiyuan; XIA, Xin; LO, David; and MURPHY, Gail C.. How does machine learning change software
development practices?. (2019). IEEE Transactions on Software Engineering. 1-14. Research Collection
School Of Computing and Information Systems.
Available at: https://ink.library.smu.edu.sg/sis_research/4498

This Journal Article is brought to you for free and open access by the School of Computing and Information
Systems at Institutional Knowledge at Singapore Management University. It has been accepted for inclusion in
Research Collection School Of Computing and Information Systems by an authorized administrator of Institutional
Knowledge at Singapore Management University. For more information, please email [email protected].
1

How does Machine Learning Change Software


Development Practices?
Zhiyuan Wan, Xin Xia, David Lo and Gail C. Murphy

Abstract—Adding an ability for a system to learn inherently adds un- to technical debt [19]. ML systems have all of the problems
certainty into the system. Given the rising popularity of incorporating of non-ML software systems plus an additional set of ML
machine learning into systems, we wondered how the addition alters specific issues. For instance, probabilistic modeling provides
software development practices. We performed a mixture of qualitative a framework for a machine to learn from observed data and
and quantitative studies with 14 interviewees and 342 survey respon-
infer models that can make predictions. Uncertainty plays
dents from 26 countries across four continents to elicit significant dif-
ferences between the development of machine learning systems and
a fundamental role in probabilistic modeling [14]: Observed
the development of non-machine-learning systems. Our study uncovers data can be consistent with various models, and thus which
significant differences in various aspects of software engineering (e.g., model is appropriate given the data is uncertain. Predictions
requirements, design, testing, and process) and work characteristics about future data and the future consequences of actions are
(e.g., skill variety, problem solving and task identity). Based on our uncertain as well. To tackle the ML specific issues, recent
findings, we highlight future research directions and provide recommen- studies have put effort into building tools for testing [26],
dations for practitioners. [30], [36], [39] and debugging [16], [27], [28] of machine
learning code, and creating frameworks and environments
Index Terms—Software engineering, machine learning, practitioner,
to support development of ML systems [3], [6].
empirical study
Despite these efforts, software practitioners still struggle
to operationalize and standardize the software develop-
1 I NTRODUCTION ment practices of systems using ML2 . Operationalization
Machine learning (ML) has progressed dramatically over and standardization of software development practices are
the past three decades, from a laboratory curiosity to a prac- essential for cost-effective development of high-quality and
tical technology in widespread commercial use [19]. Within reliable ML systems. How does machine learning change
artificial intelligence, machine learning has emerged as the software development practices? To systematically explore
method of choice for developing useful software systems the impact, we performed a mixture of qualitative and quan-
for computer vision, speech recognition, natural language titative studies to investigate the differences in software
processing, robot control, and other applications. Machine development that arise from machine learning. We start
learning capabilities may be added to a system in several with open-ended interviews with 14 software practitioners
ways, including software systems with ML components and with experience in both ML and non-ML, who have an
ML frameworks, tools and libraries that provide ML func- average of 7.4 years of software professional experience.
tionalities. A wide-spread trend has emerged: developing Through the interviews, we qualitatively investigated the
and deploying ML systems1 is relatively fast and cheap, but differences that were perceived by our interviewees and
maintaining them over time is difficult and expensive due derived 80 candidate statements that describe the differ-
ences. We further improved the candidate statements via
• Zhiyuan Wan is with the College of Computer Science and Technology, three focus group discussions and performed a survey with
and Ningbo Research Institute, Zhejiang University, China, and the De- 342 software practitioners from 26 countries across four
partment of Computer Science, University of British Columbia, Canada. continents to quantitatively validate the differences that
E-mail: [email protected]
• Xin Xia is with the Faculty of Information Technology, Monash Univer-
are uncovered in our interviews. The survey respondents
sity, Australia. work in various job roles, i.e., development (69%), testing
E-mail: [email protected] (24%) and project management (7%). We investigated the
• David Lo is with the School of Information Systems, Singapore Manage- following research questions:
ment University, Singapore.
E-mail: [email protected] RQ1. How does the incorporation of ML into a system
• Gail C. Murphy is with the Department of Computer Science, University impact software development practices?
of British Columbia, Canada. Is developing ML systems different from developing non-
E-mail: [email protected]
• Xin Xia is the corresponding author. ML systems? How does it differ? If developing ML systems
is indeed different from non-ML software development,
Manuscript received ; revised
1. In this paper, unless otherwise mentioned, we use ML systems to
past software engineering research may need to be ex-
refer to either software frameworks, tools, and libraries that provide
ML functionalities, or software systems that include ML components. 2. https://twitter.com/AndrewYNg/status/1080886439380869122
2

panded to better address the unique challenges of develop- The remainder of the paper is structured as follows. Sec-
ing ML systems; previous tools and practices may become tion 2 briefly describes the processes and concepts regarding
inapplicable to the development of ML systems; software ML development. In Section 3, we describe the methodology
engineering educators may need to teach different skills for of our study in detail. In Section 4, we present the results of
the development of ML systems. our study. In Section 5, we discuss the implications of our
Our study found several statistically significant differ- results as well as any threats to the validity of our findings.
ences in software engineering practices between ML and In Section 6, we briefly review related work. Section 7 draws
non-ML development: conclusions and outlines avenues for future work.
• Requirements: Collecting requirements in the devel-
opment of ML systems involves more preliminary 2 BACKGROUND
experiments, and creates a need for the predictable
degradation in the performance. The development of machine learning systems is a multi-
• Design: Detailed design of ML systems is more time- faceted and complex task. Various forms of processes of ML
consuming and tends to be conducted in an inten- development have been proposed [2], [11], [12], [35]. These
sively iterative way. processes share several common essential steps: context un-
• Testing and Quality: Collecting a testing dataset derstanding, data curation, data modeling, and production
requires more effort for ML development; Good and monitoring.
performance3 during testing cannot guarantee the In the context understanding step, ML practitioners iden-
performance of ML systems in production. tify areas of business that could benefit from machine
• Process and Management: The availability of data learning and the available data. ML practitioners would
limits the capability of ML systems; Data processing communicate with stakeholders about what machine learn-
is more important to the success of the whole process. ing is capable and not capable of to manage expectations.
Most importantly, ML practitioners frame and scope the
RQ2. How do the work characteristics from applied development tasks by conducting preliminary experiments
psychology, like skill variety, job complexity and problem in a particular application context.
solving, change when incorporating ML into a system?
The data curation step includes data collection from differ-
How does the context of software development (e.g., skill
ent sources, data preprocessing, and training, validation and
variety, job complexity and problem solving) change, when
test dataset creation. Since data often come from different
practitioners involve ML in their software development
sources, ML practitioners should stitch together data, and
practices? Our study identified several statistically signif-
deal with missing or corrupted data through data preprocess-
icant differences in work characteristics between ML and
ing. To create an appropriate dataset for supervised learning
non-ML development:
techniques, data labeling is required to assign ground truth
• Skill Variety: ML development intensively requires labels to each record.
knowledge in math, information theory, and statis- The data modeling step includes feature engineering, model
tics. training, and model evaluation. Feature engineering refers to the
• Job Complexity and Problem Solving: ML practi- activities that transform the given data into a form which is
tioners have a less clear roadmap for building sys- easier to interpret, including feature extraction and selection
tems. for machine learning models. During model training, ML
• Task Identity: It is much harder to make an accurate practitioners choose, train, tune machine learning models
plan for the tasks for ML development. using the chosen features. Model tuning includes adjusting
• Interaction: ML practitioners tend to communicate parameters and identifying potential issues in the current
less frequently with clients. model or the previous steps. In model evaluation, practition-
ers evaluate the output model on the test dataset using pre-
Based on the findings, we present the causes behind defined evaluation measures.
the identified differences as discussed by our interviewees
During the production and monitoring step, ML practition-
- the uncertainty in requirements and algorithms, and the
ers export the model into a pre-defined format and usually
vital role of data. We also provide practical lessons about
create an API or Web application with the model as an end-
the roles of preliminary experiments, reproducibility and
point. ML practitioners also plan for retraining the model
performance reporting, and highlight several research av-
with updated data. The model performance is continuously
enues such as continuous performance measurement and
monitored for errors or unexpected consequences, and input
debugging.
data are monitored to identify if they change with time in a
This paper makes the following contributions:
way that would invalidate the model.
• We performed a mixture of qualitative and quantita- We use the process above of ML development and re-
tive studies to investigate the differences in software lated terminology as the vocabulary for discussions in this
practices and practitioners’ work due to the impact work.
of machine learning;
• We provided practical implications for researchers
and outlined future avenues of research. 3 M ETHODOLOGY
3. In this paper, unless otherwise mentioned, we use performance to Our research methodology followed a mixed qualitative and
refer to model performance. quantitative approach as depicted in Fig. 1. We collected
3

Interview 80 Candidate
Guide Interview Statements
Open-Ended Transcript
Interviews Open Card Sorting
14 Participants on 128 Unique Codes

Interviews

Findings

Focus Group
Discussions Pilot Survey Online Survey
on 80 Candidate 9 participants 342 Respondents
Statements Survey

Fig. 1: Research methodology.


4
data from different sources : (1) We interviewed 14 software that they have not explicitly mentioned. One of the list
practitioners with experience in both ML development and comes from the Guide to the Software Engineering Body of
non-ML development; (2) We derived a list of 80 candidate Knowledge (SWEBOK) [7], which consists of 10 knowledge
statements from the results of interviews, and conducted areas, e.g., software design and software testing. The other
three focus group discussions to reduce our list to 31 final list comes from general work characteristics [18] in applied
statements for our survey; (3) We surveyed 342 respondents, psychology, which consists of 21 work characteristics, e.g.,
which we describe below. To preserve the anonymity of skill variety and problem solving. We chose SWEBOK to ensure
participants, we anonymized all items that constitute of that software engineering topics were comprehensively dis-
Personally Identifiable Information (PII) before analyzing cussed, and general work characteristics to ensure that we
the data, and further considered aliases as PII throughout covered a breadth of potential differences. In the third part,
our study (e.g., refer to the interviewees as P1 - P14). interviewees were asked to choose three topics from the two
lists to discuss. In the fourth part, interviewer selected three
3.1 Interviews topics from the two lists that had been discussed the least in
3.1.1 Protocol previous interviews, to ensure coverage of the topics.
At the end of each interview, we thanked the interviewee
The first author conducted a series of face-to-face interviews
and briefly informed him/her of our next plans.
with 14 software practitioners with experience in both ML
During the interviews, each interviewee talked about a
development and non-ML development. Each interview
median of 6 topics where he/she shared his/her perceived
took 30-45 minutes. According to Guest et al. [15], conduct-
difference between ML development and non-ML software
ing 12 to 15 interviews of a homogeneous group is adequate
development (min: 1, max: 12, mean: 6.6, sd: 3.2). The
to reach saturation. We observed a saturation when our
topics mentioned by the interviewees include: SWEBOK:
interviews were drawing to a close.
Requirements (9 interviewees), SWEBOK: Design (6 intervie-
The interviews were semi-structured and made use of
wees), SWEBOK: Construction (10 interviewees), SWEBOK:
an interview guide5 . The guide contains general groupings of
Tools (7 interviewees), SWEBOK: Testing (9 interviewees),
topics and questions, rather than a pre-determined specific
SWEBOK: Quality (5 interviewees), SWEBOK: Maintenance (4
set and order of questions.
The interview comprised four parts. In the first part, we interviewees), SWEBOK: Process (8 interviewees), SWEBOK:
asked some demographic questions about the experience Configuration Management (3 interviewees), Work: Skill Vari-
of the interviewees in both ML development and non- ety (10 interviewees), Work: Job Complexity (5 interviewees),
ML development. We covered various aspects including Work: Problem Solving (4 interviewees), Work: Task Identify (7
programming, design, project management, and testing. interviewees), Work: Autonomy (1 interviewees), Work: Inter-
In the second part, we asked an open-ended question dependence (4 interviewees), and Work: Interaction Outside the
about what differences the interviewee noticed between ML Organization (1 interviewees).
development versus non-ML development. The purpose of
this part was to allow the interviewees to speak freely about 3.1.2 Participant Selection
differences without the interviewer biasing their responses. We recruited full-time employees with experience in both
In the third and fourth part, we presented interviewees ML systems and non-ML systems from three IT companies
with two lists of topics and asked them to discuss the topics based in Hangzhou, China, namely Alibaba, Bangsun6 , and
Hengtian7 . Bangsun is a technology provider which has
4. The interviews, focus group and survey were approved by the more than 400 employees and develops real-time risk con-
relevant institutional review board (IRB). Participants were instructed
that we wanted their opinions; privacy and sensitive resources were trol systems for the financial sector and anti-fraud products.
not explicitly mentioned.
5. Interview Guide Online: https://drive.google.com/file/d/ 6. https://www.bsfit.com.cn
1ZOXwbSKY6zPnuOEzGlzFMJ3DIERYD8YG 7. http://www.hengtiansoft.com/?lang=en
4

TABLE 1: Number of interviewees with “extensive” experi- 3.2 Focus Groups


ence in a particular role.
To focus the survey and keep it to a manageable size, we
Role Machine Learning non-Machine-Learning wanted to hone in on the statement that are most likely
Programming 5 6 to differ when ML is incorporated into a software system.
Design 5 3 To determine which of the 80 candidate statements had
Management 2 2
Testing 2 3 this characteristic, the first author conducted three focus
group sessions. Each focus group session lasted for 1.5 to
2 hours and involved 3 participants. The 9 participants
Hengtian is an outsourcing company which has more than are professionals with experience in both ML and non-
2,000 employees and focuses on outsourcing projects from ML development from various IT companies in China (e.g.,
US and European corporations (e.g., State Street Bank, Cisco, Baidu, Alibaba and Huawei). They were informed about the
and Reuters). Interviewees were recruited by emailing our purpose of our study and gave their consent to use the focus
contact in each company, who was then responsible for dis- group results for research purposes.
seminating news of our study to their colleagues. Volunteers During the focus group sessions, the first author went
would inform us if they were willing to participate in the through the 80 candidate statements, and asked the follow-
study with no compensation. With this approach, 14 volun- ing question “is the statement more true for ML develop-
teers contacted us with varied experience in years. In the ment, in comparison with non-ML development”. Based
remainder of the paper, we denote these 14 interviewees as on the feedback, we removed 7 statements in which the
P1 to P14. These 14 interviewees have an average of 7.6 years participants did not understand the difference or did not
of professional experience (min: 3, max: 16, median: 6, sd: think there was a difference for ML vs. non-ML develop-
4), including 2.4 years in ML system development (min: 1, ment. In addition, we removed 42 statements in which over
max: 5, median: 2, sd: 1.6) and 5.2 years in non-ML software half of our focus group participants perceived no obvious
development (min: 2, max: 11, median: 4.5, sd: 2.6). Table difference between ML and non-ML development. In the
1 summarizes the number of interviewees who perceived end, we identified a list of 31 statements.
themselves with “extensive” experience (in comparison to
“none” and “some” experience) in a particular role. 3.3 Survey
3.1.3 Data Analysis 3.3.1 Protocol
We conducted a thematic analysis [8] to process the recorded The survey aims to quantify the differences between ML and
interviews by following the steps below: non-ML software development expressed by interviewees
Transcribing and Coding. After the last interview was over a wide range of software practitioners. We followed
completed, we transcribed the recordings of the interviews, Kitchenham and Pfleeger’s guidelines for personal opinion
and developed a thorough understanding through review- surveys [23] and used an anonymous survey to increase
ing the transcripts. The first author read the transcripts response rates [37]. A respondent has the option to specify
and coded the interviews using NVivo qualitative analysis that he/she prefers not to answer or does not understand
software [1]. To ensure the quality of codes, the second the description of a particular question. We include this
author verified initial codes created by the first author and option to reduce the possibility of respondents providing
provided suggestions for improvement. After incorporating arbitrary answers.
these suggestions, we generated a total of 295 cards that Recruitment of Respondents. The participants of the survey
contain the codes - 15 to 27 cards for each coded interview. were informed about the purpose of our study and gave
After merging the codes with the same words or meanings, their consent to use the survey results for research purposes.
we have a total of 128 unique codes. We noticed that when To recruit respondents from both ML and non-ML pop-
our interviews were drawing to a close, the collected codes ulations, we spread the survey broadly to a wide range
from interview transcripts reached a saturation. New codes of companies from various locations around the world. To
did not appear anymore; the list of codes was considered get a sufficient number of respondents from diverse back-
stable. grounds, we followed a multi-pronged strategy to recruit
Open Card Sorting. Two of the authors then separately respondents:
analyzed the codes and sorted the generated cards into • We contacted professionals from various countries
potential themes for thematic similarity (as illustrated in and IT companies and asked their help to dissemi-
LaToza et al.’s study [24]). The themes that emerged during nate our survey within their organizations. We sent
the sorting were not chosen beforehand. We then use the emails to our contacts in Amazon, Alibaba, Baidu,
Cohen’s Kappa measure [10] to examine the agreement Google, Hengtian, IBM, Intel, IGS, Kodak, Lenovo,
between the two labelers. The overall Kappa value between Microsoft, Morgan Stanley, and other companies
the two labelers is 0.78, which indicates substantial agree- from various locations around the world, encourag-
ment between the labelers. After completing the labeling ing them to complete the survey and disseminate it to
process, the two labelers discussed their disagreements to some of their colleagues. By following this strategy,
reach a common decision. To reduce bias from two of the we aimed to recruit respondents working in the
authors sorting the cards to form initial themes, they both industry from diverse organizations.
reviewed and agreed on the final set of themes. Finally, • We sent an email with a link to the survey to 1,831
we derived 80 candidate statements that describe the dif- practitioners that contributed to 18 highest-rated ma-
ferences. chine learning repositories hosted on GitHub (e.g.,
5

TensorFlow and PyTorch) and solicited their par-


ticipation. By sending to GitHub contributors to
machine learning repositories, we aimed to recruit
respondents who are open source practitioners in
addition to professionals working in the industry.
We chose this set of potential respondents to col-
lect responses from ML practitioners for contrast;
if ML respondents provide significantly different China 254 (74%)
responses than non-ML respondents, this provides United States 29 (8%)
India 10 (3%)
quantitative evidence to establish a difference be- Germany 7 (2%)
Japan 5 (1%)
tween ML and non-ML development. Moreover, the
reason for choosing the 18 high-rated machine learn-
Fig. 2: Countries in which survey respondents reside. The
ing repositories was that the contributors would
darker the color is, the more respondents reside in that
potentially be two types: practitioners of ML frame-
country. The legend presents the top 5 countries with most
work/tool/library (ML FTL) and practitioners of ML
respondents.
application8 (ML App). We were unsure whether
high variances in software differences would over-
whelm ML versus non-ML differences. Out of these
ML FTL (39)
emails, eight emails received automatic replies noti-
fying us of the absence of the receiver.
ML App (59)
No identifying information was required or gathered from
our respondents. Non-ML FTL (151)

3.3.2 Survey Design Non-ML App (93)

We captured the following pieces of information in our 0 10 20 30 40 50 60 70 80 90 100


survey (the complete questionnaire is available online as Project Management (23) Testing (82) Development (237)
supplemental material9 ):
Demographics. Fig. 3: Survey respondents demographics. The number indi-
We collected demographic information about the respon- cates the count of each demographic group.
dents to allow us to (1) filter respondents who may not un-
derstand our survey (i.e., respondents with less relevant job
roles), (2) breakdown the results by groups (e.g., developers the respondents varied from 0.1 to 25 years, with an average
and testers; ML practitioners and non-ML practitioners). of 4.3 years. Our survey respondents are distributed across
Specifically, we asked the question “What best describes your different demographic groups (job roles and product areas)
primary product area that you currently work on?”, and as shown in Fig. 3.
provided options including (1) ML framework/system/library, Practitioners’ Perceptions. We provided the list of 31 final
(2) ML application, (3) Non-ML framework/system/library, (4) statements, and asked practitioners to respond to each state-
Non-ML application, and (5) Other. The respondents selected ment on a 5-point Likert scale (strongly disagree, disagree,
one item from the provided options as their primary prod- neutral, agree, strongly disagree). To focus the respondents’
uct area. Based on their selections, we divided the survey attention on a particular area in the survey, they were
respondents into 5 groups. explicitly asked to rate each statement with respect to their
We received a total of 357 responses. We excluded ten experience with the major product area they specified.
responses made by respondents whose job roles are nei- We piloted the preliminary survey with a small set of
ther development, testing nor project management. Those practitioners who were different from our interviewees,
respondents describe their job roles as a researcher (5), stu- focus-group participants and survey takers. We obtained
dent (3) network infrastructure specialist (1), and university feedback on (1) whether the length of the survey was
professor (1). We also excluded five responses made by appropriate, and (2) the clarity and understandability of
respondents who selected Other as major product areas and the terms. We made minor modifications to the preliminary
specified their major product areas as: ecology (1), physics survey based on the received feedback and produced a final
(1), or a combination of multiple product areas that do not version. Note that the collected responses from the pilot
seem oriented at the production of a commercially relevant survey are excluded from the presented results in this paper.
software product (3). In the end, we had a set of 342 valid To support respondents from China, we translated our
responses. survey to Chinese before publishing the survey. We chose
The 342 respondents reside in 26 countries across four to make our survey available both in Chinese and English
continents as shown in Fig. 2. The top two countries in because Chinese is the most spoken language and English
which the respondents reside are China and the United is an international lingua franca. We expect that a large
States. The number of years of professional experience of number of our survey recipients are fluent in one of these
8. A software system with ML components.
two languages. We carefully translated our survey to make
9. Questionnaire Online: https://drive.google.com/file/d/ sure there exists no ambiguity between English and Chinese
124ttMmqSXglilEUuevAMP85jvP4uyVoc terms in our survey. Also, we polished the translation by
6

improving clarity and understandability according to the coupled with existing large-scale data of a particular ap-
feedback from our pilot survey. plication context.
Interviewees noted that requirements are more uncertain
3.3.3 Data Analysis for ML systems than non-ML systems [S1]. As P9 noted,
We examined distributions of Likert responses for our par- given “machine learning systems usually aim to improve
ticipants and compared the distributions of different groups or accelerate the decision-making process (of executives
of participants using Wilcoxon rank-sum test, i.e., ML vs. in an organization or a company)”, rather than detailed
non-ML, and ML framework/tool/library vs. ML appli- functional descriptions, the requirements usually include
cation. We report the full results in Section 4.3; along the a conceptual description about the goal after applying
way in Section 4.1 and 4.2, we link interviewees’ comments the machine learning systems. Since the requirements of
with survey responses by referring to survey statements machine learning systems are data-driven, different data
like: [S1]. We number statements in the order in which would lead to different requirements. Even for the same
they appeared in the survey, S1 through S31. We annotate data, as P1 and P6 suggested, a different understanding
each with whether they are statistically significant or not as of the data and different application contexts would lead
follows: to different requirements. Nevertheless, prior knowledge
about the data and application contexts bring determinism
• [4 S1] Significant difference between ML devel- to a certain extent. P6 gave a specific example, suggesting
opment and non-ML development that confirms that how prior knowledge helps to understand the data and
interviewees’ responses and no significant dif- application contexts:
ference between the development of ML frame- There exists a kind of prior knowledge named “scenario prior
work/tool/library and development of ML software knowledge”. For instance, we know that the data imbalance
application; problem occurs in the application of fraud detection. This is
• [4 4 S1] Significant difference between ML devel- because good guys always account for a larger amount of
opment and non-ML development, and significant people than bad guys. As we also know, in the field of online
difference between the development of ML frame- advertising, the conversion rate10 usually seems low and there
work/tool/library and development of ML software exists a limit.
application; Instead of functional requirements in non-ML software
• [S1] No significant differences; systems, quantitative measures comprise the majority of
• [8 S1] Significant difference between ML develop- requirements for ML systems. As P4 pointed out, distinct
ment and non-ML development, but opposite of inter- types of quantitative measures would be leveraged to define
viewees’ responses. requirements, e.g., accuracy, precision, recall, F measure and
• [8 4 S1] Significant difference between ML devel- normalized discounted cumulative gain (nDCG). These quan-
opment and non-ML development, but opposite of in- titative measurements could either come from the “cap-
terviewees’ responses; and significant difference be- tain’s call” by business stakeholders (P6) or be collected by
tween development of ML framework/tool/library project managers through user studies (P5). As P4 put it,
and development of ML software application. the target scores for quantitative measures could vary from
• [4 S1] No significant difference between ML de- one application to another. In some safety-critical domains,
velopment and non-ML development; but signifi- the accuracy of ML systems is of great importance. As
cant difference between development of ML frame- a consequence, higher scores of quantitative measures are
work/tool/library and development of ML software expected for safety considerations. P5 echoed this, saying
application.
For online shopping recommendation systems, the quantitative
Other outcomes are theoretically possible but did not measures are relatively not so restricted, and lower measures are
occur in our survey results. tolerable.
In contrast to non-ML systems, requirements for ML
systems usually involve a large number of preliminary ex-
4 R ESULTS periments [4 S2]. As P6 noted, business stakeholders might
In this section, we report the results grouped based on suggest leveraging a number of emerging machine learning
the interview topics. We combined several topics into one algorithms to solve their business problems. One of the
when interviewees had little to say about a particular topic. consequences is that it requires the requirement specialists
In some cases, we have anonymized parts of quotes to to have a strong technical background in machine learning.
maintain interviewees’ privacy. The other consequence is that requirement validation pro-
cess involves a larger amount of preliminary experiments.
Those preliminary experiments are conducted by software
4.1 RQ1. Differences in Software Engineering Prac- engineers and intend to validate and select machine learning
tices algorithms among various candidates. As P8 explained,
4.1.1 Software Requirements Say A, B and C algorithms might be all suitable for a particular
Nearly every interviewee made a strong statement about application context, but the performance closely depends on the
differences between the requirements of ML systems versus
the requirements of non-ML systems. In essence, require- 10. The probability that the user who sees the ad on his or her
ments of ML systems are generally data-driven - closely browser will take an action, i.e., the user will convert [25].
7

actual data in practice. Requirements cannot be validated until 4.1.3 Software Construction and Tools
preliminary experiments have been conducted. Interviewees reported several differences between ML sys-
The requirement should consider the predictable degra- tems and non-ML software systems in terms of coding
dation in the performance of ML systems [4 S3]. As P6 practice. First, the coding workload of ML systems is low
noted, most of the ML systems might experience perfor- compared to non-ML software systems [S8]. Instead of
mance degradation after a period in production. P6 gave coding for implementing particular functionalities in non-
an example: In a fraud detection application, adversaries ML software systems, coding in ML systems generally in-
are always trying to find ways to evade detection strategies. cludes data processing (e.g., transformation, cleansing, and
As a result, two inherent requirements are expected for ML encoding), feature analysis (e.g., visualization and statistical
systems. First, ML systems are expected to be degradation- testing), and data modeling (e.g., hyperparameters selection
sensitive, i.e., be capable of perceiving performance degra- and model training). P14 pointed at the availability of useful
dation. Second, once a performance degradation occurs, ML frameworks and libraries for data processing and data mod-
system needs to have considerable capability to adapt to eling. These frameworks and libraries help developers ac-
the degradation, either by feeding new data to the learning celerate the coding process. To achieve better performance,
algorithm or training a brand new model by using new data. developers can extend these frameworks or libraries to
As we will discuss in subsequent sections, data-driven be adapted for their own use. Second, there is little code
and large-scale characteristics of ML systems have several reuse between and within ML systems, compared to non-
consequences to the way they are developed, compared to ML software systems [S9]. One reason is that ML systems
non-ML systems. frequently have a significant emphasis on performance.
However, the performance of ML systems highly depends
4.1.2 Software Design on the data; data vary across different application contexts.
Thus, project-specific performance tuning is necessary.
Interviewees repeatedly mentioned that the design of ML Debugging in non-ML software systems aims to locate
systems and non-ML software systems differently place and fix bugs in the code [S10]. Unlike non-ML software
emphasis in a few ways. systems, debugging in ML systems aims to improve per-
First, the high-level architectural design for ML systems formance. The performance of ML systems generally cannot
is relatively fixed [8 S4]. As P3 summarized, the architecture be aware or evaluated “until the last minute when the data
of ML systems typically consists of data collection, data model is finalized” (P13, P14). Efficiently finalizing a data
cleaning, feature engineering, data modeling, execution, and model is playing an important role in the construction of
deployment. In contrast, the architectural design for non- ML systems. However, data modeling involves multiple
ML software systems is a more creative process, which iterative training rounds. Considering the high volume of
implements various structural partitioning of software com- data, each round of training may take a long time, days
ponents and generates behavioral descriptions (e.g., activity or weeks, if complete data are taken. It is infeasible to use
diagrams and data flow diagrams) (P12). Due to the high complete data to train models for each round. Thus, several
volume of data, the distributed architectural style is widely interviewees suggested a practical data modeling process
preferred for ML systems. Distributed architectural style (P4, P6, P13):
usually leads to complexity in architectural and detailed You need to build several training datasets of different sizes
design. from small-scale to large-scale. You start with the small-scale
Second, ML systems place less emphasis on low cou- dataset to train models. Till you achieve acceptable results, you
pling in components than non-ML software systems [S5]. move to a larger scale. Finally, all the way up, you would find
Although different components in ML systems have sepa- a satisfactory model.
rate functionalities, they are highly coupled. For instance, Although this process improves the training efficiency, in-
the performance of data modeling is dependent on data complete training data might risk introducing inaccuracy in
processing. As P14 noted, “‘garbage in and garbage out’ - intermediate results that may lead to bias in models.
I would spend 40% of my time on data processing since Interviewees mentioned that ML systems and non-ML
I found that poor data processing could fail any potential software systems differ in debugging practice. Debugging
effective [machine learning] models ... I divide the data practice of non-ML software systems typically uses step-
processing into multiple steps and may use existing libraries by-step program execution through breakpoints. For ML
for each step”. systems, especially deep learning software systems, “de-
Third, detailed design is more flexible for ML systems bugging aims to make it more straightforward to translate
than non-ML software systems [S6]. P1 noted that data ideas from developer’s head into code”. P6 gave a specific
modeling could contain tens to hundreds of candidates example about “dynamic computational graph”:
of machine learning algorithms, which indicates an ample Previously, developers prefer to use PyTorch mainly because
search space. P6 echoed this, saying of its support for the dynamic computational graph. ‘dynamic
Even for the same machine learning algorithm, various appli- computation’ means that the model is executed in the order you
cation contexts may introduce differences in the dimensions of wrote it. Well, like I am building a neural network, I would add
data, and further lead to changes in machine learning models. up layers one by one. Then each layer has some tradeoffs; for
As a consequence, the detailed design of ML systems instance, I would like to add an operator layer implementing
would be time-consuming and conducted in an iterative normalization. [If a debugging tool does not support dynamic
way [4 S7]. To design an effective model, software engineers computational graph,] I cannot evaluate if this addition is
tend to conduct a large number of experiments. good or not until the neural network is compiled into a model
8

and real data go in. The dynamic computational graph allows intensive [4 4 S15]. If the application is for general use
debugging on sample data immediately and helps me verify my where correct answers are known to human users, labeling
idea quickly. tasks could be outsourced to non-technical people outside
Nevertheless, debugging on the dynamic computational the organization, as P5 noted. More details are needed for
graph has drawbacks. Once the data volume is extremely these automated methods or tools. However, biases may be
high, computation for each layer takes a long time to finish. introduced to test dataset through the methods or tools, and
This delays the construction of further layers. In addition, consequently, affect both performance and generalizability.
interviewees also mentioned that creativity appears to be As P8 put it,
important in debugging of ML systems [S11]. Part of the Sometimes, we (developers) generate expected results for the
reason appears to be that because ML systems have an test cases using the algorithms or models constructed by our-
extensive search space for model tuning. selves. Paradoxically, this may introduce bias because it is like
Interviews pointed at several differences in bugs be- we define acceptance criteria for our code.
tween ML systems and non-ML software systems. First, ML Moreover, generating reliable test oracle is sometimes infea-
systems do not have as many bugs that reside in the code as sible for some ML systems. P6 gave a specific example in
non-ML software (P11) [S12]. Instead, bugs are sometimes the anomaly detection application context:
hidden in the data (P10). P1 gave a recent example that mis- Clients gave us a test dataset and told us the dataset contains
using training data and testing data with intensive overlap labeled anomaly cases. However, we have no way to know how
results in an incredibly good performance, but indicating many anomaly cases exactly there are in the dataset, because
a bug. As P4 suggested, “generalization of data models is some anomalies may not have been recognized and labeled in
also required to be taken care of”. Second, ML systems have the dataset.
specific types of bugs when taking data into account. As P11 Good testing results cannot guarantee the performance of
stated, the mismatch of data dimension order between two ML systems in production [4 S17]. The performance in
frameworks may cause bugs when integrating these two production to a large extent depends on how similar the
frameworks. Third, in contrast to non-ML software systems, training dataset and the incoming data are (P6).
a single failed case is hardly helping diagnose a bug in ML In ML systems, “too low” and “too high” scores for
systems. As P13 explained, sometimes, developers of ML performance measures as testing results both indicate bugs
systems find bugs by just “staring at every line of their code [S16]. P1 gave a recent example of a junior developer who
and try to think why it would cause a bug”. obtained an F1 score of 99% in his data model. In fact, after
carefully going through the dataset, an extensive overlap
4.1.4 Software Testing and Quality was discovered between training and testing dataset. Some
Although software quality is important in both ML and non- interviewees reported several specific tactics in testing ML
ML systems, the practice of testing appears to differ signifi- systems (P13):
cantly. One significant difference exists in the reproducibility We can use a simple algorithm as the baseline, for example,
of test results. In contrast to non-ML software systems, the a random algorithm. If the baseline performs quite well on our
testing results of ML systems is hard to reproduce because of dataset, there might exist bugs in our dataset. If our algorithm
a number of sources of randomness [S13]. As P8 explained: performs worse than the baseline, there might be some bugs in
The randomness [in ML systems] complicates testing. You have our code.
random data, random observation order, random initialization
for your weights, random batches fed to your network, random 4.1.5 Software Maintenance and Configuration Manage-
optimizations in different versions of frameworks and libraries ment
... While you can seed the initialization, fixing the batches might Interviewees suggested that less effort may be required in
come with a performance hit, as you would have to turn-off the maintenance for ML systems than traditional software
parallel batch generation which many frameworks do. There is systems [S19]. One reason is that, different from non-ML
also the approach to freeze or write-out the weights after just one software systems, ML systems run into predictable degrada-
iteration which solves the weight-initialization randomness. tion in performance as time goes by (P4 and P7). To provide
Another difference exists in the testing methods and re- constantly robust results, ML systems should support “auto-
sulting outputs. Testing practice in ML systems usually matic” maintenance. Once performance degradation occurs,
involves running an algorithm multiple times and gather an ML system is designed to perceive the degradation and
a population of performance measurements [S14]. As P12 trains new data models in an online/offline way using the
explained, latest emerged data. As P6 suggested,
Testing practice for machine learning software mainly aims We sometimes define the so-called “health factors” or quantita-
to verify the quantitative measures that indicate performance. tive indicators, of the status of a machine learning system. They
However, machine learning algorithms are stochastic. Thus, we are associated with a specific application context. The indicators
usually use k-fold cross-validation to do the testing. help machine learning system perceives its performance in the
As P9 echoed, the testing outputs are expected to be a range specific application context.
rather than a single value. Interviewees reported that configuration management
The interviewees stated that test case generation for ML for ML systems involves a larger amount of content com-
systems is more challenging, compared to non-ML systems. pared to non-ML software [S20]. One reason is that ma-
Automated testing tools are not used as frequently as non- chine learning models include not only code but also
ML systems [8 4 S18]. The test dataset is essential to the data, hyperparameters, and parameters. Developing ML
quality of test cases. Collecting testing datasets is labor systems involves rapid experimentation and iteration. The
9

performance of models would vary accordingly. To find of non-ML software, development of ML systems is an
the optimal combination of these parts that achieve the iterative optimization task by nature, “there is more than
best performance, configuration management is required to one right answer” as long as the quantitative measures meet
keep track of the varying models and associated tradeoffs, the expectation (P1). Sometimes, the available data are not
algorithm choice, architecture, data, hyperparameters. As P4 sufficient to support the application context. It is impossible
explained: to achieve the expected quantitative measures no matter
It usually happened that my currently trained model performs how good the trained model is. P6 explained:
badly. I might roll back to the previous model, and investigate No one knows if the quantitative metrics are achievable until
the reasons ... Data in the cloud change over time, including we finish training our model.
those we use to train models. Models might degrade due to the Interviewees also mentioned that the development plan of
evolving data. We may take a look at current data and compare ML systems is more flexible than non-ML software systems.
them with previous data to see why degradation happens. The progress of model training is usually not in a linear way.
As a result, configuration management for ML systems
becomes more complex compared to non-ML software. 4.2 RQ2. Differences in Work Characteristics
Besides code and dependencies, data, model files, model
dependencies, hyperparameters require configuration man- In this section, we discuss differences between ML and
agement. The models checkpoints and data would take non-ML development in terms of work characteristics. No
a large amount of space. As P8 noted, machine learning common themes emerged from several work feature topics
frameworks trade off exact numeric determinism for perfor- in our interviews or focus groups (work scheduling autonomy,
mance, “the dependent frameworks can change over time, task variety, significance, feedback from the job, information pro-
sometimes radically”. To reproduce the results, a snapshot cessing, specialization, feedbacks from others, social support, work
of the whole system may be required. conditions, ergonomics, experienced meaningfulness, experienced
responsibility, knowledge of results). Thus we do not discuss
4.1.6 Software Engineering Process and Management them in this section.
ML and non-ML systems differ in the processes that are
4.2.1 Skill Variety
followed in their development. As P2 suggested, during the
step of context understanding, it is important to communi- Interviewees identified two main differences between ML
cate with other stakeholders about what machine learning is and non-ML development in terms of skill variety.
and is not capable of. P6 mentioned that some stakeholders First, interviewees noted that developing machine learn-
might misunderstand what machine learning is capable of: ing frameworks and applications presented distinct techni-
They usually overestimate the effect of machine learning cal challenges. For example, P10 suggested that, in addition
technologies. Machine learning is not a silver bullet; the effect to programming skills, ML development tends to require
highly depends on the data they have [4 4 S21]. “specialized knowledge in math, information theory and
P14 mentioned that data processing is important to the statistics ” [4 S24]. P13 explained the differences as:
success of the whole process [4 S22], “garbage in garbage [For the development of non-ML software systems] developers
out, it is worth spending the time”. P12 noted that data can write code for a particular business requirement once they
visualization plays a crucial role in understanding feature have learned a programming language... [For the development
distributions and correlations as well as identifying interest- of ML systems] math and statistics specialized knowledge is a
ing trends and patterns out of the data. As P6 noted, domain crucial prerequisite of designing effective models for a particular
knowledge is pivotal in understanding data features: business requirement. Nevertheless, proficient programming
However, sometimes domain experts are reluctant to share their skill is still important and could help with model implemen-
knowledge. They may be afraid of being replaced by automated tation.
software or do not have any accurate reasoning but intuition. Second, interviewees suggested that a wider variety of
It is hardly possible to develop a good model in a single skills is required for ML development [S25], which can make
pass (P6). The step aims to find the right balance through ML development more challenging if a developer lacks
trial and error. Even the best machine learning practitioners those skills. As P5 suggested, “in addition to programming
need to tinker to get the models right. Sometimes, practi- skills, data analysis skill is required for ML development”.
tioners may go back to the data step to analyze errors. P10 summarized that the data analysis skill consists of the
As P3, P4, and P5 mentioned, the practitioners create an abilities to acquire, clean, explore, and model data. As P6
API or Web application with the machine learning model as noted, the huge volume of data in ML development brings
an endpoint during production. Practitioners also need to new challenges to data analysis:
plan for how frequently the model requires to be retrained In the context of big data, performing statistical data analysis is
with updated data. During the monitoring step, the perfor- not enough. Developers should be able to handle data analysis
mance of models is tracked over time. Once data changes for a huge volume of data. For example, developers need the
in a way that invalidates the models, the software should skills of writing distributed programs for data analysis and
be prepared to respond to the mistakes and unexpected using distributed computing frameworks.
consequences.
Interviewees suggested that a significant difference be- 4.2.2 Job Complexity and Problem Solving
tween management of ML versus non-ML development is Interviews indicated that ML development and non-ML
that the management of ML development lacks specific and development present complexity in different aspects. As
practical guidance [S23]. In contrast to the development P12 suggested, the job complexity of non-ML development
10

resides in architecture design and module partitioning [4 ML Application - ML App). For the Likert distributions,
S26]; in contrast, the job complexity of ML development the leftmost bar indicates strong disagreement, the middle
resides in data modeling [S27]. P14 explained that “the bar indicates neutrality, and the rightmost bar indicates the
architectures of distinct machine learning applications are strongest agreement. For example, most machine learning
relatively fixed, they usually consist of several modules, practitioners strongly agree with S24.
i.e., data collection, data pre-processing, data modeling, and The P-value column indicates whether the differences in
testing”. the agreement for each statement are statistically significant
The difference in job complexity leads to the difference in between ML and non-ML in the first sub-column, and ML
problem solving. Interviewees mentioned that, for non-ML FTL and ML App in the second subcolumn. The table is
development, a clear roadmap usually exists to produce a sorted by the p-values with Benjamini-Hochberg correction
good architecture design and module partitioning [4 S28]. in the first subcolumn. Statistically significant differences at
Developers could then follow a step-by-step approach to a 95% confidence level (Benjamini-Hochberg corrected p-
implement each module (P5, P6, P8). However, for ML value < 0.05) are highlighted in green.
development, no clear roadmap exists to build effective data The Effect Size column indicates the difference between
models (P2). As P6 suggested, the problem solving process ML and non-ML in the first sub-column, and the difference
in ML development has more uncertainties compared to between ML FTL and ML App in the second subcolumn. We
non-ML development: use Cliff’s delta to measure the magnitude of the differences
We do not know what results the data can tell, to what extent since Cliff’s delta is reported to be more robust and reliable
a data model can be improved. We would try whatever that we than Cohen’s delta [32]. Cliff’s delta represents the degree of
think may work. Thus, the search space becomes quite large, overlap between two sample distributions, ranging from −1
and the workload might explode accordingly. Sometimes, more to +1. The extreme value ±1 occurs when the intersection
workload does result in better performance. between both groups is an empty set. When the compared
groups tend to overlap, Cliff’s Delta approaches zero. Effect
4.2.3 Task Identity sizes are additionally colored on a gradient from blue to
Interviewees reported few differences in terms of task iden- orange based on the magnitudes of difference as referring
tity. One difference is suggested by P4 and P5, who reported to the interpretation of Cliff’s delta in Table 3: blue color
that it is harder to make an accurate plan for tasks in ML means the former group is more likely to agree with the
development [4 S29]. P4 summarized the reasons as: statement, and orange color means the latter group is more
In non-ML software development, the project can be divided likely to agree with the statement.
into distinct tasks according to function points. Developers Overall, the results of the survey confirm some dif-
could easily tell how long it will take to finish the imple- ferences in interviewees’ claims. Based on the observed
mentation of a particular function point. However, in machine statistically significant differences between ML and non-ML
learning development, data modeling is an indivisible task. To development, we can say with some certainty that:
achieve acceptable performance, the search space is usually quite • Requirements: Collecting requirements in the devel-
large. Making an accurate plan for such a task is hard. opment of ML systems involves more preliminary
Besides, interviewees noted that ML developers have experiments, and creates a need for the predictable
less control over their task progress towards target perfor- degradation in the performance. [4 S2, 4 S3]
mance (P9, P10) [S30]. Once starting data modeling in ML • Design: Detailed design of ML systems is more time-
development, hard work may not always consistently lead consuming and tends to be conducted in an inten-
to satisfying results (P4). sively iterative way. [4 S7]
• Testing and Quality: Collecting a testing dataset
4.2.4 Interaction Outside the Organization requires more effort for ML development; Good per-
Interviewees reported that ML developers face more chal- formance during testing cannot guarantee the per-
lenges when communicating with customers [4 S31]. As P2 formance of ML systems in production. [4 4 S15, 4
explained: S17]
It is harder to communicate the project progress for machine • Process and Management: The availability of data
learning development due to the non-linear process of data usually limits the capability of ML systems; Data
modeling... The results of machine learning development are processing tends to be more important to the success
not straightforward to interpret. For example, it is difficult to of the whole process. [4 4 S21, 4 S22]
explain why a neural network works for image processing. • Skill Variety: ML development intensively requires
knowledge in math, information theory, and statis-
4.3 Survey Results tics. [4 S24]
• Job Complexity and Problem Solving: ML practi-
We summarize the survey results in Table 2. The Statement
tioners have a less clear roadmap for building sys-
column shows the statements presented to respondents.
tems. [4 S28]
The following column indicates the labels we used to
• Task Identity: It is much harder to make an accurate
identify statements throughout the paper. The four Likert
plan for the tasks for ML development. [4 S29]
Distribution subcolumns present the distribution of agree-
• Interaction: ML practitioners tend to communicate
ment for each group of respondents (ML practitioners -
less frequently with clients. [4 S31]
ML, Non-ML Practitioners - Non-ML, practitioners of ML
Framework/Tool/Library - ML FTL, and Practitioners of The survey results cannot confirm several of intervie-
11

TABLE 2: Survey Results. Orange cells indicate where the former group (ML practitioners/ML FTL practitioners)
disagrees more strongly with the statement than the latter group (non-ML practitioners/ML App practitioners); blue
cells indicate where the former group agrees more strongly. Green cells represent statistically significant differences. The
number in “()” indicates the size of each group.
Likert Distributions Cliff’s Delta P-values
ML ML FTL ML ML FTL
ML Non-ML ML FTL ML App vs. vs. vs. vs.
Statement (98) (244) (39) (59) Non-ML ML App Non-ML ML App
Developing my software requires knowledge in math, information theory and statistics. S24 0.45 -0.19 4 .000 .320
Detailed design is time-consuming and conducted in an iterative way. S7 0.32 0.18 4 .000 .271
Requirements should consider predictable degradation in the performance of software. S3 0.29 0.11 4 .000 .433
It is easy to make an accurate plan for the development tasks of my software. S29 -0.32 0.03 4 .000 .779
Data processing is important to the success of the whole development process. S22 0.26 -0.20 4 .000 .271
Collecting testing dataset is labor intensive. S15 0.27 -0.26 4 .000 .188
Developing my software requires frequent communications with the clients. S31 -0.29 -0.14 4 .000 .577
My software is tested by using automated testing tools. S18 0.26 0.48 8 .000 4 .001
Good testing results can guarantee the performance of my software in production. S17 -0.23 0.09 4 .001 .482
Available data limit the capability my software. S21 0.22 -0.48 4 .001 4 .001
Collecting requirements involve a large number of preliminary experiments. S2 0.20 0.09 4 .002 .577
A clear roadmap exists to build my software. S28 -0.24 0.07 4 .002 .661
High level architectural design is relatively fixed. S4 -0.20 0.07 8 .017 .719
Creativity is important during debugging. S11 0.12 0.07 .064 .719
My team puts a lot of effort into maintenance of my software. S19 -0.15 0.21 .065 .188
The higher the performance measures are, the better my software is. S16 0.08 0.19 .068 .271
Architecture design is complicated for my software. S26 0.10 0.32 .069 4 .047
Data modeling is complicated for my software. S27 0.07 -0.15 .077 .471
Detailed design is flexible. S6 0.08 0.11 .107 .459
Requirements of my software are uncertain. S1 0.10 0.11 .151 .482
Testing involves multiple runs of my software to gather a population of quantitative measures. S14 0.07 0.06 .152 .719
My coding workload is heavy. S8 0.07 0.08 .223 .665
I have control over the progress towards the target performance. S30 0.06 0.02 .265 .756
Code reuse happens frequently across different projects. S9 0.06 -0.12 .265 .482
Debugging aims to locate and fix bugs in my software. S10 0.06 -0.01 .358 .943
Creating my software requires a team of people, each with different skills. S25 0.04 0.09 .540 .482
Testing results of my software are hard to reproduce. S13 0.04 0.04 .546 .787
Low coupling in the components of my software is important. S5 -0.01 0.08 .861 .459
Configuration management are mainly for the code. S20 -0.07 -0.14 .894 .943
Software engineering management lacks practical guidance for my software. S23 -0.01 -0.20 .894 .482
Bugs in my software usually reside in the code. S12 -0.01 -0.05 .979 .787

TABLE 3: Interpretation of Cliff’s delta value. Second, uncertainty originates in the inherent random-
Cliff’s Delta Value Interpretation
ness of machine learning algorithms. Machine learning prac-
|δ| < 0.147 Negligible titioners should shift their mindset and embrace uncertainty.
0.147 ≤ |δ| < 0.330 Small For instance, a machine learning algorithm may be initial-
0.330 ≤ |δ| < 0.474 Medium ized to a random state; random noise helps to effectively
|δ| ≥ 0.474 Large
find optimized solution during gradient descent (stochastic
gradient descent). To reduce uncertainty, machine learning
wees’ claims about differences between ML and non-ML. practitioners could achieve reproducibility to some extent
For example, requirement is deterministic across ML system by using the same code, data, and initial random state. Thus,
development and non-ML software development [S1]. One version control toolchains for code, data and parameters are
explanation is that although there is uncertainty in aspects essential to achieve reproducibility [5]. However, storing all
of the models that ML developers ship, requirements them- states may introduce significant overhead and slow down
selves are deterministic. the development process. Thus, the effectiveness of such
toolchains is subject to future investigation. In addition, to
5 D ISCUSSION evaluate the performance of a machine learning algorithm,
practitioners usually randomly split the data into a training
5.1 Implications and test set or use k-fold cross-validation. The performance
Embracing Uncertainty. As mentioned by our interviewees, of a machine learning algorithm should be reported as
uncertainty lies in various aspects of the development of ML a distribution of measures, rather than a single value, as
systems. emphasized by our participants.
First, uncertainty comes from the data as part of the
requirement. Although a development team of ML system Handling Data. As discussed by our interviewees, data play
has a target to attain, e.g., building a speech recognition a vital role in the development of ML systems. The large
software with absolute precision, a bunch of preliminary quantity of training data can have a tremendous impact
experiments is required to make sure the goal is achievable, on the performance of ML systems. As confirmed by our
and the available data suffice for the target. Understanding participants, data collection becomes one of the critical
application contexts, quick data visualization and hands-on challenges for the development of ML systems. Data collec-
experimental data modeling on a small-scale dataset could tion literature focuses on three lines of research [31]: data
be helpful to accelerate the progress of preliminary exper- acquisition techniques to discover, augment, or generate
iments during requirement gathering and analysis phase. datasets, data labeling techniques to label individual data
Instead of trying a number of tools, it might be wiser for points, and transfer learning techniques to improve exist-
machine learning practitioners to focus on a few tools to ing datasets. Future studies could integrate existing data
learn and use [22]. The exploratory process of preliminary collection techniques into the process of ML development.
experiments in ML development is similar to scientific In addition, existing data collection techniques tend to be
programming [9] and may benefit from the lessons learned application or data specific. More effort is needed to gen-
from scientific programming. eralize those proposed techniques to various applications.
12

ML practitioners also used distributed platforms to process non-ML differences. To prevent this threat, we compared
a large quantity of data in parallel. Debugging the parallel the differences in distributions of Likert responses between
computations for data modeling is time-consuming and these two groups.
error-prone. As illustrated in [16], future studies could put To recruit respondents from both ML and non-ML pop-
more effort to facilitate interactive and real-time debugging ulations, we spread the survey broadly to a wide range of
for ML systems. companies from various locations around the world. In the
Along with data quantity, quality is also critical to build beginning of the survey, we articulated that the purpose
a powerful and robust ML system. “Garbage in, garbage of our study is to understand whether and how machine
out”, what practitioners obtain from the machine learning learning changes software engineering practices. This de-
software is a representation of what they feed into the scription may attract more attention from a part of the non-
software. Real-world data is comprising of missing val- ML population, who know about ML, but ML is not part of
ues, imbalanced data, outliers, etc. It becomes imperative their daily work. In addition, the description may generate
that machine learning practitioners process the data before a tacit presumption that machine learning changes software
building models. Future research could develop data vi- engineering practices. The presumption may mislead the
sualization tools that give an overview of the data, help respondents to exaggerate the differences they perceived.
in locating irregularities, enable practitioners to focus on External Validity. To improve the generalizability of our
where the data actually needs cleansing. However, high- findings, we interviewed 14 interviewees from three compa-
quality datasets during development cannot ensure the high nies, and surveyed 342 respondents from 26 countries across
performance of machine learning systems eternally. Within four continents who are working for various companies
a rapidly evolving environment, a machine learning system (e.g., Amazon, Alibaba, Baidu, Google, Hengtian, IBM, Intel,
degrades in the accuracy as soon as the software is put in IGS, Kodak, Lenovo, Microsoft, and Morgan Stanley) or
production [34]. Practitioners need to recognize that there is contributing to open source machine learning projects that
never a final version of a machine learning system, which are hosted on GitHub, in various roles.
needs to be updated and improved continuously over time We wish though to highlight that while we selected em-
(e.g., feeding new data and retrain models). Online feedback ployees from three Chinese IT companies for our interviews,
and performance measurement of ML systems are fertile we improved the responses from interviews through focus
areas for future research. group discussions that involved more IT companies, and the
surveyed population is considerably wide. The improved
responses from the interviews were used to bootstrap the
5.2 Threats to Validity
statements to rate in our survey. The survey permitted
Internal Validity. It is possible that some of our survey respondents to add additional comments whenever appro-
respondents had a poor understanding of the statements priate via free-form fields; looking at the responses in such
for rating. Their responses may introduce noise to the data fields we do not observe any signs of missing statements.
that we collected. To reduce the impact of this issue, we in- In addition, some reported claims from our interviewees
cluded an “I don’t know” option in the survey and ignored were not validated through the survey and might be prema-
responses marked as such. We also dropped responses that ture.
were submitted by people whose job roles are none of
these: software development, testing and project manage-
ment. Two of the authors translated our survey to Chinese 6 R ELATED W ORK
to ensure that respondents from China could understand Some prior work provides prescriptive practices for the
our survey well. To reduce the bias of presenting survey development of machine learning systems (e.g., [29]). Some
bilingually, we carefully translated our survey to make sure discuss the realistic challenges and best practices in the
there is no ambiguity between English and Chinese terms. industry, e.g., machine learning model management at Ama-
We also polished the translation by improving clarity and zon [33], and best practices of machine learning engineer-
understandability according to the feedback from our pilot ing at Google [40]. Some investigated machine learning
survey. related questions on Stack Overflow [38]. These works are
The effect sizes of statistically significant differences based on the experience of the authors and largely do not
between ML and non-ML development reported in this contextualize machine learning development as a special
work range from negligible to medium. The negligible effect type of software engineering. In contrast, our findings are
size indicates that a particular difference between machine based on empirical observations that explicitly focus on the
learning and non-machine learning development is trivial, differences between ML and non-ML development.
even if it is statistically significant. To mitigate this threat, Like our work, several researchers have conducted em-
we did not emphasize those differences in our results. pirical studies of software engineering for data science.
As we selected survey respondents, we sent invitations Some focus on how data scientists work inside a company
to a variety of potential respondents that might involve in via interviews to identify pain points from a general tooling
different parts of ML ecosystems (ML frameworks, tools, perspective [13], and explore challenges and barriers for
and libraries, software applications with ML components). adopting visual analytic tools [20]. Other focus on char-
We mixed the responses from those ML respondents when acterizing professional roles and practices regarding data
we studied the differences between ML and non-ML de- science: Harris et al. [17] surveyed more than 250 data
velopment. It is possible that differences exist in the per- science practitioners to categorize data science practitioners
ceptions of these two groups, and overwhelm ML versus and identify their skill sets. Kim et al. interviewed sixteen
13

data scientists at Microsoft to identify five working styles 7 C ONCLUSION


[21], and supplement Harris et al.’s survey with tool usage,
In this work, we identified the significant differences be-
challenges, best practices and time spent on different activ-
tween ML and non-ML development. The differences lie in
ities [22]. In contrast to this prior work, our paper studies
a variety of aspects including software engineering prac-
broad differences between ML and non-ML development.
tices (e.g., exploratory requirements elicitation and iterative
Most similar to our study is an empirical investigation processes in ML development) and the context of software
of integrating AI capabilities into software and services and development (e.g., high complexity and demand for unique
best practices from Microsoft teams [4]. From their proposed solutions and ideas in ML development). The differences
ML workflow, they identified 11 challenges, including 1) originate from inherent features of machine learning - un-
data availability, collection, cleaning, and management, 2) certainty and the data for use.
education and training, 3) hardware resources, 4) end-to- To tackle uncertainty, ML practitioners should shift their
end pipeline support, 5) collaboration and working culture, mindset, and embrace the uncertainty in preliminary ex-
6) specification, 7) integrating AI into larger systems, 8) periments and the randomness of ML algorithms. They
guidance and mentoring, 9) AI tools, 10) scale, and 11) could learn the lessons from scientific programming, which
model evolution, evaluation, and deployment. The identi- also involves exploratory processes in the development.
fied challenges emerge across different software develop- In addition, version control toolchains for code, data and
ment practices. Our findings differ from theirs in a number parameters could play a vital role for ML practitioners to
of areas: achieve reproducibility. ML practitioners should also devote
sufficient effort to handle the data for use. Future studies
could put more effort to provide interactive and real-time
• Design. Both studies agree that maintaining modu- debugging tools to facilitate efficient development of ML
larization in ML development is difficult. Their study systems. To deal with the rapid evolution of data, online
[4] summarized the reasons as the low extensibility feedback and performance measurement for ML systems are
of an ML model and the non-obvious interaction fertile areas for future research.
between ML models. In contrast, we found ML de- In a larger sense, this work represents a step towards un-
velopment places comparable emphasis on low cou- derstanding software development not as a homogeneous
pling in components as non-ML development [S5]. bulk, but as a rich tapestry of varying practices that involve
Both studies agree that the rapid iterations exist in people of diverse backgrounds across various domains.
the detailed design of ML systems [4 S7]. Precise differences may reside in different kinds of ML
• Construction and Tools. Both studies agree that architectures. We leave these questions to future studies.
code reuse in ML development is challenging due to
varying application context and input data. Despite
the challenge of code reuse in ML development, we ACKNOWLEDGMENT
found that code reuse happens in ML development
The authors would like to thank all interviewees for their
as frequently as in non-ML development [S9].
participation and survey participants for responding to our
• Process and Management. Both studies agree that
survey. This research was partially supported by the Na-
management of ML development is challenging due
tional Key Research and Development Program of China
to the involvement of data, and that the availability
(2018YFB1003904).
of data usually limits the capability of ML systems
[4 4 S21].
• Configuration Management. Both studies agree that R EFERENCES
data versioning is required in ML development. De-
spite the necessity of data versioning, we found that [1] Nvivo qualitative data analysis software, 2018.
[2] The amazon machine learning process.
current configuration management activities in ML https://docs.aws.amazon.com/machine-learning/latest/dg/the-
development still focuses on code versioning [S20]. machine-learning-process.html, 2019. Mar. 2019.
In addition to data versioning, the earlier study [4] [3] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin,
suggested to keep track of how data is gathered and S. Ghemawat, G. Irving, M. Isard, et al. Tensorflow: A system for
large-scale machine learning. In Proceedings of the 12th USENIX
processed. Symposium on Operating Systems Design and Implementation, pages
265–283, 2016.
[4] S. Amershi, A. Begel, C. Bird, R. DeLine, H. Gall, E. Kamar, N. Na-
As discussed above, our study confirms some of the find- gappan, B. Nushi, and T. Zimmermann. Software engineering
for machine learning: A case study. In Proceedings of the 39th
ings reported in Amershi et al.’s work. Being different International Conference on Software Engineering - SEIP track. IEEE
from Amershi et al.’s work, since our study followed the Computer Society, May 2019.
SWEBOK and considered work characteristics from ap- [5] A. Anjos, M. Günther, T. de Freitas Pereira, P. Korshunov, A. Mo-
plied psychology domain in our interviews, we recognized hammadi, and S. Marcel. Continuously reproducing toolchains in
pattern recognition and machine learning experiments. In Proceed-
the differences between ML and non-ML development in ings of the 34th International Conference on Machine Learning, Aug.
other aspects, e.g., requirement gathering, job complexity, 2017. https://openreview.net/group?id=ICML.cc/2017/RML.
problem solving process, and task identity. Moreover, we [6] D. Baylor, E. Breck, H.-T. Cheng, N. Fiedel, C. Y. Foo, Z. Haque,
collected perceptions from broader population groups, e.g., S. Haykal, M. Ispir, V. Jain, L. Koc, et al. Tfx: A tensorflow-based
production-scale machine learning platform. In Proceedings of the
involving open source practitioners and professionals from 23rd ACM SIGKDD International Conference on Knowledge Discovery
various software industries. and Data Mining, pages 1387–1395. ACM, 2017.
14

[7] P. Bourque, R. E. Fairley, et al. Guide to the software engineering body ings of the 18th ACM SIGKDD International Conference on Knowledge
of knowledge (SWEBOK (R)): Version 3.0. IEEE Computer Society Discovery and Data Mining, pages 768–776. ACM, 2012.
Press, 2014. [26] L. Ma, F. Juefei-Xu, F. Zhang, J. Sun, M. Xue, B. Li, C. Chen, T. Su,
[8] V. Braun and V. Clarke. Using thematic analysis in psychology. L. Li, Y. Liu, et al. Deepgauge: Multi-granularity testing criteria
Qualitative research in psychology, 3(2):77–101, 2006. for deep learning systems. In Proceedings of the 33rd ACM/IEEE
[9] J. C. Carver, R. P. Kendall, S. E. Squires, and D. E. Post. Software International Conference on Automated Software Engineering, pages
development environments for scientific and engineering soft- 120–131. ACM, 2018.
ware: A series of case studies. In Proceedings of the 29th International [27] S. Ma, Y. Aafer, Z. Xu, W.-C. Lee, J. Zhai, Y. Liu, and X. Zhang.
Conference on Software Engineering, pages 550–559. IEEE Computer Lamp: data provenance for graph based machine learning algo-
Society, 2007. rithms through derivative computation. In Proceedings of the 2017
[10] J. Cohen. A coefficient of agreement for nominal scales. Educational 11th Joint Meeting on Foundations of Software Engineering, pages 786–
and psychological measurement, 20(1):37–46, 1960. 797. ACM, 2017.
[11] G. Ericson. What is the team data science pro- [28] S. Ma, Y. Liu, W.-C. Lee, X. Zhang, and A. Grama. Mode:
cess? https://docs.microsoft.com/en-us/azure/machine- automated neural network model debugging via state differential
learning/team-data-science-process/overview, 2017. Mar. analysis and input selection. In Proceedings of the 26th ACM Joint
2019. Meeting on European Software Engineering Conference and Symposium
[12] A. Ferlitsch. Making the machine: the machine learning on the Foundations of Software Engineering, pages 175–186. ACM,
lifecycle. https://cloud.google.com/blog/products/ai-machine- 2018.
learning/making-the-machine-the-machine-learning-lifecycle, [29] A. Ng. Machine learning yearning. URL: http://www. mlyearning.
2019. Mar. 2019. org, 2017.
[13] D. Fisher, R. DeLine, M. Czerwinski, and S. Drucker. Interactions [30] K. Pei, Y. Cao, J. Yang, and S. Jana. Deepxplore: Automated
with big data analytics. interactions, 19(3):50–59, May 2012. whitebox testing of deep learning systems. In proceedings of the
[14] Z. Ghahramani. Probabilistic machine learning and artificial 26th Symposium on Operating Systems Principles, pages 1–18. ACM,
intelligence. Nature, 521(7553):452–459, 2015. 2017.
[15] G. Guest, A. Bunce, and L. Johnson. How many interviews are [31] Y. Roh, G. Heo, and S. E. Whang. A survey on data collection
enough? an experiment with data saturation and variability. Field for machine learning: a big data-ai integration perspective. arXiv
methods, 18(1):59–82, 2006. preprint arXiv:1811.03402, 2018.
[16] M. A. Gulzar, M. Interlandi, S. Yoo, S. D. Tetali, T. Condie, T. Mill- [32] J. Romano, J. D. Kromrey, J. Coraggio, and J. Skowronek. Appro-
stein, and M. Kim. Bigdebug: Debugging primitives for interactive priate statistics for ordinal level data: Should we really be using t-
big data processing in spark. In Proceedings of the IEEE/ACM test and cohen’sd for evaluating group differences on the nsse and
38th International Conference on Software Engineering, pages 784–795. other surveys. In Proceedings of the Annual Meeting of the Florida
IEEE, 2016. Association of Institutional Research, pages 1–33, 2006.
[17] H. Harris, S. Murphy, and M. Vaisman. Analyzing the Analyzers: [33] S. Schelter, F. Biessmann, T. Januschowski, D. Salinas, S. Seufert,
An Introspective Survey of Data Scientists and Their Work. ” O’Reilly G. Szarvas, et al. On challenges in machine learning model
Media, Inc.”, 2013. management. http://sites.computer.org/debull/A18dec/p5.pdf,
[18] S. E. Humphrey, J. D. Nahrgang, and F. P. Morgeson. Integrating 2018. Mar. 2019.
motivational, social, and contextual work design features: a meta- [34] D. Talby. Lessons learned turning machine
analytic summary and theoretical extension of the work design learning models into real products and services.
literature. Journal of applied psychology, 92(5):1332, 2007. https://www.oreilly.com/ideas/lessons-learned-turning-
[19] M. I. Jordan and T. M. Mitchell. Machine learning: Trends, machine-learning-models-into-real-products-and-services, 2018.
perspectives, and prospects. Science, 349(6245):255–260, 2015. Mar. 2019.
[20] S. Kandel, A. Paepcke, J. M. Hellerstein, and J. Heer. Enterprise [35] R. Thomas. What do machine learning practitioners actually do?
data analysis and visualization: An interview study. IEEE Transac- https://www.fast.ai/2018/07/12/auto-ml-1/, 2018. Mar. 2019.
tions on Visualization and Computer Graphics, 18(12):2917–2926, 2012. [36] Y. Tian, K. Pei, S. Jana, and B. Ray. Deeptest: Automated testing
[21] M. Kim, T. Zimmermann, R. DeLine, and A. Begel. The emerging of deep-neural-network-driven autonomous cars. In Proceedings of
role of data scientists on software development teams. In Pro- the 40th international conference on software engineering, pages 303–
ceedings of the 38th International Conference on Software Engineering, 314. ACM, 2018.
pages 96–107. ACM, 2016. [37] P. K. Tyagi. The effects of appeals, anonymity, and feedback on
[22] M. Kim, T. Zimmermann, R. DeLine, and A. Begel. Data scientists mail survey response patterns from salespeople. Journal of the
in software teams: State of the art and challenges. IEEE Transactions Academy of Marketing Science, 17(3):235–241, Jun 1989.
on Software Engineering, 44(11):1024–1038, 2018. [38] Z. Wan, J. Tao, J. Liang, Z. Cai, C. Chang, L. Qiao, and Q. Zhou.
[23] B. A. Kitchenham and S. L. Pfleeger. Personal opinion surveys. Large-scale empirical study on machine learning related questions
In Guide to advanced empirical software engineering, pages 63–92. on stack overflow. Journal of ZheJiang University (Engineering
Springer, 2008. Science), 53(5):819–828, 2019.
[24] T. D. LaToza, G. Venolia, and R. DeLine. Maintaining mental [39] X. Xie, J. W. Ho, C. Murphy, G. Kaiser, B. Xu, and T. Y. Chen. Test-
models: A study of developer work habits. In Proceedings of ing and validating machine learning classifiers by metamorphic
the 28th International Conference on Software Engineering, ICSE ’06, testing. Journal of Systems and Software, 84(4):544–558, 2011.
pages 492–501, New York, NY, USA, 2006. ACM. [40] M. Zinkevich. Rules of machine learning: Best practices
[25] K.-c. Lee, B. Orten, A. Dasdan, and W. Li. Estimating conversion for ml engineering. https://developers.google.com/machine-
rate in display advertising from past performance data. In Proceed- learning/guides/rules-of-ml/, 2018. Mar. 2019.

You might also like