0% found this document useful (0 votes)
102 views12 pages

Google Productivity

Uploaded by

Rodrigo Violante
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
102 views12 pages

Google Productivity

Uploaded by

Rodrigo Violante
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

What Improves Developer Productivity at Google?

Code Quality
Lan Cheng Emerson Murphy-Hill Mark Canning
Google Google Google
United States United States United States
[email protected] [email protected] [email protected]

Ciera Jaspan Collin Green Andrea Knight


Google Google Google
United States United States United States
[email protected] [email protected] [email protected]

Nan Zhang Elizabeth Kammer


Google Google
United States United States
[email protected] [email protected]

ABSTRACT of the 30th ACM Joint European Software Engineering Conference and Sympo-
Understanding what affects software developer productivity can sium on the Foundations of Software Engineering (ESEC/FSE ’22), November
14ś18, 2022, Singapore, Singapore. ACM, New York, NY, USA, 12 pages.
help organizations choose wise investments in their technical and
https://doi.org/10.1145/3540250.3558940
social environment. But the research literature either focuses on
what correlates with developer productivity in ecologically valid
settings or focuses on what causes developer productivity in highly
constrained settings. In this paper, we bridge the gap by studying 1 INTRODUCTION
software developers at Google through two analyses. In the first Organizations want to maximize software engineering productivity
analysis, we use panel data with 39 productivity factors, finding that so that they can make the best software in the shortest amount
code quality, technical debt, infrastructure tools and support, team of time. While software engineering productivity łis essential for
communication, goals and priorities, and organizational change numerous enterprises and organizations in most domainsž [50] and
and process are all causally linked to self-reported developer pro- can be examined through multiple lenses [29], understanding the
ductivity. In the second analysis, we use a lagged panel analysis to productivity of individual software developers can be especially
strengthen our causal claims. We find that increases in perceived fruitful because it has the potential to be improved through many
code quality tend to be followed by increased perceived developer actions (e.g. from tooling to process changes) and by many stake-
productivity, but not vice versa, providing the strongest evidence holders (from individual developers to executives). However, it is
to date that code quality affects individual developer productivity. difficult to know which actions will truly improve productivity in
an ecologically valid setting, that is, in a way that accurately char-
CCS CONCEPTS acterizes productivity in a realistic software development context.
· Software and its engineering → Software creation and man- This motivates our research question:
agement; Software development process management; · General RQ: What causes improvements to developer produc-
and reference → Empirical studies. tivity in practice?

KEYWORDS A wide spectrum of prior research provides some answers to


this question, but with significant caveats. For example, at one end
Developer productivity, code quality, causation, panel data
of the research spectrum, Ko and Myers’ controlled experiment
ACM Reference Format: showed that a novel debugging tool called Whyline helped Java de-
Lan Cheng, Emerson Murphy-Hill, Mark Canning, Ciera Jaspan, Collin velopers fix bugs twice as fast as those using traditional debugging
Green, Andrea Knight, Nan Zhang, and Elizabeth Kammer. 2022. What techniques [30]. While this evidence is compelling, organizational
Improves Developer Productivity at Google? Code Quality. In Proceedings
leaders are faced with many open questions about applying these
findings in practice, such as whether the debugging tasks performed
in that study are representative of the debugging tasks performed
This work is licensed under a Creative Commons Attribution 4.0 Interna- by developers in their organizations. At the other end of the spec-
tional License. trum, Murphy-Hill and colleagues surveyed developers across three
ESEC/FSE ’22, November 14ś18, 2022, Singapore, Singapore companies, finding that job enthusiasm consistently correlated with
© 2022 Copyright held by the owner/author(s).
ACM ISBN 978-1-4503-9413-0/22/11. high self-rated productivity [36]. But yet again, an organizational
https://doi.org/10.1145/3540250.3558940 leader would have open questions about how to apply these results,

1302
ESEC/FSE ’22, November 14ś18, 2022, Singapore, Singapore L. Cheng, E. Murphy-Hill, M. Canning, C. Jaspan, C. Green, A. Knight, N. Zhang, E. Kammer

Table 1: Hypothetical cross sectional survey response data. • Respondent-independent time effects. These are effects
that influence all respondents uniformly, such as seasonal
Productivity Rating Code Quality Rating effects or company-wide initiatives. For example, prior to
Aruj Somewhat productive Medium quality the survey, perhaps all engineers were given their annual
Rusla Extremely productive Extremely high quality bonus, artificially raising everyone’s productivity.
• Non-differentiated response effects. These are effects
Table 2: More hypothetical survey responses, collected 3 where respondents will give the same or similar responses to
months after the data in Table 1. Plusses (+) and minuses (ś) every survey question, sometimes known as straightlining.
indicate the direction of the change since the prior survey. For example, perhaps Aruj tends to choose the middle op-
tion to every question and Rusla tends to answer the highest
option for every question.
Productivity Rating Code Quality Rating
Aruj Highly productive (+) High Quality (+)
Rusla Somewhat productive (ś) High Quality (ś) We use panel analysis to address these confounds, enabling
stronger causal inference than what can be obtained from cross sec-
tional data [25]. The power of panel data is that it uses data collected
such as whether some unmeasured third variable causes both high at multiple points in time from the same individuals, examining
productivity and high job enthusiasm. how measurements change over time.
More broadly, these examples illustrate the fundamental limita- To illustrate how panel data enables stronger causal inference,
tions of prior approaches to understanding developer productivity. let us return to the running example. Suppose we run the survey
On one hand, software engineering research that uses controlled ex- again, three months later, and obtain the data shown in Table 2.
periments can help show with a high degree of certainty that some One interesting observation is that if we analyze Table 2 in isola-
practices and tools increase productivity, yet such experiments are tion, we notice that there’s not a correlation between productivity
by definition highly controlled, leaving organizations to wonder and code quality ś both respondents report the same code quality,
whether the results obtained in the controlled environment will regardless of their productivity. But more importantly, looking at
also apply in their more realistic, messy environment. On the other the changes in responses from Table 1 and Table 2, we see produc-
hand, research that uses field studies ś often with cross sectional tivity changes are now correlated: Aruj’s increasing productivity
data from surveys or telemetry ś can produce contextually valid ob- correlates with increasing code quality, and Rusla’s decreasing pro-
servations about productivity, but drawing causal inferences from ductivity correlates with decreasing code quality.
field studies that rival those drawn from experiments is challenging. Panel analysis rules out the three confounding explanations
Our study builds on the existing literature about developer pro- present in the cross-sectional analysis:
ductivity, contributing the first study to draw strong causal conclu-
sions in an ecologically valid context about what affects individual
• In the cross-sectional analysis, we could not determine if
developer productivity.
Rusla’s high productivity was driven by her college edu-
cation. But in this panel analysis, we can rule out that ex-
2 MOTIVATION
planation, because college is a time invariant exposure ś it
The paper’s main technical contribution ś the ability to draw theoretically would have the same effect on her productivity
stronger causal inferences about productivity drivers than in prior in the first survey as in the second survey. This ability to
work ś is enabled by the use of the panel data analysis technique [25]. rule out other potential causes that are time invariant exists
In this section, we motivate the technique with a running example. whether or not the researcher can observe those potential
Much of the prior work on developer productivity (Section 3) causes. While with cross-sectional analysis, researchers may
relies on cross-sectional data. To illustrate the limitations of cross sec- be able to control for these potential causes using control
tional data, let us introduce an example. Consider a survey that asks variables, researchers have to anticipate and measure those
about respondents’ productivity and the quality of their codebase. control variables during analysis. This is unnecessary with
The survey is distributed at a large company, and two developers panel analysis because time invariant factors are eliminated
respond, Aruj and Rusla. Let’s assume their responses are repre- by design.
sentative of the developer population. Their survey responses are • In the cross-sectional analysis, we could not determine if
shown in Table 1. productivity was driven by a recent annual bonus. This ex-
From this survey, we see that productivity correlates with code planation is ruled out in the panel analysis. If the annual
quality. But we cannot confidently say that high code quality causes bonus had an effect, the change in productivity scores across
high developer productivity, due in part to the following confound- both participants would be uniform.
ing explanations [1]: • In the cross-sectional analysis, we could not determine if
• Time-invariant effects. These are effects that have the respondents were just choosing similar answers to every
same influence over time. For example, if Rusla went to col- question. This explanation is also ruled out with panel anal-
lege and Aruj did not, from cross-sectional data, we cannot ysis. If respondents were choosing similar answers, there
distinguish between the effect of college and the effect of would be no change in productivity scores, and thus we
code quality on productivity. would not see a correlation among the changes.

1303
What Improves Developer Productivity at Google? Code Quality ESEC/FSE ’22, November 14ś18, 2022, Singapore, Singapore

The ability of panel analyses to draw relatively strong causal project and thus increases the number of participants, but decreases
inferences makes it a quasi-experimental method, combining some the average level of contribution by individual participantsž [26].
of the advantages of experiments with those of field studies [24]. Like these papers, we use panel data to make causal inferences, but
in our case, the inferences are about developer productivity.

3 RELATED WORK 4 PANEL ANALYSIS: METHODS


To answer research questions similar to ours, several researchers Towards answering our research question, we next describe our
previously investigated what factors correlate with developer pro- data sources, dependent variables, independent variables, panel
ductivity. Petersen’s systematic mapping literature review describes data, and modeling design.
seven studies that quantify factors that predict software developer
productivity [39], factors largely drawn from the COCOMO II soft-
4.1 Data Sources
ware cost driver model [10]. For instance, in a study of 99 projects
from 37 companies, Maxwell and colleagues found that certain The data of this study comes from two sources: Google engineers’
tools and programming practices correlated with project produc- logs data and a company-wide survey at Google. Neither source
tivity, as measured by the number of lines of written code per was built specifically for the research we describe here, and so we
month [33]. More broadly, a recent study explored what factors consider this opportunistic research.
correlate with individual developers’ self-reported productivity at 4.1.1 Logs Data. We collected a rich data set from engineers’ logs
three companies [36]. In contrast to our study, these prior studies from internal tools, such as a distributed file system that records
report correlations with relatively weak causal claims. developers’ edits, a build system, and a code review tool. This data
Other researchers have been able to make stronger causal claims contains fine-grained histories of developers’ work, enabling us to
about programmer productivity by running controlled experiments. make accurate measurements of actual working behavior, such as
For instance, when Tosun and colleagues asked 24 professionals to the time developers spend actively writing code. The data helps
complete a simple programming task, either using test-driven devel- us characterize developers’ work practices, such as what kinds of
opment (treatment group) or iterative test-last development (control development tasks they are doing, how long those tasks take, and
group), they found that treatment group participants were signif- how long they are waiting for builds and tests to complete. Details
icantly more productive than control group participants, where on these tools, how data is aggregated into metrics, measurement
productivity was measured as the number of tests passed in a fixed validation, and ethical considerations of data collection can be found
amount of time [49]. Such randomized controlled experiments are elsewhere [27]. We describe the exact metrics we use in Section 4.3.
considered a łgold standardž because they can make very strong
causal claims [8]. The challenge with such studies is that they are 4.1.2 EngSat. The Engineering Satisfaction (EngSat) Survey is a
expensive to run with high ecological validity. Consequently, such longitudinal program to: understand the needs of Google engi-
studies typically use students as participants rather than profes- neers; evaluate the effectiveness of tools, process, and organization
sionals (e.g. [13, 44, 46]), use small problems and programs rather improvements; and provide feedback to teams that serve Google
than more realistic ones (e.g. [5, 35, 46]), and can only vary one or engineers. Every three months, the survey is sent out to one-third
two productivity factors per experiment (e.g. [14, 31, 42]). While of eligible engineers ś in one of five core engineering job roles,
the study presented here cannot make as strong causal claims as have been at Google’s parent company for at least 6 months, and
experiments, the present field study has higher ecological validity below the łdirectorž level. The same engineers are re-surveyed
than experimental studies. every three quarters, and a random sample of one-third of new
To address these challenges, software engineering researchers engineers is added each quarter so that all engineers are invited.
have been using causal inference techniques in field studies, where The survey questions cover a range of topics, from productivity to
stronger causal claims can be made than in studies with simple tool satisfaction to team communication. Respondents are asked
correlations. The core of such studies is analyses that leverage to describe their experience in the 3 month period prior to taking
time series data, rather than cross-sectional data. For instance, the survey. Before beginning the survey, respondents are informed
Wang and colleagues use Granger’s causality test [20] to infer that how the data is used and that participation is voluntary.
women’s pull requests cause increases in those women’s follower While EngSat response rates are typically between 30% and
counts [51]. As another example, using the Bayesian CausalImpact 40%, response bias does not appear to be a significant threat. We
framework [6], Martin and colleagues show that 33% of app releases know this because we analyzed two questions for response bias,
caused statistically significant changes to app user ratings [32]. one on productivity and one on overall engineering experience
These papers used fine-grained time series data, which is not possi- satisfaction. We found that EngSat tends to have lower response
ble for the type of data described in this paper, and to our knowledge, rates for technical leads and managers, those who have been at
has not been applied to studies of developer productivity. Google for a longer period, and for engineers from the San Francisco
Panel analysis, another causal inference technique, has been Bay Area, where Google is headquartered. To estimate the impact
used by prior software engineering researchers. Qiu and colleagues of non-response bias on a metric derived from EngSat responses,
used GitHub panel data to show that łsocial capital impacts the we compare a bias-corrected version of the metric to its uncorrected
prolonged engagement of contributors to opensourcež [40]. Islam version and check for the difference. The bias-corrected metric is
and colleagues used panel data to show that distributed version calculated by reweighting EngSat responses with the propensity
control systems łreduce the private costs for participants in an OSS score (a similarity measure [21]) of responding to EngSat, which is

1304
ESEC/FSE ’22, November 14ś18, 2022, Singapore, Singapore L. Cheng, E. Murphy-Hill, M. Canning, C. Jaspan, C. Green, A. Knight, N. Zhang, E. Kammer

Two measures capture the amount of time it takes an engineer to


produce one unit of output (a changelist):
• Median Active Coding Time. Across every CL merged, the
median time an engineer spent actively writing code per
CL [27].
• Median Wall-Clock Coding Time. The median wall-clock time
an engineer spent writing code per CL, that is, the time
elapsed between when the engineer starts writing code and
when they request the code be reviewed.
Figure 1: Quantitative features that predicted self-rated pro-
The remaining two measures captured non-productive activities,
ductivity.
that is, how much time an engineer spends waiting per unit of
output (a changelist):
estimated based on factors such as developer tenure, work location, • Median Wall-Clock Review Time. The median wall-clock time
and coding language and tools. We find that after correcting for this an engineer spent waiting for code review per CL.
non-response bias using propensity score matching, the percent of • Median Wall-Clock Merge Time. The median wall-clock time
engineers who responded favorably did not change significantly an engineer waited between approval for merging and actu-
for either question. For instance, adjusting for non-response bias, ally merging per CL.
productivity decreases relatively by 0.7%, which is too small to
We gathered the above data over 6 consecutive quarters from
reach statistical significance at the 95% level. These results were
2018Q1 to 2019Q2. For each quarter, we linked an engineer’s subjec-
consistent across the three rounds of EngSat that we analyzed.
tive measure of productivity to the above five quantitative measures.
Since engineers are invited to take our survey once every 3 quarters,
4.2 Dependent Variable: Productivity
a single engineer may be represented at most twice in this data set.
We use self-rated productivity from our survey as our dependent In total, we had 1958 engineer data points for our model.
variable. While subjective and objective measures of productivity After randomly selecting 10% of the data for validation, the model
each have advantages and disadvantages, we chose a subjective had 83% precision and 99% recall, suggesting a substantial relation-
measure of productivity here both because it is straightforward ship between quantitative and qualitative productivity measures.
to measure in survey form and because it is used broadly in prior Looking at the importance of each quantitative metric in classify-
research [15, 34, 36]. ing developers in the model (Figure 1), we see that Median Active
The EngSat survey asks the question: Overall, in the past three Coding Time was the most predictive quantitative feature. This
months how productive have you felt at work at Google/Alphabet? aligns with Meyer and colleagues’ finding that Microsoft engineers
Respondents can choose "Not at all productive", "Slightly produc- view coding as their most productive activity [34].
tive", "Moderately productive", "Very productive", or "Extremely
productive". We coded this variable from 1 (Not at all productive) 4.3 Independent Variables
to 5 (Extremely productive).
To predict the dependent variable, we started with 42 independent
Prior software engineering research has shown that subjec-
variables ś reduced to 39 after a multicolinearity check (Section 4.6)
tive productivity correlates with objective measures of produc-
ś available from the survey and logs data. Since survey respondents
tivity [36, 37] as a way to establish convergent validity of question-
are asked to report on their experiences from the three months
based productivity metrics (that is, how they relate to other mea-
prior to the survey, we collected log data for the corresponding
sures of the same construct [7]), We sought to do the same by
three month period. While many metrics could be analyzed, we
correlating our subjective productivity measure below with sev-
selected metrics that were relatively straightforward to collect and
eral objective measures of productivity. Rather than using a linear
that appeared plausibly related to individual productivity, based on
correlation as used in prior work, we were open to the possibility
consultation with internal subject matter experts within Google that
that relationships were non-linear, and thus we selected a random
were experienced with building and deploying developer metrics.
forest as a classifier.
Below, we group independent variables into six categories, de-
First, we created a simple random forest to predict a binary ver-
scribe each variable, and link them to prior work. We give each
sion of self-rated productivity, where we coded łExtremely produc-
variable a short name (in parentheses) to make referencing them
tivež and łVery productivež as productive, and the other values as
easier in the remainder of the paper. Full survey questions and
not productive. We then predicted this binary measure of self-rated
response scales are available in the Appendix.
productivity using six quantitative productivity metrics measured
over a three month period. Two of the measures capture the amount 4.3.1 Code Quality & Technical Debt. The first category of poten-
output produced over the fixed period: tial drivers of productivity are those relating to code quality and
• Total Number of Changelists. This represents the number of technical debt. Based on experience, DeMarco and Lister claim that
changelists (CLs) that an engineer merged, after code review, software quality, generally speaking, łis a means to higher pro-
into Google’s main code repository. ductivityž [11]. In an experiment, Schankin and colleagues found
• Total Lines of Code. Across all CLs an engineer merged, the that participants found errors 14% faster when descriptive iden-
total number of lines of code changed. tifier names were used [45]. Studying small industrial programs

1305
What Improves Developer Productivity at Google? Code Quality ESEC/FSE ’22, November 14ś18, 2022, Singapore, Singapore

written in the 1980s, Gill and Kemerer found that code complex- To measure team communication in our study, we examined 9
ity correlates with software maintenance productivity [17]. Based objective measures and 1 subjective measure:
on interviews and surveys with professional software developers,
• 50th and 90th percentile of rounds of code review (p50 code
Besker and colleagues found that technical debt correlates nega-
review rounds, p90 code review rounds)
tively with developer productivity [3, 4].
• 50th and 90th percentile of total wait time of code review (p50
We measured code quality and technical debt with 5 subjective
code review wait time, p90 code review wait time)
factors from our survey:
• 50th and 90th percentile of code reviewers’ organizational dis-
tances from author (p50 review org distance, p90 review org
• Code Quality Satisfaction (sat. with project code quality, sat. distance)
with dependency code quality) • 50th and 90th percentile of code reviewers’ physical distances
• Code Technical Debt (project tech debt) from author (p50 review physical distance, p90 review phys-
• Dependency Technical Debt (dependency tech debt) ical distance)
• Technical Debt Hindrance (tech debt hindrance) • Physical distance from direct manager (distance from man-
ager)
4.3.2 Infrastructure Tools & Support. The next category of poten- • Code review hindrance (slow code review)
tial drivers of productivity are issues relating to tools and infrastruc-
ture. Prior work showed that using łthe best tools and practicesž 4.3.4 Goals and Priorities. Prior research suggests that changing
was the strongest correlate of individual productivity at Google, goals and priorities correlate with software engineering outcomes.
though not a significant correlate at two other companies [36]. Surveying 365 software developers, The Standish Group found that
Storey and colleagues also found that Microsoft developers’ pro- changing requirements was a common stated reason for project
cesses and tools correlated with individual productivity [47]. failure [48]. Meyer and colleagues found that one of the top 5 most
This category had 6 objective and 12 subjective measures: commonly mentioned reasons for a productive workday was having
clear goals and requirements [34].
• Tools, infrastructure and service satisfaction (sat. with infra & We measure this category with 1 subjective measure:
tools) • Priority shift (priority shift)
• Tools and infrastructure choice (choices of infra & tools)
• Tools and infrastructure innovativeness (innovation of infra 4.3.5 Interruptions. Meyer and colleagues found that two of the
& tools) top five most commonly mentioned reasons for a productive work-
• Tools and infrastructure ease (ease of infra & tools) day by 379 software developers was having no meetings and few
• Tools and infrastructure frustration (frustration of infra & interruptions [34]. Similarly, a prior survey of Google engineers
tools) showed that lack of interruptions and efficient meetings correlated
• Developer stack change (change of tool stack) with personal productivity, as did use of personal judgment [36].
• Internal documentation support (doc. support) We measure this category with 3 objective measures:
• Internal documentation hindrance (doc. hindrance) • 50th and 90th percentile of total time spent on incoming meet-
• Build & test cycle hindrance (build & test cycle hindrance) ings per week (p50 meeting time, p90 meeting time)
• Build latency satisfaction (sat. with build latency) • Total time spent on any meetings per week (total meeting
• 50th and 90th percentile of build duration (p50 build time, p90 time)
build time)
• % of long builds per week (% of long builds) 4.3.6 Organizational and Process Factors. Finally, outside of soft-
• 50th and 90th percentile of test duration (p50 test time, p90 ware engineering, organizational and process factors correlate with
test time) a variety of work outcomes. For example, according to healthcare
• % of long tests per week (% of long tests) industry managers, reorganizations can result in workers’ sense of
• Learning hindrance (learning hindrance) powerlessness, inadequacy, and burnout [19]. Although not well-
• Migration hindrance (migration hindrance) studied in software engineering, based on personal experience,
DeMarco and Lister [11] and Armour [2] point to bureaucracy and
reorganizations as leading to poor software engineering outcomes.
4.3.3 Team Communication. The next category of drivers of pro-
This category had 2 subjective and 3 objective measures:
ductivity are issues relating to team communication. In a survey
of knowledge workers, Hernaus and Mikulić found that social job • Process hindrance (complicated processes)
characteristics (e.g. group cooperation) correlated with contextual • Organizational hindrance (team & org change)
job performance [23]. More specifically, in software engineering, • Number of times when engineers’ direct manager changes but
Chatzoglou and Macaulay interviewed software developers, finding colleagues do not change (reorg direct manager change)
that most believed that communication among team members was • Number of times when both an engineer’s direct manager and
very important to project success [9]. Studying communication colleagues change simultaneously (non-reorg direct manager
networks quantitatively, Kidane and Gloor found that in the Eclipse change)
project, a higher frequency of communication between developer • Number of different primary teams the engineer has (primary
correlated positively with performance and creativity [28]. team change)

1306
ESEC/FSE ’22, November 14ś18, 2022, Singapore, Singapore L. Cheng, E. Murphy-Hill, M. Canning, C. Jaspan, C. Green, A. Knight, N. Zhang, E. Kammer

4.4 From Variables to Panel Data 4.6 Multicollinearity


Since the survey was sent out to the same cohort of engineers every To check for multicollinearity among the independent variables, we
three quarters, we have accumulated a panel data set with two calculated Variance Inflation Factor (VIF) scores on these metrics.
observations in different points of time for each engineer. After We found some build latency metrics were highly correlated and
joining each engineer’s survey data with their logs data, we have thus may cause a multicollinearity problem. After consulting with
complete panel data for 2139 engineers. experts in our build system, we removed three build latency metrics
that had a VIF score above 3 (p50 build time, p50 test time, and % of
4.5 Modeling long tests), a threshold recommended by Hair and colleagues [22].
Using the panel data set, we applied a quasi-experiment method of The final list of 39 metrics all have VIF scores below 3.
panel data analysis to analyze the relationship between engineers’
perceived overall productivity and the independent variables. In 4.7 Threats to Validity
this paper, we use a fixed-effect model to analyze panel data at the Like all empirical studies, ours is imperfect. In this section, we
developer level. The model is describe threats to the validity of our study, broken down into
content, construct, internal, and external validity threats.
𝑦𝑖𝑡 = 𝛼𝑖 + 𝜆𝑡 + 𝛽𝐷𝑖𝑡 + 𝜖𝑖𝑡 (1)
4.7.1 Content. Although our study examines a variety of facets of
where
productivity, it does not examine every single aspect of productivity
• 𝑦𝑖𝑡 is the dependent variable y, self-rated productivity for or of factors that may influence productivity.
developer i at time t. With respect to productivity itself, we measure it with a single
• 𝛼𝑖 is unobserved engineer time-invariant effects, such as survey question. On one hand, the question itself is worded broadly
education and skills. and our validation (Section 4.2) shows that it correlates with other
• 𝜆𝑡 is the engineer-independent time effect, such as company- objective measures of productivity. On the other hand, as evidenced
wide policy changes and seasonalities at time t. by the fact that the correlation was imperfect, it is likely that our
• 𝐷𝑖𝑡 = [𝐷𝑖𝑡1 , 𝐷𝑖𝑡2 , . . . , 𝐷𝑖𝑡
𝑛 ] are observed productivity factors
question did not capture some aspects of developer productivity.
for developer i at time t. As one example, our question was only focused on productivity of
• 𝛽 = [𝛽 1, 𝛽 2, . . . , 𝛽 𝑛 ] are the causal effects of productivity an individual developer, yet productivity is often conceptualized
factors 𝐷𝑖𝑡 at time t. from a team, group, or company perspective [16].
• 𝜖𝑖𝑡 is the error term at time t. Likewise, our set of productivity factors ś like code quality and
To estimate the fixed-effect model, we differenced equation (1) build speed ś are incomplete, largely because we used conveniently
between the two periods and have available and subjectively-selected metrics and because we reused
Δ𝑦𝑖𝑡 = Δ𝜆𝑡 + 𝛽Δ𝐷𝑖𝑡 + Δ𝜖𝑖𝑡 (2) an existing long-running survey. In comparison, prior work, which
used a custom-built cross-sectional survey, found that two of the
where Δ𝜆𝑡 = 𝛾 0 + 𝛾 1𝑇 . The Δ prefix denotes the change from one strongest correlates with individual productivity were job enthusi-
time period to the next. T is a categorical variable representing asm and teammates’ support for new ideas [36]. Neither of these two
panels in different time periods, if we have more than one panel. productivity factors were explored in the present survey, demon-
Note that after differencing, 𝛼𝑖 is cancelled out and Δ𝜆𝑡 can be ex- strating that our productivity factors are incomplete.
plicitly controlled by transforming it to a series of dummy variables.
Therefore, factors in 𝛼𝑖 and 𝜆𝑡 do not confound the results. 4.7.2 Construct. Our EngSat survey measures a variety of theoret-
We then estimated equation (2) using Feasible Generalized Least ical concepts, and the questions contained in it contain a range of
Squares (FGLS); we chose FGLS to overcome heteroskedasticity, construct validity. For instance, while we have demonstrated some
serial correlation between residuals, and for efficiency compared to amount of convergent validity of our productivity question, respon-
Ordinary Least Square estimators. The parameters of interest are dents to the question may have interpreted the word łproductivityž
the 𝛽 terms. The hypothesis we are testing is that 𝛽 = 0 for all 𝐷𝑖𝑡 . differently ś some may have interpreted it to refer only to the quick
Except for binary variables and percentage variables, we transform completion of work items, while others might take a more expan-
𝐷𝑖𝑡 into 𝑙𝑜𝑔(𝐷𝑖𝑡 ). The benefit of taking a natural log is to allow sive view to include aspects such as quality. While we have tried to
us to interpret estimates of regression coefficients (𝛽 terms) as an limit the impact of different interpretations of EngSat questions by
elasticity, where a percent change in a dependent variable can be piloting variations, gathering interpretive feedback, and refining
interpreted as a percent change in an independent variable. This wording iteratively, such issues are unavoidable threats.
allows for both a uniform and intuitive interpretation of the effects Another specific threat to construct validity is inconsistent and
across both logs-based and survey-based dependent variables. ambiguous question wording. For instance, while respondents are
To liberally capture causal relationships between productivity, advised at the beginning of the survey that they should report on
we use a p-value cutoff of 0.1 to define łstatistically significantž experiences over the last 3 months, some questions (but not all) re-
results. If the reader prefers a more stringent cutoff or using a false inforce this scoping by beginning with łIn the last three months. . . ž.
discovery correction, we facilitate this by reporting p-values. As another example of inconsistency, while most questions ask only
Analysis code was written in R by the first author using the about experiences (which our models use to predict productivity),
packages glmnet, randomForest, binom, car, and plm. All code was three questions ask about the relationship between experience and
peer-reviewed using Google’s standard code review process [43]. perceived productivity, such as łhow much has technical debt. . .

1307
What Improves Developer Productivity at Google? Code Quality ESEC/FSE ’22, November 14ś18, 2022, Singapore, Singapore

Table 3: Metrics’ relationship with self-rated productivity. from cross-sectional data. However, the most significant caveat
to our ability to draw causal inferences is time variant effects. In
Metric Effect size p-value contrast to time invariant effects (e.g., prior education and demo-
Code Quality & Technical Debt graphics), time variant effects may vary over the study period. For
sat. with project code quality 0.105 <0.001 instance, in our running example, if Aruj lost a mentor and Rusla
sat. with dependency code quality -0.013 0.505 gained a mentor between the two surveys, our analysis could not
project tech debt 0.078 <0.001 rule out mentorship as a cause of increased productivity or code
dependency tech debt 0.042 0.012 quality. Thus, our analysis assumes that effects on individual en-
tech debt hindrance -0.009 0.459 gineers are time invariant. Violations of this assumption threaten
Infrastructure Tools & Support the internal validity of our study.
sat. with infra & tools 0.113 <0.001 Another internal threat to the validity of our study is partici-
choices of infra & tools 0.020 0.083 pants who chose not to answer some or all questions in the survey.
innovation of infra & tools 0.106 <0.001 While our analysis of non-response bias (Section 4.1.2) showed that
ease of infra & tools -0.018 0.352 two survey questions were robust to non-response among several
frustration of infra & tools 0.002 0.952 dimensions like level and tenure, non-response is still a threat. For
change of tool stack 0.019 0.098 one, respondents and non-respondents might differ systematically
doc. support -0.009 0.664 on some unmeasured or dimension, such as how frequently they
doc. hindrance -0.005 0.715 get feedback from peers. Likewise, respondents who choose not
build and test cycle hindrance 0.029 0.064 to answer a question will be wholly excluded from our analysis,
sat. with build latency 0.018 0.295 yet such participants might differ systematically from those who
p90 build time -0.024 0.019 answered every question.
p90 test time -0.001 0.836 Another threat to internal validity is that we analyzed data for
% of long builds 0.028 0.599 only two panels per engineer. More panels per engineer would
learning hindrance 0.038 0.006 increase the robustness of our results.
migration hindrance -0.001 0.929
Team Communication 4.7.4 External. As the title of this paper suggests, our study was
p50 code review rounds 0.007 0.081 conducted only at Google and generalizability of our results beyond
p90 code review rounds -0.014 0.058 that context is limited. Google is a large, US-headquartered, multi-
p50 code review wait time -0.0006 0.875 national, and software-centric company where engineers work on
p90 code review wait time 0.0019 0.625 largely server and mobile code, with uniform development tooling,
p50 review org distance -0.0008 0.424 and in a monolithic repository. Likewise, during the study period
p90 review org distance -0.0002 0.880 Google developers mostly worked from open offices, before the
p50 review physical distance 0.0012 0.261 global COVID19 pandemic when many developers shifted to re-
p90 review physical distance 0.0013 0.518 mote or hybrid work. While results would vary if this study were
distance from manager 0.001 0.209 replicated in other organizations, contexts that resemble ours are
slow code review 0.051 0.004 most likely to yield similar results.
Goals & Priorities
priority shift 0.077 <0.001
Interruptions 5 PANEL ANALYSIS: RESULTS
p50 meeting time 0.014 0.502 5.1 Factors Causally Linked to Productivity
p90 meeting time 0.008 0.701
Panel data analysis suggested that 16 out of the 39 metrics have a
total meeting time -0.009 0.692
statistically significant causal relationship with perceived overall
Organizational Change and Process
productivity, as listed in Table 3. The overall adjusted R-squared
complicated processes 0.027 0.067
value for the model was 0.1019. In Table 1, the Effect size should
team and org change 0.032 0.023
be read as a percent change in the dependent variable is associated
reorg direct manager changes -0.002 0.086
with that percent change in the independent variable. For instance,
non-reorg direct manager change -0.002 0.525
for code quality, a 100% change in project code quality (from łVery
primary team change 0.014 0.086
dissatisfiedž to łVery satisfiedž to quality) is associated with a 10.5%
increase in self-reported productivity. To summarize Table 3:
hindered your productivity?ž. As an example of ambiguity, several
• For code quality, we found that perceived productivity is
questions ask about engineers’ experiences with the project they
causally related to satisfaction with project code quality
work on, but respondents interpret for themselves what a "project"
but not causally related to satisfaction with code quality
is and, if they work on multiple projects, which one to report on.
in dependencies. For technical debt, we found perceived
4.7.3 Internal. As we argue in this paper, our use of panel analysis productivity is causally related to perceived technical debt
helps draw stronger causal inferences than those that can be drawn both within projects and in their dependencies.

1308
ESEC/FSE ’22, November 14ś18, 2022, Singapore, Singapore L. Cheng, E. Murphy-Hill, M. Canning, C. Jaspan, C. Green, A. Knight, N. Zhang, E. Kammer

• For infrastructure, several factors closely related to internal 6 LAGGED PANEL ANALYSIS: METHODS
infrastructure and tools showed a significant causal relation- The panel data analysis we conducted so far suggests satisfaction
ship with perceived productivity: with code quality within projects is the strongest productivity factor
ś Engineers who reported their tools and infrastructure among the 39 we studied, based on standardized effect size and
were not innovative were more likely to report lower pro- p-value.
ductivity. However, because the observed changes in factors coincided
ś Engineers who reported the number of choices were either during the same time period, such conventional panel data analysis
too few or too many were likely to report lower produc- can tell which factors are causally related to overall productivity,
tivity. We further tested whether one of the two (łtoo fewž but it does not tell us the direction of the causality.
or łtoo manyž) matters but not the other, by replacing this So, does better code quality cause increasing productivity, or does
variable with two binary variables, one representing the increasing productivity cause improved code quality? Both linkages
case of łtoo fewž choices and the other representing the are theoretically plausible: on one hand, code quality might increase
case of łtoo manyž choices. The results suggest that both productivity because higher code quality may make it easier and
cases are causally related to perceived productivity. faster to add new features; on the other hand, high productivity
ś Engineers who reported that the pace of changes in the might increase quality code because engineers have free time to
developer tool stack was too fast or too slow were likely spend on quality improvement.
to report lower productivity. Similarly, we tested the two To verify the direction of the causal relationship between project
cases, łtoo fastž or łtoo slowž, separately by replacing code quality and productivity, we conducted another panel data
this variable with two binary variables, one representing analysis using lagged panel data. In this analysis, we focus only
the case of łtoo fastž and the other representing the case on the causal relationship between code quality and productivity.
of łtoo slowž. Results suggested both cases matter for Although such an analysis is possible for other factors, it is nonethe-
perceived productivity. less laborious, as we shall see shortly. Thus, we focus our lagged
ś Engineers who were hindered by learning a new platform, analysis on only these two variables, which had the strongest causal
framework, technology, or infrastructure were likely to relationship in our prior analysis.
report lower productivity. In short, we verified the direction of the linkage between project
ś Engineers who had longer build times or reported being code quality and productivity by checking if the change in one fac-
hindered by their build & test cycle were more likely to tor is associated with the change in the other factor in the following
report lower productivity. period. The idea is that if project code quality affects productiv-
• For team communication, a metric related to code review was ity, we expect to see that changes in project code quality during
significantly causally related with perceived productivity. time T-1 are associated with changes in productivity during time T.
Engineers who had more rounds of reviews per code review Since self-reported productivity is not available for two consecutive
or reported being hindered by slow code review processes quarters (since each respondent is sampled only once every three
were likely to report lower productivity. quarters), we switch to logs-based metrics to measure productivity.
• For goals and priorities, engineers hindered by shifting project Complementing our prior analysis based on self-ratings with a logs-
priorities were likely to report lower productivity. based one has the additional benefit of increasing the robustness of
• Organizational factors were linked to perceived productivity: our results.
ś Engineers who had more changes of direct managers were More formally, we tested two competing hypotheses, Hypothesis
more likely to report lower productivity. QaP (Quality affects Productivity) and PaQ (Productivity affects
ś Engineers who reported being hindered for team and or- Quality). Hypothesis QaP is that the changes in project code quality
ganizational reasons, or by complicated processes were during time T-1 are associated with changes in productivity during
more likely to report lower productivity. time T. This implies improvements in project code quality lead to
better productivity. Hypothesis PaQ is that changes in productivity
in time T-1 are associated with changes in project code quality in
time T. This implies better productivity leads to an improvement
5.2 Quadrant Chart
in project code quality.
To visualize these factors in terms of their relative effect size and Hypothesis QaP: Changes in code quality during time T-1 are
statistical significance, we plot them in a quadrant chart (Figure 2). correlated with changes in productivity during time T. The statisti-
The chart excludes factors whose p-value is greater than 0.1. The cal model is
factors have various scales from satisfaction score to time duration,
so to make their effect size comparable, we standardized metrics by
subtracting each data point by its mean and dividing it by its stan- Δ𝑃𝑖𝑡 = 𝛼 + 𝛽Δ𝑄𝑖𝑡 −1 + Δ𝜖𝑖𝑡 (4)
dard deviation. The x axis is the absolute value of the standardized
effect size. The y axis is p-values. where Δ𝑄𝑖𝑡 −1 is the change in code quality at time t-1 and Δ𝑃𝑖𝑡 is
The top five factors in terms of relative effect size are satisfaction the following change in logs-based productivity metrics at time t.
with project code quality, hindrance of shifting priorities, technical Given the available data, we use the difference between Q3 2018
debt in projects, innovation of infrastructure, and tools and overall and Q2 2019 to measure Δ𝑄𝑖𝑡 −1 and the difference between Q3 2018
satisfaction with infrastructure and tools. and Q3 2019 to measure Δ𝑃𝑖𝑡 .

1309
What Improves Developer Productivity at Google? Code Quality ESEC/FSE ’22, November 14ś18, 2022, Singapore, Singapore

Figure 2: Quadrant of productivity factors in effect size and statistical significance

Hypothesis PaQ: Changes in productivity in time T-1 are cor- we conclude that changes in satisfaction with project code quality
related with changes in code quality in time T. The statistical model cause changes in perceived overall productivity.
is

Δ𝑄𝑖𝑡 = 𝛼 + 𝛽Δ𝑃𝑖𝑡 −1 + Δ𝜖𝑖𝑡 (5)

where Δ𝑃𝑖𝑡 −1 is the change in logs-based productivity at time t-1 8 DISCUSSION


and Δ𝑄𝑖𝑡 is the following change in code quality at time t. Given
Our findings provide practical guidance for organizations trying
the availability of data, we use the difference between Q3 2018 and
to improve individual developer productivity by providing a list of
Q2 2019 to measure Δ𝑃𝑖𝑡 −1 and the difference between Q3 2018 and
amenable factors that are causally linked to productivity. Specifi-
Q1 2019 to measure Δ𝑄𝑖𝑡 .
cally, our panel analysis shows that these factors are: code quality,
For this analysis, we had full lagged panel data for 3389 engineers.
technical debt, infrastructure tools and support, team communica-
tion, goals and priorities, and organizational change and process.
7 LAGGED PANEL ANALYSIS: RESULTS Our quadrant chart shown in Figure 2, which we originally cre-
Our results support hypothesis QaP but not hypothesis PaQ. We ated for an executive stakeholder audience within Google, allows
found that a 100% increase of satisfaction rating with project code practitioners to choose highly impactful productivity factors to act
quality (i.e. going from a rating of ‘Very dissatisfied’ to ‘Very sat- on. Factors at the top of the chart are those with high statistical
isfied’) at time T-1 was associated with a 10% decrease of median significance (and low standard error), so practitioners can read
active coding time per CL, a 12% decrease of median wall-clock those as the most consistent productivity factors. Factors on the
time from creating to mailing a CL, and a 22% decrease of median right are the ones with the largest standardized effect size, so these
wall-clock time from submitting to deploying a CL at time T. On supply the łbiggest bang for the buckž. Taken together, the factors
the other hand, we did not find any evidence to support hypothesis in the upper right quadrant are the ones most promising to improve
PaQ; changes in satisfaction with project code quality in time T productivity at Google. For instance, giving teams time to improve
were not associated with any of the productivity metrics in time code quality, reduce technical debt, and stabilize priorities would
T-1. See Appendix for a table containing this data and descriptions be good candidate initiatives for improving individual developer
of each variable. Therefore, returning to our research question, productivity.

1310
ESEC/FSE ’22, November 14ś18, 2022, Singapore, Singapore L. Cheng, E. Murphy-Hill, M. Canning, C. Jaspan, C. Green, A. Knight, N. Zhang, E. Kammer

We found that several factors did not have a statistically signifi- • EngSat results helped motivate two code quality conferences
cant relationship with perceived productivity, notably: for Google engineers with 4,000 internal attendees and more
• For documentation, perceived productivity was not causally than 15,000 views of live and on-demand talks.
linked to reported poor or missing documentation (doc. hin- • The research motivated the creation of two initiatives ś a
drance) or the frequency of documentation meeting needs Technical Debt Maturity Model (akin to the Capability Ma-
(doc. support). This is surprising, given that GitHub’s 2017 turity Model [38]) and Technical Debt Management Frame-
survey of 5,500 developers found that łincomplete or confus- work ś to help teams improve technical debt assessment and
ing documentationž was the most commonly encountered management.
problem in open source [18]. GitHub’s findings are consis- • Several teams and organizations set Objectives and Key Re-
tent with findings at Microsoft [47] and at Googleś EngSat sults (OKRs) [12] to improve technical debt in their work-
respondents often report łpoor or missing documentationž groups.
as one of the top three hindrances to their own productiv- • Google introduced łThe Healthysž, an award where teams
ity. However, the results in this paper suggest that there submit a two page explanation of a code quality improvement
is no causal relationship between developer productivity initiative they’ve performed. Using an academic reviewing
and documentation, despite developers’ reports that it is model, outside engineers evaluated the impact of nearly 350
important to them. One way to explain this finding is that submissions across the company. Accomplishments include
documentation may not impact productivity, but it may yet more than a million lines of code deleted. In a survey sent to
have other positive benefits, such as to łcreate inclusive award recipients, of 173 respondents, most respondents re-
communitiesž [18]. ported that they mentioned the award in the self-evaluation
• For meetings, we found that perceived productivity was portion of their performance evaluation (82%) and that there
not causally linked to time spent on either incoming meet- was at least a slight improvement in how code health work
ings(p50 meeting time, p90 meeting time) or all types of meet- is viewed by their team (68%) and management (60%).
ings (total meeting time). This is also surprising, given that Although difficult to ascribe specifically to this research and the
prior research found in a survey of Microsoft engineers above initiatives that it has influenced, EngSat has revealed several
that meetings were the most unproductive activity for engi- encouraging trends between when the report was released inter-
neers [34]. The contradictory results could be explained by nally in the second quarter of 2019 and the first quarter of 2021: The
differences between the studies: our panel analysis enables proportion of engineers feeling łnot at all hinderedž by technical
causal reasoning (vs correlational), more engineers were rep- debt has increased by 27%. The proportion of engineers feeling sat-
resented in our dataset (2139 vs 379), and we used objective isfied with code quality has increased by about 22%. The proportion
meeting data from engineers’ calendars (vs. self-reports). of engineers feeling highly productive at work has increased by
• For physical and organizational distances, perceived produc- about 18%.
tivity was not causally linked to physical distance from direct
manager (distance from manager), or physical (p50 review 9 CONCLUSION
physical distance, p90 review physical distance) or organiza- Prior research has made significant progress in improving our un-
tional distances from code reviewers(p50 review org distance, derstanding of what correlates with developer productivity. In this
p90 review org distance). This is in contrast to Ramasubbu and paper, we’ve advanced that research by leveraging time series data
colleagues’ cross-sectional study, which found that łas firms to run panel analyses, enabling stronger causal inference than was
distribute their software development across longer distance possible in prior studies. Our panel analysis suggests that code
(and time zones) they benefit from improved project level quality, technical debt, infrastructure tools and support, team com-
productivityž [41]. As with the prior differences, explana- munication, goals and priorities, and organizational change and
tory factors may include differences in organization and a process are causally linked to developer productivity at Google.
methodology: individual productivity versus organizational Furthermore, our lagged panel analysis provides evidence that im-
productivity, single company versus multiple companies, provements in code quality cause improvements in individual pro-
and panel versus cross-sectional analysis. ductivity. While our analysis is imperfect ś in particular, it is only
As we mentioned, a threat to these results is the threat of reverse one company and uses limited measurements ś it nonetheless can
causality ś the statistics do not tell us whether each factor causes help engineering organizations make informed decisions about
productivity changes or vice versa. We mitigated this threat for improving individual developer productivity.
code quality using lagged panel analysis, providing compelling
evidence that high code quality increases individual developers’ ACKNOWLEDGMENT
productivity. Thanks to Google employees for contributing their EngSat and logs
Within Google, our results have driven organizational change data to this study, as well as the teams responsible for building the in-
around code quality and technical debt as a way to improve devel- frastructure we leverage in this paper. Thanks in particular to Adam
oper productivity: Brown, Michael Brundage, Yuangfang Cai, Alison Chang, Sarah
• Since its creation in May 2019, a version of this report has D’Angelo, Daniel Dressler, Ben Holtz, Matt Jorde, Kurt Kluever,
been viewed by more than 1000 unique Google employees Justin Purl, Gina Roldan, Alvaro Sanchez Canudas, Jason Schwarz,
with more than 500 comments. Simone Styr, Fred Wiesinger, and anonymous reviewers.

1311
What Improves Developer Productivity at Google? Code Quality ESEC/FSE ’22, November 14ś18, 2022, Singapore, Singapore

REFERENCES 005
[1] Joshua D. Angrist and Jörn-Steffen Pischke. 2008. Mostly harmless econometrics: [27] Ciera Jaspan, Matt Jorde, Carolyn Egelman, Collin Green, Ben Holtz, Edward
An empiricist’s companion. Princeton University Press. Smith, Maggie Hodges, Andrea Knight, Liz Kammer, Jill Dicker, et al. 2020. En-
[2] Phillip Armour. 2003. The Reorg Cycle. Commun. ACM 46, 2 (2003), 19. abling the Study of Software Development Behavior With Cross-Tool Logs. IEEE
[3] Terese Besker, Hadi Ghanbari, Antonio Martini, and Jan Bosch. 2020. The influ- Software 37, 6 (2020), 44ś51. https://doi.org/10.1109/MS.2020.3014573
ence of Technical Debt on software developer morale. Journal of Systems and [28] Yared H. Kidane and Peter A. Gloor. 2007. Correlating temporal communication
Software 167 (2020), 110586. https://doi.org/10.1016/j.jss.2020.110586 patterns of the Eclipse open source community with performance and creativity.
[4] Terese Besker, Antonio Martini, and Jan Bosch. 2019. Software developer produc- Computational and mathematical organization theory 13, 1 (2007), 17ś27. https:
tivity loss due to technical debtÐa replication and extension study examining //doi.org/10.1007/s10588-006-9006-3
developers’ development work. Journal of Systems and Software 156 (2019), 41ś61. [29] Amy J. Ko. 2019. Individual, Team, Organization, and Market: Four Lenses of
https://doi.org/10.1016/j.jss.2019.06.004 Productivity. In Rethinking Productivity in Software Engineering. Springer, 49ś55.
[5] Larissa Braz, Enrico Fregnan, Gül Çalikli, and Alberto Bacchelli. 2021. Why https://doi.org/10.1007/978-1-4842-4221-6_6
Don’t Developers Detect Improper Input Validation?’; DROP TABLE Papers;ś [30] Amy J. Ko and Brad A. Myers. 2008. Debugging Reinvented: Asking and An-
. In International Conference on Software Engineering. IEEE, 499ś511. https: swering Why and Why Not Questions about Program Behavior. In Proceed-
//doi.org/10.1109/ICSE43902.2021.00054 ings of the 30th International Conference on Software Engineering (ICSE ’08).
[6] K.H. Brodersen, F. Gallusser, J. Koehler, N. Remy, and S.L. Scott. 2015. Inferring Association for Computing Machinery, New York, NY, USA, 301ś310. https:
causal impact using Bayesian structural time-series models. The Annals of Applied //doi.org/10.1145/1368088.1368130
Statistics 9, 1 (2015), 247ś274. https://doi.org/10.1214/14-AOAS788 [31] Max Lillack, Stefan Stanciulescu, Wilhelm Hedman, Thorsten Berger, and An-
[7] Kevin D Carlson and Andrew O Herdman. 2012. Understanding the impact of drzej Wąsowski. 2019. Intention-Based Integration of Software Variants. In 2019
convergent validity on research results. Organizational Research Methods 15, 1 IEEE/ACM 41st International Conference on Software Engineering (ICSE). 831ś842.
(2012), 17ś32. https://doi.org/10.1177/1094428110392383 https://doi.org/10.1109/ICSE.2019.00090
[8] Nancy Cartwright. 2007. Are RCTs the gold standard? BioSocieties 2, 1 (2007), [32] William Martin, Federica Sarro, and Mark Harman. 2016. Causal impact analysis
11ś20. https://doi.org/10.1017/S1745855207005029 for app releases in Google Play. In Proceedings of the 2016 24th ACM SIGSOFT
[9] Prodromos D. Chatzoglou and Linda A. Macaulay. 1997. The importance of human International Symposium on Foundations of Software Engineering. 435ś446. https:
factors in planning the requirements capture stage of a project. International //doi.org/10.1145/2950290.2950320
Journal of Project Management 15, 1 (1997), 39ś53. https://doi.org/10.1016/S0263- [33] Katrina D. Maxwell, Luk Van Wassenhove, and Soumitra Dutta. 1996. Software
7863(96)00038-5 development productivity of European space, military, and industrial applications.
[10] Bradford Clark, Sunita Devnani-Chulani, and Barry Boehm. 1998. Calibrating IEEE Transactions on Software Engineering 22, 10 (1996), 706ś718. https://doi.
the COCOMO II post-architecture model. In Proceedings of the International org/10.1109/32.544349
Conference on Software Engineering. IEEE, 477ś48. https://doi.org/10.1109/ICSE. [34] André N Meyer, Thomas Fritz, Gail C Murphy, and Thomas Zimmermann. 2014.
1998.671610 Software developers’ perceptions of productivity. In Proceedings of the 22nd ACM
[11] Tom DeMarco and Tim Lister. 2013. Peopleware: productive projects and teams. SIGSOFT International Symposium on Foundations of Software Engineering. 19ś29.
Addison-Wesley. https://doi.org/10.1145/2635868.2635892
[12] John Doerr. 2018. Measure what matters: How Google, Bono, and the Gates Foun- [35] Emerson Murphy-Hill and Andrew P. Black. 2008. Breaking the Barriers to
dation rock the world with OKRs. Penguin. Successful Refactoring: Observations and Tools for Extract Method. In Proceedings
[13] Davide Falessi, Natalia Juristo, Claes Wohlin, Burak Turhan, Jürgen Münch, of the 30th International Conference on Software Engineering (ICSE ’08). Association
Andreas Jedlitschka, and Markku Oivo. 2018. Empirical software engineering for Computing Machinery, New York, NY, USA, 421ś430. https://doi.org/10.
experts on the use of students and professionals in experiments. Empirical 1145/1368088.1368146
Software Engineering 23, 1 (2018), 452ś489. https://doi.org/10.1007/s10664-017- [36] Emerson Murphy-Hill, Ciera Jaspan, Caitlin Sadowski, David Shepherd, Michael
9523-3 Phillips, Collin Winter, Andrea Knight, Edward Smith, and Matthew Jorde. 2021.
[14] Petra Filkuková and Magne Jùrgensen. 2020. How to pose for a professional What Predicts Software Developers’ Productivity? IEEE Transactions on Software
photo: The effect of three facial expressions on perception of competence of Engineering 47, 3 (2021), 582ś594. https://doi.org/10.1109/TSE.2019.2900308
a software developer. Australian Journal of Psychology 72, 3 (2020), 257ś266. [37] Edson Oliveira, Eduardo Fernandes, Igor Steinmacher, Marco Cristo, Tayana
https://doi.org/10.1111/ajpy.12285 Conte, and Alessandro Garcia. 2020. Code and commit metrics of developer
[15] Denae Ford, Margaret-Anne Storey, Thomas Zimmermann, Christian Bird, Sonia productivity: a study on team leaders perceptions. Empirical Software Engineering
Jaffe, Chandra Maddila, Jenna L. Butler, Brian Houck, and Nachiappan Nagappan. 25, 4 (2020), 2519ś2549. https://doi.org/10.1007/s10664-020-09820-z
2021. A Tale of Two Cities: Software Developers Working from Home during [38] Mark C. Paulk, Bill Curtis, Mary Beth Chrissis, and Charles V. Weber. 1993.
the COVID-19 Pandemic. ACM Trans. Softw. Eng. Methodol. 31, 2, Article 27 (dec Capability maturity model, version 1.1. IEEE Software 10, 4 (1993), 18ś27. https:
2021), 37 pages. https://doi.org/10.1145/3487567 //doi.org/10.1109/52.219617
[16] Nicole Forsgren, Margaret-Anne Storey, Chandra Maddila, Thomas Zimmermann, [39] Kai Petersen. 2011. Measuring and predicting software productivity: A systematic
Brian Houck, and Jenna Butler. 2021. The SPACE of Developer Productivity: map and review. Information and Software Technology 53, 4 (2011), 317ś343.
There’s more to it than you think. Queue 19, 1 (2021), 20ś48. https://doi.org/10. https://doi.org/10.1016/j.infsof.2010.12.001 Special section: Software Engineering
1145/3454122.3454124 track of the 24th Annual Symposium on Applied Computing.
[17] Geoffrey K. Gill and Chris F. Kemerer. 1991. Cyclomatic complexity density and [40] Huilian Sophie Qiu, Alexander Nolte, Anita Brown, Alexander Serebrenik, and
software maintenance productivity. IEEE transactions on software engineering 17, Bogdan Vasilescu. 2019. Going Farther Together: The Impact of Social Capital
12 (1991), 1284. https://doi.org/10.1109/32.106988 on Sustained Participation in Open Source. In 2019 IEEE/ACM 41st International
[18] GitHub. 2017. Open Source Survey. https://opensourcesurvey.org/2017/ Conference on Software Engineering (ICSE). 688ś699. https://doi.org/10.1109/
[19] Ann-Louise Glasberg, Astrid Norberg, and Anna Söderberg. 2007. Sources of ICSE.2019.00078
burnout among healthcare employees as perceived by managers. Journal of [41] Narayan Ramasubbu, Marcelo Cataldo, Rajesh Krishna Balan, and James D. Herb-
Advanced nursing 60, 1 (2007), 10ś19. https://doi.org/10.1111/j.1365-2648.2007. sleb. 2011. Configuring global software teams: a multi-company analysis of
04370.x project productivity, quality, and profits. In 2011 33rd International Conference on
[20] C. W. Granger. 1969. Investigating causal relations by econometric models and Software Engineering (ICSE). 261ś270. https://doi.org/10.1145/1985793.1985830
cross-spectral methods. Econometrica: Journal of the Econometric Society (1969), [42] Simone Romano, Davide Fucci, Maria Teresa Baldassarre, Danilo Caivano, and
424ś438. https://doi.org/10.2307/1912791 Giuseppe Scanniello. 2019. An empirical assessment on affective reactions
[21] Shenyang Guo and Mark W. Fraser. 2014. Propensity score analysis: Statistical of novice developers when applying test-driven development. In International
methods and applications. Vol. 11. SAGE publications. Conference on Product-Focused Software Process Improvement. Springer, 3ś19.
[22] Joseph F Hair, Jeffrey J Risher, Marko Sarstedt, and Christian M Ringle. 2019. https://doi.org/10.1007/978-3-030-35333-9_1
When to use and how to report the results of PLS-SEM. European Business Review [43] Caitlin Sadowski, Emma Söderberg, Luke Church, Michal Sipko, and Alberto
(2019). https://doi.org/10.1108/EBR-11-2018-0203 Bacchelli. 2018. Modern code review: a case study at google. In Proceedings of
[23] Tomislav Hernaus and Josip Mikulić. 2014. Work characteristics and work per- the 40th International Conference on Software Engineering: Software Engineering
formance of knowledge workers. EuroMed Journal of Business (2014). https: in Practice. 181ś190. https://doi.org/10.1145/3183519.3183525
//doi.org/10.1108/EMJB-11-2013-0054 [44] Iflaah Salman, Ayse Tosun Misirli, and Natalia Juristo. 2015. Are students
[24] Cheng Hsiao. 2007. Panel data analysisÐadvantages and challenges. TEST 16, 1 representatives of professionals in software engineering experiments?. In In-
(2007), 1ś22. https://doi.org/10.1007/s11749-007-0046-x ternational Conference on Software Engineering, Vol. 1. IEEE, 666ś676. https:
[25] Cheng Hsiao. 2022. Analysis of panel data. Cambridge University Press. //doi.org/10.1109/ICSE.2015.82
[26] Mazhar Islam, Jacob Miller, and Haemin Dennis Park. 2017. But what will it cost [45] Andrea Schankin, Annika Berger, Daniel V. Holt, Johannes C. Hofmeister, Till
me? How do private costs of participation affect open source software projects? Riedel, and Michael Beigl. 2018. Descriptive Compound Identifier Names Improve
Research Policy 46, 6 (2017), 1062ś1070. https://doi.org/10.1016/j.respol.2017.05. Source Code Comprehension. In Proceedings of the 26th Conference on Program

1312
ESEC/FSE ’22, November 14ś18, 2022, Singapore, Singapore L. Cheng, E. Murphy-Hill, M. Canning, C. Jaspan, C. Green, A. Knight, N. Zhang, E. Kammer

Comprehension (ICPC ’18). Association for Computing Machinery, New York, NY, [49] Ayse Tosun, Oscar Dieste, Davide Fucci, Sira Vegas, Burak Turhan, Hakan Er-
USA, 31ś40. https://doi.org/10.1145/3196321.3196332 dogmus, Adrian Santos, Markku Oivo, Kimmo Toro, Janne Jarvinen, and Natalia
[46] Dag I.K. Sjoberg, Bente Anda, Erik Arisholm, Tore Dyba, Magne Jorgensen, Amela Juristo. 2017. An industry experiment on the effects of test-driven development
Karahasanovic, Espen Frimann Koren, and Marek Vokác. 2002. Conducting realis- on external quality and productivity. Empirical Software Engineering 22, 6 (2017),
tic experiments in software engineering. In Proceedings International Symposium 2763ś2805. https://doi.org/10.1007/s10664-016-9490-0
on Empirical Software Engineering. 17ś26. https://doi.org/10.1109/ISESE.2002. [50] Stefan Wagner and Florian Deissenboeck. 2019. Defining Productivity in Software
1166921 Engineering. In Rethinking Productivity in Software Engineering, Caitlin Sadowski
[47] Margaret-Anne Storey, Thomas Zimmermann, Christian Bird, Jacek Czerwonka, and Thomas Zimmermann (Eds.). Apress, Berkeley, CA, 29ś38. https://doi.org/
Brendan Murphy, and Eirini Kalliamvakou. 2021. Towards a Theory of Software 10.1007/978-1-4842-4221-6_4
Developer Job Satisfaction and Perceived Productivity. IEEE Transactions on [51] Zhendong Wang, Yi Wang, and David Redmiles. 2018. Competence-confidence
Software Engineering 47, 10 (2021), 2125ś2142. https://doi.org/10.1109/TSE.2019. gap: A threat to female developers’ contribution on Github. In 2018 IEEE/ACM
2944354 40th International Conference on Software Engineering: Software Engineering in
[48] The Standish Group. 1995. The CHAOS report. Society (ICSE-SEIS. IEEE, 81ś90. https://doi.org/10.1145/3183428.3183437

1313

You might also like