Google Productivity
Google Productivity
Code Quality
Lan Cheng Emerson Murphy-Hill Mark Canning
Google Google Google
United States United States United States
[email protected] [email protected] [email protected]
ABSTRACT of the 30th ACM Joint European Software Engineering Conference and Sympo-
Understanding what affects software developer productivity can sium on the Foundations of Software Engineering (ESEC/FSE ’22), November
14ś18, 2022, Singapore, Singapore. ACM, New York, NY, USA, 12 pages.
help organizations choose wise investments in their technical and
https://doi.org/10.1145/3540250.3558940
social environment. But the research literature either focuses on
what correlates with developer productivity in ecologically valid
settings or focuses on what causes developer productivity in highly
constrained settings. In this paper, we bridge the gap by studying 1 INTRODUCTION
software developers at Google through two analyses. In the first Organizations want to maximize software engineering productivity
analysis, we use panel data with 39 productivity factors, finding that so that they can make the best software in the shortest amount
code quality, technical debt, infrastructure tools and support, team of time. While software engineering productivity łis essential for
communication, goals and priorities, and organizational change numerous enterprises and organizations in most domainsž [50] and
and process are all causally linked to self-reported developer pro- can be examined through multiple lenses [29], understanding the
ductivity. In the second analysis, we use a lagged panel analysis to productivity of individual software developers can be especially
strengthen our causal claims. We find that increases in perceived fruitful because it has the potential to be improved through many
code quality tend to be followed by increased perceived developer actions (e.g. from tooling to process changes) and by many stake-
productivity, but not vice versa, providing the strongest evidence holders (from individual developers to executives). However, it is
to date that code quality affects individual developer productivity. difficult to know which actions will truly improve productivity in
an ecologically valid setting, that is, in a way that accurately char-
CCS CONCEPTS acterizes productivity in a realistic software development context.
· Software and its engineering → Software creation and man- This motivates our research question:
agement; Software development process management; · General RQ: What causes improvements to developer produc-
and reference → Empirical studies. tivity in practice?
1302
ESEC/FSE ’22, November 14ś18, 2022, Singapore, Singapore L. Cheng, E. Murphy-Hill, M. Canning, C. Jaspan, C. Green, A. Knight, N. Zhang, E. Kammer
Table 1: Hypothetical cross sectional survey response data. • Respondent-independent time effects. These are effects
that influence all respondents uniformly, such as seasonal
Productivity Rating Code Quality Rating effects or company-wide initiatives. For example, prior to
Aruj Somewhat productive Medium quality the survey, perhaps all engineers were given their annual
Rusla Extremely productive Extremely high quality bonus, artificially raising everyone’s productivity.
• Non-differentiated response effects. These are effects
Table 2: More hypothetical survey responses, collected 3 where respondents will give the same or similar responses to
months after the data in Table 1. Plusses (+) and minuses (ś) every survey question, sometimes known as straightlining.
indicate the direction of the change since the prior survey. For example, perhaps Aruj tends to choose the middle op-
tion to every question and Rusla tends to answer the highest
option for every question.
Productivity Rating Code Quality Rating
Aruj Highly productive (+) High Quality (+)
Rusla Somewhat productive (ś) High Quality (ś) We use panel analysis to address these confounds, enabling
stronger causal inference than what can be obtained from cross sec-
tional data [25]. The power of panel data is that it uses data collected
such as whether some unmeasured third variable causes both high at multiple points in time from the same individuals, examining
productivity and high job enthusiasm. how measurements change over time.
More broadly, these examples illustrate the fundamental limita- To illustrate how panel data enables stronger causal inference,
tions of prior approaches to understanding developer productivity. let us return to the running example. Suppose we run the survey
On one hand, software engineering research that uses controlled ex- again, three months later, and obtain the data shown in Table 2.
periments can help show with a high degree of certainty that some One interesting observation is that if we analyze Table 2 in isola-
practices and tools increase productivity, yet such experiments are tion, we notice that there’s not a correlation between productivity
by definition highly controlled, leaving organizations to wonder and code quality ś both respondents report the same code quality,
whether the results obtained in the controlled environment will regardless of their productivity. But more importantly, looking at
also apply in their more realistic, messy environment. On the other the changes in responses from Table 1 and Table 2, we see produc-
hand, research that uses field studies ś often with cross sectional tivity changes are now correlated: Aruj’s increasing productivity
data from surveys or telemetry ś can produce contextually valid ob- correlates with increasing code quality, and Rusla’s decreasing pro-
servations about productivity, but drawing causal inferences from ductivity correlates with decreasing code quality.
field studies that rival those drawn from experiments is challenging. Panel analysis rules out the three confounding explanations
Our study builds on the existing literature about developer pro- present in the cross-sectional analysis:
ductivity, contributing the first study to draw strong causal conclu-
sions in an ecologically valid context about what affects individual
• In the cross-sectional analysis, we could not determine if
developer productivity.
Rusla’s high productivity was driven by her college edu-
cation. But in this panel analysis, we can rule out that ex-
2 MOTIVATION
planation, because college is a time invariant exposure ś it
The paper’s main technical contribution ś the ability to draw theoretically would have the same effect on her productivity
stronger causal inferences about productivity drivers than in prior in the first survey as in the second survey. This ability to
work ś is enabled by the use of the panel data analysis technique [25]. rule out other potential causes that are time invariant exists
In this section, we motivate the technique with a running example. whether or not the researcher can observe those potential
Much of the prior work on developer productivity (Section 3) causes. While with cross-sectional analysis, researchers may
relies on cross-sectional data. To illustrate the limitations of cross sec- be able to control for these potential causes using control
tional data, let us introduce an example. Consider a survey that asks variables, researchers have to anticipate and measure those
about respondents’ productivity and the quality of their codebase. control variables during analysis. This is unnecessary with
The survey is distributed at a large company, and two developers panel analysis because time invariant factors are eliminated
respond, Aruj and Rusla. Let’s assume their responses are repre- by design.
sentative of the developer population. Their survey responses are • In the cross-sectional analysis, we could not determine if
shown in Table 1. productivity was driven by a recent annual bonus. This ex-
From this survey, we see that productivity correlates with code planation is ruled out in the panel analysis. If the annual
quality. But we cannot confidently say that high code quality causes bonus had an effect, the change in productivity scores across
high developer productivity, due in part to the following confound- both participants would be uniform.
ing explanations [1]: • In the cross-sectional analysis, we could not determine if
• Time-invariant effects. These are effects that have the respondents were just choosing similar answers to every
same influence over time. For example, if Rusla went to col- question. This explanation is also ruled out with panel anal-
lege and Aruj did not, from cross-sectional data, we cannot ysis. If respondents were choosing similar answers, there
distinguish between the effect of college and the effect of would be no change in productivity scores, and thus we
code quality on productivity. would not see a correlation among the changes.
1303
What Improves Developer Productivity at Google? Code Quality ESEC/FSE ’22, November 14ś18, 2022, Singapore, Singapore
The ability of panel analyses to draw relatively strong causal project and thus increases the number of participants, but decreases
inferences makes it a quasi-experimental method, combining some the average level of contribution by individual participantsž [26].
of the advantages of experiments with those of field studies [24]. Like these papers, we use panel data to make causal inferences, but
in our case, the inferences are about developer productivity.
1304
ESEC/FSE ’22, November 14ś18, 2022, Singapore, Singapore L. Cheng, E. Murphy-Hill, M. Canning, C. Jaspan, C. Green, A. Knight, N. Zhang, E. Kammer
1305
What Improves Developer Productivity at Google? Code Quality ESEC/FSE ’22, November 14ś18, 2022, Singapore, Singapore
written in the 1980s, Gill and Kemerer found that code complex- To measure team communication in our study, we examined 9
ity correlates with software maintenance productivity [17]. Based objective measures and 1 subjective measure:
on interviews and surveys with professional software developers,
• 50th and 90th percentile of rounds of code review (p50 code
Besker and colleagues found that technical debt correlates nega-
review rounds, p90 code review rounds)
tively with developer productivity [3, 4].
• 50th and 90th percentile of total wait time of code review (p50
We measured code quality and technical debt with 5 subjective
code review wait time, p90 code review wait time)
factors from our survey:
• 50th and 90th percentile of code reviewers’ organizational dis-
tances from author (p50 review org distance, p90 review org
• Code Quality Satisfaction (sat. with project code quality, sat. distance)
with dependency code quality) • 50th and 90th percentile of code reviewers’ physical distances
• Code Technical Debt (project tech debt) from author (p50 review physical distance, p90 review phys-
• Dependency Technical Debt (dependency tech debt) ical distance)
• Technical Debt Hindrance (tech debt hindrance) • Physical distance from direct manager (distance from man-
ager)
4.3.2 Infrastructure Tools & Support. The next category of poten- • Code review hindrance (slow code review)
tial drivers of productivity are issues relating to tools and infrastruc-
ture. Prior work showed that using łthe best tools and practicesž 4.3.4 Goals and Priorities. Prior research suggests that changing
was the strongest correlate of individual productivity at Google, goals and priorities correlate with software engineering outcomes.
though not a significant correlate at two other companies [36]. Surveying 365 software developers, The Standish Group found that
Storey and colleagues also found that Microsoft developers’ pro- changing requirements was a common stated reason for project
cesses and tools correlated with individual productivity [47]. failure [48]. Meyer and colleagues found that one of the top 5 most
This category had 6 objective and 12 subjective measures: commonly mentioned reasons for a productive workday was having
clear goals and requirements [34].
• Tools, infrastructure and service satisfaction (sat. with infra & We measure this category with 1 subjective measure:
tools) • Priority shift (priority shift)
• Tools and infrastructure choice (choices of infra & tools)
• Tools and infrastructure innovativeness (innovation of infra 4.3.5 Interruptions. Meyer and colleagues found that two of the
& tools) top five most commonly mentioned reasons for a productive work-
• Tools and infrastructure ease (ease of infra & tools) day by 379 software developers was having no meetings and few
• Tools and infrastructure frustration (frustration of infra & interruptions [34]. Similarly, a prior survey of Google engineers
tools) showed that lack of interruptions and efficient meetings correlated
• Developer stack change (change of tool stack) with personal productivity, as did use of personal judgment [36].
• Internal documentation support (doc. support) We measure this category with 3 objective measures:
• Internal documentation hindrance (doc. hindrance) • 50th and 90th percentile of total time spent on incoming meet-
• Build & test cycle hindrance (build & test cycle hindrance) ings per week (p50 meeting time, p90 meeting time)
• Build latency satisfaction (sat. with build latency) • Total time spent on any meetings per week (total meeting
• 50th and 90th percentile of build duration (p50 build time, p90 time)
build time)
• % of long builds per week (% of long builds) 4.3.6 Organizational and Process Factors. Finally, outside of soft-
• 50th and 90th percentile of test duration (p50 test time, p90 ware engineering, organizational and process factors correlate with
test time) a variety of work outcomes. For example, according to healthcare
• % of long tests per week (% of long tests) industry managers, reorganizations can result in workers’ sense of
• Learning hindrance (learning hindrance) powerlessness, inadequacy, and burnout [19]. Although not well-
• Migration hindrance (migration hindrance) studied in software engineering, based on personal experience,
DeMarco and Lister [11] and Armour [2] point to bureaucracy and
reorganizations as leading to poor software engineering outcomes.
4.3.3 Team Communication. The next category of drivers of pro-
This category had 2 subjective and 3 objective measures:
ductivity are issues relating to team communication. In a survey
of knowledge workers, Hernaus and Mikulić found that social job • Process hindrance (complicated processes)
characteristics (e.g. group cooperation) correlated with contextual • Organizational hindrance (team & org change)
job performance [23]. More specifically, in software engineering, • Number of times when engineers’ direct manager changes but
Chatzoglou and Macaulay interviewed software developers, finding colleagues do not change (reorg direct manager change)
that most believed that communication among team members was • Number of times when both an engineer’s direct manager and
very important to project success [9]. Studying communication colleagues change simultaneously (non-reorg direct manager
networks quantitatively, Kidane and Gloor found that in the Eclipse change)
project, a higher frequency of communication between developer • Number of different primary teams the engineer has (primary
correlated positively with performance and creativity [28]. team change)
1306
ESEC/FSE ’22, November 14ś18, 2022, Singapore, Singapore L. Cheng, E. Murphy-Hill, M. Canning, C. Jaspan, C. Green, A. Knight, N. Zhang, E. Kammer
1307
What Improves Developer Productivity at Google? Code Quality ESEC/FSE ’22, November 14ś18, 2022, Singapore, Singapore
Table 3: Metrics’ relationship with self-rated productivity. from cross-sectional data. However, the most significant caveat
to our ability to draw causal inferences is time variant effects. In
Metric Effect size p-value contrast to time invariant effects (e.g., prior education and demo-
Code Quality & Technical Debt graphics), time variant effects may vary over the study period. For
sat. with project code quality 0.105 <0.001 instance, in our running example, if Aruj lost a mentor and Rusla
sat. with dependency code quality -0.013 0.505 gained a mentor between the two surveys, our analysis could not
project tech debt 0.078 <0.001 rule out mentorship as a cause of increased productivity or code
dependency tech debt 0.042 0.012 quality. Thus, our analysis assumes that effects on individual en-
tech debt hindrance -0.009 0.459 gineers are time invariant. Violations of this assumption threaten
Infrastructure Tools & Support the internal validity of our study.
sat. with infra & tools 0.113 <0.001 Another internal threat to the validity of our study is partici-
choices of infra & tools 0.020 0.083 pants who chose not to answer some or all questions in the survey.
innovation of infra & tools 0.106 <0.001 While our analysis of non-response bias (Section 4.1.2) showed that
ease of infra & tools -0.018 0.352 two survey questions were robust to non-response among several
frustration of infra & tools 0.002 0.952 dimensions like level and tenure, non-response is still a threat. For
change of tool stack 0.019 0.098 one, respondents and non-respondents might differ systematically
doc. support -0.009 0.664 on some unmeasured or dimension, such as how frequently they
doc. hindrance -0.005 0.715 get feedback from peers. Likewise, respondents who choose not
build and test cycle hindrance 0.029 0.064 to answer a question will be wholly excluded from our analysis,
sat. with build latency 0.018 0.295 yet such participants might differ systematically from those who
p90 build time -0.024 0.019 answered every question.
p90 test time -0.001 0.836 Another threat to internal validity is that we analyzed data for
% of long builds 0.028 0.599 only two panels per engineer. More panels per engineer would
learning hindrance 0.038 0.006 increase the robustness of our results.
migration hindrance -0.001 0.929
Team Communication 4.7.4 External. As the title of this paper suggests, our study was
p50 code review rounds 0.007 0.081 conducted only at Google and generalizability of our results beyond
p90 code review rounds -0.014 0.058 that context is limited. Google is a large, US-headquartered, multi-
p50 code review wait time -0.0006 0.875 national, and software-centric company where engineers work on
p90 code review wait time 0.0019 0.625 largely server and mobile code, with uniform development tooling,
p50 review org distance -0.0008 0.424 and in a monolithic repository. Likewise, during the study period
p90 review org distance -0.0002 0.880 Google developers mostly worked from open offices, before the
p50 review physical distance 0.0012 0.261 global COVID19 pandemic when many developers shifted to re-
p90 review physical distance 0.0013 0.518 mote or hybrid work. While results would vary if this study were
distance from manager 0.001 0.209 replicated in other organizations, contexts that resemble ours are
slow code review 0.051 0.004 most likely to yield similar results.
Goals & Priorities
priority shift 0.077 <0.001
Interruptions 5 PANEL ANALYSIS: RESULTS
p50 meeting time 0.014 0.502 5.1 Factors Causally Linked to Productivity
p90 meeting time 0.008 0.701
Panel data analysis suggested that 16 out of the 39 metrics have a
total meeting time -0.009 0.692
statistically significant causal relationship with perceived overall
Organizational Change and Process
productivity, as listed in Table 3. The overall adjusted R-squared
complicated processes 0.027 0.067
value for the model was 0.1019. In Table 1, the Effect size should
team and org change 0.032 0.023
be read as a percent change in the dependent variable is associated
reorg direct manager changes -0.002 0.086
with that percent change in the independent variable. For instance,
non-reorg direct manager change -0.002 0.525
for code quality, a 100% change in project code quality (from łVery
primary team change 0.014 0.086
dissatisfiedž to łVery satisfiedž to quality) is associated with a 10.5%
increase in self-reported productivity. To summarize Table 3:
hindered your productivity?ž. As an example of ambiguity, several
• For code quality, we found that perceived productivity is
questions ask about engineers’ experiences with the project they
causally related to satisfaction with project code quality
work on, but respondents interpret for themselves what a "project"
but not causally related to satisfaction with code quality
is and, if they work on multiple projects, which one to report on.
in dependencies. For technical debt, we found perceived
4.7.3 Internal. As we argue in this paper, our use of panel analysis productivity is causally related to perceived technical debt
helps draw stronger causal inferences than those that can be drawn both within projects and in their dependencies.
1308
ESEC/FSE ’22, November 14ś18, 2022, Singapore, Singapore L. Cheng, E. Murphy-Hill, M. Canning, C. Jaspan, C. Green, A. Knight, N. Zhang, E. Kammer
• For infrastructure, several factors closely related to internal 6 LAGGED PANEL ANALYSIS: METHODS
infrastructure and tools showed a significant causal relation- The panel data analysis we conducted so far suggests satisfaction
ship with perceived productivity: with code quality within projects is the strongest productivity factor
ś Engineers who reported their tools and infrastructure among the 39 we studied, based on standardized effect size and
were not innovative were more likely to report lower pro- p-value.
ductivity. However, because the observed changes in factors coincided
ś Engineers who reported the number of choices were either during the same time period, such conventional panel data analysis
too few or too many were likely to report lower produc- can tell which factors are causally related to overall productivity,
tivity. We further tested whether one of the two (łtoo fewž but it does not tell us the direction of the causality.
or łtoo manyž) matters but not the other, by replacing this So, does better code quality cause increasing productivity, or does
variable with two binary variables, one representing the increasing productivity cause improved code quality? Both linkages
case of łtoo fewž choices and the other representing the are theoretically plausible: on one hand, code quality might increase
case of łtoo manyž choices. The results suggest that both productivity because higher code quality may make it easier and
cases are causally related to perceived productivity. faster to add new features; on the other hand, high productivity
ś Engineers who reported that the pace of changes in the might increase quality code because engineers have free time to
developer tool stack was too fast or too slow were likely spend on quality improvement.
to report lower productivity. Similarly, we tested the two To verify the direction of the causal relationship between project
cases, łtoo fastž or łtoo slowž, separately by replacing code quality and productivity, we conducted another panel data
this variable with two binary variables, one representing analysis using lagged panel data. In this analysis, we focus only
the case of łtoo fastž and the other representing the case on the causal relationship between code quality and productivity.
of łtoo slowž. Results suggested both cases matter for Although such an analysis is possible for other factors, it is nonethe-
perceived productivity. less laborious, as we shall see shortly. Thus, we focus our lagged
ś Engineers who were hindered by learning a new platform, analysis on only these two variables, which had the strongest causal
framework, technology, or infrastructure were likely to relationship in our prior analysis.
report lower productivity. In short, we verified the direction of the linkage between project
ś Engineers who had longer build times or reported being code quality and productivity by checking if the change in one fac-
hindered by their build & test cycle were more likely to tor is associated with the change in the other factor in the following
report lower productivity. period. The idea is that if project code quality affects productiv-
• For team communication, a metric related to code review was ity, we expect to see that changes in project code quality during
significantly causally related with perceived productivity. time T-1 are associated with changes in productivity during time T.
Engineers who had more rounds of reviews per code review Since self-reported productivity is not available for two consecutive
or reported being hindered by slow code review processes quarters (since each respondent is sampled only once every three
were likely to report lower productivity. quarters), we switch to logs-based metrics to measure productivity.
• For goals and priorities, engineers hindered by shifting project Complementing our prior analysis based on self-ratings with a logs-
priorities were likely to report lower productivity. based one has the additional benefit of increasing the robustness of
• Organizational factors were linked to perceived productivity: our results.
ś Engineers who had more changes of direct managers were More formally, we tested two competing hypotheses, Hypothesis
more likely to report lower productivity. QaP (Quality affects Productivity) and PaQ (Productivity affects
ś Engineers who reported being hindered for team and or- Quality). Hypothesis QaP is that the changes in project code quality
ganizational reasons, or by complicated processes were during time T-1 are associated with changes in productivity during
more likely to report lower productivity. time T. This implies improvements in project code quality lead to
better productivity. Hypothesis PaQ is that changes in productivity
in time T-1 are associated with changes in project code quality in
time T. This implies better productivity leads to an improvement
5.2 Quadrant Chart
in project code quality.
To visualize these factors in terms of their relative effect size and Hypothesis QaP: Changes in code quality during time T-1 are
statistical significance, we plot them in a quadrant chart (Figure 2). correlated with changes in productivity during time T. The statisti-
The chart excludes factors whose p-value is greater than 0.1. The cal model is
factors have various scales from satisfaction score to time duration,
so to make their effect size comparable, we standardized metrics by
subtracting each data point by its mean and dividing it by its stan- Δ𝑃𝑖𝑡 = 𝛼 + 𝛽Δ𝑄𝑖𝑡 −1 + Δ𝜖𝑖𝑡 (4)
dard deviation. The x axis is the absolute value of the standardized
effect size. The y axis is p-values. where Δ𝑄𝑖𝑡 −1 is the change in code quality at time t-1 and Δ𝑃𝑖𝑡 is
The top five factors in terms of relative effect size are satisfaction the following change in logs-based productivity metrics at time t.
with project code quality, hindrance of shifting priorities, technical Given the available data, we use the difference between Q3 2018
debt in projects, innovation of infrastructure, and tools and overall and Q2 2019 to measure Δ𝑄𝑖𝑡 −1 and the difference between Q3 2018
satisfaction with infrastructure and tools. and Q3 2019 to measure Δ𝑃𝑖𝑡 .
1309
What Improves Developer Productivity at Google? Code Quality ESEC/FSE ’22, November 14ś18, 2022, Singapore, Singapore
Hypothesis PaQ: Changes in productivity in time T-1 are cor- we conclude that changes in satisfaction with project code quality
related with changes in code quality in time T. The statistical model cause changes in perceived overall productivity.
is
1310
ESEC/FSE ’22, November 14ś18, 2022, Singapore, Singapore L. Cheng, E. Murphy-Hill, M. Canning, C. Jaspan, C. Green, A. Knight, N. Zhang, E. Kammer
We found that several factors did not have a statistically signifi- • EngSat results helped motivate two code quality conferences
cant relationship with perceived productivity, notably: for Google engineers with 4,000 internal attendees and more
• For documentation, perceived productivity was not causally than 15,000 views of live and on-demand talks.
linked to reported poor or missing documentation (doc. hin- • The research motivated the creation of two initiatives ś a
drance) or the frequency of documentation meeting needs Technical Debt Maturity Model (akin to the Capability Ma-
(doc. support). This is surprising, given that GitHub’s 2017 turity Model [38]) and Technical Debt Management Frame-
survey of 5,500 developers found that łincomplete or confus- work ś to help teams improve technical debt assessment and
ing documentationž was the most commonly encountered management.
problem in open source [18]. GitHub’s findings are consis- • Several teams and organizations set Objectives and Key Re-
tent with findings at Microsoft [47] and at Googleś EngSat sults (OKRs) [12] to improve technical debt in their work-
respondents often report łpoor or missing documentationž groups.
as one of the top three hindrances to their own productiv- • Google introduced łThe Healthysž, an award where teams
ity. However, the results in this paper suggest that there submit a two page explanation of a code quality improvement
is no causal relationship between developer productivity initiative they’ve performed. Using an academic reviewing
and documentation, despite developers’ reports that it is model, outside engineers evaluated the impact of nearly 350
important to them. One way to explain this finding is that submissions across the company. Accomplishments include
documentation may not impact productivity, but it may yet more than a million lines of code deleted. In a survey sent to
have other positive benefits, such as to łcreate inclusive award recipients, of 173 respondents, most respondents re-
communitiesž [18]. ported that they mentioned the award in the self-evaluation
• For meetings, we found that perceived productivity was portion of their performance evaluation (82%) and that there
not causally linked to time spent on either incoming meet- was at least a slight improvement in how code health work
ings(p50 meeting time, p90 meeting time) or all types of meet- is viewed by their team (68%) and management (60%).
ings (total meeting time). This is also surprising, given that Although difficult to ascribe specifically to this research and the
prior research found in a survey of Microsoft engineers above initiatives that it has influenced, EngSat has revealed several
that meetings were the most unproductive activity for engi- encouraging trends between when the report was released inter-
neers [34]. The contradictory results could be explained by nally in the second quarter of 2019 and the first quarter of 2021: The
differences between the studies: our panel analysis enables proportion of engineers feeling łnot at all hinderedž by technical
causal reasoning (vs correlational), more engineers were rep- debt has increased by 27%. The proportion of engineers feeling sat-
resented in our dataset (2139 vs 379), and we used objective isfied with code quality has increased by about 22%. The proportion
meeting data from engineers’ calendars (vs. self-reports). of engineers feeling highly productive at work has increased by
• For physical and organizational distances, perceived produc- about 18%.
tivity was not causally linked to physical distance from direct
manager (distance from manager), or physical (p50 review 9 CONCLUSION
physical distance, p90 review physical distance) or organiza- Prior research has made significant progress in improving our un-
tional distances from code reviewers(p50 review org distance, derstanding of what correlates with developer productivity. In this
p90 review org distance). This is in contrast to Ramasubbu and paper, we’ve advanced that research by leveraging time series data
colleagues’ cross-sectional study, which found that łas firms to run panel analyses, enabling stronger causal inference than was
distribute their software development across longer distance possible in prior studies. Our panel analysis suggests that code
(and time zones) they benefit from improved project level quality, technical debt, infrastructure tools and support, team com-
productivityž [41]. As with the prior differences, explana- munication, goals and priorities, and organizational change and
tory factors may include differences in organization and a process are causally linked to developer productivity at Google.
methodology: individual productivity versus organizational Furthermore, our lagged panel analysis provides evidence that im-
productivity, single company versus multiple companies, provements in code quality cause improvements in individual pro-
and panel versus cross-sectional analysis. ductivity. While our analysis is imperfect ś in particular, it is only
As we mentioned, a threat to these results is the threat of reverse one company and uses limited measurements ś it nonetheless can
causality ś the statistics do not tell us whether each factor causes help engineering organizations make informed decisions about
productivity changes or vice versa. We mitigated this threat for improving individual developer productivity.
code quality using lagged panel analysis, providing compelling
evidence that high code quality increases individual developers’ ACKNOWLEDGMENT
productivity. Thanks to Google employees for contributing their EngSat and logs
Within Google, our results have driven organizational change data to this study, as well as the teams responsible for building the in-
around code quality and technical debt as a way to improve devel- frastructure we leverage in this paper. Thanks in particular to Adam
oper productivity: Brown, Michael Brundage, Yuangfang Cai, Alison Chang, Sarah
• Since its creation in May 2019, a version of this report has D’Angelo, Daniel Dressler, Ben Holtz, Matt Jorde, Kurt Kluever,
been viewed by more than 1000 unique Google employees Justin Purl, Gina Roldan, Alvaro Sanchez Canudas, Jason Schwarz,
with more than 500 comments. Simone Styr, Fred Wiesinger, and anonymous reviewers.
1311
What Improves Developer Productivity at Google? Code Quality ESEC/FSE ’22, November 14ś18, 2022, Singapore, Singapore
REFERENCES 005
[1] Joshua D. Angrist and Jörn-Steffen Pischke. 2008. Mostly harmless econometrics: [27] Ciera Jaspan, Matt Jorde, Carolyn Egelman, Collin Green, Ben Holtz, Edward
An empiricist’s companion. Princeton University Press. Smith, Maggie Hodges, Andrea Knight, Liz Kammer, Jill Dicker, et al. 2020. En-
[2] Phillip Armour. 2003. The Reorg Cycle. Commun. ACM 46, 2 (2003), 19. abling the Study of Software Development Behavior With Cross-Tool Logs. IEEE
[3] Terese Besker, Hadi Ghanbari, Antonio Martini, and Jan Bosch. 2020. The influ- Software 37, 6 (2020), 44ś51. https://doi.org/10.1109/MS.2020.3014573
ence of Technical Debt on software developer morale. Journal of Systems and [28] Yared H. Kidane and Peter A. Gloor. 2007. Correlating temporal communication
Software 167 (2020), 110586. https://doi.org/10.1016/j.jss.2020.110586 patterns of the Eclipse open source community with performance and creativity.
[4] Terese Besker, Antonio Martini, and Jan Bosch. 2019. Software developer produc- Computational and mathematical organization theory 13, 1 (2007), 17ś27. https:
tivity loss due to technical debtÐa replication and extension study examining //doi.org/10.1007/s10588-006-9006-3
developers’ development work. Journal of Systems and Software 156 (2019), 41ś61. [29] Amy J. Ko. 2019. Individual, Team, Organization, and Market: Four Lenses of
https://doi.org/10.1016/j.jss.2019.06.004 Productivity. In Rethinking Productivity in Software Engineering. Springer, 49ś55.
[5] Larissa Braz, Enrico Fregnan, Gül Çalikli, and Alberto Bacchelli. 2021. Why https://doi.org/10.1007/978-1-4842-4221-6_6
Don’t Developers Detect Improper Input Validation?’; DROP TABLE Papers;ś [30] Amy J. Ko and Brad A. Myers. 2008. Debugging Reinvented: Asking and An-
. In International Conference on Software Engineering. IEEE, 499ś511. https: swering Why and Why Not Questions about Program Behavior. In Proceed-
//doi.org/10.1109/ICSE43902.2021.00054 ings of the 30th International Conference on Software Engineering (ICSE ’08).
[6] K.H. Brodersen, F. Gallusser, J. Koehler, N. Remy, and S.L. Scott. 2015. Inferring Association for Computing Machinery, New York, NY, USA, 301ś310. https:
causal impact using Bayesian structural time-series models. The Annals of Applied //doi.org/10.1145/1368088.1368130
Statistics 9, 1 (2015), 247ś274. https://doi.org/10.1214/14-AOAS788 [31] Max Lillack, Stefan Stanciulescu, Wilhelm Hedman, Thorsten Berger, and An-
[7] Kevin D Carlson and Andrew O Herdman. 2012. Understanding the impact of drzej Wąsowski. 2019. Intention-Based Integration of Software Variants. In 2019
convergent validity on research results. Organizational Research Methods 15, 1 IEEE/ACM 41st International Conference on Software Engineering (ICSE). 831ś842.
(2012), 17ś32. https://doi.org/10.1177/1094428110392383 https://doi.org/10.1109/ICSE.2019.00090
[8] Nancy Cartwright. 2007. Are RCTs the gold standard? BioSocieties 2, 1 (2007), [32] William Martin, Federica Sarro, and Mark Harman. 2016. Causal impact analysis
11ś20. https://doi.org/10.1017/S1745855207005029 for app releases in Google Play. In Proceedings of the 2016 24th ACM SIGSOFT
[9] Prodromos D. Chatzoglou and Linda A. Macaulay. 1997. The importance of human International Symposium on Foundations of Software Engineering. 435ś446. https:
factors in planning the requirements capture stage of a project. International //doi.org/10.1145/2950290.2950320
Journal of Project Management 15, 1 (1997), 39ś53. https://doi.org/10.1016/S0263- [33] Katrina D. Maxwell, Luk Van Wassenhove, and Soumitra Dutta. 1996. Software
7863(96)00038-5 development productivity of European space, military, and industrial applications.
[10] Bradford Clark, Sunita Devnani-Chulani, and Barry Boehm. 1998. Calibrating IEEE Transactions on Software Engineering 22, 10 (1996), 706ś718. https://doi.
the COCOMO II post-architecture model. In Proceedings of the International org/10.1109/32.544349
Conference on Software Engineering. IEEE, 477ś48. https://doi.org/10.1109/ICSE. [34] André N Meyer, Thomas Fritz, Gail C Murphy, and Thomas Zimmermann. 2014.
1998.671610 Software developers’ perceptions of productivity. In Proceedings of the 22nd ACM
[11] Tom DeMarco and Tim Lister. 2013. Peopleware: productive projects and teams. SIGSOFT International Symposium on Foundations of Software Engineering. 19ś29.
Addison-Wesley. https://doi.org/10.1145/2635868.2635892
[12] John Doerr. 2018. Measure what matters: How Google, Bono, and the Gates Foun- [35] Emerson Murphy-Hill and Andrew P. Black. 2008. Breaking the Barriers to
dation rock the world with OKRs. Penguin. Successful Refactoring: Observations and Tools for Extract Method. In Proceedings
[13] Davide Falessi, Natalia Juristo, Claes Wohlin, Burak Turhan, Jürgen Münch, of the 30th International Conference on Software Engineering (ICSE ’08). Association
Andreas Jedlitschka, and Markku Oivo. 2018. Empirical software engineering for Computing Machinery, New York, NY, USA, 421ś430. https://doi.org/10.
experts on the use of students and professionals in experiments. Empirical 1145/1368088.1368146
Software Engineering 23, 1 (2018), 452ś489. https://doi.org/10.1007/s10664-017- [36] Emerson Murphy-Hill, Ciera Jaspan, Caitlin Sadowski, David Shepherd, Michael
9523-3 Phillips, Collin Winter, Andrea Knight, Edward Smith, and Matthew Jorde. 2021.
[14] Petra Filkuková and Magne Jùrgensen. 2020. How to pose for a professional What Predicts Software Developers’ Productivity? IEEE Transactions on Software
photo: The effect of three facial expressions on perception of competence of Engineering 47, 3 (2021), 582ś594. https://doi.org/10.1109/TSE.2019.2900308
a software developer. Australian Journal of Psychology 72, 3 (2020), 257ś266. [37] Edson Oliveira, Eduardo Fernandes, Igor Steinmacher, Marco Cristo, Tayana
https://doi.org/10.1111/ajpy.12285 Conte, and Alessandro Garcia. 2020. Code and commit metrics of developer
[15] Denae Ford, Margaret-Anne Storey, Thomas Zimmermann, Christian Bird, Sonia productivity: a study on team leaders perceptions. Empirical Software Engineering
Jaffe, Chandra Maddila, Jenna L. Butler, Brian Houck, and Nachiappan Nagappan. 25, 4 (2020), 2519ś2549. https://doi.org/10.1007/s10664-020-09820-z
2021. A Tale of Two Cities: Software Developers Working from Home during [38] Mark C. Paulk, Bill Curtis, Mary Beth Chrissis, and Charles V. Weber. 1993.
the COVID-19 Pandemic. ACM Trans. Softw. Eng. Methodol. 31, 2, Article 27 (dec Capability maturity model, version 1.1. IEEE Software 10, 4 (1993), 18ś27. https:
2021), 37 pages. https://doi.org/10.1145/3487567 //doi.org/10.1109/52.219617
[16] Nicole Forsgren, Margaret-Anne Storey, Chandra Maddila, Thomas Zimmermann, [39] Kai Petersen. 2011. Measuring and predicting software productivity: A systematic
Brian Houck, and Jenna Butler. 2021. The SPACE of Developer Productivity: map and review. Information and Software Technology 53, 4 (2011), 317ś343.
There’s more to it than you think. Queue 19, 1 (2021), 20ś48. https://doi.org/10. https://doi.org/10.1016/j.infsof.2010.12.001 Special section: Software Engineering
1145/3454122.3454124 track of the 24th Annual Symposium on Applied Computing.
[17] Geoffrey K. Gill and Chris F. Kemerer. 1991. Cyclomatic complexity density and [40] Huilian Sophie Qiu, Alexander Nolte, Anita Brown, Alexander Serebrenik, and
software maintenance productivity. IEEE transactions on software engineering 17, Bogdan Vasilescu. 2019. Going Farther Together: The Impact of Social Capital
12 (1991), 1284. https://doi.org/10.1109/32.106988 on Sustained Participation in Open Source. In 2019 IEEE/ACM 41st International
[18] GitHub. 2017. Open Source Survey. https://opensourcesurvey.org/2017/ Conference on Software Engineering (ICSE). 688ś699. https://doi.org/10.1109/
[19] Ann-Louise Glasberg, Astrid Norberg, and Anna Söderberg. 2007. Sources of ICSE.2019.00078
burnout among healthcare employees as perceived by managers. Journal of [41] Narayan Ramasubbu, Marcelo Cataldo, Rajesh Krishna Balan, and James D. Herb-
Advanced nursing 60, 1 (2007), 10ś19. https://doi.org/10.1111/j.1365-2648.2007. sleb. 2011. Configuring global software teams: a multi-company analysis of
04370.x project productivity, quality, and profits. In 2011 33rd International Conference on
[20] C. W. Granger. 1969. Investigating causal relations by econometric models and Software Engineering (ICSE). 261ś270. https://doi.org/10.1145/1985793.1985830
cross-spectral methods. Econometrica: Journal of the Econometric Society (1969), [42] Simone Romano, Davide Fucci, Maria Teresa Baldassarre, Danilo Caivano, and
424ś438. https://doi.org/10.2307/1912791 Giuseppe Scanniello. 2019. An empirical assessment on affective reactions
[21] Shenyang Guo and Mark W. Fraser. 2014. Propensity score analysis: Statistical of novice developers when applying test-driven development. In International
methods and applications. Vol. 11. SAGE publications. Conference on Product-Focused Software Process Improvement. Springer, 3ś19.
[22] Joseph F Hair, Jeffrey J Risher, Marko Sarstedt, and Christian M Ringle. 2019. https://doi.org/10.1007/978-3-030-35333-9_1
When to use and how to report the results of PLS-SEM. European Business Review [43] Caitlin Sadowski, Emma Söderberg, Luke Church, Michal Sipko, and Alberto
(2019). https://doi.org/10.1108/EBR-11-2018-0203 Bacchelli. 2018. Modern code review: a case study at google. In Proceedings of
[23] Tomislav Hernaus and Josip Mikulić. 2014. Work characteristics and work per- the 40th International Conference on Software Engineering: Software Engineering
formance of knowledge workers. EuroMed Journal of Business (2014). https: in Practice. 181ś190. https://doi.org/10.1145/3183519.3183525
//doi.org/10.1108/EMJB-11-2013-0054 [44] Iflaah Salman, Ayse Tosun Misirli, and Natalia Juristo. 2015. Are students
[24] Cheng Hsiao. 2007. Panel data analysisÐadvantages and challenges. TEST 16, 1 representatives of professionals in software engineering experiments?. In In-
(2007), 1ś22. https://doi.org/10.1007/s11749-007-0046-x ternational Conference on Software Engineering, Vol. 1. IEEE, 666ś676. https:
[25] Cheng Hsiao. 2022. Analysis of panel data. Cambridge University Press. //doi.org/10.1109/ICSE.2015.82
[26] Mazhar Islam, Jacob Miller, and Haemin Dennis Park. 2017. But what will it cost [45] Andrea Schankin, Annika Berger, Daniel V. Holt, Johannes C. Hofmeister, Till
me? How do private costs of participation affect open source software projects? Riedel, and Michael Beigl. 2018. Descriptive Compound Identifier Names Improve
Research Policy 46, 6 (2017), 1062ś1070. https://doi.org/10.1016/j.respol.2017.05. Source Code Comprehension. In Proceedings of the 26th Conference on Program
1312
ESEC/FSE ’22, November 14ś18, 2022, Singapore, Singapore L. Cheng, E. Murphy-Hill, M. Canning, C. Jaspan, C. Green, A. Knight, N. Zhang, E. Kammer
Comprehension (ICPC ’18). Association for Computing Machinery, New York, NY, [49] Ayse Tosun, Oscar Dieste, Davide Fucci, Sira Vegas, Burak Turhan, Hakan Er-
USA, 31ś40. https://doi.org/10.1145/3196321.3196332 dogmus, Adrian Santos, Markku Oivo, Kimmo Toro, Janne Jarvinen, and Natalia
[46] Dag I.K. Sjoberg, Bente Anda, Erik Arisholm, Tore Dyba, Magne Jorgensen, Amela Juristo. 2017. An industry experiment on the effects of test-driven development
Karahasanovic, Espen Frimann Koren, and Marek Vokác. 2002. Conducting realis- on external quality and productivity. Empirical Software Engineering 22, 6 (2017),
tic experiments in software engineering. In Proceedings International Symposium 2763ś2805. https://doi.org/10.1007/s10664-016-9490-0
on Empirical Software Engineering. 17ś26. https://doi.org/10.1109/ISESE.2002. [50] Stefan Wagner and Florian Deissenboeck. 2019. Defining Productivity in Software
1166921 Engineering. In Rethinking Productivity in Software Engineering, Caitlin Sadowski
[47] Margaret-Anne Storey, Thomas Zimmermann, Christian Bird, Jacek Czerwonka, and Thomas Zimmermann (Eds.). Apress, Berkeley, CA, 29ś38. https://doi.org/
Brendan Murphy, and Eirini Kalliamvakou. 2021. Towards a Theory of Software 10.1007/978-1-4842-4221-6_4
Developer Job Satisfaction and Perceived Productivity. IEEE Transactions on [51] Zhendong Wang, Yi Wang, and David Redmiles. 2018. Competence-confidence
Software Engineering 47, 10 (2021), 2125ś2142. https://doi.org/10.1109/TSE.2019. gap: A threat to female developers’ contribution on Github. In 2018 IEEE/ACM
2944354 40th International Conference on Software Engineering: Software Engineering in
[48] The Standish Group. 1995. The CHAOS report. Society (ICSE-SEIS. IEEE, 81ś90. https://doi.org/10.1145/3183428.3183437
1313