Investigating The Effectiveness of Peer Code Review in Distributed Software Development Based On Objective and Subjective Data
Investigating The Effectiveness of Peer Code Review in Distributed Software Development Based On Objective and Subjective Data
*Correspondence:
[email protected] Abstract
Universidade Federal do Rio Grande Code review is a potential means of improving software quality. To be effective, it
do Sul (UFRGS), Porto Alegre, Brazil
depends on different factors, and many have been investigated in the literature to
identify the scenarios in which it adds quality to the final code. However, factors
associated with distributed software development, which is becoming increasingly
common, have been little explored. Geographic distance can impose additional
challenges to the reviewing process. We thus in this paper present the results of a
mixed-method study of the effectiveness of code review in distributed software
development. We investigate factors that can potentially influence the outcomes of
peer code review. The study involved an analysis of objective data collected from a
software project involving 201 members and a survey with 50 practitioners with
experience in code review. Our analysis of objective data led to the conclusion that a
high number of changed lines of code tends to increase the review duration with a
reduced number of messages, while the number of involved teams, locations, and
participant reviewers generally improve reviewer contributions, but with a severe
penalty to the duration. These results are consistent with those obtained in the survey
regarding the influence of factors over duration and participation. However,
participants’ opinion about the impact on contributions diverges from results obtained
from historical data, mainly with respect to distribution.
Keywords: Code review, Distributed software development, Empirical study, Survey
1 Background
Code review is a common practice adopted in software development to improve software
quality based on static code analysis by peers. There are studies that provide evidence
that it reduces the number of defects detected after release, mainly when it has ade-
quate code coverage as well as engagement and participation of reviewers (McIntosh et al.
2014). Moreover, code review is a recognized way to foster knowledge sharing that bene-
fits authors and reviewers (Hundhausen et al. 2013). It also improves team collaboration
because it creates collective ownership of the source code, which results from collabo-
rative work rather than individual work (Bacchelli and Bird 2013; Thongtanunam et al.
2016b). Nowadays, code reviews are less formal than in earlier decades of software devel-
opment. In the past, it was typically in the form of code inspections (Fagan 1986), which
© The Author(s). 2018 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and
reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the
Creative Commons license, and indicate if changes were made.
Witter dos Santos and Nunes Journal of Software Engineering Research and Development (2018) 6:14 Page 2 of 31
required formal meetings and checklists (Kollanus and Koskinen 2009). Today, such a
practice is more informal, being referred to as Modern Code Review (MCR) (Bacchelli and
Bird 2013). It is often assisted and enforced by tools, such as Gerrit (Google 2017a).
The effectiveness of code review depends on different factors and, when it cannot pro-
vide expected benefits, it becomes a costly and time-consuming task (Czerwonka et al.
2015; Thongtanunam et al. 2016a). For example, if there is a time gap between the comple-
tion of a change and its review by a peer, the author may have its work partially blocked,
possibly affecting the whole software release (Thongtanunam et al. 2015b). This lack of
dynamism in the code review activity increases the work in progress of teams, as new
tasks are started while waiting for the pending reviews. Furthermore, the context switch-
ing between coding tasks and reviews may also have a negative impact on developers’
work.
To understand the factors that positively and negatively affect the effectiveness of
code review, previous studies were performed, e.g. (Thongtanunam et al. 2015a; Baysal
et al. 2016; Yang 2014; Bosu et al. 2015). Examples of investigated factors are the patch
size, the nature of the change, or author’s company—that is, both technical and non-
technical factors have been investigated. Moreover, to evaluate effectiveness, different
criteria have been adopted, such as the review duration and the number of defects found
after code review. As a result, relevant conclusions regarding code review have been
reached. For instance, developers from other teams provide fewer but more useful feed-
back than those from the same team (Bosu et al. 2015). Despite all the significant results
obtained so far, code review has been investigated only to a limited extent in the con-
text of geographically distributed software development (Sengupta et al. 2006), which is
becoming increasingly common over the last decades. In the late 90s, researchers focused
on enabling formal code inspections, which involve meetings, in distributed scenarios
(Perpich et al. 1997; Stein et al. 1997). In modern code review, in contrast, tool sup-
port and asynchronous communication help deal with geographic distribution. However,
the effects of geographic distribution on the outcomes of code review (such as dura-
tion or reviewer engagement) have not been explored. Recent studies of code review in
distributed software development are limited to experience reports on code inspection
(Meyer 2008).
We thus in this paper focus on exploring how both technical and non-technical fac-
tors influence a set of metrics that are indicators of the effectiveness of code review in
the context of Distributed Software Development (DSD). We present the results of a
mixed-method study in which we investigated the relationship between four influence
factors—namely number of changed lines of code, involved teams, involved locations and
active reviewers—and the effectiveness of code review. As there is no single objective met-
ric that captures whether a review is effective, we measured and analyzed different review
outcomes that can be seen as an indication of the review effectiveness, such as reviewer
participation and number of comments. The study involved (1) an analysis of objective
data collected from a software project; and (2) a survey with 50 practitioners with experi-
ence in code review. This study is an extension of our previously presented work (Witter
dos Santos and Nunes 2017), which was complemented by the survey that allows us to
compare the results obtained with both research methods.
The first part of our study, referred to as repository mining, is based on a large amount
of data (8329 commits and 39,237 comments) extracted from the code review database
Witter dos Santos and Nunes Journal of Software Engineering Research and Development (2018) 6:14 Page 3 of 31
of a project with 201 members during 72 weeks. The analysis of our results allowed us to
conclude that a high number of changed lines of code tends to increase the duration of the
review process with a reduced number of messages, while the number of involved teams,
locations and participant reviewers generally improve the contributions from reviewers,
but with a severe penalty to the duration. These results are consistent with those obtained
in the survey regarding the influence of factors over duration and participation. However,
participants’ opinion about the impact on contributions diverges from results obtained
from historical data, mainly with respect to distribution.
The remainder of this paper is organized as follows. We first discuss related work in
Section 2. We then provide details of our target project in Section 3, describing the code
review process of our target project. Next, we describe our study settings in Section 4.
The results of the first and second parts of our study are presented and analyzed in
Section 5. A discussion regarding obtained results is presented in Section 6, followed by
our conclusions, which are presented in Section 7.
2 Related work
Since the pioneering work of Fagan (1976) on formal code inspections, many researchers
proposed approaches to improve this well-structured and phased form of code review
(Parnas and Weiss 1985; Bisant and Lyle 1989; Martin and Tsai 1990). With the popular-
ity of DSD, other researchers investigated how to make code inspections feasible when
the involved people cannot physically meet in a particular location (Perpich et al. 1997;
Stein et al. 1997). Despite its popularity among researchers and practitioners, formal code
inspection and its variations have received less attention since the early 2000s (Kollanus
and Koskinen 2009).
More recently, much work focusing on modern code review has been done, ranging
from studies that investigate what leads to successful code review to approaches that
recommend suitable reviewers. For example, in Balachandran (2013)’s approach, recom-
mended reviewers are those that made the most recent changes in the portion of code
to be reviewed. His approach was improved by Thongtanunam et al. (2014), for projects
with specific characteristics, using the File Path Similarity (FPS), which takes into account
previous changes with similar paths or file names. These approaches were extended by
also considering similarity among past commit messages (Xia et al. 2015) and recent
activity of the possible reviewers (Zanjani et al. 2016). Viviani and Murphy (2016) took
another direction by prioritizing pending reviews for each reviewer instead of finding the
best candidate reviewers for a given change. This is motivated by the fact that several
projects have a high concentration of review requests in a small group of contributors
(Yang 2014).
Despite all these significant contributions to the field of code review, it is cru-
cial to understand the factors that influence the effectiveness of code review to,
for example, provide foundations to improvements while making reviewer recom-
mendations. Therefore, many studies focus on providing a deeper understanding
of code review, and its influence factors (e.g. number of changed lines of code
and experience of individuals) and outcomes (e.g. duration and discussion among
reviewers). Although such studies are similar to ours, they do not focus on DSD.
We next discuss technical and non-technical influence factors investigated in existing
studies.
Witter dos Santos and Nunes Journal of Software Engineering Research and Development (2018) 6:14 Page 4 of 31
Summary Given that many factors that influence code review have been investigated, we
summarize what each previous study analyzed in Table 1. Rows in this table consist of the
examined influence factors, while columns represent the analyzed outcomes associated
with code review. In cells, we list the studies that focused on the relationship between a
given influence factor and outcome.
Table 1 Influence factors and review outcomes investigated by previous studies
Influence factors Absence of discussion Comment usefulness Feedback delay Participation in review Post-release defects Review duration Review iterations Review quality
Amount of previous defects ♣ ♣
Affected modules ♠
Author experience ♣ ♠
Author’s company ♠
Bug fix or new feature ♣
Commit message size ♣
External reviewers ♦
Length of prior discussions ♣ ♥
Number of authors ♣ ♣
Number of reviewers ♣
Patch size (Files) ♣ ♦
Patch size (LOC) ♠
Prior feedback delay ♣
Priority ♠
Witter dos Santos and Nunes Journal of Software Engineering Research and Development
Review coverage ♥†
Review speed ♥
Reviewer experience ♦ ♠
Source code type ♦
♣ Thongtanunam et al. (2016a)
(2018) 6:14
Some of these studies analyzed MCR targeting FLOSS (Free, Libre and Open Source
Software) projects, such as OpenStack, Qt, and LibreOffice, which present DSD charac-
teristics. However, we emphasize that most of these studies did not investigate the impact
of distribution: factors associated with distribution were random variables rather than
independent variables. For instance, Baysal et al. (2016) reported that some analyzed com-
panies had co-located groups, while others used DSD, without treating this issue as a
dependent variable. Similarly, Bosu et al. (2015) found that comments from other teams
are slightly more useful, but without considering co-location or the number of involved
teams.
As can be seen, different combinations of influence factor and outcome have been
analyzed. Differently from previous work, our study focuses on DSD and, therefore, we
focus on other influence factors, such as the number of involved cities and teams. Some
of our investigated factors, e.g. patch size (LOC), have already been studied, but not in a
DSD scenario. Moreover, we analyze four different outcomes of code review, which are
described in next section together with other details of our study settings.
3 Study subject
Our study is based on the analysis of data collected from a (commercial) software project
and developers from a single software development company. Due to the project size, we
were able to collect a large amount of information regarding its code review. We next
describe the code review process of the project, provide details about the collected data,
and characterize the participants of our survey. No further information can be given due
to a confidentiality agreement.
runtime verifications. This automated verification usually takes less than 15 min to exe-
cute and rejects the change if any critical test fails, so that the author can fix the reported
issues. Human reviewers and authors can discuss, ask and provide suggestions for each
line of code. Moreover, each reviewer can vote to summarize its feedback using one of the
following values.
Veto The reviewer considers that the change cannot be integrated without fixing the
reported issues or answering questions made. This prevents the commit to be
merged.
Rejection The reviewer recommends fixes before the change is merged.
Neutral The reviewer typically asks easy questions to be answered.
Acceptance The reviewer considers that, though the change it adequate, it needs more
reviews from other developers.
Approval Only maintainers of the module associated with the commit have this kind
of vote, as they are responsible for the module quality. Maintainers can perform
technical reviews, but must also verify that relevant developers are not missing in
the list of invited reviewers and that the overall state of the code review is adequate.
It is important to note that all invited reviewers, but the maintainer, are not
obliged to provide feedback. Before approving the change, the maintainer of each
module should consider if the most important reviewers already reviewed the code.
In the end, the piece of reviewed code is submittable if all the following condi-
tions are satisfied: (i) there is no rejection from automated reviewers; (ii) there
is no veto; and (iii) the maintainer has approved the change. If all these con-
ditions hold, the maintainer is able to merge the change into the destination
branch.
4 Study settings
After discussing our target project, we now proceed to detailing our study. We first state
our goal and research questions, then describe collected metrics and finally the procedure
of the two parts of our study, namely repository mining and survey.
RQ-1: Does the number lines of code to be reviewed influence the effectiveness of
distributed code review?
RQ-2: Does the number of involved teams influence the effectiveness of distributed code
review?
RQ-3: Does the number of involved development locations influence the effectiveness of
distributed code review?
Witter dos Santos and Nunes Journal of Software Engineering Research and Development (2018) 6:14 Page 9 of 31
RQ-4: Does the number of active reviewers influence the effectiveness of distributed code
review?
We investigate the number of teams and locations separately because the former
captures distribution among teams, allowing us to analyze the impact of involving review-
ers that have different project priorities and goals (possibly conflicting) and limited
interaction, while the latter additionally captures the impact of geographic distribution.
review effectiveness. Before detailing these outcomes, we next further specify our influ-
ence factors—which are listed following the order of our research questions—detailing
how they are measured.
Patch Size (LOC) The patch size (LOC) is used to refer to the number of lines of code
added or modified in a commit and thus need to be reviewed. This lines of code
considered are those present in the final version of the code, after going through the
reviewing process.
Teams Teams refer to the number of distinct teams associated with the author and
invited reviewers. If the author and all reviewers belong to the same team, the value
associated with this influence factor is 1.
Locations Locations refer to the number of distinct geographically distributed devel-
opment sites associated with the author and invited reviewers. If the author and
all reviewers work in the same development site, the value associated with this
influence factor is 1.
Active Reviewers Actives reviewers are those that actually participate in the reviewing
process—with comments or votes—from those invited. Although this can be seen
as an outcome of the review, given that there is no control of how many of the
invited reviewers will actually participate, we aim to explore if the number of active
reviewers influence other outcomes, such as duration. Therefore, active reviewers
are investigated as an influence factor, consisting of the number of reviewers that
contributed to the review.
Now we focus on describing the analyzed code review outcomes that indicate the review
effectiveness. Code review is effective when it achieves its goals, which can be untimely to
identify defects in the code, issues related with code maintainability and legibility, or even
to disseminate knowledge. However, these goals might include constraints regarding the
impact in the development process and invested effort.
It is not trivial to evaluated whether these goals are achieved. For example, Bosu et al.
(2015) created a model to evaluate whether the comments of a code review are useful
based on the text of the given comments. This measurement, however, may not be precise.
In our work, we focus on measurements that are more objective.
We thus selected four objective outcomes, described as follows. The first is related to
project time constraints, while the remaining three are related to the input from other
developers (reviewers) leading to possibly less failures, code quality improvement and
knowledge dissemination.
Duration (DUR) Duration counts how many days the code review process lasted, from
the day that the source code is available to be reviewed to the day that it received
the last approval of a reviewer.
Comment Density (CDG ) Instead of simply counting the number of review comments,
we take into account the amount of code to be reviewed. Therefore, comment
Witter dos Santos and Nunes Journal of Software Engineering Research and Development (2018) 6:14 Page 11 of 31
density refers to the number of review comments divided by the number of groups
of 100 LOC under review, thus giving the average number of review comments for
each 100 LOC. Review comments can be any form of interaction, e.g. approval,
rejection, question, idea or other types of comments made by any reviewer—votes
count as comments because they are a form of input and have a particular meaning.
A multiplying factor of 100 is used to avoid small fractioned numbers, which are
harder to compare and less intuitive. Comments from automated reviewers are
ignored, as this type of feedback is a constant, regardless of human interactions.
Comment Density by Reviewer (CDR ) It is expected that the higher the number of
reviewers or teams, the higher the number of comments. Therefore, CDG alone can
lead to the wrong conclusion that discussions were productive when many reviewers
are involved. We thus also analyze comment density by reviewer, given by the divi-
sion of the comment density by the number of active reviews (without taking into
account automated reviewers).
Too short or too long code review. There are studies (Kemerer and Paulk 2009;
Ferreira et al. 2010) that suggest time constraints for code review activities, limita-
tion on the number of lines reviewed per hour and also the total amount of hours
spent doing code review in a single day. Such limitations are imposed because the
code review may become error-prone or even consume more time and resources to be
finished due to tiredness. Moreover, if the review takes too long (i.e. high duration) to be
completed, developers may be prevented to continue their work and also work does not
get done. Therefore, shorter code reviews are preferred. However, if such review is too
short, it may also mean that reviewers have not properly analyzed the change.
Low reviewer participation. When reviewers are invited to participate in the review, it
is expected that they contribute. However, not all participate. Therefore, the higher the
participation of reviewers, the better. Nevertheless, we do not expect that participation
is 100%, given that there are developers that are invited automatically and may not be
relevant reviewers anymore.
Few contributions from reviewers. Reviewers may contribute in different ways, rang-
ing from a simple vote to long discussions. We assume that the higher the number of
comments made by reviewers, the more fruitful the discussion and consequently the more
effective the review. However, as explained, we do not consider the absolute number of
comments, but its density considering the amount of code to be reviewed. Moreover, we
consider the amount of contribution generally (CRG ) and by reviewer (CRR ). For both,
the higher, the better.
Although in some situations a low number of comments (either generally or by
reviewer) is enough—for example, when a low number of comments helped to improve
the code, or the change to be reviewed is minor—note that these might be not the usual
case. Because we analyze a high number of code reviews, these exceptional cases do not
Witter dos Santos and Nunes Journal of Software Engineering Research and Development (2018) 6:14 Page 12 of 31
significantly impact the results. Moreover, votes count as comments; consequently, even if
there is no need for long discussions, it is important to have at least the acknowledgement
of the reviewer in the form of a vote, i.e. a comment. Finally, we also analyze duration and
participation, which complement the analysis of the effectiveness of code review.
Getting raw data from the code review database First, we fixed a time frame in the past
so we can get data completed code reviews. Gerrit provides a query mechanism (Google
2017b) that can be used to get structured information about code reviews in JSON format.
One query for each week had to be made due to the limitation of obtaining at most 500
results per query.
Parsing and filtering code review information The retrieved JSON files provided part
of our required data. The remaining data had to be computed from the raw data obtained
from the internal Gerrit database model. The resulting data was filtered, discarding some
reviews of some types of modules, described next.
Representing and analyzing data Given that we have four research questions with four
associated influence factors as well as four outcomes, there is a large amount of data
to be analyzed. Our data consists essentially of continuous or discrete positive num-
bers, with different scales and ranges. For example, there are only four involved locations
while the patch size can be up to approximately 4 KLOC. To deal with these discrepan-
cies, we adopted an approach similar to that of Baysal et al. (2016). We clustered data in
Witter dos Santos and Nunes Journal of Software Engineering Research and Development (2018) 6:14 Page 13 of 31
groups, representing the variance of outcomes in each group using box plots. Additionally,
we performed statistical tests to identify groups that are significantly different from
each other.
5 Results
Having described our study settings, we proceed to the presentation of obtained results.
They are presented according to our research questions, and in each of them, we discuss
results associated with each of our investigated outcomes.
Fig. 2 Influence of the Patch Size (LOC) over Duration (days), Participation (%), Comment Density (comments
per 100 LOC) and Comment Density by Reviewer (comments per 100 LOC per active reviewer). The y-axes
representing Comment Density and Comment Density by Reviewer are in a logarithmic scale
(H = 709.58, p < 0.05), mainly due to differences between groups with 600 LOC or less
and larger groups. A possible explanation to this is that larger patches likely require more
effort from reviewers, discouraging engagement in the process.
When reviewers participate in the code review, the amount of contribution is mea-
sured by the overall comment density and comment density per reviewer. Our data shows
that in both cases the larger the patch, the lower the comment density. Regarding over-
all comment density, there are statistically significant differences (H = 709.58, p < 0.05).
According to the post hoc tests, this is due only to smaller groups. There is a significant
difference only among few groups with more than 601 LOC, but among groups with less
LOC, there are significant differences in most cases. Similarly, the comment density by
reviewer decreases as patches are larger (H = 3579.57, p < 0.05), showing similar results
in post hoc tests. This indicates that the amount of contribution is highly affected as the
patch size increases up to a certain point. Then, the amount of contribution is limited but
does not decrease after the patch reaches a certain size (> 601 LOC in our study).
One possible explanation for the results regarding patch size is that the patch size has an
intimidating effect on invited reviewers, because the time required to provide significant
contributions increases. This invested time, in our target project, is not explicitly recorded
and is not associated with deliverables considered more relevant, such as produced code.
Witter dos Santos and Nunes Journal of Software Engineering Research and Development (2018) 6:14 Page 16 of 31
Conclusions of RQ-1: The patch size negatively affects all outcomes of code review
that we consider as an indication of effectiveness. Reviewers are less engaged and provide
less feedback. Moreover, the duration is not linearly proportional to the patch size, which
may affect the quality of code review.
As discussed in the related work section, other studies investigated the impact of the
patch size in code review. Bosu et al. (2015) showed that for some projects the propor-
tion of relevant comments decreased by 10%, when they compared changes in 40 files
with changes in a single file, while Baysal et al. (2016) showed that changes with more
LOC need more iterations to be concluded, but without considering the time interval.
Each iteration is typically the result of an accepted feedback or comment. This indi-
cates that results with respect to patch size in non-distributed scenarios also hold for our
investigated scenario.
Fig. 3 Influence of the number of Teams over Duration (days), Participation (%), Comment Density
(comments per 100 LOC) and Comment Density by Reviewer (comments per 100 LOC per active reviewer)
Similarly, there are also statistically relevant differences with respect to participation
(H = 226.72, p < 0.05) and the post hoc analysis showed that the difference is only
significant among reviews with four or fewer teams. However, the results indicate only a
small negative influence on this review outcome.
Considering the effect on contributions, differences are also significant (H =
184.71, p < 0.05). Post hoc tests showed that this is due to the difference between reviews
involving one team and the others. Although Fig. 3 indicates that the overall comment
density increases together with the number of involved teams (except in the case of 5
involved teams), we can see in Table 5 that the standard deviation is high, indicating that
results vary a lot, justifying the not significant differences. This can be explained by the
specific teams involved, whether they are in the same location or not (an issue that is
investigated in RQ-3). In our target project, there is an internal team rotation over the
years, as new teams are created, merged or split, with knowledge sharing when teams
change, reducing the diversity of skills between author and reviewers and affecting the
number of questions, doubts or different opinions. Surprisingly, the comment density by
reviewer is higher when two teams are involved, followed by reviews involving one or
three teams. The differences among teams are indeed significant (H = 91.94, p < 0.05),
with post hoc tests showing that if more than three teams are involved, it actually makes
no difference.
Conclusions of RQ-2: We found evidence that code review with more involved teams
have lower effectiveness considering duration and participation, but higher effectiveness
with respect to the overall comment density. Comment density by reviewer is slightly
higher when two teams are involved when compared to reviews involving one or three
teams.
As can be seen, the duration of code review is considerably higher if more locations
are involved. With a further analysis of our data, we observed that with two involved
locations, reviews that started in the second half of the sprint sometimes were not fin-
ished on time, causing a performance penalty to the author’s team—as said, code review
is mandatory. This can be explained by the natural isolation of people working in differ-
ent places, which requires daily effort to synchronize priorities and state the importance
of every patch under review. Within the same team and location, this communication
happens on a daily basis in the Scrum daily meetings or other activities that promote inter-
action. There is a statistically significant difference among the groups (H = 158.0, p <
0.05), in fact, among all groups, as shown in the post hoc analysis.
There is also a positive influence on comment density (H = 134.05, p < 0.05) and com-
ment density by reviewer (H = 56.12, p < 0.05). However, there is a negative impact on
participation (H = 86.69, p < 0.05). Post hoc tests show that for these outcomes the dif-
ferences actually exist only between code reviews with one and two locations, probably
because there are few occurrences involving three locations. One possible interpreta-
tion of these results, in addition to the geographical distance barrier, is that code reviews
with more involved locations have more diversity of technical skills, which is plausible
because teams are organized based on groups of related features and technologies. More-
over, there are few rotations of team members among different locations, creating some
form of local technical specialization on each location. This diversity promotes feedback,
questions, and comments, but requires more time to complete the review process. Con-
sequently, reviewers from other locations should be invited if there is a good technical
reason to do so. Otherwise, the higher duration is not compensated by a higher level of
contributions.
Fig. 4 Influence of the number of Locations over Duration (days), Participation (%), Comment Density
(comments per 100 LOC) and Comment Density by Reviewer (comments per 100 LOC per active reviewers)
Witter dos Santos and Nunes Journal of Software Engineering Research and Development (2018) 6:14 Page 19 of 31
We also observed that the results with respect to comment density by reviewer have
large differences when compared to those discussed in the previous sections. Results show
that: (i) the average review duration in the same location is 32% greater than in the same
team; (ii) the average duration with two locations is 38% greater than with two teams; and
(iii) the average density of review comments with two locations is 24% higher than with
two teams.
Conclusions of RQ-3: We found evidence that code reviews with more involved loca-
tions have lower effectiveness with respect to duration and participation, but higher
effectiveness considering contributions. The overall comment density and comment den-
sity by reviewer are considerably higher with more involved locations. The participation
is slightly lower with multiple involved locations.
Fig. 5 Influence of the number of Active Reviewers over Duration (days), Participation (%), Comment Density
(comments per 100 LOC) and Comment Density by Reviewer (comments per 100 LOC per active reviewers)
platforms and their infrastructure modules, where typically one or two developers work
on for several months.
Reviewer participation is almost the same with more active reviewers. Although there
are statistically significant differences among groups (H = 268.49, p < 0.05), the post
hoc tests show that this is due to a few groups that have nearly no significant differences,
indicating that the more invitees, the more active reviewers.
Considering the overall comment density, there is a statistically significant difference
(H = 660.89, p < 0.05) when reviewers contribute. However, the post hoc tests show
that the presence of more than two active reviewers does not significantly improve the
comment density. Moreover, the comment density by reviewer is actually lower with three
or more active reviewers (H = 275.55, p < 0.05). This suggests that a number of two
active reviewers seems to be the optimal case considering a trade-off between duration
and contributions from reviewers.
Conclusions of RQ-4: We found evidence that code review with more active review-
ers has lower effectiveness considering the duration. The participation is slightly lower
with more active reviewers. Moreover, having more than two active reviewers does not
improve the overall comment density and negatively affects the comment density by
reviewer.
Witter dos Santos and Nunes Journal of Software Engineering Research and Development (2018) 6:14 Page 21 of 31
Fig. 6 Evaluation by participants of the relationship between influence factors and outcomes
Witter dos Santos and Nunes Journal of Software Engineering Research and Development (2018) 6:14 Page 22 of 31
description of data. Moreover, median and average values are similar considering our
data, supporting the use of average to describe our data.
Table 8 Answers of participants evaluating the influence of patch size (LOC) on review outcomes
Outcome Much Worse No Better Much Unable Min Max Med M SD
worse Infl. better to
(1.0) (2.0) (3.0) (4.0) (5.0) Resp.
Duration 20 23 2 0 0 0 1.0 3.0 2.0 1.6 0.6
Participation 9 26 6 3 0 1 1.0 4.0 2.0 2.1 0.8
Comments 3 11 9 21 0 1 1.0 4.0 3.0 3.1 1.0
Witter dos Santos and Nunes Journal of Software Engineering Research and Development (2018) 6:14 Page 23 of 31
outcomes. This suggests that the duration is, once again, the most affected outcome in
the opinion of the participants.
7 Discussion
In this section, we present insights and lessons learned from both studies and compare
them. Finally, we discuss the threats to validity.
Table 11 Answers of participants evaluating the influence of active reviews on review outcomes
Outcome Much Worse No Better Much Unable Min Max Med M SD
worse (2.0) Infl. better to
(1.0) (3.0) (4.0) (5.0) Resp.
Duration 8 26 9 0 2 0 1.0 4.0 2.0 2.1 0.7
Participation 1 21 16 4 0 3 1.0 4.0 2.0 2.5 0.7
Comments 3 3 8 25 5 1 1.0 5.0 4.0 3.6 1.0
Table 12 Comparison of results based on objective (Obj.) and subjective (Subj.) data
Influence Duration Participation Comments
Factor Obj. Subj. Obj. Subj. Obj. (CDG ) Subj. Obj. (CDR )
Patch Size (LOC) ↑ ↑ ↓ ↓ ↓ ↓
Teams ↑ ↑ ↓ ↓ ↓
Locations ↑ ↑ ↓ ↓ ↑ ↑
Active reviewers ↑ ↑ ↓ ↓ ↑
Differences are highlighted. CDR has no corresponding subjective measurement
Witter dos Santos and Nunes Journal of Software Engineering Research and Development (2018) 6:14 Page 25 of 31
Fig. 7 Distribution of data considering Duration (days) and Patch Size (LOC)
use other forms of code review, like pair programming, to reduce the severe penalty over
the duration.
7.2 Contributions
On the contributions of this paper, considering the related work on modern code
review that we presented in Table 1, this work analyzed the influence of different influ-
ence factors, such as the number of teams, locations and active reviewers. The same
holds for the code review outcomes, as these studies did not analyze the participation
and comment density. Moreover, our mixed-method study evidentiated the dissonance
between participants’ perception and metrics extracted from code review databases in
some situations. By collecting the opinion of participants using a form that is sym-
metric with the research questions, our study presented a significant degree novelty
when compared to related work, as only Bosu et al. (2015) used insights obtained
from interviews with developers create a definition for the usefulness of code review
comments.
Code review practitioners can benefit from the discussion section, which was based on
the analysis of results from both parts of this study and should foster critical thinking on
simple, daily decisions inside the teams. Although some of the suggestions are relatively
simple to adopt, the existence of the problem itself is not always evident. For instance,
having more invited reviewers does not mean that the code will be reviewed faster or
that more comments will be provided, so authors should carefully select reviewers. When
reviewers from other teams or locations are involved, the likelihood of having extra delays
is something that the teams should be aware of, and eventually adopt countermeasures
instead of just assuming that review is in progress. Similarly, delivering smaller patches
improves the code review process, so smaller tasks are preferable.
Internal validity We identified five internal threats. First, given that we analyzed an
extensive period of our target project, its developers changed over time. However, as
the number of developers and analyzed reviews is large, individual developers’ behav-
ior and expertise have a low impact on the obtained results. Moreover, this change in
development teams is expected in any software project.
Second, in most of the cases, authors and reviewers communicate using Gerrit to pro-
vide feedback, even when they are on the same team or location. However, there is no
explicit obligation in the target project to record in Gerrit feedback given by means of
other forms of communication, such as telephone or informal meetings. Nevertheless,
this is very unusual for this project—developers tend to use the available tools to ensure
that relevant questions will not be forgotten by the authors. Isolated occurrences thus do
not significantly affect the results.
Third, the participation outcome may have been affected due to the automatic addition
of reviewers by Gerrit’s reviewers-by-blame plugin. The plugin may add, as reviewers,
developers that no longer work in the same team or even in the company. Consequently,
their participation was not expected. As what matters is the relative comparison of partic-
ipation for groups of each outcome, this likely has not affected the results. The probability
of having reviewers that fall into this category is the same for the different reviews.
Fourth, our survey was conducted after the publication of the shorter version of this
work (Witter dos Santos and Nunes 2017). If participants had access to this work before
taking part of the survey, they may have been influenced by our previous results. Although
we cannot completely confirm that no participant became aware of the work, no results
were intentionally disclosed within the company in which the participants work. More-
over, the time gap between the earlier version of this work and the collection of survey
data is only of three months.
Finally, our survey was conducted with participants from the same company on which
we collected data for the objective study, but from potentially different projects. More-
over, participants of the survey did not necessarily participate of the project that had its
data analyzed. However, all participants use the same code review process and the same
tools, including the automated reviewers.
External validity Generalizing the results of empirical studies is always an issue, because
the collected and analyzed data may not generally represent software projects. Although
we focused on a single project, our results are based on a large amount of data of a large
project. Therefore, we were able to identify trends and statistically significant results.
However, we emphasize that our results are potentially generalizable only for distributed
Witter dos Santos and Nunes Journal of Software Engineering Research and Development (2018) 6:14 Page 28 of 31
8 Conclusion
Code review is an important static verification technique for improving software quality
as well as promotes knowledge sharing within a software project. To identify the scenarios
in which code review in fact succeeds, many studies investigated the relationship between
different factors and the review outcomes. However, there is limited investigation of the
situations in which modern code review is effective in the context of distributed soft-
ware development when developers and reviewers are spread into geographically distant
development locations.
In this paper, we presented the results of a mixed-method study, composed of two parts.
In the first part, repository mining, we extracted a large amount of code review informa-
tion from a software project whose aim is to develop an operating system for embedded
systems. This project involves 201 developers, spread into 21 teams located in 4 differ-
ent cities. We investigated how the patch size (in terms of lines of code), the number of
teams, the number of locations and the number of active reviewers influence the duration,
reviewer participation and comment density (general and by reviewer) of the review. We
found evidence that the duration of the code review is highly affected by all investigated
factors—the higher they are, the longer the review process. Similarly, the participation of
reviewers is negatively affected in all cases, but mainly by the number of lines of code to
be reviewed. The density of review comments is higher when a relatively small patch size
is reviewed by other reviewers of teams or locations other than that of the author. The
density of review comments per reviewer is positively affected by the number of involved
locations and negatively affected by the other factors.
In the second part of the study, we conducted a survey to collect data about the
perceived effects of the four investigated influence factors over code review outcomes
(duration, participation and total number of comments). We obtained 50 responses from
software developers with relevant professional experience in DSD projects with modern
code review practices. We found evidence that higher values of the influence factors have
similar effects on the analyzed code review outcomes. Duration and participation are neg-
atively affected; the total number of comments is negatively affected by patch size, teams
and locations, but is positively affected by the number of active reviewers.
Due to the large amount of data investigated in our study, we could not identify par-
ticular occurrences of code review that could help us to make other analyses and further
explain our data. Even if this was possible, given that we used data from the past to
have complete reviews, developers would potentially not remember specific cases. Our
study had, however, gave us insights for future investigations. First, we aim to perform an
observational study involving developers and managers that will allow us to verify if our
conclusions based on the present study hold. Second, further analyses can be made using
code review data. For example, the proportion of votes (vetoes, rejections, approvals and
neutral feedback), the influence of the number of contributions as author or reviewer
Witter dos Santos and Nunes Journal of Software Engineering Research and Development (2018) 6:14 Page 29 of 31
(overall and in the same module or file) and other reviewers’ characteristics are interesting
issues to be investigated.
As the patch size demonstrated to be a prominent influence factor, we also plan to ana-
lyze other forms of complexity and effort during code review, assuming that reviewing
ten lines added to a complex module requires more effort than reviewing ten lines added
to a simple module. It is possible to analyze the influence of other indications of com-
plexity, such as the total number of classes, files and LOC as well as the total cyclomatic
complexity.
For modular systems, some influence factors arise from the relations among the mod-
ules and from the role of each module. For instance, the number of dependent modules
could influence the participation in or the duration of the code review, and critical
infrastructure modules might have different code review dynamics when compared to
modules that implement user interfaces. Therefore, we plan to analyze the influence of
architectural aspects of the modules in the code review.
Finally, we considered many metrics to indicate the effectiveness of the review and
aim to investigate whether it is possible to derive a single metric that captures review
effectiveness by combining different review outcomes.
Endnotes
1
https://www.gerritcodereview.com/
2
Available at http://www.inf.ufrgs.br/prosoft/resources/2017/sbes-mcr-dsd.
Abbreviations
DSD: Distributed software development; FLOSS: Free, libre and open source software; FPS: File path similarity; GQM:
Goal-question-metric; LOC: Lines of code; M: Mean; Max: Maximum; MCR: Moderm code review; Med: Median; Min:
Minimum; RQ: Research question; SD: Standard deviation
Funding
Ingrid Nunes would like to thank for research grants CNPq ref. 303232/2015-3.
Authors’ contributions
EW and IN worked jointly on the conceptualization of the work as well as analysis and interpretation of the collected
data. The paper was written in an interactive way, receiving contributions of both authors. EW was the leader, writing first
drafts and making initial analyses, and also being responsible for the implementation needed to collect the data, data
collection, and execution of statistical tests. Both authors read and approved the final manuscript.
Competing interests
The authors declare that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
Bacchelli A, Bird C (2013) Expectations, Outcomes, and Challenges of Modern Code Review. In: Proceedings of the 2013
International Conference on Software Engineering. ICSE ’13. IEEE Press, Piscataway. pp 712–721. http://dl.acm.org/
citation.cfm?id=2486788.2486882
Balachandran V (2013) Reducing Human Effort and Improving Quality in Peer Code Reviews Using Automatic Static
Analysis and Reviewer Recommendation. In: Proceedings of the 2013 International Conference on Software
Engineering. ICSE ’13. IEEE Press, Piscataway. pp 931–940. http://dl.acm.org/citation.cfm?id=2486788.2486915
Witter dos Santos and Nunes Journal of Software Engineering Research and Development (2018) 6:14 Page 30 of 31
Basili VR, Selby RW, Hutchens DH (1986) Experimentation in software engineering. IEEE Trans Softw Eng 12(7):733–743
Baysal O, Kononenko O, Holmes R, Godfrey MW (2016) Investigating technical and non-technical factors influencing
modern code review. Empir Softw Eng 21(3):932–959
Beller M, Bacchelli A, Zaidman A, Juergens E (2014) Modern Code Reviews in Open-source Projects: Which Problems Do
They Fix?. In: Proceedings of the 11th Working Conference on Mining Software Repositories. MSR 2014. ACM, New
York. pp 202–211. http://doi.acm.org/10.1145/2597073.2597082
Bisant DB, Lyle JR (1989) A two-person inspection method to improve programming productivity. IEEE Trans Softw Eng
15(10):1294
Bjørn P, Esbensen M, Jensen RE, Matthiesen S (2014) Does distance still matter? revisiting the cscw fundamentals on
distributed collaboration. ACM Trans Comput-Hum Interact (TOCHI) 21(5):27
Bosu A, Greiler M, Bird C (2015) Characteristics of Useful Code Reviews: An Empirical Study at Microsoft. In: Proceedings of
the 12th Working Conference on Mining Software Repositories. MSR ’15. IEEE Press, Piscataway. pp 146–156. http://dl.
acm.org/citation.cfm?id=2820518.2820538
Czerwonka J, Greiler M, Tilford J (2015) Code Reviews Do Not Find Bugs: How the Current Code Review Best Practice
Slows Us Down. In: Proceedings of the 37th International Conference on Software Engineering - Volume 2. ICSE ’15.
IEEE Press, Piscataway. pp 27–28. http://dl.acm.org/citation.cfm?id=2819009.2819015
Fagan ME (1976) Design and code inspections to reduce errors in program development. IBM Syst J 15(3):182–211.
http://dx.doi.org/10.1147/sj.153.0182
Fagan ME (1986) Advances in software inspections. IEEE Trans Softw Eng 12(1):744–751
Ferreira AL, Machado RJ, Silva JG, Batista RF, Costa L, Paulk MC (2010) An Approach to Improving Software Inspections
Performance. In: Proceedings of the 2010 IEEE International Conference on Software Maintenance. ICSM ’10. IEEE
Computer Society, Washington. pp 1–8. http://dx.doi.org/10.1109/ICSM.2010.5609700
Google (2017a) Gerrit Code Review. https://www.gerritcodereview.com. Accessed 16 Oct 2018
Google (2017b) Gerrit Code Review v2.11.2, Queries. https://gerrit-documentation.storage.googleapis.com/
Documentation/2.12.2/cmd-query.html. Accessed 16 Oct 2018
Google (2017c) Gerrit Plugin: Reviewers by Blame. https://gerrit-review.googlesource.com/Documentation/config-
plugins.html#reviewers-by-blame. Accessed 16 Oct 2018
Hundhausen CD, Agrawal A, Agarwal P (2013) Talking about code: Integrating pedagogical code reviews into early
computing courses. Trans Comput Educ 13(3):14:1–14:28
Internet Engineering Task Force (IETF) (2017) RFC 6020: YANG - A Data Modeling Language for the Network
Configuration Protocol (NETCONF). http://tools.ietf.org/html/rfc6020
Jamieson S, et al (2004) Likert scales: how to (ab) use them. Med Educ 38(12):1217–1218
Kemerer CF, Paulk MC (2009) The impact of design and code reviews on software quality: An empirical study based on
psp data. IEEE Trans Softw Eng 35(4):534–550
Kollanus S, Koskinen J (2009) Survey of software inspection research. Open Softw Eng J 3(1):15–34
Martin J, Tsai WT (1990) N-fold inspection: A requirements analysis technique. Commun ACM 33(2):225–232
McIntosh S, Kamei Y, Adams B, Hassan AE (2014) The Impact of Code Review Coverage and Code Review Participation on
Software Quality: A Case Study of the Qt, VTK, and ITK Projects. In: Proceedings of the 11th Working Conference on
Mining Software Repositories. MSR 2014. ACM, New York. pp 192–201. http://doi.acm.org/10.1145/2597073.2597076
Meyer B (2008) Design and code reviews in the age of the internet. Commun ACM 51(9):66–71
Norman G (2010) Likert scales, levels of measurement and the “laws” of statistics. Adv Health Sci Educ 15(5):625–632
Olson GM, Olson JS (2000) Distance matters. Human-computer Interact 15(2):139–178
Olson JS, Olson GM (2013) Working together apart: Collaboration over the internet. Synth Lect Human-Centered Inform
6(5):1–151
Olson JS, Hofer E, Bos N, Zimmerman A, Olson GM, Cooney D, Faniel I (2008) A theory of remote scientific collaboration.
In: Olson GM, Zimmerman A, Bos N (eds). Scientific collaboration on the Internet. MIT Press, Cambridge. pp 73–99
Parnas DL, Weiss DM (1985) Active Design Reviews: Principles and Practices. In: Proceedings of the 8th International
Conference on Software Engineering. ICSE ’85. IEEE Computer Society Press, Los Alamitos. pp 132–136. http://dl.acm.
org/citation.cfm?id=319568.319599
Perpich JM, Perry DE, Porter AA, Votta LG, Wade MW (1997) Anywhere, anytime code inspections: Using the web to
remove inspection bottlenecks in large-scale software development. In: Proceedings of the 19th International
Conference on Software Engineering. ICSE ’97. ACM, New York. pp 14–21. http://doi.acm.org/10.1145/253228.253234
Rahman MM, Roy CK, Redl J, Collins JA (2016) CORRECT: Code Reviewer Recommendation at GitHub for Vendasta
Technologies. In: Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering.
ASE 2016. ACM, New York. pp 792–797. http://doi.acm.org/10.1145/2970276.2970283
Sengupta B, Chandra S, Sinha V (2006) A Research Agenda for Distributed Software Development. In: Proceedings of the
28th International Conference on Software Engineering. ICSE ’06. ACM, New York. pp 731–740. http://doi.acm.org/10.
1145/1134285.1134402
Shimagaki J, Kamei Y, McIntosh S, Hassan AE, Ubayashi N (2016) A Study of the Quality-impacting Practices of Modern
Code Review at Sony Mobile. In: Proceedings of the 38th International Conference on Software Engineering
Companion. ICSE ’16. ACM, New York. pp 212–221. http://doi.acm.org/10.1145/2889160.2889243
Stein M, Riedl J, Harner SJ, Mashayekhi V (1997) A case study of distributed, asynchronous software inspection. In:
Proceedings of the 19th International Conference on Software Engineering. ICSE ’97. ACM, New York. pp 107–117.
http://doi.acm.org/10.1145/253228.253250
Thongtanunam P, Kula RG, Cruz AEC, Yoshida N, Iida H (2014) Improving Code Review Effectiveness Through Reviewer
Recommendations. In: Proceedings of the 7th International Workshop on Cooperative and Human Aspects of
Software Engineering. CHASE 2014. ACM, New York. pp 119–122. http://doi.acm.org/10.1145/2593702.2593705
Thongtanunam P, McIntosh S, Hassan AE, Iida H (2015a) Investigating Code Review Practices in Defective Files: An
Empirical Study of the Qt System. In: Proceedings of the 12th Working Conference on Mining Software Repositories.
MSR ’15. IEEE Press, Piscataway. pp 168–179. http://dl.acm.org/citation.cfm?id=2820518.2820540
Witter dos Santos and Nunes Journal of Software Engineering Research and Development (2018) 6:14 Page 31 of 31
Thongtanunam P, Tantithamthavorn C, Kula RG, Yoshida N, Iida H, Matsumoto Ki (2015b) Who should review my code? a
file location-based code-reviewer recommendation approach for modern code review. In: SANER 2015. IEEE.
pp 141–150
Thongtanunam P, McIntosh S, Hassan AE, Iida H (2016a) Review participation in modern code review. Empir Softw Eng
22(2):768–817
Thongtanunam P, McIntosh S, Hassan AE, Iida H (2016b) Revisiting Code Ownership and Its Relationship with Software
Quality in the Scope of Modern Code Review. In: Proceedings of the 38th International Conference on Software
Engineering. ICSE ’16. ACM, New York. pp 1039–1050. http://doi.acm.org/10.1145/2884781.2884852
Viviani G, Murphy GC (2016) Removing Stagnation from Modern Code Review. In: Companion Proceedings of the 2016
ACM SIGPLAN International Conference on Systems, Programming, Languages and Applications: Software for
Humanity. SPLASH Companion 2016. ACM, New York. pp 43–44. http://doi.acm.org/10.1145/2984043.2989224
Witter dos Santos E, Nunes I (2017) Investigating the effectiveness of peer code review in distributed software
development. In: Proceedings of the 31st Brazilian Symposium on Software Engineering. SBES’17. ACM, New York.
pp 84–93. http://doi.acm.org/10.1145/3131151.3131161
Xia X, Lo D, Wang X, Yang X (2015) Who Should Review This Change?: Putting Text and File Location Analyses Together
for More Accurate Recommendations. In: Proceedings of the 2015 IEEE International Conference on Software
Maintenance and Evolution (ICSME). ICSME ’15. IEEE Computer Society, Washington. pp 261–270. http://dx.doi.org/
10.1109/ICSM.2015.7332472
Yang X (2014) Social Network Analysis in Open Source Software Peer Review. In: Proceedings of the 22Nd ACM SIGSOFT
International Symposium on Foundations of Software Engineering. FSE 2014. ACM, New York. pp 820–822. http://doi.
acm.org/10.1145/2635868.2661682
Zanjani MB, Kagdi H, Bird C (2016) Automatically recommending peer reviewers in modern code review. IEEE Trans Softw
Eng 42(6):530–543