0% found this document useful (0 votes)
24 views18 pages

The Corrective Commit Probability Code Quality Metric

vsv

Uploaded by

jitocap668
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views18 pages

The Corrective Commit Probability Code Quality Metric

vsv

Uploaded by

jitocap668
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

1

The Corrective Commit Probability


Code Quality Metric
Idan Amit Dror G. Feitelson
[email protected] [email protected]
Department of Computer Science
The Hebrew University of Jerusalem, 91904 Jerusalem, Israel

F
arXiv:2007.10912v1 [cs.SE] 21 Jul 2020

Abstract—We present a code quality metric, Corrective Commit Proba- Based on these considerations, we suggest the Correc-
bility (CCP), measuring the probability that a commit reflects corrective tive Commit Probability (CCP, the probability that a given
maintenance. We show that this metric agrees with developers’ concept commit is a bug fix) as a metric of quality.
of quality, informative, and stable. Corrective commits are identified by Corrective maintenance (aka fixing bugs) represents a
applying a linguistic model to the commit messages. We compute the
large fraction of software development, and contributes
CCP of all large active GitHub projects (7,557 projects with 200+ com-
mits in 2019). This leads to the creation of a quality scale, suggesting
significantly to software costs [19], [66], [92]. But not all
that the bottom 10% of quality projects spend at least 6 times more effort projects are the same: some have many more bugs than
on fixing bugs than the top 10%. Analysis of project attributes shows others. The propensity for bugs, as reflected by their fixing
that lower CCP (higher quality) is associated with smaller files, lower activity, can therefore be used to represent quality. This can
coupling, use of languages like JavaScript and C# as opposed to PHP be applied at various resolutions, e.g., a project, a file, or a
and C++, fewer developers, lower developer churn, better onboarding, method. Such application can help spot entities that are bug
and better productivity. Among other things these results support the prone, improving future bug identification [61], [84], [106].
“Quality is Free” claim, and suggest that achieving higher quality need While counting the number of bugs in code is common,
not require higher expenses.
disregarding a project’s history can be misleading. In the
CCP metric we normalize the number of bug fixes by the
total number of commits, thereby deriving the probability
1 I NTRODUCTION that a commit is a bug fix. We focus on commits because in
contemporary software development a commit is the atomic
It is widely agreed in the software engineering commu- unit of work.
nity that code quality is of the utmost importance. This We identify corrective commits using a linguistic model
has motivated the development of several software quality applied to commit messages, an idea that is commonly used
models (e.g. [18], [30]) and standards. Myriad tools promise for defect prediction [27], [41], [86]. The linguistic-model
developers that using them will improve the quality of their prediction marks commit messages as corrective or not, in
code. the spirit of Ratner et al. labelling functions [85]. Though our
But what exactly is high-quality code? There have been accuracy is significantly higher than previous work, such
many attempts to identify issues that reflect upon code qual- predictions are not completely accurate and therefore the
ity. These generally go under the name of “code smells” [33], model hits do not always coincide with the true corrective
[103]. Yet, it is debatable whether commonly used metrics commits.
indeed reflect real quality problems [1], [11]. Each smell is Given an implementation of the CCP metric, we perform
limited to identifying a restricted shallow type of problems. a large-scale assessment of GitHub projects. We analyze all
Besides, sometime a detected smell (e.g., a long method) is 7,557 large active projects (defined to be those with 200+
actually the correct design due to other considerations. commits in 2019, excluding redundant projects which might
A more general approach is to focus on an indirect bias our results [15]). We use this, inter alia, to build the
assessment of the software based on the ill-effects of low distribution of CCP, and find the quality ranking of each
quality— and in particular, on the presence of bugs. There project relative to all others. The significant difference in
is no debate that bugs are bad, especially bugs reported by the CCP among projects is informative. Software developers
customers [40]. By focusing on actual bugs, one is relieved of can easily know their own project’s CCP. They can thus find
the need to consider all possible sources of quality problems, where their project is ranked with respect to the community.
also removing any dependence on the implementation of Note that CCP provides a retrospective assessment of
the procedure that finds them. Moreover, approaches based quality. Being a process metric, it only applies after bugs
on bugs can apply equally well to different programming are found, unlike code metrics which can be applied as the
languages and projects. code is written. The CCP metric can be used as a research
tool for the study of different software engineering issues. Capers Jones defined software quality as the combina-
A simple approach is to observe the CCP given a certain tion of low defect rate and high user satisfaction [50], [52].
phenomenon (e.g., programming language, coupling). For He went on to provide extensive state-of-the-industry sur-
example, we show below that while investment in quality is veys based on defect rates and their correlation with various
often considered to reduce development speed, in practice development practices, using a database of many thousands
the development speed is actually higher in high quality of industry projects. Our work applies these concepts to the
projects. world of GitHub and open source, using the availability of
Our main contributions in this research are as follows: the code to investigate its quality and possible causes and
• We define the Corrective Commit Probability (CCP) implications.
metric to assess the quality of code. The metric is shown Software metrics can be divided into three groups: prod-
to be correlated with developers’ perceptions of quality, uct metrics, code metrics, and process metrics. Product
is easy to compute, and is applicable at all granularities metrics consider the software as a black box. A typical
and regardless of programming language. example is the ISO/IEC 25010:2011 standard [48]. It includes
• We develop a linguistic model to identify corrective metrics like fitness for purpose, satisfaction, freedom from
commits, that performs significantly better than prior risk, etc. These metrics might be subjective, hard to measure,
work and close to human level. and not applicable to white box actionable insights, which
• We show how to perform a maximum likelihood com- makes them less suitable for our research goals. Indeed,
putation to improve the accuracy of the CCP estima- studies of the ISO/IEC 9126 standard [47] (the precursor
tion, also removing the dependency on the implemen- of ISO/IEC 25010) found it to be ineffective in identifying
tation of the linguistic model. design problems [1].
• We establish a scale of CCP across projects, indicating Code metrics measure properties of the source code
that the metric provides information about the rela- directly. Typical metrics are lines of code (LOC) [67], the
tive quality of different projects. The scale shows that Chidamber and Kemerer object oriented metrics (aka CK
projects in the bottom decile of quality spend at least metrics) [23], McCabe’s cyclomatic complexity [71], Hal-
six times the effort on bug correction as projects in the stead complexity measures [42], etc. [10], [39], [83]. They
top decile. tend to be specific, low level and highly correlated with LOC
• We show that CCP correlates with various other effects, [37], [73], [90], [95]. Some specific bugs can be detected by
e.g. successful onboarding of new developers and pro- matching patterns in the code [46]. But this is not a general
ductivity. solution, since depending on it would bias our data towards
• We present twin experiments and co-change analysis in these patterns.
order to investigate relations beyond mere correlation. Process metrics focus on the code’s evolution. The main
• On the way we also provide empirical support for data source is the source control system. Typical metrics are
Linus’s law and the “Quality is Free” hypothesis. the number of commits, the commit size, the number of
contributors, etc. [38], [75], [83]. Process metrics have been
claimed to be better predictors of defects than code metrics
for reasons like showing where effort is being invested and
2 R ELATED WORK
having less stagnation [75], [83].
Despite decades of work on this issue, there is no agreed Working with commits as the entities of interest is also
definition of “software quality”. For some, this term refers popular in just in time (JIT) defect prediction [55]. Unlike
to the quality of the software product as perceived by its JIT, we are interested in the probability and not in a specific
users [93]. Others use the term in reference to the code commit being corrective. We also focus on long periods,
itself, as perceived by developers. These two approaches rather than comparing the versions before and after a bug
have a certain overlap: bugs comprise bad code that has fix, which probably reflects an improvement. We examine
effects seen by the end user. When considering the code, work at a resolution of years, and show that CCP is stable,
some define quality based on mainly non-functional prop- so projects that are prone to errors stay so, despite prior
erties, e.g. reliability, modifiability, etc. [18]. Others include efforts to fix bugs.
correctness as the foremost property [30]. Our approach Focusing on commits, we need a way to know if they are
is also that correctness is the most important element of corrective. If one has access to both a source control system
quality. The number of bugs in a program could have and a ticket management system, one can link the commits
been a great quality metric. However, Rice’s theorem [89] to the tickets [13] and reduce the CCP computation to mere
tells us that bug identification, like any non-trivial semantic counting. Yet, the links between commits and tickets might
property of programs, is undecidable. Nevertheless, bugs be biased [13]. The ticket classification itself might have
are being found, providing the basis for the CCP metric. 30% errors [44], and may not necessarily fit the researcher’s
And since bugs are time consuming, disrupt schedules, and desired taxonomy. And integrating tickets with the code
hurt the general credibility, lowering the bug rate has value management system might require a lot of effort, making it
regardless of other implications—thereby lending value to infeasible when analysing thousands of projects. Moreover,
having a low CCP. Moreover, it is generally accepted that in a research setting the ticket management system might
fixing bugs costs more the later they are found, and that be unavailable, so one is forced to rely on only the source
maintenance is costlier than initial development [16], [17], control system.
[19], [29]. Therefore, the cost of low quality is even higher When labels are not available, one can use linguistic
than implied by the bug ratio difference. analysis of the commit messages as a replacement. This is

2
often done in defect prediction, where supervised learning 3.1 Building a Gold Standard Data Set
can be used to derive models based on a labeled training set The most straight forward way to compute the CCP is to use
[27], [41], [86]. a change log system for the commits and a ticket system for
In principle, commit analysis models can be used to the commit classification [13], and compute the corrective
estimate the CCP, by creating a model and counting hits. ratio. However, for many projects the ticket system is not
That could have worked if the model accuracy was perfect. available. Therefore, we base the commit classification on
We take the model predictions and use the hit rate and the linguistic analysis, which is built and evaluated using a gold
model confusion matrix to derive a maximum likelihood es- standard.
timate of the CCP. Without such an adaptation, the analysis A gold standard is a set of entities with labels that cap-
might be invalid, and the hits of different models would ture a given concept. In our case, the entities are commits,
have been incomparable. the concept is corrective maintenance [100], namely bug
Our work is also close to Software Reliability Growth fixes, and the labels identify which commits are corrective.
Models (SRGM) [38], [109], [111]. In SRGM one tries to Gold standards are used in machine learning for building
predict the number of future failures, based on bugs dis- models, which are functions that map entities to concepts.
covered so far, and assuming the code base is fixed. The By comparing the true label to the model’s prediction, one
difference between us is that we are not aiming to predict can estimate the model performance. In addition, we also
future quality. We identify current software quality in order used the gold standard in order to understand the data
to investigate the causes and implications of quality. behavior and to identify upper bounds on performance.
The number of bugs was used as a feature and indicator We constructed the gold standard as follows. Google’s
of quality before as absolute number [58], [88], per period BigQuery has a schema for GitHub were all projects’ com-
[105], and per commit [3], [96]. We prefer the per commit mits are stored in a single table. We sampled uniformly 840
version since it is agnostic to size and useful as a probability. (40 duplicate) commits as a train set. The first author then
manually labeled these commits as being corrective or not
based on the commit content using a defined protocol.
To assess the subjectiveness in the labeling process, two
3 D EFINITION AND C OMPUTATION OF THE C OR - additional annotators labeled 400 of the commits. When
RECTIVE C OMMIT P ROBABILITY M ETRIC there was no consensus, we checked if the reason was a
deviation from the protocol or an error in the labeling (e.g.,
We now describe how we built the Corrective Commit missing an important phrase). In these cases, the annotator
Probability metric, in three steps: fixed the label. Otherwise, we considered the case as a
1) Constructing a gold standard data set of labeled com- disagreement. The final label was a majority vote of the an-
mit samples, identifying those that are corrective (bug notators. The Cohen’s kappa scores [24] among the different
fixes). These are later used to learn about corrective annotators were at least 0.9, indicating excellent agreement.
commits and to evaluate the model. Similarly consistent commit labeling was reported by Levin
2) Building and evaluating a supervised learning linguis- and Yehudai [65].
tic model to classify commits as either corrective or not. Of the 400 triple-annotated commits, there was consen-
Applying the model to a project yields a hit rate for that sus regarding the labels in 383 (95%) of them: 105 (27%)
project. were corrective, 278 were not. There were only 17 cases of
3) Using maximum likelihood estimation in order to find disagreement. An example of disagreement is “correct the
the most likely CCP given a certain hit rate. name of the Pascal Stangs library.” It is subjective whether
a wrong name is a bug.
The need for the third step arises because the hit rate In addition, we also noted the degree of certainty in the
may be biased, which might falsify further analysis like labeling. The message “mysql_upgrade should look for .sql
using regression and hypothesis testing. By working with script also in share/ directory” is clear, yet it is unclear
the CCP maximum likelihood estimation we become inde- whether the commit is a new feature or a bug fix. In only 7
pendent of the model details and its hit rate. We can then cases the annotators were uncertain and couldn’t determine
compare the results across projects, or even with results with high confidence the label from the commit message
based on other researchers’ models. We can also identify and content. Of these, in 4 they all nevertheless selected the
outliers deviating from the common linguistic behavior same label.
(e.g., non-English projects), and remove them to prevent Two of the samples (0.5%) were not in English. This
erroneous analysis. prevents English linguistic models from producing a mean-
Note that we are interested in the overall probability that a ingful classification. Luckily, this is uncommon.
commit is corrective. This is different from defect prediction, Finally, in 4 cases (1%) the commit message did not
where is the goal is to determine whether a specific commit contain any syntactic evidence for being corrective. The
is corrective. Finding the probability is easier than making most amusing example was “When I copy-adapted han-
detailed predictions. In analogy to coin tosses, we are inter- dle_level_irq I skipped note_interrupt because I considered
ested only in establishing to what degree a coin is biased, it unimportant. If I had understood its importance I would
rather than trying to predict a sequence of tosses. Thus, if have saved myself some ours of debugging” (the typo
for example false positives and false negatives are balanced, is in the origin). Such cases set an upper bound on the
the estimated probability will be accurate even if there are performance of any syntactic model. In our data set, all the
many wrong predictions. above special cases (uncertainty, disagreement, and lack of

3
syntactic evidence) are rather rare (just 22 samples, 5.5%, The last boost to performance came from the use of
many behaviors overlap), and the majority of samples are active learning [94] and specifically the use of classifiers
well behaved. The number of samples in each misbehavior discrepancies [4]. Once the model’s performance is high,
category is very small so ratios are very sensitive to noise. the probability of finding a false negative, positive_rate ·
However, we can say with confidence that these behaviors (1 − recall), is quite low, requiring a large number of man-
are not common and therefore are not an issue of concern in ually labeled random samples per false negative. Amit and
the analysis. Feitleson [3] provided models for a commit being corrective,
perfective, or adaptive. A commit not labeled by any of the
models is assured to be a false negative (of one of them).
3.2 Syntactic Identification of a Corrective Commit
Sampling from this distribution was an effective method to
Our linguistic model is a supervised learning model, based find false negatives, and improving the model to handle
on indicative terms that help identify corrective commit them increased the model recall from 69% to 84%. Similarly,
messages. Such models are built empirically by analyzing while a commit might be both corrective and adaptive,
corrective commit messages in distinction from other com- commits marked by more than one classifier are more likely
mit messages. to be false positives.
The most common approach today to do this is to The resulting model uses regular expressions to iden-
employ machine learning. We chose not to use machine tify the presence of different indicator terms in commit
learning classification algorithms to build the model. The messages. We base the model on straightforward regular
main reason was that we are using a relatively small labeled expressions because this is the tool supported by Google’s
data set, and linguistic analysis tends to lead to many fea- BigQuery relational database of GitHub data, which is our
tures (e.g., in a bag of words, word embedding, or n-grams target platform.
representation). In such a scenario, models might overfit The final model is based on three distinct regular expres-
and be less robust. One might try to cope with overfitting sions. The first identifies about 50 terms that serve as indi-
by using models of low capacity. However, the concept cations of a bug fix. Typical examples are: “bug”, “failure”,
that we would like to represent (e.g., include “fix” and and “correct this”. The second identifies terms that indicate
“error” but not “error code” and “not a bug”) is of relatively other fixes, which are not bug fixes. Typical examples are:
high capacity. The need to cover many independent textual “fixed indentation” and “error message”. The third is terms
indications and count them requires a large capacity, larger indicating negation. This is used in conjunction with the
than what can be supported by our small labeled data set. first regular expression to specifically handle cases in which
Note that though we didn’t use classification algorithms, the fix indication appears in a negative context, as in “This is
the goal, the structure, and the usage of the model are of not an error”. It is important to note that fix hits are also hits
supervised learning. of the other fixes and the negation. Therefore, the complete
Many prior language models suggest term lists like model counts the indications for a bug fix (matches to the
(‘bug’, ‘bugfix’,‘error’, ‘fail’, ‘fix’), which reach 88% accuracy first regular expression) and subtracts the indications for
on our data set. We tried many machine learning classifica- not really being a bug fix (matches to the other two regular
tion algorithms and only the plain decision tree algorithm expressions). If the result is positive, the commit message
reached such accuracy. More importantly, as presented later, was considered to be a bug fix. The results of the model
we aren’t optimizing for accuracy. We therefore elected to evaluation using a 1,100 samples test set built in Amit and
construct the model manually based on several sources of Feitelson [3] are presented in the confusion matrix of Table
candidate terms and the application of semantic under- 1.
standing.
We began with a private project in which the commits Table 1
Confusion matrix of model on test data set.
could be associated to a ticket-handling system that enabled
determining whether they were corrective. We used them Classification
in order to differentiate the word distribution of corrective Concept True(Corrective) False
commit messages and other messages and find an initial set True 228 (20.73%) TP 43 (3.91 %) FN
of indicative terms. In addition, we used the gold-standard False 34 (3.09%) FP 795 (72.27%) TN
data-set presented above. This data set is particularly im-
portant because our target is to analyze GitHub projects, so These results can be characterized by the following met-
it is desirable that our train data will represent the data on rics:
which the model will run. This train data set helped tuning • Accuracy (model is correct): 93.0%
the indicators by identifying new indications and nuances • Precision (ratio of hits that are indeed positives): 87.0%
precision
and alerting to bugs in the model implementation. • Precision lift ( positive rate − 1): 253.2%
To further improving the model we used some terms • Hit rate (ratio of commits identified by model as correc-
suggested by Ray et al. [86], tough we didn’t adopt all of tive): 23.8%
them (e.g., we don’t consider a typo to be a bug). This model • Positive rate (real corrective commit rate): 24.6%
was used in Amit and Feitelson [3], reaching an accuracy of • Recall (positives that were also hits): 84.1%
89%. We then added additional terms from Shrikanth et al. • Fpr (False Positive Rate, negatives that are hits by
[97]. We also used labeled commits from Levin and Yehudai mistake): 4.2%
[65] to further improve the model based on samples it failed Though prior work was based on different protocols and
to classify. data sets and therefore hard to compare, our accuracy is

4
significantly better than previous reported results of 68% 4 VALIDATION OF THE CCP M ETRIC
[45], 70% [5], 76% [65] and 82% [6], and also better than
our own previous result of 89% [3]. The achieved accuracy 4.1 Validation of the CCP Maximum Likelihood Estima-
is close to the well-behaving commits ratio in the gold tion
standard. George Box said: “All models are wrong but some are
useful” [20]. We would like to see how close the maximum
3.3 Maximum Likelihood Estimation of the Corrective likelihood CCP estimations are to the actual results. Note
Commit Probability that the model performance results we presented above in
We now present the CCP maximum likelihood estimation. Table 1, using the gold standard test set, do not refer to the
Let hr be the hit rate (probability that the model will identify maximum likelihood CCP estimation. We need a new in-
a commit as corrective) and pr be the positive rate, the true dependent validation set to verify the maximum likelihood
corrective rate in the commits (this is what CCP estimates). estimation. We therefore manually labeled another set of 400
In prior work it was common to use the hit rate directly commits, and applying the model resulted in the confusion
as the estimate for the positive rate. However, they differ matrix shown in Table 2.
since model prediction is not perfect. Thus, by considering
Table 2
the model performance we can better estimate the positive Confusion matrix of model on validation data set.
rate given the hit rate. From a performance modeling point
of view, the Dawid-Skene [28] modeling is an ancestor of Classification
our work. Though, the Dawid-Skene framework represents Concept True(Corrective) False
a model by its precision and recall, and we use Fpr and True 91 (22.75%) TP 18 (4.5%) FN
False 34 (8.5%) FP 257 (64.25%) TN
recall.
There are two distinct cases that can lead to a hit. The
first is a true positive (TP): There is indeed a bug fix and our In this data set the positive rate is 27.2%, the hit rate is
model identifies it correctly. The probability of this case is 31.2%, the recall is 83.5%, and the Fpr is 11.7%. Note that
Pr(T P ) = pr ·recall. The second case is a false positive (FP): the positive rate in the validation set is 2.6 percent points
There was no bug fix, yet our model mistakenly identifies different from our test set. The positive rate has nothing
the commit as corrective. The probability of this case is to do with MLE and shows that statistics tend to differ on
Pr(F P ) = (1 − pr) · F pr. Adding them gives different samples. In this section we would like to show that
the MLE method is robust to such changes.
hr = Pr(T P ) + Pr(F P ) = (recall − F pr)pr + F pr (1) In order to evaluate how sensitive the maximum like-
Extracting pr leads to lihood estimation is to changes in the data, we used the
bootstrap method [31]. We sampled with replacement 400
hr − F pr
pr = (2) items from the validation set, repeating the process 10,000
recall − F pr times. Each time we computed the true corrective commit
We want to estimate Pr(pr|hr). Let n be the number rate, the estimated CCP, and their difference. Figure 1 shows
of commits in our data set, and k the number of hits. As the difference distribution.
k
the number of samples increases, n converges to the model
hit rate hr. Therefore, we estimate Pr(pr|n, k). We will use
maximum likelihood for the estimation. The idea behind
maximum likelihood estimation is to find the value of pr
that maximizes the probability of getting a hit rate of hr.
Note that if we were given p, a single trial success
probability, we could calculate the probability of getting k
hits out of n trails using the binomial distribution formula
!
n
Pr(k; n, p) = pk (1 − p)n−k (3)
k

Finding the optimum requires the computation of the


derivative and finding where it equals to zero. The max-
k
imum of the binomial distribution is at n . Equation (2)
is linear and therefore monotone. Therefore, the maximum
likelihood estimation of the formula is
k
n − F pr
pr = (4)
recall − F pr
For our model, F pr = 0.042 and recall = 0.84 are fixed
constants (rounded values taken from the confusion matrix
of table 1). Therefore, we can obtain the most likely pr given
hr by
hr − 0.042
pr = = 1.253 · hr − 0.053 (5) Figure 1. Difference distribution in validation bootstrap.
0.84 − 0.042
5
In order to cover 95% of the distribution we can trim the the estimated CCP was above 1. Checking these projects,
2.5% tails from both sides. This will leave us with differences we found that they have many false positives, e.g. due to
ranging between -0.044 to 0.046. One can use the boundaries a convention of using the term “bug” for general tasks or
related to 95%, 90%, etc. in order to be extra cautious in the starting the subject with “fixes #123” where ticket #123 was
definition of the valid domain. not a bug fix but some other task id.
Another possible source of noise is in the model per- Another 11.8% of the projects had an estimated CCP
formance estimation. If the model is very sensitive to the below 0. This could indicate having extremely few bugs, or
test data, a few anomalous samples can lead to a bad else a relatively high fraction of false negatives (bug fixes we
estimation. Again, we used bootstrap in order to estimate did not identify). One possible reason for low identification
the sensitivity of the model performance estimation. For is if the project commit messages are not in English. To check
10,000 times we sampled two data sets of size 400. On each this, we built a simple linguistic model in order to identify
of the data sets we computed the recall and Fpr and built if a commit message is in English. The model was the 100
an MLE estimator. We then compared the difference in the most frequent words in English longer than two letters (see
model estimation at a few points of interest: [0,1] – the details and performance in supplementary materials). The
boundaries of probabilities, [0.042, 0.84] – the boundaries projects with negative CCP had a median English hit rate
of the valid domain, and [0.06, 0.39] – the p10 and p90 0.16. For comparison, the median English hit rate of the
percentiles of the CCP distribution. Since our models are projects with positive CCP was 0.54, and 96% of them had a
linear, so are their differences. Hence their maximum points hit rate above 0.16.
are at the ends of the examined segments. When considering Interestingly, another reason for many false negatives
the boundaries of probabilities [0,1], the maximal absolute was the habit of using very terse messages. We sampled
difference is 0.34 and 95% of the differences are lower than 5,000 commits from the negative CCP projects and com-
0.19. When considering the boundaries of the valid domain pared them to the triple-annotated data set used above. In
[0.042, 0.84], the maximal absolute difference is 0.28 and 95% the negative CCP commits, the median message length was
of the differences are lower than 0.15. When considering only 27 characters, and the 90th percentile was 81 characters.
the p10 and p90 percentiles of the CCP distribution [0.06, In the annotated data set the median was 8 times longer, and
0.39], the maximal absolute difference is 0.13 and 95% of the the 90th percentile was 9 times longer.
differences are lower than 0.07. It is also known that not all projects in GitHub (called
Using the validation set estimator on the test set, the there repositories) are software projects [54], [76]. Since bugs
CCP is 0.168, 7.7 percentage points off the actual positive are a software concept, other projects are unlikely to have
rate. In the other direction, using the CCP estimator test such commits and their CCP will be negative. Hence, the
data performance on the validation set, the CCP is 0.39, filtering also helps us to focus on software projects. Git is
11.8 points off. Since our classifier has high accuracy, the unable to identify the language of 6% of the projects with
difference between the hit rate and the CCP estimates in the negative CCP, more than 14 time the ratio in the valid
distribution deciles, presented below in Table 4, is at most 4 domain. The languages ‘HTML’, ‘TeX’, ‘TSQL’, ‘Makefile’,
percentage points. Hence the main practical contribution of ‘Vim script’, ‘Rich Text Format’ and ‘CSS’ are identified for
the MLE in this specific case is the identification of the valid 22% of the projects with negative CCP, more than 4 times as
domain rather than an improvement in the estimate. in the valid range. Many projects involve some languages
and when we examined a sample of projects we found that
4.2 Sensitivity to Violated Assumptions the language identification is not perfect. However, at least
28% of the projects that we filtered due to negative CCP are
4.2.1 Fixed Linguistic Model Assumption
not identified by GitHub as regular software projects.
The maximum likelihood estimation of the CCP assumes To summarize, in the projects with invalid CCP esti-
that the linguistic model performance, measured by its mates, below 0 or above 1, the behavior of the linguistic
recall and F pr, is fixed. Hence, a change in the hit rate in a model changes and invalidates the fixed performance as-
given domain is due to a change in the CCP in the domain, sumption. We believe that the analysis of projects in the
and not due to a change in the linguistic model performance. CCP valid domain is suitable for software engineering goals.
This assumption is crucial for the mapping from identified The CCP distribution in Table 4 below is presented for both
corrective commits to a quality metric. the entire data set and only for projects with valid CCP
Yet, this assumption does not always hold. Both hr and estimates. The rest of the analysis is done only on the valid
pr are probabilities and must be in the range [0, 1]. Equation projects.
(2) equals 0 at F pr and 1 at recall. For our model, this
indicates that the range of values of hr for which pr will
be a probability is [0.042, 0.84]. Beyond this range, we are 4.2.2 Fixed Bug Detection Efficiency Assumption
assured that the linguistic model performance is not as The major assumption underlying our work is that CCP
measured on the gold standard. An illustrative example of reflects quality — that the number of bug fixes reflects
the necessity of the range is a model with recall = 0.5 and the number of bugs. Likewise, the comparison of CCP
F pr = 0. Given hr = 0.9 the most likely pr is 1.8. This is across projects assumes that the bug detection efficiency is
an impossible value for a probability, so we deduce that our similar, so a difference in the CCP is due to a difference
assumption is wrong. in the existence of bugs and not due to a difference in the
As described in Section 5.2, we estimated the CCP of all ability to find them. We found two situations in which this
8,588 large active projects in 2019. In 10 of these projects assumption appears to be violated. In these situations, the

6
ability to find bugs appears to be systematically different —
higher or lower — than in other projects.
The first such situation is in very popular projects. Linus’s
law, “given enough eyeballs, all bugs are shallow” [87], sug-
gests that a large community might lead to more effective
bug identification, and as a consequence also to higher CCP.
In order to investigate this, we used projects of companies
or communities known for their high standards: Google,
Facebook, Apache, Angular, Kubernetes, and Tensorflow.
For each such source, we compared the average CCP of
projects in the top 5% as measured by stars (7,481 stars or
more), with the average CCP of projects with fewer stars.

Table 3
Linus’s Law: CCP in projects with many or fewer stars.

top 5% bottom 95%


(>7,481 stars) (<7,481 stars)
Source N avg. CCP (lift) N avg. CCP
Google 8 0.32 (27%) 66 0.25 Figure 2. CCP of files with or without different quality terms.
Facebook 9 0.30 (12%) 9 0.27
Apache 10 0.37 (44%) 35 0.26
Angular 3 0.49 (34%) 32 0.37 When comparing 1-month windows around a TFDD,
Kubernetes 3 0.21 (35%) 3 0.16
Tensorflow 5 0.26 (32%) 26 0.20 the average number of commits is reduced by 1 percentage
point. There is also an average reduction of 3 percentage
The results were that the most popular projects of high- points in refactoring, implying a small decrease in quality
reputation sources indeed have CCP higher than less popu- improvement effort. At the same time, the CCP improves
lar projects of the same organization (Table 3). The popular (decreases) by 5 percentage points. Assuming that quality
projects tend to be important projects: Google’s Tensorflow is not improved as a result of a TFDD, a more reasonable
and Facebook’s React received more than 100,000 stars each. explanation is that bug detection efficiency was reduced.
It is not likely that such projects have lower quality than the But even the traumatic loss of the core developers damage
organization’s standard. Apparently, these projects attract is only 5 percentage points.
large communities which provide the eyeballs to identify The above cases happen in identifiable conditions, and
the bugs efficiently, as predicted by Linus’s law. therefore could be filtered out. But since they happen in
Note that these communities’ projects, with many stars extreme, rare cases, we choose to leave them, and gain an
or not, have average CCP of 0.26, 21% more than all projects’ analysis that though slightly biased, represents the general
average. Their average number of authors is 219, 144% more behaviour of projects.
than the others. Their average age is 4 years compared to 5
years, 20% younger yet not young in both cases. However, 4.3 CCP as a Quality Metric
the average number of stars is 5,208 compared to 1,428, a lift The most important property of a metric is obviously its
of 364%. It is possible that while the analysis we presented validity, that it reflects the measured concept. There is a
is for extreme numbers of stars, Linus’s law kicks in already challenge in showing that CCP measures quality since there
at much lower numbers and contributed to the difference. is no agreed definition of quality.
There are only few such projects (we looked at the top To circumvent this, we checked references to low quality
5% from a small select set of sources). The effect on the CCP in commit messages and correlated them with CCP (Figs.
is modest (raising the level of bug corrections by around 2 and 3). The specific terms checked were direct references
30%, both a top and a median project will decrease in one to “low quality”, and related terms like “code smell” [34],
decile). Thus, we expect that they will not have a significant [59], [101], [104], [112], and “technical debt” [26], [62], [102].
impact on the results presented below. In addition, swearing is also a common way to express
The mirror image of projects that have enough users dissatisfaction, with millions of occurrences compared to
to benefit from Linus’s law is projects that lose their core only hundreds or thousands for the technical terms. For
developers. The “Truck Factor” originated in the Agile files, we considered files with 10+ commits and compared
community. Its informal definition is “The number of people those with at least 10% occurrences of the term to the rest.
on your team who have to be hit with a truck before the Projects may have a lot of commits, and the vast majority
project is in serious trouble” [108]. In order to analyze it, do not contain the terms. Instead of a ratio, we therefore
we used the metric suggested by Avelino et al. [9]. Truck consider a project to contain a term if it has at least 10
Factor Developers Detachment (TFDD) is the event in which occurrences. As the figures show, when the terms appear,
the core developers abandon a project as if a virtual truck the CCP is higher (sometime many times higher). Thus,
had hit them [8]. We used instances of TFDD identified by our quality metric agrees with the opinions of the projects’
Avelino et al. and matched them with the GitHub behavior developers regarding quality.
[8]. As expected, TFDD is a traumatic event for projects, and To verify this result, we attempted to use negative con-
59% of them do not survive it. trols. A negative control is an item that should be indifferent

7
and “twin” analysis, which show that the correlations are
consistent and unlikely to be random.

5.1 Methodology
Our results are in the form of correlations between CCP
and other metrics. For example, we show that projects with
shorter files tend to have a lower CCP. These correlations
are informative and actionable, e.g., enabling a developer
to focus on longer files during testing and refactoring. But
correlation is not causation, so we cannot say conclusively
that longer files cause a higher propensity for bugs that
need to be fixed. Showing causality requires experiments
in which we perform the change, which we leave for future
work. The correlations that we find indicate that a search for
causality might be fruitful and could motivate changes in
development practices that may lead to improved software
quality.
In order to make the results stronger than mere correla-
Figure 3. CCP of projects with or without different quality terms.
tion, we use several methods in the analysis. We control the
results to see that the relation between A and B is not due to
to the analysis. In our case, it should be a term not related C. In particular we control for the developer, by observing
to quality. We choose “algorithm” and “function” as such the behaviour of the same developer in different projects.
terms. The verification worked for “algorithm” at the file This allows us to separate the influence of the developer and
level: files with and without this term had practically the the project. We use co-change over time analysis in order to
same CCP. But files with “function” had a much higher CCP see to what extent a change in one metric is related to a
than files without it, and projects with both terms had a change in the other metric.
higher CCP that without them. Possible reasons are some The distributions we examined tended to have some out-
relation to quality (e.g., algorithmic oriented projects are liers that are much higher than the mean and the majority of
harder) or biases (e.g., object-oriented languages tend to use the samples. Including outliers in the analysis might distort
the term “method” rather than “function”). Anyway, it is the results. In order to reduce the destabilizing effect of
clear that the difference in “low quality” is much larger and outliers, we applied Winsorizing [43]. We used one-sided
there is a large difference in the other terms too. Note that Winsorizing, where all values above a certain threshold are
this investigation is not completely independent. While the set to this threshold. We do this for the top 1% of the
quality terms used here are different from those used for the results throughout, to avoid the need to identify outliers
classification of corrective commits, we still use the same and define a rule for adjusting the threshold for each specific
data source. case. In the rest of the paper we used the term capping (a
An additional important attribute of metrics is that they common synonym) for this action. In addition, we check
be stable. We estimate stability by comparing the CCP of the whether the metrics are stable across years. A reliable metric
same project in adjacent years, from 2014 to 2019. Overall, applied to clean data is expected to provide similar results
the quality of the projects is stable over time. The Pearson in successive years.
correlation between the CCP of the same project in two Results are computed on 2019 active projects, and specif-
successive years, with 200 or more commits in each, is 0.86. ically on projects whose CCP is in the valid domain. We
The average CCP, using all commits from all projects, was didn’t work with version releases since we work with
22.7% in 2018 and 22.3% in 2019. Looking at projects, the thousands of projects whose releases are not clearly marked.
CPP grew on average by 0.6 percentage points from year Note that in projects doing continuous development, the
to year, which might reflect a slow decrease in quality. This concept of release is no longer applicable.
average hide both increases and decreases; the average abso-
lute difference in CPP was 5.5 percentage points. Compared 5.1.1 Controlled Variables: Project Age, Number of Devel-
to the CCP distribution presented in Table 4 below, the per opers, and Programming Language
project change is very small. Our goal is to find out how to improve software devel-
opment. We would like to provide actionable recommen-
dations for better software development. However, there
5 A SSOCIATION OF CCP WITH P ROJECT AT- are factors that influence software quality that are hard to
TRIBUTES change. It is not that helpful for an ongoing project to find
To further support the claim that CCP is related to quality, out that a different programming language is indicative of a
we studied the correlations of CCP with various notions lower bug rate. Yet, we examine the effect of some variables
of quality reflected in project attributes. To strengthen the that influence the quality yet are hard to change. We control
results beyond mere correlations we control for variables them in the rest of the analysis to validate that the results
which might influence the results, such as project age and hold. We do the control by conditioning on the relevant
the number of developers. We also use co-change analysis variable and checking if the relations found in general hold

8
while controlling too. We don’t control by more then one
variable at a time since our data set is rather small and
controlling leads to smaller data sets, making the results
less robust to noise.

Figure 5. CCP distribution for projects with different numbers of devel-


opers.

10), the next 50% as intermediate (at most 80), and the rest
Figure 4. CCP distribution (during 2019) in projects of different ages. In as numerous, and verifying that results hold for each such
this and following figures, each boxplot shows the 5, 25, 50, 75, and 95 group.
percentiles. The dashed line represents the mean. Results regarding the influence of programming lan-
guage are presented below in Section 5.2.4. We show that
Lehman’s laws of software evolution imply that quality the projects written in different programming languages ex-
may have a negative correlation with the age of a project hibit somewhat different distributions of CCP. We therefore
[63], [64]. We checked this on our dataset. We first filtered control for the programming language in order to see that
out projects that started before 2008 (GitHub beginning). For our results remain valid for each language individually.
the remaining projects, we checked their CCP at each year.
Figure 4 shows that CCP indeed tends to increase slightly
5.1.2 Co-change Over Time
with age. In the first year, the average CCP is 0.18. There
is then a generally upward trend, getting to an average of While experiments can help to determine causality, they
0.23 in 10 years. Note that there is a survival bias in the data are based on few cases and expensive. On the other hand,
presented since many projects do not reach high age. we have access to plenty of observations, in which we
In order to see that our results are not due to the influ- can identify correlations. While casual relations tend to
ence of age, we divided the projects into age groups. Those lead to correlation, non-casual relations might also lead to
started earlier than 2008 were excluded, those started in correlations due to various reasons. We would like to use an
2018–2019 (23%) are considered to be young, the next, from analysis that will help to filter out non-casual relations. By
2016–2017 (40%), are medium, and those from 2008–2015 that we will be left with a smaller set of more likely relations
(37%) are old. When we obtained a result (e.g., correlation to be further investigated for causality.
between coupling and CCP), we checked if the result holds When two metrics change simultaneously, it is less likely
for each of the groups separately. to be accidental. Hence, we track the metrics over time in
The number of developers, via some influence mecha- order to see how their changes match. We create pairs of
nisms (e.g., ownership), was investigated as a quality factor the same project in two consecutive years. For each pair
and it seems that there is some relation to quality [14], [79], we mark if the first and second metrics improved. We
[107]. The number of developers and CCP have Pearson observe the following co-changes. The ratio of improvement
correlation of 0.12. The number of developers can reach very match (the equivalent to accuracy in supervised learning),
high values and therefore be very influential. is an indication of related changes. Denote the event that
Fig. 5 shows that percentiles of the CCP distribution in- metric i improved from one year to the next by mi↑. The
crease monotonically with the number of developers. Many probability P (mj ↑ | mi↑), (the equivalent to precision in
explanations have been given to the quality reduction as supervised learning), indicates how likely we are to observe
the number of developers increases. It might be simply a an improvement in metric j knowing of an improvement in
proxy to the project size (i.e. to the LOC). It might be to metric i. It might be that we will observe high precision
the increased communication complexity and the difficulty but it will be simply since P (mj ↑) is high. In order to
to coordinate multiple developers, as suggested by Brooks exclude this possibility, we also observe the precision lift,
in the mythical “The Mythical Man Month” [21]. Part of it P (mj ↑ | mi↑)
− 1. Note that lift cannot be used to identify
P (mj ↑)
might also be a reflection of Linus’s law, as discussed in
the causality direction since it is symmetric:
Section 4.2.2.
We control for the number of developers by grouping P (mj ↑ | mi↑) P (mi↑ ∧ mj ↑) P (mi↑ | mj ↑)
the 25% of project with the least developers as few (at most = = (6)
P (mj ↑) P (mi↑) · P (mj ↑) P (mi↑)
9
If an improvement in metric i indeed causes the improve- appropriate for studies of software engineering, being small,
ment in metric j, we expect high precision and lift. Since non-recent, or not even code.
small changes might be accidental, we also investigate im- In order to omit inactive or small projects where estima-
provements above a certain threshold. There is a trade-off tion might be noisy, we defined our scope to be all open
here since given a high threshold the improvement is clear source projects included in GitHub’s BigQuery data with
yet the number of cases we consider is smaller. Another 200+ commits in 2019. We selected a threshold of 200 to have
trade-off comes from how far in the past we track the co- enough data per project, yet have enough projects above the
changes. The earlier we will go the more data we will threshold. There are 14,749 such projects (Fig. 6).
have. On the other hand, this will increase the weight
of old projects, and might subject the analysis to changes GitHub projects in BigQuery
in software development practices over time and to data 2,500,000
quality problems. We chose a scope of 5 years, avoiding exclude small
non−recent
looking before 2014. projects
200+ copmmits in 2019
5.1.3 Controlling the Developer 14,749
Measured metric results (e.g., development speed, low cou- exclude
pling) might be due to the developers working on the forks
project (e.g., skill, motivation) or due to the project envi- large active non−fork
ronment (e.g., processes, technical debt). To separate the 9,481
influence of developers and environment, we checked the exclude related
performance of developers active in more than one project in projects
our data set. By fixing a single developer and comparing the large active unique
developer’s activity in different projects, we can investigate 8,588
the influence of the project. Note that a developer active exclude invalid
in n projects will generate O(n2 ) project pairs (“twins”) to CCP result
compare. final study set
We considered only involved developers, committing 7,557
at least 12 times per year, otherwise the results might be
misleading. While this omits 62% of the developers, they Figure 6. Process for selecting projects for analysis.
are responsible for only 6% of the commits.
Consider development speed as an example. If high However, this set is redundant in the sense that some
speed is due to the project environment, in high speed projects are closely related [54]. The first step to reduce
projects every developer is expected to be faster than himself redundancy is to exclude projects marked in the GitHub API
in other projects. This control resembles twin experiments, as being forks of other projects. This reduced the number
popular in psychology, where a behavior of interest is to 9,481 projects. Sometimes extensive amounts of code are
observed on twins. Since twins have a very close genetic cloned without actual forking. Such code cloning is preva-
background, a difference in their behavior is more likely lent and might impact analysis [2], [35], [68]. Using com-
to be due to another factor (e.g., being raised in different mits to identify relationships [72], we excluded dominated
families). projects, defined to have more than 50 common commits
Assume that performance on project A is in general with another, larger project, in 2019. Last, we identified
better than on project B. We consider developers that con- projects sharing the same name (e.g., ‘spark’) and preferred
tributed to both projects, and check how often they are those that belonged to the user with more projects (e.g.,
better in project A than themselves in project B (formally, ‘apache’). After the redundant projects removal, we were
the probability that a developer is better in project A than left with 8,588 projects. But calculating the CCP on some of
in project B given that project A is better than project B). these led to invalid values as described above. For analysis
This is equivalent to precision in supervised learning, where purposes we therefore consider only projects where CCP is
the project improvement is the classifier and the developer in the valid range, whose number is 7,557.
improvement is the concept. In some cases, a small dif-
5.2 Results
ference might be accidental. Therefore we require a large
difference between the projects and between the developer The following examples aim to show the applicability of
performance (e.g., at least 10 commits per year difference, or CCP. We compute the CCP of many projects and produce
more formally, the probability that a developer committed at the CCP distribution. We then demonstrate associations be-
least 10 times more in project A than in project B given that tween high quality, as represented by CCP, and short source
the average number of commits per developer in project A code file length, coupling, and programming language. We
is at least 10 commits higher than in project B). also investigate possible implications like developer engag-
ment and development speed.
5.1.4 Selection of Projects 5.2.1 The Distribution of CCP per Project
In 2018 Github published that they had 100 million projects. Given the ability to identify corrective commits, we can
The BigQuery GitHub schema contains about 2.5 million classify the commits of each project and estimate the dis-
public projects prior to 2020. But the vast majority are not tribution of CCP over the projects’ population.

10
Table 4
CCP distribution in active GitHub projects.

Full data set CCP ∈ [0, 1]


( 8,588 projects) ( 7,557 projects)
Percentile Hit rate CCP est. Hit rate CCP est.
10 0.34 0.38 0.35 0.39
20 0.28 0.30 0.29 0.32
30 0.24 0.25 0.26 0.27
40 0.21 0.21 0.22 0.23
50 0.18 0.18 0.20 0.20
60 0.15 0.14 0.17 0.17
70 0.12 0.10 0.15 0.13
80 0.09 0.06 0.12 0.10
90 0.03 -0.02 0.09 0.06
95 0.00 -0.05 0.07 0.04
Figure 7. CCP distribution for files with different lengths (in KB, capped).

Table 4 shows the distribution of hit rates and CCP


estimates on the GitHub projects with 200+ commits in 2019, the content of the files in the HEAD (last version), and not
with redundant repositories (representing the same project) previous ones.
excluded. The hit rate represents the fraction of commits Controlling by project age and developers support the
identified as corrective by the linguistic model, and the results. When controlling for language, in most languages
CCP is the maximum likelihood estimation. The top 10% the top-ranked projects indeed have shorter files. On the
of projects have a CCP of up to 0.06. The median project has other hand, in PHP they are 10% longer, and in JavaScript
a CCP of 0.2, more than three times the top projects’ CCP. the lengths in the top 10% quality projects is 31% higher
Interestingly, Lientz at el. reported a median of 0.17 in 1978, than the rest.
based on a survey of 69 projects [66]. The bottom 10% have
a CCP of 0.39 or more, more than 6 times higher than the 5.2.3 Coupling and CCP
top 10%.
A commit is a unit of work ideally reflecting the completion
Given the distribution of CCP, any developer can find
of a task. It should contain only the files relevant to that task.
the placement of his own project relative to the whole
Many files needed for a task means coupling. Therefore, the
community. The classification of commits can be obtained by
average number of files in a commit can be used as a metric
linking them to tickets in the ticket-handling system (such
for coupling [3], [113].
as Jira or Bugzilla). For projects in which there is a single
commit per ticket, or close to that, one can compute the To validate that this metric captures the way developers
CCP in the ticket-handling system directly, by calculating think about coupling, we compared it to the appearance of
the ratio of bug tickets. Hence, having full access to a project, the terms “coupled” or “coupling” in messages of commits
one can compute the exact CCP, rather than its maximum containing the file. Out of the files with at least 10 commits,
likelihood estimation. those with a hit rate of at least 0.1 for these terms had
average commit size 45% larger than the others.
Comparing the project’s CCP to the distribution in the
When looking at the size of commits, it turns out that
last column of Table 4 provides an indication of the project’s
corrective commits involve significantly fewer files than
code quality and division of effort calibrated with respect to
other commit types: the average corrective commit size is
other projects.
3.8, while the average non-corrective commit size is 5.5.
Therefore, comparing files with different ratios of corrective
5.2.2 File Length and CCP commits will influence the apparent coupling. To avoid this,
The correlation between high file length and an increase in we will compute the coupling using only non-corrective
bugs has been widely investigated and considered to be commits. We define the coupling of a project to be the
a fundamental influencing metric [37], [67]. The following average coupling of its files.
analysis first averages across files in each project, and then Figure 8 presents the results. There is a large difference
considers the distribution across projects, so as to avoid in the commit sizes: the 25% quantile is 3.1 files and the
giving extra weight to large projects. In order to avoid 75% quantile is 7.1. Similarly to the relation of CCP to file
sensitivity due to large values, we capped large file lengths sizes, here too the distribution of CCP in commits above the
at 181KB, the 99th percentile. median size appears to be largely the same, with an average
In our projects data set, the mean file length was 8.1 KB of 0.24. But in smaller commits there is a pronounced
with a standard deviation of 14.3KB, a ratio of 1.75 (capped correlation between CCP and commit size, and the average
values). Figure 7 shows that the CCP increases with the CCP in the low coupling 25% is 0.18. Projects that are in
length. Projects whose average capped file size is in the the lower 25% in both file length and coupling have 0.15
lower 25% (bellow 3.2KB) has average CCP of 0.19. The last average CCP and 29.3% chance to be in the top 10% of the
five deciles all have CCP around 0.23 as if at a certain point CCP-based quality scale, 3 times more than expected.
a file is “just too long”. When we analyze CCP and coupling co-change, the
We did not perform a co-change analysis of file length match for any improvement is 52%. A 10-percentage point
and CCP since the GitHub BigQuery database stores only reduction in CCP and a one file reduction in coupling are

11
Figure 8. CCP distribution for projects with different average commit
sizes (number of files, capped, in non-corrective commits).

matched 72% of the time. Given a reduction of coupling by


one file, the probability of a CCP reduction of 10 percentage
points is 9%, a lift of 32%. Results hold when controlling for
Figure 9. Cumulative distribution of CCP by language. Distributions
language, number of developers, and age, though in some shifted to the right tend to have higher CCP.
setting the groups are empty or very small.
In twin experiments, the probability that the developer’s Table 5
coupling is better (lower) in the better project was 49%, a lift CCP and development speed (commits per year of involved
of 15%. When the coupling in the better project was better developers) per language. Values are averages ± standard errors.
by at least one file, the developer coupling was better by one
file in 33% of the cases, a lift of 72%. Metric
Language Projects CCP Speed Speed in Speed in
top 10% others
5.2.4 Programming Languages and CCP Shell 146 0.18 ± 0.010 171 ± 10 185 ± 29 169 ± 11
JavaScript 1342 0.20 ± 0.004 156 ± 3 166 ± 8 154 ± 3
Our investigation of programming languages aims to con-
C# 315 0.21 ± 0.008 181 ± 6 207 ± 27 178 ± 7
trol the influence on CCP, not to investigate programming Python 1069 0.22 ± 0.004 139 ± 3 177 ± 19 137 ± 3
languages as a subject. Other than the direct language in- Java 764 0.22 ± 0.005 148 ± 4 205 ± 17 143 ± 4
fluence, languages are often used in different domains, and C++ 341 0.24 ± 0.007 201 ± 7 324 ± 33 196 ± 7
indirectly imply programming culture and communities. PHP 326 0.25 ± 0.009 168 ± 6 180 ± 22 167 ± 6
We extracted the 100 most common file name extensions
in GitHub, which cover 94% of the files. Of these, 28 exten-
sions are of Turing-Complete programming languages (i.e., cating that language indeed has a substantial effect, with
excluding languages like SQL). We consider a language to be a p-value around 10−9 . Hence, as Table 5 shows, there are
the dominant language in a project if above 80% of files were statistically significant differences among the programming
in this language. There were 5,407 projects with a dominant language, yet compared to the range of the CCP distribution
language out of the 7,557 being studied. Figure 9 shows the they are small.
CDFs of the CCP of projects in major languages. Of course, the above is not a full comparison of program-
The figure focuses on the high to medium quality region ming languages (See [12], [78], [82], [86] for comparisons
(excluding the highest CCPs). For averages see Table 5. All and the difficulties involving them). Many factors (e.g. being
languages cover a wide and overlapping range of CCP, typed, memory allocation handing, compiled vs. dynamic)
and in all languages one can write high quality code. The might cause the differences in the languages’ CCP. Our
least bugs occurred in Shell. This is an indication of the results agree with the results of [12], [86], indicating that
need to analyze quality carefully, as Shell is used to write difference between languages is usually small and that C++
scripts and should not be compared directly with languages has relatively high CCP.
used to write, for example, real-time applications. Project in
JavaScript, and to a somewhat lesser degree, in C#, tend to 5.2.5 Developer Engagement and CCP
have lower CCPs. Higher CCPs occur in C++, and, towards The relation between churn (developers abandoning the
the tail of the distribution, in PHP. The rest of the languages project) and quality steps out of the technical field and
are usually in between with changing regions of better involves human psychology. Motivation influences perfor-
performance. mance [22], [110]. Argyle investigated the relation between
In order to verify that differences are not accidental, we developers’ happiness and their job satisfaction and work
split the projects by language and examined their average performance, showing “modestly positive correlations with
CCP. An ANOVA test [32] led to a F-statistic of 8.3, indi- productivity, absenteeism and labour turnover” [7]. On the

12
other direction, Ghayyur et al. conducted a survey in which and on-boarding average is doubled in the first decile com-
72% claimed that poor code quality is demotivating [36]. pared to the last. In order to be more robust to noise, we
Hence, quality might be both the outcome and the cause of consider projects that have at least 10 new developers. When
motivation. looking at co-change of on-boarding and CCP, the match is
only 53% for any change but 85% for a change of at least
10 percent points in both metrics. An improvement of 10
percent points in CCP leads to a significant improvement
in on-boarding in 10% of the cases, a lift of 18%. When
controlling on language, results fit the relation other than in
PHP and Shell (which had a small number of cases). Results
hold for all age groups. For size, they hold for intermediate
and numerous numbers of developers; by definition, with
few developers there are no projects with at least 10 new
developers.

5.2.6 Development Speed and CCP


Like quality, the definition of productivity is subjective and
not clear. Measures including LOC [70], modules [74], and
function points [49], [69] per time unit have been suggested
and criticized [56], [57]. We chose to measure development
speed by the number of commits per developer per year.
Figure 10. Developer retention per CCP decile. Note the change in the This is an output per time measure, and the inverse of time
median. to complete a task, investigated in the classical work of
Sackman et al. [91]. The number of commits is correlated
We checked the retention of involved developers, where with self-rated productivity [77] and team lead perception
retention is quantified as the percentage of developers that of productivity [81]. We chose commits as the output unit
continue to work on the project in the next year, averaged since a commit is a unit of work, its computation is easy
over all years (Figure 10). Note that the median is 100% and objective, and it is not biased toward implementation
retention in all four top deciles, decreases over the next details.
three, and stabilizes again at about 85% in the last three The number of commits per project per year is stable
CCP deciles. with a Pearson correlation of 0.71. The number of devel-
When looking at co-change of CCP with churn (1 − opers per year is also stable with a Pearson correlation
retention), the match is only 51% for any change but 79% 0.81. To study development speed we omit developers with
for a change of at least 10 percentage points in each metric. fewer than 12 commits per year since they are non-involved
A improvement of 10 percent points in CCP leads to a developers. We also capped the number of commits per
significant improvement in churn in 21% of the cases, a developer at 500, about the 99th percentile of the develop-
lift of 17%. When controlling the language, age group, or ers’ contributions. While commits by users below the 99th
developer number group, we still get matching co-change. percentile are only 73% of the total, excluding the long
tail (which reaches 300,000 commits) is justified because
it most probably does not represent usual manual human
effort. Using both restrictions the correlation of commits
per developer in adjacent years is 0.62 (compared to 0.59
without them), which is reasonably stable.
As Figure 12 shows, there is a steady decrease of speed
with CCP. The average speed in the first decile is 56% higher
than in the last one. As with file length, speed differs in
projects written in different languages. Yet in all of them
higher quality goes with higher speed (see Table 5).
We conducted twin experiments, to control the devel-
oper. When a developer works in a faster project, he is faster
than himself in other projects in 51% of the cases, 8% lift.
When the project speed is 10 commits larger, the developer
has 42% chance to be also 10 commits faster than himself, a
lift of 11%.
We also investigated the co-change of CCP and speed.
Figure 11. On-boarding per CCP decile. In 52% of the cases, an improvement in CCP goes with
improvement in speed. Given a CCP improvement, there is
acquiring new developers complements the retention of a speed improvement in 53% of the cases, a lift of 4%. Given
existing ones. We define the on-boarding ratio as the average a 10 percent points improvement in CCP, the probability of
percentage of new developers becoming involved. Figure 11 10 more commits per year per developer is 53%, and the
shows that the higher the CCP, the lower is the on-boarding, lift is 2%. In the other direction, given an improvement in

13
The number of test labeled commits is small, about
1,000, hence there is a question of how well they represent
the underlying distribution. We evaluated the sensitivity to
changes in the data. Since the model was built mainly using
domain knowledge and a different data set, we could use a
small training set. Therefore, we preferred to use most of the
labels as test set for the variables estimation and to improve
the estimation of the recall and F pr.
The labeling was done manually by humans who are
prone to error and subjectivity. In order to make the labeling
stricter, we used a labeling protocol. Out of the samples,
400 were labeled by three annotators independently. The
labels were compared in order to evaluate the amount of
uncertainty.
Other than uncertainty due to different opinions, there
was uncertainty due to the lack of information in the commit
Figure 12. Distribution of commits of involved developers (capped) per message. For example, the message “Changed result default
CCP decile. value to False” describes a change well but leaves us uncer-
tain regarding its nature. We used the gold standard labels
to verify that this is rare.
speed the probability of a significant improvement in CCP Our main assumption is the conditional independence
drops to 7%. Hence, knowing of a significant improvement between the corrective commits (code) and the commit
in CCP a speed improvement is likely, but knowing of a messages describing them (process) given our concept (the
speed improvement a significant CCP improvement is very commit being corrective, namely a bug fix). This means that
unlikely. the model performance is the same over all the projects, and
When controlling for age or language, results hold. Re- a different hit rate is due to a different CCP. This assumption
sults also hold for intermediate and numerous developer is invalid in some cases. For example, projects documented
groups, with a positive lift when the change is significant, in a language other than English will appear to have no
but a -3% lift in the few developers group for any change. bugs. Non-English commit messages are relatively easy to
There are two differing theories regarding the relations identify; more problematic are differences in English flu-
between quality and productivity. The classical Iron Triangle ency. Native English speakers are less likely to have spelling
[80] sees them as a trade-off: investment in quality comes at mistakes and typos. A spelling mistake might prevent our
the expense of productivity. On the other hand, “Quality model from identifying the textual pattern, thus lowering
is Free” claims that investment in quality is beneficial and the recall. This will lead to an illusive benefit of spelling
leads to increased productivity [25]. Our results in Table 5 mistakes, misleading us to think that people who tend to
enable a quantitative investigation of this issue, where speed have more spelling mistakes tend to have fewer bugs.
is operationalized by commits per year and quality by CCP. Another threat to validity is due to the family of models
As “Quality is Free” predicts, we find that in high that we chose to use. We chose to represent the model using
quality projects the development speed is much higher. two parameters, recall and F pr, following the guidance of
The twin experiments help to reduce noise, demonstrating Occam’s razor and resorting to a more complex solution
that development speed is a characteristic of the project. In only when a need arises. However, many other families
case that this correlation is indeed due to causality, then of models are possible. We could consider different sub-
when you improve the quality you also gain speed, enjoying models for various message lengths, a model that predicts
both worlds. This relation between quality and development the commit category instead of the Boolean "Is Corrective"
speed is also supported by Jones’s research on time wasted concept, etc. Each family will have different parameters and
due to low quality [51], [53] and developers performing the behavior. More complex models will have more represen-
same tasks during “Personal Software Process” training [98]. tative power but will be harder to learn and require more
samples.
A common assumption in statistical analysis is the IID
6 T HREATS TO VALIDITY assumption (Independent and Identically Distributed ran-
There is no agreed quantitative definition of quality, hence dom variables). This assumption clearly doesn’t hold for
we cannot ensure that a certain metric measures quality. In GitHub projects. We found that forks, projects based on
order to cope with this, we showed that our metric agrees others and sharing a common history, were 35% of the active
with developers’ comments on quality and is associated projects. We therefore removed forks, but projects might still
with variables that are believed to reflect or influence qual- share code and commits. Also, older projects, with more
ity. commits and users, have higher weight in twin studies and
A specific threat to validity in our work is related to co-change analysis.
construct validity. We set out to measure the Corrective Our metric focuses on the fraction of commits that correct
Commit Probability and do so based on a linguistic analysis. bugs. One can claim that the fraction of commits that induce
We investigated whether it is indeed accurate and precise in bugs is a better quality-metric. In principle, this can be
Section 4.1. done using the SZZ algorithm (the common algorithm for

14
identifying bug inducing commits [99]). But note that SZZ It is informative in that it has a wide range of values

is applied after the bug was identified and fixed. Thus, the and distinguishes between projects.
inducing and fixing commits are actually expected to give We estimated the CCP of all 7,557 large active projects
similar results. in BigQuery’s GitHub data. This created a quality scale,
Another major threat concerns internal validity. Our enabling observations on the state of the practice. Using
basic assumption is that corrective commits reflect bugs, this scale developers can compare their project’s quality
and therefore a low CCP is indicative of few bugs and high (as reflected by CCP) to the community. A low percentile
quality. But a low CCP can also result from a disregard for suggests the need for quality improvement efforts.
fixing bugs or an inability to do so. On the other hand, in We checked the sensitivity of our assumptions and no-
extremely popular projects, Linus’s law “given enough eye- ticed that the theoretical invalid CCP range indeed tend
balls, all bugs are shallow” [87] might lead to more effective to be not in English or not software. A difference in bug
bug identification and high CCP. Another unwanted effect detection efficiency was demonstrated in highly popular
of using corrective commits is that improvements in bug projects, supporting Linus’s law [87].
detection (e.g., by doubling the QA department) will look Our results also helped demonstrate that “Quality is
like a reduction in quality. The correlations found between Free”. Instead of a trade-off between quality and devel-
CCP and various other indicators of software quality add opment speed, we find that they are positively correlated,
confidence that CCP is indeed valid. We identify such cases and this was further supported with co-change analysis and
and discuss them in Section 4.2.2. twin experiments. Thus, investing in quality may actually
Focusing on corrective commits also leads to several reduce schedules rather than extending them.
biases. Most obviously, existing bugs that have not been We show a correlation between short files and low
found yet are unknown. Finding and fixing bugs might coupling and quality, supporting to well-know recommen-
take months [60]. When projects differ in the time needed dation for quality improvement. Hence, if the discussed
to identify a bug, our results will be biased. relations are indeed casual, we have a simple way to reach
Software development is usually done subject to lack of high quality, which will benefit a project also in higher
time and resources. Due to that, many times known bugs of productivity, better on-boarding, and lower churn.
low severity are not fixed. While this leads to a bias, it can
be considered to be a desirable one, by focusing on the more Supplementary Materials
important bugs.
The language models are available at https://github.
A threat to external validity might arise due to the use of
com/evidencebp/commit-classification Utilities used for
open source projects that might not represent projects done
the analysis (e.g., co-change) are at https://github.com/
in software companies. We feel that the open source projects
evidencebp/analysis_utils All other supplementary ma-
are of significant interest on their own. Other than that, the
terials can be found at https://github.com/evidencebp/
projects we analyzed include projects of Google, Microsoft,
corrective-commit-probability
Apple, etc. so at least part of the area is covered.
Time, cost, and development speed are problematic to
measure. We use commits as a proxy to work since they Acknowledgements
typically represent tasks. Yet, tasks differ in size and diffi- This research was supported by the ISRAEL SCIENCE
culty and their translation to commits might differ due to FOUNDATION (grant No. 832/18). We thank Amiram
the project or developer habits. Commits may also include Yehudai and Stanislav Levin for providing us their data
a mix of different tasks. In order to reduce the influence set of labeled commits [65]. We thank Guilherme Avelino
of project culture we aggregated many of them. In order for drawing our attention to the importance of Truck Factor
to eliminate the effect of personal habits, we used twins Developers Detachment (TFDD) and providing a data set
experiments. Other than that, the number of commits per [8].
time is correlated to developers’ self-rated productivity [77]
and team lead perception of productivity [81], hence it
R EFERENCES
provides a good computable estimator.
[1] H. Al-Kilidar, K. Cox, and B. Kitchenham. The use and usefulness
of the ISO/IEC 9126 quality standard. In Intl. Synp. Empirical
Softw. Eng., pages 126–132, Nov 2005.
7 C ONCLUSIONS [2] M. Allamanis. The adverse effects of code duplication in ma-
chine learning models of code. In Proceedings of the 2019 ACM
We presented a novel way to measure projects’ code quality, SIGPLAN International Symposium on New Ideas, New Paradigms,
using the Corrective Commit Probability (CCP). We use the and Reflections on Programming and Software, Onward! 2019, page
consensus that bugs are bad and indicative of low quality to 143–153, New York, NY, USA, 2019. Association for Computing
Machinery.
base a metric on them. We started off with a linguistic model [3] I. Amit and D. G. Feitelson. Which refactoring reduces bug rate?
to identify corrective commits, significantly improving prior In Proceedings of the Fifteenth International Conference on Predictive
work [3], [5], [45], [65], and developed a mathematical Models and Data Analytics in Software Engineering, PROMISE’19,
pages 12–15, New York, NY, USA, 2019. ACM.
method to find the most likely CCP given the model’s hit [4] I. Amit, E. Firstenberg, and Y. Meshi. Framework for semi-
rate. The CCP metric has the following properties: supervised learning when no labeled data is given. U.S. patent
• It matches developers’ references to quality. application #US20190164086A1, 2017.
[5] J. J. Amor, G. Robles, J. M. Gonzalez-Barahona, and A. Navarro.
• It is stable: it reflects the character of a project and does Discriminating development activities in versioning systems: A
not change much from year to year. case study, Jan 2006.

15
[6] G. Antoniol, K. Ayari, M. Di Penta, F. Khomh, and Y.-G. [29] M. Dawson, D. Burrell, E. Rahim, and S. Brewster. Integrating
Guéhéneuc. Is it a bug or an enhancement? a text-based approach software assurance into the software development life cycle
to classify change requests. In Proceedings of the 2008 Conference of (SDLC). Journal of Information Systems Technology and Planning,
the Center for Advanced Studies on Collaborative Research: Meeting of 3:49–53, 2010.
Minds, CASCON ’08, New York, NY, USA, 2008. Association for [30] G. Dromey. A model for software product quality. IEEE Trans.
Computing Machinery. Softw. Eng., 21(2):146–162, Feb 1995.
[7] M. Argyle. Do happy workers work harder? the effect of job satis- [31] B. Efron. Bootstrap Methods: Another Look at the Jackknife, pages
faction on job performance. In R. Veenhoven, editor, How harmful 569–593. Springer New York, New York, NY, 1992.
is happiness? Consequences of enjoying life or not. Universitaire Pers, [32] R. Fisher. The correlation between relatives on the supposition
Rotterdam, The Netherlands, 1989. of mendelian inheritance. Transactions of the Royal Society of
[8] G. Avelino, E. Constantinou, M. T. Valente, and A. Serebrenik. Edinburgh, 52(2):399–433, 1919.
On the abandonment and survival of open source projects: An [33] M. Fowler. Refactoring: Improving the Design of Existing Code.
empirical investigation. CoRR, abs/1906.08058, 2019. Addison-Wesley Longman Publishing Co., Inc., Boston, MA,
[9] G. Avelino, L. T. Passos, A. C. Hora, and M. T. Valente. A novel USA, 1999.
approach for estimating truck factors. CoRR, abs/1604.06766,
[34] M. Fowler, K. Beck, and W. R. Opdyke. Refactoring: Improving
2016.
the design of existing code. In 11th European Conference. Jyväskylä,
[10] V. R. Basili, L. C. Briand, and W. L. Melo. A validation of object- Finland, 1997.
oriented design metrics as quality indicators. IEEE Transactions
on Software Engineering, 22(10):751–761, Oct 1996. [35] M. Gharehyazie, B. Ray, M. Keshani, M. S. Zavosht, A. Hey-
darnoori, and V. Filkov. Cross-project code clones in github.
[11] G. Bavota, A. De Lucia, M. Di Penta, R. Oliveto, and F. Palomba.
Empirical Software Engineering, 24(3):1538–1573, Jun 2019.
An experimental investigation on the innate relationship between
quality and refactoring. J. Syst. & Softw., 107:1–14, Sep 2015. [36] S. A. K. Ghayyur, S. Ahmed, S. Ullah, and W. Ahmed. The
[12] E. D. Berger, C. Hollenbeck, P. Maj, O. Vitek, and J. Vitek. impact of motivator and demotivator factors on agile software
On the impact of programming languages on code quality: A development. International Journal of Advanced Computer Science
reproduction study. ACM Trans. Program. Lang. Syst., 41(4), Oct and Applications, 9(7), 2018.
2019. [37] Y. Gil and G. Lalouche. On the correlation between size and
[13] C. Bird, A. Bachmann, E. Aune, J. Duffy, A. Bernstein, V. Filkov, metric validity. Empirical Softw. Eng., 22(5):2585–2611, Oct 2017.
and P. Devanbu. Fair and balanced?: Bias in bug-fix datasets. [38] T. L. Graves, A. F. Karr, J. S. Marron, and H. Siy. Predicting fault
In Proceedings of the the 7th Joint Meeting of the European Software incidence using software change history. IEEE Transactions on
Engineering Conference and the ACM SIGSOFT Symposium on The Software Engineering, 26(7):653–661, July 2000.
Foundations of Software Engineering, ESEC/FSE ’09, pages 121–130, [39] T. Gyimothy, R. Ferenc, and I. Siket. Empirical validation of
New York, NY, USA, 2009. ACM. object-oriented metrics on open source software for fault pre-
[14] C. Bird, N. Nagappan, B. Murphy, H. Gall, and P. Devanbu. Don’t diction. IEEE Transactions on Software Engineering, 31(10):897–910,
touch my code! examining the effects of ownership on software Oct 2005.
quality. In Proceedings of the 19th ACM SIGSOFT symposium and [40] R. Hackbarth, A. Mockus, J. Palframan, and R. Sethi. Improving
the 13th European conference on Foundations of software engineering, software quality as customers perceive it. IEEE Softw., 33(4):40–
pages 4–14, 2011. 45, Jul/Aug 2016.
[15] C. Bird, P. C. Rigby, E. T. Barr, D. J. Hamilton, D. M. German, [41] T. Hall, S. Beecham, D. Bowes, D. Gray, and S. Counsell. A
and P. Devanbu. The promises and perils of mining git. In systematic literature review on fault prediction performance in
2009 6th IEEE International Working Conference on Mining Software software engineering. IEEE Trans. Softw. Eng., 38(6):1276–1304,
Repositories, pages 1–10, May 2009. Nov 2012.
[16] B. Boehm and V. R. Basili. Software defect reduction top 10 list. [42] M. H. Halstead. Elements of Software Science (Operating and
Computer, 34(1):135–137, Jan 2001. Programming Systems Series). Elsevier Science Inc., New York,
[17] B. W. Boehm. Software Engineering Economics. Prentice-Hall, 1981. NY, USA, 1977.
[18] B. W. Boehm, J. R. Brown, and M. Lipow. Quantitative evaluation [43] C. Hastings, F. Mosteller, J. W. Tukey, and C. P. Winsor. Low mo-
of software quality. In Intl. Conf. Softw. Eng., number 2, pages ments for small samples: A comparative study of order statistics.
592–605, Oct 1976. Ann. Math. Statist., 18(3):413–426, 09 1947.
[19] B. W. Boehm and P. N. Papaccio. Understanding and control- [44] K. Herzig, S. Just, and A. Zeller. It’s not a bug, it’s a feature:
ling software costs. IEEE Transactions on Software Engineering, How misclassification impacts bug prediction. In Proceedings of
14(10):1462–1477, Oct 1988. the 2013 International Conference on Software Engineering, ICSE ’13,
[20] G. Box. Robustness in the strategy of scientific model building. pages 392–401, Piscataway, NJ, USA, 2013. IEEE Press.
In R. L. LAUNER and G. N. WILKINSON, editors, Robustness in [45] A. Hindle, D. M. German, M. W. Godfrey, and R. C. Holt.
Statistics, pages 201 – 236. Academic Press, 1979. Automatic classication of large changes into maintenance cat-
[21] F. P. Brooks, Jr. The Mythical Man-Month: Essays on Software egories. In 2009 IEEE 17th International Conference on Program
Engineering. Addison-Wesley, 1975. Comprehension, pages 30–39, May 2009.
[22] J. P. Campbell, R. A. McCloy, S. H. Oppler, and C. E. Sager. [46] D. Hovemeyer and W. Pugh. Finding bugs is easy. SIGPLAN
A theory of performance. In N. Schmitt, W. C. Borman, and Not., 39(12):92–106, Dec 2004.
Associates, editors, Personnel Selection in Organizations, pages 35–
[47] I. IEC. 9126-1 (2001). software engineering product quality-part
70. Jossey-Bass Pub., 1993.
1: Quality model. International Organization for Standardization,
[23] S. R. Chidamber and C. F. Kemerer. A metrics suite for object
page 16, 2001.
oriented design. IEEE Trans. Softw. Eng., 20(6):476–493, Jun 1994.
[24] J. Cohen. A coefficient of agreement for nominal scales. Educa- [48] International Organization for Standardization. Systems and soft-
tional and Psychological Measurement, 20(1):37–46, 1960. ware engineering – systems and software quality requirements
and evaluation (square) – system and software quality models,
[25] P. Crosby. Quality Is Free: The Art of Making Quality Certain.
2011.
McGrawHill, 1979.
[26] W. Cunningham. The wycash portfolio management system. In [49] Z. Jiang, P. Naudé, and C. Comstock. An investigation on the
Addendum to the Proceedings on Object-Oriented Programming Sys- variation of software development productivity. IEEE Transac-
tems, Languages, and Applications (Addendum), OOPSLA ’92, page tions on Software Engineering, 1(2):72–81, 2007.
29–30, New York, NY, USA, 1992. Association for Computing [50] C. Jones. Applied Software Measurement: Assuring Productivity and
Machinery. Quality. McGraw-Hill, Inc., New York, NY, USA, 1991.
[27] M. D’Ambros, M. Lanza, and R. Robbes. An extensive compar- [51] C. Jones. Social and technical reasons for software project failures.
ison of bug prediction approaches. In 2010 7th IEEE Working CrossTalk, The J. Def. Software Eng., 19(6):4–9, 2006.
Conference on Mining Software Repositories (MSR 2010), pages 31– [52] C. Jones. Software quality in 2012: A survey of the state of the
41, May 2010. art, 2012. [Online; accessed 24-September-2018].
[28] A. P. Dawid and A. M. Skene. Maximum likelihood estimation of [53] C. Jones. Wastage: The impact of poor quality on software
observer error-rates using the em algorithm. Journal of the Royal economics. retrieved from http://asq.org/pub/sqp/. Software
Statistical Society. Series C (Applied Statistics), 28(1):20–28, 1979. Quality Professional, 18(1):23–32, 2015.

16
[54] E. Kalliamvakou, G. Gousios, K. Blincoe, L. Singer, D. German, [77] E. Murphy-Hill, C. Jaspan, C. Sadowski, D. C. Shepherd,
and D. Damian. The promises and perils of mining github M. Phillips, C. Winter, A. K. Dolan, E. K. Smith, and M. A. Jorde.
(extended version). Empirical Software Engineering, 01 2015. What predicts software developers’ productivity? Transactions on
[55] Y. Kamei, E. Shihab, B. Adams, A. E. Hassan, A. Mockus, Software Engineering, 2019.
A. Sinha, and N. Ubayashi. A large-scale empirical study of [78] S. Nanz and C. A. Furia. A comparative study of programming
just-in-time quality assurance. IEEE Transactions on Software languages in rosetta code. In 2015 IEEE/ACM 37th IEEE Interna-
Engineering, 39(6):757–773, June 2013. tional Conference on Software Engineering, volume 1, pages 778–788,
[56] C. F. Kemerer. Reliability of function points measurement: A field May 2015.
experiment. Commun. ACM, 36(2):85–97, Feb 1993. [79] B. Norick, J. Krohn, E. Howard, B. Welna, and C. Izurieta.
[57] C. F. Kemerer and B. S. Porter. Improving the reliability of Effects of the number of developers on code quality in open
function point measurement: An empirical study. IEEE Trans. source software: a case study. In Proceedings of the 2010 ACM-
Softw. Eng., 18(11):1011–1024, Nov 1992. IEEE International Symposium on Empirical Software Engineering
[58] F. Khomh, T. Dhaliwal, Y. Zou, and B. Adams. Do faster releases and Measurement, pages 1–1, 2010.
improve software quality?: An empirical case study of mozilla [80] R. Oisen. Can project management be defined? Project Manage-
firefox. In Proceedings of the 9th IEEE Working Conference on Mining ment Quarterly, 2(1):12–14, 1971.
Software Repositories, MSR ’12, pages 179–188, Piscataway, NJ, [81] E. Oliveira, E. Fernandes, I. Steinmacher, M. Cristo, T. Conte, and
USA, 2012. IEEE Press. A. Garcia. Code and commit metrics of developer productivity: a
[59] F. Khomh, M. Di Penta, and Y.-G. Gueheneuc. An exploratory study on team leaders perceptions. Empirical Software Engineering,
study of the impact of code smells on software change-proneness. 04 2020.
In 2009 16th Working Conference on Reverse Engineering, pages 75– [82] L. Prechelt. An empirical comparison of seven programming
84. IEEE, 2009. languages. Computer, 33(10):23–29, Oct 2000.
[60] S. Kim and E. J. Whitehead, Jr. How long did it take to fix bugs? [83] F. Rahman and P. Devanbu. How, and why, process metrics are
In Proceedings of the 2006 International Workshop on Mining Software better. In 2013 35th International Conference on Software Engineering
Repositories, MSR ’06, pages 173–174, New York, NY, USA, 2006. (ICSE), pages 432–441, May 2013.
ACM. [84] F. Rahman, D. Posnett, A. Hindle, E. T. Barr, and P. T. Devanbu.
[61] S. Kim, T. Zimmermann, E. J. Whitehead Jr., and A. Zeller. Bugcache for inspections: hit or miss? In SIGSOFT FSE, 2011.
Predicting faults from cached history. In Proceedings of the 29th [85] A. J. Ratner, C. M. De Sa, S. Wu, D. Selsam, and C. Ré. Data
International Conference on Software Engineering, ICSE ’07, pages programming: Creating large training sets, quickly. In D. D. Lee,
489–498, Washington, DC, USA, 2007. IEEE Computer Society. M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors,
[62] P. Kruchten, R. L. Nord, and I. Ozkaya. Technical debt: From Advances in Neural Information Processing Systems 29, pages 3567–
metaphor to theory and practice. Ieee software, 29(6):18–21, 2012. 3575. Curran Associates, Inc., 2016.
[63] M. M. Lehman. Programs, life cycles, and laws of software [86] B. Ray, D. Posnett, V. Filkov, and P. Devanbu. A large scale
evolution. Proceedings of the IEEE, 68(9):1060–1076, 1980. study of programming languages and code quality in github.
[64] M. M. Lehman, J. F. Ramil, P. D. Wernick, D. E. Perry, and W. M. In Proceedings of the 22Nd ACM SIGSOFT International Symposium
Turski. Metrics and laws of software evolution – the nineties on Foundations of Software Engineering, FSE 2014, pages 155–165,
view. In Intl. Software Metrics Symp., number 4, pages 20–32, Nov New York, NY, USA, 2014. ACM.
1997. [87] E. Raymond. The cathedral and the bazaar. First Monday, 3(3),
[65] S. Levin and A. Yehudai. Boosting automatic commit classifica- 1998.
tion into maintenance activities by utilizing source code changes. [88] S. Reddivari and J. Raman. Software quality prediction: An inves-
In Proceedings of the 13th International Conference on Predictive tigation based on machine learning. 2019 IEEE 20th International
Models and Data Analytics in Software Engineering, PROMISE, Conference on Information Reuse and Integration for Data Science
pages 97–106, New York, NY, USA, 2017. ACM. (IRI), pages 115–122, 2019.
[66] B. P. Lientz, E. B. Swanson, and G. E. Tompkins. Characteristics [89] H. G. Rice. Classes of recursively enumerable sets and their
of application software maintenance. Comm. ACM, 21(6):466–471, decision problems. Transactions of the American Mathematical
Jun 1978. Society, 74(2):358–366, 1953.
[67] M. Lipow. Number of faults per line of code. IEEE Transactions [90] J. Rosenberg. Some misconceptions about lines of code. In
on software Engineering, (4):437–439, 1982. Proceedings fourth international software metrics symposium, pages
[68] C. V. Lopes, P. Maj, P. Martins, V. Saini, D. Yang, J. Zitny, 137–142. IEEE, 1997.
H. Sajnani, and J. Vitek. Déjàvu: A map of code duplicates on [91] H. Sackman, W. J. Erikson, and E. E. Grant. Exploratory ex-
github. Proc. ACM Program. Lang., 1(OOPSLA), Oct 2017. perimental studies comparing online and offline programming
[69] K. D. Maxwell and P. Forselius. Benchmarking software devel- performance. Commun. ACM, 11(1):3–11, Jan 1968.
opment productivity. IEEE Software, 17(1):80–88, Jan 2000. [92] S. R. Schach, B. Jin, L. Yu, G. Z. Heller, and J. Offutt. Determining
[70] K. D. Maxwell, L. Van Wassenhove, and S. Dutta. Software the distribution of maintenance categories: Survey versus mea-
development productivity of european space, military, and in- surement. Empirical Softw. Eng., 8(4):351–365, Dec 2003.
dustrial applications. IEEE Transactions on Software Engineering, [93] N. F. Schneidewind. Body of knowledge for software quality
22(10):706–718, Oct 1996. measurement. Computer, 35(2):77–83, Feb 2002.
[71] T. J. McCabe. A complexity measure. IEEE Trans. Softw. Eng., [94] B. Settles. Active learning literature survey. Technical report,
2(4):308–320, Jul 1976. University of Wisconsin–Madison, 2010.
[72] A. Mockus, D. Spinellis, Z. Kotti, and G. J. Dusing. A complete [95] M. Shepperd. A critique of cyclomatic complexity as a software
set of related git repositories identified via community detection metric. Software Engineering J., 3(2):30–36, Mar 1988.
approaches based on shared commits, 2020. [96] E. Shihab, A. E. Hassan, B. Adams, and Z. M. Jiang. An industrial
[73] A.-J. Molnar, A. NeamŢu, and S. Motogna. Evaluation of software study on the risk of software changes. In Proceedings of the
product quality metrics. In E. Damiani, G. Spanoudakis, and ACM SIGSOFT 20th International Symposium on the Foundations
L. A. Maciaszek, editors, Evaluation of Novel Approaches to Software of Software Engineering, FSE ’12, pages 62:1–62:11, New York, NY,
Engineering, pages 163–187, Cham, 2020. Springer International USA, 2012. ACM.
Publishing. [97] N. C. Shrikanth and T. Menzies. Assessing practitioner beliefs
[74] S. Morasca and G. Russo. An empirical study of software about software defect prediction. In Intl. Conf. Softw. Eng.,
productivity. In 25th Annual International Computer Software and number 42, May 2020.
Applications Conference. COMPSAC 2001, pages 317–322, Oct 2001. [98] N. C. Shrikanth, W. Nichols, F. M. Fahid, and T. Men-
[75] R. Moser, W. Pedrycz, and G. Succi. Analysis of the reliability of a zies. Assessing practitioner beliefs about software engineering.
subset of change metrics for defect prediction. In Proceedings of the arXiv:2006.05060, June 2020.
Second ACM-IEEE International Symposium on Empirical Software [99] J. Śliwerski, T. Zimmermann, and A. Zeller. When do changes
Engineering and Measurement, ESEM ’08, pages 309–311, New induce fixes? SIGSOFT Softw. Eng. Notes, 30(4):1–5, May 2005.
York, NY, USA, 2008. ACM. [100] E. B. Swanson. The dimensions of maintenance. In Proceedings of
[76] N. Munaiah, S. Kroh, C. Cabrey, and M. Nagappan. Curating the 2Nd International Conference on Software Engineering, ICSE ’76,
github for engineered software projects. Empirical Software Engi- pages 492–497, Los Alamitos, CA, USA, 1976. IEEE Computer
neering, 22, 04 2017. Society Press.

17
[101] S. E. S. Taba, F. Khomh, Y. Zou, A. E. Hassan, and M. Nagappan.
Predicting bugs using antipatterns. In 2013 IEEE International
Conference on Software Maintenance, pages 270–279, 2013.
[102] E. Tom, A. Aurum, and R. Vidgen. An exploration of technical
debt. Journal of Systems and Software, 86(6):1498–1516, 2013.
[103] E. van Emden and L. Moonen. Java quality assurance by
detecting code smells. In Ninth Working Conference on Reverse
Engineering, 2002. Proceedings., pages 97–106, Nov 2002.
[104] E. Van Emden and L. Moonen. Java quality assurance by
detecting code smells. In Ninth Working Conference on Reverse
Engineering, 2002. Proceedings., pages 97–106. IEEE, 2002.
[105] B. Vasilescu, Y. Yu, H. Wang, P. Devanbu, and V. Filkov. Quality
and productivity outcomes relating to continuous integration in
github. In Proceedings of the 2015 10th Joint Meeting on Foundations
of Software Engineering, ESEC/FSE 2015, pages 805–816, New
York, NY, USA, 2015. ACM.
[106] N. Walkinshaw and L. Minku. Are 20% of files responsible for
80% of defects? In Proceedings of the 12th ACM/IEEE International
Symposium on Empirical Software Engineering and Measurement,
ESEM ’18, pages 2:1–2:10, New York, NY, USA, 2018. ACM.
[107] E. Weyuker, T. Ostrand, and R. Bell. Do too many cooks spoil
the broth? using the number of developers to enhance defect
prediction models. Empirical Software Engineering, 13:539–559, 10
2008.
[108] L. Williams and R. Kessler. Pair Programming Illuminated.
Addison-Wesley Longman Publishing Co., Inc., USA, 2002.
[109] A. Wood. Predicting software reliability. Computer, 29(11):69–77,
Nov 1996.
[110] T. A. Wright and R. Cropanzano. Psychological well-being and
job satisfaction as predictors of job performance. Journal of
Occupational Health Psychology, 5:84–94, 2000.
[111] S. Yamada and S. Osaki. Software reliability growth modeling:
Models and applications. IEEE Transactions on Software Engineer-
ing, SE-11(12):1431–1437, Dec 1985.
[112] A. Yamashita and L. Moonen. Do code smells reflect important
maintainability aspects? In 2012 28th IEEE international conference
on software maintenance (ICSM), pages 306–315. IEEE, 2012.
[113] T. Zimmermann, S. Diehl, and A. Zeller. How history justifies
system architecture (or not). In Sixth International Workshop on
Principles of Software Evolution, 2003. Proceedings., pages 73–83,
Sept 2003.

18

You might also like