The Corrective Commit Probability Code Quality Metric
The Corrective Commit Probability Code Quality Metric
F
arXiv:2007.10912v1 [cs.SE] 21 Jul 2020
Abstract—We present a code quality metric, Corrective Commit Proba- Based on these considerations, we suggest the Correc-
bility (CCP), measuring the probability that a commit reflects corrective tive Commit Probability (CCP, the probability that a given
maintenance. We show that this metric agrees with developers’ concept commit is a bug fix) as a metric of quality.
of quality, informative, and stable. Corrective commits are identified by Corrective maintenance (aka fixing bugs) represents a
applying a linguistic model to the commit messages. We compute the
large fraction of software development, and contributes
CCP of all large active GitHub projects (7,557 projects with 200+ com-
mits in 2019). This leads to the creation of a quality scale, suggesting
significantly to software costs [19], [66], [92]. But not all
that the bottom 10% of quality projects spend at least 6 times more effort projects are the same: some have many more bugs than
on fixing bugs than the top 10%. Analysis of project attributes shows others. The propensity for bugs, as reflected by their fixing
that lower CCP (higher quality) is associated with smaller files, lower activity, can therefore be used to represent quality. This can
coupling, use of languages like JavaScript and C# as opposed to PHP be applied at various resolutions, e.g., a project, a file, or a
and C++, fewer developers, lower developer churn, better onboarding, method. Such application can help spot entities that are bug
and better productivity. Among other things these results support the prone, improving future bug identification [61], [84], [106].
“Quality is Free” claim, and suggest that achieving higher quality need While counting the number of bugs in code is common,
not require higher expenses.
disregarding a project’s history can be misleading. In the
CCP metric we normalize the number of bug fixes by the
total number of commits, thereby deriving the probability
1 I NTRODUCTION that a commit is a bug fix. We focus on commits because in
contemporary software development a commit is the atomic
It is widely agreed in the software engineering commu- unit of work.
nity that code quality is of the utmost importance. This We identify corrective commits using a linguistic model
has motivated the development of several software quality applied to commit messages, an idea that is commonly used
models (e.g. [18], [30]) and standards. Myriad tools promise for defect prediction [27], [41], [86]. The linguistic-model
developers that using them will improve the quality of their prediction marks commit messages as corrective or not, in
code. the spirit of Ratner et al. labelling functions [85]. Though our
But what exactly is high-quality code? There have been accuracy is significantly higher than previous work, such
many attempts to identify issues that reflect upon code qual- predictions are not completely accurate and therefore the
ity. These generally go under the name of “code smells” [33], model hits do not always coincide with the true corrective
[103]. Yet, it is debatable whether commonly used metrics commits.
indeed reflect real quality problems [1], [11]. Each smell is Given an implementation of the CCP metric, we perform
limited to identifying a restricted shallow type of problems. a large-scale assessment of GitHub projects. We analyze all
Besides, sometime a detected smell (e.g., a long method) is 7,557 large active projects (defined to be those with 200+
actually the correct design due to other considerations. commits in 2019, excluding redundant projects which might
A more general approach is to focus on an indirect bias our results [15]). We use this, inter alia, to build the
assessment of the software based on the ill-effects of low distribution of CCP, and find the quality ranking of each
quality— and in particular, on the presence of bugs. There project relative to all others. The significant difference in
is no debate that bugs are bad, especially bugs reported by the CCP among projects is informative. Software developers
customers [40]. By focusing on actual bugs, one is relieved of can easily know their own project’s CCP. They can thus find
the need to consider all possible sources of quality problems, where their project is ranked with respect to the community.
also removing any dependence on the implementation of Note that CCP provides a retrospective assessment of
the procedure that finds them. Moreover, approaches based quality. Being a process metric, it only applies after bugs
on bugs can apply equally well to different programming are found, unlike code metrics which can be applied as the
languages and projects. code is written. The CCP metric can be used as a research
tool for the study of different software engineering issues. Capers Jones defined software quality as the combina-
A simple approach is to observe the CCP given a certain tion of low defect rate and high user satisfaction [50], [52].
phenomenon (e.g., programming language, coupling). For He went on to provide extensive state-of-the-industry sur-
example, we show below that while investment in quality is veys based on defect rates and their correlation with various
often considered to reduce development speed, in practice development practices, using a database of many thousands
the development speed is actually higher in high quality of industry projects. Our work applies these concepts to the
projects. world of GitHub and open source, using the availability of
Our main contributions in this research are as follows: the code to investigate its quality and possible causes and
• We define the Corrective Commit Probability (CCP) implications.
metric to assess the quality of code. The metric is shown Software metrics can be divided into three groups: prod-
to be correlated with developers’ perceptions of quality, uct metrics, code metrics, and process metrics. Product
is easy to compute, and is applicable at all granularities metrics consider the software as a black box. A typical
and regardless of programming language. example is the ISO/IEC 25010:2011 standard [48]. It includes
• We develop a linguistic model to identify corrective metrics like fitness for purpose, satisfaction, freedom from
commits, that performs significantly better than prior risk, etc. These metrics might be subjective, hard to measure,
work and close to human level. and not applicable to white box actionable insights, which
• We show how to perform a maximum likelihood com- makes them less suitable for our research goals. Indeed,
putation to improve the accuracy of the CCP estima- studies of the ISO/IEC 9126 standard [47] (the precursor
tion, also removing the dependency on the implemen- of ISO/IEC 25010) found it to be ineffective in identifying
tation of the linguistic model. design problems [1].
• We establish a scale of CCP across projects, indicating Code metrics measure properties of the source code
that the metric provides information about the rela- directly. Typical metrics are lines of code (LOC) [67], the
tive quality of different projects. The scale shows that Chidamber and Kemerer object oriented metrics (aka CK
projects in the bottom decile of quality spend at least metrics) [23], McCabe’s cyclomatic complexity [71], Hal-
six times the effort on bug correction as projects in the stead complexity measures [42], etc. [10], [39], [83]. They
top decile. tend to be specific, low level and highly correlated with LOC
• We show that CCP correlates with various other effects, [37], [73], [90], [95]. Some specific bugs can be detected by
e.g. successful onboarding of new developers and pro- matching patterns in the code [46]. But this is not a general
ductivity. solution, since depending on it would bias our data towards
• We present twin experiments and co-change analysis in these patterns.
order to investigate relations beyond mere correlation. Process metrics focus on the code’s evolution. The main
• On the way we also provide empirical support for data source is the source control system. Typical metrics are
Linus’s law and the “Quality is Free” hypothesis. the number of commits, the commit size, the number of
contributors, etc. [38], [75], [83]. Process metrics have been
claimed to be better predictors of defects than code metrics
for reasons like showing where effort is being invested and
2 R ELATED WORK
having less stagnation [75], [83].
Despite decades of work on this issue, there is no agreed Working with commits as the entities of interest is also
definition of “software quality”. For some, this term refers popular in just in time (JIT) defect prediction [55]. Unlike
to the quality of the software product as perceived by its JIT, we are interested in the probability and not in a specific
users [93]. Others use the term in reference to the code commit being corrective. We also focus on long periods,
itself, as perceived by developers. These two approaches rather than comparing the versions before and after a bug
have a certain overlap: bugs comprise bad code that has fix, which probably reflects an improvement. We examine
effects seen by the end user. When considering the code, work at a resolution of years, and show that CCP is stable,
some define quality based on mainly non-functional prop- so projects that are prone to errors stay so, despite prior
erties, e.g. reliability, modifiability, etc. [18]. Others include efforts to fix bugs.
correctness as the foremost property [30]. Our approach Focusing on commits, we need a way to know if they are
is also that correctness is the most important element of corrective. If one has access to both a source control system
quality. The number of bugs in a program could have and a ticket management system, one can link the commits
been a great quality metric. However, Rice’s theorem [89] to the tickets [13] and reduce the CCP computation to mere
tells us that bug identification, like any non-trivial semantic counting. Yet, the links between commits and tickets might
property of programs, is undecidable. Nevertheless, bugs be biased [13]. The ticket classification itself might have
are being found, providing the basis for the CCP metric. 30% errors [44], and may not necessarily fit the researcher’s
And since bugs are time consuming, disrupt schedules, and desired taxonomy. And integrating tickets with the code
hurt the general credibility, lowering the bug rate has value management system might require a lot of effort, making it
regardless of other implications—thereby lending value to infeasible when analysing thousands of projects. Moreover,
having a low CCP. Moreover, it is generally accepted that in a research setting the ticket management system might
fixing bugs costs more the later they are found, and that be unavailable, so one is forced to rely on only the source
maintenance is costlier than initial development [16], [17], control system.
[19], [29]. Therefore, the cost of low quality is even higher When labels are not available, one can use linguistic
than implied by the bug ratio difference. analysis of the commit messages as a replacement. This is
2
often done in defect prediction, where supervised learning 3.1 Building a Gold Standard Data Set
can be used to derive models based on a labeled training set The most straight forward way to compute the CCP is to use
[27], [41], [86]. a change log system for the commits and a ticket system for
In principle, commit analysis models can be used to the commit classification [13], and compute the corrective
estimate the CCP, by creating a model and counting hits. ratio. However, for many projects the ticket system is not
That could have worked if the model accuracy was perfect. available. Therefore, we base the commit classification on
We take the model predictions and use the hit rate and the linguistic analysis, which is built and evaluated using a gold
model confusion matrix to derive a maximum likelihood es- standard.
timate of the CCP. Without such an adaptation, the analysis A gold standard is a set of entities with labels that cap-
might be invalid, and the hits of different models would ture a given concept. In our case, the entities are commits,
have been incomparable. the concept is corrective maintenance [100], namely bug
Our work is also close to Software Reliability Growth fixes, and the labels identify which commits are corrective.
Models (SRGM) [38], [109], [111]. In SRGM one tries to Gold standards are used in machine learning for building
predict the number of future failures, based on bugs dis- models, which are functions that map entities to concepts.
covered so far, and assuming the code base is fixed. The By comparing the true label to the model’s prediction, one
difference between us is that we are not aiming to predict can estimate the model performance. In addition, we also
future quality. We identify current software quality in order used the gold standard in order to understand the data
to investigate the causes and implications of quality. behavior and to identify upper bounds on performance.
The number of bugs was used as a feature and indicator We constructed the gold standard as follows. Google’s
of quality before as absolute number [58], [88], per period BigQuery has a schema for GitHub were all projects’ com-
[105], and per commit [3], [96]. We prefer the per commit mits are stored in a single table. We sampled uniformly 840
version since it is agnostic to size and useful as a probability. (40 duplicate) commits as a train set. The first author then
manually labeled these commits as being corrective or not
based on the commit content using a defined protocol.
To assess the subjectiveness in the labeling process, two
3 D EFINITION AND C OMPUTATION OF THE C OR - additional annotators labeled 400 of the commits. When
RECTIVE C OMMIT P ROBABILITY M ETRIC there was no consensus, we checked if the reason was a
deviation from the protocol or an error in the labeling (e.g.,
We now describe how we built the Corrective Commit missing an important phrase). In these cases, the annotator
Probability metric, in three steps: fixed the label. Otherwise, we considered the case as a
1) Constructing a gold standard data set of labeled com- disagreement. The final label was a majority vote of the an-
mit samples, identifying those that are corrective (bug notators. The Cohen’s kappa scores [24] among the different
fixes). These are later used to learn about corrective annotators were at least 0.9, indicating excellent agreement.
commits and to evaluate the model. Similarly consistent commit labeling was reported by Levin
2) Building and evaluating a supervised learning linguis- and Yehudai [65].
tic model to classify commits as either corrective or not. Of the 400 triple-annotated commits, there was consen-
Applying the model to a project yields a hit rate for that sus regarding the labels in 383 (95%) of them: 105 (27%)
project. were corrective, 278 were not. There were only 17 cases of
3) Using maximum likelihood estimation in order to find disagreement. An example of disagreement is “correct the
the most likely CCP given a certain hit rate. name of the Pascal Stangs library.” It is subjective whether
a wrong name is a bug.
The need for the third step arises because the hit rate In addition, we also noted the degree of certainty in the
may be biased, which might falsify further analysis like labeling. The message “mysql_upgrade should look for .sql
using regression and hypothesis testing. By working with script also in share/ directory” is clear, yet it is unclear
the CCP maximum likelihood estimation we become inde- whether the commit is a new feature or a bug fix. In only 7
pendent of the model details and its hit rate. We can then cases the annotators were uncertain and couldn’t determine
compare the results across projects, or even with results with high confidence the label from the commit message
based on other researchers’ models. We can also identify and content. Of these, in 4 they all nevertheless selected the
outliers deviating from the common linguistic behavior same label.
(e.g., non-English projects), and remove them to prevent Two of the samples (0.5%) were not in English. This
erroneous analysis. prevents English linguistic models from producing a mean-
Note that we are interested in the overall probability that a ingful classification. Luckily, this is uncommon.
commit is corrective. This is different from defect prediction, Finally, in 4 cases (1%) the commit message did not
where is the goal is to determine whether a specific commit contain any syntactic evidence for being corrective. The
is corrective. Finding the probability is easier than making most amusing example was “When I copy-adapted han-
detailed predictions. In analogy to coin tosses, we are inter- dle_level_irq I skipped note_interrupt because I considered
ested only in establishing to what degree a coin is biased, it unimportant. If I had understood its importance I would
rather than trying to predict a sequence of tosses. Thus, if have saved myself some ours of debugging” (the typo
for example false positives and false negatives are balanced, is in the origin). Such cases set an upper bound on the
the estimated probability will be accurate even if there are performance of any syntactic model. In our data set, all the
many wrong predictions. above special cases (uncertainty, disagreement, and lack of
3
syntactic evidence) are rather rare (just 22 samples, 5.5%, The last boost to performance came from the use of
many behaviors overlap), and the majority of samples are active learning [94] and specifically the use of classifiers
well behaved. The number of samples in each misbehavior discrepancies [4]. Once the model’s performance is high,
category is very small so ratios are very sensitive to noise. the probability of finding a false negative, positive_rate ·
However, we can say with confidence that these behaviors (1 − recall), is quite low, requiring a large number of man-
are not common and therefore are not an issue of concern in ually labeled random samples per false negative. Amit and
the analysis. Feitleson [3] provided models for a commit being corrective,
perfective, or adaptive. A commit not labeled by any of the
models is assured to be a false negative (of one of them).
3.2 Syntactic Identification of a Corrective Commit
Sampling from this distribution was an effective method to
Our linguistic model is a supervised learning model, based find false negatives, and improving the model to handle
on indicative terms that help identify corrective commit them increased the model recall from 69% to 84%. Similarly,
messages. Such models are built empirically by analyzing while a commit might be both corrective and adaptive,
corrective commit messages in distinction from other com- commits marked by more than one classifier are more likely
mit messages. to be false positives.
The most common approach today to do this is to The resulting model uses regular expressions to iden-
employ machine learning. We chose not to use machine tify the presence of different indicator terms in commit
learning classification algorithms to build the model. The messages. We base the model on straightforward regular
main reason was that we are using a relatively small labeled expressions because this is the tool supported by Google’s
data set, and linguistic analysis tends to lead to many fea- BigQuery relational database of GitHub data, which is our
tures (e.g., in a bag of words, word embedding, or n-grams target platform.
representation). In such a scenario, models might overfit The final model is based on three distinct regular expres-
and be less robust. One might try to cope with overfitting sions. The first identifies about 50 terms that serve as indi-
by using models of low capacity. However, the concept cations of a bug fix. Typical examples are: “bug”, “failure”,
that we would like to represent (e.g., include “fix” and and “correct this”. The second identifies terms that indicate
“error” but not “error code” and “not a bug”) is of relatively other fixes, which are not bug fixes. Typical examples are:
high capacity. The need to cover many independent textual “fixed indentation” and “error message”. The third is terms
indications and count them requires a large capacity, larger indicating negation. This is used in conjunction with the
than what can be supported by our small labeled data set. first regular expression to specifically handle cases in which
Note that though we didn’t use classification algorithms, the fix indication appears in a negative context, as in “This is
the goal, the structure, and the usage of the model are of not an error”. It is important to note that fix hits are also hits
supervised learning. of the other fixes and the negation. Therefore, the complete
Many prior language models suggest term lists like model counts the indications for a bug fix (matches to the
(‘bug’, ‘bugfix’,‘error’, ‘fail’, ‘fix’), which reach 88% accuracy first regular expression) and subtracts the indications for
on our data set. We tried many machine learning classifica- not really being a bug fix (matches to the other two regular
tion algorithms and only the plain decision tree algorithm expressions). If the result is positive, the commit message
reached such accuracy. More importantly, as presented later, was considered to be a bug fix. The results of the model
we aren’t optimizing for accuracy. We therefore elected to evaluation using a 1,100 samples test set built in Amit and
construct the model manually based on several sources of Feitelson [3] are presented in the confusion matrix of Table
candidate terms and the application of semantic under- 1.
standing.
We began with a private project in which the commits Table 1
Confusion matrix of model on test data set.
could be associated to a ticket-handling system that enabled
determining whether they were corrective. We used them Classification
in order to differentiate the word distribution of corrective Concept True(Corrective) False
commit messages and other messages and find an initial set True 228 (20.73%) TP 43 (3.91 %) FN
of indicative terms. In addition, we used the gold-standard False 34 (3.09%) FP 795 (72.27%) TN
data-set presented above. This data set is particularly im-
portant because our target is to analyze GitHub projects, so These results can be characterized by the following met-
it is desirable that our train data will represent the data on rics:
which the model will run. This train data set helped tuning • Accuracy (model is correct): 93.0%
the indicators by identifying new indications and nuances • Precision (ratio of hits that are indeed positives): 87.0%
precision
and alerting to bugs in the model implementation. • Precision lift ( positive rate − 1): 253.2%
To further improving the model we used some terms • Hit rate (ratio of commits identified by model as correc-
suggested by Ray et al. [86], tough we didn’t adopt all of tive): 23.8%
them (e.g., we don’t consider a typo to be a bug). This model • Positive rate (real corrective commit rate): 24.6%
was used in Amit and Feitelson [3], reaching an accuracy of • Recall (positives that were also hits): 84.1%
89%. We then added additional terms from Shrikanth et al. • Fpr (False Positive Rate, negatives that are hits by
[97]. We also used labeled commits from Levin and Yehudai mistake): 4.2%
[65] to further improve the model based on samples it failed Though prior work was based on different protocols and
to classify. data sets and therefore hard to compare, our accuracy is
4
significantly better than previous reported results of 68% 4 VALIDATION OF THE CCP M ETRIC
[45], 70% [5], 76% [65] and 82% [6], and also better than
our own previous result of 89% [3]. The achieved accuracy 4.1 Validation of the CCP Maximum Likelihood Estima-
is close to the well-behaving commits ratio in the gold tion
standard. George Box said: “All models are wrong but some are
useful” [20]. We would like to see how close the maximum
3.3 Maximum Likelihood Estimation of the Corrective likelihood CCP estimations are to the actual results. Note
Commit Probability that the model performance results we presented above in
We now present the CCP maximum likelihood estimation. Table 1, using the gold standard test set, do not refer to the
Let hr be the hit rate (probability that the model will identify maximum likelihood CCP estimation. We need a new in-
a commit as corrective) and pr be the positive rate, the true dependent validation set to verify the maximum likelihood
corrective rate in the commits (this is what CCP estimates). estimation. We therefore manually labeled another set of 400
In prior work it was common to use the hit rate directly commits, and applying the model resulted in the confusion
as the estimate for the positive rate. However, they differ matrix shown in Table 2.
since model prediction is not perfect. Thus, by considering
Table 2
the model performance we can better estimate the positive Confusion matrix of model on validation data set.
rate given the hit rate. From a performance modeling point
of view, the Dawid-Skene [28] modeling is an ancestor of Classification
our work. Though, the Dawid-Skene framework represents Concept True(Corrective) False
a model by its precision and recall, and we use Fpr and True 91 (22.75%) TP 18 (4.5%) FN
False 34 (8.5%) FP 257 (64.25%) TN
recall.
There are two distinct cases that can lead to a hit. The
first is a true positive (TP): There is indeed a bug fix and our In this data set the positive rate is 27.2%, the hit rate is
model identifies it correctly. The probability of this case is 31.2%, the recall is 83.5%, and the Fpr is 11.7%. Note that
Pr(T P ) = pr ·recall. The second case is a false positive (FP): the positive rate in the validation set is 2.6 percent points
There was no bug fix, yet our model mistakenly identifies different from our test set. The positive rate has nothing
the commit as corrective. The probability of this case is to do with MLE and shows that statistics tend to differ on
Pr(F P ) = (1 − pr) · F pr. Adding them gives different samples. In this section we would like to show that
the MLE method is robust to such changes.
hr = Pr(T P ) + Pr(F P ) = (recall − F pr)pr + F pr (1) In order to evaluate how sensitive the maximum like-
Extracting pr leads to lihood estimation is to changes in the data, we used the
bootstrap method [31]. We sampled with replacement 400
hr − F pr
pr = (2) items from the validation set, repeating the process 10,000
recall − F pr times. Each time we computed the true corrective commit
We want to estimate Pr(pr|hr). Let n be the number rate, the estimated CCP, and their difference. Figure 1 shows
of commits in our data set, and k the number of hits. As the difference distribution.
k
the number of samples increases, n converges to the model
hit rate hr. Therefore, we estimate Pr(pr|n, k). We will use
maximum likelihood for the estimation. The idea behind
maximum likelihood estimation is to find the value of pr
that maximizes the probability of getting a hit rate of hr.
Note that if we were given p, a single trial success
probability, we could calculate the probability of getting k
hits out of n trails using the binomial distribution formula
!
n
Pr(k; n, p) = pk (1 − p)n−k (3)
k
6
ability to find bugs appears to be systematically different —
higher or lower — than in other projects.
The first such situation is in very popular projects. Linus’s
law, “given enough eyeballs, all bugs are shallow” [87], sug-
gests that a large community might lead to more effective
bug identification, and as a consequence also to higher CCP.
In order to investigate this, we used projects of companies
or communities known for their high standards: Google,
Facebook, Apache, Angular, Kubernetes, and Tensorflow.
For each such source, we compared the average CCP of
projects in the top 5% as measured by stars (7,481 stars or
more), with the average CCP of projects with fewer stars.
Table 3
Linus’s Law: CCP in projects with many or fewer stars.
7
and “twin” analysis, which show that the correlations are
consistent and unlikely to be random.
5.1 Methodology
Our results are in the form of correlations between CCP
and other metrics. For example, we show that projects with
shorter files tend to have a lower CCP. These correlations
are informative and actionable, e.g., enabling a developer
to focus on longer files during testing and refactoring. But
correlation is not causation, so we cannot say conclusively
that longer files cause a higher propensity for bugs that
need to be fixed. Showing causality requires experiments
in which we perform the change, which we leave for future
work. The correlations that we find indicate that a search for
causality might be fruitful and could motivate changes in
development practices that may lead to improved software
quality.
In order to make the results stronger than mere correla-
Figure 3. CCP of projects with or without different quality terms.
tion, we use several methods in the analysis. We control the
results to see that the relation between A and B is not due to
to the analysis. In our case, it should be a term not related C. In particular we control for the developer, by observing
to quality. We choose “algorithm” and “function” as such the behaviour of the same developer in different projects.
terms. The verification worked for “algorithm” at the file This allows us to separate the influence of the developer and
level: files with and without this term had practically the the project. We use co-change over time analysis in order to
same CCP. But files with “function” had a much higher CCP see to what extent a change in one metric is related to a
than files without it, and projects with both terms had a change in the other metric.
higher CCP that without them. Possible reasons are some The distributions we examined tended to have some out-
relation to quality (e.g., algorithmic oriented projects are liers that are much higher than the mean and the majority of
harder) or biases (e.g., object-oriented languages tend to use the samples. Including outliers in the analysis might distort
the term “method” rather than “function”). Anyway, it is the results. In order to reduce the destabilizing effect of
clear that the difference in “low quality” is much larger and outliers, we applied Winsorizing [43]. We used one-sided
there is a large difference in the other terms too. Note that Winsorizing, where all values above a certain threshold are
this investigation is not completely independent. While the set to this threshold. We do this for the top 1% of the
quality terms used here are different from those used for the results throughout, to avoid the need to identify outliers
classification of corrective commits, we still use the same and define a rule for adjusting the threshold for each specific
data source. case. In the rest of the paper we used the term capping (a
An additional important attribute of metrics is that they common synonym) for this action. In addition, we check
be stable. We estimate stability by comparing the CCP of the whether the metrics are stable across years. A reliable metric
same project in adjacent years, from 2014 to 2019. Overall, applied to clean data is expected to provide similar results
the quality of the projects is stable over time. The Pearson in successive years.
correlation between the CCP of the same project in two Results are computed on 2019 active projects, and specif-
successive years, with 200 or more commits in each, is 0.86. ically on projects whose CCP is in the valid domain. We
The average CCP, using all commits from all projects, was didn’t work with version releases since we work with
22.7% in 2018 and 22.3% in 2019. Looking at projects, the thousands of projects whose releases are not clearly marked.
CPP grew on average by 0.6 percentage points from year Note that in projects doing continuous development, the
to year, which might reflect a slow decrease in quality. This concept of release is no longer applicable.
average hide both increases and decreases; the average abso-
lute difference in CPP was 5.5 percentage points. Compared 5.1.1 Controlled Variables: Project Age, Number of Devel-
to the CCP distribution presented in Table 4 below, the per opers, and Programming Language
project change is very small. Our goal is to find out how to improve software devel-
opment. We would like to provide actionable recommen-
dations for better software development. However, there
5 A SSOCIATION OF CCP WITH P ROJECT AT- are factors that influence software quality that are hard to
TRIBUTES change. It is not that helpful for an ongoing project to find
To further support the claim that CCP is related to quality, out that a different programming language is indicative of a
we studied the correlations of CCP with various notions lower bug rate. Yet, we examine the effect of some variables
of quality reflected in project attributes. To strengthen the that influence the quality yet are hard to change. We control
results beyond mere correlations we control for variables them in the rest of the analysis to validate that the results
which might influence the results, such as project age and hold. We do the control by conditioning on the relevant
the number of developers. We also use co-change analysis variable and checking if the relations found in general hold
8
while controlling too. We don’t control by more then one
variable at a time since our data set is rather small and
controlling leads to smaller data sets, making the results
less robust to noise.
10), the next 50% as intermediate (at most 80), and the rest
Figure 4. CCP distribution (during 2019) in projects of different ages. In as numerous, and verifying that results hold for each such
this and following figures, each boxplot shows the 5, 25, 50, 75, and 95 group.
percentiles. The dashed line represents the mean. Results regarding the influence of programming lan-
guage are presented below in Section 5.2.4. We show that
Lehman’s laws of software evolution imply that quality the projects written in different programming languages ex-
may have a negative correlation with the age of a project hibit somewhat different distributions of CCP. We therefore
[63], [64]. We checked this on our dataset. We first filtered control for the programming language in order to see that
out projects that started before 2008 (GitHub beginning). For our results remain valid for each language individually.
the remaining projects, we checked their CCP at each year.
Figure 4 shows that CCP indeed tends to increase slightly
5.1.2 Co-change Over Time
with age. In the first year, the average CCP is 0.18. There
is then a generally upward trend, getting to an average of While experiments can help to determine causality, they
0.23 in 10 years. Note that there is a survival bias in the data are based on few cases and expensive. On the other hand,
presented since many projects do not reach high age. we have access to plenty of observations, in which we
In order to see that our results are not due to the influ- can identify correlations. While casual relations tend to
ence of age, we divided the projects into age groups. Those lead to correlation, non-casual relations might also lead to
started earlier than 2008 were excluded, those started in correlations due to various reasons. We would like to use an
2018–2019 (23%) are considered to be young, the next, from analysis that will help to filter out non-casual relations. By
2016–2017 (40%), are medium, and those from 2008–2015 that we will be left with a smaller set of more likely relations
(37%) are old. When we obtained a result (e.g., correlation to be further investigated for causality.
between coupling and CCP), we checked if the result holds When two metrics change simultaneously, it is less likely
for each of the groups separately. to be accidental. Hence, we track the metrics over time in
The number of developers, via some influence mecha- order to see how their changes match. We create pairs of
nisms (e.g., ownership), was investigated as a quality factor the same project in two consecutive years. For each pair
and it seems that there is some relation to quality [14], [79], we mark if the first and second metrics improved. We
[107]. The number of developers and CCP have Pearson observe the following co-changes. The ratio of improvement
correlation of 0.12. The number of developers can reach very match (the equivalent to accuracy in supervised learning),
high values and therefore be very influential. is an indication of related changes. Denote the event that
Fig. 5 shows that percentiles of the CCP distribution in- metric i improved from one year to the next by mi↑. The
crease monotonically with the number of developers. Many probability P (mj ↑ | mi↑), (the equivalent to precision in
explanations have been given to the quality reduction as supervised learning), indicates how likely we are to observe
the number of developers increases. It might be simply a an improvement in metric j knowing of an improvement in
proxy to the project size (i.e. to the LOC). It might be to metric i. It might be that we will observe high precision
the increased communication complexity and the difficulty but it will be simply since P (mj ↑) is high. In order to
to coordinate multiple developers, as suggested by Brooks exclude this possibility, we also observe the precision lift,
in the mythical “The Mythical Man Month” [21]. Part of it P (mj ↑ | mi↑)
− 1. Note that lift cannot be used to identify
P (mj ↑)
might also be a reflection of Linus’s law, as discussed in
the causality direction since it is symmetric:
Section 4.2.2.
We control for the number of developers by grouping P (mj ↑ | mi↑) P (mi↑ ∧ mj ↑) P (mi↑ | mj ↑)
the 25% of project with the least developers as few (at most = = (6)
P (mj ↑) P (mi↑) · P (mj ↑) P (mi↑)
9
If an improvement in metric i indeed causes the improve- appropriate for studies of software engineering, being small,
ment in metric j, we expect high precision and lift. Since non-recent, or not even code.
small changes might be accidental, we also investigate im- In order to omit inactive or small projects where estima-
provements above a certain threshold. There is a trade-off tion might be noisy, we defined our scope to be all open
here since given a high threshold the improvement is clear source projects included in GitHub’s BigQuery data with
yet the number of cases we consider is smaller. Another 200+ commits in 2019. We selected a threshold of 200 to have
trade-off comes from how far in the past we track the co- enough data per project, yet have enough projects above the
changes. The earlier we will go the more data we will threshold. There are 14,749 such projects (Fig. 6).
have. On the other hand, this will increase the weight
of old projects, and might subject the analysis to changes GitHub projects in BigQuery
in software development practices over time and to data 2,500,000
quality problems. We chose a scope of 5 years, avoiding exclude small
non−recent
looking before 2014. projects
200+ copmmits in 2019
5.1.3 Controlling the Developer 14,749
Measured metric results (e.g., development speed, low cou- exclude
pling) might be due to the developers working on the forks
project (e.g., skill, motivation) or due to the project envi- large active non−fork
ronment (e.g., processes, technical debt). To separate the 9,481
influence of developers and environment, we checked the exclude related
performance of developers active in more than one project in projects
our data set. By fixing a single developer and comparing the large active unique
developer’s activity in different projects, we can investigate 8,588
the influence of the project. Note that a developer active exclude invalid
in n projects will generate O(n2 ) project pairs (“twins”) to CCP result
compare. final study set
We considered only involved developers, committing 7,557
at least 12 times per year, otherwise the results might be
misleading. While this omits 62% of the developers, they Figure 6. Process for selecting projects for analysis.
are responsible for only 6% of the commits.
Consider development speed as an example. If high However, this set is redundant in the sense that some
speed is due to the project environment, in high speed projects are closely related [54]. The first step to reduce
projects every developer is expected to be faster than himself redundancy is to exclude projects marked in the GitHub API
in other projects. This control resembles twin experiments, as being forks of other projects. This reduced the number
popular in psychology, where a behavior of interest is to 9,481 projects. Sometimes extensive amounts of code are
observed on twins. Since twins have a very close genetic cloned without actual forking. Such code cloning is preva-
background, a difference in their behavior is more likely lent and might impact analysis [2], [35], [68]. Using com-
to be due to another factor (e.g., being raised in different mits to identify relationships [72], we excluded dominated
families). projects, defined to have more than 50 common commits
Assume that performance on project A is in general with another, larger project, in 2019. Last, we identified
better than on project B. We consider developers that con- projects sharing the same name (e.g., ‘spark’) and preferred
tributed to both projects, and check how often they are those that belonged to the user with more projects (e.g.,
better in project A than themselves in project B (formally, ‘apache’). After the redundant projects removal, we were
the probability that a developer is better in project A than left with 8,588 projects. But calculating the CCP on some of
in project B given that project A is better than project B). these led to invalid values as described above. For analysis
This is equivalent to precision in supervised learning, where purposes we therefore consider only projects where CCP is
the project improvement is the classifier and the developer in the valid range, whose number is 7,557.
improvement is the concept. In some cases, a small dif-
5.2 Results
ference might be accidental. Therefore we require a large
difference between the projects and between the developer The following examples aim to show the applicability of
performance (e.g., at least 10 commits per year difference, or CCP. We compute the CCP of many projects and produce
more formally, the probability that a developer committed at the CCP distribution. We then demonstrate associations be-
least 10 times more in project A than in project B given that tween high quality, as represented by CCP, and short source
the average number of commits per developer in project A code file length, coupling, and programming language. We
is at least 10 commits higher than in project B). also investigate possible implications like developer engag-
ment and development speed.
5.1.4 Selection of Projects 5.2.1 The Distribution of CCP per Project
In 2018 Github published that they had 100 million projects. Given the ability to identify corrective commits, we can
The BigQuery GitHub schema contains about 2.5 million classify the commits of each project and estimate the dis-
public projects prior to 2020. But the vast majority are not tribution of CCP over the projects’ population.
10
Table 4
CCP distribution in active GitHub projects.
11
Figure 8. CCP distribution for projects with different average commit
sizes (number of files, capped, in non-corrective commits).
12
other direction, Ghayyur et al. conducted a survey in which and on-boarding average is doubled in the first decile com-
72% claimed that poor code quality is demotivating [36]. pared to the last. In order to be more robust to noise, we
Hence, quality might be both the outcome and the cause of consider projects that have at least 10 new developers. When
motivation. looking at co-change of on-boarding and CCP, the match is
only 53% for any change but 85% for a change of at least
10 percent points in both metrics. An improvement of 10
percent points in CCP leads to a significant improvement
in on-boarding in 10% of the cases, a lift of 18%. When
controlling on language, results fit the relation other than in
PHP and Shell (which had a small number of cases). Results
hold for all age groups. For size, they hold for intermediate
and numerous numbers of developers; by definition, with
few developers there are no projects with at least 10 new
developers.
13
The number of test labeled commits is small, about
1,000, hence there is a question of how well they represent
the underlying distribution. We evaluated the sensitivity to
changes in the data. Since the model was built mainly using
domain knowledge and a different data set, we could use a
small training set. Therefore, we preferred to use most of the
labels as test set for the variables estimation and to improve
the estimation of the recall and F pr.
The labeling was done manually by humans who are
prone to error and subjectivity. In order to make the labeling
stricter, we used a labeling protocol. Out of the samples,
400 were labeled by three annotators independently. The
labels were compared in order to evaluate the amount of
uncertainty.
Other than uncertainty due to different opinions, there
was uncertainty due to the lack of information in the commit
Figure 12. Distribution of commits of involved developers (capped) per message. For example, the message “Changed result default
CCP decile. value to False” describes a change well but leaves us uncer-
tain regarding its nature. We used the gold standard labels
to verify that this is rare.
speed the probability of a significant improvement in CCP Our main assumption is the conditional independence
drops to 7%. Hence, knowing of a significant improvement between the corrective commits (code) and the commit
in CCP a speed improvement is likely, but knowing of a messages describing them (process) given our concept (the
speed improvement a significant CCP improvement is very commit being corrective, namely a bug fix). This means that
unlikely. the model performance is the same over all the projects, and
When controlling for age or language, results hold. Re- a different hit rate is due to a different CCP. This assumption
sults also hold for intermediate and numerous developer is invalid in some cases. For example, projects documented
groups, with a positive lift when the change is significant, in a language other than English will appear to have no
but a -3% lift in the few developers group for any change. bugs. Non-English commit messages are relatively easy to
There are two differing theories regarding the relations identify; more problematic are differences in English flu-
between quality and productivity. The classical Iron Triangle ency. Native English speakers are less likely to have spelling
[80] sees them as a trade-off: investment in quality comes at mistakes and typos. A spelling mistake might prevent our
the expense of productivity. On the other hand, “Quality model from identifying the textual pattern, thus lowering
is Free” claims that investment in quality is beneficial and the recall. This will lead to an illusive benefit of spelling
leads to increased productivity [25]. Our results in Table 5 mistakes, misleading us to think that people who tend to
enable a quantitative investigation of this issue, where speed have more spelling mistakes tend to have fewer bugs.
is operationalized by commits per year and quality by CCP. Another threat to validity is due to the family of models
As “Quality is Free” predicts, we find that in high that we chose to use. We chose to represent the model using
quality projects the development speed is much higher. two parameters, recall and F pr, following the guidance of
The twin experiments help to reduce noise, demonstrating Occam’s razor and resorting to a more complex solution
that development speed is a characteristic of the project. In only when a need arises. However, many other families
case that this correlation is indeed due to causality, then of models are possible. We could consider different sub-
when you improve the quality you also gain speed, enjoying models for various message lengths, a model that predicts
both worlds. This relation between quality and development the commit category instead of the Boolean "Is Corrective"
speed is also supported by Jones’s research on time wasted concept, etc. Each family will have different parameters and
due to low quality [51], [53] and developers performing the behavior. More complex models will have more represen-
same tasks during “Personal Software Process” training [98]. tative power but will be harder to learn and require more
samples.
A common assumption in statistical analysis is the IID
6 T HREATS TO VALIDITY assumption (Independent and Identically Distributed ran-
There is no agreed quantitative definition of quality, hence dom variables). This assumption clearly doesn’t hold for
we cannot ensure that a certain metric measures quality. In GitHub projects. We found that forks, projects based on
order to cope with this, we showed that our metric agrees others and sharing a common history, were 35% of the active
with developers’ comments on quality and is associated projects. We therefore removed forks, but projects might still
with variables that are believed to reflect or influence qual- share code and commits. Also, older projects, with more
ity. commits and users, have higher weight in twin studies and
A specific threat to validity in our work is related to co-change analysis.
construct validity. We set out to measure the Corrective Our metric focuses on the fraction of commits that correct
Commit Probability and do so based on a linguistic analysis. bugs. One can claim that the fraction of commits that induce
We investigated whether it is indeed accurate and precise in bugs is a better quality-metric. In principle, this can be
Section 4.1. done using the SZZ algorithm (the common algorithm for
14
identifying bug inducing commits [99]). But note that SZZ It is informative in that it has a wide range of values
•
is applied after the bug was identified and fixed. Thus, the and distinguishes between projects.
inducing and fixing commits are actually expected to give We estimated the CCP of all 7,557 large active projects
similar results. in BigQuery’s GitHub data. This created a quality scale,
Another major threat concerns internal validity. Our enabling observations on the state of the practice. Using
basic assumption is that corrective commits reflect bugs, this scale developers can compare their project’s quality
and therefore a low CCP is indicative of few bugs and high (as reflected by CCP) to the community. A low percentile
quality. But a low CCP can also result from a disregard for suggests the need for quality improvement efforts.
fixing bugs or an inability to do so. On the other hand, in We checked the sensitivity of our assumptions and no-
extremely popular projects, Linus’s law “given enough eye- ticed that the theoretical invalid CCP range indeed tend
balls, all bugs are shallow” [87] might lead to more effective to be not in English or not software. A difference in bug
bug identification and high CCP. Another unwanted effect detection efficiency was demonstrated in highly popular
of using corrective commits is that improvements in bug projects, supporting Linus’s law [87].
detection (e.g., by doubling the QA department) will look Our results also helped demonstrate that “Quality is
like a reduction in quality. The correlations found between Free”. Instead of a trade-off between quality and devel-
CCP and various other indicators of software quality add opment speed, we find that they are positively correlated,
confidence that CCP is indeed valid. We identify such cases and this was further supported with co-change analysis and
and discuss them in Section 4.2.2. twin experiments. Thus, investing in quality may actually
Focusing on corrective commits also leads to several reduce schedules rather than extending them.
biases. Most obviously, existing bugs that have not been We show a correlation between short files and low
found yet are unknown. Finding and fixing bugs might coupling and quality, supporting to well-know recommen-
take months [60]. When projects differ in the time needed dation for quality improvement. Hence, if the discussed
to identify a bug, our results will be biased. relations are indeed casual, we have a simple way to reach
Software development is usually done subject to lack of high quality, which will benefit a project also in higher
time and resources. Due to that, many times known bugs of productivity, better on-boarding, and lower churn.
low severity are not fixed. While this leads to a bias, it can
be considered to be a desirable one, by focusing on the more Supplementary Materials
important bugs.
The language models are available at https://github.
A threat to external validity might arise due to the use of
com/evidencebp/commit-classification Utilities used for
open source projects that might not represent projects done
the analysis (e.g., co-change) are at https://github.com/
in software companies. We feel that the open source projects
evidencebp/analysis_utils All other supplementary ma-
are of significant interest on their own. Other than that, the
terials can be found at https://github.com/evidencebp/
projects we analyzed include projects of Google, Microsoft,
corrective-commit-probability
Apple, etc. so at least part of the area is covered.
Time, cost, and development speed are problematic to
measure. We use commits as a proxy to work since they Acknowledgements
typically represent tasks. Yet, tasks differ in size and diffi- This research was supported by the ISRAEL SCIENCE
culty and their translation to commits might differ due to FOUNDATION (grant No. 832/18). We thank Amiram
the project or developer habits. Commits may also include Yehudai and Stanislav Levin for providing us their data
a mix of different tasks. In order to reduce the influence set of labeled commits [65]. We thank Guilherme Avelino
of project culture we aggregated many of them. In order for drawing our attention to the importance of Truck Factor
to eliminate the effect of personal habits, we used twins Developers Detachment (TFDD) and providing a data set
experiments. Other than that, the number of commits per [8].
time is correlated to developers’ self-rated productivity [77]
and team lead perception of productivity [81], hence it
R EFERENCES
provides a good computable estimator.
[1] H. Al-Kilidar, K. Cox, and B. Kitchenham. The use and usefulness
of the ISO/IEC 9126 quality standard. In Intl. Synp. Empirical
Softw. Eng., pages 126–132, Nov 2005.
7 C ONCLUSIONS [2] M. Allamanis. The adverse effects of code duplication in ma-
chine learning models of code. In Proceedings of the 2019 ACM
We presented a novel way to measure projects’ code quality, SIGPLAN International Symposium on New Ideas, New Paradigms,
using the Corrective Commit Probability (CCP). We use the and Reflections on Programming and Software, Onward! 2019, page
consensus that bugs are bad and indicative of low quality to 143–153, New York, NY, USA, 2019. Association for Computing
Machinery.
base a metric on them. We started off with a linguistic model [3] I. Amit and D. G. Feitelson. Which refactoring reduces bug rate?
to identify corrective commits, significantly improving prior In Proceedings of the Fifteenth International Conference on Predictive
work [3], [5], [45], [65], and developed a mathematical Models and Data Analytics in Software Engineering, PROMISE’19,
pages 12–15, New York, NY, USA, 2019. ACM.
method to find the most likely CCP given the model’s hit [4] I. Amit, E. Firstenberg, and Y. Meshi. Framework for semi-
rate. The CCP metric has the following properties: supervised learning when no labeled data is given. U.S. patent
• It matches developers’ references to quality. application #US20190164086A1, 2017.
[5] J. J. Amor, G. Robles, J. M. Gonzalez-Barahona, and A. Navarro.
• It is stable: it reflects the character of a project and does Discriminating development activities in versioning systems: A
not change much from year to year. case study, Jan 2006.
15
[6] G. Antoniol, K. Ayari, M. Di Penta, F. Khomh, and Y.-G. [29] M. Dawson, D. Burrell, E. Rahim, and S. Brewster. Integrating
Guéhéneuc. Is it a bug or an enhancement? a text-based approach software assurance into the software development life cycle
to classify change requests. In Proceedings of the 2008 Conference of (SDLC). Journal of Information Systems Technology and Planning,
the Center for Advanced Studies on Collaborative Research: Meeting of 3:49–53, 2010.
Minds, CASCON ’08, New York, NY, USA, 2008. Association for [30] G. Dromey. A model for software product quality. IEEE Trans.
Computing Machinery. Softw. Eng., 21(2):146–162, Feb 1995.
[7] M. Argyle. Do happy workers work harder? the effect of job satis- [31] B. Efron. Bootstrap Methods: Another Look at the Jackknife, pages
faction on job performance. In R. Veenhoven, editor, How harmful 569–593. Springer New York, New York, NY, 1992.
is happiness? Consequences of enjoying life or not. Universitaire Pers, [32] R. Fisher. The correlation between relatives on the supposition
Rotterdam, The Netherlands, 1989. of mendelian inheritance. Transactions of the Royal Society of
[8] G. Avelino, E. Constantinou, M. T. Valente, and A. Serebrenik. Edinburgh, 52(2):399–433, 1919.
On the abandonment and survival of open source projects: An [33] M. Fowler. Refactoring: Improving the Design of Existing Code.
empirical investigation. CoRR, abs/1906.08058, 2019. Addison-Wesley Longman Publishing Co., Inc., Boston, MA,
[9] G. Avelino, L. T. Passos, A. C. Hora, and M. T. Valente. A novel USA, 1999.
approach for estimating truck factors. CoRR, abs/1604.06766,
[34] M. Fowler, K. Beck, and W. R. Opdyke. Refactoring: Improving
2016.
the design of existing code. In 11th European Conference. Jyväskylä,
[10] V. R. Basili, L. C. Briand, and W. L. Melo. A validation of object- Finland, 1997.
oriented design metrics as quality indicators. IEEE Transactions
on Software Engineering, 22(10):751–761, Oct 1996. [35] M. Gharehyazie, B. Ray, M. Keshani, M. S. Zavosht, A. Hey-
darnoori, and V. Filkov. Cross-project code clones in github.
[11] G. Bavota, A. De Lucia, M. Di Penta, R. Oliveto, and F. Palomba.
Empirical Software Engineering, 24(3):1538–1573, Jun 2019.
An experimental investigation on the innate relationship between
quality and refactoring. J. Syst. & Softw., 107:1–14, Sep 2015. [36] S. A. K. Ghayyur, S. Ahmed, S. Ullah, and W. Ahmed. The
[12] E. D. Berger, C. Hollenbeck, P. Maj, O. Vitek, and J. Vitek. impact of motivator and demotivator factors on agile software
On the impact of programming languages on code quality: A development. International Journal of Advanced Computer Science
reproduction study. ACM Trans. Program. Lang. Syst., 41(4), Oct and Applications, 9(7), 2018.
2019. [37] Y. Gil and G. Lalouche. On the correlation between size and
[13] C. Bird, A. Bachmann, E. Aune, J. Duffy, A. Bernstein, V. Filkov, metric validity. Empirical Softw. Eng., 22(5):2585–2611, Oct 2017.
and P. Devanbu. Fair and balanced?: Bias in bug-fix datasets. [38] T. L. Graves, A. F. Karr, J. S. Marron, and H. Siy. Predicting fault
In Proceedings of the the 7th Joint Meeting of the European Software incidence using software change history. IEEE Transactions on
Engineering Conference and the ACM SIGSOFT Symposium on The Software Engineering, 26(7):653–661, July 2000.
Foundations of Software Engineering, ESEC/FSE ’09, pages 121–130, [39] T. Gyimothy, R. Ferenc, and I. Siket. Empirical validation of
New York, NY, USA, 2009. ACM. object-oriented metrics on open source software for fault pre-
[14] C. Bird, N. Nagappan, B. Murphy, H. Gall, and P. Devanbu. Don’t diction. IEEE Transactions on Software Engineering, 31(10):897–910,
touch my code! examining the effects of ownership on software Oct 2005.
quality. In Proceedings of the 19th ACM SIGSOFT symposium and [40] R. Hackbarth, A. Mockus, J. Palframan, and R. Sethi. Improving
the 13th European conference on Foundations of software engineering, software quality as customers perceive it. IEEE Softw., 33(4):40–
pages 4–14, 2011. 45, Jul/Aug 2016.
[15] C. Bird, P. C. Rigby, E. T. Barr, D. J. Hamilton, D. M. German, [41] T. Hall, S. Beecham, D. Bowes, D. Gray, and S. Counsell. A
and P. Devanbu. The promises and perils of mining git. In systematic literature review on fault prediction performance in
2009 6th IEEE International Working Conference on Mining Software software engineering. IEEE Trans. Softw. Eng., 38(6):1276–1304,
Repositories, pages 1–10, May 2009. Nov 2012.
[16] B. Boehm and V. R. Basili. Software defect reduction top 10 list. [42] M. H. Halstead. Elements of Software Science (Operating and
Computer, 34(1):135–137, Jan 2001. Programming Systems Series). Elsevier Science Inc., New York,
[17] B. W. Boehm. Software Engineering Economics. Prentice-Hall, 1981. NY, USA, 1977.
[18] B. W. Boehm, J. R. Brown, and M. Lipow. Quantitative evaluation [43] C. Hastings, F. Mosteller, J. W. Tukey, and C. P. Winsor. Low mo-
of software quality. In Intl. Conf. Softw. Eng., number 2, pages ments for small samples: A comparative study of order statistics.
592–605, Oct 1976. Ann. Math. Statist., 18(3):413–426, 09 1947.
[19] B. W. Boehm and P. N. Papaccio. Understanding and control- [44] K. Herzig, S. Just, and A. Zeller. It’s not a bug, it’s a feature:
ling software costs. IEEE Transactions on Software Engineering, How misclassification impacts bug prediction. In Proceedings of
14(10):1462–1477, Oct 1988. the 2013 International Conference on Software Engineering, ICSE ’13,
[20] G. Box. Robustness in the strategy of scientific model building. pages 392–401, Piscataway, NJ, USA, 2013. IEEE Press.
In R. L. LAUNER and G. N. WILKINSON, editors, Robustness in [45] A. Hindle, D. M. German, M. W. Godfrey, and R. C. Holt.
Statistics, pages 201 – 236. Academic Press, 1979. Automatic classication of large changes into maintenance cat-
[21] F. P. Brooks, Jr. The Mythical Man-Month: Essays on Software egories. In 2009 IEEE 17th International Conference on Program
Engineering. Addison-Wesley, 1975. Comprehension, pages 30–39, May 2009.
[22] J. P. Campbell, R. A. McCloy, S. H. Oppler, and C. E. Sager. [46] D. Hovemeyer and W. Pugh. Finding bugs is easy. SIGPLAN
A theory of performance. In N. Schmitt, W. C. Borman, and Not., 39(12):92–106, Dec 2004.
Associates, editors, Personnel Selection in Organizations, pages 35–
[47] I. IEC. 9126-1 (2001). software engineering product quality-part
70. Jossey-Bass Pub., 1993.
1: Quality model. International Organization for Standardization,
[23] S. R. Chidamber and C. F. Kemerer. A metrics suite for object
page 16, 2001.
oriented design. IEEE Trans. Softw. Eng., 20(6):476–493, Jun 1994.
[24] J. Cohen. A coefficient of agreement for nominal scales. Educa- [48] International Organization for Standardization. Systems and soft-
tional and Psychological Measurement, 20(1):37–46, 1960. ware engineering – systems and software quality requirements
and evaluation (square) – system and software quality models,
[25] P. Crosby. Quality Is Free: The Art of Making Quality Certain.
2011.
McGrawHill, 1979.
[26] W. Cunningham. The wycash portfolio management system. In [49] Z. Jiang, P. Naudé, and C. Comstock. An investigation on the
Addendum to the Proceedings on Object-Oriented Programming Sys- variation of software development productivity. IEEE Transac-
tems, Languages, and Applications (Addendum), OOPSLA ’92, page tions on Software Engineering, 1(2):72–81, 2007.
29–30, New York, NY, USA, 1992. Association for Computing [50] C. Jones. Applied Software Measurement: Assuring Productivity and
Machinery. Quality. McGraw-Hill, Inc., New York, NY, USA, 1991.
[27] M. D’Ambros, M. Lanza, and R. Robbes. An extensive compar- [51] C. Jones. Social and technical reasons for software project failures.
ison of bug prediction approaches. In 2010 7th IEEE Working CrossTalk, The J. Def. Software Eng., 19(6):4–9, 2006.
Conference on Mining Software Repositories (MSR 2010), pages 31– [52] C. Jones. Software quality in 2012: A survey of the state of the
41, May 2010. art, 2012. [Online; accessed 24-September-2018].
[28] A. P. Dawid and A. M. Skene. Maximum likelihood estimation of [53] C. Jones. Wastage: The impact of poor quality on software
observer error-rates using the em algorithm. Journal of the Royal economics. retrieved from http://asq.org/pub/sqp/. Software
Statistical Society. Series C (Applied Statistics), 28(1):20–28, 1979. Quality Professional, 18(1):23–32, 2015.
16
[54] E. Kalliamvakou, G. Gousios, K. Blincoe, L. Singer, D. German, [77] E. Murphy-Hill, C. Jaspan, C. Sadowski, D. C. Shepherd,
and D. Damian. The promises and perils of mining github M. Phillips, C. Winter, A. K. Dolan, E. K. Smith, and M. A. Jorde.
(extended version). Empirical Software Engineering, 01 2015. What predicts software developers’ productivity? Transactions on
[55] Y. Kamei, E. Shihab, B. Adams, A. E. Hassan, A. Mockus, Software Engineering, 2019.
A. Sinha, and N. Ubayashi. A large-scale empirical study of [78] S. Nanz and C. A. Furia. A comparative study of programming
just-in-time quality assurance. IEEE Transactions on Software languages in rosetta code. In 2015 IEEE/ACM 37th IEEE Interna-
Engineering, 39(6):757–773, June 2013. tional Conference on Software Engineering, volume 1, pages 778–788,
[56] C. F. Kemerer. Reliability of function points measurement: A field May 2015.
experiment. Commun. ACM, 36(2):85–97, Feb 1993. [79] B. Norick, J. Krohn, E. Howard, B. Welna, and C. Izurieta.
[57] C. F. Kemerer and B. S. Porter. Improving the reliability of Effects of the number of developers on code quality in open
function point measurement: An empirical study. IEEE Trans. source software: a case study. In Proceedings of the 2010 ACM-
Softw. Eng., 18(11):1011–1024, Nov 1992. IEEE International Symposium on Empirical Software Engineering
[58] F. Khomh, T. Dhaliwal, Y. Zou, and B. Adams. Do faster releases and Measurement, pages 1–1, 2010.
improve software quality?: An empirical case study of mozilla [80] R. Oisen. Can project management be defined? Project Manage-
firefox. In Proceedings of the 9th IEEE Working Conference on Mining ment Quarterly, 2(1):12–14, 1971.
Software Repositories, MSR ’12, pages 179–188, Piscataway, NJ, [81] E. Oliveira, E. Fernandes, I. Steinmacher, M. Cristo, T. Conte, and
USA, 2012. IEEE Press. A. Garcia. Code and commit metrics of developer productivity: a
[59] F. Khomh, M. Di Penta, and Y.-G. Gueheneuc. An exploratory study on team leaders perceptions. Empirical Software Engineering,
study of the impact of code smells on software change-proneness. 04 2020.
In 2009 16th Working Conference on Reverse Engineering, pages 75– [82] L. Prechelt. An empirical comparison of seven programming
84. IEEE, 2009. languages. Computer, 33(10):23–29, Oct 2000.
[60] S. Kim and E. J. Whitehead, Jr. How long did it take to fix bugs? [83] F. Rahman and P. Devanbu. How, and why, process metrics are
In Proceedings of the 2006 International Workshop on Mining Software better. In 2013 35th International Conference on Software Engineering
Repositories, MSR ’06, pages 173–174, New York, NY, USA, 2006. (ICSE), pages 432–441, May 2013.
ACM. [84] F. Rahman, D. Posnett, A. Hindle, E. T. Barr, and P. T. Devanbu.
[61] S. Kim, T. Zimmermann, E. J. Whitehead Jr., and A. Zeller. Bugcache for inspections: hit or miss? In SIGSOFT FSE, 2011.
Predicting faults from cached history. In Proceedings of the 29th [85] A. J. Ratner, C. M. De Sa, S. Wu, D. Selsam, and C. Ré. Data
International Conference on Software Engineering, ICSE ’07, pages programming: Creating large training sets, quickly. In D. D. Lee,
489–498, Washington, DC, USA, 2007. IEEE Computer Society. M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors,
[62] P. Kruchten, R. L. Nord, and I. Ozkaya. Technical debt: From Advances in Neural Information Processing Systems 29, pages 3567–
metaphor to theory and practice. Ieee software, 29(6):18–21, 2012. 3575. Curran Associates, Inc., 2016.
[63] M. M. Lehman. Programs, life cycles, and laws of software [86] B. Ray, D. Posnett, V. Filkov, and P. Devanbu. A large scale
evolution. Proceedings of the IEEE, 68(9):1060–1076, 1980. study of programming languages and code quality in github.
[64] M. M. Lehman, J. F. Ramil, P. D. Wernick, D. E. Perry, and W. M. In Proceedings of the 22Nd ACM SIGSOFT International Symposium
Turski. Metrics and laws of software evolution – the nineties on Foundations of Software Engineering, FSE 2014, pages 155–165,
view. In Intl. Software Metrics Symp., number 4, pages 20–32, Nov New York, NY, USA, 2014. ACM.
1997. [87] E. Raymond. The cathedral and the bazaar. First Monday, 3(3),
[65] S. Levin and A. Yehudai. Boosting automatic commit classifica- 1998.
tion into maintenance activities by utilizing source code changes. [88] S. Reddivari and J. Raman. Software quality prediction: An inves-
In Proceedings of the 13th International Conference on Predictive tigation based on machine learning. 2019 IEEE 20th International
Models and Data Analytics in Software Engineering, PROMISE, Conference on Information Reuse and Integration for Data Science
pages 97–106, New York, NY, USA, 2017. ACM. (IRI), pages 115–122, 2019.
[66] B. P. Lientz, E. B. Swanson, and G. E. Tompkins. Characteristics [89] H. G. Rice. Classes of recursively enumerable sets and their
of application software maintenance. Comm. ACM, 21(6):466–471, decision problems. Transactions of the American Mathematical
Jun 1978. Society, 74(2):358–366, 1953.
[67] M. Lipow. Number of faults per line of code. IEEE Transactions [90] J. Rosenberg. Some misconceptions about lines of code. In
on software Engineering, (4):437–439, 1982. Proceedings fourth international software metrics symposium, pages
[68] C. V. Lopes, P. Maj, P. Martins, V. Saini, D. Yang, J. Zitny, 137–142. IEEE, 1997.
H. Sajnani, and J. Vitek. Déjàvu: A map of code duplicates on [91] H. Sackman, W. J. Erikson, and E. E. Grant. Exploratory ex-
github. Proc. ACM Program. Lang., 1(OOPSLA), Oct 2017. perimental studies comparing online and offline programming
[69] K. D. Maxwell and P. Forselius. Benchmarking software devel- performance. Commun. ACM, 11(1):3–11, Jan 1968.
opment productivity. IEEE Software, 17(1):80–88, Jan 2000. [92] S. R. Schach, B. Jin, L. Yu, G. Z. Heller, and J. Offutt. Determining
[70] K. D. Maxwell, L. Van Wassenhove, and S. Dutta. Software the distribution of maintenance categories: Survey versus mea-
development productivity of european space, military, and in- surement. Empirical Softw. Eng., 8(4):351–365, Dec 2003.
dustrial applications. IEEE Transactions on Software Engineering, [93] N. F. Schneidewind. Body of knowledge for software quality
22(10):706–718, Oct 1996. measurement. Computer, 35(2):77–83, Feb 2002.
[71] T. J. McCabe. A complexity measure. IEEE Trans. Softw. Eng., [94] B. Settles. Active learning literature survey. Technical report,
2(4):308–320, Jul 1976. University of Wisconsin–Madison, 2010.
[72] A. Mockus, D. Spinellis, Z. Kotti, and G. J. Dusing. A complete [95] M. Shepperd. A critique of cyclomatic complexity as a software
set of related git repositories identified via community detection metric. Software Engineering J., 3(2):30–36, Mar 1988.
approaches based on shared commits, 2020. [96] E. Shihab, A. E. Hassan, B. Adams, and Z. M. Jiang. An industrial
[73] A.-J. Molnar, A. NeamŢu, and S. Motogna. Evaluation of software study on the risk of software changes. In Proceedings of the
product quality metrics. In E. Damiani, G. Spanoudakis, and ACM SIGSOFT 20th International Symposium on the Foundations
L. A. Maciaszek, editors, Evaluation of Novel Approaches to Software of Software Engineering, FSE ’12, pages 62:1–62:11, New York, NY,
Engineering, pages 163–187, Cham, 2020. Springer International USA, 2012. ACM.
Publishing. [97] N. C. Shrikanth and T. Menzies. Assessing practitioner beliefs
[74] S. Morasca and G. Russo. An empirical study of software about software defect prediction. In Intl. Conf. Softw. Eng.,
productivity. In 25th Annual International Computer Software and number 42, May 2020.
Applications Conference. COMPSAC 2001, pages 317–322, Oct 2001. [98] N. C. Shrikanth, W. Nichols, F. M. Fahid, and T. Men-
[75] R. Moser, W. Pedrycz, and G. Succi. Analysis of the reliability of a zies. Assessing practitioner beliefs about software engineering.
subset of change metrics for defect prediction. In Proceedings of the arXiv:2006.05060, June 2020.
Second ACM-IEEE International Symposium on Empirical Software [99] J. Śliwerski, T. Zimmermann, and A. Zeller. When do changes
Engineering and Measurement, ESEM ’08, pages 309–311, New induce fixes? SIGSOFT Softw. Eng. Notes, 30(4):1–5, May 2005.
York, NY, USA, 2008. ACM. [100] E. B. Swanson. The dimensions of maintenance. In Proceedings of
[76] N. Munaiah, S. Kroh, C. Cabrey, and M. Nagappan. Curating the 2Nd International Conference on Software Engineering, ICSE ’76,
github for engineered software projects. Empirical Software Engi- pages 492–497, Los Alamitos, CA, USA, 1976. IEEE Computer
neering, 22, 04 2017. Society Press.
17
[101] S. E. S. Taba, F. Khomh, Y. Zou, A. E. Hassan, and M. Nagappan.
Predicting bugs using antipatterns. In 2013 IEEE International
Conference on Software Maintenance, pages 270–279, 2013.
[102] E. Tom, A. Aurum, and R. Vidgen. An exploration of technical
debt. Journal of Systems and Software, 86(6):1498–1516, 2013.
[103] E. van Emden and L. Moonen. Java quality assurance by
detecting code smells. In Ninth Working Conference on Reverse
Engineering, 2002. Proceedings., pages 97–106, Nov 2002.
[104] E. Van Emden and L. Moonen. Java quality assurance by
detecting code smells. In Ninth Working Conference on Reverse
Engineering, 2002. Proceedings., pages 97–106. IEEE, 2002.
[105] B. Vasilescu, Y. Yu, H. Wang, P. Devanbu, and V. Filkov. Quality
and productivity outcomes relating to continuous integration in
github. In Proceedings of the 2015 10th Joint Meeting on Foundations
of Software Engineering, ESEC/FSE 2015, pages 805–816, New
York, NY, USA, 2015. ACM.
[106] N. Walkinshaw and L. Minku. Are 20% of files responsible for
80% of defects? In Proceedings of the 12th ACM/IEEE International
Symposium on Empirical Software Engineering and Measurement,
ESEM ’18, pages 2:1–2:10, New York, NY, USA, 2018. ACM.
[107] E. Weyuker, T. Ostrand, and R. Bell. Do too many cooks spoil
the broth? using the number of developers to enhance defect
prediction models. Empirical Software Engineering, 13:539–559, 10
2008.
[108] L. Williams and R. Kessler. Pair Programming Illuminated.
Addison-Wesley Longman Publishing Co., Inc., USA, 2002.
[109] A. Wood. Predicting software reliability. Computer, 29(11):69–77,
Nov 1996.
[110] T. A. Wright and R. Cropanzano. Psychological well-being and
job satisfaction as predictors of job performance. Journal of
Occupational Health Psychology, 5:84–94, 2000.
[111] S. Yamada and S. Osaki. Software reliability growth modeling:
Models and applications. IEEE Transactions on Software Engineer-
ing, SE-11(12):1431–1437, Dec 1985.
[112] A. Yamashita and L. Moonen. Do code smells reflect important
maintainability aspects? In 2012 28th IEEE international conference
on software maintenance (ICSM), pages 306–315. IEEE, 2012.
[113] T. Zimmermann, S. Diehl, and A. Zeller. How history justifies
system architecture (or not). In Sixth International Workshop on
Principles of Software Evolution, 2003. Proceedings., pages 73–83,
Sept 2003.
18