A Large Scale Study of Programming Languages and Code Quality in Github
A Large Scale Study of Programming Languages and Code Quality in Github
ABSTRACT 1. INTRODUCTION
What is the effect of programming languages on software qual- A variety of debates ensue during discussions whether a given
ity? This question has been a topic of much debate for a very long programming language is “the right tool for the job". While some
time. In this study, we gather a very large data set from GitHub of these debates may appear to be tinged with an almost religious
(729 projects, 80 Million SLOC, 29,000 authors, 1.5 million com- fervor, most people would agree that a programming language can
mits, in 17 languages) in an attempt to shed some empirical light impact not only the coding process, but also the properties of the
on this question. This reasonably large sample size allows us to use resulting artifact.
a mixed-methods approach, combining multiple regression model- Advocates of strong static typing argue that type inference will
ing with visualization and text analytics, to study the effect of lan- catch software bugs early. Advocates of dynamic typing may argue
guage features such as static v.s. dynamic typing, strong v.s. weak that rather than spend a lot of time correcting annoying static type
typing on software quality. By triangulating findings from differ- errors arising from sound, conservative static type checking algo-
ent methods, and controlling for confounding effects such as team rithms in compilers, it’s better to rely on strong dynamic typing to
size, project size, and project history, we report that language de- catch errors as and when they arise. These debates, however, have
sign does have a significant, but modest effect on software quality. largely been of the armchair variety; usually the evidence offered
Most notably, it does appear that strong typing is modestly better in support of one position or the other tends to be anecdotal.
than weak typing, and among functional languages, static typing is Empirical evidence for the existence of associations between code
also somewhat better than dynamic typing. We also find that func- quality programming language choice, language properties, and us-
tional languages are somewhat better than procedural languages. It age domains, could help developers make more informed choices.
is worth noting that these modest effects arising from language de- Given the number of other factors that influence software en-
sign are overwhelmingly dominated by the process factors such as gineering outcomes, obtaining such evidence, however, is a chal-
project size, team size, and commit size. However, we hasten to lenging task. Considering software quality, for example, there are
caution the reader that even these modest effects might quite possi- a number of well-known influential factors, including source code
bly be due to other, intangible process factors, e.g., the preference size [11], the number of developers [36, 6], and age/maturity [16].
of certain personality types for functional, static and strongly typed These factors are known to have a strong influence on software
languages. quality, and indeed, such process factors can effectively predict de-
fect localities [32].
One approach to teasing out just the effect of language prop-
Categories and Subject Descriptors erties, even in the face of such daunting confounds, is to do a
D.3.3 [PROGRAMMING LANGUAGES]: [Language Constructs controlled experiment. Some recent works have conducted exper-
and Features] iments in controlled settings with tasks of limited scope, with stu-
dents, using languages with static or dynamic typing (based on ex-
perimental treatment setting) [14, 22, 19]. While type of controlled
General Terms study is “El Camino Real" to solid empirical evidence,another op-
Measurement, Experimentation, Languages portunity has recently arisen, thanks to the large number of open
source projects collected in software forges such as GitHub.
GitHub contains many projects in multiple languages. These
Keywords projects vary a great deal across size, age, and number of devel-
programming language, type system, bug fix, code quality, empiri- opers. Each project repository provides a historical record from
cal research, regression analysis, software domain which we extract project data including the contribution history,
project size, authorship, and defect repair. We use this data to deter-
mine the effects of language features on defect occurrence using a
Permission to make digital or hard copies of all or part of this work for personal or variety of tools. Our approach is best described as mixed-methods,
Permission
classroom usetois make
granteddigital
withoutorfeehard copies
provided thatofcopies
all orarepart of this
not made or work for
distributed
or triangulation [10] approach. A quantitative (multiple regression)
for profit or commercial advantage and that copies bear this notice and the full citation
personal or classroom use is granted without fee provided that
on the first page. Copyrights for components of this work owned by others than ACM copies are study is further examined using mixed methods: text analysis, clus-
not made or distributed forwith
profit or iscommercial
permitted. Toadvantage and that copies
must be honored. Abstracting credit copy otherwise, or republish, tering, and visualization. The observations from the mixed methods
bear
to postthis
on notice
servers and
or tothe full citation
redistribute onrequires
to lists, the firstprior
page. To copy
specific otherwise,
permission and/ortoa
fee. Requesttopermissions
republish, from [email protected].
post on servers or to redistribute to lists, requires prior specific largely confirm the findings of the quantitative study.
permission
FSE’14, and/or a 16–21,
November fee. 2014, Hong Kong, China
FSE’14 November
Copyright 2014 ACM 16â978-1-4503-3056-5/14/11...$15.00
ĂŞ22, 2014, Hong Kong, China
Copyright 2014 ACM 978-1-4503-3056-5/14/11 ...$15.00.
http://dx.doi.org/10.1145/2635868.2635922
155
In summary, the main features of our work are as follows. Table 1: Top three projects in each language
Language Projects
• We leverage a categorization of some important features of
C linux, git, php-src
programming languages that prior knowledge suggests are C++ node-webkit, phantomjs, mongo
important for software quality (strong v.s. weak typing, dy- C# SignalR, SparkleShare, ServiceStack
namic v.s.. static typing, memory managed v.s. unmanaged, Objective-C AFNetworking, GPUImage, RestKit
and scripting vs. compiled) to study their impact on defect Go docker, lime, websocketd
Java storm, elasticsearch, ActionBarSherlock
quality. CoffeeScript coffee-script, hubot, brunch
JavaScript bootstrap, jquery, node
• We use multiple regression to control for a range of different TypeScript bitcoin, litecoin, qBittorrent
factors (size, project history, number of contributors, etc.) Ruby rails, gitlabhq, homebrew
Php laravel, CodeIgniter, symfony
and study the impact of the above features on defect occur- Python flask, django, reddit
rence. The findings are listed under RQ1 and RQ2 in Sec- Perl gitolite, showdown, rails-dev-box
tion 3. Clojure LightTable, leiningen, clojurescript
Erlang ChicagoBoss, cowboy, couchdb
• We use text analysis and clustering methods to group projects Haskell pandoc, yesod, git-annex
Scala Play20, spark, scala
into domains of application, and also the defects into cate-
gories of defects; we then use heat maps to study relation-
ships of project types and defect types to programming lan- source files is assigned as primary language of the project. GitHub
guages. The findings from this study (RQ3 and RQ4 in Sec- Archive stores this information. We aggregate projects based on
tion 3) are consistent with statistical results. their primary language. Then we select the top languages having
maximum number of projects for further analysis as shown in Ta-
While the use of regression analysis to deal with confounding
ble 1.
variables is not without controversy, we submit that a couple of
Retrieving popular projects. For each selected language, we
factors increase the credibility of our results: a fairly large sample
retrieve the project repositories that are primarily written in that
size, and use of mixed methods to qualitatively explore and largely
language. We then count the number of stars associated with that
confirm the findings from the regression model.
repository. The number of stars relate to how many people are
interested in that project [2]. Thus, we assume that stars indi-
2. METHODOLOGY cate the popularity of a project. We select the top 50 projects in
Here, we describe the languages and GitHub projects that we each language. To ensure that these projects have a sufficient de-
collected, and the analysis methods we used to answer our research velopment history, we filter out the projects having less than 28
questions. commits, where 28 is the first quartile commit number of all the
projects. This leaves us with 729 projects. Table 1 shows the top
2.1 Study Subjects three projects in each language. This includes projects like Linux,
To understand whether choice of programming languages have mysql, android-sdk, facebook-sdk, mongodb, python, ruby source
any impact on software quality, we choose the top 19 program- code etc.
ming languages from GitHub. We disregard CSS, Shell script, and Retrieving project evolution history. For each of these 729
Vim script as they are not considered to be general purpose lan- projects, we downloaded the non merged commits, along with the
guages. We further include TypeScript, a typed superset of commit logs, author date, and author name using the command:
JavaScript. Then, for each of the studied languages we re- git log -no-merges -numstat. The numstat flag shows
trieve top 50 projects that are primarily written in that language. the number of added and deleted lines per file associated with each
Table 1 shows top three projects in each language, based on their commit. This helps us to compute code churn and the number of
popularity. In total, we analyze 850 projects spanning 17 different files modified per commit. We also retrieve the languages associ-
languages. ated with each commit from the extensions of the modified files.
Note that, one commit can have multiple language tags. For each
2.2 Data Collection commit, we calculate its commit age by subtracting its commit date
To retrieve the top programming languages and their correspond- from the first commit of the corresponding project. We also calcu-
ing projects from GitHub, we used GitHub Archive [1], a database late some other project related statistics, including maximum com-
that records all public GitHub activities. The archive logs eighteen mit age of a project and the total number of developers; we use
different GitHub events including new commits, fork events, pull them as control variables in our regression model as discussed in
request, developers’ information, and issue tracking of all the open Section 3 . We further identify the bug fix commits made to indi-
source GitHub projects on an hourly basis. The archive data is up- vidual projects by searching for error related keywords: ‘error’,
loaded to Google BigQuery [3] service to provide an interface for ‘bug’, ‘fix’ , ‘issue’, ‘mistake’, ‘incorrect’, ‘fault’, ‘defect’ and
interactive data analysis. ‘flaw’ in the commit log using a heuristic similar to that in Mockus
Identifying top languages. The top languages in GitHub are and Votta [25].
measured by first finding the number of open source GitHub projects Table 2 summarizes our data set. Since a project may use multi-
developed in each language, and then choosing the top languages ple languages, the second column of the table shows the total num-
with the maximum number of projects. However, since multiple ber of projects that use a certain language at some capacity. We
languages are often used to develop a project, assigning a single further exclude some languages from a project that have fewer than
language to a project is difficult. GitHub Linguist [12] can mea- 20 commits in that language, where 20 is the first quartile value of
sure such a language distribution of a GitHub project repository. the total number of commits per project per language. For example,
Since languages can be identified by the extension of a project’s we find 220 projects that use more than 20 commits in C. This en-
source files, GitHub Linguist counts the number of source files with sures that the studied languages have significant activity within the
different extensions. The language with the maximum number of projects. In summary, we study 729 projects developed in 17 lan-
156
Table 2: Study Subjects
guages with 18 years of parallel evolution history. This includes 29 is weakly-typed if type-confusion can occur silently (undetected),
thousand different developers, 1.58 million commits, and 566,000 and eventually cause errors that are difficult to localize. For exam-
bug fix commits. ple, in a weakly typed language like JavaScript adding a string
to a number is permissible (e.g., ‘5’ + 2 = ‘52’), while such an op-
2.3 Categorizing Languages eration is not permitted in strongly typed Python. Also, C and
We define language classes based on several properties of the C++ are considered weakly typed since, due to type-casting, one
language that have been thought to influence language quality [14, can interpret a field of a structure that was an integer as a pointer.
15, 19], as shown in Table 3 . The Programming Paradigm indi- Finally, Memory Class indicates whether the language requires
cates whether the project is written in a procedural, functional, or developers to manage memory. We treat Objective-C as un-
scripting language. Compile Class indicates whether the project is managed, though Objective-C follows a hybrid model, because
we observe many memory errors in Objective-C codebase, as
Table 3: Different Types of Language Classes discussed in RQ4 in Section 3.
Language Categories Languages 2.4 Identifying Project Domain
Classes
Programming Procedural C, C++, C#, Objective-C,
We classify the studied projects into different domains based
Paradigm Java, Go on their features and functionalities using a mix of automated and
Scripting CoffeeScript, JavaScript, manual techniques. The projects in GitHub come with project
Python, Perl, Php, Ruby descriptions and Readme files that describe their features.
Functional Clojure, Erlang, Haskell,
Scala First, we used Latent Dirichlet Allocation(LDA) [7], a well-known
Compilation Static C, C++, C#, Objective-C,
topic analysis algorithm, on the text describing project features.
Class Java, Go, Haskell, Scala Given a set of documents, LDA identifies a set of topics where each
Dynamic CoffeeScript, JavaScript, topic is represented as probability of generating different words.
Python, Perl, Php, Ruby, For each document, LDA also estimates the probability of assign-
Clojure, Erlang
ing that document to each topic.
Type Class Strong C#, Java, Go, Python, Ruby,
Clojure, Erlang, Haskell, Table 4: Characteristics of Domains
Scala
Weak C, C++, Objective-C,
CoffeeScript, JavaScript, Domain Domain Example Total
Perl, Php Name Characteristics Projects Proj
Memory Class Managed Others Application end user programs. bitcoin, macvim 120
Unmanaged C, C++, Objective-C (APP)
Database sql and nosql mysql, mongodb 43
statically or dynamically typed. (DB) databases
Type Class classifies languages based on strong and weak typing, CodeAnalyzer compiler, parser ruby, php-src 88
based on whether the language admits type-confusion. We consider (CA) interpreter etc.
that a program introduces type-confusion when it attempts to inter- Middleware Operating Systems, linux, memcached 48
pret a memory region populated by a datum of specific type T1, as (MW) Virtual Machine, etc.
an instance of a different type T2 and T1 and T2 are not related by Library APIs, libraries etc. androidApis, 175
(LIB) opencv
inheritance. We classify a language as strongly typed if it explicitly
detects type confusion and reports it as such. Strong typing could Framework SDKs, plugins ios sdk, coffeekup 206
(FW)
happen by static type inference within a compiler (e.g., with Java),
using a type-inference algorithm such as Hendley-Milner [17, 24], Other - Arduino, autoenv 49
(OTH)
or at run-time using a dynamic type checker. In contrast, a language
157
We detect distinct 30 domains (i.e. topics) and estimate the also stem the bag-of-words using standard natural language pro-
probability of each project belonging to these domains. For ex- cessing (NLP) techniques. Finally, we use a well-known super-
ample, LDA assigned the facebook-android-sdk project to the fol- vised classifier: Support Vector Machine(SVM) [34] to classify the
lowing topic with high probability: (0.042 ∗ f acebook + 0.010 ∗ test data.
swank/slime+0.007∗f ramework+0.007∗environments.+ To evaluate the accuracy of the bug classifier, we manually an-
0.007 ∗ transf orming). Here, the text values are the topics and notated 180 randomly chosen bug fixes, equally distributed across
the numbers are their probability of belonging to that domain. For all of the categories. We then compare the result of the automatic
clarity, we only show the top 5 domains. Since such auto-detected classifier with the manually annotated data set. The following table
domains include several project-specific keywords, such as face- summarizes the result for each bug category.
book, swank/slime as shown in the previous example, it is hard to
identify the underlying common functionalities. Hence, we man- precision recall
ually inspect each of the thirty domains to identify project-name- Performance 70.00% 87.50%
independent, domain-identifying keywords. Manual inspection helps Security 75.00% 83.33%
Failure 80.00% 84.21%
us in assigning a meaningful name to each domain. For example, Memory 86.00% 85.71%
for the domain described earlier, we identify the keywords frame- Programming 90.00% 69.23%
work, environments, and transforming to call it development frame- Concurrency 100.00% 90.91%
Algorithm 85.00% 89.47%
work. We manually rename all the thirty auto-detected domains in
similar manner and find that the majority of the projects fall under Average 83.71% 84.34%
six domains: Application, Database, CodeAnalyzer, Middleware,
The result of our bug classification is shown in Table 5. In the
Library, and Framework. We also find that some projects like “on-
Cause category, we find most of the bugs are related to generic pro-
line books and tutorials”, “scripts to setup environment”, “hard-
gramming errors (88.53%). Such high proportion is not surprising
ware programs” etc. do not fall under any of the above domains
because it involves a wide variety of programming errors including
and so we assign them to a catchall domain labeled as Other. This
incorrect error handling, type errors, typos, compilation errors, in-
classification of projects into domains was subsequently checked
correct control-flow, and data initialization errors. The rest 5.44%
and confirmed by another member of our research group. Table 4
appears to be incorrect memory handling; 1.99% is concurrency
summarizes the identified domains resulting from this process. In
bugs, and 0.11% is algorithmic errors. Analyzing the impact of the
our study set, the Framework domain has the greatest number of
bugs, we find 2.01% are related to security vulnerability; 1.55%
projects (206), while the Database domain has the fewest number
is performance errors, and 3.77% causes complete failure to the
of projects (43).
system. Our technique could not classify 1.04% of the bug fix mes-
sages in any Cause or Impact category; we classify these with the
2.5 Categorizing Bugs Unknown type.
While fixing software bugs, developers often leave important in-
formation in the commit logs about the nature of the bugs; e.g., why 2.6 Statistical Methods
the bugs arise, how to fix the bugs. We exploit such information to We use regression modeling to describe the relationship of a set
categorize the bugs, similar to Tan et al. [20, 33]. First, we catego- of predictors against a response. In this paper, we model the num-
rize the bugs based on their Cause and Impact. Root Causes are ber of defective commits against other factors related to software
further classified into disjoint sub-categories of errors—Algorithmic, projects. All regression models use negative binomial regression,
Concurrency, Memory, generic Programming, and Unknown. The or NBR to model the non-negative counts of project attributes such
bug Impact is also classified into four disjoint sub-categories: Se- as the number of commits. NBR is a type of generalized linear
curity, Performance, Failure, and other unknown categories. Thus, model used to model non-negative integer responses. It is appro-
each bug fix commit has a Cause and a Impact type. For exam- priate here as NBR is able to handle over-dispersion, e.g., cases
ple, a Linux bug corresponding to the bug fix message: “return where the response variance is greater than the mean [8].
if prcm_base is NULL.... This solves the following crash" 1 was In our models we control for several language per-project de-
caused due to a missing check (programming error), and impact pendent factors that are likely to influence the outcome. Conse-
was crash (failure). Table 5 shows the description of each bug cat- quently, each (language, project) pair is a row in our regression and
egory. This classification is performed in two phases: is viewed as a sample from the population of open source projects.
(1) Keyword search. We randomly choose 10% of the bug-fix We log-transform dependent count variables as it stabilizes the vari-
messages and use a keyword based search technique to automati- ance and usually improves the model fit [8]. We verify this by
cally categorize the messages with potential bug types. We use this comparing transformed with non transformed data using the AIC
annotation, separately, for both Cause and Impact types. We chose and Vuong’s test for non-nested models [35].
a restrictive set of keywords and phrases as shown in Table 5. For To check that excessive multi-collinearity is not an issue, we
example, if a bug fix log contains any of the keywords: deadlock, compute the variance inflation factor (VIF) of each dependent vari-
race condition or synchronization error, we infer it is related to the able in all of the models. Although there is no particular value of
Concurrency error category. Such a restrictive set of keywords and VIF that is always considered excessive, we use the commonly used
phrases help to reduce false positives. conservative value of 5 [8]. We check for and remove high lever-
(2) Supervised classification. We use the annotated bug fix logs age points through visual examination of the residuals vs leverage
from the previous step as training data for supervised learning tech- plot for each model, looking for both separation and large values of
niques to classify the remainder of the bug fix messages by treating Cook’s distance.
them as test data. We first convert each bug fix message to a bag-of- We employ effects, or contrast, coding in our study to facilitate
words. We then remove words that appear only once among all of interpretation of the language coefficients [8]. Effects codes dif-
the bug fix messages. This reduces project specific keywords. We fer from the more commonly used dummy, or treatment, codes that
compare a base level of the factor with one or more treatments.
1
https://lkml.org/lkml/2012/12/18/102 With effects coding, each coefficient indicates the relative effect
158
Table 5: Categories of bugs and their distribution in the whole dataset
Bug Type Bug Description Search keywords/phrases count %count
Algorithm (Algo) algorithmic or logical errors algorithm 606 0.11
Concurrancy (Conc) multi-threading or multi-processing deadlock, race condition, synchronization error. 11111 1.99
Cause
related issues
Memory (Mem) incorrect memory handling memory leak, null pointer, buffer overflow, heap 30437 5.44
overflow, null pointer, dangling pointer, double
free, segmentation fault.
Programming (Prog) generic programming errors exception handling, error handling, type error, 495013 88.53
typo, compilation error, copy-paste error, refactor-
ing, missing switch case, faulty initialization, de-
fault value.
Security (Sec) correctly runs but can be exploited buffer overflow, security, password, oauth, ssl 11235 2.01
Impact by attackers
Performance (Perf) correctly runs with delayed re- optimization problem, performance 8651 1.55
sponse
Failure (Fail) crash or hang reboot, crash, hang, restart 21079 3.77
Unknown (Unkn) not part of the above seven cate- 5792 1.04
gories
of the use of a particular language on the response as compared ance. The remaining coefficients are significant and either positive
to the weighted mean of the dependent variable across all projects. or negative. For those with positive coefficients we can expect that
Since our factors are unbalanced, i.e., we have different numbers of the language is associated with, ceteris paribus, a greater number
projects in each language, we use weighted effects coding, which of defect fixes. These languages include C, C++, JavaScript,
takes into account the scarcity of a language. This method has been Objective-C, Php, and Python. The languages Clojure,
used previously in software engineering to study the impact of pat- Haskell, Ruby, Scala, and TypeScript, all have negative
tern roles on change proneness [31]. As with treatment coding, it coefficients implying that these languages are less likely than the
is still necessary to omit a factor to compute the model. Since the average to result in defect fixing commits.
use of effects codes compares each level to the grand mean, how-
Df Deviance Resid. Resid. Pr(>Chi)
ever, we can easily extract the missing coefficient from the other Df Dev
model coefficients, or more practically, we simply re-compute the NULL 1113 38526.51
coefficients using a different level of the factor as a base [8]. log commits 1 36986.03 1112 1540.48 0.0000
log age 1 42.70 1111 1497.78 0.0000
To test for the relationship between two factor variables we use log size 1 12.25 1110 1485.53 0.0005
a Chi-Square test of independence [21]. After confirming a depen- log devs 1 48.22 1109 1437.30 0.0000
dence we use Cramer’s V, an r × c equivalent of the phi coefficient language 16 242.89 1093 1194.41 0.0000
for nominal data, to establish an effect size [9]. One should take care not to overestimate the impact of language
on defects. While these relationships are statistically significant,
3. RESULTS the effects are quite small. In the analysis of deviance table above
Prior to analyzing language properties in more detail, we begin we see that activity in a project accounts for the majority of ex-
with a straightforward question that directly addresses the core of plained deviance. Note that all variables are significant, that is, all
what some fervently believe must be true, namely:
Table 6: Some languages induce fewer defects than other languages.
RQ1. Are some languages more defect prone than others? Response is the number of defective commits.Languages are coded with
weighted effects coding so each language is compared to the grand
We evaluate this question using an NBR model with languages mean. AIC=10673, BIC=10783, Log Likelihood = -5315, Deviance=1194,
encoded with weighted effects codes as predictors for the number Num. obs.=1114
of defect fixing commits. The model details are shown in Table 6.
We include some variables as controls for factors that will clearly Defective Commits Model Coef. Std. Err.
influence the number of defect fixes. Project age is included as (Intercept) −1.93 (0.10)∗∗∗
older projects will generally have a greater number of defect fixes. log commits 2.26 (0.03)∗∗∗
log age 0.11 (0.03)∗∗
Trivially, the number of commits to a project will also impact the log size 0.05 (0.02)∗
response. Additionally, the number of developers who touch a log devs 0.16 (0.03)∗∗∗
project and the raw size of the project are both expected to grow C 0.15 (0.04)∗∗∗
with project activity. C++ 0.23 (0.04)∗∗∗
C# 0.03 (0.05)
The sign and magnitude of the Estimate in the above model re- Objective-C 0.18 (0.05)∗∗∗
lates the predictors to the outcome. The first four variables are Go −0.08 (0.06)
control variables and we are not interested in their impact on the Java −0.01 (0.04)
CoffeeScript 0.07 (0.05)
outcome other than to say, in this case, that they are all positive, JavaScript 0.06 (0.02)∗∗
as expected, and significant. The language variables are indica- TypeScript −0.43 (0.06)∗∗∗
tor, or factor, variables for each project. The coefficient compares Ruby −0.15 (0.04)∗
Php 0.15 (0.05)∗∗∗
each language to the grand weighted mean of all languages in all Python 0.10 (0.03)∗∗
projects. The language coefficients can be broadly grouped into Perl −0.15 (0.08)
three general categories. The first category are those for which the Clojure −0.29 (0.05)∗∗∗
Erlang (0.05)
coefficient is statistically insignificant and the modeling procedure Haskell
−0.00
−0.23 (0.06)∗∗∗
could not distinguish the coefficient from zero. These languages Scala −0.28 (0.05)∗∗∗
may behave similarly to the average or they may have wide vari- ∗∗∗
p < 0.001, ∗∗ p < 0.01, ∗ p < 0.05
159
of the factors above account for some of the variance in the number cific interactions between language and domain with respect to de-
of defective commits. The next closest predictor, which accounts fects within projects. Statistically, this goal poses challenges for
for less than one percent of the total deviance, is language. All regression. Interaction terms in a regression model might yield
other controls taken together do not account for as much deviance some insight, however, with 17 languages across 7 domains this
as language. will yield an excessive number of terms and a challenging, and
Although we expressed the impact of language in this model as most likely spurious, interpretation.
a percentage of deviance, we hasten to remind the reader that al- Given this, we first consider testing for the dependence between
though interpretation is similar, in a rough sense, to a percentage domain and language usage within a project, using a Chi-Square
of the total variance explained in an ordinary least squares regres- test of independence. Unfortunately, out of 119 cells in our data
sion, it is not accurate to say that the measures are synonymous [8]. set, 46, i.e. 39%, are below the value of 5. This exceeds the recom-
About the best we can do is to observe that it is a small affect. We mendation that no more than 20% of the counts should be below
can read the coefficients as the expected change in the log of the 5 [21]. We include the value here for completeness3 , however, the
response for a one unit change in the predictor with all other pre- low strength of association of 0.191 as measured by Cramer’s V,
dictors held constant; i.e., for a coefficient βi , a one unit change suggests that although there is likely some relationship between do-
in βi yields an expected change in the response of eβi . For the main and language in our data set, including domain in regression
factor variables, this expected change is normally interpreted as models is unlikely to lead to meaningful models.
a comparison to a base factor. In our case, however, the use of One option to address this concern would be to remove lan-
weighted effects coding allows us to compare this expected change guages or combine domains, however, our data here presents no
to the grand mean, i.e., the average across all languages. Thus, if, clear choices. Alternatively, we could combine languages; this
for some number of commits, a particular project developed in an choice leads to a related but slightly different question.
average language had four defective commits, then the choice to
use C++ would mean that we should expect one additional buggy RQ2. Which language properties relate to defects?
commit since e0.23 × 4 = 5.03. For the same project, choos- Rather than considering languages individually, we aggregate
ing Haskell would mean that we should expect about one fewer them by language class, as described in Section 2.3, and analyze
defective commit as e−0.23 × 4 = 3.18. The accuracy of this the relationship between defects and language class. Broadly, each
prediction is dependent on all other factors remaining the same, a of these properties divides languages along some line that is often
challenging proposition for all but the most trivial of projects. All discussed in the context of errors, drives user debate, or have been
observational studies face similar limitations and we address this the subject of prior work. To arrive at the six factors in the model
concern in more detail in section 5. we combined all of these factors across all of the languages in our
study.
Ideally, we would want to include each of the separate properties
Result 1: Some languages have a greater association with
in the regression model so that we can assert with some assurance
defects than other languages, although the effect is small.
that a particular property is responsible for particular defects. Un-
fortunately, however, the properties are highly correlated and mod-
In the remainder of this paper we expand on this basic result els with all properties are not stable. To avoid this issue we model
by considering how different categories of application, defect, and the impact of the six different language classes, which result from
language, lead to further insight into the relationship between lan- combining the language properties, on the number of defects while
guages and defect proneness. controlling for the same basic covariates that we used in the model
Software bugs usually fall under two broad categories: (1) Do- in RQ1.
main Specific bug: specific to project functionalities and do not Table 7: Functional languages have a smaller relationship to defects
depend on the underlying programming language. For example, than other language classes where as procedural languages are either
we find a bug fix in Linux with log message: “Fix headset mic sup- greater than average or similar to the average. Language classes are
port for Asus X101CH". The bug was due to a missing functional- coded with weighted effects coding so each language is compared to the
ity [30] in Asus headset2 and less to do with language feature. Pre- grand mean. AIC=10419, Deviance=1132, Num. obs.=1067
vious researches term these errors as Software Component bug [20,
33]. (2) Generic bug: more generic in nature and have less to do Defective Commits
with project functionalities. For example, type-error, concurrency (Intercept) −2.13 (0.10)∗∗∗
log commits 0.96 (0.01)∗∗∗
error etc. log age 0.07 (0.01)∗∗∗
Consequently, it is reasonable to think that the interaction of ap- log size 0.05 (0.01)∗∗∗
plication domain and language might impact the number of defects log devs 0.07 (0.01)∗∗∗
Functional-Static-Strong-Managed (0.04)∗∗∗
within a project. Moreover some languages are believed to excel at
−0.25
Functional-Dynamic-Strong-Managed −0.17 (0.04)∗∗∗
some tasks more so than others, e.g., C for low level work, or Java Proc-Static-Strong-Managed −0.06 (0.03)∗
for user applications. A less than ideal language for an application Script-Dynamic-Strong-Managed 0.001 (0.03)
Script-Dynamic-Weak-Managed 0.04 (0.02)∗
might lead to a greater number of defects. To compare error prone- Proc-Static-Weak-Unmanaged 0.14 (0.02)∗∗∗
ness of different languages we ideally should ignore the domain ∗∗∗
p < 0.001, ∗∗ p < 0.01, ∗ p < 0.05
specific bugs and focus only on the generic ones since they are
more likely to depend on the programming language feature. How- As with language, we are comparing language classes with the
ever, since a domain specific bug may also arise due to a generic average behavior across all languages. The model is presented in
programming error, it is difficult to separate the two. A possible Table 7. As with the previous model, the first four variables are
workaround is to study the defect proneness of languages while control variables and we are not interested in their impact on the
controlling the domain. To this end, we would like to confirm spe- outcome other than to say, in this case, that, with the exception
2 of size, they are all positive and significant. their It’s clear that
https://bugs.launchpad.net/ubuntu/+source/
3
linux/+bug/1169138 Chi-Squared value of 243.6 with 96d.f. and p = 8.394e − 15
160
Script-Dynamic-Strong-Managed class has the smallest relationship between these factors and the project domain yields
magnitude coefficient. The coefficient is insignificant implying that a value of 99.0494 and df = 30 with p = 2.622e − 09 allow-
the z-test for the coefficient cannot distinguish the coefficient from ing us to reject the null hypothesis that the factors are independent.
zero. Given the magnitude of the standard error, however, we can Cramer’s-V yields a value of 0.133, a weak level of association.
reasonably assume that the lack of significance is not related to Consequently, although there is some relation between domain and
wide variance or insufficient data, rather, it is because the behav- language, there is only a weak relationship between domain and
ior of languages in this class is very close to the average behavior language class.
across all languages. We confirm this by recoding the coefficient
using Proc-Static-Weak-Unmanaged as the base level and
employing treatment, or dummy coding that compares each lan- Result 2: There is a small but significant relationship be-
guage class with the base level. In this case, Script-Dynamic- tween language class and defects. Functional languages have
-Strong-Managed is significantly different with p = 0.00044. a smaller relationship to defects than either procedural or
We note here that while choosing different coding methods affects scripting languages.
the coefficients and z-scores, the models are identical in all other
respects. When we change the coding we are rescaling the coeffi-
It is somewhat unsatisfying that we do not observe a strong as-
cients to reflect the comparison that we wish to make [8]. Compar-
sociation between language, or language class, and domain within
ing the other language classes to the grand mean, Proc-Static-
a project. An alternative way to view this same data is to aggregate
-Weak-Unmanaged languages are more likely to induce defects.
defects over all languages and domains, disregarding the relation-
This implies that either weak typing or memory management is-
ship to projects. Since we cannot view this data as independent
sues contribute to greater defect proneness as compared with other
samples, we do not attempt to analyze it statistically, rather we take
procedural languages.
a descriptive, visualization based approach.
Among scripting languages we observe a similar relationship be-
We define Defect Proneness as the ratio of bug fix commits over
tween weak and strong typing. This is some evidence that weak vs
total commits per language per domain. Figure 1 illustrates the in-
strong typing is more likely responsible for this difference as op-
teraction between domain and language using a heat map, where
posed to memory management, we cannot state this conclusively
the defect proneness increases from lighter to darker zone. We in-
given the correlation between factors. It is possible that among
vestigate which language factors influence defect fixing commits
script languages strong versus weak typing is driving the relation-
across a collection of projects written across a variety of languages.
ship and among procedural languages, memory management is driv-
This leads to the following research question:
ing the difference. However, as a group, strongly typed languages
are less error prone than average while the weakly typed languages RQ3. Does language defect proneness depend on domain?
are more error prone than the average. The contrast between static
A first glance at Figure 1(a) reveals that defect proneness of the
and dynamic typing is also visible in functional languages.
languages indeed depends on the domain. For example, in the
The functional languages as a group show a strong difference
Middleware domain JavaScript is most defect prone (31.06%
from the average. Compared to all other language types, both
defect proneness). This was little surprising to us since Java-
Functional-Dynamic-Strong-Managed and Function-
Script is typically not used for Middleware domain. On a closer
al-Static-Strong-Managed languages show a smaller rela-
look, we find that JavaScript has only one project, v8 (Google’s
tionship with defects. Statically typed languages have substantially
JavaScript virtual machine), in Middleware domain that is re-
smaller coefficient yet both functional language classes have the
sponsible for all the errors.
same standard error. This is strong evidence that functional static
Also, Scala is most defect prone in the Application domain
languages are less error prone than functional dynamic languages,
with a defect density (number of bug fix commit over total com-
however, the z-tests only test whether the coefficients are different
mit) of 52.67%. However, a single project zipkin, with defect den-
from zero. In order to strengthen this assertion we recode the model
sity of 80%, contribute most of defects. Similarly, Perl and C
as above using treatment coding and observe that the Function-
are the two top most buggy languages in Database domain with de-
al-Static-Strong-Managed language class is significantly
fect proneness 58.63% and 61.47% respectively. However, we find
less defect prone than the Functional-Dynamic-Strong-
that mysql project is solely responsible for such high defect count,
-Managed language class with p = 0.034.
where overall defect density of mysql is 70.25%.
Df Deviance Resid. Df Resid. Dev Pr(>Chi) Thus, it turns out that variation of defect density of the domains
NULL 1066 32995.23 with languages may be an attribute of individual projects. To ver-
log commits 1 31634.32 1065 1360.91 0.0000
log age 1 51.04 1064 1309.87 0.0000 ify this, we re-evaluate domain-language interaction after ignoring
log size 1 50.82 1063 1259.05 0.0000 the outliers. We filter out the projects that have defect density be-
log devs 1 31.11 1062 1227.94 0.0000 low 10 percentile and above 90 percentile. Figure 1(b) shows the
Lang. Class 5 95.54 1057 1132.40 0.0000
result. Note that, the outliers’ effects are also controlled in all our
As with the relationship between language and defects, the re- regression models as we filter them out as high leverage points, as
lationship between language class and defects is based on a small discussed in Section 2.6.
effect. The deviance explained is shown in the anova table above The variation of defect proneness and languages per domain is
and is similar, although smaller, to the deviance related to language much subdued in the new heat map. The remaining variation comes
and, consequently, has a similar interpretation. from the inherent defect proneness of the languages, as we have
Having discussed the relationship between language class and seen in RQ1. To ensure this, we measure the pairwise rank cor-
defects, we revisit the question of application domain.As before we relation (Spearman Correlation [37]) between the language defect
ask whether domain has an interaction with language class. Does proneness for each domain with the Overall. For all the domains,
the choice of, e.g., a functional language, have an advantage for the correlation is positive, and p-values are significant (< 0.01) ex-
a particular domain? For this pair of factors, the contingency ta- cept for Database domain. This shows that w.r.t. defect proneness,
ble confirms to assumptions. As above, a Chi-Square test for the language ordering in each domain (except Database) is correlated
161
C C
C++ C++
C# C#
Objective−C Objective−C
Go Go
Java Java
Coffeescript bug_pcent Coffeescript bug_pcent
Language
Language
60
Javascript Javascript 40
Typescript 40 Typescript
30
Ruby Ruby
Php Php 20
20
Python Python 10
Perl Perl
Clojure Clojure
Erlang Erlang
Haskell Haskell
Scala Scala
Applicat CodeA D Framew Li Middlew O Applicat CodeA D Framew Li Middlew O
ion nalyze atabase ork brary are verall ion nalyze atabase ork brary are verall
r r
Domain Domain
(a) Variation of defect proneness across languages for a given domain (b) Variation of defect proneness across languages for a given domain
after removing the outliers
with the overall language ordering. Hence, domain has less signif- defect type is similar in magnitude for most of the categories. We
icance when it comes to defect proneness of the languages. interpret this relationship to mean that language has a greater im-
pact on specific categories of bugs, than it does on bugs overall.
APP CA DB FW LIB MW
In the following section, we elaborate the result for bug cate-
Spearman 0.71 0.56 0.30 0.76 0.90 0.46 gories with significant bug count as reported in Table 5. However,
Corr.
p-value 0.00 0.02 0.28 0.00 0.00 0.09 our conclusion generalizes for all categories.
162
Table 8: While the impact of language on defects varies across defect category, language has a greater impact on specific categories than it does on
defects in general. For all models above the deviance explained by language type has p < 3.031e − 07.
errors than the average with statistical significance. Among the condition, deadlock, or incorrect synchronization, as shown in the
managed languages, Java has significantly more memory errors table above. In all the languages, race condition is most frequent,
than the average, though its regression coefficient is less than the ranging from 41% in C++ to 92% is Go. The enrichment of race
unmanaged languages. Though Java has its own garbage col- condition errors in Go is likely because the Go is distributed with a
lector, having memory leak is not surprising since unused object race-detection tool that may advantage Go developers in detecting
references often prevents the garbage collector from reclaiming the races. Deadlocks are also noteworthy, ranging from 43.36% in C++
memory [18]. In fact, we notice 28.89% of all the memory errors to 10.80% in Go. The synchronization errors are mostly related to
in Java comes from a memory leak. In terms of effect size, lan- message passing (MPI) or shared memory operation (SHM). Er-
guage has a larger impact on memory defects than all other cause lang and Go use MPI [13, 4] (which do not require locking of
categories. shared resources) for inter-thread communication, which explains
Concurrency Errors. 1.99% of total bug fix commits are re- why these two languages do not have any SHM related errors like
lated to Concurrency Errors. The heat map shows that Proc- locking, mutex etc. In contrast, projects in the other languages
-Static-Weak-Unmanaged class dominates concurrency er- use SHM primitives for communication and can thus have locking-
rors. In this class, C and C++ introduce 19.15% and 7.89% of the related errors.
Concurrency errors, and they are distributed across the projects. Security and Other Impact Errors. Around 7.33% of all the
The regression result of Table 8 also confirms that with significant bug fix commits is related to Impact errors. Among them Er-
p-value. Language classes Proc-Static-Strong-Managed lang, C, C++, and Go produce more security errors than average,
and Functional-Static-Strong-Managed are also in the as confirmed by the regression model (Table 8). The regression also
darker zone in the heat map. The major languages that contribute suggests that projects written in TypeScript and Clojure are
concurrency errors from these classes are Go, C#, and Scala and less likely to introduce a security error than average (Figure 2).
their regression coefficient is also statistically significant. These From the heat map we also see that Static languages are in
results confirm, in general static languages produce more concur- general more prone to failure and performance errors, followed
rency errors than others. Among the dynamic languages, only Er- by Functional-Dynamic-Strong-Managed languages. In
lang is more prone to concurrency errors. The regression anal- the later category, Erlang is more prone to induce a failure (posi-
ysis also shows that projects written in dynamic languages like tive regression coefficient with statistical significance). The analy-
CoffeeScript, TypeScript, Ruby, and Php have fewer con- sis of deviance results confirm that language is strongly associated
currency errors (note statistically significant negative coefficients, with failure impacts. While security errors are the weakest among
in Table 8). the categories, with respect to the residual deviance of the model,
C C++ C# Java Scala Go Erlang
the deviance explained by language is still quite strong.
race 63.11 41.46 77.7 65.35 74.07 92.08 78.26
deadlock 26.55 43.36 14.39 17.08 18.52 10.89 15.94
SHM 28.78 18.24 9.36 9.16 8.02 0 0
Result 4: Defect types are strongly associated with lan-
MPI 0 2.21 2.16 3.71 4.94 1.98 10.14 guages; Some defect type like memory error, concurrency er-
rors also depend on language primitives. Language matters
A textual analysis based on word-frequency of the bug fix mes- more for specific categories than it does for defects overall.
sages suggests that most of the concurrency errors occur due to race
163
4. RELATED WORK monly used, as reflected in our data set. For instance, we classify
Prior work on programming language comparison falls in three Objective-C as unmanaged memory type, although it may fol-
categories: (1) Controlled Experiment: For a given task, develop- low a hybrid memory model. Similarly, we annotate Scala as
ers are monitored while programming in different languages. Re- functional and C# as procedural language, although they support
searchers then compare outcomes such as development effort and both procedural and functional design [26, 28].We do not distin-
program quality. Hanenberg et al. [14] compared static vs. dynamic guish object-oriented languages (OOP) in this work as there is no
typing by monitoring 48 programmers for 27 hours while devel- clear distinction between pure OOP languages and procedural lan-
oping a parser program. They found no significant difference in guages.The difference largely depends on programming style. We
code quality between the two. However, dynamic type-based lan- categorize C++ as weakly typed because a memory region of a cer-
guage have shorter development time. Their study was conducted tain type can be treated differently using pointer manipulation [29].
with undergraduate students in a lab setting with custom-designed However, depending on the compiler some C++ type errors can
language and IDE. Our study, by contrast is a field study of popu- be detected in compile time. We further exclude TypeScript
lar software applications. While we can only indirectly (and post from our language classification model (see Table 3 and Table 7);
facto) control for confounding factors using regression, we bene- TypeScript is intended to be used as a static, strongly typed lan-
fit from much larger sample sizes, and more realistic, widely-used guage. However, in practice, we notice that developers often (for
software. We find that statically typed languages in general are less 50% of the variables, and in all the TypeScript-using projects
defect prone than the dynamic types, and that strong typing is better in our dataset) use any type, a catch-all union type, and thus makes
than weak typing in the same regard. The effect sizes are modest; TypeScript dynamic and weak.
it could be reasonably argued that they are visible here precisely Finally, we associate defect fixing commits to language proper-
because of the large sample sizes. ties, although they could reflect reporting style or other developer
Harrison et al. [15] compared C++, a procedural language, with properties. Availability of external tools or libraries may also im-
SML, a functional language, finding no significant difference in to- pact the extent of bugs associated with a language.
tal number of errors, although SML has higher defect density than
C++. SML was not represented in our data, which however, sug- 6. CONCLUSION
gest that functional languages are generally less defect prone than We have presented a large scale study of language type and use,
procedural languages. Another line of work primarily focuses on as it relates to software quality. The Github data we used is char-
comparing development effort across different languages [19, 27]. acterized by its complexity and the variance along multiple dimen-
However, they do not analyze language defect proneness. sions of language, language type, usage domain, amount of code,
(2) Surveys: Meyerovich et al. survey developers’ views of pro- sizes of commits, and the various characteristics of the many issue
gramming languages, to study why some languages are more popu- types.
lar than others[23]. They report strong influence from non-linguistic The large GitHub dataset sample-size allows a mixed-methods
factors: prior language skills, availability of open source tools, and study of the effects of language, while controlling for a number of
existing legacy systems. Such factors also arise in our findings: confounds. Through a combination of regression modeling, text
we confirm that availability of external tools also impact software analytics, and visualization, we have examined the interactions of
quality; Go has lot more concurrency bugs related to race condition language, domain, and defect type. The data indicates functional
due to its race condition detection tool (see RQ4 in Section 3). languages are better than procedural languages; it suggests that
(3) Repository Mining: Bhattacharya et al. [5] study four projects strong typing is better than weak typing; that static typing is better
developed in both C and C++ and find that the software compo- than dynamic; and that managed memory usage is better than un-
nents developed in C++ are in general more reliable than C. We managed. Further, that the defect proneness of languages in general
find that both C and C++ are more defect prone than average defect is not associated with software domains. Also, languages are more
proneness of all the studied languages, although C++ has a higher related to individual bug categories than bugs overall.
regression coefficient (0.23) than C (0.15) (see Table 6). However, On the other hand, even large datasets become small and insuf-
for certain bug types like concurrency errors, C is more defect prone ficient when they are sliced and diced many ways simultaneously,
than C++ (see RQ4 in Section 3). i.e. when the underlying connectivity between variables is rich. The
implications are that the more dependent variables there are, the
more difficult it becomes (vis-a-vis the amount of data available) to
5. THREATS TO VALIDITY answer questions about a specific variable’s effect on any outcome
We recognize few threats to our reported results. First, to iden- where interactions with other variables exist. Hence, we are unable
tify bug fix commits we did not check the bug database; instead to quantify the specific effects of language type on usage. Addi-
we rely on the keywords that developers often use to indicate a tional methods such as surveys could be helpful here. Addressing
bug fix commit. Our choice was deliberate. We wanted to capture these challenges remains for future work.
the issues that developers continuously face in an ongoing devel-
opment process, not just the reported bugs. However, such choice 7. ACKNOWLEDGEMENTS
possesses threats of over estimation. Our categorization of the do-
We thank Sameer Khatri for cross checking domain categoriza-
mains is subjected to interpreter’s bias, although another member
tion. We acknowledge support from the National Science Founda-
of our group verified the categories. Also, our effort to categorize a
tion under Grants No. CCF-1247280 and CCF-1446683 and from
large number of bug fix commits could potentially raise some ques-
AFOSR award FA955-11-1-0246.
tions. Especially, the categorization can be tainted by the initial
choice of keywords. Also, the descriptiveness of commit logs vary
across the projects. To mitigate the threat, we evaluate our classifi- 8. REFERENCES
cation against manual annotation as discussed in Section 2.5. [1] Github archive, https://githubarchive.org/.
To interpret the language classes in Section 2.3, we make cer- [2] Github documentation, https://help.github.com/articles/stars.
tain assumptions based on how a language property is most com- [3] Google big query, https://developers.google.com/bigquery/.
164
[4] J. Armstrong, R. Virding, C. Wikström, and M. Williams. support for improving software dependability, October 2006.
Concurrent programming in erlang. 1993. [21] J. P. Marques De Sá. Applied statistics using spss, statistica
[5] P. Bhattacharya and I. Neamtiu. Assessing programming and matlab. 2003.
language impact on development and maintenance: A study [22] C. Mayer, S. Hanenberg, R. Robbes, É. Tanter, and A. Stefik.
on c and c++. In Proceedings of the 33rd International An empirical study of the influence of static type systems on
Conference on Software Engineering, ICSE ’11, pages the usability of undocumented software. In ACM SIGPLAN
171–180, New York, NY, USA, 2011. ACM. Notices, volume 47, pages 683–702. ACM, 2012.
[6] C. Bird, N. Nagappan, B. Murphy, H. Gall, and P. Devanbu. [23] L. A. Meyerovich and A. S. Rabkin. Empirical analysis of
Don’t touch my code!: examining the effects of ownership programming language adoption. In Proceedings of the 2013
on software quality. In Proceedings of the 19th ACM ACM SIGPLAN international conference on Object oriented
SIGSOFT symposium and the 13th European conference on programming systems languages & applications, pages
Foundations of software engineering, pages 4–14. ACM, 1–18. ACM, 2013.
2011. [24] R. Milner. A theory of type polymorphism in programming.
[7] D. M. Blei. Probabilistic topic models. Communications of Journal of computer and system sciences, 17(3):348–375,
the ACM, 55(4):77–84, 2012. 1978.
[8] J. Cohen. Applied multiple regression/correlation analysis [25] A. Mockus and L. G. Votta. Identifying reasons for software
for the behavioral sciences. Lawrence Erlbaum, 2003. changes using historic databases. In ICSM ’00: Proceedings
[9] H. CRAMÉR et al. Mathematical methods of statistics. of the International Conference on Software Maintenance,
Princeton University Press, 1946. page 120. IEEE Computer Society, 2000.
[10] S. Easterbrook, J. Singer, M.-A. Storey, and D. Damian. [26] M. Odersky, L. Spoon, and B. Venners. Programming in
Selecting empirical methods for software engineering scala. Artima Inc, 2008.
research. In Guide to advanced empirical software [27] V. Pankratius, F. Schmidt, and G. Garretón. Combining
engineering, pages 285–311. Springer, 2008. functional and imperative programming for multicore
[11] K. El Emam, S. Benlarbi, N. Goel, and S. N. Rai. The software: an empirical study evaluating scala and java. In
confounding effect of class size on the validity of Proceedings of the 2012 International Conference on
object-oriented metrics. Software Engineering, IEEE Software Engineering, pages 123–133. IEEE Press, 2012.
Transactions on, 27(7):630–650, 2001. [28] T. Petricek and J. Skeet. Real World Functional
[12] GitHub. Linguist: https://github.com/github/linguist. Programming: With Examples in F# and C#. Manning
[13] Google. http://golang.org/doc/effective_go. Publications Co., 2009.
html#concurrency. [29] B. C. Pierce. Types and programming languages. MIT press,
[14] S. Hanenberg. An experiment about static and dynamic type 2002.
systems: Doubts about the positive impact of static type [30] A. A. Porter and L. G. Votta. An experiment to assess
systems on development time. In Proceedings of the ACM different defect detection methods for software requirements
International Conference on Object Oriented Programming inspections. In Proceedings of the 16th International
Systems Languages and Applications, OOPSLA ’10, pages Conference on Software Engineering, ICSE ’94, pages
22–35, New York, NY, USA, 2010. ACM. 103–112, Los Alamitos, CA, USA, 1994. IEEE Computer
[15] R. Harrison, L. Smaraweera, M. Dobie, and P. Lewis. Society Press.
Comparing programming paradigms: an evaluation of [31] D. Posnett, C. Bird, and P. Dévanbu. An empirical study on
functional and object-oriented programs. Software the influence of pattern roles on change-proneness.
Engineering Journal, 11(4):247–254, 1996. Empirical Software Engineering, 16(3):396–423, 2011.
[16] D. E. Harter, M. S. Krishnan, and S. A. Slaughter. Effects of [32] F. Rahman and P. Devanbu. How, and why, process metrics
process maturity on quality, cycle time, and effort in software are better. In Proceedings of the 2013 International
product development. Management Science, 46(4):451–466, Conference on Software Engineering, pages 432–441. IEEE
2000. Press, 2013.
[17] R. Hindley. The principal type-scheme of an object in [33] L. Tan, C. Liu, Z. Li, X. Wang, Y. Zhou, and C. Zhai. Bug
combinatory logic. Transactions of the american characteristics in open source software. Empirical Software
mathematical society, pages 29–60, 1969. Engineering, 2013.
[18] M. Jump and K. S. McKinley. Cork: dynamic memory leak [34] V. Vapnik. The nature of statistical learning theory. springer,
detection for garbage-collected languages. In ACM 2000.
SIGPLAN Notices, volume 42, pages 31–38. ACM, 2007. [35] Q. Vuong. Likelihood ratio tests for model selection and
[19] S. Kleinschmager, S. Hanenberg, R. Robbes, É. Tanter, and non-nested hypotheses. Econometrica: Journal of the
A. Stefik. Do static type systems improve the maintainability Econometric Society, pages 307–333, 1989.
of software systems? an empirical study. In Program [36] E. J. Weyuker, T. J. Ostrand, and R. M. Bell. Do too many
Comprehension (ICPC), 2012 IEEE 20th International cooks spoil the broth? using the number of developers to
Conference on, pages 153–162. IEEE, 2012. enhance defect prediction models. Empirical Software
[20] Z. Li, L. Tan, X. Wang, S. Lu, Y. Zhou, and C. Zhai. Have Engineering, 13(5):539–559, 2008.
things changed now? An empirical study of bug [37] J. H. Zar. Significance Testing of the Spearman Rank
characteristics in modern open source software. In ASID ’06: Correlation Coefficient. Journal of the American Statistical
Proceedings of the 1st workshop on Architectural and system Association, 67(339):578–580, 1972.
165