0% found this document useful (0 votes)
1 views12 pages

Is Better Data Better Than Better Data Miners

This paper addresses systematic errors in prior studies ranking classifiers for software defect prediction by applying multi-performance criteria and improving data quality through SMOTUNED, an auto-tuning version of SMOTE. The results show significant performance improvements in defect predictions, with SMOTUNED outperforming traditional methods and demonstrating that data preprocessing can be more crucial than classifier selection. The findings suggest that better training data enhances software analytics tasks, and SMOTUNED is a promising approach for handling class imbalance in defect prediction.

Uploaded by

Ziqing Wang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views12 pages

Is Better Data Better Than Better Data Miners

This paper addresses systematic errors in prior studies ranking classifiers for software defect prediction by applying multi-performance criteria and improving data quality through SMOTUNED, an auto-tuning version of SMOTE. The results show significant performance improvements in defect predictions, with SMOTUNED outperforming traditional methods and demonstrating that data preprocessing can be more crucial than classifier selection. The findings suggest that better training data enhances software analytics tasks, and SMOTUNED is a promising approach for handling class imbalance in defect prediction.

Uploaded by

Ziqing Wang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

2018 ACM/IEEE 40th International Conference on Software Engineering

Is “Better Data” Better Than “Better Data Miners”?


On the Benefits of Tuning SMOTE for Defect Prediction

Amritanshu Agrawal Tim Menzies


Department of Computer Science Department of Computer Science
North Carolina State University North Carolina State University
Raleigh, NC, USA Raleigh, NC, USA
[email protected] [email protected]

ABSTRACT Therefore, in addition to rigorously assessing critical areas, a paral-


We report and fix an important systematic error in prior studies lel activity should be to sample the blind spots [37].
that ranked classifiers for software analytics. Those studies did not To sample those blind spots, many researchers use static code
(a) assess classifiers on multiple criteria and they did not (b) study defect predictors. Source code is divided into sections and researchers
how variations in the data affect the results. Hence, this paper annotate the code with the number of issues known for each section.
applies (a) multi-performance criteria while (b) fixing the weaker Classification algorithms are then applied to learn what static code
regions of the training data (using SMOTUNED, which is an auto- attributes distinguish between sections with few/many issues. Such
tuning version of SMOTE). This approach leads to dramatically static code measures can be automatically extracted from the code
large increases in software defect predictions when applied in a base with little effort even for very large software systems [44].
5*5 cross-validation study for 3,681 JAVA classes (containing over One perennial problem is what classifier should be used to build
a million lines of code) from open source systems, SMOTUNED predictors? Many papers report ranking studies where a quality
increased AUC and recall by 60% and 20% respectively. These im- measure is collected from classifiers when they are applied to data
provements are independent of the classifier used to predict for sets [13, 15–18, 21, 25–27, 29, 32, 33, 35, 40, 53, 62, 67]. These ranking
defects. Same kind of pattern (improvement) was observed when a studies report which classifiers generate best predictors.
comparative analysis of SMOTE and SMOTUNED was done against Research of this paper began with the question would the use of
the most recent class imbalance technique. data pre-processor change the rankings of classifiers? SE data sets are
In conclusion, for software analytic tasks like defect prediction, often imbalanced, i.e., the data in the target class is overwhelmed by
(1) data pre-processing can be more important than classifier choice, an over-abundance of information about everything else except the
(2) ranking studies are incomplete without such pre-processing, target [36]. As shown in the literature review of this paper, in the
and (3) SMOTUNED is a promising candidate for pre-processing. overwhelming majority of papers (85%), SE research uses SMOTE
to fix data imbalance [7] but SMOTE is controlled by numerous
KEYWORDS parameters which usually are tuned using engineering expertise
or left at their default values. This paper proposes SMOTUNED,
Search based SE, defect prediction, classification, data analytics for
an automatic method for setting those parameters which when
software engineering, SMOTE, imbalanced data, preprocessing
assessed on defect data from 3,681 classes (over a million lines
ACM Reference Format: of code) taken from open source JAVA systems, SMOTUNED out-
Amritanshu Agrawal and Tim Menzies. 2018. Is “Better Data” Better Than performed both the original SMOTE [7] as well as state-of-the-art
“Better Data Miners”?: On the Benefits of Tuning SMOTE for Defect Pre-
method [4].
diction . In ICSE ’18: ICSE ’18: 40th International Conference on Software
Engineering , May 27-June 3, 2018, Gothenburg, Sweden. ACM, New York,
To assess, we ask four questions:
NY, USA, 12 pages. https://doi.org/10.1145/3180155.3180197 • RQ1: Are the default “off-the-shelf” parameters for SMOTE ap-
propriate for all data sets?
1 INTRODUCTION
Result 1
Software quality methods cost money and better quality costs expo-
SMOTUNED learned different parameters for each data set, all of
nentially more money [16, 66]. Given finite budgets, quality assur-
which were very different from default SMOTE.
ance resources are usually skewed towards areas known to be most
safety critical or mission critical [34]. This leaves “blind spots”: re-
• RQ2: Is there any benefit in tuning the default parameters of
gions of the system that may contain defects which may be missed.
SMOTE for each new data set?
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed Result 2
for profit or commercial advantage and that copies bear this notice and the full citation Performance improvements using SMOTUNED are dramatically
on the first page. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, large, e.g., improvements in AUC up to 60% against SMOTE.
to post on servers or to redistribute to lists, requires prior specific permission and/or a
fee. Request permissions from [email protected]. In those results, we see that while no learner was best across all
ICSE ’18, May 27-June 3, 2018, Gothenburg, Sweden data sets and performance criteria, SMOTUNED was most often
© 2018 Association for Computing Machinery.
ACM ISBN 978-1-4503-5638-1/18/05. . . $15.00
seen in the best results. That is, creating better training data might
https://doi.org/10.1145/3180155.3180197 be more important than the subsequent choice of classifiers.

1050
ICSE ’18, May 27-June 3, 2018, Gothenburg, Sweden Amritanshu Agrawal and Tim Menzies

• RQ3: In terms of runtimes, is the cost of running SMOTUNED A variety of approaches have been proposed to recognize defect-
worth the performance improvement? prone software components using code metrics (lines of code, com-
plexity) [10, 38, 40, 45, 58] or process metrics (number of changes,
Result 3
recent activity) [22]. Other work, such as that of Bird et al. [5],
SMOTUNED terminates in under two minutes, i.e., fast enough to
indicated that it is possible to predict which components (for e.g.,
recommend its widespread use.
modules) are likely locations of defect occurrence using a compo-
• RQ4: How does SMOTUNED perform against the recent class nent’s development history and dependency structure. Prediction
imbalance technique? models based on the topological properties of components within
them have also proven to be accurate [71].
Result 4 The lesson of all the above is that the probable location of future
SMOTUNED performs better than a very recent imbalance handling defects can be guessed using logs of past defects [6, 21]. These logs
technique proposed by Bennin et al. [4]. might summarize software components using static code metrics
In summary, the contributions of this paper are: such as McCabes cyclomatic complexity, Briands coupling metrics,
dependencies between binaries, or the CK metrics [8] (which is
• The discovery of an important systematic error in many prior described in Table 1). One advantage with CK metrics is that they
ranking studies, i.e., all of [13, 15–18, 21, 25–27, 29, 32, 33, 35, are simple to compute and hence, they are widely used. Radjenović
40, 53, 62, 67]. et al. [53] reported that in the static code defect prediction, the CK
• A novel application of search-based SE (SMOTUNED) to handle metrics are used twice as much (49%) as more traditional source
class imbalance that out-performs the prior state-of-the-art. code metrics such as McCabes (27%) or process metrics (24%). The
• Dramatically large improvements in defect predictors. static code measures that can be extracted from a software is shown
• Potentially, for any other software analytics task that uses clas- in Table 1. Note that such attributes can be automatically collected,
sifiers, a way to improve those learners as well. even for very large systems [44]. Other methods, like manual code
• A methodology for assessing the value of pre-processing data reviews, are far slower and far more labor intensive.
sets in software analytics. Static code defect predictors are remarkably fast and effective.
• A reproduction package to reproduce our results then (perhaps) Given the current generation of data mining tools, it can be a matter
to improve or refute our results (Available to download from of just a few seconds to learn a defect predictor (see the runtimes
http://tiny.cc/smotuned). in Table 9 of reference [16]). Further, in a recent study by Rahman
The rest of this paper is structured as follows: Section 2.1 gives an et al. [54], found no significant differences in the cost-effectiveness
overview on software defect prediction. Section 2.2 talks about all of (a) static code analysis tools FindBugs and Jlint, and (b) static
the performance criteria used in this paper. Section 2.3 explains code defect predictors. This is an interesting result since it is much
the problem of class imbalance in defect prediction. Assessment slower to adapt static code analyzers to new languages than defect
of the previous ranking studies is done in Section 2.4. Section 2.5 predictors (since the latter just requires hacking together some new
introduces SMOTE and discusses how SMOTE has been used in lit- static code metrics extractors).
erature. Section 2.6 provides the definition of SMOTUNED. Section
3 describes the experimental setup of this paper and above research 2.2 Performance Criteria
questions are answered in Section 4. Lastly, we discuss the validity Formally, defect prediction is a binary classification problem. The
of our results and a section describing our conclusions. performance of a defect predictor can be assessed via a confusion
Note that the experiments of this paper only make conclusions matrix like Table 2 where a “positive” output is the defective class
about software analytics for defect prediction. That said, many other under study and a “negative” output is the non-defective one.
software analytics tasks use the same classifiers explored here: for Further, “false” means the Table 2: Results Matrix
non-parametric sensitivity analysis [41], as a pre-processor to build learner got it wrong and
the tree used to infer quality improvement plans [31], to predict Actual
“true” means the learner cor- Prediction false true
Github issue close time [55], and many more. That is, potentially, rectly identified a fault or non- defect-free TN FN
SMOTUNED is a sub-routine that could improve many software fault module. Hence, Table 2 defective FP TP
analytics tasks. This could be a highly fruitful direction for future has four quadrants containing,
research. e.g., FP which denotes “false positive”.
From this matrix, we can define performance measures like:
2 BACKGROUND AND MOTIVATION • Recall = pd = TP/(TP + FN )
2.1 Defect Prediction • Precision = prec = TP/(TP + FP )
Software programmers are intelligent, but busy people. Such busy • False Alarm = p f = FP/(FP + TN )
people often introduce defects into the code they write [20]. Testing • Area Under Curve (AUC), which is the area covered by an
software for defects is expensive and most software assessment ROC curve [11, 60] in which the X-axis represents, false positive
budgets are finite. Meanwhile, assessment effectiveness increases rate and the Y-axis represents true positive rate.
exponentially with assessment effort [16]. Such exponential costs As shown in Figure 1, a typical predictor must “trade-off” be-
exhaust finite resources so software developers must carefully de- tween false alarm and recall. This is because the more sensitive the
cide what parts of their code need most testing. detector, the more often it triggers and the higher its recall. If a

1051
Is “Better Data” Better Than “Better Data Miners”? ICSE ’18, May 27-June 3, 2018, Gothenburg, Sweden

Table 1: OO CK code metrics used for all studies in this paper. The last line shown, denotes the dependent variable.

amc average method complexity e.g., number of JAVA byte codes


avg, cc average McCabe average McCabe’s cyclomatic complexity seen in class
ca afferent couplings how many other classes use the specific class.
cam cohesion amongst classes summation of number of different types of method parameters in every method divided by a multiplication of
number of different method parameter types in whole class and number of methods.
cbm coupling between methods total number of new/redefined methods to which all the inherited methods are coupled
cbo coupling between objects increased when the methods of one class access services of another.
ce efferent couplings how many other classes is used by the specific class.
dam data access ratio of the number of private (protected) attributes to the total number of attributes
dit depth of inheritance tree
ic inheritance coupling number of parent classes to which a given class is coupled
lcom lack of cohesion in methods number of pairs of methods that do not share a reference to an case variable.
locm3 another lack of cohesion measure if m, a are the number of met hods, at t r ibut es in a class number and μ (a) is the number of methods accessing

an attribute, then lcom3 = (( a1 , j a μ (a, j )) − m)/(1 − m) .
loc lines of code
max, cc maximum McCabe maximum McCabe’s cyclomatic complexity seen in class
mfa functional abstraction no. of methods inherited by a class plus no. of methods accessible by member methods of the class
moa aggregation count of the number of data declarations (class fields) whose types are user defined classes
noc number of children
npm number of public methods
rfc response for a class number of methods invoked in response to a message to the object.
wmc weighted methods per class
nDefects raw defect counts numeric: number of defects found in post-release bug-tracking systems.
defects present? boolean if nDefects > 0 then true else false

detector triggers more often, it also raises more false alarms. Hence, classes [23]. By convention, the under-represented class is called
when increasing recall, we should expect the false alarm rate to the minority class, and correspondingly the class which is over-
increase (ideally, not by very much). represented is called the majority class. In this paper, we say that
There are many class imbalance is worse when the ratio of minority class to majority
more ways to eval- increases, that is, class-imbalance of 5:95 is worse than 20:80. Menzies
uate defect predic- et al. [36] reported SE data sets often contain class imbalance. In
tors besides the four their examples, they showed static code defect prediction data sets
listed above. Previ- with class imbalances of 1:7; 1:9; 1:10; 1:13; 1:16; 1:249.
ously, Menzies et al. The problem of class imbalance is sometimes discussed in the
catalogued dozens software analytics community. Hall et al. [21] found that models
of them (see Table based on C4.5 under-perform if they have imbalanced data while
23.2 of [39]) and Naive Bayes and Logistic regression perform relatively better. Their
even several novel general recommendation is to not use imbalanced data. Some re-
Figure 1: Trade-offs false alarm vs
ones were proposed searchers offer preliminary explorations into methods that might
recall (probability of detection).
(balance, G-measure [38]). mitigate for class imbalance. Wang et al. [67] and Yu et al. [69]
But no evaluation criteria is “best” since different criteria are appro- validated the Hall et al. results and concluded that the performance
priate in different business contexts. For e.g., as shown in Figure 1, of C4.5 is unstable on imbalanced data sets while Random Forest
when dealing with safety-critical applications, management may be and Naive Bayes are more stable. Yan et al. [68] performed fuzzy
“risk adverse” and hence many elect to maximize recall, regardless logic and rules to overcome the imbalance problem, but they only
of the time wasted exploring false alarm. Similarly, when rushing explored one kind of learner (Support Vector Machines). Pelayo et
some non-safety critical application to market, management may al. [49] studied the effects of the percentage of oversampling and
be “cost adverse” and elect to minimize false alarm since this avoids undersampling done. They found out that different percentage of
distractions to the developers. each helps improve the accuracies of decision tree learner for defect
In summary, there are numerous evaluation criteria and numer- prediction using CK metrics. Menzies et al. [42] undersampled the
ous business contexts where different criteria might be preferred non-defect class to balance training data and reported how little in-
by different local business users. In response to the cornucopia of formation was required to learn a defect predictor. They found that
evaluation criteria, we make the following recommendations: a) throwing away data does not degrade the performance of Naive
do evaluate learners on more than one criteria, b) do not evaluate Bayes and C4.5 decision trees. Other papers [49, 50, 57] have shown
learners on all criteria (there are too many), and instead, apply the the usefulness of resampling based on different learners.
criteria widely seen in the literature. Applying this advice, this pa- We note that many researchers in this area [19, 67, 69] refer to
per evaluates the defect predictors using the four criteria mentioned the SMOTE method explored in this paper, but only in the context
above (since these are widely reported in the literature [16, 17])) of future work. One rare exception to this general pattern is the
but not other criteria that have yet to gain a wide acceptance (i.e., recent paper by Bennin et al. [4], which we explored as part of RQ4.
balance and G-measure).
2.4 Ranking Studies
2.3 Defect Prediction and Class Imbalance A constant problem in defect prediction is what classifier should
Class imbalance is concerned with the situation in where some be applied to build the defect predictors? To address this problem,
classes of data are highly under-represented compared to other many researchers run ranking studies where performance scores

1052
ICSE ’18, May 27-June 3, 2018, Gothenburg, Sweden Amritanshu Agrawal and Tim Menzies

Table 3: Classifiers used in this study. Rankings from Ghotra et al. [17].
RANK LEARNER NOTES
1 “best” RF= random forest Random forest of entropy-based decision trees.
LR=Logistic regression A generalized linear regression model.
2 KNN= K-means Classify a new instance by finding “k” examples of similar instances. Ghortra et al. suggested K = 8.
NB= Naive Bayes Classify a new instance by (a) collecting mean and standard deviations of attributes in old instances of different
classes; (b) return the class whose attributes are statistically most similar to the new instance.
3 DT= decision trees Recursively divide data by selecting attribute splits that reduce the entropy of the class distribution.
4 “worst” SVM= support vector machines Map the raw data into a higher-dimensional space where it is easier to distinguish the examples.

Table 4: 22 highly cited Software Defect prediction studies. scored more than “1” if they used multiple performance scores of
Evaluated the kind listed at the end of Section 2.2.
Considered
Ranked using We find that, in those 22 papers from Table 4, numerous classi-
Ref Year Citations Data
Classifiers? multiple
criteria?
Imbalance? fiers have used AUC as the measure to evaluate the software defect
[38] 2007 855  2  predictor studies. We also found that majority of papers (from last
[32] 2008 607  1  column of Table 4, 6/7=85%) in SE community has used SMOTE to
[13] 2008 298  2 
[40] 2010 178  3  fix the data imbalance [4, 28, 49, 50, 61, 67]. This also made us to
[18] 2008 159  1  propose SMOTUNED. As noted in [17, 32], no single classification
[30] 2011 153  2  technique always dominates. That said, Table IX of a recent study
[53] 2013 150  1 
[25] 2008 133  1  by Ghotra et al. [17] ranks numerous classifiers using data similar
[67] 2013 115  1  to what we use here (i.e., OO JAVA systems described using CK
[35] 2009 92  1 
[33] 2012 79  2 
metrics). Using their work, we can select a range of classifiers for
[28] 2007 73  2  this study ranking from “best” to “worst’: see Table 3.
[49] 2007 66  1  The key observation to be made from this survey is that, as
[27] 2009 62  3 
[29] 2010 60  1  shown in Figure 2, the overwhelming majority of prior papers in
[17] 2015 53  1  our sample do not satisfy our definition of a “good” project (the sole
[26] 2008 41  1  exception is the recent Bennin et al. [4] which we explore in RQ4).
[62] 2016 31  1 
[61] 2015 27  2  Accordingly, the rest of this paper defines and executes a “good”
[50] 2012 23  1  ranking study, with an additional unique feature of an auto-tuning
[16] 2016 15  1 
[4] 2017 0  3 
version of SMOTE.

2.5 Handling Data Imbalance with SMOTE


are collected from many classifiers executed on many software SMOTE handles class imbalance by changing the frequency of
defect data sets [13, 16–18, 21, 25–27, 29, 32, 33, 35, 40, 53, 62, 67]. different classes of the training data [7]. The algorithm’s name
This section assesses those ranking studies. We will say a ranking is short for “synthetic minority over-sampling technique”. When
study is “good” if it compares multiple learners using multiple data applied to data, SMOTE sub-samples the majority class (i.e., deletes
sets and multiple evaluation criteria while at the same time doing some examples) while super-sampling the minority class until all
something to address the data imbalance problem. classes have the same frequency. In the case of software defect data,
In July 2017, we searched the minority class is usually the defective class.
scholar.google.com for Figure 3 shows
the conjunction of “soft- how SMOTE works. def SMOTE(k=2, m=50%, r=2): # defaults
ware” and “defect predic- During super-sampling, while Majority > m do
tion” and “OO” and “CK” a member of the mi- delete any majority item # random
while Minority < m do
published in the last nority class finds k add something_like(any minority item)
decade. This returned nearest neighbors. It
def something_like(X0):
231 results. We only se- builds an artificial relevant = emptySet
lected OO and CK key- member of the mi- k1 = 0
while(k1++ < 20 and size(found) < k) {
words since CK metrics nority class at some all = k1 nearest neighbors
are more popular and Figure 2: Summary of Table 4. point in-between it- relevant += items in "all" of X0 class}
better than process met- Z = any of found
self and one of its Y = interpolate (X0, Z)
rics for software defect prediction [53]. From that list, we selected random nearest neigh- return Y
“highly-cited” papers, which we defined as having more than 10 bors. During that def minkowski_distance(a,b,r):
citations per year. This reduced our population of papers down process, some dis- return (Σi abs (a i − bi ) r ) 1/r
to 107. After reading the titles and abstracts of those papers, and tance function is re-
skimming the contents of the potentially interesting papers, we quired which is the Figure 3: Pseudocode of SMOTE
found 22 papers of Table 4 that either performed ranking studies minkowski_distance function.
(as defined above) or studied the effects of class imbalance on defect SMOTE’s control parameters are (a) k that selects how many
prediction. In the column “evaluated using multiple criteria”, papers neighbors to use (defaults to k = 5), (b) m is how many examples of

1053
Is “Better Data” Better Than “Better Data Miners”? ICSE ’18, May 27-June 3, 2018, Gothenburg, Sweden

each class which need to be generated (defaults to m = 50% of the def DE( n=10, cf=0.3, f=0.7): # default settings 1
frontier = sets of guesses (n=10) 2
total training samples), and (3) r which selects the distance function best = frontier.1 # any value at all 3
(default is r = 2, i.e., use Euclidean distance). lives = 1 4
while(lives−− > 0): 5
In the software analytics literature, there are contradictory find- tmp = empty 6
ings on the value of applying SMOTE for software defect prediction. for i = 1 to |frontier |: # size of frontier 7
Van et al. [64], Pears et al. [47] and Tan et al. [61] found SMOTE to old = frontieri 8
x,y,z = any three from frontier, picked at random 9
be advantageous, while others, such as Pelayo et al. [49] did not. new= copy(old) 10
Further, some researchers report that some learners respond bet- for j = 1 to |new|: # for all attributes 11
if rand() < cf # at probability cf... 12
ter than others to SMOTE. Kamei et al. [28] evaluated the effects of new.j = x .j + f ∗ (z .j − y .j ) # ...change item j 13
SMOTE applied to four fault-proneness models (linear discriminant # end for 14
new = new if better(new,old) else old 15
analysis, logistic regression, neural network, and decision tree) by tmpi = new 16
using two module sets of industry legacy software. They reported if better(new,best) then 17
best = new 18
that SMOTE improved the prediction performance of the linear and lives++ # enable one more generation 19
logistic models, but not neural network and decision tree models. end 20
Similar results, that the value of SMOTE was dependent on the # end for 21
frontier = tmp 22
learner, was also reported by Van et al. [64]. # end while 23
Recently, Bennin et al. [4] proposed a new method based on the return best 24
chromosomal theory of inheritance. Their MAHAKIL algorithm
interprets two distinct sub-classes as parents and generates a new Figure 4: SMOTUNED uses DE (differential evolution).
synthetic instance that inherits different traits from each parent
and contributes to the diversity within the data distribution. They Table 5: SMOTE parameters
report that MAHAKIL usually performs as well as SMOTE, but
Defaults Tuning Range
does much better than all other class balancing techniques in terms Para used by (Explored by Description
of recall. Please note, that work did not consider the impact of SMOTE ( SMOTUNED)
parameter tuning of a preprocessor so in our RQ4 we will compare k 5 [1,20] Number of neighbors
m 50% {50, 100, 200, 400} Number of synthetic examples to
SMOTUNED to MAHAKIL. create. Expressed as a percent of
final training data.
2.6 SMOTUNED = auto-tuning SMOTE r 2 [0.1,5] Power parameter for the
Minkowski distance metric.
One possible explanation for the variability in the SMOTE results
is that the default parameters of this algorithm are not suited to Table 6: Important Terms of SMOTUNED Algorithm
all data sets. To test this, we designed SMOTUNED, which is an
auto-tuning version of SMOTE. SMOTUNED uses different control Keywords Description
parameters for different data sets. Differential weight (f = 0.7) Mutation power
SMOTUNED uses DE (differential evolution [59]) to explore the Crossover probability (cf = 0.3) Survival of the candidate
Population Size (n = 10) Frontier size in a generation
parameter space of Table 5. DE is an optimizer useful for functions Lives Number of generations
that may not be smooth or linear. Vesterstrom et al. [65] find DE’s Fitness Function (bet t er ) Driving factor of DE
optimizations to be competitive with other optimizers like particle Rand() function Returns between 0 to 1
swarm optimization or genetic algorithms. DEs have been used Best (or Output) Optimal configuration for SMOTE
before for parameter tuning [2, 9, 14, 16, 46]) but this paper is
the first attempt to do DE-based class re-balancing for SE data by distance functions in the SE and machine learning literature. To
studying multiple learners for multiple evaluation criteria. avoid introducing noise by overpopulating the minority samples
In Figure 4, DE evolves a frontier of candidates from an ini- we are not using m as percentage rather than number of examples
tial population which is driven by a goal (like maximizing recall) to create. Aggarawal et al. [1] argue that with data being highly
evaluated using a fitness function (shown in line 17). In the case dimensional, r should shrink to some fraction less than one (hence
of SMOTUNED, each candidate is a randomly selected value for the bound of r = 0.1 in Table 5).
SMOTE’s k, m and r parameters. To evolve the frontier, within each
generation, DE compares each item to a new candidate generated 3 EXPERIMENTAL DESIGN
by combining three other frontier items (and better new candidates This experiment reports the effects on defect prediction after using
replace older items). To compare them, the better function (line MAHAKIL or SMOTUNED or SMOTE. Using some data D i ∈ D,
17) calls SMOT E function (from Figure 3) using the proposed new performance measure Mi ∈ M, and classifier Ci ∈ C, this experi-
parameter settings. This pre-processed training data is then fed ment conducts the 5*5 cross-validation study, defined below. Our
into a classifier to find a particular measure (like recall). When our data sets D are shown in Table 7. These are all open source JAVA
DE terminates, it returns the best candidate ever seen in the entire OO systems described in terms of the CK metrics. Since, we are
run. comparing these results for imbalanced class, only imbalanced class
Table 6 provides important terms of SMOTUNED when explor- data sets were selected from SEACRAFT (http://tiny.cc/seacraft).
ing SMOTE’s parameter ranges, shown in Table 5. To define the Our performance measures M were introduced in Section 2.2
parameters, we found the range of used settings for SMOTE and which includes AUC, precision, recall, and the false alarm. Our

1054
ICSE ’18, May 27-June 3, 2018, Gothenburg, Sweden Amritanshu Agrawal and Tim Menzies

classifiers C come from a recent study [17] and were listed in Table 3. known that our performance measures are inter-connected via
For implementations of these learners, we used the open source the Zhang’s equation [70]. Hence, increasing (e.g.,) recall might
tool Scikit-Learn [48]. Our cross-validation study [56] is defined as potentially have the adverse effect of driving up (e.g) the false
follows: alarm rate. To avoid this problem, we also apply the following cross-
(1) We randomized the order of the data set D i five times. This measure assessment rig. At the conclusion of the within-measure
reduces the probability that some random ordering of examples assessment rig, we will observe that the AUC performance measure
in the data will conflate our results. will show the largest improvements. Using that best performer, we
(2) Each time, we divided the data D i into five bins; will re-apply steps 1,2,3 abcde (listed above) but this time:
(3) For each bin (the test), we trained on four bins (the rest) and • In step 3b, we will tell SMOTUNED to optimize for AUC;
then tested on the test bin as follows. • In step 3d, 3e we will collect the performance delta on AUC as
(a) The training set is pre-filtered using either No-SMOTE (i.e., well as precision, recall, and false alarm.
do nothing) or SMOTE or SMOTUNED. In this approach, steps 3d and 3e collect the information required to
(b) When using SMOTUNED, we further divide those four bins check if succeeding according to one performance criteria results
of training data. 3 bins are used for training the model, and in damage to another. We would also want to make sure that our
1 bin is used for validation in DE. DE is run to improve model is not over-fitted based on one evaluation measure. And
the performance measure Mi seen when the classifier Ci since SMOTUNED is a time expensive task, we do not want to tune
was applied to the training data. Important point: we only for each measure which will quadruple the time. The results of
used SMOTE on the training data, leaving the testing data within- vs cross-measure assessment is shown in Section 4.
unchanged.
(c) After pre-filtering, a classifier Ci learns a predictor. 3.2 Statistical Analysis
(d) The model is applied to the test data to collect performance
When comparing the results of SMOTUNED to other treatments, we
measure Mi .
use a statistical significance test and an effect size test. Significance
(e) We print the relative performance delta between this Mi and
test are useful for detecting if two populations differ merely by
another Mi generated from applying Ci to the raw data D i
random noise. Also, effect sizes are useful for checking that two
(i.e., compare the learner without any filtering). We finally
populations differ by more than just a trivial amount.
report median on the 25 repeats.
For the significance test, we used the Scott-Knott procedure [17,
Note that the above rig tunes SMOTE, but not the control pa- 43]. This technique recursively bi-clusters a sorted set of numbers.
rameters of the classifiers. We do this since, in this paper, we aim to If any two clusters are statistically indistinguishable, Scott-Knott
document the benefits of tuning SMOTE since as shown below, they reports them both as one group. Scott-Knott first looks for a break
are very large indeed. Also, it would be very useful if we can show in the sequence that maximizes the expected values in the difference
that a single algorithm (SMOTUNED) improves the performance of in the means before and after the break. More specifically, it splits
defect prediction. This would allow subsequent work to focus on l values into sub-lists m and n in order to maximize the expected
the task of optimizing SMOTUNED (which would be a far easier value of differences in the observed performances before and after
task than optimizing the tuning of a wide-range of classifiers). divisions. For e.g., lists l, m and n of size ls, ms and ns where l =
m ∪n, Scott-Knott divides the sequence at the break that maximizes:
3.1 Within- vs Cross-Measure Assessment
E (Δ) = ms/ls ∗ abs (m.μ − l .μ) 2 + ns/ls ∗ abs (n.μ − l .μ) 2
We call the above rig as the within-measure assessment rig since it
is biased in its evaluation measures. Specifically, in this rig, when Scott-Knott then applies some statistical hypothesis test H to check
SMOTUNED is optimized for (e.g.,) AUC, we do not explore the if m and n are significantly different. If so, Scott-Knott then re-
effects on (e.g.,) the false alarm. This is less than ideal since it is curses on each division. For this study, our hypothesis test H was a
conjunction of the A12 effect size test (endorsed by [3]) and non-
parametric bootstrap sampling [12], i.e., our Scott-Knott divided
Table 7: Data set statistics. Data sets are sorted from low per- the data if both bootstrapping and an effect size test agreed that
centage of defective class to high defective class. Data comes the division was statistically significant (99% confidence) and not a
from the SEACRAFT repository: http://tiny.cc/seacraft “small” effect (A12 ≥ 0.6).
Version Dataset Name Defect. % No. of classes lines of code
4 RESULTS
4.3 jEdit 2 492 202,363
1.0 Camel 4 339 33,721 RQ1: Are the default “off-the-shelf” parameters for SMOTE
6.0.3 Tomcat 9 858 300,674 appropriate for all data sets?
2.0 Ivy 11 352 87,769 As discussed above, the default parameters for SMOTE, k, m
1.0 Arcilook 11.5 234 31,342 and r are 5, 50% and 2. Figure 5 shows the range of parameters
1.0 Redaktor 15 176 59,280 found by SMOTUNED across nine data sets for the 25 repeats
1.7 Apache Ant 22 745 208,653 of our cross-validation procedure. All the results in this figure
1.2 Synapse 33.5 256 53,500 are within-measure assessment results, i.e., here, we SMOTUNED
1.6.1 Velocity 34 229 57,012
on a particular performance measure and then we only collect
total: 3,681 1,034,314
performance for that performance measure on the test set.

1055
Is “Better Data” Better Than “Better Data Miners”? ICSE ’18, May 27-June 3, 2018, Gothenburg, Sweden

Figure 5a: Tuned values for k Figure 5b: Tuned values for m Figure 5c: Tuned values for r
(default: k = 5). (default: m = 50%). (default: r = 2).
Figure 5: Data sets vs Parameter Variation when optimized for recall and results reported on recall. “Median” denotes 50th
percentile values seen in the 5*5 cross-validations and “IQR” shows the intra-quartile range, i.e., (75-25)th percentiles.

Figure 6: SMOTUNED improvements over SMOTE. Within-Measure assessment (i.e., for each of these charts, optimize for per-
formance measure Mi , then test for performance measure Mi ). For most charts, larger values are better, but for false alarm,
smaller values are better. Note that the corresponding percentage of minority class (in this case, defective class) is written
beside each data set.

In Figure 5, the median is the 50th percentile value and IQR is the Figure 6 shows the performance delta of the within-measure as-
(75-25)th percentile (variance). As can be seen in Figure 5, most of sessment rig. Please recall that when this rig applies SMOTUNED, it
the learned parameters are far from the default values: 1) Median k optimizes for performance measure, Mi ∈ {recall, precision, f alse
is never less than 11; 2) Median m differs according to each data set alarm, AUC} after which it uses the same performance measure
and quite far from the actual; 3) The r used in the distance function Mi when evaluating the test data. In Figure 6, each subfigure shows
was never 2, rather, it was usually 3. Hence, our answer to RQ1 is that DE is optimized for each M_i and results are reported against
“no”: the use of off-the-shelf SMOTE should be deprecated. the same M_i. From the figure 6, it is observed that SMOTUNED
We note that many of the settings in Figure 5 are very simi- achieves large AUC (about 60%) and recall (about 20%) improve-
lar; for e.g., median values of k = 13 and r = 3 seems to be a ments relatively without damaging precision and with only minimal
common result irrespective of data imbalance percentage among changes to false alarm. Another key observation here that can be
the datasets. Nevertheless, we do not recommend replacing the made is that improvements in AUC with SMOTUNED is constant
defaults of SMOTE with the findings of Figure 5. Also, IQR bars whether imbalance is of 34% or 2%. Another note should be taken
are very large. Clearly, SMOTUNED’s decisions vary dramatically of the AUC improvements, that these are the largest improvements
depending on what data is being processed. Hence, we strongly we have yet seen, for any prior treatment of defect prediction data.
recommend that SMOTUNED be applied to each new data set. Also, for the raw AUC values, please see http://tiny.cc/raw_auc.
Figure 7 offers a statistical analysis of different results achieved
RQ2: Is there any benefit in tuning the default parameters after applying our three data pre-filtering methods: 1) NO = do
of SMOTE for each new data set? nothing, 2) S1 = use default SMOTE, and 3) S2 = use SMOTUNED.
For any learner, there are three such treatments and darker the cell,

1056
ICSE ’18, May 27-June 3, 2018, Gothenburg, Sweden Amritanshu Agrawal and Tim Menzies

Figure 7: Scott Knott analysis of No-SMOTE, SMOTE and SMOTUNED. The column headers are denoted as No for No-SMOTE,
S1 for SMOTE and S2 for SMOTUNED. (∗) Mark represents the best learner combined with its techniques.
better the performance. In that figure, cells with the same color are Hence, we conclude prior ranking study results (that only assessed
either not statistically significantly different or are different only different learners) have missed a much more general effect; i.e. it
via a small effect (as judged by the statistical methods described in can be more useful to reflect on data pre-processors than algorithm
Section 3.2). selection. To say that another way, at least for defect prediction,
As to what combination of pre-filter+learner works better for “better data” might be better than “better data miners”.
any data set, that is marked by a ‘*’. Since we have three pre-filtering
methods and six learners providing us with in-total 18 treatments, RQ3: In terms of runtimes, is the cost of running SMOTUNED
and ‘*’ represents the best learner picked with highest median value. worth the performance improvement?
In the AUC and recall results, the best “*” cell always appears in Figure 8 shows the mean runtimes for running a 5*5 cross-
the S2=SMOTUNED column, i.e., SMOTUNED is always used by validation study for six learners for each data set. These runtimes
the best combination of pre-filter+learner . were collected from one machine running CENTOS7, with 16 cores.
As to precision results, at first glance, the results in Figure 7 Note that they do not increase monotonically with the size of the
look bad for SMOTUNED since, less than half the times, the best data sets– a result we can explain with respect to the internal struc-
“*” happens in S2=SMOTUNED column. But recall from Figure 6 ture of the data. Our version of SMOTE uses ball trees to optimize
that the absolute size of the precision deltas is very small. Hence, the nearest neighbor calculations. Hence, the runtime of that algo-
even though SMOTUNED “losses” in this statistical analysis, the rithm is dominated by the internal topology of the data sets rather
pragmatic impact of that result is negligible. But if we can get than the number of classes. Also, as shown in Figure 3, SMOTUNED
feedback from domain/expert, we can change between SMOTE and explores the local space until it finds k neighbors of the same class.
SMOTUNED dynamically based on the measures and data miners. This can take a variable amount of time to terminate.
As to the false alarm results from Figure 7, as discussed above
in Section 2.2, the cost of increased recall is to also increase the
false alarm rate. For e.g., the greatest increase in the recall was 0.58
seen in the jEdit results. This increase comes at a cost of increasing
the false alarm rate by 0.20. Apart from this one large outlier, the
overall pattern is that the recall improvements range from +0.18 to
+0.42 (median to max) and these come at the cost of much smaller
false alarm increase of 0.07 to 0.16 (median to max).
In summary, the answer to RQ2 is that our AUC and recall results
strongly endorse the use of SMOTUNED while the precision and
false alarm rates show there is little harm in using SMOTUNED.
Before moving to the next research question, we note that these
results offer an interesting insight on prior ranking studies. Based
on the Ghotra et al. results of Table 3, our expectation was that
Random Forests (RF) would yield the best results across this defect
Figure 8: Data sets vs Runtimes. Note that the numbers
data. Figure 7 reports that, as predicted by Ghotra et al., RF earns
shown here are the mean times seen across 25 repeats of a
more “stars” than any other learner, i.e., it is seen to be “best” more
5*5 cross-validation study.
often than anything else. That said, RF was only “best” in 11/36 of
those results, i.e., even our “best” learner (RF) fails over half the As expected, SMOTUNED is an order of magnitude slower than
time. SMOTE since it has to run SMOTE many times to assess different pa-
It is significant to note that SMOTUNED was consistently used rameter settings. That said, those runtimes are not excessively slow.
by whatever learner was found to be “best” (in recall and AUC). SMOTUNED usually terminates in under two minutes and never
more than half an hour. Hence, in our opinion, we answer RQ3

1057
Is “Better Data” Better Than “Better Data Miners”? ICSE ’18, May 27-June 3, 2018, Gothenburg, Sweden

Figure 9: SMOTUNED improvements over MAHAKIL [4]. Within-Measure assessment (i.e., for each of these charts, optimize
for performance measure Mi , then test for performance measure Mi ). Same format as Figure 6.

as “yes” since the performance increment seen in Figure 6 is more According to its authors [4], MAHAKIL was developed to reduce
than to compensate for the extra CPU required for SMOTUNED. the false alarm rates on SMOTE and on that criteria it succeeds (as
seen in Figure 9, since SMOTUNED does lead to slightly higher false
RQ4: How does SMOTUNED perform against more recent alarm rates). But, as discussed above in section 2.2, the downside on
class imbalance technique? minimizing false alarms is also minimizing our ability to find defects
All the above work is based on tuning the original 2002 SMOTE which is measured in terms of AUC and recall, SMOTUNED does
paper [7]. While that version of SMOTE is widely used in the SE best. Hence, if this paper was a comparative assessment of SMO-
literature, it is prudent to compare SMOTUNED with more recent TUNED vs MAHAKIL, we would conclude that by recommending
work. Our reading of the literature is that the MAHAKIL algorithm SMOTUNED.
of Bennin et al. [4] represents the most recent work in SE on han- However, the goal of this paper is to defend the claim that “bet-
dling class imbalance. At the time of writing of this paper (early ter data” could be better than “better data miners”, i.e., data pre-
August 2017), there was no reproduction package available for MA- processing is more effective than switching to another data miner.
HAKIL so we wrote our own version based on the description in In this regard, there is something insightful to conclude if we com-
that paper (Available on http://tiny.cc/mahakil). We verified our bine the results of both MAHAKIL and SMOTUNED. In the MA-
implementation on their datasets, and achieved close to their values HAKIL experiments, the researchers spent some time on tuning the
± 0.1. The difference could be due to different random seed. learner’s parameters. That is, Figure 9 is really a comparison of two
Figure 9 compares results from MAHAKIL with those from SMO- treatments: tuned data miners+adjust data against just using SMO-
TUNED. These results were generated using the same experimental TUNED to adjust the data. Note that SMOTUNED still achieves
methods as used for Figure 6 (those methods were described in better results even though the MAHAKIL treatment adjusted both
Section 3.1). The following table repeats the statistical analysis of data and data miners. Since SMOTUNED performed so well without
Figure 7 to report how often SMOTE, SMOTUNED, or MAHAKIL tuning the data miners, we can conclude from the conjunction of
achieves best results across nine data sets. Note that, in this follow- these experiments that “better data” is better than using “better
ing table, larger values are better: data miners”.
number of wins Of course, there needs to be further studies done in other SE
Treatments AUC Recall Precision False Alarm applications to make the above claim. There is also one more treat-
MAHAKIL 1/9 0/9 6/9 9/9 ment not discussed in the paper: tuning both the data pre-processor
SMOTE 0/9 1/9 0/9 0/9 and the data miners. This is a very, very large search space so while
SMOTUNED 8/9 8/9 3/9 0/9 we have experiments running to explore this task, at this time we
These statistical tests tell us that the differences seen in Figure 9 have not definitive conclusions to report.
are large enough to be significant. Looking at Figure 9, there are
9 datasets on x-axis, and the differences in precision are so small 5 THREATS TO VALIDITY
in 7 out of those 9 data sets that the pragmatic impact of those As with any empirical study, biases can affect the final results.
differences is small. As to AUC and recall, we see that SMOTUNED Therefore, any conclusions made from this work must consider the
generated larger and better results than MAHAKIL (especially for following issues in mind.
recall). SMOTUNED generates slightly larger false alarms but, in Order bias: With each data set how data samples are distributed
7/9 data sets, the increase in the false alarm rate is very small. in training and testing set is completely random. Though there

1058
ICSE ’18, May 27-June 3, 2018, Gothenburg, Sweden Amritanshu Agrawal and Tim Menzies

Figure 10: SMOTUNED improvements over SMOTE. Cross-Measure assessment (i.e., for each of these charts, optimize for AUC,
then test for performance measure Mi ). Same format as Figure 6.

could be times when all good samples are binned into training and controlled by whatever is the downstream evaluation criteria (as
testing set. To mitigate this order bias, we run the experiment 25 done in the within-measure assessment rig of Figure 6.)
times by randomly changing the order of the data samples each
time. 6 CONCLUSION
Sampling bias threatens any classification experiment, i.e., what Prior work on ranking studies tried to improve software analytics
matters there may not be true here. For e.g., the data sets used here by selecting better learners. Our results show that there may be
comes from the SEACRAFT repository and were supplied by one more benefits in exploring data pre-processors like SMOTUNED
individual. These data sets have used in various case studies by because we found that no learner was usually “best” across all data
various researchers [24, 51, 52, 63], i.e., our results are not more sets and all evaluation criteria. On one hand, across the same data
biased than many other studies in this arena. That said, our nine sets, SMOTUNED was consistently used by whatever learner was
open-source data sets are mostly from Apache. Hence it is an open found to be “best” in the AUC/recall results. On the other hand,
issue if our results hold for proprietary projects and open source for the precision and false alarm results, there was little evidence
projects from other sources. against the use of SMOTUNED. That is, creating better training
Evaluation bias: In terms of evaluation bias, our study is far data (using techniques like SMOTUNED) may be more important
less biased than many other ranking studies. As shown by our than the subsequent choice of a classifier. To say that another way,
sample of 22 ranking studies in Table 4, 19/22 of those prior studies at least for defect prediction, “better data” is better than “better
used fewer evaluation criteria than the four reported here (AUC, data miners”.
recall, precision and false alarm). As to specific recommendations, we suggest that any prior rank-
The analysis done in RQ4 could be affected by some other settings ing study which did not study the effects of data pre-processing
which we might not have considered since the reproduction package needs to be analyzed again. Any future such ranking study should
was not available from the original paper [4]. That said, there is include a SMOTE-like pre-processor. SMOTE should not be used
another more subtle evaluation bias arises in the Figure 6. The four with its default parameters. For each new data set, SMOTE should
plots of that figure are four different runs of our within-measure be used with some automatic parameter tuning tool in order to
assessment rig (defined in Section 3.1). Hence, it is reasonable to find the best parameters for that data set. SMOTUNED is one of
check what happens when (a) one evaluation criteria is used to the examples of parameter tuning. Ideally, SMOTUNED should be
control SMOTUNED, and (b) the results are assessed using all four tuned using the evaluation criteria used to assess the final predic-
evaluation criteria. Figure 10 shows the results of such a cross- tors. However, if there is not enough CPU to run SMOTUNED for
measure assessment rig where AUC was used to control SMOTUNED. each new evaluation criteria, SMOTUNED can be tuned using AUC.
We note that the results in this figure are very similar to Figure 6,
e.g., the precision deltas aver usually tiny, and false alarm increases REFERENCES
are usually smaller than the associated recall improvements. But [1] Charu C Aggarwal, Alexander Hinneburg, and Daniel A Keim. 2001. On the
there are some larger improvements in Figure 6 than Figure 10. surprising behavior of distance metrics in high dimensional space. In International
Hence, we recommend cross-measure assessment only if CPU is Conference on Database Theory. Springer, 420–434.
[2] Amritanshu Agrawal, Wei Fu, and Tim Menzies. 2016. What is wrong with
critically restricted. Otherwise, we think SMOTUNED should be topic modeling?(and how to fix it using search-based se). arXiv preprint

1059
Is “Better Data” Better Than “Better Data Miners”? ICSE ’18, May 27-June 3, 2018, Gothenburg, Sweden

arXiv:1608.08176 (2016). [32] Stefan Lessmann, Bart Baesens, Christophe Mues, and Swantje Pietsch. 2008.
[3] Andrea Arcuri and Lionel Briand. 2011. A practical guide for using statistical tests Benchmarking classification models for software defect prediction: A proposed
to assess randomized algorithms in software engineering. In Software Engineering framework and novel findings. IEEE TSE 34, 4 (2008), 485–496.
(ICSE), 2011 33rd International Conference on. IEEE, 1–10. [33] Ming Li, Hongyu Zhang, Rongxin Wu, and Zhi-Hua Zhou. 2012. Sample-based
[4] Kwabena Ebo Bennin, Jacky Keung, Passakorn Phannachitta, Akito Monden, and software defect prediction with active and semi-supervised learning. Automated
Solomon Mensah. 2017. MAHAKIL: Diversity based Oversampling Approach Software Engineering 19, 2 (2012), 201–230.
to Alleviate the Class Imbalance Issue in Software Defect Prediction. IEEE [34] Michael Lowry, Mark Boyd, and Deepak Kulkami. 1998. Towards a theory for
Transactions on Software Engineering (2017). integration of mathematical verification and empirical testing. In Automated
[5] Christian Bird, Nachiappan Nagappan, Harald Gall, Brendan Murphy, and Software Engineering, 1998. Proceedings. 13th IEEE International Conference on.
Premkumar Devanbu. 2009. Putting it all together: Using socio-technical net- IEEE, 322–331.
works to predict failures. In 2009 20th ISSRE. IEEE, 109–119. [35] Thilo Mende and Rainer Koschke. 2009. Revisiting the evaluation of defect
[6] Cagatay Catal and Banu Diri. 2009. A systematic review of software fault predic- prediction models. In Proceedings of the 5th International Conference on Predictor
tion studies. Expert systems with applications 36, 4 (2009), 7346–7354. Models in Software Engineering. ACM, 7.
[7] Nitesh V. Chawla, Kevin W. Bowyer, Lawrence O. Hall, and W. Philip Kegelmeyer. [36] Tim Menzies, Alex Dekhtyar, Justin Distefano, and Jeremy Greenwald. 2007.
2002. SMOTE: synthetic minority over-sampling technique. Journal of artificial Problems with Precision: A Response to" comments on’data mining static code
intelligence research 16 (2002), 321–357. attributes to learn defect predictors’". IEEE TSE 33, 9 (2007).
[8] Shyam R Chidamber and Chris F Kemerer. 1994. A metrics suite for object [37] Tim Menzies and Justin S. Di Stefano. 2004. How Good is Your Blind Spot
oriented design. IEEE Transactions on software engineering 20, 6 (1994), 476–493. Sampling Policy. In Proceedings of the Eighth IEEE International Conference on High
[9] I. Chiha, J. Ghabi, and N. Liouane. 2012. Tuning PID controller with multi- Assurance Systems Engineering (HASE’04). IEEE Computer Society, Washington,
objective differential evolution. In ISCCSP ’12. IEEE, 1–4. DC, USA, 129–138. http://dl.acm.org/citation.cfm?id=1890580.1890593
[10] Marco D’Ambros, Michele Lanza, and Romain Robbes. 2010. An extensive com- [38] Tim Menzies, Jeremy Greenwald, and Art Frank. 2007. Data mining static code
parison of bug prediction approaches. In 2010 7th IEEE MSR). IEEE, 31–41. attributes to learn defect predictors. IEEE TSE 33, 1 (2007), 2–13.
[11] Richard O Duda, Peter E Hart, and David G Stork. 2012. Pattern classification. [39] Tim Menzies, Ekrem Kocaguneli, Burak Turhan, Leandro Minku, and Fayola
John Wiley & Sons. Peters. 2014. Sharing data and models in software engineering. Morgan Kaufmann.
[12] Bradley Efron and Robert J Tibshirani. 1994. An introduction to the bootstrap. [40] Tim Menzies, Zach Milton, Burak Turhan, Bojan Cukic, Yue Jiang, and Ayşe Bener.
Chapman and Hall, London. 2010. Defect prediction from static code features: current results, limitations,
[13] Karim O Elish and Mahmoud O Elish. 2008. Predicting defect-prone software new approaches. Automated Software Engineering 17, 4 (2010), 375–407.
modules using support vector machines. JSS 81, 5 (2008), 649–660. [41] Tim Menzies and Erik Sinsel. 2000. Practical large scale what-if queries: Case
[14] Wei Fu and Tim Menzies. 2017. Easy over Hard: A Case Study on Deep Learning. studies with software risk assessment. In Automated Software Engineering, 2000.
arXiv preprint arXiv:1703.00133 (2017). Proceedings ASE 2000. The Fifteenth IEEE International Conference on. IEEE, 165–
[15] Wei Fu and Tim Menzies. 2017. Revisiting unsupervised learning for defect 173.
prediction. In Proceedings of the 2017 11th Joint Meeting on Foundations of Software [42] Tim Menzies, Burak Turhan, Ayse Bener, Gregory Gay, Bojan Cukic, and Yue
Engineering. ACM, 72–83. Jiang. 2008. Implications of ceiling effects in defect predictors. In Proceedings of
[16] Wei Fu, Tim Menzies, and Xipeng Shen. 2016. Tuning for software analytics: Is it the 4th international workshop on Predictor models in software engineering. ACM,
really necessary? IST 76 (2016), 135–146. 47–54.
[17] Baljinder Ghotra, Shane McIntosh, and Ahmed E Hassan. 2015. Revisiting the im- [43] Nikolaos Mittas and Lefteris Angelis. 2013. Ranking and clustering software cost
pact of classification techniques on the performance of defect prediction models. estimation models through a multiple comparisons algorithm. IEEE Transactions
In 37th ICSE-Volume 1. IEEE Press, 789–800. on software engineering 39, 4 (2013), 537–551.
[18] Iker Gondra. 2008. Applying machine learning to software fault-proneness [44] Nachiappan Nagappan and Thomas Ball. 2005. Static analysis tools as early
prediction. Journal of Systems and Software 81, 2 (2008), 186–195. indicators of pre-release defect density. In Proceedings of the 27th international
[19] David Gray, David Bowes, Neil Davey, Yi Sun, and Bruce Christianson. 2009. conference on Software engineering. ACM, 580–586.
Using the support vector machine as a classification method for software defect [45] Nachiappan Nagappan, Thomas Ball, and Andreas Zeller. 2006. Mining metrics
prediction with static code metrics. In International Conference on Engineering to predict component failures. In Proceedings of the 28th international conference
Applications of Neural Networks. Springer, 223–234. on Software engineering. ACM, 452–461.
[20] Philip J Guo, Thomas Zimmermann, Nachiappan Nagappan, and Brendan Murphy. [46] M. Omran, A. P. Engelbrecht, and A. Salman. 2005. Differential evolution methods
2011. Not my bug! and other reasons for software bug report reassignments. In for unsupervised image classification. In IEEE Congress on Evolutionary Compu-
Proceedings of the ACM 2011 conference on Computer supported cooperative work. tation ’05, Vol. 2. 966–973.
ACM, 395–404. [47] Russel Pears, Jacqui Finlay, and Andy M Connor. 2014. Synthetic Minority over-
[21] Tracy Hall, Sarah Beecham, David Bowes, David Gray, and Steve Counsell. 2012. sampling technique (SMOTE) for predicting software build outcomes. arXiv
A systematic literature review on fault prediction performance in software engi- preprint arXiv:1407.2330 (2014).
neering. IEEE TSE 38, 6 (2012), 1276–1304. [48] Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel,
[22] Ahmed E Hassan. 2009. Predicting faults using the complexity of code changes. Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss,
In 31st ICSE. IEEE Computer Society, 78–88. Vincent Dubourg, and others. 2011. Scikit-learn: Machine learning in Python.
[23] Haibo He and Edwardo A Garcia. 2009. Learning from imbalanced data. IEEE Journal of Machine Learning Research 12, Oct (2011), 2825–2830.
Transactions on knowledge and data engineering 21, 9 (2009), 1263–1284. [49] Lourdes Pelayo and Scott Dick. 2007. Applying novel resampling strategies to
[24] Zhimin He, Fengdi Shu, Ye Yang, Mingshu Li, and Qing Wang. 2012. An investi- software defect prediction. In NAFIPS 2007-2007 Annual Meeting of the North
gation on the feasibility of cross-project defect prediction. Automated Software American Fuzzy Information Processing Society. IEEE, 69–72.
Engineering 19, 2 (2012), 167–199. [50] Lourdes Pelayo and Scott Dick. 2012. Evaluating stratification alternatives to
[25] Yue Jiang, Bojan Cukic, and Yan Ma. 2008. Techniques for evaluating fault improve software defect prediction. IEEE Transactions on Reliability 61, 2 (2012),
prediction models. Empirical Software Engineering 13, 5 (2008), 561–595. 516–525.
[26] Yue Jiang, Bojan Cukic, and Tim Menzies. 2008. Can data transformation help [51] Fayola Peters, Tim Menzies, Liang Gong, and Hongyu Zhang. 2013. Balancing
in the detection of fault-prone modules?. In Proceedings of the 2008 workshop on privacy and utility in cross-company defect prediction. IEEE Transactions on
Defects in large software systems. ACM, 16–20. Software Engineering 39, 8 (2013), 1054–1068.
[27] Yue Jiang, Jie Lin, Bojan Cukic, and Tim Menzies. 2009. Variance analysis in soft- [52] Fayola Peters, Tim Menzies, and Andrian Marcus. 2013. Better cross company
ware fault prediction models. In Software Reliability Engineering, 2009. ISSRE’09. defect prediction. In Mining Software Repositories (MSR), 2013 10th IEEE Working
20th International Symposium on. IEEE, 99–108. Conference on. IEEE, 409–418.
[28] Yasutaka Kamei, Akito Monden, Shinsuke Matsumoto, Takeshi Kakimoto, and [53] Danijel Radjenović, Marjan Heričko, Richard Torkar, and Aleš Živkovič. 2013.
Ken-ichi Matsumoto. 2007. The effects of over and under sampling on fault-prone Software fault prediction metrics: A systematic literature review. Information
module detection. In ESEM 2007. IEEE, 196–204. and Software Technology 55, 8 (2013), 1397–1418.
[29] Taghi M Khoshgoftaar, Kehan Gao, and Naeem Seliya. 2010. Attribute selection [54] Foyzur Rahman, Sameer Khatri, Earl T. Barr, and Premkumar Devanbu. 2014.
and imbalanced data: Problems in software defect prediction. In Tools with Artifi- Comparing Static Bug Finders and Statistical Prediction (ICSE). ACM, New York,
cial Intelligence (ICTAI), 2010 22nd IEEE International Conference on, Vol. 1. IEEE, NY, USA, 424–434. DOI:http://dx.doi.org/10.1145/2568225.2568269
137–144. [55] Mitch Rees-Jones, Matthew Martin, and Tim Menzies. 2017. Better Predictors for
[30] Sunghun Kim, Hongyu Zhang, Rongxin Wu, and Liang Gong. 2011. Dealing with Issue Lifetime. CoRR abs/1702.07735 (2017). http://arxiv.org/abs/1702.07735
noise in defect prediction. In Software Engineering (ICSE), 2011 33rd International [56] Payam Refaeilzadeh, Lei Tang, and Huan Liu. 2009. Cross-validation. In Encyclo-
Conference on. IEEE, 481–490. pedia of database systems. Springer, 532–538.
[31] Rahul Krishna, Tim Menzies, and Lucas Layman. 2017. Less is More: Minimizing [57] JC Riquelme, R Ruiz, D Rodríguez, and J Moreno. 2008. Finding defective modules
Code Reorganization using XTREE. Information and Software Technology (2017). from highly unbalanced datasets. Actas de los Talleres de las Jornadas de Ingeniería

1060
ICSE ’18, May 27-June 3, 2018, Gothenburg, Sweden Amritanshu Agrawal and Tim Menzies

del Software y Bases de Datos 2, 1 (2008), 67–74. [65] Jakob Vesterstrom and Rene Thomsen. 2004. A comparative study of differential
[58] Martin Shepperd, David Bowes, and Tracy Hall. 2014. Researcher bias: The use evolution, particle swarm optimization, and evolutionary algorithms on numeri-
of machine learning in software defect prediction. IEEE Transactions on Software cal benchmark problems. In Evolutionary Computation, 2004. CEC2004. Congress
Engineering 40, 6 (2014), 603–616. on, Vol. 2. IEEE, 1980–1987.
[59] Rainer Storn and Kenneth Price. 1997. Differential evolution–a simple and [66] Jeffrey M. Voas and Keith W Miller. 1995. Software testability: The new verifica-
efficient heuristic for global optimization over continuous spaces. Journal of tion. IEEE software 12, 3 (1995), 17–28.
global optimization 11, 4 (1997), 341–359. [67] Shuo Wang and Xin Yao. 2013. Using class imbalance learning for software defect
[60] John A Swets. 1988. Measuring the accuracy of diagnostic systems. Science 240, prediction. IEEE Transactions on Reliability 62, 2 (2013), 434–443.
4857 (1988), 1285. [68] Zhen Yan, Xinyu Chen, and Ping Guo. 2010. Software defect prediction using
[61] Ming Tan, Lin Tan, Sashank Dara, and Caleb Mayeux. 2015. Online defect fuzzy support vector regression. In International Symposium on Neural Networks.
prediction for imbalanced data. In ICSE-Volume 2. IEEE Press, 99–108. Springer, 17–24.
[62] Chakkrit Tantithamthavorn, Shane McIntosh, Ahmed E Hassan, and Kenichi [69] Qiao Yu, Shujuan Jiang, and Yanmei Zhang. 2017. The Performance Stability
Matsumoto. 2016. Automated parameter optimization of classification techniques of Defect Prediction Models with Class Imbalance: An Empirical Study. IEICE
for defect prediction models. In ICSE 2016. ACM, 321–332. TRANSACTIONS on Information and Systems 100, 2 (2017), 265–272.
[63] Burak Turhan, Ayşe Tosun Mısırlı, and Ayşe Bener. 2013. Empirical evaluation [70] Hongyu Zhang and Xiuzhen Zhang. 2007. Comments on Data Mining Static Code
of the effects of mixed project data on learning defect predictors. Information Attributes to Learn Defect Predictors. IEEE Transactions on Software Engineering
and Software Technology 55, 6 (2013), 1101–1118. 33, 9 (2007), 635–637.
[64] Jason Van Hulse, Taghi M Khoshgoftaar, and Amri Napolitano. 2007. Experi- [71] Thomas Zimmermann and Nachiappan Nagappan. 2008. Predicting defects using
mental perspectives on learning from imbalanced data. In Proceedings of the 24th network analysis on dependency graphs. In ICSE. ACM, 531–540.
international conference on Machine learning. ACM, 935–942.

1061

You might also like