Xu 2019
Xu 2019
a r t i c l e i n f o a b s t r a c t
Keywords: Context: Software defect prediction strives to detect defect-prone software modules by mining the historical data.
Feature extraction Effective prediction enables reasonable testing resource allocation, which eventually leads to a more reliable
Nonlinear mapping software.
Kernel principal component analysis
Objective: The complex structures and the imbalanced class distribution in software defect data make it challeng-
Weighted extreme learning machine
ing to obtain suitable data features and learn an effective defect prediction model. In this paper, we propose a
method to address these two challenges.
Method: We propose a defect prediction framework called KPWE that combines two techniques, i.e., Kernel
Principal Component Analysis (KPCA) and Weighted Extreme Learning Machine (WELM). Our framework consists
of two major stages. In the first stage, KPWE aims to extract representative data features. It leverages the KPCA
technique to project the original data into a latent feature space by nonlinear mapping. In the second stage, KPWE
aims to alleviate the class imbalance. It exploits the WELM technique to learn an effective defect prediction model
with a weighting-based scheme.
Results: We have conducted extensive experiments on 34 projects from the PROMISE dataset and 10 projects from
the NASA dataset. The experimental results show that KPWE achieves promising performance compared with 41
baseline methods, including seven basic classifiers with KPCA, five variants of KPWE, eight representative feature
selection methods with WELM, 21 imbalanced learning methods.
Conclusion: In this paper, we propose KPWE, a new software defect prediction framework that considers the
feature extraction and class imbalance issues. The empirical study on 44 software projects indicate that KPWE is
superior to the baseline methods in most cases.
1.1. Motivation
☆
Fully documented templates are available in the elsarticle package on CTAN.
∗
Corresponding author. Selecting optimal features that can reveal the intrinsic structures
E-mail address: [email protected] (J. Liu). of the defect data is crucial to build effective defect prediction mod-
https://doi.org/10.1016/j.infsof.2018.10.004
Received 21 September 2017; Received in revised form 26 August 2018; Accepted 5 October 2018
Available online xxx
0950-5849/© 2018 Elsevier B.V. All rights reserved.
Please cite this article as: Z. Xu et al., Information and Software Technology (2018), https://doi.org/10.1016/j.infsof.2018.10.004
JID: INFSOF
ARTICLE IN PRESS [m5GeSdc;October 24, 2018;2:51]
0.754 on NASA dataset, and of 0.480, 0.649, 0.356, and 0.761 across
44 projects of the two datasets. We compare KPWE against 41 base-
line methods. The experimental results show that KPWE achieves sig-
nificantly better performance (especially in terms of F-measure, MCC,
feature mapping
and AUC) compared with all baseline methods.
1.2. Organization
2
JID: INFSOF
ARTICLE IN PRESS [m5GeSdc;October 24, 2018;2:51]
defect prediction by conducting a systematic literature review on the sampling techniques on the performance and interpretation of seven
studies that published from January 1991 to October 2013. They dis- classification models. The experimental results explained that these sam-
cussed the merits and demerits of the classification models and found pling methods increased the completeness of Recall indicator but had
that they were superior to traditional statistical models. In addition, no impact on the AUC indicator. In addition, the sampling based im-
they suggested that new methods should be developed to further im- balanced learning methods were not conducive to the understanding
prove the defect prediction performance. Malhotra [43] used the statis- towards the interpretation of the defect prediction models.
tical tests to compare the performance differences among 18 classifica- The cost-sensitive based imbalanced learning methods alleviate the
tion models for defect prediction. They performed the experiments on differences between the instance number of two classes by assigning dif-
seven Android software projects and stated that these models have sig- ferent weights to the two types of instances. Khoshgottar et al. [54] pro-
nificant differences while support vector machine and voted perceptron posed a cost-boosting method by combining multiple classification mod-
model did not perform well. Lessmann et al. [33] conducted an em- els. Experiments on two industrial software systems showed that the
pirical study to investigate the effectiveness of 21 classifiers on NASA boosting method was feasible for defect prediction. Zheng [55] pro-
dataset. The results showed that the performances of most classifiers posed three cost-sensitive boosting methods to boost neural networks for
have no significant differences. They suggested that some additional defect prediction. Experimental results showed that threshold-moving-
factors, such as the computational overhead and simplicity, should be based boosting neural networks can achieve better performance, espe-
considered when selecting a proper classifier for defect prediction. Gho- cially for object-oriented software projects. Liu et al. [56] proposed
tra et al. [44] expanded Lessmann’s experiment by applying 31 classi- a novel two-stage cost-sensitive learning method by utilizing cost in-
fiers to two versions of NASA dataset and PROMISE dataset. The results formation in the classification stage and the feature selection stage.
showed that these classifiers achieved similar results on the noisy NASA Experiments on seven projects of NASA dataset demonstrated its superi-
dataset but different performance on the clean NASA and the PROMISE ority compared with the single-stage cost-sensitive classifiers and cost-
datasets. Malhotra and Raje [45] investigated the performances of 18 blind feature selection methods. Siers and Islam [57] proposed two cost-
classifiers on six projects with object-oriented features and found that sensitive classification models by combining decision trees to minimize
Naive Bayes classifier achieved the best performance. Although some the classification cost for defect prediction. The experimental results on
researchers introduced KPCA into defect prediction [46–48] recently, six projects of NASA dataset showed the superiority of their methods
they aimed at building asymmetrical prediction models with the kernel compared with six classification methods. The WELM technique used in
method by considering the relationship between principal components our work belongs to this type of imbalanced learning methods.
and the class labels. In this work, we leverage KPCA as a feature selec-
tion method to extract representative features for defect prediction. In 3. KPWE: The new framework
addition, Mesquita et al. [49] proposed a method based on ELM with re-
ject option (i.e., IrejoELM) for defect prediction. The results were good The new framework consists of two stages: feature extraction and
because they abandoned the modules that have contradictory decisions model construction. This section first describes how to project the orig-
for two designed classifiers. However, in practice, such modules should inal data into a latent feature space using the nonlinear feature trans-
be considered. formation technique KPCA, and then presents how to build the WELM
model with the extracted features by considering the class imbalance
2.3. Class imbalanced learning for defect prediction issue.
Since class imbalance issue can hinder defect prediction techniques 3.1. Feature extraction based on KPCA
to achieve satisfactory performance, researchers have proposed differ-
ent imbalanced learning methods to mitigate such negative effects. Sam- In this stage, we extract representative features with KPCA to re-
pling based methods and cost-sensitive based methods are the most stud- veal the potentially complex structures in the defect data. KPCA uses
ied imbalanced learning methods for defect prediction. a nonlinear mapping function 𝜑 to project each raw data point within
For the sampling based imbalanced learning methods, there are two a low-dimensional space into a new point within a high-dimensional
main sampling strategies to balance the data distribution. One is to de- feature space F.
crease the number of non-defective modules (such as under-sampling Given a dataset {𝑥𝑖 , 𝑦𝑖 }, 𝑖 = 1, 2, … , 𝑛, where 𝑥𝑖 = [𝑥𝑖1 , 𝑥𝑖2 , … , 𝑥𝑖𝑚 ]T ∈
technique), the other is to increase the number of the defective modules ℜ𝑚 denotes the feature set and 𝑦𝑖 = [𝑦𝑖1 , 𝑦𝑖2 , … , 𝑦𝑖𝑐 ]T ∈ ℜ𝑐 (𝑐 = 2 in this
with redundant modules (such as over-sampling technique) or synthetic work) denotes the label set. Assuming that each data point xi is mapped
modules (such as Synthetic Minority Over-sampling Technique, SMOTE). into a new point 𝜑(xi ) and the mapped data points are centralized, i.e.,
Kamei et al. [50] investigated the impact of four sampling methods on 1 ∑𝑛
the performance of four basic classification models. They conducted ex- 𝑛 𝑖=1 𝜑(𝑥𝑖 ) = 0 (1)
periments on two industry legacy software systems and found that these The covariance matrix C of the mapped data is:
sampling methods can benefit linear and logistic models but were not ∑
𝐂 = 1𝑛 𝑛𝑖=1 𝜑(𝑥𝑖 )𝜑(𝑥𝑖 )T (2)
helpful to neural network and classification tree models. Bennin et al.
[51] assessed the statistical and practical significance of six sampling To perform the linear PCA in F, we diagonalize the covariance ma-
methods on the performance of five basic defect prediction models. Ex- trix C, which can be treated as a solution of the following eigenvalue
periments on 10 projects indicated that these sampling methods had problem
statistical and practical effects in terms of some performance indica- 𝐂𝐕 = 𝜆𝐕, (3)
tors, such as Pd, Pf, G-mean, but had no effect in terms of AUC. Bennin
et al. [52] explored the impact of a configurable parameter (i.e, the per- where 𝜆 and V denote the eigenvalues and eigenvectors of C, respec-
centage of defective modules) in seven sampling methods on the per- tively.
formance of five classification models. The experimental results showed Since all solutions V lie in the span of the mapped data points
that this parameter can largely impact the performance (except AUC) 𝜑(𝑥1 ), 𝜑(𝑥2 ), … , 𝜑(𝑥𝑛 ), we multiply both sides of Eq. (3) by 𝜑(xl )T as
of studied prediction models. Due to the contradictory conclusions of 𝜑(𝑥𝑙 )T 𝐂𝐕 = 𝜆𝜑(𝑥𝑙 )T 𝐕, ∀𝑙 = 1, 2, … , 𝑛 (4)
previous empirical studies about which imbalanced learning methods
Meanwhile, there exist coefficients 𝛼1 , 𝛼2 , … , 𝛼𝑛 that linearly express
performed the best in the context of defect prediction models, Tan-
the eigenvectors V of C with 𝜑(𝑥1 ), 𝜑(𝑥2 ), … , 𝜑(𝑥𝑛 ), i.e.,
tithamthavorn et al. [53] conducted a large-scale empirical experiment
∑
on 101 project versions to investigate the impact of four popularly-used 𝐕 = 𝑛𝑗=1 𝛼𝑗 𝜑(𝑥𝑗 ) (5)
3
JID: INFSOF
ARTICLE IN PRESS [m5GeSdc;October 24, 2018;2:51]
β11
...
1 βj1 1
yi1
x'i1
βq1
yi
...
...
x' h(wj, bj, x'i) j
βic
x'ip yic
βjc
...
p c
βqc
h(wq, bq, x'i) q
Fig. 2. Feature extraction with KPCA.
Eq. (4) can be rewritten as following formula by substituting Input Layer Hidden Layer Output Layer
Eqs. (2) and (5) into it
∑ ∑ ∑ Fig. 3. The architecture of ELM.
1
𝑛
𝜑(𝑥𝑙 )T 𝑛𝑖=1 𝜑(𝑥𝑖 )𝜑(𝑥𝑖 )T 𝑛𝑗=1 𝛼𝑗 𝜑(𝑥𝑗 ) = 𝜆𝜑(𝑥𝑙 )T 𝑛𝑗=1 𝛼𝑗 𝜑(𝑥𝑗 ) (6)
𝐊 𝛼 = 𝑛𝜆𝐊𝛼,
2
(10) 3.2. ELM
where 𝛼 = [𝛼1 , 𝛼2 , … , 𝛼𝑛 ]T .
Before formulizing the WELM, we first introduce the basic ELM. With
The solution of Eq. (10) can be obtained by solving the eigenvalue
the mapped dataset {𝑥′𝑖 , 𝑦𝑖 } ∈ ℜ𝑝 × ℜ𝑐 (𝑖 = 1, 2, … , 𝑛), the output of the
problem generalized SLFNs with q hidden nodes and activation function h(x′) is
𝐊𝛼 = 𝑛𝜆𝛼 (11) formally expressed as
∑ ∑
for nonzero eigenvalues 𝜆 and corresponding eigenvectors 𝜶. As we can 𝑜𝑖 = 𝑞𝑘=1 𝛽𝑘 ℎ𝑘 (𝑥′𝑖 ) = 𝑞𝑘=1 𝛽𝑘 ℎ(𝑤𝑘 , 𝑏𝑘 , 𝑥′𝑖 ), (16)
see, all the solutions of Eq. (11) satisfy Eq. (10).
where 𝑖 = 1, 2, … , 𝑛, 𝑤𝑘 = [𝑤𝑘1 , 𝑤𝑘2 , … , 𝑤𝑘𝑝 ]T denotes the input weight
As mentioned above, we first assume that the mapped data points
vector connecting the input nodes and the kth hidden node, bk denotes
are centralized. If they are not centralized, the Gram matrix 𝐾̃ be used
the bias of the k-th hidden node, 𝛽𝑘 = [𝛽𝑘1 , 𝛽𝑘2 , … , 𝛽𝑘𝑐 ]T denotes the out-
to replace the kernel matrix K as
put weight vector connecting the output nodes and the kth hidden node,
̃ = 𝐊 − 1𝑛 𝐊 − 𝐊1𝑛 + 1𝑛 𝐊1𝑛 ,
𝐊 (12) and oi denotes the expected output of the ith sample. The commonly-
used activation functions in ELM include sigmoid function, Gaussian
where 1n denotes the n × n matrix with all values equal to 1/n. RBF function, hard limit function, and multiquadric function [61,62].
Thus, we just need to solve the following formula Fig. 3 depicts the basic architecture of ELM.
̃ 𝛼 = 𝑛𝜆𝛼
𝐊 (13) Eq. (16) can be equivalently rewritten as
4
JID: INFSOF
ARTICLE IN PRESS [m5GeSdc;October 24, 2018;2:51]
𝜷 denotes the weight matrix connecting the hidden layer and the or
output layer, which is defined as {
0.618∕𝑛𝑃 if 𝑥′𝑖 ∈ minority class
𝐖𝟐 = 𝐖𝐢𝐢 = , (29)
⎡𝛽1 ⎤
T 1∕𝑛𝑁 if 𝑥′𝑖 ∈ majority class
𝛽=⎢⋮⎥ (19)
⎢ T⎥ where W1 and W2 denote two weighting schemes, nP and nN indicate
⎣𝛽𝑞 ⎦𝑞 × 𝑐 the number of samples of the minority and majority class, respectively.
The golden ratio of 0.618:1 between the majority class and the minority
O denotes the expected label matrix, and each row represents the
class in scheme W2 represents the perfection in nature [67].
output vector of one sample. O is defined as
5
JID: INFSOF
ARTICLE IN PRESS [m5GeSdc;October 24, 2018;2:51]
6
JID: INFSOF
ARTICLE IN PRESS [m5GeSdc;October 24, 2018;2:51]
Table 2
Statistics of the NASA dataset.
Table 3
The feature description and abbreviation for PROMISE dataset.
1. Weighted Methods per Class (WMC) 11. Measure of Functional Abstraction (MFA)
2. Depth of Inheritance Tree (DIT) 12. Cohesion Among Methods of Class (CAM)
3. Number of Children (NOC) 13. Inheritance Coupling (IC)
4. Coupling between Object Classes (CBO) 14. Coupling Between Methods (CBM)
5. Response for a Class (RFC) 15. Average Method Complexity (AMC)
6. Lack of Cohesion in Methods (LOCM) 16. Afferent Couplings (Ca)
7. Lack of Cohesion in Methods (LOCM3) 17. Efferent Couplings (Ce)
8. Number of Public Methods (NPM) 18. Greatest Value of CC (Max_CC)
9. Data Access Metric (DAM) 19. Arithmetic mean value of CC (Avg_CC)
10. Measure of Aggregation (MOA) 20. Lines of Code (LOC)
Actual defective TP FN where N denotes the total number of the projects, L denotes the num-
∑ 𝑗
Actual defective-free FP TN ber of methods needed to be compared, 𝐴𝑅𝑗 = 𝑁1 𝑁 𝑖=1 𝑅𝑖 denotes the
𝑇𝑃
average rank of method j on all projects and 𝑅𝑗𝑖 denotes the rank of jth
pd (recall) 𝑇 𝑃 +𝐹 𝑁
𝐹𝑃
pf 𝐹 𝑃 +𝑇 𝑁 method on the ith project. 𝜏𝜒 2 obeys the 𝜒 2 distribution with 𝐿 − 1 de-
𝑇𝑃
precision 𝑇 𝑃 +𝐹 𝑃
gree of freedom [82]. Since the original Friedman test statistic is too
conservative, its variant 𝜏 F is usually used to conduct the statistic test.
Table 5
The specific features for each project of NASA dataset.
Features CM1 KC1 KC3 MC1 MC2 MW1 PC1 PC3 PC4 PC5
√ √ √ √ √ √ √ √ √
21. Number_of_lines
√ √ √ √ √ √ √ √ √
22. Cyclomatic_Density
√ √ √ √ √ √ √ √ √ √
23. Branch_Count
√ √ √ √ √ √ √ √ √
24. Essential_Density
√ √ √ √ √ √ √ √ √
25. Call_Pairs
√ √ √ √ √ √ √ √ √
26. Condition_Count
√ √ √ √ √ √ √ √ √
27. Decision_Count
√ √ √ √ √ √ √
28. Decision_Density
√ √ √ √ √ √ √ √ √
29. Design_Density
√ √ √ √ √ √ √ √ √
30. Edge_Count
√ √ √ √
31. Global_Data_Complexity
√ √ √ √
32. Global_Data_Density
√ √ √ √ √ √ √ √ √
33. Maintenance_Severity
√ √ √ √ √ √ √ √ √
34. Modified_Condition_Count
√ √ √ √ √ √ √ √ √
35. Multiple_Condition_Count
√ √ √ √ √ √ √ √ √
36. Node_Count
√ √ √ √ √ √ √ √ √
37. Normalized_CC
√ √ √ √ √ √ √ √ √
38. Parameter_Count
√ √ √ √ √ √ √ √ √
39. Percent_Comments
7
JID: INFSOF
ARTICLE IN PRESS [m5GeSdc;October 24, 2018;2:51]
Table 7
The parameter settings of the used machine learning classifiers.
Table 8
Training Time of classifiers on promise dataset (in Seconds).
Table 9
Training time of classifiers on nasa dataset (in Seconds).
Table 10
Average indicator values of KPWE and seven basic classifiers with KPCA on two datasets and across
all projects.
Dataset Indicator KPNB KPNN KPRF KPLR KPCART KPBP KPSVM KPWE
PROMISE F-measure 0.426 0.423 0.361 0.410 0.396 0.419 0.391 0.500
G-measure 0.525 0.523 0.360 0.453 0.484 0.478 0.376 0.660
MCC 0.284 0.257 0.235 0.292 0.222 0.260 0.280 0.374
AUC 0.699 0.624 0.696 0.716 0.630 0.672 0.648 0.764
NASA F-measure 0.354 0.336 0.267 0.325 0.315 0.352 0.310 0.410
G-measure 0.476 0.477 0.264 0.387 0.425 0.429 0.287 0.611
MCC 0.248 0.216 0.201 0.234 0.176 0.242 0.230 0.296
AUC 0.708 0.596 0.693 0.698 0.606 0.684 0.655 0.754
ALL F-measure 0.410 0.403 0.340 0.391 0.377 0.403 0.372 0.480
G-measure 0.513 0.512 0.338 0.438 0.471 0.467 0.355 0.649
MCC 0.276 0.248 0.228 0.279 0.212 0.256 0.269 0.356
AUC 0.701 0.618 0.695 0.712 0.625 0.675 0.650 0.761
𝜏 F is calculated as the following formula: values2 for the F distribution and then determine whether to accept
or reject the null hypothesis (i.e., all methods perform equally on the
(𝑁 − 1)𝜏𝜒 2 projects).
𝜏𝐹 = . (34) If the null hypothesis is rejected, it means that the performance dif-
𝑁(𝐿 − 1) − 𝜏𝜒 2
ferences among different methods are nonrandom, then a so-called Ne-
8
JID: INFSOF
ARTICLE IN PRESS [m5GeSdc;October 24, 2018;2:51]
F-measure G-measure
0.8 0.8
0.7
0.6
0.6
0.5
0.4
0.4
0.3
0.2
0.2
0.1 KPNB KPNN KPRF KPLR KPCART KPBP KPSVM KPWE 0.0 KPNB KPNN KPRF KPLR KPCART KPBP KPSVM KPWE
MCC AUC
0.7 0.9
0.6
0.8
0.5
0.7
0.4
0.3
0.6
0.2
0.5
0.1
0.0 KPNB KPNN KPRF KPLR KPCART KPBP KPSVM KPWE 0.4 KPNB KPNN KPRF KPLR KPCART KPBP KPSVM KPWE
Fig. 4. Box-plots of four indicators for KPWE and seven basic classifiers with KPCA across all 44 projects.
Table 11
Average indicator values of KPWE and its five variants with WELM on two datasets and
across all projects.
menyi’s post-hoc test is performed to check which specific method dif- where q𝛼, L is a critical value that related to the number of methods L
fers significantly [33]. For each pair of methods, this test uses the aver- and the significance level 𝛼. The critical values are available online.3
age rank of each method and checks whether the rank difference exceeds The Frideman test with the Nemenyi’s post-hoc test is widely used in
a Critical Difference (CD) which is calculated with the following formula: previous studies [33,81,83–88].
√
𝐿(𝐿 + 1)
𝐶𝐷 = 𝑞𝛼,𝐿 , (35) 3
http://www.cin.ufpe.br/~fatc/AM/Nemenyi_critval.pdf.
6𝑁
9
JID: INFSOF
ARTICLE IN PRESS [m5GeSdc;October 24, 2018;2:51]
Table 12
Average indicator values of KPWE and eight feature selection methods with WELM on two datasets and across all
projects.
PROMISE F-measure 0.347 0.415 0.349 0.415 0.427 0.435 0.425 0.431 0.500
G-measure 0.482 0.574 0.482 0.574 0.588 0.605 0.582 0.597 0.660
MCC 0.139 0.257 0.142 0.255 0.271 0.283 0.271 0.277 0.374
AUC 0.590 0.680 0.588 0.674 0.688 0.692 0.689 0.690 0.764
NASA F-measure 0.297 0.360 0.301 0.366 0.353 0.378 0.365 0.369 0.410
G-measure 0.510 0.568 0.515 0.578 0.573 0.603 0.581 0.591 0.611
MCC 0.152 0.247 0.157 0.243 0.228 0.265 0.242 0.252 0.296
AUC 0.618 0.685 0.606 0.685 0.681 0.688 0.679 0.679 0.754
ALL F-measure 0.336 0.403 0.338 0.404 0.410 0.422 0.411 0.417 0.480
G-measure 0.488 0.572 0.490 0.575 0.585 0.604 0.582 0.595 0.649
MCC 0.142 0.255 0.145 0.252 0.261 0.279 0.265 0.271 0.356
AUC 0.596 0.681 0.592 0.676 0.686 0.691 0.687 0.688 0.761
However, the main drawback of post-hoc Nemenyi test is that it 5. Performance evaluation
may generate overlapping groups for the methods that are compared,
not completely distinct groups, which means that a method may be- 5.1. Answer to RQ1: the efficiency of ELM, WELM and some classic
long to multiple significantly different groups [44,88]. In this work, we classifiers.
utilize the strategy in [88] to address this issue. More specifically, un-
der the assumption that the distance (i.e., the difference between two Since many previous defect prediction studies applied classic classi-
average ranks) between the best average rank and the worst rank is fiers as prediction models [33,44], in this work, we choose seven repre-
2 times larger than CD value, we divide the methods into three non- sentative classifiers, including Naive Bayes (NB), Nearest Neighbor (NN),
overlapping groups: (1) The method whose distance to the best average Random Forest (RF), Logistic Regression (LR), Classification and Regression
rank is less than CD belongs to the top rank group; (2) The method Tree (CART), Back Propagation neural networks (BP) and Support Vector
whose distance to the worst average rank is less than CD belongs to Machine (SVM), and compare their efficiency with ELM and WELM.
the bottom rank group; (3) The other methods belong to the middle The parameter settings of the classifiers are detailed as follows. For
rank group. And if the distance between the best average rank and the NB, we use the kernel estimator that achieves better F-measure values
worst rank is larger than 1 time but less than 2 times CD value, we di- on most projects through our extensive experiments. For RF, we set the
vide the methods into 2 non-overlapping groups: The method belongs number of generated trees to 10, the number of variables for random
to the top rank group (or bottom rank group) if its average rank is feature selection to 2, and do not limit the maximum depth of the trees,
closer to the best average rank (or the worst average rank). In addi- as suggested in [11]. BP is implemented using the neural networks tool-
tion, if the distance between the best average rank and the worst rank box in MATLAB with a three-layered and fully-connected network ar-
is less than CD value, all methods belong to the same group. Using chitecture. The learning rate is initialized to 0.1. Since how to select an
this strategy, the generating groups are non-overlapping significantly optimal number of hidden nodes is still an open question [89], we con-
different. duct extensive experiments on the benchmark dataset and find that BP
10
JID: INFSOF
ARTICLE IN PRESS [m5GeSdc;October 24, 2018;2:51]
F-measure G-measure
0.8 0.8
0.7
0.6
0.6
0.5
0.4
0.4
0.3
0.2
0.2
0.1 ELM PCAELM KPCAELM WELM PCAWELM KPWE 0.0 ELM PCAELM KPCAELM WELM PCAWELM KPWE
MCC AUC
0.7 0.9
0.6
0.8
0.5
0.4
0.7
0.3
0.2
0.6
0.1
0.0 ELM PCAELM KPCAELM WELM PCAWELM KPWE ELM PCAELM KPCAELM WELM PCAWELM KPWE
Fig. 6. Box-plots of four indicators for KPWE and its variants on NASA dataset.
can achieve the best F-measure with less than 80 hidden nodes on the 14 projects, is lower than the baseline classifiers on most projects. More
vast majority of the projects. Thus we set the number of hidden nodes specifically, the training time of NB, RF, LR, and CART, less than 0.3 s,
from 5 to 80 with an increment of 5. The algorithm terminates when the is a little bit longer than that of ELM and WELM except for the time of
number of iterations is above 2000 or the tolerant error is below 0.004. RF on project lucene and poi, while the training time of ELM and WELM
Other network parameters is set with the default values. The optimal are much shorter than that of BP and SVM. In particular, WELM runs
number of hidden nodes is determined based on the best F-measure. nearly 200 (for poi) to 30,000 (for velocity) times faster than BP while
For SVM, we also choose the Gaussian RBF as the kernel function, 600 (for arc) to 8500 (for velocity) times faster than SVM. The training
and set the kernel parameter 𝜔𝑆𝑉 𝑀 = 2−10 , 2−9 , … , 24 while cost param- time between ELM and WELM has a slight difference. From Table 9, we
eter 𝐶 = 2−2 , 2−1 , … , 212 as suggested in [90]. Similarly, the optimal find that, on NASA dataset, WELM takes less than 0.1 seconds to finish
parameter combination is obtained according to the best performance training a model on 9 projects. ELM and WELM run faster than the six
through the grid search. For other classifiers, we use the default pa- classifiers except for NB on CM1 project. Particularly, WELM runs 50
rameter values. Table 7 tabulates the parameter setting of the seven (for MC2) to 2700 (for MW1) times faster than BP while 100 (for KC3)
basic classifiers. The experiments are conducted on a workstation with to 17,000 (for KC1) times faster than SVM.
a 3.60 GHz Intel i7-4790 CPU and 8.00 GB RAM. Discussion: The short training time of ELM and WELM is due to the
Since NN is a lazy classifier that does not need to build a model following reasons. First, the weights of the input layer and the bias of the
with the training set in advance, it has no training time [91]. hidden layer in ELM are randomly assigned without iterative learning.
Tables 8 and 9 present the training times of ELM, WELM and the baseline Second, the weights of the output layer are solved by an inverse opera-
classifiers on PROMISE dataset and NASA dataset, respectively. Note tion without iteration. They empower ELM to train the model quickly.
that the value 0 means the training time of the classifier is less than Since WELM only adds one step for assigning different weights to the
0.0005 s. For the project with multiple versions, we only report the av- defective and non-defective modules when building the model, it intro-
erage training time across the versions. From Table 8, we observe that, duces little additional computation cost. Therefore, the training time of
on PROMISE dataset, the training time of WELM, less than 0.01 s on ELM and that of WELM are very similar. The superiority of the training
11
JID: INFSOF
ARTICLE IN PRESS [m5GeSdc;October 24, 2018;2:51]
speed of ELM and WELM will be more significant when they are applied Third, Fig. 5 visualizes the results of the Friedman test with Ne-
to larger datasets. menyi’s post-hoc test for KPWE and the seven baseline methods in terms
of the four indicators. Groups of the methods that are significantly differ-
Summary: Compared with the basic classifiers, ELM and WELM are
ent are with different colors. The results of the Friedman test show that
more efficient to train the prediction model, especially towards BP
the p values are all less than 0.05, which means that there exist signifi-
and SVM, whereas the differences of the efficiency between ELM,
cant differences among the eight methods in terms of all four indicators.
WELM and other classifiers are small.
The results of the post-hoc test show that KPWE always belongs to the
top rank group in terms of all indicators. In addition, KPLR belongs to
the top rank group in terms of AUC. These observations indicate that
5.2. Answer to RQ2: the prediction performance of KPWE and the basic
KPWE performs significantly better than the seven baseline methods ex-
classifiers with KPCA.
pect for the KPLR method in terms of AUC.
Discussion: Among all the methods that build prediction models with
Table 10 presents the average indicator values of KPWE and the
the features extracted by KPCA, KPWE outperforms the baseline meth-
seven baseline methods on PROMISE dataset, NASA dataset, and across
ods because it uses an advanced classifier that considers the class imbal-
all 44 projects of the two datasets. Fig. 4 depicts the box-plots of four
ance in the defect data while traditional classifiers could not well copy
indicators for the eight methods across all 44 projects. The detailed re-
with the imbalanced data.
sults, including the optimal kernel parameter, the number of hidden
nodes, the performance value for each indicator on each project and the Summary: Our method KPWE performs better than KPCA with the
corresponding standard deviation for all research questions are avail- seven basic classifiers. On average, compared with the seven base-
able on our online supplementary materials.4 From Table 10 and Fig. 4, line methods, KPWE achieves 24.2%, 47.3%, 44.3%, 14.4% perfor-
we have the following observations. mance improvement in terms of the four indicators respectively over
First, from Table 10, the results show that our method KPWE PROMISE dataset, 28.1%, 63.6%, 35.6%, 14.2% performance im-
achieves the best average performance in terms of all indicators on provement in terms of the four indicators respectively over NASA
two datasets and across all 44 projects. More specifically, across all 44 dataset, and 25.1%, 50.4%, 42.2%, 14.2% performance improve-
projects, the average F-measure value (0.480) by KPWE yields improve- ment in terms of the four indicators respectively across all 44
ments between 17.1% (for KPNB) and 41.2% (for KPRF) with an average projects.
improvement of 25.1%, the average G-measure value (0.649) by KPWE
gains improvements between 26.5% (for KPNB) and 92.0% (for KPRF) 5.3. Answer to RQ3: the prediction performance of KPWE and its variants.
with an average improvement of 50.4%, the average MCC value (0.356)
by KPWE achieves improvements between 27.6% (for KPLR) and 67.9% Table 11 presents the average indicator values of KPWE and its five
(for KPCART) with an average improvement of 42.2%, and the average variants on PROMISE dataset, NASA dataset, and across all 44 projects
AUC value (0.761) gets improvements between 6.9% (for KPLR) and of the two datasets. Fig. 6 depicts the box-plots of four indicators for the
23.1% (for KPNN) with an average improvement of 14.2% compared six methods across all 44 projects. From Table 11 and Fig. 6, we have
against the seven classic classifiers with KPCA. the following findings.
Second, Fig. 4 demonstrates that the median values of all four indica- First, from Table 11, the results show that our method KPWE
tors by KPWE are superior to that by the seven baseline methods across achieves the best average performance in terms of all indicators on
all 44 projects. In particular, the median AUC by KPWE is even higher two datasets and across all 44 projects. More specifically, across all 44
than or similar to the maximum AUC by KPNN, KPCART, and KPBP. projects, the average F-measure value (0.480) by KPWE yields improve-
ments between 8.1% (for KPCAELM) and 31.9% (for WELM) with an av-
erage improvement of 25.4%, the average G-measure value (0.649) by
4
https://sites.google.com/site/istkpwe. KPWE gains improvements between 14.7% (for PCAWELM) and 38.7%
12
JID: INFSOF
ARTICLE IN PRESS [m5GeSdc;October 24, 2018;2:51]
F-measure
0.8
G-measure
0.8
0.6
0.7
0.6
0.4
0.5
0.4
0.2
0.3
MCC AUC
0.7 0.9
0.6
0.8
0.5
0.4
0.7
0.3
0.6
0.2
0.1
0.5
0.0
Fig. 8. Box-plots of four indicators for KPWE and eight feature selection methods with WELM across all 44 projects.
(for ELM) with an average improvement of 25.0%, the average MCC cators. In addition, KPCAELM belong to the top rank 1 group in terms
value (0.356) by KPWE achieves improvements between 9.9% (for KP- of F-measure and MCC. These observations indicate that in terms of
CAELM) and 107.0% (for ELM) with an average improvement of 78.2%, G-measure and AUC, KPWE significantly performs better than the five
and the average AUC value (0.761) gets improvements between 9.2% variants, whereas in terms of F-measure and MCC, KPWE does not per-
(for KPCAELM) and 23.5% (for ELM) with an average improvement of form significantly better than KPCAELM.
19.2% compared with the five variants. Discussion: On the one hand, KPWE and KPCAELM are superior to
Second, Fig. 6 shows that KPWE outperforms the five variants in PCAWELM and PCAELM in terms of all four indicators respectively, on
terms of the median values of all indicators across all 44 projects. In the other hand, KPWE and KPCAELM perform better than WELM and
particular, the median G-measure by KPWE is higher than or similar ELM, respectively on both datasets, all these mean that the features ex-
to the maximum G-measure (do not consider the noise points) by the tracted by the nonlinear method KPCA are beneficial to ELM and WELM
baseline methods except for PCAWELM, the median MCC by KPWE is for the improvement of defect prediction performance compared against
higher than the maximum MCC by ELM, WELM and PCAWELM, and the the raw features or the features extracted by linear method PCA. More-
median AUC by KPWE is higher than the maximum AUC by the baseline over, KPWE, PCAWELM and WELM are superior to KPCAELM, PCAELM
methods except for PCAWELM. and ELM respectively which denotes that WELM is more appropriate to
Third, Fig. 7 visualizes the results of the Friedman test with Ne- the class imbalanced defect data than ELM.
menyi’s post-hoc test for KPWE and its five variants in terms of the four
Summary: KPWE precedes its five variants. On average, compared
indicators. The p values of the Friedman test are all less than 0.05, which
with the five downgraded variants, KPWE achieves 26.1%, 25.4%,
means that there exist significant differences among the six methods in
84.2%, 19.2% performance improvement in terms of the four in-
terms of all four indicators. The results of the post-hoc test show that
dicators respectively over PROMISE dataset, 22.7%, 23.9%, 58.4%,
KPWE also always belongs to the top rank 1 group in terms of all indi-
13
JID: INFSOF
ARTICLE IN PRESS [m5GeSdc;October 24, 2018;2:51]
19.6% performance improvement in terms of the four indicators re- Second, Fig. 8 manifests that superiority of KPWE compared with the
spectively over NASA dataset, and 25.4%, 25.0%, 78.2%, 19.2% per- eight baseline methods in terms of the median values of all four indi-
formance improvement in terms of the four indicators respectively cators across all 44 projects. In particular, the median AUC by KPWE is
across all 44 projects. higher than the maximum AUC by CS and IG. In addition, we can also
observe that the performance of the four wrapper-based feature sub-
5.4. Answer to RQ4: the prediction performance of KPWE and other set selection methods are generally better than the filter-based feature
feature selection methods with WELM. subset selection methods, which is consistent with the observation in
previous study [19].
Here, we choose eight representative feature selection methods, in- Third, Fig. 9 visualizes the results of the Friedman test with Ne-
clude four filter-based feature ranking methods and four wrapper-based menyi’s post-hoc test for KPWE and the eight feature selection based
feature subset selection methods, for comparison. The filter-based meth- baseline methods in terms of the four indicators. There exist significant
ods are Chi-Square (CS), Fish Score (FS), Information Gain (IG) and ReliefF differences among the nine methods in terms of all four indicators since
(ReF). The first two methods are both based on statistics, the last two the p values of the Friedman test are all less than 0.05. The results of
are based on entropy and instance, respectively. These methods have the post-hoc test illustrate that KPWE always belongs to the top rank
been proven to be effective for defect prediction [19,92]. For wrapper- group in terms of all indicators. In addition, NNWrap belongs to the top
based methods, we choose four commonly-used classifiers (i.e., NB, NN, rank group in terms of G-measure. These observations show that KPWE
LR, and RF) and F-measure to evaluate the performance of the selected performs significantly better than the eight baseline methods expect for
feature subset. The four wrapper methods are abbreviated as NBWrap, the NNWrap method in terms of G-measure.
NNWrap, LRWrap, and RFWrap. Following the previous work [19,38], Discussion: The reason why the features extracted by KPCA are more
we set the number of selected features to ⌈log2 m⌉, where m is the number effective is that, the eight feature selection methods only select a subset
of original features. of original features that are not able to excavate the important informa-
Table 12 presents the average indicator values of KPWE and eight tion hidden behind the raw data, whereas KPCA can eliminate the noise
feature selection methods with WELM on PROMISE dataset, NASA in the data and extract the intrinsic structures of the data that are more
dataset, and across all 44 projects of the two datasets. Fig. 8 depicts the helpful to distinguish the class labels of the modules.
box-plots of four indicators for the nine methods across all 44 projects.
Some findings are observed from Table 12 and Fig. 8 as follows. Summary: KPWE outperforms the eight feature selection methods
First, from Table 12, the results show that our method KPWE with WELM. On average, compared with the eight baseline meth-
achieves the best average performance in terms of all indicators on ods, KPWE achieves 24.3%, 18.6%, 71.0%, 16.0% performance im-
two datasets and across all 44 projects. More specifically, across all 44 provement in terms of the four indicators respectively over PROMISE
projects, the average F-measure value (0.480) by KPWE yields improve- dataset, 18.5%, 8.5%, 38.3%, 13.7% performance improvement in
ments between 13.7% (for NNWrap) and 42.9% (for CS) with an average terms of the four indicators respectively over NASA dataset, and
improvement of 23.2%, the average G-measure value (0.649) by KPWE 23.2%, 16.4%, 63.4%, 15.4% performance improvement in terms
gains improvements between 7.5% (for NNWrap) and 33.0% (for CS) of the four indicators respectively across all 44 projects.
with an average improvement of 16.4%, the average MCC value (0.356)
by KPWE achieves improvements between 27.6% (for NNWrap) and 5.5. Answer to RQ5: the prediction performance of KPWE and other
150.7% (for CS) with an average improvement of 63.4%, and the aver- imbalanced learning methods.
age AUC value (0.761) gets improvements between 10.1% (for NNWrap)
and 27.7% (for CS) with an average improvement of 15.4% compared Here, we employ 12 classic imbalanced learning methods based
with eight feature selection methods with WELM. on data sampling strategies. These methods first use Random
14
JID: INFSOF
ARTICLE IN PRESS [m5GeSdc;October 24, 2018;2:51]
F-measure G-measure
KPWE KPWE
BCSFV BCSFV
CSFV CSFV
SysFV SysFV
CEL CEL
APL APL
Bal Bal
Easy Easy
Ada Ada
Bag Bag
SMLR SMLR
RULR RULR
ROLR ROLR
SMRF SMRF
RURF RURF
RORF RORF
SMNN SMNN
RUNN RUNN
RONN RONN
SMNB SMNB
RUNB RUNB
RONB RONB
0.0 0.2 0.4 0.6 0.8 0.0 0.2 0.4 0.6 0.8 1.0
MCC AUC
KPWE KPWE
BCSFV BCSFV
CSFV CSFV
SysFV SysFV
CEL CEL
APL APL
Bal Bal
Easy Easy
Ada Ada
Bag Bag
SMLR SMLR
RULR RULR
ROLR ROLR
SMRF SMRF
RURF RURF
RORF RORF
SMNN SMNN
RUNN RUNN
RONN RONN
SMNB SMNB
RUNB RUNB
RONB RONB
-0.2 0.0 0.2 0.4 0.6 0.8 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Fig. 10. Box-plots of four indicators for KPWE and 21 class imbalanced learning methods across all 44 projects.
Under-sampling (RU), Random Over-sampling (RO) or SMOTE (SM) tech- First, from Table 13, the results show that our method KPWE
niques to rebalance the modules of the two classes in the training set, achieves the best average performance in terms of F-measure and MCC
then, four popular classifiers as the same in RQ4 (i.e., NB, NN, LR, and on two datasets and across all 44 projects. More specifically, across all
RF) are applied to the rebalanced training set. The method name is the 44 projects, the average F-measure value (0.480) by KPWE yields im-
combination of the abbreviation of the sampling strategy and the used provements between 7.6% (for CEL) and 34.5% (for RULR) with an av-
classifier. Also, we employ two widely-used ensemble learning meth- erage improvement of 19.6%, the average MCC value (0.356) by KPWE
ods (i.e., Bagging (Bag) and Adaboost (Ada) for comparison. Moreover, gains improvements between 17.9% (for Easy) and 140.5% (for SMNB)
we use other seven imbalanced learning methods, Coding-based Ensem- with an average improvement of 56.5%. However, Easy, Bal, APL out-
ble Learning (CEL) [93], Systematically developed Forest with cost-sensitive perform our method KPWE in terms of average G-measure values and
Voting (SysFV) [94], Cost-Sensitive decision Forest with cost-sensitive Vot- Easy outperforms KPWE in terms of the average AUC values across all
ing (CSFV) [95], Balanced CSFV (BCSFV) [57], Asymmetric Partial Least 44 projects. Overall, KPWE achieves average improvements of 23.4%
squares classifier (APL) [96], EasyEnsemble (Easy) [97], and BalanceCas- and 11.2% over the 21 baseline methods in terms of average G-measure
cade (Bal) [97] as the baseline methods. Note that the last three methods and AUC, respectively.
have not yet been applied to defect prediction but have been proved to Second, Fig. 10 depicts that KPWE is superior to the 21 baseline
achieve promising performance for imbalanced data in other domains. methods in terms of the median F-measure and MCC across all 44
Among these method, SysFV, CSFV amd BCSFV are cost-sensitive based projects. In particular, the median MCC by KPWE is higher than the max-
imbalanced learning methods, while Easy and Bal combine the sampling imum MCC by RONB and SMNB. In addition, the median G-measure by
strategies and ensemble learning methods. KPWE is similar to that by APL and Bal, whereas the median G-measure
Table 13 presents the average indicator values of KPWE and the 21 and AUC by KPWE are only a little lower than those by Easy.
class imbalanced baseline methods on PROMISE dataset, NASA dataset, Third, Fig. 11 visualizes the results of the Friedman test with Ne-
and across all 44 projects of the two datasets. Fig. 10 depicts the box- menyi’s post-hoc test for KPWE and the 21 class imbalanced learning
plots of four indicators for the 22 methods across all 44 projects. We methods in terms of the four indicators. As the p values of the Friedman
describe the findings from Table 13 and Fig. 10 as follows. test are all less than 0.05, there exist significant differences among the
15
JID: INFSOF
ARTICLE IN PRESS [m5GeSdc;October 24, 2018;2:51]
22 methods in terms of all four indicators. The results of the post-hoc the baseline methods, KPWE achieves 19.1%, 23.9%, 57.7%, 11.3%
test illustrate that KPWE also belongs to the top rank group in terms performance improvement in terms of the four indicators respec-
of all indicators. However, in terms of F-measure, G-measure MCC and tively over PROMISE dataset, 21.0%, 23.2%, 53.2%, 11.4% perfor-
AUC, KPWE does not perform significantly well compared with seven, mance improvement in terms of the four indicators respectively over
seven, four and six baseline methods respectively in which the common NASA dataset, and 19.6%, 23.4%, 56.5%, 11.2% performance im-
methods are Easy and Bal. These observations manifest that KPWE, Easy provement in terms of the four indicators respectively across all 44
and Bal belong to the top rank group and perform no statistically sig- projects. In addition, KPWE performs no statistically significant dif-
nificant differences with each other in terms of all four indicators. Since ferences compared with Easy and Bal across all 44 projects in terms
this is the first work to investigate the performance of method Easy and of all four indicators.
methods Bal on software defect data, the experimental results indicate
that they are also potentially effective methods for defect prediction as 6. Threats to validity
our method KPWE is.
Discussion: The under-sampling methods may neglects the potentially 6.1. External validity
useful information contained in the ignored non-defective modules, and
the over-sampling methods may cause the model over-fitting by adding External validity focuses on whether our experimental conclusions
some redundancy defective modules. In addition, data sampling based will vary on different projects. We conduct experiments on total 44
imbalanced learning methods usually change the data distribution of the projects of two defect datasets to reduce the threat for this kind of va-
defect data. From this point, the cost-sensitive learning methods (such lidity. In addition, since the features of our benchmark dataset are all
as our KPWE method) which does not change the data distribution are static product metrics and the modules are abstracted at class level (for
better choices for imbalanced defect data. Considering the main draw- PROMISE dataset) and component level (for NASA dataset), we cannot
back of under-sampling methods, Easy and Bal sample multiple subsets claim that our experimental conclusions can be generalized to the defect
from the majority class and then use each of these subsets to train an datasets with process metrics and the modules extracted at file level.
ensemble. Finally, they combine all weak classifiers of these ensembles
into a final output [97]. The two methods can wisely explore these ig- 6.2. Internal validity
nored modules, which enable them to perform well on the imbalanced
data. We implement most baseline methods using the function library of
machine learning and toolbox in MATLAB to reduce the potential influ-
Summary: KPWE performs better than the 21 baseline methods espe- ence of the incorrect implementations on our experimental results. In
cially in terms of F-measure and MCC. On average, compared with addition, we tune the optimal parameter values, such as the width of
16
JID: INFSOF
ARTICLE IN PRESS [m5GeSdc;October 24, 2018;2:51]
KPWE
0.500
0.374
0.410
0.296
0.480
0.356
0.660
0.764
0.611
0.754
0.649
0.761
from a relatively wide range of tested options. Nevertheless, a more
carefully controlled experiment for the parameter selection should be
considered.
BCSFV
0.455
0.454
0.272
0.707
0.380
0.388
0.240
0.710
0.438
0.439
0.265
0.708
6.3. Construct validity
0.456
0.459
0.291
0.740
0.357
0.384
0.247
0.734
0.434
0.442
0.281
0.738
CSFV
0.455
0.455
0.294
0.730
0.389
0.401
0.266
0.736
0.440
0.443
0.288
0.732
these indicators do not take the effort of inspecting cost into considera-
tion. We will use the effect-aware indicators to evaluate the effectiveness
of our method in future work.
0.467
0.610
0.324
0.668
0.374
0.561
0.254
0.644
0.446
0.599
0.308
0.662
CEL
0.255
0.757
0.432
0.664
0.279
0.741
APL
0.298
0.742
0.389
0.669
0.264
0.748
0.436
0.666
0.290
0.744
significant. With this statistic test, the assessment towards the superior-
Bal
0.759
0.669
0.764
0.458
0.664
0.311
0.393
0.689
0.275
0.443
0.302
Easy
7. Conclusion
0.397
0.443
0.260
0.724
0.296
0.355
0.215
0.734
0.375
0.423
0.250
0.726
learn the representative features by mapping the original data into a la-
Bag
The mapped features in the new space can better represent the raw
SMLR
0.393
0.595
0.200
0.664
0.327
0.593
0.170
0.651
0.378
0.595
0.193
0.661
0.375
0.586
0.171
0.636
0.295
0.550
0.117
0.601
0.357
0.578
0.159
0.628
0.395
0.598
0.203
0.668
0.326
0.599
0.172
0.643
0.379
0.598
0.196
0.662
0.408
0.598
0.228
0.693
0.331
0.583
0.172
0.656
0.391
0.594
0.215
0.685
0.421
0.639
0.256
0.715
0.327
0.597
0.180
0.682
0.400
0.629
0.239
0.708
number of hidden nodes and kernel parameter values for KPWE, as they
vary for different projects. In addition, we plan to explore the impact
RORF
0.431
0.534
0.276
0.721
0.308
0.410
0.176
0.682
0.403
0.506
0.253
0.712
0.415
0.613
0.242
0.647
0.354
0.620
0.213
0.645
0.401
0.615
0.235
0.646
Acknowledgments
0.407
0.633
0.228
0.647
0.331
0.623
0.187
0.641
0.389
0.630
0.218
0.646
0.385
0.422
0.152
0.632
0.313
0.463
0.134
0.629
0.369
0.431
0.148
0.632
0.400
0.416
0.181
0.634
0.329
0.502
0.174
0.674
0.383
0.435
0.180
0.643
Supplementary material
RONB
0.416
0.476
0.187
0.622
0.333
0.486
0.144
0.603
0.397
0.479
0.177
0.617
G-measure
G-measure
F-measure
F-measure
F-measure
Indicator
References
MCC
MCC
MCC
AUC
AUC
AUC
[1] J. Tian, Software Quality Engineering: Testing, Quality Assurance, and Quantifiable
Improvement, John Wiley & Sons, 2005.
PROMISE
Table 13
Dataset
[2] G.J. Myers, C. Sandler, T. Badgett, The Art of Software Testing, John Wiley & Sons,
NASA
2011.
ALL
[3] M. Shepperd, D. Bowes, T. Hall, Researcher bias: the use of machine learning in
software defect prediction, IEEE Trans. Softw. Eng. (TSE) 40 (6) (2014) 603–616.
17
JID: INFSOF
ARTICLE IN PRESS [m5GeSdc;October 24, 2018;2:51]
[4] Q. Song, Z. Jia, M. Shepperd, S. Ying, J. Liu, A general software defect-proneness [37] S. Shivaji, J.E.J. Whitehead, R. Akella, S. Kim, Reducing features to improve bug
prediction framework, IEEE Trans. Softw. Eng. (TSE) 37 (3) (2011) 356–370. prediction, in: Proceedings of the 24th International Conference on Automated Soft-
[5] X. Yang, K. Tang, X. Yao, A learning-to-rank approach to software defect prediction, ware Engineering (ASE), IEEE Computer Society, 2009, pp. 600–604.
IEEE Trans. Reliab. 64 (1) (2015) 234–246. [38] K. Gao, T.M. Khoshgoftaar, H. Wang, N. Seliya, Choosing software metrics for defect
[6] P. Knab, M. Pinzger, A. Bernstein, Predicting defect densities in source code files prediction: an investigation on feature selection techniques, Softw. Pract. Exp. (SPE)
with decision tree learners, in: Proceedings of the 3rd International Workshop on 41 (5) (2011) 579–606.
Mining Software Repositories (MSR), ACM, 2006, pp. 119–125. [39] X. Chen, Y. Shen, Z. Cui, X. Ju, Applying feature selection to software defect predic-
[7] T. Menzies, J. Greenwald, A. Frank, Data mining static code attributes to learn defect tion using multi-objective optimization, in: Proceedings of the 41st Annual Computer
predictors, IEEE Trans. Softw. Eng. (TSE) 33 (1) (2007) 2–13. Software and Applications Conference (COMPSAC), 2, IEEE, 2017, pp. 54–59.
[8] L. Guo, Y. Ma, B. Cukic, H. Singh, Robust prediction of fault-proneness by random [40] C. Catal, B. Diri, Investigating the effect of dataset size, metrics sets, and feature
forests, in: Proceedings of the 15th International Symposium on Software Reliability selection techniques on software fault prediction problem, Inf. Sci. (Ny) 179 (8)
Engineering (ISSRE), IEEE, 2004, pp. 417–428. (2009) 1040–1058.
[9] C. Macho, S. McIntosh, M. Pinzger, Predicting build co-changes with source code [41] B. Ghotra, S. McIntosh, A.E. Hassan, A large-scale study of the impact of fea-
change and commit categories, in: Proceedings of the 23rd International Confer- ture selection techniques on defect classification models, in: Proceedings of the
ence on Software Analysis, Evolution, and Reengineering (SANER), 1, IEEE, 2016, 14th International Conference on Mining Software Repositories (MSR), IEEE, 2017,
pp. 541–551. pp. 146–157.
[10] X. Jing, F. Wu, X. Dong, F. Qi, B. Xu, Heterogeneous cross-company defect prediction [42] R. Malhotra, A systematic review of machine learning techniques for software fault
by unified metric representation and CCA-based transfer learning, in: Proceedings of prediction, Appl. Softw. Comput. 27 (2015) 504–518.
the 10th Joint Meeting on Foundations of Software Engineering (FSE), ACM, 2015, [43] R. Malhotra, An empirical framework for defect prediction using machine learning
pp. 496–507. techniques with android software, Appl. Softw. Comput. 49 (2016) 1034–1050.
[11] K.O. Elish, M.O. Elish, Predicting defect-prone software modules using support vec- [44] B. Ghotra, S. McIntosh, A.E. Hassan, Revisiting the impact of classification tech-
tor machines, J. Syst. Softw. (JSS) 81 (5) (2008) 649–660. niques on the performance of defect prediction models, in: Proceedings of the
[12] Z. Yan, X. Chen, P. Guo, Software defect prediction using fuzzy support vector re- 37th International Conference on Software Engineering (ICSE), IEEE Press, 2015,
gression, Adv. Neural Netw. (2010) 17–24. pp. 789–800.
[13] M.M.T. Thwin, T.-S. Quah, Application of neural networks for software quality [45] R. Malhotra, R. Raje, An empirical comparison of machine learning techniques
prediction using object-oriented metrics, J. Syst. Softw. (JSS) 76 (2) (2005) 147–156. for software defect prediction, in: Proceedings of the 8th International Conference
[14] T.M. Khoshgoftaar, E.B. Allen, J.P. Hudepohl, S.J. Aud, Application of neural net- on Bioinspired Information and Communications Technologies, ICST (Institute for
works to software quality modeling of a very large telecommunications system, IEEE Computer Sciences, Social-Informatics and Telecommunications Engineering), 2014,
Trans. Neural Netw. (TNN) 8 (4) (1997) 902–909. pp. 320–327.
[15] D.E. Neumann, An enhanced neural network technique for software risk analysis, [46] J. Ren, K. Qin, Y. Ma, G. Luo, On software defect prediction using machine learning,
IEEE Trans. Soft. Eng. (TSE) 28 (9) (2002) 904–912. J. Appl. Math. 2014 (2014).
[16] A. Panichella, R. Oliveto, A. De Lucia, Cross-project defect prediction models: [47] G. Luo, H. Chen, Kernel based asymmetric learning for software defect prediction,
L’union fait la force, in: Proceedings of the 21st Software Evolution Week-IEEE Con- IEICE Trans. Inf. Syst. 95 (1) (2012) 267–270.
ference on Software Maintenance, Reengineering and Reverse Engineering (CSM- [48] G. Luo, Y. Ma, K. Qin, Asymmetric learning based on kernel partial least squares for
R-WCRE), IEEE, 2014, pp. 164–173. software defect prediction, IEICE Trans. Inf. Syst. 95 (7) (2012) 2006–2008.
[17] A. Shanthini, Effect of ensemble methods for software fault prediction at various [49] D.P. Mesquita, L.S. Rocha, J.P.P. Gomes, A.R.R. Neto, Classification with
metrics level, Int. J. Appl. Inf. Syst. (2014). reject option for software defect prediction, Appl. Softw. Comput. 49 (2016)
[18] X. Xia, D. Lo, S. McIntosh, E. Shihab, A.E. Hassan, Cross-project build co-change pre- 1085–1093.
diction, in: Proceedings of the 22nd International Conference on Software Analysis, [50] Y. Kamei, A. Monden, S. Matsumoto, T. Kakimoto, K.-i. Matsumoto, The effects of
Evolution and Reengineering (SANER), IEEE, 2015, pp. 311–320. over and under sampling on fault-prone module detection, in: Proceedings of the
[19] Z. Xu, J. Liu, Z. Yang, G. An, X. Jia, The impact of feature selection on defect predic- 1st International Symposium on Empirical Software Engineering and Measurement
tion performance: an empirical comparison, in: Proceedings of the 27th International (ESEM), IEEE, 2007, pp. 196–204.
Symposium on Software Reliability Engineering (ISSRE), IEEE, 2016, pp. 309–320. [51] K.E. Bennin, J. Keung, A. Monden, P. Phannachitta, S. Mensah, The significant effects
[20] S. Wold, K. Esbensen, P. Geladi, Principal component analysis, Chemom. Intell. Lab. of data sampling approaches on software defect prioritization and classification, in:
Syst. 2 (1–3) (1987) 37–52. Proceedings of the 11th International Symposium on Empirical Software Engineering
[21] Q. Song, J. Ni, G. Wang, A fast clustering-based feature subset selection algorithm and Measurement (ESEM), IEEE Press, 2017, pp. 364–373.
for high-dimensional data, IEEE Trans. Knowl. Data Eng. (TKDE) 25 (1) (2013) 1–14. [52] K.E. Bennin, J. Keung, A. Monden, Impact of the distribution parameter of data
[22] T. Wang, Z. Zhang, X. Jing, L. Zhang, Multiple kernel ensemble learning for software sampling approaches on software defect prediction models, in: Proceedings of the
defect prediction, Autom. Softw. Eng. (ASE) 23 (4) (2016) 569–590. 24th Asia-Pacific Software Engineering Conference (APSEC), IEEE, 2017, pp. 630–
[23] F. Liu, X. Gao, B. Zhou, J. Deng, Software defect prediction model based on 635.
PCA-isvm, Comput. Simulat. (2014). [53] C. Tantithamthavorn, A.E. Hassan, K. Matsumoto, The impact of class rebalanc-
[24] H. Cao, Z. Qin, T. Feng, A novel PCA-bp fuzzy neural network model for software ing techniques on the performance and interpretation of defect prediction models,
defect prediction, Adv. Sci. Lett. 9 (1) (2012) 423–428. arXiv:1801.10269 (2018).
[25] C. Zhong, Software quality prediction method with hybrid applying principal com- [54] T.M. Khoshgoftaar, E. Geleyn, L. Nguyen, L. Bullard, Cost-sensitive boosting in soft-
ponents analysis and wavelet neural network and genetic algorithm, Int. J. Digital ware quality modeling, in: Proceedings of the 7th International Symposium on High
Content Technol. Appl. 5 (3) (2011). Assurance Systems Engineering, IEEE, 2002, pp. 51–60.
[26] T.M. Khoshgoftaar, R. Shan, E.B. Allen, Improving tree-based models of software [55] J. Zheng, Cost-sensitive boosting neural networks for software defect prediction,
quality with principal components analysis, in: Proceedings of the 11th International Expert Syst. Appl. 37 (6) (2010) 4537–4543.
Symposium on Software Reliability Engineering (ISSRE), IEEE, 2000, pp. 198–209. [56] M. Liu, L. Miao, D. Zhang, Two-stage cost-sensitive learning for software defect pre-
[27] M. Shepperd, Q. Song, Z. Sun, C. Mair, Data quality: some comments on the diction, IEEE Trans. Reliab. 63 (2) (2014) 676–686.
nasa software defect datasets, IEEE Trans. Softw. Eng. (TSE) 39 (9) (2013) [57] M.J. Siers, M.Z. Islam, Software defect prediction using a cost sensitive decision
1208–1215. forest and voting, and a potential solution to the class imbalance problem, Inf. Syst.
[28] D. Gray, D. Bowes, N. Davey, Y. Sun, B. Christianson, The misuse of the nasa metrics 51 (2015) 62–71.
data program data sets for automated software defect prediction, in: Proceedings of [58] J. Peng, D.R. Heisterkamp, Kernel indexing for relevance feedback image retrieval,
the 15th Annual Conference on Evaluation & Assessment in Software Engineering in: Proceedings of the 10th International Conference on Image Processing (ICIP), 1,
(EASE), IET, 2011, pp. 96–103. IEEE, 2003, pp. I–733.
[29] T. Menzies, K. Ammar, A. Nikora, J. DiStefano, How simple is software defect de- [59] J. Li, S. Chu, J.-S. Pan, Kernel principal component analysis (Kpca)-based face
tection, Submitt. Emprical Softw. Eng. J. (2003). recognition, in: Kernel Learning Algorithms for Face Recognition, Springer, 2014,
[30] B. Schölkopf, A. Smola, K.-R. Müller, Kernel principal component analysis, in: Pro- pp. 71–99.
ceedings of the 7th International Conference on Artificial Neural Networks (ICANN), [60] H. Abdi, L.J. Williams, Principal component analysis, Wiley Interdiscip. Rev. Com-
Springer, 1997, pp. 583–588. put. Stat. 2 (4) (2010) 433–459.
[31] B. Schölkopf, A. Smola, K.-R. Müller, Nonlinear component analysis as a kernel [61] G. Huang, G. Huang, S. Song, K. You, Trends in extreme learning machines: a review,
eigenvalue problem, Neural Comput. 10 (5) (1998) 1299–1319. Neural Netw. 61 (2015) 32–48.
[32] K.I. Kim, M.O. Franz, B. Scholkopf, Iterative kernel principal component analysis [62] S. Ding, H. Zhao, Y. Zhang, X. Xu, R. Nie, Extreme learning machine: algorithm,
for image modeling, IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI) 27 (9) (2005) theory and applications, Artif. Intell. Rev. 44 (1) (2015) 103–115.
1351–1366. [63] G. Huang, L. Chen, C.K. Siew, Universal approximation using incremental construc-
[33] S. Lessmann, B. Baesens, C. Mues, S. Pietsch, Benchmarking classification models for tive feedforward networks with random hidden nodes, IEEE Trans. Neural Netw.
software defect prediction: a proposed framework and novel findings, IEEE Trans. (TNN) 17 (4) (2006) 879–892.
Softw. Eng. (TSE) 34 (4) (2008) 485–496. [64] C.R. Rao, S.K. Mitra, Generalized inverse of matrices and its applications, John Wiley
[34] W. Zong, G. Huang, Y. Chen, Weighted extreme learning machine for imbalance & Sons, New York, 1971.
learning, Neurocomputing 101 (2013) 229–242. [65] C.R. Johnson, Matrix Theory and Applications, 40, American Mathematical Soc.,
[35] G. Huang, Q. Zhu, C. Siew, Extreme learning machine: theory and applications, Neu- 1990.
rocomputing 70 (1) (2006) 489–501. [66] R. Fletcher, Practical Methods of Optimization, John Wiley & Sons, 2013.
[36] S. Shivaji, E.J. Whitehead, R. Akella, S. Kim, Reducing features to improve code [67] A. Asuncion, D. Newman, Uci machine learning repository, 2007.
change-based bug prediction, IEEE Trans. Softw. Eng. (TSE) 39 (4) (2013) 552–569. https://www.archive.ics.uci.edu/ml/index.php.
18
JID: INFSOF
ARTICLE IN PRESS [m5GeSdc;October 24, 2018;2:51]
[68] Y. Zhang, D. Lo, X. Xia, J. Sun, An empirical study of classifier combination for [85] J. Nam, S. Kim, Clami: defect prediction on unlabeled datasets, in: Proceedings of
cross-project defect prediction, in: Proceedings of the 39th Computer Software and the 30th IEEE/ACM International Conference on Automated Software Engineering
Applications Conference (COMPSAC), 2, IEEE, 2015, pp. 264–269. (ASE), IEEE, 2015, pp. 452–463.
[69] L. Chen, B. Fang, Z. Shang, Y. Tang, Negative samples reduction in cross-company [86] Z. Li, X.-Y. Jing, X. Zhu, H. Zhang, B. Xu, S. Ying, On the multiple sources and
software defects prediction, Inf. Softw. Technol. (IST) 62 (2015) 67–77. privacy preservation issues for heterogeneous defect prediction, IEEE Trans. Softw.
[70] M. Jureczko, L. Madeyski, Towards identifying software project clusters with regard Eng. (TSE) (2017).
to defect prediction, in: Proceedings of the 6th International Conference on Predic- [87] Z. Li, X.-Y. Jing, F. Wu, X. Zhu, B. Xu, S. Ying, Cost-sensitive transfer kernel canonical
tive Models in Software Engineering, ACM, 2010, p. 9. correlation analysis for heterogeneous defect prediction, Autom. Softw. Eng. (ASE)
[71] M. Jureczko, D. Spinellis, Using object-oriented design metrics to predict software 25 (2) (2018) 201–245.
defects, Models Methods Syst. Dependability. Oficyna Wydawnicza Politechniki [88] S. Herbold, A. Trautsch, J. Grabowski, A comparative study to benchmark cross-pro-
Wrocławskiej (2010) 69–81. ject defect prediction approaches, in: Proceedings of the 40th International Confer-
[72] Y. Ma, G. Luo, X. Zeng, A. Chen, Transfer learning for cross-company software defect ence on Software Engineering (ICSE), ACM, 2018, p. 1063.
prediction, Inf. Softw. Technol. (IST) 54 (3) (2012) 248–256. [89] B. Pizzileo, K. Li, G.W. Irwin, W. Zhao, Improved structure optimization for
[73] Z. Zhang, X. Jing, T. Wang, Label propagation based semi-supervised learning for fuzzy-neural networks, IEEE Trans. Fuzzy Syst. (TFS) 20 (6) (2012) 1076–1089.
software defect prediction, Autom. Softw. Eng. (ASE) 24 (1) (2017) 47–69. [90] C.-W. Hsu, C.-J. Lin, A comparison of methods for multiclass support vector ma-
[74] A.J. Smola, B. Schölkopf, Learning with Kernels, GMD-Forschungszentrum Informa- chines, IEEE Trans. Neural Netw. (TNN) 13 (2) (2002) 415–425.
tionstechnik, 1998. [91] S.D. Thepade, M.M. Kalbhor, Novel data mining based image classification with
[75] W. Yu, F. Zhuang, Q. He, Z. Shi, Learning deep representations via extreme learning Bayes, tree, rule, lazy and function classifiers using fractional row mean of cosine,
machines, Neurocomputing 149 (2015) 308–315. sine and walsh column transformed images, in: Proceedings of the International
[76] K. Herzig, S. Just, A. Rau, A. Zeller, Predicting defects using change genealogies, in: Conference on Communication, Information and Computing Technology (ICCICT),
Proceedings of the 24th International Symposium on Software Reliability Engineer- IEEE, 2015, pp. 1–6.
ing (ISSRE), IEEE, 2013, pp. 118–127. [92] S. Shivaji, Efficient bug prediction and fix suggestions, University of California, Santa
[77] X.-Y. Jing, S. Ying, Z.-W. Zhang, S.-S. Wu, J. Liu, Dictionary learning based software Cruz, 2013 Ph.D. thesis.
defect prediction, in: Proceedings of the 36th International Conference on Software [93] Z. Sun, Q. Song, X. Zhu, Using coding-based ensemble learning to improve software
Engineering (ICSE), ACM, 2014, pp. 414–423. defect prediction, IEEE Trans. Syst. Man Cybern. Part C (Appl. Rev.) 42 (6) (2012)
[78] J. Hryszko, L. Madeyski, M. Dabrowska, P. Konopka, Defect prediction with bad 1806–1817.
smells in code, arXiv:1703.06300 (2017). [94] Z. Islam, H. Giggins, Knowledge discovery through SysFor: a systematically devel-
[79] D. Ryu, O. Choi, J. Baik, Value-cognitive boosting with a support vector machine for oped forest of multiple decision trees, in: Proceedings of the Ninth Australasian
cross-project defect prediction, Empir. Softw. Eng. 21 (1) (2016) 43–71. Data Mining Conference-Volume 121, Australian Computer Society, Inc., 2011,
[80] J. Demšar, Statistical comparisons of classifiers over multiple data sets, J. Mach. pp. 195–204.
Learn. Res. 7 (Jan) (2006) 1–30. [95] M.J. Siers, M.Z. Islam, Cost sensitive decision forest and voting for software defect
[81] T. Mende, R. Koschke, Effort-aware defect prediction models, in: Proceedings of the prediction, in: Proceedings of the Pacific Rim International Conference on Artificial
14th European Conference on Software Maintenance and Reengineering (CSMR), Intelligence, Springer, 2014, pp. 929–936.
IEEE, 2010, pp. 107–116. [96] H. Qu, G. Li, W. Xu, An asymmetric classifier based on partial least squares, Pattern
[82] J.H. Zar, et al., Biostatistical Analysis, Pearson Education India, 1999. Recognit. 43 (10) (2010) 3448–3457.
[83] Y. Jiang, B. Cukic, Y. Ma, Techniques for evaluating fault prediction models, Empir. [97] X. Liu, J. Wu, Z. Zhou, Exploratory undersampling for class-imbalance learning, IEEE
Softw. Eng. (ESE) 13 (5) (2008) 561–595. Trans. Syst. Man Cybern. Part B (Cybern.) 39 (2) (2009) 539–550.
[84] M. DAmbros, M. Lanza, R. Robbes, Evaluating defect prediction approaches: a
benchmark and an extensive comparison, Empir. Softw. Eng. (ESE) 17 (4–5) (2012)
531–577.
19