0% found this document useful (0 votes)
23 views19 pages

Xu 2019

This document presents a new software defect prediction framework called KPWE that combines kernel principal component analysis (KPCA) and weighted extreme learning machine (WELM). KPWE aims to address challenges in software defect prediction by extracting representative features using KPCA's nonlinear mapping and alleviating class imbalance using WELM's weighting scheme. The authors evaluate KPWE on 44 software projects and find it achieves better performance than 41 baseline methods, especially in terms of precision, recall, F-measure, Matthews correlation coefficient, and area under the ROC curve.

Uploaded by

Catur Supriyanto
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views19 pages

Xu 2019

This document presents a new software defect prediction framework called KPWE that combines kernel principal component analysis (KPCA) and weighted extreme learning machine (WELM). KPWE aims to address challenges in software defect prediction by extracting representative features using KPCA's nonlinear mapping and alleviating class imbalance using WELM's weighting scheme. The authors evaluate KPWE on 44 software projects and find it achieves better performance than 41 baseline methods, especially in terms of precision, recall, F-measure, Matthews correlation coefficient, and area under the ROC curve.

Uploaded by

Catur Supriyanto
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

JID: INFSOF

ARTICLE IN PRESS [m5GeSdc;October 24, 2018;2:51]

Information and Software Technology xxx (xxxx) xxx

Contents lists available at ScienceDirect

Information and Software Technology


journal homepage: www.elsevier.com/locate/infsof

Software defect prediction based on kernel PCA and weighted extreme


learning machine☆
Zhou Xu a,b, Jin Liu a,f,∗, Xiapu Luo b, Zijiang Yang c, Yifeng Zhang a, Peipei Yuan d, Yutian Tang b,
Tao Zhang e
a
School of Computer Science, Wuhan University, Wuhan, China
b
Department of Computing, The Hong Kong Polytechnic University, Hong Kong
c
Department of Computer Science, Western Michigan University, Michigan, USA
d
School of Electronic Information and Communications, Huazhong University of Science and Technology, Wuhan, China
e
College of Computer Science and Technology, Harbin Engineering University, China
f
Key Laboratory of Network Assessment Technology, Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China.

a r t i c l e i n f o a b s t r a c t

Keywords: Context: Software defect prediction strives to detect defect-prone software modules by mining the historical data.
Feature extraction Effective prediction enables reasonable testing resource allocation, which eventually leads to a more reliable
Nonlinear mapping software.
Kernel principal component analysis
Objective: The complex structures and the imbalanced class distribution in software defect data make it challeng-
Weighted extreme learning machine
ing to obtain suitable data features and learn an effective defect prediction model. In this paper, we propose a
method to address these two challenges.
Method: We propose a defect prediction framework called KPWE that combines two techniques, i.e., Kernel
Principal Component Analysis (KPCA) and Weighted Extreme Learning Machine (WELM). Our framework consists
of two major stages. In the first stage, KPWE aims to extract representative data features. It leverages the KPCA
technique to project the original data into a latent feature space by nonlinear mapping. In the second stage, KPWE
aims to alleviate the class imbalance. It exploits the WELM technique to learn an effective defect prediction model
with a weighting-based scheme.
Results: We have conducted extensive experiments on 34 projects from the PROMISE dataset and 10 projects from
the NASA dataset. The experimental results show that KPWE achieves promising performance compared with 41
baseline methods, including seven basic classifiers with KPCA, five variants of KPWE, eight representative feature
selection methods with WELM, 21 imbalanced learning methods.
Conclusion: In this paper, we propose KPWE, a new software defect prediction framework that considers the
feature extraction and class imbalance issues. The empirical study on 44 software projects indicate that KPWE is
superior to the baseline methods in most cases.

1. Introduction classification techniques have been used as defect prediction models,


such as decision tree [6], Naive Bayes [7], random forest [8,9], near-
Software testing is an important part of software development life est neighbor [10], support vector machine [11,12], neural network
cycle for software quality assurance [1,2]. Defect prediction can assist [13–15], logistic regression [16], and ensemble methods [17,18]. Since
the quality assurance teams to reasonably allocate the limited testing re- irrelevant and redundant features in the defect data may degrade the
sources by detecting the potentially defective software modules (such as performance of the classification models, different feature selection
classes, files, components) before releasing the software product. Thus, methods have been applied to select an optimal feature subset for defect
effective defect prediction can save testing cost and improve software prediction[19]. These methods can be roughly divided into three cate-
quality [3–5]. gories: the filter-based feature ranking methods, wrapper-based feature
The majority of existing researches leverages various machine learn- subset evaluation methods, and extraction-based feature transformation
ing techniques to build defect prediction methods. In particular, many methods, such as Principal Component Analysis (PCA) [20].

1.1. Motivation

Fully documented templates are available in the elsarticle package on CTAN.

Corresponding author. Selecting optimal features that can reveal the intrinsic structures
E-mail address: [email protected] (J. Liu). of the defect data is crucial to build effective defect prediction mod-

https://doi.org/10.1016/j.infsof.2018.10.004
Received 21 September 2017; Received in revised form 26 August 2018; Accepted 5 October 2018
Available online xxx
0950-5849/© 2018 Elsevier B.V. All rights reserved.

Please cite this article as: Z. Xu et al., Information and Software Technology (2018), https://doi.org/10.1016/j.infsof.2018.10.004
JID: INFSOF
ARTICLE IN PRESS [m5GeSdc;October 24, 2018;2:51]

Z. Xu et al. Information and Software Technology 000 (2018) 1–19

0.754 on NASA dataset, and of 0.480, 0.649, 0.356, and 0.761 across
44 projects of the two datasets. We compare KPWE against 41 base-
line methods. The experimental results show that KPWE achieves sig-
nificantly better performance (especially in terms of F-measure, MCC,
feature mapping
and AUC) compared with all baseline methods.

1.2. Organization

The remainder of this paper is organized as follows.


Section 2 presents the related work. In Section 3, we describe the
Low-dimensional space High-dimensional space proposed method in detail. Section 4 elaborates the experimental
setup. In Section 5, we report the experimental results of performance
Fig. 1. An example of the merit of feature mapping. verification. Section 6 discusses the threats to validity. In Section 7, we
draw the conclusion.
els. The filter-based and wrapper-based feature selection methods only
select a subset of the original features without any transformation [21]. 2. Related work
However, such raw features may not properly represent the essential
structures of raw defect data [22]. Being a linear feature extraction 2.1. Feature selection for defect prediction
method, PCA has been widely used to transform the raw features to
a low-dimensional space where the features are the linear combinations Some recent studies have investigated the impact of feature selection
of the raw ones [23–26]. PCA performs well when the data are linearly methods on the performance of defect prediction. Song et al. [4] sug-
separable and follow a Gaussian distribution, whereas the real defect gested that feature selection is an indispensable part of a general de-
data may have complex structures that can not be simplified in a linear fect prediction framework. Menzies et al. [7] found that Naive Bayes
subspace [27,28]. Therefore, the features extracted by PCA are usually classifier with Information Gain based feature selection can get good
not representative, and cannot gain anticipated performance for defect performances over 10 projects from the NASA dataset. Shivaji et al.
prediction [19,29]. To address this issue, we exploit KPCA [30], a non- [36,37] studied the performance of filter-based and wrapper-based fea-
linear extension of PCA, to project the original data into a latent high- ture selection methods for bug prediction. Their experiments showed
dimensional feature space in which the mapped features can properly that feature selection can improve the defect prediction performance
characterize the complex data structures and increase the probability of even remaining 10% of the original features. Wold et al. [20] investi-
linear separability of the data. When the original data follow an arbitrary gated four filter-based feature selection methods on a large telecom-
distribution, the mapped data by KPCA obey an approximate Gaussian munication system and found that the Kolmogorov–Smirnov method
distribution. Fig. 1 shows the merit of the feature mapping, where the achieved the best performance. Gao et al. [38] explored the performance
data are linearly inseparable within the low-dimensional space but lin- of their hybrid feature selection framework based on seven filter-based
early separable within the high-dimensional space. Existing studies have and three feature subset search methods. They found that the reduced
shown that KPCA outperforms PCA [31,32]. features would not adversely affect the prediction performance in most
Although many classifiers have been used for defect prediction, Less- cases. Chen et al. [39] modelled the feature selection as a multi-objective
mann et al. [33] suggested that the selection of classifiers for defect optimization problem: minimizing the number of selected features and
prediction needs to consider additional criteria, such as computational maximizing the defect prediction performance. They conducted experi-
efficiency and simplicity, because they found that there are no sig- ments on 10 projects from PROMISE dataset and found that their method
nificant performance differences among most defect prediction classi- outperformed three wrapper-based feature selection methods. However,
fiers. Moreover, class imbalance is prevalent in defect data in which the their method was less efficient than two wrapper-based methods. Catal
non-defective modules usually outnumber the defective ones. It makes and Diri [40] conducted an empirical study to investigate the impact of
most classifiers tend to classify the minority samples (i.e., the defec- the dataset size, the types of feature sets and the feature selection meth-
tive modules) as the majority samples (i.e., the non-defective modules). ods on defect prediction. To study the impact of feature selection meth-
However, existing defect prediction methods did not address this prob- ods, they first utilized a Correlation-based Feature Selection (CFS) method
lem well, thus leading to unsatisfactory performance. In this work, we to obtain the relevant features before training the classification mod-
exploit Single-hidden Layer Feedforward Neural networks (SLFNs) called els. The experiments on five projects from NASA dataset showed that
Weighted Extreme Learning Machine (WELM) [34] to overcome this chal- the random forest classifier with CFS performed well on large project
lenge. WELM assigns higher weights to defective modules to emphasize datasets and the Naive Bayes classifier with CFS worked well on small
their importance. In addition, WELM is efficient and convenient since it projects datasets. Xu et al. [19] conducted an extensive empirical com-
only needs to adaptively set the number of hidden nodes while other pa- parison to investigate the impact of 32 feature selection methods on
rameters are randomly generated instead being tuned through iterations defect prediction performance over three public defect datasets. The ex-
like traditional neural networks [35]. perimental results showed that the performances of these methods had
In this paper, we propose a new defect prediction framework called significant differences on all datasets and that PCA performed the worst.
KPWE that leverages the two aforementioned techniques: KPCA and Ghotra et al. [41] extended Xu et al.’s work and conducted a large-scale
WELM. This framework consists of two major stages. First, KPWE ex- empirical study to investigate the defect prediction performance of 30
ploits KPCA to map original defect data into a latent feature space. The feature selection methods with 21 classification models. The experimen-
mapped features in the space can well represent the original ones. Sec- tal results on 18 projects from NASA and PROMISE datasets suggested
ond, with the mapped features, KPWE applies WELM to build an efficient that correlation-based filter-subset feature selection method with best-
and effective defect prediction model that can handle imbalanced defect first search strategy achieved the best performance among all other fea-
data. ture selection methods on majority projects.
We conduct extensive experiments on 44 software projects from two
datasets (PROMISE dataset and NASA dataset) with four indicators, i.e., 2.2. Various classifiers for defect prediction
F-measure, G-measure, MCC, and AUC. On average, KPWE achieves av-
erage F-measure, G-measure, MCC, and AUC values of 0.500, 0.660, Various classification models have been applied to defect prediction.
0.374, and 0.764 on PROMISE dataset, of 0.410, 0.611, 0.296 and Malhotra [42] evaluated the feasibility of seven classification models for

2
JID: INFSOF
ARTICLE IN PRESS [m5GeSdc;October 24, 2018;2:51]

Z. Xu et al. Information and Software Technology 000 (2018) 1–19

defect prediction by conducting a systematic literature review on the sampling techniques on the performance and interpretation of seven
studies that published from January 1991 to October 2013. They dis- classification models. The experimental results explained that these sam-
cussed the merits and demerits of the classification models and found pling methods increased the completeness of Recall indicator but had
that they were superior to traditional statistical models. In addition, no impact on the AUC indicator. In addition, the sampling based im-
they suggested that new methods should be developed to further im- balanced learning methods were not conducive to the understanding
prove the defect prediction performance. Malhotra [43] used the statis- towards the interpretation of the defect prediction models.
tical tests to compare the performance differences among 18 classifica- The cost-sensitive based imbalanced learning methods alleviate the
tion models for defect prediction. They performed the experiments on differences between the instance number of two classes by assigning dif-
seven Android software projects and stated that these models have sig- ferent weights to the two types of instances. Khoshgottar et al. [54] pro-
nificant differences while support vector machine and voted perceptron posed a cost-boosting method by combining multiple classification mod-
model did not perform well. Lessmann et al. [33] conducted an em- els. Experiments on two industrial software systems showed that the
pirical study to investigate the effectiveness of 21 classifiers on NASA boosting method was feasible for defect prediction. Zheng [55] pro-
dataset. The results showed that the performances of most classifiers posed three cost-sensitive boosting methods to boost neural networks for
have no significant differences. They suggested that some additional defect prediction. Experimental results showed that threshold-moving-
factors, such as the computational overhead and simplicity, should be based boosting neural networks can achieve better performance, espe-
considered when selecting a proper classifier for defect prediction. Gho- cially for object-oriented software projects. Liu et al. [56] proposed
tra et al. [44] expanded Lessmann’s experiment by applying 31 classi- a novel two-stage cost-sensitive learning method by utilizing cost in-
fiers to two versions of NASA dataset and PROMISE dataset. The results formation in the classification stage and the feature selection stage.
showed that these classifiers achieved similar results on the noisy NASA Experiments on seven projects of NASA dataset demonstrated its superi-
dataset but different performance on the clean NASA and the PROMISE ority compared with the single-stage cost-sensitive classifiers and cost-
datasets. Malhotra and Raje [45] investigated the performances of 18 blind feature selection methods. Siers and Islam [57] proposed two cost-
classifiers on six projects with object-oriented features and found that sensitive classification models by combining decision trees to minimize
Naive Bayes classifier achieved the best performance. Although some the classification cost for defect prediction. The experimental results on
researchers introduced KPCA into defect prediction [46–48] recently, six projects of NASA dataset showed the superiority of their methods
they aimed at building asymmetrical prediction models with the kernel compared with six classification methods. The WELM technique used in
method by considering the relationship between principal components our work belongs to this type of imbalanced learning methods.
and the class labels. In this work, we leverage KPCA as a feature selec-
tion method to extract representative features for defect prediction. In 3. KPWE: The new framework
addition, Mesquita et al. [49] proposed a method based on ELM with re-
ject option (i.e., IrejoELM) for defect prediction. The results were good The new framework consists of two stages: feature extraction and
because they abandoned the modules that have contradictory decisions model construction. This section first describes how to project the orig-
for two designed classifiers. However, in practice, such modules should inal data into a latent feature space using the nonlinear feature trans-
be considered. formation technique KPCA, and then presents how to build the WELM
model with the extracted features by considering the class imbalance
2.3. Class imbalanced learning for defect prediction issue.

Since class imbalance issue can hinder defect prediction techniques 3.1. Feature extraction based on KPCA
to achieve satisfactory performance, researchers have proposed differ-
ent imbalanced learning methods to mitigate such negative effects. Sam- In this stage, we extract representative features with KPCA to re-
pling based methods and cost-sensitive based methods are the most stud- veal the potentially complex structures in the defect data. KPCA uses
ied imbalanced learning methods for defect prediction. a nonlinear mapping function 𝜑 to project each raw data point within
For the sampling based imbalanced learning methods, there are two a low-dimensional space into a new point within a high-dimensional
main sampling strategies to balance the data distribution. One is to de- feature space F.
crease the number of non-defective modules (such as under-sampling Given a dataset {𝑥𝑖 , 𝑦𝑖 }, 𝑖 = 1, 2, … , 𝑛, where 𝑥𝑖 = [𝑥𝑖1 , 𝑥𝑖2 , … , 𝑥𝑖𝑚 ]T ∈
technique), the other is to increase the number of the defective modules ℜ𝑚 denotes the feature set and 𝑦𝑖 = [𝑦𝑖1 , 𝑦𝑖2 , … , 𝑦𝑖𝑐 ]T ∈ ℜ𝑐 (𝑐 = 2 in this
with redundant modules (such as over-sampling technique) or synthetic work) denotes the label set. Assuming that each data point xi is mapped
modules (such as Synthetic Minority Over-sampling Technique, SMOTE). into a new point 𝜑(xi ) and the mapped data points are centralized, i.e.,
Kamei et al. [50] investigated the impact of four sampling methods on 1 ∑𝑛
the performance of four basic classification models. They conducted ex- 𝑛 𝑖=1 𝜑(𝑥𝑖 ) = 0 (1)
periments on two industry legacy software systems and found that these The covariance matrix C of the mapped data is:
sampling methods can benefit linear and logistic models but were not ∑
𝐂 = 1𝑛 𝑛𝑖=1 𝜑(𝑥𝑖 )𝜑(𝑥𝑖 )T (2)
helpful to neural network and classification tree models. Bennin et al.
[51] assessed the statistical and practical significance of six sampling To perform the linear PCA in F, we diagonalize the covariance ma-
methods on the performance of five basic defect prediction models. Ex- trix C, which can be treated as a solution of the following eigenvalue
periments on 10 projects indicated that these sampling methods had problem
statistical and practical effects in terms of some performance indica- 𝐂𝐕 = 𝜆𝐕, (3)
tors, such as Pd, Pf, G-mean, but had no effect in terms of AUC. Bennin
et al. [52] explored the impact of a configurable parameter (i.e, the per- where 𝜆 and V denote the eigenvalues and eigenvectors of C, respec-
centage of defective modules) in seven sampling methods on the per- tively.
formance of five classification models. The experimental results showed Since all solutions V lie in the span of the mapped data points
that this parameter can largely impact the performance (except AUC) 𝜑(𝑥1 ), 𝜑(𝑥2 ), … , 𝜑(𝑥𝑛 ), we multiply both sides of Eq. (3) by 𝜑(xl )T as
of studied prediction models. Due to the contradictory conclusions of 𝜑(𝑥𝑙 )T 𝐂𝐕 = 𝜆𝜑(𝑥𝑙 )T 𝐕, ∀𝑙 = 1, 2, … , 𝑛 (4)
previous empirical studies about which imbalanced learning methods
Meanwhile, there exist coefficients 𝛼1 , 𝛼2 , … , 𝛼𝑛 that linearly express
performed the best in the context of defect prediction models, Tan-
the eigenvectors V of C with 𝜑(𝑥1 ), 𝜑(𝑥2 ), … , 𝜑(𝑥𝑛 ), i.e.,
tithamthavorn et al. [53] conducted a large-scale empirical experiment

on 101 project versions to investigate the impact of four popularly-used 𝐕 = 𝑛𝑗=1 𝛼𝑗 𝜑(𝑥𝑗 ) (5)

3
JID: INFSOF
ARTICLE IN PRESS [m5GeSdc;October 24, 2018;2:51]

Z. Xu et al. Information and Software Technology 000 (2018) 1–19

h(w1, b1, x'i)


1

β11

...
1 βj1 1
yi1
x'i1
βq1
yi

...

...
x' h(wj, bj, x'i) j
βic
x'ip yic
βjc

...
p c
βqc
h(wq, bq, x'i) q
Fig. 2. Feature extraction with KPCA.

Eq. (4) can be rewritten as following formula by substituting Input Layer Hidden Layer Output Layer
Eqs. (2) and (5) into it
∑ ∑ ∑ Fig. 3. The architecture of ELM.
1
𝑛
𝜑(𝑥𝑙 )T 𝑛𝑖=1 𝜑(𝑥𝑖 )𝜑(𝑥𝑖 )T 𝑛𝑗=1 𝛼𝑗 𝜑(𝑥𝑗 ) = 𝜆𝜑(𝑥𝑙 )T 𝑛𝑗=1 𝛼𝑗 𝜑(𝑥𝑗 ) (6)

Let the kernel function 𝜅(xi , xj ) be


where ‖ · ‖ denotes the l2 norm and 2𝜎 2 = 𝜔 denotes the width of the
𝜅(𝑥𝑖 , 𝑥𝑗 ) = 𝜑(𝑥𝑖 )T 𝜑(𝑥𝑗 ) (7) Gaussian RBF function.
To eliminate the underlying noise in the data, when performing the
Then Eq. (6) is rewritten as
PCA in the latent feature space F, we maintain the most important prin-
1 ∑𝑛 ∑𝑛 ∑𝑛
𝑙=1,𝑖=1 𝜅(𝑥𝑙 , 𝑥𝑖 ) 𝑖=1,𝑗=1 𝛼𝑗 𝜅(𝑥𝑖 , 𝑥𝑗 ) = 𝜆 𝑙=1,𝑗=1 𝛼𝑗 𝜅(𝑥𝑙 , 𝑥𝑗 ) (8) cipal components that capture at least 95% of total variances of the data
𝑛
according to their cumulative contribution rates [60]. Finally, the data
Let the kernel matrix K with size n × n be
are mapped into a p-dimensional space.
𝐊𝑖,𝑗 = 𝜅(𝑥𝑖 , 𝑥𝑗 ) (9) After completing feature extraction, the original training data are
transformed to a new dataset {𝑥′𝑖 , 𝑦𝑖 } ∈ ℜ𝑝 × ℜ𝑐 (𝑖 = 1, 2, … , 𝑛).
Then Eq. (8) is rewritten as

𝐊 𝛼 = 𝑛𝜆𝐊𝛼,
2
(10) 3.2. ELM

where 𝛼 = [𝛼1 , 𝛼2 , … , 𝛼𝑛 ]T .
Before formulizing the WELM, we first introduce the basic ELM. With
The solution of Eq. (10) can be obtained by solving the eigenvalue
the mapped dataset {𝑥′𝑖 , 𝑦𝑖 } ∈ ℜ𝑝 × ℜ𝑐 (𝑖 = 1, 2, … , 𝑛), the output of the
problem generalized SLFNs with q hidden nodes and activation function h(x′) is
𝐊𝛼 = 𝑛𝜆𝛼 (11) formally expressed as
∑ ∑
for nonzero eigenvalues 𝜆 and corresponding eigenvectors 𝜶. As we can 𝑜𝑖 = 𝑞𝑘=1 𝛽𝑘 ℎ𝑘 (𝑥′𝑖 ) = 𝑞𝑘=1 𝛽𝑘 ℎ(𝑤𝑘 , 𝑏𝑘 , 𝑥′𝑖 ), (16)
see, all the solutions of Eq. (11) satisfy Eq. (10).
where 𝑖 = 1, 2, … , 𝑛, 𝑤𝑘 = [𝑤𝑘1 , 𝑤𝑘2 , … , 𝑤𝑘𝑝 ]T denotes the input weight
As mentioned above, we first assume that the mapped data points
vector connecting the input nodes and the kth hidden node, bk denotes
are centralized. If they are not centralized, the Gram matrix 𝐾̃ be used
the bias of the k-th hidden node, 𝛽𝑘 = [𝛽𝑘1 , 𝛽𝑘2 , … , 𝛽𝑘𝑐 ]T denotes the out-
to replace the kernel matrix K as
put weight vector connecting the output nodes and the kth hidden node,
̃ = 𝐊 − 1𝑛 𝐊 − 𝐊1𝑛 + 1𝑛 𝐊1𝑛 ,
𝐊 (12) and oi denotes the expected output of the ith sample. The commonly-
used activation functions in ELM include sigmoid function, Gaussian
where 1n denotes the n × n matrix with all values equal to 1/n. RBF function, hard limit function, and multiquadric function [61,62].
Thus, we just need to solve the following formula Fig. 3 depicts the basic architecture of ELM.
̃ 𝛼 = 𝑛𝜆𝛼
𝐊 (13) Eq. (16) can be equivalently rewritten as

To extract the nonlinear principal components of a new test data 𝐇𝛽 = 𝐎, (17)


point 𝜑(xnew ), we can compute the projection of the kth kernel compo- where H is called the hidden layer output matrix of the SLFNs and is
nent by defined as
∑ ∑
𝐕𝑘 ⋅ 𝜑(𝑥𝑛𝑒𝑤 ) = 𝑛𝑖=1 𝛼𝑖𝑘 𝜑(𝑥𝑖 )T 𝜑(𝑥𝑛𝑒𝑤 ) = 𝑛𝑖=1 𝛼𝑖𝑘 𝜅(𝑥𝑖 , 𝑥𝑛𝑒𝑤 ) (14) ⎡𝐡(𝑥′1 )⎤
𝐇 = 𝐇(𝑤1 , … , 𝑤𝑞 , 𝑏1 , … , 𝑏𝑞 , 𝑥′1 , … , 𝑥′𝑛 ) = ⎢ ⋮ ⎥
Fig. 2 depicts the process of KPCA for feature extraction. KPCA sim- ⎢ ′ ⎥
⎣𝐡(𝑥𝑛 )⎦
plifies the feature mapping by calculating the inner product of two data (18)
points with kernel function instead of calculating the 𝜑(xi ) explicitly. ⎡ℎ(𝑤1 , 𝑏1 , 𝑥′1 ) ⋯ ℎ(𝑤𝑞 , 𝑏𝑞 , 𝑥′1 )⎤
Various kernel functions, such as Gaussian Radial Basic Function (RBF) =⎢ ⋮ ⋱ ⋮ ⎥ ,
⎢ ⎥
kernel and polynomial kernel, can induce different nonlinear mapping. ⎣ℎ(𝑤1 , 𝑏1 , 𝑥′𝑛 ) ⋯ ℎ(𝑤𝑞 , 𝑏𝑞 , 𝑥′𝑛 )⎦𝑛 × 𝑞
The RBF kernel is commonly used in image retrieval and pattern recog-
nition domains [58,59] that is defined as where the ith row of H denotes the output vector of the hidden layer
( ) with respect to input sample 𝑥′𝑖 , and the kth column of H denotes the
‖𝑥𝑖 − 𝑥𝑗 ‖2 output vector of the kth hidden node with respect to the input samples
𝜅(𝑥𝑖 , 𝑥𝑗 ) = exp − , (15)
2𝜎 2 𝑥′1 , 𝑥′2 , … , 𝑥′𝑛 .

4
JID: INFSOF
ARTICLE IN PRESS [m5GeSdc;October 24, 2018;2:51]

Z. Xu et al. Information and Software Technology 000 (2018) 1–19

𝜷 denotes the weight matrix connecting the hidden layer and the or
output layer, which is defined as {
0.618∕𝑛𝑃 if 𝑥′𝑖 ∈ minority class
𝐖𝟐 = 𝐖𝐢𝐢 = , (29)
⎡𝛽1 ⎤
T 1∕𝑛𝑁 if 𝑥′𝑖 ∈ majority class
𝛽=⎢⋮⎥ (19)
⎢ T⎥ where W1 and W2 denote two weighting schemes, nP and nN indicate
⎣𝛽𝑞 ⎦𝑞 × 𝑐 the number of samples of the minority and majority class, respectively.
The golden ratio of 0.618:1 between the majority class and the minority
O denotes the expected label matrix, and each row represents the
class in scheme W2 represents the perfection in nature [67].
output vector of one sample. O is defined as

⎡ 𝑜T1 ⎤ ⎡𝑜11 ⋯ 𝑜1𝑐 ⎤


4. Experimental setup
𝐎=⎢ ⋮ ⎥=⎢ ⋮ ⋱ ⋮⎥ (20)
⎢ T⎥ ⎢ ⎥
⎣𝑜𝑛 ⎦ ⎣𝑜𝑛1 ⋯ 𝑜𝑛𝑐 ⎦𝑛 × 𝑐
In this section, we elaborate the experimental setup, including the
Since the target of training SLFNs is to minimize the output error, Research Questions (RQs), benchmark datasets, the performance indica-
i.e., approximating the input samples with zero error as follows tors, and the experimental design.
∑𝑛
𝑖=1 ‖𝑜𝑖 − 𝑦𝑖 ‖ = ‖𝐎 − 𝐘‖ = 0 (21)
4.1. Research questions
⎡ 𝑦T1
⎤ ⎡𝑦11 ⋯ 𝑦1𝑐 ⎤
where 𝐘 = ⎢ ⋮ ⎥ = ⎢ ⋮ ⋱ ⋮⎥ denotes the target output ma- We design the following five research questions to evaluate our
⎢ T⎥ ⎢ ⎥ KPWE method.
⎣𝑦 𝑛 ⎦ ⎣𝑦 𝑛 1 ⋯ 𝑦𝑛𝑐 ⎦𝑛 × 𝑐
trix. RQ1: How efficient are ELM and WELM?
Then, we need to solve the following formula As the computational cost is an important criterion to select the ap-
propriate classifier for defect prediction in practical application [33,63],
𝐇𝛽 = 𝐘 (22) this question is designed to evaluate the efficiency of ELM and its variant
WELM compared with some typical classifiers.
Huang et al. [35,63] proved that, for ELM, the weights wk of the
RQ2: How effective is KPWE compared with basic classifiers with KPCA?
input connection and the bias bk of the hidden layer node can be ran-
Since our method KPWE combines feature transformation and an
domly and independently designated. Once these parameters are as-
advanced classifier, this question is designed to explore the effectiveness
signed, Eq. (22) is converted into a linear system and the output weight
of this new classifier compared against the typical classifiers with the
matrix 𝜷 can be analytically determined by finding the least-square so-
same process of feature extraction. We use the classic classifiers in RQ1
lution of the linear system, i.e.,
with KPCA as the baseline methods.
min ‖𝐇𝛽 − 𝐘‖ (23) RQ3: Is KPWE superior to its variants?
𝛽 Since the two techniques KPCA and WELM used in our method are
The optimal solution of Eq. (23) is variants of the linear feature extraction method PCA and the original
ELM respectively, this question is designed to investigate whether our
𝛽̂ = 𝐇† 𝐘 = (𝐇𝑇 𝐇) (24) method is more effective than other combinations of these four tech-
niques. To answer this question, we first compare KPWE against the
where H† denotes the Moore–Penrose generalized inverse of the hidden baseline methods that combine WELM with PCA (short for PCAWELM)
layer output matrix H [64,65]. The obtained 𝛽̂ can ensure minimum and none feature extraction (short for WELM). It can be used to inves-
training error, get optimal generalization ability and avoid plunging into tigate the different performance among the methods using non-linear,
local optimum since 𝛽̂ is unique [35]. This solution can also be obtained linear and none feature extraction for WELM. Then, we compare KPWE
with Karush–Kuhn–Tucker (KKT) theorem [66]. against the baseline methods that combine ELM with KPCA, PCA, and
Finally, we get the classification function of ELM as none feature extraction (short for KPCAELM, PCAELM, and ELM, re-
𝑓 (𝑥′ ) = 𝐡(𝑥′ )𝛽̂ = 𝐡(𝑥′ )𝐇† 𝐘 (25) spectively). It can be used to compare the performance of our method
against its downgraded version methods that do not consider the class
imbalance issue. All these baseline methods are treated as the variants
3.3. Model construction based on WELM of KPWE.
RQ4: Are the selected features by KPCA more effective for performance
For imbalanced data, to consider the different importance of the ma- improvement than that by other feature selection methods?
jority class samples (i.e., defective modules) and the minority class sam- To obtain the representative features of the defect data, previous re-
ples (i.e., non-defective modules) when building the ELM classifier, we searches [19,41] used various feature selection methods to select an op-
define a n × n diagonal matrix W, whose diagonal element Wii denotes timal feature subset to replace the original set. This question is designed
the weight of training sample 𝑥′𝑖 . More precisely, if 𝑥′𝑖 belongs to the to investigate whether the features extracted by KPCA are more effec-
majority class, the weight Wii is relatively lower than the sample that tive in improving the defect prediction performance than the features se-
belongs to the minority class. According to the KKT theorem, Eq. (24) is lected by other feature selection methods. To answer this question, we
rewritten as select some classic filter-based feature ranking methods and wrapper-
based feature subset selection methods with the same classifier WELM
𝛽̂ = 𝐇† 𝐘 = (𝐇T 𝐖𝐇)−1 𝐇T 𝐖𝐓 (26)
for comparison.
Then, Eq. (25) becomes RQ5: Is the prediction performance of KPWE comparable to that of other
imbalanced learning methods?
𝑓 (𝑥′ ) = 𝐡(𝑥′ )𝛽̂ = 𝐡(𝑥′ )(𝐇T 𝐖𝐇)−1 𝐇T 𝐖𝐓 (27) Since our method KPWE is customized to address the class imbal-
ance issue for software defect data, this question is designed to study
There are mainly two schemes for assigning the weights to the sam-
whether our method can achieve better or at least comparable perfor-
ples of the two classes as follows [34]:
mance than existing imbalanced learning methods. To answer this ques-
{
1∕𝑛𝑃 if 𝑥′𝑖 ∈ minority class tion, we employ several sampling-based, ensemble learning-based, and
𝐖𝟏 = 𝐖𝐢𝐢 = , (28) cost-sensitive-based imbalanced learning methods for comparison.
1∕𝑛𝑁 if 𝑥′𝑖 ∈ majority class

5
JID: INFSOF
ARTICLE IN PRESS [m5GeSdc;October 24, 2018;2:51]

Z. Xu et al. Information and Software Technology 000 (2018) 1–19

Table 1 Possibility of detection (pd) or recall is defined as the ratio of the


Statistics of the PROMISE dataset. number of defective modules that are correctly predicted to the total
Projects #M #D (%)D Projects #M #D (%)D number of defective modules.
Possibility of false alarm (pf) is defined as the ratio of the number of
ant−1.3 125 20 16.00 lo4j−1.0 135 34 25.19
defective modules that are incorrectly predicted to the total number of
ant−1.4 178 40 22.47 log4j−1.1 109 37 33.94
ant−1.5 293 32 10.92 lucene−2.0 195 91 46.67 non-defective modules.
ant−1.6 351 92 26.21 poi−2.0 314 37 11.78 Precision is defined as the ratio of the number of defective modules
ant−1.7 745 166 22.28 prop−6 660 66 10.00 that are correctly predicted to the total number of defective modules
arc 234 27 11.54 redaktor 176 27 15.34
that are correctly and incorrectly predicted.
camel−1.0 339 13 3.83 synapse−1.0 157 16 10.19
camel−1.2 608 216 35.53 synapse−1.1 222 60 27.03
F-measure, a trade-off between recall and precision, is defined as
camel−1.4 872 145 16.63 synapse−1.2 256 86 33.59 2 ∗ recall ∗ precision
camel−1.6 965 188 19.48 tomcat 858 77 8.97 F-measure = . (30)
ivy−1.4 241 16 6.64 velocity−1.6 229 78 34.06 recall + precision
ivy−2.0 352 40 11.36 xalan−2.4 723 110 15.21
G-measure, a trade-off between pd and pf, is defined as
jedit−3.2 272 90 33.09 xalan−2.5 803 387 48.19
jedit−4.0 306 75 24.51 xalan−2.6 885 411 46.44 2 ∗ pd ∗ (1 − pf)
jedit−4.1 312 79 25.32 xerces-init 162 77 47.53 G-measure = . (31)
pd + (1 − pf)
jedit−4.2 367 48 13.08 xerces−1.2 440 71 16.14
jedit−4.3 492 11 2.24 xerces−1.3 453 69 15.23 MCC, a comprehensive indicator by considering TP, TN, FP, and FN,
is defined as
TP ∗ TN − FP ∗ FN
MCC = √ . (32)
4.2. Benchmark dataset (TP + FP) ∗ (TP + FN) ∗ (TN + FP) ∗ (TN + FN)
AUC calculates the area under a ROC curve which depicts the relative
We conduct extensive experiments on 34 projects taken from an
trade-off between pd (the y-axis) and pf (the x-axis) of a binary classifi-
open-source PROMISE data repository,1 which have been widely used
cation. Different from the above three indicators which are based on the
in many defect prediction studies [44,68,69]. These projects include
premise that the threshold of determining a sample as positive class is
open-source projects (such as ‘ant’ project), proprietary projects (such as
0.5 by default, the value of AUC is independent of the decision thresh-
‘prop’ project) and academic projects (such as ‘redaktor’ project). Each
old. More specifically, given a threshold, we can get a point pair (pd,pf)
module in the projects includes 20 object-oriented features and a depen-
and draw the corresponding position in the two-dimension plane. For all
dent variable that denotes the number of defects in the module. These
possible thresholds, we can get a set of such point pairs. The ROC curve
features are collected by Jureczko and Madeyski, Spinellis with Ckjm
is made up by connecting all these points. The area under this curve is
tool [70,71]. We label the module as 1 if it contains one or more de-
used to evaluate the classification performance.
fects. Otherwise, we label it as 0. In this work, we just select a subset
The greater values of the four indicators indicate better prediction
from PROMISE data repository as our benchmark dataset. The selection
performance.
criteria are that: First, to ensure a certain amount of training set and test
set, we filter out the projects that have less than 100 modules. Second,
since our method KPWE is designed to address the imbalanced defect 4.4. Experimental design
data where the non-defective modules outnumber the defective ones, we
only consider the projects whose defective ratios are lower than 50%. As We perform substantial experiments to evaluate the effectiveness of
a result, 34 versions of 15 projects are selected and used in this study. KPWE. In the feature extraction phase, we choose the Gaussian RBF
To investigate the generalization of our method to other datasets, we as the kernel function for KPCA since it usually exhibits better perfor-
further conduct experiments on ten projects from NASA dataset which mances in many applications [58,59,74]. In terms of the parameter 𝜔,
is cleaned by Shepperd et al. [27]. Since there are two cleaned versions i.e., the width of the Gaussian kernel (as defined in Section 3.1), we
(D′ and D′′) of NASA dataset, in this work, we use the D′′ version as our empirically set a relatively wide range as 𝜔 = 102 , 202 , … , 1002 . In the
benchmark dataset as in previous work [44]. model construction phase, we also choose the Gaussian RBF as the ac-
Tables 1 and 2 summarize the basic information of the two datasets, tivation function for WELM because it is the preferred choice in many
including the number of features (# F), the number of modules (# M), applications [59,75]. Since the number of hidden nodes q is far less than
the number of defective modules (# D) and the defect ratios (% D). Note the number of training sample n [35], we set the number of hidden
that we do not report the number of features for the projects in PROMISE nodes from 5 to n with an increment of 5. So, for each project, there are
dataset since all of them contain 20 features. In addition, for PROMISE 2𝑛(10 × 5𝑛 ) combinations of 𝜔 and q in total. For the weighting scheme
dataset, the feature descriptions and corresponding abbreviations are of W, we adopt the second scheme W2 as described in Section 3.3. For
presented in Table 3 (CC is the abbreviations of Cyclomatic Complexity). each project, we use the 50:50 split with stratified sampling to constitute
For NASA dataset, Table 4 depicts the common features among the 10 the training and test set. More specifically, we utilize stratified sampling
projects and Table 5 tabulates the other specific features for each project to randomly select 50% instances as the training set and the remaining

with symbol . 50% instances as the test set. The stratified sampling strategy guarantees
the same defect ratios of the training set and test set which conforms to
the actual application scenario. In addition, such sampling setting helps
4.3. Performance indicators reduce sampling biases [76]. The 50:50 split and stratified sampling
are commonly used in previous defect prediction studies [22,77–79].
We use F-measure, G-measure, Matthews Correlation Coefficient To mitigate the impact of the random division treatment on the experi-
(MCC) and Area Under the ROC Curve (AUC) to measure the perfor- mental results and produce a general conclusion, we repeat this process
mance of KPWE, because they are widely used in defect prediction 30 times on each project by reshuffling the module order. Therefore,
[44,69,72,73]. The first three indicators can be deduced by some sim- for each parameter combination, we run KPWE 30 times and record the
pler binary classification metrics as listed in Table 6. average indicator values. Finally, the optimal combination of parame-
ters 𝜔 and q is determined by the best average F-measure value. In this
work, we report the average values of the four indicators on 30-rounds
1
http://openscience.us/repo/defect/ck/. experiments.

6
JID: INFSOF
ARTICLE IN PRESS [m5GeSdc;October 24, 2018;2:51]

Z. Xu et al. Information and Software Technology 000 (2018) 1–19

Table 2
Statistics of the NASA dataset.

Projects #F #M #D (%)D Projects #F #M #D (%)D

CM1 37 327 42 12.84 MW1 37 251 25 9.96


KC1 21 1162 294 25.30 PC1 37 696 55 7.90
KC3 39 194 36 18.56 PC3 37 1073 132 12.30
MC1 38 1847 36 1.95 PC4 37 1276 176 13.79
MC2 39 125 44 35.20 PC5 38 1679 459 27.34

Table 3
The feature description and abbreviation for PROMISE dataset.

1. Weighted Methods per Class (WMC) 11. Measure of Functional Abstraction (MFA)
2. Depth of Inheritance Tree (DIT) 12. Cohesion Among Methods of Class (CAM)
3. Number of Children (NOC) 13. Inheritance Coupling (IC)
4. Coupling between Object Classes (CBO) 14. Coupling Between Methods (CBM)
5. Response for a Class (RFC) 15. Average Method Complexity (AMC)
6. Lack of Cohesion in Methods (LOCM) 16. Afferent Couplings (Ca)
7. Lack of Cohesion in Methods (LOCM3) 17. Efferent Couplings (Ce)
8. Number of Public Methods (NPM) 18. Greatest Value of CC (Max_CC)
9. Data Access Metric (DAM) 19. Arithmetic mean value of CC (Avg_CC)
10. Measure of Aggregation (MOA) 20. Lines of Code (LOC)

Table 4 4.5. Statistical test method


The description of the common feature for NASA dataset.
To statistically analyze the performance between our method KPWE
1. Line count of code 11. Halstead_Volume
and other baseline methods, we perform the non-parametric Frideman
2. Count of blank lines 12. Halstead_Level
3. Count of code and comments 13. Halstead_Difficulty
test with the Nemenyi’s post-hoc test [80] at significant level 0.05 over
4. Count of comments 14. Halstead_Content all projects. The Friedman test evaluates whether there exist statisti-
5. Line count of executable code 15. Halstead_Effort cally significant differences among the average ranks of different meth-
6. Number of operators 16. Halstead_Error_Estimate ods. Since Friedman test is based on performance ranks of the methods,
7. Number of operands 17. Halstead_Programming_Time
rather than actual performance values, therefore it makes no assump-
8. Number of unique operators 18. Cyclomatic_Complexity
9. Number of unique operands 19. Design_Complexity tions on the distribution of performance values and is less susceptible to
10. Halstead_Length 20. Essential_Complexity outliers [33,81]. The test statistic of the Friedman test can be calculated
as follows:
(𝐿 )
Table 6
12𝑁 ∑ 𝐿(𝐿 + 1)2
Basic indicators for defect prediction. 𝜏𝜒 2 = 𝐴𝑅𝑗 −
2
, (33)
𝐿(𝐿 + 1) 𝑗=1 4
Predicted as defective Predicted as defective-free

Actual defective TP FN where N denotes the total number of the projects, L denotes the num-
∑ 𝑗
Actual defective-free FP TN ber of methods needed to be compared, 𝐴𝑅𝑗 = 𝑁1 𝑁 𝑖=1 𝑅𝑖 denotes the
𝑇𝑃
average rank of method j on all projects and 𝑅𝑗𝑖 denotes the rank of jth
pd (recall) 𝑇 𝑃 +𝐹 𝑁
𝐹𝑃
pf 𝐹 𝑃 +𝑇 𝑁 method on the ith project. 𝜏𝜒 2 obeys the 𝜒 2 distribution with 𝐿 − 1 de-
𝑇𝑃
precision 𝑇 𝑃 +𝐹 𝑃
gree of freedom [82]. Since the original Friedman test statistic is too
conservative, its variant 𝜏 F is usually used to conduct the statistic test.

Table 5
The specific features for each project of NASA dataset.

Features CM1 KC1 KC3 MC1 MC2 MW1 PC1 PC3 PC4 PC5
√ √ √ √ √ √ √ √ √
21. Number_of_lines
√ √ √ √ √ √ √ √ √
22. Cyclomatic_Density
√ √ √ √ √ √ √ √ √ √
23. Branch_Count
√ √ √ √ √ √ √ √ √
24. Essential_Density
√ √ √ √ √ √ √ √ √
25. Call_Pairs
√ √ √ √ √ √ √ √ √
26. Condition_Count
√ √ √ √ √ √ √ √ √
27. Decision_Count
√ √ √ √ √ √ √
28. Decision_Density
√ √ √ √ √ √ √ √ √
29. Design_Density
√ √ √ √ √ √ √ √ √
30. Edge_Count
√ √ √ √
31. Global_Data_Complexity
√ √ √ √
32. Global_Data_Density
√ √ √ √ √ √ √ √ √
33. Maintenance_Severity
√ √ √ √ √ √ √ √ √
34. Modified_Condition_Count
√ √ √ √ √ √ √ √ √
35. Multiple_Condition_Count
√ √ √ √ √ √ √ √ √
36. Node_Count
√ √ √ √ √ √ √ √ √
37. Normalized_CC
√ √ √ √ √ √ √ √ √
38. Parameter_Count
√ √ √ √ √ √ √ √ √
39. Percent_Comments

7
JID: INFSOF
ARTICLE IN PRESS [m5GeSdc;October 24, 2018;2:51]

Z. Xu et al. Information and Software Technology 000 (2018) 1–19

Table 7
The parameter settings of the used machine learning classifiers.

Classifier Parameter settings

NB Estimator: kernel estimator


RF Number of generated tree: 10, Number of variables for random feature selection: 2
BP Layer: 3, Learning rate: 0.1, Maximal number of iterations: 2000, Tolerant error: 0.004
SVM Kernel function: Gaussian RBF, Kernel parameter: 2−10 , 2−9 , , 24 , Cost parameter: 2−2 , 2−1 , 212
NN Number of neighbors used: 1
LR The distribution used: normal
CART The minimal number of observations per tree leaf: 1

Table 8
Training Time of classifiers on promise dataset (in Seconds).

Projects NB RF LR CART BP SVM ELM WELM

ant 0.085 0.181 0.019 0.030 2.933 8.089 0.008 0.003


arc 0.084 0.174 0.040 0.016 8.444 3.651 0.003 0.006
camel 0.084 0.171 0.050 0.050 9.050 21.985 0.061 0.004
ivy 0.086 0.168 0.014 0.020 6.222 5.233 0.006 0.002
jedit 0.100 0.168 0.032 0.034 4.414 7.869 0.008 0.007
log4j 0.066 0.150 0.007 0.014 0.465 2.181 0.000 0.000
lucene 0.088 0.000 0.073 0.004 0.666 7.887 0.006 0.003
poi 0.085 0.000 0.043 0.005 0.663 10.196 0.004 0.003
prop-6 0.086 0.170 0.144 0.056 11.793 14.179 0.042 0.003
redaktor 0.082 0.171 0.044 0.023 0.645 2.793 0.000 0.000
synapse 0.081 0.170 0.021 0.020 5.761 3.912 0.003 0.000
tomcat 0.082 0.206 0.023 0.058 6.267 21.958 0.055 0.005
velocity 0.087 0.170 0.012 0.017 14.742 4.154 0.003 0.000
xalan 0.080 0.223 0.024 0.077 6.836 26.410 0.028 0.011
xerces 0.084 0.192 0.026 0.039 3.898 8.112 0.006 0.008

Table 9
Training time of classifiers on nasa dataset (in Seconds).

Projects NB RF LR CART BP SVM ELM WELM

CM1 0.004 0.175 1.902 0.094 8.551 6.960 0.030 0.061


KC1 0.014 0.294 0.027 0.112 5.316 88.619 0.176 0.005
KC3 0.004 0.167 1.755 0.07 18.519 3.996 0.003 0.040
MC1 0.662 0.263 2.939 0.204 131.473 95.002 0.309 0.108
MC2 0.669 0.15 1.065 0.049 1.791 2.696 0.003 0.036
MW1 0.629 0.152 1.848 0.053 86.102 4.585 0.006 0.031
PC1 0.643 0.198 2.151 0.115 3.285 20.158 0.041 0.054
PC3 0.681 0.257 0.424 0.218 127.702 50.87 0.147 0.061
PC4 0.630 0.261 2.658 0.216 53.239 65.151 0.09 0.073
PC5 0.666 0.351 0.246 0.438 113.32 179.318 0.283 0.087

Table 10
Average indicator values of KPWE and seven basic classifiers with KPCA on two datasets and across
all projects.

Dataset Indicator KPNB KPNN KPRF KPLR KPCART KPBP KPSVM KPWE

PROMISE F-measure 0.426 0.423 0.361 0.410 0.396 0.419 0.391 0.500
G-measure 0.525 0.523 0.360 0.453 0.484 0.478 0.376 0.660
MCC 0.284 0.257 0.235 0.292 0.222 0.260 0.280 0.374
AUC 0.699 0.624 0.696 0.716 0.630 0.672 0.648 0.764
NASA F-measure 0.354 0.336 0.267 0.325 0.315 0.352 0.310 0.410
G-measure 0.476 0.477 0.264 0.387 0.425 0.429 0.287 0.611
MCC 0.248 0.216 0.201 0.234 0.176 0.242 0.230 0.296
AUC 0.708 0.596 0.693 0.698 0.606 0.684 0.655 0.754
ALL F-measure 0.410 0.403 0.340 0.391 0.377 0.403 0.372 0.480
G-measure 0.513 0.512 0.338 0.438 0.471 0.467 0.355 0.649
MCC 0.276 0.248 0.228 0.279 0.212 0.256 0.269 0.356
AUC 0.701 0.618 0.695 0.712 0.625 0.675 0.650 0.761

𝜏 F is calculated as the following formula: values2 for the F distribution and then determine whether to accept
or reject the null hypothesis (i.e., all methods perform equally on the
(𝑁 − 1)𝜏𝜒 2 projects).
𝜏𝐹 = . (34) If the null hypothesis is rejected, it means that the performance dif-
𝑁(𝐿 − 1) − 𝜏𝜒 2
ferences among different methods are nonrandom, then a so-called Ne-

𝜏 F obeys the F-distribution with 𝐿 − 1 and (𝐿 − 1)(𝑁 − 1) degrees of


freedom. Once 𝜏 F value is calculated, we can compare 𝜏 F against critical 2
http://www.socr.ucla.edu/applets.dir/f_table.html.

8
JID: INFSOF
ARTICLE IN PRESS [m5GeSdc;October 24, 2018;2:51]

Z. Xu et al. Information and Software Technology 000 (2018) 1–19

F-measure G-measure
0.8 0.8

0.7

0.6
0.6

0.5
0.4
0.4

0.3
0.2

0.2

0.1 KPNB KPNN KPRF KPLR KPCART KPBP KPSVM KPWE 0.0 KPNB KPNN KPRF KPLR KPCART KPBP KPSVM KPWE

MCC AUC
0.7 0.9

0.6
0.8
0.5

0.7
0.4

0.3
0.6

0.2
0.5
0.1

0.0 KPNB KPNN KPRF KPLR KPCART KPBP KPSVM KPWE 0.4 KPNB KPNN KPRF KPLR KPCART KPBP KPSVM KPWE

Fig. 4. Box-plots of four indicators for KPWE and seven basic classifiers with KPCA across all 44 projects.

Table 11
Average indicator values of KPWE and its five variants with WELM on two datasets and
across all projects.

Dataset Indicator ELM PCAELM KPCAELM WELM PCAWELM KPWE

PROMISE F-measure 0.382 0.388 0.467 0.374 0.385 0.500


G-measure 0.470 0.486 0.567 0.556 0.571 0.660
MCC 0.174 0.183 0.342 0.182 0.200 0.374
AUC 0.617 0.624 0.702 0.629 0.639 0.745
NASA F-measure 0.322 0.324 0.365 0.330 0.333 0.410
G-measure 0.458 0.451 0.475 0.550 0.550 0.611
MCC 0.164 0.164 0.263 0.184 0.188 0.296
AUC 0.612 0.611 0.679 0.626 0.629 0.754
ALL F-measure 0.369 0.374 0.444 0.364 0.373 0.480
G-measure 0.468 0.478 0.546 0.555 0.566 0.649
MCC 0.172 0.179 0.324 0.183 0.197 0.356
AUC 0.616 0.621 0.697 0.628 0.637 0.747

menyi’s post-hoc test is performed to check which specific method dif- where q𝛼, L is a critical value that related to the number of methods L
fers significantly [33]. For each pair of methods, this test uses the aver- and the significance level 𝛼. The critical values are available online.3
age rank of each method and checks whether the rank difference exceeds The Frideman test with the Nemenyi’s post-hoc test is widely used in
a Critical Difference (CD) which is calculated with the following formula: previous studies [33,81,83–88].


𝐿(𝐿 + 1)
𝐶𝐷 = 𝑞𝛼,𝐿 , (35) 3
http://www.cin.ufpe.br/~fatc/AM/Nemenyi_critval.pdf.
6𝑁
9
JID: INFSOF
ARTICLE IN PRESS [m5GeSdc;October 24, 2018;2:51]

Z. Xu et al. Information and Software Technology 000 (2018) 1–19

Table 12
Average indicator values of KPWE and eight feature selection methods with WELM on two datasets and across all
projects.

Dataset Indicator CS FS IG ReF NBWrap NNWrap LRWrap RFWrap KPWE

PROMISE F-measure 0.347 0.415 0.349 0.415 0.427 0.435 0.425 0.431 0.500
G-measure 0.482 0.574 0.482 0.574 0.588 0.605 0.582 0.597 0.660
MCC 0.139 0.257 0.142 0.255 0.271 0.283 0.271 0.277 0.374
AUC 0.590 0.680 0.588 0.674 0.688 0.692 0.689 0.690 0.764
NASA F-measure 0.297 0.360 0.301 0.366 0.353 0.378 0.365 0.369 0.410
G-measure 0.510 0.568 0.515 0.578 0.573 0.603 0.581 0.591 0.611
MCC 0.152 0.247 0.157 0.243 0.228 0.265 0.242 0.252 0.296
AUC 0.618 0.685 0.606 0.685 0.681 0.688 0.679 0.679 0.754
ALL F-measure 0.336 0.403 0.338 0.404 0.410 0.422 0.411 0.417 0.480
G-measure 0.488 0.572 0.490 0.575 0.585 0.604 0.582 0.595 0.649
MCC 0.142 0.255 0.145 0.252 0.261 0.279 0.265 0.271 0.356
AUC 0.596 0.681 0.592 0.676 0.686 0.691 0.687 0.688 0.761

(a) F-measure (b) G-measure

(c) MCC (d) AUC


Fig. 5. Comparison of KPWE against seven basic classifiers with KPCA using Friedman test and Nemenyi’s post-hoc test in terms of four indicators.

However, the main drawback of post-hoc Nemenyi test is that it 5. Performance evaluation
may generate overlapping groups for the methods that are compared,
not completely distinct groups, which means that a method may be- 5.1. Answer to RQ1: the efficiency of ELM, WELM and some classic
long to multiple significantly different groups [44,88]. In this work, we classifiers.
utilize the strategy in [88] to address this issue. More specifically, un-
der the assumption that the distance (i.e., the difference between two Since many previous defect prediction studies applied classic classi-
average ranks) between the best average rank and the worst rank is fiers as prediction models [33,44], in this work, we choose seven repre-
2 times larger than CD value, we divide the methods into three non- sentative classifiers, including Naive Bayes (NB), Nearest Neighbor (NN),
overlapping groups: (1) The method whose distance to the best average Random Forest (RF), Logistic Regression (LR), Classification and Regression
rank is less than CD belongs to the top rank group; (2) The method Tree (CART), Back Propagation neural networks (BP) and Support Vector
whose distance to the worst average rank is less than CD belongs to Machine (SVM), and compare their efficiency with ELM and WELM.
the bottom rank group; (3) The other methods belong to the middle The parameter settings of the classifiers are detailed as follows. For
rank group. And if the distance between the best average rank and the NB, we use the kernel estimator that achieves better F-measure values
worst rank is larger than 1 time but less than 2 times CD value, we di- on most projects through our extensive experiments. For RF, we set the
vide the methods into 2 non-overlapping groups: The method belongs number of generated trees to 10, the number of variables for random
to the top rank group (or bottom rank group) if its average rank is feature selection to 2, and do not limit the maximum depth of the trees,
closer to the best average rank (or the worst average rank). In addi- as suggested in [11]. BP is implemented using the neural networks tool-
tion, if the distance between the best average rank and the worst rank box in MATLAB with a three-layered and fully-connected network ar-
is less than CD value, all methods belong to the same group. Using chitecture. The learning rate is initialized to 0.1. Since how to select an
this strategy, the generating groups are non-overlapping significantly optimal number of hidden nodes is still an open question [89], we con-
different. duct extensive experiments on the benchmark dataset and find that BP

10
JID: INFSOF
ARTICLE IN PRESS [m5GeSdc;October 24, 2018;2:51]

Z. Xu et al. Information and Software Technology 000 (2018) 1–19

F-measure G-measure
0.8 0.8

0.7

0.6
0.6

0.5
0.4
0.4

0.3
0.2

0.2

0.1 ELM PCAELM KPCAELM WELM PCAWELM KPWE 0.0 ELM PCAELM KPCAELM WELM PCAWELM KPWE

MCC AUC
0.7 0.9

0.6

0.8
0.5

0.4
0.7
0.3

0.2
0.6

0.1

0.0 ELM PCAELM KPCAELM WELM PCAWELM KPWE ELM PCAELM KPCAELM WELM PCAWELM KPWE

Fig. 6. Box-plots of four indicators for KPWE and its variants on NASA dataset.

can achieve the best F-measure with less than 80 hidden nodes on the 14 projects, is lower than the baseline classifiers on most projects. More
vast majority of the projects. Thus we set the number of hidden nodes specifically, the training time of NB, RF, LR, and CART, less than 0.3 s,
from 5 to 80 with an increment of 5. The algorithm terminates when the is a little bit longer than that of ELM and WELM except for the time of
number of iterations is above 2000 or the tolerant error is below 0.004. RF on project lucene and poi, while the training time of ELM and WELM
Other network parameters is set with the default values. The optimal are much shorter than that of BP and SVM. In particular, WELM runs
number of hidden nodes is determined based on the best F-measure. nearly 200 (for poi) to 30,000 (for velocity) times faster than BP while
For SVM, we also choose the Gaussian RBF as the kernel function, 600 (for arc) to 8500 (for velocity) times faster than SVM. The training
and set the kernel parameter 𝜔𝑆𝑉 𝑀 = 2−10 , 2−9 , … , 24 while cost param- time between ELM and WELM has a slight difference. From Table 9, we
eter 𝐶 = 2−2 , 2−1 , … , 212 as suggested in [90]. Similarly, the optimal find that, on NASA dataset, WELM takes less than 0.1 seconds to finish
parameter combination is obtained according to the best performance training a model on 9 projects. ELM and WELM run faster than the six
through the grid search. For other classifiers, we use the default pa- classifiers except for NB on CM1 project. Particularly, WELM runs 50
rameter values. Table 7 tabulates the parameter setting of the seven (for MC2) to 2700 (for MW1) times faster than BP while 100 (for KC3)
basic classifiers. The experiments are conducted on a workstation with to 17,000 (for KC1) times faster than SVM.
a 3.60 GHz Intel i7-4790 CPU and 8.00 GB RAM. Discussion: The short training time of ELM and WELM is due to the
Since NN is a lazy classifier that does not need to build a model following reasons. First, the weights of the input layer and the bias of the
with the training set in advance, it has no training time [91]. hidden layer in ELM are randomly assigned without iterative learning.
Tables 8 and 9 present the training times of ELM, WELM and the baseline Second, the weights of the output layer are solved by an inverse opera-
classifiers on PROMISE dataset and NASA dataset, respectively. Note tion without iteration. They empower ELM to train the model quickly.
that the value 0 means the training time of the classifier is less than Since WELM only adds one step for assigning different weights to the
0.0005 s. For the project with multiple versions, we only report the av- defective and non-defective modules when building the model, it intro-
erage training time across the versions. From Table 8, we observe that, duces little additional computation cost. Therefore, the training time of
on PROMISE dataset, the training time of WELM, less than 0.01 s on ELM and that of WELM are very similar. The superiority of the training

11
JID: INFSOF
ARTICLE IN PRESS [m5GeSdc;October 24, 2018;2:51]

Z. Xu et al. Information and Software Technology 000 (2018) 1–19

(a) F-measure (b) G-measure

(c) MCC (d) AUC


Fig. 7. Comparison of KPWE against its five variants with Friedman test and Nemenyi’s post-hoc test in terms of four indicators.

speed of ELM and WELM will be more significant when they are applied Third, Fig. 5 visualizes the results of the Friedman test with Ne-
to larger datasets. menyi’s post-hoc test for KPWE and the seven baseline methods in terms
of the four indicators. Groups of the methods that are significantly differ-
Summary: Compared with the basic classifiers, ELM and WELM are
ent are with different colors. The results of the Friedman test show that
more efficient to train the prediction model, especially towards BP
the p values are all less than 0.05, which means that there exist signifi-
and SVM, whereas the differences of the efficiency between ELM,
cant differences among the eight methods in terms of all four indicators.
WELM and other classifiers are small.
The results of the post-hoc test show that KPWE always belongs to the
top rank group in terms of all indicators. In addition, KPLR belongs to
the top rank group in terms of AUC. These observations indicate that
5.2. Answer to RQ2: the prediction performance of KPWE and the basic
KPWE performs significantly better than the seven baseline methods ex-
classifiers with KPCA.
pect for the KPLR method in terms of AUC.
Discussion: Among all the methods that build prediction models with
Table 10 presents the average indicator values of KPWE and the
the features extracted by KPCA, KPWE outperforms the baseline meth-
seven baseline methods on PROMISE dataset, NASA dataset, and across
ods because it uses an advanced classifier that considers the class imbal-
all 44 projects of the two datasets. Fig. 4 depicts the box-plots of four
ance in the defect data while traditional classifiers could not well copy
indicators for the eight methods across all 44 projects. The detailed re-
with the imbalanced data.
sults, including the optimal kernel parameter, the number of hidden
nodes, the performance value for each indicator on each project and the Summary: Our method KPWE performs better than KPCA with the
corresponding standard deviation for all research questions are avail- seven basic classifiers. On average, compared with the seven base-
able on our online supplementary materials.4 From Table 10 and Fig. 4, line methods, KPWE achieves 24.2%, 47.3%, 44.3%, 14.4% perfor-
we have the following observations. mance improvement in terms of the four indicators respectively over
First, from Table 10, the results show that our method KPWE PROMISE dataset, 28.1%, 63.6%, 35.6%, 14.2% performance im-
achieves the best average performance in terms of all indicators on provement in terms of the four indicators respectively over NASA
two datasets and across all 44 projects. More specifically, across all 44 dataset, and 25.1%, 50.4%, 42.2%, 14.2% performance improve-
projects, the average F-measure value (0.480) by KPWE yields improve- ment in terms of the four indicators respectively across all 44
ments between 17.1% (for KPNB) and 41.2% (for KPRF) with an average projects.
improvement of 25.1%, the average G-measure value (0.649) by KPWE
gains improvements between 26.5% (for KPNB) and 92.0% (for KPRF) 5.3. Answer to RQ3: the prediction performance of KPWE and its variants.
with an average improvement of 50.4%, the average MCC value (0.356)
by KPWE achieves improvements between 27.6% (for KPLR) and 67.9% Table 11 presents the average indicator values of KPWE and its five
(for KPCART) with an average improvement of 42.2%, and the average variants on PROMISE dataset, NASA dataset, and across all 44 projects
AUC value (0.761) gets improvements between 6.9% (for KPLR) and of the two datasets. Fig. 6 depicts the box-plots of four indicators for the
23.1% (for KPNN) with an average improvement of 14.2% compared six methods across all 44 projects. From Table 11 and Fig. 6, we have
against the seven classic classifiers with KPCA. the following findings.
Second, Fig. 4 demonstrates that the median values of all four indica- First, from Table 11, the results show that our method KPWE
tors by KPWE are superior to that by the seven baseline methods across achieves the best average performance in terms of all indicators on
all 44 projects. In particular, the median AUC by KPWE is even higher two datasets and across all 44 projects. More specifically, across all 44
than or similar to the maximum AUC by KPNN, KPCART, and KPBP. projects, the average F-measure value (0.480) by KPWE yields improve-
ments between 8.1% (for KPCAELM) and 31.9% (for WELM) with an av-
erage improvement of 25.4%, the average G-measure value (0.649) by
4
https://sites.google.com/site/istkpwe. KPWE gains improvements between 14.7% (for PCAWELM) and 38.7%

12
JID: INFSOF
ARTICLE IN PRESS [m5GeSdc;October 24, 2018;2:51]

Z. Xu et al. Information and Software Technology 000 (2018) 1–19

F-measure
0.8

G-measure
0.8

0.6
0.7

0.6
0.4
0.5

0.4
0.2

0.3

0.0 CS IG NBWrap LRWrap KPWE


0.2 CS IG NBWrap LRWrap KPWE
FS ReF NNWrap RFWrap FS ReF NNWrap RFWrap

MCC AUC
0.7 0.9

0.6
0.8
0.5

0.4
0.7

0.3

0.6
0.2

0.1
0.5
0.0

-0.1 CS IG NBWrap LRWrap KPWE 0.4 CS IG NBWrap LRWrap KPWE


FS ReF NNWrap RFWrap FS ReF NNWrap RFWrap

Fig. 8. Box-plots of four indicators for KPWE and eight feature selection methods with WELM across all 44 projects.

(for ELM) with an average improvement of 25.0%, the average MCC cators. In addition, KPCAELM belong to the top rank 1 group in terms
value (0.356) by KPWE achieves improvements between 9.9% (for KP- of F-measure and MCC. These observations indicate that in terms of
CAELM) and 107.0% (for ELM) with an average improvement of 78.2%, G-measure and AUC, KPWE significantly performs better than the five
and the average AUC value (0.761) gets improvements between 9.2% variants, whereas in terms of F-measure and MCC, KPWE does not per-
(for KPCAELM) and 23.5% (for ELM) with an average improvement of form significantly better than KPCAELM.
19.2% compared with the five variants. Discussion: On the one hand, KPWE and KPCAELM are superior to
Second, Fig. 6 shows that KPWE outperforms the five variants in PCAWELM and PCAELM in terms of all four indicators respectively, on
terms of the median values of all indicators across all 44 projects. In the other hand, KPWE and KPCAELM perform better than WELM and
particular, the median G-measure by KPWE is higher than or similar ELM, respectively on both datasets, all these mean that the features ex-
to the maximum G-measure (do not consider the noise points) by the tracted by the nonlinear method KPCA are beneficial to ELM and WELM
baseline methods except for PCAWELM, the median MCC by KPWE is for the improvement of defect prediction performance compared against
higher than the maximum MCC by ELM, WELM and PCAWELM, and the the raw features or the features extracted by linear method PCA. More-
median AUC by KPWE is higher than the maximum AUC by the baseline over, KPWE, PCAWELM and WELM are superior to KPCAELM, PCAELM
methods except for PCAWELM. and ELM respectively which denotes that WELM is more appropriate to
Third, Fig. 7 visualizes the results of the Friedman test with Ne- the class imbalanced defect data than ELM.
menyi’s post-hoc test for KPWE and its five variants in terms of the four
Summary: KPWE precedes its five variants. On average, compared
indicators. The p values of the Friedman test are all less than 0.05, which
with the five downgraded variants, KPWE achieves 26.1%, 25.4%,
means that there exist significant differences among the six methods in
84.2%, 19.2% performance improvement in terms of the four in-
terms of all four indicators. The results of the post-hoc test show that
dicators respectively over PROMISE dataset, 22.7%, 23.9%, 58.4%,
KPWE also always belongs to the top rank 1 group in terms of all indi-

13
JID: INFSOF
ARTICLE IN PRESS [m5GeSdc;October 24, 2018;2:51]

Z. Xu et al. Information and Software Technology 000 (2018) 1–19

(a) F-measure (b) G-measure

(c) MCC (d) AUC


Fig. 9. Comparison of KPWE against the eight feature selection based baseline methods with Friedman test and Nemenyi’s post-hoc test in terms of four indicators.

19.6% performance improvement in terms of the four indicators re- Second, Fig. 8 manifests that superiority of KPWE compared with the
spectively over NASA dataset, and 25.4%, 25.0%, 78.2%, 19.2% per- eight baseline methods in terms of the median values of all four indi-
formance improvement in terms of the four indicators respectively cators across all 44 projects. In particular, the median AUC by KPWE is
across all 44 projects. higher than the maximum AUC by CS and IG. In addition, we can also
observe that the performance of the four wrapper-based feature sub-
5.4. Answer to RQ4: the prediction performance of KPWE and other set selection methods are generally better than the filter-based feature
feature selection methods with WELM. subset selection methods, which is consistent with the observation in
previous study [19].
Here, we choose eight representative feature selection methods, in- Third, Fig. 9 visualizes the results of the Friedman test with Ne-
clude four filter-based feature ranking methods and four wrapper-based menyi’s post-hoc test for KPWE and the eight feature selection based
feature subset selection methods, for comparison. The filter-based meth- baseline methods in terms of the four indicators. There exist significant
ods are Chi-Square (CS), Fish Score (FS), Information Gain (IG) and ReliefF differences among the nine methods in terms of all four indicators since
(ReF). The first two methods are both based on statistics, the last two the p values of the Friedman test are all less than 0.05. The results of
are based on entropy and instance, respectively. These methods have the post-hoc test illustrate that KPWE always belongs to the top rank
been proven to be effective for defect prediction [19,92]. For wrapper- group in terms of all indicators. In addition, NNWrap belongs to the top
based methods, we choose four commonly-used classifiers (i.e., NB, NN, rank group in terms of G-measure. These observations show that KPWE
LR, and RF) and F-measure to evaluate the performance of the selected performs significantly better than the eight baseline methods expect for
feature subset. The four wrapper methods are abbreviated as NBWrap, the NNWrap method in terms of G-measure.
NNWrap, LRWrap, and RFWrap. Following the previous work [19,38], Discussion: The reason why the features extracted by KPCA are more
we set the number of selected features to ⌈log2 m⌉, where m is the number effective is that, the eight feature selection methods only select a subset
of original features. of original features that are not able to excavate the important informa-
Table 12 presents the average indicator values of KPWE and eight tion hidden behind the raw data, whereas KPCA can eliminate the noise
feature selection methods with WELM on PROMISE dataset, NASA in the data and extract the intrinsic structures of the data that are more
dataset, and across all 44 projects of the two datasets. Fig. 8 depicts the helpful to distinguish the class labels of the modules.
box-plots of four indicators for the nine methods across all 44 projects.
Some findings are observed from Table 12 and Fig. 8 as follows. Summary: KPWE outperforms the eight feature selection methods
First, from Table 12, the results show that our method KPWE with WELM. On average, compared with the eight baseline meth-
achieves the best average performance in terms of all indicators on ods, KPWE achieves 24.3%, 18.6%, 71.0%, 16.0% performance im-
two datasets and across all 44 projects. More specifically, across all 44 provement in terms of the four indicators respectively over PROMISE
projects, the average F-measure value (0.480) by KPWE yields improve- dataset, 18.5%, 8.5%, 38.3%, 13.7% performance improvement in
ments between 13.7% (for NNWrap) and 42.9% (for CS) with an average terms of the four indicators respectively over NASA dataset, and
improvement of 23.2%, the average G-measure value (0.649) by KPWE 23.2%, 16.4%, 63.4%, 15.4% performance improvement in terms
gains improvements between 7.5% (for NNWrap) and 33.0% (for CS) of the four indicators respectively across all 44 projects.
with an average improvement of 16.4%, the average MCC value (0.356)
by KPWE achieves improvements between 27.6% (for NNWrap) and 5.5. Answer to RQ5: the prediction performance of KPWE and other
150.7% (for CS) with an average improvement of 63.4%, and the aver- imbalanced learning methods.
age AUC value (0.761) gets improvements between 10.1% (for NNWrap)
and 27.7% (for CS) with an average improvement of 15.4% compared Here, we employ 12 classic imbalanced learning methods based
with eight feature selection methods with WELM. on data sampling strategies. These methods first use Random

14
JID: INFSOF
ARTICLE IN PRESS [m5GeSdc;October 24, 2018;2:51]

Z. Xu et al. Information and Software Technology 000 (2018) 1–19

F-measure G-measure
KPWE KPWE
BCSFV BCSFV
CSFV CSFV
SysFV SysFV
CEL CEL
APL APL
Bal Bal
Easy Easy
Ada Ada
Bag Bag
SMLR SMLR
RULR RULR
ROLR ROLR
SMRF SMRF
RURF RURF
RORF RORF
SMNN SMNN
RUNN RUNN
RONN RONN
SMNB SMNB
RUNB RUNB
RONB RONB

0.0 0.2 0.4 0.6 0.8 0.0 0.2 0.4 0.6 0.8 1.0

MCC AUC
KPWE KPWE
BCSFV BCSFV
CSFV CSFV
SysFV SysFV
CEL CEL
APL APL
Bal Bal
Easy Easy
Ada Ada
Bag Bag
SMLR SMLR
RULR RULR
ROLR ROLR
SMRF SMRF
RURF RURF
RORF RORF
SMNN SMNN
RUNN RUNN
RONN RONN
SMNB SMNB
RUNB RUNB
RONB RONB

-0.2 0.0 0.2 0.4 0.6 0.8 0.4 0.5 0.6 0.7 0.8 0.9 1.0

Fig. 10. Box-plots of four indicators for KPWE and 21 class imbalanced learning methods across all 44 projects.

Under-sampling (RU), Random Over-sampling (RO) or SMOTE (SM) tech- First, from Table 13, the results show that our method KPWE
niques to rebalance the modules of the two classes in the training set, achieves the best average performance in terms of F-measure and MCC
then, four popular classifiers as the same in RQ4 (i.e., NB, NN, LR, and on two datasets and across all 44 projects. More specifically, across all
RF) are applied to the rebalanced training set. The method name is the 44 projects, the average F-measure value (0.480) by KPWE yields im-
combination of the abbreviation of the sampling strategy and the used provements between 7.6% (for CEL) and 34.5% (for RULR) with an av-
classifier. Also, we employ two widely-used ensemble learning meth- erage improvement of 19.6%, the average MCC value (0.356) by KPWE
ods (i.e., Bagging (Bag) and Adaboost (Ada) for comparison. Moreover, gains improvements between 17.9% (for Easy) and 140.5% (for SMNB)
we use other seven imbalanced learning methods, Coding-based Ensem- with an average improvement of 56.5%. However, Easy, Bal, APL out-
ble Learning (CEL) [93], Systematically developed Forest with cost-sensitive perform our method KPWE in terms of average G-measure values and
Voting (SysFV) [94], Cost-Sensitive decision Forest with cost-sensitive Vot- Easy outperforms KPWE in terms of the average AUC values across all
ing (CSFV) [95], Balanced CSFV (BCSFV) [57], Asymmetric Partial Least 44 projects. Overall, KPWE achieves average improvements of 23.4%
squares classifier (APL) [96], EasyEnsemble (Easy) [97], and BalanceCas- and 11.2% over the 21 baseline methods in terms of average G-measure
cade (Bal) [97] as the baseline methods. Note that the last three methods and AUC, respectively.
have not yet been applied to defect prediction but have been proved to Second, Fig. 10 depicts that KPWE is superior to the 21 baseline
achieve promising performance for imbalanced data in other domains. methods in terms of the median F-measure and MCC across all 44
Among these method, SysFV, CSFV amd BCSFV are cost-sensitive based projects. In particular, the median MCC by KPWE is higher than the max-
imbalanced learning methods, while Easy and Bal combine the sampling imum MCC by RONB and SMNB. In addition, the median G-measure by
strategies and ensemble learning methods. KPWE is similar to that by APL and Bal, whereas the median G-measure
Table 13 presents the average indicator values of KPWE and the 21 and AUC by KPWE are only a little lower than those by Easy.
class imbalanced baseline methods on PROMISE dataset, NASA dataset, Third, Fig. 11 visualizes the results of the Friedman test with Ne-
and across all 44 projects of the two datasets. Fig. 10 depicts the box- menyi’s post-hoc test for KPWE and the 21 class imbalanced learning
plots of four indicators for the 22 methods across all 44 projects. We methods in terms of the four indicators. As the p values of the Friedman
describe the findings from Table 13 and Fig. 10 as follows. test are all less than 0.05, there exist significant differences among the

15
JID: INFSOF
ARTICLE IN PRESS [m5GeSdc;October 24, 2018;2:51]

Z. Xu et al. Information and Software Technology 000 (2018) 1–19

(a) F-measure (b) G-measure

(c) MCC (d) AUC


Fig. 11. Comparison of KPWE against the 21 class imbalanced learning methods with Friedman test and Nemenyi’s post-hoc test in terms of four indicators.

22 methods in terms of all four indicators. The results of the post-hoc the baseline methods, KPWE achieves 19.1%, 23.9%, 57.7%, 11.3%
test illustrate that KPWE also belongs to the top rank group in terms performance improvement in terms of the four indicators respec-
of all indicators. However, in terms of F-measure, G-measure MCC and tively over PROMISE dataset, 21.0%, 23.2%, 53.2%, 11.4% perfor-
AUC, KPWE does not perform significantly well compared with seven, mance improvement in terms of the four indicators respectively over
seven, four and six baseline methods respectively in which the common NASA dataset, and 19.6%, 23.4%, 56.5%, 11.2% performance im-
methods are Easy and Bal. These observations manifest that KPWE, Easy provement in terms of the four indicators respectively across all 44
and Bal belong to the top rank group and perform no statistically sig- projects. In addition, KPWE performs no statistically significant dif-
nificant differences with each other in terms of all four indicators. Since ferences compared with Easy and Bal across all 44 projects in terms
this is the first work to investigate the performance of method Easy and of all four indicators.
methods Bal on software defect data, the experimental results indicate
that they are also potentially effective methods for defect prediction as 6. Threats to validity
our method KPWE is.
Discussion: The under-sampling methods may neglects the potentially 6.1. External validity
useful information contained in the ignored non-defective modules, and
the over-sampling methods may cause the model over-fitting by adding External validity focuses on whether our experimental conclusions
some redundancy defective modules. In addition, data sampling based will vary on different projects. We conduct experiments on total 44
imbalanced learning methods usually change the data distribution of the projects of two defect datasets to reduce the threat for this kind of va-
defect data. From this point, the cost-sensitive learning methods (such lidity. In addition, since the features of our benchmark dataset are all
as our KPWE method) which does not change the data distribution are static product metrics and the modules are abstracted at class level (for
better choices for imbalanced defect data. Considering the main draw- PROMISE dataset) and component level (for NASA dataset), we cannot
back of under-sampling methods, Easy and Bal sample multiple subsets claim that our experimental conclusions can be generalized to the defect
from the majority class and then use each of these subsets to train an datasets with process metrics and the modules extracted at file level.
ensemble. Finally, they combine all weak classifiers of these ensembles
into a final output [97]. The two methods can wisely explore these ig- 6.2. Internal validity
nored modules, which enable them to perform well on the imbalanced
data. We implement most baseline methods using the function library of
machine learning and toolbox in MATLAB to reduce the potential influ-
Summary: KPWE performs better than the 21 baseline methods espe- ence of the incorrect implementations on our experimental results. In
cially in terms of F-measure and MCC. On average, compared with addition, we tune the optimal parameter values, such as the width of

16
JID: INFSOF
ARTICLE IN PRESS [m5GeSdc;October 24, 2018;2:51]

Z. Xu et al. Information and Software Technology 000 (2018) 1–19

kernel parameter in KPCA and the number of hidden nodes in WELM,

KPWE

0.500

0.374

0.410

0.296

0.480

0.356
0.660

0.764

0.611

0.754

0.649

0.761
from a relatively wide range of tested options. Nevertheless, a more
carefully controlled experiment for the parameter selection should be
considered.

BCSFV

0.455
0.454
0.272
0.707
0.380
0.388
0.240
0.710
0.438
0.439
0.265
0.708
6.3. Construct validity

0.456
0.459
0.291
0.740
0.357
0.384
0.247
0.734
0.434
0.442
0.281
0.738
CSFV

Although we employ four extensively-used indicators to evaluate the


performances of KPWE and the baseline methods for defect prediction,
SysFV

0.455
0.455
0.294
0.730
0.389
0.401
0.266
0.736
0.440
0.443
0.288
0.732
these indicators do not take the effort of inspecting cost into considera-
tion. We will use the effect-aware indicators to evaluate the effectiveness
of our method in future work.
0.467
0.610
0.324
0.668
0.374
0.561
0.254
0.644
0.446
0.599
0.308
0.662
CEL

6.4. Conclusion validity


0.677
0.447
0.660
0.286
0.737
0.379

0.255
0.757
0.432
0.664
0.279
0.741
APL

We use a state-of-the-art double Scott-Knott ESD method to check


whether the differences between KPWE and the baseline methods are
0.665
0.450

0.298
0.742
0.389
0.669
0.264
0.748
0.436
0.666
0.290
0.744

significant. With this statistic test, the assessment towards the superior-
Bal

ity of KPWE is more rigorous.


0.765

0.759

0.669

0.764
0.458
0.664
0.311

0.393
0.689
0.275

0.443

0.302
Easy

7. Conclusion
0.397
0.443
0.260
0.724
0.296
0.355
0.215
0.734
0.375
0.423
0.250
0.726

In this work, we propose a new defect prediction framework KPWE


Ada

that comprises feature extraction stage and model construction stage.


In the first stage, to handle the complex structures in defect data, we
0.409
0.429
0.298
0.755
0.281
0.303
0.217
0.748
0.380
0.400
0.279
0.754

learn the representative features by mapping the original data into a la-
Bag

tent feature space with a nonlinear feature extraction method KPCA.


Average indicator values of KPWE and 21 class imbalanced learning methods with WELM on two datasets and across all projects.

The mapped features in the new space can better represent the raw
SMLR

0.393
0.595
0.200
0.664
0.327
0.593
0.170
0.651
0.378
0.595
0.193
0.661

data. In the second stage, we construct a class imbalanced classifier on


the extracted features by introducing a state-of-the-art learning algo-
RULR

0.375
0.586
0.171
0.636
0.295
0.550
0.117
0.601
0.357
0.578
0.159
0.628

rithm WELM. Besides the advantages of fine generalization ability and


less prone to local optimum, WELM strengthens the impact of defective
modules by assigning them higher weights. We have carefully evaluated
ROLR

0.395
0.598
0.203
0.668
0.326
0.599
0.172
0.643
0.379
0.598
0.196
0.662

KPWE on 34 projects from PROMISE dataset and 10 projects from NASA


dataset with four indicators. The experimental results show that KPWE
SMRF

0.408
0.598
0.228
0.693
0.331
0.583
0.172
0.656
0.391
0.594
0.215
0.685

exhibits superiority over 41 baselines methods, especially in terms of


F-measure, MCC and AUC.
In future work, we will provide guidelines on deciding the optimal
RURF

0.421
0.639
0.256
0.715
0.327
0.597
0.180
0.682
0.400
0.629
0.239
0.708

number of hidden nodes and kernel parameter values for KPWE, as they
vary for different projects. In addition, we plan to explore the impact
RORF

0.431
0.534
0.276
0.721
0.308
0.410
0.176
0.682
0.403
0.506
0.253
0.712

of the different kernel functions in KPCA and the different activation


functions in WELM on the performance of KPWE.
SMNN

0.415
0.613
0.242
0.647
0.354
0.620
0.213
0.645
0.401
0.615
0.235
0.646

Acknowledgments

The authors would like to acknowledge the support provided by the


RUNN

0.407
0.633
0.228
0.647
0.331
0.623
0.187
0.641
0.389
0.630
0.218
0.646

grands of the National Natural Science Foundation of China (61572374,


U1636220, 61472423, 61602258), Open Fund of Key Laboratory of
RONN

Network Assessment Technology from CAS, the Academic Team Build-


0.414
0.546
0.247
0.634
0.347
0.525
0.213
0.622
0.399
0.541
0.239
0.631

ing Plan for Young Scholars from Wuhan University (WHU2016012),


the National Science Foundation (DGE-1522883), Hong Kong RGC
SMNB

0.385
0.422
0.152
0.632
0.313
0.463
0.134
0.629
0.369
0.431
0.148
0.632

Project (CityU C1008-16G), Hong Kong General Research Fund (PolyU


152279/16E, 152223/17E), and the China Postdoctoral Science Foun-
dation (2017M621247).
RUNB

0.400
0.416
0.181
0.634
0.329
0.502
0.174
0.674
0.383
0.435
0.180
0.643

Supplementary material
RONB

0.416
0.476
0.187
0.622
0.333
0.486
0.144
0.603
0.397
0.479
0.177
0.617

Supplementary material associated with this article can be found, in


the online version, at doi:10.1016/j.infsof.2018.10.004.
G-measure

G-measure

G-measure
F-measure

F-measure

F-measure
Indicator

References
MCC

MCC

MCC
AUC

AUC

AUC

[1] J. Tian, Software Quality Engineering: Testing, Quality Assurance, and Quantifiable
Improvement, John Wiley & Sons, 2005.
PROMISE
Table 13

Dataset

[2] G.J. Myers, C. Sandler, T. Badgett, The Art of Software Testing, John Wiley & Sons,
NASA

2011.
ALL

[3] M. Shepperd, D. Bowes, T. Hall, Researcher bias: the use of machine learning in
software defect prediction, IEEE Trans. Softw. Eng. (TSE) 40 (6) (2014) 603–616.

17
JID: INFSOF
ARTICLE IN PRESS [m5GeSdc;October 24, 2018;2:51]

Z. Xu et al. Information and Software Technology 000 (2018) 1–19

[4] Q. Song, Z. Jia, M. Shepperd, S. Ying, J. Liu, A general software defect-proneness [37] S. Shivaji, J.E.J. Whitehead, R. Akella, S. Kim, Reducing features to improve bug
prediction framework, IEEE Trans. Softw. Eng. (TSE) 37 (3) (2011) 356–370. prediction, in: Proceedings of the 24th International Conference on Automated Soft-
[5] X. Yang, K. Tang, X. Yao, A learning-to-rank approach to software defect prediction, ware Engineering (ASE), IEEE Computer Society, 2009, pp. 600–604.
IEEE Trans. Reliab. 64 (1) (2015) 234–246. [38] K. Gao, T.M. Khoshgoftaar, H. Wang, N. Seliya, Choosing software metrics for defect
[6] P. Knab, M. Pinzger, A. Bernstein, Predicting defect densities in source code files prediction: an investigation on feature selection techniques, Softw. Pract. Exp. (SPE)
with decision tree learners, in: Proceedings of the 3rd International Workshop on 41 (5) (2011) 579–606.
Mining Software Repositories (MSR), ACM, 2006, pp. 119–125. [39] X. Chen, Y. Shen, Z. Cui, X. Ju, Applying feature selection to software defect predic-
[7] T. Menzies, J. Greenwald, A. Frank, Data mining static code attributes to learn defect tion using multi-objective optimization, in: Proceedings of the 41st Annual Computer
predictors, IEEE Trans. Softw. Eng. (TSE) 33 (1) (2007) 2–13. Software and Applications Conference (COMPSAC), 2, IEEE, 2017, pp. 54–59.
[8] L. Guo, Y. Ma, B. Cukic, H. Singh, Robust prediction of fault-proneness by random [40] C. Catal, B. Diri, Investigating the effect of dataset size, metrics sets, and feature
forests, in: Proceedings of the 15th International Symposium on Software Reliability selection techniques on software fault prediction problem, Inf. Sci. (Ny) 179 (8)
Engineering (ISSRE), IEEE, 2004, pp. 417–428. (2009) 1040–1058.
[9] C. Macho, S. McIntosh, M. Pinzger, Predicting build co-changes with source code [41] B. Ghotra, S. McIntosh, A.E. Hassan, A large-scale study of the impact of fea-
change and commit categories, in: Proceedings of the 23rd International Confer- ture selection techniques on defect classification models, in: Proceedings of the
ence on Software Analysis, Evolution, and Reengineering (SANER), 1, IEEE, 2016, 14th International Conference on Mining Software Repositories (MSR), IEEE, 2017,
pp. 541–551. pp. 146–157.
[10] X. Jing, F. Wu, X. Dong, F. Qi, B. Xu, Heterogeneous cross-company defect prediction [42] R. Malhotra, A systematic review of machine learning techniques for software fault
by unified metric representation and CCA-based transfer learning, in: Proceedings of prediction, Appl. Softw. Comput. 27 (2015) 504–518.
the 10th Joint Meeting on Foundations of Software Engineering (FSE), ACM, 2015, [43] R. Malhotra, An empirical framework for defect prediction using machine learning
pp. 496–507. techniques with android software, Appl. Softw. Comput. 49 (2016) 1034–1050.
[11] K.O. Elish, M.O. Elish, Predicting defect-prone software modules using support vec- [44] B. Ghotra, S. McIntosh, A.E. Hassan, Revisiting the impact of classification tech-
tor machines, J. Syst. Softw. (JSS) 81 (5) (2008) 649–660. niques on the performance of defect prediction models, in: Proceedings of the
[12] Z. Yan, X. Chen, P. Guo, Software defect prediction using fuzzy support vector re- 37th International Conference on Software Engineering (ICSE), IEEE Press, 2015,
gression, Adv. Neural Netw. (2010) 17–24. pp. 789–800.
[13] M.M.T. Thwin, T.-S. Quah, Application of neural networks for software quality [45] R. Malhotra, R. Raje, An empirical comparison of machine learning techniques
prediction using object-oriented metrics, J. Syst. Softw. (JSS) 76 (2) (2005) 147–156. for software defect prediction, in: Proceedings of the 8th International Conference
[14] T.M. Khoshgoftaar, E.B. Allen, J.P. Hudepohl, S.J. Aud, Application of neural net- on Bioinspired Information and Communications Technologies, ICST (Institute for
works to software quality modeling of a very large telecommunications system, IEEE Computer Sciences, Social-Informatics and Telecommunications Engineering), 2014,
Trans. Neural Netw. (TNN) 8 (4) (1997) 902–909. pp. 320–327.
[15] D.E. Neumann, An enhanced neural network technique for software risk analysis, [46] J. Ren, K. Qin, Y. Ma, G. Luo, On software defect prediction using machine learning,
IEEE Trans. Soft. Eng. (TSE) 28 (9) (2002) 904–912. J. Appl. Math. 2014 (2014).
[16] A. Panichella, R. Oliveto, A. De Lucia, Cross-project defect prediction models: [47] G. Luo, H. Chen, Kernel based asymmetric learning for software defect prediction,
L’union fait la force, in: Proceedings of the 21st Software Evolution Week-IEEE Con- IEICE Trans. Inf. Syst. 95 (1) (2012) 267–270.
ference on Software Maintenance, Reengineering and Reverse Engineering (CSM- [48] G. Luo, Y. Ma, K. Qin, Asymmetric learning based on kernel partial least squares for
R-WCRE), IEEE, 2014, pp. 164–173. software defect prediction, IEICE Trans. Inf. Syst. 95 (7) (2012) 2006–2008.
[17] A. Shanthini, Effect of ensemble methods for software fault prediction at various [49] D.P. Mesquita, L.S. Rocha, J.P.P. Gomes, A.R.R. Neto, Classification with
metrics level, Int. J. Appl. Inf. Syst. (2014). reject option for software defect prediction, Appl. Softw. Comput. 49 (2016)
[18] X. Xia, D. Lo, S. McIntosh, E. Shihab, A.E. Hassan, Cross-project build co-change pre- 1085–1093.
diction, in: Proceedings of the 22nd International Conference on Software Analysis, [50] Y. Kamei, A. Monden, S. Matsumoto, T. Kakimoto, K.-i. Matsumoto, The effects of
Evolution and Reengineering (SANER), IEEE, 2015, pp. 311–320. over and under sampling on fault-prone module detection, in: Proceedings of the
[19] Z. Xu, J. Liu, Z. Yang, G. An, X. Jia, The impact of feature selection on defect predic- 1st International Symposium on Empirical Software Engineering and Measurement
tion performance: an empirical comparison, in: Proceedings of the 27th International (ESEM), IEEE, 2007, pp. 196–204.
Symposium on Software Reliability Engineering (ISSRE), IEEE, 2016, pp. 309–320. [51] K.E. Bennin, J. Keung, A. Monden, P. Phannachitta, S. Mensah, The significant effects
[20] S. Wold, K. Esbensen, P. Geladi, Principal component analysis, Chemom. Intell. Lab. of data sampling approaches on software defect prioritization and classification, in:
Syst. 2 (1–3) (1987) 37–52. Proceedings of the 11th International Symposium on Empirical Software Engineering
[21] Q. Song, J. Ni, G. Wang, A fast clustering-based feature subset selection algorithm and Measurement (ESEM), IEEE Press, 2017, pp. 364–373.
for high-dimensional data, IEEE Trans. Knowl. Data Eng. (TKDE) 25 (1) (2013) 1–14. [52] K.E. Bennin, J. Keung, A. Monden, Impact of the distribution parameter of data
[22] T. Wang, Z. Zhang, X. Jing, L. Zhang, Multiple kernel ensemble learning for software sampling approaches on software defect prediction models, in: Proceedings of the
defect prediction, Autom. Softw. Eng. (ASE) 23 (4) (2016) 569–590. 24th Asia-Pacific Software Engineering Conference (APSEC), IEEE, 2017, pp. 630–
[23] F. Liu, X. Gao, B. Zhou, J. Deng, Software defect prediction model based on 635.
PCA-isvm, Comput. Simulat. (2014). [53] C. Tantithamthavorn, A.E. Hassan, K. Matsumoto, The impact of class rebalanc-
[24] H. Cao, Z. Qin, T. Feng, A novel PCA-bp fuzzy neural network model for software ing techniques on the performance and interpretation of defect prediction models,
defect prediction, Adv. Sci. Lett. 9 (1) (2012) 423–428. arXiv:1801.10269 (2018).
[25] C. Zhong, Software quality prediction method with hybrid applying principal com- [54] T.M. Khoshgoftaar, E. Geleyn, L. Nguyen, L. Bullard, Cost-sensitive boosting in soft-
ponents analysis and wavelet neural network and genetic algorithm, Int. J. Digital ware quality modeling, in: Proceedings of the 7th International Symposium on High
Content Technol. Appl. 5 (3) (2011). Assurance Systems Engineering, IEEE, 2002, pp. 51–60.
[26] T.M. Khoshgoftaar, R. Shan, E.B. Allen, Improving tree-based models of software [55] J. Zheng, Cost-sensitive boosting neural networks for software defect prediction,
quality with principal components analysis, in: Proceedings of the 11th International Expert Syst. Appl. 37 (6) (2010) 4537–4543.
Symposium on Software Reliability Engineering (ISSRE), IEEE, 2000, pp. 198–209. [56] M. Liu, L. Miao, D. Zhang, Two-stage cost-sensitive learning for software defect pre-
[27] M. Shepperd, Q. Song, Z. Sun, C. Mair, Data quality: some comments on the diction, IEEE Trans. Reliab. 63 (2) (2014) 676–686.
nasa software defect datasets, IEEE Trans. Softw. Eng. (TSE) 39 (9) (2013) [57] M.J. Siers, M.Z. Islam, Software defect prediction using a cost sensitive decision
1208–1215. forest and voting, and a potential solution to the class imbalance problem, Inf. Syst.
[28] D. Gray, D. Bowes, N. Davey, Y. Sun, B. Christianson, The misuse of the nasa metrics 51 (2015) 62–71.
data program data sets for automated software defect prediction, in: Proceedings of [58] J. Peng, D.R. Heisterkamp, Kernel indexing for relevance feedback image retrieval,
the 15th Annual Conference on Evaluation & Assessment in Software Engineering in: Proceedings of the 10th International Conference on Image Processing (ICIP), 1,
(EASE), IET, 2011, pp. 96–103. IEEE, 2003, pp. I–733.
[29] T. Menzies, K. Ammar, A. Nikora, J. DiStefano, How simple is software defect de- [59] J. Li, S. Chu, J.-S. Pan, Kernel principal component analysis (Kpca)-based face
tection, Submitt. Emprical Softw. Eng. J. (2003). recognition, in: Kernel Learning Algorithms for Face Recognition, Springer, 2014,
[30] B. Schölkopf, A. Smola, K.-R. Müller, Kernel principal component analysis, in: Pro- pp. 71–99.
ceedings of the 7th International Conference on Artificial Neural Networks (ICANN), [60] H. Abdi, L.J. Williams, Principal component analysis, Wiley Interdiscip. Rev. Com-
Springer, 1997, pp. 583–588. put. Stat. 2 (4) (2010) 433–459.
[31] B. Schölkopf, A. Smola, K.-R. Müller, Nonlinear component analysis as a kernel [61] G. Huang, G. Huang, S. Song, K. You, Trends in extreme learning machines: a review,
eigenvalue problem, Neural Comput. 10 (5) (1998) 1299–1319. Neural Netw. 61 (2015) 32–48.
[32] K.I. Kim, M.O. Franz, B. Scholkopf, Iterative kernel principal component analysis [62] S. Ding, H. Zhao, Y. Zhang, X. Xu, R. Nie, Extreme learning machine: algorithm,
for image modeling, IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI) 27 (9) (2005) theory and applications, Artif. Intell. Rev. 44 (1) (2015) 103–115.
1351–1366. [63] G. Huang, L. Chen, C.K. Siew, Universal approximation using incremental construc-
[33] S. Lessmann, B. Baesens, C. Mues, S. Pietsch, Benchmarking classification models for tive feedforward networks with random hidden nodes, IEEE Trans. Neural Netw.
software defect prediction: a proposed framework and novel findings, IEEE Trans. (TNN) 17 (4) (2006) 879–892.
Softw. Eng. (TSE) 34 (4) (2008) 485–496. [64] C.R. Rao, S.K. Mitra, Generalized inverse of matrices and its applications, John Wiley
[34] W. Zong, G. Huang, Y. Chen, Weighted extreme learning machine for imbalance & Sons, New York, 1971.
learning, Neurocomputing 101 (2013) 229–242. [65] C.R. Johnson, Matrix Theory and Applications, 40, American Mathematical Soc.,
[35] G. Huang, Q. Zhu, C. Siew, Extreme learning machine: theory and applications, Neu- 1990.
rocomputing 70 (1) (2006) 489–501. [66] R. Fletcher, Practical Methods of Optimization, John Wiley & Sons, 2013.
[36] S. Shivaji, E.J. Whitehead, R. Akella, S. Kim, Reducing features to improve code [67] A. Asuncion, D. Newman, Uci machine learning repository, 2007.
change-based bug prediction, IEEE Trans. Softw. Eng. (TSE) 39 (4) (2013) 552–569. https://www.archive.ics.uci.edu/ml/index.php.

18
JID: INFSOF
ARTICLE IN PRESS [m5GeSdc;October 24, 2018;2:51]

Z. Xu et al. Information and Software Technology 000 (2018) 1–19

[68] Y. Zhang, D. Lo, X. Xia, J. Sun, An empirical study of classifier combination for [85] J. Nam, S. Kim, Clami: defect prediction on unlabeled datasets, in: Proceedings of
cross-project defect prediction, in: Proceedings of the 39th Computer Software and the 30th IEEE/ACM International Conference on Automated Software Engineering
Applications Conference (COMPSAC), 2, IEEE, 2015, pp. 264–269. (ASE), IEEE, 2015, pp. 452–463.
[69] L. Chen, B. Fang, Z. Shang, Y. Tang, Negative samples reduction in cross-company [86] Z. Li, X.-Y. Jing, X. Zhu, H. Zhang, B. Xu, S. Ying, On the multiple sources and
software defects prediction, Inf. Softw. Technol. (IST) 62 (2015) 67–77. privacy preservation issues for heterogeneous defect prediction, IEEE Trans. Softw.
[70] M. Jureczko, L. Madeyski, Towards identifying software project clusters with regard Eng. (TSE) (2017).
to defect prediction, in: Proceedings of the 6th International Conference on Predic- [87] Z. Li, X.-Y. Jing, F. Wu, X. Zhu, B. Xu, S. Ying, Cost-sensitive transfer kernel canonical
tive Models in Software Engineering, ACM, 2010, p. 9. correlation analysis for heterogeneous defect prediction, Autom. Softw. Eng. (ASE)
[71] M. Jureczko, D. Spinellis, Using object-oriented design metrics to predict software 25 (2) (2018) 201–245.
defects, Models Methods Syst. Dependability. Oficyna Wydawnicza Politechniki [88] S. Herbold, A. Trautsch, J. Grabowski, A comparative study to benchmark cross-pro-
Wrocławskiej (2010) 69–81. ject defect prediction approaches, in: Proceedings of the 40th International Confer-
[72] Y. Ma, G. Luo, X. Zeng, A. Chen, Transfer learning for cross-company software defect ence on Software Engineering (ICSE), ACM, 2018, p. 1063.
prediction, Inf. Softw. Technol. (IST) 54 (3) (2012) 248–256. [89] B. Pizzileo, K. Li, G.W. Irwin, W. Zhao, Improved structure optimization for
[73] Z. Zhang, X. Jing, T. Wang, Label propagation based semi-supervised learning for fuzzy-neural networks, IEEE Trans. Fuzzy Syst. (TFS) 20 (6) (2012) 1076–1089.
software defect prediction, Autom. Softw. Eng. (ASE) 24 (1) (2017) 47–69. [90] C.-W. Hsu, C.-J. Lin, A comparison of methods for multiclass support vector ma-
[74] A.J. Smola, B. Schölkopf, Learning with Kernels, GMD-Forschungszentrum Informa- chines, IEEE Trans. Neural Netw. (TNN) 13 (2) (2002) 415–425.
tionstechnik, 1998. [91] S.D. Thepade, M.M. Kalbhor, Novel data mining based image classification with
[75] W. Yu, F. Zhuang, Q. He, Z. Shi, Learning deep representations via extreme learning Bayes, tree, rule, lazy and function classifiers using fractional row mean of cosine,
machines, Neurocomputing 149 (2015) 308–315. sine and walsh column transformed images, in: Proceedings of the International
[76] K. Herzig, S. Just, A. Rau, A. Zeller, Predicting defects using change genealogies, in: Conference on Communication, Information and Computing Technology (ICCICT),
Proceedings of the 24th International Symposium on Software Reliability Engineer- IEEE, 2015, pp. 1–6.
ing (ISSRE), IEEE, 2013, pp. 118–127. [92] S. Shivaji, Efficient bug prediction and fix suggestions, University of California, Santa
[77] X.-Y. Jing, S. Ying, Z.-W. Zhang, S.-S. Wu, J. Liu, Dictionary learning based software Cruz, 2013 Ph.D. thesis.
defect prediction, in: Proceedings of the 36th International Conference on Software [93] Z. Sun, Q. Song, X. Zhu, Using coding-based ensemble learning to improve software
Engineering (ICSE), ACM, 2014, pp. 414–423. defect prediction, IEEE Trans. Syst. Man Cybern. Part C (Appl. Rev.) 42 (6) (2012)
[78] J. Hryszko, L. Madeyski, M. Dabrowska, P. Konopka, Defect prediction with bad 1806–1817.
smells in code, arXiv:1703.06300 (2017). [94] Z. Islam, H. Giggins, Knowledge discovery through SysFor: a systematically devel-
[79] D. Ryu, O. Choi, J. Baik, Value-cognitive boosting with a support vector machine for oped forest of multiple decision trees, in: Proceedings of the Ninth Australasian
cross-project defect prediction, Empir. Softw. Eng. 21 (1) (2016) 43–71. Data Mining Conference-Volume 121, Australian Computer Society, Inc., 2011,
[80] J. Demšar, Statistical comparisons of classifiers over multiple data sets, J. Mach. pp. 195–204.
Learn. Res. 7 (Jan) (2006) 1–30. [95] M.J. Siers, M.Z. Islam, Cost sensitive decision forest and voting for software defect
[81] T. Mende, R. Koschke, Effort-aware defect prediction models, in: Proceedings of the prediction, in: Proceedings of the Pacific Rim International Conference on Artificial
14th European Conference on Software Maintenance and Reengineering (CSMR), Intelligence, Springer, 2014, pp. 929–936.
IEEE, 2010, pp. 107–116. [96] H. Qu, G. Li, W. Xu, An asymmetric classifier based on partial least squares, Pattern
[82] J.H. Zar, et al., Biostatistical Analysis, Pearson Education India, 1999. Recognit. 43 (10) (2010) 3448–3457.
[83] Y. Jiang, B. Cukic, Y. Ma, Techniques for evaluating fault prediction models, Empir. [97] X. Liu, J. Wu, Z. Zhou, Exploratory undersampling for class-imbalance learning, IEEE
Softw. Eng. (ESE) 13 (5) (2008) 561–595. Trans. Syst. Man Cybern. Part B (Cybern.) 39 (2) (2009) 539–550.
[84] M. DAmbros, M. Lanza, R. Robbes, Evaluating defect prediction approaches: a
benchmark and an extensive comparison, Empir. Softw. Eng. (ESE) 17 (4–5) (2012)
531–577.

19

You might also like