Information Sciences: Byoung-Jun Park, Sung-Kwun Oh, Witold Pedrycz
Information Sciences: Byoung-Jun Park, Sung-Kwun Oh, Witold Pedrycz
Information Sciences
journal homepage: www.elsevier.com/locate/ins
a r t i c l e i n f o a b s t r a c t
Article history: In this study, we introduce a design methodology of polynomial function-based Neural
Received 7 March 2008 Network (pf-NN) classifiers (predictors). The essential design components include Fuzzy
Received in revised form 29 October 2010 C-Means (FCM) regarded as a generic clustering algorithm and polynomials providing all
Accepted 17 January 2011
required nonlinear capabilities of the model. The learning method uses a weighted cost
Available online 23 January 2011
function (objective function) while to analyze the performance of the system we engage
a standard receiver operating characteristics (ROC) analysis. The proposed networks are
Keywords:
used to detect software defects. From the conceptual standpoint, the classifier of this form
Software defect
Neural networks
can be expressed as a collection of ‘‘if-then’’ rules. Fuzzy clustering (Fuzzy C-Means, FCM) is
Pattern classification aimed at the development of premise layer of the rules while the corresponding conse-
Fuzzy clustering quences of the rules are formed by some local polynomials. A detailed learning algorithm
Two-class discrimination for the pf-NNs is presented with particular provisions made for dealing with imbalanced
Imbalanced data classes encountered quite commonly in software quality problems. The use of simple mea-
sures such as accuracy of classification becomes questionable. In the assessment of quality
of classifiers, we confine ourselves to the use of the area under curve (AUC) in the receiver
operating characteristics (ROCs) analysis. AUC comes as a sound classifier metric capturing
a tradeoff between the high true positive rate (TP) and the low false positive rate (FP). The
performance of the proposed classifier is contrasted with the results produced by some
‘‘standard’’ Radial Basis Function (RBF) neural networks.
Ó 2011 Elsevier Inc. All rights reserved.
1. Introduction
The construction of effective predictive model of software defect estimation is one of the key challenges of quantitative
software engineering. A wide range of prediction models have been proposed on a basis of product metrics by applying dif-
ferent modeling techniques, cf. [2,5,7,13,22,29,32]. Most defect predictors have used size and complexity metrics when
attempting to predict defects of software. The earliest study along this line is the one reported by Akiyama [2], which dealt
with the system developed at Fujitsu, Japan. Software metrics have been used as quantitative means of assessing software
development process as well as the quality of software products. Many researchers have studied the correlation between
software design metrics and the likelihood of occurrence of software faults [6,11,15,18,30,33].
As the size and complexity of software systems increases, software industry is highly challenged to deliver high quality,
reliable software on time and within budget. Although there is a diversity of definitions of software quality, it is widely
⇑ Corresponding author. Tel.: +82 31 229 8162; fax: +82 31 220 2667.
E-mail address: [email protected] (S.-K. Oh).
0020-0255/$ - see front matter Ó 2011 Elsevier Inc. All rights reserved.
doi:10.1016/j.ins.2011.01.026
B.-J. Park et al. / Information Sciences 229 (2013) 40–57 41
accepted that a project with many defects lacks quality. Defects are commonly defined as deviations from specifications or
expectations that might lead to future failures in operation [5,13]. Knowing the causes of possible defects as well as iden-
tifying general software management decisions and practices that may need attention since the beginning of a project could
save money, time and work. The estimation of the potential fault-prone of software is an indicator of quality and can help
planning, controlling and executing software development and maintenance activities [5].
A variety of statistical techniques are used in software quality modeling. Models often exploit statistical relationships be-
tween measures of quality and software metrics. However, relationships between static software metrics and quality factors
are often complex and nonlinear, limiting the accuracy of conventional modeling approaches (such as e.g., linear regression
models). An efficient method for software quality (defect) analysis is to learn from past mistakes and acquired experience,
namely, to use machine learning approaches. Neural networks are adept at modeling nonlinear functional relationships that
are difficult to model when exploiting other techniques, and thus, form an attractive alternative for software quality mod-
eling. However, traditional machine learning approaches may be faced with a difficulty of building a classification model
(classifier) with binary class imbalanced data in software quality engineering [14]. The class imbalanced dataset has a class
that outnumbers greatly the other classes. One of the commonly encountered techniques to alleviate the problems associ-
ated with class imbalance is data sampling that can be accomplished through undersampling [20], oversampling [8] or both
of them [21].
The goal of this study is to develop a design environment for polynomial-based neural network predictors (classifiers) for
the prediction of defects in software engineering artifacts. Neural networks (NNs) have been widely used to deal with pat-
tern classification problems. It has been shown that the NNs can be trained to approximate complex discriminant functions
[24]. NN classifiers can deal with numerous multivariable nonlinear problems for which an accurate analytical solution is
difficult to derive or does not exist [3]. It is found however, that the quality and effectiveness of NN classifiers depend on
several essential parameters whose values are crucial to the accurate predictions of the properties being sought. The appro-
priate NN architecture, the number of hidden layers and the number of neurons in each hidden layer are the important de-
sign issues that can immediately affect the accuracy of the prediction. Unfortunately, there is no direct method to identify
these factors as they need to be determined on an experimental basis [3]. In addition, it is difficult to understand and inter-
pret the resulting NNs. These difficulties increase once the number of variables and the size of the networks start to increase
[34,38].
To alleviate the problems highlighted above, we propose to consider polynomial function based Neural Network (pf-NN)
classifiers exploiting a direct usage of Fuzzy C-Means clustering and involving polynomials in the description of relationships
of the models. We also describe the use a suitable learning methodology based on the weighted cost function with intention
to cope with imbalanced classes. The proposed classifier is expressed as a collection of ‘‘if-then’’ rules. In its design, the pre-
mise parts of the rules are constructed with the use of the FCM. In the consequence (conclusion) part of the network we con-
sider the use of polynomials which serve as the corresponding local models. The aggregation involves mechanisms of fuzzy
inference. According to the type of polynomial, pf-NNs can be categorized into linear function-based NNs (lf-NNs) and qua-
dratic function-based NNs (qf-NNs). For this category of the neural architectures, we construct a detailed learning algorithm.
Generally speaking, most learning algorithms lead to high classification accuracy under assumption that the distribution of
classes is almost equal. However, defect problems in software engineering have large differences between classes, called a
class imbalance problem. Given this, an evaluation of the performance of the classifier using only accuracy of classification
becomes questionable and has to be replaced by a more suitable criterion. To cope with the problem, we use a weighted cost
function for learning and the area under curve (AUC) in the receiver operating characteristics (ROCs) analysis. AUC is a clas-
sifier metric which offers an important tradeoff between the high true positive rate (TP) and the low false positive rate (FP)
[16].
In order to demonstrate the usefulness of the proposed pf-NN in software engineering, we use two software engineering
datasets (CM1 and DATATRIEVE) coming from the PROMISE repository of empirical software engineering data (http://pro-
misedata.org/repository/). The performance of the proposed classifier is contrasted with the results produced by the radial
basis function (RBF) NNs given that these networks are one of the most widely applied category of neural classifiers [34,35].
This paper is organized as follows. We outline the underlying general topology of polynomial function-based neural net-
works in Section 2. Learning method of pf-NNs is given in Section 3 where we elaborate on all necessary development issues.
Section 4 covers the assessment of performance of the proposed predictors. Extensive experimental studies are presented in
Section 5 while Section 6 offers some concluding comments.
2. Design of polynomial function-based neural network predictors for software defect detection
In what follows, we discuss a general topology of polynomial function-based neural networks (pf-NNs).
Neural Networks (NNs) have been widely used to deal with pattern classification problems. The generic topology of NNs
consists of three layers as shown in Fig. 1. One of most widely applied neural classifiers are RBF NNs [34,35]. RBF NNs exhibit
42 B.-J. Park et al. / Information Sciences 229 (2013) 40–57
Output Layer Σ
Hidden Layer
Input Layer x1 x2 x3
some advantages including global optimal approximation and classification capabilities, and rapid convergence of the learn-
ing procedures, cf. [12,19].
From the structural perspective, RBF NNs come with a single hidden layer. Each node in the hidden layer associates with a
receptive field H(x) and generates a certain activation level depending on some incoming input x. The output y(x) is a linear
combination of the activation levels of the receptive fields:
X
c
yðxÞ ¼ wi Hi ðxÞ: ð1Þ
i¼1
Where x is the n-dimensional input vector [x1, . . . xn]T, vi = [vi1, . . . vin]T is the center of ith basis function Hi(x) and c is the
number of the nodes of the hidden layer. Typically the distance shown in (2) is the Euclidean one [36].
The proposed pf-NNs exhibit a similar topology as the one encountered in RBF NNs. However the functionality and the
associated design process exhibit some evident differences. In particular, the receptive fields do not assume any explicit
functional form (say, Gaussian, ellipsoidal, etc.), but are directly reflective of the nature of the data and come as the result
of fuzzy clustering. Considering the prototypes v1, v2, . . . , vc formed by the FCM method, the receptive fields are expressed in
the following way
1
Hi ðxÞ ¼ Ai ðxÞ ¼ : ð3Þ
Pc kxv i k2
j¼1 2
kxv j k
In the addition, the weights between the output layer and the hidden layer are not constants but come in the form of poly-
nomials of the input variables, namely
wi ¼ fi ðxÞ: ð4Þ
The neuron located at the output layer completes a linear combination of the activation levels of the corresponding receptive
fields hence (1) can be rewritten as follows:
X
c
yðxÞ ¼ fi ðxÞAi ðxÞ: ð5Þ
i¼1
The above structure of the classifier can be represented through a collection of fuzzy rules
Output Layer Σ
f c(
)
,…
x1
)
f 4(x
f3 (x1,x2 ,…)
…
,x 2
2
,x
,…
x1
x2 ,
1
,x 2
f1 (
)
x1 ,
,…
f2 (
)
Hidden Layer A1 A2 A3 A4
Input Layer x1 x2 x3
The premise layer of pf-NNs is formed by means of the Fuzzy C-Means (FCM) clustering. In this section, we briefly review
the objective function-based fuzzy clustering with the intent of highlighting it key features being useful here. The FCM algo-
rithm comes as a standard mechanism aimed at the formation of ‘c’ fuzzy sets (relations) in Rn. The objective function Q guid-
ing the clustering process is expressed as a sum of the distances of individual data from the prototypes v1, v2, . . . , and vc,
X
c X
N
2
Q¼ ik kxk v i k :
um ð7Þ
i¼1 k¼1
Here, kkdenotes a certain distance function; ‘m’ stands for a fuzzification factor (coefficient), m > 1.0. The commonly used
value of ‘‘m’’ is equal to 2. N is the number of patterns (data). The resulting partition matrix is denoted by
U = [uik],i = 1, 2, . . . , c; k = 1, 2, . . . , N. While there is a substantial diversity as far as distance functions are concerned, here
we adhere to a weighted Euclidean distance taking on the following form
X
n
ðxkj v ij Þ2
kxk v i k2 ¼ ð8Þ
j¼1
r2j
with rj being a standard deviation of the jth variable. While not being computationally demanding, this type of distance is
still quite flexible and commonly used.
The minimization of Q is realized in successive iterations by adjusting both the prototypes and entries of the partition
matrix, min Q(U, v1, v2, . . . , vc). The corresponding formulas leading to the iterative algorithm read as follows:
1
uik ¼ m1
2
; 1 6 k 6 N; 1 6 i 6 c ð9Þ
Pc kxk v i k
j¼1 kxk vj k
and
PN m
k¼1 uik xk
vi ¼ PN m
; 1 6 i 6 c: ð10Þ
k¼1 uik
The properties of the optimization algorithm are well documented in the literature, cf. [1,4]. In the context of our investiga-
tions, we note that the resulting partition matrix is a clear realization of ‘c’ fuzzy relations with the membership functions
u1, u2, . . . , uc forming the corresponding rows of the partition matrix U, that is U ¼ ½uT1 uT2 . . . :uTc .
Polynomial functions are dealt with in the consequence layer. In the Fig. 2 and (6), fi(x) is represented as the polynomial of
the following forms.
44 B.-J. Park et al. / Information Sciences 229 (2013) 40–57
X
n
Linear function-based NN; f i ðxÞ ¼ ai0 þ aij xj ; ð11Þ
j¼1
X
n X
n X
n
Quadratic function-based NN; f i ðxÞ ¼ ai0 þ aij xj þ aijk xj xk : ð12Þ
j¼1 j¼1 k¼j
These functions are activated by partition matrix and leads to local regression models for the consequence layer in each lin-
guistic rule. As shown in (11) and (12), the proposed pf-NNs are separated into linear function-based NN (lf-NN) and qua-
dratic function-based NN (qf-NN). In case of the quadratic function, the dimensionality of the problem goes up for a larger
number of inputs, namely, a large number of combinations of the pairs of the input variables. In anticipation of the compu-
tational difficulties, we confine ourselves to a reduced quadratic function
X
n X
n
Reducedqf NN; f i ðxÞ ¼ ai0 þ aij xj þ aijk x2k : ð13Þ
j¼1 k¼1
Let us consider the pf-NNs structure by considering the fuzzy partition (formed by the FCM) as shown in Fig. 2. The family of
fuzzy sets Ai forms a partition (so that the sum of membership grades sum up to one at each point of the input space). The
P
‘‘ ’’ neuron realizes the sum as governed by (5). The output of pf-NNs can be obtained through a general scheme of fuzzy
inference based on a collection of ‘‘if-then’’ fuzzy rules [17,28]. More specifically, we obtain
Xc
ui fi ðxÞ X c
y ¼ gðxÞ ¼ Pc ¼ ui fi ðxÞ ð14Þ
i¼1 j¼1 uj i¼1
where, ui = Ai(x) [4].g(x) is a concise description of the pf-NNs regarded as some discriminant function.
Note that based on local presentations (polynomials), the global characteristics of the pf-NNs result through the compo-
sition of their local relationships on the aggregation layer of pf-NNs.
In this section, we consider the use of pf-NNs as software defect predictors and show how their functionality gives rise to
highly nonlinear classifier.
A defect predictor is a two-class classifier with software modules being fault free (class x+) and faulty (class x). The
discriminant function gives rise to the following classification rule
Decide xþ if gðxÞ > 0; otherwise decide x : ð15Þ
The final output of networks, (14), is used as a discriminant function g(x) and can be rewritten by a linear combination
(i) lf-NNs; aT = [a10, . . . , ac0, a11, . . . , ac1, . . . , acn] fx = [u1, . . . , uc, u1x1, . . . , ucx1, . . . , ucxn]T
T
(ii) qf-NNs; a = [a10, . . . , ac0, a11, . . . , ac1, . . . , acn, . . . , acnn] fx = [u1, . . . , uc, u1x1, . . . , ucx1, . . . , ucxn, . . . , ucxnxn]T
T
(iii) Reduced qf-NNs; a = [a10, . . . , ac0, a11, . . . , ac1, . . . , acn, . . . , acnn] fx ¼ ½u1 ; . . . ; uc ; u1 x1 ; . . . ; uc x1 ; . . . ; uc xn ; . . . ; uc x2n T
For the discriminant function coming in the form of (16), a two-class classifier implements the decision rule expressed by
(15). Namely, x is assigned to x+ if the inner product aTfx is greater than zero and to x otherwise. The equation g(x) = 0
defines the (nonlinear) decision surface that separates the two classes of the software modules.
In order to identify the parameters of the consequence layer of the pf-NNs classifier, we take care of a learning algorithm
using a weighted cost function analysis.
Consider that a set of N patterns fx1, . . . , fxN is given and these are labeled as x+ or x. We want to use these patterns to
determine the coefficients a in a discriminant function (16). A pattern xk is classified correctly if aTfxk > 0 and fxk is labeled
x+ or if aTfxk < 0 and fxk is labeled x. Here, these can be incorporated into the formulas to simplify the treatment of the
two-class case, namely, the replacement of all patterns labeled x by their minus. Thus we look for a coefficient vector,
a, such that aTfxk > 0 for all patterns. Seeking a vector making all of the inner products aTfxk positive, aTfxk = bk can be con-
sidered cf. [9]. Here bk is some arbitrarily specified positive constant; generally one could assume that bk = 1. Then the prob-
lem is to determine a coefficient vector a satisfying the system of linear equations
Xa ¼ b; ð17Þ
B.-J. Park et al. / Information Sciences 229 (2013) 40–57 45
where X = [fx1, fx2, . . . , fxN]T and b = [b1, . . . , bN]T is called a margin vector.
The coefficient vector a standing in the consequence layer of pf-NNs classifier can be determined by the standard least
square method. Namely, the coefficient can be estimated by solving the optimization problem
Mina Jða; NÞ; ð18Þ
where the minimized performance index J comes in the form
2
X
N
Jða; NÞ ¼ kXa bk ¼ ðaT fxi bi Þ2 : ð19Þ
i¼1
The advantage of using the standard least squares estimator is that the classifier can be easily determined. However, the dis-
advantage is that a preferred classifier is not derived by optimizing the results of classification or the cost of classification
(19) for imbalanced data sets. Note that parameter estimators that directly optimize classification performance are generally
difficult to find due to the factors such as unknown probability function of the data distribution, or possibly non-differen-
tiable objective functions [16].
A common scenario in learning realized in presence of imbalanced data sets is that the trained classifier tends to classify
all examples to the major class [23]. In practice, it is often very important to have accurate classification for the minority
class, e.g., in the application of abnormality detection. This issue is apparent in software defect analysis. A major problem
in predicting software quality using the number of component defects as a direct metric is the highly skewed distribution
of faults because the majority of components have no defects or very few defects [22]. In this case, standard least squares
learning methods with equal weighting for all data points will produce unfavorable classification results for the minority
class (defect), which has a much smaller influence than the majority class on the resulting parameter estimates.
In order to learn parameters which reflect upon the minority class, we use the weighed least squares learning method
based on the following cost function that is sensitive to data sample’s importance for classification [16,23].
X X
J ¼ cþ e2i þ c e2i ; ð20Þ
xi 2xþ xi 2x
where, e is classification error, and c+ and c are weights for x+ and x, respectively. If c+ = c =1, then the cost function for
parameter learning becomes the same as being used in (19).
In this paper, the weights for x+ and x are defined following the number of patterns in each class, viz.
N N
cþ ¼ ; c ¼ ; ð21Þ
Nþ N
where, N+ is the number of patterns of minority class x+, and N is the number of patterns belonging to x. We have
N = N+ + N. Following (21), the performance index (20) assigns more weight to data points positioned in the minority class
and this helps to alleviate the potential problem of classifying all examples to the major class [16].
We introduce the following matrix form
2 3
c1 0 0
6 0 c2 7
6 7
6 7
6 .. .. 7
6 . . 7 cþ if xi 2 xþ ;
6 7
C¼6 .. 7; in which ci ¼ ð22Þ
6 . ci 7 c if xi 2 x :
6 7
6 .. 7
6 7
4 . 0 5
0 0 cN
Then (20) can be expressed as follows:
2
Jða; N; CÞ ¼ eT Ce ¼ ½Xa bT C½Xa b ¼ ½C1=2 Xa C1=2 bT ½C1=2 Xa C1=2 b ¼ ½Xa ~ T ½Xa
~ b ~ ¼
~ b ~
~ b
Xa
X
N
¼ ðaT f x ~ Þ2 :
~i b ð23Þ
i
i¼1
Given this form of the performance index, its global minimum is produced by the least square method. This leads to the opti-
mal solution in the well-known format
a ¼ ðX ~ 1 X
~ T XÞ ~
~ T b; ð24Þ
pffiffiffiffiffi pffiffiffiffiffi pffiffiffiffiffiffi h iT pffiffiffiffiffi pffiffiffiffiffi pffiffiffiffiffiffi
~ ¼ ½f x
where, X ~1 ; f x ~ N T ¼
~2 ; . . . ; f x c1 fx1 ; c2 fx2 ; . . . ; cN fxN T b~ ¼ b~1 ; b~2 ; . . . ; b~N ¼ c1 b1 ; c2 b2 ; . . . ; cN bN T
As mentioned in the previous section, the use of the reduced quadratic function can be helpful in reducing the number of
coefficients, which contributes to a substantial computing load when dealing with a large number of combinations of the
pairs of the variables.
46 B.-J. Park et al. / Information Sciences 229 (2013) 40–57
The performances of the proposed predictors of software defect are assessed by making use of the receiver operating
characteristics (ROCs). Let us recall that the ROC analysis is a classical approach coming from signal detection theory
[10,16,37]. Formally, a defect predictor gets for a signal that a software module is defect prone. Signal detection theory con-
siders ROC curves as a method for assessing quality of different predictors [27].The ROC analysis has been widely used in
medical diagnosis [37] and is receiving considerable attention in the machine learning research community [31]. It has been
shown that the ROC approach is a powerful tool both for making practical choices and for drawing general conclusions. The
cost of misclassification is described in [31], in which the ROC convex hull (ROCCH) method has been introduced to manage
classifiers (via combining or voting). One of the performance metrics used in the ROC analysis is to maximize the area under
curve (AUC) of the ROC, which is equivalent to the probability that for a pair of randomly drawn data samples from the po-
sitive and negative groups respectively, the classifier ranks the positive sample higher than the negative sample in terms of
‘‘being positive’’ [10,16,27,31,37].
Consider a two-class set x = {x+, x} consisting of N n-dimensional patterns that belong to a two-class set. There are N+
positive data samples which belong to x+ and N negative data patterns, N = N+ + N. In this paper, the minority class, defect
patterns, is referred to as the positive class x+. Let a two-class classifier be formed using the data set. The performance of the
classifier may be evaluated using the counts of patterns {A, B, C, D} collected in the confusion matrix shown in Table 1.
Clearly N+ = A + B and N=C + D. The true positive rate (TP) is the proportion of positive patterns that were correctly iden-
tified, as given by [16]. TP can be called as probability of detection (PD) or recall in software engineering [27].
A A
TP ¼ ¼ : ð25Þ
A þ B Nþ
The false positive rate (FP) is the proportion of the negatives patterns that were incorrectly classified as positive. FP can be
called as probability of a false alarm (PF).
C C
FP ¼ ¼ ; ð26Þ
C þ D N
AþD AþD
CA ¼ ¼ : ð27Þ
AþBþCþD N
Classification accuracy (CA) is defined as the percentage of true negatives and true positives. Maximizing CA is a commonly
used performance metric, which is equivalent to minimizing the misclassification rate of the classifier.
Alternatively, a classifier can be mapped as a point in two-dimensional ROC plane with coordinates FP-TP, as shown in
Fig. 3. The y-axis shows the true positive rate (TP) and the x-axis shows the false positive rate (FP) of a classifier. By defini-
tion, the ROC curve must pass through the points FP = TP = 0 and FP = TP = 1 (a predictor that never triggers never makes
false alarms; a predictor that always triggers always generates false alarms).
Three interesting trajectories connect these points:
1. A straight line from (0, 0) to (1, 1) is of little interest since it offers no information; i.e., the probability of a predictor firing
(Defect detection) is the same as it being silent (No defect detection).
2. Another trajectory is the negative curve that bends away from the ideal point. Elsewhere [26], we have found that if pre-
dictors negate their tests, the negative curve will transpose into a preferred curve.
3. The point (FP = 0, TP = 1) is the ideal position (called sweet spot) on ROC plane. This is where we recognize all errors and
never make mistakes. Preferred curves bend up toward this ideal point.
In the ideal case, a predictor has a high probability of detecting a genuine fault (TP) and a very low probability of false
alarm (FP). This ideal case is very rare.
The ROC analysis is commonly applied to the visualization of the model performance, decision analysis, and model com-
binations [31] with an extensive scope and numerous applications [10,37]. For instance, the AUC of 0.50 means that the diag-
nostic accuracy is equivalent to a pure random guess. The AUC of 1.00 states that the classier distinguishes class examples
perfectly [16]. As the performance index of a classifier, the AUC of a ROC can be calculated as follows [16,31].
1 þ TP FP
AUC ¼ : ð28Þ
2
Table 1
Confusion matrix of a classifier (counts of patterns).
Eq. (28) can be easily adjusted to cope with cost-sensitive classification if the misclassification costs are skewed [31], or other
performance metrics derived from TP and FP are used. Clearly, the AUC of (28) is a classifier metric with a tradeoff between
high TP and low FP. Note that if the data set is completely balanced with N+ = N, then AUC = CA. However, for imbalanced
data sets, this equivalence no longer holds. By separating the performance of a classifier into two terms that represent the
performance for two classes, respectively, in comparison to the classification accuracy of (27) which only has a single term,
this enables the possibility to manage the classification performance for imbalanced data.
For imbalanced datasets, there are several performance metrics widely used for balanced classification performance for
two classes. All of them are based on values reported in the confusion matrix, see Table 1.
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
ð0 FPÞ2 þ ð1 TPÞ2
BAL ¼ 1 ; ð29Þ
2
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
GM ¼ TP ð1 FPÞ: ð30Þ
The index BAL [27] is based on Euclidean distance from the ‘‘sweet’’ spot FP = 0, TP = 1 to a pair of (FP, TP) and GM [16] is a
performance metric as geometric mean for tradeoff between high TP and low FP.
5. Experimental studies
The experiments reported here involve synthetic two-dimensional data and CM1 and DATATRIEVE concerning defect
problems coming from the PROMISE repository of empirical software engineering data (http://promisedata.org/reposi-
tory/). Our objective is to quantify the performance of the proposed pf-NNs classifier (predictor) and compare it with the
performance of some other classifiers reported in the literature. In the assessment of the performance of the classifiers,
we use the AUC of the classifier.
For the software datasets, the experiments completed in this study are reported for the 10-fold cross-validation using a
split of data into 70%–30% training and testing subsets, namely, 70% of the entire set of patterns is selected randomly for
training and the remaining patterns are used for testing purposes.
We start with a series of two-dimensional synthetic examples. Our primary objective is to illustrate the classification of
the proposed pf-NNs classifier learned when using the weighted cost function (20) and to interpret the resulting partition
space of inputs.
The collection of data involving 2 classes is shown in Fig. 4. The data set with two features x1 and x2 was generated fol-
lowing some Gaussian distribution, with the majority class (x) of mean vector [0 0]T and the covariance matrix as the iden-
tity matrix. The minority class (x+) has a mean vector of [2 2]T and the same identity matrix as the covariance matrix. The
synthetic dataset consists of 300 samples from x and 30 patterns from x+.
The classification results of the proposed pf-NN classifiers are shown in Tables 2 and 3. Table 2 shows the results of the pf-
NNs learned with the use of (19) while the results of the pf-NNs trained on weighted cost function (20) are shown in Table 3.
Table 2 shows the results with a high accuracy (96.1%) but a low probability of detection (66.7 or 63.3%). On the opposite
end, Table 3 shows the results with lower accuracy 80.3 % and 78.8 %, but higher probability of detection 93.3 % and 96.7
% for lf-NN and qf-NN, respectively. Classification accuracy is a good measure of a learner’s performance when the possible
48 B.-J. Park et al. / Information Sciences 229 (2013) 40–57
x2
0
-1
-2
-3
-4
-4 -3 -2 -1 0 1 2 3 4
x1
Table 2
Results of pf-NNs learned with the use of the standard cost function.
Table 3
Results of pf-NNs learned with the use of the weighted cost function.
outcomes occur with similar frequencies. However, the datasets used in this study have very uneven class distributions as
shown in Fig. 4. Therefore, we should not be using the classification accuracy but consider AUC to be a sound vehicle to as-
sess the performance of the classifiers. As shown in Tables 2 and 3, the learning method with the weighted cost function is
better than those making use of the standard cost function in case of imbalanced data. In (20), c = 330/300 = 1.1 and
c+ = 330/30 = 10.1 are used as the weight values in the minimized performance index.
For the pf-NN classifiers learned with the use of the standard and weighted objective functions, the boundaries of clas-
sification are shown in Figs. 5 and 6, respectively. When comparing Fig. 5 with Fig. 6, we note that the classification boundary
(a) (b)
4 4
3 3
2 2
1 1
x2
x2
0 0
-1 -1
-2 -2
-3 -3
-3 -2 -1 0 1 2 3 4 -3 -2 -1 0 1 2 3 4
x1 x1
Fig. 5. Classification boundary of pf-NN classifiers learned with the use of the standard cost function: (a) lf-NN classifier and (b) qf-NN classifier.
B.-J. Park et al. / Information Sciences 229 (2013) 40–57 49
(a) (b)
4 4
3 3
2 2
1 1
x2
x2
0 0
-1 -1
-2 -2
-3 -3
-3 -2 -1 0 1 2 3 4 -3 -2 -1 0 1 2 3 4
x1 x1
Fig. 6. Classification boundary of pf-NN classifiers learned with the use of the weighted cost function: (a) lf-NN classifier and (b) qf-NN classifier.
of the proposed pf-NN classifiers trained on the weighted objective function is reflective of the nature of the imbalanced
dataset. For these results shown in Figs. 5 and 6, pf-NNs can be represented in the form of a series of rules. A partition Ai
of each rule on premise layer is illustrated by means of the contour shown in Fig. 7.
Fig. 7. Linguistic variables: (a) A1, (b) A2, (c) A1 & A2, and (d) contour of A1 & A2.
50 B.-J. Park et al. / Information Sciences 229 (2013) 40–57
Table 4
Attribute information of CM1 dataset.
(a)
For training For testing
1 0.7
All
0.68 LOC
AUC (area under curve)
0.95 Mc
0.66 bH
0.9 dH
0.64
0.62
0.85
0.6
0.8 0.58
(b)
For training For testing
1 0.68
All
0.66 LOC
AUC (area under curve)
AUC (area under curve)
0.95 Mc
0.64 bH
dH
0.9 0.62
0.6
0.85
0.58
0.8 0.56
All
LOC 0.54
0.75 Mc
bH 0.52
dH
0.7 0.5
2 5 10 15 20 25 30 2 5 10 15 20 25 30
Number of rules Number of rules
Fig. 8. Results of AUC of pf-NN predictors applied to subsets of the attributes: (a) AUC of lf-NN and (b) qf-NN predictors.
B.-J. Park et al. / Information Sciences 229 (2013) 40–57 51
CM1 dataset data comes from the PROMISE repository (http://promisedata.org/repository/). This dataset is a NASA space-
craft instrument written in C. The preprocessed data set has 21 attributes and one target attribute (defect or not), shown in
Table 4 and included Halstead, McCabe, lines of code (loc), and other attributes. For the class distribution, the number of
patterns is false 449 (90.16 %) and true (defects) 49 (9.84 %). The total number of patterns is 498. Halstead’s theory which
predicts the number of defects based on the language volume and McCabe’s cyclomatic complexity which measures and con-
trols the number of paths through a program are well known metrics in software defect prediction focused on establishing
relationships between software complexity, usually measured in lines of code and defects [5,27]. Halstead estimates reading
complexity by counting the number of operators and operands in a module; see the Basic Halstead attributes. Also, attributes
related to the Halsted attributes were used to compute the eight Derived Halstead attributes shown in Table 4 [27]. Unlike
Halstead, McCabe argued that the complexity of pathways between module symbols is more insightful than just a count of
the symbols [27].
To generate defect predictors, we apply pf-NNs to CM1 dataset. Also, the pf-NNs are applied to subsets of the attributes
containing just the loc (LOC), just McCabe (Mc), just basic Halstead (bH), just derived Halstead (dH) or all available attributes
0.9 0.55 Mc
bH
0.85 0.5 dH
0.8 0.45
0.75 0.4
0.7 0.35
0.65 All 0.3
LOC
0.6 Mc 0.25
bH
0.55 0.2
dH
0.5
2 5 10 15 20 25 30 2 5 10 15 20 25 30
Number of rules Number of rules
(b)
For training For testing
0.35 0.45
All All
LOC
FP (false positive rate)
LOC
0.3 0.4
FP (false positive rate)
Mc Mc
bH bH
0.25 dH 0.35 dH
0.2 0.3
0.15 0.25
0.1 0.2
0.05 0.15
0 0.1
2 5 10 15 20 25 30 2 5 10 15 20 25 30
Number of rules Number of rules
(All). The experiments completed in this study are reported for the 10-fold cross-validation using a split of data into 70%-30%
training and testing subsets. In this sense, the results pertain to the average and the standard deviation reported over the 10
experiments. Fig. 8 shows the results of AUC of pf-NN predictors according to number of rules. When results of training and
testing are considered, the preferred number of rules is 10 to 15 rules in case of lf-NN and 5 to 10 rules in case of qf-NN. In
addition, all available attributes or derived Halstead attribute are preferred as an input set of pf-NN. The results of TP and FP
used to calculate AUC are shown in Figs. 9 and 10 for lf-NN and qf-NN, respectively. As shown in results, TP increase and FP
(a)
For training For testing
1 0.65
All
Mc
bH
0.55
0.9 dH
0.5
0.85
0.45
0.8
0.4
0.75 All
LOC
0.35
Mc
0.7 bH 0.3
dH
0.65 0.25
2 5 10 15 20 25 30 2 5 10 15 20 25 30
Number of rules Number of rules
(b)
For training For testing
0.25 0.5
All All
LOC 0.45 LOC
FP (false positive rate)
FP (false positive rate)
Mc Mc
0.2 bH bH
0.4 dH
dH
0.35
0.15
0.3
0.1 0.25
0.2
0.05
0.15
0 0.1
2 5 10 15 20 25 30 2 5 10 15 20 25 30
Number of rules Number of rules
Table 5
Results of comparative performance analysis.
decrease for training. To consider the testing results yields that 10 to 15 rules and 5 to 10 rules are preferred as the number of
rules for the lf-NN and qf-NN, respectively. The values of the AUC, TP and FP shown in the figures refer to the mean of 10
results and the standard deviation as reported in Table 5.
In order to compare training results using the standard cost function (19) and the weighted cost function (20), Fig. 11
shows points of pairs of TP and FP on ROC plans for results of lf-NN with 10 to 15 rules and qf-NN with 5 to 10 rules on
the derived Halstead. Here, we used c = 498/449 = 1.1091 and c+ = 498/49 = 10.163 as the weighted values in the (20).
As shown in the results, the points learned on the standard cost function are nearer to no information line than those on
the weighted cost function.
Table 5 shows results of the proposed pf-NN and other methods applied to the weighted cost function. RBF NN (radial
basis function neural network) is one of most widely applied neural classifiers [34,35] and LSR is linear standard regression
(a) 1
(b) 1
0.9 0.9
TP (true positive rate)
Fig. 11. Points of pairs of TP and FP on ROC plans: Results of (a) lf-NN and (b) qf-NN on the derived Halstead.
Table 6
Attribute information of DATATRIEVE dataset.
0.8 0.8
0.7 0.7
0.6 0.6
0.5 0.5
lf-NN
qf-NN
0.4 RBF-NN 0.4
2 5 10 15 20 25 30 2 5 10 15 20 25 30
Number of rules Number of rules
Fig. 12. AUC of pf-NN and RBF-NN: (a) training and (b) testing.
54 B.-J. Park et al. / Information Sciences 229 (2013) 40–57
shown in [25]. The MLP (multi-layer perceptron) have been widely used to deal with pattern classification problem. For the
comparison of the models, we consider their approximation and the generalization abilities. In Table 5, lf-NN has
0.999 ± 0.001 and 0.644 ± 0.067, qf-NN has 0.995 ± 0.008 and 0.64 ± 0.081 and RBF NN has 0.692 ± 0.017 and
0.704 ± 0.041 for the training (approximation) and testing (generalization), respectively. It seems that neither lf-NN nor
qf-NN is better than RBF NN for the generalization of the model. However, when we consider the both the approximation
and the generalization aspects, the proposed models are better than the previous models since the improvement of the
approximation capabilities is higher than the deterioration of performance with regard to the generalization abilities
themselves.
(a)
1.2 0.6
Training Training
Testing Testing
0.8 0.4
0.6 0.3
0.4 0.2
0.2 0.1
0 0
-0.2 -0.1
2 5 10 15 20 25 30 2 5 10 15 20 25 30
Number of rules Number of rules
(b)
1.2 0.7
Training Training
Testing
0.6 Testing
FP (false positive rate)
TP (true positive rate)
0.5
0.8
0.4
0.6
0.3
0.4
0.2
0.2
0.1
0 0
-0.2 -0.1
2 5 10 15 20 25 30 2 5 10 15 20 25 30
Number of rules Number of rules
Table 7
Comparison of the assessing performance with other methods trained on the weighted cost function.
Table 8
Comparison of the assessing performance with other methods trained on the standard cost function.
1
0.9
TP (true positive rate)
0.8
0.7
0.6
0.5 ← No information
0.4
0.3
0.2 Training on standard cost
Testing on standard cost
0.1 Training on weighted cost
0 Testing weighted cost
In this section, a pf-NN classifier is applied to the analysis of measurement data of a real-life maintenance project, the
DATATRIEVETM project carried out at Digital Engineering Italy. The DATATRIEVETM product was originally developed in
the BLISS language. BLISS is an expression language. It is block-structured, with exception handling facilities, co-routines,
and a macro system. It was one of the first non-assembly languages for operating system implementation. Some parts were
later added or rewritten in the C language. Therefore, the overall structure of DATATRIEVETM is composed of C functions and
BLISS subroutines [28]. DATATRIEVE dataset consists of 130 patterns with 8 condition attributes and 1 decision attribute
shown in Table 6. Class distribution is 119 (91.54 %) and 11 (8.46 %) for 0 and 1. Its value is 0 for all those patterns in which
no faults were found and its value is 1 for all other patterns. Therefore, c = 130/119 = 1.092 and c+ = 130/11 = 11.818 are
used as the weighted values in (20).
Fig. 12 illustrates the AUC results obtained by lf-NN and qf-NN. Also, the AUCs of RBF NN trained on the weighted cost
function are added for comparison with pf-NNs. As shown in this figure, lf-NN and qf-NN have bigger values of AUC than
those of RBF NN for training; however, the similar trend of AUC is shown for testing. TP and FP of pf-NN are shown in
Fig. 13. The bar denotes the standard deviation and the center value of a bar represents an average coming from the set
of the 10 experiments. For the comparison of the results learned on the weighted and standard cost functions, Tables 7
and 8 are shown respectively. In case of the standard cost function, classification accuracy is higher but TP (probability of
detection) is very low. The proposed lf-NN predictors show the better results than other methods. Fig. 14 illustrates relations
between TP and FP (probability of false alarm) in Tables 7 and 8. As shown in Fig. 14, the points learned on the weighted cost
function are far from the ‘‘no information’’ line. As the same results are shown in Section 5.2, the pf-NN models are preferred
as a predictor in the detection of software defects given approximation and generalization aspects of the model.
6. Conclusions
In this paper, we proposed and design classifiers in the form of polynomial function-based neural networks (pf-NNs) for
prediction of defects in software engineering. This type of classifiers is expressed in the form of ‘‘if-then’’ rules. The premise
layer of the rules is developed with the use of fuzzy clustering. The consequence layer comes in the form of polynomials. The
56 B.-J. Park et al. / Information Sciences 229 (2013) 40–57
proposed pf-NNs comes with two categories that is linear function-based NNs (lf-NNs) and quadratic function-based NNs
(qf-NNs) where the taxonomy of the networks depends on the type of the polynomial in the conclusion part of the rule.
The learning algorithm used to in the development of the consequence layer of the rules takes advantage of the linear dis-
criminant analysis. In addition, the weighted cost function is used to handle imbalanced classes. The area under curve (AUC)
in the receiver operating characteristics (ROC) analysis is used to evaluate the performance in presence of imbalanced
datasets.
The pf-NNs were experimented with synthetic data and selected software engineering datasets. In this suite of experi-
ments, we investigated and quantified the impact coming from the number of rules and the values of the AUC trained on
the weighted cost function. In general, we have achieved substantially high values of TP (probability of detection) and
low values of FP (probability of false alarm) than those for other methods and the standard cost function. Classifying soft-
ware modules into faulty and non-faulty ones that is dealt with in this paper, is one of the first pursuits of software quality
assurance. Further activities dealing with the localization and elimination of software defects could engage the use of other
hybrid techniques of Computational Intelligence and machine learning.
Acknowledgements
This research was supported by the Converging Research Center Program funded by the Ministry of Education, Science
and Technology (No. 2011K000655) and supported by the GRRC program of Gyeonggi province [GRRC SUWON2012-B2,
Center for U-city Security & Surveillance Technology] and also National Research Foundation of Korea Grant funded by
the Korean Government (NRF-2012-003568).
References
[1] A. Aiyer, K. Pyun, Y.Z. Huang, D.B. O’Brien, R.M. Gray, Lloyd clustering of Gauss mixture models for image compression and classification, Signal
Processing: Image Communication 20 (2005) 459–485.
[2] F. Akiyama, An example of software system debugging, Information Processing 71 (1971) 353–379.
[3] Y. Al-Assaf, H. El Kadi, Fatigue life prediction of composite materials using polynomial classifiers and recurrent neural networks, Composite Structures
77 (2007) 561–569.
[4] J.C. Bezdek, Pattern Recognition with Fuzzy Objective Function Algorithms, Plenum Press, New York, 1981.
[5] S. Bibi, G. Tsoumakas, I. Stamelos, I. Vlahavas, Regression via classification applied on software defect estimation, Expert Systems with Applications 34
(2008) 2091–2101.
[6] L.C. Briand, W.L. Melo, J. Wust, Assessing the applicability of fault-proneness models across object-oriented software projects, IEEE Transactions on
Software Engineering 28 (2002) 706–720.
[7] C. Catal, B. Diri, Investigating the effect of dataset size, metrics sets, and feature selection techniques on software fault prediction problem, Information
Sciences 179 (8) (2009) 1040–1058.
[8] N.V. Chawla, L.O. Hall, K.W. Bowyer, W.P. Kegelmeyer, SMOTE: synthetic minority oversampling technique, Journal Artificial Intellegince Research 16
(2002) 321–357.
[9] R.O. Duda, P.E. Hart, D.G. Stork, Pattern Classification, second ed., Wiley-Interscience, 2000.
[10] J.P. Egan, Signal Detection Theory and ROC Analysis, Academic, New York, 1975.
[11] K.E. Emam, W. Melo, C.M. Javam, The prediction of faulty classes using object-oriented design metrics, Journal of Systems and Software 56 (1) (2001)
63–75.
[12] M.J. Er, S.Q. Wu, J.W. Lu, H.L. Toh, Face recognition with radical basis function (RBF) neural networks, IEEE Transactions on Neural Networks 13 (5)
(2002) 697–710.
[13] N.E. Fenton, M. Neil, A critique of software defect prediction models, IEEE Transactions on Software Engineering 25 (5) (1999) 675–689.
[14] A. Fernández, M. José del Jesus, F. Herrera, On the 2-tuples based genetic tuning performance for fuzzy rule based classification systems in imbalanced
data-sets, Information Sciences 180 (8) (2010) 1268–1291.
[15] T.L. Graves, A.F. Karr, J.S. Marron, H. Siy, Predicting fault incidence using software change history, IEEE Transactions on Software Engineering 26 (7)
(2000) 653–661.
[16] X. Hong, S. Chen, C.J. Harris, A kernel-based two-class classifier for imbalanced data sets, IEEE Transactions on Neural Networks 18 (1) (2007) 28–41.
[17] R.A. Hooshmand, M. Ataei, Real-coded genetic algorithm based design and analysis of an auto-tuning fuzzy logic PSS, Journal of Electrical Engineering
and Technology 2 (2) (2007) 178–187.
[18] A. James, M. Scotto, W. Pedrycz, B. Russo, M. Stefanovic, G. Succi, Identification of defect-prone classes in telecommunication software systems using
design metrics, Information Sciences 176 (2006) 3711–3734.
[19] X.-Y. Jing, Y.-F. Yao, D. Zhang, J.-Y. Yang, M. Li, Face and palmprint pixel level fusion and Kernel DCV-RBF classifier for small sample biometric
recognition, Pattern Recognition 40 (2007) 3209–3224.
[20] T.M. Khoshgoftaar, K. Gao, Feature selection with imbalanced data for software defect prediction, in: Proceeding of the 2009 International Conference
on Machine Learning and Applications, 2009, pp. 235–240.
[21] T.M. Khoshgoftaar, J.V. Hulse, A. Napolitano, Supervised neural network modeling: an empirical investigation into learning from imbalanced data with
labeling errors, IEEE Transactions on Neural Networks 21 (5) (2010) 813–830.
[22] F. Lanubile, G. Visaggio, Evaluating predictive quality models derived from software measures: lessons learned, Journal of Systems and Software 38 (3)
(1997) 225–234.
[23] J. Leskovec, J. Shawe-Taylor, Linear programming boost for uneven datasets, in: Proceedings of International Conference on Machine Learning (ICML),
2003, pp. 456–463.
[24] R.P. Lippman, An introduction to computing with neural nets, IEEE ASSP Magazine 4 (2) (1981) 4–22.
[25] T. Menzies, J.S. Di Stefano, How good is your blind spot sampling policy?, Proceedings of IEEE International Symposium on High Assurance Systems
Engineering (2004) 129–138
[26] T. Menzies, J.S. Di Stefano, M. Chapman, K. Mcgill, Metrics that Matter, in: Proceedings of 27th NASA SEL Workshop Software Engineering, 2002, pp.
51–57.
[27] T. Menzies, J. Greenwald, A. Frank, Data mining static code attributes to learn defect predictors, IEEE Transactions on Software Engineering 33 (1)
(2007) 2–13.
[28] S. Morasca, G. Ruhe, A hybrid approach to analyze empirical software engineering data and its application to predict module fault-proneness in
maintenance, The Journal of Systems and Software 53 (2000) 225–237.
B.-J. Park et al. / Information Sciences 229 (2013) 40–57 57
[29] J.E. Munoz-Exposito, S. Garcia-Galan, N. Ruiz-Reyes, P. Vera-Candeas, Adaptive network-based fuzzy inference system vs. other classification
algorithms for warped LPC-based speech/music discrimination, Engineering Applications of Artificial Intelligence 20 (2007) 783–793.
[30] S.-K. Oh, W. Pedrycz, B.-J. Park, Self-organizing neurofuzzy networks in modeling software data, Fuzzy Sets and Systems 145 (2004) 165–181.
[31] F. Provost, T. Fawcett, Robust classification for imprecise environments, Machine Learning 42 (2001) 203–231..
[32] T.-S. Quah, Estimating software readiness using predictive models, Information Science 179 (4) (2009) 430–445.
[33] T.-S. Quah, M.M.T. Thwin, Prediction of software development faults in PL/SQL files using neural network models, Information and Software
Technology 46 (2004) 519–523.
[34] F. Ros, M. Pintore, J.R. Chretien, Automatic design of growing radial basis function neural networks based on neighborhood concepts, Chemometrics
and Intelligent Laboratory Systems 87 (2007) 231–240.
[35] H. Sarimveis, P. Doganis, A. Alexandridis, A classification technique based on radial basis function neural networks, Advances in Engineering Software
37 (2006) 218–221.
[36] A. Staiano, R. Tagliaferri, W. Pedrycz, Improving RBF networks performance in regression tasks by means of a supervised fuzzy clustering,
Neurocomputing 69 (2006) 1570–1581.
[37] J.A. Swets, Signal Detection Theory and ROC Analysis in Psychology and Diagnostics: Collected Papers, Lawrence Erlbaum Associates, Mahwah, NJ,
1996.
[38] C. Zhang, J. Jiang, M. Kamel, Intrusion detection using hierarchical neural networks, Pattern Recognition Letters 26 (2005) 779–791.