0% found this document useful (0 votes)
68 views8 pages

Redd Ivar I 2019

This document summarizes a research paper that investigates using machine learning techniques to predict software quality in terms of reliability and maintainability. Specifically, it evaluates eight machine learning models to predict the number of defects (reliability) and number of code changes (maintainability) using software metrics. The best performing model was Random Forest, which achieved an AUC over 0.8 for both defect and maintenance prediction.

Uploaded by

KIKI
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
68 views8 pages

Redd Ivar I 2019

This document summarizes a research paper that investigates using machine learning techniques to predict software quality in terms of reliability and maintainability. Specifically, it evaluates eight machine learning models to predict the number of defects (reliability) and number of code changes (maintainability) using software metrics. The best performing model was Random Forest, which achieved an AUC over 0.8 for both defect and maintenance prediction.

Uploaded by

KIKI
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

*&&&UI*OUFSOBUJPOBM$POGFSFODFPO*OGPSNBUJPO3FVTFBOE*OUFHSBUJPOGPS%BUB4DJFODF

*3*

Software Quality Prediction: An Investigation based on Machine Learning


Sandeep Reddivari and Jayalakshmi Raman
School of Computing, University of North Florida, FL, USA
{sandeep.reddivari, n01382362}@unf.edu

Abstract maintainability, usability, performance and efficiency.


Irrespective of the type of software system that is being Even though there are several factors, in practice, the
developed, producing and delivering high-quality software estimation of the quality of a software majorly concerns
within the specified time and budget is crucial for many its reliability and maintainability [4]. A good quality
software businesses. The software process model has a software should be reliable with a lower degree of error or
major impact on the quality of the overall system – the faults. Software reliability can be defined as the measure
longer a defect remains in the system undetected, the of probability or confidence on the software’s ability to be
harder it becomes to fix. However, predicting the quality operational in its specified environment [2]. The
of the software in the early phases would immensely assist reliability of a software is typically measured by the
developers in software maintenance and quality assurance number of defects within the software. On the other hand,
activities, and to allocate effort and resources more software maintainability is the ease with which a software
efficiently. This paper presents an evaluation of eight system can be maintainable. Software maintenance is
machine learning techniques in the context of reliability defined in IEEE Standard 1219 [IEEE93] as “The
and maintainability. Reliability is investigated as the modification of a software product after delivery to
number of defects in a system and the maintainability is correct faults, to improve performance or other attributes,
analyzed as the number of changes made in the system. or to adapt the product to a modified environment” [4,6].
Software metrics are direct reflections of various The cost of maintenance is stated to be 40% more than the
characteristics of software and are used in our study as the cost of development of a system [4]. Unlike reliability that
major attributes for training the models for both defect can be measured by investigating whether a system is
and maintainability prediction. Among the eight different defective or not, the maintainability of a system cannot be
techniques we experimented with, Random Forest measured by a binary property. Maintainability is rather
provided the best results with an AUC of over 0.8 during measured as the amount of effort required to perform a
both defect and maintenance prediction. maintenance task, in terms of the number of changes made
in code or the amount of development effort contributes
1. Introduction [3]. In this research, we consider maintainability and
Software systems have grown tremendously over the past reliability for the detection of software quality. The
few decades and become an indispensable part of our quality and performance of any software system can be
daily lives. Virtually every business depends on software determined by its metrics [5]. Software metrics provide
for its development, marketing, production and support sufficient data on the software characteristics there by
[1]. Software systems are not just mere forms of supporting quantitative managerial decision making
convenience, they implement the core features and during the software life cycle [9]. The information and
functionalities of several safety critical systems. In the clarity provided by these metrics is invaluable to detect
face of such a widespread use, assurance of the quality of defects earlier, manage and estimate the costs of the
the software product is very critical. Now a days, software system.
systems are large complex systems and the cost to repair a
defect prone or faulty system is highly expensive. One of 2. Background and Related Work
the major goals of software engineering is to produce a Software metrics have been studied since early times and
high-quality software within a defined cost and schedule. several metrics for procedural paradigm have been
The U.S National Institute of Technology (NIST) proposed. Software metrics are the measures of software
estimated the average cost of such failures and faults of characteristics and help in estimating the cost and impact
software systems per year to be around $59 billion [1]. of several software life cycle activities. Traditional
This shows how vital and cost effective it is to identify metrics such as McCabe’s cyclomatic complexity,
and rectify a poor quality software during the earlier information flow metric, statement interconnection metric,
stages. Explicit attention to software quality can save size and comment percentage are used to measure
significant costs in the software development life cycle. software complexity and the size of structured systems
The quality of a software can be reflected by several [3,14]. In early 1990’s metrics were studied and
non-functional attributes or qualities such as its reliability, investigated for the Object–Oriented (OO) design. Since

¥*&&& 
%0**3*
then, several attempts have been made with both capability of Bayesian networks, one of the simplest
procedural and OO metrics, to rapidly advance the effective classifiers to predict the software quality. This
research in software metrics. Li & Henry (1996) in their study identified the most effective metrics for quality
work showed how various OO metrics can be used to prediction and performed an inspection of them on several
measure the amount of change in the code. The amount of open source systems [22]. In addition to SVM and
change made in the code is a direct reflection of the Bayesian networks, researchers have used ensemble
maintenance effort and it can be predicted with software learning and other alternative machine learning classifiers
metrics. There are other researchers [6, 19] in this area for identifying defects [23, 24]. However, it is noteworthy
that used other complexity measures to show the to mention that majority of the literature identified defect
prediction of maintainability. This forms the basis of our identification as quality prediction. Although defects
work to predict the maintainability of a system from its reflect the reliability of the system, maintainability is
metrics. Another complimentary study by Basili et.al., another important factor to be considered while assessing
(1996) conducted at University of Maryland investigated the quality of a software system. To the best of our
the capability of OO metrics as predictors of fault-prone knowledge there are very few studies that considered both
classes. The main finding from this study is that the reliability and maintainability aspects to predict the
Chidamber and Kemerer (CK) metrics predicted the software quality. For example, Quah & Thwin presented
defects efficiently in the early stages of the development the application of neural networks in estimating the
life cycle when compared to other traditional code quality of a software using OO metrics [26]. This paper
metrics. Another case study by Yu & Systa (2002) on the described software quality as a measure of its reliability
client side of a large industrial network service and maintainability. The reliability is measured in terms of
management system also showed the usefulness of OO defects and maintainability is measured in terms of the
metrics in predicting the fault proneness of the system. changes made to the code [26].
There are several more empirical evidences [15, 16, 17]
showing that software metrics are effective predictors of 3. Software Metrics
fault prone systems. Machine learning has received a great A software metric is a measure of various characteristics
deal of attention in the recent years with its applications of a software system. Metrics can be used to determine the
extending to various fields that has a considerable amount quality and performance of the software. In this paper, our
of data that can be used for analysis. Software engineering research objective is to analyse how different machine
is also one of the actively researched fields, where many learning techniques perform in predicting the software
software engineering (SE) activities can be formulated as quality [12,13,14,18]. There have been several papers
a learning problem and approached with these machine published on how software metrics can be used to estimate
learners [25]. Several activities of software development, the maintainability and the reliability of the system
ranging from requirements analysis to software testing [3,6,7,8]. An empirical evaluation of several major OO
have been actively researched with machine learners. metrics was performed by Gyimothy et al., (2005) to
Software maintenance is another active area of research in investigate the effectiveness in fault detection. This study
SE where researchers proposed machine learning has shown that the coupling between objects (CBO)
techniques for detecting code smells and predicting errors measure to be the best metric for predicting fault –
in the program. According to the literature, there exist a proneness, followed by lines of code (LOC) and lack of
direct relationship between the metrics and quality of the cohesion of methods (LCOM) [16]. Another empirical
system and several predictive models have been proposed investigation conducted by Yu et al., (2002) presented the
for the classification of systems into two categories: fault appropriate metrics with an industrial case study. This
prone and non-fault prone. study also described the significance of CBO, LOC and
LCOM in fault detection. Further, the study identified
Zimmermann et al., (2007) described one of the initial number of children (NOC) as a significant factor in
works on the usage of classifiers for predicting the defects determining the defects [15]. In addition to the OO
[20]. Eclipse bug dataset was used for predicting the metrics, Quah et.al, introduced additional metrics such as
faulty and non-faulty classes in [20]. Additionally, they inheritance coupling (IC), weighted methods per class
annotated the data with common complexity metrics and (WMC) and coupling between methods (CBM) [26]. Li &
the prediction results were concluded as being far from Henry identified metrics such as depth in inheritance tree
perfect [20]. Elish et al., (2008) evaluated the capability (DIT), weighted method complexity (WMC), number of
of support vector machines (SVM) in predicting the methods (NOM), response for class (RFC), message
defect-prone software modules and compared the passing coupling (MPC), data abstraction coupling (DAC)
performance of SVM against eight machine learning and few others to effectively estimate the maintenance
techniques in the context of NASA datasets [21]. Another
research by Okutan & Yldz (2014) investigated the


effort [3]. The details of the various metrics discussed Coupling between Methods (CBM) – CBM provides the
thus far are listed below: total number of methods to which the inherited methods
Weighted Methods Per Class (WMC) – WMC is an OO are coupled. It gives the functional dependencies between
metric to measure the complexity in a class. It is the the inherited methods and the new/redefined methods
measure of summation of complexities of all local [26].
methods of a class. The complexity of the method is Message Passing Coupling (MPC) – MPC is a measure of
proportional to the number of control flows it has and the number of messages an object of a class sends to other
greater the value of WMC, the harder it is to maintain the objects [3]. MPC is an indication of how dependent a
class [11, 26]. local method is on other methods of other classes [26].
Lines of Code (LOC) – LOC is used to measure the size of Data Abstraction Coupling (DAC) – This metric measures
the program with the number of non-commented, lines of the complexity caused by the abstract data types (ADTs)
source code. [3]. This coupling may cause a violation of encapsulation
Coupling Between Objects (CBO) – CBO is the measure if direct access to private attributes of the ADT is granted
of classes in which a given class is coupled to another if [3]. DAC is the number of ADT’s defined in the class.
methods of one class use methods or instance variables Number of Local Methods (NOM) – NOM is an interface
defined by other class [11]. Excessive coupling reduces metric and it is a measure of number of local methods
the chances of class reuse and makes the system more within a class. The complexity of a class’s interface
complicated. The theory behind this metric is that higher depends on the number of methods a class contains [3].
the number of couplings in a system, the more sensitive it SIZE1 – is the measure of number of executables in a
is to changes, and thus making the task of maintenance class (calculated by the number of semicolons).
difficult [11]. SIZE2 – is the measure of number of properties in a class.
Depth of Inheritance Tree (DIT) – DIT is the measure of It is calculated as the sum of the number of attributes and
depth of inheritance of any class. Considering classes to the number of local methods in the class.
be a directed acyclic graph (DAG), DIT is the longest
path of the node from the root [11]. The complexity of a 4. Machine Learning for Predicting Software
class is determined by the number of ancestors it has Quality
inherited from, as more number of ancestors means more
4.1 Decision Trees
inherited methods, and thus making the behaviour
Decision tree is a tree constructed in a top down recursive
prediction more complex [11].
order in which each node represents all possible decision
Number of Children (NOC) – NOC is the number of direct
with edges indicating the possible path from one node to
sub classes for each class. More number of children makes
another. Classification of an instance is effectively
it challenging to modify the parent class and increases the
following the path from the root of the tree to one of its
likelihood of improper abstraction of the parent class [11].
leaves. The attributes used for decision making are
The number of reuse instances of a class has a direct
selectively picked such that the information gain from the
impact on the magnitude of ripple effects and might
attribute is high. In our study, we use C4.5 which is a
require more testing [26].
classification algorithm producing a decision tree based
Lack of Cohesion in Methods (LCOM) – LCOM is the
on information theory and it is implemented as J48 in
difference between the number of pairs of methods
Weka [37]. One of the other methods used in our
without any shared instance variables and the number of
experiment is Random Forest (RF). RF is an ensemble
pairs of methods with a shared instance variable. In case,
learning technique where a forest of multiple decision
none of the methods share an instance variable and the
trees is constructed [27]. RF is known to be effective for
difference is negative, the LCOM is set to zero [11, 26]. A
predictions and it depends on the strength of the
low rate of cohesion indicates complexity as it illustrates
individual predictors [27].
encapsulation of unrelated methods, thereby increasing the
likeliness of errors [10].
Response for Class (RFC) – RFC is the set of methods of 4.2 Bayesian Classification
a class that can be executed in response to a message from Bayesian classification is a probabilistic classifier that
the object of a class [11]. The larger the number of labels a class based on its probability. A simple Bayesian
methods invoked in response to a message from the object classifier is the naïve Bayesian classifier based on the
of class, the larger the complexity. Bayes theorem [33,34]. According to this theorem the
Inheritance Coupling (IC) – Inheritance coupling shows classifier makes a hypothesis that the instance belongs to a
the number of parent class dependencies a given class has. class and the training set increases or decreases the
That is, a given class is coupled to its parent class if one probability that the hypothesis is correct. In our study, we
of its methods is functionally dependent on the parent used the Weka’s implementation of naïve Bayes classifier
class’s methods because of inheritance [26]. [32, 37]. Bayesian belief network forms a directed acyclic


graph indicating the conditional dependences between Optimization (SMO) algorithm available in Weka for
various attributes. The connection between two nodes experimentation [37].
implies the conditional independencies between the nodes.
Each node takes multiple values as the inputs depending 4.6. Artificial Neural Networks (ANN)
on its parent variables and determines the probability of Artificial Neural Networks is a learning algorithm inspired
the variable’s occurrence. by neural networks of the human brain. ANN comprises of
a set of connected nodes or artificial neurons that forms
4.3. Rule-based Classification the input, output and various hidden layers [28]. The
Rule based classifiers use a set of IF-THEN rules to various nodes and the connections have a weight that is
classify the instances. Rule based classifiers iteratively used for computation of the output. During the learning
generate a set of rules from the training set until all the phase the ANN may not produce the desired output, but it
data in the training set is covered by some rules and there learns from the output via back propagation and gradually
is no more data in the training set left to cover. The IF part adjusts this weight with the goal of minimizing its error
forms the rule condition and the THEN part the rate and making correct predictions about the class [28].
consequence. Many of the rule-based classifiers extracts Depending on the complexity of the problem at hand, one
the rules from decision trees. In this study we have used or more hidden layers can be added to increase the
the PART rule-based classifier available in Weka [37]. accuracy of ANN. This is also referred to as the
This algorithm produces an ordered set of rules called connectionist learning as it requires a lot of training time
decision list. During classification, a data instance is but is known to be very successful in classification.
assigned to the class of the first matching rule from the
decision list [29]. PART extracts its rules from C4.5 5. Experiments and Results
decision tree which we mentioned in section 4.1.

4.4 Nearest Neighbours 5.1. Defect Prediction


For defect prediction, we used the six machine learning
Nearest Neighbours is a class of deterministic learners that
techniques we described in section 4 and the OO metrics
learn by analogy. That is, it learns by comparing a given
that are presented in section 3.
test instance with a set of training instances similar to it
and classifies it based on the class of its nearest
5.1.1. Data Collection
neighbours. The nearest neighbour can be identified by
The datasets used for defect prediction are obtained from
computing the various distance measures such as
the PROMISE data repository [30]. Our datasets include
Euclidian distance, Manhattan distance and Cosine
open source projects such as Ant, Tomcat, Velocity, Jedit,
similarity (in case of nominal attributes) [38]. It is also
Ivy, Poi, Forrest and Workflow. Although several versions
called as a lazy learner as most of the work is done during
of these projects were available we selected the latest
the testing phase and the classifier simple stores the
version available. More details about the datasets are
training data and does not involve in any computation
provided in Table 1. The datasets obtained from
during the training phase. In case of multiple close
PROMISE had several additional OO metrics which were
neighbours, this algorithm employs a majority voting or
removed from the dataset. Only the specified nine metrics
weighted majority voting of neighbours. We use the K–
(c.f. section 3) were retained and used to train the
nearest neighbour model in our study. The ‘K’ value is the
classifiers as these nine metrics have been chosen to be
specified the number of neighbours the classifier has to
the most effective for defect prediction after a careful
take into consideration during class labelling [38].
review of past literatures. A total of 3189 instances were
4.5 Support Vector Machine (SVM) available in total from all the projects, from which 1236
defects were identified.
Support Vector machine employs a non-linear mapping of
input vectors or training set into a high dimensional space TABLE 1: DATASETS USED FOR DEFECT ANALYSIS
in order to classify instances. The training set provided is
marked to one of the two classes [35]. SVM attempts to Project Version No.of % of Defects
find an optimal decision boundary or hyper plane in the instances (Approx.)
Ant 1.7 745 22
mapped dimension. It finds this hyperplane essentially
Tomcat 6.0 858 9
with the support vectors, which are nothing but the Velocity 1.6 229 34
training data set and the margin defined by these support Ivy 2.0 352 11
vectors. That is, SVM attempts to place a linear boundary Jedit 4.3 492 2
between the two classes (SVM is a binary classifier) in Poi 3.0 442 63
such a way that the margin between the classes in Forrest 0.8 32 18
Workflow 1 39 51
maximized. In this study we use the Sequential Minimal


We manually evaluated to identify any missing, unknown
values in the dataset. The goal here is to evaluate the
classifier’s ability to identify if a given instance as
defective or not, we labelled all instances as defective or
not defective and used that as the classifying class. Not all
the data sets used were balanced in terms of the number of
defect instances. For example, Poi had 63% of defective
data while Jedit had only 2%. This was chosen
specifically to train the classifiers with variedly balanced
set of data. However, owing to such imbalance in the Fig. 2: The graphical user interface (GUI) of the Neural Network formed
number of defects, the entire data set was considered from software metrics for defect prediction
during the training phase and was evaluated with tenfold For Bayesian networks we used Simple Estimator (in
cross validation technique. We also removed the name of Weka) [37] for the construction of Bayesian network.
the project from the data set as that could be a factor for Fig1 shows the Bayesian network we obtained for the
the classifiers to determine the class label since some of defect dataset and can be seen that the metrics WMC,
the projects had very few defects. The fact that the DIT, NOC are being the primary estimators followed by
proportion of defective and non-defective classes are not LCOM, RFC, IC, CBO. For testing the rule-based
equal was considered during the evaluation of the classifiers we used the PART classifier from Weka [37].
classifiers. We used the Ibk which is an implementation of nearest
5.1.2. Experimentation neighbours in Weka for testing the nearest neighbour class
For experimenting with decisions trees, we chose J48 of classifiers [37]. If the ‘k’ value is too small, the
which is Weka’s implementation of C4.5 and Random decision is susceptible to noise and if ‘k’ value is chosen
Forest (RF) ensemble technique which is also available in to be too large then a larger area of instance space must be
Weka [37]. We did not use any additional filters for these covered and leads to wrong classifications. For the
algorithms as we have manually identified the defective parameter ‘k’, the general rule of thumb is that the ideal
instances based on the number of defects produced by value of ‘k’ is equal to the square root of the number of
each instance. Therefore, J48 classifier has nominal instances. Having a total of about 3189 instances, we
classes for consideration and not required any additional started experimenting with 58,57,59 and 60 as values of
filters. The pruned tree constructed by J48 showed that ‘k’. SMO algorithm is used for training SVM [37]. The
the lack of Cohesion in Methods (LCOM) was the root training data was normalized to speed up the time taken to
implying it was one of the major deciding attributes build the model. Since we were not sure if the data set was
followed by RFC, DIT and CBM. In case of RF linearly separable we specified the exponent value to two
technique, we specified 150 iterations for the model in the kernel option. This was to ensure that weka uses
construction. The construction of trees in RF also showed support vectors for classification. In addition to these, we
that LCOM, WMC to be the primary contributors in used logistic regression model as the calibrator. The SMO
decision making. For Bayesian classification, we tested model-built shows that about 1523 support vectors have
with both naïve Bayes and Bayesian belief’s network. been used to train the SVM. ANN is implemented as
Naïve Bayes is one of the simplest models and it makes a multilayer perceptron in Weka [31, 37] as shown in the
naïve assumption that all attributes of the training set is Figure 2.
independent of each other, which might not be true in
many cases, yet performs well. 5.2. Prediction of Maintainability
Like defect prediction, for predicting the maintenance
effort required, we experimented with all machine
learning techniques described in section 4. The metrics
used here are DIT, NOC, MPC, RFC, LCOM, DAC,
WMC, MOM, SIZE1 and SIZE2 (c.f. section 3). The
details of the data collection process and the experimental
set up for maintenance prediction is described below.

5.2.1. Data Collection


The dataset for maintenance prediction is obtained from
two commercial systems – User Interface Management
Fig. 1 Bayesian Network formed from software metrics for defect System (UIMS) and QUality Evaluation System (QUES)
prediction which are presented in [3]. The maintenance effort here is
represented as the number of line changes made per class.


The change could be an addition or a deletion. In case of complexities directly correlate with the overall complexity
change of content, it is counted as a deletion followed by of the system. For experimentation with nearest
an addition [3]. Both the systems have been developed neighbours we used the ibk algorithm as mentioned in
from an object-oriented language – Classic-Ada. Table 2 section 5.1.2 and iteratively experimented with different
provides more details of the datasets. values of ‘k’ in order to identify the ideal ‘k’ value. Since
TABLE 2 DATASETS USED FOR MAINTENANCE ANALYSIS we used the binary SMO and the dataset contains three
labels, the SVM was trained with the typical one versus all
Project No. of Total No.of Change Percent (w.r.t strategy [35]. ANN is implemented as multilayer
Instances Changes the total executables)
UIMS 39 1826 43% perceptron in Weka [37] as shown in Figure 3.
QUES 71 4560 23%

The dataset obtained was manually evaluated to identify


any missing values. Each instance of the dataset represents
a class of the system. There was a total of 110 instances
from both the systems. Table 2 shows the percentage of
changes made with respect to the number of executables
of the system. However, it has be noted that change of
content is counted as both addition and deletion. Based on
Fig. 3: The graphical user interface (GUI) of the Neural Network formed
the percentage of change made to each class we further from software metrics for maintainability prediction
defined the class attribute maintenance difficulty with
three possible classes: heavy, medium and light. The
6. Results and Evaluation
We used tenfold cross validation for both defect and
maintenance difficulty of classes with a change percentage
maintenance prediction experiments. Cross validation is
 20 were marked as light, between 20 and 50 were
an evaluation technique where the n-1 number of folds are
marked as moderate and 50  were marked as high. Both
used for training the classifier and one-fold is retained to
the datasets were grouped together and input as a single
test the classifier. For assessing the quality of a classifier,
dataset during the training phase of all the experiments.
we recommend using the precision, recall and AUC values
Our goal is to identify whether the classification
[36]. Classification, when applied to the datasets, the
techniques could accurately identify the severity of the
outcome is a classification table such as the following:
maintenance changes.
Observed
5.2.2. Experimentation True False
We used Weka implementations to test the classifiers Classifier Positive True False
predictions Positive Positive
maintainability label prediction. We started with decision (TP) (FP)
trees – J48 and Random Forest (RF). The pruned tree Negative False True
Negative Negative
formed from J48 shows NOC to be major factor for (FN) (TN)
maintenance predictability. This is followed by LCOM True Positives (TP) – This represents the positive
and DAC. The tree formed from RF also has LCOM as its instances that were correctly classified
major contributor followed by WMC and DAC. Next, we True Negatives (TN) – represents the negative instances
experimented with naïve Bayes and Bayesian network. As that were correctly classified.
mentioned in Section 5.1.2, we used Simple Estimator (in Positive (FP) – Represents the positive instances that were
Weka) [37] for the construction of Bayesian network. The incorrectly classified as negative.
Bayesian network thus formed has DIT, WMC, MPC, False Negative (FN) – represents the negative instances
size1, DAC and NOC to be its major deciding factors. that were incorrectly classified as positive.
Most of the classifiers have DAC as its major contributor Precision – Precision is the measure of correctly classified
showing how the usage of abstract types increases the instances among all the positively classified instance.
complexity of the system by increasing the coupling Precision is calculated as
between classes. This is followed by RFC and LCOM. Precision = TP
PART formed about 13 rules for classification. The rule TP+FP
classifying the majority part of the instances involved DIT Recall – Recall is the measure of completeness in
and LCOM. About 25 instances were classified with this classifying an instance. That is gives the measure of total
rule. The next major rule includes RFC, MPC, DAC and positives captured by the classifier relative to the number
WMC. We noticed that majority of the classifiers used of actual positives in the data set. Recall is calculated as
LCOM as their major deciding factor followed by DAC, Recall = TP
MPC, WMC, and RFC. This shows how coupling TP + FN


Accuracy – Accuracy relates the number of correct TABLE 4: ACCURACY AND ERROR RATE
classifications. Accuracy is calculated as
Accuracy = TP+TN Technique Accuracy Mean Root mean

TP + TN+ FP+ FN absolute squared


error error
Area under the curve (AUC) – Area under the AUC curve
J48 68% 0.248 0.427
is a measure of accuracy. AUC here is a two-dimensional
Random Forest 67% 0.262 0.374
curve with TP is the y-axis and FP in the x-axis [36].
Naïve Bayes 61% 0.342 0.255
Table 3 provides the evaluation results of the defect
detection. We noticed that the precision and recall for Bayes. Network 62% 0.342 0.277

non-defective instances were fairly very accurate with PART 68% 0.239 0.426

most classifiers predicting them with 90% accuracy. KNN 66% 0.258 0.400

However, most of the classifiers were below par in the SVM 66% 0.317 0.411

classification of defective instances. This justifies our ANN 68% 0.281 0.409

consideration of AUC and not accuracy as the evaluation


measure. Particularly in the case of SVM the TP, TABLE 5: EVALUATION RESULTS FOR MAINTENANCE
precision and recall for defective instances was zero. This PREDICITON
was very surprising as SVM is known to effectively
distinguish the border of two classes. SVM has Technique Precision Recall AUC
constructed about 1500 support vectors but still performed J48 0.649 0.590 0.702
very poorly in cases of defective classes. Random Forest 0.537 0.527 0.814
Naïve Bayes 0.604 0.522 0.633
TABLE 3: EVALUATION RESULTS FOR DEFECT PREDICITON Bayesian Network 0.693 0.492 0.626
Technique Precision Recall AUC
PART 0.586 0.573 0.742
J48 0.755 0.626 0.720
KNN 0.774 0.520 0.755
Random Forest 0.740 0.654 0.801
SVM 0.719 0.557 0.681
Naïve Bayes 0.616 0.558 0.711
ANN 0.464 0.515 0.645
Bayesian Network 0.777 0.696 0.849
PART 0.776 0.619 0.760
Both naïve Bayes and Bayesian Network has relatively
KNN 0.807 0.621 0.790 less AUC value and less precision and recall values.
SVM 0.394 0.499 0.499 Followed by these two, ANN has an AUC of 0.64. These
ANN 0.729 0.632 0.740 form the least effective classifiers for maintenance
prediction in our experiment. The performance of PART
The AUC shows SVM to be the least effective model for is relatively moderate. KNN has the highest precision with
defect prediction. Naïve Bayes performed moderately an AUC of 0.75. Random Forest (RF) has the highest
better than SVM. ANN performed moderately well with AUC rate, a value of 0.8. RF manages to identify just one
an AUC value of 0.74. Random Forest also had high AUC instance as correctly belonging to the high maintenance
value of 0.8. Bayesian network had the overall best results class. However, considering the uneven distribution of
of all the experimented classifiers with a very high AUC classes, AUC is the best means of determining the
value of 0.84. It could be seen from the Bayesian network accuracy of the classifiers. A high AUC with a low
formed in Fig.1 that DIT, WMC and NOC are primary precision and recall indicates that the classifier can be
factors contributing to the defects of a software system. In tuned to improve its performance. Considering the
addition to the Bayesian network, Random Forest and imbalance of classes, we have considered AUC as the
KNN are also best suited for defect prediction in software major factor of evaluation. Random Forest seems to
systems. perform very well compared to other classifiers during
In the case of maintenance prediction, almost all both defect classification and maintainability
the classifiers had comparable performances in terms of classification.
accuracy. The accuracy of all classifiers ranged from 60 to
68%. The techniques J48, ANN and PART had the 7. Conclusion
highest accuracy of 68%. The accuracy, mean squared We have experimented with eight different popular
error and absolute error of all the classifiers are provided classifiers using the metrics obtained from open source
in Table 4. Most of the classifiers had a low precision rate projects from PROMISE data repository as well as using
in prediction of high maintenance instances. Both naïve the QUES system. We observed that specific metrics such
Bayes and Bayesian network had the lowest accuracy. as WMC, CBM, DIT, MPC and CBO are major


contributors in determining the quality. We have primarily prediction,” Software Engineering, IEEE Transactions on, vol.
considered AUC as the evaluating parameter as it 31, no. 10, pp. 897 – 910, oct. 2005.
[17] Y. Zhou and H. Leung, “Empirical analysis of object-oriented
considers the variability in class distribution. Our design metrics for predicting high and low severity faults,”
evaluation results show that Random Forest is a reliable Software Engineering, IEEE Transactions on, vol. 32, no. 10, pp.
classifier for both defect and maintainability prediction 771 –789, oct. 2006.
with a good AUC value. However, there were other [18] R. Shatnawi and W. Li, “The effectiveness of software metrics in
identifying error-prone classes in post-release software evolution
classifiers with overall good performance such as the process,” Journal of Systems and Software, vol. 81, no. 11, pp.
PART, J48 and KNN. Overall, decision tree-based 1868–1882, 2008.
prediction techniques (i.e., Random Forest) seemed to [19] R.K. Bandi, V.K. Vaishnavi, D.E. Turk, "Predicting Maintenance
perform well in quality prediction. Bayesian network had Performance Using Object-Oriented Design Complexity
Metrics", IEEE Trans. Software Eng., vol. 29, no. 1, pp. 77-87,
very good results in defect prediction, however, had very Jan. 2003.
low accuracy and TP rate for high maintenance class [20] T. Zimmermann, R. Premraj, and A. Zeller, “Predicting defects
prediction. Our work sheds light on predicting software for eclipse,” in Predictor Models in Software Engineering, 2007.
quality using machine learning techniques. PROMISE’07: ICSE Workshops 2007. International Workshop
on, may 2007, p. 9
[21] K. O. Elish and M. O. Elish, “Predicting defect-prone software
References modules using support vector machines,” J. Syst. Softw., vol. 81,
no. 5, pp. 649–660, 2008.
[1] The economic impacts of inadequate infrastructure for software [22] A. Okutan and O. Yldz, “Software defect prediction using
testing. http://www.nist.gov/director/planning/upload/report02- bayesian networks,” Empirical Software Engineering, pp. 1–28,
3.pdf. 2012.
[2] J. D. Musa, "A theory of software reliability and its application," [23] I. H. Laradji, M. Alshayeb, and L. Ghouti. Software defect
IEEE Trans. Software Eng., vol. SE-1, pp. 312-327, 1971. prediction using ensemble learning on selected features.
[3] W. Li and S. Henry (1993). “Object-oriented metrics that predict Information and Software Technology, 58:388– 402, 2015.
maintainability”. Journal of Systems and Software. 23(2):111– [24] I. Gondra, “Applying machine learning to software fault
122. proneness prediction,” Journal of Systems and Software, vol. 81,
[4] F.P. Brooks, The Mythical Man-Month: Essays on Software no. 2, pp. 186–195, 2008.
Engineering, Addison-Wesley, Reading, Mass., 1998. [25] Zhang, D. and J.J.P. Tsai, Machine Learning and Software
[5] R.B. Grady, "Successfully Applying Software Metrics," Engineering. Software Quality Journal, 2003. 11(2): p. 87-119.
Computer, vol. 27, no. 9, pp. 18-25, September. 1994. [26] M.M.T. Thwin, T.-S. Quah, "Application of Neural Networks for
[6] D. M. Coleman, D. Ash, B. Lowther, and P. W. Oman, “Using Software Quality Prediction Using Object-Oriented Metrics",
metrics to evaluate software system maintainability,” IEEE Proc. IEEE Int'l Conf. Software Maintenance (ICSM), 2003.
Computer, vol. 27, no. 8, pp. 44–49, August 1994. [27] L. Breiman. Random forests. Mach. Learning, 45(1):5–32, 2001.
[7] T.M. Khoshgoftaar and J.C. Munson, “Predicting Software [28] A. K. Jain, J. Mao, and K. M. Mohiuddin, “Artificial neural
Development Errors Using Complexity Metrics,” IEEE J Selected networks: A tutorial,” IEEE Comput., pp. 31–44, Mar. 1996.
Areas in Comm., vol. 8, no. 2, pp. 253-261, 1990. [29] B.R. Gaines and P. Compton. Induction of ripple-down rules
[8] L. Rosenberg, T. Hammer, and J. Shaw, "Software Metrics and applied to modeling large databases.
Reliability," in 9th International Symposium on Software [30] G. Boetticher, T. Menzies, and T. J. Ostrand, (2007) Promise
Reliability Engineering, 1998. repository of empirical software engineering data. [Online].
[9] G. Stark, R. C. Durst, C. W. Vowell, "Using Metrics in Available: http:// promisedata.org/repository.
Management Decision Making", IEEE Computer, vol. 27, no. 9, [31] G. Holmes, A. Donkin, and I. Witten, “Weka: A machine learning
pp. 42-48, September 1994. workbench,” in Proc. 2nd Aust. New Zealand Conf. Intell. Inf.
[10] V. Basili, L. Briand, and W.L. Melo, “A Validation of Object Syst., 1994, pp. 1269–1277
Oriented Design Metrics as Quality Indicators,” IEEE Trans. [32] P. Domingos and M. Pazzani. On the optimality of the simple
Software Eng., 1996. Bayesian classier under zero-one loss. Machine Learning,
[11] S. R. Chidamber and C. F. Kemerer, “A Metrics Suite for Object 29:103{130, 1997.
Oriented Design”, IEEE Transactions on Software Engineering, [33] N.Friedman, D.Geiger, and M.Goldszmidt. Bayesian network
20(6), pp. 476-493, 1994. classifiers. Machine Learning, 29:131{163, 1997.
[12] M. Lorenz and J. Kidd. Object-Oriented Software Metrics. [34] A.McCallum and K.Nigam. A comparison of event models for
Prentice-Hall Object-Oriented Series, Englewood Cliffs, NY, naive bayes text classification. In AAAI-98 Workshop on
1994. Learning for Text Categorization, 1998.
[13] F. Brito e Abreu, Melo, W., “Evaluating the Impact of Object- [35] T.Pornpon, L.Preechaveerakul, and W. Wettayaprasit. "A novel
Oriented Design on Software Quality”, Proceedings of Third Voting Algorithm of multi-class SVM for web page
International Software Metrics Symposium, pp. 90-99, 1996. classification." Computer Science and Information Technology,
[14] S. Jamali, Object Oriented Metrics A Survey Approach, Tehran 2009. ICCSIT 2009. 2nd IEEE International Conference on.
Iran: Sharif University of Technology, January 2006. IEEE, 2009.
[15] P. Yu, T. Systa¨, and H. Mu¨ller, “Predicting Fault-Proneness [36] A. P. Bradley, “The use of the area under the ROC curve in the
Using OO Metrics: An Industrial Case Study,” Proc. Sixth evaluation of machine learning algorithms,” Pattern Recog., vol.
European Conf. Software Maintenance and Reeng. (CSMR 30, no. 7, pp. 1145– 1159, 1997.
2002), pp. 99-107, Mar. 2002. [37] M. Hal et al., “The WEKA data mining software: An update,”
[16] T. Gyimothy, R. Ferenc, and I. Siket, “Empirical validation of SIGKDD Explor., vol. 11, no. 1, pp. 10-18, 2009.
object-oriented metrics on open source software for fault T. M. Cover and P. E. Hart, “Nearest neighbour pattern
classification,” IEEE Trans. Inf. Theory, vol. IT-13, no. 1, pp.
21-27, Jan. 1967.



You might also like