Software Defect Prediction Using Random Forest
Software Defect Prediction Using Random Forest
Abstract—The software defect can cause the unnecessary errors or logical inconsistencies. However, it is needed to
effects on the software such as cost and quality. The prediction of perform the executing of programs in dynamic analysis.
the software defect can be useful for the development of good
quality software. For the prediction, the PROMISE public
dataset will be used and random forest (RF) algorithm will be
applied with the RAPIDMINER machine learning tool. This
paper will compare the performance evaluation upon the
different number of trees in RF. As the results, the accuracy will
be slightly increased if the number of trees will be more. The
maximum accuracy is up to 99.59 and the minimum accuracy is
85.96. Another comparison is based on AUC curve that
represents the most informative indicator of predictive accuracy
within the field of software defect prediction. All of the results
show that RF algorithm is effective in this prediction which is
more suitable with the usage of hundred trees in the RF.
I. INTRODUCTION
A defect can be error/ bug in the development of the
software. In the software developing process,
the mistakes or errors can be made by the software
programmer. These also means that the flaws of the software
Fig. 1. Defects causes software crash.
or defects. In the software product, it does not meet the end
user expectations with the software requirement and it is called In this paper, the PRedictOr Models In Software
software defect. The defect is also the error in coding which Engineering (PROMISE), public datasets are used for
can be logic errors and it produces incorrect or unexpected prediction of software defect. For the prediction,
results. RAPIDMINER tool is used to apply the random forest
algorithm and to show the performance evaluation of this
The goal of software quality control is the management of
algorithm upon the selected datasets. And, this paper will show
software quality which needs to remove quality problems in the
the comparison of performance varying the number of trees
software [2]. These quality problems can be named as bugs,
upon the random forest algorithm. Random forest algorithm
faults, errors, or defects. Before the end user encounter the
overcomes the overfitting problem and it can get the high
errors, the software engineers need to find out the defects as
accuracy for the prediction.
possible as that they can. It is needed to avoid these problems
because the impact of defect may affect on customer opinion. This paper is organized with six sections. In section 2, it
Therefore, it is important to detect errors and defects. Fig. 1 will introduce about cost and impact of software defects.
shows the software can be damaged due to the defects. Section 3 discuss about the related work of the software defect
prediction. In section 4, methodology of the prediction is
Detection of defects is the types of software testing. Most
presented, including with the usable datasets, random forest
of the software defects detection researches are using data
algorithm and performance measurement mechanisms. The
mining algorithm and tools. There are many approaches to
experimental results are discussed in section 5. Finally, this
detect such as static analysis and dynamic analysis. Static
paper will conclude in section 6.
analysis examines the codes without running the software [3].
It can detect potential software defects that can be runtime
II. COST AND IMPACT OF SOFTWARE DEFECTS these classifications processing used WEKA tools for
In the software development process, a quality problem that generation the result.
is propagated from one process framework activity such as Selvaraj et al. [4] used Support Vector Machine (SVM) for
modeling to construction can also be called a defect [2]. The software defects prediction. KC1 dataset is used for this
cost of defects can be measured by both the impact of the prediction and it comes from C++ implemented storage
defects and when these are found. The earlier the defect is management system for ground data. Their research aim to
found, the fewer the software cost is required. If the defect is detect the defects and the detector calculates with the classifier
not detected until acceptance testing or even once the system and module that predicts no defects/errors or some
has been implemented, the system will be much more defects/errors. But it can’t solve complex problems. The
expensive to fix it. If the cost of software refinement is too accuracy of this work is 86.05 % for the KC1 dataset which is
expensive, it can’t be correct that are depending on the defects. used 1391 samples as training set and 716 samples for testing.
Fig. 2 shows that the cost effective of software defect depends The results are carried out by WEKA tool.
on processing time. If the finding of software defects will take
more time, the software cost may be more expensive. Ma et al. [5] proposed the software defect prediction and it
is based on association rule classification. This paper assesses
that the use of the classification methods based on CBA2,
association rule and compares the results with other rule-based
classification methods. And then it investigates whether rule
sets generated on data from a software project which can be
used to predict defective software modules. Their finding, the
results are accurate and comprehensible rule sets. Comparisons
are based on the AUC curve that represents the accuracy for
the software defect prediction. Twelve datasets are used for
detection, one-third of each dataset is used for testing and the
remaining part of each dataset is for training with
implementing using WEKA. The accuracy of each process got
around 69 to 99% but CBA2 classifier would give a lower
performance then a rule set which induced on the same data
set.
Song et al. [6] proposed the software defect prediction
Fig. 2. The cost effective of software defect depends on processing time. using association rule mining. Their proposed method is based
A number of industry studies show that design activities on association rule mining for predicting defect associations
introduce between fifty and sixty-five percent of all defects and defect correction effort. They used the SEL defect dataset
during the software process. However, the review techniques that consists of more than 200 projects. Their prediction results
have been indicated to be up to seventy-five percent effective are compared the defect correction effort prediction method
in the uncovering design flaws [2]. These review processes with other methods such as PART, C4.5, and Naive Bayes.
significantly reduce the cost of subsequent activities in the NASA SEL datasets are used for defect isolation and the defect
software process by detecting and removing a large percentage correction effort. Only one defect will not be correctly
of these flaws. predicted in the 37.36 percent of cases in the defect dataset and
the results are based on the total of sixty sets of rules, which
Defect prevention is a critical step or activities in any consist of more than one thousand rules. They found that the
software development process. The defect prevention result will not be effective in the higher support and confidence
responsibilities for testers are requirement specification levels. They got the maximum accuracy 96.06 percents and
review, design review and code review. Some defect minimum accuracy is 95.38 percents with the percentage of
prevention methods are as follows [8]. The first one is review minimum support and minimum confidence levels are twenty
and inspiration that includes the review by and individual team and thirty respectively.
member for self-checking, peer reviews and inspiration of all
work products. The second is walkthrough that is more or less
IV. METHODOLOGY
like a review and another is defect logging and documentation
that provides some key information, arguments or parameters This work will use the public datasets that are downloaded
used to be supported for analyzing defects. And, the root cause and these data are applied with random forest algorithm. And
analysis that can be form as simple techniques in problem then, the accuracy is calculated with the outcomes by using
resolution for maximum impact and, cause and effect analysis performance measuring mechanisms.
is based on team-wide brainstorming.
A. Dataset
III. RELATED WORKS The public datasets (software defect prediction) are getting
Most of the software defect detections are based on the from tera-PROMISE Repository and these datasets are
datasets by using machine learning techniques. And most of categorized into three groups such as AR, KC and PC datasets.
AR datasets contains ar1, ar3, ar4, ar5 and ar6. KC datasets
includes kc1, kc2 and kc3. PC datasets involve pc1, pc2, pc3
2018 12th South East Asian Technical University Consortium Sysmposium (SEATUC), Yogyakarta, Indonesia
and pc4. These datasets are created by the NASA metrics data x Creation of random forest
program [1]. The name of this package is PROMISE dataset
that makes publicly available to use the predictive models of x Prediction using the created trees
software engineering.
1) Creation
AR datasets are embedded software implemented in C and
these are from a Turkish white-goods manufacturer. These The following is the pseudocode of the process flow of
datasets included function level static code attributes which is random forest creation.
representing with numeric values. These datasets contain the 1. Randomly select “k” features from total “m” features.
defect information whether there is bug or not. There are 29 (where k << m)
attributes in these datasets and 14 percent of data are defect
records and remaining parts are non-defect records that are 2. Among the “k” features, calculate the node “d” using
separated in 5 data files such as ar1, ar3, ar4, ar5 and ar6. the best split point.
KC datasets came from C++ implemented storage 3. Split the node into child nodes using the best split.
management system for receiving and processing ground data. 4. Repeat above three steps until “1” number of nodes
And, these come from McCabe and Halstead, source code has been reached.
feature extractors. These features were defined in an attempt to
objectively characterize code features which are associated 5. Build forest by repeating above four steps for “n”
with software quality and the nature of association is under number times to create “n” number of trees.
argument. The McCabe and Halstead measures are based on In the beginning of RF algorithm, it randomly selects the
module, the smallest unit of functionality. The module would features out of total features. In the second step, it finds out
be called function or methods in the C and Smalltalk. KC1 the root node by using best split approach. The next step, the
dataset contains 21 feature attributes and on class label daughter nodes will be extracted by using the same best split
attributes that include defects and non-defects. There are 2109 approach. The first three steps will be performed until forming
records containing 326 defects and 1783 non-defects data. the tree with a root node and getting the target as the leaf
node. Finally, it is repeated first four steps to create randomly
TABLE I. PROMISE DATASETS n trees.
Non- 2) Prediction
Name Attributes Records Defects
Defects
The trained random forest algorithm is used to perform
ar1 29 121 112 9 prediction as in the following processes.
ar3 29 63 55 8
ar4 29 107 87 20 1. Takes the test features and use the rules of each
ar5 29 36 28 8
randomly created decision tree.
ar6 29 101 86 15 2. Calculate the votes for each predicted target.
kc1 21 2,109 1,783 326
3. Consider the high voted predicted target as the final
kc2 21 522 415 107
prediction.
kc3 21 458 415 43
pc1 21 1,109 1,032 77 In the first step, it will be needed to pass the test features
pc2 36 5,589 5,566 23
through the rules of each randomly created tree. The different
target outcomes for the same test feature will be predicted by
pc3 37 1,563 1,403 160
each random forest. By considering each predicted target votes
pc4 37 1,458 1,280 178
will be calculated and the final random forest returns as the
predicted target.
PC datasets are the data from C functions that are flight
software for earth orbiting satellite. These data also come from C. Performance Evaluation
McCabe and Halstead feature extractors source code. There are The evaluation performance of the prediction of the
21 attributes in pc1, 36 attributes in pc2 and, 37 attributes in software defects as the following criteria such as accuracy,
pc3 and pc4 relatively. These datasets only contain almost 5 classification error and the area under the curve (AUC).
percent of defect records and remaining parts are non-defect
records. Each of the datasets is summarized in Table I. x Accuracy (Acc): Percentage of correctly identified for
prediction.
B. Random Forest Algorithm
Random forest (RF) algorithm is one of supervised (1)
classification algorithm that creates the forest with a number of
trees. x Classification Error (Err): Percentage of incorrectly
identified for prediction.
The random forest algorithm can divide into two processes
[7]. (2)
2018 12th South East Asian Technical University Consortium Sysmposium (SEATUC), Yogyakarta, Indonesia
TP is the true positive values that indicate that the number effective in the prediction of software defect for these datasets
of defect classify. FP is the false positive values that represent especially in pc2 and this prediction accuracy is highest with
the number of non-defect correctly classify. TN is also the true nearly 100% among datasets.
negative values that correspond to the number of defect
wrongly classify. And, FN is the false negative values that TABLE III. CLASSIFICATION ERROR OF DEFECT PREDICTION (ERR)
stand for the number of non-defect wrongly classify.
Number of Trees
The AUC is a performance metrics for the binary Dataset
classifiers. AUC captures the extent to which the curve is up 10 100 1000
in the northwest corner. The higher AUC is the better result. ar1 7.44 7.44 7.44
The normal threshold for two classes is 0.5 and the algorithm ar3 6.35 4.76 4.76
classify upon the true and false. In this paper, AUC values will ar4 11.21 9.35 8.41
be calculated based on positive class (true).
ar5 8.33 2.78 2.78
ar6 11.88 10.89 10.89
V. EXPERIMENTAL RESULT
kc1 14.04 13.47 13.18
The experimental results are compared with three kc2 12.26 11.49 9.96
categories (the number of trees) such as 10, 100 and 1000
kc3 7.64 7.21 7.21
trees in random forest algorithm. These results are based on
pc1 6.94 6.76 6.76
the minimum confidence 0.25 for pruning and minimum gain
value 0.1 for pre-pruning by using RAPIDMINER machine pc2 0.41 0.41 0.41
learning tool. Table 2 and 3 show the prediction accuracy and pc3 10.24 10.24 10.24
error. pc4 11.93 11.87 11.87
In Table II, all of the accuracy is the same in ar1, pc2 and
pc3. Other results show that most of the prediction accuracy is Another comparison is based on the AUC curve. For the
slightly increase when the number of trees will be more. Using software defect predictive, AUC is the indication of software
10 trees for random forest, some of results are stable but most defect prediction accuracy. In the Table IV, AUC values will
are slightly decrease in accuracy result. If the number of tree be calculated based on positive class (true). The results show
will increase up to 100, most of accuracy results are slightly that this will be slightly increased if it will create more trees in
increase. If the number of tree will create up to 1000, there Random Forest. But the processing time is more needed for
isn’t too difference upon the accuracy results and most of the more trees creation in Random Forest. By summarizing these
results are same using with 100 trees. As the result show that three tables, the prediction of defects is more suitable for using
the usage of around 100 trees is more suitable for the around 100 trees in the forest.
prediction of the software defects by using the PROMISE
datasets. In the datasets, there are 20 percentages of records
TABLE IV. AUC PERFORMANCE OF DEFECT PREDICTION
are the defects records and others are non-defects records.
Number of Trees
Dataset
TABLE II. ACCURACY OF DEFECT PREDICTION (ACC) 10 100 1000
it is more suitable to use around hundred tree for this [8] “Defect prevention methods and techniques”, [Online]. Available:
prediction. http://www.softwaretestinghelp.com/defect-prevention-methods/
[Accessed: Dec 21, 2017].
VI. CONCLUSIONS
Most of the software defect prediction works are based on
the datasets by using machine learning techniques. In this
paper, the prediction results are based on PROMISE public
datasets by using random forest algorithm. The maximum
accuracy of prediction is up to 99.59 and the minimum
accuracy is 85.96 by using a hundred trees with random forest
algorithm. The results show that the number trees in the forest
should be around a hundred because it is more stable for
defect prediction to get the high accuracy. In the future, we
will analyze the prediction for software defects by using other
datasets and machine learning algorithms for getting the most
stable mechanism.
ACKNOWLEDGMENT
The authors are grateful for the financial support provided
by AUN/SEED Net Project (JICA).
REFERENCES
[1] G. Boetticher, T. Menzies, and T. Ostrand, “PROMISE repository of
empirical software engineering data repository”, West Virginia
University, Department of Computer Science, 2007.
[2] R. S. Pressman, “Software Engineering – A Practitioner’s Approach”,
7th Edition, ISBN 978–0–07–337597–7, 2010.
[3] N. Ayewah, D. Hovemeyer, D. J. Morgenthaler, J. Penix and W. Pugh,
“Using static analysis to find bugs”, IEEE Software, ISSN: 0740-7459,
19 August 2008.
[4] P. A. Selvaraj, and P. Thangaraj, “Support Vector Machine for software
defect prediction”, International Journal of Engineering & Technology
Research, ISSN Online: 2347-4904, 2013.
[5] B. Ma, K. Dejaeger, J. Vanthienen, and B. Baesens, “Software defect
prediction based on association rule classification”, Department of
Decision Sciences and Information Management, K. U. Leuven, B-3000,
Leuven, Belgium, 2011.
[6] Q. Song, M. Shepperd, M. Cartwright and C. Mair, “Software defect
association mining and defect correction effort prediction”, IEEE
Transactions on Software Engineering, DOI:
10.1109/TSE.2006.1599417, 2006.
[7] S. Polamuri (2017), “How the random forest algorithm works in
machine learning”, [Online]. Available:
http://dataaspirant.com/2017/05/22/random-forest-algorithm-machine-
learing/ [Accessed: Dec 1, 2017].