Software Defect Prediction Using Supervised Machine Learning and Ensemble Techniques
Software Defect Prediction Using Supervised Machine Learning and Ensemble Techniques
net/publication/333216078
CITATIONS READS
79 2,031
2 authors:
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Abdullah Alsaeedi on 21 May 2019.
Department of Computer Science, College of Computer Science and Engineering, Taibah University, Madinah, KSA
1. Introduction
A software defect is a bug, fault, or error in a program that causes improper
outcomes. Software defects are programming errors that may occur because of
errors in the source code, requirements, or design. Defects negatively affect
software quality and software reliability [1]. Hence, they increase maintenance
costs and efforts to resolve them. Software development teams can detect bugs
by analyzing software testing results, but it is costly and time-consuming by
testing entire software modules. As such, identifying defective modules in early
DOI: 10.4236/jsea.2019.125007 May 21, 2019 85 Journal of Software Engineering and Applications
A. Alsaeedi, M. Z. Khan
stages is necessary to aid software testers in detecting modules that required in-
tensive testing [2] [3].
In the field of software engineering, software defect prediction (SDP) in early
stages is vital for software reliability and quality [1] [4]. The intention of SDP is
to predict defects before software products are released, as detecting bugs after
release is an exhausting and time-consuming process. In addition, SDP ap-
proaches have been demonstrated to improve software quality, as they help de-
velopers predict the most likely defective modules [5] [6]. SDP is considered a
significant challenge, so various machine learning algorithms have been used to
predict and determine defective modules [7]. With the end goal of expanding the
viability of software testing, SDP is utilized to distinguish defective modules in
current and subsequent versions of a software product. Therefore, SDP ap-
proaches are very helpful in allocating more efforts and resources for testing and
examining likely-defective modules [8].
Commonly-used SDP strategies are regression and classification strategies.
The objective of regression techniques is to predict the number of software de-
fects [5]. In the literature, there are a number of regression models used for SDP
[9] [10] [11] [12]. In contrast, classification approaches aim to decide whether a
software module is faulty or not. Classification models can be trained from the
defect data of the previous version of the same software. The trained models can
then be used to predict further potential software defects. Mining software repo-
sitory becomes a vital topic in research for predicting defects [13] [14].
Supervised machine learning classifiers are commonly employed to predict
software defects such as support vector machines [15] [16] [17], k-nearest
neighbors (KNN) [18] [19], Naive Bayes [19] [20] [21], and so on. In addition,
Bowes [22] suggested the use of classifier ensembles to effectively predict defects.
A number of works have been accomplished in the field of SDP utilizing ensem-
ble methods such as bagging [23] [24] [25], voting [22] [26], boosting [23] [24]
[25], random tree [22], RF [27] [28], and stacking [22]. Neural networks (NN)
can be used to predict defect prone software modules [29] [30] [31] [32].
Clustering algorithms such as k-means, x-means, and expectation maximiza-
tion (EM) have also been applied to predict defects [33] [34] [35]. In addition,
the experiment outcomes in [34] [35] showed that x-means clustering performed
better than fuzzy clustering, EM, and k-means clustering at identifying software
defects. Aside from those, transfer learning is a machine learning approach that
expects to exchange the information learned in one dataset and utilize that
learning to help tackle issues in an alternate dataset [36]. Transfer learning has
also been introduced to the field of SDP [37] [38].
Software engineering data, such as defect prediction datasets, are very imba-
lanced, where the number of samples of a specific class is vastly higher than
another class. To deal with such data, imbalanced learning approaches have been
proposed in SDP to mitigate the data imbalance problem [7]. Imbalanced learn-
ing approaches include re-sampling, cost-sensitive learning, ensemble learning,
and imbalanced ensemble learning (hybrid approaches) [7] [39]. Re-sampling
2. Software Metrics
A software metric is a proportion of quantifiable or countable characteristics
that can be used to measure and predict the quality of software. A metric is an
indicator describing a specific feature of a software [6]. Identifying and measur-
ing software metrics is vital for various reasons, including estimating program-
ming execution, measuring the effectiveness of software processes, estimating
required efforts for processes, deduction of defects during software develop-
ment, and monitoring and controlling software project executions [5].
Various software metrics have been commonly used for defect prediction. The
first group of software metrics is called lines of code (LOC) metrics and is con-
sidered basic software metrics. LOC metrics are typical proportions of software
development. Many studies in SDP have proven a clear correlation between LOC
metrics and defect prediction [43] [44]. One of the most common software me-
trics widely used for SDP are the cyclomatic complexity metrics, which were
proposed by McCabe [45] and are used to represent the complexity of software
products. McCabe’s metrics (cyclomatic metrics) are computed based on the
control flow graphs of a source code by counting the number of nodes, arcs, and
= TP ( TP + FP )
Precision (2)
Recall TP ( TP + FN )
= (3)
TPR TP ( TP + FN )
= (5)
FPR FP ( TN + FP )
= (6)
4. Experimental Methodology
For the experiments, 10 well-known software defect datasets [62] were selected.
The majority of related works used these datasets to evaluate the performance of
their SDP techniques and this is the reason behind selecting the above-mentioned
dataset for further comparisons. Table 1 reports the datasets used in the experi-
ments along with the statistics. RF, DS, Linear SVC SVM, and LR were chosen to
be the base classifiers. Boosting and bagging classifiers for all the base classifiers
were also considered. The experiments were conducted on a Python environ-
ment. The classifiers’ performances in this study were measured using classifica-
tion accuracy, precision, recall, F-score, and ROC-AUC score. It is important to
highlight that these metrics were computed using the weighted average. The in-
tuition behind selecting the weighted average was to calculate metrics for each
class label and take the label imbalance into the account.
KC3 194 36 39
MC1 1988 46 38
MC2 125 44 39
MW1 253 27 37
PC1 705 61 37
PC2 745 16 36
PC1 0.91 0.87 0.79 0.81 0.90 0.86 0.79 0.78 0.89 0.89 0.79 0.81
PC3 0.84 0.80 0.74 0.76 0.84 0.81 0.74 0.76 0.82 0.84 0.74 0.75
PC4 0.90 0.85 0.81 0.82 0.89 0.84 0.81 0.82 0.89 0.89 0.81 0.83
PC5 0.76 0.71 0.68 0.68 0.75 0.70 0.68 0.71 0.76 0.77 0.68 0.68
JM1 0.77 0.71 0.69 0.70 0.77 0.78 0.69 0.72 0.77 0.77 0.69 0.69
KC2 0.82 0.78 0.79 0.78 0.81 0.77 0.79 0.80 0.80 0.79 0.79 0.79
KC3 0.81 0.77 0.77 0.77 0.79 0.80 0.77 0.71 0.79 0.82 0.77 0.76
MC1 0.97 0.94 0.81 0.81 0.97 0.94 0.81 0.77 0.97 0.96 0.81 0.81
MC2 0.69 0.67 0.65 0.63 0.71 0.68 0.65 0.65 0.71 0.75 0.65 0.65
CM1 0.83 0.78 0.75 0.73 0.82 0.77 0.75 0.75 0.81 0.83 0.75 0.74
obtained by LR for the same dataset. Among the base learners, RF was the best
performing classifier for all datasets, while SVM was the worst classifier for all
datasets, except KC2, MC2, and CM1. Besides, Bagging with DS achieves higher
accuracy scores for PC3, PC5, KC3, MC2, CM1 compared to the other bagging
and boosting methods.
Table 3 reports the F-scores attained using different classifiers. In general, it is
apparent that the RF classifier was the best performing for six different datasets,
as illustrated in Table 2 and Table 3. For PC1, PC3, PC4, KC2, MC1, and CM1,
the RF classifier attained the highest F-scores compared to the other classifiers,
indicating better predictions obtained by RF. In addition, the reported F-scores
presented that AdaBoost classifier with RF as a base learner attained similar
scores to RF for the PC3, PC4, KC2, and MC1 datasets. Furthermore, bagging
with DS achieved higher F-scores compared to other classifiers for PC5, KC, and
MC2.
Figure 2 illustrates bar plots of the F-scores attained using classifiers for all
datasets. For the PC3, PC4, PC5, and JM1 datasets, it is obvious that the SVM,
6. Threats to Validity
In this section, we list some potential threats in our study and responses to con-
struct validity.
1) The selection of datasets may not be representative. One potential threat to
validity is the selection of datasets where they might not be representative. In our
study, this threat is mitigated by evaluating the performance of the classifiers on
ten well-known datasets that are commonly used in the literature review.
2) The generalization of our results. We have attempted to mitigate this threat
by measuring the performance of the base learners, boosting, and bagging clas-
sifier on diverse datasets that have different sizes.
3) The trained classifiers may over-fitting and bias the results. Instead of split-
ting the datasets randomly using the simple train-test split (70% - 80% for train-
ing and 30% - 20% for testing), we split the dataset into training and testing sets
using the 10-fold cross validation to avoid the over-fitting issue that might be
caused using the random splitting.
7. Related Works
Kalai Magal et al. [28] combined feature selection with RF to improve the accu-
racy of software defect predication. Feature selection was based on correlation
computation and aimed to choose the ideal subset of features. The selected fea-
tures using correlation-based feature selection were then used with RF to predict
software defects. Various experiments were conducted on open NASA datasets
from the PROMISE repository. The outcome showed clear improvements ob-
tained using the improved RF compared to the traditional RF.
Venkata et al. [9] explored various machine learning algorithms for real-time
system defect identification. They investigated the impact of attribute reduction
on the performance of SDP models and attempted to combine PCA with differ-
ent classification models which did not show any improvements. However, the
outcomes of the experimental results demonstrated that combining the correla-
Conflicts of Interest
The authors declare no conflicts of interest regarding the publication of this pa-
per.
References
[1] Rawat, M.S. and Dubey, S.K. (2012) Software Defect Prediction Models for Quality
Improvement: A Literature Study. International Journal of Computer Science Is-
sues, 9, 288-296.
[2] Li, J., He, P., Zhu, J. and Lyu, M.R. (2017) Software Defect Prediction via Convolu-
tional Neural Network. 2017 IEEE International Conference on Software Quality,
Reliability and Security, 25-29 July 2017, Prague, 318-328.
https://doi.org/10.1109/QRS.2017.42
[3] Hassan, F., Farhan, S., Fahiem, M.A. and Tauseef, H. (2018) A Review on Machine
Learning Techniques for Software Defect Prediction. Technical Journal, 23, 63-71.
[4] Punitha, K. and Chitra, S. (2013) Software Defect Prediction Using Software Me-
trics: A Survey. 2013 International Conference on Information Communication and
Embedded Systems, 21-22 February 2013, Chennai, 555-558.
https://doi.org/10.1109/ICICES.2013.6508369
[5] Kalaivani, N. and Beena, R. (2018) Overview of Software Defect Prediction Using
Machine Learning Algorithms. International Journal of Pure and Applied Mathe-
matics, 118, 3863-3873.
[6] Ge, J., Liu, J. and Liu, W. (2018) Comparative Study on Defect Prediction Algo-
rithms of Supervised Learning Software Based on Imbalanced Classification Data
Sets. 2018 19th IEEE/ACIS International Conference on Software Engineering, Ar-
tificial Intelligence, Networking and Parallel/Distributed Computing, 27-29 June
2018, Busan, 399-406. https://doi.org/10.1109/SNPD.2018.8441143
[7] Song, Q., Guo, Y. and Shepperd, M. (2018) A Comprehensive Investigation of the
Role of Imbalanced Learning for Software Defect Prediction. IEEE Transactions on
Software Engineering, 1. https://doi.org/10.1109/TSE.2018.2836442
[8] Chang, R.H., Mu, X.D. and Zhang, L. (2011) Software Defect Prediction Using
Non-Negative Matrix Factorization. Journal of Software, 6, 2114-2120.
https://doi.org/10.4304/jsw.6.11.2114-2120
[9] Challagulla, V.U.B., Bastani, F.B., Yen, I.L. and Paul, R.A. (2005) Empirical Assess-
ment of Machine Learning Based Software Defect Prediction Techniques. Proceed-
ings of the 10th IEEE International Workshop on Object-Oriented Real-Time De-
pendable Systems, 2-4 February 2005, Sedona, 263-270.
https://doi.org/10.1109/WORDS.2005.32
[10] Yan, Z., Chen, X. and Guo, P. (2010) Software Defect Prediction Using Fuzzy Sup-
port Vector Regression. In: Zhang, L., Lu, B. and Kwok, J., Eds., Advances in Neural
Networks, Springer, Berlin, 17-24.
https://doi.org/10.1007/978-3-642-13318-3_3
[11] Rathore, S.S. and Kumar, S. (2016) A Decision Tree Regression Based Approach for
the Number of Software Faults Prediction. ACM SIGSOFT Softw Are Engineering
Notes, 41, 1-6. https://doi.org/10.1145/2853073.2853083
[12] Rathore, S.S. and Kumar, S. (2017) An Empirical Study of Some Software Fault Pre-
diction Techniques for the Number of Faults Prediction. Soft Computing, 21,
7417-7434. https://doi.org/10.1007/s00500-016-2284-x
[13] Wang, H. (2014) Software Defects Classification Prediction Based on Mining Soft-
ware Repository. Master’s Thesis, Uppsala University, Department of Information
Technology.
[14] Vandecruys, O., Martens, D., Baesens, B., Mues, C., Backer, M.D. and Haesen, R.
(2008) Mining Software Repositories for Comprehensible Software Fault Prediction
Models. Journal of Systems and Software, 81, 823-839.
https://doi.org/10.1016/j.jss.2007.07.034
http://www.sciencedirect.com/science/article/pii/S0164121207001902
[15] Vapnik, V. (2013) The Nature of Statistical Learning Theory. Springer, Berlin.
[16] Elish, K.O. and Elish, M.O. (2008) Predicting Defect-Prone Software Modules Using
Support Vector Machines. Journal of Systems and Software, 81, 649-660.
https://doi.org/10.1016/j.jss.2007.07.040
http://www.sciencedirect.com/science/article/pii/S016412120700235X
[17] Gray, D., Bowes, D., Davey, N., Sun, Y. and Christianson, B. (2009) Using the Sup-
port Vector Machine as a Classification Method for Software Defect Prediction with
Static Code Metrics. In: Palmer-Brown, D., Draganova, C., Pimenidis, E. and Mou-
ratidis, H., Eds., Engineering Applications of Neural Networks, Springer, Berlin,
223-234. https://doi.org/10.1007/978-3-642-03969-0_21
[18] Wang, H., Khoshgoftaar, T.M. and Seliya, N. (2011) How Many Software Metrics
Should Be Selected for Defect Prediction? 24th International FLAIRS Conference,
18-20 May 2011, Palm Beach, 69-74.
[19] Perreault, L., Berardinelli, S., Izurieta, C. and Sheppard, J. (2017) Using Classifiers
for Software Defect Detection. 26th International Conference on Software Engi-
neering and Data Engineering, 2-4 October 2017, Sydney, 2-4.
[20] Wang, T. and Li, W. (2010) Naive Bayes Software Defect Prediction Model. 2010
International Conference on Computational Intelligence and Software Engineering,
10-12 December 2010, Wuhan, 1-4. https://doi.org/10.1109/CISE.2010.5677057
[21] Jiang, Y., Cukic, B. and Menzies, T. (2007) Fault Prediction Using Early Lifecycle
Data. 18th IEEE International Symposium on Software Reliability, 5-9 November
2007, Trollhättan, 237-246. https://doi.org/10.1109/ISSRE.2007.24
[22] Wang, Tao, Li, W., Shi, H. and Liu, Z. (2011) Software Defect Prediction Based on