Hyperparameter Optimization for Software Bug Prediction Using Ensemble Learning
Hyperparameter Optimization for Software Bug Prediction Using Ensemble Learning
ABSTRACT Software Bug Prediction (SBP) is an integral process to the software’s success that involves
predicting software bugs before their occurrence. Detecting software bugs early in the development process
enhances software quality, performance, and reduces software costs. The integration of Machine Learning
(ML) algorithms has significantly improved software bug prediction accuracy and concurrently reduced
costs and resource utilization. Numerous studies have explored the impact of Hyperparameter Optimization
on single classifiers, enhancing these models’ overall performance in SBP analysis. Ensemble Learning
(EL) approaches have also demonstrated increased model accuracy and performance on SBP datasets.
This study proposes a novel learning model for predicting software bugs through the utilization of EL
and tuning hyperparameters. The results are compared with single hypothesis learning models using the
WEKA software. The dataset, collected by the National Aeronautics and Space Administration (NASA)
U.S.A., comprises 10,885 instances with 20 attributes, including a classifier for defects in one of their
coding projects. The findings indicate that EL models outperform single hypothesis learning models, and
the proposed model’s accuracy increases after optimization. Furthermore, the accuracy of the proposed
model demonstrates improvement following the optimization process. These results underscore the efficacy
of ensemble learning, coupled with hyperparameter optimization, as a viable approach for enhancing the
predictive capabilities of software bug prediction models.
INDEX TERMS Software bug prediction, machine learning, hyperparameter optimization, ensemble
learning.
to multiple SDLC models, such as the Waterfall Model, Iter- Levesque et al. [9]. Hyperparameter optimizers are utilized
ative Model, Agile Model, V-Shaped Model, Spiral Model, to fine-tune the control parameters of data mining algorithms.
and Big Bang Model. ML algorithms prove effective in It is widely recognized that such tuning enhances classifica-
determining the presence of bugs in a given piece of code. tion tasks, including software bug prediction or text classifi-
Enhanced programming methods can significantly elevate the cation [10].
overall quality of the program. ML represents one of the To enhance the effectiveness of software bug prediction,
most rapidly expanding fields of computer science, with far- this research addresses two key questions. Firstly, it explores
reaching applications. It involves the automatic identification whether Ensemble Learning (EL) models outperform single
of meaningful patterns in records. The primary goal of ML learning models in predicting software bugs. Secondly, the
tools is to provide algorithms with the opportunity to learn study investigates the impact of adding Hyperparameter Opti-
and adapt [3], [4]. mization approaches on EL model performance, examining
With the rise of ML Algorithms in the past couple of years, whether this enhancement leads to improved accuracy. These
many companies utilized ML to find and predict software questions serve as the foundation of our exploration, guiding
bugs before occurring. ML Algorithms help organizations us to understand the nuances of predictive modeling in soft-
make future predictions in many fields using mathematical ware bug detection.
equations with statistics and ML algorithms. There are two The remainder of the paper is organized as follows:
types of ML algorithms, supervised learning and unsuper- Section II presents related work; Section III covers the back-
vised learning. In supervised learning, the algorithm’s input ground of the study, which includes Hyperparameter opti-
and output are based on historical data related to the study, mization and ensemble learning; Section IV outlines the
which is used to train the algorithms and find patterns; these methodology, encompassing the dataset, tools used, research
patterns are then used to predict future outputs. In unsu- questions, and evaluation measures; Section V presents the
pervised learning, the learning algorithms process data to results and discussion; Section VI provides the conclusion
find patterns in unlabeled fields. Early bug discovery and and outlines future work.
prediction are critical exercises in the software development
life cycle [5]. They contribute to a higher satisfaction rate II. RELATED WORK
among users and enhance the overall performance and quality This section navigates through recent contributions in the
of the software project. dynamic field of software bug prediction by exploring various
Defect prediction models can guide testing efforts towards methodologies and approaches used by previous researchers.
code that is susceptible to bugs. Before software delivery to These studies enrich our understanding of effective bug pre-
the customer, latent bugs in the code may surface, and once diction models and offer diverse insights and methodologies
identified, these flaws can be rectified prior to delivery at to address the challenges posed by software defects.
a fraction of the cost compared to post-delivery repairs [6]. In the study of Hammouri et al. [5], the researchers
Code defects incur significant expenses for companies each conducted a thorough examination of three supervised ML
year, with billions of dollars spent on identification and models—Naive Bayes, Decision Tree, and Artificial Neural
correction. Models that accurately pinpoint the locations of Networks—specifically tailored for predicting software bugs.
bugs in code have the potential to save businesses substantial Their research involved a meticulous comparison with two
amounts of money. Given the high costs involved, even minor previously proposed models, revealing that the three models
improvements in detecting and repairing flaws can have a exhibited a significantly higher accuracy rate than the pro-
significant impact on total costs. posed two models.
Many studies have been conducted on Bug Detection using Di Nucci et al. [11] undertook an innovative approach,
ML Algorithms. Some of these studies have utilized EL, quantifying the dispersal of modifications made by devel-
which enables researchers and practitioners to employ more opers working on a component, and utilized this data to
than one machine learning algorithm. By employing different construct a bug prediction model. This study, based on prior
ML algorithms with diverse methodologies, EL allows for research that examined the human factor in bug generation,
the aggregation of votes from multiple ML algorithms used demonstrates that non-focused developers are more prone to
together, enhancing the ability to make more accurate predic- introducing defects compared to their focused counterparts.
tions. Further, previous research studies have examined dif- Notably, the model exhibited superiority when compared to
ferent types of algorithms to predict software bugs. Although four competitive techniques.
some of these proposed algorithms showed higher accuracy Khan et al. [12] introduced a model for predicting soft-
rates, most of the cited bug prediction studies for software ware bugs, integrating ML classifiers with Artificial Immune
projects had their settings values set to default values [7]. Networks (AIN) to optimize hyperparameters for enhanced
Hyperparameter optimization can enhance accuracy and bug prediction accuracy. Employing seven ML algorithms
improve the performance of ML models. The optimiza- on a bug prediction dataset, the study revealed that hyper-
tion of hyperparameters in Bayesian, SVM, and KNN mod- parameter tuning of ML classifiers using AIN surpassed the
els has demonstrated an increase in the accuracy of bug performance of classifiers with default hyperparameters in
prediction models, as observed by Osman et al. [8] and the context of software bug prediction.
An ensemble consists of a collection of learners referred Random Forests employs a similar bagging technique to
to as base learners. The generalization ability of an ensemble generate random samples of training sets for each Random
typically surpasses that of individual base learners. Ensemble Tree, known as bootstrap samples. Each current training
Learning (EL) is particularly appealing because it has the batch is derived from the original dataset with modifica-
capability to elevate slow learners, which are only marginally tions [27]. Consequently, the tree is constructed using the
better than random guesses, to proficient learners capable current set and a randomly generated collection of attributes.
of making highly accurate predictions [22]. Consequently, The node is divided using the optimal split on the ran-
the term ‘‘weak learners’’ is often synonymous with ‘‘base domly chosen properties. Importantly, established trees in
learners.’’ It is noteworthy, however, that while theoretical Random Forests are not trimmed, preserving their original
studies primarily focus on weak learners, the base learners structure.
employed in practical applications are not exclusively weak. The upcoming methodology section aims to illustrate the
Utilizing base learners that are less weak also contributes to practical implementation of Hyperparameter Optimization
superior outcomes. and Ensemble Learning in the context of software bug pre-
Among the EL methods, commonly utilized ones include diction.
Bagging, Adaboost, Stacking, and Voting. Bagging, short for
bootstrap aggregation, stands out as one of the oldest, most IV. METHODOLOGY
fundamental, and perhaps simplest ensemble-based algo- The classification problem is a common formulation of the
rithms, demonstrating remarkably high performance [23]. supervised learning task in which the learner is expected
In bagging, diversity of classifiers is achieved by creating to learn (approximate the action of) a function that maps
bootstrapped replicas of the training data [24]. This involves a vector into one of several classes by inspecting several
randomly selecting separate subsets of training data from input-output examples of the function [4]. Supervised ML
the entire training dataset, with replacement. Each subset of is the process of constructing a classification algorithm that
training data is then employed to train a specific classifier. can successfully learn from fresh data. Figure 1 demonstrates
The classifiers are subsequently aggregated using a majority the steps followed in the application of supervised machine
voting mechanism, where the ensemble decision is deter- learning to address our optimization problem.
mined by the class selected by the majority of classifiers for
a given case.
Adaboost operates by training models in successive
rounds, with a new model trained in each iteration. At the
end of each round, misclassified examples are identified, and
their significance is amplified in a new training sequence.
This updated training data is then fed back into the begin-
ning of the subsequent round, where a new model is pre-
pared [25]. The underlying theory is that subsequent iter-
ations should compensate for mistakes made by previous
models, leading to an overall improvement in the ensemble’s
performance.
Bagging is particularly effective for unpredictable mod-
els that exhibit varying generalization behavior with slight
changes in the training data [25]. These models, often termed
high variance models, encompass examples like Decision
Trees and Neural Networks. However, bagging may not be
suitable for extremely simple models. In essence, bagging
randomly selects from the set of possible models to create an
ensemble of relatively simple models, resulting in predictions
that are approximately identical (low diversity).
Voting serves as a technique for consolidating the outputs
of multiple classifiers. Three types of voting methods exist:
unanimous voting, majority voting, and plurality voting [26].
Unanimous voting takes place when all classifiers unani-
FIGURE 1. Research methodological steps.
mously agree on the final judgment. Majority voting occurs
when more than half of the votes contribute to the evaluation,
while plurality voting happens when the majority of votes The following sub-sections provide a description of the
determine the final prediction. Each voting method provides dataset utilized in this study, details the software and tools
a distinct approach to aggregating the decisions of individual employed, and presents the evaluation metrics used to assess
classifiers in an ensemble. the performance of the implemented approach.
C. TOOL
In this study, we employed Weka (Waikato Environment for
Knowledge Analysis), a renowned Java-based machine learn-
ing software suite developed at the University of Waikato in
New Zealand. Weka is an open-source software distributed
B. HYPERMETER OPTIMIZATION for download under the GNU General Public License (GPL),
In this study, the researchers opted for Cross-Validation making it freely accessible for users. The Weka workbench
Parameter Selection (CVParameter Selection), employing a encompasses a diverse array of algorithms for data process-
predefined set of hyperparameter values for multiple learn- ing, predictive modeling, and visualization, complemented by
ing algorithms. CVParameter Selection involves systemati- user-friendly graphical interfaces for efficient access to these
cally trying out different hyperparameter values for multi- functionalities. Weka empowers practitioners to train and
ple learning algorithms through cross-validation to explore predict machine learning models without the need for coding,
the model’s performance under different configurations, ulti- offering a user-friendly General User Interface with a rich
mately selecting the hyperparameter values that result in the selection of state-of-the-art algorithms. Moreover, it provides
best model performance. Each algorithm has distinct hyper- practitioners with the flexibility to utilize various hyperpa-
parameters chosen to maximize accuracy and performance on rameter optimization approaches, including GridSearch and
the software bug prediction dataset. The set of hyperparame- CVParameter Selection.
ters applied across various algorithms and their tuned values Weka offers a diverse array of ML models for data
are shown in Table 2. The hyperparameters included: exploration, including pre-installed models like Support Vec-
• Percentage of weight mass to base training on. tor Machine, Naive Bayes, Neural Networks, and Decision
• Use resampling for boosting. Trees. Additionally, the tool provides pre-installed EL models
• Random number seed. such as Voting, Stacking, and AdaBoost. One notable fea-
• The number of iterations. ture is the built-in tool for testing models while optimizing
TABLE 4. Results of ensemble learning models before and after hyperparameter optimization.
TABLE 5. A comparison of results achieved in our work with those in previous work.
accuracy rates and superior performance in software bug Vote EL model achieved the highest accuracy (0.7189). The
prediction. The top-performing models among the tested ones ROC measure achieved for the same model was 0.763 which
are EL models, namely Vote, Bagging, Random Forest, and indicated good discrimination between positive and negative
AdaBoosstM1, outperforming the single hypothesis learning instances.
models. These models, including Logistics Regression, SVM, Furthermore, it is noteworthy that the ROC curve was
Neural Networks, Naive Bayes, and Decision Trees, scored consistently higher for the majority of EL models, approach-
lower in terms of accuracy and ROC measures. Notably, the ing closer to one compared to single hypothesis learning
models like Neural Networks, Naive Bayes, and Decision trends are observed for the Bagging algorithm, where our
Trees. study exhibits higher accuracy (0.8183), ROC (0.769), and
The Kappa statistics also indicated higher values for the F-measure (0.783) compared to the referenced works. More-
majority of EL models, with the exception of the Vote classi- over, the Vote algorithm in our work achieves a notewor-
fier model, which scored the highest in accuracy. In terms of thy accuracy of 0.8195, ROC of 0.763, and F-measure of
sensitivity, EL models demonstrated an advantage, while the 0.774, highlighting significant advancements over the pro-
specificity was lower for EL models, with the Vote classifier vided references. The Random Forest algorithm also displays
registering the lowest value in this regard. Additionally, the improved performance in our study, with an accuracy of
F-measure scores for EL models surpassed those of sin- 0.8182, ROC of 0.757, and F-measure of 0.796.
gle hypothesis learning models, signifying that EL models Overall, the results suggest that our proposed approach,
exhibited higher precision and recall compared to their single which incorporates hyperparameter optimization, enhances
hypothesis learning counterpart. the predictive capabilities of various classifiers across differ-
ent metrics, surpassing prior studies in the same domain and
B. RQ2: DOES ADDING HYPERPARAMETER
dataset.
OPTIMIZATION APPROACHES IMPROVE THE
D. THREATS TO VALIDITY
PERFORMANCE AND ACCURACY OF THE EL?
Several potential threats to the validity of this study have to
To address the second research question, hyperparameter
be acknowledged. Sampling bias may restrict the general-
optimization was conducted within the WEKA software,
izability of results to datasets with different characteristics.
encompassing manual optimization for some hyperparam-
The imbalanced distribution of defective instances (19.32%)
eters and automated optimization using the CVParameter
could introduce bias, impacting relevance to diverse class dis-
Selection Classifier, incorporating steps for each hyperpa-
tributions. Duplicated instances in the JM1 dataset (24.16%)
rameter optimization to facilitate automated testing. As a
may affect the model’s learning process, potentially overesti-
result, testing time was increased significantly due to the fact
mating predictive capabilities. The effectiveness of hyperpa-
that the model is tested on each step and combination of
rameter optimization may be algorithm-dependent, limiting
the hyperparameter, and the highest achieved combination is
generalizability. Additionally, exploring different evaluation
chosen to calculate the accuracy of the model. Table 3 shows
metrics to assess and compare classifiers could provide a
the results of the impact of optimizing hyperparameters for
more comprehensive understanding of model performance.
EL models for software bug prediction.
Acknowledging and addressing these challenges is crucial for
The results presented in Table 4 clearly indicate that tuning
enhancing the transparency of our results.
hyperparameters for Ensemble Learning (EL) models sig-
nificantly influences the accuracy and performance of the
VI. THEORETICAL AND PRACTICAL IMPLICATIONS
models in software bug prediction. Across all cases, there
This study contributes theoretically by advancing our under-
was an observable increase in all model measures, demon-
standing of predictive modeling in software bug detection.
strating the positive impact of hyperparameter optimization.
By comparing ensemble learning (EL) models against single
Many hyperparameters were adjusted to enhance accuracy
learning models, the research sheds light on the superiority
and performance, deviating from default settings. The eval-
of EL models in predicting software bugs. This analysis
uation of the proposed model highlights that employing EL
underscores the potential of EL models as a more effective
models with hyperparameter optimization for software bug
approach in such tasks. Furthermore, the investigation into
prediction yields superior performance compared to single
the impact of hyperparameter optimization on EL model
hypothesis learning models. Furthermore, an enhancement
performance offers insights into the importance of fine-tuning
in the ROC Curve is evident in three out of the four EL
model parameters, enriching our theoretical understanding of
algorithms following hyperparameter optimization.
machine learning model optimization.
On a practical level, the findings hold significance for soft-
C. COMPARISON WITH PREVIOUS WORK ware developers and quality assurance professionals. Demon-
In this section, we conducted a comparative analysis of our strating the superiority of EL models, particularly when
findings with previous studies in the field of software bug optimized with hyperparameters, in predicting software bugs
prediction and hyperparameter optimization. We specifically offers tangible benefits for software development processes.
focused on several EL classifiers that employed the same Implementing EL models with optimized hyperparameters
dataset as ours (JM1). The objective is to check whether our can lead to more accurate bug prediction, ultimately improv-
study represents an enhancement over existing research in ing software quality and reducing development costs. Addi-
this domain or not. The details of the results obtained are tionally, the study provides practical guidance for selecting
listed in Table 5. For AdaBoostM1, our work outperforms appropriate machine learning models for similar predictive
previous studies in terms of accuracy (0.8176), ROC (0.797), tasks, aiding practitioners in making informed decisions.
and F-measure (0.787), demonstrating improvements over Moreover, the introduction of various model evaluation
the reported values in [17], [31], [32], [33], and [34]. Similar measures such as accuracy, ROC Curve, Kappa statistics,
[28] A. E. Mohamed, ‘‘Comparative study of four supervised ML techniques ABDEL-RAHMAN AL-GHUWAIRI received the
for classification,’’ Inf. J. Appl. Sci. Technol., vol. 7, no. 2, pp. 1–15, 2017. Ph.D. degree in software engineering from New
[29] M. G. Al-Obeidallah, D. G. Al-Fraihat, A. M. Khasawneh, A. M. Saleh, Mexico State University, Las Cruces, NM, USA,
and H. Addous, ‘‘Empirical investigation of the impact of the adapter in 2013. He is currently an Associate Profes-
design pattern on software maintainability,’’ in Proc. Int. Conf. Inf. Technol. sor with the Department of Software Engineer-
(ICIT), Jul. 2021, pp. 206–211. ing, Hashemite University, Jordan. His research
[30] A.-R. Al-Ghuwairi, D. Al-Fraihat, Y. Sharrab, H. Alrashidi, N. Almujally, interests include software engineering, cloud com-
A. Kittaneh, and A. Ali, ‘‘Visualizing software refactoring using radar puting, requirements engineering, information
charts,’’ Sci. Rep., vol. 13, no. 1, p. 19530, Nov. 2023, doi: 10.1038/s41598-
retrieval, big data, and database systems.
023-44281-6.
[31] H. Alsawalqah, H. Faris, I. Aljarah, L. Alnemer, and N. Alhindawi,
‘‘Hybrid SMOTE-ensemble approach for software defect prediction,’’ in
Software Engineering Trends and Techniques in Intelligent Systems. Cham,
Switzerland: Springer, 2017, pp. 355–366.
[32] S. A. El-Shorbagy, W. M. El-Gammal, and W. M. Abdelmoez, ‘‘Using
SMOTE and heterogeneous stacking in ensemble learning for software
defect prediction,’’ in Proc. 7th Int. Conf. Softw. Inf. Eng., May 2018,
pp. 44–47.
[33] D. Al-Fraihat, Y. Sharrab, F. Alzyoud, A. Qahmash, M. Tarawneh, and
A. Maaita, ‘‘Speech recognition utilizing deep learning: A systematic
HAMZEH ALSHISHANI received the M.Sc.
review of the latest developments,’’ Hum.-Centric Comput. Inf. Sci.,
vol. 14, pp. 1–34, Mar. 2024. degree in software engineering from Hashemite
[34] R. Li, L. Zhou, S. Zhang, H. Liu, X. Huang, and Z. Sun, ‘‘Software defect University, Jordan. He is currently a Lecturer and
prediction based on ensemble learning,’’ in Proc. 2nd Int. Conf. Data Sci. a Researcher with the Department of Software
Inf. Technol., Jul. 2019, pp. 1–6. Engineering, Hashemite University. His research
interests include software engineering, cloud com-
puting, and requirements engineering.
DIMAH AL-FRAIHAT received the Ph.D. degree
in software engineering from the University of
Warwick, U.K. She is currently an Assistant Pro-
fessor with the Faculty of Information Technol-
ogy, Isra University, Jordan. Her research interests
include software engineering, refactoring, design
patterns, software testing, requirements engineer-
ing, documentation, computer-based applications,
technology enhanced learning, and deep learning.