0% found this document useful (0 votes)
2 views10 pages

Hyperparameter Optimization for Software Bug Prediction Using Ensemble Learning

This study investigates the effectiveness of Hyperparameter Optimization and Ensemble Learning (EL) in Software Bug Prediction (SBP), revealing that EL models significantly outperform single hypothesis learning models. Utilizing a dataset from NASA, the research demonstrates that tuning hyperparameters enhances the accuracy of the proposed model. The findings highlight the importance of these methodologies in improving predictive capabilities in software development processes.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views10 pages

Hyperparameter Optimization for Software Bug Prediction Using Ensemble Learning

This study investigates the effectiveness of Hyperparameter Optimization and Ensemble Learning (EL) in Software Bug Prediction (SBP), revealing that EL models significantly outperform single hypothesis learning models. Utilizing a dataset from NASA, the research demonstrates that tuning hyperparameters enhances the accuracy of the proposed model. The findings highlight the importance of these methodologies in improving predictive capabilities in software development processes.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Received 31 January 2024, accepted 17 March 2024, date of publication 21 March 2024, date of current version 16 April 2024.

Digital Object Identifier 10.1109/ACCESS.2024.3380024

Hyperparameter Optimization for Software Bug


Prediction Using Ensemble Learning
DIMAH AL-FRAIHAT 1 , YOUSEF SHARRAB2 , ABDEL-RAHMAN AL-GHUWAIRI3 ,
HAMZEH ALSHISHANI3 , AND ABDULMOHSEN ALGARNI 4
1 Department of Software Engineering, Faculty of Information Technology, Isra University, Amman 11622, Jordan
2 Department of Data Science and Artificial Intelligence, Faculty of Information Technology, Isra University, Amman 11622, Jordan
3 Department of Software Engineering, Faculty of Prince Al-Hussien Bin Abdullah for IT, Hashemite University, Zarqa 13133, Jordan
4 Department of Computer Science, King Khalid University, Abha 61421, Saudi Arabia

Corresponding author: Dimah Al-Fraihat ([email protected])


This work was supported by the Deanship of Scientific Research, King Khalid University, under Grant R.G.P.2/93/45.

ABSTRACT Software Bug Prediction (SBP) is an integral process to the software’s success that involves
predicting software bugs before their occurrence. Detecting software bugs early in the development process
enhances software quality, performance, and reduces software costs. The integration of Machine Learning
(ML) algorithms has significantly improved software bug prediction accuracy and concurrently reduced
costs and resource utilization. Numerous studies have explored the impact of Hyperparameter Optimization
on single classifiers, enhancing these models’ overall performance in SBP analysis. Ensemble Learning
(EL) approaches have also demonstrated increased model accuracy and performance on SBP datasets.
This study proposes a novel learning model for predicting software bugs through the utilization of EL
and tuning hyperparameters. The results are compared with single hypothesis learning models using the
WEKA software. The dataset, collected by the National Aeronautics and Space Administration (NASA)
U.S.A., comprises 10,885 instances with 20 attributes, including a classifier for defects in one of their
coding projects. The findings indicate that EL models outperform single hypothesis learning models, and
the proposed model’s accuracy increases after optimization. Furthermore, the accuracy of the proposed
model demonstrates improvement following the optimization process. These results underscore the efficacy
of ensemble learning, coupled with hyperparameter optimization, as a viable approach for enhancing the
predictive capabilities of software bug prediction models.

INDEX TERMS Software bug prediction, machine learning, hyperparameter optimization, ensemble
learning.

I. INTRODUCTION consumes resources and contributes to delays in release dates.


Software bugs increase the cost of the Software Development All of these considerations have led to the development of
Life Cycle (SDLC), as the expense of software development Software Bug Prediction (SBP) systems, which assist soft-
rises proportionately with the time of discovery [1]. This ware development organizations in automating the identifica-
impact is particularly significant if bugs are identified after tion of bugs before they occur, thereby preventing the wastage
the software release, beginning to affect the end-user experi- of resources.
ence. Therefore, it is always best to find bugs as soon as they Predicting the emergence of bugs and understanding the
occur. Bugs also influence software quality and reliability, Software Development Life Cycle (SDLC) process flow are
prompting many software companies to research methods critical factors that contribute to performance criteria [2]. The
for identifying bugs as soon as they occur. This research application of ML algorithms for defect prediction is versa-
tile, as it can be employed across various stages of the SDLC,
The associate editor coordinating the review of this manuscript and including problem detection, planning, design, building, test-
approving it for publication was Ahmed M. Elmisery . ing, deployment, and maintenance. This application extends
2024 The Authors. This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.
For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
VOLUME 12, 2024 51869
D. Al-Fraihat et al.: Hyperparameter Optimization for SBP Using EL

to multiple SDLC models, such as the Waterfall Model, Iter- Levesque et al. [9]. Hyperparameter optimizers are utilized
ative Model, Agile Model, V-Shaped Model, Spiral Model, to fine-tune the control parameters of data mining algorithms.
and Big Bang Model. ML algorithms prove effective in It is widely recognized that such tuning enhances classifica-
determining the presence of bugs in a given piece of code. tion tasks, including software bug prediction or text classifi-
Enhanced programming methods can significantly elevate the cation [10].
overall quality of the program. ML represents one of the To enhance the effectiveness of software bug prediction,
most rapidly expanding fields of computer science, with far- this research addresses two key questions. Firstly, it explores
reaching applications. It involves the automatic identification whether Ensemble Learning (EL) models outperform single
of meaningful patterns in records. The primary goal of ML learning models in predicting software bugs. Secondly, the
tools is to provide algorithms with the opportunity to learn study investigates the impact of adding Hyperparameter Opti-
and adapt [3], [4]. mization approaches on EL model performance, examining
With the rise of ML Algorithms in the past couple of years, whether this enhancement leads to improved accuracy. These
many companies utilized ML to find and predict software questions serve as the foundation of our exploration, guiding
bugs before occurring. ML Algorithms help organizations us to understand the nuances of predictive modeling in soft-
make future predictions in many fields using mathematical ware bug detection.
equations with statistics and ML algorithms. There are two The remainder of the paper is organized as follows:
types of ML algorithms, supervised learning and unsuper- Section II presents related work; Section III covers the back-
vised learning. In supervised learning, the algorithm’s input ground of the study, which includes Hyperparameter opti-
and output are based on historical data related to the study, mization and ensemble learning; Section IV outlines the
which is used to train the algorithms and find patterns; these methodology, encompassing the dataset, tools used, research
patterns are then used to predict future outputs. In unsu- questions, and evaluation measures; Section V presents the
pervised learning, the learning algorithms process data to results and discussion; Section VI provides the conclusion
find patterns in unlabeled fields. Early bug discovery and and outlines future work.
prediction are critical exercises in the software development
life cycle [5]. They contribute to a higher satisfaction rate II. RELATED WORK
among users and enhance the overall performance and quality This section navigates through recent contributions in the
of the software project. dynamic field of software bug prediction by exploring various
Defect prediction models can guide testing efforts towards methodologies and approaches used by previous researchers.
code that is susceptible to bugs. Before software delivery to These studies enrich our understanding of effective bug pre-
the customer, latent bugs in the code may surface, and once diction models and offer diverse insights and methodologies
identified, these flaws can be rectified prior to delivery at to address the challenges posed by software defects.
a fraction of the cost compared to post-delivery repairs [6]. In the study of Hammouri et al. [5], the researchers
Code defects incur significant expenses for companies each conducted a thorough examination of three supervised ML
year, with billions of dollars spent on identification and models—Naive Bayes, Decision Tree, and Artificial Neural
correction. Models that accurately pinpoint the locations of Networks—specifically tailored for predicting software bugs.
bugs in code have the potential to save businesses substantial Their research involved a meticulous comparison with two
amounts of money. Given the high costs involved, even minor previously proposed models, revealing that the three models
improvements in detecting and repairing flaws can have a exhibited a significantly higher accuracy rate than the pro-
significant impact on total costs. posed two models.
Many studies have been conducted on Bug Detection using Di Nucci et al. [11] undertook an innovative approach,
ML Algorithms. Some of these studies have utilized EL, quantifying the dispersal of modifications made by devel-
which enables researchers and practitioners to employ more opers working on a component, and utilized this data to
than one machine learning algorithm. By employing different construct a bug prediction model. This study, based on prior
ML algorithms with diverse methodologies, EL allows for research that examined the human factor in bug generation,
the aggregation of votes from multiple ML algorithms used demonstrates that non-focused developers are more prone to
together, enhancing the ability to make more accurate predic- introducing defects compared to their focused counterparts.
tions. Further, previous research studies have examined dif- Notably, the model exhibited superiority when compared to
ferent types of algorithms to predict software bugs. Although four competitive techniques.
some of these proposed algorithms showed higher accuracy Khan et al. [12] introduced a model for predicting soft-
rates, most of the cited bug prediction studies for software ware bugs, integrating ML classifiers with Artificial Immune
projects had their settings values set to default values [7]. Networks (AIN) to optimize hyperparameters for enhanced
Hyperparameter optimization can enhance accuracy and bug prediction accuracy. Employing seven ML algorithms
improve the performance of ML models. The optimiza- on a bug prediction dataset, the study revealed that hyper-
tion of hyperparameters in Bayesian, SVM, and KNN mod- parameter tuning of ML classifiers using AIN surpassed the
els has demonstrated an increase in the accuracy of bug performance of classifiers with default hyperparameters in
prediction models, as observed by Osman et al. [8] and the context of software bug prediction.

51870 VOLUME 12, 2024


D. Al-Fraihat et al.: Hyperparameter Optimization for SBP Using EL

Yang et al. [13] proposed a two-layer Ensemble Learn- A. HYPERPARAMETER OPTIMIZATION


ing (TLEL) technique, incorporating both decision trees and In the field of machine learning, the term ‘‘optimization’’
Ensemble Learning to enhance just-in-time software defect is employed to describe the process of fine-tuning hyper-
prediction. The proposed model demonstrated a substan- parameters. This involves adjusting parameters such as reg-
tial improvement over existing state-of-the-art ML models. ularization, kernels, and learning intensity [18]. Hyperpa-
Remarkably, by reviewing only 20% of the code lines, TLEL rameters play a critical role in enhancing the accuracy of
successfully identified over 70% of the bugs. ML classifiers, influencing them throughout the phases of
Additionally, Ensemble Learning (EL) not only enhances learning, construction, and evaluation. The terms ‘‘model
the accuracy of ML models but also improves their overall discovery’’ and ‘‘hyperparameter tuning’’ encapsulate the
performance. Balogun et al. [14] scrutinized the performance process of identifying the optimal hyperparameters for ML
of individual ML classifiers compared to EL, incorporating classifiers [12]. As each algorithm comes with a set of param-
the Analytic Network Process (ANP) in their evaluation. The eters, many of which are optimizable, adjusting these param-
results demonstrated that Ensemble Learning (EL) models eters becomes pivotal. This optimization aims to achieve a
achieved a higher priority level than all single classifier meth- higher model score and enhance the overall performance of
ods across 11 software defect datasets. the classifier by selecting the optimal configuration for its
In a study conducted by Qiu et al. [15], 15 imbalanced EL parameters.
methods were examined across 31 open-source projects. The Tuning machine learning models exemplifies an optimiza-
findings indicated that imbalanced EL models, which com- tion challenge, wherein a diverse set of hyperparameters
bined both under-sampling and bagging approaches, demon- needs to be fine-tuned to identify the optimal combination of
strated a higher effectiveness rate. values that minimizes the function’s loss or maximizes accu-
Abdou and Darwish [16] evaluated EL models and intro- racy. One method employed for hyperparameter optimization
duced the resample technique with Boosting, Bagging, and is the Manual Search. In this approach, practitioners rely on
Rotation Forest. The study concludes that the accuracy their expertise to select hyperparameters, followed by model
improved using EL approaches more than single classifiers. training, accuracy assessment, and iterative refinement. This
The researchers also noticed that the accuracy and perfor- cyclic process continues until the model’s performance and
mance improved using Multilayer Perceptron and partial accuracy align with the specified requirements.
decision trees using Bagging. The study also does not rec- Grid search, also referred to as parameter sweep, consti-
ommend using Support Vector Machine and Logistics as tutes an exhaustive approach to hyperparameter optimization,
a single classifier with Boosting, Bagging, and Rotation involving a systematic exploration of a predefined subset
Forest. within a learning algorithm’s hyperparameter space. This
Furthermore, Pandey et al. [17] suggested a rudimentary algorithmic process is guided by an output criterion, often
classification-based method for SBP using EL techniques and determined through cross-validation on the training set or
Deep Representation. The approach encompasses both EL assessment on a held-out validation set. Cross-validation
and deep representation in combination. In comparison to stands out as one of the most prevalent methods for selecting
EL and other state-of-the-art techniques, the proposed model tuning parameters, identifying the parameter value associated
exhibited superior performance across the majority of the with the lowest cross-validated ranking [19]. Cross-validation
datasets. parameter selection accommodates the optimization of an
In summary, the reviewed literature has provided a diverse infinite number of parameters, allowing researchers to choose
range of methodologies for software bug prediction, encom- a combination of hyperparameter values that the model tests
passing supervised machine learning, Ensemble Learning, and refines iteratively. Upon completion of the model build,
and deep representation. The comparative analyses under- the hyperparameters yielding the highest accuracy score are
score the effectiveness of EL models and the importance selected.
of hyperparameter optimization in bug prediction. As we
transition to the background section, these insights form a B. ENSEMBLE LEARNING
solid foundation for exploring the key components of Hyper- EL algorithms differ from single hypothesis learning algo-
parameter Optimization and Ensemble Learning in software rithms in their unique approach. Rather than relying on a sin-
bug prediction. gular, superior explanation for interpreting results, EL algo-
rithms generate a series of hypotheses and use them col-
III. BACKGROUND lectively to determine the labeling of new data points in a
In this section, two crucial components integral to this diverse manner [20]. Experimental data indicates that ensem-
paper’s methodology are examined—Hyperparameter Opti- ble approaches are notably more reliable than individual the-
mization and Ensemble Learning (EL). This comprehen- ories. In a study by Freund et al. [21], efficiency improved
sive exploration aims to provide a detailed understanding in 22 benchmark questions, remained equal in one, and
of the roles and significance of EL and Hyperparameter exhibited a decrease in performance in four instances. This
Optimization in the context of the research presented in this highlights the robustness and effectiveness of EL in diverse
study. scenarios.

VOLUME 12, 2024 51871


D. Al-Fraihat et al.: Hyperparameter Optimization for SBP Using EL

An ensemble consists of a collection of learners referred Random Forests employs a similar bagging technique to
to as base learners. The generalization ability of an ensemble generate random samples of training sets for each Random
typically surpasses that of individual base learners. Ensemble Tree, known as bootstrap samples. Each current training
Learning (EL) is particularly appealing because it has the batch is derived from the original dataset with modifica-
capability to elevate slow learners, which are only marginally tions [27]. Consequently, the tree is constructed using the
better than random guesses, to proficient learners capable current set and a randomly generated collection of attributes.
of making highly accurate predictions [22]. Consequently, The node is divided using the optimal split on the ran-
the term ‘‘weak learners’’ is often synonymous with ‘‘base domly chosen properties. Importantly, established trees in
learners.’’ It is noteworthy, however, that while theoretical Random Forests are not trimmed, preserving their original
studies primarily focus on weak learners, the base learners structure.
employed in practical applications are not exclusively weak. The upcoming methodology section aims to illustrate the
Utilizing base learners that are less weak also contributes to practical implementation of Hyperparameter Optimization
superior outcomes. and Ensemble Learning in the context of software bug pre-
Among the EL methods, commonly utilized ones include diction.
Bagging, Adaboost, Stacking, and Voting. Bagging, short for
bootstrap aggregation, stands out as one of the oldest, most IV. METHODOLOGY
fundamental, and perhaps simplest ensemble-based algo- The classification problem is a common formulation of the
rithms, demonstrating remarkably high performance [23]. supervised learning task in which the learner is expected
In bagging, diversity of classifiers is achieved by creating to learn (approximate the action of) a function that maps
bootstrapped replicas of the training data [24]. This involves a vector into one of several classes by inspecting several
randomly selecting separate subsets of training data from input-output examples of the function [4]. Supervised ML
the entire training dataset, with replacement. Each subset of is the process of constructing a classification algorithm that
training data is then employed to train a specific classifier. can successfully learn from fresh data. Figure 1 demonstrates
The classifiers are subsequently aggregated using a majority the steps followed in the application of supervised machine
voting mechanism, where the ensemble decision is deter- learning to address our optimization problem.
mined by the class selected by the majority of classifiers for
a given case.
Adaboost operates by training models in successive
rounds, with a new model trained in each iteration. At the
end of each round, misclassified examples are identified, and
their significance is amplified in a new training sequence.
This updated training data is then fed back into the begin-
ning of the subsequent round, where a new model is pre-
pared [25]. The underlying theory is that subsequent iter-
ations should compensate for mistakes made by previous
models, leading to an overall improvement in the ensemble’s
performance.
Bagging is particularly effective for unpredictable mod-
els that exhibit varying generalization behavior with slight
changes in the training data [25]. These models, often termed
high variance models, encompass examples like Decision
Trees and Neural Networks. However, bagging may not be
suitable for extremely simple models. In essence, bagging
randomly selects from the set of possible models to create an
ensemble of relatively simple models, resulting in predictions
that are approximately identical (low diversity).
Voting serves as a technique for consolidating the outputs
of multiple classifiers. Three types of voting methods exist:
unanimous voting, majority voting, and plurality voting [26].
Unanimous voting takes place when all classifiers unani-
FIGURE 1. Research methodological steps.
mously agree on the final judgment. Majority voting occurs
when more than half of the votes contribute to the evaluation,
while plurality voting happens when the majority of votes The following sub-sections provide a description of the
determine the final prediction. Each voting method provides dataset utilized in this study, details the software and tools
a distinct approach to aggregating the decisions of individual employed, and presents the evaluation metrics used to assess
classifiers in an ensemble. the performance of the implemented approach.

51872 VOLUME 12, 2024


D. Al-Fraihat et al.: Hyperparameter Optimization for SBP Using EL

A. DATASET • Size of each bag, as a percentage of the training set size.


In this study, we examined a dataset obtained from the • Cross-validation folds.
National Aeronautics and Space Administration (NASA),
TABLE 2. Hyperparameter optimization results for classifiers.
widely recognized as a prevalent resource in the field of
software bug prediction. The JM1 dataset was selected from
the NASA repository, recognized as one of the largest repos-
itories, with 19% of its data indicating defects. The dataset
encompasses 20 attributes and comprises 10,885 instances
related to software bugs. Table 1 presents the list of attributes
within the dataset, accompanied by concise descriptions of
each field and the respective data types stored in each.

TABLE 1. List of dataset attributes.

C. TOOL
In this study, we employed Weka (Waikato Environment for
Knowledge Analysis), a renowned Java-based machine learn-
ing software suite developed at the University of Waikato in
New Zealand. Weka is an open-source software distributed
B. HYPERMETER OPTIMIZATION for download under the GNU General Public License (GPL),
In this study, the researchers opted for Cross-Validation making it freely accessible for users. The Weka workbench
Parameter Selection (CVParameter Selection), employing a encompasses a diverse array of algorithms for data process-
predefined set of hyperparameter values for multiple learn- ing, predictive modeling, and visualization, complemented by
ing algorithms. CVParameter Selection involves systemati- user-friendly graphical interfaces for efficient access to these
cally trying out different hyperparameter values for multi- functionalities. Weka empowers practitioners to train and
ple learning algorithms through cross-validation to explore predict machine learning models without the need for coding,
the model’s performance under different configurations, ulti- offering a user-friendly General User Interface with a rich
mately selecting the hyperparameter values that result in the selection of state-of-the-art algorithms. Moreover, it provides
best model performance. Each algorithm has distinct hyper- practitioners with the flexibility to utilize various hyperpa-
parameters chosen to maximize accuracy and performance on rameter optimization approaches, including GridSearch and
the software bug prediction dataset. The set of hyperparame- CVParameter Selection.
ters applied across various algorithms and their tuned values Weka offers a diverse array of ML models for data
are shown in Table 2. The hyperparameters included: exploration, including pre-installed models like Support Vec-
• Percentage of weight mass to base training on. tor Machine, Naive Bayes, Neural Networks, and Decision
• Use resampling for boosting. Trees. Additionally, the tool provides pre-installed EL models
• Random number seed. such as Voting, Stacking, and AdaBoost. One notable fea-
• The number of iterations. ture is the built-in tool for testing models while optimizing

VOLUME 12, 2024 51873


D. Al-Fraihat et al.: Hyperparameter Optimization for SBP Using EL

hyperparameters. This tool allows users to set minimum and


maximum values for each parameter, along with the step of
increment, streamlining the process of fine-tuning models for
optimal performance.
The tool also incorporates numerous evaluation measures
for assessing machine learning models, encompassing but
not limited to Accuracy, Kappa statistics, Mean Absolute
Error, Precision, F-Measure, and ROC Area. The inclusion
of a Confusion Matrix proves beneficial for machine learning
practitioners, as it readily provides Specificity and Sensi-
tivity values. Notably, all these functionalities are accessi-
ble through the graphical user interface (GUI) of the tool,
eliminating the need for coding. This user-friendly approach
proves particularly advantageous for practitioners without a FIGURE 2. ROC curve.
background in Software Engineering, enabling them to effort-
lessly train and evaluate machine learning models without 4) SENSITIVITY
writing a single line of code. However, a significant drawback Sensitivity is described as the proportion of correctly
of employing Weka is the occasional extended time required expected positive examples. It is measured by dividing the
for model evaluation. The duration is contingent on various number of TP impressions by TP value and adding the FN
factors, including the specific model used, the optimized value [28].
hyperparameters, the number of features, and the volume of TP
instances in the data undergoing exploration. TPR = (1)
FN + TP
5) SPECIFICITY
D. EVALUATION MEASURES
The efficiency and accuracy of the ML algorithms were Specificity is the ratio of the negative examples correctly
assessed using a variety of metrics. Recognizing the com- predicted. The specificity is calculated by dividing the TN
plexity of assessing algorithm success with a single metric, on the FP plus TN [29].
we employed five: accuracy, Receiver Operating Character- FP
FPR = (2)
istic (ROC), Kappa, Specificity, and Sensitivity. TN + FP
6) F-MEASURE
1) ACCURACY
F-measure is a commonly used metric for evaluating the
This metric refers to how accurately the algorithm classifies success of classification techniques. To quantify this value,
the production data. The formula for calculating accuracy two measures called ‘‘Precision’’ and ‘‘Recall’’ must be
based on correctly categorized records and total number of computed, since the F-measure is the average of these two
records is shown below. As a result, the better the classifier measures [29], [30], as shown below:
is, the higher the precision.
precision ∗ Recall ∗ 2
F − measure = (3)
2) ROC
precision + Recall
Using accuracy alone is not the best way to test an algorithm’s V. RESULTS AND DISCUSSION
performance; it is advised to use more than one measurement To address the research questions, we conducted a study to
metric to get the whole story behind a dataset. Therefore, compare the results between single hypothesis ML models
we can use the ROC curve to get a bigger picture. The ROC and EL models. The outcomes of these models were then
chart can be used to visually compare and contrast False thoroughly analyzed and compared. The evaluation involved
Positive (FP) rate specificity and True Positive (TP) rate assessing four EL models against five single hypothesis learn-
sensitivity. The specificity is represented on the x axis, while ing models. Table 3 shows the results of all nine tests based on
the sensitivity is represented on the y axis. The curve should the highest accuracy and performance of the ML models. The
come as close to the upper-left corner as possible. Figure 2 WEKA program was instrumental in running these models,
shows the ROC curve and the relationship between sensitivity employing a 10-fold cross-validation method to iteratively
and specificity. divide instances for training and production ten times.

3) KAPPA A. RQ1: DO EL MODELS SHOW BETTER PERFORMANCE


Kappa measures an algorithm’s success by comparing it to COMPARED TO SINGLE LEARNING MODELS IN
a classifier that makes predictions based on guessing at ran- SOFTWARE BUG PREDICTION PROBLEMS?
dom. The higher and closer the Kappa to one, the better the According to the conducted study and the results presented
algorithm. in Table 3, it is evident that EL models exhibit higher

51874 VOLUME 12, 2024


D. Al-Fraihat et al.: Hyperparameter Optimization for SBP Using EL

TABLE 3. Performance metrics of machine learning algorithms.

TABLE 4. Results of ensemble learning models before and after hyperparameter optimization.

TABLE 5. A comparison of results achieved in our work with those in previous work.

accuracy rates and superior performance in software bug Vote EL model achieved the highest accuracy (0.7189). The
prediction. The top-performing models among the tested ones ROC measure achieved for the same model was 0.763 which
are EL models, namely Vote, Bagging, Random Forest, and indicated good discrimination between positive and negative
AdaBoosstM1, outperforming the single hypothesis learning instances.
models. These models, including Logistics Regression, SVM, Furthermore, it is noteworthy that the ROC curve was
Neural Networks, Naive Bayes, and Decision Trees, scored consistently higher for the majority of EL models, approach-
lower in terms of accuracy and ROC measures. Notably, the ing closer to one compared to single hypothesis learning

VOLUME 12, 2024 51875


D. Al-Fraihat et al.: Hyperparameter Optimization for SBP Using EL

models like Neural Networks, Naive Bayes, and Decision trends are observed for the Bagging algorithm, where our
Trees. study exhibits higher accuracy (0.8183), ROC (0.769), and
The Kappa statistics also indicated higher values for the F-measure (0.783) compared to the referenced works. More-
majority of EL models, with the exception of the Vote classi- over, the Vote algorithm in our work achieves a notewor-
fier model, which scored the highest in accuracy. In terms of thy accuracy of 0.8195, ROC of 0.763, and F-measure of
sensitivity, EL models demonstrated an advantage, while the 0.774, highlighting significant advancements over the pro-
specificity was lower for EL models, with the Vote classifier vided references. The Random Forest algorithm also displays
registering the lowest value in this regard. Additionally, the improved performance in our study, with an accuracy of
F-measure scores for EL models surpassed those of sin- 0.8182, ROC of 0.757, and F-measure of 0.796.
gle hypothesis learning models, signifying that EL models Overall, the results suggest that our proposed approach,
exhibited higher precision and recall compared to their single which incorporates hyperparameter optimization, enhances
hypothesis learning counterpart. the predictive capabilities of various classifiers across differ-
ent metrics, surpassing prior studies in the same domain and
B. RQ2: DOES ADDING HYPERPARAMETER
dataset.
OPTIMIZATION APPROACHES IMPROVE THE
D. THREATS TO VALIDITY
PERFORMANCE AND ACCURACY OF THE EL?
Several potential threats to the validity of this study have to
To address the second research question, hyperparameter
be acknowledged. Sampling bias may restrict the general-
optimization was conducted within the WEKA software,
izability of results to datasets with different characteristics.
encompassing manual optimization for some hyperparam-
The imbalanced distribution of defective instances (19.32%)
eters and automated optimization using the CVParameter
could introduce bias, impacting relevance to diverse class dis-
Selection Classifier, incorporating steps for each hyperpa-
tributions. Duplicated instances in the JM1 dataset (24.16%)
rameter optimization to facilitate automated testing. As a
may affect the model’s learning process, potentially overesti-
result, testing time was increased significantly due to the fact
mating predictive capabilities. The effectiveness of hyperpa-
that the model is tested on each step and combination of
rameter optimization may be algorithm-dependent, limiting
the hyperparameter, and the highest achieved combination is
generalizability. Additionally, exploring different evaluation
chosen to calculate the accuracy of the model. Table 3 shows
metrics to assess and compare classifiers could provide a
the results of the impact of optimizing hyperparameters for
more comprehensive understanding of model performance.
EL models for software bug prediction.
Acknowledging and addressing these challenges is crucial for
The results presented in Table 4 clearly indicate that tuning
enhancing the transparency of our results.
hyperparameters for Ensemble Learning (EL) models sig-
nificantly influences the accuracy and performance of the
VI. THEORETICAL AND PRACTICAL IMPLICATIONS
models in software bug prediction. Across all cases, there
This study contributes theoretically by advancing our under-
was an observable increase in all model measures, demon-
standing of predictive modeling in software bug detection.
strating the positive impact of hyperparameter optimization.
By comparing ensemble learning (EL) models against single
Many hyperparameters were adjusted to enhance accuracy
learning models, the research sheds light on the superiority
and performance, deviating from default settings. The eval-
of EL models in predicting software bugs. This analysis
uation of the proposed model highlights that employing EL
underscores the potential of EL models as a more effective
models with hyperparameter optimization for software bug
approach in such tasks. Furthermore, the investigation into
prediction yields superior performance compared to single
the impact of hyperparameter optimization on EL model
hypothesis learning models. Furthermore, an enhancement
performance offers insights into the importance of fine-tuning
in the ROC Curve is evident in three out of the four EL
model parameters, enriching our theoretical understanding of
algorithms following hyperparameter optimization.
machine learning model optimization.
On a practical level, the findings hold significance for soft-
C. COMPARISON WITH PREVIOUS WORK ware developers and quality assurance professionals. Demon-
In this section, we conducted a comparative analysis of our strating the superiority of EL models, particularly when
findings with previous studies in the field of software bug optimized with hyperparameters, in predicting software bugs
prediction and hyperparameter optimization. We specifically offers tangible benefits for software development processes.
focused on several EL classifiers that employed the same Implementing EL models with optimized hyperparameters
dataset as ours (JM1). The objective is to check whether our can lead to more accurate bug prediction, ultimately improv-
study represents an enhancement over existing research in ing software quality and reducing development costs. Addi-
this domain or not. The details of the results obtained are tionally, the study provides practical guidance for selecting
listed in Table 5. For AdaBoostM1, our work outperforms appropriate machine learning models for similar predictive
previous studies in terms of accuracy (0.8176), ROC (0.797), tasks, aiding practitioners in making informed decisions.
and F-measure (0.787), demonstrating improvements over Moreover, the introduction of various model evaluation
the reported values in [17], [31], [32], [33], and [34]. Similar measures such as accuracy, ROC Curve, Kappa statistics,

51876 VOLUME 12, 2024


D. Al-Fraihat et al.: Hyperparameter Optimization for SBP Using EL

Sensitivity, Specificity, and F-measure serves to enhance per- REFERENCES


formance evaluation practices. These measures offer valuable [1] R. Ferenc, D. Bán, T. Grósz, and T. Gyimóthy, ‘‘Deep learning in static,
tools for assessing the performance of machine learning mod- metric-based bug prediction,’’ Array, vol. 6, Jul. 2020, Art. no. 100021.
[2] S. D. Immaculate, M. F. Begam, and M. Floramary, ‘‘Software bug pre-
els in software bug prediction, contributing to the develop- diction using supervised machine learning algorithms,’’ in Proc. Int. Conf.
ment of standardized evaluation practices in the field. Such Data Sci. Commun. (IconDSC), Mar. 2019, pp. 1–7.
standardization enables more reliable performance compar- [3] S. Shalev-Shwartz and S. Ben-David, Understanding Machine Learning:
From Theory to Algorithms. Cambridge, U.K.: Cambridge Univ. Press,
isons across different studies and experiments, fostering 2014.
advancements in software bug prediction methodologies. [4] F. Osisanwo, J. Akinsola, O. Awodele, J. Hinmikaiye, O. Olakanmi, and
J. Akinjobi, ‘‘Supervised ML algorithms: Classification and comparison,’’
Int. J. Comput. Trends Technol., vol. 48, no. 3, pp. 128–138, 2017.
VII. CONCLUSION AND FUTURE WORK [5] A. Hammouri, M. Hammad, M. Alnabhan, and F. Alsarayrah, ‘‘Software
This study was conducted to enhance the effectiveness of soft- bug prediction using ML approach,’’ Int. J. Adv. Comput. Sci. Appl., vol. 9,
ware bug prediction by addressing two key questions. Firstly, no. 2, pp. 78–83, 2018.
[6] D. Bowes, T. Hall, and J. Petrić, ‘‘Software defect prediction: Do differ-
we explored whether Ensemble Learning (EL) models out- ent classifiers find the same defects?’’ Softw. Quality J., vol. 26, no. 2,
perform single learning models in predicting software bugs. pp. 525–552, Jun. 2018.
Secondly, we investigated the impact of adding Hyperpa- [7] W. Fu, T. Menzies, and X. Shen, ‘‘Tuning for software analytics: Is it really
necessary?’’ Inf. Softw. Technol., vol. 76, pp. 135–146, Aug. 2016.
rameter Optimization approaches on EL model performance,
[8] H. Osman, M. Ghafari, and O. Nierstrasz, ‘‘Hyperparameter optimization
examining whether this enhancement leads to improved accu- to improve bug prediction accuracy,’’ in Proc. IEEE Workshop Mach.
racy. These questions served as the foundation of our explo- Learn. Techn. Softw. Quality Eval. (MaLTeSQuE), Feb. 2017, pp. 33–38.
ration, guiding us to understand the nuances of predictive [9] J.-C. Lévesque, C. Gagné, and R. Sabourin, ‘‘Bayesian hyperparameter
optimization for ensemble learning,’’ 2016, arXiv:1605.06394.
modeling in software bug detection. Our findings unequiv- [10] T. Xia, R. Krishna, J. Chen, G. Mathew, X. Shen, and T. Menzies, ‘‘Hyper-
ocally demonstrate the superiority of EL models, especially parameter optimization for effort estimation,’’ 2018, arXiv:1805.00336.
those optimized with hyperparameters, over single hypothesis [11] D. Di Nucci, F. Palomba, G. De Rosa, G. Bavota, R. Oliveto, and
A. De Lucia, ‘‘A developer centered bug prediction model,’’ IEEE Trans.
learning models in software bug prediction. Across various Softw. Eng., vol. 44, no. 1, pp. 5–24, Jan. 2018.
model measures, including accuracy, ROC Curve, Kappa [12] F. Khan, S. Kanwal, S. Alamri, and B. Mumtaz, ‘‘Hyper-parameter
statistics, Sensitivity, Specificity, and F-measure, EL models optimization of classifiers, using an artificial immune network and
its application to software bug prediction,’’ IEEE Access, vol. 8,
consistently outperformed their counterparts. The positive pp. 20954–20964, 2020.
impact of hyperparameter tuning further reinforced the effi- [13] X. Yang, D. Lo, X. Xia, and J. Sun, ‘‘TLEL: A two-layer ensemble learning
cacy of EL models, contributing to enhanced accuracy and approach for just-in-time defect prediction,’’ Inf. Softw. Technol., vol. 87,
pp. 206–220, Jul. 2017.
overall performance. [14] A. O. Balogun, A. O. Bajeh, V. A. Orie, and A. W. Yusuf-Asaju, ‘‘Software
Nevertheless, recognizing a limitation in our study is cru- defect prediction using ensemble learning: An ANP based evaluation
cial due to the reliance on a specific dataset from the NASA method,’’ FUOYE J. Eng. Technol., vol. 3, no. 2, pp. 1–14, Sep. 2018.
[15] S. Qiu, L. Lu, S. Jiang, and Y. Guo, ‘‘An investigation of imbalanced
repository. To further validate the effectiveness of our pro- ensemble learning methods for cross-project defect prediction,’’ Int. J. Pat-
posed method, future work should consider incorporating tern Recognit. Artif. Intell., vol. 33, no. 12, Nov. 2019, Art. no. 1959037.
more diverse datasets. This expansion will enhance the gen- [16] A. Sayed and N. Ramadan, ‘‘Early prediction of software defect using
ensemble learning: A comparative study,’’ Int. J. Comput. Appl., vol. 179,
eralizability of our findings and ensure the robustness of no. 46, pp. 29–40, Jun. 2018.
our approach across various software development contexts. [17] S. K. Pandey, R. B. Mishra, and A. K. Tripathi, ‘‘BPDET: An effective soft-
Additionally, we plan to extend our investigation by applying ware bug prediction model using deep representation and ensemble learn-
EL models with hyperparameter optimization to cross-project ing techniques,’’ Expert Syst. Appl., vol. 144, Apr. 2020, Art. no. 113085.
[18] M. Claesen, J. Simm, D. Popovic, and B. Moor, ‘‘Hyperparameter tuning
observations for software bug prediction. This aims to assess in Python using optunity,’’ in Proc. Int. Workshop Tech. Comput. ML Math.
the generalizability of the proposed model and replicate Eng., vol. 1, 2014, p. 3.
its outcomes in diverse software bug prediction projects. [19] Y. Jung, ‘‘Efficient tuning parameter selection by cross-validated score in
high dimensional models,’’ Int. J. Math., Comput., Phys., Elect. Comput.
Such cross-project validation will provide valuable insights Eng., World Acad. Sci., Eng. Technol., vol. 10, no. 1, pp. 19–25, 2016.
into the robustness and applicability of our approach across [20] T. G. Dietterich, ‘‘Ensemble learning,’’ in The Handbook of Brain Theory
different software development contexts. Finally, we plan and Neural Networks, vol. 2. Cambridge, U.K.: The MIT Press, 2002,
pp. 110–125.
to incorporate statistical tests to enhance the methodologi- [21] Y. Freund and R. E. Schapire, ‘‘Experiments with a new boosting
cal approach, including parametric tests such as t-test and algorithm,’’ in Proc. 13th Int. Conf. Mach. Learn., 1996, pp. 148–156.
Wilcoxon, to compare the performance of classifiers. This [22] Z.-H. Zhou, ‘‘Ensemble learning,’’ in Machine Learning. Singapore:
Springer, 2021, pp. 181–210.
addition will provide a more comprehensive assessment of [23] L. Breiman, ‘‘Bagging predictors,’’ Mach. Learn., vol. 24, no. 2,
the robustness and applicability of our approach in diverse pp. 123–140, Aug. 1996.
software development contexts. [24] R. Polikar, ‘‘Ensemble learning,’’ in Ensemble Machine Learning. Cham,
Switzerland: Springer, 2012, pp. 1–34.
[25] G. Brown, ‘‘Ensemble learning,’’ in Encyclopedia of Machine Learning.
CONFLICT OF INTERESTS Berlin, Germany: Springer, 2010, pp. 393–402.
The authors declare no conflict of interest. [26] D. H. Wolpert, ‘‘Stacked generalization,’’ Neural Netw., vol. 5, no. 2,
pp. 241–259, Jan. 1992.
[27] T. M. Oshiro, P. S. Perez, and J. A. Baranauskas, ‘‘How many trees in
DATA AVAILABILITY STATEMENT a random forest?’’ in Proc. 8th Int. Workshop ML Data Mining Pattern
The data presented in this study are available in the article. Recognit. (MLDM). Berlin, Germany: Springer, Jul. 2012, pp. 154–168.

VOLUME 12, 2024 51877


D. Al-Fraihat et al.: Hyperparameter Optimization for SBP Using EL

[28] A. E. Mohamed, ‘‘Comparative study of four supervised ML techniques ABDEL-RAHMAN AL-GHUWAIRI received the
for classification,’’ Inf. J. Appl. Sci. Technol., vol. 7, no. 2, pp. 1–15, 2017. Ph.D. degree in software engineering from New
[29] M. G. Al-Obeidallah, D. G. Al-Fraihat, A. M. Khasawneh, A. M. Saleh, Mexico State University, Las Cruces, NM, USA,
and H. Addous, ‘‘Empirical investigation of the impact of the adapter in 2013. He is currently an Associate Profes-
design pattern on software maintainability,’’ in Proc. Int. Conf. Inf. Technol. sor with the Department of Software Engineer-
(ICIT), Jul. 2021, pp. 206–211. ing, Hashemite University, Jordan. His research
[30] A.-R. Al-Ghuwairi, D. Al-Fraihat, Y. Sharrab, H. Alrashidi, N. Almujally, interests include software engineering, cloud com-
A. Kittaneh, and A. Ali, ‘‘Visualizing software refactoring using radar puting, requirements engineering, information
charts,’’ Sci. Rep., vol. 13, no. 1, p. 19530, Nov. 2023, doi: 10.1038/s41598-
retrieval, big data, and database systems.
023-44281-6.
[31] H. Alsawalqah, H. Faris, I. Aljarah, L. Alnemer, and N. Alhindawi,
‘‘Hybrid SMOTE-ensemble approach for software defect prediction,’’ in
Software Engineering Trends and Techniques in Intelligent Systems. Cham,
Switzerland: Springer, 2017, pp. 355–366.
[32] S. A. El-Shorbagy, W. M. El-Gammal, and W. M. Abdelmoez, ‘‘Using
SMOTE and heterogeneous stacking in ensemble learning for software
defect prediction,’’ in Proc. 7th Int. Conf. Softw. Inf. Eng., May 2018,
pp. 44–47.
[33] D. Al-Fraihat, Y. Sharrab, F. Alzyoud, A. Qahmash, M. Tarawneh, and
A. Maaita, ‘‘Speech recognition utilizing deep learning: A systematic
HAMZEH ALSHISHANI received the M.Sc.
review of the latest developments,’’ Hum.-Centric Comput. Inf. Sci.,
vol. 14, pp. 1–34, Mar. 2024. degree in software engineering from Hashemite
[34] R. Li, L. Zhou, S. Zhang, H. Liu, X. Huang, and Z. Sun, ‘‘Software defect University, Jordan. He is currently a Lecturer and
prediction based on ensemble learning,’’ in Proc. 2nd Int. Conf. Data Sci. a Researcher with the Department of Software
Inf. Technol., Jul. 2019, pp. 1–6. Engineering, Hashemite University. His research
interests include software engineering, cloud com-
puting, and requirements engineering.
DIMAH AL-FRAIHAT received the Ph.D. degree
in software engineering from the University of
Warwick, U.K. She is currently an Assistant Pro-
fessor with the Faculty of Information Technol-
ogy, Isra University, Jordan. Her research interests
include software engineering, refactoring, design
patterns, software testing, requirements engineer-
ing, documentation, computer-based applications,
technology enhanced learning, and deep learning.

ABDULMOHSEN ALGARNI received the Ph.D.


degree from the Queensland University of Tech-
YOUSEF SHARRAB received the Ph.D. degree in nology, Australia, in 2012. He was a Research
computer engineering from Wayne State Univer- Associate with the School of Electrical Engineer-
sity, Detroit, MI, USA, in 2017. He is currently an ing and Computer Science, Queensland Univer-
Assistant Professor with the Department of Com- sity of Technology, in 2012. He is currently an
puter Science, Isra University. His main research Associate Professor with the College of Computer
interests include artificial intelligence, machine Science, King Khalid University. His research
learning, deep learning, computer vision, software interests include artificial intelligence, data min-
engineering, and the IoT. ing, text mining, machine learning, information
retrieval, and information filtering.

51878 VOLUME 12, 2024

You might also like