Software Defect Prediction Based on Multi-filter Wrapper Feature (2)
Software Defect Prediction Based on Multi-filter Wrapper Feature (2)
https://doi.org/10.1007/s00521-024-10902-y (0123456789().,-volV)
(0123456789().,-volV)
Abstract
Software defect prediction (SDP) models rely on various software metrics and defect data to identify potential defects in
new software modules. However, the performance of these predictive models can be negatively impacted by irrelevant,
redundant metrics and the imbalanced nature of defect datasets. Additionally, the previous studies mainly use conventional
machine learning (ML) techniques, but their predictive performance is not superior enough. Addressing these issues is
crucial to improve the accuracy and effectiveness of SDP models. This study presents a novel approach to SDP using a
multi-filter wrapper feature selection technique (MFWFS). To identify a subset of relevant and informative features, we
leverage the combination of filter techniques—Information gain (IG), Chi-square (CS), and Relief-F (RF) method, and a
wrapper technique—Opposition-Based Whale Optimization Algorithm (OBWOA). One-dimensional-Convolutional
Neural Network (CNN) with an attention mechanism is employed to enhance the classification performance of the
predictive model by efficiently integrating the selected characteristics into abstract deep semantic features. We undertake
experiments on seventeen open-source software datasets on four performance measures—AUC, G-mean, F-measure, and
MCC and compare the obtained results with existing state-of-the-art ML and hybrid algorithms. The experimental findings
demonstrate the greater efficiency of our approach, highlighting the usefulness of the multi-filter wrapper feature selection
technique and 1D-CNN with attention to SDP.
Keywords Software defects Filter techniques Wrapper techniques Opposition-based whale optimization algorithm
CNN Attention mechanism
1 Introduction
123
Neural Computing and Applications
features. However, not all the attributes are desirable for information and enhances the prediction model’s overall
creating defect prediction models. The amount of time search performance. This research work uses multiple filter
needed to train a model and its prediction’s quality may be methods such as Information Gain, Chi-square, and Relief-
negatively impacted by redundant and irrelevant attributes, F methods as per the previous research findings [19, 20].
commonly known as the curse of dimensionality. The The features obtained are then integrated with the feature
previous works [2–5] have proven that feature selection subset obtained from the wrapper method—opposition-
techniques help diminish the big dimensionality problem in based whale optimization to maximize the classification
software defect datasets. performance (AUC) of the predictive model. Filter meth-
Feature selection identifies the most representative fea- ods extract the information from features, and the wrapper
ture subset from the initial set by assessing each feature’s uses the learning algorithm for judgment. The computa-
significance while preserving the model’s high classifica- tional time and complexity are reduced compared to the
tion performance. The three categories of feature selection pure wrapper method [10]. We believe that this multi-filter
techniques that help select the informative features are— wrapper technique has never been applied earlier in the
filter, wrapper, and hybrid approach. A filter approach SDP domain.
relies on the data’s characteristics to assess the feature’s A predictive model’s performance mainly depends on
relevance rather than any classification algorithm to eval- two important factors: feature representation and classifi-
uate the feature subset [6, 7]. Wrapper approaches out- cation algorithm. Researchers have been applying ML and
perform filter methods by employing the greedy approach deep learning (DL) algorithms to efficiently construct
to evaluate all potential optimal feature subsets with a models for predicting software defects [3, 21–27]. ML
classification algorithm [8, 9]. Though filter methods are algorithms need manually extracted software metrics for
computationally less expensive and faster, the wrapper building classifiers to predict defect or non-defect-prone
methods produce a better subset of features. Hybrid areas. While DL algorithms automatically extract high-
methods merge the complementary strengths of both filter level features and learn from high-dimensional and more
and wrapper techniques [10]. complex data [28]. As a result, a lot of researchers are now
Metaheuristic algorithms are frequently applied in a concentrating on creating SDP models using DL algo-
variety of optimization problems in recent years, and fea- rithms. CNN has gained popularity in many fields,
ture selection is another vital study area where these including speech recognition [29], semantic search [30],
methods have been studied. Unfortunately, there is no object detection [31], etc., recently because of its capability
assurance that the feature selection will provide the best to extract semantic features with powerful discriminating
combination of attributes due to the stochastic nature of capacity compared to ML algorithms. They are useful for
metaheuristic algorithms. Several metaheuristic algorithms developing SDP models, but they seem complicated and
have been applied in SDP, such as genetic algorithms [11], have insufficient low accuracy values. This may be because
particle swarm optimization [12], artificial bee colony CNNs were originally built using 2D structures based on
[13, 14], ant colony [15], moth flame optimization [2], salp 2D data such as images and videos. Recently, there has
swarm optimization [16], etc. Whale optimization algo- been a development of a new modified variant of 2D-CNN
rithm (WOA), a recently developed metaheuristic algo- called 1D-CNN [32–34]. They are preferable over 2D-
rithm, mimics the intelligent search strategy used by CNN due to reasons such as—the complexity of the 1D
humpback whales to find food by the method known as variant (O(NK)) is lower than that of the 2D variant
bubble-net feeding. Although WOA has superior search (O(N2K2)) [32]. One-dimensional variant implementation
ability compared to other existing algorithms, it has limi- is feasible on a standard computer and comparatively faster
tations [17] such as getting stuck in local optima. WOA due to fewer hidden layers and fewer parameters involved.
starts with random positions for each agent in the group Its exceptional performance in real-time electrocardiogram
and switches locations with either a randomly assigned (ECG) monitoring [34], structural damage detection [35],
search agent or the current best solution available high-power engine fault monitoring [36], and automatic
throughout the search. The WOA is combined with the idea speech recognition [37] have all contributed to its increased
of an opposition-based learning concept to form OBWOA popularity. Inspired by the successful application of 1D-
that improves its base version [18]. By determining if the CNN in these areas, 1D-CNN with an attention mechanism
opposite position for each agent is better, the opposition- has been employed as the classification algorithm for cre-
based idea can assist WOA in finding accurate solutions. ating a defect prediction model.
Instead of employing the filter and wrapper techniques The following are the main objectives of the paper:
independently, we combine the features produced from
• A novel hybrid model based on the multiple filter
both techniques to take maximum advantage of both
wrapper feature selection (MFWFS) techniques is
techniques. This helps to avoid missing any relevant
123
Neural Computing and Applications
123
Neural Computing and Applications
that cross-validation with feature selection was better than However, the hybrid approaches applied in existing lit-
other approaches using AUC values. erature have very limited use of feature selection methods
Li et al. [21] presented a framework named Defect combined with deep neural network architectures for SDP.
Prediction via CNN (DP-CNN) that automatically produces The main focus of this study is to propose a unique hybrid
syntactic and semantic features obtained from source code feature selection technique that integrates the existing filter
using abstract syntax trees and integrates them with tradi- methods—IG, CS, RF, and the modified wrapper method—
tional features for SDP. According to the results, DP-CNN OBWOA. The OBWOA is chosen because of its significant
advances the existing techniques achieving an average 12% competitiveness over other metaheuristic algorithms in
increase in F-measure. solving complex neural network problems. The features
Hassouneh et al. [40] presented an efficient feature obtained from the above unification, along with superior
selection approach for SDP using WOA augmented with feature extraction capabilities of 1D-CNN combined with
five natural selection operators. Four different classifiers the coordinate attention mechanism result in a powerful
were used—KNN, DT, LDA, and SVM. The tournament- unified defect predictor that has superior classification
based WOA with DT classifier outperformed other variants performance compared to other existing ML algorithms.
of WOA in terms of AUC measure over 17 public software Further, the Wilcoxon test is employed for statistical
datasets. analysis to validate the results and assess the robustness of
Balogun et al. [41] introduced a new hybrid multi-filter the model’s performance compared to other algorithms
wrapper feature selection technique that uses a rank present in the literature.
aggregation method to choose irrelevant and redundant
features in software defect datasets. Twenty-five datasets
were selected from various software repositories to check 3 Background
the feasibility of the novel technique on Naı̈ve Bayes and
Decision Tree classifier to evaluate its performance. The 3.1 Filter methods
evaluation metrics used were F-measure, AUC, and accu-
racy. The results indicate this feature selection method Filter methods rely on univariate statistics to capture the
effectively addresses issues such as filter rank selection and inherent characteristics of features, independent of any
local optima stagnation. learning algorithm. These approaches are quicker and less
Zain et al. [42] proposed a novel 1D-CNN structure for computationally intensive compared to wrapper methods
SDP that is efficient in comparison with other 1D-CNN [7]. In this work, we have used the information gain, chi-
models varying in filter size, kernel size, dropout layers, square, and relief-F methods described below.
and a number of convolutional and max-pool layers. The
• Information gain
models were tuned with different hyperparameters and
It is a technique that determines the relevance of a
evaluated based on F-measure, accuracy, training, and
given attribute for the class label prediction. It is an
testing time. The comparison with other ML algorithms
entropy-based concept that measures the uncertainty in
showcased the exceptional performance of the proposed
the class label after the attribute value is observed [45].
model.
It can be found using Eq. (1).
Fan et al. [43] presented a DL-based approach named
defect prediction via an attention-based RNN (DP-ARNN). Xm
IGð yÞ ¼ pðxi Þlogpðxi Þ
This model makes use of abstract syntax trees that help to i
learn syntactic and semantic features from the source code X
m
123
Neural Computing and Applications
independence of two variables [46]. It determines the • Exploitation phase Shrinking Encircling Mechanism:
most relevant features by considering the highest values The humpback whales swim in a spiral-like pattern as
for the chi-squared statistic of the class label computed they approach their prey in a shrinking encircling
using Eq. (2) mechanism. They update their positions and move
X ðObs Exp Þ2 toward the best solution according to the current
i i
v2 ¼ ð2Þ optimal solution, using Eqs. (4) and (5):
Expi
! ! ! !
D ¼ C :Z bestt Z t ð4Þ
where Obsi and Expi are the observed and expected
values. Higher values of chi-squared statistic indicate a ! ! ! !
Z tþ1 ¼ Z bestt A : D ð5Þ
stronger dependence of the feature on the output
variable. !
where Z best t denotes the position of the best solution
• Relief-F method ! !
thus far. The coefficients A and C can be computed
This is an instance-based method that relies on the
using Eqs. (6) and (7),
concept of nearest neighbors to estimate the quality of
!
the attributes [47]. For each instance in a training set, A ¼ 2! a :!
r ! a ð6Þ
two nearest neighbors are determined—one belonging
to the same class (HIT) and the other of a different class C ¼ 2! r ð7Þ
(MISS). The importance of the attribute is then a ¼ 2 t M2 is linearly decreased from 2 to 0, M refers
~
calculated using Eq. (3). to the maximum number of iterations, and ! r refers to
I ðY Þ ¼ Pðy=nearestinstancedifferent classÞ any random vector in [0, 1].
Pðy=nearest instancesame classÞ ð3Þ • Spiral path updating position:
The bubble-net feeding mechanism calculates the
where P is the probability, and y denotes the value of an
distance between the current and the best solution thus
attribute. An important attribute should be able to
far using the spiral Eq. (8), as illustrated below.
effectively differentiate between instances of different
! ! !
classes while maintaining similar values for instances Z tþ1 ¼ D0 :ebl :cosð2plÞ þ Z best t ð8Þ
within the same class [48].
where b defines the constant that dictates the shape of
the logarithmic spiral function, and l represents a ran-
3.2 Wrapper method dom value in [-1,1].
According to Eq. (9), humpback whales have a 50%
Wrapper methods assess the quality of various feature chance of randomly selecting either a spiral-shaped
subsets utilizing a particular machine learning algorithm. path model or a shrinking encircling mechanism to
In contrast with filter methods, that rely on intrinsic approach their prey.
properties, wrappers measure the usefulness of the features ( ! )
by training and evaluating a classifier based on the chosen ~ Zbest t ~ ~;
A:D p\0:5
Z t ¼ !0 ! pe½0; 1
algorithm [49]. In this research, we apply the swarm-based D :ebl : cosð2plÞ þ Zbest t p 0:5;
whale optimization algorithm as a wrapper method for ð9Þ
finding the optimal feature subset.
where p is some random number e [0,1].
• Whale Optimization Algorithm • Exploration phase
It is a stochastic population-based metaheuristic !
This phase is controlled by the variable A that lies
algorithm recently invented in 2016 [17]. This algo-
!
rithm mimics the intelligent search strategy adopted by between [-1,1]. If A 1, the position of each whale is
humpback whales known as bubble-net feeding for modified according to the position of the randomly
searching their food. They create bubbles in a spiral !
chosen whale, and if A \1, the position of each whale
motion around their food, i.e., small fishes, and then
slowly swim toward the surface. is modified according to the current optimal solution.
There are two stages in a WOA algorithm—the first This global search capability of WOA is shown in
involves encircling the prey and using a spiral-bubble- Eqs. (10) and (11):
net attacking mechanism (the exploitation stage), and ! ! ! !
D ¼ C :Z rand t Z t ð10Þ
the second involves randomly searching for prey (the
exploration stage) [50].
123
Neural Computing and Applications
123
Neural Computing and Applications
123
Neural Computing and Applications
One-dimensional variant implementation is feasible on a c) ReLU layer—The output from the preceding layer
standard computer and comparatively faster due to fewer (feature map) is subjected to a nonlinear activation
hidden layers and fewer parameters involved. function by the ReLU layer. It gives the model
Figure 2 illustrates the structure of the proposed 1D- nonlinearity and enhances its capacity for expression.
CNN model with attention comprising the following d) Flatten layer—The feature maps produced as a
important layers: - result of 2nd convolutional layer are transformed into
a single long continuous linear vector through the
a) Input layer—The 1D-input vector K[n] where
process of flattening.
n = 0,1, 2,…… N-1 is given to the first layer of the
e) Fully connected (FC) layer—The output from the
CNN architecture. ‘n’ represents the software metrics
flattening layer is linearly transformed into a new
present in the dataset. E.g.; the JM1 dataset contains
representation through a set of weights and biases. To
21 software metrics, so the input layer will accept 21
classify inputs accurately, we employed two FC
metrics.
layers. The first FC layer used ReLU activation,
b) Convolutional layer—It is the foundational layer of
while the second FC layer used softmax activation
1D-CNN. A series of convolutions are performed
generating a probability ranging from 0 to 1.
using filters or kernels to acquire local features from
f) Dropout layer— A regularization approach ran-
the input data. The filters help detect deep semantic
domly removes a particular proportion of the neurons
features from the low-level feature representation in
in the FC layer during training. Hence, the model’s
the defect instances. We have used two convolutional
generalization performance is enhanced, and overfit-
layers with 64 and 32 filters and a kernel size of 1 for
ting is prevented.
each filter.
g) Attention mechanism— It is a technique commonly
used in DL models to focus on the most important
123
Neural Computing and Applications
123
Neural Computing and Applications
Also, the imbalance ratio in all the datasets is quite high. [73]. This is typically done to boost the neural network
Therefore, there is a need to address the imbalance issue in training process and solve model learning issues.
these software projects. The independent variables in these Equation (18) is used to normalize the dataset.
datasets denote the features/attributes, and the dependent X - min(X)
variable is the class label (‘1’—defective/‘0’—non-defec- Xnormalized = ð18Þ
maxðXÞ - min(X)
tive). In some datasets, the dependent variable shows the
count of defects within a particular module; in that case, we
In the above equation, the minimum and maximum
represent the class label as defective if there is more than
values of software metric X are denoted by min(X) and
one defect present; otherwise, we label it non-defective.
max(X), respectively.
5.2 Data pre-processing
5.3 Performance measures
• Missing values and Removal of Outliers—This stage
It is possible to assess the effectiveness of predictive
involved searching the datasets for any outliers and
models for datasets with imbalanced class distributions by
missing values that might have been present. The data
utilizing appropriate performance measures. In our study,
points are analyzed in each dataset, and those that vary
we have considered AUC, F-measure, MCC, and G-mean,
extremely from the remaining data points are consid-
as these metrics are widely utilized in the literature
ered outliers. These outliers are critical and need to be
[3, 39, 74–76]. The most traditional and commonly used
handled. In our study, we have used median imputation
performance metric to evaluate a classifier is accuracy.
to treat the outliers since deleting the outliers would
However, it gives biased results when handling imbalanced
lead to degradation of the dataset and a reduction in the
datasets [77]. Studies have advocated the use of AUC
available data volume.
(Area under the Receiver’s Operating Curve), a robust
• Class Imbalance processing—The datasets used to
measure particularly effective for imbalanced and noisy
solve SDP problems are usually imbalanced. There is a
datasets [68, 78]. Thus, in our research, we have employed
relatively small percentage of defective modules (mi-
AUC to examine the relationship between true-positive rate
nority class) as compared to non-defective modules
(recall or sensitivity) and false-positive rate [5, 79] defined
(majority class) [70]. If an imbalanced dataset is put
by Eq. (19).
directly to train a model, the results may be poor and
biased in favor of the majority class. Oversampling and Truepos
TPR ¼ ð19Þ
undersampling are common approaches to resolve the Truepos þFalseneg
class imbalance problem [71]. Oversampling includes Precision represents a measure that determines correctly
generating duplicates of minority class instances, while classified defective instances out of all the defective
undersampling entails eliminating instances from the instances. It can be calculated using Eq. (20):
majority class. A part of the training dataset may be
Truepos
partially lost in undersampling, resulting in underfitting. Precision ¼ ð20Þ
We use the random oversampling method to create a Truepos þFalsepos
class-balanced training set that replicates the instances F-measure combines recall and precision into a single
from the minority class randomly [72]. measure and is a harmonic mean of the two, as shown in
• Data Standardization—It is crucial to first normalize Eq. (21).
the input dataset before initiating the training process
Precision. Recall
because each dimension in an input space may have a F measure ¼ 2 ð21Þ
distinct range of values. The original dimension values Precision + Recall
must be standardized in order to produce reliable Matthews Correlation Coefficient (MCC) measures the
results, and the processed dimension values must correlation between predicted and actual instances. Equa-
correspond to the original distribution. In a training tion (22) is used to calculate its value, which falls between
process, it is a common practice to normalize the data, [-1,1]
ensuring that input values fall within the range [0, 1]
123
Neural Computing and Applications
G-mean represents a balanced measure that is computed We applied the Wilcoxon signed-rank test [81] to detect
using the geometric mean of sensitivity and specificity, as significant differences in the techniques pair. With the help
shown in the equation below (23). of this test, we may determine whether the null hypothesis,
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi suggesting no difference in the performance of techniques
G mean ¼ Sensitivity Specificity ð23Þ
can be rejected or not. We compute the p-value for every
technique pair and make a decision on either to accept or
5.4 Parameter settings reject the null hypothesis, with a chosen significance
level = 0.05.
For performing feature selection, only the training dataset
is used in this paper. The model is evaluated using the same
features that were selected from the training dataset. The 6 Results and discussion
filter methods—IG, CS, and RF are implemented using the
Weka Tool (default settings), and top log2n features are RQ1. How efficient is the SDP model developed
selected. The wrapper-based, oppositional whale opti- using 1D-CNN with and without multi-filter wrapper
mization algorithm and 1D-CNN model are implemented feature selection?
in Python 3.9 on Spyder 5.2.2. We have divided the data
The study utilized 17 software defect datasets to
into 70% training and 30% testing before executing the
implement the 1D-CNN model with and without MFWFS.
CNN algorithm. Tables 2 and 3 show the parameter set-
The model was designed with parameter settings specified
tings for OBWOA and 1D-CNN.
in Table 3. The datasets are initially balanced using the
random oversampling technique before training the model.
5.5 Statistical analysis The top log2n features are selected from IG, CS, and
Relief-F methods applied to the training set. The OBWOA
The study employed statistical tests, mainly the Wilcoxon
uses KNN (K = 5) as the internal classifier with parameter
test, to support and reinforce the findings and conclusions.
settings specified in Table 2 Parameter settings for
Demšar [80] has suggested using non-parametric tests
OBWOA. To handle the stochastic nature of OBWOA, we
because assumptions regarding data distribution, such as
performed 20 runs of this technique to find the average
normality and homogeneity, are relaxed.
results. Table 4 shows the original number of features in all
123
Neural Computing and Applications
datasets, the average number of selected features after This clearly implies that the impact of MFWFS on the
performing OBWOA, the average selection ratio, the best 1D-CNN classifier is significant with respect to AUC val-
fitness value, and number of features in the MFWFS ues. Although 1D-CNN extracts high-level deep semantic
technique. For instance, the JM1 dataset has originally 21 features using various layers, feature selection using a
features, we select top log2n features, i.e., four features hybrid filter wrapper approach is applied prior to training
each from IG, CS, and RF methods, nine features are the model, removing all the defect features that are irrel-
selected from OBWOA, and after their union—we get 12 evant, redundant, and cause noise in the dataset and over-
features in MFWFS. We can observe from Table 4 that the fitting the model. As a result, this integration of static
attribute reduction rate varies from 20 to 77% across all defect feature subsets with abstract deep semantic features
seventeen datasets, resulting in a total average reduction of in CNN improves the model’s efficiency and builds a
50% attributes using the above MFWFS technique. This stronger ability to distinguish between defective/non-de-
implies that the new MFWFS technique generates feature fective classes.
subsets with substantially fewer features that produce
RQ2. How efficient is the SDP model developed
higher-quality solutions.
using MFWFS-1D-CNN with and without attention?
The classification performance of the SDP model
developed using 1D-CNN defect predictor without and The study evaluated the performance of SDP models
with MFWFS is evaluated across 17 software projects created with MFWFS-1D-CNN, both with and without an
measuring performance using F-measure, AUC, G-mean, attention layer, by measuring their F-measure, AUC,
and MCC. The results are displayed in Tables 5, 6, 7, and G-mean, and MCC values. The results are presented in
8. Results show that CNN with feature selection depicts the Tables 10, 11, 12, and 13. Across all datasets, the F-mea-
best F-measure, AUC, G-mean, and MCC values in nine, sure, AUC, G-mean, and MCC values range from 0.613 to
fifteen, ten, and ten out of seventeen datasets. There is an 0.901, 0.788 to 0.974, 0.761 to 0.958, and 0.381 to 0.897,
improvement in values of F-measure (1–8%), AUC respectively. Adding the attention layer improves the pre-
(0.1–30%), G-mean (0.2–6%), and MCC (1–23.5%) for all dictive model’s performance in all the datasets except for
the datasets except for the PDE and camel-1.6 dataset. The one, i.e., xalan. We can clearly observe that there is an
performance values of the PDE and camel-1.6 dataset after improvement of about 0.1–11% in the values of F-measure,
feature selection did not improve; hence, the feature 0.5–11% in the values of AUC, 0.3–12% in the values of
selection was not helpful for that particular dataset. G-mean, and 0.12–35% in the values of MCC measures
To assess how the number of features affects classifier obtained after applying the attention layer in the 1D-CNN
performance, the average F-measure, AUC, G-mean, and model. The improvement in the values is due to the
MCC values were compared using the Wilcoxon signed- application of the attention mechanism in the 1D-CNN
rank test with a significance threshold of 0.05. The test model, which can detect the direction and position of the
results are presented in Table 9, showing the p-values for data, thus enabling the precise identification of the defects
each comparison. A p-value below 0.05 suggests a statis- in the software.
tically significant difference between the two techniques.
The performance difference between the two techniques is
denoted as either significantly different (S ?) or not sig-
nificantly different (NS), respectively. The analysis Table 3 Parameter settings for 1D-CNN
revealed a significant difference between the two methods Parameters Values
in terms of AUC (p-value = 0.015) but not for F-measure,
No. of convolutional layers 2
G-mean, or MCC.
No. of dense layers 2
Activation function ReLU ? Softmax
Table 2 Parameter settings for OBWOA Number of attention layers 2
Filter 64,32
Parameters Values
Kernel 1,1
No. of iterations 100 Optimizer Adam
No. of search agents (whales) 50 Learning rate 0.01
Spiral factor b 1 Loss function Binary cross entropy
Convergence constant ‘a’ 1 No. of epochs 250
No. of runs 20 Dropout 0.2
123
Neural Computing and Applications
To statistically validate our results, we performed the and poi-3.0 datasets. However, for the jedit-4.3 dataset, the
Wilcoxon test on all four performance measures. Table 14 performance of the IG ? CS ? RF ? OBWOA method
represents the p-values obtained when the MFWFS-1D- was better in terms of AUC and G-mean, while the
CNN-Attention model is compared with MFWFS-1D- OBWOA wrapper method alone was better in terms of
CNN. The p-values obtained with respect to F-mea- F-measure and MCC scores. For Xalan-2.7, IG ?
sure = 0.03, AUC = 0.041, and MCC = 0.012 show a OBWOA provided the best F-measure and MCC value,
significant difference in the two approaches but not sig- while OBWOA gave the best performance in terms of
nificantly with respect to the G-mean measure. These AUC and G-mean. For the Xerces-1.4 dataset, IG ?
results suggest that the MFWFS-1D-CNN-Attention model OBWOA performed the best, with the highest values for all
outperforms the MFWFS-1D-CNN model without atten- evaluation measures.
tion, signifying the importance of the attention layer in the We also present the box plots to visually compare the
neural network. difference in the prediction performances between
OBWOA, IG ? OBWOA, CS ? OBWOA, RF ?
RQ3. How effective are the SDP models developed
OBWOA, and IG ? CS ? RF ? OBWOA over seventeen
using the proposed hybrid model MFWFS-1D-CNN-
datasets, with respect to four performance measures in
Attention in comparison with the models created with
Fig. 3, respectively. The x-axis denotes the different
individual filters and wrapper-based CNN model?
methodologies being compared, and the y-axis denotes the
Answer: To assess whether the proposed MFWFS-1D- four evaluation measures. We observe that for each of the
CNN-Attention model is able to effectively classify the four evaluation metrics, the median value achieved by
software defects, we made a comparison of this model with IG ? CS ? RF ? OBWOA is higher than that obtained
the sole wrapper method (OBWOA) and combinations of by other approaches. The average predicted F-measure
filter (IG, CS, and RF) and wrapper method. Table 15 value by IG ? CS ? RF ? OBWOA is 0.794 (for all the
reports the results using the four performance measures— datasets), that is, an improvement between 7% (IG ?
F-measure, AUC, G-mean, and MCC. OBWOA) and 10.41% (OBWOA), the average AUC
In terms of all the evaluation measures, MFWFS (0.896) by IG ? CS ? RF ? OBWOA obtains improve-
(IG ? CS ? RF ? OBWOA) on CNN attention model ment between 4.4% (RF ? OBWOA) and 7% (OBWOA),
produced the highest values in JM1, MC1, KC1, PC1, PC3, the average G-mean (0.875) by IG ? CS ? RF ?
PC4, PC5, EQ, JDT, Lucene, Mylyn, ant-1.7, camel-1.6, OBWOA obtains improvement between 5.4% (IG ?
123
Table 5 Comparative results of 1D-CNN with and without MFWFS based on F-measure
JM1 MC1 KC1 PC3 PC4 PC5 EQ JDT PDE Lucene Mylyn Ant-1.7 Camel-1.6 Jedit-4.3 Poi-3.0 Xalan-2.7 Xerces-1.4
123
Without FS 0.614 0.66 0.726 0.764 0.782 0.731 0.854 0.859 0.846 0.825 0.732 0.79 0.746 0.69 0.802 0.831 0.909
With FS 0.651 0.653 0.721 0.776 0.841 0.739 0.898 0.901 0.839 0.776 0.753 0.789 0.708 0.706 0.796 0.677 0.9
Table 6 Comparative results of 1D-CNN with and without MFWFS based on AUC
JM1 MC1 KC1 PC3 PC4 PC5 EQ JDT PDE Lucene Mylyn Ant-1.7 Camel-1.6 Jedit-4.3 Poi-3.0 Xalan-2.7 Xerces-1.4
Without FS 0.712 0.878 0.847 0.915 0.916 0.806 0.871 0.902 0.912 0.847 0.833 0.805 0.826 0.935 0.825 0.754 0.895
With FS 0.752 0.888 0.848 0.919 0.94 0.813 0.932 0.926 0.89 0.861 0.875 0.833 0.826 0.88 0.833 0.98 0.942
Neural Computing and Applications
Table 7 Comparative results of 1D-CNN with and without MFWFS based on G-mean
JM1 MC1 KC1 PC3 PC4 PC5 EQ JDT PDE Lucene Mylyn Ant-1.7 Camel-1.6 Jedit-4.3 Poi-3.0 Xalan-2.7 Xerces-1.4
Without FS 0.662 0.865 0.821 0.91 0.89 0.771 0.864 0.891 0.904 0.848 0.827 0.813 0.802 0.888 0.828 0.393 0.899
With FS 0.701 0.858 0.804 0.906 0.914 0.775 0.914 0.914 0.883 0.863 0.853 0.815 0.789 0.855 0.824 0.969 0.917
Neural Computing and Applications
Table 8 Comparative results of 1D-CNN with and without MFWFS based on MCC
JM1 MC1 KC1 PC3 PC4 PC5 EQ JDT PDE Lucene Mylyn Ant-1.7 Camel-1.6 Jedit-4.3 Poi-3.0 Xalan-2.7 Xerces-1.4
Without FS 0.281 0.406 0.511 0.61 0.623 0.496 0.717 0.732 0.713 0.668 0.515 0.59 0.53 0.466 0.637 0.227 0.822
With FS 0.347 0.475 0.491 0.624 0.709 0.506 0.815 0.806 0.671 0.586 0.56 0.588 0.479 0.471 0.627 0.467 0.807
123
Neural Computing and Applications
Xerces-1.4
Table 9 Wilcoxon test result based on average F-measure, AUC, G-
mean, and MCC
0.901
0.9
Comparison p-value p-value p-value p-value
approach (F- (AUC) (G- (MCC)
measure) mean)
Xalan-2.7
0.677
0.613
CNN without NS S ? (0.015) NS NS
feature selection/ (0.463) (0.193) (0.201)
CNN with
Poi-3.0
MFWFS
0.796
0.818
Jedit-4.3
OBWOA) and 8.65% (OBWOA), and the average MCC
0.706
0.706
(0.632) by IG ? CS ? RF ? OBWOA obtains improve-
ment between 18% (IG ? OBWOA) and 28.78%
(OBWOA).
Camel-1.6
We also compared the performance of the proposed
0.708
0.761
feature selection method IG ? CS ? RF ? OBWOA with
transformer-based model BERT [82] used in a previous
Ant-1.7
literature study to validate the performance and robustness
0.789
0.809
of the proposed method against the original 1D-CNN-
Attention-based model. The F-measure, AUC, G-mean,
Mylyn
and MCC values are reported in Table 16. The results
0.753
0.811
indicate that the IG 1 CS 1 RF 1 OBWOA with 1D-
CNN model consistently outperforms the IG 1 CS 1
Lucene
0.776
0.831
RF 1 OBWOA with BERT approach across all datasets
in terms of F-measure, AUC, G-mean, and MCC values.
Table 10 Comparative results of MFWFS-1D-CNN with and without attention based on F-measure
Higher F-measure values for 1D-CNN indicate better
0.839
0.879
PDE
balance between precision and recall, which is essential for
accurate defect prediction. AUC values are also consis-
0.901
0.862
tently higher for 1D-CNN, indicating stronger discrimi-
JDT
123
Table 11 Comparative results of MFWFS-1D-CNN with and without attention based on AUC
JM1 MC1 KC1 PC3 PC4 PC5 EQ JDT PDE Lucene Mylyn Ant-1.7 Camel-1.6 Jedit-4.3 Poi-3.0 Xalan-2.7 Xerces-1.4
Without attention 0.752 0.888 0.848 0.919 0.94 0.813 0.932 0.926 0.89 0.861 0.875 0.833 0.826 0.88 0.833 0.98 0.942
With attention 0.788 0.933 0.868 0.924 0.938 0.835 0.93 0.926 0.937 0.9 0.901 0.837 0.85 0.974 0.838 0.948 0.891
Neural Computing and Applications
Table 12 Comparative results of MFWFS-1D-CNN with and without attention based on G-mean
JM1 MC1 KC1 PC3 PC4 PC5 EQ JDT PDE Lucene Mylyn Ant-1.7 Camel-1.6 Jedit-4.3 Poi-3.0 Xalan-2.7 Xerces-1.4
Without attention 0.701 0.858 0.804 0.906 0.914 0.775 0.914 0.914 0.883 0.863 0.853 0.815 0.789 0.855 0.824 0.969 0.917
With attention 0.761 0.921 0.834 0.911 0.917 0.79 0.897 0.896 0.927 0.881 0.876 0.822 0.816 0.958 0.835 0.935 0.89
123
Neural Computing and Applications
study.
0.479
0.555
other three datasets (JM1, PC4, and ant-1.7). For the JM1
0.586
0.678
models.
In a nutshell, we can say that the MFWFS-1D-CNN-
Without attention
123
Neural Computing and Applications
Table 14 Wilcoxon test result based on F-measure, AUC, G-mean, and MCC
Comparison approach p-value (F- p-value p-value (G- p-value
measure) (AUC) mean) (MCC)
123
Neural Computing and Applications
Table 15 (continued)
Datasets Methodology F-measure AUC G-mean MCC
123
Neural Computing and Applications
Fig. 3 Box plot for F-measure, AUC, G-mean, and MCC values for different feature selection methodologies
123
Neural Computing and Applications
Table 16 Comparison of
Datasets Methodology F-measure AUC G-mean MCC
performance of proposed
feature selection with 1D-CNN JM1 IG ? CS ? RF ? OBWOA with CNN 0.722 0.788 0.761 0.469
and BERT transformer
IG ? CS ? RF ? OBWOA with BERT 0.516 0.571 0.503 0.287
MC1 IG ? CS ? RF ? OBWOA with CNN 0.712 0.933 0.921 0.509
IG ? CS ? RF ? OBWOA with BERT 0.579 0.5 0.235 0.117
KC1 IG ? CS ? RF ? OBWOA with CNN 0.753 0.868 0.834 0.551
IG ? CS ? RF ? OBWOA with BERT 0.611 0.818 0.521 0.475
PC3 IG ? CS ? RF ? OBWOA with CNN 0.803 0.924 0.911 0.667
IG ? CS ? RF ? OBWOA with BERT 0.47 0.58 0.569 0.18
PC4 IG ? CS ? RF ? OBWOA with CNN 0.854 0.938 0.917 0.73
IG ? CS ? RF ? OBWOA with BERT 0.381 0.589 0.53 0.153
PC5 IG ? CS ? RF ? OBWOA with CNN 0.75 0.835 0.79 0.534
IG ? CS ? RF ? OBWOA with BERT 0.463 0.562 0.534 0.173
EQ IG ? CS ? RF ? OBWOA with CNN 0.889 0.93 0.897 0.897
IG ? CS ? RF ? OBWOA with BERT 0.677 0.792 0.673 0.509
JDT IG ? CS ? RF ? OBWOA with CNN 0.862 0.926 0.896 0.735
IG ? CS ? RF ? OBWOA with BERT 0.861 0.914 0.805 0.729
PDE IG ? CS ? RF ? OBWOA with CNN 0.879 0.937 0.927 0.773
IG ? CS ? RF ? OBWOA with BERT 0.388 0.557 0.463 0.119
Lucene IG ? CS ? RF ? OBWOA with CNN 0.831 0.9 0.881 0.678
IG ? CS ? RF ? OBWOA with BERT 0.456 0.763 0.617 0.272
Mylyn IG ? CS ? RF ? OBWOA with CNN 0.811 0.901 0.876 0.646
IG ? CS ? RF ? OBWOA with BERT 0.393 0.615 0.554 0.155
Ant-1.7 IG ? CS ? RF ? OBWOA with CNN 0.809 0.837 0.822 0.623
IG ? CS ? RF ? OBWOA with BERT 0.452 0.752 0.621 0.304
Camel-1.6 IG ? CS ? RF ? OBWOA with CNN 0.761 0.85 0.816 0.555
IG ? CS ? RF ? OBWOA with BERT 0.314 0.684 0.563 0.218
Jedit-4.3 IG ? CS ? RF ? OBWOA with CNN 0.722 0.982 0.964 0.544
IG ? CS ? RF ? OBWOA with BERT 0.65 0.98 0.703 0.393
Poi-3.0 IG ? CS ? RF ? OBWOA with CNN 0.818 0.838 0.835 0.649
IG ? CS ? RF ? OBWOA with BERT 0.851 0.912 0.874 0.732
Xalan-2.7 IG ? CS ? RF ? OBWOA with CNN 0.613 0.948 0.935 0.381
IG ? CS ? RF ? OBWOA with BERT 0.643 0.874 0.723 0.548
Xerces-1.4 IG ? CS ? RF ? OBWOA with CNN 0.901 0.891 0.89 0.808
IG ? CS ? RF ? OBWOA with BERT 0.854 0.832 0.819 0.747
Bold values represent the highest values among all the techniques for each dataset
methods significantly outperformed than using them sepa- Internal validity threats may arise from unimpeded
rately. Further, applying a deep learning-based CNN model factors impacting experimental outcomes, such as experi-
with an attention mechanism results in the identification of mental errors, parameter settings, and implementation
crucial features that improve the performance of SDP. errors. Although the experiment process was carried out
very carefully, some errors may have gone unnoticed [84].
External validity threat is associated with the general-
7 Threats to validity ization and replication of our results. This threat is reduced
by applying the proposed model to 17 open-source datasets
This section presents a comprehensive list of potential from PROMISE, NASA, and AEEEM repositories that are
threats that could impact the results of the study. Exam- publicly available and used in earlier SDP studies. The
ining these threats is essential to enhance the reliability of projects belong to different programming languages, sizes,
the obtained results. and application domains. The parameter settings and fitness
123
Neural Computing and Applications
Table 17 Abbreviations of the algorithms used for comparison with the proposed model
Abbr. Algorithm References Implementation/parameter values
k-NN k-nearest neighbor [68, 69] The number of neighbors was varied between 1, 3, 5, up to
15
SVM Support vector machine [68, 69] Kernel used—RBF, A multilevel grid search was employed
to tune kernel width and regularization parameter CC,
ranging from log(C) = [- 6, - 5,…,20]
RF Random forest [68] Number of trees tuned between 10, 50, 100, 250, 500, and
1000, and the number of attributes selected per tree
pffiffiffiffiffi
as [0.5,1,2] 9 M , where M is the number of attributes in
the dataset
L-SVM Lagrangian support vector machine [68, 69] No kernel function used, A range from log.C = [6; 5;... 20]
has been evaluated
LS-SVM Least squares support vector machine [68, 69] Kernel—RBF, A multilevel grid search was employed to
tune kernel width and regularization parameter CC,
ranging from log(C) = [- 6, - 5,…,20]
NB Naı̈ve Bayes [69] Default (Gaussian NB)
LDA Linear discriminant analysis [68, 69] Solver: SVD, shrinkage:None
C4.5 Decision tree [68] Pruning strategies varied confidence levels from 0.05 to 0.7,
with and without Laplacian smoothing and subtree raising
ANN Artificial neural network [38] Three-layer network, hidden neurons (10), no. of epochs
(300), activation function (sigmoid), learning rate (0.1)
CNN- CNN-whale optimization-simulated annealing-based [3] Number of neurons in convolutional layers: 32, 64, 16,
WSHCKE kernel extreme machine learning kernel sizes: 3 9 3, 3 9 3, and 4 9 4, activation function:
ReLU, KELM hyperparameters: The Gaussian kernel’s
bandwidth r
SSA-BPNN Salp swarm algorithm-backpropagation neural [16] SSA: Population size—30, max iterations—300
network BPNN: three-layer network, No. of hidden layer:1, number
of hidden neurons: 2n ? 1
PCA-ANN Principal component analysis-ANN [69] PCA uses maximum likelihood estimation,
ANN- no. of hidden neurons: 2n ? 1,
Sigmoid activation function, minimizes MSE during training
LRNN Layered recurrent neural network [39] No. of iterations: 1000, No. of input layer neurons: no. of
features, No. of hidden layer neurons: number of features/
2. Number of neurons in output layer: 1, threshold value:
0.5
BGA- Binary genetic algorithm-binary particle swarm [39] BGA—Number of Iterations: 300, Population Size: 40,
BPSO- optimizaton-binary ant colony optimization-layered Crossover Rate: 0.7, Mutation Rate: 0.1, Selection Type:
BACO- recurrent neural network Roulette Wheel Selection, Crossover Type: Single, double,
LRNN or uniform (randomly selected each iteration)
BPSO—Number of Iterations: 300, Swarm Size: 40, Degree
of Influence (c1, c2): Both set to 1.5, Inertia Weight (w):
0.8, Maximum and Minimum Velocity (v_max, v_min): 1
and 0, respectively
BACO—Number of Iterations: 300,Number of Ants
(Agents): 20, Initial Pheromone Level: 1, Pheromone
Exponential Weight (a): 0.8, Heuristic Exponential
Weight (b): 0.8, Evaporation Rate: 0.6
TBWOA- Tournament selection method with binary [40] Population Size: 10, Number of Iterations: 100
DT Whale optimization algorith-decision tree Tournament Size (T): 3
EBMFOV3 Enhanced binary moth flame optimization [2] Population size: 20–50, Maximum Iterations: 100, Transfer
functions used: S-shaped and V-shaped
123
Neural Computing and Applications
Table 17 (continued)
Abbr. Algorithm References Implementation/parameter values
DP-ARNN Defect prediction via attention-based recurrent neural [43] Embedding dimension: 30, AST vector length: 2000
network Bi-LSTM units: 40 per layer, First hidden layer nodes: 16,
Second hidden layer nodes: 24, Batch size: 32, Epochs: 20,
Activation functions: tanh in the first layer, linear in the
second layer, sigmoid in the output layer, Loss function:
Binary Cross-Entropy
Optimizer: RMSprop
CNN CNN [43] Number of filters: 10, Filter length: 5, Fully connected layer
nodes: 100
RNN Recurrent neural network [43] Same as for DP-ARNN [43]
Table 18 Results of the proposed approach against eighteen other algorithms in respect of AUC
Techniques JM1 KC1 PC3 PC4 ant-1.7 camel-1.6 jedit-4.3 poi-3.0 xalan-2.7 xerces-1.4
functions are clearly specified to make it easy for Construct validity threats involve the appropriate
researchers to replicate the results [85, 86]. selection of evaluation metrics. AUC, F-measure, MCC,
Conclusion validity threat ensures that the drawn con- and G-mean are well-established metrics used in the pre-
clusions are statistically valid. To address this concern, we vious SDP studies. Therefore, this threat should also be
use a non-parametric Wilcoxon test to ensure we obtain acceptable [88, 89].
statistically valid results. Using stable performance mea-
sures such as F-measure, AUC, MCC, and G-mean to
evaluate results avoids the threat of conclusion validity
[87].
123
Neural Computing and Applications
MCC
0
0
0
In conclusion, this study offers a novel hybrid model for
G-mean
SDP utilizing a multiple filter wrapper feature selection
0.425
0.396
0.5
(MFWFS) technique which combines the features obtained
from three filter methods—Information Gain, Chi-square,
0.497
0.503
AUC
and Relief-F and one swarm-based oppositional whale
CodeBERT
0.5
optimization algorithm. It effectively selects fewer optimal
features by taking advantage of both the filter and wrapper
F-measure
0.652
0.685
0.448
0.540
0.652
0.349
the study.
G-mean
SDP problems.
–
0.746
0.589
0.577
MCC
0.881
AUC
Proposed approach
0.89
Data availability The datasets that support this study are publicly
available. They are taken from http://promise.site.uottawa.ca/SER
0.855
0.768
0.766
NASA
123
Neural Computing and Applications
123
Neural Computing and Applications
32. Kiranyaz S, Avci O, Abdeljaber O, Ince T, Gabbouj M, Inman DJ RELIEF. Springer Verlag, Berlin, pp 171–182. https://doi.org/10.
(2021) 1D convolutional neural networks and applications: a 1007/3-540-57868-4_57/COVER
survey. Mech Syst Signal Process. https://doi.org/10.1016/J. 48. Ahmed S, Zhang M, Peng L (2014) Improving feature ranking for
YMSSP.2020.107398 biomarker discovery in proteomics mass spectrometry data using
33. Avci O, Abdeljaber O, Kiranyaz S, Boashash B, Sodano H, genetic programming. Connect Sci 26(3):215–243. https://doi.
Inman D (2018) Efficiency validation of one dimensional con- org/10.1080/09540091.2014.906388
volutional neural networks for structural damage detection using 49. Chandrashekar G, Sahin F (2014) A survey on feature selection
A SHM Benchmark Data. In: 25th International congress on methods. Comput Electr Eng 40(1):16–28. https://doi.org/10.
sound and vibration, Hiroshima, pp 4600–4607 1016/j.compeleceng.2013.11.024
34. Kiranyaz S, Ince T, Hamila R, Gabbouj M (2015)‘‘Convolutional 50. Mafarja M, Mirjalili S (2018) Whale optimization approaches for
Neural Networks for patient-specific ECG classification. In: wrapper feature selection. Appl Soft Comput 62:441–453. https://
Proceedings of the annual international conference of the IEEE doi.org/10.1016/J.ASOC.2017.11.006
Engineering in Medicine and Biology Society, EMBS, Institute of 51. Rahnamayan S, Tizhoosh HR, Salama MMA (2008) Opposition
Electrical and Electronics Engineers Inc., pp 2608–2611. https:// versus randomness in soft computing techniques. Appl Soft
doi.org/10.1109/EMBC.2015.7318926 Comput 8(2):906–918. https://doi.org/10.1016/J.ASOC.2007.07.
35. Avci O, Abdeljaber O, Kiranyaz S, Hussein M, Gabbouj M, 010
Inman DJ (2021) A review of vibration-based damage detection 52. Tizhoosh HR (2005) Opposition-based learning: a new
in civil structures: From traditional methods to Machine Learning scheme for machine intelligence. In: International conference on
and Deep Learning applications. Mech Syst Signal Process computational intelligence for modelling, control and automation
147:107077. https://doi.org/10.1016/J.YMSSP.2020.107077 and international conference on intelligent agents, web tech-
36. Ince T, Kiranyaz S, Eren L, Askar M, Gabbouj M (2016) Real- nologies and internet commerce (CIMCA-IAWTIC’06),
time motor fault detection by 1-D convolutional neural networks. pp 695–701. https://doi.org/10.1109/CIMCA.2005.1631345
IEEE Trans Industr Electron 63(11):7067–7075. https://doi.org/ 53. Wang H, Wu Z, Rahnamayan S, Liu Y, Ventresca M (2011)
10.1109/TIE.2016.2582729 Enhancing particle swarm optimization using generalized oppo-
37. Jiang H (2010) Discriminative training of HMMs for automatic sition-based learning. Inf Sci (N Y) 181(20):4699–4714. https://
speech recognition: a survey. Comput Speech Lang doi.org/10.1016/j.ins.2011.03.016
24(4):589–608. https://doi.org/10.1016/J.CSL.2009.08.002 54. Rahnamayan RS, Tizhoosh HR, Salama MMA (2008) Opposi-
38. Jin C, Jin SW (2015) Prediction approach of software fault- tion-based differential evolution. IEEE Trans Evol Comput
proneness based on hybrid artificial neural network and quantum 12(1):64–79. https://doi.org/10.1109/TEVC.2007.894200
particle swarm optimization. Appl Soft Comput J 35:717–725. 55. Ibrahim RA, Elaziz MA, Lu S (2018) Chaotic opposition-based
https://doi.org/10.1016/j.asoc.2015.07.006 grey-wolf optimization algorithm based on differential evolution
39. Turabieh H, Mafarja M, Li X (2019) Iterated feature selection and disruption operator for global optimization. Expert Syst Appl
algorithms with layered recurrent neural network for software 108:1–27. https://doi.org/10.1016/J.ESWA.2018.04.028
fault prediction. Expert Syst Appl 122:27–42. https://doi.org/10. 56. Malisia AR, Tizhoosh HR (2007) Applying opposition-based
1016/J.ESWA.2018.12.033 ideas to the ant colony system. In: Proceedings of the 2007 IEEE
40. Hassouneh Y, Turabieh H, Thaher T, Tumar I, Chantar H, Too J swarm intelligence symposium, SIS 2007, Honolulu, pp 182–189.
(2021) Boosted whale optimization algorithm with natural https://doi.org/10.1109/SIS.2007.368044
selection operators for software fault prediction. IEEE Access 57. Wang WL, Li WK, Wang Z, Li L (2019) Opposition-based multi-
9:14239–14258. https://doi.org/10.1109/ACCESS.2021.3052149 objective whale optimization algorithm with global grid ranking.
41. Balogun AO et al (2021) A novel rank aggregation-based hybrid Neurocomputing 341:41–59. https://doi.org/10.1016/J.NEU
multifilter wrapper feature selection method in software defect COM.2019.02.054
prediction. Comput Intell Neurosci. https://doi.org/10.1155/2021/ 58. Ghotra B, McIntosh S, Hassan AE (2017) A large-scale study of
5069016 the impact of feature selection techniques on defect classification
42. Zain ZM, Sakri S, Ismail NHA, Parizi RM (2022) Software models. In: IEEE International working conference on mining
defect prediction harnessing on multi 1-dimensional convolu- software repositories. IEEE Computer Society, pp 146–157.
tional neural network structure. Comput, Mater Continua https://doi.org/10.1109/MSR.2017.18
71(1):1521–1546. https://doi.org/10.32604/cmc.2022.022085 59. Emary E, Zawbaa HM, Hassanien AE (2016) Binary grey wolf
43. Fan G, Diao X, Yu H, Yang K, Chen L (2019) Software defect optimization approaches for feature selection. Neurocomputing
prediction via attention-based recurrent neural network. Sci 172:371–381. https://doi.org/10.1016/j.neucom.2015.06.083
Program. https://doi.org/10.1155/2019/6230953 60. Mafarja M, Jarrar R, Ahmad S, Abusnaina AA (2018) Feature
44. Akay R, Akay B (2020) Artificial bee colony algorithm and an selection using Binary Particle Swarm optimization with time
application to software defect prediction. In: Nature-inspired varying inertia weight strategies. In: ACM international confer-
methods for metaheuristics optimization, vol. 16, Springer, ence proceeding series, association for computing machinery.
pp 73–92. https://doi.org/10.1007/978-3-030-26458-1_5. https://doi.org/10.1145/3231053.3231071
45. Xu Z, Liu J, Yang Z, An G, Jia X (2016) The impact of feature 61. Sumbul G, Cinbis RG, Aksoy S (2019) Multisource region
selection on defect prediction performance: an empirical com- attention network for fine-grained object recognition in remote
parison. In: Proceedings - International symposium on software sensing imagery. IEEE Trans Geosci Remote Sens
reliability engineering, ISSRE. IEEE Computer Society, 57(7):4929–4937. https://doi.org/10.1109/TGRS.2019.2894425
pp 309–320. https://doi.org/10.1109/ISSRE.2016.13 62. Qi X, Li K, Liu P, Zhou X, Sun M (2020) Deep attention and
46. Liu H, Setiono R (1955) ‘‘Chi2: feature selection and dis- multi-scale networks for accurate remote sensing image seg-
cretization of numeric attributes. In: Proceedings of 7th IEEE mentation. IEEE Access 8:146627–146639. https://doi.org/10.
International conference on tools with artificial intelligence. 1109/ACCESS.2020.3015587
IEEE, Singapore, pp 388–391. https://doi.org/10.1109/TAI.1995. 63. Li W, Liu K, Zhang L, Cheng F (2020) Object detection based on
479783 an adaptive attention mechanism. Sci Rep 10(1):1–13. https://doi.
47. Kononenko I(1994) Estimating attributes: analysis and extensions org/10.1038/s41598-020-67529-x
of RELIEF. In: Estimating attributes: analysis and extensions of
123
Neural Computing and Applications
64. Hou Q, Zhou D, Feng J (2021) ‘‘Coordinate attention for efficient prediction. IEEE Trans Software Eng 45(12):1253–1269. https://
mobile network design. In: Proceedings of the IEEE computer doi.org/10.1109/TSE.2018.2836442
society conference on computer vision and pattern recognition, 78. Ghotra B, McIntosh S, Hassan AE (2015) Revisiting the impact
IEEE Computer Society. pp 13708–13717. https://doi.org/10. of classification techniques on the performance of defect pre-
1109/CVPR46437.2021.01350 diction models. In: 2015 IEEE/ACM 37th IEEE International
65. Shepperd M, Song Q, Sun Z, Mair C (2013) Data quality: some conference on software engineering. IEEE, Florence,
comments on the NASA software defect datasets. IEEE Trans pp 789–800. https://doi.org/10.1109/ICSE.2015.91
Software Eng 39(9):1208–1215. https://doi.org/10.1109/TSE. 79. De Carvalho AB, Pozo A, Vergilio S, Lenz A (2008) Predicting
2013.11 fault proneness of classes trough a multiobjective particle swarm
66. Jureczko M, Madeyski L (2010) Towards identifying software optimization algorithm. In: Proceedings - international confer-
project clusters with regard to defect prediction. In: ACM inter- ence on tools with artificial intelligence. ICTAI, pp 387–394.
national conference proceeding series, pp 1–10. https://doi.org/ https://doi.org/10.1109/ICTAI.2008.76
10.1145/1868328.1868342 80. Demšar J (2006) Statistical comparisons of classifiers over mul-
67. Ambros MD’, Lanza M, Robbes R (2010) An extensive com- tiple data sets
parison of bug prediction approaches. In: Proceedings - interna- 81. Wilcoxon F (1945) Individual comparisons by ranking methods.
tional conference on software engineering, pp 31–41. https://doi. Biometrics Bullet 1(6):80–83
org/10.1109/MSR.2010.5463279 82. Uddin MN, Li B, Ali Z, Kefalas P, Khan I, Zada I (2022) Soft-
68. Lessmann S, Baesens B, Mues C, Pietsch S (2008) Benchmarking ware defect prediction employing BiLSTM and BERT-based
classification models for software defect prediction: a proposed semantic feature. Soft comput 26(16):7877–7891. https://doi.org/
framework and novel findings. In: IEEE transactions on software 10.1007/s00500-022-06830-5
engineering, pp 485–496. https://doi.org/10.1109/TSE.2008.35 83. Pan C, Lu M, Xu B (2021) An empirical study on software defect
69. Jayanthi R, Florence L (2019) Software defect prediction tech- prediction using codebert model. Appl Sci (Switzerland) 11:11.
niques using metrics based on neural network classifier. Cluster https://doi.org/10.3390/app11114793
Comput 22(1):77–88. https://doi.org/10.1007/S10586-018-1730- 84. Malhotra R, Khanna M (2018) Threats to validity in search-based
1/METRICS predictive modelling for software engineering. IET Software
70. Li J et al. (2017) Rare event prediction using similarity majority 12(4):293–305. https://doi.org/10.1049/IET-SEN.2018.5143
under-sampling technique. In: Soft computing in data science. 85. Kondo M, Bezemer CP, Kamei Y, Hassan AE, Mizuno O (2019)
SCDS 2017 communications in computer and information sci- The impact of feature reduction techniques on defect prediction
ence. Springer, Singapore, pp 23–39. https://doi.org/10.1007/978- models. Empir Softw Eng 24(4):1925–1963. https://doi.org/10.
981-10-7242-0_3/COVER 1007/S10664-018-9679-5/METRICS
71. Tan M, Tan L, Dara S, Mayeux C (2015) Online defect prediction 86. Lu H, Kocaguneli E, Cukic B (2014) Defect prediction between
for imbalanced data. In: 2015 IEEE/ACM 37th IEEE Interna- software versions with active learning and dimensionality
tional conference on software engineering. IEEE Computer reduction. In: 2014 IEEE 25th International symposium on soft-
Society, pp 99–108. https://doi.org/10.1109/ICSE.2015.139 ware reliability engineering. IEEE Computer Society,
72. Farid AB, Fathy EM, Eldin AS, Abd-Elmegid LA (2021) Soft- pp 312–322. https://doi.org/10.1109/ISSRE.2014.35
ware defect prediction using hybrid model (CBIL) of convolu- 87. Malhotra R, Lata K (2021) An empirical study to investigate the
tional neural network (CNN) and bidirectional long short-term impact of data resampling techniques on the performance of class
memory (Bi-LSTM). PeerJ Comput Sci 7:1–22. https://doi.org/ maintainability prediction models. Neurocomputing
10.7717/PEERJ-CS.739 459:432–453. https://doi.org/10.1016/J.NEUCOM.2020.01.120
73. Srinivasan K, Fisher D (1995) Machine learning approaches to 88. Li M, Zhang H, Wu R, Zhou ZH (2012) Sample-based software
estimating software development effort. IEEE Trans Software defect prediction with active and semi-supervised learning.
Eng 21(2):126–137. https://doi.org/10.1109/32.345828 Autom Softw Eng 19(2):201–230. https://doi.org/10.1007/
74. Wei H, Hu C, Chen S, Xue Y, Zhang Q (2019) Establishing a S10515-011-0092-1/METRICS
software defect prediction model via effective dimension reduc- 89. Zhou T, Sun X, Xia X, Li B, Chen X (2019) Improving defect
tion. Inf Sci (N Y) 477:399–409. https://doi.org/10.1016/J.INS. prediction with deep forest. Inf Softw Technol 114:204–216.
2018.10.056 https://doi.org/10.1016/J.INFSOF.2019.07.003
75. Tantithamthavorn C, McIntosh S, Hassan AE, Matsumoto K
(2017) An empirical comparison of model validation techniques Publisher’s Note Springer Nature remains neutral with regard to
for defect prediction models. IEEE Trans Software Eng jurisdictional claims in published maps and institutional affiliations.
43(1):1–18. https://doi.org/10.1109/TSE.2016.2584050
76. Zhu K, Zhang N, Ying S, Wang X (2020) Within-project and
Springer Nature or its licensor (e.g. a society or other partner) holds
cross-project software defect prediction based on improved
exclusive rights to this article under a publishing agreement with the
transfer naive bayes algorithm. Comput, Mater Continua
author(s) or other rightsholder(s); author self-archiving of the
63(2):891–910. https://doi.org/10.32604/CMC.2020.08096
accepted manuscript version of this article is solely governed by the
77. Song Q, Guo Y, Shepperd M (2019) A comprehensive investi-
terms of such publishing agreement and applicable law.
gation of the role of imbalanced learning for software defect
123