0% found this document useful (0 votes)
12 views10 pages

A Novel Resampling Technique For Imbalanced Classification in Software Defect Prediction by A Re-Sampling Method With Filtering

The document presents a novel hybrid preprocessing technique to address class imbalance, noise, and borderline samples in software defect prediction. The method combines Synthetic Minority Oversampling Technique (SMOTE), Rough Set Theory (RST), and Iterative-Partitioning Filter (IPF) to enhance the classification performance of the C4.5 algorithm. Experimental results demonstrate that this approach outperforms existing sampling methods in improving software quality and defect prediction accuracy.

Uploaded by

mamatha.t
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views10 pages

A Novel Resampling Technique For Imbalanced Classification in Software Defect Prediction by A Re-Sampling Method With Filtering

The document presents a novel hybrid preprocessing technique to address class imbalance, noise, and borderline samples in software defect prediction. The method combines Synthetic Minority Oversampling Technique (SMOTE), Rough Set Theory (RST), and Iterative-Partitioning Filter (IPF) to enhance the classification performance of the C4.5 algorithm. Experimental results demonstrate that this approach outperforms existing sampling methods in improving software quality and defect prediction accuracy.

Uploaded by

mamatha.t
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

A Novel Resampling Technique for Imbalanced Classification in Software

Defect Prediction by a re-sampling method with filtering


Kamal Bashir1, a) and Mohamed Mosadag1
Department of Information Technology, College of Computer Science and Information Technology,
Karary University, Omdurman 12304, Sudan
ABSTRACT One of the greatest difficulties that most algorithms that learn classifiers face is imbalanced
data. A class imbalance problem, however, is not inherently detrimental, some contemporary studies
claim, nor is the performance decline solely attributable to this issue, but rather other aspects of the
data distribution, such as the presence of noise and borderline cases around the class heads. In order
to rectify the issue of data imbalance, the author proposes a new hybrid preprocessing technique in this
study. This technique handles class imbalance, the presence of noise, and borderline samples in software
defect data. To solve this problem of data imbalance and noisy samples, a combination of hyperparameter
optimization based on Rough Set Theory (RST), Iterative-Partitioning Filter (IPF), and Synthetic Minority
Oversampling Technique (SMOTE) oversampling is used. The strategy begins with the application of
SMOTE to synthesize artificial examples through a process of linear interpolation. The first step of the
algorithm is to use SMOTE followed by linear interpolation of two defect-prone k-nearest neighbors to
generate synthetic examples. Then, majority-class examples with a lower approximation of these originals
and newly generated minority-class examples are removed using RST. Then IPF is applied to erase the
data. We analyze the efficacy of our algorithm by conducting experiments where the learning algorithm is
the C4.5 classifier. Then, statistical tests show the superiority of our proposed method over state-of-the-art
sampling methods.

the software’s quality and encourage user trust and de-


pendability throughout the supply chain. Binary pre-
Received: 12 October 2024 Accepted: 09 Novem- diction models are mostly used to predict a module as
ber 2024 either defective or not in order to detect the DP software
DOI: https://doi.org/10.71107/1v443128 module1 . The quality of the training data, where class
imbalance is a significant problem, is partially related
to these learners’ capacity to classify a module appro-
priately. Nevertheless, actual data, such as SD, are fre-
I. INTRODUCTION quently disproportionately skewed, with a small propor-
tion of unusual or fascinating DP examples and a large
By quickly identifying Defect Prone (DP) modules proportion of conventional (NDP) ones2 . Meanwhile,
through metric-based classification, Software Defect Standard machine learning techniques, mean- while, as-
Prediction (SDP) aims to increase software quality1 . sume that the training database is fairly partitioned into
SDP makes an effort to forecast the quantity of defects a number of classes. Therefore, class imbalanced data
as well as the defect-proneness of system components set with about equal encompassment training class in-
prior to their deployment in order to reduce mainte- creases the performance of classifiers3 .
nance costs for a high-quality product delivery. In or-
As it happens in imbalance learning, if it happens,
der to lower costs and support software developers in
the normal machine learning classifiers mostly go to
their efficient resource management, an efficient method
favor the samples from the majority class and simply
for predicting DP software modules is required for suit-
neglect the ones from the minority class. Imbalances
able testing. Additionally, this may greatly enhance
in class data cause problems in numerous applications,
such as risk analysis, fraud detection, violation preven-
tion, and medical studies. For instance, about our SDP,
a) Electronic
a software development team needs to implement a clas-
mail: [email protected]
Conclusions in Engineering 9

sification model predicting the likelihood of defective


modules in a subsequent release of the program. DP
contains only about 2% of historical data; that is, only
a small portion of its totality is of this type. A de-
fect prediction module can elaborate outputs with up
to 98% accuracy if it predicts each module to be NDP.
This output was generated because the target DPs were
not sufficiently defined in the training data, which the
classification model could not identify. If a classifier can
guess the minority class appropriately, the industrialists
and actions will be capable of setting cheap strategies4 .
This challenge is due to the classifying models inability
to determine the target DPs because their representa-
tion in the training data is too low. If a classifier is
able to effectively predict the minority class, such an FIG. 1: The safe (s), borderline (b) and noisy (n)
achievement will be beneficial to industry stakeholders examples. The line represent the decision boundary
and firms in implementing policies that would cut down among the two classes
costs. Over the last 10 years, the issue of using imbal-
anced data to develop classifiers and the challenges that
arise have attracted much interest from researchers. So-
lutions to this problem have been proposed3–15 . A very
good discussion on this may be found at16 . In order
to deal with the data imbalance issue, the Synthetic class distribution also have an impact on the decline in
Minority Over-Sampling Technique (SMOTE) is often performance. Among these is the presence of borderline
employed. Recent studies show that class imbalance is and noisy examples. Although a great deal of research
not the only issue and that data distribution issues can has been done to investigate, treat, and mitigate each
also cause low performance. In particular, the prob- issue of class imbalance, little has been done to exam-
lem of the presence of noisy and borderline examples is ine datasets that share these challenges. Therefore, our
of great importance. Borderline examples are the ones goal is to develop a new preprocessing technique that
found in the regions that are near the class separation may solve these issues by utilizing the SMOTE, Rough
line, while noisy examples are those located inside the Set Instance Selection (RST), and Iterative-Partitioning
region of other classes away from theirs. However, some Filter (IPF).
features that are intrinsic to SMOTE can aggravate this In this study, we introduce a new oversampling
problem rather than provide relief, and the extension of methodology: SMOTE generates synthetic examples.
SMOTE as it exists currently is not well adapted to RST any synthetic instance that does not fall into the
mitigate these. lower approximation of the minority class as noisy or
As shown in Fig. 1, various classifications and differ- outliers in the boundary region and thus not useful for
entiations have to be made in terms of safe, borderline, classification. Fixedly, we utilize the IPF to wash the
and noisy cases to avoid the term confusion. The regions noise out of the whole set. This process, referred to as
with mostly similar class labels are the dwelling places of SMOTE-RSTNF, will take care of two major challenges:
the safe examples. When we talk about noisy examples, They analyzed the misleading effects from (1) the im-
it involves one class engaging in safe space within the balanced distribution of classes, which is addressed by
other class. As noted in17 , it might be best to consider SMOTE oversampling, and (2) the data quality, where
them as examples of what class label noise may affect. the post-processing removes noisy, borderline, and irrel-
Finally, the marginal instances lie in the region close to evant data examples. The rest of the work is divided
the classification borders in between the minority and with a view. Section II of the study is the Review
majority classes, or when the shape of the boundary is of Literature. Section III describes the preprocessing
brittle. Even those cases can be shifted to the wrong side technique of this chapter violently. The experiment is
of the judgment boundary by the least attribute noise, explained in detail in Section V of this paper. The final
therefore making them more difficult17 . To the best of section of this research is Section IV, where the findings
our knowledge, learning imbalance is not a problem in of the study will be discussed and analyzed. Section
and of itself, and other factors pertaining to the data’s VI, on the other hand, will provide a conclusion and
suggestions for future research
Conclusions in Engineering 10

II. LITERATURE REVIEW PROMISE, JIRA, and Eclipse datasets. If these sam-
pling strategies are used with baseline models in fault
Scholars have noted that searching for a practical so- assessment datasets, the baseline model results are sig-
lution for misclassification is almost always a challenge nificantly improved. For intra-release and cross-release
because it is more likely to occur in the overlap area or SDP, non-linear and linear Bayesian regression have
near decision boundaries, as this work confirmed [18]. been carried out by Singh and Rathore30 . Therefore,
For instance, Napierala et al.18 showed in a number integrated with SMOTE data sampling methods, Ran-
of experiments that the number of borderline samples dom Forest (RF), Support Vector Support Vector Ma-
has a direct influence on the classifier’s degradation in chine (SVM), Linear Regression (LR), Linear Bayesian
an imbalanced scenario. Two distinct techniques have Regression (LBR), and Non-linear Bayesian Regression
been taken by the literature in an attempt to address (NLBr). The study demonstrated that, in an inde-
the aforementioned issue. pendent software product dataset of 46 products, the
Bayesian non-linear model outperforms the linear re-
A. Modifications of SMOTE gression model algorithms. Elahi et. al.31 , carried
change-direction techniques that are associated with out a study whereby a number of ensemble methods
SMOTE modifications. These direct SMOTEs gener- used in SDP were analyzed. For classification, applied
ate positive examples toward specified areas of the data are Logistic Regression LR, Naive Bayes NB, binomial
space as well as taking into account particular data producer multinomial NB, Decision Tree DT, and K-
features. This category includes the following meth- nearest neighbor KNN. This experiment is performed
ods: ADMOS19 , ADASYN20 , Borderline-SMOTE21 on four datasets from the PROMISE repository. This
and Safe level SMOTE22 . These techniques are in- paper uses F-measure as the performance measure. The
tended to generate positive examples only within the findings about model averaging outperform voting and
region of positive class or around regions where the den- stacking ensemble approaches were derived directly from
sity of positive examples is high. the data. He et al. proposed SHSE32 . In fact, it can
be concluded that CSS is a mixture between different
B. Extensions of SMOTE sampling, feature subspace, and ensemble learning. The
problems of data imbalance are solved with the help of
According to the results of the research, the results the Subspace Hybrid Sampling approach. When con-
of further data pretreatment methods combined with ducting experiments on 27 datasets, SHSE performed
SMOTE include filtering. Usually, noise filters are better than any other algorithms used in software fail-
used in ordinary classification problems to screen out ure number prediction. DT performs the best when im-
the possibly noisy samples and make the classifica- plemented together with SHSE.
tion boundaries more clear and definite when defin-
To address data imbalance issue in SDP, Goyal33
ing the training sets17 . In the empirical analysis of
proposed an innovative sampling technique known as
the performance of filters to identify the behavior of
neighborhood-based undersampling (N-US). ANN, DT,
data balance methods for the training of computer ML
KNN, SVM, and NB classifier models are used in the
data23 . The usefulness of integrating filters with over-
modeling process. The study also makes use of the
sampling approaches is established. Some of the gener-
PROMISE dataset. For the measurement of the model’s
als of SMOTE are SMOTE-RSB24 , SMOTE- FRST25 ,
performance accuracy, AUC and ROC are being used.
SMOTE-Tomek Links (TL)26 , SMOTE- ENN [Edited
As it has been evidenced, when acting in accordance
Nearest Neighbor Rule (ENN)]27 , and SMOTE-IPF17
with the N-US approach, the classifiers’ accuracy in-
in which a form of filtering takes place after the SMOTE
creases. Similar, Pandey et al.34 also used NASA and
operation. Tumar et al.28 proposed an SDP model in
the PROMISE repository for SDP. To address the prob-
which BMFO is combined with an ADASYN to over-
lem of data imbalance, the feature selection procedure
come data imbalance problems. When used on PMB
SMOTE combined with Kernel Principal Component
Poland’s PROMISE dataset, the suggested strategy im-
Analysis (K-PCA) is further used on the dataset in or-
proves the results of numerous classifiers.
der to exclude features that are irrelevant to the circum-
Rathore et al.29 created three independent generative stance. Compared with conventional methods of fault
oversampling techniques: Other models that we com- prediction, using rebooted NB, LR, Multi-Layer Percep-
pare our results against include conditional generative tron Neural Network (MLP), SVM, as well as conven-
adversarial networks, also known as CTGAN, vanilla tional K-PCA and SMOTE methods that are incorpo-
GAN, and Wasserstein GAN with Gradient Penalty, rated with Extreme Learning Machine and PCA-ELM
known as WGANGP. This experiment is carried out on
Conclusions in Engineering 11

have technically higher ROC indices in this study. The A. Synthetic Minority Over-Sampling Technique
recommended technique provides more objective results (SMOTE)
as compared to other reliable methods. To overcome
the data imbalance problem, Yedida and Menzies35 have A vast majority of the under-sampling and minority
put forward a new oversampling technique called fuzzy over-sampling methods have been discussed elaborately
sampling. An SDP model is developed by means of a in this literature on data sampling methods. This work
deep belief network. The experiment is implemented on employed the SMOTE2 , an algorithm that generates
the PROMISE dataset. It uses AUC, RUC, and false new synthetic instances in minority classes. They do so
alarm rates to assess the performance of the method- not in data space, but in feature space where they cre-
ologies. Thus, the authors find that oversampling is ate the synthetic instances. The SMOTE instances are
necessary before applying deep learning for SDP. given by S = S + u × (X 0 − X) with 0 ≤ u ≤ 1$, where
(X and X 0 ) are two similar samples belonging the mi-
Pandey et al.36 have done experiments on raw NASA nority class. X’ is randomly selected from K neighbor-
datasets to detect software faults. It is seen that the hoods of X in the minority class. The construction of
dataset is highly imbalanced, and thus the SMOTE the more recent examples expands the fullness and gen-
technique is applied to them. For the balanced dataset, erality of the minority while reducing the rarity of the
the SqueezeNet and Bottleneck DL models are used. occurrence.
Tantithamthavorn et al.37 applied four class imbalanc-
ing procedures: sampling techniques including oversam- B. Rough Set Analysis
pling, undersampling, SMOTE, and Random Oversam-
pling Examples (ROSE) associated with NB, AVNNet, Data is represented as an information system, orS =
xGBTree, C5.0, RF, LR, and GBM classification algo- (X, A) , in rough set analysis40 , where X = {x1 , . . . , xn }
rithms. The study revealed that to optimize the pa- and A = {a1 , . . . , am } are the non-empty and finite sets
rameters of SMOTE, AUC might be improved. In case of objects and attributes, respectively. There is a map-
of defect prediction, Nitin et al.38 used four ensemble ping a : X → Va for every a ∈ A , where Va is the value
techniques: random forest, bagging, random subspace, set of attribute a. The B-indiscernibility relation RB is
boosting, and SMOTE for handling imbalance data. defined with regard to any subsetB ⊆ A.
The ensemble techniques employ the use of DT, LR,
RB = {(x, y) ∈ X 2 : a(x) = a(y), ∀a ∈ B} (1)
and KNN as base learners. In the experiment, fifteen
datasets from the Eclipse and PROMISE repositories
are used. A model has been proposed by Balaram and RB is an equivalence relation, which can create a par-
Vasundra39 also known as BOA combined with E-RF tition of the universalX , denoted as X/RB . [x]RB =
linked to ADASYN. PROMISE dataset is used in the {y ∈ X : (x, y) ∈ RB } is the equivalence class of x
study. BOA is used to address the overfitting issue, and andX/RB = {[x]RB : x ∈ X} . GivenU ⊆ X , the lower
to address the class imbalance issue, ADASYN is em- and upper approximation w.r.t. RB are determined by
ployed. The evaluations in specificity, AUC, and sen- RB ↓ U = {x ∈ X|[x]RB ⊆ U } (2)
sitivity claimed that this proposed E-RF-ADASYN is
slightly higher than KNN and DT classifiers. Table I
shows some existing SDP models that used balancing RB ↑ U = {x ∈ X|[x]RB ∩ U 6= ∅} (3)
techniques
In the context of classification, a decision system
III. THE PROPOSED METHOD (X, A ∪ {d}) is a unique type of information system,
where the designated attribute d(d ∈/ A) is called the de-
In this section, we introduce the SMOTE-RSTNF tech- cision attribute. In X/Rd = {[x]Rd : x ∈ X}, the decision
nique for improving the SMOTE algorithm, in which classes with respect to d are given. Given B ⊆ A, the
every step removes noisy and borderline instances that objects from X for which the values of B allow for un-
can deteriorate learning performances. In the suggested ambiguous prediction of the decision class are included
algorithm, new synthetic minority class instances are in the B-positive areaP OSB :
first added to the training set using SMOTE. Synthetic
instances, or majority class instances, are then removed P OSB = RB ↓ [x]Rd (4)
[

using the IPF filter, which eliminates noisy examples x∈X


from both the dataset and those produced by SMOTE.
All of this is done by using RST and the lower approx- In fact, ifx ∈ P OSB , then any object that shares the
imation of a subset. same values with x for any attribute in B will likewise be
Conclusions in Engineering 12

a member of the same decision class as x. The following IV. EXPERIMENTAL DESIGN
number (degree of reliance of d on B ) represents the
predictive ability w.r.t. d of the qualities in B: In this work, the data is divided into ten mutual choice
subsets (the folds) of nearly the same measure as the
|P OSB |
γB = (5) training datasets are divided randomly. All the nine
|X| folders are chosen in the 10 fold procedure to train
both of the models and then testing it on the separate
(X, A ∪ {d}) is called consistent if γA = 1. If a subset
ninth fold. This continues until all of the folds have
B of A meets these requirements, it’s referred to as a
been utilized per their ability for testing or training al-
decision reduct. (1) P OSB = P OSA , meaning that B
ternatively. Empirical data is generated and analyzed
maintains A 0 s ability to make decisions. (2) It is not
by using computation programs WEKA version 3.8.1,
reducible further, that is, P OSB 0 = P OSA does not ex-
0 MATLAB R2016a, the KEEL software tool, and the R
ist for any suitable subset B of B . We refer to B as statistical program. in this work, the C4.5 learning al-
a choice superreduct if the latter requirement is lifted, gorithms are applied with the use of WEKA tools with
that is, if B is not necessarily minimum. the default settings. The AUC statistic determines the
accuracy of picture painted at how well the constructed
C. Iterative Partitioning Filtering Based Noise Filtering
models in terms of categorization. Datasets: All the
IPF arising due to the excellent work by41 The method datasets in the experiment except for JDT and PDE
of IPF eliminates a number of noisy cases in a number of that are mined from42 , were derived from public soft-
iterations prior to reaching a certain threshold. The it- ware project data repositories43 . Attached is Table II
erative method stops when, for a number of consecutive that gives features of information. The characteristics
iterations k, the number of mistakenly noisy examples of the data are presented in Table II.
in each of them is less than the percentage p of the ini-
tial training data set. Initially, the approach starts with
a set of noisy instances A = 0. The basic steps of each
iteration are as follows:

• Divide the given training dataset denoted as E


into further n equally small subsets.
• Create a classifier by the C4.5 method on each of
these n subgroups and use them for the evaluation
of the entire current training data set E.
• Overlaid with B in A, the noisy cases are identi-
fied in E through the voting system (consensus or
majority).
• Remove the noisy examples: E= E \A.

Two voting techniques can be employed to identify FIG. 2: Average AUC over all datasets for the C4.5
noisy examples: consensus and majority. The former classifier.
removes an example if it is misclassified by all the clas-
sifiers, while the latter eliminates an example if it is
misclassified by more than half the classifiers. The pa- V. RESULTS AND DISCUSSION
rameter setting for the implementation of IPF used in
this study has also been found to estimate the degree of To perform an evaluation of the results obtained in this
balance and the level of noisy and borderline samples on paper, several statistical tests are run to compare our
imbalanced datasets once preprocessed with SMOTE. In method with the one without preprocessing and also
fact, the majority approach is applied to recognize the compare it with eleven other preprocessing techniques
noisy examples; the number of partitions with random selected from the literature.
examples is n = 9, k = 3 iterations for the stop criterion, The outcomes of the experimental investigation for
and p = 1% of deleted examples. Scholarly investiga- the test partitions are summarized in Table V, where
tions into the effect of those parameters in the findings the best approach is highlighted for each data set and
are used to determine this parameter configuration. the one overall is in the first row. Because the proposed
Conclusions in Engineering 13

TABLE II: Characteristics of datasets.

Defect Ratio NFP FP #Attribute #Modules Datasets


3.48 579 166 21 745 ant-1.7
7.67 207 27 21 234 arc
1.97 268 136 62 404 EQ
3.65 6110 1672 22 7782 jm1
9.79 627 64 62 691 LC
6.16 1288 209 62 1497 PDE

phase in the oversampling process is in order to enhance


the behavior at the moment of the classification. As
shown in Figs. 3 and 2, all the preprocessing techniques
in the end surpass the result using the original data sets
as expected. The algorithm ratings for each of the orig-
inal data sets and the ones selected for this analysis are
indicated in Table III. As the reader can see, our idea
wins first place four times, third place once, and fifth
place once.
The results will be tested with an additional multi-
ple comparison to compare methods and distinct pre-
processing procedures. Table IV provides the compari-
son of the performance of the different methods of our
proposal and compared methods, which indicates our
method has the best ranking compared, while the two
FIG. 3: The AUC classification results over all last positions correspond to the Borderline-ADASYN
datasets for the C4.5 classifier. Rank and SL-SMOTE Rank. For the current study, a
p-value of nearly zero must be found in order to detect
a statistical difference between the methods. A multi-
TABLE V: Average ranks obtained by each method in ple comparison test will be used to compare the results
the Friedman tes. attained and identify the most effective preprocessing
method. As also shown in Table III, our proposal ob-
Algorithm Ranking tains the best ranking, with Borderline-ADASYN and
Proposed method 2 SL-SMOTE ranking last. In this investigation, there-
BL-SMOTE 3.1667 fore, it is necessary to use a p-value close to zero in
MWMOTE 3.5833
order to detect any significant difference between the
ADOMS 4.0833
approaches.
SMOTE-FRS 6.3333
SMOTE-TL 6.3333
SMOTE-ENN 6.1667 A. Statistical Test
SMOTE-IPF 7.5
SMOTE 8.3333 The Holm process results for comparing our proposal
SMOTE-RSB 8.5 with the other ones are presented in Table VI. Thirdly,
ADASYN 10.5 the z-value obtained for each of the equations helps in
SL-SMOTE 11.6667 identifying the order of the algorithms. Since the test
Normal 12.8333 rejects all cases, the equivalent p-values for each com-
parison can be done with normal distribution. This can
then be compared to the other related in the correspond-
method outperforms the other methods in AUC mea- ing row of the table, either to reject the hypothesis of
sures in four of the six software defect datasets, we can equal behavior in favor of the best ranking algorithm.
understand how efficient it is. Moreover, the improve- This work demonstrated consistently that our strategy
ments reached by BL-SMOTE and MWMOTE against is better than every other strategy that we analyzed
the purely SMOTE present how essential the cleaning statistically.
B.
TABLE I: Summary of SDP Models Handling Class Imbalance Issues

Author Dataset ML Algorithm Sampling Techniques


KNN, DT, LDA NB, RF, Adaptive synthetic
Tumar et al., 201928 PROMISE
Dynamic Adaboost sampling
ANN, DT, KNN, EasyEnsemble (Ensemble
Elahi et. Al, 202031 PROMISE
SVM, NB random under-sampling)
NB, RF, LR, Neighbourhood based
Goyal, 202133 NASA
DT, KNN under-sampling (N-US)
ECLIPSE, JIRA, NB, RF, LR, GAN based oversampling, Wasserstein
Rathore et al., 202229 PROMISE DT, KNN GAN with Gradient Penalty (WGANGP),
Conditional GAN (CTGAN)

strengths and weaknesses


PROMISE, AEEEM, RF, SVM, LR, LBR, NLBR DT Regressor
Singh and Rathore, 202230 JIRA AEEEM, MetricsRepo K-PCA-ELM, SVM, LR, Naive Bayes, SMOTE
Conclusions in Engineering

MLP Deep belief network


Conclusions in Engineering

Tong et al., 202232 Random under sampling, SMOTER


Pandey et al., 202034 PROMISE, NASA SMOTE
Yedida and Menzies, 202135 PROMISE Fuzzy
SqueezeNet, BottleNeck RF,
NASA
Pandey et al., 202236 LR, NB, GBM, C5.0, SMOTE

SMOTE-RSTNF: suitability of the approach,


Boosting, NNixGBTree, AVNNet
Over sampling, SMOTE,
Tantithamtha et al., 202037 NASA, PROMISE
Under sampling, ROSE
SVM, NB random under-sampling)

feature that makes IPF + RST fit to unbalanced


The noise examples’ iterative removal is precisely the
Nitin et al., 202138 PROMISE, ECLIPSE Ensemble Learning SMOTE
A. Balaram and Vasundra, 202239 PROMISE Ensemble RF Adaptive synthetic sampling

TABLE III: AUC Classification Results Over All Datasets for the C4.5 Classifier.

Dataset Normal SMOTE ADASYN ADOMS MWMOTE SL-SMOTE BL-SMOTE SMOTE-ENN SMOTE-RSB SMOTE-FRS SMOTE-TL SMOTE-IPF Proposed Method
ant-1.7 0.665 0.781 0.745 0.846 0.809 0.741 0.815 0.796 0.765 0.847 0.799 0.778 0.808
arc 0.502 0.836 0.813 0.864 0.88 0.783 0.931 0.843 0.75 0.856 0.869 0.833 0.883
EQ 0.686 0.75 0.726 0.771 0.779 0.702 0.785 0.842 0.739 0.646 0.836 0.799 0.851
jm1 0.617 0.7 0.672 0.819 0.833 0.695 0.795 0.757 0.728 0.692 0.734 0.756 0.872
Conclusions in Engineering

LC 0.654 0.844 0.838 0.92 0.927 0.737 0.937 0.797 0.808 0.904 0.835 0.866 0.95
PDE 0.577 0.805 0.794 0.879 0.879 0.705 0.869 0.854 0.831 0.807 0.803 0.85 0.908
Average 0.617 0.786 0.765 0.85 0.851 0.72 0.855 0.815 0.782 0.797 0.815 0.805 0.879
Winner 0 0 0 0 0 0 1 0 0 1 0 0 4

TABLE IV: Performance ranking for the preprocessing algorithms.


Dataset 1st 2nd 3rd 4th 5th 6th 7th 8th 9th 10th 11th 12th 13th
ant-1.7 SMOTE-FRS ADOMS BL-SMOTE MWMOTE Proposed Method SMOTE-TL SMOTE-ENN SMOTE SMOTE-IPF SMOTE-RSB ADASYN SL-SMOTE Normal
12

arc BL-SMOTE SMOTE-FRS Proposed Method MWMOTE SMOTE-TL ADOMS SMOTE-ENN SMOTE SMOTE-IPF ADASYN SL-SMOTE SMOTE-RSB Normal
EQ Proposed Method SMOTE-ENN SMOTE-TL SMOTE-IPF BL-SMOTE MWMOTE ADOMS SMOTE SMOTE-RSB ADASYN SL-SMOTE Normal SMOTE-FRS
Conclusions in Engineering

jm1 Proposed Method MWMOTE ADOMS BL-SMOTE SMOTE-ENN SMOTE-IPF SMOTE-TL SMOTE-RSB SMOTE SMOTE-FRS ADASYN SL-SMOTE Normal
LC Proposed Method BL-SMOTE MWMOTE ADOMS SMOTE-FRS SMOTE-RSB SMOTE-IPF SMOTE-TL SMOTE ADASYN SMOTE-ENN SL-SMOTE Normal
PDE Proposed Method ADOMS MWMOTE BL-SMOTE SMOTE-ENN SMOTE-RSB SMOTE-FRS SMOTE SMOTE-TL SMOTE-IPF ADASYN SL-SMOTE Normal

datasets preprocessed with SMOTE containing some


14
Conclusions in Engineering 15

TABLE VI: Post Hoc comparison for A= 0:05, our proposed technique is the control method.

i algorithm z = (R0 − Ri )/SE p Holm


12 Normal 4.818121 0.000001 0.004167
11 SL-SMOTE 4.299246 0.000017 0.004545
10 ADASYN 3.780372 0.000157 0.005
9 SMOTE-RSB 2.890872 0.003842 0.005556
8 SMOTE 2.816747 0.004851 0.00625
7 SMOTE-IPF 2.446123 0.01444 0.007143
6 SMOTE-FRS 1.927248 0.053949 0.008333
5 SMOTE-TL 1.927248 0.053949 0.01
4 SMOTE-ENN 1.853123 0.063865 0.0125
3 ADOMS 0.926562 0.354154 0.016667
2 MWMOTE 0.704187 0.481316 0.025
1 BL-SMOTE 0.518875 0.603848 0.05
Holm’s procedure rejects those hypotheses which the p-value ≤ 0:007143

noise and borderline examples. This fact indicates more of safe examples, we have confirmed that the majority
efficient noise filtering since instances that were dis- method is superior to the consensus scheme. The con-
carded in the first stage do not interfere with the de- sensus scheme has been noted to be quite conservative
tection process at the next stage. Furthermore, in this with regard to deleting examples, and it does not permit
case, the ensemble nature of IPF with RST allows it to deletion to the extent that it can significantly improve
pool together predictions made by different classifiers, the performance.
leading to a better estimation of hard-to-classify noisy
samples than pooling data from a single classifier.
VI. CONCLUSION
IPF + RST might be effective in terms of remov-
ing noise and borderline samples because such charac- When it comes to learning from imprecise data, the pres-
teristics seem to be their main strength factors rather ence of noise and outliers is an art and still an active area
than another lower-ranked noise filter’s versatility. Most of research. In this work, we proposed SMOTE-RSTNF,
of the noise filters empowered with SMOTE, such as a combined preprocessing technique for unbalanced mul-
ENN or TL, define noise around two nearest neighbor tifaceted data. With the former, we employed the RST
instances but not around two different classes. Even to get rid of the original majority samples and synthetic
though they have not received attention in the liter- samples that were not part of the lower approximation
ature, this issue could be a source of embarrassment: of the class after applying SMOTE to generate the sam-
since SMOTE plots the position of a new positive ex- ples. The IPF is then applied to clean up all of the data.
ample relative to a nearest neighbor, these synthetic ex- Employing real-world software datasets, the suggested
amples, which are faulty, are likely to escape detection methodologies were employed to develop classifiers that
by noise filters, which are based on superfluous near- could identify malfunctioning modules. The techniques
est neighbors. While they will see some exceptions were tested using the C4.5 classifier, and the outcome
that have been classified as noise, noise identification was stored using the AUC performance metric. The
methods such as IPF and RST that are based on more outcome of the experiment proves that the use of our
complex criteria enable sample sizes that have similar proposed approach performed better than the existing
characteristics to be grouped. This solves the problem methods reported in the current literature. The signed-
mentioned earlier and helps in quite easily spotting the rank test conducted by the author revealed that the re-
outliers. Among all its parameters, the ones for IPF’s sults for the Friedmans scheme preferred the strategies
selection can be termed as one of its major weaknesses, suggested came out to be statistically significant for the
as there are many parameters and their values dictate independent experimental investigations. In our further
most of the filter’s performance. By way of our nu- work, we plan to address other problems associated with
merous experiments, then, we may draw some conclu- data, like class overlapping, which is not considered in
sions as to the effects of the different parameters on the our work, and taking it into consideration may enhance
performance results. Since there are enough noisy as the performance more. Also, we plan to apply boost-
well as borderline examples with regard to the number ing to bring greater improvements to implementing the
Conclusions in Engineering 16

suggested method for SDP. 2017 12th International Conference on Intelligent Systems and
Knowledge Engineering (ISKE) (IEEE, 2017) pp. 1–6.
12 S. Wang and X. Yao, “Using class imbalance learning for soft-
DECLARATION OF COMPETING INTER- ware defect prediction,” IEEE Transactions on Reliability 62,
EST 434–443 (2013).
13 C. W. Yohannese, T. Li, and K. Bashir, “A three-stage based

The authors declare that they have no known competing ensemble learning for improved software fault prediction: An
empirical comparative study,” International Journal of Compu-
financial interests or personal relationships that could tational Intelligence Systems 11, 1229–1247 (2018).
have appeared to influence the work reported in this 14 K. Bashir, T. Li, C. W. Yohannese, M. Yahaya, and T. Ali, “A

paper. novel preprocessing approach for imbalanced learning in soft-


ware defect prediction,” in Data Science and Knowledge Engi-
neering for Sensing Decision Support: Proceedings of the 13th
REFERENCES International FLINS Conference (FLINS 2018) (World Scien-
tific, 2018) pp. 500–508.
15 K. Bashir, S. Pirasteh, H. Abdelrhman, M. Mosadag, and
1 S. Lessmann, B. Baesens, C. Mues, and S. Pietsch, “Bench- A. Mohammed, “An enhanced feature selection approach for
marking classification models for software defect prediction: A breast cancer prediction using a hybrid framework,” Journal
proposed framework and novel findings,” IEEE Transactions on of Karary University for Engineering Science (2024), dOI not
Software Engineering 34, 485–496 (2008). available.
2 N. V. Chawla, K. W. Bowyer, L. O. Hall, 16 H. He and E. A. Garcia, “Learning from imbalanced data,”
and W. P.
Kegelmeyer, “Smote: Synthetic minority over-sampling tech- IEEE Transactions on Knowledge and Data Engineering 21,
nique,” Journal of Artificial Intelligence Research 16, 321–357 1263–1284 (2009).
17 J. A. Sáez, J. Luengo, J. Stefanowski, and F. Herrera,
(2002).
3 K. Bashir, T. Li, C. W. Yohannese, and Y. Mahama, “En- “Smote–ipf: Addressing the noisy and borderline examples
hancing software defect prediction using supervised-learning problem in imbalanced classification by a re-sampling method
based framework,” in Proceedings of the 2017 12th Interna- with filtering,” Information Sciences 291, 184–203 (2015).
18 K. Napierała, J. Stefanowski, and S. Wilk, “Learning from im-
tional Conference on Intelligent Systems and Knowledge En-
gineering (ISKE) (IEEE, 2017) pp. 1–6. balanced data in presence of noisy and borderline examples,”
4 K. Bashir, T. Li, C. W. Yohannese, and M. J. Yahaya, in Rough Sets and Current Trends in Computing: 7th Interna-
“Smotefris-inffc: Handling the challenge of borderline and noisy tional Conference, RSCTC 2010, Warsaw, Poland, June 28-30,
examples in imbalanced learning for software defect prediction,” 2010. Proceedings (Springer, 2010) pp. 158–167.
19 S. Tang and S.-P. Chen, “The generation mechanism of
Journal of Intelligent & Fuzzy Systems 38, 917–933 (2020).
5 K. Bashir, T. Li, and C. W. Yohannese, “An empirical study synthetic minority class examples,” in 2008 International
for enhanced software defect prediction using a learning-based Conference on Information Technology and Applications in
framework,” International Journal of Computational Intelli- Biomedicine (IEEE, 2008) pp. 444–447.
20 H. He, Y. Bai, E. A. Garcia, and S. Li, “Adasyn: Adap-
gence Systems 12, 282–298 (2018).
6 C. W. Yohannese and T. Li, “A combined-learning based frame- tive synthetic sampling approach for imbalanced learning,” in
work for improved software fault prediction,” International 2008 IEEE International Joint Conference on Neural Networks
Journal of Computational Intelligence Systems 10, 647–662 (IEEE World Congress on Computational Intelligence) (IEEE,
(2017). 2008) pp. 1322–1328.
7 M. Galar, A. Fernandez, E. Barrenechea, H. Bustince, and 21 H. Han, W. Wang, and B. Mao, “Borderline-SMOTE: A new

F. Herrera, “A review on ensembles for the class imbalance over-sampling method in imbalanced data sets learning,” in
problem: Bagging-, boosting-, and hybrid-based approaches,” International Conference on Intelligent Computing (Springer,
IEEE Transactions on Systems, Man, and Cybernetics, Part C 2005) pp. 878–887.
22 C. Bunkhumpornpat, K. Sinapiromsaran, and C. Lursinsap,
(Applications and Reviews) 42, 463–484 (2011).
8 T. M. Khoshgoftaar, K. Gao, and N. Seliya, “Attribute se- “Safe-level-smote: Safe-level-synthetic minority over-sampling
lection and imbalanced data: Problems in software defect pre- technique for handling the class imbalanced problem,” in Ad-
diction,” in Proceedings of the 2010 22nd IEEE International vances in Knowledge Discovery and Data Mining (Springer,
Conference on Tools with Artificial Intelligence, Vol. 1 (IEEE, 2009) pp. 475–482.
23 V. García, J. Sánchez, and R. A. Mollineda, “An empir-
2010) pp. 137–144.
9 D. Van Nguyen, K. Ogawa, K.-i. Matsumoto, and ical study of the behavior of classifiers on imbalanced and
M. Hashimoto, “Editing training sets from imbalanced data overlapped data sets,” Progress in Pattern Recognition, Image
using fuzzy-rough sets,” in Artificial Intelligence Applications Analysis and Applications, , 397–406 (2007).
24 E. Ramentol, Y. Caballero, R. Bello, and F. Herrera, “Smote-
and Innovations: 11th IFIP WG 12.5 International Confer-
ence, AIAI 2015, Bayonne, France, September 14–17, 2015, rsb*: A hybrid preprocessing approach based on oversampling
Proceedings (Springer, 2015) pp. 115–129. and undersampling for high imbalanced data-sets using smote
10 C. Seiffert, T. M. Khoshgoftaar, J. Van Hulse, and A. Napoli- and rough sets theory,” Knowledge and Information Systems
tano, “Rusboost: A hybrid approach to alleviating class im- 33, 245–265 (2012).
25 E. Ramentol, N. Verbiest, R. Bello, Y. Caballero, C. Cornelis,
balance,” IEEE Transactions on Systems, Man, and Cybernet-
ics—Part A: Systems and Humans 40, 185–197 (2009). and F. Herrera, “Smote-frst: A new resampling method using
11 C. W. Yohannese, T. Li, M. Simfukwe, and F. Khurshid, fuzzy rough set theory,” in Uncertainty Modeling in Knowledge
“Ensembles based combined learning for improved software Engineering and Decision Making (World Scientific, 2012) pp.
fault prediction: A comparative study,” in Proceedings of the 800–805.
Conclusions in Engineering 17

26 G. E. Batista, A. L. Bazzan, and M. C. Monard, “Balancing ing machine: an empirical study,” IET Software 14, 768–782
training data for automated annotation of keywords: A case (2020).
study,” WIT Transactions on Information and Communication 35 R. Yedida and T. Menzies, “On the value of oversampling for

Technologies 29, 10–18 (2003). deep learning in software defect prediction,” IEEE Transactions
27 G. E. Batista, R. C. Prati, and M. C. Monard, “A study of on Software Engineering 48, 3103–3116 (2021).
the behavior of several methods for balancing machine learning 36 S. K. Pandey, A. Haldar, and A. K. Tripathi, “Is deep learn-

training data,” ACM SIGKDD Explorations Newsletter 6, 20– ing good enough for software defect prediction?” Innovations in
29 (2004). Systems Software Engineering , 1–16 (2023).
28 I. Tumar, Y. Hassouneh, and H. Turabieh, “Enhanced binary 37 C. Tantithamthavorn, A. E. Hassan, and K. Matsumoto, “The

moth flame optimization as a feature selection algorithm to impact of class rebalancing techniques on the performance and
predict software fault prediction,” IEEE Access 8, 8041–8055 interpretation of defect prediction models,” IEEE Transactions
(2020). on Software Engineering 46, 1200–1219 (2018).
29 S. S. Rathore, S. Chouhan, D. Jain, and A. Vachhani, “Gen- 38 Nitin, K. Kumar, and S. S. Rathore, “Analyzing ensemble
erative oversampling methods for handling imbalanced data in methods for software fault prediction,” in Advances in Com-
software fault prediction,” IEEE Transactions on Reliability 71, munication and Computational Technology: Select Proceedings
747–762 (2022). of ICACCT 2019 (Springer, 2021) pp. 1253–1267.
30 R. Singh and S. S. Rathore, “Linear and non-linear bayesian 39 A. Balaram and S. Vasundra, “Prediction of software fault-

regression methods for software fault prediction,” International prone classes using ensemble random forest with adaptive syn-
Journal of System Assurance Engineering and Management 13, thetic sampling algorithm,” Automated Software Engineering
1864–1884 (2022). 29, 6 (2022).
31 E. Elahi, S. Kanwal, and A. N. Asif, “A new ensemble approach 40 Z. Pawlak, “Rough sets,” International Journal of Computer &

for software fault prediction,” in 2020 17th International Bhur- Information Sciences 11, 341–356 (1982).
ban Conference on Applied Sciences and Technology (IBCAST) 41 T. M. Khoshgoftaar and P. Rebours, “Improving software qual-

(IEEE, 2020) pp. 407–412. ity prediction by noise filtering techniques,” Journal of Com-
32 H. Tong, W. Lu, W. Xing, B. Liu, and S. Wang, “Shse: A puter Science and Technology 22, 387–396 (2007).
subspace hybrid sampling ensemble method for software de- 42 T. Menzies, B. Caglayan, E. Kocaguneli, J. Krall, F. Peters,

fect number prediction,” Information Software Technology 142, and B. Turhan, “The promise repository of empirical software
106747 (2022). engineering data,” (2012), accessed: 2025-02-10.
33 S. Goyal, “Handling class-imbalance with knn (neighbourhood) 43 M. D’Ambros, M. Lanza, and R. J. E. S. E. Robbes, “Evalu-

under-sampling for software defect prediction,” Artificial Intel- ating defect prediction approaches: a benchmark and an exten-
ligence Review 55, 2023–2064 (2022). sive comparison,” Empirical Software Engineering 17, 531–577
34 S. K. Pandey, D. Rathee, and A. K. Tripathi, “Software defect
(2012).
prediction using k-pca and various kernel-based extreme learn-

You might also like