0% found this document useful (0 votes)

29 views12 pages

Handling Imbalanced Ratio For Class Imbalance Problem Using SMOTE

Uploaded by

emil hard

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

29 views12 pages

Handling Imbalanced Ratio For Class Imbalance Problem Using SMOTE

Uploaded by

emil hard

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

Chapter 3

Handling Imbalanced Ratio for Class

Imbalance Problem Using SMOTE

Nurulﬁtrah Noorhalim, Aida Ali and Siti Mariyam Shamsuddin

Abstract There are many issues regarding datasets classification. One such issue is
class imbalance classification, which often occurs with extreme skewness across
many real-world domains. The issue presents itself as one of the fundamental
difficulties to form robust classifiers. In this paper, a sampling method was used to
identify the performance of classification for k-NN classifier and C4.5 classifier
with a ten-fold cross validation. Experimental results conducted showed that
sampling greatly benefited the performance of classification in class imbalance
problem, by improving class boundary region especially with extremely imbalanced
datasets (extreme number of imbalanced ratio). This result demonstrates that class
imbalance can affect many domains in real-world applications.

Keywords Sampling Synthetic minority over-sampling Imbalanced dataset

1 Introduction

The use of today’s technology has contributed significantly to the revenue of mil-
lions of data within seconds in scientific research, business, and industry. This has
created a huge opportunity, especially for researchers to analyse and interpret data
into a form that can give implication to an organisation’s profitability, especially in
consumerism-specific domain, attributed to the growth in science and technology
field. This phenomenon has grown drastically and gained substantial popularity,
largely owed to heavy interest in artificial intelligence behaviour research, particu-
larly in machine learning, which can be very useful in administrating customer

N. Noorhalim (&) A. Ali S. M. Shamsuddin

Faculty of Computing, Universiti Teknologi Malaysia, UTM, 81310 Skudai, Johor, Malaysia
e-mail: nurulﬁ[email protected]
A. Ali
e-mail: [email protected]
S. M. Shamsuddin
e-mail: [email protected]

© Springer Nature Singapore Pte Ltd. 2019 19

L.-K. Kor et al. (eds.), Proceedings of the Third International Conference
on Computing, Mathematics and Statistics (iCMS2017),
https://doi.org/10.1007/978-981-13-7279-7_3
20 N. Noorhalim et al.

relationship management (CRM). For instance, collected business data can be used
and classified in numerous ways including, in the form of descriptive analytics,
diagnostic analytics, prescriptive analytics, and predictive analytics.
However, there exist several issues surrounding datasets collection classification,
such as imbalanced class classification that often faces extreme skewness, which
occurs across numerous real-world domains, contributing to be one of the funda-
mental difficulties to form a robust classifier [1–4]. Class imbalance refers to a
situation or condition where the number of minority class instances (positive class)
is far lesser or smaller than the number of majority class instances (negative class)
or not adequately represented.
According to Chawla et al. [4], class imbalance appeared attributed to the evo-
lution of science development into applied technology in machine learning. From
then on, various classifier learning algorithms were developed, which assume that
datasets possess a comparatively balanced distribution, neglecting a serious com-
plexity in imbalanced class distribution [5, 6]. Such imbalanced, real-world classi-
fication problems can be found throughout various domains such as, diagnostic
problems [7, 8], software quality prediction [9], software defect prediction [10], and
activity recognition [11]. Against a backdrop of the notion that various method-
ologies extraction produces a significant knowledge to many general practitioners, as
indicated by Bach et al. [12], focusing on class imbalance topic is, thus, crucial.
There exist substantial cases where imbalanced datasets require two-class clas-
sification in learning classifiers, when the natural distribution of the datasets is
dominated by one class or when a positive majority class exceeds other class
(majority or negative class). For instance, a dataset of breast tumour patients
exhibits a dual-class classification problem for breast cancer diagnoses, comprising
benign and malignant cancer. In medical cases, there are lesser patients diagnosed
with malignant cancer (minority class) than patients diagnosed with benign cancer
(majority class). Lack of instances (minority class) could lead to difficulties in
obtaining a precise classification.
The skewness of data distribution has inspired the formation of more robust clas-
sifiers, which is a basic problem in data mining [6, 13]. In addition, standard classifiers
can be biased in evaluation measurements, which typically produce bad performance
for minority class, and vice versa for majority class. This is because the intention of
classifiers is to identify noise in positive samples and ignore them during learning
process [14, 15]. Traditional class imbalance classification accuracy also degrades, due
to the presence of class imbalance difficulties including, (i) borderline samples,
(ii) small disjuncts, (iii) noisy data, (iv) small number of training data, (v) overlapping
instances, and (vi) differences of distribution on test and training data (or dataset shift).
There are two methods to handle class imbalance problem including, data-level
and algorithm-level methods. Typically, data-level will be run as pre-processing
steps to make sure unbalanced datasets can be adjusted and skewness of the dis-
tributions can be reduced, by using any types of sampling methods [16–18].
According to Lee, Sheen [16], random sampling imposes some limitations to most
data pre-processing. Major differences in pre-processing results could end up with
severely biased data distribution. Meanwhile, algorithm-level involves more on
3 Handling Imbalanced Ratio for Class Imbalance Problem … 21

modifying algorithms such as, ensemble approach and cost-sensitive learning that
can manage imbalanced class distribution [19, 20]. Ali et al. [1] claimed that, even
though both approaches are able to change and balance data distribution to become
normal, there remains theoretical gaps in existing approaches that can bear all the
aforementioned problems.

1.1 Imbalanced Ratio

In class imbalanced classification, there are a few challenges that might be faced
during the classification processes. This difficulties often occur when the datasets
have imbalanced distribution between classes such as minority class or often known
as positive class and majority class or familiar as negative class [21, 22], which also
known as imbalance ratio. Imbalance ratio (IR) is a proportion samples in the
number of majority class (negative class) to the number of minority class (positive
class) [15, 23]. In binary class datasets, the problems appear based on the number of
samples i.e. the number of samples for majority class has monopolized the distri-
bution of minority class samples which also important as class of interest in clas-
sification. This condition probably assumed minority class has very low predictive
accuracy [24]. However, the error that comes from the minority class ought be
important as stated in many research works [21]. This scenario obviously shows
that most of the datasets in classification has imbalanced problem with its classes. It
has widely happened in various studies involving with the issue of lack of training
sample set in data classification [13] which is also noted as another issue for class
imbalance problem. Class imbalanced could be happened when the datasets has
insufficient amount of samples that can lead to another problem in identifying the
pattern regularities [1] in which the pattern itself could help to improve the decision
boundary of the classes [25]. The best classification performance should have
balanced distribution and sufficient number of samples to represent the training
samples that can provide more knowledge that can be useful for learning processes.
Otherwise, it can degrade the classification performance.
Besides, another issue is, where most of the standard learning algorithms
thought that all datasets has same number of data class distributions [6]. This
overlapping problem has a higher tendency to create another related issue that can
affect the classifier performance greater than the problem of imbalanced distri-
bution [1, 26–31].

2 Data Collection

Three datasets were used in this experiment, which are available on KEEL datasets
repository [32]. The datasets were chosen because they possess an imbalanced ratio
(IR) of positive and negative classes. These datasets have been used and reported in
22 N. Noorhalim et al.

previous studies such as [22–25] using SMOTE and other sampling techniques with
different types of datasets such as, datasets with mildly imbalanced ratio between
1.5 and 9, and datasets that have extremely imbalanced ratio of more than nine.
These datasets were divided into two, in order to identify the extent that SMOTE
helps C4.5 and k-NN. Apart from that, different types of IR may also help in
identifying whether SMOTE only helps in classifying extremely imbalanced
datasets, or both mild and extreme datasets. This is due to different values of IR
have different distribution pattern, attributed to implemented sampling method. The
datasets used in this experiment referred to the imbalance ratio between 1.5 and 9.
There are 12 datasets out of 22 datasets available which were randomly selected, as
listed in Table 1. All the datasets are already set up into a sufﬁcient number of
minority samples using 5-folds stratiﬁed cross validation for the test partition.
Referring to Table 1, the datasets are arranged accordingly from lowly imbal-
anced to highly imbalanced datasets, based on ascending imbalanced ratio
(IR) values with additional details including, number of instances (#Ex), number of
attributes (#Atts) in real (R), integer (I), and nominal (N) valued, and number of
classes (accounting both minority and majority classes).
Table 1 consists of 12 different types of fungi that have different number of IR,
in which all are less than nine, and different number of instances. All datasets have
two types of classes comprising minority class (positive class) and majority class
(negative class). Most of the instances are real attributes. None of nominal attributes
were utilized in this dataset.

3 Data Pre-processing

In order to make sure the classification process runs smoothly and efficiently, all
data needed to be pre-processed by using Waikato Environment for Knowledge
Analysis (WEKA) [33] before the experiment proceeded to classification processes.

Table 1 Summary description datasets with imbalance ratio between 1.5 and 9
Data-set IR #Ex. #Atts. (R/I/N) #Class
Glass1 1.82 214 9 (9/0/0) 2
Wisconsin 1.87 683 9 (0/1/0) 2
Glass0 2.06 214 9 (9/0/0) 2
Yeast1 2.46 1484 8 (8/0/0) 2
Glass-0-1-2-3_Vs_4-5-6 3.2 214 9 (9/0/0) 2
Vehicle0 3.25 846 18 (0/18/0) 2
New-Thyroid1 5.14 215 5 (4/1/0) 2
Ecoli2 5.46 336 7 (7/0/0) 2
Segment0 6.02 2308 19 (19/0/0) 2
Glass6 6.38 214 9 (9/0/0) 2
Ecoli3 8.6 336 7 (7/0/0) 2
Page-blocks0 8.79 5472 10 (4/6/0) 2
3 Handling Imbalanced Ratio for Class Imbalance Problem … 23

Normalization is used to synchronize the interval value, as ﬁrst step, in the

pre-processing for all three datasets. As stated by Kotsiantis et al. [34], this process
can be essential for k-Nearest Neighbor (k-NN) algorithm.

4 Classiﬁcation

Two classification algorithms were used in this experiment consisting C4.5 and
k-NN. Every dataset was run with a ten-fold cross-validation, with a single repe-
tition. Table 2 shows specification of parameters for the classification process.
In this experiment, SMOTE sampling utilized five nearest neighbours with its
default percentage, 100%, for instances to be created. C4.5 was used with its default
parameter of confidence factor = 0.25. Meanwhile, for k-NN, the authors used
KNN = 10, instead of one, as its default. To make sure the readers understand this
experiment, Table 3 is constructed to summarize the list of algorithms, according to
two types of classifiers; sampling and non-sampling. The abbreviation for every
method with its short description is included in Table 3.
The analysis of classification task for performance measure was based on con-
fusion matrix of the results for each instance, as shown in Table 4, which was used
to predict positive class.

Table 2 Speciﬁcation of parameters

Parameters
Experiment type Cross-validation
Number of folds 10
Number of repetitions 1

Table 3 Algorithms used in experimental design

Non-sampling classiﬁer
Abbr. Method Short description
C45 C4.5 Class generating pruned C4.5 decision tree and previously
normalized datasets
k-NN K-Nearest Selects appropriate value of K based on cross-validation and
Neighbour performs distance weighting, pre-processed with normalization
Sampling classiﬁer
Abbr. Method Short description
CSMT C4.5 + SMOTE Applies C4.5 on datasets after pre-processing with
normalization and SMOTE
KSMT K-Nearest Selects appropriate value of K based on cross validation and
Neighbor + SMOTE performs distance weighting after pre-processing, with
normalization and SMOTE
24 N. Noorhalim et al.

4.1 Synthetic Minority Over-Sampling

Synthetic minority over-sampling or widely known as SMOTE [35] is a most

sophisticated technique in handling problems of over- and under-sampling tech-
niques as mention previously which required minority samples to create its
impostor synthetically and typically will be merged with an under-sampling
approach from majority instances.
In this approach, the over-sampling process of the minority instances used
selection approach and also using iterative search until it reach its amount needed
for every observation [17]. The new samples which produced synthetically will
used selected k nearest neighbours (k-NN) at random along the line segments that
joining any/all of its neighbours [15] as illustrated in Fig. 1.
Figure 2.10 shows point Xi1 to point Xi4 as selected nearest neighbours in order
to synthetically produced data point r1 to r4 at random.
This research using SMOTE sampling as pre-processing technique because of it
can produce a better performance in practice which is excellent either for basic or
hybrid methods [36].

Fig. 1 Illustration on how

the synthetic data points
created in SMOTE algorithm
[15]
3 Handling Imbalanced Ratio for Class Imbalance Problem … 25

5 Results

The objective of the experimental study is to identify which datasets could give a
balance, robust, and good trade-off, or conversely sacrifice performance; with
sampling and without sampling by using decision trees with C4.5 [37] and the
widely known k-Nearest Neighbor (k-NN) [38], as learning classifiers.
In this experiment, the authors investigated the performance of methods in terms
of robustness and ability in handling imbalanced datasets with different IR. The
authors also took into consideration the improvement to the results and justified it
with respect to two algorithms used, to determine whether it can provide a better
trade-off or otherwise, for both C4.5 and k-NN algorithms.
The authors utilized datasets with closely related imbalance ratio (IR) ranging
between 1.5 and 9, across 12 different data. The datasets were divided into three
categories comprising, mild imbalance with IR less than three, medium imbalance
with IR ranging from three to six, and extreme imbalance with IR of more than six.
WEKA [24] was used in this experiment to perform k-NN and C4.5 learning
classifiers across all datasets. It is used due to its ability to set up large-scale
experiments.
Based on the results, with close observation on the scores of performance
measures, the authors found that imbalance ratio does not affect the findings for the
datasets. The results for experiment using k-NN algorithm did yield an improve-
ment, superior than C4.5 on datasets with IR between 1.5 and 9. Similar results’
patterns were observed for both algorithms on sampling and non-sampling
approaches as shown in Tables 4 and 5.
From the experiments, it is observed that SMOTE improved both C4.5 and
k-NN, ranging from 0.45 to 0.18. It can be deduced that such improvement is
achieved due to increasing number of training data points which formed better
decision boundary representation between majority class and minority class.
Figures 2 and 3 illustrate the performance of classifiers for both C4.5 and k-NN
algorithms; with and without sampling for segment0 dataset. The classifiers cor-
rectly classified all data in sampling with SMOTE, with extremely imbalanced IR.
Hence, it is safe to conclude that class imbalance problem does benefit from
sampling approach, especially when dealing with extremely skewed datasets, as in
this case with SMOTE. The results also indicated trade-off between recall and
precision. The classifiers have completely learned the data.
26

Table 4 Average performance measure on ten-fold CV of C4.5 and K-NN in non-sampling approach for datasets with imbalance ratio between 1.5 and 9
Non-sampling
IR Sensitivity Speciﬁcity Precision Accuracy G mean F–measure
C4.5 k-NN C4.5 k-NN C4.5 k-NN C4.5 k-NN C4.5 k-NN C4.5 k-NN
IR < 3
Glass1 1.82 0.613 0.611 0.842 0.864 0.795 0.818 0.727 0.737 0.718 0.726 0.692 0.699
Wisconsin 1.87 0.941 0.967 0.968 0.975 0.968 0.975 0.955 0.971 0.955 0.971 0.954 0.971
Glass0 2.06 0.757 0.914 0.840 0.785 0.826 0.809 0.799 0.850 0.798 0.847 0.790 0.859
Yeast1 2.46 0.464 0.490 0.869 0.857 0.780 0.774 0.667 0.673 0.635 0.648 0.582 0.600
Mean 0.694 0.745 0.880 0.870 0.842 0.844 0.787 0.808 0.776 0.798 0.755 0.782
Standard deviation 0.204 0.232 0.060 0.079 0.086 0.089 0.124 0.131 0.136 0.141 0.158 0.165
3 > IR < 6
Glass-0-1-2-3_vs_4-5-6 3.2 0.823 0.770 0.975 0.932 0.971 0.919 0.899 0.851 0.896 0.847 0.891 0.838
Vehicle0 3.25 0.873 0.899 0.950 0.941 0.946 0.939 0.912 0.920 0.911 0.920 0.908 0.919
New-thyroid1 5.14 0.942 0.800 0.989 1.000 0.988 1.000 0.965 0.900 0.965 0.894 0.964 0.889
Ecoli2 5.46 0.750 0.907 0.972 0.968 0.964 0.966 0.861 0.953 0.854 0.937 0.844 0.936
Mean 0.847 0.844 0.972 0.961 0.967 0.956 0.909 0.906 0.906 0.900 0.902 0.895
Standard deviation 0.081 0.069 0.016 0.030 0.017 0.035 0.043 0.042 0.046 0.039 0.050 0.043
IR > 3
Segment0 6.02 0.957 0.982 0.997 0.996 0.997 0.996 0.977 0.989 0.977 0.989 0.977 0.989
Glass6 6.38 0.800 0.767 0.978 0.968 0.974 0.959 0.889 0.867 0.885 0.861 0.878 0.852
Ecoli3 8.6 0.567 0.717 0.957 0.964 0.930 0.952 0.762 0.840 0.736 0.831 0.704 0.818
Page-blocks0 8.79 0.859 0.696 0.987 0.987 0.985 0.981 0.923 0.841 0.920 0.829 0.917 0.814
Mean 0.796 0.790 0.980 0.979 0.971 0.972 0.888 0.884 0.880 0.877 0.869 0.868
Standard deviation 0.166 0.131 0.017 0.016 0.029 0.020 0.091 0.071 0.103 0.076 0.117 0.082
Mean 0.779 0.793 0.944 0.936 0.927 0.924 0.861 0.866 0.854 0.858 0.841 0.849
N. Noorhalim et al.

Standard deviation 0.158 0.150 0.058 0.067 0.079 0.079 0.101 0.092 0.109 0.098 0.125 0.111
Table 5 Average performance measure on ten-fold CV of C4.5 and K-NN in sampling approach for datasets with imbalance ratio between 1.5 and 9
Sampling
IR Sensitivity Speciﬁcity Precision Accuracy G mean F–measure
CSMT KSMT CSMT KSMT CSMT KSMT CSMT KSMT CSMT KSMT CSMT KSMT
IR < 3
Glass1 1.82 0.795 0.914 0.732 0.685 0.748 0.744 0.763 0.800 0.763 0.791 0.771 0.820
Wisconsin 1.87 0.983 0.998 0.950 0.968 0.952 0.969 0.967 0.983 0.967 0.983 0.967 0.983
Glass0 2.06 0.893 0.957 0.814 0.675 0.828 0.747 0.854 0.816 0.853 0.804 0.859 0.839
Yeast1 2.46 0.711 0.853 0.779 0.680 0.763 0.727 0.745 0.766 0.744 0.761 0.736 0.785
Mean 0.846 0.931 0.819 0.752 0.823 0.797 0.832 0.841 0.832 0.835 0.833 0.857
Standard Deviation 0.118 0.062 0.094 0.144 0.093 0.115 0.101 0.097 0.102 0.100 0.103 0.087
3 > IR < 6
Glass-0-1-2-3_vs_4-5-6 3.2 0.914 0.952 0.928 0.927 0.927 0.929 0.921 0.939 0.921 0.939 0.920 0.940
Vehicle0 3.25 0.925 0.985 0.946 0.904 0.945 0.911 0.935 0.945 0.935 0.944 0.935 0.947
New-thyroid1 5.14 0.914 0.957 0.983 0.983 0.982 0.983 0.949 0.970 0.948 0.970 0.947 0.970
Ecoli2 5.46 0.855 0.932 0.969 0.958 0.965 0.957 0.912 0.945 0.910 0.945 0.906 0.944
Mean 0.902 0.956 0.956 0.943 0.954 0.945 0.929 0.950 0.928 0.949 0.927 0.950
Standard Deviation 0.032 0.022 0.025 0.035 0.024 0.031 0.016 0.014 0.017 0.014 0.018 0.013
3 Handling Imbalanced Ratio for Class Imbalance Problem …

IR > 6
Segment0 6.02 0.995 0.986 0.998 0.991 0.998 0.991 0.997 0.989 0.997 0.989 0.997 0.989
Glass6 6.38 0.930 0.863 0.957 0.962 0.956 0.958 0.943 0.913 0.943 0.911 0.943 0.908
Ecoli3 8.6 0.800 0.871 0.943 0.934 0.934 0.929 0.872 0.902 0.869 0.902 0.862 0.899
Page-blocks0 8.79 0.905 0.853 0.979 0.974 0.978 0.970 0.942 0.914 0.942 0.912 0.940 0.908
Mean 0.969 0.965 0.966 0.962 0.939 0.929 0.938 0.928 0.935 0.926 0.969 0.965
Standard deviation 0.024 0.024 0.028 0.026 0.051 0.040 0.053 0.040 0.056 0.042 0.024 0.024
Mean 0.885 0.927 0.915 0.887 0.914 0.901 0.900 0.907 0.899 0.904 0.899 0.911
Standard deviation 0.082 0.054 0.088 0.127 0.100 0.078 0.074 0.078 0.077 0.079 0.066
27
28 N. Noorhalim et al.

Fig. 2 ROC curve of

sampling for segment0
datasets on false positive rate
(x-axis) and true positive rate
(y-axis)

Fig. 3 ROC curve without

sampling of segment0 on false
positive rate (x-axis) and true
positive rate (y-axis)

6 Conclusion

This study focused on the effect of sampling on C4.5 and k-NN algorithms in
handling imbalanced datasets. It was observed that SMOTE sampling could lead to
improvements in classification. The findings of the experiment could help inform
general practitioners of the main problem surrounding efforts to improve decision
boundary of data region with extreme IR values, especially, concerning the
skewness of data, which can be resolved through identifying factors that cause
imbalance ratio to occur. The study also observed that extreme IR benefited from
sampling, while other ranges of IR did not have much improvement on their
classification rate.

Acknowledgements The authors would like to express appreciation to the UTM Big Data Centre
of Universiti Teknologi Malaysia and Y.M. Said for their support in this study. The authors greatly
acknowledge the Research Management Centre, UTM and Ministry of Higher Education for the
ﬁnancial support through Research University Grant (RUG) Vot. No. Q.JI30000.2528.13H30.
3 Handling Imbalanced Ratio for Class Imbalance Problem … 29

References

1. Ali, A., Shamsuddin, S.M., Ralescu, A.L.: Classification with class imbalance problem: a
review. Int. J. Adv. Soft Comput. Appl. 7(3) (2015)
2. Beyan, C., Fisher, R.: Classifying imbalanced data sets using similarity based hierarchical
decomposition. Pattern Recogn. 48(5), 1653–1672 (2015)
3. Cleofas-Sánchez, L., Sánchez, J.S., García, V., Valdovinos, R.: Associative learning on
imbalanced environments: An empirical study. Expert Syst. Appl. 54, 387–397 (2016)
4. Al-Stouhi, S., Reddy, C.K.: Transfer learning for class imbalance problems with inadequate
data. Knowl. Inf. Syst. 48(1), 201–228 (2016)
5. Chawla, N.V., Japkowicz, N., Kotcz, A.: Editorial: special issue on learning from imbalanced
data sets. ACM SIGKDD Explor. Newsl. 6(1), 1–6 (2004)
6. Sun, Y., Wong, A.K., Kamel, M.S.: Classification of imbalanced data: a review. Int. J. Pattern
Recognit Artif Intell. 23(04), 687–719 (2009)
7. He, H., Garcia, E.A.: Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 21(9),
1263–1284 (2009)
8. Bruha, I., Kočková, S.: A support for decision-making: cost-sensitive learning system. Artif.
Intell. Med. 6(1), 67–82 (1994). https://doi.org/10.1016/0933-3657(94)90058-2
9. Kukar, M., Kononenko, I., Grošelj, C., Kralj, K., Fettich, J.: Analysing and improving the
diagnosis of ischaemic heart disease with machine learning. Artif. Intell. Med. 16(1), 25–50
(1999). https://doi.org/10.1016/S0933-3657(98)00063-3
10. Gao, K., Khoshgoftaar, T.M., Napolitano, A.: An empirical investigation of combining
filter-based feature subset selection and data sampling for software defect prediction. Int.
J. Reliab. Qual. Saf. Eng. 22(6) (2015). https://doi.org/10.1142/s0218539315500278
11. Gao, K., Khoshgoftaar, T.M., Napolitano, A.: Aggregating data sampling with feature subset
selection to address skewed software defect data. Int. J. Soft. Eng. Knowl. Eng. 25(09n10),
1531–1550 (2015)
12. Abidine, M.B., Fergani, B., Ordóñez, F.J.: Effect of over-sampling versus under-sampling for
SVM and LDA classifiers for activity recognition. Int. J. Des. Nat. Ecodynamics 11(3), 306–
316 (2016). https://doi.org/10.2495/DNE-V11-N3-306-316
13. Bach, M., Werner, A., Żywiec, J., Pluskiewicz, W.: The study of under- and over-sampling
methods’ utility in analysis of highly imbalanced data on osteoporosis. Inf. Sci. 384, 174–190
(2017). https://doi.org/10.1016/j.ins.2016.09.038
14. Ando, S.: Classifying imbalanced data in distance-based feature space. Knowl. Inf. Syst. 46
(3), 707–730 (2016)
15. Lee, W., Jun, C.-H., Lee, J.-S.: Instance categorization by support vector machines to adjust
weights in AdaBoost for imbalanced data classification. Inf. Sci. 381, 92–103 (2017). https://
doi.org/10.1016/j.ins.2016.11.014
16. López, V., Fernández, A., García, S., Palade, V., Herrera, F.: An insight into classification
with imbalanced data: Empirical results and current trends on using data intrinsic
characteristics. Inf. Sci. 250, 113–141 (2013)
17. Lee, C.S., Sheen, D.: Nonconforming generalized multiscale finite element methods.
J. Comput. Appl. Math. 311, 215–229 (2017)
18. Rivera, W.A., Xanthopoulos, P.: A priori synthetic over-sampling methods for increasing
classification sensitivity in imbalanced data sets. Expert Syst. Appl. 66, 124–135 (2016)
19. Vluymans, S., Triguero, I., Cornelis, C., Saeys, Y.: EPRENNID: An evolutionary prototype
reduction based ensemble for nearest neighbor classification of imbalanced data.
Neurocomputing 216, 596–610 (2016). https://doi.org/10.1016/j.neucom.2016.08.026
20. Galar, M., Fernandez, A., Barrenechea, E., Bustince, H., Herrera, F.: A review on ensembles
for the class imbalance problem: Bagging-, boosting-, and hybrid-based approaches. IEEE
Trans. Syst. Man Cybern. Part C Appl. Rev. 42(4), 463–484 (2012). https://doi.org/10.1109/
TSMCC.2011.2161285
21. Downs, R.: Beware the aliased signal! Electron. Des. 59(4) (2011)
30 N. Noorhalim et al.

22. Visa, S.: Fuzzy classifiers for imbalanced data sets. University of Cincinnati (2006)
23. Kotsiantis, S., Kanellopoulos, D., Pintelas, P.: Handling imbalanced datasets: a review.
GESTS Int. Trans. Comput. Sci. Eng. 30(1), 25–36 (2006)
24. García, V., Sánchez, J.S., Mollineda, R.A.: On the effectiveness of preprocessing methods
when dealing with different levels of class imbalance. Knowl.-Based Syst. 25(1), 13–21
(2012)
25. Fernández, A., del Jesus, M.J., Herrera, F.: Hierarchical fuzzy rule based classification
systems with genetic rule selection for imbalanced data-sets. Int. J. Approx. Reason. 50(3),
561–577 (2009)
26. Phung, S.L., Bouzerdoum, A., Nguyen, G.H.: Learning pattern classification tasks with
imbalanced data sets (2009)
27. Xiong, H., Wu, J., Liu, L.: Classification with class overlapping: a systematic study. In: The
2010 International Conference on E-Business Intelligence, pp. 491–497 (2010)
28. Galar, M., Fernandez, A., Barrenechea, E., Bustince, H., Herrera, F.: A review on ensembles
for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE
Trans. Syst. Man Cybern. Part C (Appl. Rev.) 42(4), 463–484 (2012)
29. Longadge, R., Dongre, S.: Class imbalance problem in data mining review (2013). arXiv
preprint arXiv:1305.1707
30. Japkowicz, N.: Learning from imbalanced data sets: a comparison of various strategies. In:
AAAI Workshop on Learning from Imbalanced Data Sets, pp. 10–15, Menlo Park, CA (2000)
31. Batista, G.E., Prati, R.C., Monard, M.C.: Balancing strategies and class overlapping. In:
International Symposium on Intelligent Data Analysis, pp. 24–35. Springer, Heidelberg
(2005)
32. Prati, R.C., Batista, G.E., Monard, M.C.: Learning with class skews and small disjuncts. In:
Brazilian Symposium on Artificial Intelligence, pp. 296–306. Springer, Heidelberg (2004)
33. Alcalá-Fdez, J., Fernández, A., Luengo, J., Derrac, J., García, S., Sánchez, L., Herrera, F.:
KEEL data-mining software tool: data set repository, integration of algorithms and
experimental analysis framework. J. Multiple-Valued Logic Soft Comput. 17 (2011)
34. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data
mining software: an update. ACM SIGKDD Explor. Newsl. 11(1), 10–18 (2009)
35. Kotsiantis, S., Kanellopoulos, D., Pintelas, P.: Data preprocessing for supervised leaning. Int.
J. Comput. Sci. 1(2), 111–117 (2006)
36. Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority
over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)
37. Fernández, A., García, S., del Jesus, M.J., Herrera, F.: A study of the behaviour of linguistic
fuzzy rule based classification systems in the framework of imbalanced data-sets. Fuzzy Sets
Syst. 159(18), 2378–2398 (2008)
38. Salzberg, S.L.: C4.5: Programs for Machine Learning by J. Ross Quinlan. Morgan Kaufmann
Publishers, Inc., 1993. Mach. Learn. 16(3), 235–240 (1994). https://doi.org/10.1007/
bf00993309

Machine Learning: Fundamentals and Applications
From Everand
Machine Learning: Fundamentals and Applications
Fouad Sabry
No ratings yet
Business Statistics I Essentials
From Everand
Business Statistics I Essentials
Louise Clark
5/5 (5)
EILPR Toward End-To-End Irregular License Plate Recognition Based On Automatic P
No ratings yet
EILPR Toward End-To-End Irregular License Plate Recognition Based On Automatic P
10 pages
Performance Analysis of Deep Learning Based Object Detection Algorithms On COCO Benchmark: A Comparative Study
No ratings yet
Performance Analysis of Deep Learning Based Object Detection Algorithms On COCO Benchmark: A Comparative Study
18 pages
Main
No ratings yet
Main
9 pages
ZhangWang2020.Therelationshipbetweenmathematicsinterestandmathematicsachievementmediatingrolesofself-efficacyandmathematicsanxiety
No ratings yet
ZhangWang2020.Therelationshipbetweenmathematicsinterestandmathematicsachievementmediatingrolesofself-efficacyandmathematicsanxiety
9 pages
Music Genre Classification: SAURABH KUMAR (19106082) LIKITHA ESLAVATH (19106049) SHIWALI (19106088)
No ratings yet
Music Genre Classification: SAURABH KUMAR (19106082) LIKITHA ESLAVATH (19106049) SHIWALI (19106088)
34 pages
Data Preprocessing Steps For Machine Learning in Python (Part 1) - by Learn With Nas - Wom
No ratings yet
Data Preprocessing Steps For Machine Learning in Python (Part 1) - by Learn With Nas - Wom
39 pages
DR Mehdi Hassan
No ratings yet
DR Mehdi Hassan
53 pages
Sentiment Analysis
No ratings yet
Sentiment Analysis
77 pages
Stroke Prediction Using Linear Regression
No ratings yet
Stroke Prediction Using Linear Regression
8 pages
IBM DS Certificate CapstoneProject SamiAlaruri
No ratings yet
IBM DS Certificate CapstoneProject SamiAlaruri
49 pages
AI Fundamentals SkillUp Session 2
No ratings yet
AI Fundamentals SkillUp Session 2
41 pages
EHB 420E - Artificial Neural Networks Term Project: Machine Learning Models For Heart Attack Prediction
No ratings yet
EHB 420E - Artificial Neural Networks Term Project: Machine Learning Models For Heart Attack Prediction
10 pages
Heart Disease Detection System Proposal - 2024
No ratings yet
Heart Disease Detection System Proposal - 2024
28 pages
Numerical Similarity Measures Versus Jaccard For Collaborative Filtering
No ratings yet
Numerical Similarity Measures Versus Jaccard For Collaborative Filtering
14 pages
Product Helpfulness Detection With Novel Transformer Based BERT Embedding and Class Probability Features
No ratings yet
Product Helpfulness Detection With Novel Transformer Based BERT Embedding and Class Probability Features
13 pages
Mbedded Methods For Feature Selection in Neural Networks: Reprint
No ratings yet
Mbedded Methods For Feature Selection in Neural Networks: Reprint
7 pages
Imbalanced Data Problem in Machine Learning A Review
No ratings yet
Imbalanced Data Problem in Machine Learning A Review
14 pages
Tree-Structured Parzen Estimator: Understanding Its Algorithm Components and Their Roles For Better Empirical Performance
No ratings yet
Tree-Structured Parzen Estimator: Understanding Its Algorithm Components and Their Roles For Better Empirical Performance
74 pages
The Framingham Offspring Study: Risk Variable Clustering in The Insulin Resistance Syndrome
No ratings yet
The Framingham Offspring Study: Risk Variable Clustering in The Insulin Resistance Syndrome
7 pages
Chime Cry Deciphering Infants Cry
No ratings yet
Chime Cry Deciphering Infants Cry
10 pages
Imbalance Factor: A Simple New Scale For Measuring Inter-Class Imbalance Extent in Classification Problems
No ratings yet
Imbalance Factor: A Simple New Scale For Measuring Inter-Class Imbalance Extent in Classification Problems
27 pages
Module 3 DS
No ratings yet
Module 3 DS
44 pages
A Multilayer Perceptron Neural Network Model For Predicting Diabetes
No ratings yet
A Multilayer Perceptron Neural Network Model For Predicting Diabetes
9 pages
3406-Article Text-6396-1-10-20210421
No ratings yet
3406-Article Text-6396-1-10-20210421
6 pages
Data Mining Notes Unit 4
No ratings yet
Data Mining Notes Unit 4
30 pages
1 s2.0 S016786551730257X Main
No ratings yet
1 s2.0 S016786551730257X Main
7 pages
CLArticulosAcademicosyJuegodeRol Uni
No ratings yet
CLArticulosAcademicosyJuegodeRol Uni
732 pages
Author Final Version
No ratings yet
Author Final Version
11 pages
A Survey On Fingerprinting Technologies For Smartphones Based On Embedded Transducers
No ratings yet
A Survey On Fingerprinting Technologies For Smartphones Based On Embedded Transducers
25 pages
Vechicle Counting System in Urban Areas A Practical Case
No ratings yet
Vechicle Counting System in Urban Areas A Practical Case
6 pages
Classification of Imbalanced Data A Review
No ratings yet
Classification of Imbalanced Data A Review
34 pages
MEE22154 Task2
No ratings yet
MEE22154 Task2
4 pages
Imbalanced Data
No ratings yet
Imbalanced Data
54 pages
1 s2.0 S0926580524001122 Main
No ratings yet
1 s2.0 S0926580524001122 Main
25 pages
Secrets of Statistical Data Analysis and Management Science!
From Everand
Secrets of Statistical Data Analysis and Management Science!
Andrei Besedin
No ratings yet
Imbalanced Data Classification Method Based On LSSASMOTE
No ratings yet
Imbalanced Data Classification Method Based On LSSASMOTE
9 pages
Kumar 2021 IOP Conf. Ser. Mater. Sci. Eng. 1099 012077
No ratings yet
Kumar 2021 IOP Conf. Ser. Mater. Sci. Eng. 1099 012077
9 pages
1 s2.0 S0957417422003888 Main
No ratings yet
1 s2.0 S0957417422003888 Main
13 pages
Pavuluri 2020
No ratings yet
Pavuluri 2020
6 pages
MK-SMOTE and M-SMOTE: Enhanced Techniques For Handling Class Imbalance Problem
No ratings yet
MK-SMOTE and M-SMOTE: Enhanced Techniques For Handling Class Imbalance Problem
19 pages
NICE Actimize - DS - Rarity Problem in Supervised Fraud Detection Insights Article - 3JUNE20
No ratings yet
NICE Actimize - DS - Rarity Problem in Supervised Fraud Detection Insights Article - 3JUNE20
11 pages
Child Mortality Prediction Using Machine Learning Techniques
No ratings yet
Child Mortality Prediction Using Machine Learning Techniques
6 pages
AI & ML Unit 4 Notes
No ratings yet
AI & ML Unit 4 Notes
16 pages
Introduction To Imbalanced Datasets
No ratings yet
Introduction To Imbalanced Datasets
10 pages
Advanced Image To Speech Conversion
No ratings yet
Advanced Image To Speech Conversion
46 pages
11-A-SMOTE A New Preprocessing Approach For Highly Im
No ratings yet
11-A-SMOTE A New Preprocessing Approach For Highly Im
11 pages
Machine Learning Based Predicting House Prices Using Regression Techniques
No ratings yet
Machine Learning Based Predicting House Prices Using Regression Techniques
7 pages
AReviewon Oversampling Techniquesfor Solvingthe Data Imbalance Problemin Classification
No ratings yet
AReviewon Oversampling Techniquesfor Solvingthe Data Imbalance Problemin Classification
11 pages
Machine Learning Classification Techniques For Heart Disease Prediction: A Review
No ratings yet
Machine Learning Classification Techniques For Heart Disease Prediction: A Review
7 pages
Data Mining: Concepts and Techniques: - Slides For Textbook - Chapter 7
No ratings yet
Data Mining: Concepts and Techniques: - Slides For Textbook - Chapter 7
88 pages
Class Notes
No ratings yet
Class Notes
24 pages
Machine Learning With Oversampling and Undersampling Techniques Overview Study and Experimental Results
No ratings yet
Machine Learning With Oversampling and Undersampling Techniques Overview Study and Experimental Results
6 pages
Random and Synthetic Over Sampling Approach To Resolve Data 2zu79c47m6
No ratings yet
Random and Synthetic Over Sampling Approach To Resolve Data 2zu79c47m6
9 pages
Admin, 1277
No ratings yet
Admin, 1277
21 pages
VUTTIPITTAYAMONGKOL 2021 On The Class Overlap Problem
No ratings yet
VUTTIPITTAYAMONGKOL 2021 On The Class Overlap Problem
56 pages
Improving Imbalanced Learning Through A Heuristic Oversampling Method Based On K-Means and SMOTE
No ratings yet
Improving Imbalanced Learning Through A Heuristic Oversampling Method Based On K-Means and SMOTE
20 pages
A Chaos-Based Complex Micro-Instruction Set For Mitigating Instruction Reverse Engineering
No ratings yet
A Chaos-Based Complex Micro-Instruction Set For Mitigating Instruction Reverse Engineering
14 pages
2515-Article Text-14337-4-10-20230331
No ratings yet
2515-Article Text-14337-4-10-20230331
12 pages
2018 12state of ArtofImbalancedDataClassificationMethods
No ratings yet
2018 12state of ArtofImbalancedDataClassificationMethods
7 pages
International Conference On Information and Communications Technology
No ratings yet
International Conference On Information and Communications Technology
5 pages
A Cluster Based Under-Sampling Solution For Handling Imbalanced Data
No ratings yet
A Cluster Based Under-Sampling Solution For Handling Imbalanced Data
12 pages
Batista 2004
No ratings yet
Batista 2004
10 pages
An Overview of Classification Algorithms For Imbalanced Datasets
No ratings yet
An Overview of Classification Algorithms For Imbalanced Datasets
7 pages
5 Techniques To Handle Imbalanced Data For A Classification Problem
No ratings yet
5 Techniques To Handle Imbalanced Data For A Classification Problem
7 pages
Project 3 - Build A Logistic Regression Model To Predict Custo Mer Churn in Telecom IndustryV1.0 PDF
100% (1)
Project 3 - Build A Logistic Regression Model To Predict Custo Mer Churn in Telecom IndustryV1.0 PDF
38 pages
1 s2.0 S0950705119302898 Main
No ratings yet
1 s2.0 S0950705119302898 Main
17 pages
10 Techniques To Deal With Class Imbalance in Machine Learning
No ratings yet
10 Techniques To Deal With Class Imbalance in Machine Learning
10 pages
Unit 1 - Machine Learning - WWW - Rgpvnotes.in
No ratings yet
Unit 1 - Machine Learning - WWW - Rgpvnotes.in
23 pages
ECE457 Pattern Recognition Techniques and Algorithms: Answer All Questions
No ratings yet
ECE457 Pattern Recognition Techniques and Algorithms: Answer All Questions
3 pages
Survey On Deep Learning With Class Imbalance: Open Access Survey Paper
No ratings yet
Survey On Deep Learning With Class Imbalance: Open Access Survey Paper
54 pages
A Systematic Review On Imbalanced Data Challenges in Machine Learning: Applications and Solutions
100% (1)
A Systematic Review On Imbalanced Data Challenges in Machine Learning: Applications and Solutions
36 pages
Foundations of Data Imbalance and Solutions For A Data Democracy
No ratings yet
Foundations of Data Imbalance and Solutions For A Data Democracy
20 pages
8 Tactics To Combat Imbalanced Classes in Your Machine Learning Dataset
No ratings yet
8 Tactics To Combat Imbalanced Classes in Your Machine Learning Dataset
62 pages
Text Classification Paper 1
No ratings yet
Text Classification Paper 1
5 pages
Clustering Based Undersampling For Handling Class Imbalance in C4.5 Classification Algorithm
No ratings yet
Clustering Based Undersampling For Handling Class Imbalance in C4.5 Classification Algorithm
7 pages
d2c0 PDF
No ratings yet
d2c0 PDF
6 pages
Leevy2018 Article ASurveyOnAddressingHigh-classI
No ratings yet
Leevy2018 Article ASurveyOnAddressingHigh-classI
30 pages
Analysis of Imbalanced Classification Algorithms A Perspective View
No ratings yet
Analysis of Imbalanced Classification Algorithms A Perspective View
5 pages
Classifying Imbalanced Data Sets Using Similarity Based Hierarchical Decomposition
No ratings yet
Classifying Imbalanced Data Sets Using Similarity Based Hierarchical Decomposition
16 pages
Deep Learning and Thresholding With Class-Imbalanced Big Data
No ratings yet
Deep Learning and Thresholding With Class-Imbalanced Big Data
8 pages
Imbalanced Dataset Classification and Solutions: A Review
No ratings yet
Imbalanced Dataset Classification and Solutions: A Review
29 pages
Eng2 12298 PDF
No ratings yet
Eng2 12298 PDF
24 pages
Oligois: Scalable Instance Selection For Class-Imbalanced Data Sets
No ratings yet
Oligois: Scalable Instance Selection For Class-Imbalanced Data Sets
15 pages
Ensemble Models For Effective Classification of Big Data With Data Imbalance
No ratings yet
Ensemble Models For Effective Classification of Big Data With Data Imbalance
17 pages
Bagging Using Instance-Level Difficulty For Multi-Class Imbalanced Big Data Classification On Spark
No ratings yet
Bagging Using Instance-Level Difficulty For Multi-Class Imbalanced Big Data Classification On Spark
10 pages
Enhancing Classification Performance of Multi-Class Imbalanced Data Using The OAA-DB Algorithm
No ratings yet
Enhancing Classification Performance of Multi-Class Imbalanced Data Using The OAA-DB Algorithm
8 pages
Expert Systems With Applications: Georgios Douzas, Fernando Bacao
No ratings yet
Expert Systems With Applications: Georgios Douzas, Fernando Bacao
8 pages
IR-Lab Project of Yanjun Qi (Fall 2004) : A Brief Literature Review of Class Imbalanced Problem
No ratings yet
IR-Lab Project of Yanjun Qi (Fall 2004) : A Brief Literature Review of Class Imbalanced Problem
5 pages
An Insight Into Classification With Imbalanced Data
No ratings yet
An Insight Into Classification With Imbalanced Data
29 pages
Internet Usage in Sierra Leone
From Everand
Internet Usage in Sierra Leone
Dr. Kamara
No ratings yet
Paper IJRITCC
No ratings yet
Paper IJRITCC
5 pages
Imbalanced Data: How To Handle Imbalanced Classification Problems
No ratings yet
Imbalanced Data: How To Handle Imbalanced Classification Problems
17 pages
Addressing Imbalance Problem in The Class - A Survey
No ratings yet
Addressing Imbalance Problem in The Class - A Survey
5 pages
A Survey On Oversampling Techniques For Imbalanced Learning
No ratings yet
A Survey On Oversampling Techniques For Imbalanced Learning
6 pages
Class Imbalance Problem in Data Mining: Review
No ratings yet
Class Imbalance Problem in Data Mining: Review
5 pages
A Study For The Discovery of Web Usage Patterns Using Soft Computing Based Data Clustering Techniques
No ratings yet
A Study For The Discovery of Web Usage Patterns Using Soft Computing Based Data Clustering Techniques
14 pages

Handling Imbalanced Ratio For Class Imbalance Problem Using SMOTE

Uploaded by

Handling Imbalanced Ratio For Class Imbalance Problem Using SMOTE

Uploaded by

Chapter 3

Handling Imbalanced Ratio for Class

Nurulﬁtrah Noorhalim, Aida Ali and Siti Mariyam Shamsuddin

Keywords Sampling Synthetic minority over-sampling Imbalanced dataset

N. Noorhalim (&) A. Ali S. M. Shamsuddin

© Springer Nature Singapore Pte Ltd. 2019 19

1.1 Imbalanced Ratio

Normalization is used to synchronize the interval value, as ﬁrst step, in the

Table 2 Speciﬁcation of parameters

Table 3 Algorithms used in experimental design

4.1 Synthetic Minority Over-Sampling

Synthetic minority over-sampling or widely known as SMOTE [35] is a most

Fig. 1 Illustration on how

Fig. 2 ROC curve of

Fig. 3 ROC curve without

You might also like