0% found this document useful (0 votes)
29 views12 pages

Handling Imbalanced Ratio For Class Imbalance Problem Using SMOTE

Uploaded by

emil hard
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views12 pages

Handling Imbalanced Ratio For Class Imbalance Problem Using SMOTE

Uploaded by

emil hard
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Chapter 3

Handling Imbalanced Ratio for Class


Imbalance Problem Using SMOTE

Nurulfitrah Noorhalim, Aida Ali and Siti Mariyam Shamsuddin

Abstract There are many issues regarding datasets classification. One such issue is
class imbalance classification, which often occurs with extreme skewness across
many real-world domains. The issue presents itself as one of the fundamental
difficulties to form robust classifiers. In this paper, a sampling method was used to
identify the performance of classification for k-NN classifier and C4.5 classifier
with a ten-fold cross validation. Experimental results conducted showed that
sampling greatly benefited the performance of classification in class imbalance
problem, by improving class boundary region especially with extremely imbalanced
datasets (extreme number of imbalanced ratio). This result demonstrates that class
imbalance can affect many domains in real-world applications.

Keywords Sampling  Synthetic minority over-sampling  Imbalanced dataset

1 Introduction

The use of today’s technology has contributed significantly to the revenue of mil-
lions of data within seconds in scientific research, business, and industry. This has
created a huge opportunity, especially for researchers to analyse and interpret data
into a form that can give implication to an organisation’s profitability, especially in
consumerism-specific domain, attributed to the growth in science and technology
field. This phenomenon has grown drastically and gained substantial popularity,
largely owed to heavy interest in artificial intelligence behaviour research, particu-
larly in machine learning, which can be very useful in administrating customer

N. Noorhalim (&)  A. Ali  S. M. Shamsuddin


Faculty of Computing, Universiti Teknologi Malaysia, UTM, 81310 Skudai, Johor, Malaysia
e-mail: nurulfi[email protected]
A. Ali
e-mail: [email protected]
S. M. Shamsuddin
e-mail: [email protected]

© Springer Nature Singapore Pte Ltd. 2019 19


L.-K. Kor et al. (eds.), Proceedings of the Third International Conference
on Computing, Mathematics and Statistics (iCMS2017),
https://doi.org/10.1007/978-981-13-7279-7_3
20 N. Noorhalim et al.

relationship management (CRM). For instance, collected business data can be used
and classified in numerous ways including, in the form of descriptive analytics,
diagnostic analytics, prescriptive analytics, and predictive analytics.
However, there exist several issues surrounding datasets collection classification,
such as imbalanced class classification that often faces extreme skewness, which
occurs across numerous real-world domains, contributing to be one of the funda-
mental difficulties to form a robust classifier [1–4]. Class imbalance refers to a
situation or condition where the number of minority class instances (positive class)
is far lesser or smaller than the number of majority class instances (negative class)
or not adequately represented.
According to Chawla et al. [4], class imbalance appeared attributed to the evo-
lution of science development into applied technology in machine learning. From
then on, various classifier learning algorithms were developed, which assume that
datasets possess a comparatively balanced distribution, neglecting a serious com-
plexity in imbalanced class distribution [5, 6]. Such imbalanced, real-world classi-
fication problems can be found throughout various domains such as, diagnostic
problems [7, 8], software quality prediction [9], software defect prediction [10], and
activity recognition [11]. Against a backdrop of the notion that various method-
ologies extraction produces a significant knowledge to many general practitioners, as
indicated by Bach et al. [12], focusing on class imbalance topic is, thus, crucial.
There exist substantial cases where imbalanced datasets require two-class clas-
sification in learning classifiers, when the natural distribution of the datasets is
dominated by one class or when a positive majority class exceeds other class
(majority or negative class). For instance, a dataset of breast tumour patients
exhibits a dual-class classification problem for breast cancer diagnoses, comprising
benign and malignant cancer. In medical cases, there are lesser patients diagnosed
with malignant cancer (minority class) than patients diagnosed with benign cancer
(majority class). Lack of instances (minority class) could lead to difficulties in
obtaining a precise classification.
The skewness of data distribution has inspired the formation of more robust clas-
sifiers, which is a basic problem in data mining [6, 13]. In addition, standard classifiers
can be biased in evaluation measurements, which typically produce bad performance
for minority class, and vice versa for majority class. This is because the intention of
classifiers is to identify noise in positive samples and ignore them during learning
process [14, 15]. Traditional class imbalance classification accuracy also degrades, due
to the presence of class imbalance difficulties including, (i) borderline samples,
(ii) small disjuncts, (iii) noisy data, (iv) small number of training data, (v) overlapping
instances, and (vi) differences of distribution on test and training data (or dataset shift).
There are two methods to handle class imbalance problem including, data-level
and algorithm-level methods. Typically, data-level will be run as pre-processing
steps to make sure unbalanced datasets can be adjusted and skewness of the dis-
tributions can be reduced, by using any types of sampling methods [16–18].
According to Lee, Sheen [16], random sampling imposes some limitations to most
data pre-processing. Major differences in pre-processing results could end up with
severely biased data distribution. Meanwhile, algorithm-level involves more on
3 Handling Imbalanced Ratio for Class Imbalance Problem … 21

modifying algorithms such as, ensemble approach and cost-sensitive learning that
can manage imbalanced class distribution [19, 20]. Ali et al. [1] claimed that, even
though both approaches are able to change and balance data distribution to become
normal, there remains theoretical gaps in existing approaches that can bear all the
aforementioned problems.

1.1 Imbalanced Ratio

In class imbalanced classification, there are a few challenges that might be faced
during the classification processes. This difficulties often occur when the datasets
have imbalanced distribution between classes such as minority class or often known
as positive class and majority class or familiar as negative class [21, 22], which also
known as imbalance ratio. Imbalance ratio (IR) is a proportion samples in the
number of majority class (negative class) to the number of minority class (positive
class) [15, 23]. In binary class datasets, the problems appear based on the number of
samples i.e. the number of samples for majority class has monopolized the distri-
bution of minority class samples which also important as class of interest in clas-
sification. This condition probably assumed minority class has very low predictive
accuracy [24]. However, the error that comes from the minority class ought be
important as stated in many research works [21]. This scenario obviously shows
that most of the datasets in classification has imbalanced problem with its classes. It
has widely happened in various studies involving with the issue of lack of training
sample set in data classification [13] which is also noted as another issue for class
imbalance problem. Class imbalanced could be happened when the datasets has
insufficient amount of samples that can lead to another problem in identifying the
pattern regularities [1] in which the pattern itself could help to improve the decision
boundary of the classes [25]. The best classification performance should have
balanced distribution and sufficient number of samples to represent the training
samples that can provide more knowledge that can be useful for learning processes.
Otherwise, it can degrade the classification performance.
Besides, another issue is, where most of the standard learning algorithms
thought that all datasets has same number of data class distributions [6]. This
overlapping problem has a higher tendency to create another related issue that can
affect the classifier performance greater than the problem of imbalanced distri-
bution [1, 26–31].

2 Data Collection

Three datasets were used in this experiment, which are available on KEEL datasets
repository [32]. The datasets were chosen because they possess an imbalanced ratio
(IR) of positive and negative classes. These datasets have been used and reported in
22 N. Noorhalim et al.

previous studies such as [22–25] using SMOTE and other sampling techniques with
different types of datasets such as, datasets with mildly imbalanced ratio between
1.5 and 9, and datasets that have extremely imbalanced ratio of more than nine.
These datasets were divided into two, in order to identify the extent that SMOTE
helps C4.5 and k-NN. Apart from that, different types of IR may also help in
identifying whether SMOTE only helps in classifying extremely imbalanced
datasets, or both mild and extreme datasets. This is due to different values of IR
have different distribution pattern, attributed to implemented sampling method. The
datasets used in this experiment referred to the imbalance ratio between 1.5 and 9.
There are 12 datasets out of 22 datasets available which were randomly selected, as
listed in Table 1. All the datasets are already set up into a sufficient number of
minority samples using 5-folds stratified cross validation for the test partition.
Referring to Table 1, the datasets are arranged accordingly from lowly imbal-
anced to highly imbalanced datasets, based on ascending imbalanced ratio
(IR) values with additional details including, number of instances (#Ex), number of
attributes (#Atts) in real (R), integer (I), and nominal (N) valued, and number of
classes (accounting both minority and majority classes).
Table 1 consists of 12 different types of fungi that have different number of IR,
in which all are less than nine, and different number of instances. All datasets have
two types of classes comprising minority class (positive class) and majority class
(negative class). Most of the instances are real attributes. None of nominal attributes
were utilized in this dataset.

3 Data Pre-processing

In order to make sure the classification process runs smoothly and efficiently, all
data needed to be pre-processed by using Waikato Environment for Knowledge
Analysis (WEKA) [33] before the experiment proceeded to classification processes.

Table 1 Summary description datasets with imbalance ratio between 1.5 and 9
Data-set IR #Ex. #Atts. (R/I/N) #Class
Glass1 1.82 214 9 (9/0/0) 2
Wisconsin 1.87 683 9 (0/1/0) 2
Glass0 2.06 214 9 (9/0/0) 2
Yeast1 2.46 1484 8 (8/0/0) 2
Glass-0-1-2-3_Vs_4-5-6 3.2 214 9 (9/0/0) 2
Vehicle0 3.25 846 18 (0/18/0) 2
New-Thyroid1 5.14 215 5 (4/1/0) 2
Ecoli2 5.46 336 7 (7/0/0) 2
Segment0 6.02 2308 19 (19/0/0) 2
Glass6 6.38 214 9 (9/0/0) 2
Ecoli3 8.6 336 7 (7/0/0) 2
Page-blocks0 8.79 5472 10 (4/6/0) 2
3 Handling Imbalanced Ratio for Class Imbalance Problem … 23

Normalization is used to synchronize the interval value, as first step, in the


pre-processing for all three datasets. As stated by Kotsiantis et al. [34], this process
can be essential for k-Nearest Neighbor (k-NN) algorithm.

4 Classification

Two classification algorithms were used in this experiment consisting C4.5 and
k-NN. Every dataset was run with a ten-fold cross-validation, with a single repe-
tition. Table 2 shows specification of parameters for the classification process.
In this experiment, SMOTE sampling utilized five nearest neighbours with its
default percentage, 100%, for instances to be created. C4.5 was used with its default
parameter of confidence factor = 0.25. Meanwhile, for k-NN, the authors used
KNN = 10, instead of one, as its default. To make sure the readers understand this
experiment, Table 3 is constructed to summarize the list of algorithms, according to
two types of classifiers; sampling and non-sampling. The abbreviation for every
method with its short description is included in Table 3.
The analysis of classification task for performance measure was based on con-
fusion matrix of the results for each instance, as shown in Table 4, which was used
to predict positive class.

Table 2 Specification of parameters


Parameters
Experiment type Cross-validation
Number of folds 10
Number of repetitions 1

Table 3 Algorithms used in experimental design


Non-sampling classifier
Abbr. Method Short description
C45 C4.5 Class generating pruned C4.5 decision tree and previously
normalized datasets
k-NN K-Nearest Selects appropriate value of K based on cross-validation and
Neighbour performs distance weighting, pre-processed with normalization
Sampling classifier
Abbr. Method Short description
CSMT C4.5 + SMOTE Applies C4.5 on datasets after pre-processing with
normalization and SMOTE
KSMT K-Nearest Selects appropriate value of K based on cross validation and
Neighbor + SMOTE performs distance weighting after pre-processing, with
normalization and SMOTE
24 N. Noorhalim et al.

4.1 Synthetic Minority Over-Sampling

Synthetic minority over-sampling or widely known as SMOTE [35] is a most


sophisticated technique in handling problems of over- and under-sampling tech-
niques as mention previously which required minority samples to create its
impostor synthetically and typically will be merged with an under-sampling
approach from majority instances.
In this approach, the over-sampling process of the minority instances used
selection approach and also using iterative search until it reach its amount needed
for every observation [17]. The new samples which produced synthetically will
used selected k nearest neighbours (k-NN) at random along the line segments that
joining any/all of its neighbours [15] as illustrated in Fig. 1.
Figure 2.10 shows point Xi1 to point Xi4 as selected nearest neighbours in order
to synthetically produced data point r1 to r4 at random.
This research using SMOTE sampling as pre-processing technique because of it
can produce a better performance in practice which is excellent either for basic or
hybrid methods [36].

Fig. 1 Illustration on how


the synthetic data points
created in SMOTE algorithm
[15]
3 Handling Imbalanced Ratio for Class Imbalance Problem … 25

5 Results

The objective of the experimental study is to identify which datasets could give a
balance, robust, and good trade-off, or conversely sacrifice performance; with
sampling and without sampling by using decision trees with C4.5 [37] and the
widely known k-Nearest Neighbor (k-NN) [38], as learning classifiers.
In this experiment, the authors investigated the performance of methods in terms
of robustness and ability in handling imbalanced datasets with different IR. The
authors also took into consideration the improvement to the results and justified it
with respect to two algorithms used, to determine whether it can provide a better
trade-off or otherwise, for both C4.5 and k-NN algorithms.
The authors utilized datasets with closely related imbalance ratio (IR) ranging
between 1.5 and 9, across 12 different data. The datasets were divided into three
categories comprising, mild imbalance with IR less than three, medium imbalance
with IR ranging from three to six, and extreme imbalance with IR of more than six.
WEKA [24] was used in this experiment to perform k-NN and C4.5 learning
classifiers across all datasets. It is used due to its ability to set up large-scale
experiments.
Based on the results, with close observation on the scores of performance
measures, the authors found that imbalance ratio does not affect the findings for the
datasets. The results for experiment using k-NN algorithm did yield an improve-
ment, superior than C4.5 on datasets with IR between 1.5 and 9. Similar results’
patterns were observed for both algorithms on sampling and non-sampling
approaches as shown in Tables 4 and 5.
From the experiments, it is observed that SMOTE improved both C4.5 and
k-NN, ranging from 0.45 to 0.18. It can be deduced that such improvement is
achieved due to increasing number of training data points which formed better
decision boundary representation between majority class and minority class.
Figures 2 and 3 illustrate the performance of classifiers for both C4.5 and k-NN
algorithms; with and without sampling for segment0 dataset. The classifiers cor-
rectly classified all data in sampling with SMOTE, with extremely imbalanced IR.
Hence, it is safe to conclude that class imbalance problem does benefit from
sampling approach, especially when dealing with extremely skewed datasets, as in
this case with SMOTE. The results also indicated trade-off between recall and
precision. The classifiers have completely learned the data.
26

Table 4 Average performance measure on ten-fold CV of C4.5 and K-NN in non-sampling approach for datasets with imbalance ratio between 1.5 and 9
Non-sampling
IR Sensitivity Specificity Precision Accuracy G mean F–measure
C4.5 k-NN C4.5 k-NN C4.5 k-NN C4.5 k-NN C4.5 k-NN C4.5 k-NN
IR < 3
Glass1 1.82 0.613 0.611 0.842 0.864 0.795 0.818 0.727 0.737 0.718 0.726 0.692 0.699
Wisconsin 1.87 0.941 0.967 0.968 0.975 0.968 0.975 0.955 0.971 0.955 0.971 0.954 0.971
Glass0 2.06 0.757 0.914 0.840 0.785 0.826 0.809 0.799 0.850 0.798 0.847 0.790 0.859
Yeast1 2.46 0.464 0.490 0.869 0.857 0.780 0.774 0.667 0.673 0.635 0.648 0.582 0.600
Mean 0.694 0.745 0.880 0.870 0.842 0.844 0.787 0.808 0.776 0.798 0.755 0.782
Standard deviation 0.204 0.232 0.060 0.079 0.086 0.089 0.124 0.131 0.136 0.141 0.158 0.165
3 > IR < 6
Glass-0-1-2-3_vs_4-5-6 3.2 0.823 0.770 0.975 0.932 0.971 0.919 0.899 0.851 0.896 0.847 0.891 0.838
Vehicle0 3.25 0.873 0.899 0.950 0.941 0.946 0.939 0.912 0.920 0.911 0.920 0.908 0.919
New-thyroid1 5.14 0.942 0.800 0.989 1.000 0.988 1.000 0.965 0.900 0.965 0.894 0.964 0.889
Ecoli2 5.46 0.750 0.907 0.972 0.968 0.964 0.966 0.861 0.953 0.854 0.937 0.844 0.936
Mean 0.847 0.844 0.972 0.961 0.967 0.956 0.909 0.906 0.906 0.900 0.902 0.895
Standard deviation 0.081 0.069 0.016 0.030 0.017 0.035 0.043 0.042 0.046 0.039 0.050 0.043
IR > 3
Segment0 6.02 0.957 0.982 0.997 0.996 0.997 0.996 0.977 0.989 0.977 0.989 0.977 0.989
Glass6 6.38 0.800 0.767 0.978 0.968 0.974 0.959 0.889 0.867 0.885 0.861 0.878 0.852
Ecoli3 8.6 0.567 0.717 0.957 0.964 0.930 0.952 0.762 0.840 0.736 0.831 0.704 0.818
Page-blocks0 8.79 0.859 0.696 0.987 0.987 0.985 0.981 0.923 0.841 0.920 0.829 0.917 0.814
Mean 0.796 0.790 0.980 0.979 0.971 0.972 0.888 0.884 0.880 0.877 0.869 0.868
Standard deviation 0.166 0.131 0.017 0.016 0.029 0.020 0.091 0.071 0.103 0.076 0.117 0.082
Mean 0.779 0.793 0.944 0.936 0.927 0.924 0.861 0.866 0.854 0.858 0.841 0.849
N. Noorhalim et al.

Standard deviation 0.158 0.150 0.058 0.067 0.079 0.079 0.101 0.092 0.109 0.098 0.125 0.111
Table 5 Average performance measure on ten-fold CV of C4.5 and K-NN in sampling approach for datasets with imbalance ratio between 1.5 and 9
Sampling
IR Sensitivity Specificity Precision Accuracy G mean F–measure
CSMT KSMT CSMT KSMT CSMT KSMT CSMT KSMT CSMT KSMT CSMT KSMT
IR < 3
Glass1 1.82 0.795 0.914 0.732 0.685 0.748 0.744 0.763 0.800 0.763 0.791 0.771 0.820
Wisconsin 1.87 0.983 0.998 0.950 0.968 0.952 0.969 0.967 0.983 0.967 0.983 0.967 0.983
Glass0 2.06 0.893 0.957 0.814 0.675 0.828 0.747 0.854 0.816 0.853 0.804 0.859 0.839
Yeast1 2.46 0.711 0.853 0.779 0.680 0.763 0.727 0.745 0.766 0.744 0.761 0.736 0.785
Mean 0.846 0.931 0.819 0.752 0.823 0.797 0.832 0.841 0.832 0.835 0.833 0.857
Standard Deviation 0.118 0.062 0.094 0.144 0.093 0.115 0.101 0.097 0.102 0.100 0.103 0.087
3 > IR < 6
Glass-0-1-2-3_vs_4-5-6 3.2 0.914 0.952 0.928 0.927 0.927 0.929 0.921 0.939 0.921 0.939 0.920 0.940
Vehicle0 3.25 0.925 0.985 0.946 0.904 0.945 0.911 0.935 0.945 0.935 0.944 0.935 0.947
New-thyroid1 5.14 0.914 0.957 0.983 0.983 0.982 0.983 0.949 0.970 0.948 0.970 0.947 0.970
Ecoli2 5.46 0.855 0.932 0.969 0.958 0.965 0.957 0.912 0.945 0.910 0.945 0.906 0.944
Mean 0.902 0.956 0.956 0.943 0.954 0.945 0.929 0.950 0.928 0.949 0.927 0.950
Standard Deviation 0.032 0.022 0.025 0.035 0.024 0.031 0.016 0.014 0.017 0.014 0.018 0.013
3 Handling Imbalanced Ratio for Class Imbalance Problem …

IR > 6
Segment0 6.02 0.995 0.986 0.998 0.991 0.998 0.991 0.997 0.989 0.997 0.989 0.997 0.989
Glass6 6.38 0.930 0.863 0.957 0.962 0.956 0.958 0.943 0.913 0.943 0.911 0.943 0.908
Ecoli3 8.6 0.800 0.871 0.943 0.934 0.934 0.929 0.872 0.902 0.869 0.902 0.862 0.899
Page-blocks0 8.79 0.905 0.853 0.979 0.974 0.978 0.970 0.942 0.914 0.942 0.912 0.940 0.908
Mean 0.969 0.965 0.966 0.962 0.939 0.929 0.938 0.928 0.935 0.926 0.969 0.965
Standard deviation 0.024 0.024 0.028 0.026 0.051 0.040 0.053 0.040 0.056 0.042 0.024 0.024
Mean 0.885 0.927 0.915 0.887 0.914 0.901 0.900 0.907 0.899 0.904 0.899 0.911
Standard deviation 0.082 0.054 0.088 0.127 0.100 0.078 0.074 0.078 0.077 0.079 0.066
27
28 N. Noorhalim et al.

Fig. 2 ROC curve of


sampling for segment0
datasets on false positive rate
(x-axis) and true positive rate
(y-axis)

Fig. 3 ROC curve without


sampling of segment0 on false
positive rate (x-axis) and true
positive rate (y-axis)

6 Conclusion

This study focused on the effect of sampling on C4.5 and k-NN algorithms in
handling imbalanced datasets. It was observed that SMOTE sampling could lead to
improvements in classification. The findings of the experiment could help inform
general practitioners of the main problem surrounding efforts to improve decision
boundary of data region with extreme IR values, especially, concerning the
skewness of data, which can be resolved through identifying factors that cause
imbalance ratio to occur. The study also observed that extreme IR benefited from
sampling, while other ranges of IR did not have much improvement on their
classification rate.

Acknowledgements The authors would like to express appreciation to the UTM Big Data Centre
of Universiti Teknologi Malaysia and Y.M. Said for their support in this study. The authors greatly
acknowledge the Research Management Centre, UTM and Ministry of Higher Education for the
financial support through Research University Grant (RUG) Vot. No. Q.JI30000.2528.13H30.
3 Handling Imbalanced Ratio for Class Imbalance Problem … 29

References

1. Ali, A., Shamsuddin, S.M., Ralescu, A.L.: Classification with class imbalance problem: a
review. Int. J. Adv. Soft Comput. Appl. 7(3) (2015)
2. Beyan, C., Fisher, R.: Classifying imbalanced data sets using similarity based hierarchical
decomposition. Pattern Recogn. 48(5), 1653–1672 (2015)
3. Cleofas-Sánchez, L., Sánchez, J.S., García, V., Valdovinos, R.: Associative learning on
imbalanced environments: An empirical study. Expert Syst. Appl. 54, 387–397 (2016)
4. Al-Stouhi, S., Reddy, C.K.: Transfer learning for class imbalance problems with inadequate
data. Knowl. Inf. Syst. 48(1), 201–228 (2016)
5. Chawla, N.V., Japkowicz, N., Kotcz, A.: Editorial: special issue on learning from imbalanced
data sets. ACM SIGKDD Explor. Newsl. 6(1), 1–6 (2004)
6. Sun, Y., Wong, A.K., Kamel, M.S.: Classification of imbalanced data: a review. Int. J. Pattern
Recognit Artif Intell. 23(04), 687–719 (2009)
7. He, H., Garcia, E.A.: Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 21(9),
1263–1284 (2009)
8. Bruha, I., Kočková, S.: A support for decision-making: cost-sensitive learning system. Artif.
Intell. Med. 6(1), 67–82 (1994). https://doi.org/10.1016/0933-3657(94)90058-2
9. Kukar, M., Kononenko, I., Grošelj, C., Kralj, K., Fettich, J.: Analysing and improving the
diagnosis of ischaemic heart disease with machine learning. Artif. Intell. Med. 16(1), 25–50
(1999). https://doi.org/10.1016/S0933-3657(98)00063-3
10. Gao, K., Khoshgoftaar, T.M., Napolitano, A.: An empirical investigation of combining
filter-based feature subset selection and data sampling for software defect prediction. Int.
J. Reliab. Qual. Saf. Eng. 22(6) (2015). https://doi.org/10.1142/s0218539315500278
11. Gao, K., Khoshgoftaar, T.M., Napolitano, A.: Aggregating data sampling with feature subset
selection to address skewed software defect data. Int. J. Soft. Eng. Knowl. Eng. 25(09n10),
1531–1550 (2015)
12. Abidine, M.B., Fergani, B., Ordóñez, F.J.: Effect of over-sampling versus under-sampling for
SVM and LDA classifiers for activity recognition. Int. J. Des. Nat. Ecodynamics 11(3), 306–
316 (2016). https://doi.org/10.2495/DNE-V11-N3-306-316
13. Bach, M., Werner, A., Żywiec, J., Pluskiewicz, W.: The study of under- and over-sampling
methods’ utility in analysis of highly imbalanced data on osteoporosis. Inf. Sci. 384, 174–190
(2017). https://doi.org/10.1016/j.ins.2016.09.038
14. Ando, S.: Classifying imbalanced data in distance-based feature space. Knowl. Inf. Syst. 46
(3), 707–730 (2016)
15. Lee, W., Jun, C.-H., Lee, J.-S.: Instance categorization by support vector machines to adjust
weights in AdaBoost for imbalanced data classification. Inf. Sci. 381, 92–103 (2017). https://
doi.org/10.1016/j.ins.2016.11.014
16. López, V., Fernández, A., García, S., Palade, V., Herrera, F.: An insight into classification
with imbalanced data: Empirical results and current trends on using data intrinsic
characteristics. Inf. Sci. 250, 113–141 (2013)
17. Lee, C.S., Sheen, D.: Nonconforming generalized multiscale finite element methods.
J. Comput. Appl. Math. 311, 215–229 (2017)
18. Rivera, W.A., Xanthopoulos, P.: A priori synthetic over-sampling methods for increasing
classification sensitivity in imbalanced data sets. Expert Syst. Appl. 66, 124–135 (2016)
19. Vluymans, S., Triguero, I., Cornelis, C., Saeys, Y.: EPRENNID: An evolutionary prototype
reduction based ensemble for nearest neighbor classification of imbalanced data.
Neurocomputing 216, 596–610 (2016). https://doi.org/10.1016/j.neucom.2016.08.026
20. Galar, M., Fernandez, A., Barrenechea, E., Bustince, H., Herrera, F.: A review on ensembles
for the class imbalance problem: Bagging-, boosting-, and hybrid-based approaches. IEEE
Trans. Syst. Man Cybern. Part C Appl. Rev. 42(4), 463–484 (2012). https://doi.org/10.1109/
TSMCC.2011.2161285
21. Downs, R.: Beware the aliased signal! Electron. Des. 59(4) (2011)
30 N. Noorhalim et al.

22. Visa, S.: Fuzzy classifiers for imbalanced data sets. University of Cincinnati (2006)
23. Kotsiantis, S., Kanellopoulos, D., Pintelas, P.: Handling imbalanced datasets: a review.
GESTS Int. Trans. Comput. Sci. Eng. 30(1), 25–36 (2006)
24. García, V., Sánchez, J.S., Mollineda, R.A.: On the effectiveness of preprocessing methods
when dealing with different levels of class imbalance. Knowl.-Based Syst. 25(1), 13–21
(2012)
25. Fernández, A., del Jesus, M.J., Herrera, F.: Hierarchical fuzzy rule based classification
systems with genetic rule selection for imbalanced data-sets. Int. J. Approx. Reason. 50(3),
561–577 (2009)
26. Phung, S.L., Bouzerdoum, A., Nguyen, G.H.: Learning pattern classification tasks with
imbalanced data sets (2009)
27. Xiong, H., Wu, J., Liu, L.: Classification with class overlapping: a systematic study. In: The
2010 International Conference on E-Business Intelligence, pp. 491–497 (2010)
28. Galar, M., Fernandez, A., Barrenechea, E., Bustince, H., Herrera, F.: A review on ensembles
for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE
Trans. Syst. Man Cybern. Part C (Appl. Rev.) 42(4), 463–484 (2012)
29. Longadge, R., Dongre, S.: Class imbalance problem in data mining review (2013). arXiv
preprint arXiv:1305.1707
30. Japkowicz, N.: Learning from imbalanced data sets: a comparison of various strategies. In:
AAAI Workshop on Learning from Imbalanced Data Sets, pp. 10–15, Menlo Park, CA (2000)
31. Batista, G.E., Prati, R.C., Monard, M.C.: Balancing strategies and class overlapping. In:
International Symposium on Intelligent Data Analysis, pp. 24–35. Springer, Heidelberg
(2005)
32. Prati, R.C., Batista, G.E., Monard, M.C.: Learning with class skews and small disjuncts. In:
Brazilian Symposium on Artificial Intelligence, pp. 296–306. Springer, Heidelberg (2004)
33. Alcalá-Fdez, J., Fernández, A., Luengo, J., Derrac, J., García, S., Sánchez, L., Herrera, F.:
KEEL data-mining software tool: data set repository, integration of algorithms and
experimental analysis framework. J. Multiple-Valued Logic Soft Comput. 17 (2011)
34. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data
mining software: an update. ACM SIGKDD Explor. Newsl. 11(1), 10–18 (2009)
35. Kotsiantis, S., Kanellopoulos, D., Pintelas, P.: Data preprocessing for supervised leaning. Int.
J. Comput. Sci. 1(2), 111–117 (2006)
36. Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority
over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)
37. Fernández, A., García, S., del Jesus, M.J., Herrera, F.: A study of the behaviour of linguistic
fuzzy rule based classification systems in the framework of imbalanced data-sets. Fuzzy Sets
Syst. 159(18), 2378–2398 (2008)
38. Salzberg, S.L.: C4.5: Programs for Machine Learning by J. Ross Quinlan. Morgan Kaufmann
Publishers, Inc., 1993. Mach. Learn. 16(3), 235–240 (1994). https://doi.org/10.1007/
bf00993309

You might also like