Eng2 12298 PDF

Received: 14 November 2019 Revised: 31 July 2020 Accepted: 7 September 2020
DOI: 10.1002/eng2.12298
REVIEW
The balancing trick: Optimized sampling of imbalanced

datasets—A brief survey of the recent State of the Art
Seba Susan Amitesh Kumar
Department of Information Technology,

Delhi Technological University, Delhi, Abstract
India This survey paper focuses on one of the current primary issues challenging data
mining researchers experimenting on real-world datasets. The problem is that of
Correspondence
Seba Susan, Department of Information imbalanced class distribution that generates a bias toward the majority class due
Technology, Delhi Technological to insufficient training samples from the minority class. The current machine
University, Rohini, Delhi 110042, India.
Email: seba_406@yahoo.in
learning and deep learning algorithms are trained on datasets that are insuffi-
ciently represented in certain categories. On the other hand, some other classes
have surplus samples due to the ready availability of data from these categories.
Conventional solutions suggest undersampling of the majority class and/or over-
sampling of the minority class for balancing the class distribution prior to the
learning phase. Though this problem of uneven class distribution is, by and
large, ignored by researchers focusing on the learning technology, a need has
now arisen for incorporating balance correction and data pruning procedures
within the learning process itself. This paper surveys a plethora of conventional
and recent techniques that address this issue through intelligent representations
of samples from the majority and minority classes, that are given as input to
the learning module. The application of nature-inspired evolutionary algorithms
to intelligent sampling is examined, and so are hybrid sampling strategies that
select and retain the difficult-to-learn samples and discard the easy-to-learn sam-
ples. The findings by various researchers are summarized to a logical end, and
various possibilities and challenges for future directions in research are outlined.
KEYWORDS
class-imbalance problem, hybrid sampling, imbalanced data, oversampling, sampling,
undersampling
1 I N T RO DU CT ION
Learning from imbalanced datasets results in a bias toward the majority class whose labeled samples are available in
plenty as compared to the insufficiently represented minority class.1 In data mining, factors that bring down the classifier
performance are the intrinsic characteristics of the data and an uneven class distribution.2 Lack of adequate data in
the minority class results in a fuzzy and ever-varying decision boundary, leading to erroneous results. The problem is
This is an open access article under the terms of the Creative Commons Attribution-NonCommercial-NoDerivs License, which permits use and distribution in any medium,
provided the original work is properly cited, the use is non-commercial and no modifications or adaptations are made.
© 2020 The Authors. Engineering Reports published by John Wiley & Sons Ltd.
Engineering Reports. 2020;e12298. wileyonlinelibrary.com/journal/eng2 1 of 24

https://doi.org/10.1002/eng2.12298
2 of 24 SUSAN and KUMAR
aggravated by the presence of noisy samples that are difficult to distinguish in a depleted minority population. The issue
was first brought into limelight by Japkowicz3 as a significant research problem on its own. All real-world datasets have
some degree of imbalance and a perfectly balanced dataset is difficult to find. Where the problem is of severe nature,
the accuracy is generally high since almost all samples get classified as the majority class, and the poor performance of
the depleted minority class does not affect the scores much. However, the matter assumes importance in circumstances
such as when the minority class pertains to detection of rare cancer cells and a wrong diagnosis may prove to be fatal.
Other examples where the minority population needs attention are software defect detection, fabric defect detection, and
recognition of faces of non-celebrities vs celebrities.
The class-imbalance issue can be deduced from the classical probability statistics model presented by Bayes theorem,
given by.
P (x |Ck )) P(Ck )
P (Ck |x)) = (1)
P(x)
The above equation that predicts the class of a sample x, can be interpreted as.
likelihood × prior
posterior = (2)
normalization − factor
The likelihood P(x|Ck ) and the prior P(Ck ) decide the value of the posterior probability that in turn decides the class of
x. The largest posterior probability P(Ck |x) that satisfies the condition P(Ck |x) > P(Ci |x), ∀ i ≠ k indicates that the pattern
x belongs to class Ck . In case of skewed class distributions, uneven class priors pre-exist even before the data are applied,
and P(CMajority ) ≫ P(CMinority ). The numerator will be very high in case of the majority class, leading to a biased decision
since the posterior class probability will be automatically high irrespective of the outcome of the likelihood. The posterior
class probability will be very low in case of the minority class even if the likelihood function is high. To avoid faulty
decisions, most statisticians in history prefer to work with data having similar class priors.4
One reason for the surging interest in the topic of imbalanced datasets, in current times, is the fact that real-world
datasets today are mostly crawled from the web that has variable sources of information.5 This causes an imbalance in the
number of samples accumulated in different categories, unlike in the past when datasets were aggregated systematically
by a human annotator. A comprehensive list of application domains where class-imbalance persists, such as medical
diagnosis and network intrusion detection, is given in Reference 6. The number of samples per category is also subject to
the commonality of occurrence of the category as observed for the cancer dataset where the number of “Benign” cases is
almost double that of “Malignant” cases.7 Another reason that the issue is of current concern to the research community
is that the number of categories has substantially spiked as compared to the last decade, primarily due to the advent of
deep neural networks that can be trained on thousands of image classes using high performance computing machines.8 In
cases where the Imbalance Ratio (IR) (ratio of the number of majority samples to the number of minority samples) is very
high, all the samples of the minority may get classified as majority samples. Transforming the input data space by distance
metric learning constrains class populations to restricted subspaces that improves classification performance, as observed
by the authors in a recent work.9 Majority voting within an elite group of k-nearest neighbors of the transformed test
sample decided its class label. However, distance metric learning becomes computationally complex when the number of
classes is high as observed for the imbalanced Labeled Faces in the Wild (LFW) dataset in Reference 10. Here, the distance
metric is learnt exclusively from sampled subspaces of the majority classes prior to the input space transformation. Also,
transformations of the input space tend to improvise results only for distance-based classifiers and their variants; for the
general group of classifiers, the improvement would not be significant since there is no change in the class distribution
which is still uneven.
Balancing the class-distribution is thus essential for efficient learning and unbiased decision-making. As per the
review in Reference 11, class imbalance can be resolved by one of several approaches namely (a) data-level techniques
that involve sampling the dataset prior to learning, (b) algorithmic techniques that involve modifying the machine learn-
ing algorithm, and (c) cost-sensitive learning. The review concluded that while sampling achieved the best results in
most of the cases, a combination of sampling and algorithmic techniques such as boosting is expected to enhance the
classification performance. Our survey paper reviews state-of-the-art resampling strategies like Synthetic Minority Over-
sampling (SMOTE) as well as hybrid algorithmic methods that incorporate resampling like SMOTEBoost which will be
discussed in detail. Cost-sensitive learning assigns a higher weight to the misclassification cost pertaining to the minor-
ity class as compared to that of the majority class.114 In Reference 12, a weight of 1.0 was assigned to the minority class
SUSAN and KUMAR 3 of 24
and 0.1 to the majority class in order to correct a 1:9 class-imbalance. Susan et al13 proposed a class-weight equal to
the ratio of total population to class population, for detecting lung nodule images. For the majority class (non-nodule
images), this ratio was observed to be 1:1, while for the minority class (nodule images), the ratio was 6:1, thereby attribut-
ing a higher weight to the minority loss function. Zhou and Liu14 implemented oversampling, undersampling, and
threshold moving with hard and soft voting schemes for cost-sensitive learning using neural networks. Their exper-
iments concluded that threshold moving with soft ensemble was the best combination for tackling class-imbalance.
Provided the misclassification costs are known beforehand, cost-sensitive learning is considered a better option as com-
pared to sampling due to the obvious disadvantages related to loss of information while undersampling and the likely
overfitting of data due to oversampling. These disadvantages can be minimized by employing intelligent sampling tech-
niques as we discuss in this paper. However, it is difficult to know the misclassification costs beforehand, and not all
machine learning algorithms have cost-sensitive implementations available. These issues together with the current avail-
ability of real-time data in huge volume, velocity and variety render sampling a practical choice over cost-sensitive
learning.
Sampling of a dataset can be achieved in two ways: either by increasing the number of samples in the minority class
to match the majority population (oversampling) or by cutting down the majority population to match the minority
population (undersampling). Pruning the dataset by resampling is a natural choice for achieving the balance. However, at
the same time, inter-class and intra-class diversity in the training data have to be maintained. This throws up a challenge
in regard of the necessity of careful and intelligent pruning of imbalanced datasets such that the representation from
all classes is balanced, and at the same time, inter-class and intra-class diversity is preserved. A careful selection of a
subset of samples that carry class-discriminatory information would create a balanced distribution that achieves a better
learning performance. Figure 1A illustrates the class-imbalance problem for a synthetic dataset. Some simplistic baseline
resampling strategies for correcting the class-imbalance are also introduced; they will be perused in greater details in
subsequent sections. The methods demonstrated include oversampling the minority class by SMOTE (Figure 1B), random
undersampling (RUS) of the majority class (Figure 1C), shortlisting the borderline instances of both classes (Figure 1D),
borderline instances with noisy samples removed (Figure 1E). The borderline instances are those that lie near the decision
boundary and hence influence the decision-making. We will be exploring methods in our study where, once the borderline
instances are obtained, they are oversampled to increase the final population. Most of the methods primarily differ on
the basis of the “intelligence” of their sampling algorithm. The term “intelligent sampling” refers to the ability of the
FIGURE 1 Demonstration of class-imbalance and some

simplistic sampling strategies [Red Discs: Majority class, Blue
Triads: Minority class]: (A) Original class-distribution (B)
oversampling the minority class by SMOTE (C) random
undersampling of the majority class (D) shortlisting the
borderline instances of both classes (E) borderline instances
with noisy samples removed
sampling algorithm to retain samples or create new samples that enhance class discriminatory information in order to
induce a better classification performance. Interestingly, the number of samples in the final balanced population varies
as per sampling strategy with a larger population in cases of oversampling (Figure 1B) and a depleted population in cases
of undersampling (Figure 1C-E). However, as we shall see, the size of the final population has no bearing on the results.
The performance depends on the class discriminatory information contained in the pruned population.
This paper discusses several approaches of pruning imbalanced datasets by resampling the majority and/or minor-
ity classes. The resampling strategies are roughly divided into undersampling and oversampling techniques and a set of
hybrid techniques that employ both undersampling and oversampling. The term “intelligent sampling” is paid a lot of
attention to, with an effort to understand the logic behind each method. The organization of this paper is as follows.
Section 2 reviews various undersampling techniques. Section 3 discusses the vast diversity of oversampling techniques
available in literature. Section 4 describes the hybrid sampling strategies. Section 5 discusses the implementation details
and experimental setup for sampling algorithms that include a discussion on software tools, popular datasets and perfor-
mance metrics. Section 6 summarizes the challenges and outlines the future directions of research and applications for
imbalanced learning. Section 7 draws the final conclusions from our study.
2 UNDERSAMPLING THE MAJORITY C LASS
Undersampling the majority class is a natural choice to begin with for solving the imbalanced class problem. The criteria
for deciding which samples of the majority class should be deleted and which should be retained is what defines the
undersampling strategy. The balancing achieved by simply cutting down the majority population is expected to eliminate
any class-induced bias while learning, however at the cost of loss of information.113 How far this loss will affect the
performance is a matter of study. Here, we review some methods that followed this line of research. The concept of
cleaning the dataset by identifying and deleting noisy samples is an old concept. One of the significant works in this
regard is that of Tomek,15 who defined Tomek links as a pair of samples belonging to two different classes whose distance
is minimum. This could be interpreted in two ways: either both are borderline examples or one of the samples is noise.
Hence the solution would be either to delete both the samples or delete the sample that belongs to the majority class.
Several modern works in literature make use of Tomek links to balance their population as observed from Table 1 that
outlines some distinctive sampling strategies.
In RUS demonstrated in Figure 1C, a small population of majority samples are selected randomly to match the minor-
ity population. Since no intelligent technique or reasoning was applied for the selection/rejection of majority samples,
the loss of information in the form of useful majority samples that were not included, is likely to be higher.3 Clustering is
another tool for cutting down the majority population. In Reference 32, the number of clusters of the majority class is set
equal to the minority population so that a balance ensues. The cluster centers form the new majority population. Alter-
natively, substituting the cluster centers with the nearest neighbors is also considered, in the same work that proved to be
a more viable approach. Mani and Zhang33 conducted an investigation on the effect of class-imbalance on the k-NN clas-
sifier. They proposed NearMiss-1, NearMiss-2, and NearMiss-3 criteria for selecting majority class samples based on the
average distance to the minority class samples; out of all three methods, NearMiss-2 performed best. NearMiss-2 selected
only those majority samples that were close to the three farthest minority samples.
Often intelligent learning techniques are used in conjunction with, or as alternatives to, random sampling, for higher
performance. Sometimes, undersampling is integrated into the learning process itself, as observed in the case of Refer-
ence 34 where two different approaches are presented. These are EasyEnsemble (learns a separate classifier for different
subsets of samples from the majority class and combines all outputs using ensemble technique) and BalanceCascade
(separate classifiers are trained sequentially on different subsets of samples from the majority class after removal of the
correctly classified examples at each step). In Reference 35, the minority class is trained on a greater number of classi-
fiers as an alternative to increasing the number of minority instances. The proposed Biased Random Forest introduces
oversampling at the algorithmic-level instead of at the data-level. Some research works focus on undersampling while
simultaneously transforming the data space. One such work is that of Dumpala et al,36 in which two samples are com-
bined prior to training by simple concatenation and likewise for their respective class-labels that assume four different
forms of (majority-majority), (majority-minority), (minority-majority) and (minority-minority). To cut-down on the con-
tribution of majority samples, the combination (majority-majority) is reduced to (majority-majorityN ), where majorityN
constitutes a subset of the samples selected randomly from the majority class, the exact number of which is set equal to the
number of minority samples (N). This data representation was termed the majority-constrained s2s data representation.
T A B L E 1 Summary of some distinctive sampling strategies in literature
Sampling Distinctiveness Procedure of pruning the
Method strategy factor training set Experimental details
SMOTE-TOMEK16 Oversampling of the minority Detection of TOMEK links SMOTE followed by removal of • Domain: Data mining
SUSAN and KUMAR
class followed by data cleaning (majority-minority noisy pair of TOMEK links to achieve data • Datasets: 15 UCI datasets
samples) cleaning in both classes
• Classifier: Euclidean distance (Quantita-
tive attributes) and Value Difference Metric
(Qualitative attributes), Decision trees
• Baseline approaches compared: SMOTE,
SMOTE-ENN, RUS, CNN, Tomek, OSS, NCL
SMOTE-ENN16 Oversampling of the minority Edited Nearest Neighbor Rule (ENN): SMOTE followed by removal of the • Domain: Data mining
class followed by in-depth data Checking the class labels of three majority sample whose label is • Datasets: 15 UCI datasets
cleaning nearest neighbors of each majority different from that of its three nearest
sample neighbors
• Classifier: Euclidean distance (Quantita-
tive attributes) and Value Difference Metric
(Qualitative attributes), Decision trees
SMOTE-ENN, RUS, CNN, Tomek, OSS, NCL
One sided selection Undersampling the majority Retaining all the minority samples but Construct TOMEK links and removing • Domain: Data mining
(OSS)17 class by data cleaning removing selectively the noisy the associated noisy majority • Datasets: two data files related to oil spill detec-
majority samples followed by examples; remove redundant tion project and five UCI datasets
removal of redundant negative majority samples by removing the
samples (one-sided selection) instances that are similar to the
• Classifier: 1-Nearest Neighbor, Decision trees
selected majority samples • Baseline approaches compared: using all train-
ing samples vs data reduction by removing
redundant negative samples
Cluster-SMOTE18 Oversampling the minority class Independent cluster-wise oversampling Form sub-clusters of minority class by • Domain: Intrusion detection
after clustering k-means clustering algorithm and • Datasets: Intrusion datasets collected from net-
oversample each sub-cluster using work traffic
SMOTE
• Classifier: Rule based classification
RIPPER rule learning for intrusive
traffic + Undersampling
(Continues)
5 of 24
T A B L E 1 (Continued)
6 of 24

Undersampling based on Undersampling the Independent cluster-wise Form sub-clusters of minority class by • Domain: Data mining
clustering (SBC)19 minority class after undersampling k-means clustering algorithm and • Datasets: Artificial and real-world datasets (two
clustering undersample each sub-cluster based UCI datasets)
on value of imbalance ratio of the
specific sub-cluster
• Classifier: Neural network
• Baseline approaches compared: RUS,
NearMiss-2
Cluster-based Oversampling clusters of Clustering of majority and minority Oversample the clusters of majority (to • Domain: Data mining
oversampling by both the minority and classes separately followed by match the largest class) and then • Datasets: Artificial and real-world datasets (two
inflating small disjuncts majority classes oversampling each cluster to address increase the minority class UCI datasets)
(CBOS)12 the within-class-imbalances along population to match the oversampled
with the pre-existing majority cluster population
• Classifier: Decision trees and neural network
inter-class-imbalance • Baseline approaches compared: Oversampling,
Undersampling, Cost modification
Safe-level-SMOTE20 Selective oversampling of Compute a threshold called safe-level Generate synthetic minority samples • Dom ain: Data mining
the minority class for every minority sample based on “in the safe zone” nearer to a minority • Datasets: two UCI datasets
the number of minority instances in instance with larger safe-level
the set of its k-nearest neighbors
• Classifier: Decision trees, SVM, Naïve Bayes
• Baseline approaches compared: SMOTE, Bor-
derline SMOTE
Similarity-based Independent Compute distance of candidate Consider the synthetic minority samples • Domain: Data Mining
undersampling and oversampling of the synthetic minority samples from the that have the largest distance from • Datasets: Artificial dataset and two industrial
normal distribution minority class and majority class (for oversampling); the majority class (oversampling); datasets
based oversampling undersampling of the Compute similarity between pairs of Eliminate one of the pair of similar
(SUNDO)21 majority class majority samples to detect majority samples (undersampling)
• Classifier: SVM, Bayesian classifier, Decision
trees, Self organizing map
redundancy (for undersampling)
informed SMOTE, informed clustering based
oversampling
Distinct-Borderline Oversampling of both Create a separate boundary subset of Remove certain majority samples from • Domain: Data mining
(DBI)22 classes followed by majority and minority samples; the boundary subset that are similar • Datasets: six UCI datasets
undersampling of the Generate synthetic samples for both to the minority samples in the subset
• Classifier: Decision trees
majority class the minority class and the majority
class in this boundary subset • Baseline approaches compared: Random over-
SUSAN and KUMAR
sampling, SMOTE, Borderline SMOTE
(Continues)
Sampling Distinctiveness Procedure of
Method strategy factor pruning the training set Experimental details
Probability density function Oversampling of the PDF of the minority class estimated For every minority sample generate • Domain: Data mining
(PDF) estimation based minority class based on by the Parzen-kernel window synthetic samples using the PDF
SUSAN and KUMAR
• Datasets: six UCI datasets

Oversampling (PDFOS)23 estimated statistical function. assuming new samples share the
• Classifier: Radial Basis Function
distribution same statistical properties
• Baseline approaches compared: Orthogonal
Forward Selection (OFS), k-means
STA-SMOTE24 Oversampling of the Synthetic samples are generated by Number of synthetic samples are • Domain: Data mining with industrial applica-
minority class SMOTE whose adjustable restricted between the limits of tion
parameters (N, k) are optimized minority population and majority • Datasets: four UCI medical datasets and alu-
by the State Transition Algorithm population minum electrolysis dataset collected from an
(STA) aluminum industry in China
• Classifier: Decision trees and neural network
• Baseline approaches compared: PSO-SMOTE
Evolutionary Undersampling Undersampling of the Removal of instances from the Fitness function is set equal to • Domain: Data mining
(EUS)25 majority class by majority class based on the accuracy/Area Under Curve • Datasets: 28 UCI datasets (results averaged
prototype selection survival of the fittest chromosome (AUC)/Geometric Mean (GM) of across all datasets)
True Positive Rate and False
Positive Rate (AUC, GM gave
• Classifier: 1-Nearest Neighbor classifier
equally good performance) • Baseline approaches compared: CNN, NCL,
OSS, RUS, Tomek Links,
NRSBoundary-SMOTE26 Oversampling of the Construct a BORDERLINE set Synthetic samples generated based • Domain: Data mining
minority class based on comprising of those minority on the minority instances in the • Datasets: 15 UCI datasets
the Neighborhood instances that are closer to the BORDERLINE set
Rough Set (NRS) model decision boundary and are in a
• Classifier: C4.5 Decision trees, CART, k-NN,
SVM
with the aim that rational neighborhood that do not
majority class affect majority samples • Baseline approaches compared: SMOTE, Bor-
performance is derline SMOTE, SMOTE-RSB
unaffected
Under Sampling Strategy Undersampling of the Removal of noisy samples defined as Iterative removal of the weak • Domain: Software engineering
(USS)27 minority and majority misclassified instances whose instances and use of a • Datasets: four software defect analysis datasets
classes nearest neighbors belong to the correlation-based subset filter for
• Classifier: Decision trees
opposite class obtaining the optimal sample
subset • Baseline approaches compared: C4.5, Classifi-
cation and Regression Trees (CART), reduced
7 of 24
error pruning tree
(Continues)
8 of 24
SMOTE-Cosine28 Oversampling of the Compute the cosine similarity Sort the nearest neighbors of minority • Domain: Data mining
minority class between each minority sample samples as per sum of ranks assigned • Datasets: 18 UCI datasets
and all other minority samples to by cosine similarity and Euclidean
• Classifier: SVM with linear kernel
determine nearest neighbors distance and generate synthetic
samples for samples at the top • Baseline approaches compared: SMOTE
Adaptive semi-unsupervised Oversampling of the Cluster minority samples by Generate synthetic samples for the • Domain: Data mining
weighted oversampling minority class hierarchical clustering and borderline cases in each sub-cluster • Datasets: 16 UCI datasets
(A-SUWO)29 oversample each sub-cluster
• Classifier: SVM
considering its Euclidean distance
from the majority class • Baseline approaches compared: Ran-
dom oversampling, SMOTE, Borderline
SMOTE, Safe-level SMOTE, Cluster-SMOTE,
MWMOTE, CBOS, SBC
Hybrid sampling with data Undersampling of the Sampling is followed by data space Classifier is trained on the pruned and • Domain: Data mining
space transformation30 majority class followed transformation of the pruned and transformed dataset. The • Datasets: 12 UCI datasets
by oversampling of the balanced training dataset transformation matrix learned from
• Classifier: 1-Nearest Neighbor classifier
minority class followed the pruned training dataset is applied
by undersampling of to the test sample for effective • Baseline approaches compared: ADASYN,
the oversampled distance-based classification MWMOTE, Borderline SMOTE,
minority class SSOMaj -SMOTE-SSOMin
CDSMOTE31 Undersampling the Reducing the dominance of majority Majority samples are clustered based on • Domain: Data mining
majority class by class class without information loss similarities by Class Decomposition • Datasets: 60 datasets from KEEL repository
decomposition (CD) and the minority class is oversampled
and oversampling the by SMOTE to balance the distribution
• Classifier: SVM with linear kernel Baseline
approaches compared: ADASYN, SMOTE,
minority class by
Class decomposition
SMOTE
SUSAN and KUMAR
The testing phase comprised of combining the test sample with the reference samples followed by voting for predicting
the class label.
The disadvantage of RUS of the majority class is the possibility of loss of useful samples that could have helped dis-
criminate between the majority and minority classes. Neighborhood Cleaning Rule (NCL) was proposed by Laurikkala37
for data reduction of difficult datasets using Wilson’s Edited Nearest Neighbor (ENN) distance rule.38 The ENN rule
states that those majority class samples whose two out of three nearest neighbors have different class labels are deleted.
Apart from NCL, all undersampling methods tagged with (-ENN) use the above stated Wilson’s ENN rule. NCL modi-
fies Wilson’s rule by applying it on the minority class as well. If the minority class samples have two out of three nearest
neighbors differing in class labels, and if these nearest neighbors belong to the majority class, then these majority sam-
ples are also deleted. The research in NCL promotes ideas leading to modification of pre-existing distance and similarity
rules for matching of templates by redefining these algorithms to behave differently for the minority and the majority
classes. Recently, a number of sampling algorithms have emerged that are based on evolutionary algorithms for refining
their population. Some of the undersampling techniques that follow this approach are discussed next. To tap the “use-
fulness” of samples toward class-discrimination, a cross-validation procedure was used in Sample Subset Optimization
(SSO) proposed by Yang et al.39 Optimized subsets of samples that minimize the k-fold cross-validation error, are short-
listed by an evolutionary algorithm (both Particle Swarm Optimization (PSO) and Genetic Algorithm (GA) are tested
for the purpose). PSO40 is a meta-heuristic algorithm that simulates bird flocking behavior to achieve an objective such
as finding food. GA proposed by Goldberg41 is based on the principle of natural selection. Here, a population of chro-
mosomes representing the unknown variables are refined, with the fittest individuals in each generation being selected
for reproduction, cross-over and mutation, to generate the next generation of offspring. In the undersampling scenario,
the objective is to find the optimal subset of training samples in the majority class that offers the maximum discrimi-
natory information. The majority samples are then ranked based on the number of times they have been included in
such optimized subsets and the most frequent of them all are selected as the representative samples from the major-
ity class that balances the minority population. Another approach that used GA for instance selection was observed in
Reference 42 where a cluster-based undersampling overcame the class-imbalance problem in bankruptcy prediction.
Ant Colony Optimization was applied in Reference 43 for shortlisting significant majority samples of the DNA microar-
ray data that suffers from severe imbalance problem. Randomly, the training set was divided into sample subsets. A
statistical analysis of the results highlighted the majority samples that got selected with higher frequency. These high
frequency samples were then combined with minority samples for learning. PSO was investigated in Reference 44 for
example-based learning under the axiom that incorporating multiple good examples while learning leads to fast and
good solutions. Emerging works such as45 explore more obscure swarm intelligence techniques such as Stochastic Dif-
fusion Search and Dispersive Flies Optimization for intelligent sampling of the majority population. It was observed
here that the agents’ activity is proportional to data redundancy in the dataset. Figure 2 shows the results of intelligent
FIGURE 2 Demonstration of an
intelligent undersampling scheme [Red
Discs: Majority class, Blue Triads: Minority
class]: (A) Original distribution of
Ionosphere (B) Balanced distribution after
intelligent undersampling of majority class
by SSO (with Particle Swarm Optimization
as the intelligent agent in the search space)
undersampling of the majority class by SSO-PSO, on the dataset Ionosphere. Majority and minority samples after pruning
are confined to local regions with a few odd exceptions. The distribution of the pruned and balanced population in the
case of SSO-PSO (Figure 2B) is intuitively observed to be more helpful for learning class boundaries than is possible with
RUS, where random samples, irrespective of their distance from the minority samples, are deleted in order to balance
the population.
RUS of the majority class involves simple programming that is executed in a fraction of seconds in a computer of
configuration Pentium- 266 MHz 96 MB-RAM Desktop-PC, or higher. The undersampling algorithms that involve evolu-
tionary algorithms like GA, PSO, and so on, for selecting the optimal subset, take up more time for execution depending
on the size and dimension of the search space. In fact, the primary advantage of using RUS is that it is simple and fast.
RUSBoost46 combines RUS with AdaBoost to provide a simpler and faster version of its predecessor SMOTEBoost that
combines the oversampling technique SMOTE with the AdaBoost classifier. While SMOTE involves some computational
complexity in the form of interpolation to compute new synthetic samples, RUS involves simply deleting samples at ran-
dom which is a much faster and simpler approach. SMOTE and SMOTEBoost are discussed in more details in the next
section that is devoted to oversampling strategies.
3 OV ERSAMPLING THE MINORITY C LASS
Oversampling the minority class is another natural choice to solving the imbalanced class problem, by creating more
instances of the underrepresented minority class. Random oversampling selects the minority samples to be replicated
in a randomized manner. Several researchers prefer increasing the number of samples as opposed to removing samples
via undersampling since the latter may lead to loss of some amount of information. The instances can be generated by
either duplicating the minority samples or by introducing rotation and skew to perturb the data.47 This approach however
generally suffers from the side-effects of creation of random noise and unnecessary duplications if oversampling is not
handled intelligently, and generally may lead to overfitting. One of the most popular works that emerged over time as
a standard baseline is that of Chawla et al48 who proposed the SMOTE Technique. It worked on the simple principle of
creating new minority instances by interpolating neighboring minority instances. The aim was to introduce variation in
the dataset. The interpolation process is demonstrated below for two minority instances p and q that are interpolated to
create a new synthetic instance r as.
r = p + (q − p) × 𝜆 (3)
where 𝜆 is a random number in the range [0,1]. There are two adjustable parameters associated with SMOTE: the over-
sampling rate and the number of nearest neighbors. The advantages of SMOTE include simplicity of execution and
robustness when applied to a variety of data imbalance problems. Since then several advancements over SMOTE and its
extensions have been proposed in the past 15 years which have been comprehensively summarized in the recent survey by
Fernández et al.49 SMOTE expands the decision space for the minority class, improving the performance, while prevent-
ing overfitting. However, at the same time, the decision space for the majority class is contracted and the probability of
misclassification for majority samples may rise. Hence researchers who propose extensions or variations of SMOTE may
well take note of the fact, for effective classification of both the minority and majority classes. Combinations of SMOTE
with AdaBoost classifier (SMOTEBoost50 ) has led to improved learning performance. Boosting is an iterative learning
algorithm that in each iteration assigns higher weights to misclassified examples. Since the misclassified examples, in case
of imbalanced data, are more likely to belong to the majority class, a skewed distribution exists among the misclassified
examples as well. Hence SMOTE is incorporated into the Boosting procedure by which, in each iteration, minority exam-
ples that are misclassified are oversampled as per the SMOTE algorithm, a classifier is learnt, and then these examples are
discarded; this process repeats for every iteration. DataBoost-IM51 and RAMO-Boost52 were developed along similar lines.
While DataBoost-IM oversamples both majority and minority classes in each boosting iteration, RAMO-Boost advocates
ranked minority class oversampling in which the minority samples are ranked in each boosting iteration according to an
underlying probability distribution so that the decision boundary is adaptively shifted toward the more difficult-to-learn
instances. In Enhanced Structure Preserving Oversampling53 and Mahalanobis distance-based oversampling,54 the struc-
ture of the original input dataset is preserved while generating synthetic samples. Structure could pertain to either the
covariance distribution or the underlying probability distribution.
Some of the significant and well-known extensions of SMOTE are described next. Each aims to apply some form of
intelligence interwoven into the sampling process. Intelligence term here pertains to being aware of locations of other
minority samples and/or majority samples while oversampling in order to create a new population that will define the
class boundaries in a better manner. Borderline SMOTE is one of the first attempts toward constructing intelligent over-
sampling algorithms.55 The motive is selective resampling that involves replication of those minority instances that are
difficult to classify and lie near the decision boundary, as observed from the toy example in Figure 1D. First of all, the
borderline minority examples that lie along the decision boundary are identified. Then the k nearest neighbors of each
borderline minority instance are determined and new minority instances are synthetically generated by interpolating
along the line connecting each borderline minority sample and each of its nearest neighbors, thereby strengthening the
borderline minority population. In Adaptive Synthetic sampling,56 the number of synthetic instances were determined
individually for each minority sample using a weighted distribution proportional to the sample’s closeness with the major-
ity class. More number of synthetic instances were generated for those minority examples that proved to be difficult
to learn, due to their greater proximity to the majority samples. The number of synthetic minority samples generated
can be thus adaptively controlled. Both Borderline SMOTE and ADASYN focus on resampling those minority samples
that are close to the decision boundary. Other type of SMOTE variations boast of newer interpolating techniques such
as—those that involve interpolating with majority instances as well,55 creating synthetic samples closer to the minority
instance than any of its k nearest neighbors,57 clustering-based techniques that consider the cluster centroids to be the new
F I G U R E 3 Balanced distribution achieved after applying some of the resampling methods (based on SMOTE oversampling and its
variants) on the imbalanced Page-blocks dataset [Red Discs: Majority class, Blue Triads: Minority class]: in the order from Left to Right, (A)
Original distribution (B) Distribution after SMOTE (C) Distribution after Borderline SMOTE (D) Distribution after ADASYN (E) Distribution
after MWMOTE (F) Distribution after MAHAKIL (G) Distribution after SMOTE-RSB (H) Distribution after hybrid SSOMaj -SMOTE-SSOMin
hybrid sampling strategy
synthetic minority instances58 and estimation of probability distribution functions of the data.23 The type of research cov-
ered in SMOTE and its extensions are summarized as: selecting the right minority samples, adaptively controlling the
number of created instances for each minority sample, and defining the computation involved in generating the new
synthetic samples.
As an inspiration from SMOTE and its variations discussed above, a new generation of works have emerged that
deserve mention due to their new perspectives on oversampling, and it is some of these works that we review next. The first
of these works is Majority Weighted Minority Oversampling Technique (MWMOTE)58 that identifies “difficult” minor-
ity samples that are hard to learn, weights them as per respective distances to the majority class, and then creates the
new synthetic instances by hierarchical clustering of the weighted informative minority samples. The new instances are
generated from a cluster by interpolating two random members of the cluster. Biologically inspired solutions to oversam-
pling are also found in literature. MAHAKIL59 is based on the chromosomal theory of inheritance and was proposed for
software defect detection. The minority instances are divided into bins, and samples with the same sequence number are
averaged to form a representative synthetic sample. Another method, inspired by SMOTE, is certainty guided minority
oversampling proposed by Zhang et al60 in which minority samples are carefully oversampled along the decision bound-
ary by checking for each instance if it increases the certainty of the complete dataset. Certainty of a dataset is interpreted
to be the cumulative posterior class probability associated with the instances in the dataset. Only the synthetic instances
that create a positive change in the certainty are allowed to be included in the resampled dataset. The approach is distinct
due to its treatment of the majority class samples that are considered while generating synthetic samples of the minor-
ity class. Minority Oversampling in Kernel Adaptive Subspaces was proposed in Reference 61 as an intelligent sampling
strategy. The synthetic samples here, are generated from various subspaces that are trained by competitive learning, and
hence bear characteristics of the original input data and at the same time show some variation. Figure 3 demonstrates sev-
eral resampling strategies for a real-world dataset including popular oversampling schemes such as SMOTE, Borderline
SMOTE, ADASYN, MWMOTE, and MAHAKIL. The complex algorithms of ADASYN and MWMOTE are observed to
create a large population of minority instances, with well-defined class boundaries. The minority instances in the case of
ADASYN (Figure 3D) are noted to be more scattered while the population is found concentrated in the case of MWMOTE
(Figure 3E). The real test for an intelligent sampling strategy is when the trained classifier is presented with new data.
The computational complexity of oversampling techniques depends on the complexity of the intelligent sampling incor-
porated. SMOTE and simple variants that involve interpolation of existing samples have fast execution time while those
that involve careful selection of the minority samples to be replicated are more computationally intensive.
4 HYBRID SAMPLING ST RATEGIES
The minority samples are retained as it is, in most of the undersampling methods discussed in the previous section. These
minority samples, however, could be noisy or borderline or redundant data that are seldom removed out of concern for
decrementing an already diminished minority population. This section reviews some of the methods that advocate hybrid
sampling strategies by combining oversampling and undersampling in their bid to jointly consider both the underrepre-
sented minority class and the overrepresented majority class. Oversampling may sometimes lead to overfitting. Hence it
is advisable to clean the data by undersampling prior to the oversampling. Most of these hybrid strategies use SMOTE as
the oversampling technique. Hu et al62 implement the undersampling part by PSO. Their hybrid sampling strategy called
SMOTE-PSO was applied successfully for identifying malicious web domains. A related work is SMOTE-SSO that sequen-
tially performs oversampling of the minority class and undersampling of the majority class.63 SMOTE-RSB combines
SMOTE with an editing technique based on the Rough Set Theory.64 The synthetic samples generated by SMOTE were
evaluated for their similarity with the majority class samples. Those synthetic samples whose similarity index was high,
were deleted from the training set since they do not contribute to class-discriminatory information. Hence this method
qualifies as a hybrid strategy since it oversamples and then cleans the synthetically generated minority data by under-
sampling. SMOTE was integrated with data gravitation-based classification (DGC) in Reference 65, a physical-inspired
classification model that fails under conditions of class imbalance; the hybrid was named SMOTE-DGC. The hybrid
method SSOMaj -SMOTE-SSOMin 66 involves careful selection of the population of both classes, and only representative sam-
ples that carry discriminatory information are retained on both sides. An evolutionary algorithm (PSO) is used for the
selection. The three-step data pruning process in SSOMaj -SMOTE-SSOMin constitutes of: (1) undersampling the majority,
(2) oversampling the minority, and (3) undersampling the oversampled minority. The results marked an improvement
over the baseline undersampling and oversampling techniques. All the above methods that involve hybridization of
SMOTE with intelligent undersampling, facilitate careful selection of both the majority and minority population that is
a better proposition than careful selection of the majority samples alone. Figure 3 shows the distribution of the resam-
pled Page-blocks dataset after application of several sampling techniques. The red polygon shaped markers indicate the
majority class while the blue triads indicate the minority class. The original distribution shown in Figure 3A highlights
the class-imbalance problem. SMOTE and its variants are observed to augment the minority population though not at a
huge scale as in ADASYN and MWMOTE that agrees with our discussion in Section 3. The results of the hybrid sampling
strategies of SMOTE-RSB and SSOMaj -SMOTE-SSOMin are also shown. SMOTE-RSB shows a variation in the minority
population while the majority population is intact since both oversampling and oversampling happens in the minor-
ity class only. All these methods depict some changes from the original distribution, even though the changes may be
incremental. One observation is that not all methods implement the perfect balancing act. ADASYN, MWMOTE, and
SSOMaj -SMOTE-SSOMin yield a balanced distribution with equal members in both classes. The final balanced population
of SSOMaj -SMOTE-SSOMin is observed to be sparse as compared to all other methods, with the size of the final popula-
tion dependent on the size of the minority population in the original distribution. The balanced datasets are used to train
suitable classifiers, some popular choices being Support Vector Machine (SVM), logistic regression, naïve Bayes, C4.5
decision trees, and k-nearest neighbor classifiers.
In Reference 67, both undersampling and oversampling were performed, with the oversampling being achieved by
fuzzy logic; and elitist selection, recombination and mutation operators facilitating the undersampling part. A hybrid
undersampling-based ensemble approach HUSBoost68 finds special mention here since it achieves balancing by simple
RUS and combines several classifiers through a soft voting approach to decide on the class of the test label. It is interesting
to note that the first step in the HUSBoost algorithm is data cleaning by removal of noisy data using Tomek Links. A recent
work by Li et al69 uses entropy as a measure for retaining the difficult-to-learn samples and discarding the easy-to-learn
samples. The resulting entropy-based hybrid sampling encourages a new research pattern that involves statistical machine
learning for finding discriminatory samples. A recent work that addresses multi-class imbalance is Evolutionary inversion
of class-distribution in overlapping class regions proposed by Fernandes and de Carvalho.70 This method focuses only
on the minority instances lying in the overlapping regions between classes since these have more class-discriminatory
information. A multi-objective evolutionary algorithm is used to determine the population for each individual classifier
in an ensemble of classifiers with a greater selection of minority instances being from the overlapping region. This method
ensures diversity learning since the classifiers have different views of the dataset. However, the computational complexity
is high due to the iterative evolutionary process executed for each classifier in the ensemble. Considering time complexity,
hybrid strategies have more computational overhead due to the sequential process flow integrating multiple sampling
schema. Parallelizing the sampling strategies in hybrid approaches, is a future research direction. Individual attempts at
parallelizing generic sampling can be observed in References 71,72.
Some distinctive approaches to sampling in literature, other than those described above and in the previous sections,
are outlined in Table 1 along with the concept used in each approach that makes it distinctive. The implementation details
are also summarized for each method in the last column of Table 1. This includes the domain, datasets, classifier, and
baseline methods compared with, for the reference of the readers.
5 EXPERIMENTAL SETUP FOR IMBALANCED LEARNING
5.1 Software tools
Python contains several useful libraries such as numpy, scipy, and scikit-learn that are used by programmers implementing
data mining algorithms. These libraries contain inbuilt functions that are helpful for classification and decision-making.
Sampling algorithms were implemented in the past in C/C++/JAVA languages that involve coding from scratch. This
offers the programmer more flexibility in executing a sampling technique rather than relying on what the pre-built
functions have to offer. Some recent works have adopted the “code from scratch” methodology such as SSO-PSO and
SSOMaj -SMOTE-SSOMin that are coded in JAVA. The R language is also popular among data mining enthusiasts and
contains functionalities such as training set selection, discretization, feature selection, and so on, that aid in classifying
imbalanced datasets.73
Modern toolboxes specialized for tackling imbalanced datasets have been introduced that offer a limited range of
flexibility of choice of parameters to the programmer. Imbalanced-learn is an example of an open-source Python tool-
box dedicated for learning from imbalanced datasets.74 State-of-the-art methods for oversampling, undersampling, and
hybrid sampling such as SMOTE and Tomek Links are included in this toolbox. An open-source JAVA-based software
tool is Knowledge Extraction based on Evolutionary Learning (KEEL) that incorporates evolutionary algorithms for clas-
sifying imbalanced datasets.75 KEEL incorporates 45 algorithms for imbalanced classification that include cost-sensitive
learning, ensemble, and resampling. Multi-imbalance60 is an open-source software for multi-class imbalanced datasets.
This toolbox includes 18 algorithms for multi-class learning in an imbalanced scenario. Ensemble techniques such as
AdaBoost, DECOC, and DOVO are adapted for the multi-class scenario by incorporating sample weighting and intelli-
gent combination of base classifiers. A performance analysis by the authors proved that DECOC and DOVO are the best
software tools in Multi-imbalance for classification of multi-class imbalanced datasets. DECOC or Diversified Error Cor-
recting Output Codes was proposed by Bi and Zhang76 and DOVO is the Diversified One-vs-One strategy proposed by
Kang et al77 for multi-class classification. DECOC is a multiple classifier or ensemble system. Codewords are constructed
for multiple classes based on the largest distance between the classes and a dichotomy classifier translates the codes into
class labels. DOVO, on the other hand, selects the strongest classifier for a subset of data as the one with the lowest error
rate. The subset is formed by selecting samples from each pair of classes. The outputs of the strongest classifiers for all
subsets are combined to make the final prediction.
5.2 Evaluation metrics
The performance evaluation metrics considered significant in case of skewed class distributions are: Area Under the
Receiver Operating Characteristics Curve (AUC), G-mean, and the F1-score derived from the precision-recall values
computed from predicted class labels, since these are more informative on the scores of the minority class.78
F1-score is computed from precision and recall values as
2(precision × recall)
F1 − score = (4)
(precision + recall)
where, precision and recall values are computed from the number of True Positives (TP), False Positives (FP), False
Negatives (FN) as
TP
precision = (5)
TP + FP
TP
recall = (6)
TP + FN
The G-mean index is also used by some researchers.21 This index is defined as a function of the True Positive Rate
(TPR) and the True Negative Rate (TNR) that are defined as follows.
√
G − mean = TPR × TNR (7)
TP
TPR = (8)
TP + FN
TN
TNR = (9)
TN + FP
A g-performance measure was recommended for evaluating classification performance of imbalanced datasets in
Reference 17, which is defined as the geometric mean of the percentages of majority and minority samples that have
been correctly classified. The values of percentage accuracy of correct classification are often misleading for imbalanced
datasets, since the minority samples are few in number, and even when all the minority samples are misclassified as
majority, the percentages may still be uncomfortably high due to the entire majority population being correctly classified.
Following this cue, research in Reference 79 focuses on devising new metrics that measure the state of imbalance in a
more interpretable manner. The new metrics are Individual Bayes Imbalance Impact Index (IBI3 ) at instance level, and
Bayes Imbalance Impact Index (BI3 ) at a data level. These measures reflect the extent of imbalance from the perspective
of the minority population with respect to the whole dataset. While IBI3 was found to increase consistently with increase
in accuracy, BI3 was found consistent to variations in F1-score.
T A B L E 2 Some popular datasets, their Imbalance Ratios (IR) and reported results
Minority Majority Minority Majority Imbalance Method (reported
Dataset class class samples samples ratio (IR) F1-score/AUC, classifier)
Abalone58 Class “18” Class “9” 42 688 16.38 MWMOTE (F1-score: 0.394, Neural Network) ADASYN
(F1-score: 0.298, Neural Network) SMOTE (F1-score:
0.441, Neural Network) RAMOBoost (F1-score: 0.439,
Neural network) MWMOTE (F1-score: 0.608, k-Nearest
Neighbor) ADASYN (F1-score: 0.440, k-Nearest
Neighbor) SMOTE (F1-score: 0.501, k-Nearest
Neighbor) RAMOBoost (F1-score: 0.662, k-Nearest
Neighbor) MWMOTE (F1-score: 0.357, Decision tree)
ADASYN (F1-score: 0.360, Decision tree) SMOTE
(F1-score: 0.306, Decision tree) RAMOBoost (F1-score:
0.281, Decision tree)
Yeast26 MIT All other 244 1240 5.08 SMOTE (F1-score: 0.573, Decision tree) Borderline
classes SMOTE (F1-score: 0.527, Decision tree) SMOTE-RSB
(F1-score: 0.5577, Decision tree) NRSBoundary SMOTE
(F1-score: 0.574, Decision tree)
Haberman16 Class “Die” Class 81 221 2.72 RUS (AUC: 0.684, Decision tree) CNN (AUC: 0.583,
“Sur- Decision tree) TOMEK (AUC: 0.696, Decision tree) OSS
vive” (AUC: 0.620, Decision tree) SMOTE (AUC: 0.754,
Decision tree) SMOTE+TOMEK (AUC: 0.784, Decision
tree) SMOTE+ENN (AUC: 0.770, Decision tree)
Pima Diabetes58 Class “1” Class “0” 268 500 1.865 MWMOTE (F1-score: 0.679, Neural Network) ADASYN
(F1-score: 0.683, Neural Network) SMOTE (F1-score:
Statlog (vehicle Class “Van” All other 226 720 3.18 MWMOTE (F1-score: (0.920, Neural Network) ADASYN
silhouette)58 classes (F1-score: 0.857, Neural Network) SMOTE (F1-score:
Statlog (vehicle Class “Van” All other 226 720 3.18 Random oversampling (F1-score: 0.955, SVM) SMOTE
silhouette)29 classes (F1-score: 0.953, SVM) Borderline SMOTE (F1-score:
0.959, SVM) Safe-level SMOTE (F1-score: 0.954, SVM)
Cluster-SMOTE (F1-score: 0.858, SVM) MWMOTE
(F1-score: 0.948, SVM) CBOS (F1-score: 0.955, SVM)
SBC (F1-score: 0.917, SVM) A-SUWO (F1-score: 0.961,
SVM)
5.3 Some popular datasets for imbalanced learning and reported results
Two popular repositories for smaller sized real-world datasets or toy datasets, as they are sometimes called, are the UCI
repository80 and the online platform Kaggle. Examples of smaller sized real-world datasets used by researchers popularly
for implementing imbalanced learning algorithms include Yeast (Gao et al,23 Hu and Li26 ), García and Herrera25 ), Iono-
sphere (Hu and Li26 ), Abalone (Barua et al,58 García and Herrera25 ), Pima Diabetes (Jo and Japkowicz,12 Barua et al,58
Gao et al,23 García and Herrera,25 Hu and Li26 ), Wisconsin Breast Cancer (Jo and Japkowicz,12 Pengfei et al22 ), Haberman
(Batista et al,16 Bunkhumpornpat et al,20 Gao et al,23 García and Herrera,,25 Hu and Li26 ), among a few. Pima Diabetes
and Haberman are observed to be the most popular choices among most researchers. Kaggle contains Big Data examples
as well such as “The Nature Conservancy Fisheries Monitoring”, “European Soccer database” and “Intel & MobileODT
Cervical Cancer Screening” that are examples of imbalanced datasets with uneven class distributions. The KEEL data
repository75 contains imbalanced datasets with imbalance ratios ranging from 1.5 to 9, with each dataset partitioned using
5-fold stratified cross-validation.
Datasets with binary classes are preferred by researchers for the experiments on imbalance learning. The reason
is that most learning algorithms are written from the perspective of the inter-relationship of majority and minority
class samples and their distances from the decision boundary. A single decision boundary is convenient to consider,
and multiple decision boundaries, as in the case of real-world datasets, only complicates the algorithm and exponen-
tially increases the computational complexity. Multi-class datasets are thus converted by most researchers to the binary
class form by either considering any two differently populated classes, or by grouping classes into two main cate-
gories namely, majority and minority, by relatively comparing the class populations. Imbalance ratio (IR) is the ratio
of majority and minority class population and its value indicates the extent of the class-imbalance problem for differ-
ent datasets.30 The IR of some popular datasets used for imbalanced learning in data mining are shown in Table 2
along with reported results; these IR values range from 1.86 to 57.63. Two sets of results extracted from two indepen-
dent works are shown for the Statlog (vehicle silhouette) dataset to highlight the variety in the choices of classifiers
and baseline approaches. The F1-score and AUC performance measures are the most popular choices as evident from
the readings in Table 2. Among all methods, RUS, SMOTE, MWMOTE, and Borderline SMOTE appear to be the most
popular baseline approaches. The scores of SMOTE are observed to be consistently high. The decision tree classifier
is observed to be the most popular choice of the classifier followed by SVM. The k-Nearest Neighbor and 1-Nearest
Neighbor classifiers perform well in Euclidean space and are sometimes used by researchers to highlight the worst-case
scenario.
However, a high IR may not necessarily mean poor performance of the minority class, since a lot depends on the deci-
sion boundary between the majority and minority classes; and a well-defined decision boundary yields high performance
for both classes despite of the minority samples being present in small numbers.
6 FUTURE DIRECTIONS, A PPLICATIONS, AND CHALLENGES
This section surveys the current challenges faced by data mining researchers working with imbalanced datasets and iden-
tifies future directions of research in imbalanced learning. We especially draw attention to open-ended research problems
and potential application areas that are yet to be explored from the perspective of imbalanced learning.
6.1 Exploration of classifier models
The response of individual classifiers needs to be studied under the impact of skewed class-distributions. An investigative
study is conducted in Reference 81 to determine the effect of class-imbalance on various classifier models. It was found
that among all classifiers, SVM is found least affected by the imbalanced class distribution, and oversampling tends to
worsen the performance for most classifiers. On the other hand, the C4.5 decision tree classifier is most affected by the
imbalance problem and responds well to imbalance treatment strategies like sampling. The survey by Branco et al82 inves-
tigates the side-effects of imbalance conditions in classification and regression tasks in predictive analytics. The study
suggests improvisation of the standard evaluation metrics to focus on the rare case examples while training since these
would be more class-informative. Investigators could follow the cue of Zhang et al83 where a thorough investigation on
30 real-world datasets using six classifiers was done to prove the superiority of their SMOTE extension over SMOTE. In
a recent work, Raghuwanshi and Shukla84 applied SMOTE for oversampling the minority class with the class-specific
Extreme Learning Machine (ELM) as the classifier. The overall product is presented as SMOTE-CSELM, a new classifier
that handles class-imbalance using SMOTE and has class-specific regularization parameters. This inspires ideas for mod-
ification of existing machine learning algorithms to integrate class-specific imbalance treatment, and present it as a new
generation of classifier models.
6.2 Handling multi-class imbalance
Another important open research problem is that of multi-class imbalance.60 Most of the solutions offered so far are
for binary class datasets where the majority vs minority class boundary is neatly defined.85 Experiments in Reference
14 proved that, in most cases, these solutions were ineffective for handing imbalance in multi-class datasets. In the
multi-class case, a divide-and-conquer scheme is still the norm, solved by dividing the multi-class problem into several
two-class problems and aggregating the result of the individual classifiers.6 The detection of rare events is also an unex-
plored area from the perspective of class-imbalance.86 The performance of the sampling algorithm is heavily dependent
on the definitions of class regions for multi-class datasets. Future works need to do away with the concept of binarized
classes and need to work out a standard protocol for applying resampling strategies in a multi-class scenario without
redefining the class boundaries.
6.3 Open research problems and future application areas
An insightful survey solely dedicated to addressing open research issues and challenges for imbalanced datasets is.87
The survey identifies Big Data analytics, social media, and computer vision as vital application areas. The open research
issues yet to be addressed from the perspective of imbalanced learning, as per the authors, are stated to be regres-
sion, learning from non-stationary data streams and multi-class problems. The survey in Reference 88 draws attention
to a number of factors that aggravate the class-imbalance problem such as the existence of small disjuncts within a
class, small sample size, and class overlapping occurrences. The application of imbalanced learning algorithms cov-
ers a wide spectrum of research. Text classification is a research area that has very few works devoted to correcting
the class-imbalance.89 Class-imbalance occurs when the number of text samples from a particular category is signif-
icantly less as compared to the other categories. Often the problem is not addressed and most researchers ignore
the class-imbalance issue occurring in text-based analysis. The problem was noted in a recent work on twitter-based
sentiment analysis where varying distributions of positive, negative, and neutral tweets in twitter datasets cast an
ambiguity on the applicability of different sentiment lexicons such as SentiWordNet, Afinn, and Vader for the fuzzy
classification process.90 A bidirectional sampling based on k-means clustering algorithm was proposed in Refer-
ence 91 to handle the class-imbalance in a text-based corpus. SMOTE was applied for oversampling the minority
cluster obtained by k-means algorithm. Bioinformatics projects that involve RNA and DNA identification, protein
localization, prediction, and profiling often tackle imbalanced datasets when comparing human data with that of
other species.39 The applicability of resampling strategies for the bioinformatics field still needs to be explored in
its entirety.
6.4 Challenges in computer vision
In computer vision, deep learning is the current trend for learning from huge image databases.92 The AlexNet pre-
trained network is trained on the 1000-class ImageNet database.93 Addressing the class-imbalance problems in large-scale
computer vision problems needs immediate attention, considering the information boom in surveillance sector that
is an inherent part of our society now. Imbalanced learning algorithms specific to the field of Computer Vision need
to be devised since resampling strategies are not practical in situations when the computational complexity is high,
the number of classes is too high and the class-distribution profile is extremely uneven, as is the situation with most
real-world vision tasks. Data augmentation of the minority class was explored in Reference 94 as a simple but effec-
tive solution to correct class-imbalance in Benign vs Malignant cancer classification for the BREAKHIS cancer dataset.
Data augmentation is equivalent to oversampling, in computer vision. The term pertains to the synthetic generation
of new images by affine transformation (scaling, rotation, horizontal/vertical flips and shifts, shear, and so on) of the
original images in the dataset. In Reference 95, deep learning, data augmentation, and Bag-of-Visual Words repre-
sentation are analyzed with respect to their ability to handle class-imbalance. It was noted that as compared to the
other approaches, deep convolutional neural networks have inherent properties that naturally tackle class-imbalance
in an effective manner. A similar observation was made in Reference 96 in which a new cost function was proposed to
improve the response of deep learning networks to class-imbalance. The deterioration of classification performance due
to the heavy class-imbalance in the LFW facial image dataset containing 1680 classes, is highlighted in Reference 97.
While some celebrities have only one or two images, others have hundreds of photographs due to easy availability of
images of these celebrities, leading to a biased decision curve. A very interesting study on the effect of class-imbalance
on Convolutional Neural Networks in Reference 98 concludes that oversampling is the best solution for balancing
the dataset and that application of undersampling depends on the degree of imbalance. The behavior of deep con-
volutional networks needs to be investigated in more depth for imbalanced learning, both for networks trained from
scratch and the pre-trained networks that are trained on millions of images and fine-tuned on smaller (and imbalanced)
datasets.
6.5 Challenges in big data applications
Another field where class-imbalance problems spike to mammoth dimensions is that of Big Data comprising of data in
huge Volume, Velocity, Variety, and Variability, especially the confluence of multimodal sensor IoT data in the backdrop
of development of smart cities.99 Most of the researches involving new sampling algorithms and balancing techniques
focus on datasets prevalent in data mining such as those available from UCI repository,80 some of which are listed in
Table 2 and covered in Section 5.3. An immediate challenge to be addressed by researchers tackling imbalanced learn-
ing is the applicability of the existing algorithms on real-world data such as large image, text and video databases, and
sensor-based Big Data applications for Internet of Things (IoT). The continuous stream of multimedia data from elec-
tronic surveillance systems needs to be processed in real-time. The volume, velocity, and variety offered by such huge
datasets are a challenge to the conventional machine learning algorithms that are mostly trained on toy datasets lim-
ited to a few thousands of samples.100 However, to handle large-scale data containing millions of samples, a framework
like MapReduce is required to distribute the computing among smaller computing elements. A recent study in Refer-
ence 101 advocated the distributed procedure followed in the MapReduce framework to aid in tackling class-imbalance
in Big Data as compared to the conventional solutions that seem to fail for Big Data applications. Distributed evolution-
ary undersampling is explored in Reference 102 for Big Data. In this scheme, evolutionary undersampling is performed
on smaller subsets of the training data using MapReduce, and each undersampled subset is trained on a decision tree
classifier. An ensemble approach aggregates the outputs of the individual classifiers. Experiments on Big Data were suc-
cessfully conducted in Reference 103 for achieving resampling with MapReduce and classification using the Random
Forest classifier. Oversampling, undersampling, and cost-sensitive learning were investigated, and the studies concluded
that when the number of mappers is increased, the processing time becomes faster at the cost of a slight drop in accu-
racy. A GPU-based implementation of SMOTE achieved good results in Reference 104 that paves the way for large-scale
computing machines for sampling Big Data. Information filtering is essential under such circumstances with a clear
understanding of which samples are significant and are to be retained. Han et al105 recently confirmed this theory by
introducing a distribution-sensitive undersampling for IoT datasets that retained the majority samples near the decision
boundary and deleted noisy and stable majority samples. The minority samples near the boundary were replicated for
achieving oversampling.
6.6 Imbalance in streaming data
One topic that needs to be researched more is the co-occurrence of class-imbalance with other problems encountered
in the learning phase. A pioneering work in this regard is that of Wang et al106 who investigated the co-occurrence of
class-imbalance with concept drift in online learning. Concept drift is the problem of dynamically shifting probability
distributions associated with non-stationary data streams. Application of the conventional remedies for concept drift may
aggravate the prevailing class-imbalance. It would also be difficult to attribute the large classification error to either cause.
A thorough experimental study in Reference 106 concluded that imbalance remedies such as oversampling enhances
performance but interferes with drift detection. Research on real-time learning from streaming data is still in its evolution
stage. The issue is complicated since class definitions change dynamically over time as proved for the XML data streams
in Reference 107. The circumstance (concept drift) is tackled by Brzezinski and Piernik through incrementally mined
subtrees and a suitable similarity measure. One of the earlier works that tackles population drift and changing class
dynamics is that of Kelly et al.108 The paper identifies three types of population drifts: those associated with class priors,
posterior probabilities, and class probabilities, out of which change in posterior class probabilities is most indicative.
The classification rule is updated by including the effect of new data points and down-weighting the effect of the older
points. An advisable approach to tackle real-time analytics is the online updation of the learning model as fresh data is
streamed online, rather than training a new model from scratch every time a fresh batch of data is released.109 In this
spirit, a low-dimensional linear subspace was estimated in Reference 110 as a low-rank matrix completion problem that
reconstructed the input signal as a byproduct. This method incorporates missing and noisy entries in the data stream.
Skewed class distribution in data streams is handled by importance sampling in Reference 111. The majority class, in
this case, is undersampled with replacement while minority instances are generated as per importance weights attached
to specific feature groups. The group weighting scheme is dynamic in nature and evolves with time over shifting time
windows across the data stream. Class-imbalance for online learning is explored in Reference 112 for the multi-class
scenario. A weighted online sequential ELM with kernel was proposed for the purpose that learns either in a one-by-one
and chunk-by-chunk manner. The future research in devising new sampling strategies would thus see the involvement
of high computing machines with cluster computing for high volume multimodal data, and real-time management of
streaming data with shifting class priors.
7 CO NC LU SION S
Imbalance in class distribution in modern-day real-world datasets is a common and urgent problem affecting researchers
working in all realms of pattern recognition and machine learning. The issue is mainly overlooked by most researchers,
though undeniably the learning performance is known to surge higher if some form of balancing trick is applied for
balancing the class-distribution prior to the learning phase. This paper reviews the state-of-the art balancing strategies,
specifically sampling that transforms the data space to create a balanced training distribution prior to the learning phase.
The survey covers basic concepts, implementation details, and future challenges related to the state-of-the-art in resam-
pling. Undersampling the majority class is a good balancing strategy from the perspective of the low computational cost
involved while learning from large datasets. However, it may lead to loss of useful training majority instances that might
have turned out to be crucial to decision-making. Oversampling the minority class is also a popular and simple balanc-
ing trick that however adds noise and involves unnecessary duplication of samples and sometimes leads to overfitting.
It is possible to minimize the side-effects of sampling by incorporating intelligent procedures for pruning the major-
ity and minority classes. This takes the form of intelligent data reduction by minimizing the loss of majority samples
with class discriminatory information, and intelligent generation of new minority samples that add variety, in limits,
to the existing minority population. In addition to a discussion on the strategy, the survey also enumerates the intelli-
gent sampling procedures based on their performance, popularity, and ease of implementation. SMOTE and Borderline
SMOTE are considered to be the baseline approaches by most researchers, achieving consistent classification scores in
most experiments. Most of the intelligent oversampling strategies like Borderline SMOTE focus on the region near the
decision boundary. However, they were found to lack a clearly defined objective function and a systematic way of gen-
erating the synthetic samples. A theoretical justification is lacking in most methods regarding the definitive gain in
performance achieved by resampling. The performance depends a lot on the definitions of majority and minority classes.
Multi-class imbalance correction procedures are severely lacking since most works include only binary class datasets, and
if multi-class datasets were incorporated then they were first converted to binary class distributions. Also, most meth-
ods do not address the issue of drop in performance of the majority class due to addition of synthetic samples from the
minority class. It was also noted that most of the researchers tend to experiment with the same set of UCI datasets such
as Pima Diabetes and Haberman, and future research needs to involve a more variety in real-world datasets extracted
from real-world applications. Most of these datasets are limited to few thousands of samples only. The need of the hour
is to adapt the existing algorithms to Big Data applications. Especially due to the recent emergence of Internet-of-Things
(IoT), the availability of large real-time datasets in large volume collected from various sensors has become a reality. The
current new crop of hybrid methods that involve mutually aware conditioned treatment of the majority and minority
classes, form the foundation for future research in imbalanced learning. The applications of this essential methodology
need to be successfully explored and implemented across all realms of real-world pattern recognition problems especially
IoT multimodal sensor data, Big Data, and Computer Vision problems that involve thousands of image classes. New
application areas need to be explored such as imbalance in text mining, real-time streaming data, web mining, bioin-
formatics, multimodal sensor fusion in IoT, and mining network traffic. Existing machine learning algorithms can be
remodeled to incorporate class-specific balancing strategies such as the most recent work on ELMs that incorporated
SMOTE. Very less information is available on how the current pre-trained deep networks that are trained on millions of
images, will treat imbalanced data while performing transfer learning. Further ambiguity surrounds the deep CNNs that
are trained from scratch when they are trained directly on imbalanced datasets. Additional experiments need to be per-
formed to investigate the development of advanced deep learning modules having integrated class-specific resampling
strategies.
PEER REVIEW INFORMATION

Engineering Reports thanks Petros Xanthopoulos and other anonymous reviewers for their contribution to the peer review
of this work.
CONFLICT OF INTEREST
The authors declare no potential conflict of interest.
AU THOR CONTRIBUTIONS
Seba Susan: Conceptualization; formal analysis; investigation; methodology; project administration; supervision; visu-
alization; writing-original draft; writing-review and editing. Amitesh Kumar: Conceptualization; data curation; formal
analysis; investigation; methodology; resources; software; validation; visualization.
DATA AVAILABILITY STATEMENT

Data sharing is not applicable to this article as no new data were created or analyzed in this study.
ORCID
Seba Susan https://orcid.org/0000-0002-6709-6591
REFERENCES
1. Wu G, Chang EY. KBA: kernel boundary alignment considering imbalanced data distribution. IEEE Trans Knowl Data Eng.
2005;6:786-795.
2. Fernández A, García S, Galar M, Prati RC, Krawczyk B, Herrera F. Learning from Imbalanced Data Sets. Berlin, Germany: Springer; 2018.
3. Japkowicz N. The class imbalance problem: significance and strategies. Paper presented at: Proc. of the Int’l Conf. on Artificial
Intelligence; 2000.
4. Cheeseman P, Self M, Kelly J, Taylor W, Freeman D, Stutz J. Bayesian classification. Proceedings of the Seventh AAAI National Conference
on Artificial Intelligence. Vol 88. AAAI press; 1988:607-611. https://dl.acm.org/doi/abs/10.5555/2887965.2888073.
5. Liao C, Wang W, Sakurada K, Kawaguchi N. Image-matching based identification of store signage using web-crawled information. IEEE
Access. 2018;6:45590-45605.
6. Sun Y, Wong AKC, Kamel MS. Classification of imbalanced data: a review. Int J Pattern Recogn Artif Intell. 2009;23(04):687-719.
7. Borges LR. Analysis of the Wisconsin breast Cancer dataset and machine learning for breast Cancer detection. Group. 1989;1(369):15–19.
8. Deng J, Dong W, Socher R, Li LJ, Li K, Fei-Fei L. Imagenet: a large-scale hierarchical image database; 2009:248-255.
9. Susan S, Kumar A. DST-ML-EkNN: data space transformation with metric learning and elite k-nearest neighbor cluster formation for
classification of imbalanced datasets. Advances in Artificial Intelligence and Data Engineering. Singapore: Springer; 2020:319-328.
10. Susan S, Kaushik A. Weakly supervised metric learning with majority classes for large imbalanced image dataset. In Proceedings of the
2020 the 4th International Conference on Big Data and Internet of Things; 2020:16-19.
11. Longadge R, Dongre S. Class imbalance problem in data mining review; 2013. arXiv preprint arXiv:1305.1707.
12. Jo J, Japkowicz N. Class imbalances versus small disjuncts. ACM Sigkdd Explor Newsletter. 2004;6(1):40-49.
13. Susan S, Sethi D, Arora K. CW-CAE: pulmonary nodule detection from imbalanced dataset using class-weighted convolutional autoen-
coder. Paper presented at: International Conference on innovative computing and communications; 2020; Singapore: Springer:825-833.
14. Zhou Z-H, Liu X-Y. Training cost-sensitive neural networks with methods addressing the class imbalance problem. IEEE Trans Knowl
Data Eng. 2005;18(1):63-77.
15. Tomek I. Two modifications of CNN. IEEE Trans Syst Man Commun SMC-6. 1976;6(11):769-772.
16. Batista GE, Prati RC, Monard MC. A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD
Explor Newsletter. 2004;6(1):20-29.
17. Kubat M, Matwin S. Addressing the curse of imbalanced training sets: one-sided selection. Proc. Fourteenth Intl. Conf. on Machine
Learning (ICML). Vol 97. United States: ICML; 1997:179-186.
18. Cieslak DA, Chawla NV, Striegel A. Combating imbalance in network intrusion datasets. Paper presented at: GrC; 2006:732-737.
19. Yen S-J, Lee Y-S. Cluster-based under-sampling approaches for imbalanced data distributions. Exp Syst Appl. 2009;36(3):5718-5727.
20. Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C. Safe-level-smote: safe-level-synthetic minority over-sampling technique for han-
dling the class imbalanced problem. Paper presented at: Pacific-Asia Conference on Knowledge Discovery and Data Mining, 2009;
Berlin/Heidelberg, Germany: Springer:475-482.
21. Cateni S, Colla V, Vannucci M. A method for resampling imbalanced datasets in binary classification tasks for real-world problems.
Neurocomputing. 2014;135:32-41.
22. Pengfei J, Chunkai Z, He Z. A new sampling approach for classification of imbalanced data sets with high density. Paper presented at:
2014 International Conference on Big Data and Smart Computing (BIGCOMP); 2014: IEEE:217–222.
23. Gao M, Hong X, Chen S, Harris CJ, Khalaf E. PDFOS: PDF estimation based over-sampling for imbalanced two-class problems.
Neurocomputing. 2014;138:248-259.
24. Huang Z, Yang C, Chen X, Huang K, Xie Y. Adaptive over-sampling method for classification with application to imbalanced datasets in
aluminum electrolysis. Neural Comput Appl. 2020;32:7183-7199. https://doi.org/10.1007/s00521-019-04208-7.
25. García S, Herrera F. Evolutionary undersampling for classification with imbalanced datasets: proposals and taxonomy. Evol Comput.
2009;17(3):275-306.
26. Hu F, Li H. A novel boundary oversampling algorithm based on neighborhood rough set model: NRSBoundary-SMOTE. Math Probl Eng.
2013;2013:1-10.
27. Rao KN, Reddy CS. A novel under sampling strategy for efficient software defect analysis of skewed distributed data. Evol Syst.
2020;11(1):119-131.
28. Koto F. SMOTE-Out, SMOTE-Cosine, and Selected-SMOTE: an enhancement strategy to handle imbalance in data level. Paper presented
at: 2014 International Conference on Advanced Computer Science and Information System; 2014: IEEE:280-284.
29. Nekooeimehr I, Lai-Yuen SK. Adaptive semi-unsupervised weighted oversampling (A-SUWO) for imbalanced datasets. Exp Syst Appl.
2016;46:405-416.
30. Susan S, Kumar A. Learning data space transformation matrix from pruned imbalanced datasets for nearest neighbor classification.
Paper presented at: 2019 IEEE 21st International Conference on High Performance Computing and Communications; IEEE 17th Inter-
national Conference on Smart City; IEEE 5th International Conference on Data Science and Systems (HPCC/SmartCity/DSS); 2019:
IEEE:2831-2838.
31. Elyan E, Moreno-Garcia CF, Jayne C. CDSMOTE: class decomposition and synthetic minority class oversampling technique for
imbalanced-data classification. Neural Comput Appl. 2020;1-13. https://doi.org/10.1007/s00521-020-05130-z.
32. Lin W-C, Tsai C-F, Hu Y-H, Jhang J-S. Clustering-based undersampling in class-imbalanced data. Inform Sci. 2017;409:17-26.
33. Mani I, Zhang I. kNN approach to unbalanced data distributions: a case study involving information extraction. Proceedings of Workshop
on Learning from Imbalanced Datasets. Vol 126. United States: ICML; 2003.
34. Liu X-Y, Wu J, Zhou Z-H. Exploratory undersampling for class-imbalance learning. IEEE Trans Syst Man Cybern B Cybern.
2009;39(2):539-550.
35. Bader-El-Den M, Teitei E, Perry T. Biased random forest for dealing with the class imbalance problem. IEEE Trans Neural Netw Learn
Syst. 2018.30(7):2163–2172.
36. Dumpala SH, Chakraborty R, Kopparapu SK, Reseach TC. A novel data representation for effective learning in class imbalanced scenarios.
Paper presented at: IJCAI; 2018:2100-2106.
37. Laurikkala J. Improving identification of difficult small classes by balancing class distribution. Paper presented at: Conference on
Artificial Intelligence in Medicine in Europe; 2001; Berlin/Heidelberg, Germany: Springer:63-66.
38. Wilson DL. Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans Syst Man Cybern. 1972;3:408-421.
39. Yang P, Yoo PD, Fernando J, Zhou BB, Zhang Z, Zomaya AY. Sample subset optimization techniques for imbalanced and ensemble
learning problems in bioinformatics applications. IEEE Trans Cybernet. 2014;44(3):445-455.
40. Kennedy J, Eberhart R. Particle swarm optimization (PSO). Paper presented at: Proceedings of the IEEE International Conference on
Neural Networks; 1942; Perth, Australia, vol. 27.
41. Goldberg DE. Genetic Algorithms in Search, Optimization, and Machine Learning. Reading, MA. 3, no. 1: Addison-Wesley; 1989.
42. Kim H-J, Jo N-O, Shin K-S. Optimization of cluster-based evolutionary undersampling for the artificial neural networks in corporate
bankruptcy prediction. Exp Syst Appl. 2016;59:226-234.
43. Yu H, Ni J, Zhao J. ACOSampling: an ant colony optimization-based undersampling method for classifying imbalanced DNA microarray
data. Neurocomputing. 2013;101:309-318.
44. Huang H, Hu Q, Hao Z, Lim A. Example-based learning particle swarm optimization for continuous optimization. Inform Sci.
2012;182(1):125-138.
45. Alhakbani H, Al-Rifaie M. Handling class imbalance using swarm intelligence techniques, hybrid data and algorithmic level solutions
[PhD dissertation, PhD thesis]. London, UK: Goldsmiths, University of London; 2018.
46. Seiffert C, Khoshgoftaar TM, Van Hulse J, Napolitano A. RUSBoost: a hybrid approach to alleviating class imbalance. IEEE Trans Syst
Man Cybernet A Syst Humans. 2010;40(1):185-197.
47. Ha TM, Bunke H. Off-line, handwritten numeral recognition by perturbation method. IEEE Trans Pattern Anal Mach Intell.
1997;19(5):535-539.
48. Chawla NV, Bowyer KW, Hall LO, Philip Kegelmeyer W. SMOTE: synthetic minority over-sampling technique. J Artif Intell Res.
2002;16:321-357.
49. Fernández A, Garcia S, Herrera F, Chawla NV. Smote for learning from imbalanced data: progress and challenges, marking the 15-year
anniversary. J Artif Intell Res. 2018;61:863-905.
50. Chawla NV, Lazarevic A, Hall LO, Bowyer KW. SMOTEBoost: improving prediction of the minority class in boosting. Paper presented
at: European conference on principles of data mining and knowledge discovery; 2003; Berlin/Heidelberg, Germany: Springer:107-119.
51. Guo H, Viktor HL. Learning from imbalanced data sets with boosting and data generation: the databoost-im approach. ACM Sigkdd
Explor Newsletter. 2004;6(1):30-39.
52. Chen S, He H, Garcia EA. RAMOBoost: ranked minority oversampling in boosting. IEEE Trans Neural Netw. 2010;21(10):1624-1642.
53. Cao H, Li X-L, Woon DY-K, Ng S-K. Integrated oversampling for imbalanced time series classification. IEEE Trans Knowl Data Eng.
2013;25(12):2809-2822.
54. Abdi L, Hashemi S. To combat multi-class imbalanced problems by means of over-sampling techniques. IEEE Trans Knowl Data Eng.
2015;28(1):238-251.
55. Han H, Wang WY, Mao BH. Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. Paper presented at:
International conference on intelligent computing; 2005; Berlin/Heidelberg, Germany: Springer:878–887.
56. He H, Bai Y, Garcia EA, Li S. ADASYN: adaptive synthetic sampling approach for imbalanced learning. Paper presented at: 2008 IEEE
International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence); 2008: IEEE:1322-1328.
57. Maciejewski T, Stefanowski J. Local neighbourhood extension of SMOTE for mining imbalanced data. Paper presented at: 2011 IEEE
Symposium on Computational Intelligence and Data Mining (CIDM); 2011: IEEE:104-111.
58. Barua S, Islam MM, Yao X, Murase K. MWMOTE—majority weighted minority oversampling technique for imbalanced data set learning.
IEEE Trans Knowl Data Eng. 2014;26(2):405-425.
59. Bennin KE, Keung J, Phannachitta P, Monden A, Mensah S. MAHAKIL: diversity based oversampling approach to alleviate the class
imbalance issue in software defect prediction. IEEE Trans Softw Eng. 2017;44(6):534-550.
60. Zhang C, Bi J, Xu S, et al. Multi-imbalance: an open-source software for multi-class imbalance learning. Knowl Based Syst.
2019;174:137-143.
61. Lin C-T, Hsieh T-Y, Liu Y-T, et al. Minority oversampling in kernel adaptive subspaces for class imbalanced datasets. IEEE Trans Knowl
Data Eng. 2017;30(5):950-962.
62. Hu Z, Chiong R, Pranata I, Bao Y, Lin Y. Malicious web domain identification using online credibility and performance data by
considering the class imbalance issue. Ind Manag Data Syst. 2019;119(3):676-696.
63. Susan S, Kumar A. Hybrid of intelligent minority oversampling and PSO-based intelligent majority undersampling for learning from
imbalanced datasets. Paper presented at: International Conference on Intelligent Systems Design and Applications; 2018:760-769. Cham:
Springer.
64. Ramentol E, Caballero Y, Bello R, Herrera F. SMOTE-RSB*: a hybrid preprocessing approach based on oversampling and undersampling
for high imbalanced data-sets using SMOTE and rough sets theory. Knowl Inform Syst. 2012;33(2):245-265.
65. Peng L, Zhang H, Yang B, Chen Y, Zhou X. SMOTE-DGC: an imbalanced learning approach of data gravitation based classification.
International Conference on Intelligent Computing. Cham, Switzerland: Springer; 2016:133-144.
66. Susan S, Kumar A. SSOMaj-SMOTE-SSOMin: three-step intelligent pruning of majority and minority samples for learning from
imbalanced datasets. Appl Soft Comput. 2019;78:141-149.
67. Wong GY, Leung FHF, Ling SH. A hybrid evolutionary preprocessing method for imbalanced datasets. Inform Sci. 2018;454:161-177.
68. Popel MH, Hasib KM, Habib SA, Shah FM. A hybrid under-sampling method (HUSBoost) to classify imbalanced data. Paper presented
at: 2018 21st International Conference of Computer and Information Technology (ICCIT); 2018: IEEE:1–7.
69. Li L, He H, Li J. Entropy-based sampling approaches for multi-class imbalanced problems. IEEE Trans Knowl Data Eng. 2019:1–12.
70. Fernandes ERQ, de Carvalho AC. Evolutionary inversion of class distribution in overlapping areas for multi-class imbalanced learning.
Inform Sci. 2019;494:141-154.
71. D’Addabbo A, Maglietta R. Parallel selective sampling method for imbalanced and large data classification. Pattern Recogn Lett.
2015;62:61-67.
72. Hu F, Li H, Lou H, Dai J. A parallel oversampling algorithm based on NRSBoundary-SMOTE. J Inform Comput Sci.
2014;11(13):4655-4665.
73. Torgo L. Data Mining with R: Learning with Case Studies. United States: CRC Press; 2016.
74. Lemaître G, Nogueira F, Aridas CK. Imbalanced-learn: a python toolbox to tackle the curse of imbalanced datasets in machine learning.
J Mach Learn Res. 2017;18(1):559-563.
75. Alcalá-Fdez J, Sanchez L, Garcia S, et al. KEEL: a software tool to assess evolutionary algorithms for data mining problems. Soft Comput.
2009;13(3):307-318.
76. Bi J, Zhang C. An empirical comparison on state-of-the-art multi-class imbalance learning algorithms and a new diversified ensemble
learning scheme. Knowl Based Syst. 2018;158:81-93.
77. Kang S, Cho S, Kang P. Constructing a multi-class classifier using one-against-one approach with different binary classifiers. Neurocom-
puting. 2015;149:677-682.
78. He H, Garcia EA. Learning from imbalanced data. IEEE Trans Knowl Data Eng. 2009;21(9):1263-1284.
79. Lu Y, Cheung Y-M, Tang YY. Bayes imbalance impact index: a measure of class imbalanced dataset for classification problem; 2019.
arXiv preprint arXiv:1901.10173.
80. Frank, A. & Asuncion, A. (2010). UCI Machine Learning Repository. Irvine, CA: University of California, School of Information and
Computer Science.
81. Prati RC, Batista GE, Silva DF. Class imbalance revisited: a new experimental setup to assess the performance of treatment methods.
Knowl Inform Syst. 2015;45(1):247-270.
82. Branco P, Torgo L, Ribeiro R. A survey of predictive modelling under imbalanced distributions. ACM Comput Surv. 2016;49(2):1-31.
83. Zhang X, Ma D, Gan L, Jiang S, Agam G. CGMOS: Certainty guided minority oversampling. Paper presented at: Proceedings of the 25th
ACM International on Conference on Information and Knowledge Management; 2016:1623-1631.
84. Raghuwanshi BS, Shukla S. SMOTE based class-specific extreme learning machine for imbalanced learning. Knowl Based Syst.
2020;187:104814.
85. Elrahman SMA, Abraham A. A review of class imbalance problem. 2013;J Netw Innov Comput, 1(2013):332-340.
86. Haixiang G, Yijing L, Shang J, Mingyun G, Yuanyue H, Bing G. Learning from class-imbalanced data: review of methods and applications.
Exp Syst Appl. 2017;73:220-239.
87. Krawczyk B. Learning from imbalanced data: open challenges and future directions. Prog Artif Intell. 2016;5(4):221-232.
88. Ali A, Shamsuddin SM, Ralescu AL. Classification with class imbalance problem: a review. Int J Adv Soft Comput Appl. 2015;7(3):176-204.
89. Li Y, Sun G, Zhu Y. Data imbalance problem in text classification. Paper presented at: 2010 Third International Symposium on
Information Processing; 2010: IEEE:301–305.
90. Vashishtha S, Susan S. Fuzzy rule based unsupervised sentiment analysis from social media posts. Exp Syst Appl. 2019;138:112834.
91. Song J, Huang X, Qin S, Song Q. A bi-directional sampling based on K-means method for imbalance text classification. Paper presented
at: 2016 IEEE/ACIS 15th International Conference on Computer and Information Science (ICIS); 2016: IEEE:1-5.
92. LeCun Y, Bengio Y, Hinton G. Deep learning. Nature. 2015;521(7553):436-444.
93. Krizhevsky A, Sutskever I, Hinton GE. Imagenet classification with deep convolutional neural networks. Advances in Neural Information
Processing Systems. United States: MIT press; 2012:1097-1105.
94. Saini M, Susan S. Data augmentation of minority class with transfer learning for classification of imbalanced breast cancer dataset
using inception-V3. Paper presented at: Iberian Conference on Pattern Recognition and Image Analysis; 2019; Cham, Swizterland:
Springer:409-420.
95. Saini M, Susan S. Comparison of deep learning, data augmentation and bag of-visual-words for classification of imbalanced image
datasets. Paper presented at: International conference on recent trends in image processing and pattern recognition; 2018;561-571.
Singapore: Springer.
96. Dalyac A, Shanahan M, Kelly J. Tackling class imbalance with deep convolutional neural networks. Imperial College. 2014;2:30-35.
97. Susan S, Jain A, Sharma A, Verma S, Jain S. Fuzzy match index for scale-invariant feature transform (SIFT) features with application to
face recognition with weak supervision. IET Image Process. 2015;9(11):951-958.
98. Buda M, Maki A, Mazurowski MA. A systematic study of the class imbalance problem in convolutional neural networks. Neural Netw.
2018;106:249-259.
99. Lin C-C, Deng D-J, Kuo C-H, Chen L. Concept drift detection and adaption in big imbalance industrial IoT data using an ensemble
learning method of offline classifiers. IEEE Access. 2019;7:56198-56207.
100. Obermeyer Z, Emanuel EJ. Predicting the future—big data, machine learning, and clinical medicine. N Engl J Med.
2016;375(13):1216-1219.
101. Fernández A, del Río S, Chawla NV, Herrera F. An insight into imbalanced big data classification: outcomes and challenges. Complex
Intell Syst. 2017;3(2):105-120.
102. Triguero I, Galar M, Vluymans S, Cornelis C, Bustince H, Herrera F, Saeys Y. Evolutionary undersampling for imbalanced big data
classification. Paper presented at: 2015 IEEE Congress on Evolutionary Computation (CEC); 2015: IEEE:715-722.
103. Del Río S, López V, Benítez JM, Herrera F. On the use of MapReduce for imbalanced big data using random Forest. Inform Sci.
2014;285:112-137.
104. Gutiérrez PD, Lastra M, Benítez JM, Herrera F. SMOTE-GPU: big data preprocessing on commodity hardware for imbalanced classifi-
cation. Prog Artif Intell. 2017;6(4):347-354.
105. Han W, Tian Z, Huang Z, Li S, Jia Y. Bidirectional self-adaptive resampling in internet of things big data learning. Multimed Tools Appl.
2019;78(21):30111-30126.
106. Wang S, Minku LL, Yao X. A systematic study of online class imbalance learning with concept drift. IEEE Trans Neural Netw Learn Syst.
2018;29(10):4802-4821.
107. Brzezinski D, Piernik M. Structural XML classification in concept drifting data streams. New Gener Comput. 2015;33(4):345-366.
108. Kelly MG, Hand DJ, Adams NM. The impact of changing populations on classifier performance. Paper presented at: Proceedings of the
fifth ACM SIGKDD International conference on knowledge discovery and data mining; 1999: ACM:367-371.
109. Mairal J, Bach F, Ponce J, Sapiro G. Online learning for matrix factorization and sparse coding. J Mach Learn Res. 2010;11:19-60.
110. Mardani M, Mateos G, Giannakis GB. Subspace learning and imputation for streaming big data matrices and tensors. IEEE Trans Signal
Process. 2015;63(10):2663-2677.
111. Wu K, Edwards A, Fan W, Gao J, Zhang K. Classifying imbalanced data streams via dynamic feature group weighting with importance
sampling. Paper presented at: Proceedings of the 2014 SIAM International Conference on Data Mining: Society for Industrial and Applied
Mathematics; 2014:722-730.
112. Ding S, Mirza B, Lin Z, et al. Kernel based online learning for imbalance multiclass classification. Neurocomputing. 2018;277:139-148.
113. Dal Pozzolo, A., Caelen, O., & Bontempi, G. (2015). When is undersampling effective in unbalanced classification tasks?. In Appice, A.,
Rodrigues, P. P., Costa, V. S., Soares, C., Gama, J., & Jorge, A. (Eds.), ECML/PKDD, Vol. 9284 Lecture Notes in Computer Science, Pp.
200–215; Cham: Springer.
114. Tayal A, Coleman TF, Li Y. Rankrc: large-scale nonlinear rare class ranking. IEEE Trans Knowl Data Eng. 2015;27(12):3347-3359.
AU THOR BIOGRAPHIES
Seba Susan is an Associate Professor in the Department of Information Technology at Delhi

Technological University (DTU). She completed her Ph.D from the Electrical Engineering
Department of Indian Institute of Technology (IIT) Delhi in 2014. Her research areas are Machine
Learning, Computer Vision, Speech and Natural Language Processing.
Amitesh Kumar is a postgraduate student in the Information Systems programme in the

Department of Information Technology at Delhi Technological University (DTU). His areas of
interest are Machine learning and Artificial Intelligence. His current research involves learning
from imbalanced datasets.
How to cite this article: Susan S, Kumar A. The balancing trick: Optimized sampling of imbalanced datasets—A
brief survey of the recent State of the Art. Engineering Reports. 2020;e12298. https://doi.org/10.1002/eng2.12298

Eng2 12298 PDF

Uploaded by

Document Informationclick to expand document information

Document Informationclick to expand document information

Copyright:

Available Formats

Eng2 12298 PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Eng2 12298 PDF

Uploaded by

Copyright:

Available Formats

Received: 14 November 2019 Revised: 31 July 2020 Accepted: 7 September 2020

The balancing trick: Optimized sampling of imbalanced

Seba Susan Amitesh Kumar

Department of Information Technology,

Engineering Reports. 2020;e12298. wileyonlinelibrary.com/journal/eng2 1 of 24

FIGURE 1 Demonstration of class-imbalance and some

2 UNDERSAMPLING THE MAJORITY C LASS

Sampling Distinctiveness Procedure of pruning the

sampling, SMOTE, Borderline SMOTE

• Datasets: six UCI datasets

error pruning tree

3 OV ERSAMPLING THE MINORITY C LASS

4 HYBRID SAMPLING ST RATEGIES

5 EXPERIMENTAL SETUP FOR IMBALANCED LEARNING

5.1 Software tools

5.2 Evaluation metrics

6 FUTURE DIRECTIONS, A PPLICATIONS, AND CHALLENGES

6.1 Exploration of classifier models

6.2 Handling multi-class imbalance

6.3 Open research problems and future application areas

6.4 Challenges in computer vision

6.5 Challenges in big data applications

6.6 Imbalance in streaming data

PEER REVIEW INFORMATION

DATA AVAILABILITY STATEMENT

Seba Susan is an Associate Professor in the Department of Information Technology at Delhi

Amitesh Kumar is a postgraduate student in the Information Systems programme in the

You might also like