0% found this document useful (0 votes)
14 views11 pages

AReviewon Oversampling Techniquesfor Solvingthe Data Imbalance Problemin Classification

ai

Uploaded by

Lan Anh Ngô
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views11 pages

AReviewon Oversampling Techniquesfor Solvingthe Data Imbalance Problemin Classification

ai

Uploaded by

Lan Anh Ngô
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/371609615

A Review on Oversampling Techniques for Solving the Data Imbalance


Problem in Classification

Article in International Journal on Advances in ICT for Emerging Regions (ICTer) · June 2023
DOI: 10.4038/icter.v16i1.7260

CITATIONS READS

6 208

2 authors:

Tharinda Dilshan Piyadasa Kasun Gunawardana


The University of Sydney University of Colombo
7 PUBLICATIONS 10 CITATIONS 22 PUBLICATIONS 72 CITATIONS

SEE PROFILE SEE PROFILE

All content following this page was uploaded by Tharinda Dilshan Piyadasa on 18 September 2023.

The user has requested enhancement of the downloaded file.


International Journal on Advances in ICT for Emerging Regions 2023 16 (1):

A Review on Oversampling Techniques for Solving


the Data Imbalance Problem in Classification
Tharinda Dilshan Piyadasa, Kasun Gunawardana

Abstract— The data imbalance problem is a widely explored


area in the Machine Learning domain. With the rapid disproportionate among these samples is identified using the
advancement of computing infrastructure and the incessant Imbalance Ratio (IR), which can vary from dataset to dataset.
increase in the amount and variety of data generated, the data This metric simply represents the ratio between the majority
imbalance problem has prevailed and reshaped with the
requirement for novel approaches to address it. Among the
and minority class samples.
different approaches that exist to address the data imbalance In many practical applications of classification analysis,
problem, such as data-level and algorithmic-level, data-level the minority class represents the positive examples or the
approaches are more popular among the scientific community target class where the adverse effect of false-negative
due to their classifier-independent nature. When investigating predictions is much higher than false-positive predictions [1].
current trends in data-level approaches, it is evident that For example, when considering credit card fraud detection,
oversampling is a technique frequently explored due to its there can be thousands of regular transactions for a single
adaptability to scenarios where extreme data imbalance is fraudulent transaction, making the target class the minority
present. This paper presents a review of different oversampling class in the dataset. Suppose a regular transaction is flagged
techniques with a comprehensive analysis of the strategies that
have been used along with possible areas that looks promising
as a fraudulent transaction (false positive) by a trained model.
to explore further to develop more advanced oversampling In that case, it can later be resolved using further
techniques. examinations. Still, on the other hand, if a fraudulent
transaction is incorrectly classified as a regular transaction
(false negative), which is the usual behavior of traditional
Keywords— Data Imbalance Problem, Classification Analysis,
classifiers on imbalanced datasets, the primary intention of
Oversampling.
the classifier is futile. The justification behind this behavior
I. INTRODUCTION is that, in extreme imbalance scenarios where positive
examples are under-represented, they are often mistaken for
D ata mining and knowledge discovery have become
indispensable in the contemporary age of big data for
making accurate decisions and predictions. Classification
noise, outliers, or allocated to the majority class, ignoring
the importance of their characteristics, leading traditional
learning models to favor the majority class heavily [4].
analysis is one of the most commonly employed data mining
Another significant observation of class imbalance is that,
tasks for various market and engineering problems, such as
regardless of the poor performance of standard classifiers on
bankruptcy prediction, network intrusion detection, fraud
the minority class, the classifier would still make predictions
detection, and software fault detection, where classifiers are
with higher accuracy depending on the imbalance ratio of
trained to discriminate between the different classes
the classes. For example, suppose the imbalance ratio of a
representing the problem [1]. When using traditional
binary class dataset is 9:1 (for nine samples in the majority
classifiers to carry out the said tasks, it can be observed that
class, there is only one minority sample). The classifier can
these classifiers perform well over evenly distributed data.
acquire an accuracy of 90% by classifying all the samples
This is due to the fact that traditional classifiers are designed
into the majority class, which is a decent accuracy when
to increase accuracy with no notion of the distribution of
considering a standard classifier [5]. In practical applications,
data [2]. However, in the real world, data collected for
the imbalance ratio can be much higher than the ratio
classification analysis are usually class (or Data) imbalanced.
depicted in the above example. It is also evident that
In the context of classification analysis, class imbalance
accuracy is not a suitable evaluation metric to evaluate a
refers to classification problems where the dataset contains
standard classifier when datasets are imbalanced as the
at least one class with significantly fewer samples than other
importance of the minority class is ignored.
classes in the dataset. In a two-class classification problem,
the class with the fewest samples is called the minority class, A. Addressing the Data Imbalance Problem
and the other class is called the majority class [3]. The class The approaches used to overcome the data imbalance
problem can be categorized into three groups as represented
Correspondence:Tharinda Dilshan Piyadasa (E-mail: tharindad7@gmail. in Fig 1: External approaches (data level), Internal approach
com) Received:18-07-2022 Revised:23.01.2023 Accepted: 26-02-2023 es (algorithmic level), and Hybrid approaches.
The external approaches focus on balancing the dataset
Tharinda Dilshan Piyadasa and Kasun Gunawardana are from University of
Colombo School of Computing, Sri Lanka. ([email protected],
either by removing the majority class samples through
[email protected]) undersampling or adding minority class samples through
oversampling. It is also possible to combine oversampling
DOI: http://doi.org/10.4038/icter.v16i1.7260 and undersampling to form hybrid sampling methods. The
© 2022 International Journal on Advances in ICT for Emerging Regions
objective of external approaches is to reduce the imbalance
ratio to achieve a favorable distribution among the classes.
This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted
use, distribution, and reproduction in any medium, provided the original author and source are credited

March 2023 International Journal on Advances in ICT for Emerging Regions


23 Tharinda Dilshan Piyadasa#1, Kasun Gunawardana*2

Fig. 1 Approaches to address the data imbalance problem.


n

Even though most of the proposed external approaches B. Overview of External Approaches
resample the dataset until the number of samples in each
class is equal , studies such as [6] demonstrate that it is not As aforementioned, external approaches to solving the
always required to maintain a 50:50 class distribution when data imbalance problem are heavily favored in the field of
resampling. However, there is no hard and fast rule to decide research due to classifier independence. When considering
on a favorable imbalance ratio as it can vary depending on oversampling and undersampling, both methods have their
the domain and the type of classifier used. own advantages and disadvantages. The main drawback of
Internal approaches involve developing and improving the oversampling is that it risks generating synthetic data that
underlying classification algorithm without altering the can lead to overfitting. This can be caused by generating
dataset involved [7]. There are mainly two ways that internal synthetic samples that closely resemble original samples or
approaches address the data imbalance problem . The first by incorrectly positioning (overlapping other classes)
method is cost-sensitive learning , where the classifier is synthetic samples in the data space. In the case of
modified such that the misclassification of minority class undersampling, it risks excluding important information
samples is heavily penalized compared to the from the dataset, such as samples that are crucial when
misclassification of majority class samples. The second and deciding the decision boundaries or samples that contain a
most popular internal approach is to incorporate ensemble - higher weight in representing a particular class or a feature.
based classifiers where multiple weak classifiers are Apart from the exclusion of important information,
combined to improve the performance of the overall undersampling can also suffer from data scarcity after
classification algorithm . Apart from these methods , there resampling if the minority class contains extremely fewer
have also been algorithmic classifier modifications proposed samples.
in past years to improve the classifier performance on Throughout the past years, many studies have been
classifiers like Support Vector Machines (SVM ), Extreme carried out to investigate methods and mechanisms to
Learning Machines (ELM), and Neural Networks (NN). mitigate these drawbacks from external approaches. For
Moreover , internal approaches can also be combined with example, the most intuitive technique to add or remove data
external approaches to derive hybrid approaches that to/from a given class is by performing random selections.
incorporate both advantages and disadvantages of internal These primitive techniques have evolved and improved over
and external approaches [1][8]. time to address their foundational drawbacks by combining
When comparing the approaches to address the Data more complex techniques and statistical and probabilistic
Imbalance problem , it is apparent that researchers prefer methods.
external approaches over internal approaches mainly due to When comparing oversampling and undersampling, even
the classifier independence [5]. In external approaches, since though oversampling leads to overfitting, it is possible to
only the dataset is modified , it gives the freedom to select detect it during the earlier stages of training using
any suitable classifier for the classification task. However, in straightforward approaches such as using a good train test
the case of internal approaches , as the internal structure / split and observing the change in testing error compared to
algorithm of the classifier is modified to address the the training error. However, in the case of important
imprecise classification of minority class samples , the information exclusion caused by undersampling, although it
dataset is heavily dependent on the modified classifier . might work well with the resampled dataset, the classifier
Nevertheless , it is impractical to use the same classification trained with excluded samples can lead to many
algorithm with every dataset in different contexts. Therefore, misclassifications with the introduction of new data samples.
on the basis of generalizability , it is reasonable to presume Mohammed et al. [9] validate this assertion, where several
that external approaches provide an added advantage over state-of-the-art classifiers are used to evaluate oversampled
internal approaches. and undersampled datasets. The authors have concluded that

International Journal on Advances in ICT for Emerging Regions March 2023


A Review on Oversampling Techniques for Solving the Data Imbalance Problem in Classification 24

compared to undersampling, oversampling of datasets leads without introducing any bias in this context) alternative
to a more accurate classification. oversampling techniques.
There are numerous studies, including [10] and [11], that When exploring literature on oversampling, the most
review oversampling techniques by conducting experiments widely used techniques in the scientific community are
and comparing the results to provide a comprehensive SMOTE [12] and its variants. SMOTE stands for Synthetic
evaluation of the performance of different oversampling Minority Oversampling Technique, where the algorithm
techniques in practical settings. However, the objective of generates a synthetic sample along the line segment that
these studies seems to be finding the better technique out of joins a randomly selected minority class sample and one of
a set of available techniques based on a systematic analysis, its K nearest neighbors. In SMOTE, the value of K is a
and they provide limited information on the approaches and parameter that should be specified prior to its application,
methodologies used in such techniques. As a result, a and minority class samples are randomly chosen from the set
researcher may find it challenging to comprehend the of K-Nearest Neighbors based on the amount of
underlying strategies of an oversampling technique, which, oversampling required. The operation of the SMOTE
in turn, would lead to failure in addressing its limitations. algorithm is further elaborated in Fig. 2, where (a) The
This paper provides a comprehensive review of some majority class and minority class samples are represented in
popular oversampling techniques used to address the data blue and green colors, respectively. (b) A minority class
imbalance problem, highlighting their strategies and sample is randomly selected (black), and its K-nearest
potential areas for further improvement. The aim of this neighbors (3 in the image) are selected. (c) A new synthetic
study is to provide insights and guidance for researchers and sample (red) is generated on the line that joins the randomly
practitioners in the field of machine learning who are selected minority class sample and its nearest neighbor.
interested in developing more robust oversampling
techniques to address the data imbalance problem.
The rest of this paper is organized as follows. Section II
provides a comprehensive review of existing oversampling
techniques, highlighting their strategies when performing
oversampling. Section III presents the key findings of the
review, emphasizing the factors that need to be considered
when formulating new oversampling techniques, followed
by the conclusion in section IV.

II. ANALYSIS OF OVERSAMPLING TECHNIQUES


Fig. 2 Graphical representation of SMOTE algorithm [13]
When selecting studies for the review, a deliberate
decision was made to include oversampling techniques that As the synthetic samples generated by SMOTE are not
are widely recognized and used in the machine learning duplicates of already existing samples, they are more
community. The rationale behind this choice was that these generalizable than samples generated through random
methods have been proven to be successful and efficient in oversampling, reducing the risk of overfitting. However, due
previous studies, such as [10] and [11], and that they are to the random selection of minority class samples with a
readily accessible and available in popular machine learning uniform probability for oversampling, densely populated
libraries like scikit-learn. By incorporating these well- minority class areas become more condensed while sparsely
established oversampling techniques, this study aims to populated minority class areas remain sparse. This behavior
ensure that the review reflects the best practices and of SMOTE manages to address the between-class imbalance
standards in the field. (imbalance between multiple classes), while the within-class
Oversampling approaches generate synthetic minority imbalance (multiple dense or sparse regions of the same
class samples and combine them with the existing dataset, class) is ignored. Another drawback of the SMOTE
resulting in a new dataset that is more appropriate for algorithm is the generation of noisy samples. If a new
training. The most intuitive form of oversampling is random synthetic sample is generated between an existing noisy
oversampling, where minority class samples are randomly sample and its nearest neighbor, there is a high probability
selected and duplicated without any specific selection that the newly generated sample will also be noisy. This is
standard. because the SMOTE algorithm has no notion of overlapping
Random oversampling can be effective for machine class regions when generating synthetic samples [1][4].
learning algorithms influenced by skewed distributions in Throughout the years, the SMOTE algorithm has been
instances where the overall size of the dataset is small and modified to address its drawbacks and limitations.
the imbalance is not that significant [9]. However, in cases In [14], Batista et al. apply SMOTE to oversample the
where the dataset is heavily imbalanced, or the number of minority class, followed by applying Tomek Links to
minority class samples is insufficient to train a decent increase the class separation near the decision boundary. The
classifier, random oversampling can risk classifier authors state that the class clusters are sometimes ill-defined
overfitting during training due to repeated duplication of the during oversampling as the minority class samples may enter
minority samples. Despite the implementation simplicity and the majority class area and that interpolating minority class
fast execution, which is ideal for large and complex datasets, instances can enlarge the minority cluster, introducing noisy
the lack of generalizability and high likelihood of overfitting minority samples deep in the majority area, which is harmful
in random oversampling has led researchers to look for more and can lead to overfitting. Tomek Links act as a data
robust (robustness is denoted as the ability to oversample cleaning mechanism in this technique, where overlapping
majority and minority class samples are removed to form

March 2023 International Journal on Advances in ICT for Emerging Regions


25 Tharinda Dilshan Piyadasa#1, Kasun Gunawardana*2

well-defined class clusters. This can be considered a hybrid density distribution as a criterion to determine the number of
technique as it applies both oversampling and undersampling new synthetic samples that should be generated for each
to the dataset. minority sample. The density distribution considers the
The research work presented in [15] is another hybrid learning difficulty of each of the minority class samples and
technique where extensive data cleaning based on generates more synthetic samples around samples that are
misclassifications is applied through ENN (Edited Nearest more difficult to learn than those that are simpler to learn.
Neighbor) on an oversampled dataset. ENN is similar to Even though ADASYN is capable of enhancing hard-to-
Tomek Links, but it is more aggressive as it removes any learn minority sample areas, it is sensitive to outliers
sample (majority or minority) from the training set that its because of the possibility of misinterpreting noisy samples,
three nearest neighbors misclassify, creating more which usually occur in low densities, as harder-to-learn
distinguishable class spaces with clear separation along the samples, associating them with higher weights. A summary
decision boundary. The study also states that oversampling of SMOTE and its variants elaborated above are presented in
strategies lead to more accurate classifiers than strategies Table I.
derived through undersampling. When examining the process in which the aforementioned
Geometric SMOTE [16] is another extension of SMOTE methods have approached the problem, it is evident that they
that generates synthetic samples near selected minority class are focused on balancing the number of samples in the
samples in a geometric region instead of linear interpolation. dataset classes. The imbalance between the dataset classes
While this selected region is a hyper-sphere in its default that split them into majority and minority classes is called
configuration, G-SMOTE deforms it to a hyper-spheroid and the between-class imbalance. By default, all the resampling
eventually to a line segment, simulating the SMOTE process techniques are designed to address the between-class
in the last instance. Geometric SMOTE addresses two main imbalance through oversampling, undersampling, or hybrid
issues in SMOTE: generation of noisy samples and sampling. However, when comparing with vanilla SMOTE,
generation of samples that belong to the same sub-cluster. it can be observed that most of the above techniques attempt
The above issues are addressed by identifying safe areas to in refining the output of the SMOTE algorithm by regulating
synthesize new samples and varying the number of minority the areas of sample generation and eliminating noisy
samples generated. The authors claim that the ability of G- synthetic samples to preserve the decision boundary that
SMOTE to produce a variety of synthetic minority data in separates the classes.
safe regions of the input space while aggressively boosting The samples near the decision boundary undoubtedly
their diversity is the rationale for its performance gain. represent the most crucial samples for any classification task.
Safe-Level-SMOTE [17] follows a similar approach to Despite the importance of the decision boundary, as depicted
SMOTE but considers the nearby majority class samples in Fig. 3 (B), the samples generated near the boundary
when generating synthetic minority class samples. Safe through oversampling often tend to distort the class
levels are computed using nearest neighbor minority samples, separation, generating noisy samples that overlap with the
and synthetic samples are generated such that they lie closer majority class samples. The reason for the generation of
to minority class samples (safe area). The study tries to noisy samples in the decision boundary is caused by the use
address the overgeneralization problem encountered by of the same sample generation strategy throughout the data
SMOTE due to arbitrary generalization of the minority class space, which is not designed to preserve the decision
territory neglecting the majority class, which can lead to an boundary. The oversampling techniques mentioned above
increased likelihood of class mixing in the case of highly address this issue and emphasize preserving the decision
skewed class distributions. boundary when generating new synthetic minority class
Borderline-SMOTE [18] is another variation of SMOTE samples.
that generates synthetic minority class samples only within
the decision boundary that separates the classes. In contrast
to SMOTE, Borderline-SMOTE identifies minority class
samples that lie within the vicinity of the majority class
samples and prevents the generation of noisy synthetic
samples based on those. The authors declare that most
classification algorithms strive to understand the boundaries
of each class as precisely as possible during the training
process to obtain a better prediction, making the samples far Fig. 3 (A) Occurrence of multiple disjuncts of minority class samples
with varying densities. (B) Noisy minority class samples distort the decision
from the borderline less significant compared to the samples boundary by overlapping with majority class samples
that lie within the vicinity of the class borders. Furthermore,
the study presents two versions of Borderline-SMOTE, Moreover, when considering real-life datasets, there can
Borderline-SMOTE1, which generates new synthetic also be instances where multiple dense or sparse clusters of
samples between borderline minority samples and its K- minority class samples are present within the data
nearest minority neighbors, and Borderline-SMOTE2, which distribution, as illustrated in Fig. 3 (A).
generates new synthetic samples between borderline The existence of multiple disjuncts of minority class
minority samples and its K-nearest minority as well as K- samples is referred to as the within-class imbalance, and it
nearest majority neighbors. can lead to an extreme lack of representation of crucial
ADASYN [19] is a density-based oversampling technique minority class features. Oversampling techniques that
where the density of minority samples in a neighbourhood is randomly select minority samples to generate new synthetic
considered when generating new synthetic minority class samples, such as SMOTE, fail to resolve the within-class
samples. The main intuition of ADASYN is to utilize a imbalance, resulting in a skewed minority class distribution

International Journal on Advances in ICT for Emerging Regions March 2023


A Review on Oversampling Techniques for Solving the Data Imbalance Problem in Classification 26

[20]. Therefore, it is important to address both between-class are given greater weights to reduce the within-class
and within-class imbalances when addressing the data imbalance. Finally, a modified hierarchical clustering
imbalance. The simultaneous removal of both these approach is used to create synthetic samples from the
imbalances minimizes the classifier bias toward bigger sub- weighed minority class samples making sure the generated
clusters by decreasing the influence of the bigger sub-cluster samples reside within the minority class region to avoid noisy
error on the total error [21]. sample generation.

TABLE I
SUMMARY OF SMOTE AND ITS VARIANTS

Reference Approach Summary

- A minority class sample is selected at random.


Oversampling: combines random sampling with the
[12]
K nearest neighbor algorithm. - A synthetic sample is generated on the line that joins the random
minority sample and its nearest neighbor.

- Applies SMOTE followed by Tomek Links.


Hybrid Sampling: refines the output of SMOTE by
[14] - Tomek Links remove overlapping majority and minority samples.
applying Tomek Links.
- Increase the class separation near the decision boundary.

- Extensive data cleaning based on misclassification.


Hybrid Sampling: refines the output of SMOTE by
[15] - Removes any sample that its 3 nearest neighbors misclassify.
applying Edited Nearest Neighbor (ENN).
- ENN is more aggressive than Tomek Links.

- Generates samples in a geometric region instead of linear


interpolation.
Oversampling: modifies the sample generation
[16]
strategy of SMOTE by identifying safe regions.
- Prevents the generation of noisy samples by identifying safe areas
and varying the number of samples generated.

- Considers nearby majority class samples when generating new


Oversampling: modifies the sample generation synthetic minority class samples.
[17]
strategy of SMOTE by computing safe levels.
- Safe-levels computed using nearest neighbor minority samples.

- Generates minority class samples only within the decision


boundary.
Oversampling: modifies SMOTE to generate new
[18]
samples only within the decision boundary.
- Ignores minority class samples that lie within the majority class
samples during synthesis.

- Density of minority class samples in a neighborhood is considered


Oversampling: uses the density around minority
when generating new samples.
[19] class samples to determine the number of synthetic
samples to be generated.
- Heavily sensitive to outliers.

Further looking into oversampling techniques reveals Cluster SMOTE [23] uses K-means to cluster the minority
another set of studies that use a different strategy to deal class and applies SMOTE within the identified clusters. This
with the data imbalance problem. approach makes sure that the generated synthetic samples
MWMOTE [22] is a popular SMOTE-based always lie inside naturally occurring clusters of the minority
oversampling technique. It first locates hard-to-learn class samples. The study claims that the existence of a small
minority class samples (samples near the decision boundary) number of minority class samples is challenging when
using the majority class samples near the decision boundary forming decent class borders, and addressing this limitation
and uses the Euclidean distance from these nearest majority by accurate class region and border definition would enable
class samples to assign them weights. This weighing trivial classification. Since these class regions are unknown
mechanism ensures that higher weights are assigned to and impossible to infer through given data, K-means is used
samples closer to the decision boundary than others. The to approximate the minority region, followed by applying
authors highlight the fact that the presence of within-class SMOTE to each identified cluster. This study is explicitly
imbalance and small disjuncts of the minority class can lead designed to address the imbalance in network intrusion
to performance degradation in classifiers and, therefore, datasets and only uses two intrusion datasets to evaluate.
similar to the weighing of hard-to-learn minority samples [24] presents a clustering-based oversampling technique
near the decision boundary, the samples of smaller clusters designed to address the within and between class imbalances,

March 2023 International Journal on Advances in ICT for Emerging Regions


27 Tharinda Dilshan Piyadasa#1, Kasun Gunawardana*2

avoiding the generation of noisy synthetic samples. Initially, clustering to identify the boundary of the identified sub-
the algorithm clusters the input space using K-means clusters. Finally, synthetic minority class samples are
clustering and filters out the cluster with a higher number of generated in the enclosed region of the class separating
minority samples for oversampling. The number of synthetic boundary. As suggested by the authors, the main goal of this
samples to be generated is then dispersed, with more technique is to assign equal weight to all sub-clusters of the
samples being assigned to clusters with a low density of minority class that would otherwise be overlooked due to the
minority samples. Finally, SMOTE is used to obtain the skewness of the distribution. The cluster/density based
required ratio of minority and majority samples in each of oversampling techniques elaborated above are summarized
the filtered clusters. The authors rationalize cluster-based in Table II.
oversampling as one of the strategies that aim to minimize In order to address the within-class imbalance, it is
the within-class imbalance while also reducing the between- necessary to identify different regions within the data space
class imbalance, facilitating the oversampling technique to where oversampling is effective. The above studies show
identify the most effective areas of the input space to that clustering and density-based techniques are popular
generate synthetic samples. approaches that researchers use to identify such areas. After
DBSMOTE [25] is another density-based oversampling the identification of significant areas to oversample, it is
technique that uses the DBSCAN algorithm to partition the possible to use traditional oversampling techniques to
minority class samples. SMOTE is used to generate generate synthetic samples. The clustering-based
synthetic samples between the shortest path that join oversampling techniques introduced above emphasize the
minority class samples with a pseudo-centroid of a minority importance of addressing the within-class imbalance when
cluster, avoiding the generation of outliers or noisy samples. formulating oversampling techniques.
As a result, synthetic samples are generated in such a way
that they are dense around the centroid and are sparse further A. Oversampling High-Dimensional Data
away from the centroid. The authors claim that a real-world Further inspecting the aforementioned oversampling
dataset with proximate data clusters can be described by a techniques that address the data imbalance, it is evident that
normal distribution, dense at the centroid and sparse towards most of the techniques are based on clustering algorithms
the boundary and that a classifier can correctly identify such as K-means, DBSCAN, and hierarchical clustering,
samples near the centroid as it identifies the area around the combined with heuristics based on Euclidean distance.
centroid as a class. Based on the above observations, Therefore, the majority of these approaches rely on heuristic
DBSMOTE is designed to oversample the minority class methods that apply in two-dimensional space (Euclidean
area around the centroid because it is too sparse to be space) when generating synthetic data, whereas practical
recognized by a classifier. scenarios often consist of high-dimensional data [30].
CURE-SMOTE [26] works by clustering the minority Additionally, when the number of features in the dataset
class samples using the CURE hierarchical clustering (dimensionality of data) increases, the data points become
algorithm followed by noise and outlier removal. It then sparser or farther apart (Fig. 4), making the nearest neighbor
randomly generates synthetic minority class samples along problem ill-defined [31]. This behavior is called the “curse
the line segment that joins representative points and the of dimensionality” [32]. As a result, in higher dimensional
center point. In CURE hierarchical clustering, each sample space, the use of heuristics based on Euclidean distance
is assumed to represent a cluster, where local clustering is becomes ineffective, and the assumption of well-defined
used to combine these samples to form the clusters present clusters fails, generating noisy synthetic samples.
in the input space. The study justifies CURE hierarchical A common strategy that can be adopted when formulating
clustering, stating that it is more efficient for large datasets oversampling techniques that use clustering mechanisms and
with varying shapes of data distributions than K-means heuristics based on Euclidean distance is to reduce the
clustering, which is only suitable for spherically distributed dimensionality of the original input space. Principal
datasets. Further, it is stated that the combination of Component Analysis (PCA), Multidimensional Scaling
clustering and merging operations tends to eliminate noise (MDS), and Self-Organizing Maps are some common
with reduced complexity as it eliminates the need to remove dimensionality reduction techniques practitioners use. In
the furthest created synthetic samples (noisy samples) after recent years, Self-Organizing Maps [33] based resampling
applying SMOTE. techniques have been extensively explored in the community.
A-SUWO (Adaptive Semi-Unsupervised Weighted Self-Organizing Map based Oversampling (SOMO) [30]
Oversampling) [27] and its improved version, IA-SUWO generates a clustered two-dimensional representation of the
[28], cluster the minority class samples using a semi- input space by applying the SOM algorithm. Clusters are
unsupervised hierarchical clustering approach and use the filtered to perform oversampling by calculating the density
classification complexity and cross-validation of each sub- of minority class samples in each cluster. SMOTE is applied
cluster to decide the optimal size to oversample. Both A- to generate synthetic minority class samples within the
SUWO and IA-SUWO aim to generate synthetic samples filtered clusters and between neighboring clusters,
near minority class instances that lie close to the decision addressing both within and between class imbalances. The
boundary with lower densities. authors have identified and addressed a few inefficiencies of
[29] presents a probability-based cluster expansion existing oversampling techniques, namely, the generation of
oversampling technique that uses a model-based clustering noisy instances that infiltrate the majority region, the
mechanism (MCLUST) to identify sub-clusters present in generation of duplicate samples, and the use of heuristics
the dataset. The method also uses K-Nearest Neighbor based based on the assumption that the input space has a simple
noise removal prior to clustering to reduce the oversampling manifold structure. SOMO is capable of generating more
of noisy samples and equal posterior probability after effective synthetic samples by investigating the manifold

International Journal on Advances in ICT for Emerging Regions March 2023


A Review on Oversampling Techniques for Solving the Data Imbalance Problem in Classification 28

structure of the input space, exploiting the topology- among majority and synthetic minority samples , which
preserving property of Self-Organizing Maps. evaluate the positive impact of their removal or inclusion in
[1] proposes an imbalance dataset resampling technique the training data, respectively . The ideal rates of exclusion
by combining Self-Organizing Maps and Genetic and inclusion for each accepted criterion are obtained
Algorithms. The technique uses two Self-Organizing Maps using a Genetic Algorithm that considers the performance
to perform oversampling on the minority class and of a random classifier for a given training dataset in the
undersampling on the majority class . The clusters derived context of imbalance classification via the fitness function .
from Self - Organizing Maps identify the reg ions where The authors claim that the capabilities of Self-Organizing
majority and minority class samples are dense. The filtered Maps to preserve the distribution and topology of the input
clusters are then utilized to derive a set of rankings data lead to the conservation of the natural spatial

TABLE II
SUMMARY OF CLUSTER/DENSITY BASED OVERSAMPLING TECHNIQUES

Reference Approach Summary

- Identifies hard-to-learn minority class samples and assigns them


Oversampling: combines hierarchical clustering with weights based on the nearest majority class samples.
[22] SMOTE to address both within and between class
imbalances. - Makes sure the generated samples fall into some minority class
cluster

- Uses K-means to cluster the minority class and applies SMOTE


Oversampling: uses K-means to approximate the
within the identified clusters.
[23] minority region, followed by applying SMOTE to
each identified cluster.
- Makes sure generated samples lie inside naturally occurring clusters.

- Uses K-means to identify clusters and assigns weights based on the


Oversampling: uses K-means to identify and filter minority class density in each cluster.
[24] clusters with high minority class density, followed
by applying SMOTE. - Generates more samples in clusters with low minority class
densities.

- Uses the DBSCAN algorithm to partition the minority class


Oversampling: combines the DBSCAN algorithm
samples.
with SMOTE to generate synthetic samples such that
[25]
the dataset is dense at the centroid and sparse
- Generate synthetic samples between the lines that join minority
towards the boundary.
class samples with a pseudo-centroid of a minority cluster.

Oversampling: combines CURE hierarchical - Uses CURE hierarchical clustering algorithm followed by noise and
clustering with noise and outlier removal so that the outlier removal.
[26]
samples generated after using SMOTE are more
precise. - Addresses datasets that have clusters of varying shapes and sizes.

Oversampling: uses a semi-supervised hierarchical - Clusters the minority class using a hierarchical clustering approach.
clustering algorithm to generate synthetic samples
[27][28]
around minority class instances that lie close to the - Oversample size is decided from classification complexity and
decision boundary with lower densities. cross-validation of each sub-cluster.

- Uses MCLUST to identify sub-clusters present in the dataset.


Oversampling: combines a model-based clustering
[29] - The Goal is to assign equal weights to all minority class sub-clusters
mechanism with KNN-based noise removal.
that would otherwise be overlooked due to skewness of the
distribution.

Fig. 4 Data points become sparser as dimensionality increases [34].

March 2023 International Journal on Advances in ICT for Emerging Regions


29 Tharinda Dilshan Piyadasa#1, Kasun Gunawardana*2

relationship among samples at the cluster level , and the is no optimal imbalance ratio that needs to be reached when
optimization capabilities of Genetic Algorithms result in resampling an imbalanced dataset. However, [6] states that a
maximization of classifier performance , improving the 35:65 class distribution can achieve a higher classification
overall resampling operation. performance compared to a 50:50 class distribution when the
[35] uses a customized SOM-SMOTE algorithm to classes are heavily imbalanced. This is an area that is still
address the imbalance in clutter data when addressing the being investigated.
clutter suppression in search radars. The authors have 2) Addressing the within-class imbalance: Within-class
identified two limitations in the SMOTE algorithm, random imbalance demonstrates the imbalance within the minority
sample selection and ignoring the data distribution during class due to the existence of multiple disjuncts of minority
interpolation, resulting in samples that are not representative class samples with varying densities (Fig. 3A) that can lead
enough. The study addresses the above limitation using a to an extreme representation deficiency of essential
combined Self-Organizing Map and SMOTE algorithm that characteristics of the minority class. The majority of the
clusters the minority class samples into several subsets using oversampling techniques that randomly select minority class
a Self-Organizing Map and interpolates synthetic samples samples to generate new synthetic samples, such as SMOTE,
around the cluster centers using SMOTE. The authors also fail to address the within-class imbalance, leaving the
highlight the ability of Self-Organizing Maps to preserve the minority class distribution skewed. When analyzing
topology of higher dimensional clutter data, resulting in oversampling techniques capable of addressing the within-
synthetic samples with distribution characteristics similar to class imbalance, it can be observed that they are based on
original data. clustering approaches. The use of clustering approaches is an
From the above studies, it can be assumed that the ability obvious design choice as they provide the capability to
of Self-Organizing Maps to address the within-class analyze the spatial location of the minority class to
imbalance as a clustering algorithm, along with the ability to determine the suitable areas to generate new synthetic
reduce the dimensionality of data while preserving the samples.
topology of the input space, are the main reasons for its 3) Preserving the boundary region when generating
widespread popularity as an excellent candidate to address synthetic samples: The boundary region represents the area
the data imbalance problem. that separates two or more classes. As mentioned previously,
the most crucial samples in any classification task are the
III. DISCUSSION samples that reside near the boundary region. When
When analyzing oversampling techniques that address the considering oversampling techniques, the synthetic samples
class imbalance problem, it is possible to identify key factors generated near the decision boundary often distort the class
that contribute to the success of an oversampling technique. separation, generating noisy samples that overlap with the
Throughout the literature, it can be observed that every majority class samples. This behavior is caused due to the
oversampling technique attempts to adopt one or more of use of the same sample generation strategy throughout the
these factors during its strategy formulation. data space, which is not designed to preserve the decision
Considering the variations of the SMOTE algorithm that boundary. However, studies such as [14], [15], [16], [17],
have been introduced as improved versions of vanilla and [18] emphasize the importance of preserving the
SMOTE [12], it is evident that most of the proposed boundary region when generating new synthetic samples.
techniques such as [14], [15], Safe-Level-SMOTE [17], and The decision boundary preservation can be achieved either
Borderline-SMOTE [18] try to preserve the boundary region by using a separate sample generation strategy near the
that separates the minority and majority classes. Compared boundary region or by refining the synthetically generated
to the vanilla SMOTE, the higher classification accuracies of samples to remove noisy samples generated near the
these techniques demonstrate the importance of preserving decision boundary.
the boundary region when formulating an oversampling Even though there are oversampling techniques that
technique. address different combinations of the above three constraints,
Further exploring contemporary oversampling techniques, almost all the proposed oversampling techniques do not
it can also be observed that clustering-based approaches are address all three constraints together.
more popular among researchers. This is because, apart from Aside from addressing the constraints mentioned above
preserving the boundary region, it is also essential to address when formulating an oversampling approach, it is also
the within-class imbalance in the dataset (all the resampling preferable to pay special attention to the curse of
techniques address the between-class imbalance by default). dimensionality. As elaborated in the previous section, many
Clustering algorithms do not necessarily address the within- of the currently available oversampling approaches are
class imbalance unless they are explicitly designed to unable to handle the curse of dimensionality, resulting in
address it. poor performance on high dimensional datasets. We believe
Based on the above observations, it is possible to identify addressing the above constraints along with a proper
three constraints that need to be simultaneously satisfied clustering algorithm or a dimensionality reduction technique
when formulating an oversampling technique to generate an is a promising research avenue to investigate further.
optimal resampled dataset.
1) Addressing the between-class imbalance: The between- IV. CONCLUSION
class imbalance represents the typical imbalance scenario The data imbalance problem is one of the most well-
where there is a significant difference between the number defined problems in the Machine Learning domain that has
of samples in the dataset classes. All the resampling been addressed throughout the past decades. With the
techniques attempt to address the between-class imbalance emergence of Big Data, traditional techniques to address the
by oversampling, undersampling, or hybrid-sampling. There data imbalance problem have been challenging, and the

International Journal on Advances in ICT for Emerging Regions March 2023


A Review on Oversampling Techniques for Solving the Data Imbalance Problem in Classification 30

necessity of new and improved techniques to address the forest. Proceedings - International Conference on Tools with
Artificial Intelligence, ICTAI 2 (2007), 310–317.
imbalance has created promising research avenues in many
https://doi.org/10.1109/ICTAI.2007.46
practical domains. [7] U. Bhowan, M. Johnston, and M. Zhang. 2012. Developing New
This paper reviews numerous research work that has Fitness Functions in Genetic Programming for Classification With
attempted to address the data imbalance problem by Unbalanced Data. 42, 2 (2012), 406–421.
[8] J. Gao, K. Liu, B. Wang, D. Wang, and Q. Hong. 2021. An improved
oversampling the minority class samples. We identify
deep forest for alleviating the data imbalance problem. Soft
several subsets of oversampling techniques and highlight Computing 25, 3 (2021), 2085–2101. https://doi.org/10.1007/s00500-
different approaches adopted by them to discover suitable 020-05279-8
samples/areas to oversample and strategies used to generate [9] R. Mohammed, J. Rawashdeh, and M. Abdullah. 2020. Machine
Learning with Oversampling and Undersampling Techniques:
new synthetic samples. Based on these studies it is evident
Overview Study and Experimental Results. 2020 11th International
that some oversampling techniques focus on preserving the Conference on Information and Communication Systems, ICICS
decision boundary by refining the oversampled output or by 2020 May (2020), 243–248.
restricting sample generation in certain areas. It is also https://doi.org/10.1109/ICICS49469.2020.239556
[10] D. Elreedy and A. F. Atiya, “A Comprehensive Analysis of Synthetic
possible to identify studies that use clustering and density-
Minority Oversampling Technique (SMOTE) for handling class
based techniques to prevent the generation of noisy samples imbalance,” Information Sciences, vol. 505, pp. 32–64, Dec. 2019,
and alleviate the occurrence of disjuncts of minority class doi: https://doi.org/10.1016/j.ins.2019.07.070.
samples with varying densities. Furthermore, the review also [11] A. Gosain and S. Sardana, “Handling class imbalance problem using
oversampling techniques: A review,” IEEE Xplore, Sep. 01, 2017.
presents the challenges faced by traditional oversampling
https://ieeexplore.ieee.org/abstract/document/8125820 (accessed
techniques on high-dimensional data and suggest different May 17, 2022).
techniques that can be utilized to address them. [12] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer.
By analyzing various strategies adopted in the scientific 2002. SMOTE: Synthetic Minority Over-sampling Technique.
Journal of Artificial Intelligence Research 16, 2 (jun 2002), 321–357.
community for oversampling, we have identified three key
https://doi.org/10.1613/jair.953
constraints that need to be satisfied when developing state- [13] M. Schubach, M. Re, P. N. Robinson, and G. Valentini. 2017.
of-the-art oversampling techniques, Imbalance-aware machine learning for predicting rare and common
1) Addressing the between-class imbalance: represents disease-associated non-coding variants. Scientific reports 7, 1 (2017),
1–12.
the typical imbalance scenario where there is a significant
[14] G. E. A. P. A. Batista, A. L. C. Bazzan, and M. C. Monard. 2003.
difference between the number of samples in the dataset Balancing Training Data for Automated Annotation of Keywords: a
classes. Case Study. In Proceedings of the Second Brazilian Workshop on
2) Addressing the within-class imbalance: represents the Bioinformatics January (2003), 35–43. http://www.cs.waikato.ac.nz/
[15] G. E. A. P. A. Batista, R. C. Prati, and M. C. Monard. 2004. A study
imbalance within the minority class due to the existence of
of the behavior of several methods for balancing machine learning
multiple disjuncts of minority class samples with varying training data. ACM SIGKDD Explorations Newsletter 6, 1 (2004),
densities. 20–29. https://doi.org/10.1145/1007730.1007735
3) Preserving the boundary region when generating [16] G. Douzas and F. Bacao. 2017. Geometric SMOTE: Effective
oversampling for imbalanced learning through a geometric extension
synthetic samples: the boundary region represents the area
of SMOTE. (2017), 1–22. http://arxiv.org/abs/1709.07377
that separates two or more classes. It is required to make [17] C. Bunkhumpornpat, K. Sinapiromsaran, and C. Lursinsap. 2009.
sure that the synthetic samples do not distort the decision Safe-level-SMOTE: Safe-level-synthetic minority oversampling
boundary and overlap with samples in other classes. technique for handling the class imbalanced problem. Lecture Notes
in Computer Science (including subseries Lecture Notes in Artificial
Along with the above constraints, being attentive to the
Intelligence and Lecture Notes in Bioinformatics) 5476 LNAI (2009),
curse of dimensionality and addressing it would lead to a 475–482. https://doi.org/10.1007/978-3-642-01307-2_43
more optimal resampling. Based on these findings, [18] H. Han, W. Y. Wang, and B. H. Mao. 2005. Borderline-SMOTE: A
researchers can develop more robust oversampling New Over-Sampling Method in Imbalanced Data Sets Learning. In
Advances in Intelligent Systems and Computing. Vol. 683. 878–887.
techniques in the future.
https://doi.org/10.1007/11538059_91
[19] H. He, Y. Bai, E. A. Garcia, and S. Li. 2008. ADASYN: Adaptive
REFERENCES synthetic sampling approach for imbalanced learning. Proceedings of
[1] M. Vannucci and V. Colla, Imbalanced datasets resampling through the International Joint Conference on Neural Networks 3 (2008),
self organizing maps and genetic algorithms. Springer International 1322–1328. https://doi.org/10.1109/IJCNN.2008.4633969
Publishing, 2019, vol. 1000. [Online]. Available: [20] C. T. Lin, T. Y. Hsieh, Y. T. Liu, Y. Y. Lin, C. N. Fang, Y. K. Wang,
http://dx.doi.org/10.1007/978-3-030-20257-6_34 G. Yen, N. R. Pal, and C. H. Chuang. 2018. Minority Oversampling
[2] S. Maheshwari, J. Agrawal, and S. Sharma, “A New approach for in Kernel Adaptive Subspaces for Class Imbalanced Datasets. IEEE
Classification of Highly Imbalanced Datasets using Evolutionary Transactions on Knowledge and Data Engineering 30, 5 (2018), 950–
Algorithms,” International Journal of Scientific & Engineering 962. https://doi.org/10.1109/TKDE.2017.2779849
Research, vol. 2, no. 7, 2011. [Online]. Available: [21] S. A. Shahee and U. Ananthakumar. 2018. An adaptive oversampling
http://www.ijser.org technique for imbalanced datasets. Vol. 10933 LNAI. Springer
[3] T. Jo and N. Japkowicz, “Class imbalances versus small disjuncts,” International Publishing. 1–16 pages. https://doi.org/10.1007/978-3-
ACM SIGKDD Explorations Newsletter, vol. 6, no. 1, pp. 40–49, 319-95786-9_1
2004. [22] S. Barua, M. M. Islam, X. Yao, and K. Murase. 2014. MWMOTE -
[4] V. García, J. S. Sánchez, A. I. Marqués, R. Florencia, and G. Rivera. Majority weighted minority oversampling technique for imbalanced
2020. Understanding the apparent superiority of over-sampling data set learning. IEEE Transactions on Knowledge and Data
through an analysis of local information for class-imbalanced data. Engineering 26, 2 (2014), 405–425.
Expert Systems with Applications 158, December (2020). https://doi.org/10.1109/TKDE.2012.232
https://doi.org/10.1016/j.eswa.2019.113026 [23] D. A. Cieslak, N. V. Chawla, and A. Striegel. 2006. Combating
[5] V. López, A. Fernández, S. García, V. Palade, and F. Herrera, “An imbalance in network intrusion datasets. 2006 IEEE International
insight into classification with imbalanced data: Empirical results and Conference on Granular Computing JANUARY 2006 (2006), 732–
current trends on using data intrinsic characteristics,” Information 737. https://doi.org/10.1109/grc.2006.1635905
Sciences, vol. 250, pp. 113–141, Nov. 2013, doi: [24] F. Last, G. Douzas, and F. Bacao. 2017. Oversampling for
10.1016/j.ins.2013.07.007. Imbalanced Learning Based on K-Means and SMOTE. (2017), 1–19.
[6] T. M. Khoshgoftaar, M. Golawala, and J. V. Hulse. 2007. An https://doi.org/10.1016/j.ins.2018.06.056
empirical study of learning from imbalanced data using random

March 2023 International Journal on Advances in ICT for Emerging Regions


31 Tharinda Dilshan Piyadasa#1, Kasun Gunawardana*2

[25] C. Bunkhumpornpat, K. Sinapiromsaran, and C. Lursinsap. 2012.


DBSMOTE: Density-based synthetic minority over-sampling
technique. Applied Intelligence 36, 3 (2012), 664–684.
https://doi.org/10.1007/s10489-011-0287-y
[26] L. Ma and S. Fan. 2017. CURE-SMOTE algorithm and hybrid
algorithm for feature selection and parameter optimization based on
random forests. BMC Bioinformatics 18, 1 (2017), 1–18.
https://doi.org/10.1186/s12859-017-1578-z
[27] I. Nekooeimehr and S. K. L. Yuen. 2016. Adaptive semi-
unsupervised weighted oversampling (A-SUWO) for imbalanced
datasets. Expert Systems with Applications 46 (2016), 405–416.
https://doi.org/10.1016/j.eswa.2015.10.031
[28] J. Wei, H. Huang, L. Yao, Y. Hu, Q. Fan, and D. Huang. 2020. IA-
SUWO: An Improving Adaptive semi-unsupervised weighted
oversampling for imbalanced classification problems. Knowledge-
Based Systems 203, June (2020), 106116.
https://doi.org/10.1016/j.knosys.2020.106116
[29] S. A. Shahee and U. Ananthakumar. 2018. Probability Based Cluster
Expansion Oversampling Technique for Imbalanced Data. (2018),
77–90. https://doi.org/10.5121/csit.2018.80607
[30] G. Douzas and F. Bacao. 2017. Self-Organizing Map Oversampling
(SOMO) for imbalanced data set learning. Expert Systems with
Applications 82 (2017), 40–52.
https://doi.org/10.1016/j.eswa.2017.03.073
[31] K. Beyer, J. Goldstein, R. Ramakrishnan, U. Shaft, C. Beeri, and P.
Buneman. 1999. When Is ”Nearest Neighbor” Meaningful?, 217–235
pages. http://www.springerlink.com/content/04p94cqnbge862kh/
[32] C. C. Aggarwal, A. Hinneburg, and D. A. Keim. 2001. On the
surprising behavior of distance metrics in high dimensional space.
Lecture Notes in Computer Science (including subseries Lecture
Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
1973 (2001), 420–434. https://doi.org/10.1007/3-540-44503-x_27
[33] T. Kohonen. 1982. Self-organized formation of topologically correct
feature maps. Biological Cybernetics 43, 1 (1982), 59–69.
https://doi.org/10.1007/BF00337288
[34] N. B. Subramanian, “Why High Dimensional Data are a Curse?”,
https://aiaspirant.com/curse-of-dimensionality (accessed June 4,
2022).
[35] X. Zhang, W. Wang, X. Zheng, Y. Ma, Y. Wei, M. Li, and Y. Zhang.
2019. A Clutter Suppression Method Based on SOM-SMOTE
Random Forest. In 2019 IEEE Radar Conference (RadarConf). IEEE,
1–4. https://doi.org/10.1109/RADAR.2019.8835836

International Journal on Advances in ICT for Emerging Regions March 2023

View publication stats

You might also like