0% found this document useful (0 votes)

14 views11 pages

AReviewon Oversampling Techniquesfor Solvingthe Data Imbalance Problemin Classification

Uploaded by

Lan Anh Ngô

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views11 pages

AReviewon Oversampling Techniquesfor Solvingthe Data Imbalance Problemin Classification

Uploaded by

Lan Anh Ngô

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/371609615

A Review on Oversampling Techniques for Solving the Data Imbalance

Problem in Classiﬁcation

Article in International Journal on Advances in ICT for Emerging Regions (ICTer) · June 2023
DOI: 10.4038/icter.v16i1.7260

CITATIONS READS

6 208

2 authors:

Tharinda Dilshan Piyadasa Kasun Gunawardana

The University of Sydney University of Colombo
7 PUBLICATIONS 10 CITATIONS 22 PUBLICATIONS 72 CITATIONS

SEE PROFILE SEE PROFILE

All content following this page was uploaded by Tharinda Dilshan Piyadasa on 18 September 2023.

The user has requested enhancement of the downloaded file.

International Journal on Advances in ICT for Emerging Regions 2023 16 (1):

A Review on Oversampling Techniques for Solving

the Data Imbalance Problem in Classification
Tharinda Dilshan Piyadasa, Kasun Gunawardana

Abstract— The data imbalance problem is a widely explored

area in the Machine Learning domain. With the rapid disproportionate among these samples is identified using the
advancement of computing infrastructure and the incessant Imbalance Ratio (IR), which can vary from dataset to dataset.
increase in the amount and variety of data generated, the data This metric simply represents the ratio between the majority
imbalance problem has prevailed and reshaped with the
requirement for novel approaches to address it. Among the
and minority class samples.
different approaches that exist to address the data imbalance In many practical applications of classification analysis,
problem, such as data-level and algorithmic-level, data-level the minority class represents the positive examples or the
approaches are more popular among the scientific community target class where the adverse effect of false-negative
due to their classifier-independent nature. When investigating predictions is much higher than false-positive predictions [1].
current trends in data-level approaches, it is evident that For example, when considering credit card fraud detection,
oversampling is a technique frequently explored due to its there can be thousands of regular transactions for a single
adaptability to scenarios where extreme data imbalance is fraudulent transaction, making the target class the minority
present. This paper presents a review of different oversampling class in the dataset. Suppose a regular transaction is flagged
techniques with a comprehensive analysis of the strategies that
have been used along with possible areas that looks promising
as a fraudulent transaction (false positive) by a trained model.
to explore further to develop more advanced oversampling In that case, it can later be resolved using further
techniques. examinations. Still, on the other hand, if a fraudulent
transaction is incorrectly classified as a regular transaction
(false negative), which is the usual behavior of traditional
Keywords— Data Imbalance Problem, Classification Analysis,
classifiers on imbalanced datasets, the primary intention of
Oversampling.
the classifier is futile. The justification behind this behavior
I. INTRODUCTION is that, in extreme imbalance scenarios where positive
examples are under-represented, they are often mistaken for
D ata mining and knowledge discovery have become
indispensable in the contemporary age of big data for
making accurate decisions and predictions. Classification
noise, outliers, or allocated to the majority class, ignoring
the importance of their characteristics, leading traditional
learning models to favor the majority class heavily [4].
analysis is one of the most commonly employed data mining
Another significant observation of class imbalance is that,
tasks for various market and engineering problems, such as
regardless of the poor performance of standard classifiers on
bankruptcy prediction, network intrusion detection, fraud
the minority class, the classifier would still make predictions
detection, and software fault detection, where classifiers are
with higher accuracy depending on the imbalance ratio of
trained to discriminate between the different classes
the classes. For example, suppose the imbalance ratio of a
representing the problem [1]. When using traditional
binary class dataset is 9:1 (for nine samples in the majority
classifiers to carry out the said tasks, it can be observed that
class, there is only one minority sample). The classifier can
these classifiers perform well over evenly distributed data.
acquire an accuracy of 90% by classifying all the samples
This is due to the fact that traditional classifiers are designed
into the majority class, which is a decent accuracy when
to increase accuracy with no notion of the distribution of
considering a standard classifier [5]. In practical applications,
data [2]. However, in the real world, data collected for
the imbalance ratio can be much higher than the ratio
classification analysis are usually class (or Data) imbalanced.
depicted in the above example. It is also evident that
In the context of classification analysis, class imbalance
accuracy is not a suitable evaluation metric to evaluate a
refers to classification problems where the dataset contains
standard classifier when datasets are imbalanced as the
at least one class with significantly fewer samples than other
importance of the minority class is ignored.
classes in the dataset. In a two-class classification problem,
the class with the fewest samples is called the minority class, A. Addressing the Data Imbalance Problem
and the other class is called the majority class [3]. The class The approaches used to overcome the data imbalance
problem can be categorized into three groups as represented
Correspondence:Tharinda Dilshan Piyadasa (E-mail: tharindad7@gmail. in Fig 1: External approaches (data level), Internal approach
com) Received:18-07-2022 Revised:23.01.2023 Accepted: 26-02-2023 es (algorithmic level), and Hybrid approaches.
The external approaches focus on balancing the dataset
Tharinda Dilshan Piyadasa and Kasun Gunawardana are from University of
Colombo School of Computing, Sri Lanka. ([email protected],
either by removing the majority class samples through
[email protected]) undersampling or adding minority class samples through
oversampling. It is also possible to combine oversampling
DOI: http://doi.org/10.4038/icter.v16i1.7260 and undersampling to form hybrid sampling methods. The
© 2022 International Journal on Advances in ICT for Emerging Regions
objective of external approaches is to reduce the imbalance
ratio to achieve a favorable distribution among the classes.
This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted
use, distribution, and reproduction in any medium, provided the original author and source are credited

March 2023 International Journal on Advances in ICT for Emerging Regions

23 Tharinda Dilshan Piyadasa#1, Kasun Gunawardana*2

Fig. 1 Approaches to address the data imbalance problem.

Even though most of the proposed external approaches B. Overview of External Approaches
resample the dataset until the number of samples in each
class is equal , studies such as [6] demonstrate that it is not As aforementioned, external approaches to solving the
always required to maintain a 50:50 class distribution when data imbalance problem are heavily favored in the field of
resampling. However, there is no hard and fast rule to decide research due to classifier independence. When considering
on a favorable imbalance ratio as it can vary depending on oversampling and undersampling, both methods have their
the domain and the type of classifier used. own advantages and disadvantages. The main drawback of
Internal approaches involve developing and improving the oversampling is that it risks generating synthetic data that
underlying classification algorithm without altering the can lead to overfitting. This can be caused by generating
dataset involved [7]. There are mainly two ways that internal synthetic samples that closely resemble original samples or
approaches address the data imbalance problem . The first by incorrectly positioning (overlapping other classes)
method is cost-sensitive learning , where the classifier is synthetic samples in the data space. In the case of
modified such that the misclassification of minority class undersampling, it risks excluding important information
samples is heavily penalized compared to the from the dataset, such as samples that are crucial when
misclassification of majority class samples. The second and deciding the decision boundaries or samples that contain a
most popular internal approach is to incorporate ensemble - higher weight in representing a particular class or a feature.
based classifiers where multiple weak classifiers are Apart from the exclusion of important information,
combined to improve the performance of the overall undersampling can also suffer from data scarcity after
classification algorithm . Apart from these methods , there resampling if the minority class contains extremely fewer
have also been algorithmic classifier modifications proposed samples.
in past years to improve the classifier performance on Throughout the past years, many studies have been
classifiers like Support Vector Machines (SVM ), Extreme carried out to investigate methods and mechanisms to
Learning Machines (ELM), and Neural Networks (NN). mitigate these drawbacks from external approaches. For
Moreover , internal approaches can also be combined with example, the most intuitive technique to add or remove data
external approaches to derive hybrid approaches that to/from a given class is by performing random selections.
incorporate both advantages and disadvantages of internal These primitive techniques have evolved and improved over
and external approaches [1][8]. time to address their foundational drawbacks by combining
When comparing the approaches to address the Data more complex techniques and statistical and probabilistic
Imbalance problem , it is apparent that researchers prefer methods.
external approaches over internal approaches mainly due to When comparing oversampling and undersampling, even
the classifier independence [5]. In external approaches, since though oversampling leads to overfitting, it is possible to
only the dataset is modified , it gives the freedom to select detect it during the earlier stages of training using
any suitable classifier for the classification task. However, in straightforward approaches such as using a good train test
the case of internal approaches , as the internal structure / split and observing the change in testing error compared to
algorithm of the classifier is modified to address the the training error. However, in the case of important
imprecise classification of minority class samples , the information exclusion caused by undersampling, although it
dataset is heavily dependent on the modified classifier . might work well with the resampled dataset, the classifier
Nevertheless , it is impractical to use the same classification trained with excluded samples can lead to many
algorithm with every dataset in different contexts. Therefore, misclassifications with the introduction of new data samples.
on the basis of generalizability , it is reasonable to presume Mohammed et al. [9] validate this assertion, where several
that external approaches provide an added advantage over state-of-the-art classifiers are used to evaluate oversampled
internal approaches. and undersampled datasets. The authors have concluded that

International Journal on Advances in ICT for Emerging Regions March 2023

A Review on Oversampling Techniques for Solving the Data Imbalance Problem in Classification 24

compared to undersampling, oversampling of datasets leads without introducing any bias in this context) alternative
to a more accurate classification. oversampling techniques.
There are numerous studies, including [10] and [11], that When exploring literature on oversampling, the most
review oversampling techniques by conducting experiments widely used techniques in the scientific community are
and comparing the results to provide a comprehensive SMOTE [12] and its variants. SMOTE stands for Synthetic
evaluation of the performance of different oversampling Minority Oversampling Technique, where the algorithm
techniques in practical settings. However, the objective of generates a synthetic sample along the line segment that
these studies seems to be finding the better technique out of joins a randomly selected minority class sample and one of
a set of available techniques based on a systematic analysis, its K nearest neighbors. In SMOTE, the value of K is a
and they provide limited information on the approaches and parameter that should be specified prior to its application,
methodologies used in such techniques. As a result, a and minority class samples are randomly chosen from the set
researcher may find it challenging to comprehend the of K-Nearest Neighbors based on the amount of
underlying strategies of an oversampling technique, which, oversampling required. The operation of the SMOTE
in turn, would lead to failure in addressing its limitations. algorithm is further elaborated in Fig. 2, where (a) The
This paper provides a comprehensive review of some majority class and minority class samples are represented in
popular oversampling techniques used to address the data blue and green colors, respectively. (b) A minority class
imbalance problem, highlighting their strategies and sample is randomly selected (black), and its K-nearest
potential areas for further improvement. The aim of this neighbors (3 in the image) are selected. (c) A new synthetic
study is to provide insights and guidance for researchers and sample (red) is generated on the line that joins the randomly
practitioners in the field of machine learning who are selected minority class sample and its nearest neighbor.
interested in developing more robust oversampling
techniques to address the data imbalance problem.
The rest of this paper is organized as follows. Section II
provides a comprehensive review of existing oversampling
techniques, highlighting their strategies when performing
oversampling. Section III presents the key findings of the
review, emphasizing the factors that need to be considered
when formulating new oversampling techniques, followed
by the conclusion in section IV.

II. ANALYSIS OF OVERSAMPLING TECHNIQUES

Fig. 2 Graphical representation of SMOTE algorithm [13]
When selecting studies for the review, a deliberate
decision was made to include oversampling techniques that As the synthetic samples generated by SMOTE are not
are widely recognized and used in the machine learning duplicates of already existing samples, they are more
community. The rationale behind this choice was that these generalizable than samples generated through random
methods have been proven to be successful and efficient in oversampling, reducing the risk of overfitting. However, due
previous studies, such as [10] and [11], and that they are to the random selection of minority class samples with a
readily accessible and available in popular machine learning uniform probability for oversampling, densely populated
libraries like scikit-learn. By incorporating these well- minority class areas become more condensed while sparsely
established oversampling techniques, this study aims to populated minority class areas remain sparse. This behavior
ensure that the review reflects the best practices and of SMOTE manages to address the between-class imbalance
standards in the field. (imbalance between multiple classes), while the within-class
Oversampling approaches generate synthetic minority imbalance (multiple dense or sparse regions of the same
class samples and combine them with the existing dataset, class) is ignored. Another drawback of the SMOTE
resulting in a new dataset that is more appropriate for algorithm is the generation of noisy samples. If a new
training. The most intuitive form of oversampling is random synthetic sample is generated between an existing noisy
oversampling, where minority class samples are randomly sample and its nearest neighbor, there is a high probability
selected and duplicated without any specific selection that the newly generated sample will also be noisy. This is
standard. because the SMOTE algorithm has no notion of overlapping
Random oversampling can be effective for machine class regions when generating synthetic samples [1][4].
learning algorithms influenced by skewed distributions in Throughout the years, the SMOTE algorithm has been
instances where the overall size of the dataset is small and modified to address its drawbacks and limitations.
the imbalance is not that significant [9]. However, in cases In [14], Batista et al. apply SMOTE to oversample the
where the dataset is heavily imbalanced, or the number of minority class, followed by applying Tomek Links to
minority class samples is insufficient to train a decent increase the class separation near the decision boundary. The
classifier, random oversampling can risk classifier authors state that the class clusters are sometimes ill-defined
overfitting during training due to repeated duplication of the during oversampling as the minority class samples may enter
minority samples. Despite the implementation simplicity and the majority class area and that interpolating minority class
fast execution, which is ideal for large and complex datasets, instances can enlarge the minority cluster, introducing noisy
the lack of generalizability and high likelihood of overfitting minority samples deep in the majority area, which is harmful
in random oversampling has led researchers to look for more and can lead to overfitting. Tomek Links act as a data
robust (robustness is denoted as the ability to oversample cleaning mechanism in this technique, where overlapping
majority and minority class samples are removed to form

March 2023 International Journal on Advances in ICT for Emerging Regions

25 Tharinda Dilshan Piyadasa#1, Kasun Gunawardana*2

well-defined class clusters. This can be considered a hybrid density distribution as a criterion to determine the number of
technique as it applies both oversampling and undersampling new synthetic samples that should be generated for each
to the dataset. minority sample. The density distribution considers the
The research work presented in [15] is another hybrid learning difficulty of each of the minority class samples and
technique where extensive data cleaning based on generates more synthetic samples around samples that are
misclassifications is applied through ENN (Edited Nearest more difficult to learn than those that are simpler to learn.
Neighbor) on an oversampled dataset. ENN is similar to Even though ADASYN is capable of enhancing hard-to-
Tomek Links, but it is more aggressive as it removes any learn minority sample areas, it is sensitive to outliers
sample (majority or minority) from the training set that its because of the possibility of misinterpreting noisy samples,
three nearest neighbors misclassify, creating more which usually occur in low densities, as harder-to-learn
distinguishable class spaces with clear separation along the samples, associating them with higher weights. A summary
decision boundary. The study also states that oversampling of SMOTE and its variants elaborated above are presented in
strategies lead to more accurate classifiers than strategies Table I.
derived through undersampling. When examining the process in which the aforementioned
Geometric SMOTE [16] is another extension of SMOTE methods have approached the problem, it is evident that they
that generates synthetic samples near selected minority class are focused on balancing the number of samples in the
samples in a geometric region instead of linear interpolation. dataset classes. The imbalance between the dataset classes
While this selected region is a hyper-sphere in its default that split them into majority and minority classes is called
configuration, G-SMOTE deforms it to a hyper-spheroid and the between-class imbalance. By default, all the resampling
eventually to a line segment, simulating the SMOTE process techniques are designed to address the between-class
in the last instance. Geometric SMOTE addresses two main imbalance through oversampling, undersampling, or hybrid
issues in SMOTE: generation of noisy samples and sampling. However, when comparing with vanilla SMOTE,
generation of samples that belong to the same sub-cluster. it can be observed that most of the above techniques attempt
The above issues are addressed by identifying safe areas to in refining the output of the SMOTE algorithm by regulating
synthesize new samples and varying the number of minority the areas of sample generation and eliminating noisy
samples generated. The authors claim that the ability of G- synthetic samples to preserve the decision boundary that
SMOTE to produce a variety of synthetic minority data in separates the classes.
safe regions of the input space while aggressively boosting The samples near the decision boundary undoubtedly
their diversity is the rationale for its performance gain. represent the most crucial samples for any classification task.
Safe-Level-SMOTE [17] follows a similar approach to Despite the importance of the decision boundary, as depicted
SMOTE but considers the nearby majority class samples in Fig. 3 (B), the samples generated near the boundary
when generating synthetic minority class samples. Safe through oversampling often tend to distort the class
levels are computed using nearest neighbor minority samples, separation, generating noisy samples that overlap with the
and synthetic samples are generated such that they lie closer majority class samples. The reason for the generation of
to minority class samples (safe area). The study tries to noisy samples in the decision boundary is caused by the use
address the overgeneralization problem encountered by of the same sample generation strategy throughout the data
SMOTE due to arbitrary generalization of the minority class space, which is not designed to preserve the decision
territory neglecting the majority class, which can lead to an boundary. The oversampling techniques mentioned above
increased likelihood of class mixing in the case of highly address this issue and emphasize preserving the decision
skewed class distributions. boundary when generating new synthetic minority class
Borderline-SMOTE [18] is another variation of SMOTE samples.
that generates synthetic minority class samples only within
the decision boundary that separates the classes. In contrast
to SMOTE, Borderline-SMOTE identifies minority class
samples that lie within the vicinity of the majority class
samples and prevents the generation of noisy synthetic
samples based on those. The authors declare that most
classification algorithms strive to understand the boundaries
of each class as precisely as possible during the training
process to obtain a better prediction, making the samples far Fig. 3 (A) Occurrence of multiple disjuncts of minority class samples
with varying densities. (B) Noisy minority class samples distort the decision
from the borderline less significant compared to the samples boundary by overlapping with majority class samples
that lie within the vicinity of the class borders. Furthermore,
the study presents two versions of Borderline-SMOTE, Moreover, when considering real-life datasets, there can
Borderline-SMOTE1, which generates new synthetic also be instances where multiple dense or sparse clusters of
samples between borderline minority samples and its K- minority class samples are present within the data
nearest minority neighbors, and Borderline-SMOTE2, which distribution, as illustrated in Fig. 3 (A).
generates new synthetic samples between borderline The existence of multiple disjuncts of minority class
minority samples and its K-nearest minority as well as K- samples is referred to as the within-class imbalance, and it
nearest majority neighbors. can lead to an extreme lack of representation of crucial
ADASYN [19] is a density-based oversampling technique minority class features. Oversampling techniques that
where the density of minority samples in a neighbourhood is randomly select minority samples to generate new synthetic
considered when generating new synthetic minority class samples, such as SMOTE, fail to resolve the within-class
samples. The main intuition of ADASYN is to utilize a imbalance, resulting in a skewed minority class distribution

International Journal on Advances in ICT for Emerging Regions March 2023

A Review on Oversampling Techniques for Solving the Data Imbalance Problem in Classification 26

[20]. Therefore, it is important to address both between-class are given greater weights to reduce the within-class
and within-class imbalances when addressing the data imbalance. Finally, a modified hierarchical clustering
imbalance. The simultaneous removal of both these approach is used to create synthetic samples from the
imbalances minimizes the classifier bias toward bigger sub- weighed minority class samples making sure the generated
clusters by decreasing the influence of the bigger sub-cluster samples reside within the minority class region to avoid noisy
error on the total error [21]. sample generation.

TABLE I
SUMMARY OF SMOTE AND ITS VARIANTS

Reference Approach Summary

- A minority class sample is selected at random.

Oversampling: combines random sampling with the
[12]
K nearest neighbor algorithm. - A synthetic sample is generated on the line that joins the random
minority sample and its nearest neighbor.

- Applies SMOTE followed by Tomek Links.

Hybrid Sampling: refines the output of SMOTE by
[14] - Tomek Links remove overlapping majority and minority samples.
applying Tomek Links.
- Increase the class separation near the decision boundary.

- Extensive data cleaning based on misclassification.

Hybrid Sampling: refines the output of SMOTE by
[15] - Removes any sample that its 3 nearest neighbors misclassify.
applying Edited Nearest Neighbor (ENN).
- ENN is more aggressive than Tomek Links.

- Generates samples in a geometric region instead of linear

interpolation.
Oversampling: modifies the sample generation
[16]
strategy of SMOTE by identifying safe regions.
- Prevents the generation of noisy samples by identifying safe areas
and varying the number of samples generated.

- Considers nearby majority class samples when generating new

Oversampling: modifies the sample generation synthetic minority class samples.
[17]
strategy of SMOTE by computing safe levels.
- Safe-levels computed using nearest neighbor minority samples.

- Generates minority class samples only within the decision

boundary.
Oversampling: modifies SMOTE to generate new
[18]
samples only within the decision boundary.
- Ignores minority class samples that lie within the majority class
samples during synthesis.

- Density of minority class samples in a neighborhood is considered

Oversampling: uses the density around minority
when generating new samples.
[19] class samples to determine the number of synthetic
samples to be generated.
- Heavily sensitive to outliers.

Further looking into oversampling techniques reveals Cluster SMOTE [23] uses K-means to cluster the minority
another set of studies that use a different strategy to deal class and applies SMOTE within the identified clusters. This
with the data imbalance problem. approach makes sure that the generated synthetic samples
MWMOTE [22] is a popular SMOTE-based always lie inside naturally occurring clusters of the minority
oversampling technique. It first locates hard-to-learn class samples. The study claims that the existence of a small
minority class samples (samples near the decision boundary) number of minority class samples is challenging when
using the majority class samples near the decision boundary forming decent class borders, and addressing this limitation
and uses the Euclidean distance from these nearest majority by accurate class region and border definition would enable
class samples to assign them weights. This weighing trivial classification. Since these class regions are unknown
mechanism ensures that higher weights are assigned to and impossible to infer through given data, K-means is used
samples closer to the decision boundary than others. The to approximate the minority region, followed by applying
authors highlight the fact that the presence of within-class SMOTE to each identified cluster. This study is explicitly
imbalance and small disjuncts of the minority class can lead designed to address the imbalance in network intrusion
to performance degradation in classifiers and, therefore, datasets and only uses two intrusion datasets to evaluate.
similar to the weighing of hard-to-learn minority samples [24] presents a clustering-based oversampling technique
near the decision boundary, the samples of smaller clusters designed to address the within and between class imbalances,

March 2023 International Journal on Advances in ICT for Emerging Regions

27 Tharinda Dilshan Piyadasa#1, Kasun Gunawardana*2

avoiding the generation of noisy synthetic samples. Initially, clustering to identify the boundary of the identified sub-
the algorithm clusters the input space using K-means clusters. Finally, synthetic minority class samples are
clustering and filters out the cluster with a higher number of generated in the enclosed region of the class separating
minority samples for oversampling. The number of synthetic boundary. As suggested by the authors, the main goal of this
samples to be generated is then dispersed, with more technique is to assign equal weight to all sub-clusters of the
samples being assigned to clusters with a low density of minority class that would otherwise be overlooked due to the
minority samples. Finally, SMOTE is used to obtain the skewness of the distribution. The cluster/density based
required ratio of minority and majority samples in each of oversampling techniques elaborated above are summarized
the filtered clusters. The authors rationalize cluster-based in Table II.
oversampling as one of the strategies that aim to minimize In order to address the within-class imbalance, it is
the within-class imbalance while also reducing the between- necessary to identify different regions within the data space
class imbalance, facilitating the oversampling technique to where oversampling is effective. The above studies show
identify the most effective areas of the input space to that clustering and density-based techniques are popular
generate synthetic samples. approaches that researchers use to identify such areas. After
DBSMOTE [25] is another density-based oversampling the identification of significant areas to oversample, it is
technique that uses the DBSCAN algorithm to partition the possible to use traditional oversampling techniques to
minority class samples. SMOTE is used to generate generate synthetic samples. The clustering-based
synthetic samples between the shortest path that join oversampling techniques introduced above emphasize the
minority class samples with a pseudo-centroid of a minority importance of addressing the within-class imbalance when
cluster, avoiding the generation of outliers or noisy samples. formulating oversampling techniques.
As a result, synthetic samples are generated in such a way
that they are dense around the centroid and are sparse further A. Oversampling High-Dimensional Data
away from the centroid. The authors claim that a real-world Further inspecting the aforementioned oversampling
dataset with proximate data clusters can be described by a techniques that address the data imbalance, it is evident that
normal distribution, dense at the centroid and sparse towards most of the techniques are based on clustering algorithms
the boundary and that a classifier can correctly identify such as K-means, DBSCAN, and hierarchical clustering,
samples near the centroid as it identifies the area around the combined with heuristics based on Euclidean distance.
centroid as a class. Based on the above observations, Therefore, the majority of these approaches rely on heuristic
DBSMOTE is designed to oversample the minority class methods that apply in two-dimensional space (Euclidean
area around the centroid because it is too sparse to be space) when generating synthetic data, whereas practical
recognized by a classifier. scenarios often consist of high-dimensional data [30].
CURE-SMOTE [26] works by clustering the minority Additionally, when the number of features in the dataset
class samples using the CURE hierarchical clustering (dimensionality of data) increases, the data points become
algorithm followed by noise and outlier removal. It then sparser or farther apart (Fig. 4), making the nearest neighbor
randomly generates synthetic minority class samples along problem ill-defined [31]. This behavior is called the “curse
the line segment that joins representative points and the of dimensionality” [32]. As a result, in higher dimensional
center point. In CURE hierarchical clustering, each sample space, the use of heuristics based on Euclidean distance
is assumed to represent a cluster, where local clustering is becomes ineffective, and the assumption of well-defined
used to combine these samples to form the clusters present clusters fails, generating noisy synthetic samples.
in the input space. The study justifies CURE hierarchical A common strategy that can be adopted when formulating
clustering, stating that it is more efficient for large datasets oversampling techniques that use clustering mechanisms and
with varying shapes of data distributions than K-means heuristics based on Euclidean distance is to reduce the
clustering, which is only suitable for spherically distributed dimensionality of the original input space. Principal
datasets. Further, it is stated that the combination of Component Analysis (PCA), Multidimensional Scaling
clustering and merging operations tends to eliminate noise (MDS), and Self-Organizing Maps are some common
with reduced complexity as it eliminates the need to remove dimensionality reduction techniques practitioners use. In
the furthest created synthetic samples (noisy samples) after recent years, Self-Organizing Maps [33] based resampling
applying SMOTE. techniques have been extensively explored in the community.
A-SUWO (Adaptive Semi-Unsupervised Weighted Self-Organizing Map based Oversampling (SOMO) [30]
Oversampling) [27] and its improved version, IA-SUWO generates a clustered two-dimensional representation of the
[28], cluster the minority class samples using a semi- input space by applying the SOM algorithm. Clusters are
unsupervised hierarchical clustering approach and use the filtered to perform oversampling by calculating the density
classification complexity and cross-validation of each sub- of minority class samples in each cluster. SMOTE is applied
cluster to decide the optimal size to oversample. Both A- to generate synthetic minority class samples within the
SUWO and IA-SUWO aim to generate synthetic samples filtered clusters and between neighboring clusters,
near minority class instances that lie close to the decision addressing both within and between class imbalances. The
boundary with lower densities. authors have identified and addressed a few inefficiencies of
[29] presents a probability-based cluster expansion existing oversampling techniques, namely, the generation of
oversampling technique that uses a model-based clustering noisy instances that infiltrate the majority region, the
mechanism (MCLUST) to identify sub-clusters present in generation of duplicate samples, and the use of heuristics
the dataset. The method also uses K-Nearest Neighbor based based on the assumption that the input space has a simple
noise removal prior to clustering to reduce the oversampling manifold structure. SOMO is capable of generating more
of noisy samples and equal posterior probability after effective synthetic samples by investigating the manifold

International Journal on Advances in ICT for Emerging Regions March 2023

A Review on Oversampling Techniques for Solving the Data Imbalance Problem in Classification 28

structure of the input space, exploiting the topology- among majority and synthetic minority samples , which
preserving property of Self-Organizing Maps. evaluate the positive impact of their removal or inclusion in
[1] proposes an imbalance dataset resampling technique the training data, respectively . The ideal rates of exclusion
by combining Self-Organizing Maps and Genetic and inclusion for each accepted criterion are obtained
Algorithms. The technique uses two Self-Organizing Maps using a Genetic Algorithm that considers the performance
to perform oversampling on the minority class and of a random classifier for a given training dataset in the
undersampling on the majority class . The clusters derived context of imbalance classification via the fitness function .
from Self - Organizing Maps identify the reg ions where The authors claim that the capabilities of Self-Organizing
majority and minority class samples are dense. The filtered Maps to preserve the distribution and topology of the input
clusters are then utilized to derive a set of rankings data lead to the conservation of the natural spatial

TABLE II
SUMMARY OF CLUSTER/DENSITY BASED OVERSAMPLING TECHNIQUES

Reference Approach Summary

- Identifies hard-to-learn minority class samples and assigns them

Oversampling: combines hierarchical clustering with weights based on the nearest majority class samples.
[22] SMOTE to address both within and between class
imbalances. - Makes sure the generated samples fall into some minority class
cluster

- Uses K-means to cluster the minority class and applies SMOTE

Oversampling: uses K-means to approximate the
within the identified clusters.
[23] minority region, followed by applying SMOTE to
each identified cluster.
- Makes sure generated samples lie inside naturally occurring clusters.

- Uses K-means to identify clusters and assigns weights based on the

Oversampling: uses K-means to identify and filter minority class density in each cluster.
[24] clusters with high minority class density, followed
by applying SMOTE. - Generates more samples in clusters with low minority class
densities.

- Uses the DBSCAN algorithm to partition the minority class

Oversampling: combines the DBSCAN algorithm
samples.
with SMOTE to generate synthetic samples such that
[25]
the dataset is dense at the centroid and sparse
- Generate synthetic samples between the lines that join minority
towards the boundary.
class samples with a pseudo-centroid of a minority cluster.

Oversampling: combines CURE hierarchical - Uses CURE hierarchical clustering algorithm followed by noise and
clustering with noise and outlier removal so that the outlier removal.
[26]
samples generated after using SMOTE are more
precise. - Addresses datasets that have clusters of varying shapes and sizes.

Oversampling: uses a semi-supervised hierarchical - Clusters the minority class using a hierarchical clustering approach.
clustering algorithm to generate synthetic samples
[27][28]
around minority class instances that lie close to the - Oversample size is decided from classification complexity and
decision boundary with lower densities. cross-validation of each sub-cluster.

- Uses MCLUST to identify sub-clusters present in the dataset.

Oversampling: combines a model-based clustering
[29] - The Goal is to assign equal weights to all minority class sub-clusters
mechanism with KNN-based noise removal.
that would otherwise be overlooked due to skewness of the
distribution.

Fig. 4 Data points become sparser as dimensionality increases [34].

March 2023 International Journal on Advances in ICT for Emerging Regions

29 Tharinda Dilshan Piyadasa#1, Kasun Gunawardana*2

relationship among samples at the cluster level , and the is no optimal imbalance ratio that needs to be reached when
optimization capabilities of Genetic Algorithms result in resampling an imbalanced dataset. However, [6] states that a
maximization of classifier performance , improving the 35:65 class distribution can achieve a higher classification
overall resampling operation. performance compared to a 50:50 class distribution when the
[35] uses a customized SOM-SMOTE algorithm to classes are heavily imbalanced. This is an area that is still
address the imbalance in clutter data when addressing the being investigated.
clutter suppression in search radars. The authors have 2) Addressing the within-class imbalance: Within-class
identified two limitations in the SMOTE algorithm, random imbalance demonstrates the imbalance within the minority
sample selection and ignoring the data distribution during class due to the existence of multiple disjuncts of minority
interpolation, resulting in samples that are not representative class samples with varying densities (Fig. 3A) that can lead
enough. The study addresses the above limitation using a to an extreme representation deficiency of essential
combined Self-Organizing Map and SMOTE algorithm that characteristics of the minority class. The majority of the
clusters the minority class samples into several subsets using oversampling techniques that randomly select minority class
a Self-Organizing Map and interpolates synthetic samples samples to generate new synthetic samples, such as SMOTE,
around the cluster centers using SMOTE. The authors also fail to address the within-class imbalance, leaving the
highlight the ability of Self-Organizing Maps to preserve the minority class distribution skewed. When analyzing
topology of higher dimensional clutter data, resulting in oversampling techniques capable of addressing the within-
synthetic samples with distribution characteristics similar to class imbalance, it can be observed that they are based on
original data. clustering approaches. The use of clustering approaches is an
From the above studies, it can be assumed that the ability obvious design choice as they provide the capability to
of Self-Organizing Maps to address the within-class analyze the spatial location of the minority class to
imbalance as a clustering algorithm, along with the ability to determine the suitable areas to generate new synthetic
reduce the dimensionality of data while preserving the samples.
topology of the input space, are the main reasons for its 3) Preserving the boundary region when generating
widespread popularity as an excellent candidate to address synthetic samples: The boundary region represents the area
the data imbalance problem. that separates two or more classes. As mentioned previously,
the most crucial samples in any classification task are the
III. DISCUSSION samples that reside near the boundary region. When
When analyzing oversampling techniques that address the considering oversampling techniques, the synthetic samples
class imbalance problem, it is possible to identify key factors generated near the decision boundary often distort the class
that contribute to the success of an oversampling technique. separation, generating noisy samples that overlap with the
Throughout the literature, it can be observed that every majority class samples. This behavior is caused due to the
oversampling technique attempts to adopt one or more of use of the same sample generation strategy throughout the
these factors during its strategy formulation. data space, which is not designed to preserve the decision
Considering the variations of the SMOTE algorithm that boundary. However, studies such as [14], [15], [16], [17],
have been introduced as improved versions of vanilla and [18] emphasize the importance of preserving the
SMOTE [12], it is evident that most of the proposed boundary region when generating new synthetic samples.
techniques such as [14], [15], Safe-Level-SMOTE [17], and The decision boundary preservation can be achieved either
Borderline-SMOTE [18] try to preserve the boundary region by using a separate sample generation strategy near the
that separates the minority and majority classes. Compared boundary region or by refining the synthetically generated
to the vanilla SMOTE, the higher classification accuracies of samples to remove noisy samples generated near the
these techniques demonstrate the importance of preserving decision boundary.
the boundary region when formulating an oversampling Even though there are oversampling techniques that
technique. address different combinations of the above three constraints,
Further exploring contemporary oversampling techniques, almost all the proposed oversampling techniques do not
it can also be observed that clustering-based approaches are address all three constraints together.
more popular among researchers. This is because, apart from Aside from addressing the constraints mentioned above
preserving the boundary region, it is also essential to address when formulating an oversampling approach, it is also
the within-class imbalance in the dataset (all the resampling preferable to pay special attention to the curse of
techniques address the between-class imbalance by default). dimensionality. As elaborated in the previous section, many
Clustering algorithms do not necessarily address the within- of the currently available oversampling approaches are
class imbalance unless they are explicitly designed to unable to handle the curse of dimensionality, resulting in
address it. poor performance on high dimensional datasets. We believe
Based on the above observations, it is possible to identify addressing the above constraints along with a proper
three constraints that need to be simultaneously satisfied clustering algorithm or a dimensionality reduction technique
when formulating an oversampling technique to generate an is a promising research avenue to investigate further.
optimal resampled dataset.
1) Addressing the between-class imbalance: The between- IV. CONCLUSION
class imbalance represents the typical imbalance scenario The data imbalance problem is one of the most well-
where there is a significant difference between the number defined problems in the Machine Learning domain that has
of samples in the dataset classes. All the resampling been addressed throughout the past decades. With the
techniques attempt to address the between-class imbalance emergence of Big Data, traditional techniques to address the
by oversampling, undersampling, or hybrid-sampling. There data imbalance problem have been challenging, and the

International Journal on Advances in ICT for Emerging Regions March 2023

A Review on Oversampling Techniques for Solving the Data Imbalance Problem in Classification 30

necessity of new and improved techniques to address the forest. Proceedings - International Conference on Tools with
Artificial Intelligence, ICTAI 2 (2007), 310–317.
imbalance has created promising research avenues in many
https://doi.org/10.1109/ICTAI.2007.46
practical domains. [7] U. Bhowan, M. Johnston, and M. Zhang. 2012. Developing New
This paper reviews numerous research work that has Fitness Functions in Genetic Programming for Classification With
attempted to address the data imbalance problem by Unbalanced Data. 42, 2 (2012), 406–421.
[8] J. Gao, K. Liu, B. Wang, D. Wang, and Q. Hong. 2021. An improved
oversampling the minority class samples. We identify
deep forest for alleviating the data imbalance problem. Soft
several subsets of oversampling techniques and highlight Computing 25, 3 (2021), 2085–2101. https://doi.org/10.1007/s00500-
different approaches adopted by them to discover suitable 020-05279-8
samples/areas to oversample and strategies used to generate [9] R. Mohammed, J. Rawashdeh, and M. Abdullah. 2020. Machine
Learning with Oversampling and Undersampling Techniques:
new synthetic samples. Based on these studies it is evident
Overview Study and Experimental Results. 2020 11th International
that some oversampling techniques focus on preserving the Conference on Information and Communication Systems, ICICS
decision boundary by refining the oversampled output or by 2020 May (2020), 243–248.
restricting sample generation in certain areas. It is also https://doi.org/10.1109/ICICS49469.2020.239556
[10] D. Elreedy and A. F. Atiya, “A Comprehensive Analysis of Synthetic
possible to identify studies that use clustering and density-
Minority Oversampling Technique (SMOTE) for handling class
based techniques to prevent the generation of noisy samples imbalance,” Information Sciences, vol. 505, pp. 32–64, Dec. 2019,
and alleviate the occurrence of disjuncts of minority class doi: https://doi.org/10.1016/j.ins.2019.07.070.
samples with varying densities. Furthermore, the review also [11] A. Gosain and S. Sardana, “Handling class imbalance problem using
oversampling techniques: A review,” IEEE Xplore, Sep. 01, 2017.
presents the challenges faced by traditional oversampling
https://ieeexplore.ieee.org/abstract/document/8125820 (accessed
techniques on high-dimensional data and suggest different May 17, 2022).
techniques that can be utilized to address them. [12] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer.
By analyzing various strategies adopted in the scientific 2002. SMOTE: Synthetic Minority Over-sampling Technique.
Journal of Artificial Intelligence Research 16, 2 (jun 2002), 321–357.
community for oversampling, we have identified three key
https://doi.org/10.1613/jair.953
constraints that need to be satisfied when developing state- [13] M. Schubach, M. Re, P. N. Robinson, and G. Valentini. 2017.
of-the-art oversampling techniques, Imbalance-aware machine learning for predicting rare and common
1) Addressing the between-class imbalance: represents disease-associated non-coding variants. Scientific reports 7, 1 (2017),
1–12.
the typical imbalance scenario where there is a significant
[14] G. E. A. P. A. Batista, A. L. C. Bazzan, and M. C. Monard. 2003.
difference between the number of samples in the dataset Balancing Training Data for Automated Annotation of Keywords: a
classes. Case Study. In Proceedings of the Second Brazilian Workshop on
2) Addressing the within-class imbalance: represents the Bioinformatics January (2003), 35–43. http://www.cs.waikato.ac.nz/
[15] G. E. A. P. A. Batista, R. C. Prati, and M. C. Monard. 2004. A study
imbalance within the minority class due to the existence of
of the behavior of several methods for balancing machine learning
multiple disjuncts of minority class samples with varying training data. ACM SIGKDD Explorations Newsletter 6, 1 (2004),
densities. 20–29. https://doi.org/10.1145/1007730.1007735
3) Preserving the boundary region when generating [16] G. Douzas and F. Bacao. 2017. Geometric SMOTE: Effective
oversampling for imbalanced learning through a geometric extension
synthetic samples: the boundary region represents the area
of SMOTE. (2017), 1–22. http://arxiv.org/abs/1709.07377
that separates two or more classes. It is required to make [17] C. Bunkhumpornpat, K. Sinapiromsaran, and C. Lursinsap. 2009.
sure that the synthetic samples do not distort the decision Safe-level-SMOTE: Safe-level-synthetic minority oversampling
boundary and overlap with samples in other classes. technique for handling the class imbalanced problem. Lecture Notes
in Computer Science (including subseries Lecture Notes in Artificial
Along with the above constraints, being attentive to the
Intelligence and Lecture Notes in Bioinformatics) 5476 LNAI (2009),
curse of dimensionality and addressing it would lead to a 475–482. https://doi.org/10.1007/978-3-642-01307-2_43
more optimal resampling. Based on these findings, [18] H. Han, W. Y. Wang, and B. H. Mao. 2005. Borderline-SMOTE: A
researchers can develop more robust oversampling New Over-Sampling Method in Imbalanced Data Sets Learning. In
Advances in Intelligent Systems and Computing. Vol. 683. 878–887.
techniques in the future.
https://doi.org/10.1007/11538059_91
[19] H. He, Y. Bai, E. A. Garcia, and S. Li. 2008. ADASYN: Adaptive
REFERENCES synthetic sampling approach for imbalanced learning. Proceedings of
[1] M. Vannucci and V. Colla, Imbalanced datasets resampling through the International Joint Conference on Neural Networks 3 (2008),
self organizing maps and genetic algorithms. Springer International 1322–1328. https://doi.org/10.1109/IJCNN.2008.4633969
Publishing, 2019, vol. 1000. [Online]. Available: [20] C. T. Lin, T. Y. Hsieh, Y. T. Liu, Y. Y. Lin, C. N. Fang, Y. K. Wang,
http://dx.doi.org/10.1007/978-3-030-20257-6_34 G. Yen, N. R. Pal, and C. H. Chuang. 2018. Minority Oversampling
[2] S. Maheshwari, J. Agrawal, and S. Sharma, “A New approach for in Kernel Adaptive Subspaces for Class Imbalanced Datasets. IEEE
Classification of Highly Imbalanced Datasets using Evolutionary Transactions on Knowledge and Data Engineering 30, 5 (2018), 950–
Algorithms,” International Journal of Scientific & Engineering 962. https://doi.org/10.1109/TKDE.2017.2779849
Research, vol. 2, no. 7, 2011. [Online]. Available: [21] S. A. Shahee and U. Ananthakumar. 2018. An adaptive oversampling
http://www.ijser.org technique for imbalanced datasets. Vol. 10933 LNAI. Springer
[3] T. Jo and N. Japkowicz, “Class imbalances versus small disjuncts,” International Publishing. 1–16 pages. https://doi.org/10.1007/978-3-
ACM SIGKDD Explorations Newsletter, vol. 6, no. 1, pp. 40–49, 319-95786-9_1
2004. [22] S. Barua, M. M. Islam, X. Yao, and K. Murase. 2014. MWMOTE -
[4] V. García, J. S. Sánchez, A. I. Marqués, R. Florencia, and G. Rivera. Majority weighted minority oversampling technique for imbalanced
2020. Understanding the apparent superiority of over-sampling data set learning. IEEE Transactions on Knowledge and Data
through an analysis of local information for class-imbalanced data. Engineering 26, 2 (2014), 405–425.
Expert Systems with Applications 158, December (2020). https://doi.org/10.1109/TKDE.2012.232
https://doi.org/10.1016/j.eswa.2019.113026 [23] D. A. Cieslak, N. V. Chawla, and A. Striegel. 2006. Combating
[5] V. López, A. Fernández, S. García, V. Palade, and F. Herrera, “An imbalance in network intrusion datasets. 2006 IEEE International
insight into classification with imbalanced data: Empirical results and Conference on Granular Computing JANUARY 2006 (2006), 732–
current trends on using data intrinsic characteristics,” Information 737. https://doi.org/10.1109/grc.2006.1635905
Sciences, vol. 250, pp. 113–141, Nov. 2013, doi: [24] F. Last, G. Douzas, and F. Bacao. 2017. Oversampling for
10.1016/j.ins.2013.07.007. Imbalanced Learning Based on K-Means and SMOTE. (2017), 1–19.
[6] T. M. Khoshgoftaar, M. Golawala, and J. V. Hulse. 2007. An https://doi.org/10.1016/j.ins.2018.06.056
empirical study of learning from imbalanced data using random

March 2023 International Journal on Advances in ICT for Emerging Regions

31 Tharinda Dilshan Piyadasa#1, Kasun Gunawardana*2

[25] C. Bunkhumpornpat, K. Sinapiromsaran, and C. Lursinsap. 2012.

DBSMOTE: Density-based synthetic minority over-sampling
technique. Applied Intelligence 36, 3 (2012), 664–684.
https://doi.org/10.1007/s10489-011-0287-y
[26] L. Ma and S. Fan. 2017. CURE-SMOTE algorithm and hybrid
algorithm for feature selection and parameter optimization based on
random forests. BMC Bioinformatics 18, 1 (2017), 1–18.
https://doi.org/10.1186/s12859-017-1578-z
[27] I. Nekooeimehr and S. K. L. Yuen. 2016. Adaptive semi-
unsupervised weighted oversampling (A-SUWO) for imbalanced
datasets. Expert Systems with Applications 46 (2016), 405–416.
https://doi.org/10.1016/j.eswa.2015.10.031
[28] J. Wei, H. Huang, L. Yao, Y. Hu, Q. Fan, and D. Huang. 2020. IA-
SUWO: An Improving Adaptive semi-unsupervised weighted
oversampling for imbalanced classification problems. Knowledge-
Based Systems 203, June (2020), 106116.
https://doi.org/10.1016/j.knosys.2020.106116
[29] S. A. Shahee and U. Ananthakumar. 2018. Probability Based Cluster
Expansion Oversampling Technique for Imbalanced Data. (2018),
77–90. https://doi.org/10.5121/csit.2018.80607
[30] G. Douzas and F. Bacao. 2017. Self-Organizing Map Oversampling
(SOMO) for imbalanced data set learning. Expert Systems with
Applications 82 (2017), 40–52.
https://doi.org/10.1016/j.eswa.2017.03.073
[31] K. Beyer, J. Goldstein, R. Ramakrishnan, U. Shaft, C. Beeri, and P.
Buneman. 1999. When Is ”Nearest Neighbor” Meaningful?, 217–235
pages. http://www.springerlink.com/content/04p94cqnbge862kh/
[32] C. C. Aggarwal, A. Hinneburg, and D. A. Keim. 2001. On the
surprising behavior of distance metrics in high dimensional space.
Lecture Notes in Computer Science (including subseries Lecture
Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
1973 (2001), 420–434. https://doi.org/10.1007/3-540-44503-x_27
[33] T. Kohonen. 1982. Self-organized formation of topologically correct
feature maps. Biological Cybernetics 43, 1 (1982), 59–69.
https://doi.org/10.1007/BF00337288
[34] N. B. Subramanian, “Why High Dimensional Data are a Curse?”,
https://aiaspirant.com/curse-of-dimensionality (accessed June 4,
2022).
[35] X. Zhang, W. Wang, X. Zheng, Y. Ma, Y. Wei, M. Li, and Y. Zhang.
2019. A Clutter Suppression Method Based on SOM-SMOTE
Random Forest. In 2019 IEEE Radar Conference (RadarConf). IEEE,
1–4. https://doi.org/10.1109/RADAR.2019.8835836

International Journal on Advances in ICT for Emerging Regions March 2023

View publication stats

Data Science through R. Unsupervised Learning. Dimension Reduction Techniques: Principal Components, Factor Analysis and Correspondence Analysis
From Everand
Data Science through R. Unsupervised Learning. Dimension Reduction Techniques: Principal Components, Factor Analysis and Correspondence Analysis
César Pérez López
No ratings yet
Pattern Recognition: Fundamentals and Applications
From Everand
Pattern Recognition: Fundamentals and Applications
Fouad Sabry
No ratings yet
OUA Memo - 02014 - Mi Techtalk Webinar - Training Sessions On Microsoft 0365 Tools For Teachers and Students - 2021 - 02 - 03
No ratings yet
OUA Memo - 02014 - Mi Techtalk Webinar - Training Sessions On Microsoft 0365 Tools For Teachers and Students - 2021 - 02 - 03
5 pages
Ancel's Intel Hidden Bios Guide
No ratings yet
Ancel's Intel Hidden Bios Guide
8 pages
SMM Overview Updated
No ratings yet
SMM Overview Updated
9 pages
BD U-3 (Anupam Sir)
No ratings yet
BD U-3 (Anupam Sir)
23 pages
HP DL380 G8: Hardware Module Description
No ratings yet
HP DL380 G8: Hardware Module Description
6 pages
Prix Num 6 - Datasheet DELL Multimedia Keyboard - KB216 - Anglais
No ratings yet
Prix Num 6 - Datasheet DELL Multimedia Keyboard - KB216 - Anglais
2 pages
Document 49
No ratings yet
Document 49
9 pages
JPSP - 2022 - 383
No ratings yet
JPSP - 2022 - 383
12 pages
Imbalanced Data Problem in Machine Learning A Review
No ratings yet
Imbalanced Data Problem in Machine Learning A Review
14 pages
Author Final Version
No ratings yet
Author Final Version
11 pages
1 s2.0 S0957417423032803 Main
No ratings yet
1 s2.0 S0957417423032803 Main
29 pages
FULLTEXT01
No ratings yet
FULLTEXT01
42 pages
Nguyễn Minh Thuận: Education
No ratings yet
Nguyễn Minh Thuận: Education
2 pages
MEE22154 Task2
No ratings yet
MEE22154 Task2
4 pages
Imbalanced Data
No ratings yet
Imbalanced Data
54 pages
Classification of Imbalanced Data A Review
No ratings yet
Classification of Imbalanced Data A Review
34 pages
Imbalanced Data Classification Method Based On LSSASMOTE
No ratings yet
Imbalanced Data Classification Method Based On LSSASMOTE
9 pages
Kumar 2021 IOP Conf. Ser. Mater. Sci. Eng. 1099 012077
No ratings yet
Kumar 2021 IOP Conf. Ser. Mater. Sci. Eng. 1099 012077
9 pages
Natoreit Profile
No ratings yet
Natoreit Profile
7 pages
Handling Data Imbalance in Machine Learning
No ratings yet
Handling Data Imbalance in Machine Learning
51 pages
CSS 12 Exam
No ratings yet
CSS 12 Exam
3 pages
Gauss-Sediel Methode
No ratings yet
Gauss-Sediel Methode
36 pages
Resume 2022
No ratings yet
Resume 2022
2 pages
MXC-6400 Series Datasheet-En 20180706
No ratings yet
MXC-6400 Series Datasheet-En 20180706
2 pages
A Novel Resampling Technique For Imbalanced Classification in Software Defect Prediction by A Re-Sampling Method With Filtering
No ratings yet
A Novel Resampling Technique For Imbalanced Classification in Software Defect Prediction by A Re-Sampling Method With Filtering
10 pages
MK-SMOTE and M-SMOTE: Enhanced Techniques For Handling Class Imbalance Problem
No ratings yet
MK-SMOTE and M-SMOTE: Enhanced Techniques For Handling Class Imbalance Problem
19 pages
NICE Actimize - DS - Rarity Problem in Supervised Fraud Detection Insights Article - 3JUNE20
No ratings yet
NICE Actimize - DS - Rarity Problem in Supervised Fraud Detection Insights Article - 3JUNE20
11 pages
Investigating Class Rarity in Big Data: Open Access Research
No ratings yet
Investigating Class Rarity in Big Data: Open Access Research
17 pages
History of Computers
No ratings yet
History of Computers
49 pages
Introduction To Imbalanced Datasets
No ratings yet
Introduction To Imbalanced Datasets
10 pages
11-A-SMOTE A New Preprocessing Approach For Highly Im
No ratings yet
11-A-SMOTE A New Preprocessing Approach For Highly Im
11 pages
Lesson 3
No ratings yet
Lesson 3
8 pages
Advanced Math Reviewer Module 1 Lessons 1-6: N n+1 N n+1
No ratings yet
Advanced Math Reviewer Module 1 Lessons 1-6: N n+1 N n+1
4 pages
Choose An OTA For The Apple Watch Series 3 (42mm) IPSW Downloads
No ratings yet
Choose An OTA For The Apple Watch Series 3 (42mm) IPSW Downloads
1 page
10 Techniques To Solve Imbalanced Classes in ML
No ratings yet
10 Techniques To Solve Imbalanced Classes in ML
16 pages
Handling Imbalanced Ratio For Class Imbalance Problem Using SMOTE
No ratings yet
Handling Imbalanced Ratio For Class Imbalance Problem Using SMOTE
12 pages
Class Notes
No ratings yet
Class Notes
24 pages
Data Sheet 6ES7155-6MU00-0CN0: General Information
No ratings yet
Data Sheet 6ES7155-6MU00-0CN0: General Information
4 pages
Common Cathode Fast Recovery Epitaxial Diode (FRED) : Dsek 60 I 2x 30 A V 600 V T 35 Ns
No ratings yet
Common Cathode Fast Recovery Epitaxial Diode (FRED) : Dsek 60 I 2x 30 A V 600 V T 35 Ns
2 pages
Random and Synthetic Over Sampling Approach To Resolve Data 2zu79c47m6
No ratings yet
Random and Synthetic Over Sampling Approach To Resolve Data 2zu79c47m6
9 pages
VUTTIPITTAYAMONGKOL 2021 On The Class Overlap Problem
No ratings yet
VUTTIPITTAYAMONGKOL 2021 On The Class Overlap Problem
56 pages
Improving Imbalanced Learning Through A Heuristic Oversampling Method Based On K-Means and SMOTE
No ratings yet
Improving Imbalanced Learning Through A Heuristic Oversampling Method Based On K-Means and SMOTE
20 pages
Stop Oversampling For Class Imbalance Learning - A Review (OJO) - AHMAD S. TARAWNEH, AHMAD B. HASSANAT, GHADA AWAD ALTARAWNEH, ABDULLAH ALMUHAIMEED
No ratings yet
Stop Oversampling For Class Imbalance Learning - A Review (OJO) - AHMAD S. TARAWNEH, AHMAD B. HASSANAT, GHADA AWAD ALTARAWNEH, ABDULLAH ALMUHAIMEED
18 pages
Data Communication
No ratings yet
Data Communication
543 pages
2515-Article Text-14337-4-10-20230331
No ratings yet
2515-Article Text-14337-4-10-20230331
12 pages
Performance Evaluation of Class Balancing
No ratings yet
Performance Evaluation of Class Balancing
6 pages
International Conference On Information and Communications Technology
No ratings yet
International Conference On Information and Communications Technology
5 pages
2018 12state of ArtofImbalancedDataClassificationMethods
No ratings yet
2018 12state of ArtofImbalancedDataClassificationMethods
7 pages
A Cluster Based Under-Sampling Solution For Handling Imbalanced Data
No ratings yet
A Cluster Based Under-Sampling Solution For Handling Imbalanced Data
12 pages
A Unifying View of Class Overlap and Imbalance
No ratings yet
A Unifying View of Class Overlap and Imbalance
26 pages
5 Techniques To Handle Imbalanced Data For A Classification Problem
No ratings yet
5 Techniques To Handle Imbalanced Data For A Classification Problem
7 pages
An Overview of Classification Algorithms For Imbalanced Datasets
No ratings yet
An Overview of Classification Algorithms For Imbalanced Datasets
7 pages
Core Concepts in Statistical Learning
From Everand
Core Concepts in Statistical Learning
Tushar Gulati
No ratings yet
Mkt4218: New Product and Innovation
No ratings yet
Mkt4218: New Product and Innovation
36 pages
Maintenance - Free Secondary Cells (Vrla) General: BSNL Power-Plant
No ratings yet
Maintenance - Free Secondary Cells (Vrla) General: BSNL Power-Plant
15 pages
10 Techniques To Deal With Class Imbalance in Machine Learning
No ratings yet
10 Techniques To Deal With Class Imbalance in Machine Learning
10 pages
A Simple Demonstration On Reversing
100% (1)
A Simple Demonstration On Reversing
15 pages
Navo Minority Over-Sampling Technique (Nmote) : A Consistent Performance Booster On Imbalanced Datasets
No ratings yet
Navo Minority Over-Sampling Technique (Nmote) : A Consistent Performance Booster On Imbalanced Datasets
42 pages
Apqp Workflow - EXAMPLE
No ratings yet
Apqp Workflow - EXAMPLE
1 page
Lakes Wrplot View Release Notes 7 PDF
No ratings yet
Lakes Wrplot View Release Notes 7 PDF
9 pages
A Systematic Review On Imbalanced Data Challenges in Machine Learning: Applications and Solutions
100% (1)
A Systematic Review On Imbalanced Data Challenges in Machine Learning: Applications and Solutions
36 pages
Computer Profile Summary: Plan For Your Next Computer Refresh... Click For Belarc's System Management Products
0% (1)
Computer Profile Summary: Plan For Your Next Computer Refresh... Click For Belarc's System Management Products
6 pages
Foundations of Data Imbalance and Solutions For A Data Democracy
No ratings yet
Foundations of Data Imbalance and Solutions For A Data Democracy
20 pages
Mca Department: G. H. Raisoni Institute of Information Technology, Nagpur
No ratings yet
Mca Department: G. H. Raisoni Institute of Information Technology, Nagpur
18 pages
Fluent-Intro 15.0 L07 Turbulence PDF
No ratings yet
Fluent-Intro 15.0 L07 Turbulence PDF
48 pages
Comenzi Cisco
No ratings yet
Comenzi Cisco
3 pages
Clustering Based Undersampling For Handling Class Imbalance in C4.5 Classification Algorithm
No ratings yet
Clustering Based Undersampling For Handling Class Imbalance in C4.5 Classification Algorithm
7 pages
IMECS2010 pp513-517
No ratings yet
IMECS2010 pp513-517
5 pages
B.sc. Industrial Chemistry
50% (2)
B.sc. Industrial Chemistry
79 pages
Predictive Accuracy: A Misleading Performance Measure For Highly Imbalanced Data
No ratings yet
Predictive Accuracy: A Misleading Performance Measure For Highly Imbalanced Data
12 pages
d2c0 PDF
No ratings yet
d2c0 PDF
6 pages
Leevy2018 Article ASurveyOnAddressingHigh-classI
No ratings yet
Leevy2018 Article ASurveyOnAddressingHigh-classI
30 pages
Analysis of Imbalanced Classification Algorithms A Perspective View
No ratings yet
Analysis of Imbalanced Classification Algorithms A Perspective View
5 pages
TV LCD Samsung Pl42p5hdx
No ratings yet
TV LCD Samsung Pl42p5hdx
13 pages
Classifying Imbalanced Data Sets Using Similarity Based Hierarchical Decomposition
No ratings yet
Classifying Imbalanced Data Sets Using Similarity Based Hierarchical Decomposition
16 pages
Imbalanced Dataset Classification and Solutions: A Review
No ratings yet
Imbalanced Dataset Classification and Solutions: A Review
29 pages
Eng2 12298 PDF
No ratings yet
Eng2 12298 PDF
24 pages
Oligois: Scalable Instance Selection For Class-Imbalanced Data Sets
No ratings yet
Oligois: Scalable Instance Selection For Class-Imbalanced Data Sets
15 pages
Ensemble Models For Effective Classification of Big Data With Data Imbalance
No ratings yet
Ensemble Models For Effective Classification of Big Data With Data Imbalance
17 pages
Bagging Using Instance-Level Difficulty For Multi-Class Imbalanced Big Data Classification On Spark
No ratings yet
Bagging Using Instance-Level Difficulty For Multi-Class Imbalanced Big Data Classification On Spark
10 pages
Expert Systems With Applications: Georgios Douzas, Fernando Bacao
No ratings yet
Expert Systems With Applications: Georgios Douzas, Fernando Bacao
8 pages
IR-Lab Project of Yanjun Qi (Fall 2004) : A Brief Literature Review of Class Imbalanced Problem
No ratings yet
IR-Lab Project of Yanjun Qi (Fall 2004) : A Brief Literature Review of Class Imbalanced Problem
5 pages
An Insight Into Classification With Imbalanced Data
No ratings yet
An Insight Into Classification With Imbalanced Data
29 pages
Imbalanced Data: How To Handle Imbalanced Classification Problems
No ratings yet
Imbalanced Data: How To Handle Imbalanced Classification Problems
17 pages
Addressing Imbalance Problem in The Class - A Survey
No ratings yet
Addressing Imbalance Problem in The Class - A Survey
5 pages
A Survey On Oversampling Techniques For Imbalanced Learning
No ratings yet
A Survey On Oversampling Techniques For Imbalanced Learning
6 pages
Class Imbalance Problem in Data Mining: Review
No ratings yet
Class Imbalance Problem in Data Mining: Review
5 pages
A Study For The Discovery of Web Usage Patterns Using Soft Computing Based Data Clustering Techniques
No ratings yet
A Study For The Discovery of Web Usage Patterns Using Soft Computing Based Data Clustering Techniques
14 pages

AReviewon Oversampling Techniquesfor Solvingthe Data Imbalance Problemin Classification

Uploaded by

AReviewon Oversampling Techniquesfor Solvingthe Data Imbalance Problemin Classification

Uploaded by

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

A Review on Oversampling Techniques for Solving the Data Imbalance

Tharinda Dilshan Piyadasa Kasun Gunawardana

SEE PROFILE SEE PROFILE

The user has requested enhancement of the downloaded file.

A Review on Oversampling Techniques for Solving

Abstract— The data imbalance problem is a widely explored

March 2023 International Journal on Advances in ICT for Emerging Regions

Fig. 1 Approaches to address the data imbalance problem.

International Journal on Advances in ICT for Emerging Regions March 2023

II. ANALYSIS OF OVERSAMPLING TECHNIQUES

March 2023 International Journal on Advances in ICT for Emerging Regions

International Journal on Advances in ICT for Emerging Regions March 2023

Reference Approach Summary

- A minority class sample is selected at random.

- Applies SMOTE followed by Tomek Links.

- Extensive data cleaning based on misclassification.

- Generates samples in a geometric region instead of linear

- Considers nearby majority class samples when generating new

- Generates minority class samples only within the decision

- Density of minority class samples in a neighborhood is considered

March 2023 International Journal on Advances in ICT for Emerging Regions

International Journal on Advances in ICT for Emerging Regions March 2023

Reference Approach Summary

- Identifies hard-to-learn minority class samples and assigns them

- Uses K-means to cluster the minority class and applies SMOTE

- Uses K-means to identify clusters and assigns weights based on the

- Uses the DBSCAN algorithm to partition the minority class

- Uses MCLUST to identify sub-clusters present in the dataset.

Fig. 4 Data points become sparser as dimensionality increases [34].

March 2023 International Journal on Advances in ICT for Emerging Regions

International Journal on Advances in ICT for Emerging Regions March 2023

March 2023 International Journal on Advances in ICT for Emerging Regions

[25] C. Bunkhumpornpat, K. Sinapiromsaran, and C. Lursinsap. 2012.

International Journal on Advances in ICT for Emerging Regions March 2023

View publication stats

You might also like