AReviewon Oversampling Techniquesfor Solvingthe Data Imbalance Problemin Classification
AReviewon Oversampling Techniquesfor Solvingthe Data Imbalance Problemin Classification
net/publication/371609615
Article in International Journal on Advances in ICT for Emerging Regions (ICTer) · June 2023
DOI: 10.4038/icter.v16i1.7260
CITATIONS READS
6 208
2 authors:
All content following this page was uploaded by Tharinda Dilshan Piyadasa on 18 September 2023.
Even though most of the proposed external approaches B. Overview of External Approaches
resample the dataset until the number of samples in each
class is equal , studies such as [6] demonstrate that it is not As aforementioned, external approaches to solving the
always required to maintain a 50:50 class distribution when data imbalance problem are heavily favored in the field of
resampling. However, there is no hard and fast rule to decide research due to classifier independence. When considering
on a favorable imbalance ratio as it can vary depending on oversampling and undersampling, both methods have their
the domain and the type of classifier used. own advantages and disadvantages. The main drawback of
Internal approaches involve developing and improving the oversampling is that it risks generating synthetic data that
underlying classification algorithm without altering the can lead to overfitting. This can be caused by generating
dataset involved [7]. There are mainly two ways that internal synthetic samples that closely resemble original samples or
approaches address the data imbalance problem . The first by incorrectly positioning (overlapping other classes)
method is cost-sensitive learning , where the classifier is synthetic samples in the data space. In the case of
modified such that the misclassification of minority class undersampling, it risks excluding important information
samples is heavily penalized compared to the from the dataset, such as samples that are crucial when
misclassification of majority class samples. The second and deciding the decision boundaries or samples that contain a
most popular internal approach is to incorporate ensemble - higher weight in representing a particular class or a feature.
based classifiers where multiple weak classifiers are Apart from the exclusion of important information,
combined to improve the performance of the overall undersampling can also suffer from data scarcity after
classification algorithm . Apart from these methods , there resampling if the minority class contains extremely fewer
have also been algorithmic classifier modifications proposed samples.
in past years to improve the classifier performance on Throughout the past years, many studies have been
classifiers like Support Vector Machines (SVM ), Extreme carried out to investigate methods and mechanisms to
Learning Machines (ELM), and Neural Networks (NN). mitigate these drawbacks from external approaches. For
Moreover , internal approaches can also be combined with example, the most intuitive technique to add or remove data
external approaches to derive hybrid approaches that to/from a given class is by performing random selections.
incorporate both advantages and disadvantages of internal These primitive techniques have evolved and improved over
and external approaches [1][8]. time to address their foundational drawbacks by combining
When comparing the approaches to address the Data more complex techniques and statistical and probabilistic
Imbalance problem , it is apparent that researchers prefer methods.
external approaches over internal approaches mainly due to When comparing oversampling and undersampling, even
the classifier independence [5]. In external approaches, since though oversampling leads to overfitting, it is possible to
only the dataset is modified , it gives the freedom to select detect it during the earlier stages of training using
any suitable classifier for the classification task. However, in straightforward approaches such as using a good train test
the case of internal approaches , as the internal structure / split and observing the change in testing error compared to
algorithm of the classifier is modified to address the the training error. However, in the case of important
imprecise classification of minority class samples , the information exclusion caused by undersampling, although it
dataset is heavily dependent on the modified classifier . might work well with the resampled dataset, the classifier
Nevertheless , it is impractical to use the same classification trained with excluded samples can lead to many
algorithm with every dataset in different contexts. Therefore, misclassifications with the introduction of new data samples.
on the basis of generalizability , it is reasonable to presume Mohammed et al. [9] validate this assertion, where several
that external approaches provide an added advantage over state-of-the-art classifiers are used to evaluate oversampled
internal approaches. and undersampled datasets. The authors have concluded that
compared to undersampling, oversampling of datasets leads without introducing any bias in this context) alternative
to a more accurate classification. oversampling techniques.
There are numerous studies, including [10] and [11], that When exploring literature on oversampling, the most
review oversampling techniques by conducting experiments widely used techniques in the scientific community are
and comparing the results to provide a comprehensive SMOTE [12] and its variants. SMOTE stands for Synthetic
evaluation of the performance of different oversampling Minority Oversampling Technique, where the algorithm
techniques in practical settings. However, the objective of generates a synthetic sample along the line segment that
these studies seems to be finding the better technique out of joins a randomly selected minority class sample and one of
a set of available techniques based on a systematic analysis, its K nearest neighbors. In SMOTE, the value of K is a
and they provide limited information on the approaches and parameter that should be specified prior to its application,
methodologies used in such techniques. As a result, a and minority class samples are randomly chosen from the set
researcher may find it challenging to comprehend the of K-Nearest Neighbors based on the amount of
underlying strategies of an oversampling technique, which, oversampling required. The operation of the SMOTE
in turn, would lead to failure in addressing its limitations. algorithm is further elaborated in Fig. 2, where (a) The
This paper provides a comprehensive review of some majority class and minority class samples are represented in
popular oversampling techniques used to address the data blue and green colors, respectively. (b) A minority class
imbalance problem, highlighting their strategies and sample is randomly selected (black), and its K-nearest
potential areas for further improvement. The aim of this neighbors (3 in the image) are selected. (c) A new synthetic
study is to provide insights and guidance for researchers and sample (red) is generated on the line that joins the randomly
practitioners in the field of machine learning who are selected minority class sample and its nearest neighbor.
interested in developing more robust oversampling
techniques to address the data imbalance problem.
The rest of this paper is organized as follows. Section II
provides a comprehensive review of existing oversampling
techniques, highlighting their strategies when performing
oversampling. Section III presents the key findings of the
review, emphasizing the factors that need to be considered
when formulating new oversampling techniques, followed
by the conclusion in section IV.
well-defined class clusters. This can be considered a hybrid density distribution as a criterion to determine the number of
technique as it applies both oversampling and undersampling new synthetic samples that should be generated for each
to the dataset. minority sample. The density distribution considers the
The research work presented in [15] is another hybrid learning difficulty of each of the minority class samples and
technique where extensive data cleaning based on generates more synthetic samples around samples that are
misclassifications is applied through ENN (Edited Nearest more difficult to learn than those that are simpler to learn.
Neighbor) on an oversampled dataset. ENN is similar to Even though ADASYN is capable of enhancing hard-to-
Tomek Links, but it is more aggressive as it removes any learn minority sample areas, it is sensitive to outliers
sample (majority or minority) from the training set that its because of the possibility of misinterpreting noisy samples,
three nearest neighbors misclassify, creating more which usually occur in low densities, as harder-to-learn
distinguishable class spaces with clear separation along the samples, associating them with higher weights. A summary
decision boundary. The study also states that oversampling of SMOTE and its variants elaborated above are presented in
strategies lead to more accurate classifiers than strategies Table I.
derived through undersampling. When examining the process in which the aforementioned
Geometric SMOTE [16] is another extension of SMOTE methods have approached the problem, it is evident that they
that generates synthetic samples near selected minority class are focused on balancing the number of samples in the
samples in a geometric region instead of linear interpolation. dataset classes. The imbalance between the dataset classes
While this selected region is a hyper-sphere in its default that split them into majority and minority classes is called
configuration, G-SMOTE deforms it to a hyper-spheroid and the between-class imbalance. By default, all the resampling
eventually to a line segment, simulating the SMOTE process techniques are designed to address the between-class
in the last instance. Geometric SMOTE addresses two main imbalance through oversampling, undersampling, or hybrid
issues in SMOTE: generation of noisy samples and sampling. However, when comparing with vanilla SMOTE,
generation of samples that belong to the same sub-cluster. it can be observed that most of the above techniques attempt
The above issues are addressed by identifying safe areas to in refining the output of the SMOTE algorithm by regulating
synthesize new samples and varying the number of minority the areas of sample generation and eliminating noisy
samples generated. The authors claim that the ability of G- synthetic samples to preserve the decision boundary that
SMOTE to produce a variety of synthetic minority data in separates the classes.
safe regions of the input space while aggressively boosting The samples near the decision boundary undoubtedly
their diversity is the rationale for its performance gain. represent the most crucial samples for any classification task.
Safe-Level-SMOTE [17] follows a similar approach to Despite the importance of the decision boundary, as depicted
SMOTE but considers the nearby majority class samples in Fig. 3 (B), the samples generated near the boundary
when generating synthetic minority class samples. Safe through oversampling often tend to distort the class
levels are computed using nearest neighbor minority samples, separation, generating noisy samples that overlap with the
and synthetic samples are generated such that they lie closer majority class samples. The reason for the generation of
to minority class samples (safe area). The study tries to noisy samples in the decision boundary is caused by the use
address the overgeneralization problem encountered by of the same sample generation strategy throughout the data
SMOTE due to arbitrary generalization of the minority class space, which is not designed to preserve the decision
territory neglecting the majority class, which can lead to an boundary. The oversampling techniques mentioned above
increased likelihood of class mixing in the case of highly address this issue and emphasize preserving the decision
skewed class distributions. boundary when generating new synthetic minority class
Borderline-SMOTE [18] is another variation of SMOTE samples.
that generates synthetic minority class samples only within
the decision boundary that separates the classes. In contrast
to SMOTE, Borderline-SMOTE identifies minority class
samples that lie within the vicinity of the majority class
samples and prevents the generation of noisy synthetic
samples based on those. The authors declare that most
classification algorithms strive to understand the boundaries
of each class as precisely as possible during the training
process to obtain a better prediction, making the samples far Fig. 3 (A) Occurrence of multiple disjuncts of minority class samples
with varying densities. (B) Noisy minority class samples distort the decision
from the borderline less significant compared to the samples boundary by overlapping with majority class samples
that lie within the vicinity of the class borders. Furthermore,
the study presents two versions of Borderline-SMOTE, Moreover, when considering real-life datasets, there can
Borderline-SMOTE1, which generates new synthetic also be instances where multiple dense or sparse clusters of
samples between borderline minority samples and its K- minority class samples are present within the data
nearest minority neighbors, and Borderline-SMOTE2, which distribution, as illustrated in Fig. 3 (A).
generates new synthetic samples between borderline The existence of multiple disjuncts of minority class
minority samples and its K-nearest minority as well as K- samples is referred to as the within-class imbalance, and it
nearest majority neighbors. can lead to an extreme lack of representation of crucial
ADASYN [19] is a density-based oversampling technique minority class features. Oversampling techniques that
where the density of minority samples in a neighbourhood is randomly select minority samples to generate new synthetic
considered when generating new synthetic minority class samples, such as SMOTE, fail to resolve the within-class
samples. The main intuition of ADASYN is to utilize a imbalance, resulting in a skewed minority class distribution
[20]. Therefore, it is important to address both between-class are given greater weights to reduce the within-class
and within-class imbalances when addressing the data imbalance. Finally, a modified hierarchical clustering
imbalance. The simultaneous removal of both these approach is used to create synthetic samples from the
imbalances minimizes the classifier bias toward bigger sub- weighed minority class samples making sure the generated
clusters by decreasing the influence of the bigger sub-cluster samples reside within the minority class region to avoid noisy
error on the total error [21]. sample generation.
TABLE I
SUMMARY OF SMOTE AND ITS VARIANTS
Further looking into oversampling techniques reveals Cluster SMOTE [23] uses K-means to cluster the minority
another set of studies that use a different strategy to deal class and applies SMOTE within the identified clusters. This
with the data imbalance problem. approach makes sure that the generated synthetic samples
MWMOTE [22] is a popular SMOTE-based always lie inside naturally occurring clusters of the minority
oversampling technique. It first locates hard-to-learn class samples. The study claims that the existence of a small
minority class samples (samples near the decision boundary) number of minority class samples is challenging when
using the majority class samples near the decision boundary forming decent class borders, and addressing this limitation
and uses the Euclidean distance from these nearest majority by accurate class region and border definition would enable
class samples to assign them weights. This weighing trivial classification. Since these class regions are unknown
mechanism ensures that higher weights are assigned to and impossible to infer through given data, K-means is used
samples closer to the decision boundary than others. The to approximate the minority region, followed by applying
authors highlight the fact that the presence of within-class SMOTE to each identified cluster. This study is explicitly
imbalance and small disjuncts of the minority class can lead designed to address the imbalance in network intrusion
to performance degradation in classifiers and, therefore, datasets and only uses two intrusion datasets to evaluate.
similar to the weighing of hard-to-learn minority samples [24] presents a clustering-based oversampling technique
near the decision boundary, the samples of smaller clusters designed to address the within and between class imbalances,
avoiding the generation of noisy synthetic samples. Initially, clustering to identify the boundary of the identified sub-
the algorithm clusters the input space using K-means clusters. Finally, synthetic minority class samples are
clustering and filters out the cluster with a higher number of generated in the enclosed region of the class separating
minority samples for oversampling. The number of synthetic boundary. As suggested by the authors, the main goal of this
samples to be generated is then dispersed, with more technique is to assign equal weight to all sub-clusters of the
samples being assigned to clusters with a low density of minority class that would otherwise be overlooked due to the
minority samples. Finally, SMOTE is used to obtain the skewness of the distribution. The cluster/density based
required ratio of minority and majority samples in each of oversampling techniques elaborated above are summarized
the filtered clusters. The authors rationalize cluster-based in Table II.
oversampling as one of the strategies that aim to minimize In order to address the within-class imbalance, it is
the within-class imbalance while also reducing the between- necessary to identify different regions within the data space
class imbalance, facilitating the oversampling technique to where oversampling is effective. The above studies show
identify the most effective areas of the input space to that clustering and density-based techniques are popular
generate synthetic samples. approaches that researchers use to identify such areas. After
DBSMOTE [25] is another density-based oversampling the identification of significant areas to oversample, it is
technique that uses the DBSCAN algorithm to partition the possible to use traditional oversampling techniques to
minority class samples. SMOTE is used to generate generate synthetic samples. The clustering-based
synthetic samples between the shortest path that join oversampling techniques introduced above emphasize the
minority class samples with a pseudo-centroid of a minority importance of addressing the within-class imbalance when
cluster, avoiding the generation of outliers or noisy samples. formulating oversampling techniques.
As a result, synthetic samples are generated in such a way
that they are dense around the centroid and are sparse further A. Oversampling High-Dimensional Data
away from the centroid. The authors claim that a real-world Further inspecting the aforementioned oversampling
dataset with proximate data clusters can be described by a techniques that address the data imbalance, it is evident that
normal distribution, dense at the centroid and sparse towards most of the techniques are based on clustering algorithms
the boundary and that a classifier can correctly identify such as K-means, DBSCAN, and hierarchical clustering,
samples near the centroid as it identifies the area around the combined with heuristics based on Euclidean distance.
centroid as a class. Based on the above observations, Therefore, the majority of these approaches rely on heuristic
DBSMOTE is designed to oversample the minority class methods that apply in two-dimensional space (Euclidean
area around the centroid because it is too sparse to be space) when generating synthetic data, whereas practical
recognized by a classifier. scenarios often consist of high-dimensional data [30].
CURE-SMOTE [26] works by clustering the minority Additionally, when the number of features in the dataset
class samples using the CURE hierarchical clustering (dimensionality of data) increases, the data points become
algorithm followed by noise and outlier removal. It then sparser or farther apart (Fig. 4), making the nearest neighbor
randomly generates synthetic minority class samples along problem ill-defined [31]. This behavior is called the “curse
the line segment that joins representative points and the of dimensionality” [32]. As a result, in higher dimensional
center point. In CURE hierarchical clustering, each sample space, the use of heuristics based on Euclidean distance
is assumed to represent a cluster, where local clustering is becomes ineffective, and the assumption of well-defined
used to combine these samples to form the clusters present clusters fails, generating noisy synthetic samples.
in the input space. The study justifies CURE hierarchical A common strategy that can be adopted when formulating
clustering, stating that it is more efficient for large datasets oversampling techniques that use clustering mechanisms and
with varying shapes of data distributions than K-means heuristics based on Euclidean distance is to reduce the
clustering, which is only suitable for spherically distributed dimensionality of the original input space. Principal
datasets. Further, it is stated that the combination of Component Analysis (PCA), Multidimensional Scaling
clustering and merging operations tends to eliminate noise (MDS), and Self-Organizing Maps are some common
with reduced complexity as it eliminates the need to remove dimensionality reduction techniques practitioners use. In
the furthest created synthetic samples (noisy samples) after recent years, Self-Organizing Maps [33] based resampling
applying SMOTE. techniques have been extensively explored in the community.
A-SUWO (Adaptive Semi-Unsupervised Weighted Self-Organizing Map based Oversampling (SOMO) [30]
Oversampling) [27] and its improved version, IA-SUWO generates a clustered two-dimensional representation of the
[28], cluster the minority class samples using a semi- input space by applying the SOM algorithm. Clusters are
unsupervised hierarchical clustering approach and use the filtered to perform oversampling by calculating the density
classification complexity and cross-validation of each sub- of minority class samples in each cluster. SMOTE is applied
cluster to decide the optimal size to oversample. Both A- to generate synthetic minority class samples within the
SUWO and IA-SUWO aim to generate synthetic samples filtered clusters and between neighboring clusters,
near minority class instances that lie close to the decision addressing both within and between class imbalances. The
boundary with lower densities. authors have identified and addressed a few inefficiencies of
[29] presents a probability-based cluster expansion existing oversampling techniques, namely, the generation of
oversampling technique that uses a model-based clustering noisy instances that infiltrate the majority region, the
mechanism (MCLUST) to identify sub-clusters present in generation of duplicate samples, and the use of heuristics
the dataset. The method also uses K-Nearest Neighbor based based on the assumption that the input space has a simple
noise removal prior to clustering to reduce the oversampling manifold structure. SOMO is capable of generating more
of noisy samples and equal posterior probability after effective synthetic samples by investigating the manifold
structure of the input space, exploiting the topology- among majority and synthetic minority samples , which
preserving property of Self-Organizing Maps. evaluate the positive impact of their removal or inclusion in
[1] proposes an imbalance dataset resampling technique the training data, respectively . The ideal rates of exclusion
by combining Self-Organizing Maps and Genetic and inclusion for each accepted criterion are obtained
Algorithms. The technique uses two Self-Organizing Maps using a Genetic Algorithm that considers the performance
to perform oversampling on the minority class and of a random classifier for a given training dataset in the
undersampling on the majority class . The clusters derived context of imbalance classification via the fitness function .
from Self - Organizing Maps identify the reg ions where The authors claim that the capabilities of Self-Organizing
majority and minority class samples are dense. The filtered Maps to preserve the distribution and topology of the input
clusters are then utilized to derive a set of rankings data lead to the conservation of the natural spatial
TABLE II
SUMMARY OF CLUSTER/DENSITY BASED OVERSAMPLING TECHNIQUES
Oversampling: combines CURE hierarchical - Uses CURE hierarchical clustering algorithm followed by noise and
clustering with noise and outlier removal so that the outlier removal.
[26]
samples generated after using SMOTE are more
precise. - Addresses datasets that have clusters of varying shapes and sizes.
Oversampling: uses a semi-supervised hierarchical - Clusters the minority class using a hierarchical clustering approach.
clustering algorithm to generate synthetic samples
[27][28]
around minority class instances that lie close to the - Oversample size is decided from classification complexity and
decision boundary with lower densities. cross-validation of each sub-cluster.
relationship among samples at the cluster level , and the is no optimal imbalance ratio that needs to be reached when
optimization capabilities of Genetic Algorithms result in resampling an imbalanced dataset. However, [6] states that a
maximization of classifier performance , improving the 35:65 class distribution can achieve a higher classification
overall resampling operation. performance compared to a 50:50 class distribution when the
[35] uses a customized SOM-SMOTE algorithm to classes are heavily imbalanced. This is an area that is still
address the imbalance in clutter data when addressing the being investigated.
clutter suppression in search radars. The authors have 2) Addressing the within-class imbalance: Within-class
identified two limitations in the SMOTE algorithm, random imbalance demonstrates the imbalance within the minority
sample selection and ignoring the data distribution during class due to the existence of multiple disjuncts of minority
interpolation, resulting in samples that are not representative class samples with varying densities (Fig. 3A) that can lead
enough. The study addresses the above limitation using a to an extreme representation deficiency of essential
combined Self-Organizing Map and SMOTE algorithm that characteristics of the minority class. The majority of the
clusters the minority class samples into several subsets using oversampling techniques that randomly select minority class
a Self-Organizing Map and interpolates synthetic samples samples to generate new synthetic samples, such as SMOTE,
around the cluster centers using SMOTE. The authors also fail to address the within-class imbalance, leaving the
highlight the ability of Self-Organizing Maps to preserve the minority class distribution skewed. When analyzing
topology of higher dimensional clutter data, resulting in oversampling techniques capable of addressing the within-
synthetic samples with distribution characteristics similar to class imbalance, it can be observed that they are based on
original data. clustering approaches. The use of clustering approaches is an
From the above studies, it can be assumed that the ability obvious design choice as they provide the capability to
of Self-Organizing Maps to address the within-class analyze the spatial location of the minority class to
imbalance as a clustering algorithm, along with the ability to determine the suitable areas to generate new synthetic
reduce the dimensionality of data while preserving the samples.
topology of the input space, are the main reasons for its 3) Preserving the boundary region when generating
widespread popularity as an excellent candidate to address synthetic samples: The boundary region represents the area
the data imbalance problem. that separates two or more classes. As mentioned previously,
the most crucial samples in any classification task are the
III. DISCUSSION samples that reside near the boundary region. When
When analyzing oversampling techniques that address the considering oversampling techniques, the synthetic samples
class imbalance problem, it is possible to identify key factors generated near the decision boundary often distort the class
that contribute to the success of an oversampling technique. separation, generating noisy samples that overlap with the
Throughout the literature, it can be observed that every majority class samples. This behavior is caused due to the
oversampling technique attempts to adopt one or more of use of the same sample generation strategy throughout the
these factors during its strategy formulation. data space, which is not designed to preserve the decision
Considering the variations of the SMOTE algorithm that boundary. However, studies such as [14], [15], [16], [17],
have been introduced as improved versions of vanilla and [18] emphasize the importance of preserving the
SMOTE [12], it is evident that most of the proposed boundary region when generating new synthetic samples.
techniques such as [14], [15], Safe-Level-SMOTE [17], and The decision boundary preservation can be achieved either
Borderline-SMOTE [18] try to preserve the boundary region by using a separate sample generation strategy near the
that separates the minority and majority classes. Compared boundary region or by refining the synthetically generated
to the vanilla SMOTE, the higher classification accuracies of samples to remove noisy samples generated near the
these techniques demonstrate the importance of preserving decision boundary.
the boundary region when formulating an oversampling Even though there are oversampling techniques that
technique. address different combinations of the above three constraints,
Further exploring contemporary oversampling techniques, almost all the proposed oversampling techniques do not
it can also be observed that clustering-based approaches are address all three constraints together.
more popular among researchers. This is because, apart from Aside from addressing the constraints mentioned above
preserving the boundary region, it is also essential to address when formulating an oversampling approach, it is also
the within-class imbalance in the dataset (all the resampling preferable to pay special attention to the curse of
techniques address the between-class imbalance by default). dimensionality. As elaborated in the previous section, many
Clustering algorithms do not necessarily address the within- of the currently available oversampling approaches are
class imbalance unless they are explicitly designed to unable to handle the curse of dimensionality, resulting in
address it. poor performance on high dimensional datasets. We believe
Based on the above observations, it is possible to identify addressing the above constraints along with a proper
three constraints that need to be simultaneously satisfied clustering algorithm or a dimensionality reduction technique
when formulating an oversampling technique to generate an is a promising research avenue to investigate further.
optimal resampled dataset.
1) Addressing the between-class imbalance: The between- IV. CONCLUSION
class imbalance represents the typical imbalance scenario The data imbalance problem is one of the most well-
where there is a significant difference between the number defined problems in the Machine Learning domain that has
of samples in the dataset classes. All the resampling been addressed throughout the past decades. With the
techniques attempt to address the between-class imbalance emergence of Big Data, traditional techniques to address the
by oversampling, undersampling, or hybrid-sampling. There data imbalance problem have been challenging, and the
necessity of new and improved techniques to address the forest. Proceedings - International Conference on Tools with
Artificial Intelligence, ICTAI 2 (2007), 310–317.
imbalance has created promising research avenues in many
https://doi.org/10.1109/ICTAI.2007.46
practical domains. [7] U. Bhowan, M. Johnston, and M. Zhang. 2012. Developing New
This paper reviews numerous research work that has Fitness Functions in Genetic Programming for Classification With
attempted to address the data imbalance problem by Unbalanced Data. 42, 2 (2012), 406–421.
[8] J. Gao, K. Liu, B. Wang, D. Wang, and Q. Hong. 2021. An improved
oversampling the minority class samples. We identify
deep forest for alleviating the data imbalance problem. Soft
several subsets of oversampling techniques and highlight Computing 25, 3 (2021), 2085–2101. https://doi.org/10.1007/s00500-
different approaches adopted by them to discover suitable 020-05279-8
samples/areas to oversample and strategies used to generate [9] R. Mohammed, J. Rawashdeh, and M. Abdullah. 2020. Machine
Learning with Oversampling and Undersampling Techniques:
new synthetic samples. Based on these studies it is evident
Overview Study and Experimental Results. 2020 11th International
that some oversampling techniques focus on preserving the Conference on Information and Communication Systems, ICICS
decision boundary by refining the oversampled output or by 2020 May (2020), 243–248.
restricting sample generation in certain areas. It is also https://doi.org/10.1109/ICICS49469.2020.239556
[10] D. Elreedy and A. F. Atiya, “A Comprehensive Analysis of Synthetic
possible to identify studies that use clustering and density-
Minority Oversampling Technique (SMOTE) for handling class
based techniques to prevent the generation of noisy samples imbalance,” Information Sciences, vol. 505, pp. 32–64, Dec. 2019,
and alleviate the occurrence of disjuncts of minority class doi: https://doi.org/10.1016/j.ins.2019.07.070.
samples with varying densities. Furthermore, the review also [11] A. Gosain and S. Sardana, “Handling class imbalance problem using
oversampling techniques: A review,” IEEE Xplore, Sep. 01, 2017.
presents the challenges faced by traditional oversampling
https://ieeexplore.ieee.org/abstract/document/8125820 (accessed
techniques on high-dimensional data and suggest different May 17, 2022).
techniques that can be utilized to address them. [12] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer.
By analyzing various strategies adopted in the scientific 2002. SMOTE: Synthetic Minority Over-sampling Technique.
Journal of Artificial Intelligence Research 16, 2 (jun 2002), 321–357.
community for oversampling, we have identified three key
https://doi.org/10.1613/jair.953
constraints that need to be satisfied when developing state- [13] M. Schubach, M. Re, P. N. Robinson, and G. Valentini. 2017.
of-the-art oversampling techniques, Imbalance-aware machine learning for predicting rare and common
1) Addressing the between-class imbalance: represents disease-associated non-coding variants. Scientific reports 7, 1 (2017),
1–12.
the typical imbalance scenario where there is a significant
[14] G. E. A. P. A. Batista, A. L. C. Bazzan, and M. C. Monard. 2003.
difference between the number of samples in the dataset Balancing Training Data for Automated Annotation of Keywords: a
classes. Case Study. In Proceedings of the Second Brazilian Workshop on
2) Addressing the within-class imbalance: represents the Bioinformatics January (2003), 35–43. http://www.cs.waikato.ac.nz/
[15] G. E. A. P. A. Batista, R. C. Prati, and M. C. Monard. 2004. A study
imbalance within the minority class due to the existence of
of the behavior of several methods for balancing machine learning
multiple disjuncts of minority class samples with varying training data. ACM SIGKDD Explorations Newsletter 6, 1 (2004),
densities. 20–29. https://doi.org/10.1145/1007730.1007735
3) Preserving the boundary region when generating [16] G. Douzas and F. Bacao. 2017. Geometric SMOTE: Effective
oversampling for imbalanced learning through a geometric extension
synthetic samples: the boundary region represents the area
of SMOTE. (2017), 1–22. http://arxiv.org/abs/1709.07377
that separates two or more classes. It is required to make [17] C. Bunkhumpornpat, K. Sinapiromsaran, and C. Lursinsap. 2009.
sure that the synthetic samples do not distort the decision Safe-level-SMOTE: Safe-level-synthetic minority oversampling
boundary and overlap with samples in other classes. technique for handling the class imbalanced problem. Lecture Notes
in Computer Science (including subseries Lecture Notes in Artificial
Along with the above constraints, being attentive to the
Intelligence and Lecture Notes in Bioinformatics) 5476 LNAI (2009),
curse of dimensionality and addressing it would lead to a 475–482. https://doi.org/10.1007/978-3-642-01307-2_43
more optimal resampling. Based on these findings, [18] H. Han, W. Y. Wang, and B. H. Mao. 2005. Borderline-SMOTE: A
researchers can develop more robust oversampling New Over-Sampling Method in Imbalanced Data Sets Learning. In
Advances in Intelligent Systems and Computing. Vol. 683. 878–887.
techniques in the future.
https://doi.org/10.1007/11538059_91
[19] H. He, Y. Bai, E. A. Garcia, and S. Li. 2008. ADASYN: Adaptive
REFERENCES synthetic sampling approach for imbalanced learning. Proceedings of
[1] M. Vannucci and V. Colla, Imbalanced datasets resampling through the International Joint Conference on Neural Networks 3 (2008),
self organizing maps and genetic algorithms. Springer International 1322–1328. https://doi.org/10.1109/IJCNN.2008.4633969
Publishing, 2019, vol. 1000. [Online]. Available: [20] C. T. Lin, T. Y. Hsieh, Y. T. Liu, Y. Y. Lin, C. N. Fang, Y. K. Wang,
http://dx.doi.org/10.1007/978-3-030-20257-6_34 G. Yen, N. R. Pal, and C. H. Chuang. 2018. Minority Oversampling
[2] S. Maheshwari, J. Agrawal, and S. Sharma, “A New approach for in Kernel Adaptive Subspaces for Class Imbalanced Datasets. IEEE
Classification of Highly Imbalanced Datasets using Evolutionary Transactions on Knowledge and Data Engineering 30, 5 (2018), 950–
Algorithms,” International Journal of Scientific & Engineering 962. https://doi.org/10.1109/TKDE.2017.2779849
Research, vol. 2, no. 7, 2011. [Online]. Available: [21] S. A. Shahee and U. Ananthakumar. 2018. An adaptive oversampling
http://www.ijser.org technique for imbalanced datasets. Vol. 10933 LNAI. Springer
[3] T. Jo and N. Japkowicz, “Class imbalances versus small disjuncts,” International Publishing. 1–16 pages. https://doi.org/10.1007/978-3-
ACM SIGKDD Explorations Newsletter, vol. 6, no. 1, pp. 40–49, 319-95786-9_1
2004. [22] S. Barua, M. M. Islam, X. Yao, and K. Murase. 2014. MWMOTE -
[4] V. García, J. S. Sánchez, A. I. Marqués, R. Florencia, and G. Rivera. Majority weighted minority oversampling technique for imbalanced
2020. Understanding the apparent superiority of over-sampling data set learning. IEEE Transactions on Knowledge and Data
through an analysis of local information for class-imbalanced data. Engineering 26, 2 (2014), 405–425.
Expert Systems with Applications 158, December (2020). https://doi.org/10.1109/TKDE.2012.232
https://doi.org/10.1016/j.eswa.2019.113026 [23] D. A. Cieslak, N. V. Chawla, and A. Striegel. 2006. Combating
[5] V. López, A. Fernández, S. García, V. Palade, and F. Herrera, “An imbalance in network intrusion datasets. 2006 IEEE International
insight into classification with imbalanced data: Empirical results and Conference on Granular Computing JANUARY 2006 (2006), 732–
current trends on using data intrinsic characteristics,” Information 737. https://doi.org/10.1109/grc.2006.1635905
Sciences, vol. 250, pp. 113–141, Nov. 2013, doi: [24] F. Last, G. Douzas, and F. Bacao. 2017. Oversampling for
10.1016/j.ins.2013.07.007. Imbalanced Learning Based on K-Means and SMOTE. (2017), 1–19.
[6] T. M. Khoshgoftaar, M. Golawala, and J. V. Hulse. 2007. An https://doi.org/10.1016/j.ins.2018.06.056
empirical study of learning from imbalanced data using random