ADASYN: Adaptive Synthetic Sampling Approach For Imbalanced Learning
ADASYN: Adaptive Synthetic Sampling Approach For Imbalanced Learning
Learning
Haibo He, Yang Bai, Edwardo A. Garcia, and Shutao Li
Abstract—This paper presents a novel adaptive synthetic Generally speaking, imbalanced learning occurs whenever
(ADASYN) sampling approach for learning from imbalanced some types of data distribution significantly dominate the
data sets. The essential idea of ADASYN is to use a weighted instance space compared to other data distributions. In this
distribution for different minority class examples according to
their level of difficulty in learning, where more synthetic data paper, we focus on the two-class classification problem for
is generated for minority class examples that are harder to imbalanced data sets, a topic of major focus in recent research
learn compared to those minority examples that are easier to activities in the research community. Recently, theoretical
learn. As a result, the ADASYN approach improves learning analysis and practical applications for this problem have
with respect to the data distributions in two ways: (1) reducing attracted a growing attention from both academia and industry.
the bias introduced by the class imbalance, and (2) adaptively
shifting the classification decision boundary toward the difficult This is reflected by the establishment of several major work-
examples. Simulation analyses on several machine learning data shops and special issue conferences, including the American
sets show the effectiveness of this method across five evaluation Association for Artificial Intelligence workshop on Learning
metrics. from Imbalanced Data Sets (AAAI’00) [9], the International
Conference on Machine Learning workshop on Learning from
I. I NTRODUCTION Imbalanced Data Sets (ICML’03) [10], and the Association
for Computing Machinery (ACM) Special Interest Group on
EARNING from imbalanced data sets is a relatively new
L challenge for many of today’s data mining applications.
From applications in Web mining to text categorization to
Knowledge Discovery and Data Mining explorations (ACM
SIGKDD Explorations’04) [11].
The state-of-the-art research methodologies to handle imbal-
biomedical data analysis [1], this challenge manifests itself anced learning problems can be categorized into the following
in two common forms: minority interests and rare instances. five major directions:
Minority interests arise in domains where rare objects (minor- (1) Sampling strategies. This method aims to develop var-
ity class samples) are of great interest, and it is the objective ious oversampling and/or undersampling techniques to com-
of the machine learning algorithm to identify these minority pensate for imbalanced distributions in the original data sets.
class examples as accurately as possible. For instance, in For instance, in [12] the cost curves technique was used to
financial engineering, it is important to detect fraudulent credit study the interaction of both oversampling and undersampling
card activities in a pool of large transactions [2] [3]. Rare with decision tree based learning algorithms. Sampling tech-
instances, on the other hand, concerns itself with situations niques with the integration of probabilistic estimates, pruning,
where data representing a particular event is limited compared and data preprocessing were studied for decision tree learning
to other distributions [4] [5], such as the detection of oil in [13]. Additionally, in [14], “JOUS-Boost” was proposed
spills from satellite images [6]. One should note that many to handle imbalanced data learning by integrating adaptive
imbalanced learning problems are caused by a combination of boosting with jittering sampling techniques.
these two factors. For instance, in biomedical data analysis, the (2) Synthetic data generation. This approach aims to over-
data samples for different kinds of cancers are normally very come imbalance in the original data sets by artificially gener-
limited (rare instances) compared to normal non-cancerous ating data samples. The SMOTE algorithm [15], generates an
cases; therefore, the ratio of the minority class to the majority arbitrary number of synthetic minority examples to shift the
class can be significant (at a ratio of 1 to 1000 or even classifier learning bias toward the minority class. SMOTE-
more [4][7][8]). On the other hand, it is essential to predict Boost, an extension work based on this idea, was proposed
the presence of cancers, or further classify different types of in [16], in which the synthetic procedure was integrated with
cancers as accurate as possible for earlier and proper treatment adaptive boosting techniques to change the method of updating
(minority interests). weights to better compensate for skewed distributions. In order
to ensure optimal classification accuracy for minority and
Haibo He, Yang Bai, and Edwardo A. Garcia are with the Department of majority class, DataBoost-IM algorithm was proposed in [17]
Electrical and Computer Engineering, Stevens Institute of Technology, Hobo-
ken, New Jersey 07030, USA (email: {hhe, ybai1, egarcia}@stevens.edu). where synthetic data examples are generated for both minority
Shutao Li is with the College of Electrical and Information Engineering, and majority classes through the use of “seed” samples.
Hunan University, Changsha, 410082, China.(Email: shutao [email protected]) (3) Cost-sensitive learning. Instead of creating balanced
This work was supported in part by the Center for Intelligent Networked
Systems (iNetS) at Stevens Institute of Technology and the Excellent Youth data distributions by sampling strategies or synthetic data
Foundation of Hunan Province (Grant No. 06JJ1010). generation methods, cost-sensitive learning takes a different
1322
978-1-4244-1821-3/08/$25.002008
c IEEE
approach to address this issue: It uses a cost-matrix for The remainder of this paper is organized as follow. Section
different types of errors or instance to facilitate learning from II presents the ADASYN algorithm in detail, and discusses the
imbalanced data sets. That is to say, cost-sensitive learning major advantages of this method compared to conventional
does not modify the imbalanced data distribution directly; synthetic approaches for imbalanced learning problems. In
instead, it targets this problem by using different cost-matrices section III, we test the performance of ADASYN on various
that describe the cost for misclassifying any particular data machine learning test benches. Various evaluation metrics are
sample. A theoretical analysis on optimal cost-sensitive learn- used to assess the performance of this method against existing
ing for binary classification problems was studied in [18]. methods. Finally, a conclusion is presented in Section IV.
In [19] instead of using misclassification costs, an instance-
weighting method was used to induce cost-sensitive trees II. ADASYN ALGORITHM
and demonstrated better performance. In [20], Metacost, a Motivated by the success of recent synthetic approaches
general cost-sensitive learning framework was proposed. By including SMOTE [15], SMOTEBoost [16], and DataBoost-
wrapping a cost-minimizing procedure, Metacost can make IM [17], we propose an adaptive method to facilitate learning
any arbitrary classifier cost-sensitive according to different from imbalanced data sets. The objective here is two-fold:
requirements. In [21], cost-sensitive neural network models reducing the bias and adaptively learning. The proposed
were investigated for imbalanced classification problems. A algorithm for the two-class classification problem is described
threshold-moving technique was used in this method to adjust in [Algorithm ADASYN]:
the output threshold toward inexpensive classes, such that
high-cost (expensive) samples are unlikely to be misclassified.
(4) Active learning. Active learning techniques are conven- [Algorithm - ADASYN]
tionally used to solve problems related to unlabeled training
data. Recently, various approaches on active learning from Input
imbalanced data sets have been proposed in literature [1] [22] (1) Training data set Dtr with m samples {xi , yi }, i =
[23] [24]. In particular, an active learning method based on 1, ..., m, where xi is an instance in the n dimensional feature
support vector machines (SVM) was proposed in [23] [24]. space X and yi ∈ Y = {1, −1} is the class identity label asso-
Instead of searching the entire training data space, this method ciated with xi . Define ms and ml as the number of minority
can effectively select informative instances from a random class examples and the number of majority class examples,
set of training populations, therefore significantly reducing respectively. Therefore, ms ≤ ml and ms + ml = m.
the computational cost when dealing with large imbalanced
data sets. In [25], active learning was used to study the class
imbalance problems of word sense disambiguation (WSD) Procedure
applications. Various strategies including max-confidence and (1) Calculate the degree of class imbalance:
min-error were investigated as the stopping criteria for the d = ms /ml (1)
proposed active learning methods.
(5) Kernel-based methods. Kernel-based methods have also where d ∈ (0, 1].
been used to study the imbalanced learning problem. By (2) If d < dth then (dth is a preset threshold for the maximum
integrating the regularized orthogonal weighted least squares tolerated degree of class imbalance ratio):
(ROWLS) estimator, a kernel classifier construction algorithm (a) Calculate the number of synthetic data examples that
based on orthogonal forward selection (OFS) was proposed in need to be generated for the minority class:
[26] to optimize the model generalization for learning from
two-class imbalanced data sets. In [27], a kernel-boundary- G = (ml − ms ) × β (2)
alignment (KBA) algorithm based on the idea of modifying Where β∈ [0, 1] is a parameter used to specify the desired
the kernel matrix according to the imbalanced data distribution balance level after generation of the synthetic data. β = 1
was proposed to solve this problem. Theoretical analyses in means a fully balanced data set is created after the general-
addition to empirical studies were used to demonstrate the ization process.
effectiveness of this method. (b) For each example xi ∈ minorityclass, find K nearest
In this paper, we propose an adaptive synthetic (ADASYN) neighbors based on the Euclidean distance in n dimensional
sampling approach to address this problem. ADASYN is space, and calculate the ratio ri defined as:
based on the idea of adaptively generating minority data
samples according to their distributions: more synthetic data ri = Δi /K, i = 1, ..., ms (3)
is generated for minority class samples that are harder to learn
compared to those minority samples that are easier to learn. where Δi is the number of examples in the K nearest
The ADASYN method can not only reduce the learning bias neighbors of xi that belong to the majority class, therefore
introduced by the original imbalance data distribution, but can ri ∈ [0, 1];
ms
also adaptively shift the decision boundary to focus on those (c) Normalize ri according to rˆi = ri / ri , so that rˆi is
difficult to learn samples. i=1
'TTQTRGTHQTOCPEG
from the K nearest neighbors for data xi .
(ii) Generate the synthetic data example:
si = xi + (xzi − xi ) × λ (5)
EEQTTGURQPFUVQHWNN[
where (xzi − xi ) is the difference vector in n dimensional DCNCPEGFFCVCCHVGT
#0CNIQTKVJO
spaces, and λ is a random number: λ ∈ [0, 1].
The key idea of ADASYN algorithm is to use a density &KHHGTGPVEEQGHHKEKGPVU
Dataset
Methods OA Precision Recall F measure G mean
Decision tree 2 5 0 1 0
Winning times
SMOTE 0 0 1 1 0
ADASYN 3 0 4 3 5
methods such as adjusting the class balance can help the learn- brief discussion on possible future research directions in this
ing capabilities? This is a fundamental and critical question in Section.
this domain. In fact, the importance of this question has been
previously addressed by F. Provost in the invited paper for the Firstly of all, in our current study, we compared the
AAAI’2000 Workshop on Imbalanced Data Sets [1]: ADASYN algorithm to single decision tree and SMTOE
“Isn’t the best research strategy to concentrate on how algorithm [15] for performance assessment. This is mainly
machine learning algorithms can deal most effectively with because all of these methods are single-model based learning
whatever data they are given?” algorithms. Statistically speaking, ensemble based learning al-
Based on our simulation results, we believe that this gorithms can improve the accuracy and robustness of learning
fundamental question should be investigated in more depth performance, thus as a future research direction, the ADASYN
both theoretically and empirically in the research community algorithm can be extended for integration with ensemble
to correctly understand the essence of imbalanced learning based learning algorithms. To do this, one will need to use
problems. a bootstrap sampling technique to sample the original training
data sets, and then embed ADASYN to each sampled set to
train a hypothesis. Finally, a weighted combination voting rule
D. Discussions
similar to AdaBoost.M1 [35] [36] can be used to combine
As a new learning method, ADASYN can be further ex- all decisions from different hypotheses for the final predicted
tended to handle imbalanced learning in different scenarios, outputs. In such situation, it would be interesting to see the
therefore potentially benefit a wide range of real-world ap- performance of such boosted ADASYN algorithm with those
plications for learning from imbalanced data sets. We give a of SMOTEBoost [16], DataBoost-IM [17] and other ensemble