0% found this document useful (0 votes)

128 views7 pages

ADASYN: Adaptive Synthetic Sampling Approach For Imbalanced Learning

This document summarizes an adaptive synthetic sampling approach called ADASYN for learning from imbalanced data sets. ADASYN generates more synthetic data for minority class examples that are harder to learn compared to those that are easier to learn. This reduces bias and adaptively shifts the decision boundary toward difficult examples. The paper presents the ADASYN algorithm and compares its performance to other methods on several machine learning data sets, showing its effectiveness across five evaluation metrics.

Uploaded by

Nguyễn Tom

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

128 views7 pages

ADASYN: Adaptive Synthetic Sampling Approach For Imbalanced Learning

Uploaded by

Nguyễn Tom

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

ADASYN: Adaptive Synthetic Sampling Approach for Imbalanced

Learning
Haibo He, Yang Bai, Edwardo A. Garcia, and Shutao Li

Abstract—This paper presents a novel adaptive synthetic Generally speaking, imbalanced learning occurs whenever
(ADASYN) sampling approach for learning from imbalanced some types of data distribution significantly dominate the
data sets. The essential idea of ADASYN is to use a weighted instance space compared to other data distributions. In this
distribution for different minority class examples according to
their level of difficulty in learning, where more synthetic data paper, we focus on the two-class classification problem for
is generated for minority class examples that are harder to imbalanced data sets, a topic of major focus in recent research
learn compared to those minority examples that are easier to activities in the research community. Recently, theoretical
learn. As a result, the ADASYN approach improves learning analysis and practical applications for this problem have
with respect to the data distributions in two ways: (1) reducing attracted a growing attention from both academia and industry.
the bias introduced by the class imbalance, and (2) adaptively
shifting the classification decision boundary toward the difficult This is reflected by the establishment of several major work-
examples. Simulation analyses on several machine learning data shops and special issue conferences, including the American
sets show the effectiveness of this method across five evaluation Association for Artificial Intelligence workshop on Learning
metrics. from Imbalanced Data Sets (AAAI’00) [9], the International
Conference on Machine Learning workshop on Learning from
I. I NTRODUCTION Imbalanced Data Sets (ICML’03) [10], and the Association
for Computing Machinery (ACM) Special Interest Group on
EARNING from imbalanced data sets is a relatively new
L challenge for many of today’s data mining applications.
From applications in Web mining to text categorization to
Knowledge Discovery and Data Mining explorations (ACM
SIGKDD Explorations’04) [11].
The state-of-the-art research methodologies to handle imbal-
biomedical data analysis [1], this challenge manifests itself anced learning problems can be categorized into the following
in two common forms: minority interests and rare instances. five major directions:
Minority interests arise in domains where rare objects (minor- (1) Sampling strategies. This method aims to develop var-
ity class samples) are of great interest, and it is the objective ious oversampling and/or undersampling techniques to com-
of the machine learning algorithm to identify these minority pensate for imbalanced distributions in the original data sets.
class examples as accurately as possible. For instance, in For instance, in [12] the cost curves technique was used to
financial engineering, it is important to detect fraudulent credit study the interaction of both oversampling and undersampling
card activities in a pool of large transactions [2] [3]. Rare with decision tree based learning algorithms. Sampling tech-
instances, on the other hand, concerns itself with situations niques with the integration of probabilistic estimates, pruning,
where data representing a particular event is limited compared and data preprocessing were studied for decision tree learning
to other distributions [4] [5], such as the detection of oil in [13]. Additionally, in [14], “JOUS-Boost” was proposed
spills from satellite images [6]. One should note that many to handle imbalanced data learning by integrating adaptive
imbalanced learning problems are caused by a combination of boosting with jittering sampling techniques.
these two factors. For instance, in biomedical data analysis, the (2) Synthetic data generation. This approach aims to over-
data samples for different kinds of cancers are normally very come imbalance in the original data sets by artificially gener-
limited (rare instances) compared to normal non-cancerous ating data samples. The SMOTE algorithm [15], generates an
cases; therefore, the ratio of the minority class to the majority arbitrary number of synthetic minority examples to shift the
class can be significant (at a ratio of 1 to 1000 or even classifier learning bias toward the minority class. SMOTE-
more [4][7][8]). On the other hand, it is essential to predict Boost, an extension work based on this idea, was proposed
the presence of cancers, or further classify different types of in [16], in which the synthetic procedure was integrated with
cancers as accurate as possible for earlier and proper treatment adaptive boosting techniques to change the method of updating
(minority interests). weights to better compensate for skewed distributions. In order
to ensure optimal classification accuracy for minority and
Haibo He, Yang Bai, and Edwardo A. Garcia are with the Department of majority class, DataBoost-IM algorithm was proposed in [17]
Electrical and Computer Engineering, Stevens Institute of Technology, Hobo-
ken, New Jersey 07030, USA (email: {hhe, ybai1, egarcia}@stevens.edu). where synthetic data examples are generated for both minority
Shutao Li is with the College of Electrical and Information Engineering, and majority classes through the use of “seed” samples.
Hunan University, Changsha, 410082, China.(Email: shutao [email protected]) (3) Cost-sensitive learning. Instead of creating balanced
This work was supported in part by the Center for Intelligent Networked
Systems (iNetS) at Stevens Institute of Technology and the Excellent Youth data distributions by sampling strategies or synthetic data
Foundation of Hunan Province (Grant No. 06JJ1010). generation methods, cost-sensitive learning takes a different

1322

978-1-4244-1821-3/08/$25.002008
c IEEE
approach to address this issue: It uses a cost-matrix for The remainder of this paper is organized as follow. Section
different types of errors or instance to facilitate learning from II presents the ADASYN algorithm in detail, and discusses the
imbalanced data sets. That is to say, cost-sensitive learning major advantages of this method compared to conventional
does not modify the imbalanced data distribution directly; synthetic approaches for imbalanced learning problems. In
instead, it targets this problem by using different cost-matrices section III, we test the performance of ADASYN on various
that describe the cost for misclassifying any particular data machine learning test benches. Various evaluation metrics are
sample. A theoretical analysis on optimal cost-sensitive learn- used to assess the performance of this method against existing
ing for binary classification problems was studied in [18]. methods. Finally, a conclusion is presented in Section IV.
In [19] instead of using misclassification costs, an instance-
weighting method was used to induce cost-sensitive trees II. ADASYN ALGORITHM
and demonstrated better performance. In [20], Metacost, a Motivated by the success of recent synthetic approaches
general cost-sensitive learning framework was proposed. By including SMOTE [15], SMOTEBoost [16], and DataBoost-
wrapping a cost-minimizing procedure, Metacost can make IM [17], we propose an adaptive method to facilitate learning
any arbitrary classifier cost-sensitive according to different from imbalanced data sets. The objective here is two-fold:
requirements. In [21], cost-sensitive neural network models reducing the bias and adaptively learning. The proposed
were investigated for imbalanced classification problems. A algorithm for the two-class classification problem is described
threshold-moving technique was used in this method to adjust in [Algorithm ADASYN]:
the output threshold toward inexpensive classes, such that
high-cost (expensive) samples are unlikely to be misclassified.
(4) Active learning. Active learning techniques are conven- [Algorithm - ADASYN]
tionally used to solve problems related to unlabeled training
data. Recently, various approaches on active learning from Input
imbalanced data sets have been proposed in literature [1] [22] (1) Training data set Dtr with m samples {xi , yi }, i =
[23] [24]. In particular, an active learning method based on 1, ..., m, where xi is an instance in the n dimensional feature
support vector machines (SVM) was proposed in [23] [24]. space X and yi ∈ Y = {1, −1} is the class identity label asso-
Instead of searching the entire training data space, this method ciated with xi . Define ms and ml as the number of minority
can effectively select informative instances from a random class examples and the number of majority class examples,
set of training populations, therefore significantly reducing respectively. Therefore, ms ≤ ml and ms + ml = m.
the computational cost when dealing with large imbalanced
data sets. In [25], active learning was used to study the class
imbalance problems of word sense disambiguation (WSD) Procedure
applications. Various strategies including max-confidence and (1) Calculate the degree of class imbalance:
min-error were investigated as the stopping criteria for the d = ms /ml (1)
proposed active learning methods.
(5) Kernel-based methods. Kernel-based methods have also where d ∈ (0, 1].
been used to study the imbalanced learning problem. By (2) If d < dth then (dth is a preset threshold for the maximum
integrating the regularized orthogonal weighted least squares tolerated degree of class imbalance ratio):
(ROWLS) estimator, a kernel classifier construction algorithm (a) Calculate the number of synthetic data examples that
based on orthogonal forward selection (OFS) was proposed in need to be generated for the minority class:
[26] to optimize the model generalization for learning from
two-class imbalanced data sets. In [27], a kernel-boundary- G = (ml − ms ) × β (2)
alignment (KBA) algorithm based on the idea of modifying Where β∈ [0, 1] is a parameter used to specify the desired
the kernel matrix according to the imbalanced data distribution balance level after generation of the synthetic data. β = 1
was proposed to solve this problem. Theoretical analyses in means a fully balanced data set is created after the general-
addition to empirical studies were used to demonstrate the ization process.
effectiveness of this method. (b) For each example xi ∈ minorityclass, find K nearest
In this paper, we propose an adaptive synthetic (ADASYN) neighbors based on the Euclidean distance in n dimensional
sampling approach to address this problem. ADASYN is space, and calculate the ratio ri defined as:
based on the idea of adaptively generating minority data
samples according to their distributions: more synthetic data ri = Δi /K, i = 1, ..., ms (3)
is generated for minority class samples that are harder to learn
compared to those minority samples that are easier to learn. where Δi is the number of examples in the K nearest
The ADASYN method can not only reduce the learning bias neighbors of xi that belong to the majority class, therefore
introduced by the original imbalance data distribution, but can ri ∈ [0, 1];
ms
also adaptively shift the decision boundary to focus on those (c) Normalize ri according to rî = ri / ri , so that rî is
difficult to learn samples. i=1

2008 International Joint Conference on Neural Networks (IJCNN 2008) 1323

a density distribution ( rî = 1) algorithm. Fig. 1 shows that the ADASYN algorithm can
i improve the classification performance by reducing the bias
(d) Calculate the number of synthetic data examples that introduced in the original imbalanced data sets. Further more,
need to be generated for each minority example xi : it also demonstrates the tendency in error reduction as balance
gi = rî × G (4) level is increased by ADASYN.

where G is the total number of synthetic data examples that

#0CNIQTKVJOHQTFKHHGTGPVEEQGHHKEKGPVU

need to be generated for the minority class as deﬁned in EEQTTGURQPFUVQVJG

QTKIKPCNKODCNCPEGFFCVC
Equation (2).

(e) For each minority class data example xi , generate gi

synthetic data examples according to the following steps:
Do the Loop from 1 to gi :

(i) Randomly choose one minority data example, xzi ,

'TTQTRGTHQTOCPEG
from the K nearest neighbors for data xi .

(ii) Generate the synthetic data example:

si = xi + (xzi − xi ) × λ (5)
EEQTTGURQPFUVQHWNN[
where (xzi − xi ) is the difference vector in n dimensional DCNCPEGFFCVCCHVGT
#0CNIQTKVJO
spaces, and λ is a random number: λ ∈ [0, 1].

End Loop

The key idea of ADASYN algorithm is to use a density &KHHGTGPVEEQGHHKEKGPVU

distribution rˆi as a criterion to automatically decide the

Fig. 1. ADASYN algorithm for imbalanced learning
number of synthetic samples that need to be generated for
each minority data example. Physically, rî is a measurement
of the distribution of weights for different minority class III. S IMULATION ANALYSIS AND DISCUSSIONS
examples according to their level of difficulty in learning. A. Data set analysis
The resulting dataset post ADASYN will not only provide a
balanced representation of the data distribution (according to We test our algorithm on various real-world machine learn-
the desired balance level defined by the β coefficient), but it ing data sets as summarized in Table 1. All these data sets are
will also force the learning algorithm to focus on those difficult available from the UCI Machine Learning Repository [28].
to learn examples. This is a major difference compared to the In addition, since our interest here is to test the learning
SMOTE [15] algorithm, in which equal numbers of synthetic capabilities from two-class imbalanced problems, we made
samples are generated for each minority data example. Our modifications on several of the original data sets according to
objective here is similar to those in SMOTEBoost [16] and various literary results from similar experiments [17] [29]. A
DataBoost-IM [17] algorithms: providing different weights for brief description of such modifications is discussed as follows.
different minority examples to compensate for the skewed TABLE I
distributions. However, the approach used in ADASYN is DATA SET CHARACTERISTICS USED IN THIS PAPER .
more efficient since both SMOTEBoost and DataBoost-IM
rely on the evaluation of hypothesis performance to update
the distribution function, whereas our algorithm adaptively Data set # total # minority # majority #
updates the distribution based on the data distribution char- Name examples examples examples attributes
Vehicle 846 199 647 18
acteristics. Hence, there is no hypothesis evaluation required Diabetes (PID) 768 268 500 8
for generating synthetic data samples in our algorithm. Vowel 990 90 900 10
Fig. 1 shows the classification error performance for differ- Ionosphere 351 126 225 34
Abalone 731 42 689 7
ent β coefficients for an artificial two-class imbalanced data
set. The training data set includes 50 minority class examples
and 200 majority class examples, and the testing data set Vehicle dataset. This data set is used to classify a given
includes 200 examples. All data examples are generated by silhouette as one of four types of vehicles [30]. This dataset
multidimensional Gaussian distributions with different mean has a total of 846 data examples and 4 classes (opel, saab,
and covariance matrix parameters. These results are based bus and van). Each example is represented by 18 attributes. We
on the average of 100 runs with a decision tree as the base choose “Van” as the minority class and collapse the remaining
classifier. In Fig. 1, β = 0 corresponds to the classification classes into one majority class. This gives us an imbalanced
error based on the original imbalanced data set, while β = 1 two-class dataset, with 199 minority class examples and 647
represents a fully balanced data set generated by the ADASYN majority class examples.

1324 2008 International Joint Conference on Neural Networks (IJCNN 2008)

Pima Indian Diabetes dataset. This is a two-class data set
and is used to predict positive diabetes cases. It includes a
total of 768 cases with 8 attributes. We use the positive cases
as the minority class, which give us 268 minority class cases
and 500 majority class cases.
Vowel recognition dataset. This is a speech recognition
dataset used to classify different vowels. The original dataset
includes 990 examples and 11 classes. Each example is repre-
sented by 10 attributes. Since each vowel in the original data
set has 10 examples, we choose the first vowel as the minority Fig. 2. Confusion matrix for performance evaluation
class and collapse the rest to be the majority class, which gives
90 and 900 minority and majority examples, respectively.
Ionosphere dataset. This data set includes 351 examples F Measure:
with 2 classes (good radar returns versus bad radar returns). (1 + β 2 ) · recall · precision
F M easure = (9)
Each example is represented by 34 numeric attributes. We β 2 · recall + precision
choose the “bad radar” instances as minority class and “good
Where β is a coefficient to adjust the relative importance of
radar” instance as the majority class, which gives us 126
precision versus recall (usually β = 1).
minority class examples and 225 majority class examples.
G mean:
Abalone dataset. This data set is used to predict the age
of abalone from physical measurements. The original data set G mean = P ositiveAccuracy × N egativeAccuracy
includes 4177 examples and 29 classes, and each example
TP TN
is represented by 8 attributes. We choose class “18” as the = × (10)
TP + FN TN + FP
minority class and class “9” as the majority class as suggested
in [17]. In addition, we also removed the discrete feature C. Simulation analyses
(feature “sex”) in our current simulation. This gives us 42 We use the decision tree as the base learning model in our
minority class examples and 689 majority class examples; each current study. According to the assessment metrics presented
represented by 7 numerical attributes. in Section III-B, Table 2 illustrates the performance of the
ADASYN algorithm compared to the SMOTE algorithm. As
B. Evaluation metrics for imbalanced data sets a reference, we also give the performance of the decision tree
Instead of using the overall classification accuracy as a learning based on the original imbalanced data sets. These
single evaluation criterion, we use a set of assessment metrics results are based on the average of 100 runs. At each run, we
related to receiver operating characteristics (ROC) graphs [31] randomly select half of the minority class and majority class
to evaluate the performance of ADASYN algorithm. We use examples as the training data, and use the remaining half for
ROC based evaluation metrics because under the imbalanced testing purpose. For both SMOTE and ADASYN, we set the
learning condition, traditional overall classification accuracy number of nearest neighbors K = 5. Other parameters include
may not be able to provide a comprehensive assessment N = 200 for SMOTE according to [15], β = 1 and dth = 0.75
of the observed learning algorithm [17] [31] [32] [33] [6] for ADASYN.
[34] [16]. Let {p, n} be the positive and negative testing For each method, the best performance is highlighted in
examples and {Y, N } be the classification results given by each category. In addition, the total winning times for each
a learning algorithm for positive and negative predictions. A method across different evaluation metrics are also shown in
representation of classification performance can be formulated Table 2. Based on these simulation results, the ADASYN
by a confusion matrix (contingency table) as illustrated in algorithm can achieve competitive results on these five test
Fig. 2. We followed the suggestions of [15] [34] and use the benches. As far as the overall winning times are concerned,
minority class as the positive class and majority class as the ADASYN outperforms the other methods. Further more,
negative class. ADASYN algorithm also provides the best performance in
Based on Fig. 2, the evaluation metrics used to assess terms of G-mean for all data sets. This means our algorithm
learning from imbalanced data sets are defined as: provides improved accuracy for both minority and majority
Overall Accuracy (OA): classes and does not sacrifice one class in preference for
TP + TN another. This is one of the advantages of our method to handle
OA = (6) the imbalanced learning problems.
TP + FP + FN + TN
There is another interesting observation that merit further
Precision:
TP discussion. From Table 2 one can see there are situations
P recision = (7) that learning from the original data set can actually achieve
TP + FP
better performance for certain assessment criterion, such as
Recall:
TP the precision assessment. This raises an important question:
Recall = (8) generally speaking, to what level the imbalanced learning
TP + FN

2008 International Joint Conference on Neural Networks (IJCNN 2008) 1325

TABLE II
E VALUATION METRICS AND PERFORMANCE COMPARISON

Dataset
Methods OA Precision Recall F measure G mean

Decision tree 0.9220 0.8454 0.8199 0.8308 0.8834

Vehicle
SMOTE 0.9239 0.8236 0.8638 0.8418 0.9018
ADASYN 0.9257 0.8067 0.9015 0.8505 0.9168

Decision tree 0.6831 0.5460 0.5500 0.5469 0.6430

Pima Indian Diabetes
SMOTE 0.6557 0.5049 0.6201 0.5556 0.6454
ADASYN 0.6837 0.5412 0.6097 0.5726 0.6625

Decision tree 0.9760 0.8710 0.8700 0.8681 0.9256

Vowel recognition
SMOTE 0.9753 0.8365 0.9147 0.8717 0.9470
ADASYN 0.9678 0.7603 0.9560 0.8453 0.9622

Decision tree 0.8617 0.8403 0.7698 0.8003 0.8371

Ionosphere
SMOTE 0.8646 0.8211 0.8032 0.8101 0.8489
ADASYN 0.8686 0.8298 0.8095 0.8162 0.8530

Decision tree 0.9307 0.3877 0.2929 0.3249 0.5227

Abalone
SMOTE 0.9121 0.2876 0.3414 0.3060 0.5588
ADASYN 0.8659 0.2073 0.4538 0.2805 0.6291

Decision tree 2 5 0 1 0
Winning times
SMOTE 0 0 1 1 0
ADASYN 3 0 4 3 5

methods such as adjusting the class balance can help the learn- brief discussion on possible future research directions in this
ing capabilities? This is a fundamental and critical question in Section.
this domain. In fact, the importance of this question has been
previously addressed by F. Provost in the invited paper for the Firstly of all, in our current study, we compared the
AAAI’2000 Workshop on Imbalanced Data Sets [1]: ADASYN algorithm to single decision tree and SMTOE
“Isn’t the best research strategy to concentrate on how algorithm [15] for performance assessment. This is mainly
machine learning algorithms can deal most effectively with because all of these methods are single-model based learning
whatever data they are given?” algorithms. Statistically speaking, ensemble based learning al-
Based on our simulation results, we believe that this gorithms can improve the accuracy and robustness of learning
fundamental question should be investigated in more depth performance, thus as a future research direction, the ADASYN
both theoretically and empirically in the research community algorithm can be extended for integration with ensemble
to correctly understand the essence of imbalanced learning based learning algorithms. To do this, one will need to use
problems. a bootstrap sampling technique to sample the original training
data sets, and then embed ADASYN to each sampled set to
train a hypothesis. Finally, a weighted combination voting rule
D. Discussions
similar to AdaBoost.M1 [35] [36] can be used to combine
As a new learning method, ADASYN can be further ex- all decisions from different hypotheses for the ﬁnal predicted
tended to handle imbalanced learning in different scenarios, outputs. In such situation, it would be interesting to see the
therefore potentially beneﬁt a wide range of real-world ap- performance of such boosted ADASYN algorithm with those
plications for learning from imbalanced data sets. We give a of SMOTEBoost [16], DataBoost-IM [17] and other ensemble

1326 2008 International Joint Conference on Neural Networks (IJCNN 2008)

based imbalanced learning algorithms. mining and many related areas. We are currently investigating
Secondly, ADASYN can be generalized to multiple-class various issues, such as multiple classes imbalanced learning
imbalanced learning problems as well. Although two-class and incremental imbalanced learning. Motivated by the results
imbalanced classification problems dominate the research ac- in this paper, we believe that ADASYN may provide a
tivities in today’s research community, this is not a limitation powerful method in this domain.
to our method. To extend the ADASYN idea to multi-class
problems, one first needs to calculate and sort the degree R EFERENCES
of class imbalance for each class with respect to the most [1] F. Provost, “Machine Learning from Imbalanced Data Sets 101,” Invited
paper for the AAAI’2000 Workshop on Imbalanced Data Sets, Menlo
significant class, ys ∈ Y = {1, ..., C}, which is defined as Park, CA, 2000.
the class identity label with the largest number of examples. [2] P. K. Chan, W. Fan, A. L. Prodromidis, and S. J. Stolf, “Distributed Data
Then for all classes that satisfy the condition d < dth , the Mining in Credit Card Fraud Detection,” IEEE Intelligent Systems, pp.
67-74, November/December 1999.
ADASYN algorithm is executed to balance them according to [3] P. K. Chan and S. J. Stolfo, “Toward scalable learning with non-uniform
their own data distribution characteristics. In this situation, the class and cost distributions: a case study in credit card fraud detection,”
update of ri in equation (3) can be modified to reflect different in Proc. Int. Conf. Knowledge Discovery and Data Mining (KDD’01),
pp. 164-168, 2001.
needs in different applications. For instance, if one would like [4] G. M. Weiss, “Mining with Rarity: A Unifying Framework,” SIGKDD
to balance the examples in class yk , (yk ∈ {1, ..., C} and Explorations, 6(1):7-19, 2004.
yk = ys ), then the definition of Δi in equation (3) can be [5] G. M. Weiss “Mining Rare Cases,” In O. Maimon and L. Rokach (.eds),
defined as the number of examples in the nearest neighbors Data Mining and Knowledge Discovery Handbook: A Complete Guide
for Practitioners and Researchers, Kluwer Academic Publishers, pp. 765-
belonging to class ys , or belonging to all other classes except 776, 2005.
yk (similar to transforming the calculation of the nearest [6] M. Kubat, R. C. Holte, and S. Matwin, “Machine Learning for the
neighbors to a Boolean type function: belonging to yk or not Detection of Oil Spills in Satellite Radar Images,” Machine Learning,
30(2):195-215, 1998.
belonging to yk ). [7] H. He and X. Shen, “A Ranked Subspace Learning Method for Gene
Further more, the ADASYN algorithm can also be modified Expression Data Classification,” in Proc. Int. Conf. Artificial Intelligence
to facilitate incremental learning applications. Most current (ICAI’07), pp. 358 - 364, June 2007
[8] R. Pearson, G. Goney, and J. Shwaber, “Imbalanced Clustering for
imbalanced learning algorithms assume that representative Microarray Time-Series,” in Proc. ICML’03 workshop on Learning from
data samples are available during the training process. How- Imbalanced Data Sets, 2003
ever, in many real-world applications such as mobile sensor [9] N. Japkowicz, (Ed.), “Learning from Imbalanced Data Sets,” the AAAI
Workshop, Technical Report WS-00-05, American Association for Arti-
networks, Web mining, surveillance, homeland security, and ficial Intelligence, Menlo Park, CA, 2000.
communication networks, training data may continuously be- [10] N. V. Chawla, N. Japkowicz, and A. Ko lcz, (Ed.), “Imbalanced
come available in small chunks over a period of time. In Clustering for Microarray Time-Series,” in Proc. ICML’03 Workshop on
Learning from Imbalanced Data Sets, 2003
this situation, a learning algorithm should have the capability [11] N. V. Chawla, N. Japkowicz and A. Kolcz, SIGKDD Explorations:
to accumulate previous experience and use this knowledge Special issue on Learning from Imbalanced Datasets, vol.6, issue 1, 2004.
to learn additional new information to aid prediction and [12] C. Drummond and R. Holte, “C4.5, Class Imbalance, and Cost Sen-
sitivity: Why Under-sampling Beats Oversampling,” in Proc. ICML’03
future decision-making processes. The ADASYN algorithm Workshop on Learning from Imbalanced Data Sets, 2003
can potentially be adapted to such an incremental learning [13] N. Chawla, “C4.5 and Imbalanced Datasets: Investigating the Effect of
scenario. To do this, one will need to dynamically update the ri Sampling Method, Probalistic Estimate, and Decision Tree Structure,” in
ICML-KDD’03 Workshop: Learning from Imbalanced Data Sets, 2003
distribution whenever a new chunk of data samples is received. [14] D. Mease, A. J. Wyner, and A. Buja, “Boosted Classification Trees
This can be accomplished by an online learning and evaluation and Class Probability/Quantile Estimation,” Journal of Machine Learning
process. Research, vol. 8, pp. 409- 439, 2007.
[15] N. V. Chawla, L. O. Hall, K. W. Bowyer, and W. P. Kegelmeyer,
“SMOTE: Synthetic Minority Oversampling TEchnique,” Journal of
IV. C ONCLUSION Artificial Intelligence Research, vol. 16, pp. 321-357, 2002.
In this paper, we propose a novel adaptive learning al- [16] N. V. Chawla, A. Lazarevic, L. O. Hall, and K. W. Bowyer, “Smoteboost:
Improving Prediction of the Minority Class in Boosting,” in Proc.
gorithm ADASYN for imbalanced data classification prob- European Conf. Principles and Practice of Knowledge Discovery in
lems. Based on the original data distribution, ADASYN can Databases, pp. 107-119, Dubrovnik, Croatia, 2003.
adaptively generate synthetic data samples for the minority [17] H. Guo and H. L. Viktor, “Learning from Imbalanced Data Sets with
Boosting and Data Generation: the DataBoost-IM Approach,” in SIGKDD
class to reduce the bias introduced by the imbalanced data Explorations: Special issue on Learning from Imbalanced Datasets, vol.6,
distribution. Further more, ADASYN can also autonomously issue 1, pp. 30 - 39, 2004.
shift the classifier decision boundary to be more focused on [18] C. Elkan, “The foundations of cost-sensitive learning,” in Proc. Int. Joint
Conf. Artificial Intelligence (IJCAI’01), pp. 973-978, 2001.
those difficult to learn examples, therefore improving learning [19] K. M. Ting, “An instance-weighting method to induce cost-sensitive
performance. These two objectives are accomplished by a trees,” IEEE Transaction on Knowledge and Data Engineering, 14: pp.
dynamic adjustment of weights and an adaptive learning 659-665, 2002.
[20] P. Domingos, “Metacost: A general method for making classifiers cost-
procedure according to data distributions. Simulation results sensitive,” in Proc. ACM SIGKDD Int. Conf. Knowledge Discovery and
on five data sets based on various evaluation metrics show the Data Mining, pp. 155-164, San Diego, CA, 1999.
effectiveness of this method. [21] Z. H. Zhou and X. Y. Liu, “Training Cost-Sensitive Neural Networks
with Methods Addressing the Class Imbalance Problem,” IEEE Trans-
Imbalanced learning is a challenging and active research actions on Knowledge and Data Engineering, vol. 18, no. 1, pp. 63-77,
topic in the artificial intelligence, machine learning, data 2006.

2008 International Joint Conference on Neural Networks (IJCNN 2008) 1327

[22] N. Abe, “Invited talk: Sampling Approaches to Learning From Imbal-
anced Datasets: Active Learning, Cost Sensitive Learning and Beyond,”
in ICML-KDD’03 Workshop: Learning from Imbalanced Data Sets, 2003.
[23] S. Ertekin, J. Huang, and C. L. Giles, “Active Learning for Class
Imbalance Problem,” in Proc. Annual Int. ACM SIGIR Conf. Research
and development in information retrieval, pp. 823 - 824, Amsterdam,
Netherlands, 2007.
[24] S. Ertekin, J. Huang, L. Bottou, C. L. Giles, “Learning on the Bor-
der: Active Learning in Imbalanced Data Classification,” in CIKM’07,
November 6-8, 2007, Lisboa, Portugal.
[25] J. Zhu and E. Hovy, “Active Learning for Word Sense Disambiguation
with Methods for Addressing the Class Imbalance Problem,” in Proc.
Joint Conf. Empirical Methods in Natural Language Processing and
Computational Natural Language Learning, pp. 783-790, Prague, June
2007.
[26] X. Hong, S. Chen, and C. J. Harris, “A Kernel-Based Two-Class
Classifier for Imbalanced Data Sets,” IEEE Transactions on Neural
Networks, vol. 18, no. 1, pp. 28-41, 2007.
[27] G. Wu and E. Y. Chang, “KBA: Kernel Boundary Alignment Consid-
ering Imbalanced Data Distribution,” IEEE Transactions on Knowledge
and Data Engineering, vol. 17, no.6, pp. 786-795, 2005.
[28] UCI Machine Learning Repository, [online], available:
http://archive.ics.uci.edu/ml/
[29] F. Provost, T. Fawcett, and R. Kohavi, “The Case Against Accuracy
Estimation for Comparing Induction Algorithms,” in Proc. Int. Conf.
Machine Learning, pp. 445-453 Madison, WI. Morgan Kauffmann, 1998
[30] J. P. Siebert, “Vehicle Recognition Using Rule Based Methods,” Turing
Institute Research Memorandum TIRM-87-018, March 1987.
[31] T. Fawcett, “ROC Graphs: Notes and Practical Considerations for Data
Mining Researchers,” Technical Report HPL-2003-4, HP Labs, 2003.
[32] F. Provost and T. Fawcett, “Analysis and Visualization of Classifier
Performance: Comparison Under Imprecise Class and Cost Distributions,”
in Proc. Int. Conf. Knowledge Discovery and Data Mining, Menlo Park,
CS, AAAI Press, 43-48, 1997.
[33] M. A. Maloof, “Learning When Data Sets Are Imbalanced and When
Cost Are Unequal and Unkown,” in ICML’03 Workshop on Learning from
Imbalanced Data Sets II, 2003
[34] M. Kubat and S. Matwin, “Addressing the Curse of Imbalanced Training
Sets: One-sided Selection,” in Proc. Int. Conf. Machine Learning, San
Francisco, CA, Morgan Kaufmann, pp. 179-186, 1997.
[35] Y. Freund and R. E. Schapire, “Experiments With a New Boosting
Algorithm,” in Proc. Int. Conf. Machine Learning (ICML’96), pp. 148-
156, 1996.
[36] Y. Freund and R. E. Schapire, “Decision-theoretic Generalization of On-
line Learning and Application to Boosting,” in J. Computer and Syst.
Sciences, vol. 55, no. 1, pp. 119-139, 1997.

1328 2008 International Joint Conference on Neural Networks (IJCNN 2008)

A Systematic Review On Imbalanced Data Challenges in Machine Learning: Applications and Solutions
100% (1)
A Systematic Review On Imbalanced Data Challenges in Machine Learning: Applications and Solutions
36 pages
Learning From Class Imbalanced Data Review of Methods and Applications
No ratings yet
Learning From Class Imbalanced Data Review of Methods and Applications
20 pages
FAQs ICTO
100% (1)
FAQs ICTO
3 pages
Expert Systems With Applications: Georgios Douzas, Fernando Bacao
No ratings yet
Expert Systems With Applications: Georgios Douzas, Fernando Bacao
8 pages
Class Imbalance Problem in Data Mining: Review
No ratings yet
Class Imbalance Problem in Data Mining: Review
5 pages
An Empirical Comparison and Evaluation of Minority Oversampling
No ratings yet
An Empirical Comparison and Evaluation of Minority Oversampling
13 pages
A Survey On Oversampling Techniques For Imbalanced Learning
No ratings yet
A Survey On Oversampling Techniques For Imbalanced Learning
6 pages
Krawczyk2016 Article LearningFromImbalancedDataOpen
No ratings yet
Krawczyk2016 Article LearningFromImbalancedDataOpen
12 pages
Li 2011
No ratings yet
Li 2011
4 pages
Learning From Imbalanced Data
No ratings yet
Learning From Imbalanced Data
22 pages
Author Final Version
No ratings yet
Author Final Version
11 pages
11192-Article (PDF) - 20731-1-10-20180420
No ratings yet
11192-Article (PDF) - 20731-1-10-20180420
43 pages
Imbalanced Data
No ratings yet
Imbalanced Data
54 pages
Under-Sampling Technique For Imbalanced Data Using Minimum Sum of Euclidean Distance in Principal Component Subset
No ratings yet
Under-Sampling Technique For Imbalanced Data Using Minimum Sum of Euclidean Distance in Principal Component Subset
14 pages
2018 12state of ArtofImbalancedDataClassificationMethods
No ratings yet
2018 12state of ArtofImbalancedDataClassificationMethods
7 pages
Learning From Imbalanced Data: Open Challenges and Future Directions
No ratings yet
Learning From Imbalanced Data: Open Challenges and Future Directions
13 pages
IR-Lab Project of Yanjun Qi (Fall 2004) : A Brief Literature Review of Class Imbalanced Problem
No ratings yet
IR-Lab Project of Yanjun Qi (Fall 2004) : A Brief Literature Review of Class Imbalanced Problem
5 pages
Kumar 2021 IOP Conf. Ser. Mater. Sci. Eng. 1099 012077
No ratings yet
Kumar 2021 IOP Conf. Ser. Mater. Sci. Eng. 1099 012077
9 pages
Analysis of Imbalanced Classification Algorithms A Perspective View
No ratings yet
Analysis of Imbalanced Classification Algorithms A Perspective View
5 pages
Imbalanced Data Classification Method Based On LSSASMOTE
No ratings yet
Imbalanced Data Classification Method Based On LSSASMOTE
9 pages
Class Notes
No ratings yet
Class Notes
24 pages
Model Optimisation of Class Imbalanced Learning Using Ensemble Classifier On Over-Sampling Data
No ratings yet
Model Optimisation of Class Imbalanced Learning Using Ensemble Classifier On Over-Sampling Data
8 pages
IMECS2010 pp513-517
No ratings yet
IMECS2010 pp513-517
5 pages
Eng2 12298 PDF
No ratings yet
Eng2 12298 PDF
24 pages
Handling Data Imbalance in Machine Learning
No ratings yet
Handling Data Imbalance in Machine Learning
51 pages
FULLTEXT01
No ratings yet
FULLTEXT01
42 pages
Clustering Based Undersampling For Handling Class Imbalance in C4.5 Classification Algorithm
No ratings yet
Clustering Based Undersampling For Handling Class Imbalance in C4.5 Classification Algorithm
7 pages
An Overview of Classification Algorithms For Imbalanced Datasets
No ratings yet
An Overview of Classification Algorithms For Imbalanced Datasets
7 pages
I D L A R: Mbalanced ATA Earning Pproaches Eview
No ratings yet
I D L A R: Mbalanced ATA Earning Pproaches Eview
19 pages
International Conference On Information and Communications Technology
No ratings yet
International Conference On Information and Communications Technology
5 pages
1 s2.0 S0950705119302898 Main
No ratings yet
1 s2.0 S0950705119302898 Main
17 pages
MEE22154 Task2
No ratings yet
MEE22154 Task2
4 pages
11-A-SMOTE A New Preprocessing Approach For Highly Im
No ratings yet
11-A-SMOTE A New Preprocessing Approach For Highly Im
11 pages
Admin, 1277
No ratings yet
Admin, 1277
21 pages
Oligois: Scalable Instance Selection For Class-Imbalanced Data Sets
No ratings yet
Oligois: Scalable Instance Selection For Class-Imbalanced Data Sets
15 pages
Navo Minority Over-Sampling Technique (Nmote) : A Consistent Performance Booster On Imbalanced Datasets
No ratings yet
Navo Minority Over-Sampling Technique (Nmote) : A Consistent Performance Booster On Imbalanced Datasets
42 pages
Influence-Balanced Loss For Imbalanced Visual Classification
No ratings yet
Influence-Balanced Loss For Imbalanced Visual Classification
10 pages
Addressing Imbalance Problem in The Class - A Survey
No ratings yet
Addressing Imbalance Problem in The Class - A Survey
5 pages
JPSP - 2022 - 383
No ratings yet
JPSP - 2022 - 383
12 pages
Batista 2004
No ratings yet
Batista 2004
10 pages
Neural Networks: Sree Rama Vamsidhar S., Arun Kumar Sivapuram, Vaishnavi Ravi, Gowtham Senthil, Rama Krishna Gorthi
No ratings yet
Neural Networks: Sree Rama Vamsidhar S., Arun Kumar Sivapuram, Vaishnavi Ravi, Gowtham Senthil, Rama Krishna Gorthi
7 pages
Improving Imbalanced Learning Through A Heuristic Oversampling Method Based On K-Means and SMOTE
No ratings yet
Improving Imbalanced Learning Through A Heuristic Oversampling Method Based On K-Means and SMOTE
20 pages
Springer Format Data Imbalance Paper - Docm
No ratings yet
Springer Format Data Imbalance Paper - Docm
8 pages
An Insight Into Classification With Imbalanced Data
No ratings yet
An Insight Into Classification With Imbalanced Data
29 pages
Two Novel SMOTE Methods For Solving Imbalanced Classification Problems
No ratings yet
Two Novel SMOTE Methods For Solving Imbalanced Classification Problems
8 pages
2515-Article Text-14337-4-10-20230331
No ratings yet
2515-Article Text-14337-4-10-20230331
12 pages
Wang 2021
No ratings yet
Wang 2021
11 pages
A Theoretical Analysis of The Learning Dynamics Under Class Imbalance
No ratings yet
A Theoretical Analysis of The Learning Dynamics Under Class Imbalance
38 pages
Paper 6 - 240417 - 184500 OCR
No ratings yet
Paper 6 - 240417 - 184500 OCR
11 pages
A Comparative Study of SMOTE Borderline-SMOTE and ADASYN Oversampling Techniques Using Different Classifiers
No ratings yet
A Comparative Study of SMOTE Borderline-SMOTE and ADASYN Oversampling Techniques Using Different Classifiers
9 pages
Machine Learning With Oversampling and Undersampling Techniques Overview Study and Experimental Results
No ratings yet
Machine Learning With Oversampling and Undersampling Techniques Overview Study and Experimental Results
6 pages
A Comprehensive Analysis of Synthetic Minority Oversampling Technique (SMOTE) For Handling Class Imbalance
No ratings yet
A Comprehensive Analysis of Synthetic Minority Oversampling Technique (SMOTE) For Handling Class Imbalance
33 pages
Random and Synthetic Over Sampling Approach To Resolve Data 2zu79c47m6
No ratings yet
Random and Synthetic Over Sampling Approach To Resolve Data 2zu79c47m6
9 pages
Classifying Imbalanced Data Sets Using Similarity Based Hierarchical Decomposition
No ratings yet
Classifying Imbalanced Data Sets Using Similarity Based Hierarchical Decomposition
16 pages
NICE Actimize - DS - Rarity Problem in Supervised Fraud Detection Insights Article - 3JUNE20
No ratings yet
NICE Actimize - DS - Rarity Problem in Supervised Fraud Detection Insights Article - 3JUNE20
11 pages
Axioms 11 00607 v2
No ratings yet
Axioms 11 00607 v2
19 pages
Class Imbalance Notes
No ratings yet
Class Imbalance Notes
6 pages
5 Techniques To Handle Imbalanced Data For A Classification Problem
No ratings yet
5 Techniques To Handle Imbalanced Data For A Classification Problem
7 pages
Machine Learning: Fundamentals and Applications
From Everand
Machine Learning: Fundamentals and Applications
Fouad Sabry
No ratings yet
Data Mining: Fundamentals and Applications
From Everand
Data Mining: Fundamentals and Applications
Fouad Sabry
No ratings yet
MACHINE LEARNING FOR BEGINNERS: A Practical Guide to Understanding and Applying Machine Learning Concepts (2023 Beginner Crash Course)
From Everand
MACHINE LEARNING FOR BEGINNERS: A Practical Guide to Understanding and Applying Machine Learning Concepts (2023 Beginner Crash Course)
Elaine Tate
No ratings yet
CCM201 Chapter 2 3 Slots
No ratings yet
CCM201 Chapter 2 3 Slots
51 pages
C1 SC10 Setup Cli Ja
No ratings yet
C1 SC10 Setup Cli Ja
27 pages
MKE02P64M40SF0
No ratings yet
MKE02P64M40SF0
37 pages
Getting Started With Power Query: Presented By: John Larimer
No ratings yet
Getting Started With Power Query: Presented By: John Larimer
45 pages
GRC Config Steps, ARA, EAM
100% (1)
GRC Config Steps, ARA, EAM
61 pages
Calling - API de Nuvem Do WhatsApp
No ratings yet
Calling - API de Nuvem Do WhatsApp
4 pages
DOCSIS 3.0-Cisco
No ratings yet
DOCSIS 3.0-Cisco
76 pages
Unit 3 - Part 1 Assignment Problem
No ratings yet
Unit 3 - Part 1 Assignment Problem
54 pages
GRD - 8 - RECORDING - 2019 - ORACLE SENIOR SECONDARY SCHOOL (NPC) Nov2019
No ratings yet
GRD - 8 - RECORDING - 2019 - ORACLE SENIOR SECONDARY SCHOOL (NPC) Nov2019
46 pages
PME Licensing Guide
No ratings yet
PME Licensing Guide
19 pages
Reference Guide - TNA For Teachers
100% (1)
Reference Guide - TNA For Teachers
6 pages
IT-2205 Lec 03 Error Detection & Correction-1
No ratings yet
IT-2205 Lec 03 Error Detection & Correction-1
45 pages
Project Diary
No ratings yet
Project Diary
20 pages
EDA Manual
No ratings yet
EDA Manual
20 pages
Literature Survey
No ratings yet
Literature Survey
8 pages
Net Scaler
No ratings yet
Net Scaler
19 pages
Solution Function Worksheet
No ratings yet
Solution Function Worksheet
7 pages
Rs Syll
No ratings yet
Rs Syll
3 pages
Muqaddas Research Papers
No ratings yet
Muqaddas Research Papers
5 pages
Difference Between BFS and DFS
No ratings yet
Difference Between BFS and DFS
3 pages
Datasheet 708E Series v1 02 20221205
No ratings yet
Datasheet 708E Series v1 02 20221205
3 pages
Linear Programming - 17 March 23
No ratings yet
Linear Programming - 17 March 23
8 pages
Topic 1 Introduction To Digital Logic and Boolean Algebra
No ratings yet
Topic 1 Introduction To Digital Logic and Boolean Algebra
99 pages
EN Rider Guide
No ratings yet
EN Rider Guide
32 pages
New Text Document
No ratings yet
New Text Document
8 pages
E610-Dtu (433c30) e+User+Manual en v1.0
No ratings yet
E610-Dtu (433c30) e+User+Manual en v1.0
48 pages
Data Flow Diagrams
No ratings yet
Data Flow Diagrams
26 pages
? GP-02-Kit - Easy UART Control, NMEA Protocol Ready.
No ratings yet
? GP-02-Kit - Easy UART Control, NMEA Protocol Ready.
9 pages
Networking Assignment
No ratings yet
Networking Assignment
80 pages

ADASYN: Adaptive Synthetic Sampling Approach For Imbalanced Learning

Uploaded by

ADASYN: Adaptive Synthetic Sampling Approach For Imbalanced Learning

Uploaded by

ADASYN: Adaptive Synthetic Sampling Approach for Imbalanced

2008 International Joint Conference on Neural Networks (IJCNN 2008) 1323

where G is the total number of synthetic data examples that 

need to be generated for the minority class as deﬁned in EEQTTGURQPFUVQVJG

(e) For each minority class data example xi , generate gi 

(i) Randomly choose one minority data example, xzi , 

End Loop 

distribution rˆi as a criterion to automatically decide the

1324 2008 International Joint Conference on Neural Networks (IJCNN 2008)

2008 International Joint Conference on Neural Networks (IJCNN 2008) 1325

Decision tree 0.9220 0.8454 0.8199 0.8308 0.8834

Decision tree 0.6831 0.5460 0.5500 0.5469 0.6430

Decision tree 0.9760 0.8710 0.8700 0.8681 0.9256

Decision tree 0.8617 0.8403 0.7698 0.8003 0.8371

Decision tree 0.9307 0.3877 0.2929 0.3249 0.5227

1326 2008 International Joint Conference on Neural Networks (IJCNN 2008)

2008 International Joint Conference on Neural Networks (IJCNN 2008) 1327

1328 2008 International Joint Conference on Neural Networks (IJCNN 2008)

You might also like

where G is the total number of synthetic data examples that

need to be generated for the minority class as deﬁned in EEQTTGURQPFUVQVJG

(e) For each minority class data example xi , generate gi

(i) Randomly choose one minority data example, xzi ,

End Loop