Admin, 1277
Admin, 1277
Zhuoyuan Zheng
Guangxi Key Laboratory of Trusted Software
Guilin University of Electronic Technology, Guilin 541004, China
e-mail: [email protected]
Yunpeng Cai, Ye Li
Shenzhen Institutes of Advanced Technology
Key Laboratory for Biomedical Informatics and Health Engineering
Chinese Academy of Sciences
1068 Xueyuan Avenue, Shenzhen University Town, Shenzhen 518055
e-mail: {yp.cai, ye.li}@siat.ac.cn
1 INTRODUCTION
Classification is by far the most important machine learning topic. Various classi-
fication algorithms, such as decision tree, BP neural networks, Bayesian networks,
k-nearest neighbor, support vector machine, etc., were developed and widely used in
many fields. Almost all algorithms suffer from the problem of imbalanced dataset,
in which there are more instances belonging to some classes than others. Imbal-
anced data usually causes biases in classification and leads to poor generalization
performance. For many real-world applications, including fraud detection [1, 2],
bioinformatics [3, 4, 5, 6], text classification [7, 8], medical field [9, 10], etc., the
class of interest has a frequency of less than 0.01 among all cases. In these applica-
tions, the minority is most interesting and its identification is of utmost importance.
This requires a fairly high detection rate of the minority class and usually allows
a small error rate in the majority class since the cost of misclassifying a majority
instance can be relatively low. The class imbalance problem is of the crucial im-
portance. It can cause a significant bottleneck in the performance attainable by
standard learning methods which assume a balanced class distribution. Plenty of
works [11, 12, 13, 14, 15, 16, 17, 18] have shown the importance of this problem for
classification, and lots of studies [19, 20, 21, 22, 23, 24, 25, 26, 27] demonstrated
that sample balancing provides a significant quality improvement in real-world ap-
plication problems including control of financial risk, image recognition, medicine,
biology, text mining, time series, etc. Unbalanced data had been identified as one of
10 challenging problems in data mining research long ago [28]. In recent years, clas-
sification problem for imbalanced datasets has been a more and more hot research
topic.
Sun et al. [29] investigated the nature of imbalanced classification problem and
believed that imbalanced class distribution, small sample size, class separability and
within-class concepts were the main factors causing the problem. Corresponding
to different problem roots, people put forward different schemes of solution to im-
balanced classification. For example, in order to tackle the problem of imbalanced
class distribution, researchers devised a variety of sampling methods, such as random
over- or under-sampling, SMOTE [30] and other methods based on it, CBO [31].
Also a lot of pertinent algorithms (e.g. cost sensitive learning, one-class model, and
so on) are developed to alleviate the problem of class imbalance in classification.
The details of solution to problems of imbalanced classification can be found in
Section 2. Of all these solutions, SMOTE oversampling has received much more
attention in recent research. After that, researchers proposed some other methods
based on it, including Borderline-SMOTE [32], ADASYN [33], SMOTEboost [34],
CBSO [35], RAMOBoost [36], KSMOTE [37]. Their experiments show that all these
methods can alleviate the imbalance problem more or less. Despite the modifica-
tions in various scopes, the key idea of these methods is still essentially analogue to
SMOTE. Our investigation in this paper suggests that a severe flaw exists for this
type of oversampling. Specifically, new samples created in SMOTE invariably lie
in the line segment between seed samples. This is why new synthetic samples can-
Oversampling Method for Imbalanced Classification 1019
not really simulate the distribution of original samples. Besides, nearest neighbors
searching algorithm used in SMOTE does not take the distribution of subclasses
into account, which can lead to the problem of overlapping between classes. We
propose a new oversampling method SNOCC to overcome above defect. The basic
idea is to employ an improved method to avoid new synthetic samples confining in
the line segment between seed samples, and a different nearest neighbors searching
algorithm is adopted to tackle the problem of classes overlapping. SNOCC integrate
cluster based on distance and oversampling that is different from SMOTE. Our ex-
periments show that SNOCC oversampling can generate better synthetic samples
than SMOTE and other methods based on it.
Our paper is organized as follows. Section 2 briefly introduces the advancements
in the domain related to imbalanced data. In Section 3 we analyze the SMOTE
method and give the reason why it may cause a generalization error. Section 4
describes our SNOCC oversampling methods. Section 5 presents our experiment
result and compares it with results of other different methods. Finally, we present
our conclusions in the Section 6.
2 RELATED WORKS
randomly selected as x0 (we call instances xi and x0 seed sample). Then a random
number between [0, 1] δ is generated. The new artificial sample xnew is created
as:
xnew = xi + (x0 − xi ) × δ (1)
model selection
seed samples. When these new samples serve as training data, it is inevitable that
this will more or less cause additional error.
seed samples
new samples
Figure 2. The distribution of new samples for oversampling 3 samples 100 times using
SMOTE
seed samples
new samples
Figure 3. The distribution of new samples for oversampling 10 samples 500 times using
SMOTE
negative samples
positive samples
Figure 4. The distribution of all positive and negative samples in example for classification
1024 Z. Zheng, Y. Cai, Y. Li
Figure 5. The distribution of training samples and misclassified samples (marked in red)
in the example for classification
seed samples and it cannot reproduce the distribution of original samples. This is an
important cause resulting in a bad performance of classifier. Since classifier cannot
get real information on original sample, there is a bigger probability that samples are
misclassified. Besides, SMOTE does not consider neighboring samples and that may
cause decision boundaries for the minority class to spread further into the majority
class space. This can result in the problem of overlapping between classes [35, 47].
x = α1 x1 + α2 x2 + . . . + αn xn (2)
seed samples
new samples
Figure 6. The distribution of new samples for oversampling 3 samples 100 times using
SNOCC
seed samples
new samples
Figure 7. The distribution of new samples for oversampling 10 samples 500 times using
SNOCC
Oversampling Method for Imbalanced Classification 1027
original samples. Comparing to Figures 7 and 3, Figures 8 and 9 show that new
samples of SNOCC oversampling can better reproduce the distribution of original
samples, even if the distribution is irregular (Figure 8).
It is natural for SNOCC to handle continuous feature. For ordinal feature, we
first map the values into an integer’s sequence. For example, if there are n distinct
values in an ordinal feature. The integer’s sequence is (1, 2, . . . , n). During oversam-
pling, it is not the original value in the ordinal feature but the integer’s sequence
calculated. The corresponding results are rounded to the nearest integer. Finally,
this integer will be mapped back into original value in the ordinal feature according
to the inverse mapping used above.
seed samples
new samples
Figure 8. The distribution of new samples for oversampling 20 samples 500 times using
SNOCC
5 EXPERIMENTS
seed samples
new samples
Figure 9. The distribution of new samples for oversampling 20 samples 500 times using
SMOTE
imbalanced dataset
Step 1
Step 2
positive
samples
Step 3
Step 4
SMOTE CBSO SNOCC
training training training
dataset dataset dataset
Step 5
learning learning learning learning
classifier classifier classifier classifier
Step 6
Number of Number of
Number of Sample Imbalance
Dataset Positive Negative
Features Size Ratio
Samples Samples
ecoli-0-1-3-7 vs 2-6 7 281 7 274 39.14
ecoli4 7 336 20 316 15.8
glass-0-1-6 vs 5 9 184 9 175 19.44
glass5 9 214 9 205 22.78
yeast-0-5-6-7-9 vs 4 8 528 51 477 9.35
yeast-1-2-8-9 vs 7 8 947 30 917 30.57
yeast-1-4-5-8 vs 7 8 693 30 663 22.1
yeast-1 vs 7 7 459 30 429 14.3
yeast-2 vs 4 8 514 51 463 9.08
yeast4 8 1484 51 1433 28.1
yeast5 8 1484 44 1440 32.73
yeast6 8 1484 35 1449 41.4
Step 2: All positive samples in training dataset are picked up, and the numbers of
positive and negative samples are counted. The number of new positive samples
to be generated is the difference of the number of negative samples and that of
positive samples.
Step 3: Oversampling methods SMOTE, CBSO and SNOCC are called to create
new synthetic SMOTE, CBSO and SNOCC samples, respectively.
Step 4: Adding SMOTE, CBSO and SNOCC samples into original training dataset
to form new SMOTE, CBSO and SNOCC training dataset.
Step 5: Learning classifier from original, SMOTE, CBSO and SNOCC training
dataset, respectively.
Step 6: Classifying the samples in test dataset using classifier learned from Step 5
and calculating corresponding F-measure F-value for positive class.
In order to make a fair comparison with SMOTE, the number of nearest neigh-
bors k in SNOCC was set to 5, which is the same as the default value in SMOTE.
After oversampling, we obtained a training dataset of positive samples and negative
samples in equal portions, that is, a balanced dataset. To eliminate random effects,
for each dataset, we ran each oversampling algorithms 100 times and call naive-
Bayes classifier to get corresponding F-value for each time, resulting in 100 F-value
for each oversampling algorithm. Finally, t-test was used to verify the significance
of the F-value differences between methods.
We computed the mean values and standard deviations of the F-value of 100 clas-
sifications without oversampling, with SMOTE oversampling, with CBSO oversam-
pling and with SNOCC oversampling, respectively. The results are shown in Table 2.
In each dataset, the biggest among all F-value pertained to different oversampling
methods is bold-faced. Of all twelve biggest F-value, SNOCC accounts for nine and
the rest three are from CBSO. This shows that in generating new synthetic samples,
SNOCC performs much better. t-test was performed to compare the significance
of the results obtained from SNOCC with SMOTE, and SNOCC with CBSO, re-
spectively, at 0.05 significance level. The test results are presented in Table 3. By
combining with Table 2, we can determine which one is the winner. The h is 1+
if SNOCC is winner and 1− otherwise. From Table 3 we can find that SNOCC
outperforms SMOTE on ten of twelve datasets. And SMOTE outperforms SNOCC
on two datasets. SNOCC outperforms CBSO on eight of twelve datasets and there
are three datasets that CBSO outperforms SNOCC. On the whole, the experiments
results show that the SNOCC performances are significantly better than those of
SMOTE and CBSO.
6 CONCLUSIONS
After decades of development, classification techniques get matured day by day. But
most of the existing classifiers tend to identify majority class samples and usually
fail to classify minority class samples with a satisfactory accuracy. There are plenty
Oversampling Method for Imbalanced Classification 1031
Dataset RAW SMOTE CBSO SNOCC
ecoli-0-1-3-7 vs 2-6 0.2992±0.1906 0.5100±0.1404 0.5028±0.1336 0.5749±0.1413
ecoli4 0.3626±0.0746 0.6723±0.0539 0.6850±0.0586 0.6552±0.0484
glass-0-1-6 vs 5 0.5597±0.1651 0.5620±0.0445 0.6186±0.0685 0.7219±0.0855
glass5 0.5247±0.1394 0.5407±0.0402 0.5801±0.0743 0.6651±0.0871
yeast-0-5-6-7-9 vs 4 0.0356±0.0258 0.3272±0.0212 0.3459±0.0262 0.3363±0.0246
yeast-1-2-8-9 vs 7 0.0265±0.0348 0.0830±0.0048 0.0826±0.0044 0.0947±0.0116
yeast-1-4-5-8 vs 7 0.0091±0.0216 0.1100±0.0051 0.1079±0.0056 0.1286±0.0138
yeast-1 vs 7 0.1311±0.0658 0.2463±0.0235 0.2365±0.0183 0.2613±0.0380
yeast-2 vs 4 0.6057±0.0419 0.6739±0.0268 0.6748±0.0301 0.6630±0.0320
yeast4 0.0457±0.0289 0.1620±0.0094 0.1813±0.0108 0.1818±0.0114
yeast5 0.4758±0.0540 0.5279±0.0125 0.5368±0.0121 0.5747±0.0201
yeast6 0.1882±0.0703 0.2041±0.0141 0.2288±0.0169 0.3903±0.0309
This table shows F-measure value of classification pertained to different oversampling methods. Oversampling
methods is the column title. The second column titled RAW is the F-value without oversampling. Each grid is
filled with average F-value±standard derivation. In each row, the biggest F-value is bold-faced.
of imbalanced data in the application domain. The basic reason is that either it is
very difficult to collect data or positive samples in collected data are rare [50, 51].
This poses a challenge to academic community. SMOTE oversampling proposed by
Chawla et al. gave us a good start to tackle the problem of imbalanced distribution.
And based on SMOTE, researchers did a lot of fruitful work.
Even so, we find some weakness of SMOTE. In classification, the distribution
of training data can greatly influence the generalization ability of a classifier. New
samples generated by SMOTE oversampling are confined in the line segment between
two seed samples (Figures 2, 3 and 9). That means that new samples created by
SMOTE oversampling cannot fully cover the distribution space of original samples.
This is a major factor that causes generalization error of the classifier. Besides,
SMOTE introduces the problem of classes overlapping [35, 47].
SNOCC proposed in the paper can remedy the defect of SMOTE. In SNOCC,
the method that creates new samples makes each new sample likely to locate in any
place of convex hull formed by seed samples. We can see this from Figures 6, 7
and 8. In order to minimize the adverse effects of classes overlapping on the per-
formance of classification, we do not use k-nearest-neighbors but nearest-neighbors
based on distance to search neighbors of samples. This change can improve the
efficiency of oversampling and our results of experiment also support this conclu-
sion. SNOCC method can generate samples that naturally model the distribution of
original samples. Our experiment results show that SNOCC outperforms SMOTE
and methods derived from it. Our method failed in 2 and 3 datasets, respectively.
However, the difference of performance in these datasets between our method and
the others is marginal. We can see that from Table 3. The smallest p value was
0.000131774 in these datasets while it was less than 10−8 in most datasets that
our method won. We think that the possible reason which causes this difference is
1032 Z. Zheng, Y. Cai, Y. Li
t-test 1 t-test 2
dataset h p h p
ecoli-0-1-3-7 vs 2-6 1+ 0.001391304 1+ 0.00029264
ecoli4 1− 0.019986531 1− 0.000131774
glass-0-1-6 vs 5 1+ 4.63 × 10−39 1+ 1.53 × 10−17
glass5 1+ 4.51 × 10−28 1+ 3.97 × 10−12
yeast-0-5-6-7-9 vs 4 1+ 0.006099216 1− 0.008741365
yeast-1-2-8-9 vs 7 1+ 2.25 × 10−17 1+ 1.89 × 10−18
yeast-1-4-5-8 vs 7 1+ 2.99 × 10−27 1+ 6.01 × 10−31
yeast-1 vs 7 1+ 0.000997247 1+ 1.89 × 10−8
yeast-2 vs 4 1− 0.010792733 1− 0.008674209
yeast4 1+ 3.57 × 10−29 0 0.777325853
yeast5 1+ 2.51 × 10−48 1+ 9.15 × 10−38
yeast6 1+ 4.03 × 10−121 1+ 7.13 × 10−107
Column t-test 1 shows the result of t-test that tests the significant difference between F-value of SMOTE and that
of SNOCC with the 5 % significance level. And sub-column h is the result and p the corresponding p value. A 1+
indicates that the performance of SNOCC method is significantly better than that of SMOTE, and 1− indicates
a reverse result.
Column t-test 2 shows the result of t-test that tests the significant difference between F-value of CBSO and that
of SNOCC with the 5 % significance level. And sub-column h is the result and p the corresponding p value. A 1+
indicates that the performance of SNOCC method is significantly better than that of CBSO, and 1− indicates
a reverse result.
The zero value of h shows that there is no significant difference between two methods at 5 % significance level.
that the problem of overlapping between classes in these data sets either is not so
bad, or does not exist at all. So, there are few or no negative effects on SMOTE
or CBSO. Besides, comparing it to these benchmark methods, from the result of
our experiment we can see that our method can obtain much better performance in
data sets with greater imbalance ratio. More experiments will be done to improve
it in the next work. In our work, SNOCC can only handle continuous and ordinal
feature. Future work will be focused on how to deal with categorical and Boolean
attribute.
Acknowledgements
This work was supported in part by Shenzhen Innovation Funding for Advance
Talents (No. KQCX20130628112914291), the Promotion and Development Project
for Key Laboratory of Shenzhen (No. CXB201104220026A), the National Natural
Science Foundation of Youth Science Foundation (No. 31000447), the National Nat-
ural Science Foundation of China (No. 61363005, 61462017, 61462018), Guangxi
Natural Science Foundation (No. 2014GXNSFDA118036, 2013GXNSFAA019324),
Education Department of Guangxi (No. ZD2014049, 2013YB084), Guangxi Key
Laboratory of Trusted Software (No. kx201512). The authors would also like to
thank the anonymous reviewers for their helpful comments and suggestions for im-
proving the manuscript.
Oversampling Method for Imbalanced Classification 1033
REFERENCES