Li 2011
Li 2011
Abstract: The performance of many classifiers based on classification. Explorations for new classifiers have become
balanced data sets can not do well in imbalanced data sets. a hot research topic in machine learning.
This article applied the over-sampling method of Random- Massive achievements in classifiers based on
SMOTE (R-S), which is based on SMOTE method, in imbalanced data sets have been presented with different
imbalanced data mining of several datasets. We used the R-S approaches. Sampling method focuses on the perspective of
method to increase the number of the minority samples data, which reconstructs the data set artificially to reduce the
randomly in the minority sample space. By this way, the degree of imbalance. Over-sampling is to increase the
number of the minority is improved to be almost equal to that number of the minority, but it may lead to over-fitting
of the majority in data mining tasks. Five UCI imbalanced
because of the duplication of data. While, under-sampling
data sets were balanced with the integrated data mining
process. Logit algorithm is used for classification with these
[4] is to cut down the number of the majority. But it may
data sets. The result shows that the integrated use of R-S and lose information of the majority and decrease the
logit in data mining improves the performance of the classifier performance of classification. The other method focuses on
significantly. the view of algorithm, which introduces certain mechanism
to compensate the imbalance and make it suitable for the
Keywords-Imbalaced Data set; Data Mining; Integrated use of classification on imbalanced data sets. Some examples
Random-SMOTE and logit include: cost sensitive study, revision of support vector
machines algorithm, and some ensemble methods. There are
I. INTRODUCTION many mechanisms in revising algorithms for imbalanced
data mining. For instance, the use of adjustment of cost
Traditional classification approaches are mainly based function, the use of different values of weight, the change of
on the following hypotheses: (1) taking classification probability density, and the trim of classification boundary.
precision as evaluation criteria; (2) the number of sample in Cost sensitive study algorithm uses the cost of each class to
each class is almost equal; (3) the cost of misclassify for make classification decision. Its target is to cut down the
each class is the same [ 1 ]. There are many classical overall cost instead of reducing the error rate as much as
classification algorithms based on these hypotheses, such as possible [5]. Support vector machine has been modified to
Bayes algorithms, neural network etc.. But in real word, the process imbalanced data sets. One simple modification is to
number of samples which come from different classes is not make a moderate skewing to the majority boundary. Thus,
always equal. Sometimes, they are even extremely different. there will be fewer samples in the minority class to be
Usually, more attention should be paid to the minority misclassified [6]. Another modification is to assign different
sample. Because the cost of misclassify of the minority costs for the majority and the minority [7]. Ensemble is to
sample is higher. The recognition of cheating in credit card combine many classifiers to form a new classifier and
transaction, the perdition of telecommunication equipment improve its performance. Through boost, many weak
failure, and the forecast of business failure are some classifiers can construct a strong one. AdaBoost is a
examples [2]. Another example is the use of imbalanced representative boost algorithm, which assigns iteration
data set for checking oil well erupts images by the satellite. weighting for training data set [8]. Based on this algorithm,
There are only 41 images with suspended oil out of 937 AdaCost changes the weight updating rule to cost sensitive
ones [3]. Obviously, people hope to establish an excellent one, and assigns bigger weight values for misclassified
classification model which can detect all the images with minority samples in order to get a lower misclassification
suspended oil precisely to avoid pollution. cost than AdaBoost [ 9 ]. Some scholars propose a
The information in the minority is insufficient compared SMOTEBoost algorithm, which combines boost and the
with that in the majority. It is easy to be overlapped by the sampling method of SMOTE. As artificial boost algorithm
information of the majority and lead to misclassification. As inclines to assign the minority with bigger weights, its effect
a result, the performance of the classifier based on balanced is almost the same as duplication of minority samples. It
data sets is far better than that based on the imbalanced ones. also has the problem of over-fitting. Therefore,
Therefore, the traditional classification approaches and their SMOTEBoost uses SMOTE algorithm to increase the new
evaluation criteria are not sitable for the imbalanced data minority samples [10].
131
132
The new minority samples are more than the original TABLE III. ACCURACY RATE OF MINORITY SAMPLES
sample sizes N times with R-S. The new sampling method
R-S makes the minority sample of the five datasets be Glass Haberman Pima Vowel Wine
balanced to the majority.
Without 16.2% 13.26% 56.46% 72.2% 98.13%
C. Partition of Data Sets Sampling
Each empirical dataset is divided into training sample set Method
and testing sample set to calculate forecasting accuracy.
With 88.47% 64.46% 99.69% 99.95% 94.23%
Two-thirds of the samples are selected randomly as training R-S
samples to establish the Logit model. The rest samples are We should also inspect the forecast accuracy of the
used as testing samples to test performance of the model. positive sample and it equals to TP/(TP+ FN). So we can see
More specificity, two-thirds of the samples are chosen from whether the accuracy of the positive samples is influenced by
the negative samples after re-sampling. It is necessary to R-S or not. Meanwhile we should find whether the accuracy
select two-thirds of the samples from the positive samples. of the positive sample is improved. It is necessary to
Thus, the two parts form the training samples. Testing compare the forecasting accuracy of positive samples with
samples set is composed of the rest samples. This process is R-S and the results without the re-sampling method. The
carried on 100 times. Then the forecasting accuracy of comparisons are shown in Table 4.
negative samples is obtained by the Logit model and we will
calculate the mean value of accuracy in the end. TABLE IV. ACCURACY RATE OF MAJORITY SAMPLES
D. Evaluation Standard
Glass Haberman Pima Vowel Wine
The traditional sorter usually takes the error rate or the
accuracy as the evaluation criteria. The purpose of pursuing
Without 92.8% 95.75% 88.17% 98.08% 95.653%
is the high rate of accuracy. This kind of evaluation criteria Sampling
in imbalanced study of the question is obviously incorrect. Method
For example, consider a data set of binary classification,
where the percentage of positive samples accounts for 99%. With 80.94% 47.64% 100% 100% 100%
And the use of an ineffective sorter to record the data as the R-S
positive sample, its accuracy may achieve 99%. It is very From table 3, we can find that the predictive model
difficult to have a better sorter to obtain the accuracy higher performs very badly in recognizing negative samples
than 99% in the actual problem. Therefore it is suitable to without the sampling method. Especially, the recognition
use the hybrid matrix to evaluate the performance of sorter capability of the predicative model on negative samples in
which is used to the imbalanced data set. The hybrid matrix Glass and Haberman is extremely weak. The quantity of the
is as shown in Table 2. negative sample is increased with R-S. So the forecasting
accuracy of the predictive on negative samples is enhanced
TABLE II. HYBRID MATRIX greatly. And the difference on performance is significant.
Although the forecasting accuracy of the Haberman data set
Forecast Number of Forecast Number of
is still not very high, the performance of R-S in improving
Negative sample Positive sample predictive performance regarding the majority of data sets is
Actual Number of very effective.
Negative sample TN FP The negative samples of Glass and Vowel are absolutely
Actual Number of sparse, which has enlarged the recognition capability
Positive sample FN TP regarding the negative sample. From table 3, we can find
that the R-S can be an effective processing approach to the
TN is the number of forecasting the negative sample scarce samples and can enhance the forecasting accuracy of
correctly. FN is the number of forecasting the negative predictive model on the negative sample obviously.
sample wrongly. FP is the number of forecasting the From table 4, we can find that the recognition capability
positive sample wrongly. TP is the number of forecasting of the negative samples with R-S approach doesn’t have
the positive sample correctly. tremendous bad influence on recognizing positive samples.
It also has enhancement function on recognition of positive
VI. EXPERIMENTAL RESULT AND ANALYSIS samples. The recognition capability of the model on positive
This article mainly aims at measuring forecast accuracy samples has been improved in the datasets of Pima, Vowel,
of the minor sample. To study R-S’s sampling effect, we Wine, and the forecasting accuracy has achieved 100%.
inspect the forecast accuracy of negative samples, and it is Obviously, R-S is efficient for the imbalanced data mining.
equal to TN/(TN + FP). It is necessary to compare the Compared to the accuracy of positive samples, the
forecasting accuracy of negative samples with R-S and the accuracy of negative samples is very low without R-S,
results without the re-sampling method. The comparisons are which declares that Logit will produce bad performance on
as shown in Table 3. imbalanced data sets. When the data sets become balanced
with R-S approach, the performance of Logit has been
132
133
greatly improved, which indicates that the integrated use of 3358-3378.
R-S sampling method in data mining is a good way to [2] A.F. Atiya, Bankruptcy Prediction for Credit Risk Using Neural
imbalanced datasets. This integrated use provides an Network: A Survey and New Results. IEEE Trans on Neural Networks,
2001, 12 (4): 929-935.
effective solution to change the condition of relatively
[3] M. Kubat, R. Hohe, S. Matwin, Machine Learning for the Detection of
scarce and absolutely scarce of data. Oil Spills in Satellite Radar Images. Machine Learning, 1998, 30 (2/3):
195-215.
VII. CONCLUTION
[4] M. Kubat, S. Matwin, Addressing the Course of Imbalanced Training
We implemented an experiment with five datasets from Sets: One-sided Selection. ICML, 1997: 179-186.
UCI, and proved that the integrated use of R-S and logit in [5] C. Elkan, The Foundations of Cost-Sensitive Learning. Proceedings of
data mining tasks can improve predictive performance of the Seventeenth International Joint Conference on Artificial
the mining model. This approach increases the number of Intelligence (IJCAI’01), 2001, 973-978.
the minority randomly in the minority sample space and [6] B. Raskuttl, A. KowaleZyk, Extreme Re-Balancing for Svms: A case
Study. ACM SIGKDD Explorations, Newsletter,2004, 6(l): 60-69.
greatly improves the identity of the minority with no bad
[7] R. Akbani, S. KweK, N. JaPkowiez, Applying Support Vector Machines
effect on the recognition of the majority. 5 UCI imbalanced to Imbalanced Datasets. Lecture Notes in Computer Science,2004: 39-
data sets are balanced with this method. The results with the 50.
predictive model as Logit show that the integrated use of R- [8] Y. Freund, R.E. Schapire, A Decision Theoretic Generalization of on-
S method in data mining tasks can improve the performance Learning and an Application to Boosting. Journal of Computer and
of the classifier significantly. System Science,1997, 55(1): 119-139.
[9] W. Fan, S.J. Stolfo, J. Zhang, Adacost: Misclassification Cost-Sensitive
ACKNOWLEDGMENT Boosting. the 16th International Conference on Machine Learning
(ICML’99), 1999: 97-105.
This research is partially supported by the National
[10] N.V. Chawla, A. Lazarevic, L.O. Hall, Smoteboost: Improving
Natural Science Foundation of China (No. 70801055) and Prediction of the Minority Class in Boosting. Lecture Notes in
the Zhejiang Provincial Natural Science Foundation of Computer Science,2003: 107-119.
China (No. Y7100008). The authors gratefully thank [11]N.V. Chawla, K.W. Bowyer, L.O. Hall, W. Kegelmeyer, Smote:
anonymous referees for their useful comments and editors Synthetic Minority Over-Sampling Technique. Journal of Artificial
for their work. Intelligence Research, 2002, 16: 321-357.
[12]Y. Dong, Random-Smote for learning from imbalanced data sets.
REFERENCES Dalian University of Technology, 2009.
[1] Y. Sun, M. Kamela, A. Wong, et al, Cost-Sensitive Boosting for
Classification of Imbalanced Data. Pattern Recognilion, 2007, 40 (12):
133
134