Two Novel SMOTE Methods For Solving Imbalanced Classification Problems

This document presents two novel methods, Center Point SMOTE (CP-SMOTE) and Inner and Outer SMOTE (IO-SMOTE), to address the imbalanced classification problem in machine learning. These methods aim to improve the synthetic minority oversampling technique (SMOTE) by reducing the influence of noise samples and enhancing classification performance. Numerical experiments demonstrate that both CP-SMOTE and IO-SMOTE outperform conventional SMOTE and no-sampling methods in various classification metrics.

Uploaded by

padmajakamaraj

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views8 pages

Two Novel SMOTE Methods For Solving Imbalanced Classification Problems

Uploaded by

padmajakamaraj

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

Received 23 December 2022, accepted 10 January 2023, date of publication 13 January 2023, date of current version 19 January 2023.

Digital Object Identifier 10.1109/ACCESS.2023.3236794

Two Novel SMOTE Methods for Solving

Imbalanced Classification Problems
YUAN BAO 1 AND SIBO YANG 2
1 School of Information Science and Technology, Dalian Maritime University, Dalian 116026, China
2 School of Science, Dalian Maritime University, Dalian 116026, China

Corresponding author: Sibo Yang ([email protected])

This work was supported by the National Natural Science Foundation of China under Grant 61720106005.

ABSTRACT The imbalanced classification problem has always been one of the important challenges in
neural network and machine learning. As an effective method to deal with imbalanced classification prob-
lems, the synthetic minority oversampling technique (SMOTE) has its disadvantage: Some noise samples
may participate in the process of synthesizing new samples; As a result, the new synthetic sample lacks its
rationality, which will reduce the classification performances of the network. To remedy this shortcoming,
two novel improved SMOTE method are proposed in this paper: Center point SMOTE (CP-SMOTE) method
and Inner and outer SMOTE (IO-SMOTE) method. The CP-SMOTE method generates new samples based on
finding several center points, then linearly combining the minority samples with their corresponding center
points. The IO-SMOTE method divides minority samples into inner and outer samples, and then uses inner
samples as much as possible in the subsequent process of generating new samples. Numerical experiments
are conducted to prove that compared with no-sampling and conventional SMOTE methods, the CP-SMOTE
and IO-SMOTE methods can achieve better classification performances.

INDEX TERMS Imbalanced classification problems, IO-SMOTE method, CP-SMOTE method, machine
learning.

I. INTRODUCTION The approaches to dealing with the imbalanced prob-

For general balanced classification problems, the conven- lems roughly come from two directions: algorithm improve-
tional neural networks can achieve good classification results. ment [4], [5] and data processing. The algorithm improve-
However, in the real world, there are lots of imbalanced ment includes feature selection [6], cost-sensitive [7], and
problems, such as transaction fraud, cancer diagnosis [1], [2], integrated learning. And one of the effective data processing
virus script judgment, and so on. As far as cancer diagnosis is resampling method [8], which includes undersampling [9]
is concerned, the number of cancer patients must be small. and oversampling methods [10]. Undersampling method is to
But it is precisely that these few cancer patients are the most remove some samples in the majority class to make the num-
important research objects. At this time, the original neural ber of positive and negative samples balanced, and then train
networks [3] are no longer able to obtain satisfactory clas- the network. The random undersampling (RUS) method [11]
sification results, especially for those minority samples. The is one of the simpler undersampling methods. As the name
reason for this result is that too few minority samples make suggests, the RUS method is to randomly select some samples
the networks unable to learn the dataset efficiently. Therefore, from the majority Smajor to form a sample set E; And then
how to deal with imbalanced problems is an important issue remove the sample set E from Smajor to obtain a new data set
in machine learning. Sminor +Smajor −E. The RUS method achieves the purpose of
modifying the sample distribution by changing the proportion
of the majority samples, so as to make the samples more
The associate editor coordinating the review of this manuscript and balanced. However, it also has some disadvantages. Since
approving it for publication was Jad Nasreddine . the sample number of the new dataset is less than that of

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
5816 VOLUME 11, 2023
Y. Bao, S. Yang: Two Novel SMOTE Methods for Solving Imbalanced Classification Problems

the original dataset, some information will be lost. That is, out to compare the CP-SMOTE and IO-SMOTE methods
deleting the majority samples might cause the classifier to with the no-sampling and conventional SMOTE methods.
lose important information about the majority class. According to comparing the classification accuracy rate, pre-
In order to overcome the shortcomings of the under- diction rate, recall rate, F1-measure and some other indica-
sampling method, researchers have proposed oversampling tors, the CP-SMOTE and IO-SMOTE methods have their own
method [12], [13]. And the basic idea of the oversampling advantages, and on the whole, these two methods are much
method is to add some minority samples to make the number better than the SMOTE method.
of positive and negative samples balanced. The simplest ran- The remaining chapters of this paper are organized as
dom oversampling (ROS) method [14] is to randomly select follows: The descriptions of the CP-SMOTE and IO-SMOTE
some samples from the minority samples Sminor , and generate methods are given in Section II. And in Section III, some
a sample set E by copying the selected samples, then add numerical experiments on four datasets and corresponding
them to Sminor to obtain a new minority class set Sminor + E. analysis are carried out after we show the experiment setting.
However, for the ROS method, the complexity increases in At last, the conclusion is presented in Section IV.
the process of training the networks due to the duplication of
the minority samples. On the other hand, it is easy to cause II. CP-SMOTE AND IO-SMOTE METHODS
over-fitting problems, because the ROS method is simply a A. CP-SMOTE METHOD (CENTER POINT SMOTE METHOD)
copy process of the initial samples, which is not conducive to
For solving the imbalanced classification problem, the con-
the generalization performance of the network.
ventional SMOTE method synthesize several minority points
In order to solve the over-fitting problem [15] caused by to balance the number of various samples. However, this
the ROS method, and simultaneously ensure the dataset is method blurs the boundary between the majority and minority
balanced, Chawla [16] proposed a synthetic minority over- samples. As shown in Fig. 1, suppose that A is chosen to be an
sampling technique (SMOTE) method. The basic idea of the oversampling point, then randomly select point B among the
SMOTE method is as follows: For each minority sample xi , k-nearest neighbor points of A, and randomly generate point
randomly choose a sample xi′ from its neighbor (xi′ is also a C on the connection line between point A and B. However, it is
minority sample); Then randomly select a point on the line not difficult to see that the neighbor points of C are majority
between xi and xi′ as the new synthetic minority sample. points, and even point C itself might be a majority sample.
Based on the SMOTE method, many researchers have Therefore, the new sample synthesized by SMOTE method
made improvements and achieved better classification is an extremely unreasonable sample point, which will cause
results. Borderline SMOTE [17] oversampling process is to a particularly large error in the subsequent network training
divide the minority samples into three categories: safe, danger and affect the performances of the classifier.
and noise. And then, only the danger samples are employed To overcome the above-mentioned shortcomings of the
to generate the novel samples. Radius SMOTE first selects SMOTE method, we propose a new center point-SMOTE
a minority sample xi and calculates a radius according to (CP-SMOTE) method. First, the k-clustering method [20],
the k-nearest neighbor. Then take xi as the center and ran- [21] is used to find several regions of the minority sample
domly find several points so that their distance to xi is less distribution. For each region, calculate the Euclidean center
than the radius. The R-SMOTE method [18] eliminates the point of all the minority points in the region where they are
limitation of generating minority class instance distribution located. If this distance is less than the distance of any major-
and improves the classification accuracy of minority class. ity sample point to the center point, then randomly select a
ADASYN [19] was proposed to generate new minority class new point between the minority sample point and the center
samples near the original samples that were misclassified point; Otherwise, the minority sample point is abandoned.
based on the k-nearest neighbor classifier. As shown in Fig. 2, find the two regions where the minority
For the original SMOTE method, some noise samples sample is located. For the right region, we calculate the
might participate in the process of synthesizing new sam- distances of the center point O to all points in this region,
ples. Thus, the new synthetic sample lacks its rational- and calculate the closest distance d of all the majority sample
ity, which will reduce the classification performances of points to O. For each minority sample D, if the distance
the classifier. The purpose of this paper is to propose two between D and O is less than d, then we randomly synthesize
novel improved SMOTE methods: Center point SMOTE (CP- a point between D and O; otherwise, D does not participate
SMOTE) method and Inner and outer SMOTE (IO-SMOTE) in synthesizing new sample points. For the left region, the
method. The novel CP-SMOTE method generates new sam- similar process is applied again.
ples according to finding several center points, and making The process is given in Algorithm 1:
a linear combination of the minority samples and their cor-
responding center points. As another alternative method to Step 1: Divide the imbalanced dataset into majority
avoid noise samples, the IO-SMOTE method divides minor- class samples and minority class samples.
ity samples into inner and outer parts, and then uses inner Step 2: The k-clustering method is employed
samples as much as possible in the subsequent process of gen- to find n regions and corresponding center
erating new samples. The numerical experiments are carried point {O1 , O2 , . . . , On } of the minority sample