Two Novel SMOTE Methods For Solving Imbalanced Classification Problems
Two Novel SMOTE Methods For Solving Imbalanced Classification Problems
ABSTRACT The imbalanced classification problem has always been one of the important challenges in
neural network and machine learning. As an effective method to deal with imbalanced classification prob-
lems, the synthetic minority oversampling technique (SMOTE) has its disadvantage: Some noise samples
may participate in the process of synthesizing new samples; As a result, the new synthetic sample lacks its
rationality, which will reduce the classification performances of the network. To remedy this shortcoming,
two novel improved SMOTE method are proposed in this paper: Center point SMOTE (CP-SMOTE) method
and Inner and outer SMOTE (IO-SMOTE) method. The CP-SMOTE method generates new samples based on
finding several center points, then linearly combining the minority samples with their corresponding center
points. The IO-SMOTE method divides minority samples into inner and outer samples, and then uses inner
samples as much as possible in the subsequent process of generating new samples. Numerical experiments
are conducted to prove that compared with no-sampling and conventional SMOTE methods, the CP-SMOTE
and IO-SMOTE methods can achieve better classification performances.
INDEX TERMS Imbalanced classification problems, IO-SMOTE method, CP-SMOTE method, machine
learning.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
5816 VOLUME 11, 2023
Y. Bao, S. Yang: Two Novel SMOTE Methods for Solving Imbalanced Classification Problems
the original dataset, some information will be lost. That is, out to compare the CP-SMOTE and IO-SMOTE methods
deleting the majority samples might cause the classifier to with the no-sampling and conventional SMOTE methods.
lose important information about the majority class. According to comparing the classification accuracy rate, pre-
In order to overcome the shortcomings of the under- diction rate, recall rate, F1-measure and some other indica-
sampling method, researchers have proposed oversampling tors, the CP-SMOTE and IO-SMOTE methods have their own
method [12], [13]. And the basic idea of the oversampling advantages, and on the whole, these two methods are much
method is to add some minority samples to make the number better than the SMOTE method.
of positive and negative samples balanced. The simplest ran- The remaining chapters of this paper are organized as
dom oversampling (ROS) method [14] is to randomly select follows: The descriptions of the CP-SMOTE and IO-SMOTE
some samples from the minority samples Sminor , and generate methods are given in Section II. And in Section III, some
a sample set E by copying the selected samples, then add numerical experiments on four datasets and corresponding
them to Sminor to obtain a new minority class set Sminor + E. analysis are carried out after we show the experiment setting.
However, for the ROS method, the complexity increases in At last, the conclusion is presented in Section IV.
the process of training the networks due to the duplication of
the minority samples. On the other hand, it is easy to cause II. CP-SMOTE AND IO-SMOTE METHODS
over-fitting problems, because the ROS method is simply a A. CP-SMOTE METHOD (CENTER POINT SMOTE METHOD)
copy process of the initial samples, which is not conducive to
For solving the imbalanced classification problem, the con-
the generalization performance of the network.
ventional SMOTE method synthesize several minority points
In order to solve the over-fitting problem [15] caused by to balance the number of various samples. However, this
the ROS method, and simultaneously ensure the dataset is method blurs the boundary between the majority and minority
balanced, Chawla [16] proposed a synthetic minority over- samples. As shown in Fig. 1, suppose that A is chosen to be an
sampling technique (SMOTE) method. The basic idea of the oversampling point, then randomly select point B among the
SMOTE method is as follows: For each minority sample xi , k-nearest neighbor points of A, and randomly generate point
randomly choose a sample xi′ from its neighbor (xi′ is also a C on the connection line between point A and B. However, it is
minority sample); Then randomly select a point on the line not difficult to see that the neighbor points of C are majority
between xi and xi′ as the new synthetic minority sample. points, and even point C itself might be a majority sample.
Based on the SMOTE method, many researchers have Therefore, the new sample synthesized by SMOTE method
made improvements and achieved better classification is an extremely unreasonable sample point, which will cause
results. Borderline SMOTE [17] oversampling process is to a particularly large error in the subsequent network training
divide the minority samples into three categories: safe, danger and affect the performances of the classifier.
and noise. And then, only the danger samples are employed To overcome the above-mentioned shortcomings of the
to generate the novel samples. Radius SMOTE first selects SMOTE method, we propose a new center point-SMOTE
a minority sample xi and calculates a radius according to (CP-SMOTE) method. First, the k-clustering method [20],
the k-nearest neighbor. Then take xi as the center and ran- [21] is used to find several regions of the minority sample
domly find several points so that their distance to xi is less distribution. For each region, calculate the Euclidean center
than the radius. The R-SMOTE method [18] eliminates the point of all the minority points in the region where they are
limitation of generating minority class instance distribution located. If this distance is less than the distance of any major-
and improves the classification accuracy of minority class. ity sample point to the center point, then randomly select a
ADASYN [19] was proposed to generate new minority class new point between the minority sample point and the center
samples near the original samples that were misclassified point; Otherwise, the minority sample point is abandoned.
based on the k-nearest neighbor classifier. As shown in Fig. 2, find the two regions where the minority
For the original SMOTE method, some noise samples sample is located. For the right region, we calculate the
might participate in the process of synthesizing new sam- distances of the center point O to all points in this region,
ples. Thus, the new synthetic sample lacks its rational- and calculate the closest distance d of all the majority sample
ity, which will reduce the classification performances of points to O. For each minority sample D, if the distance
the classifier. The purpose of this paper is to propose two between D and O is less than d, then we randomly synthesize
novel improved SMOTE methods: Center point SMOTE (CP- a point between D and O; otherwise, D does not participate
SMOTE) method and Inner and outer SMOTE (IO-SMOTE) in synthesizing new sample points. For the left region, the
method. The novel CP-SMOTE method generates new sam- similar process is applied again.
ples according to finding several center points, and making The process is given in Algorithm 1:
a linear combination of the minority samples and their cor-
responding center points. As another alternative method to Step 1: Divide the imbalanced dataset into majority
avoid noise samples, the IO-SMOTE method divides minor- class samples and minority class samples.
ity samples into inner and outer parts, and then uses inner Step 2: The k-clustering method is employed
samples as much as possible in the subsequent process of gen- to find n regions and corresponding center
erating new samples. The numerical experiments are carried point {O1 , O2 , . . . , On } of the minority sample
FIGURE 1. Special case of SMOTE method. The stars, circles and square denote the minority samples,
majority samples and new synthetic sample, respectively.
FIGURE 2. CP-SMOTE method. The stars, circles and triangle denote the minority samples, majority
samples and center point, respectively.
distribution, where Oi = m1 m
P
j=1 Dij , Dij is the j-th For each point x ∈ M , it can be obviously classified into
point in i-th region. positive class if most neighbor of x is positive. In this case,
Step 3: For i = 1 to i = n, calculate the closest we call the point x inner point. On the other hand, if most
distance di of all the majority sample points to the neighbor of x are negative points, the class of point x is not
points Oi . easy to give and point x is denoted by outer point. Therefore,
Step 4: For each minority class sample P, calculate the minority set M is divided into two parts: inner point set
the distance dis of this sample to its corresponding and outer point set. Here, the k-nearest neighbor method is
center point. applied to find the neighbors of point x. Specifically, select
Step 5: Compare dis with its corresponding di . two fixed positive integer c1 and c2 , where c1 < c2 . For
If dis < di , then synthesize a point in the following any x ∈ M , if there exists c ∈ [c1 , c2 ] such that the number
criterion: of positive points in the adjacent points of x exceeds half of
|M |, then point x is an inner point. Otherwise, point x is an
Pnew = ηP + (1 − η)Oi , (1)
outer point. As shown in Fig. 3, for the minority sample x1 ,
where 0 < η < 1. Otherwise, the point P does not only one of the six-nearest neighbors is the minority sample,
participate in synthesizing new sample points. so x1 is an outer point; On the contrary, for x2 , five of the
Step 6: Put the dataset obtained in steps 2-5 and six-nearest neighbors are minority samples, so x2 is an inner
the original sample set together, and then train the point.
networks. The process is given in Algorithm 2:
Step 1: Divide the imbalanced dataset into majority
B. IO-SMOTE METHOD (INNER AND OUTER SMOTE set N and minority set M .
METHOD) Step 2: Divide the minority set M into two parts:
Given an imbalanced dataset including the minority (positive) Inner set inner and outer set outer.
set M and the majority (negative) set N , |M | < |N |. Here |M | Step 3: In the case of inner ̸= ∅ and outer ̸ = ∅,
and |N | denote the number of M and N , respectively. for each point x ∈ inner then find point y ∈ outer
FIGURE 3. IO-SMOTE method. The stars and circles denote the minority and majority samples,
respectively. x1 is an outer point, x2 is an inner point.
closest to the point x. IO-SMOTE method synthe- negative class; Otherwise, if the actual output is more than
sizes a new point z in the following criterion: 0.50, then we regard it as approximately equal to 1 and
classify this sample into positive class. Here, the sigmoidal
z = ηx + (1 − η)y, (2) function is employed as activation function:
where 0 < η < 1. In this way, the point number 1
that the IO-SMOTE method synthesizes is equal to g(x) = . (5)
1 + e−x
the inner point number. The experiment process is given in Algorithm 3:
Step 4: For the case inner ̸ = ∅ or outer ̸ = ∅,
Step 1: Input the imbalanced dataset, the minority
randomly choose three point x1 , x2 and x3 from the
(positive) set M = {mj |mj ∈ Rn , j = 1, . . . , M }
minority set M . IO-SMOTE method synthesizes a
and the majority (negative) set N = {nj |nj ∈
new point z in the following criterion:
Rn , j = 1, . . . , N }.
z = η2 x1 + (1 − η2 )y, (3) Step 2: The above four methods are applied to
generate positive samples Q to balance the num-
where ber of positive samples and negative samples,
y = η1 x2 + (1 − η1 )x3 , (4) respectively.
Step 3: Five-fold cross validation technology: 8 =
where 0 < η1 , η2 < 1. M ∪ N ∪ Q = {(xj , oj )|xj ∈ Rn , oj = 0 or 1, j =
Step 5: Put the dataset obtained in steps 3-4 and the 1, . . . , T } is equally divided into five parts:
original sample set together, and train the networks. 81 , . . ., 85 .
Step 4: For i = 1 to i = 5, do Step 4 to Step 7. Let
III. NUMERICAL EXPERIMENTS 8i be the test samples, while 8 \ 8i is the training
To verify the validity of the CP-SMOTE and IO-SMOTE samples.
methods, we compare them with no-sampling and SMOTE Step 5: Train an FNN with the datasets generated
methods on four real classification problems: ecoli1, yeast1, by each of the above-mentioned four methods, and
yeast3 and newthyroid1. test the performances of these four networks.
Step 6: Train an ELM with the datasets generated
A. EXPERIMENT SETTINGS by each of the above-mentioned four methods, and
In our experiments, five-fold cross validation technology will test the performances of these four networks.
be used [22], [23], [24], [25]. For details, the dataset is equally Step 7: Repeat the above procedure Steps 3-6 twenty
divided into five parts, and the learning process is conducted times.
twenty times. For each time of the training process, each part Step 8: Compare the one hundred experimental
takes turns as the test set, while the rest as the training set. The results of these four methods.
above process is repeated twenty times. After adding them all
together, one hundred classification results are achieved for B. EXPERIMENTAL RESULTS
each method-data pair. The contents in Tabs. 1-3 are obtained For these four different datasets, the SMOTE, IO-SMOTE,
by averaging the corresponding 100 results. and CP-SMOTE methods are respectively applied to over-
We evaluate the class of a sample according to the actual sample the minority class samples. For newly generated sam-
output: If the actual output is less than 0.50, then we regard ples, their characteristics are high-dimensional. To visualize
it as approximately equal to 0 and classify this sample into these points, the PCA technique [26], [27] is employed
FIGURE 4. Discrete point models based on four oversampling methods in two-dimension. Red plus sign represents minority sample points, blue dots
denote majority samples, green snowflakes are newly generated samples.
to reduce the dimensionality of the sample points in the datasets obtained by the above three oversampling meth-
n-dimensional to two-dimensional space. The distributions of ods (cf. Tabs. 1-2). According to these two tables, the
these points are shown in Fig. 4, where blue represents the IO-SMOTE and CP-SMOTE methods are both better than
majority sample points, red represents the minority sample the no-sampling and SMOTE methods in terms of training
points, and green represents the synthetic sample points. and test accuracies. Moreover, the classification accuracies of
Obviously, compared with the SMOTE method, the new syn- the CP-SMOTE method are slightly higher than those of the
thetic sample points of the IO-SMOTE and CP-SMOTE are IO-SMOTE method.
more compact, especially the CP-SMOTE method. For the At the same time, we compared the error function in
CP-SMOTE method, the new synthetic sample points rarely the neural network model (cf. Fig. 5). It can be seen that
appear near the class boundary, which will make the error the dataset without oversampling processing has the largest
smaller in the learning process. error, while the dataset with SMOTE method has a signifi-
Furthermore, the feedforward neural network (FNN) [28] cant improvement. In addition, the errors of these two novel
and extreme learning machine (ELM) [29], [30] are SMOTE methods are both obviously better than that of the
employed to train the original dataset and the new SMOTE method.
FIGURE 5. Error functions based on four oversampling methods for four datasets.
TABLE 1. Classification accuracies for four oversampling methods in ELM. TABLE 2. Classification accuracies for four oversampling methods in FNN.
TABLE 3. Five classification criteria for four datasets. center points; The IO-SMOTE method divides minority sam-
ples into inner and outer samples, and then uses inner samples
as much as possible in the subsequent process of generating
new samples. Most of the samples generated by these two
methods are far away from the classification boundary, which
will make error smaller in the process of training the network.
Experiments are conducted for solving four classifica-
tion problems. The experimental results reveal that the
IO-SMOTE and CP-SMOTE methods both have better per-
formances than the traditional SMOTE method.
REFERENCES
[1] M. Saini and S. Susan, ‘‘Deep transfer with minority data augmentation for
imbalanced breast cancer dataset,’’ Appl. Soft Comput., vol. 97, Dec. 2020,
Art. no. 106759.
[2] Q. Li, G. Yu, J. Wang, and Y. Liu, ‘‘A deep multimodal generative and
fusion framework for class-imbalanced multimodal data,’’ Multimedia
Tools Appl., vol. 79, nos. 33–34, pp. 25023–25050, Sep. 2020.
[3] E. Judith and J. M. Deleo, ‘‘Artificial neural networks,’’ Cancer, vol. 91,
no. 8, pp. 1615–1635, 2001.
[4] J.-J. Zhang and P. Zhong, ‘‘Learning biased SVM with weighted within-
class scatter for imbalanced classification,’’ Neural Process. Lett., vol. 51,
no. 1, pp. 797–817, Feb. 2020.
[5] H. Zhu, H. Liu, and A. Fu, ‘‘Class-weighted neural network for monotonic
imbalanced classification,’’ Int. J. Mach. Learn. Cybern., vol. 12, no. 4,
pp. 1191–1201, Apr. 2021.
v [6] B. Selvalakshmi and M. Subramaniam, ‘‘Intelligent ontology based seman-
u S−1 tic information retrieval using feature selection and classification,’’ Cluster
u 1 X
σ := t (yi − yi )2 , Comput., vol. 22, no. 5, pp. 12871–12881, Sep. 2019.
S −1 [7] H. Hu, Q. Wang, M. Cheng, and Z. Gao, ‘‘Cost-sensitive semi-supervised
i=1
v deep learning to assess driving risk by application of naturalistic vehicle
u S trajectories,’’ Exp. Syst. Appl., vol. 178, Sep. 2021, Art. no. 115041.
u1 X
RMSE := t (yi − ti )2 , [8] N. M. Faber, ‘‘Comment on a recently proposed resampling method,’’
S J. Chemometrics, vol. 15, no. 3, pp. 169–188, Mar. 2001.
i=1 [9] H. Yu, J. Ni, and J. Zhao, ‘‘ACOSampling: An ant colony optimization-
2 × PR × RR based undersampling method for classifying imbalanced DNA microarray
F1 − measure := . data,’’ Neurocomputing, vol. 101, pp. 309–318, Feb. 2013.
PR + RR
[10] S. Park and H. Park, ‘‘Combined oversampling and undersampling method
based on slow-start algorithm for imbalanced network traffic,’’ Computing,
And Tab. 3 shows the prediction rate, recall rate, vol. 103, no. 1, pp. 1–24, 2021.
F1-measure, σ and RMSE of these four methods. In these [11] M. A. Tahir, J. Kittler, F. Yan, and K. Mikolajczyk, ‘‘Concept learning for
four datasets, the IO-SMOTE and CP-SMOTE methods both image and video retrieval: The inverse random under sampling approach,’’
have better performances than the no-sampling and SMOTE in Proc. 17th Eur. Signal Process. Conf., 2015, pp. 574–578.
[12] S. Kumar, M. S. Chaudhari, R. Gupta, and S. Majhi, ‘‘Multiple CFOs
methods on all these five criteria. Furthermore, in the datasets estimation and implementation of SC-FDMA uplink system using over-
of Ecoli1, Yeast1 and Yeast3, the CP-SMOTE method per- sampling and iterative method,’’ IEEE Trans. Veh. Technol., vol. 69, no. 6,
forms better than IO-SMOTE method; And for the rest dataset pp. 6254–6263, Jun. 2020.
[13] Y. Yang, S. Fu, and E. T. Chung, ‘‘Online mixed multiscale finite element
of Newthyroid1, these two methods have their own advan- method with oversampling and its applications,’’ J. Sci. Comput., vol. 82,
tages under different evaluation criteria. Combined with the no. 2, pp. 1–20, Feb. 2020.
classification accuracies, the ranking of these four oversam- [14] Y. Pang, Z. Chen, L. Peng, K. Ma, C. Zhao, and K. Ji, ‘‘A signature-
pling methods is: CP-SMOTE>IO-SMOTE>SMOTE >No- based assistant random oversampling method for malware detection,’’ in
Proc. 18th IEEE Int. Conf. Trust, Secur. Privacy Comput. Commun./13th
sampling. SMOTE and these two proposed methods are over- IEEE Int. Conf. Big Data Sci. Eng. (TrustCom/BigDataSE), Aug. 2019,
sampling methods and do not involve network structure. Only pp. 256–263.
ELM and FNN networks are enough in experiments. In fact, [15] J. Kolluri, V. K. Kotte, M. S. B. Phridviraj, and S. Razia, ‘‘Reducing
overfitting problem in machine learning using novel L1/4 regularization
we will obtain similar results under other network models. method,’’ in Proc. 4th Int. Conf. Trends Electron. Informat. (ICOEI),
Jun. 2020, pp. 934–938.
IV. CONCLUSION [16] N. V. Chawla, A. Lazarevic, L. O. Hall, and K. W. Bowyer, ‘‘SMOTEBoost:
Improving prediction of the minority class in boosting,’’ in Proc. Eur. Conf.
This paper proposes two novel improved SMOTE methods to Knowl. Discovery Databases, 2003, pp. 107–119.
generate new samples: Center point SMOTE (CP-SMOTE) [17] H. Hui, W. Y. Wang, and B. H. Mao, ‘‘Borderline-SMOTE: A new over-
method and Inner and outer SMOTE (IO-SMOTE) method. sampling method in imbalanced data sets learning,’’ in Proc. Int. Conf. Adv.
Intell. Comput., 2005, pp. 878–887.
The CP-SMOTE method generates new samples according
[18] M. Naseriparsa, A. Al-Shammari, M. Sheng, Y. Zhang, and R. Zhou,
to finding several center points, and then making a linear ‘‘RSMOTE: Improving classification performance over imbalanced medi-
combination of the minority samples and their corresponding cal datasets,’’ Health Inf. Sci. Syst., vol. 8, no. 1, pp. 1–13, Dec. 2020.
[19] H. He, Y. Bai, E. A. Garcia, and S. Li, ‘‘ADASYN: Adaptive synthetic [32] R. Wang and J. Li, ‘‘Bayes test of precision, recall, and F1 measure for
sampling approach for imbalanced learning,’’ in Proc. IEEE Int. Joint Conf. comparison of two natural language processing models,’’ in Proc. 57th
Neural Netw., Jun. 2008, pp. 1322–1328. Annu. Meeting Assoc. Comput. Linguistics, 2019, pp. 4135–4145.
[20] R. Aishwarya and V. Nagaraju, ‘‘Automatic region of interest based medi- [33] H. Azami, A. Fernández, and J. Escudero, ‘‘Refined multiscale fuzzy
cal image segmentation using spatial fuzzy K clustering method 1,’’ Int. J. entropy based on standard deviation for biomedical signal analysis,’’ Med.
Electron. Commun. Technol., vol. 3, no. 1, pp. 226–229, Mar. 2012. Biol. Eng., Comput., vol. 55, no. 11, pp. 2037–2052, 2017.
[21] S. Mahak, ‘‘Image segmentation with modified K-means clustering [34] C. J. Willmott and K. Matsuura, ‘‘Advantages of the mean absolute
method,’’ Int. J. Recent Technol. Eng., vol. 1, no. 2, pp. 176–179, 2012. error (MAE) over the root mean square error (RMSE) in assessing average
[22] T.-T. Wong and N.-Y. Yang, ‘‘Dependency analysis of accuracy estimates model performance,’’ Climate Res., vol. 30, no. 1, pp. 79–82, Dec. 2005.
in k-fold cross validation,’’ IEEE Trans. Knowl. Data Eng., vol. 29, no. 11,
pp. 2417–2427, Nov. 2017.
[23] P. Jiang and J. Chen, ‘‘Displacement prediction of landslide based on
generalized regression neural networks with K-fold cross-validation,’’
Neurocomputing, vol. 198, pp. 40–47, Jul. 2016.
YUAN BAO received the B.S. degree in mathemat-
[24] J. He and X. Fan, ‘‘Evaluating the performance of the K-fold cross-
ics and applied mathematics from Henan Univer-
validation approach for model selection in growth mixture modeling,’’
Struct. Equation Model., Multidisciplinary J., vol. 26, no. 1, pp. 66–79, sity, Kaifeng, China, in 2013, and the Ph.D. degree
Jan. 2019. in computational mathematics from the Dalian
[25] T. Fushiki, ‘‘Estimation of prediction error by using K-fold cross- University of Technology, Dalian, China, in 2020.
validation,’’ Statist. Comput., vol. 21, no. 2, pp. 137–146, Apr. 2011. She is currently a Postdoctoral Fellow with the
[26] B. C. Moore, ‘‘Principal component analysis in linear systems: Controlla- School of Information Science and Technology,
bility, observability, and model reduction,’’ IEEE Trans. Autom. Control, Dalian Maritime University. Her research inter-
vol. AC-26, no. 1, pp. 17–32, Feb. 1981. ests include finite element methods and computer
[27] L. E. Pirogov and P. M. Zemlyanukha, ‘‘Principal component analysis for networks.
estimating parameters of the L1287 dense core by fitting model spec-
tral maps into observed ones,’’ Astron. Rep., vol. 65, no. 2, pp. 82–94,
Feb. 2021.
[28] M. Frean, ‘‘The upstart algorithm: A method for constructing and training
feedforward neural networks,’’ Neural Comput., vol. 2, no. 2, pp. 198–209,
Jun. 1990. SIBO YANG received the B.S. and Ph.D. degrees
[29] G.-B. Huang, Q.-Y. Zhu, and C.-K. Siew, ‘‘Extreme learning machine: The- in computational mathematics from the Dalian
ory and applications,’’ Neurocomputing, vol. 70, nos. 1–3, pp. 489–501, University of Technology, Dalian, China, in
2006. 2013 and 2020, respectively. He is currently a Lec-
[30] G. B. Huang, H. Zhou, X. Ding, and R. Zhang, ‘‘Extreme learning machine turer with the School of Science, Dalian Maritime
for regression and multiclass classification,’’ IEEE Trans. Syst., Man, University, Dalian. His research interests include
Cybern. B, Cybern., vol. 42, no. 2, pp. 513–529, Feb. 2012. extreme learning machine and improvement of
[31] J. M. DuBois, L. S. Boylan, M. Shiyko, W. B. Barr, and O. Devinsky, learning algorithms in neural networks.
‘‘Seizure prediction and recall,’’ Epilepsy Behav., vol. 18, nos. 1–2,
pp. 106–109, May 2010.