SMOTE using Python. Achieving class balance with few lines… | by Dr.
Saptarsi Goswami | Towards Data Science 28/5/21, 11:23 PM
8$(')$' 9%/),/+6/%4
;1**1. ?@@A);1**1.%6, · B4$/16,C)D$"E, ;%+/F6%, 5%%2)5$&%,
!"#$%&'()*+&,-./0*
!"#$%&$'()"*+,,)-+*+'"%).$/#)0%.)*$'%,)10)23/#1')"14%,
567)8+2/+6,$)91,.+:$ ;%-)<= · >):$')6%+4
[Link] Page 1 of 9
SMOTE using Python. Achieving class balance with few lines… | by Dr. Saptarsi Goswami | Towards Data Science 28/5/21, 11:23 PM
#//2,GHHF',2*+,#7"1:H2#1/1,HI@%JKD@L@MN
Class Imbalance is a quite frequently occurring problem
manifested in fraud detection, intrusion detection, Suspicious
[Link] Page 2 of 9
SMOTE using Python. Achieving class balance with few lines… | by Dr. Saptarsi Goswami | Towards Data Science 28/5/21, 11:23 PM
activity detection to name a few. In the context of binary
classi>cation, the less frequently occurring class is called the
minority class, and the more frequently occurring class is called
the majority class. You can check out our video on the same
topic here.
What’s the issue with this?
Most machine learning models actually get overwhelmed by
the majority class, as it expects the classes to be somewhat
balanced. It’s like asking a student to learn both algebra and
trigonometry equally well but giving him only 5 solved
problems of trigonometry to learn from compared to a 1000
solved problem in algebra. The patterns of the minority class
get buried. This literally becomes the problem of >nding a
neeedle from the haystack.
#//2,GHHF',2*+,#7"1:H2#1/1,H@OPQRI>(,I.
[Link] Page 3 of 9
SMOTE using Python. Achieving class balance with few lines… | by Dr. Saptarsi Goswami | Towards Data Science 28/5/21, 11:23 PM
The evaluation also goes for a toss, we are more concerned
with the minority class recall rather than anything else.
K1'0F,$1')O+/6$S).$/#)"1*16,)+""164$'()/1)/#%)4%,$6+-$*$/3
False-positive is kind of ‘ok’ but ‘False Negative is unacceptable.
The fraud class is taken as the positive class.
The objective of this article is the implementation, for the
theoretical understanding you can refer to the detailed working
of SMOTE here.
K*+,,)$:-+*+'"%)8/6+/%(3)T)81F6"%G)!F/#16U
[Link] Page 4 of 9
SMOTE using Python. Achieving class balance with few lines… | by Dr. Saptarsi Goswami | Towards Data Science 28/5/21, 11:23 PM
Of course, the best thing is to have more data, but that’s too
ideal. Among the sampling-based and sampling-based
strategies, SMOTE comes under the generate synthetic sample
strategy.
Step 1: Creating a sample dataset
from [Link] import make_classification
X, y = make_classification(n_classes=2, class_sep=0.5,
weights=[0.05, 0.95], n_informative=2, n_redundant=0,
flip_y=0,
n_features=2, n_clusters_per_class=1, n_samples=1000,
random_state=10)
make_classi>cation is a pretty handy function to create some
experimental data for you. The important parameter over here
is weights which ensure 95% are from one class and 5% from
the other class.
V$,F+*$M$'()/#%)5+/+)T)W:+(%)81F6"%G)!F/#16U
[Link] Page 5 of 9
SMOTE using Python. Achieving class balance with few lines… | by Dr. Saptarsi Goswami | Towards Data Science 28/5/21, 11:23 PM
It can be understood the red class is the majority class and the
blue class is the minority class.
Step 2: Create train, test dataset, :t and evaluate the model
!"#$%&'(&")*$+,)-&.*
!,,/)D610)Y)K1:2)8"
Z+'(+-+,$)O16'$'()K*([)\%+4
]%,%+6"#%6)^'$&%6,$/3)10
K+*"F//+)5+/+)8"$%'"%)\+-[
B&+*F+/$1')1')X%,/)8%/):14%*)/6+$'%4)1')16$($'+*)$:-+*+'"%4)4+/+)TW:+(%)81F6"%G)!F/#16U
8>58)%S%"F/$&%)"1::$//%%
:%:-%6[)_58K)A1*E+/+
K#+2/%6)\%+4
The main issue over here we have a very poor recall rate for the
;1**1.
minority class when the original imbalanced data is used for
training the model.
>< <
Step 3: Create a dataset with Synthetic samples
from imblearn.over_sampling import SMOTE
sm = SMOTE(random_state=42)
X_res, y_res = sm.fit_resample(X_train, y_train)
We can create a balanced dataset with just above three lines of
code
[Link] Page 6 of 9
SMOTE using Python. Achieving class balance with few lines… | by Dr. Saptarsi Goswami | Towards Data Science 28/5/21, 11:23 PM
Step 4: Fit and evaluate the model on the modi:ed dataset
B&+*F+/$1')1')X%,/)8%/):14%*)/6+$'%4)1'):14$0$%4)-+*+'"%4)4+/+)TW:+(%)81F6"%G)!F/#16U
We can see directly, the recall has improved from .21 to .84.
Such is the power and beauty of the three lines code.
SMOTE works by selecting pair of minority class observations and
then creating a synthetic point that lies on the line connecting
these two. It is pretty liberal about selecting the minority points
and may end up picking up minority points that are outliers.
ADASYN, BorderLine SMOTE, KMeansSMOTE, SVMSMOTE
are some of the strategies to select better minority points.
EndNote:
Class Imbalance is a quite common problem and if not handled
can have a telling impact on the model performance. The
model performance is especially critical for the minority class.
[Link] Page 7 of 9
SMOTE using Python. Achieving class balance with few lines… | by Dr. Saptarsi Goswami | Towards Data Science 28/5/21, 11:23 PM
In this article, we have outlined how with few lines of code, can
work like a miracle.
References:
[1] [Link]
[2] [Link]
imbalanced-classes-in-your-machine-learning-dataset/
[3][Link]
data
[Link] Page 8 of 9
SMOTE using Python. Achieving class balance with few lines… | by Dr. Saptarsi Goswami | Towards Data Science 28/5/21, 11:23 PM
%*/0$1'$2,"$345$6&"*&785
Z3)X1.+64,)5+/+)8"$%'"%
B&%63)X#F6,4+3[)/#%)V+6$+-*%)4%*$&%6,)/#%)&%63)-%,/)10)X1.+64,)5+/+
8"$%'"%G)061:)#+'4,I1')/F/16$+*,)+'4)"F//$'(I%4(%)6%,%+6"#)/1)16$($'+*
0%+/F6%,)31F)41'C/).+'/)/1):$,,7)X+E%)+)*11E7
9%/)/#$,)'%.,*%//%6
N1FC**)'%%4)/1),$(')$')16)"6%+/%)+')+""1F'/)/1)6%"%$&%)/#$,
'%.,*%//%67
8:1/% K*+,,)W:-+*+'"% O+"#$'%)\%+6'$'()D3/#1' O+"#$'%)\%+6'$'(
5+/+)8"$%'"%
!-1F/ `%*2 \%(+*
[Link] Page 9 of 9