0% found this document useful (0 votes)
24 views

SMOTE Using Python1

1. Class imbalance, where one class is underrepresented, can negatively impact machine learning models by overwhelming them with the majority class. 2. SMOTE is a technique that generates synthetic samples of the minority class to balance the classes and improve model performance, especially for the underrepresented class. 3. The article demonstrates applying SMOTE in Python with just a few lines of code, significantly improving recall of the minority class from 0.21 to 0.84.

Uploaded by

Ioanna Diam
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views

SMOTE Using Python1

1. Class imbalance, where one class is underrepresented, can negatively impact machine learning models by overwhelming them with the majority class. 2. SMOTE is a technique that generates synthetic samples of the minority class to balance the classes and improve model performance, especially for the underrepresented class. 3. The article demonstrates applying SMOTE in Python with just a few lines of code, significantly improving recall of the minority class from 0.21 to 0.84.

Uploaded by

Ioanna Diam
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

SMOTE using Python. Achieving class balance with few lines… | by Dr.

Saptarsi Goswami | Towards Data Science 28/5/21, 11:23 PM

8$(')$' 9%/),/+6/%4

;1**1. ?@@A);1**1.%6, · B4$/16,C)D$"E, ;%+/F6%, 5%%2)5$&%,

!"#$%&'()*+&,-./0*
!"#$%&$'()"*+,,)-+*+'"%).$/#)0%.)*$'%,)10)23/#1')"14%,

567)8+2/+6,$)91,.+:$ ;%-)<= · >):$')6%+4

https://towardsdatascience.com/applying-smote-for-class-imbalance-with-just-a-few-lines-of-code-python-cdf603e58688 Page 1 of 9
SMOTE using Python. Achieving class balance with few lines… | by Dr. Saptarsi Goswami | Towards Data Science 28/5/21, 11:23 PM

#//2,GHHF',2*+,#7"1:H2#1/1,HI@%JKD@L@MN

Class Imbalance is a quite frequently occurring problem


manifested in fraud detection, intrusion detection, Suspicious

https://towardsdatascience.com/applying-smote-for-class-imbalance-with-just-a-few-lines-of-code-python-cdf603e58688 Page 2 of 9
SMOTE using Python. Achieving class balance with few lines… | by Dr. Saptarsi Goswami | Towards Data Science 28/5/21, 11:23 PM

activity detection to name a few. In the context of binary


classi>cation, the less frequently occurring class is called the
minority class, and the more frequently occurring class is called
the majority class. You can check out our video on the same
topic here.

What’s the issue with this?

Most machine learning models actually get overwhelmed by


the majority class, as it expects the classes to be somewhat
balanced. It’s like asking a student to learn both algebra and
trigonometry equally well but giving him only 5 solved
problems of trigonometry to learn from compared to a 1000
solved problem in algebra. The patterns of the minority class
get buried. This literally becomes the problem of >nding a
neeedle from the haystack.

#//2,GHHF',2*+,#7"1:H2#1/1,H@OPQRI>(,I.

https://towardsdatascience.com/applying-smote-for-class-imbalance-with-just-a-few-lines-of-code-python-cdf603e58688 Page 3 of 9
SMOTE using Python. Achieving class balance with few lines… | by Dr. Saptarsi Goswami | Towards Data Science 28/5/21, 11:23 PM

The evaluation also goes for a toss, we are more concerned


with the minority class recall rather than anything else.

K1'0F,$1')O+/6$S).$/#)"1*16,)+""164$'()/1)/#%)4%,$6+-$*$/3

False-positive is kind of ‘ok’ but ‘False Negative is unacceptable.


The fraud class is taken as the positive class.

The objective of this article is the implementation, for the


theoretical understanding you can refer to the detailed working
of SMOTE here.

K*+,,)$:-+*+'"%)8/6+/%(3)T)81F6"%G)!F/#16U

https://towardsdatascience.com/applying-smote-for-class-imbalance-with-just-a-few-lines-of-code-python-cdf603e58688 Page 4 of 9
SMOTE using Python. Achieving class balance with few lines… | by Dr. Saptarsi Goswami | Towards Data Science 28/5/21, 11:23 PM

Of course, the best thing is to have more data, but that’s too
ideal. Among the sampling-based and sampling-based
strategies, SMOTE comes under the generate synthetic sample
strategy.

Step 1: Creating a sample dataset

from sklearn.datasets import make_classification


X, y = make_classification(n_classes=2, class_sep=0.5,
weights=[0.05, 0.95], n_informative=2, n_redundant=0,
flip_y=0,
n_features=2, n_clusters_per_class=1, n_samples=1000,
random_state=10)

make_classi>cation is a pretty handy function to create some


experimental data for you. The important parameter over here
is weights which ensure 95% are from one class and 5% from
the other class.

V$,F+*$M$'()/#%)5+/+)T)W:+(%)81F6"%G)!F/#16U

https://towardsdatascience.com/applying-smote-for-class-imbalance-with-just-a-few-lines-of-code-python-cdf603e58688 Page 5 of 9
SMOTE using Python. Achieving class balance with few lines… | by Dr. Saptarsi Goswami | Towards Data Science 28/5/21, 11:23 PM

It can be understood the red class is the majority class and the
blue class is the minority class.

Step 2: Create train, test dataset, :t and evaluate the model

!"#$%&'(&")*$+,)-&.*
!,,/)D610)Y)K1:2)8"
Z+'(+-+,$)O16'$'()K*([)\%+4
]%,%+6"#%6)^'$&%6,$/3)10
K+*"F//+)5+/+)8"$%'"%)\+-[
B&+*F+/$1')1')X%,/)8%/):14%*)/6+$'%4)1')16$($'+*)$:-+*+'"%4)4+/+)TW:+(%)81F6"%G)!F/#16U
8>58)%S%"F/$&%)"1::$//%%
:%:-%6[)_58K)A1*E+/+
K#+2/%6)\%+4
The main issue over here we have a very poor recall rate for the
;1**1.
minority class when the original imbalanced data is used for
training the model.

>< <
Step 3: Create a dataset with Synthetic samples

from imblearn.over_sampling import SMOTE


sm = SMOTE(random_state=42)
X_res, y_res = sm.fit_resample(X_train, y_train)

We can create a balanced dataset with just above three lines of


code

https://towardsdatascience.com/applying-smote-for-class-imbalance-with-just-a-few-lines-of-code-python-cdf603e58688 Page 6 of 9
SMOTE using Python. Achieving class balance with few lines… | by Dr. Saptarsi Goswami | Towards Data Science 28/5/21, 11:23 PM

Step 4: Fit and evaluate the model on the modi:ed dataset

B&+*F+/$1')1')X%,/)8%/):14%*)/6+$'%4)1'):14$0$%4)-+*+'"%4)4+/+)TW:+(%)81F6"%G)!F/#16U

We can see directly, the recall has improved from .21 to .84.
Such is the power and beauty of the three lines code.

SMOTE works by selecting pair of minority class observations and


then creating a synthetic point that lies on the line connecting
these two. It is pretty liberal about selecting the minority points
and may end up picking up minority points that are outliers.

ADASYN, BorderLine SMOTE, KMeansSMOTE, SVMSMOTE


are some of the strategies to select better minority points.

EndNote:

Class Imbalance is a quite common problem and if not handled


can have a telling impact on the model performance. The
model performance is especially critical for the minority class.

https://towardsdatascience.com/applying-smote-for-class-imbalance-with-just-a-few-lines-of-code-python-cdf603e58688 Page 7 of 9
SMOTE using Python. Achieving class balance with few lines… | by Dr. Saptarsi Goswami | Towards Data Science 28/5/21, 11:23 PM

In this article, we have outlined how with few lines of code, can
work like a miracle.

References:

[1] https://www.kaggle.com/saptarsi/smote-notebook

[2] https://machinelearningmastery.com/tactics-to-combat-
imbalanced-classes-in-your-machine-learning-dataset/

[3]https://www.kaggle.com/qianchao/smote-with-imbalance-
data

https://towardsdatascience.com/applying-smote-for-class-imbalance-with-just-a-few-lines-of-code-python-cdf603e58688 Page 8 of 9
SMOTE using Python. Achieving class balance with few lines… | by Dr. Saptarsi Goswami | Towards Data Science 28/5/21, 11:23 PM

%*/0$1'$2,"$345$6&"*&785
Z3)X1.+64,)5+/+)8"$%'"%

B&%63)X#F6,4+3[)/#%)V+6$+-*%)4%*$&%6,)/#%)&%63)-%,/)10)X1.+64,)5+/+
8"$%'"%G)061:)#+'4,I1')/F/16$+*,)+'4)"F//$'(I%4(%)6%,%+6"#)/1)16$($'+*
0%+/F6%,)31F)41'C/).+'/)/1):$,,7)X+E%)+)*11E7

9%/)/#$,)'%.,*%//%6

N1FC**)'%%4)/1),$(')$')16)"6%+/%)+')+""1F'/)/1)6%"%$&%)/#$,
'%.,*%//%67

8:1/% K*+,,)W:-+*+'"% O+"#$'%)\%+6'$'()D3/#1' O+"#$'%)\%+6'$'(

5+/+)8"$%'"%

!-1F/ `%*2 \%(+*

https://towardsdatascience.com/applying-smote-for-class-imbalance-with-just-a-few-lines-of-code-python-cdf603e58688 Page 9 of 9

You might also like