SMOTE for Imbalanced Classification with Python - GeeksforGeeks
SMOTE for Imbalanced Classification with Python - GeeksforGeeks
Table of Content
https://www.geeksforgeeks.org/smote-for-imbalanced-classification-with-python/ 2024-10-31, 6 57 PM
Page 1 of 18
:
Techniques like Oversampling, Undersampling, Threshold moving,
and SMOTE help address this issue. Handling imbalanced datasets
is crucial to prevent biased model outputs, especially in multi-
classification problems.
https://www.geeksforgeeks.org/smote-for-imbalanced-classification-with-python/ 2024-10-31, 6 57 PM
Page 2 of 18
:
class samples. By default, SMOTE typically aims to balance the
class distribution by generating synthetic samples until the
minority class reaches the same size as the majority class.
5. Repeat for All Minority Class Instances: Steps 2-4 are
repeated for all minority class instances in the dataset,
generating synthetic samples to augment the minority class.
6. Create Balanced Dataset: After generating synthetic samples
for the minority class, the resulting dataset becomes more
balanced, with a more equitable distribution of instances
across classes.
Python
https://www.geeksforgeeks.org/smote-for-imbalanced-classification-with-python/ 2024-10-31, 6 57 PM
Page 3 of 18
:
Output:
Now, lets use SMOTE to handle this problem. We will utilize SMOTE
to address data imbalance by generating synthetic samples for the
minority class, indicated by 'sampling_strategy='minority''. By
applying SMOTE, the code balances the class distribution in the
dataset, as confirmed by 'y.value_counts()' displaying the count
of each class after resampling.
Python
https://www.geeksforgeeks.org/smote-for-imbalanced-classification-with-python/ 2024-10-31, 6 57 PM
Page 4 of 18
:
smote=SMOTE(sampling_strategy='minority')
x,y=smote.fit_resample(x,y)
y.value_counts()
Output:
Outcome
1 500
0 500
Name: count, dtype: int64
1. ADASYN
2. Borderline SMOTE
3. SMOTE-ENN (Edited Nearest Neighbors)
4. SMOTE+TOMEK
5. SMOTE-NC (Nominal Continuous)
https://www.geeksforgeeks.org/smote-for-imbalanced-classification-with-python/ 2024-10-31, 6 57 PM
Page 5 of 18
:
handling imbalanced datasets. ADASYN focuses on local densities
of minority classes. It finds out the regions where the imbalance is
very severe and applies the strategy to generate synthetic samples
there. It generates more samples where the density is high and
fewer samples where the density is low. This approach is highly
useful in scenarios where class distribution varies across the
feature space.
https://www.geeksforgeeks.org/smote-for-imbalanced-classification-with-python/ 2024-10-31, 6 57 PM
Page 6 of 18
:
the minority classes increases. This makes the dataset
balanced and helps the model to learn more accurately.
Python
# Applying ADASYN
adasyn = ADASYN(sampling_strategy='minority')
x_resampled, y_resampled = adasyn.fit_resample(x, y)
# Count outcome values after applying ADASYN
y_resampled.value_counts()
Output:
Outcome
1 500
0 500
Name: count, dtype: int64
Borderline SMOTE
Borderline SMOTE is designed to better address the issue of
misclassification of minority class samples that are near the
borderline between classes. These samples are often the hardest to
classify and are more likely to be mislabeled by classifiers.
Borderline SMOTE focuses on generating synthetic samples near
the decision boundary between the minority and majority classes. It
targets instances that are more challenging to classify, aiming to
improve the generalization performance of classifiers.
https://www.geeksforgeeks.org/smote-for-imbalanced-classification-with-python/ 2024-10-31, 6 57 PM
Page 7 of 18
:
Working Procedure of Borderline SMOTE
Python
blsmote = BorderlineSMOTE(sampling_strategy='minority',
kind='borderline-1')
X_resampled, y_resampled = blsmote.fit_resample(x, y)
y_resampled.value_counts()
Output:
Outcome
1 500
0 500
Name: count, dtype: int64
https://www.geeksforgeeks.org/smote-for-imbalanced-classification-with-python/ 2024-10-31, 6 57 PM
Page 8 of 18
:
SMOTE-ENN (Edited Nearest Neighbors)
SMOTE-ENN combines the SMOTE method with the Edited Nearest
Neighbors (ENN) rule. ENN is used to clean the data by removing
any samples that are misclassified by their nearest neighbors. This
combination helps in cleaning up the synthetic samples, improving
the overall quality of the dataset. The objective of ENN is to remove
noisy or ambiguous samples, which may include both minority and
majority class instances.
Python
Output:
https://www.geeksforgeeks.org/smote-for-imbalanced-classification-with-python/ 2024-10-31, 6 57 PM
Page 9 of 18
:
Outcome
1 297
0 215
Name: count, dtype: int64
https://www.geeksforgeeks.org/smote-for-imbalanced-classification-with-python/ 2024-10-31, 6 57 PM
Page 10 of 18
:
Working Procedure of SMOTE- TOMEK Links
Python
Output:
Outcome
1 471
https://www.geeksforgeeks.org/smote-for-imbalanced-classification-with-python/ 2024-10-31, 6 57 PM
Page 11 of 18
:
0 471
Name: count, dtype: int64
https://www.geeksforgeeks.org/smote-for-imbalanced-classification-with-python/ 2024-10-31, 6 57 PM
Page 12 of 18
:
Handling Nominal Features: Traditional SMOTE operates in
the feature space by interpolating between minority class
instances. However, when categorical features are present, it's
not meaningful to interpolate between categories directly.
SMOTE-NC addresses this by considering the categorical
features separately and ensuring that synthetic samples
preserve the categorical properties of the original data.
Combining SMOTE with Handling Nominal Features:
SMOTE-NC extends the SMOTE algorithm to handle both
nominal and continuous features appropriately. It generates
synthetic samples by oversampling the minority class
instances in the continuous feature space while preserving the
distribution of categorical features.
Integration with Categorical Encoding: Before applying
SMOTE-NC, categorical features need to be encoded into a
numerical representation. This encoding could be done using
techniques like one-hot encoding or ordinal encoding,
depending on the nature of the categorical variables.
Preservation of Feature Characteristics: During the
synthetic sample generation process, SMOTE-NC ensures that
the categorical features of the synthetic samples align with the
original dataset. This helps in maintaining the integrity of the
dataset and ensuring that the synthetic samples accurately
represent the minority class.
https://www.geeksforgeeks.org/smote-for-imbalanced-classification-with-python/ 2024-10-31, 6 57 PM
Page 13 of 18
:
Continuous)
Python
import numpy as np
from sklearn.datasets import make_classification
from imblearn.over_sampling import SMOTENC
https://www.geeksforgeeks.org/smote-for-imbalanced-classification-with-python/ 2024-10-31, 6 57 PM
Page 14 of 18
:
# Perform the resampling
X_resampled, y_resampled = smote_nc.fit_resample(X, y)
Output:
https://www.geeksforgeeks.org/smote-for-imbalanced-classification-with-python/ 2024-10-31, 6 57 PM
Page 15 of 18
:
Datasets generating Use when certain
where samples next to areas of the
ADASYN
imbalance the original feature space are
(Adaptive
varies samples that more imbalanced
Synthetic
significantly are harder to than others,
Sampling)
across the learn, adapting requiring
feature to varying adaptive density
space. degrees of class estimation.
imbalance.
Use when data
Datasets points from
Enhances
where different classes
classification
minority overlap and are
near the
Borderline class prone to
borderline
SMOTE examples are misclassification,
where
close to the particularly in
misclassification
decision binary
risk is high.
boundary. classification
problems.
Use when your
Datasets that
dataset includes
include a Handles mixed
both categorical
combination data types
SMOTE-NC and continuous
of nominal without
(Nominal inputs, ensuring
(categorical) distorting the
Continuous) that synthetic
and categorical
samples respect
continuous feature space.
the nature of
features.
both data types.
Use when the
Combines over-
dataset is noisy
Datasets with sampling with
or contains
SMOTE-ENN potential cleaning to
outliers, and you
(Edited Nearest noise and remove noisy
want to refine the
Neighbors) mislabeled and
class boundary
examples. misclassified
further after
instances.
over-sampling.
Use when you
Best for Cleans the data need a cleaner
https://www.geeksforgeeks.org/smote-for-imbalanced-classification-with-python/ 2024-10-31, 6 57 PM
Page 16 of 18
:
reducing by removing dataset with less
overlap Tomek links, overlap between
SMOTE+TOMEK between which can help classes, suitable
classes after in enhancing for situations
applying the classifier’s where class
SMOTE. performance. separation is a
priority.
Conclusion
To sum up, SMOTE is an effective technique to handle imbalanced
datasets. It finds the minority class in the dataset and generates
synthetic samples for them. It thus helps in balancing data which
makes the machine learning model better learn. It is widely used in
classification problems. However, it is essential to carefully analyze
the problem before applying the method, as sometimes it might
lead to trade-offs. Overall, SMOTE plays a vital role in handling
imbalance datasets.
https://www.geeksforgeeks.org/smote-for-imbalanced-classification-with-python/ 2024-10-31, 6 57 PM
Page 17 of 18
:
imbalance is prevalent. But it may not be suitable for all types of
problems, so it's essential to consider its limitations.
Are you passionate about data and looking to make one giant leap
into your career? Our Data Science Course will help you change
your game and, most importantly, allow students, professionals, and
working adults to tide over into the data science immersion. Master
state-of-the-art methodologies, powerful tools, and industry best
practices, hands-on projects, and real-world applications. Become
the executive head of industries related to Data Analysis, Machine
Learning, and Data Visualization with these growing skills. Ready
to Transform Your Future? Enroll Now to Be a Data Science
Expert!
https://www.geeksforgeeks.org/smote-for-imbalanced-classification-with-python/ 2024-10-31, 6 57 PM
Page 18 of 18
: