0% found this document useful (0 votes)

26 views10 pages

10 Techniques To Deal With Class Imbalance in Machine Learning

The document discusses 10 techniques to deal with imbalanced classes in machine learning: 1. It introduces the problem of class imbalance and how it can negatively impact model accuracy. 2. It discusses resampling techniques like random under-sampling, random over-sampling, and near-miss to balance class distributions. 3. It demonstrates various resampling techniques like random under-sampling, random over-sampling, Tomek links, and SMOTE on a credit card fraud detection dataset using libraries like imbalanced-learn.

Uploaded by

CHLIAH HANANE

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

26 views10 pages

10 Techniques To Deal With Class Imbalance in Machine Learning

Uploaded by

CHLIAH HANANE

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

10 Techniques to deal with Imbalanced Classes in Machine

Learning
C LA S S I F I C AT I O N I NT E RM E D I AT E M A C HI NE LE A RNI NG PYT HO N S T RUC T URE D D AT A T E C HNI Q UE UNC AT E G O RI Z E D

Overview

Get familiar with class imbalance

Understand various techniques to treat imbalanced classes such as-
Random under-sampling
Random over-sampling
NearMiss

You can check the implementation of the code in my GitHub repository here

Introduction

When observation in one class is higher than the observation in other classes then there exists a class
imbalance. Example: To detect fraudulent credit card transactions. As you can see in the below graph
fraudulent transaction is around 400 when compared with non-fraudulent transaction around 90000.

Class Imbalance is a common problem in machine learning, especially in classification problems.

Imbalance data can hamper our model accuracy big time.

Class Imbalance appear in many domains, including:

Fraud detection
Spam filtering
Disease screening
SaaS subscription churn
Advertising click-throughs
The Problem with Class Imbalance

Most machine learning algorithms work best when the number of samples in each class are about equal.
This is because most algorithms are designed to maximize accuracy and reduce errors.

However, if the data set in imbalance then In such cases, you get a pretty high accuracy just by predicting
the majority class, but you fail to capture the minority class, which is most often the point of creating the
model in the first place.

Credit card fraud detection example

Let’s say we have a dataset of credit card companies where we have to find out whether the credit card
transaction was fraudulent or not.

But here’s the catch… the fraud transaction is relatively rare, only 6% of the transaction is fraudulent.

Now, before you even start, do you see how the problem might break? Imagine if you didn’t bother training
a model at all. Instead, what if you just wrote a single line of code that always predicts ‘no fraudulent
transaction’

def transaction(transaction_data): return 'No fradulent transaction'

Well, guess what? Your “solution” would have 94% accuracy!

Unfortunately, that accuracy is misleading.

All those non-fraudulent transactions, you’d have 100% accuracy.

Those transactions which are fraudulent, you’d have 0% accuracy.
Your overall accuracy would be high simply because the most transaction is not fraudulent(not
because your model is any good).

This is clearly a problem because many machine learning algorithms are designed to maximize overall
accuracy. In this article, we will see different techniques to handle the imbalanced data.

Data

We will use a credit card fraud detection dataset for this article you can find the dataset from here.

After loading the data display the first five-row of the data set.

# check the target variable that is fraudulet and not fradulent transactiondata['Class'].value_counts()# 0 ->

non fraudulent # 1 -> fraudulent

# visualize the target variable g = sns.countplot(data['Class']) g.set_xticklabels(['Not Fraud','Fraud'])

plt.show()

You can clearly see that there is a huge difference between the data set. 9000 non-fraudulent transactions
and 492 fraudulent.

The Metric Trap

One of the major issues that new developer users fall into when dealing with unbalanced datasets relates
to the metrics used to evaluate their model. Using simpler metrics like accuracy score can be misleading.
In a dataset with highly unbalanced classes, the classifier will always “predicts” the most common class
without performing any analysis of the features and it will have a high accuracy rate, obviously not the
correct one.

Let’s do this experiment, using simple XGBClassifier and no feature engineering:

# import linrary from xgboost import XGBClassifier xgb_model = XGBClassifier().fit(x_train, y_train) #

predict xgb_y_predict = xgb_model.predict(x_test) # accuracy score xgb_score = accuracy_score(xgb_y_predict,

y_test) print('Accuracy score is:', xbg_score)OUTPUT Accuracy score is: 0.992

We can see 99% accuracy, we are getting very high accuracy because it is predicting mostly the majority
class that is 0 (Non-fraudulent).

Resampling Technique

A widely adopted technique for dealing with highly unbalanced datasets is called resampling. It consists of
removing samples from the majority class (under-sampling) and/or adding more examples from the
minority class (over-sampling).
Despite the advantage of balancing classes, these techniques also have their weaknesses (there is no free
lunch).

The simplest implementation of over-sampling is to duplicate random records from the minority class,
which can cause overfishing.

In under-sampling, the simplest technique involves removing random records from the majority class,
which can cause loss of information.

Let’s implement this with the credit card fraud detection example.

We will start by separating the class that will be 0 and class 1.

# class count class_count_0, class_count_1 = data['Class'].value_counts() # Separate class class_0 =

data[data['Class'] == 0] class_1 = data[data['Class'] == 1]# print the shape of the class print('class 0:',

class_0.shape) print('class 1:', class_1.shape

1. Random Under-Sampling

Undersampling can be defined as removing some observations of the majority class. This is done until the
majority and minority class is balanced out.

Undersampling can be a good choice when you have a ton of data -think millions of rows. But a drawback
to undersampling is that we are removing information that may be valuable.

class_0_under = class_0.sample(class_count_1) test_under = pd.concat([class_0_under, class_1], axis=0)

print("total class of 1 and0:",test_under['Class'].value_counts())# plot the count after under-sampeling

test_under['Class'].value_counts().plot(kind='bar', title='count (target)')

2. Random Over-Sampling

Oversampling can be defined as adding more copies to the minority class. Oversampling can be a good
choice when you don’t have a ton of data to work with.

A con to consider when undersampling is that it can cause overfitting and poor generalization to your test
set.

class_1_over = class_1.sample(class_count_0, replace=True) test_over = pd.concat([class_1_over, class_0],

axis=0) print("total class of 1 and 0:",test_under['Class'].value_counts())# plot the count after under-

sampeling test_over['Class'].value_counts().plot(kind='bar', title='count (target)')

Balance data with the imbalanced-learn python module

A number of more sophisticated resampling techniques have been proposed in the scientific literature.

For example, we can cluster the records of the majority class, and do the under-sampling by removing
records from each cluster, thus seeking to preserve information. In over-sampling, instead of creating
exact copies of the minority class records, we can introduce small variations into those copies, creating
more diverse synthetic samples.

Let’s apply some of these resampling techniques, using the Python library imbalanced-learn. It is
compatible with scikit-learn and is part of scikit-learn-contrib projects.

import imblearn
3. Random under-sampling with imblearn

RandomUnderSampler is a fast and easy way to balance the data by randomly selecting a subset of data
for the targeted classes. Under-sample the majority class(es) by randomly picking samples with or without
replacement.

# import library from imblearn.under_sampling import RandomUnderSampler rus =

RandomUnderSampler(random_state=42, replacement=True)# fit predictor and target variable x_rus, y_rus =

rus.fit_resample(x, y) print('original dataset shape:', Counter(y)) print('Resample dataset shape',

Counter(y_rus))

4. Random over-sampling with imblearn

One way to fight imbalance data is to generate new samples in the minority classes. The most naive
strategy is to generate new samples by randomly sampling with replacement of the currently available
samples. The RandomOverSampler offers such a scheme.

# import library from imblearn.over_sampling import RandomOverSampler ros =

RandomOverSampler(random_state=42) # fit predictor and target variablex_ros, y_ros = ros.fit_resample(x, y)

print('Original dataset shape', Counter(y)) print('Resample dataset shape', Counter(y_ros))

5. Under-sampling: Tomek links

Tomek links are pairs of very close instances but of opposite classes. Removing the instances of the
majority class of each pair increases the space between the two classes, facilitating the classification
process.

Tomek’s link exists if the two samples are the nearest neighbors of each other
In the code below, we’ll use ratio='majority' to resample the majority class.

# import library from imblearn.under_sampling import TomekLinks tl =

RandomOverSampler(sampling_strategy='majority') # fit predictor and target variable x_tl, y_tl =

ros.fit_resample(x, y) print('Original dataset shape', Counter(y)) print('Resample dataset shape',

Counter(y_ros))

6. Synthetic Minority Oversampling Technique (SMOTE)

This technique generates synthetic data for the minority class.

SMOTE (Synthetic Minority Oversampling Technique) works by randomly picking a point from the minority
class and computing the k-nearest neighbors for this point. The synthetic points are added between the
chosen point and its neighbors.

SMOTE algorithm works in 4 simple steps:

1. Choose a minority class as the input vector

2. Find its k nearest neighbors (k_neighbors is specified as an argument in the SMOTE() function)
3. Choose one of these neighbors and place a synthetic point anywhere on the line joining the point
under consideration and its chosen neighbor
4. Repeat the steps until data is balanced

# import library from imblearn.over_sampling import SMOTE smote = SMOTE() # fit predictor and target variable

x_smote, y_smote = smote.fit_resample(x, y) print('Original dataset shape', Counter(y)) print('Resample

dataset shape', Counter(y_ros))

7. NearMiss

NearMiss is an under-sampling technique. Instead of resampling the Minority class, using a distance, this
will make the majority class equal to the minority class.
from imblearn.under_sampling import NearMiss nm = NearMiss() x_nm, y_nm = nm.fit_resample(x, y)
print('Original dataset shape:', Counter(y)) print('Resample dataset shape:', Counter(y_nm))

8. Change the performance metric

Accuracy is not the best metric to use when evaluating imbalanced datasets as it can be misleading.

Metrics that can provide better insight are:

Confusion Matrix: a table showing correct predictions and types of incorrect predictions.
Precision: the number of true positives divided by all positive predictions. Precision is also called
Positive Predictive Value. It is a measure of a classifier’s exactness. Low precision indicates a high
number of false positives.
Recall: the number of true positives divided by the number of positive values in the test data. The
recall is also called Sensitivity or the True Positive Rate. It is a measure of a classifier’s completeness.
Low recall indicates a high number of false negatives.
F1: Score: the weighted average of precision and recall.
Area Under ROC Curve (AUROC): AUROC represents the likelihood of your model distinguishing
observations from two classes.
In other words, if you randomly select one observation from each class, what’s the probability that your
model will be able to “rank” them correctly?

9. Penalize Algorithms (Cost-Sensitive Training)

The next tactic is to use penalized learning algorithms that increase the cost of classification mistakes on
the minority class.

A popular algorithm for this technique is Penalized-SVM.

During training, we can use the argument class_weight=’balanced’ to penalize mistakes on the
minority class by an amount proportional to how under-represented it is.

We also want to include the argument probability=True if we want to enable probability estimates for
SVM algorithms.

Let’s train a model using Penalized-SVM on the original imbalanced dataset:

# load library from sklearn.svm import SVC # we can add class_weight='balanced' to add panalize mistake
svc_model = SVC(class_weight='balanced', probability=True) svc_model.fit(x_train, y_train) svc_predict =

svc_model.predict(x_test)# check performance print('ROCAUC score:',roc_auc_score(y_test, svc_predict))

print('Accuracy score:',accuracy_score(y_test, svc_predict)) print('F1 score:',f1_score(y_test, svc_predict))
10. Change the algorithm

While in every machine learning problem, it’s a good rule of thumb to try a variety of algorithms, it can be
especially beneficial with imbalanced datasets.

Decision trees frequently perform well on imbalanced data. In modern machine learning, tree ensembles
(Random Forests, Gradient Boosted Trees, etc.) almost always outperform singular decision trees, so we’ll
jump right into those:

Tree base algorithm work by learning a hierarchy of if/else questions. This can force both classes to be
addressed.

# load library from sklearn.ensemble import RandomForestClassifier rfc = RandomForestClassifier() # fit the

predictor and target rfc.fit(x_train, y_train) # predict rfc_predict = rfc.predict(x_test)# check performance
print('ROCAUC score:',roc_auc_score(y_test, rfc_predict)) print('Accuracy score:',accuracy_score(y_test,

rfc_predict)) print('F1 score:',f1_score(y_test, rfc_predict))

Advantage and disadvantages of Under-sampling

Advantages

It can help improve run time and storage problems by reducing the number of training data samples
when the training data set is huge.

Disadvantages

It can discard potentially useful information which could be important for building rule classifiers.
The sample chosen by random under-sampling may be a biased sample. And it will not be an accurate
representation of the population. Thereby, resulting in inaccurate results with the actual test data set.

Advantages and Disadvantage of over-sampling

Advantages

Unlike under-sampling, this method leads to no information loss.

Outperforms under sampling

Disadvantages
It increases the likelihood of overfitting since it replicates the minority class events.

You can check the implementation of the code in my GitHub repository here.

References

1. https://elitedatascience.com/imbalanced-classes
2. https://towardsdatascience.com/methods-for-dealing-with-imbalanced-data-5b761be45a18

Conclusion

To summarize, in this article, we have seen various techniques to handle the class imbalance in a dataset.
There are actually many methods to try when dealing with imbalanced data. Hope this article was useful if
so please share and like it.

Thanks for reading…!

About the Author

Benai Kumar – Aspiring Data Scientist by hear t | Keen to learn and share knowledge

Article Url - https://www.analyticsvidhya.com/blog/2020/07/10-techniques-to-deal-with-class-imbalance-

in-machine-learning/

Guest Blog

WINSEM2024-25 CBS3006 ETH VL2024250505168 2025-01-11 Reference-Material-I
No ratings yet
WINSEM2024-25 CBS3006 ETH VL2024250505168 2025-01-11 Reference-Material-I
81 pages
l10 Machine Learning
No ratings yet
l10 Machine Learning
39 pages
Imbalanced Data Problem in Machine Learning A Review
No ratings yet
Imbalanced Data Problem in Machine Learning A Review
14 pages
SMOTE For Imbalanced Classification With Python - GeeksforGeeks
No ratings yet
SMOTE For Imbalanced Classification With Python - GeeksforGeeks
18 pages
A Comparative Study of SMOTE Borderline-SMOTE and ADASYN Oversampling Techniques Using Different Classifiers
No ratings yet
A Comparative Study of SMOTE Borderline-SMOTE and ADASYN Oversampling Techniques Using Different Classifiers
9 pages
Modeling Imbalance Class
No ratings yet
Modeling Imbalance Class
24 pages
Overfitting & Feature Engineering
No ratings yet
Overfitting & Feature Engineering
37 pages
Mod 7 Smote ML
No ratings yet
Mod 7 Smote ML
40 pages
Handling Data Imbalance in Machine Learning
No ratings yet
Handling Data Imbalance in Machine Learning
51 pages
Enhanced Synthetic Oversampling For Multiclass Imbalanced Data
No ratings yet
Enhanced Synthetic Oversampling For Multiclass Imbalanced Data
20 pages
MK-SMOTE and M-SMOTE: Enhanced Techniques For Handling Class Imbalance Problem
No ratings yet
MK-SMOTE and M-SMOTE: Enhanced Techniques For Handling Class Imbalance Problem
19 pages
Imbalanced Dataset Techniques
No ratings yet
Imbalanced Dataset Techniques
16 pages
Class Notes
No ratings yet
Class Notes
24 pages
Handling Imbalanced Datasets
No ratings yet
Handling Imbalanced Datasets
21 pages
Class Imbalance
No ratings yet
Class Imbalance
12 pages
Learning From Imbalanced Classes
100% (1)
Learning From Imbalanced Classes
33 pages
10 Techniques To Solve Imbalanced Classes in ML
No ratings yet
10 Techniques To Solve Imbalanced Classes in ML
16 pages
Author Final Version
No ratings yet
Author Final Version
11 pages
Axioms 11 00607 v2
No ratings yet
Axioms 11 00607 v2
19 pages
02 - Diagnostics For Machine Learning Model
No ratings yet
02 - Diagnostics For Machine Learning Model
20 pages
Admin, 1277
No ratings yet
Admin, 1277
21 pages
Oversampling Techniques For Imbalanced Data in Regression
No ratings yet
Oversampling Techniques For Imbalanced Data in Regression
19 pages
Handle Class Imbalance: Liang Liang
No ratings yet
Handle Class Imbalance: Liang Liang
31 pages
Introduction To Imbalanced Datasets
No ratings yet
Introduction To Imbalanced Datasets
10 pages
8 Tactics To Combat Imbalanced Classes in Your Machine Learning Dataset
No ratings yet
8 Tactics To Combat Imbalanced Classes in Your Machine Learning Dataset
62 pages
Foundations of Data Imbalance and Solutions For A Data Democracy
No ratings yet
Foundations of Data Imbalance and Solutions For A Data Democracy
20 pages
Lesson 3
No ratings yet
Lesson 3
8 pages
NICE Actimize - DS - Rarity Problem in Supervised Fraud Detection Insights Article - 3JUNE20
No ratings yet
NICE Actimize - DS - Rarity Problem in Supervised Fraud Detection Insights Article - 3JUNE20
11 pages
Synth
No ratings yet
Synth
6 pages
Handling Imbalanced Ratio For Class Imbalance Problem Using SMOTE
No ratings yet
Handling Imbalanced Ratio For Class Imbalance Problem Using SMOTE
12 pages
2515-Article Text-14337-4-10-20230331
No ratings yet
2515-Article Text-14337-4-10-20230331
12 pages
AReviewon Oversampling Techniquesfor Solvingthe Data Imbalance Problemin Classification
No ratings yet
AReviewon Oversampling Techniquesfor Solvingthe Data Imbalance Problemin Classification
11 pages
Complete ML Concepts
No ratings yet
Complete ML Concepts
30 pages
Machine Learning With Oversampling and Undersampling Techniques Overview Study and Experimental Results
No ratings yet
Machine Learning With Oversampling and Undersampling Techniques Overview Study and Experimental Results
6 pages
5 Techniques To Handle Imbalanced Data For A Classification Problem
No ratings yet
5 Techniques To Handle Imbalanced Data For A Classification Problem
7 pages
Chapter 3 Methods and Procedures This Chapter
90% (48)
Chapter 3 Methods and Procedures This Chapter
11 pages
Kumar 2021 IOP Conf. Ser. Mater. Sci. Eng. 1099 012077
No ratings yet
Kumar 2021 IOP Conf. Ser. Mater. Sci. Eng. 1099 012077
9 pages
Batista 2004
No ratings yet
Batista 2004
10 pages
A Systematic Review On Imbalanced Data Challenges in Machine Learning: Applications and Solutions
100% (1)
A Systematic Review On Imbalanced Data Challenges in Machine Learning: Applications and Solutions
36 pages
Dealing With Imbalanced Data
No ratings yet
Dealing With Imbalanced Data
9 pages
Paper 6 - 240417 - 184500 OCR
No ratings yet
Paper 6 - 240417 - 184500 OCR
11 pages
Improving Imbalanced Learning Through A Heuristic Oversampling Method Based On K-Means and SMOTE
No ratings yet
Improving Imbalanced Learning Through A Heuristic Oversampling Method Based On K-Means and SMOTE
20 pages
Catboost ET Comparaison
No ratings yet
Catboost ET Comparaison
20 pages
Imbalanced Learn Python
No ratings yet
Imbalanced Learn Python
5 pages
Random and Synthetic Over Sampling Approach To Resolve Data 2zu79c47m6
No ratings yet
Random and Synthetic Over Sampling Approach To Resolve Data 2zu79c47m6
9 pages
An Overview of Classification Algorithms For Imbalanced Datasets
No ratings yet
An Overview of Classification Algorithms For Imbalanced Datasets
7 pages
MEE22154 Task2
No ratings yet
MEE22154 Task2
4 pages
Performance Evaluation of Class Balancing
No ratings yet
Performance Evaluation of Class Balancing
6 pages
Handling Imbalanced Datasets in Machine Learning - by Baptiste Rocca - Towards Data Science
No ratings yet
Handling Imbalanced Datasets in Machine Learning - by Baptiste Rocca - Towards Data Science
24 pages
8 Tactics To Combat Imbalanced Classes in Your Machine Learning Dataset - Machine Learning Mastery by Jason Brownlee
No ratings yet
8 Tactics To Combat Imbalanced Classes in Your Machine Learning Dataset - Machine Learning Mastery by Jason Brownlee
7 pages
Improving The Performance of Your Imbalanced Machine Learning Classifiers
No ratings yet
Improving The Performance of Your Imbalanced Machine Learning Classifiers
26 pages
Eng2 12298 PDF
No ratings yet
Eng2 12298 PDF
24 pages
Predictive Accuracy: A Misleading Performance Measure For Highly Imbalanced Data
No ratings yet
Predictive Accuracy: A Misleading Performance Measure For Highly Imbalanced Data
12 pages
Box Jenkins Method
No ratings yet
Box Jenkins Method
5 pages
ML ProjectReport-Sonali Joshi
100% (2)
ML ProjectReport-Sonali Joshi
38 pages
Analysis of Imbalanced Classification Algorithms A Perspective View
No ratings yet
Analysis of Imbalanced Classification Algorithms A Perspective View
5 pages
1608 06048 PDF
No ratings yet
1608 06048 PDF
7 pages
Z-Test and T-Test For One Sample Mean
No ratings yet
Z-Test and T-Test For One Sample Mean
19 pages
Addressing Imbalance Problem in The Class - A Survey
No ratings yet
Addressing Imbalance Problem in The Class - A Survey
5 pages
Class Imbalance Notes
No ratings yet
Class Imbalance Notes
6 pages
Box-and-Whisker Plots 7.2: Activity
No ratings yet
Box-and-Whisker Plots 7.2: Activity
6 pages
Class Imbalance Problem in Data Mining: Review
No ratings yet
Class Imbalance Problem in Data Mining: Review
5 pages
Handling Imbalanced Data
No ratings yet
Handling Imbalanced Data
21 pages
Principal Component Analysis
No ratings yet
Principal Component Analysis
27 pages
Ogistic Egression: Concha Bielza, Pedro Larra Naga
No ratings yet
Ogistic Egression: Concha Bielza, Pedro Larra Naga
33 pages
FHMM1034 Topic 1 B Descriptive Statistics Student
No ratings yet
FHMM1034 Topic 1 B Descriptive Statistics Student
142 pages
CUML1021 Machine Learning For Predictive Analytics Syllabus
No ratings yet
CUML1021 Machine Learning For Predictive Analytics Syllabus
4 pages
Homework Applied Statistics and Degign of Experiments
No ratings yet
Homework Applied Statistics and Degign of Experiments
16 pages
Imbalanced Data: How To Handle Imbalanced Classification Problems
No ratings yet
Imbalanced Data: How To Handle Imbalanced Classification Problems
17 pages
Retail Sales Forecasting
No ratings yet
Retail Sales Forecasting
31 pages
Sample Sec 3
No ratings yet
Sample Sec 3
16 pages
MMW Project Answer Sheet
No ratings yet
MMW Project Answer Sheet
12 pages
Introductory Statistics Exploring The World Through Data 1st Edition Gould Test Bank Download
100% (2)
Introductory Statistics Exploring The World Through Data 1st Edition Gould Test Bank Download
44 pages
Discovering Statistics Using IBM SPSS Statistics 4th Edition Field Test Bankdownload
100% (3)
Discovering Statistics Using IBM SPSS Statistics 4th Edition Field Test Bankdownload
45 pages
MATH103 M2 Data Presentation
No ratings yet
MATH103 M2 Data Presentation
43 pages
HCS6049.Unit4 Appendix 4.5 ConversionTable.16-17
No ratings yet
HCS6049.Unit4 Appendix 4.5 ConversionTable.16-17
2 pages
International Journal of Clinical and Health Psychology 1697-2600
No ratings yet
International Journal of Clinical and Health Psychology 1697-2600
12 pages
Regression Analysis A Practical Introduction 1st Edition Jeremy Arkes All Chapter Instant Download
100% (5)
Regression Analysis A Practical Introduction 1st Edition Jeremy Arkes All Chapter Instant Download
49 pages
MCS-011: Problem Solving and Programming
From Everand
MCS-011: Problem Solving and Programming
Dr. DK Sukhani
No ratings yet
Detection and Treatment of Careless Responses To Improve Item Parameter Estimation
No ratings yet
Detection and Treatment of Careless Responses To Improve Item Parameter Estimation
33 pages
Statistical Tables: Appendix
No ratings yet
Statistical Tables: Appendix
26 pages
Sathyabama University: Register Number
No ratings yet
Sathyabama University: Register Number
4 pages
Statistics and Probability Senior High School Students
No ratings yet
Statistics and Probability Senior High School Students
19 pages
Decision Sciences Formulae Sheet
No ratings yet
Decision Sciences Formulae Sheet
3 pages
The Whole Is Not Different From Its Parts
No ratings yet
The Whole Is Not Different From Its Parts
17 pages
Western Mindanao State University Siay Campus: Mode Median
No ratings yet
Western Mindanao State University Siay Campus: Mode Median
5 pages
Articel
No ratings yet
Articel
8 pages
Criterion Regression 1 PDC A + (B × MP$) Regression 2 PDC A + (B × # of Pos) Regression 3 PDC A + (B × # of SS)
No ratings yet
Criterion Regression 1 PDC A + (B × MP$) Regression 2 PDC A + (B × # of Pos) Regression 3 PDC A + (B × # of SS)
3 pages
HW 2.4 - Influential Points and Departures From Linearity
No ratings yet
HW 2.4 - Influential Points and Departures From Linearity
2 pages
3151 H3 20240514
No ratings yet
3151 H3 20240514
1 page

10 Techniques To Deal With Class Imbalance in Machine Learning

Uploaded by

10 Techniques To Deal With Class Imbalance in Machine Learning

Uploaded by

10 Techniques to deal with Imbalanced Classes in Machine

Get familiar with class imbalance

Class Imbalance is a common problem in machine learning, especially in classification problems.

Class Imbalance appear in many domains, including:

Credit card fraud detection example

def transaction(transaction_data): return 'No fradulent transaction'

Well, guess what? Your “solution” would have 94% accuracy!

Unfortunately, that accuracy is misleading.

All those non-fraudulent transactions, you’d have 100% accuracy.

non fraudulent # 1 -> fraudulent

The Metric Trap

Let’s do this experiment, using simple XGBClassifier and no feature engineering:

# import linrary from xgboost import XGBClassifier xgb_model = XGBClassifier().fit(x_train, y_train) #

predict xgb_y_predict = xgb_model.predict(x_test) # accuracy score xgb_score = accuracy_score(xgb_y_predict,

y_test) print('Accuracy score is:', xbg_score)OUTPUT Accuracy score is: 0.992

We will start by separating the class that will be 0 and class 1.

# class count class_count_0, class_count_1 = data['Class'].value_counts() # Separate class class_0 =

class_0.shape) print('class 1:', class_1.shape

class_0_under = class_0.sample(class_count_1) test_under = pd.concat([class_0_under, class_1], axis=0)

print("total class of 1 and0:",test_under['Class'].value_counts())# plot the count after under-sampeling

test_under['Class'].value_counts().plot(kind='bar', title='count (target)')

class_1_over = class_1.sample(class_count_0, replace=True) test_over = pd.concat([class_1_over, class_0],

sampeling test_over['Class'].value_counts().plot(kind='bar', title='count (target)')

Balance data with the imbalanced-learn python module

# import library from imblearn.under_sampling import RandomUnderSampler rus =

RandomUnderSampler(random_state=42, replacement=True)# fit predictor and target variable x_rus, y_rus =

4. Random over-sampling with imblearn

# import library from imblearn.over_sampling import RandomOverSampler ros =

print('Original dataset shape', Counter(y)) print('Resample dataset shape', Counter(y_ros))

5. Under-sampling: Tomek links

# import library from imblearn.under_sampling import TomekLinks tl =

ros.fit_resample(x, y) print('Original dataset shape', Counter(y)) print('Resample dataset shape',

6. Synthetic Minority Oversampling Technique (SMOTE)

This technique generates synthetic data for the minority class.

SMOTE algorithm works in 4 simple steps:

1. Choose a minority class as the input vector

x_smote, y_smote = smote.fit_resample(x, y) print('Original dataset shape', Counter(y)) print('Resample

8. Change the performance metric

Metrics that can provide better insight are:

9. Penalize Algorithms (Cost-Sensitive Training)

A popular algorithm for this technique is Penalized-SVM.

Let’s train a model using Penalized-SVM on the original imbalanced dataset:

svc_model.predict(x_test)# check performance print('ROCAUC score:',roc_auc_score(y_test, svc_predict))

rfc_predict)) print('F1 score:',f1_score(y_test, rfc_predict))

Advantage and disadvantages of Under-sampling

Advantages and Disadvantage of over-sampling

Unlike under-sampling, this method leads to no information loss.

Thanks for reading…!

About the Author

Article Url - https://www.analyticsvidhya.com/blog/2020/07/10-techniques-to-deal-with-class-imbalance-

You might also like