0% found this document useful (0 votes)
6 views

Lesson 3

Uploaded by

Yuliana Sanchez
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Lesson 3

Uploaded by

Yuliana Sanchez
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Python for

Business Insights
Lesson 3 – Week 1

Xavier Figueroa
Class imbalance
Class imbalance occurs in a dataset when the distribution of the target variable is uneven, meaning that one
class significantly outnumbers the others.

Examples of class imbalance problems


Fraud Detection in Financial Transactions
Detecting fraudulent transactions (credit card fraud, insurance fraud, etc.) in large volumes of financial
data. The number of fraudulent transactions is typically very small compared to the legitimate ones,
often less than 1%.

Customer Churn Prediction


Predicting which customers are likely to stop using a service or cancel a subscription (e.g., telecom
services, SaaS products). In many cases, most customers continue using the service, leading to a small
minority of churn cases.

Defect Detection in Manufacturing


Identifying defective products on a production line (e.g., in automotive, electronics, or pharmaceuticals).
Most products are usually manufactured correctly, with defects occurring in a small fraction of the total
output.
Handling class imbalance
Handling imbalanced classes is crucial in machine learning because imbalanced data can lead to biased models
that perform poorly on the minority class.

Resampling Techniques
Oversampling: Increase the number of instances in the minority class. Techniques include
duplicating existing instances or generating synthetic examples using methods like SMOTE.

Undersampling: Reduce the number of instances in the majority class to balance the dataset. This
can involve randomly removing samples from the majority class.

Algorithmic Adjustments
Class Weights: Use algorithms that handle class imbalance inherently (e.g., Decision Trees with
class weights).

Cost-Sensitive Learning: Modify existing algorithms with techniques like cost-sensitive learning,
where misclassification of the minority class is penalized more.
Handling class imbalance
SMOTE (Synthetic Minority Over-sampling Technique)

https://varshasaini.in/glossary/smote/
Handling class imbalance
Tomek’s links

https://imbalanced-learn.org/stable/under_sampling.html#cleaning-under-sampling-techniques
Resampling Disadvantages

Data Distribution Distortion


Issue: Both oversampling and undersampling can alter the original distribution of the data,
potentially making it less representative of real-world scenarios.
Impact: Models trained on distorted data may not perform well on real, imbalanced
datasets.

Bias Introduction
Issue: Resampling methods may introduce bias if not carefully managed, particularly if they
are applied indiscriminately.
Impact: This can lead to skewed model performance and inaccurate predictions.
Data Scaling
MinMax Scaler
It scales the distribution to a defined range

https://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html
Data Scaling
Robust Scaler
The centering and scaling statistics of RobustScaler are based on percentiles.

https://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html

You might also like