Lesson 3
Lesson 3
Business Insights
Lesson 3 – Week 1
Xavier Figueroa
Class imbalance
Class imbalance occurs in a dataset when the distribution of the target variable is uneven, meaning that one
class significantly outnumbers the others.
Resampling Techniques
Oversampling: Increase the number of instances in the minority class. Techniques include
duplicating existing instances or generating synthetic examples using methods like SMOTE.
Undersampling: Reduce the number of instances in the majority class to balance the dataset. This
can involve randomly removing samples from the majority class.
Algorithmic Adjustments
Class Weights: Use algorithms that handle class imbalance inherently (e.g., Decision Trees with
class weights).
Cost-Sensitive Learning: Modify existing algorithms with techniques like cost-sensitive learning,
where misclassification of the minority class is penalized more.
Handling class imbalance
SMOTE (Synthetic Minority Over-sampling Technique)
https://varshasaini.in/glossary/smote/
Handling class imbalance
Tomek’s links
https://imbalanced-learn.org/stable/under_sampling.html#cleaning-under-sampling-techniques
Resampling Disadvantages
Bias Introduction
Issue: Resampling methods may introduce bias if not carefully managed, particularly if they
are applied indiscriminately.
Impact: This can lead to skewed model performance and inaccurate predictions.
Data Scaling
MinMax Scaler
It scales the distribution to a defined range
https://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html
Data Scaling
Robust Scaler
The centering and scaling statistics of RobustScaler are based on percentiles.
https://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html