0% found this document useful (0 votes)
10 views

Slides Imbalanced Learning Intro

The document discusses imbalanced data sets where the ratio of classes is significantly different. This can cause undesirable predictive behavior for the smaller class. Some examples of domains with imbalanced data are medicine, information retrieval, and fraud detection. The document outlines issues with evaluating classifiers on imbalanced data and possible solutions like resampling data, cost-sensitive learning, and ensemble-based approaches.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Slides Imbalanced Learning Intro

The document discusses imbalanced data sets where the ratio of classes is significantly different. This can cause undesirable predictive behavior for the smaller class. Some examples of domains with imbalanced data are medicine, information retrieval, and fraud detection. The document outlines issues with evaluating classifiers on imbalanced data and possible solutions like resampling data, cost-sensitive learning, and ensemble-based approaches.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Advanced Machine Learning

Imbalanced Learning: Introduction

positives
Learning goals
3

negatives
2

Know what an imbalanced data


1

set is
x2
0

Understand disadvantage of
−1

accuracy on imbalanced data


−2

Know techniques for handling


−3

imbalanced data sets


−3 −2 −1 0 1 2 3

x1
IMBALANCED DATA SETS
Class imbalance: Ratio of classes is significantly different.
Consequence: Undesirable predictive behavior for smaller class.
Example: Sampling from two Gaussian distributions

© Advanced Machine Learning – 1 / 6


IMBALANCED DATA SETS: EXAMPLES

Domain Task Majority Class Minor Class

Medicine Predict tumor pathology Benign Malignant


Information retrieval Find relevant items Irrelevant items Relevant items
Tracking criminals Detect fraud emails Non-fraud emails Fraud emails
Weather prediction Predict extreme weather Normal weather Tornado, hurricane

Often, the minority class is the more important class.


Imbalanced data can be a source of bias related to concept of fairness.

© Advanced Machine Learning – 2 / 6


ISSUES WITH EVALUATING CLASSIFIERS
Ideal case: correctly classify as many instances as possible
⇒ High accuracy, preferably 100%.
In practice, we often obtain on imbalanced data sets: good
performance on the majority class(es), a poor performance on
the minority class(es).
Reason: the classifier is biased towards the majority class(es), as
predicting the majority class pays off in terms of accuracy.
Focusing only on accuracy can lead to bad performance on
minority class.
Example:
Assume that only 0.5% of the patients have a disease,
Always predicting “no disease” leads to accuracy of 99.5%

© Advanced Machine Learning – 3 / 6


ISSUES WITH EVALUATING CLASSIFIERS
1.000

0.8
0.975

Learner Learner
Accuracy

TPR
Classification Tree 0.6 Classification Tree
0.950 Logistic Regression Logistic Regression
SVM SVM

0.4
0.925

10000/10000 1000/10000 100/10000 50/10000 10000/10000 1000/10000 100/10000 50/10000


Positive/Negative Ratio Positive/Negative Ratio

0.9
0.90

0.8

0.85
Learner Learner

F1 Score
0.7
PPV

Classification Tree Classification Tree


Logistic Regression Logistic Regression
0.80 SVM SVM
0.6

0.75 0.5

0.4
10000/10000 1000/10000 100/10000 50/10000 10000/10000 1000/10000 100/10000 50/10000
Positive/Negative Ratio Positive/Negative Ratio

In each scenario, we have 10.000 obs in the negative class. Number of obs in positive
class varies between 10.000, 1.000, 100, and 50. Train classifiers with 10-fold stratified
cv. Evaluate via aggregated predictions on test set.

© Advanced Machine Learning – 4 / 6


POSSIBLE SOLUTIONS
Ideal performance metric: the learning is properly biased towards
the minority class(es).
Imbalance-aware performance metrics:
G-score
Balanced accuracy
Matthews Correlation Coefficient
Weighted macro F1 score

© Advanced Machine Learning – 5 / 6


POSSIBLE SOLUTIONS

Approach Main idea Remark


Algorithm-level Bias classifiers towards minority Special knowledge about clas-
sifiers is needed
Data-level Rebalance classes by resampling No modification of classifiers is
needed
Cost-sensitive Introduce different costs for mis- Between algorithm- and data-
Learning classification when learning level approaches

Ensemble-based Ensemble learning plus one of -


three techniques above

© Advanced Machine Learning – 6 / 6

You might also like