0% found this document useful (0 votes)

44 views40 pages

Mod 7 Smote ML

bkbj

Uploaded by

Raushan Kumar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

44 views40 pages

Mod 7 Smote ML

bkbj

Uploaded by

Raushan Kumar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 40

IMBALANCED DATA

Setup
1. for 1 hour, google collects 1M e-mails randomly
2. they pay people to label them as “phishing” or
“not-phishing”
3. they give the data to you to learn to classify
e-mails as phishing or not
4. you, having taken ML, try out a few of your
favorite classifiers
5. You achieve an accuracy of 99.997%

Should you be happy?

Imbalanced data

99.997% The phishing problem is what is called an

not-phishing imbalanced data problem

This occurs where there is a large discrepancy

labeled data

between the number of examples with each

class label

e.g. for our 1M example dataset only about

30 would actually represent phishing e-mails

0.003% What is probably going on with our classifier?

phishing
Imbalanced data
Many classifiers are designed to optimize
error/accuracy

This tends to bias performance towards the majority

class

Anytime there is an imbalance in the data this can

happen
Imbalanced problem domains
Medical diagnosis

Predicting faults/failures (e.g. hard-drive failures,

mechanical failures, etc.)

Predicting rare events (e.g. earthquakes)

Detecting fraud (credit card transactions, internet

traffic)
Imbalanced data: current classifiers

99.997%
not-phishing
labeled data

0.003%
phishing

How will our current classifiers do on this problem?

Imbalanced data: current classifiers
All will do fine if the data can be easily separated/distinguished

Decision trees:
 explicitly minimizes training error
 when pruning pick “majority” label at leaves
 tend to do very poor at imbalanced problems

k-NN:
 even for small k, majority class will tend to overwhelm the vote

perceptron:
 can be reasonable since only updates when a mistake is made
 can take a long time to learn
“identification” tasks
View the task as trying to find/identify “positive” examples (i.e.
the rare events)

Precision: proportion of test examples predicted as positive

that are correct
# correctly predicted as positive
# examples predicted as positive
Recall: proportion of test examples labeled as positive that
are correct

# correctly predicted as positive

# positive examples in test set
“identification” tasks
Precision: proportion of test examples predicted as positive that are correct
# correctly predicted as positive
# examples predicted as positive
Recall: proportion of test examples labeled as positive that are correct
# correctly predicted as positive
# positive examples in test set

predicted all positive precision recall

positive
precision and recall
data label predicted
# correctly predicted as positive
precision =
0 0 # examples predicted as positive

0 1
# correctly predicted as positive
recall =
1 0 # positive examples in test set

1 1

0 1

1 1

0 0
precision and recall
data label predicted
# correctly predicted as positive
precision =
0 0 # examples predicted as positive

0 1
# correctly predicted as positive
recall =
1 0 # positive examples in test set

1 1
2
0 1 precision =
4
1 1 2
recall =
0 0 3
precision and recall
data label predicted
# correctly predicted as positive
precision =
0 0 # examples predicted as positive

0 1
# correctly predicted as positive
recall =
1 0 # positive examples in test set

1 1

0 1
Why do we have both measures?
1 1 How can we maximize precision?
How can we maximize recall?
0 0
Maximizing precision
data label predicted
# correctly predicted as positive
precision =
0 0 # examples predicted as positive

0 0
# correctly predicted as positive
recall =
1 0 # positive examples in test set

1 0

0 0
Don’t predict anything as positive!
1 0

0 0
Maximizing recall
data label predicted
# correctly predicted as positive
precision =
0 1 # examples predicted as positive

0 1
# correctly predicted as positive
recall =
1 1 # positive examples in test set

1 1

0 1
Predict everything as positive!
1 1

0 1
precision vs. recall
Often there is a tradeoff between precision and
recall

increasing one, tends to decrease the other

For our algorithms, how might be increase/decrease

precision/recall?
precision/recall tradeoff
data label predicted confidence

0 0 0.75 - For many classifiers we can

get some notion of the
0 1 0.60 prediction confidence
1 0 0.20
- Only predict positive if the
confidence is above a given
1 1 0.80 threshold
- By varying this threshold, we
0 1 0.50 can vary precision and recall
1 1 0.55

0 0 0.90
precision/recall tradeoff
data label predicted confidence
put most confident positive
1 1 0.80 predictions at top
0 1 0.60
put most confident negative
1 1 0.55 predictions at bottom

0 1 0.50 calculate precision/recall at

each break point/threshold
1 0 0.20

classify everything above

0 0 0.75
threshold as positive and
0 0 0.90 everything else negative
precision/recall tradeoff
data label predicted confidence precision recall

1 1 0.80

0 1 0.60 1/2 = 0.5 1/3 = 0.33

1 1 0.55

0 1 0.50

1 0 0.20

0 0 0.75

0 0 0.90
precision/recall tradeoff
data label predicted confidence precision recall

1 1 0.80

0 1 0.60

1 1 0.55

0 1 0.50

1 0 0.20 3/5 = 0.6 3/3 = 1.0

0 0 0.75

0 0 0.90
precision/recall tradeoff
data label predicted confidence precision recall

1 1 0.80

0 1 0.60

1 1 0.55

0 1 0.50

1 0 0.20

0 0 0.75

0 0 0.90 3/7 = 0.43 3/3 = 1.0

precision/recall tradeoff
data label predicted confidence precision recall

1 1 0.80 1.0 0.33

0 1 0.60 0.5 0.33

1 1 0.55 0.66 0.66

0 1 0.50 0.5 0.66

1 0 0.20 0.6 1.0

0 0 0.75 0.5 1.0

0 0 0.90 0.43 1.0

Area under the curve
Area under the curve (AUC) is one metric that
encapsulates both precision and recall

calculate the precision/recall values for all thresholding of

the test set (like we did before)

then calculate the area under the curve

can also be calculated as the average precision for all the

recall points
Area under the curve?
precision

precision

recall recall

Any concerns/problems?
Area under the curve?
precision

precision
?

recall recall

For real use, often only Eventually, need to deploy.

interested in performance in How do we decide what
a particular range threshold to use?
F1-measure

Most common α=0.5: equal balance/weighting

between precision and recall:
1 (   1) PR
2
F 
1
  (1   )
1  2
P  R
P R

1 2PR
F1  
1 1 PR
0.5  0.5
P R
Evaluation summarized
Accuracy is often NOT an appropriate evaluation
metric for imbalanced data problems

precision/recall capture different characteristics of

our classifier

AUC and F1 can be used as a single metric to

compare algorithm variations (and to tune
hyperparameters)
Black box approach
Abstraction: we have a generic binary classifier, how
can we use it to solve our new problem

+1
binary optionally: also output
classifier a confidence/score
-1

Can we do some pre-processing/post-processing of our

data to allow us to still use our binary classifiers?
Idea 1: subsampling
Create a new training data set by:
- including all k “positive” examples
- randomly picking k “negative”
99.997% examples
not-phishing
labeled data

50%
not-phishing

50%
phishing

pros/cons?
0.003%
phishing
Subsampling
Pros:
 Easy to implement
 Training becomes much more efficient (smaller training
set)
 For some domains, can work very well

Cons:
 Throwing away a lot of data/information
Idea 2: oversampling
Create a new training data set by:
- including all m “negative” examples
- include m “positive examples:
- repeat each example a fixed
99.997% number of times, or
not-phishing - sample with replacement
labeled data

50%
not-phishing

50%
phishing

0.003%
phishing
pros/cons?
oversampling
Pros:
 Easy to implement
 Utilizes all of the training data

 Tends to perform well in a broader set of circumstances

than subsampling

Cons:
 Computationally expensive to train classifier
Idea 2b: weighted examples

cost/weights
Add costs/weights to the training set
99.997%
not-phishing
“negative” examples get weight 1
labeled data

1 “positive” examples get a much larger

weight

change learning algorithm to optimize

weighted training error

99.997/0.003 =
0.003%
33332 pros/cons?
phishing
weighted examples
Pros:
 Achieves the effect of oversampling without the
computational cost
 Utilizes all of the training data

 Tends to perform well in a broader set circumstances

Cons:
 Requires a classifier that can deal with weights

Of our three classifiers, can all be modified to handle weights?

Idea 3 : Sampling
 SMOTE: Synthetic Minority Over-sampling
Technique
The minority class is over-sampled by taking each minority
class sample and introducing synthetic examples along the line
segments joining any/all of the k minority class nearest
neighbors.

 Statistical technique for increasing the number of cases in your

dataset in a balanced way.
 The module works by generating new instances from existing
minority cases that you supply as input.
 Does not change the number of majority cases.
 The new instances are not just copies of existing
minority cases;
 Nearest neighbors are used to generates new
examples.
 Makes the samples more general.
 SMOTE takes the entire dataset as an input, but it
increases the percentage of only the minority cases.
Class 0 Class 1 total
Original dataset 570 178 748
(equivalent to SMOTE
percentage = 0) 76% 24%
SMOTE percentage = 100 570 356 926

62% 38%
SMOTE percentage = 200 570 534 1104

52% 48%
SMOTE percentage = 300 570 712 1282

44% 56%
Warning

 Increasing the number of cases using SMOTE is not

guaranteed to produce more accurate models.
 You should try experimenting with different
percentages, different feature sets, and different
numbers of nearest neighbors to see how adding
cases influences your model.
 Use the Number of nearest neighbors option to
determine the size of the feature space that the SMOTE
algorithm uses when in building new cases.
 A nearest neighbor is a row of data (a case) that is very
similar to some target case. The distance between any
two cases is measured by combining the weighted
vectors of all features.
 By increasing the number of nearest neighbors, you get
features from more cases.
 By keeping the number of nearest neighbors low, you
use features that are more like those in the original
sample.

Data Science in FInancial Services - 3
No ratings yet
Data Science in FInancial Services - 3
76 pages
lec21-ML II
No ratings yet
lec21-ML II
66 pages
WINSEM2024-25 CBS3006 ETH VL2024250505168 2025-01-11 Reference-Material-I
No ratings yet
WINSEM2024-25 CBS3006 ETH VL2024250505168 2025-01-11 Reference-Material-I
81 pages
In5490 Classification
No ratings yet
In5490 Classification
85 pages
KNN Evaluation
No ratings yet
KNN Evaluation
51 pages
Features
No ratings yet
Features
37 pages
INT354 - Unit 1
No ratings yet
INT354 - Unit 1
72 pages
A Comparative Study of SMOTE Borderline-SMOTE and ADASYN Oversampling Techniques Using Different Classifiers
No ratings yet
A Comparative Study of SMOTE Borderline-SMOTE and ADASYN Oversampling Techniques Using Different Classifiers
9 pages
ML 2 PPT Unit 2
No ratings yet
ML 2 PPT Unit 2
214 pages
Evaluation Metrics
No ratings yet
Evaluation Metrics
25 pages
3ML.02.MainConcepts Evaluation
No ratings yet
3ML.02.MainConcepts Evaluation
35 pages
5 DL
No ratings yet
5 DL
33 pages
Machine Learning: A Review of Classification and Combining Techniques
No ratings yet
Machine Learning: A Review of Classification and Combining Techniques
32 pages
15 dm2 Imbalanced Learning 2022 23
No ratings yet
15 dm2 Imbalanced Learning 2022 23
35 pages
Overfitting & Feature Engineering
No ratings yet
Overfitting & Feature Engineering
37 pages
CS 620 / DASC 600 Introduction To Data Science & Analytics: Lecture 8-Performance Evaluation
No ratings yet
CS 620 / DASC 600 Introduction To Data Science & Analytics: Lecture 8-Performance Evaluation
62 pages
CH 6
No ratings yet
CH 6
24 pages
L2 - Problems in ML & Performance Evaluation
No ratings yet
L2 - Problems in ML & Performance Evaluation
30 pages
ML 3 & 4 Notes
No ratings yet
ML 3 & 4 Notes
18 pages
CH-5 ML
No ratings yet
CH-5 ML
36 pages
Lecture 11
No ratings yet
Lecture 11
18 pages
Week 3
No ratings yet
Week 3
26 pages
Machine Learning Basics
No ratings yet
Machine Learning Basics
32 pages
Puzzles PDF
100% (1)
Puzzles PDF
202 pages
Bi Unit 5
No ratings yet
Bi Unit 5
20 pages
ML Chap 2
No ratings yet
ML Chap 2
60 pages
Module 6
No ratings yet
Module 6
24 pages
Chapter 02 Overview - 4
No ratings yet
Chapter 02 Overview - 4
43 pages
04 - Model Selection
No ratings yet
04 - Model Selection
62 pages
COMP4702 Notes 2019: Week 2 - Supervised Learning
No ratings yet
COMP4702 Notes 2019: Week 2 - Supervised Learning
23 pages
ML Unit 1
No ratings yet
ML Unit 1
73 pages
Model Evaluation and Selection
No ratings yet
Model Evaluation and Selection
41 pages
Model Evaluation and Selection
No ratings yet
Model Evaluation and Selection
22 pages
CSC4316 9
No ratings yet
CSC4316 9
40 pages
Basics of ML and Evaluation
No ratings yet
Basics of ML and Evaluation
42 pages
Lecture 2 Classifier Performance Metrics
No ratings yet
Lecture 2 Classifier Performance Metrics
60 pages
A Novel Resampling Technique For Imbalanced Classification in Software Defect Prediction by A Re-Sampling Method With Filtering
No ratings yet
A Novel Resampling Technique For Imbalanced Classification in Software Defect Prediction by A Re-Sampling Method With Filtering
10 pages
Python Learning
No ratings yet
Python Learning
21 pages
Data Mining - Credibility: Evaluating What's Been Learned
No ratings yet
Data Mining - Credibility: Evaluating What's Been Learned
36 pages
5 Techniques To Handle Imbalanced Data For A Classification Problem
No ratings yet
5 Techniques To Handle Imbalanced Data For A Classification Problem
7 pages
10 Techniques To Deal With Class Imbalance in Machine Learning
No ratings yet
10 Techniques To Deal With Class Imbalance in Machine Learning
10 pages
Catboost ET Comparaison
No ratings yet
Catboost ET Comparaison
20 pages
Random and Synthetic Over Sampling Approach To Resolve Data 2zu79c47m6
No ratings yet
Random and Synthetic Over Sampling Approach To Resolve Data 2zu79c47m6
9 pages
Machine Learning Lecture1 - 26-27 Aug
No ratings yet
Machine Learning Lecture1 - 26-27 Aug
30 pages
CE880 Lecture6 Slides
No ratings yet
CE880 Lecture6 Slides
25 pages
Chapter 7 - LAST
No ratings yet
Chapter 7 - LAST
29 pages
Chapter 4
No ratings yet
Chapter 4
27 pages
MEE22154 Task2
No ratings yet
MEE22154 Task2
4 pages
Performance Metrics (Classification) : Enrique J. de La Hoz D
100% (1)
Performance Metrics (Classification) : Enrique J. de La Hoz D
30 pages
Chapter 3 Model Evaluation Final
No ratings yet
Chapter 3 Model Evaluation Final
30 pages
Accuracy Precision and Recall
No ratings yet
Accuracy Precision and Recall
21 pages
Session01 DataScience
No ratings yet
Session01 DataScience
79 pages
Protection of The Laws Within The Territory of India
No ratings yet
Protection of The Laws Within The Territory of India
79 pages
1608 06048 PDF
No ratings yet
1608 06048 PDF
7 pages
Machine Learning Model Evaluation
No ratings yet
Machine Learning Model Evaluation
11 pages
Data Science Interview Questions (#Day11) PDF
100% (1)
Data Science Interview Questions (#Day11) PDF
11 pages
Machine Learning Cheatsheet
No ratings yet
Machine Learning Cheatsheet
12 pages
Classification Metrics in Machine Learning
No ratings yet
Classification Metrics in Machine Learning
6 pages
Algebra
No ratings yet
Algebra
62 pages
List I (Noble Gas)
No ratings yet
List I (Noble Gas)
10 pages
MGMT 222 Ch. III
No ratings yet
MGMT 222 Ch. III
10 pages
Probability and Statistics (Chapter 4-6)
No ratings yet
Probability and Statistics (Chapter 4-6)
31 pages
Class Imbalance Notes
No ratings yet
Class Imbalance Notes
6 pages
ISDS 361A - Chapter 1 PDF
No ratings yet
ISDS 361A - Chapter 1 PDF
23 pages
Evaluation of Predictive Models Final
No ratings yet
Evaluation of Predictive Models Final
6 pages
Formula
No ratings yet
Formula
15 pages
Trigo PDF
No ratings yet
Trigo PDF
71 pages
Quantitaive & Qualitative Mcqs
No ratings yet
Quantitaive & Qualitative Mcqs
95 pages
Cheb Yshev
No ratings yet
Cheb Yshev
13 pages
Lesson 11 Multiple Linear Regression
No ratings yet
Lesson 11 Multiple Linear Regression
35 pages
TR Rain Error
No ratings yet
TR Rain Error
6 pages
Geography
No ratings yet
Geography
104 pages
Admitcard-SSB, Kolkata-PCT221M008365
No ratings yet
Admitcard-SSB, Kolkata-PCT221M008365
2 pages
Worksheet November 21 Solutions - 2
No ratings yet
Worksheet November 21 Solutions - 2
8 pages
Template Analisis Angket
No ratings yet
Template Analisis Angket
55 pages
BMGT 210 BUSINESS STATISTICS 1 - Kabarak University
No ratings yet
BMGT 210 BUSINESS STATISTICS 1 - Kabarak University
5 pages
Jrs Tutorials: Botany Practice Test Paper
No ratings yet
Jrs Tutorials: Botany Practice Test Paper
10 pages
ES031 M3 HypothesisTestingSingleSample
No ratings yet
ES031 M3 HypothesisTestingSingleSample
55 pages
Full Practice Questions Sample
No ratings yet
Full Practice Questions Sample
13 pages
Jrs Tutorials: Botany Practice Test Paper Plant Physiology
No ratings yet
Jrs Tutorials: Botany Practice Test Paper Plant Physiology
26 pages
Marketing Research Chapter 15
No ratings yet
Marketing Research Chapter 15
21 pages
Logistic Regration 17 Sep-24
No ratings yet
Logistic Regration 17 Sep-24
37 pages
Chapter-3: Data Presentation and Analysis
No ratings yet
Chapter-3: Data Presentation and Analysis
15 pages
The Landscape of R Packages For Automated Exploratory Data Analysis
No ratings yet
The Landscape of R Packages For Automated Exploratory Data Analysis
19 pages
Basic Mathematics - I BCA Syllabus 2024-25
No ratings yet
Basic Mathematics - I BCA Syllabus 2024-25
2 pages
Phy Nda
No ratings yet
Phy Nda
32 pages
Course Breakup Econometrics
No ratings yet
Course Breakup Econometrics
3 pages
Data Science Presentation SSJ.01
No ratings yet
Data Science Presentation SSJ.01
16 pages
MVP 9.6 Making More
No ratings yet
MVP 9.6 Making More
10 pages
Data Table: No. Date Stock Prices Returns DHT Vnindex DHT Vnindex
No ratings yet
Data Table: No. Date Stock Prices Returns DHT Vnindex DHT Vnindex
7 pages
2019 - Nissen Etal - Missing Data and Bias in Physics Education Research - A Case For Using Multiple Imputation
No ratings yet
2019 - Nissen Etal - Missing Data and Bias in Physics Education Research - A Case For Using Multiple Imputation
15 pages
Extreme Rare Event Classification Using Autoencoders in Keras - by Chitta Ranjan - Towards Data Science
No ratings yet
Extreme Rare Event Classification Using Autoencoders in Keras - by Chitta Ranjan - Towards Data Science
14 pages
Harare Institute of Technology
No ratings yet
Harare Institute of Technology
5 pages
Concordant and Discordant Pairs
No ratings yet
Concordant and Discordant Pairs
7 pages
Statistics and Probability - Q3-1
No ratings yet
Statistics and Probability - Q3-1
4 pages
SM025 GIAT 9 (Solutions)
No ratings yet
SM025 GIAT 9 (Solutions)
3 pages
(A) Business Statistics - Regression Analysis
No ratings yet
(A) Business Statistics - Regression Analysis
2 pages