0% found this document useful (0 votes)
44 views40 pages

Mod 7 Smote ML

bkbj

Uploaded by

Raushan Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views40 pages

Mod 7 Smote ML

bkbj

Uploaded by

Raushan Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

IMBALANCED DATA

Setup
1. for 1 hour, google collects 1M e-mails randomly
2. they pay people to label them as “phishing” or
“not-phishing”
3. they give the data to you to learn to classify
e-mails as phishing or not
4. you, having taken ML, try out a few of your
favorite classifiers
5. You achieve an accuracy of 99.997%

Should you be happy?


Imbalanced data

99.997% The phishing problem is what is called an


not-phishing imbalanced data problem

This occurs where there is a large discrepancy


labeled data

between the number of examples with each


class label

e.g. for our 1M example dataset only about


30 would actually represent phishing e-mails

0.003% What is probably going on with our classifier?


phishing
Imbalanced data
Many classifiers are designed to optimize
error/accuracy

This tends to bias performance towards the majority


class

Anytime there is an imbalance in the data this can


happen
Imbalanced problem domains
Medical diagnosis

Predicting faults/failures (e.g. hard-drive failures,


mechanical failures, etc.)

Predicting rare events (e.g. earthquakes)

Detecting fraud (credit card transactions, internet


traffic)
Imbalanced data: current classifiers

99.997%
not-phishing
labeled data

0.003%
phishing

How will our current classifiers do on this problem?


Imbalanced data: current classifiers
All will do fine if the data can be easily separated/distinguished

Decision trees:
 explicitly minimizes training error
 when pruning pick “majority” label at leaves
 tend to do very poor at imbalanced problems

k-NN:
 even for small k, majority class will tend to overwhelm the vote

perceptron:
 can be reasonable since only updates when a mistake is made
 can take a long time to learn
“identification” tasks
View the task as trying to find/identify “positive” examples (i.e.
the rare events)

Precision: proportion of test examples predicted as positive


that are correct
# correctly predicted as positive
# examples predicted as positive
Recall: proportion of test examples labeled as positive that
are correct

# correctly predicted as positive


# positive examples in test set
“identification” tasks
Precision: proportion of test examples predicted as positive that are correct
# correctly predicted as positive
# examples predicted as positive
Recall: proportion of test examples labeled as positive that are correct
# correctly predicted as positive
# positive examples in test set

predicted all positive precision recall


positive
precision and recall
data label predicted
# correctly predicted as positive
precision =
0 0 # examples predicted as positive

0 1
# correctly predicted as positive
recall =
1 0 # positive examples in test set

1 1

0 1

1 1

0 0
precision and recall
data label predicted
# correctly predicted as positive
precision =
0 0 # examples predicted as positive

0 1
# correctly predicted as positive
recall =
1 0 # positive examples in test set

1 1
2
0 1 precision =
4
1 1 2
recall =
0 0 3
precision and recall
data label predicted
# correctly predicted as positive
precision =
0 0 # examples predicted as positive

0 1
# correctly predicted as positive
recall =
1 0 # positive examples in test set

1 1

0 1
Why do we have both measures?
1 1 How can we maximize precision?
How can we maximize recall?
0 0
Maximizing precision
data label predicted
# correctly predicted as positive
precision =
0 0 # examples predicted as positive

0 0
# correctly predicted as positive
recall =
1 0 # positive examples in test set

1 0

0 0
Don’t predict anything as positive!
1 0

0 0
Maximizing recall
data label predicted
# correctly predicted as positive
precision =
0 1 # examples predicted as positive

0 1
# correctly predicted as positive
recall =
1 1 # positive examples in test set

1 1

0 1
Predict everything as positive!
1 1

0 1
precision vs. recall
Often there is a tradeoff between precision and
recall

increasing one, tends to decrease the other

For our algorithms, how might be increase/decrease


precision/recall?
precision/recall tradeoff
data label predicted confidence

0 0 0.75 - For many classifiers we can


get some notion of the
0 1 0.60 prediction confidence
1 0 0.20
- Only predict positive if the
confidence is above a given
1 1 0.80 threshold
- By varying this threshold, we
0 1 0.50 can vary precision and recall
1 1 0.55

0 0 0.90
precision/recall tradeoff
data label predicted confidence
put most confident positive
1 1 0.80 predictions at top
0 1 0.60
put most confident negative
1 1 0.55 predictions at bottom

0 1 0.50 calculate precision/recall at


each break point/threshold
1 0 0.20

classify everything above


0 0 0.75
threshold as positive and
0 0 0.90 everything else negative
precision/recall tradeoff
data label predicted confidence precision recall

1 1 0.80

0 1 0.60 1/2 = 0.5 1/3 = 0.33

1 1 0.55

0 1 0.50

1 0 0.20

0 0 0.75

0 0 0.90
precision/recall tradeoff
data label predicted confidence precision recall

1 1 0.80

0 1 0.60

1 1 0.55

0 1 0.50

1 0 0.20 3/5 = 0.6 3/3 = 1.0

0 0 0.75

0 0 0.90
precision/recall tradeoff
data label predicted confidence precision recall

1 1 0.80

0 1 0.60

1 1 0.55

0 1 0.50

1 0 0.20

0 0 0.75

0 0 0.90 3/7 = 0.43 3/3 = 1.0


precision/recall tradeoff
data label predicted confidence precision recall

1 1 0.80 1.0 0.33

0 1 0.60 0.5 0.33

1 1 0.55 0.66 0.66

0 1 0.50 0.5 0.66

1 0 0.20 0.6 1.0

0 0 0.75 0.5 1.0

0 0 0.90 0.43 1.0


Area under the curve
Area under the curve (AUC) is one metric that
encapsulates both precision and recall

calculate the precision/recall values for all thresholding of


the test set (like we did before)

then calculate the area under the curve

can also be calculated as the average precision for all the


recall points
Area under the curve?
precision

precision

recall recall

Any concerns/problems?
Area under the curve?
precision

precision
?

recall recall

For real use, often only Eventually, need to deploy.


interested in performance in How do we decide what
a particular range threshold to use?
F1-measure

Most common α=0.5: equal balance/weighting


between precision and recall:
1 (   1) PR
2
F 
1
  (1   )
1  2
P  R
P R

1 2PR
F1  
1 1 PR
0.5  0.5
P R
Evaluation summarized
Accuracy is often NOT an appropriate evaluation
metric for imbalanced data problems

precision/recall capture different characteristics of


our classifier

AUC and F1 can be used as a single metric to


compare algorithm variations (and to tune
hyperparameters)
Black box approach
Abstraction: we have a generic binary classifier, how
can we use it to solve our new problem

+1
binary optionally: also output
classifier a confidence/score
-1

Can we do some pre-processing/post-processing of our


data to allow us to still use our binary classifiers?
Idea 1: subsampling
Create a new training data set by:
- including all k “positive” examples
- randomly picking k “negative”
99.997% examples
not-phishing
labeled data

50%
not-phishing

50%
phishing

pros/cons?
0.003%
phishing
Subsampling
Pros:
 Easy to implement
 Training becomes much more efficient (smaller training
set)
 For some domains, can work very well

Cons:
 Throwing away a lot of data/information
Idea 2: oversampling
Create a new training data set by:
- including all m “negative” examples
- include m “positive examples:
- repeat each example a fixed
99.997% number of times, or
not-phishing - sample with replacement
labeled data

50%
not-phishing

50%
phishing

0.003%
phishing
pros/cons?
oversampling
Pros:
 Easy to implement
 Utilizes all of the training data

 Tends to perform well in a broader set of circumstances


than subsampling

Cons:
 Computationally expensive to train classifier
Idea 2b: weighted examples

cost/weights
Add costs/weights to the training set
99.997%
not-phishing
“negative” examples get weight 1
labeled data

1 “positive” examples get a much larger


weight

change learning algorithm to optimize


weighted training error

99.997/0.003 =
0.003%
33332 pros/cons?
phishing
weighted examples
Pros:
 Achieves the effect of oversampling without the
computational cost
 Utilizes all of the training data

 Tends to perform well in a broader set circumstances

Cons:
 Requires a classifier that can deal with weights

Of our three classifiers, can all be modified to handle weights?


Idea 3 : Sampling
 SMOTE: Synthetic Minority Over-sampling
Technique
The minority class is over-sampled by taking each minority
class sample and introducing synthetic examples along the line
segments joining any/all of the k minority class nearest
neighbors.

 Statistical technique for increasing the number of cases in your


dataset in a balanced way.
 The module works by generating new instances from existing
minority cases that you supply as input.
 Does not change the number of majority cases.
 The new instances are not just copies of existing
minority cases;
 Nearest neighbors are used to generates new
examples.
 Makes the samples more general.
 SMOTE takes the entire dataset as an input, but it
increases the percentage of only the minority cases.
Class 0 Class 1 total
Original dataset 570 178 748
(equivalent to SMOTE
percentage = 0) 76% 24%
SMOTE percentage = 100 570 356 926

62% 38%
SMOTE percentage = 200 570 534 1104

52% 48%
SMOTE percentage = 300 570 712 1282

44% 56%
Warning

 Increasing the number of cases using SMOTE is not


guaranteed to produce more accurate models.
 You should try experimenting with different
percentages, different feature sets, and different
numbers of nearest neighbors to see how adding
cases influences your model.
 Use the Number of nearest neighbors option to
determine the size of the feature space that the SMOTE
algorithm uses when in building new cases.
 A nearest neighbor is a row of data (a case) that is very
similar to some target case. The distance between any
two cases is measured by combining the weighted
vectors of all features.
 By increasing the number of nearest neighbors, you get
features from more cases.
 By keeping the number of nearest neighbors low, you
use features that are more like those in the original
sample.

You might also like