Mod 7 Smote ML
Mod 7 Smote ML
Setup
1. for 1 hour, google collects 1M e-mails randomly
2. they pay people to label them as “phishing” or
“not-phishing”
3. they give the data to you to learn to classify
e-mails as phishing or not
4. you, having taken ML, try out a few of your
favorite classifiers
5. You achieve an accuracy of 99.997%
99.997%
not-phishing
labeled data
0.003%
phishing
Decision trees:
explicitly minimizes training error
when pruning pick “majority” label at leaves
tend to do very poor at imbalanced problems
k-NN:
even for small k, majority class will tend to overwhelm the vote
perceptron:
can be reasonable since only updates when a mistake is made
can take a long time to learn
“identification” tasks
View the task as trying to find/identify “positive” examples (i.e.
the rare events)
0 1
# correctly predicted as positive
recall =
1 0 # positive examples in test set
1 1
0 1
1 1
0 0
precision and recall
data label predicted
# correctly predicted as positive
precision =
0 0 # examples predicted as positive
0 1
# correctly predicted as positive
recall =
1 0 # positive examples in test set
1 1
2
0 1 precision =
4
1 1 2
recall =
0 0 3
precision and recall
data label predicted
# correctly predicted as positive
precision =
0 0 # examples predicted as positive
0 1
# correctly predicted as positive
recall =
1 0 # positive examples in test set
1 1
0 1
Why do we have both measures?
1 1 How can we maximize precision?
How can we maximize recall?
0 0
Maximizing precision
data label predicted
# correctly predicted as positive
precision =
0 0 # examples predicted as positive
0 0
# correctly predicted as positive
recall =
1 0 # positive examples in test set
1 0
0 0
Don’t predict anything as positive!
1 0
0 0
Maximizing recall
data label predicted
# correctly predicted as positive
precision =
0 1 # examples predicted as positive
0 1
# correctly predicted as positive
recall =
1 1 # positive examples in test set
1 1
0 1
Predict everything as positive!
1 1
0 1
precision vs. recall
Often there is a tradeoff between precision and
recall
0 0 0.90
precision/recall tradeoff
data label predicted confidence
put most confident positive
1 1 0.80 predictions at top
0 1 0.60
put most confident negative
1 1 0.55 predictions at bottom
1 1 0.80
1 1 0.55
0 1 0.50
1 0 0.20
0 0 0.75
0 0 0.90
precision/recall tradeoff
data label predicted confidence precision recall
1 1 0.80
0 1 0.60
1 1 0.55
0 1 0.50
0 0 0.75
0 0 0.90
precision/recall tradeoff
data label predicted confidence precision recall
1 1 0.80
0 1 0.60
1 1 0.55
0 1 0.50
1 0 0.20
0 0 0.75
precision
recall recall
Any concerns/problems?
Area under the curve?
precision
precision
?
recall recall
1 2PR
F1
1 1 PR
0.5 0.5
P R
Evaluation summarized
Accuracy is often NOT an appropriate evaluation
metric for imbalanced data problems
+1
binary optionally: also output
classifier a confidence/score
-1
50%
not-phishing
50%
phishing
pros/cons?
0.003%
phishing
Subsampling
Pros:
Easy to implement
Training becomes much more efficient (smaller training
set)
For some domains, can work very well
Cons:
Throwing away a lot of data/information
Idea 2: oversampling
Create a new training data set by:
- including all m “negative” examples
- include m “positive examples:
- repeat each example a fixed
99.997% number of times, or
not-phishing - sample with replacement
labeled data
50%
not-phishing
50%
phishing
0.003%
phishing
pros/cons?
oversampling
Pros:
Easy to implement
Utilizes all of the training data
Cons:
Computationally expensive to train classifier
Idea 2b: weighted examples
cost/weights
Add costs/weights to the training set
99.997%
not-phishing
“negative” examples get weight 1
labeled data
99.997/0.003 =
0.003%
33332 pros/cons?
phishing
weighted examples
Pros:
Achieves the effect of oversampling without the
computational cost
Utilizes all of the training data
Cons:
Requires a classifier that can deal with weights
62% 38%
SMOTE percentage = 200 570 534 1104
52% 48%
SMOTE percentage = 300 570 712 1282
44% 56%
Warning