CH-5 ML
CH-5 ML
Model Evaluation
Basic Concepts
Binary classification
Accuracy
Yes TP TN
FN accuracy = P + N
No
FP
TN
• Imagine using classifier to identify positive cases (i.e., for
information retrieval)
TP TP
precision = recall =
TP + F P TP +
Probability that a F N
randomly Probability that a randomly
selected result is relevant selected relevant document
is retrieved 4
Binary classification
Problems
• For strongly unbalanced datasets (typically negatives much
more than positives) it is not informative:
• Predictions are dominated by the larger class
• Predicting everything as negative often maximizes accuracy
• One possibility consists of rebalancing costs (e.g. a single
positive counts as N/P where N=TN+FP and P=TP+FN)
Precision (recap)
• It is the fraction of positives among examples predicted as positives
• It measures the precision of the learner when predicting positive
2(Pre ∗ Rec)
F1 =
Pre + Rec
Precision-recall curve
Classifiers often provide a confidence in the
prediction (e.g. margin of SVM)
A hard decision is made setting a threshold
on the classifier (zero for SVM)
Acc,Pre,Rec,F1 all measure peformance of a
classifier for a specific threshold
It is possible to change the threshold if
interested in maximizing a specific
performance (e.g. recall)
Binary
Classification
Precision-recall curve
By varying threshold from min to max possible
value, we obtain a curve of performance
measures
This curve can be shown plotting one measure
(recall) against the complementary one
(precision)
It is possible to investigate the performance of the
Binary
Classification
Multiclass classification
ACC,Pre,Rec,F1 carry over to a per-class
measure considering as negatives examples
from other classes.
E.g.:
nii
Prei = nii Reci =
nii + FPi
nii + FNi
Multiclass accuracy is the overall fraction of
correctly classified examples:
Performance
measures
Training Data and Test Data
5
Simple Decision Boundary
3
Feature
2
2
0
Decisio
n
- Boundar
12 3 4 y 5 7 8 9 1
0
6
More Complex Decision Boundary
3
Feature
2
2
0
Decision
- Boundar
12 3 4 y 5 6 7 8 9 1
Feature 0
1
7
Example: The Overfitting Phenomenon
8
A Complex Model
Y = high-order polynomial
in X
9
The True (simpler) Model
Y=aX +b +
noise
10
How Overfitting Affects Prediction
Error on Training
Data
Model
Complexity
Ideal Range
for Model
Complexity
13
Comparing Classifiers
14
Training and Test Data
15
Performance
estimation
Hold-out procedure
Computing performance measure on training set
would be optimistically biased
Need to retain an independent set on which to
compute performance:
validation set when used to estimate
performance of different
algorithmic settings (i.e.
hyperparameters)
test set when used to estimate final
performance of selected model
E.g.: split dataset in 40%/30%/30% for training,
validation and testing
Problem
Hold-out procedure depends on the specific test
(and validation) set chosen (esp. for small
K-Fold Cross-‐Validation
Si = SD i [L(Ti )]
Training
Test Data ... Data
Data Training
Data Test Data
Summary statistics
over k test
performances
17
Optimizing Model Parameters
Training
Validation
Set
... Data
• Cross-validation
‐ generates an approximate estimate
of how well the classifier will do on “unseen” data
– As k n, the model becomes more accurate
(more training data)
– ...but, CV becomes more computationally expensive
– Choosing k < n is a compromise
‐ d
k-fol Training
Data Training
Test Data
CV Data
‐ d
k-fol Training
Data Training
Test Data
CV Data
Hypothesis testing
• We want to compare generalization performance of two learning algorithms
• We want to know whether observed different in performance is statistically
significant (and not due to some noisy evaluation)
• Hypothesis testing allows to test the statistical significance of a hypothesis
(e.g. the two predictors have different performance)
Hypothesis testing (t-test)
Example
Learning Curve
22
Building Learning Curves
‐ d
k-fol Training
Data Training
Test Data
CV Data