M1 - Evaluating Predictive Performance
M1 - Evaluating Predictive Performance
Evaluating Predictive
Performance
Xiaocheng Li
[email protected]
6 Oversampling
7 Cross-Validation
2
Content
6 Oversampling
7 Cross-Validation
3
Which Fit is “Right”?
(X1, Y1)
Y
(X2, Y2) (X3, Y3)
X
4
Which Fit is “Right”?
Y linea
r fit?
X
5
Which Fit is “Right”?
high
Y poly -order
nom
ial fi
t?
X
6
Which Fit is “Right”?
Y quad
ra tic fi
t?
X
7
Which Fit is “Right”?
X
8
Which Fit is “Right”?
X
11
Which Fit is “Right”?
X
12
Validation Sets
X
13
Validation Sets
X
14
Validation Sets
Y Y
X X
15
Validation Sets
Y Y
X X
16
Validation Sets
Y Y
X X
17
Validation Sets
Y Y
X X
18
The “Training Set-Validation Set”
Approach
The “Training Set-Validation Set” Approach:
1 Split the available data into a training set and a validation set:
depending on the amount of available data and the
number of models to be compared, 50:50, 2/3:1/3, 75:25, …
At the end, train the selected model again using all data!
19
The “Training Set-Validation Set”
Approach
2.2 Assessing Model Accuracy 31
Source: James, Witten, Hastie, Tibshirani (2013)
2.5
underfitting overfitting
12
2.0
10
1.5
8
optimum
Y
validation error
1.0
6
training error
0.5
4
2
0.0
0 20 40 60 80 100 2 5 10 20
X Flexibility
true
FIGURE function
2.9. vs linear
Left: Data fit vs
simulated third-order
from f , shown fit
in vs high-order
black. Three fit
estimates of
20
Can We Use the Validation Set to
Predict Future Performance, too?
2.2 Assessing Model Accuracy 31
Source: James, Witten, Hastie, Tibshirani (2013)
2.5
12
2.0
10
1.5
8
Y
validation error
1.0
6
training error
0.5
4
X Flexibility
true
FIGURE function
2.9. vs linear
Left: Data fit vs
simulated third-order
from f , shown fit
in vs high-order
black. Three fit
estimates of
21
Can We Use the Validation Set to
Predict Future Performance, too?
Consider the following “thought experiment”:
1 Take 10 “one penny” coins.
0 1 2 3 4 5 6 7 8 9 10
3 Take the minimum percentage across all 10 coins. 4,000
2,000
0 1 2 3 4 5 6 7 8 9 10
t
estimator for the probability of “head shown”. 1,000
0 1 2 3 4 5 6 7 8 9 10
3 Take the minimum percentage across all 10 coins. 4,000
2,000
0 1 2 3 4 5 6 7 8 9 10
Perfo
4 Throw the coin with therm
minimum
ance percentage again 10 times.
on te
Is the new percentage an unbiased sestimator
t set for the probability
of the event “that coin will show23 head”?
The “Training Set-Validation Set-
Test Set” Approach
The “Training Set-Validation Set-Test Set” Approach:
Useful for selecting one of several models and obtaining an
estimate of the resulting performance (model assessment):
1 Split the available data into a training set, a validation set
and a test set:
depending on the amount of available data and the
number of models to be compared, 50:25:25 or 60:20:20.
2 Fit each model separately on the training set.
3 Evaluate each model separately on the validation set.
4 Choose the model that performs best on the validation set.
5 Estimate the performance of that model on the test set.
At the end, train the selected model again using all data!
24
The “Training Set-Validation Set-
Test Set” Approach
The “Training Set-Validation Set-Test Set” Approach:
Useful for selecting one of several models and obtaining an
estimate of the resulting performance (model assessment):
1 Split the available data into a training set, a validation set
and a test set:
depending on the amount of available data and the
You only get an unbiased estimate of a
number of models to be compared, 50:25:25 or 60:20:20.
model’s performance on new data if you
apply it to on
2 Fit each model separately previously untouched
the training set. data!
3 Evaluate each model separately on the evaluation set.
4 Choose the model that performs best on the evaluation set.
5 Estimate the performance of that model on the test set.
At the end, train the selected model again using all data!
25
Content
6 Oversampling
7 Cross-Validation
26
The Mathematics Behind the
Bias-Variance Tradeoff
We want to understand the unknown relationship
Y = f (X1 , . . . , Xp ) +
2 2
E Y new
fˆ(X new ) = E f (X new
) fˆ(X new ) + Var [ ] ,
2 2
E Y new
fˆ(X new ) = E f (X new
) fˆ(X new ) + Var [ ] ,
29
The Mathematics Behind the
Bias-Variance Tradeoff
Imagine we construct an estimate fˆ of f via a ML method.
The expected squared error of fˆ for a new sample
(Y new , X new ) = (Y new , X1new , . . . , Xpnew ) is
2 2
E Y new
fˆ(X new ) = E f (X new
) fˆ(X new ) + Var [ ] ,
Y Y Y
X X X
Y Y Y
X X
31 X
The Mathematics Behind the
Bias-Variance Tradeoff
32
The Mathematics Behind the
Bias-Variance Tradeoff
2.2 is
Example 1: Unknown true relationship Assessing Modelcomplexity”
of “medium Accuracy 31
Source: James, Witten, Hastie, Tibshirani (2013)
2.5
underfitting overfitting
12
2.0
10
1.5
8
optimum
Y
validation error
1.0
6
training error
0.5
4
2
0.0
0 20 40 60 80 100 2 5 10 20
X Flexibility
true function vs linear fit vs third-order
33 fit vs high-order fit
The Mathematics Behind the
Bias-Variance Tradeoff
2.2 is
Example 2: Unknown true relationship Assessing
of “low Model Accuracy
complexity” 33
Source: James, Witten, Hastie, Tibshirani (2013)
2.5
under- overfitting
12
fitting
2.0
10
1.5
8
Y
optimum
validation error
1.0
6
training error
0.5
4
2
0.0
0 20 40 60 80 100 2 5 10 20
X Flexibility
true function vs linear fit vs second-order
34 fit vs high-order fit
The Mathematics Behind the
Bias-Variance Tradeoff
34Example
2. Statistical
3: UnknownLearning
true relationship is of “high complexity”
Source: James, Witten, Hastie, Tibshirani (2013)
20
underfitting overfitting
20
15
Mean Squared Error
10
10
Y
5
validation error
−10
0
training error
0 20 40 60 80 100 2 5 10 20
X Flexibility
true function vs linear fit vs 35
fifth-order fit vs high-order fit
Content
6 Oversampling
7 Cross-Validation
36
Performance Measures for
Regression Problems
Let i-th validation error be ei = Yi fˆ(Xi1 , . . . , Xip ), i = 1, . . . , n :
1
1 mean absolute error: |ei |
n i
1
2 average error: ei
n i
1 ei
3 mean absolute percentage error: 100% ·
n i
yi
1
4 root-mean-squared error: e2i
n i
5 total sum of squared errors: e2i
i
6 Oversampling
7 Cross-Validation
38
Performance Measures for
Classification Problems
Consider the following confusion matrix:
Predicted Class
“yes" “no"
“yes" n11 n12
Actual
Class “no" n21 n22
estimation misclassification rate n12 + n21
1
(= total error rate): n11 + n12 + n21 + n22
2 accuracy: 1 - estimation misclassification rate
n11
3 sensitivity:
n11 + n12
n22 if “yes” is the important class
4 specificity:
n21 + n22
Benchmark: The “majority predictor”
39 (majority class in training data)
Content
6 Oversampling
7 Cross-Validation
40
Lift Charts for Ranking Problems
Example:
# of important records
or ds
ll r ec
1 f a
p e #o
t/
lo a n
s port
m
of i perfect
classifier
pe#
slo typical classifier
random classifier
43
Lift Charts for Ranking Problems
Example:
Sum of outcome values
c ome
e out
e rag
e : av perfect classifier
p
slo
typical classifier
average classifier
# of all records
44
Content
6 Oversampling
7 Cross-Validation
45
Oversampling
“less important”
(e.g. law-abiding citizens)
“important”
(e.g. tax fraudsters)
46
Oversampling
a “normal” classifier
may not be suitable:
6 Oversampling
7 Cross-Validation
K-Fold Cross-Validation
At the end, train the selected model again using all data!
52
K-Fold Cross-Validation
K 1
This is “essentially” as good as training a model on 100% ·
K
of the data and evaluating
53 it on 100% of the data!
K-Fold Cross-Validation
Experiment 1: Simple linear regression over 50 samples.
For different training set sizes K ∈ { 5, 10, …, 45 }, do:
1 Split data into a training set of size K
and a validation set of size N - K.
Training Validation
estimator: error:
regression over MSE to
training set validation samples
55
K-Fold Cross-Validation
Observations:
average MSE initially low since average MSE decreases with K
training data is “too simple” MSE variation initially high (too
average MSE converges and few training data) and ultimately
MSE variation decreases with K high (too few validation data)
56
K-Fold Cross-Validation
Experiment 2: Simple linear regression over 50 samples.
1 Split data into 10 folds.
1 2 3 4 5 6 7 8 9 10
57
K-Fold Cross-Validation
Observations:
average MSE over validation set comparable to K = 45
MSE variation is much smaller, however!
58