07 - Evaluating Performance
07 - Evaluating Performance
Evaluating Performance
References
Lantz:
Machine Learning with R
5 Oversampling
6 Cross-Validation
Reading: James et al., §2.2
Wikipedia (Bias-Variance Tradeoff)
fi
Which Fit is “Right”?
(X1, Y1)
Y
(X3, Y3)
(X2, Y2)
X
Which Fit is “Right”?
Y linea
r t?
X
fi
Which Fit is “Right”?
high
Y poly -order
nom
ial
t?
X
fi
Which Fit is “Right”?
Y quad
ra tic
t?
X
fi
Which Fit is “Right”?
X
Which Fit is “Right”?
A linear t seems
too crude.
under tting
X (bias)
fi
fi
Which Fit is “Right”?
The high-order t
responds to noise.
over tting
X (variance)
fi
fi
Which Fit is “Right”?
The quadratic t
feels “about right”.
X
fi
Which Fit is “Right”?
X
fi
fi
Validation Sets
X
Validation Sets
X
Validation Sets
Y Y
X X
Validation Sets
Y Y
X X
Validation Sets
Y Y
X X
Validation Sets
Y Y
X X
The “Training Set-Validation Set”
Approach
The “Training Set-Validation Set” Approach:
1 Split the available data into a training set and a validation set:
depending on the amount of available data and the
number of models to be compared, 50:50, 2/3:1/3, 75:25, …
At the end, train the selected model again using all data!!!
The “Training Set-Validation Set”
Approach
2.2 Assessing Model Accuracy 31
Source: James, Witten, Hastie, Tibshirani (2013)
2.5
under tting over tting
12
2.0
10
1.5
8
optimum
Y
validation error
1.0
6
training error
0.5
4
2
0.0
0 20 40 60 80 100 2 5 10 20
X Flexibility
true
FIGURE function
2.9. vs linear
Left: Data t vs
simulated third-order
from f , shown int vs high-order
black. t
Three estimates of
fi
fi
fi
fi
fi
Can We Use the Validation Set to
Predict Future Performance, too?
2.2 Assessing Model Accuracy 31
Source: James, Witten, Hastie, Tibshirani (2013)
2.5
12
2.0
10
1.5
8
Y
validation error
1.0
6
training error
0.5
4
X Flexibility
true
FIGURE function
2.9. vs linear
Left: Data t vs
simulated third-order
from f , shown int vs high-order
black. t
Three estimates of
fi
fi
fi
The “Training Set-Validation Set-
Test Set” Approach
The “Training Set-Validation Set-Test Set” Approach:
Useful for selecting one of several models and obtaining an
estimate of the resulting performance (model assessment):
1 Split the available data into a training set, a validation set
and a test set:
depending on the amount of available data and the
number of models to be compared, 50:25:25 or 60:20:20.
2 Fit each model separately on the training set.
3 Evaluate each model separately on the validation set.
4 Choose the model that performs best on the validation set.
5 Estimate the performance of that model on the test set.
At the end, train the selected model again using all data!!!
The “Training Set-Validation Set-
Test Set” Approach
The “Training Set-Validation Set-Test Set” Approach:
Useful for selecting one of several models and obtaining an
estimate of the resulting performance (model assessment):
1 Split the available data into a training set, a validation set
and a test set:
depending on the amount of available data and the
You only get an unbiased estimate of a
number of models to be compared, 50:25:25 or 60:20:20.
model’s performance on new data if you
apply it to on
2 Fit each model separately previously untouched
the training set. data!
3 Evaluate each model separately on the evaluation set.
4 Choose the model that performs best on the evaluation set.
5 Estimate the performance of that model on the test set.
At the end, train the selected model again using all data!
Content
5 Oversampling
6 Cross-Validation
Reading: James et al., §2.2
Wikipedia (Bias-Variance Tradeoff)
fi
The Bias-Variance Tradeoff
(Advanced & De nitely Not Tested)
2.2 is
Example 1: Unknown true relationship Assessing Modelcomplexity”
of “medium Accuracy 31
Source: James, Witten, Hastie, Tibshirani (2013)
2.5
under tting over tting
12
2.0
10
1.5
8
optimum
Y
validation error
1.0
6
training error
0.5
4
2
0.0
0 20 40 60 80 100 2 5 10 20
X Flexibility
true function vs linear t vs third-order t vs high-order t
fi
fi
fi
fi
fi
fi
The Bias-Variance Tradeoff
(Advanced & De nitely Not Tested)
2.2 is
Example 2: Unknown true relationship Assessing
of “low Model Accuracy
complexity” 33
Source: James, Witten, Hastie, Tibshirani (2013)
2.5
under- over tting
12
tting
2.0
10
1.5
8
Y
optimum
validation error
1.0
6
training error
0.5
4
2
0.0
0 20 40 60 80 100 2 5 10 20
X Flexibility
true function vs linear t vs second-order t vs high-order t
fi
fi
fi
fi
fi
fi
The Bias-Variance Tradeoff
(Advanced & De nitely Not Tested)
34Example
2. Statistical
3: UnknownLearning
true relationship is of “high complexity”
Source: James, Witten, Hastie, Tibshirani (2013)
20
under tting over tting
20
15
Mean Squared Error
10
10
Y
5
validation error
−10
0
training error
0 20 40 60 80 100 2 5 10 20
X Flexibility
true function vs linear t vs fth-order t vs high-order t
fi
fi
fi
fi
fi
fi
fi
Content
5 Oversampling
6 Cross-Validation
Reading: Shmueli et al., §5.2
fi
Performance Measures for
Regression Problems
Let i-th validation error be ei = Yi fˆ(Xi1 , . . . , Xip ), i = 1, . . . , n :
1
1 mean absolute error: |ei |
n i
1
2 average error: ei
n i
1 ei
3 mean absolute percentage error: 100% ·
n i
yi
1
4 root-mean-squared error: e2i
n i
5 total sum of squared errors: e2i
i
5 Oversampling
6 Cross-Validation
Reading: Shmueli et al., §5.3
fi
Performance Measures for
Classi cation Problems
Consider the following confusion matrix:
Predicted Class
“yes" “no"
“yes" n11 n12
Actual
Class
“no" n21 n22
estimation misclassi cation rate n12 + n21
1
(= total error rate): n11 + n12 + n21 + n22
2 accuracy: 1 - estimation misclassi cation rate
n11
3 sensitivity:
n11 + n12
n22 if “yes” is the important class
4 speci city:
n21 + n22
Benchmark: The “majority predictor”
32 (majority class in training data)
fi
fi
fi
fi
Content
5 Oversampling
6 Cross-Validation
Reading: Shmueli et al., §5.5
fi
Oversampling
“less important”
(e.g. law-abiding citizens)
“important”
(e.g. tax fraudsters)
34
fi
Oversampling
a “normal” classi er
may not be suitable:
5 Oversampling
6 Cross-Validation
Reading: James et al., §5.1
fi
K-Fold Cross-Validation
At the end, train the selected model again using all data!
40
K-Fold Cross-Validation
K 1
This is “essentially” as good as training a model on 100% ·
K
of the data and evaluating
41 it on 100% of the data!
K-Fold Cross-Validation
Experiment 1: Simple linear regression over 50 samples.
For different training set sizes M ∈ { 5, 10, …, 45 }, do:
1 Split data into a training set of size M
and a validation set of size N - M.
Training Validation
estimator: error:
regression over MSE to
training set validation samples
43
K-Fold Cross-Validation
Observations:
average MSE initially low since average MSE decreases with M
training data is “too simple” MSE variation initially high (too
average MSE converges and few training data) and ultimately
MSE variation decreases with M high (too few validation data)
44
K-Fold Cross-Validation
45
K-Fold Cross-Validation
Observations:
average MSE over validation set comparable to M = 45
MSE variation is much smaller, however!
46