0% found this document useful (0 votes)
32 views58 pages

M1 - Evaluating Predictive Performance

Uploaded by

zhuyq745
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views58 pages

M1 - Evaluating Predictive Performance

Uploaded by

zhuyq745
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 58

Module 1:

Evaluating Predictive
Performance
Xiaocheng Li
[email protected]

1 MSc Business Analytics 2020/21


Content

1 Training Sets, Validation Sets and Test Sets

2 The Mathematics of the Bias-Variance Tradeoff

3 Evaluating Performance of Regression Problems

4 Evaluating Performance of Classification Problems

5 Evaluating Performance of Ranking Problems

6 Oversampling

7 Cross-Validation

2
Content

1 Training Sets, Validation Sets and Test Sets

2 The Mathematics of the Bias-Variance Tradeoff

3 Evaluating Performance of Regression Problems

4 Evaluating Performance of Classification Problems

5 Evaluating Performance of Ranking Problems

6 Oversampling

7 Cross-Validation

3
Which Fit is “Right”?

Imagine you want to understand the unknown relationship


Y = f (X1 , . . . , Xp ) +
from training data (Yi , Xi1 , . . . , Xip ), i = 1, . . . , n :

(X1, Y1)

Y
(X2, Y2) (X3, Y3)

X
4
Which Fit is “Right”?

Imagine you want to understand the unknown relationship


Y = f (X1 , . . . , Xp ) +
from training data (Yi , Xi1 , . . . , Xip ), i = 1, . . . , n :

Y linea
r fit?

X
5
Which Fit is “Right”?

Imagine you want to understand the unknown relationship


Y = f (X1 , . . . , Xp ) +
from training data (Yi , Xi1 , . . . , Xip ), i = 1, . . . , n :

high
Y poly -order
nom
ial fi
t?

X
6
Which Fit is “Right”?

Imagine you want to understand the unknown relationship


Y = f (X1 , . . . , Xp ) +
from training data (Yi , Xi1 , . . . , Xip ), i = 1, . . . , n :

Y quad
ra tic fi
t?

X
7
Which Fit is “Right”?

Imagine you want to understand the unknown relationship


Y = f (X1 , . . . , Xp ) +
from training data (Yi , Xi1 , . . . , Xip ), i = 1, . . . , n :

The black curve


represents the
relationship ff
of interest.
Y
f

X
8
Which Fit is “Right”?

Imagine you want to understand the unknown relationship


Y = f (X1 , . . . , Xp ) +
from training data (Yi , Xi1 , . . . , Xip ), i = 1, . . . , n :

The black curve


represents the
relationship ff
of interest.
Y

A linear fit seems


too crude.
underfitting
X (bias)
9
Which Fit is “Right”?

Imagine you want to understand the unknown relationship


Y = f (X1 , . . . , Xp ) +
from training data (Yi , Xi1 , . . . , Xip ), i = 1, . . . , n :

The black curve


represents the
relationship ff
of interest.
Y

The high-order fit


responds to noise.
overfitting
X (variance)
10
Which Fit is “Right”?

Imagine you want to understand the unknown relationship


Y = f (X1 , . . . , Xp ) +
from training data (Yi , Xi1 , . . . , Xip ), i = 1, . . . , n :

The black curve


represents the
relationship ff
of interest.
Y

The quadratic fit


feels “about right”.

X
11
Which Fit is “Right”?

Imagine you want to understand the unknown relationship


Y = f (X1 , . . . , Xp ) +
from training data (Yi , Xi1 , . . . , Xip ), i = 1, . . . , n :

The black curve


represents the
How can we rigorously judge relationship ff
of interest.
whether a fit is “right”, given
Y
that we do not know f ?
The quadratic fit
feels “about right”.

X
12
Validation Sets

Split your data — 50% for training, 50% for “validation”:

X
13
Validation Sets

Split your data — 50% for training, 50% for “validation”:

validation data point

training data point

X
14
Validation Sets

Split your data — 50% for training, 50% for “validation”:

Training data: Validation data:

Y Y

X X
15
Validation Sets

Split your data — 50% for training, 50% for “validation”:

Training data: Validation data:

Y Y

X X
16
Validation Sets

Split your data — 50% for training, 50% for “validation”:

Training data: Validation data:

Y Y

X X
17
Validation Sets

Split your data — 50% for training, 50% for “validation”:

Training data: Validation data:

Y Y

X X
18
The “Training Set-Validation Set”
Approach
The “Training Set-Validation Set” Approach:

Useful for selecting one of several models (model selection):

1 Split the available data into a training set and a validation set:
depending on the amount of available data and the
number of models to be compared, 50:50, 2/3:1/3, 75:25, …

2 Fit each model separately on the training set.

3 Evaluate each model separately on the validation set.

4 Choose the model that performs best on the validation set.

At the end, train the selected model again using all data!
19
The “Training Set-Validation Set”
Approach
2.2 Assessing Model Accuracy 31
Source: James, Witten, Hastie, Tibshirani (2013)

2.5
underfitting overfitting
12

2.0
10

Mean Squared Error

1.5
8

optimum
Y

validation error

1.0
6

training error

0.5
4
2

0.0
0 20 40 60 80 100 2 5 10 20

X Flexibility

true
FIGURE function
2.9. vs linear
Left: Data fit vs
simulated third-order
from f , shown fit
in vs high-order
black. Three fit
estimates of
20
Can We Use the Validation Set to
Predict Future Performance, too?
2.2 Assessing Model Accuracy 31
Source: James, Witten, Hastie, Tibshirani (2013)

2.5
12

2.0
10

Mean Squared Error

1.5
8
Y

validation error

1.0
6

training error

0.5
4

After selecting the quadratic model,


2

can we expect to see this performance 0.0


0 20 on
40 new
60 (unseen)
80 100 data? 2 5 10 20

X Flexibility

true
FIGURE function
2.9. vs linear
Left: Data fit vs
simulated third-order
from f , shown fit
in vs high-order
black. Three fit
estimates of
21
Can We Use the Validation Set to
Predict Future Performance, too?
Consider the following “thought experiment”:
1 Take 10 “one penny” coins.

2 Throw each coin 10 times:


For each coin, record the % of “heads” shown (e.g. 60%).
By construction, each percentage is an unbiased 2,000

estimator for the probability of “head shown”. 1,000

0 1 2 3 4 5 6 7 8 9 10
3 Take the minimum percentage across all 10 coins. 4,000

Is this percentage an unbiased estimator for the 3,000

2,000

probability of the event “that coin will show head”? 1,000

0 1 2 3 4 5 6 7 8 9 10

4 Throw the coin with the minimum percentage again 10 times.


Is the new percentage an unbiased estimator for the probability
of the event “that coin will show22 head”?
Can We Use the Validation Set to
Predict Future Performance, too?
Consider the following “thought experiment”:
1 Take 10 “one penny” coins.

2 Throw each coinP10 times:


erfor
For each coin, record m an%
the ceofon“heads” shown (e.g. 60%).
valid
By construction, each percentage is an atiunbiased
on se 2,000

t
estimator for the probability of “head shown”. 1,000

0 1 2 3 4 5 6 7 8 9 10
3 Take the minimum percentage across all 10 coins. 4,000

Is this percentage an unbiased estimator for the 3,000

2,000

probability of the event “that coin will show head”? 1,000

0 1 2 3 4 5 6 7 8 9 10
Perfo
4 Throw the coin with therm
minimum
ance percentage again 10 times.
on te
Is the new percentage an unbiased sestimator
t set for the probability
of the event “that coin will show23 head”?
The “Training Set-Validation Set-
Test Set” Approach
The “Training Set-Validation Set-Test Set” Approach:
Useful for selecting one of several models and obtaining an
estimate of the resulting performance (model assessment):
1 Split the available data into a training set, a validation set
and a test set:
depending on the amount of available data and the
number of models to be compared, 50:25:25 or 60:20:20.
2 Fit each model separately on the training set.
3 Evaluate each model separately on the validation set.
4 Choose the model that performs best on the validation set.
5 Estimate the performance of that model on the test set.

At the end, train the selected model again using all data!
24
The “Training Set-Validation Set-
Test Set” Approach
The “Training Set-Validation Set-Test Set” Approach:
Useful for selecting one of several models and obtaining an
estimate of the resulting performance (model assessment):
1 Split the available data into a training set, a validation set
and a test set:
depending on the amount of available data and the
You only get an unbiased estimate of a
number of models to be compared, 50:25:25 or 60:20:20.
model’s performance on new data if you
apply it to on
2 Fit each model separately previously untouched
the training set. data!
3 Evaluate each model separately on the evaluation set.
4 Choose the model that performs best on the evaluation set.
5 Estimate the performance of that model on the test set.

At the end, train the selected model again using all data!
25
Content

1 Training Sets, Validation Sets and Test Sets

2 The Mathematics of the Bias-Variance Tradeoff

3 Evaluating Performance of Regression Problems

4 Evaluating Performance of Classification Problems

5 Evaluating Performance of Ranking Problems

6 Oversampling

7 Cross-Validation

26
The Mathematics Behind the
Bias-Variance Tradeoff
We want to understand the unknown relationship

Y = f (X1 , . . . , Xp ) +

output variable input variables noise


target function (unknown!)
from limited data points (Yi , Xi1 , . . . , Xip ), i = 1, . . . , n .

Assumptions: is a random variable.


white
has has mean zero.
noise
is independent of X1, …, Xp.
Examples: unmeasured inputs that are useful for predicting Y,
unmeasurable variation (e.g. manufacturing variation
of a drug, patient’s well-being
27 on the day, …).
The Mathematics Behind the
Bias-Variance Tradeoff
Imagine we construct an estimate fˆ of f via a ML method.
The expected squared error of fˆ for a new sample
(Y new , X new ) = (Y new , X1new , . . . , Xpnew ) is

2 2
E Y new
fˆ(X new ) = E f (X new
) fˆ(X new ) + Var [ ] ,

where: The predictors of the new sample (Y new , X new ) are


deterministic and Y new = f (X new ) + is random.
The expectation is taken w.r.t random draws of the
training data (Yi , Xi1 , . . . , Xip ), i = 1, . . . , n, each
of which leads to a different estimate fˆ, as well as
to the randomness of the new outcome Y new.
2
Var [ ] = E is the28variance of the noise term.
The Mathematics Behind the
Bias-Variance Tradeoff
Imagine we construct an estimate fˆ of f via a ML method.
The expected squared error of fˆ for a new sample
(Y new , X new ) = (Y new , X1new , . . . , Xpnew ) is

2 2
E Y new
fˆ(X new ) = E f (X new
) fˆ(X new ) + Var [ ] ,

expected squared error reducible error: irreducible error:


of new sample can (possibly) be can’t be reduced
reduced by more data unless we have
or selecting a more more information
appropriate ML method about the noise

29
The Mathematics Behind the
Bias-Variance Tradeoff
Imagine we construct an estimate fˆ of f via a ML method.
The expected squared error of fˆ for a new sample
(Y new , X new ) = (Y new , X1new , . . . , Xpnew ) is

2 2
E Y new
fˆ(X new ) = E f (X new
) fˆ(X new ) + Var [ ] ,

expected squared error reducible error irreducible error


of new sample
2
= E f (X new
) fˆ(X new ) + Var fˆ(X new )

squared bias30of fˆ variance of fˆ


The Mathematics Behind the
Bias-Variance Tradeoff
Illustrative example: Fits over 3 different training sets

Linear estimator: high bias but low variance

Y Y Y

X X X

Nonlinear estimator: lower bias but higher variance

Y Y Y

X X
31 X
The Mathematics Behind the
Bias-Variance Tradeoff

low variance high variance


low bias
high bias

32
The Mathematics Behind the
Bias-Variance Tradeoff
2.2 is
Example 1: Unknown true relationship Assessing Modelcomplexity”
of “medium Accuracy 31
Source: James, Witten, Hastie, Tibshirani (2013)

2.5
underfitting overfitting
12

2.0
10

Mean Squared Error

1.5
8

optimum
Y

validation error

1.0
6

training error

0.5
4
2

0.0

0 20 40 60 80 100 2 5 10 20

X Flexibility
true function vs linear fit vs third-order
33 fit vs high-order fit
The Mathematics Behind the
Bias-Variance Tradeoff
2.2 is
Example 2: Unknown true relationship Assessing
of “low Model Accuracy
complexity” 33
Source: James, Witten, Hastie, Tibshirani (2013)

2.5
under- overfitting
12

fitting

2.0
10

Mean Squared Error

1.5
8
Y

optimum
validation error

1.0
6

training error

0.5
4
2

0.0

0 20 40 60 80 100 2 5 10 20

X Flexibility
true function vs linear fit vs second-order
34 fit vs high-order fit
The Mathematics Behind the
Bias-Variance Tradeoff
34Example
2. Statistical
3: UnknownLearning
true relationship is of “high complexity”
Source: James, Witten, Hastie, Tibshirani (2013)

20
underfitting overfitting
20

15
Mean Squared Error
10

10
Y

5
validation error
−10

0
training error

0 20 40 60 80 100 2 5 10 20

X Flexibility
true function vs linear fit vs 35
fifth-order fit vs high-order fit
Content

1 Training Sets, Validation Sets and Test Sets

2 The Mathematics of the Bias-Variance Tradeoff

3 Evaluating Performance of Regression Problems

4 Evaluating Performance of Classification Problems

5 Evaluating Performance of Ranking Problems

6 Oversampling

7 Cross-Validation

36
Performance Measures for
Regression Problems
Let i-th validation error be ei = Yi fˆ(Xi1 , . . . , Xip ), i = 1, . . . , n :
1
1 mean absolute error: |ei |
n i
1
2 average error: ei
n i
1 ei
3 mean absolute percentage error: 100% ·
n i
yi
1
4 root-mean-squared error: e2i
n i
5 total sum of squared errors: e2i
i

Benchmark: The “average predictor” fˆ(Xi1 , . . . , Xip ) = y ,


where y is the average output over the training set.
37
Content

1 Training Sets, Validation Sets and Test Sets

2 The Mathematics of the Bias-Variance Tradeoff

3 Evaluating Performance of Regression Problems

4 Evaluating Performance of Classification Problems

5 Evaluating Performance of Ranking Problems

6 Oversampling

7 Cross-Validation

38
Performance Measures for
Classification Problems
Consider the following confusion matrix:
Predicted Class
“yes" “no"
“yes" n11 n12
Actual
Class “no" n21 n22
estimation misclassification rate n12 + n21
1
(= total error rate): n11 + n12 + n21 + n22
2 accuracy: 1 - estimation misclassification rate
n11
3 sensitivity:
n11 + n12
n22 if “yes” is the important class
4 specificity:
n21 + n22
Benchmark: The “majority predictor”
39 (majority class in training data)
Content

1 Training Sets, Validation Sets and Test Sets

2 The Mathematics of the Bias-Variance Tradeoff

3 Evaluating Performance of Regression Problems

4 Evaluating Performance of Classification Problems

5 Evaluating Performance of Ranking Problems

6 Oversampling

7 Cross-Validation

40
Lift Charts for Ranking Problems

Lift Chart Algorithm for Classification Problems:

1 Sort validation records by decreasing propensity (estimated


probability of belonging to the important “yes” class).

2 For each sorted validation record i = 1, …, n:


Plot the point (i, # of actual “yes” class members in
the validation records {1…i}).

3 Compare it against the “random classifier” with points


(i, i * percentage of “yes” class members in validation set)
as well as the “perfect classifier” with points
(i, min {i, number of “yes” class members in validation set}).
41
Lift Charts for Ranking Problems

Lift Chart Algorithm for Classification Problems:

Example:
# of important records

or ds
ll r ec
1 f a
p e #o
t/
lo a n
s port
m
of i perfect
classifier
pe#
slo typical classifier
random classifier

# of important records # of all records


42
Lift Charts for Ranking Problems

Lift Chart Algorithm for Regression Problems:

1 Sort validation records by decreasing predicted


outcome value.

2 For each sorted validation record i = 1, …, n:


Plot the point (i, cumulative sum of actual outcome values in
the validation records {1…i}).

3 Compare it against the “average classifier” with points


(i, i * average outcome over all validation records) as well as
the “perfect classifier” with points (i, sum of i largest outcome
values in all of the validation records).

43
Lift Charts for Ranking Problems

Lift Chart Algorithm for Regression Problems:

Example:
Sum of outcome values

c ome
e out
e rag
e : av perfect classifier
p
slo
typical classifier
average classifier

# of all records
44
Content

1 Training Sets, Validation Sets and Test Sets

2 The Mathematics of the Bias-Variance Tradeoff

3 Evaluating Performance of Regression Problems

4 Evaluating Performance of Classification Problems

5 Evaluating Performance of Ranking Problems

6 Oversampling

7 Cross-Validation

45
Oversampling

Consider a binary classification problem where


the “class of interest” is rare:

“less important”
(e.g. law-abiding citizens)

“important”
(e.g. tax fraudsters)

46
Oversampling

Consider a binary classification problem where


the “class of interest” is rare:

a “normal” classifier
may not be suitable:

since it minimises the


overall misclassification
rate, it may perform
poorly in identifying the
“interesting” cases

best classifier misclassifies 1 record


47
Oversampling

Consider a binary classification problem where


the “class of interest” is rare:

idea: oversample the


“interesting” class

leads to poorer overall


misclassification rate,
but typically to better
misclassification rate on
the “interesting” cases

best classifier misclassifies 2 records


48
Oversampling

Stratified Sampling Algorithm

1 Divide the available data into two sets (strata):


all samples of the class of interest (set A);
all other samples (set B).

2 Construct the training set:


randomly select 50% of the samples in set A
add equally many samples from set B.

3 Construct the validation set:


select the remaining 50% of samples from set A
add enough samples from set B so as to restore
the original ratio from the overall data set.
49
Content

1 Training Sets, Validation Sets and Test Sets

2 The Mathematics of the Bias-Variance Tradeoff

3 Evaluating Performance of Regression Problems

4 Evaluating Performance of Classification Problems

5 Evaluating Performance of Ranking Problems

6 Oversampling

7 Cross-Validation
K-Fold Cross-Validation

The “training set-validation set” approach has 2 shortcomings:

1 Unless we have a large amount of data, we either end up with


either too few training data or too few validation data:
too few training data too few validation data
Training Validation Training Validation

models in training phase estimates in validation phase


will be of poor quality will be of poor quality

2 If the data is randomly split into training and validation data,


the approach gives different results for different splits.
51
K-Fold Cross-Validation

K-fold cross-validation can alleviate both shortcomings:

K-Fold Cross-Validation Algorithm:

1 Split data into K folds (e.g. K = 5 or K = 10).


2 For each fold i = 1, …, K:
2 A Train each model on all folds j ≠ i.
2 B Evaluate each model on fold i.

3 For each model, average validation performance over all K runs.

4 Choose the model with best average validation performance.

At the end, train the selected model again using all data!
52
K-Fold Cross-Validation

K-fold cross-validation can alleviate both shortcomings:

K-Fold Cross-Validation Algorithm:

1 2 3 4 5 6 K=7 all data

A1 Train2 each 3model4on all 5folds j 6≠ i. K=7 run 1


B1 Evaluate
2 each
3 model
4 on5 fold i.6 K=7 run 2
… …
1 2 3 4 5 6 K=7 run K

K 1
This is “essentially” as good as training a model on 100% ·
K
of the data and evaluating
53 it on 100% of the data!
K-Fold Cross-Validation
Experiment 1: Simple linear regression over 50 samples.
For different training set sizes K ∈ { 5, 10, …, 45 }, do:
1 Split data into a training set of size K
and a validation set of size N - K.
Training Validation

2 Run the regression over the training set and


calculate the MSE over the validation set.
Training Validation

estimator: error:
regression over MSE to
training set validation samples

Which K would you choose? 54


K-Fold Cross-Validation

Example: Training (top) and validation (bottom) for K = 5.

55
K-Fold Cross-Validation

Result over 10,000 repetitions:

MSE over validation set


MSE over training set

size of training set size of training set

Observations:
average MSE initially low since average MSE decreases with K
training data is “too simple” MSE variation initially high (too
average MSE converges and few training data) and ultimately
MSE variation decreases with K high (too few validation data)
56
K-Fold Cross-Validation
Experiment 2: Simple linear regression over 50 samples.
1 Split data into 10 folds.
1 2 3 4 5 6 7 8 9 10

2 For each fold i = 1, …, 10:


2 A Run the regression over all folds j ≠ i.
2 B Calculate the MSE on the validation fold i.
1 2 3 4 5 6 7 8 9 10

error: MSE estimator: regression


over validation set over training set

3 Record the average MSE over all 10 runs.

57
K-Fold Cross-Validation

Result over 10,000 repetitions:

MSE over validation set

MSE over validation set


MSE over training set

size of training set size of training set CV

Observations:
average MSE over validation set comparable to K = 45
MSE variation is much smaller, however!

58

You might also like