0% found this document useful (0 votes)
16 views32 pages

Lecture16 Crossvalidation

Uploaded by

yitongwu766
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views32 pages

Lecture16 Crossvalidation

Uploaded by

yitongwu766
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

Overfitting and Cross-Validation

George Lan

A. Russell Chandler III Chair Professor


H. Milton Stewart School of Industrial & Systems
Engineering
Apartment hunting
Suppose you are to move to Atlanta
And you want to find the most
reasonably priced apartment satisfying
your needs:
square-ft., # of bedroom, distance to campus …

Living area (ft2) # bedroom Rent ($)

230 1 600
506 2 1000
433 2 1100
109 1 500

150 1 ?
270 1.5 ?
Nonlinear regression

Want to fit a polynomial regression model

𝑦 = 𝜃! + 𝜃" 𝑥 + 𝜃# 𝑥 # + ⋯ + 𝜃$ 𝑥 $ + 𝜖

Let 𝑥( = 1, 𝑥, 𝑥 # , … , 𝑥 $ % and 𝜃 = 𝜃! , 𝜃" , 𝜃# , … , 𝜃$ %

y = 𝜃 % 𝑥(
3
Least mean square method
Given 𝑚 data points, find 𝜃 that minimizes the mean square
error
)
1 #
.
𝜃 = 𝑎𝑟𝑔𝑚𝑖𝑛& 𝐿 𝜃 = 6 𝑦 − 𝜃 𝑥( ' % '
𝑚
'("

Our usual trick: set gradient to 0 and find parameter

)
𝜕𝐿 𝜃 2
= − 6 𝑦 ' − 𝜃 % 𝑥( ' 𝑥( ' = 0
𝜕𝜃 𝑚
'
) )
2 ' '
2 ' ' %
⇔ − 6 𝑦 𝑥( + 6 𝑥( 𝑥( 𝜃 = 0
𝑚 𝑚
' '

4
Matrix version of the gradient
&
Define 𝑋" = 𝑥% (") , 𝑥% ($) , … 𝑥% (%) , 𝑦 = 𝑦 (") , 𝑦 ($) , … , 𝑦 (%) , gradient
becomes

𝜕𝐿 𝜃 2 2
= − 𝑋𝑦< + 𝑋< 𝑋< % 𝜃 = 0
𝜕𝜃 𝑚 𝑚
*"
. <
⇒ 𝜃 = 𝑋𝑋 < % <
𝑋𝑦

Note that 𝑥( = 1, 𝑥, 𝑥 # , … , 𝑥 $ %

If we choose a different maximal degree 𝑛 for the polynomial,


the solution will be different.

5
Increasing the maximal degree

6
Increasing the maximal degree

7
Increasing the maximal degree

8
Which one is better?

Can we increase the maximal polynomial degree to very large,


such that the curve passes through all training points?

The optimization does not prevent us from doing that

9
When maximal degree is very large
%
Define 𝑋< = 𝑥( (") , 𝑥( (#) , … 𝑥( ()) , 𝑦 = 𝑦 (") , 𝑦 (#) , … , 𝑦 ()) , set
-. & # #
gradient to zero, -&
= − 𝑋𝑦 + 𝑋< 𝑋< % 𝜃 = 0
)
<
)

⇒ 𝑋< 𝑋< % 𝜃 = 𝑋𝑦
<

Each 𝑥( = 1, 𝑥, 𝑥 # , … , 𝑥 $ % is a vector of polynomial features,


the size of 𝑋< is 𝑛×𝑚, and 𝑋< 𝑋< % is 𝑛×𝑛

When 𝑛 > 𝑚,
𝑋< 𝑋< % is not invertible; there are multiple solutions 𝜃 which give
zero objective
)
1 ' % ' #
𝐿 𝜃 = 6 𝑦 − 𝜃 𝑥(
𝑚 10
'("
Overfitting/Underfitting
Blue points: training data points, Red points: test data points

The fit in the middle panel achieves a balance of small error in


both training and test points.

Underfitting Overfitting
What is the problem?
Given 𝑚 data points 𝐷 = (𝑥( ' , 𝑦 ' ) , find 𝜃 that minimizes the
mean square error
)
1 #
. .
𝜃 = 𝑎𝑟𝑔𝑚𝑖𝑛& 𝐿 𝜃 : = 6 𝑦 − 𝜃 𝑥( ' % '
𝑚
'("

But we really want to minimize the error for unseen data


points, or with respect to the entire distribution of data

𝜃 ∗ = 𝑎𝑟𝑔𝑚𝑖𝑛& 𝐿 𝜃 ≔ 𝔼 1,3
0 0
~5(1,3) 𝑦 − 𝜃 % 𝑥( #

It is the finite number training point that creates the problem

12
Decomposition of expected loss
Estimate your function from a finite data set 𝐷
)
1 #
𝑓G = 𝑎𝑟𝑔𝑚𝑖𝑛6 𝐿. (𝑓) ≔ 6 𝑦 ' − 𝑓 𝑥 '
𝑚
'("
𝑓G is a random function, generally different for different data set

Expected loss of 𝑓G
#
𝐿 𝑓G : = 𝔼7 𝔼(1,3) 𝑦 − 𝑓G 𝑥

Bias-variance decomposition

Expected loss = (bias)2 + variance + noise


13
What is the best we can do?
The expected squared loss is
$
𝐿 𝑓+ : = 𝔼' 𝔼((,*) 𝑦 − 𝑓+ 𝑥
$
= 𝔼' 0 0 𝑦 − 𝑓+ 𝑥 𝑝(𝑥, 𝑦)𝑑𝑥𝑑𝑦

𝐴
Our goal is to choose 𝑓G 𝑥 that minimize 𝐿 𝑓G . Calculus of
variations (see, e.g., Appendix D of Pattern Recognition and
Machine Learning)
𝜕𝐴
= 2 0 𝑦 − 𝑓 𝑥 𝑝 𝑥, 𝑦 𝑑𝑦 = 0
𝜕𝑓(𝑥)

⇔ 0 𝑓 𝑥 𝑝 𝑥, 𝑦 𝑑𝑦 = 𝑓 𝑥 𝑝 𝑥 = 0 𝑦𝑝 𝑥, 𝑦 𝑑𝑦

𝑦𝑝(𝑥, 𝑦)
⇔ ℎ 𝑥 := 𝑓 𝑥 = 0 𝑑y = 0𝑦𝑝 𝑦 𝑥 𝑑y = 𝔼*|( 𝑦 = 𝔼[𝑦|𝑥 <
𝑝(𝑥 )
The best predictor is the expected value
The best you can do is ℎ 𝑥 = 𝔼3|1 𝑦 : the expected value of 𝑦
given a particular 𝑥

ℎ(𝑥)

ℎ(𝑥! )

𝑝(𝑦|𝑥! )

𝑥! 𝑥
15
Noise term in the decomposition
ℎ 𝑥 = 𝔼(𝑦|𝑥) is the optimal predictor, and 𝑓G 𝑥 our actual
predictor, decompose the error a bit
( (
𝔼" 𝔼($,&) 𝑦 − 𝑓@ 𝑥 = 𝔼" ∫ ∫ 𝑦 − ℎ 𝑥 + ℎ 𝑥 − 𝑓@ 𝑥 𝑝 𝑥, 𝑦 𝑑𝑥𝑑𝑦
(
= 𝔼" FH H I 𝑓@ 𝑥 − ℎ 𝑥 + 2 𝑓@ 𝑥 − ℎ 𝑥 ℎ 𝑥 −𝑦

+ ℎ 𝑥 − 𝑦 ( J 𝑝(𝑥, 𝑦)𝑑𝑥𝑑𝑦G

(
= 𝔼" H 𝑓@ 𝑥 − ℎ 𝑥 𝑝 𝑥 𝑑𝑥 + H H ℎ 𝑥 − 𝑦 ( 𝑝(𝑥, 𝑦)𝑑𝑥𝑑𝑦

Will decompose further Noise term. can not do


better than this. a lower
bound of the expected
loss
Bias-variance decomposition
𝑓! 𝑥 is a random function, generally different for different dataset 𝐷
𝔼' 𝑓! 𝑥 : expected value of 𝑓! 𝑥 with respected to random dataset
$ $
𝔼' ' 𝑓! 𝑥 − ℎ 𝑥 𝑝 𝑥 𝑑𝑥 = 𝔼' 𝔼( 𝑓! 𝑥 − ℎ 𝑥

$
= 𝔼( 𝔼' 𝑓! 𝑥 − 𝔼' 𝑓! 𝑥 + 𝔼' 𝑓! 𝑥 −ℎ 𝑥

$ $
= 𝔼( 𝔼' 𝑓! 𝑥 − 𝔼' 𝑓! 𝑥 + 𝔼( 𝔼' 𝔼' 𝑓! 𝑥 −ℎ 𝑥

−2𝔼( 𝔼' 𝑓! 𝑥 − 𝔼' 𝑓! 𝑥 𝔼' 𝑓! 𝑥 −ℎ 𝑥

$ $
= 𝔼( 𝔼' 𝑓! 𝑥 − 𝔼' 𝑓! 𝑥 + 𝔼( 𝔼' 𝑓! 𝑥 −ℎ 𝑥

Variance Bias(
Overall decomposition of expected loss
Putting things together
Expected loss = (bias)2 + variance + noise
In formula
#
𝔼7 𝔼(1,3) 𝑦 − 𝑓G 𝑥
#
= 𝔼1 𝔼7 𝑓G 𝑥 − ℎ 𝑥 (𝑏𝑖𝑎𝑠 # )
#
+𝔼1 𝔼7 𝑓G 𝑥 − 𝔼7 𝑓G 𝑥 𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒
+𝔼(1,3) ℎ 𝑥 − 𝑦 # (𝑛𝑜𝑖𝑠𝑒)

Key quantities
𝑓! 𝑥 : actual predictor
𝔼' 𝑓! 𝑥 : expected predictor
ℎ 𝑥 = 𝔼(𝑦|𝑥 ): optimal predictor 18
Model space
Which model space should we choose?

The more complex the model, the large the model space

Eg. Polynomial function of degree 1, 2, … corresponds to space


H1, H2 …
Intuition of model selection
Find the right model family s.t. expected loss becomes minimum

Expected
loss

Best model family Total loss

𝐵𝑖𝑎𝑠 (
𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒

Model Complexity
H*
(eg. Polynomial degree)
Other things that control model complexity
Eg. In the case of linear models 𝑦 = 𝑤, 𝑥 + 𝑏, one wants to make 𝑤 a
controlled parameter

𝑤 <𝐶

𝐻) the linear model function family satisfying the constraint


The large the C, the large the model family

Eg. the larger the regularization parameter 𝜆, the small the model family

( (
𝐽 𝑤 = ∑* 𝑤 + 𝑥* − 𝑦* +𝜆 𝑤 (
𝐽 𝑤 = ∑* 𝑤 + 𝑥* − 𝑦* (+𝜆 𝑤
,
Ridge regression
Given 𝑚 data points, find 𝜃 that minimizes the regularized
mean square error
)
1 #
𝜃 = 𝑎𝑟𝑔𝑚𝑖𝑛& 𝐿 𝜃 = 6 𝑦 − 𝜃 𝑥 + 𝜆 𝜃 #
V ' % '
𝑚
'("

gradient becomes

𝜕𝐿 𝜃 2 2
= − 𝑋𝑦 + 𝑋𝑋 % 𝜃 + 2𝜆𝜃 = 0
𝜕𝜃 𝑚 𝑚

⇒ 𝜃 V = 𝑋𝑋 % + 𝜆𝑚𝐼 *"
𝑋𝑦

If we choose a different 𝜆, the solution will be different.


22
Experiment with bias-variance tradeoff
𝜆 is a "regularization" terms in
LR, the smaller the 𝜆, is more
complex the model
Simple (highly regularized)
models have low variance but
high bias.
Complex models have low bias
but high variance.

The actual 𝔼' can not be


computed

You are inspecting an


empirical average over 100
training set.
How to do model selection in practice?
Suppose we are trying select among several different models
for a learning problem.
Examples:
1. polynomial regression
ℎ 𝑥; 𝜃 = 𝑔(𝜃! + 𝜃, 𝑥 + 𝜃( 𝑥 ( + ⋯ + 𝜃- 𝑥 - )
Model selection: we wish to automatically and objectively decide if
k should be, say, 0, 1, . . . , or 10.
2. Mixture models and hidden Markov model,
Model selection: we want to decide the number of hidden states
The Problem:
Given model family 𝐹 = {𝑀) , 𝑀* , … , 𝑀+ } , find 𝑀, ∈ 𝐹 s.t.
𝑀, = arg max 𝐽(𝐷, 𝑀)
-∈/
Cross-Validation
K-fold cross-validation (CV)

For each fold 𝑖:


Set aside 𝛼 ⋅ 𝑚 samples of 𝐷 (where 𝑚 = 𝐷 ) as the held-out
data. They will be used to evaluate the error
Fit a model 𝑓, (𝑥) to the remaining (1 − 𝛼) ⋅ 𝑚 samples in D
Calculate the error of the model 𝑓, (𝑥) on the held-out data.

Repeat the above K times, choosing a different held-out data


set each time, and the errors are averaged over the folds.

For the polynomial degree with the lowest score, we use all of
𝐷 to find the parameter values for 𝑓(𝑥).
Cross-validation
Eg. Want to select the maximal degree of polynomial
5-fold cross-validation (blank: training; red: test)
Data: 1 … 𝑚
Fold 1: Test 1 ⇒ 𝑓, (𝑥) ⇒ 𝑒𝑟𝑟𝑜𝑟 1

Fold 2: Test 2 ⇒ 𝑓( (𝑥) ⇒ 𝑒𝑟𝑟𝑜𝑟 2

Fold 3: Test 3 ⇒ 𝑓. (𝑥) ⇒ 𝑒𝑟𝑟𝑜𝑟 3

Fold 4: Test 4 ⇒ 𝑓/ (𝑥) ⇒ 𝑒𝑟𝑟𝑜𝑟 4

Fold 5: Test 5 ⇒ 𝑓0 (𝑥) ⇒ 𝑒𝑟𝑟𝑜𝑟 5


average
Important: test data 𝑖 is not used to fit model 𝑓' (𝑥)
26
Example:
When 𝛼 = 1/𝑚, the algorithm is known as Leave-One-Out-
Cross-Validation (LOOCV)

MSELOOCV(M1)=2.12 MSELOOCV(M2)=0.962
Practical issues for K-fold CV
How to decide the values for 𝐾 (or 𝛼)
Commonly used 𝐾 = 10 or (𝛼 = 0.1).
Large 𝐾 makes it time-consuming.
Bias-variance trade-off
Large 𝐾 usually leads to low bias. In principle, LOOCV provides an almost
unbiased estimate of the generalization ability of a classifier, but it can also
have high variance.
Small 𝐾 can reduce variance, but will lead to under-use of data, and
causing high-bias.

One important point is that the test data 𝐷fghf is never used in
CV, because doing so would result in overly (indeed dishonest)
optimistic accuracy rates during the testing phase.
Ridge regression
Given 𝑚 data points, find 𝜃 that minimizes the regularized
mean square error
)
1 #
𝜃 = 𝑎𝑟𝑔𝑚𝑖𝑛& 𝐿 𝜃 = 6 𝑦 − 𝜃 𝑥 + 𝜆 𝜃 #
i ' % '
𝑚
'("

gradient becomes

𝜕𝐿 𝜃 2 2 %
2𝜆
= − 𝑋𝑦 + 𝑋𝑋 𝜃 + 𝜃 = 0
𝜕𝜃 𝑚 𝑚 𝑚

⇒ 𝜃 i = 𝑋𝑋 % + 𝜆𝐼 *" 𝑋𝑦

If we choose a different 𝜆, the solution will be different.


29
Regularization in maximum likelihood
Regularize the likelihood objective (also known as penalized
likelihood, shrinkage, smoothing, etc.)

𝜃.hji'$klmg = arg max[𝑙 𝜃; 𝐷 − 𝜆 𝜃 ]


&

where 𝜆 > 0 and 𝜃 might be the 𝐿) or 𝐿* norm.

The choice of norm has an effect


using the 𝐿* norm pulls directly towards the origin,
while using the 𝐿) norm pulls towards the coordinate axes, i.e it
tries to set some of the coordinates to 0.
Bayesian interpretation of regulation
Assume iid data and Gaussian noise, LMS is equivalent to MLE of 𝜃
11 1 $
𝑙 𝜃 = 𝑛 log − $ J 𝑦, − 𝜃&𝑥,
2𝜋𝜎 𝜎 2 ,

Assume 𝑦 is a linear in 𝑥 plus noise 𝜖


𝑦 = 𝜃&𝑥 + 𝜖

Assume 𝜖 follows a Gaussian N(0,σ)


, & , $
, ,
1 𝑦 − 𝜃 𝑥
𝑝 𝑦 𝑥 ;𝜃 = exp −
2𝜋𝜎 2𝜎 $

Now assume that vector q follows a normal prior with 0-mean and a
diagonal covariance matrix
𝑝(𝜃) = 𝑁(0, 𝜏 $ 𝐼 )

What is the posterior distribution of 𝜃?


Bayesian interpretation of regulation
The posterior distribution of q
1 $
𝑝 𝜃 𝐷 ∝ exp − $ J 𝑦 , − 𝜃 & 𝑥 , × exp{−𝜃 & 𝜃/2𝜏 $ }
2𝜎
,

This leads to a now objective


1 1 $ 11
𝑙-./ 𝜃; 𝐷 = − $ J 𝑦 , − 𝜃 & 𝑥 , − $ J 𝜃0$
2𝜎 2 𝜏 2
, 0
= 𝑙 𝜃; 𝐷 − 𝜆 𝜃

This is 𝐿# regularized LR! --- a MAP estimation of 𝜃

How to choose 𝜆.
cross-validation!

You might also like