Lecture16 Crossvalidation
Lecture16 Crossvalidation
George Lan
230 1 600
506 2 1000
433 2 1100
109 1 500
…
150 1 ?
270 1.5 ?
Nonlinear regression
𝑦 = 𝜃! + 𝜃" 𝑥 + 𝜃# 𝑥 # + ⋯ + 𝜃$ 𝑥 $ + 𝜖
y = 𝜃 % 𝑥(
3
Least mean square method
Given 𝑚 data points, find 𝜃 that minimizes the mean square
error
)
1 #
.
𝜃 = 𝑎𝑟𝑔𝑚𝑖𝑛& 𝐿 𝜃 = 6 𝑦 − 𝜃 𝑥( ' % '
𝑚
'("
)
𝜕𝐿 𝜃 2
= − 6 𝑦 ' − 𝜃 % 𝑥( ' 𝑥( ' = 0
𝜕𝜃 𝑚
'
) )
2 ' '
2 ' ' %
⇔ − 6 𝑦 𝑥( + 6 𝑥( 𝑥( 𝜃 = 0
𝑚 𝑚
' '
4
Matrix version of the gradient
&
Define 𝑋" = 𝑥% (") , 𝑥% ($) , … 𝑥% (%) , 𝑦 = 𝑦 (") , 𝑦 ($) , … , 𝑦 (%) , gradient
becomes
𝜕𝐿 𝜃 2 2
= − 𝑋𝑦< + 𝑋< 𝑋< % 𝜃 = 0
𝜕𝜃 𝑚 𝑚
*"
. <
⇒ 𝜃 = 𝑋𝑋 < % <
𝑋𝑦
Note that 𝑥( = 1, 𝑥, 𝑥 # , … , 𝑥 $ %
5
Increasing the maximal degree
6
Increasing the maximal degree
7
Increasing the maximal degree
8
Which one is better?
9
When maximal degree is very large
%
Define 𝑋< = 𝑥( (") , 𝑥( (#) , … 𝑥( ()) , 𝑦 = 𝑦 (") , 𝑦 (#) , … , 𝑦 ()) , set
-. & # #
gradient to zero, -&
= − 𝑋𝑦 + 𝑋< 𝑋< % 𝜃 = 0
)
<
)
⇒ 𝑋< 𝑋< % 𝜃 = 𝑋𝑦
<
When 𝑛 > 𝑚,
𝑋< 𝑋< % is not invertible; there are multiple solutions 𝜃 which give
zero objective
)
1 ' % ' #
𝐿 𝜃 = 6 𝑦 − 𝜃 𝑥(
𝑚 10
'("
Overfitting/Underfitting
Blue points: training data points, Red points: test data points
Underfitting Overfitting
What is the problem?
Given 𝑚 data points 𝐷 = (𝑥( ' , 𝑦 ' ) , find 𝜃 that minimizes the
mean square error
)
1 #
. .
𝜃 = 𝑎𝑟𝑔𝑚𝑖𝑛& 𝐿 𝜃 : = 6 𝑦 − 𝜃 𝑥( ' % '
𝑚
'("
𝜃 ∗ = 𝑎𝑟𝑔𝑚𝑖𝑛& 𝐿 𝜃 ≔ 𝔼 1,3
0 0
~5(1,3) 𝑦 − 𝜃 % 𝑥( #
12
Decomposition of expected loss
Estimate your function from a finite data set 𝐷
)
1 #
𝑓G = 𝑎𝑟𝑔𝑚𝑖𝑛6 𝐿. (𝑓) ≔ 6 𝑦 ' − 𝑓 𝑥 '
𝑚
'("
𝑓G is a random function, generally different for different data set
Expected loss of 𝑓G
#
𝐿 𝑓G : = 𝔼7 𝔼(1,3) 𝑦 − 𝑓G 𝑥
Bias-variance decomposition
𝐴
Our goal is to choose 𝑓G 𝑥 that minimize 𝐿 𝑓G . Calculus of
variations (see, e.g., Appendix D of Pattern Recognition and
Machine Learning)
𝜕𝐴
= 2 0 𝑦 − 𝑓 𝑥 𝑝 𝑥, 𝑦 𝑑𝑦 = 0
𝜕𝑓(𝑥)
⇔ 0 𝑓 𝑥 𝑝 𝑥, 𝑦 𝑑𝑦 = 𝑓 𝑥 𝑝 𝑥 = 0 𝑦𝑝 𝑥, 𝑦 𝑑𝑦
𝑦𝑝(𝑥, 𝑦)
⇔ ℎ 𝑥 := 𝑓 𝑥 = 0 𝑑y = 0𝑦𝑝 𝑦 𝑥 𝑑y = 𝔼*|( 𝑦 = 𝔼[𝑦|𝑥 <
𝑝(𝑥 )
The best predictor is the expected value
The best you can do is ℎ 𝑥 = 𝔼3|1 𝑦 : the expected value of 𝑦
given a particular 𝑥
ℎ(𝑥)
ℎ(𝑥! )
𝑝(𝑦|𝑥! )
𝑥! 𝑥
15
Noise term in the decomposition
ℎ 𝑥 = 𝔼(𝑦|𝑥) is the optimal predictor, and 𝑓G 𝑥 our actual
predictor, decompose the error a bit
( (
𝔼" 𝔼($,&) 𝑦 − 𝑓@ 𝑥 = 𝔼" ∫ ∫ 𝑦 − ℎ 𝑥 + ℎ 𝑥 − 𝑓@ 𝑥 𝑝 𝑥, 𝑦 𝑑𝑥𝑑𝑦
(
= 𝔼" FH H I 𝑓@ 𝑥 − ℎ 𝑥 + 2 𝑓@ 𝑥 − ℎ 𝑥 ℎ 𝑥 −𝑦
+ ℎ 𝑥 − 𝑦 ( J 𝑝(𝑥, 𝑦)𝑑𝑥𝑑𝑦G
(
= 𝔼" H 𝑓@ 𝑥 − ℎ 𝑥 𝑝 𝑥 𝑑𝑥 + H H ℎ 𝑥 − 𝑦 ( 𝑝(𝑥, 𝑦)𝑑𝑥𝑑𝑦
$
= 𝔼( 𝔼' 𝑓! 𝑥 − 𝔼' 𝑓! 𝑥 + 𝔼' 𝑓! 𝑥 −ℎ 𝑥
$ $
= 𝔼( 𝔼' 𝑓! 𝑥 − 𝔼' 𝑓! 𝑥 + 𝔼( 𝔼' 𝔼' 𝑓! 𝑥 −ℎ 𝑥
$ $
= 𝔼( 𝔼' 𝑓! 𝑥 − 𝔼' 𝑓! 𝑥 + 𝔼( 𝔼' 𝑓! 𝑥 −ℎ 𝑥
Variance Bias(
Overall decomposition of expected loss
Putting things together
Expected loss = (bias)2 + variance + noise
In formula
#
𝔼7 𝔼(1,3) 𝑦 − 𝑓G 𝑥
#
= 𝔼1 𝔼7 𝑓G 𝑥 − ℎ 𝑥 (𝑏𝑖𝑎𝑠 # )
#
+𝔼1 𝔼7 𝑓G 𝑥 − 𝔼7 𝑓G 𝑥 𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒
+𝔼(1,3) ℎ 𝑥 − 𝑦 # (𝑛𝑜𝑖𝑠𝑒)
Key quantities
𝑓! 𝑥 : actual predictor
𝔼' 𝑓! 𝑥 : expected predictor
ℎ 𝑥 = 𝔼(𝑦|𝑥 ): optimal predictor 18
Model space
Which model space should we choose?
The more complex the model, the large the model space
Expected
loss
𝐵𝑖𝑎𝑠 (
𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒
Model Complexity
H*
(eg. Polynomial degree)
Other things that control model complexity
Eg. In the case of linear models 𝑦 = 𝑤, 𝑥 + 𝑏, one wants to make 𝑤 a
controlled parameter
𝑤 <𝐶
Eg. the larger the regularization parameter 𝜆, the small the model family
( (
𝐽 𝑤 = ∑* 𝑤 + 𝑥* − 𝑦* +𝜆 𝑤 (
𝐽 𝑤 = ∑* 𝑤 + 𝑥* − 𝑦* (+𝜆 𝑤
,
Ridge regression
Given 𝑚 data points, find 𝜃 that minimizes the regularized
mean square error
)
1 #
𝜃 = 𝑎𝑟𝑔𝑚𝑖𝑛& 𝐿 𝜃 = 6 𝑦 − 𝜃 𝑥 + 𝜆 𝜃 #
V ' % '
𝑚
'("
gradient becomes
𝜕𝐿 𝜃 2 2
= − 𝑋𝑦 + 𝑋𝑋 % 𝜃 + 2𝜆𝜃 = 0
𝜕𝜃 𝑚 𝑚
⇒ 𝜃 V = 𝑋𝑋 % + 𝜆𝑚𝐼 *"
𝑋𝑦
For the polynomial degree with the lowest score, we use all of
𝐷 to find the parameter values for 𝑓(𝑥).
Cross-validation
Eg. Want to select the maximal degree of polynomial
5-fold cross-validation (blank: training; red: test)
Data: 1 … 𝑚
Fold 1: Test 1 ⇒ 𝑓, (𝑥) ⇒ 𝑒𝑟𝑟𝑜𝑟 1
MSELOOCV(M1)=2.12 MSELOOCV(M2)=0.962
Practical issues for K-fold CV
How to decide the values for 𝐾 (or 𝛼)
Commonly used 𝐾 = 10 or (𝛼 = 0.1).
Large 𝐾 makes it time-consuming.
Bias-variance trade-off
Large 𝐾 usually leads to low bias. In principle, LOOCV provides an almost
unbiased estimate of the generalization ability of a classifier, but it can also
have high variance.
Small 𝐾 can reduce variance, but will lead to under-use of data, and
causing high-bias.
One important point is that the test data 𝐷fghf is never used in
CV, because doing so would result in overly (indeed dishonest)
optimistic accuracy rates during the testing phase.
Ridge regression
Given 𝑚 data points, find 𝜃 that minimizes the regularized
mean square error
)
1 #
𝜃 = 𝑎𝑟𝑔𝑚𝑖𝑛& 𝐿 𝜃 = 6 𝑦 − 𝜃 𝑥 + 𝜆 𝜃 #
i ' % '
𝑚
'("
gradient becomes
𝜕𝐿 𝜃 2 2 %
2𝜆
= − 𝑋𝑦 + 𝑋𝑋 𝜃 + 𝜃 = 0
𝜕𝜃 𝑚 𝑚 𝑚
⇒ 𝜃 i = 𝑋𝑋 % + 𝜆𝐼 *" 𝑋𝑦
Now assume that vector q follows a normal prior with 0-mean and a
diagonal covariance matrix
𝑝(𝜃) = 𝑁(0, 𝜏 $ 𝐼 )
How to choose 𝜆.
cross-validation!