0% found this document useful (0 votes)
63 views32 pages

ECS171: Machine Learning: Lecture 13: Validation, Model Selection

This document discusses validation and model selection in machine learning. It explains that validation is used to directly estimate out-of-sample error, while regularization is used to estimate the overfitting penalty. It also discusses splitting data into training and validation sets, analyzing the validation error estimate, and using cross-validation to address the tradeoff between validation set size and model selection bias. Cross-validation involves dividing data into folds and iteratively training and validating on different folds to select models and hyperparameters.

Uploaded by

Sam Dillinger
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
63 views32 pages

ECS171: Machine Learning: Lecture 13: Validation, Model Selection

This document discusses validation and model selection in machine learning. It explains that validation is used to directly estimate out-of-sample error, while regularization is used to estimate the overfitting penalty. It also discusses splitting data into training and validation sets, analyzing the validation error estimate, and using cross-validation to address the tradeoff between validation set size and model selection bias. Cross-validation involves dividing data into folds and iteratively training and validating on different folds to select models and hyperparameters.

Uploaded by

Sam Dillinger
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

ECS171: Machine Learning

Lecture 13: Validation, Model Selection

Cho-Jui Hsieh
UC Davis

Feb 28, 2018


Validation
Validation versus regularization

We know, Eout = Ein + overfit penalty


Regularization:

Eout (h) = Ein (h) + overfit penalty


| {z }
reularization estimates this

Validation: directly estimate Eout (h)


Analyzing the estimate

On out-of-sample point (x, y ), the error is e(h(x), y )


e.g., e(h(x), y ) = (h(x) − y )2
Analyzing the estimate

On out-of-sample point (x, y ), the error is e(h(x), y )


e.g., e(h(x), y ) = (h(x) − y )2
Out-of-sample (test) error:

E[e(h(x), y )] = Eout (h)

Variance:
var[e(h(x), y )] = σ 2
Validation error

Given a set (x1 , y1 ), · · · , (xK , yK ) of validation data,


K
1 X
Eval (h) = e(h(xk ), yk )
K
k=1
K
1 X
E[Eval (h)] = E[e(h(xk ), yk )] = Eout (h)
K
k=1
K
1 X σ2
var[Eval (h)] = var[e(h(xk ), yk )] =
K2 K
k=1
Validation error

Given a set (x1 , y1 ), · · · , (xK , yK ) of validation data,


K
1 X
Eval (h) = e(h(xk ), yk )
K
k=1
K
1 X
E[Eval (h)] = E[e(h(xk ), yk )] = Eout (h)
K
k=1
K
1 X σ2
var[Eval (h)] = var[e(h(xk ), yk )] =
K2 K
k=1

So we (roughly) have
1
Eval (h) = Eout (h) ± O( √ )
K
| {z }
standard deviation
Validation is taken out of training set

Given the data set D = (x1 , y1 ), · · · , (xN , yN )


Split into
Dval : K points validation
Dtrain : N − K points training
Validation is taken out of training set

Given the data set D = (x1 , y1 ), · · · , (xN , yN )


Split into
Dval : K points validation
Dtrain : N − K points training
Tradeoff in choosing K :
Small K =⇒ bad estimate
Large K =⇒ small set for training
Validation

D −→ Dtrain ∪ Dval
| {z } |{z}
N−K K
Dtrain =⇒ g−
Validation error: Eval = Eval (g − )
Final model: D =⇒ g
Validation

D −→ Dtrain ∪ Dval
| {z } |{z}
N−K K
Dtrain =⇒ g−
Validation error: Eval = Eval (g − )
Final model: D =⇒ g

Rule of Thumb: K = N/5


Why “validation”?

Dval is used to make learning choices


(e.g., parameter tuning)
Examples: regularization parameter λ, number of iterations, · · ·
Why “validation”?

Dval is used to make learning choices


(e.g., parameter tuning)
Examples: regularization parameter λ, number of iterations, · · ·
Validation error 6= test error
because validation set affects learning
(optimistic bias)
Model selection by validation

D = Dtrain ∪ Dval
M models H1 , · · · , HM
Model selection by validation

D = Dtrain ∪ Dval
M models H1 , · · · , HM
− for each
Use Dtrain to learn gm
model
Model selection by validation

D = Dtrain ∪ Dval
M models H1 , · · · , HM
− for each
Use Dtrain to learn gm
model
− using D :
Evaluate each gm val
Em = Eval (gm − ),

m = 1, · · · , M
Model selection by validation

D = Dtrain ∪ Dval
M models H1 , · · · , HM
− for each
Use Dtrain to learn gm
model
− using D :
Evaluate each gm val
Em = Eval (gm − ),

m = 1, · · · , M
Pick model m = m∗ with
smallest Em
The bias in validation

We select the model Hm∗ using Dval


− −
Eval (gm ∗ ) is a biased estimate of Eout (gm∗ )
How much bias

For M models H1 , · · · , HM
Assume Dval is used for “training” on the finalists model
How much bias

For M models H1 , · · · , HM
Assume Dval is used for “training” on the finalists model
Selecting the best model from

Hval = {g1− , g2− , · · · , gM



}

Back to Hoeffding and VC:


r
− − log M
Eout (gm ∗) ≤ Eval (gm ∗) + O( )
K
How much bias

For M models H1 , · · · , HM
Assume Dval is used for “training” on the finalists model
Selecting the best model from

Hval = {g1− , g2− , · · · , gM



}

Back to Hoeffding and VC:


r
− − log M
Eout (gm ∗) ≤ Eval (gm ∗) + O( )
K

For continuous values (e.g., regularization parameter), M can be


replaced by mH (K )
Cross Validation
The dilemma about K

The chain of reasoning:

≈ Eout (g − ) |{z}
Eout (g ) |{z} ≈ Eval (g − )
(small K ) (large K )
The dilemma about K

The chain of reasoning:

≈ Eout (g − ) |{z}
Eout (g ) |{z} ≈ Eval (g − )
(small K ) (large K )

Can we have K both small and large?


Leave one out

N − 1 points for training, and 1 point for validation!

Dn = (x1 , y1 ), · · · , (xn−1 , yn−1 ), (xn , yn ) , (xn+1 , yn+1 ), · · · , (xN , yN )

Training: Dn −→ gn−
en = Eval (gn− ) = e(gn− (xn ), yn )
Leave one out

N − 1 points for training, and 1 point for validation!

Dn = (x1 , y1 ), · · · , (xn−1 , yn−1 ), (xn , yn ) , (xn+1 , yn+1 ), · · · , (xN , yN )

Training: Dn −→ gn−
en = Eval (gn− ) = e(gn− (xn ), yn )
1 PN
cross validation error: ECV = N n=1 en
Illustration of cross validation
Model selection using CV
Leave more than one out

Divide the dataset into C folds, each with K examples


C training sessions on N − K points each

Usually C = 5 (5-fold CV) or 10 (10-fold CV)


Classification/regression with CV

Given data D = {(x1 , y1 ), · · · , (xN , yN )}


Choose a suitable model
Ridge Regression:
N
1 X T
min (w xi − yi )2 + λw T w
w N n=1

Logistic regression:
N
1 X T
min log(1 + e −yi w xi ) + λw T w
w N n=1

Linear SVM:
N
1 X
min max(1 − yi w T xi , 0) + λw T w
w N n=1
Classification/regression with cross validation

Given data D = {(x1 , y1 ), · · · , (xN , yN )}


Split data into D = D1 ∪ D2 ∪ · · · ∪ D5
For each choice of λ
For c = 1, · · · , 5
Obtain gc− using D \ Dc
Compute ec = Eval (gc− )
Set ECV (λ) = (e1 + · · · + eC )/C
Choose λ∗ with best validation error
Train the model using full data D and λ∗
Conclusions

Next class: Support vector machines, Kernel methods

Questions?

You might also like