0% found this document useful (0 votes)
15 views37 pages

Cross Validation

The document discusses cross-validation methods for model assessment and selection, specifically focusing on the Validation Set Approach, Leave-One-Out Cross-Validation (LOOCV), and k-Fold Cross Validation. It presents an advertising data set and an auto data set to illustrate the application of these methods in predicting sales and miles per gallon, respectively. The advantages and disadvantages of each cross-validation technique are also highlighted, along with the importance of estimating test error rates.

Uploaded by

krush2408
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views37 pages

Cross Validation

The document discusses cross-validation methods for model assessment and selection, specifically focusing on the Validation Set Approach, Leave-One-Out Cross-Validation (LOOCV), and k-Fold Cross Validation. It presents an advertising data set and an auto data set to illustrate the application of these methods in predicting sales and miles per gallon, respectively. The advantages and disadvantages of each cross-validation technique are also highlighted, along with the importance of estimating test error rates.

Uploaded by

krush2408
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

Cross-Validation

Outline
•Context
•Different Approaches of Cross-Validation
Validation Set Approach
Leave-One-Out Cross-Validation
𝑘-Fold Cross Validation
•An application
Context
Advertising Data Set
The Advertising data set consists of the sales (in
thousands of units) of a particular product in 200
different markets.

It also contains the advertising budgets (in thousands


of dollars) for the product in each of the markets for
three different media: TV, Radio, and Newspaper
Regression Problem

Linear
Predicted
Sales Regression Sales
Model

Quantitative
Quantitative
TV Radio Newspaper
Possible Models
Models Predictors
1 TV
2 Radio
3 Newspaper
4 TV and Radio
5 TV and Newspaper
6 Radio and Newspaper
7 TV, Radio and Newspaper
Model Selection
Model Predictors 𝑅2 Adjusted − 𝑅 2 𝑅𝑆𝐸

1 TV 0.61 0.61 3.26


2 Radio 0.33 0.33 4.28
3 Newspaper 0.05 0.05 5.09
4 TV & Radio 0.90 0.90 1.68
5 TV & Newspaper 0.65 0.64 3.12

6 Radio & Newspaper 0.33 0.33 4.28

7 TV, Radio & Newspaper 0.90 0.90 1.69


Cross-validation
• Cross-validation is an alternative method for assessing the
performance of a model.
• It can also be used for model selection.
Case Data set
Auto Data Set
• A data frame with 392 observations on the 9 variables.
• Our discussion will be focused on the following two variables.
• mpg (miles per gallon): Dependent Variable (𝑌)
• horsepower (Engine horsepower): Independent Variable (𝑋)
• We have to fit a model that predicts mpg using horsepower.
Auto Data Set
Model (1)
Coefficients Std. error t-statistic p-value
Intercept 39.936 0.717 55.66 <0.0001
Horsepower −0.158 0.006 −24.49 <0.0001

Multiple R-squared: 0.6059


Adjusted R-squared: 0.6049
Residual standard error: 4.906
Model (2)
Coefficients Std. error t-statistic p-value

Intercept 56.900 1.800 31.60 <0.0001

Horsepower −0.466 0.031 −14.98 <0.0001

I(horsepower^2) 0.001 0.000 10.08 <0.0001

Multiple R-squared: 0.688


Adjusted R-squared: 0.686
Residual standard error: 4.374
Possible Models
• Consider 10 possible models:
Model (1): Predictor- ℎ𝑜𝑟𝑠𝑒𝑝𝑜𝑤𝑒𝑟
Model (2): Predictors- ℎ𝑜𝑟𝑠𝑒𝑝𝑜𝑤𝑒𝑟 and ℎ𝑜𝑟𝑠𝑝𝑜𝑤𝑒𝑟 2

Model (10): Predictors- h𝑜𝑟𝑠𝑒𝑝𝑜𝑤𝑒𝑟, ℎ𝑜𝑟𝑠𝑝𝑜𝑤𝑒𝑟 2 , … ℎ𝑜𝑟𝑠𝑝𝑜𝑤𝑒𝑟10
Training Error & Test Error
Training Error
• In order to assess the performance of a method, we must compare the predictions
with the observed data.
• A common measure of accuracy is the Mean Squared Error (𝑀𝑆𝐸), i.e.,
𝑛
1
𝑀𝑆𝐸 = ෍ 𝐴𝑐𝑡𝑢𝑎𝑙𝑖 − 𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑𝑖 2 ,
𝑛
𝑖=1
where 𝑛 is the number of observations.
• The 𝑀𝑆𝐸 is computed above based on the training data that was used to fit the model.
• This is referred as training MSE.
Test Error
• Our method has generally been designed to make 𝑀𝑆𝐸 small based on the
training data, e.g., in linear regression, we obtain the regression line such that
𝑀𝑆𝐸 is minimized.
• What really matters is how well the method works on new data.
• We call this new data “Test Data”.
• There is no guarantee that the method with the smallest training 𝑀𝑆𝐸 will have
the smallest test (i.e. new data) 𝑀𝑆𝐸.
Estimation of Test Error
• We here consider a class of methods that estimate the test error rate by holding
out a subset of the training, and then applying the method to those held out
observations.
• This approach is more formally known as cross-validation.
• We now discuss three popular approaches of cross-validation
Validation Set Approach
Leave-One-Out Cross-Validation
𝑘-Fold Cross Validation
Validation Set Approach
Validation Set Approach

Training Data Testing Data


Validation Set Approach
• If we have a large data set, we randomly split the data into training and
validation (testing) parts.
• We use the training data to fit each possible model, and then choose the model
that gave the lowest error rate when applied to the validation data.
• The validation error rate or test error rate is typically assessed using 𝑀𝑆𝐸 for
the quantitative response.
Models for Auto Data
• Consider 10 possible models:
Model (1): Predictor- ℎ𝑜𝑟𝑠𝑒𝑝𝑜𝑤𝑒𝑟
Model (2): Predictors- ℎ𝑜𝑟𝑠𝑒𝑝𝑜𝑤𝑒𝑟 and ℎ𝑜𝑟𝑠𝑝𝑜𝑤𝑒𝑟 2

Model (10): Predictors- ℎ𝑜𝑟𝑠𝑒𝑝𝑜𝑤𝑒𝑟, ℎ𝑜𝑟𝑠𝑝𝑜𝑤𝑒𝑟 2 , … ℎ𝑜𝑟𝑠𝑝𝑜𝑤𝑒𝑟10
Validation Set Approach for Auto Data
• Randomly split Auto data set into the training (196 obs.) and validation
1 (testing) data sets (196 obs.)

• Fit all the candidate models (Models (1) to Models (10)) using the
2 training data set.

• Use the fitted models to predict 𝑚𝑝𝑔 for the validation data set.
3

• The model with the lowest validation (testing) MSE is the winner!!
4
Auto Data
• Left: Validation error rate
for a single split into the
training and validation data
sets
• Right: Validation method
repeated 10 times, each
time the split is done
randomly!
Validation Set Approach: Advantages
Conceptually simple

Easy to implement
Validation Set Approach: Disadvantages
The validation MSE can be highly variable (see the plot on the
right-hand panel of the Figure in the previous slide).

Only a subset of observations are used to fit the model (training


data). Statistical methods tend to perform worse when trained
on fewer observations.
Leave-One-Out Cross-Validation
Leave-One-Out Cross-Validation (LOOCV)
LOOCV
• Split the data set of size n into a training data set of size 𝑛 − 1 and a validation data
1 set of size 1

• Fit the model using the training data.


2

• Validate the model using the validation data and compute the corresponding MSE.
3

• Repeat this process n times to obtain 𝑀𝑆𝐸1 , … , 𝑀𝑆𝐸𝑛 .


4

1
• The MSE for the model is computed as 𝐶𝑉(𝑛) = σ𝑛𝑖=1 𝑀𝑆𝐸𝑖 .
5 𝑛
LOOCV: Advantages
LOOCV has less bias
• We repeatedly fit the statistical learning method using training data
that contains 𝑛 − 1 obs., i.e. almost all the data set is used.
LOOCV produces a less variable MSE
• The validation approach produces different MSE when applied
repeatedly due to randomness in the splitting process, while
performing LOOCV multiple times will always yield the same results,
because we split based on one obs. each time.
LOOCV: Disadvantages
• LOOCV is computationally intensive.
We fit each model 𝑛 times !!
𝐾-fold Cross Validation
𝐾-fold Cross Validation
𝐾-fold Cross Validation
• We divide the data set into 𝐾 different parts (e.g., 𝐾 = 5, or 𝐾 = 10, etc.).
1

• We then remove the first part, fit the model on the remaining 𝐾 − 1 parts,
2 and compute the MSE on the first part.

• We then repeat this 𝐾 different times taking out a different part each time.
3

1
• The 𝐾-fold CV error is given by 𝐶𝑉(𝐾) = σ𝐾
𝑖=1 𝑀𝑆𝐸𝑖 .
4 𝐾
Auto Data: LOOCV & K-fold CV
• LOOCV is a special case of k-fold, where 𝑘 = 𝑛
• They are both stable, but LOOCV is more computationally intensive!
Home Work
• Obtain the cross-validation error rates of all 7 models for the Advertising data
set.
Reading Material
• James, G., Witten, D., Hastie, T. & Tibshirani, R. (2021). An Introduction to
Statistical Learning: with Applications in R. New York: Springer-Verlag.
Chapter 2: Sub-section 2.2.1
Chapter 5: Section 5.1, Sub-sections 5.1.1, 5.1.2, 5.1.3, 5.3.1, 5.3.2, 5.3.3.

You might also like