0% found this document useful (0 votes)

27 views43 pages

ML 2024 Part1 CrossValidation

Uploaded by

jfang1

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

27 views43 pages

ML 2024 Part1 CrossValidation

Uploaded by

jfang1

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 43

Machine Learning for Microeconometrics

Part 1: Selection and cross validation

A. Colin Cameron
Univ.of California - Davis
.

April 2024

A. Colin Cameron Univ.of California - Davis . () ML Part 1: Cross validation April 2024 1 / 43
Course Outline
Part 1: Variable selection and cross validation
2. Shrinkage methods
I ridge, lasso, elastic net
3. ML for causal inference using lasso
I OLS with many controls, IV with many instruments
4. Other methods for prediction
I nonparametric regression, principal components, splines
I neural networks
I regression trees, random forests, bagging, boosting
5. More ML for causal inference
I ATE with heterogeneous e¤ects and many controls.
6. Classi…cation and unsupervised learning
I classi…cation (categorical y ) and unsupervised learning (no y ).
A. Colin Cameron Univ.of California - Davis . () ML Part 1: Cross validation April 2024 2 / 43
1. Introduction

1. Introduction

These slides consider how to determine how well a model predicts

with a penalty for model complexity
I subsequent slides present various ML methods for prediction.
MUS2: Chapter 28.2 “Machine Learning for Prediction and Inference”
in A. Colin Cameron and Pravin K. Trivedi (2022),
Microeconometrics using Stata, Second edition, forthcoming.
For more detail on machine learning I used the two books given in the
references
I ISL2: Introduction to Statistical Learning, R and Python editions.
I ESL: Elements of Statistical Learning.
While most ML code is in R and Python, these slides use Stata.
I Stata 16 plus some add-ons covers the material in these slides.

A. Colin Cameron Univ.of California - Davis . () ML Part 1: Cross validation April 2024 3 / 43
1. Introduction

Overview

1 Introduction
2 Model Selection using Predictive Ability
1 Generated data example
2 Mean squared error
3 Information criteria and related penalty measures
4 Cross validation overview
5 Single-split cross validation
6 K-fold cross-validation
7 Leave-one-out cross validation
8 Stepwise selection and best subsets selection
9 Selection using statistical signi…cance

A. Colin Cameron Univ.of California - Davis . () ML Part 1: Cross validation April 2024 4 / 43
1. Introduction

Terminology

We consider two types of data sets.

1. training data set (or estimation sample)
I used to …t a model.
2. test data set (or hold-out sample or validation set)
I additional data used to determine how good is the model …t
I a test observation (x0 , y0 ) is a previously unseen observation.

A. Colin Cameron Univ.of California - Davis . () ML Part 1: Cross validation April 2024 5 / 43
2. Model Selection using predictive ability 2.1 Generated data example

2.1 Generated Data Example

We consider just OLS

I but the model selection methods carry over to any other machine
learner.
These slides use Stata
I most machine learning code is initially done in Python or R.
Generated data: n = 40
Three correlated regressors.
2 3 02 3 2 31
x1i 0 1 0.5 0.5
I 4 x2i 5 N @4 0 5 , 4 0.5 1 0.5 5A
x3i 0 0.5 0.5 1
But only x1 determines y
I y = 2 + x1i + ui where ui N (0, 32 ).

A. Colin Cameron Univ.of California - Davis . () ML Part 1: Cross validation April 2024 6 / 43
2. Model Selection using predictive ability 2.1 Generated data example

* Generate three correlated variables (rho = 0.5) and y linear only in x1

clear
quietly set obs 40
set seed 12345
matrix MU = (0,0,0)
scalar rho = 0.5
matrix SIGMA = (1,rho,rho n rho,1,rho n rho,rho,1)
drawnorm x1 x2 x3, means(MU) cov(SIGMA)
generate y = 2 + 1*x1 + rnormal(0,3)

A. Colin Cameron Univ.of California - Davis . () ML Part 1: Cross validation April 2024 7 / 43
2. Model Selection using predictive ability 2.1 Generated data example

. * Summarize data
. summarize

Variable Obs Mean Std. Dev. Min Max

x1 40 .3337951 .8986718 -1.099225 2.754746

x2 40 .1257017 .9422221 -2.081086 2.770161
x3 40 .0712341 1.034616 -1.676141 2.931045
y 40 3.107987 3.400129 -3.542646 10.60979

. correlate
(obs=40)

x1 x2 x3 y

x1 1.0000
x2 0.5077 1.0000
x3 0.4281 0.2786 1.0000
y 0.4740 0.3370 0.2046 1.0000

A. Colin Cameron Univ.of California - Davis . () ML Part 1: Cross validation April 2024 8 / 43
2. Model Selection using predictive ability 2.1 Generated data example

Fitted OLS regression

As expected only x1 is statistically signi…cant at 5%

I though due to randomness this is not guaranteed.

. * OLS regression of y on x1-x3

. regress y x1 x2 x3, vce(robust)

Linear regression Number of obs = 40

F(3, 36) = 4.91
Prob > F = 0.0058
R-squared = 0.2373
Root MSE = 3.0907

Robust
y Coef. Std. Err. t P>|t| [95% Conf. Interval]

x1 1.555582 .5006152 3.11 0.004 .5402873 2.570877

x2 .4707111 .5251826 0.90 0.376 -.5944086 1.535831
x3 -.0256025 .6009393 -0.04 0.966 -1.244364 1.193159
_cons 2.531396 .5377607 4.71 0.000 1.440766 3.622025

A. Colin Cameron Univ.of California - Davis . () ML Part 1: Cross validation April 2024 9 / 43
2. Model Selection using predictive ability 2.2 Mean squared error

2.2 Mean squared error

Assume (y , x) have common c.d.f. F ( ) in both training and test
data.
We wish to predict y given x = (x1 , ..., xp )
I de…ne f (x) = E [y jx] for a single observation

A training data set d yields prediction rule b f (x)

I e.g. for OLS yb = b
f (x) = x(X0 X ) 1 X0 y (where X is N p ).
Using test data we predict y at point x0 using yb0 = b
f (x0 ).
I e.g. for OLS y 0 1 0
b0 = x0 (X X) X y.
For regression consider squared error loss (y yb)2 .
We wish to estimate the true prediction error
I E [(y
F 0 yb0 )2 ] for test data set point (x0 , y0 ) F .
Some methods adapt to other loss functions
I e.g. absolute error loss and log-likelihood loss
I e.g. loss function for classi…cation is 1(y 6= yb).
A. Colin Cameron Univ.of California - Davis . () ML Part 1: Cross validation April 2024 10 / 43
2. Model Selection using predictive ability 2.2 Mean squared error

Models over…t in sample

We want to estimate the true prediction error
I EF [(y0 yb0 )2 ] for test data set point (x0 , y0 ) F.
The obvious criterion is in-sample mean squared error
I MSE = 1
n ∑ni=1 (yi ybi )2 where MSE = mean squared error.
Problem: in-sample MSE under-estimates the true prediction error
I Intuitively models “over…t” within sample.
Example: suppose y = Xβ + u
I b = (y b M)u where M = X(X0 X) 1X
then u Xβ OLS ) = (I
F bi j < jui j (OLS residual < true unknown error)
so on average ju
I b2 = s 2 =
and use σ 1
n k ∑ni=1 (yi ybi )2 and not 1
n ∑ni=1 (yi ybi )2
Two solutions:
I penalize for over…tting e.g. R̄ 2 , AIC, BIC, Cp
I use out-of-estimation sample prediction (cross-validation).
A. Colin Cameron Univ.of California - Davis . () ML Part 1: Cross validation April 2024 11 / 43
2. Model Selection using predictive ability 2.3 Information criteria and related penalty measures

2.3 Information criteria and related penalty measures

Two standard measures for general parametric model are

I Akaike’s information criterion
F AIC= 2 ln L + 2k
I BIC: Bayesian information criterion
F BIC= 2 ln L + (ln n ) k

Models with smaller AIC and BIC are preferred.

AIC has a small penalty for larger model size
I for nested models selects larger model if 2∆ ln L > 2∆k
F whereas LR test= 2∆ ln L of size α requires 2∆ ln L > χ2α (k ).

BIC has a larger penalty.

A. Colin Cameron Univ.of California - Davis . () ML Part 1: Cross validation April 2024 12 / 43
2. Model Selection using predictive ability 2.3 Information criteria and related penalty measures

Details: AIC and BIC for OLS

For classical regression assuming i.i.d. normal errors
n ln 2π n ln σ2 1 n
I ln L =
2 2 (y xi0 β)2
2 σ 2 ∑ i =1 i
Di¤erent programs can calculate a di¤erent AIC and BIC
I using the following variations.
n
Constants such as 2 ln 2π are often dropped.

The sign sign can be reversed (so AIC and BIC are positive).
Econometricians use βb and σ b )2
b2 = MSE= n1 ∑ni=1 (yi xi0 β
I then AIC= n
2 ln 2π n
2 b2
ln σ n
2 + 2k.
b and σ
Machine learners use β e2 = 1
∑ni=1 (yi e )2
xi0 β
n p p
I e is obtained from OLS in the largest model under
where β p
consideration that has p regressors including intercept
Also a …nite sample correction is
I AICC= AIC+2(K + 1)(K + 2) / (N K 2).
A. Colin Cameron Univ.of California - Davis . () ML Part 1: Cross validation April 2024 13 / 43
2. Model Selection using predictive ability 2.3 Information criteria and related penalty measures

Details: More measures for OLS

For OLS a standard measure is R̄ 2 (adjusted R 2 )
1
∑ni=1 (yi ybi )2 1
∑ni=1 (yi ybi )2
I R̄ 2 = 1 n k
1 (whereas R 2 = 1 n 1
)
n 1 ∑ni=1 (yi ȳ )2 1
n 1 ∑ni=1 (yi ȳ )2
I R̄ 2 has a small penalty for model complexity
F R̄ 2 favors the larger nested model if the subset test F > 1.

Some machine learners also use Mallows Cp measure

I σ2 )
Cp = (n MSE/e n + 2k
F MSE= 1 ∑ni=1 (yi b )2 and σ
xi0 β e2 = 1
∑ni=1 (yi yei )2
n n p
I and some replace p with “e¤ective degrees of freedom”
d (µ
p = σ12 ∑ni=1 Cov b i , yi ) .
Note that for linear regression AIC, BIC, AICC and Cp are designed
for models with homoskedastic errors
I more generally AIC and BIC are for correctly speci…ed likelihood
function.
A. Colin Cameron Univ.of California - Davis . () ML Part 1: Cross validation April 2024 14 / 43
2. Model Selection using predictive ability 2.3 Information criteria and related penalty measures

Penalty measure measures for OLS

Measure Penalty for model size

1 n
MSE = n ∑i =1 (yi ybi )2 None
1
∑ni=1 (yi ybi )2
R2 = 1 n
None
1
n∑ni=1 (yi ȳ )2
1 n
bi )2
n k ∑ i =1 (y i y
R̄ 2 = 1 1 n Too small
n 1 ∑i =1 (y i ȳ )
2
n
AIC = const 2 ln MSE 2k Too small
n
BIC = const 2 ln MSE ln nk Possibly too large

A. Colin Cameron Univ.of California - Davis . () ML Part 1: Cross validation April 2024 15 / 43
2. Model Selection using predictive ability 2.3 Information criteria and related penalty measures

Example of penalty measures

We will consider all 8 possible models based on x1 , x2 and x3 .

. * Regressor lists for all possible models

. global xlist1

. global xlist2 x1

. global xlist3 x2

. global xlist4 x3

. global xlist5 x1 x2

. global xlist6 x2 x3

. global xlist7 x1 x3

. global xlist8 x1 x2 x3

.
. * Full sample estimates with AIC, BIC, Cp, R2adj penalties
. quietly regress y $xlist8

. scalar s2full = e(rmse)^2 // Needed for Mallows Cp

A. Colin Cameron Univ.of California - Davis . () ML Part 1: Cross validation April 2024 16 / 43
2. Model Selection using predictive ability 2.3 Information criteria and related penalty measures

Example of penalty measures (continued)

Manually get various measures. All but MSE favor model with just x1
I Note that MSE decreases as x2 and x3 added to model with x1 .

. forvalues k = 1/8 {
2. quietly regress y ${xlist`k'}
3. scalar mse`k' = e(rss)/e(N)
4. scalar r2adj`k' = e(r2_a)
5. scalar aic`k' = -2*e(ll) + 2*e(rank)
6. scalar bic`k' = -2*e(ll) + e(rank)*ln(e(N))
7. scalar cp`k' = e(rss)/s2full - e(N) + 2*e(rank)
8. display "Model " "${xlist`k'}" _col(15) " MSE=" %8.5f mse`k' ///
> " R2adj=" %6.3f r2adj`k' " AIC=" %7.2f aic`k' ///
> " BIC=" %7.2f bic`k' " Cp=" %6.3f cp`k'
9. }
Model MSE=11.27186 R2adj= 0.000 AIC= 212.41 BIC= 214.10 Cp= 9.199
Model x1 MSE= 8.73891 R2adj= 0.204 AIC= 204.23 BIC= 207.60 Cp= 0.593
Model x2 MSE= 9.99158 R2adj= 0.090 AIC= 209.58 BIC= 212.96 Cp= 5.838
Model x3 MSE=10.80016 R2adj= 0.017 AIC= 212.70 BIC= 216.08 Cp= 9.224
Model x1 x2 MSE= 8.59796 R2adj= 0.196 AIC= 205.58 BIC= 210.64 Cp= 2.002
Model x2 x3 MSE= 9.84189 R2adj= 0.080 AIC= 210.98 BIC= 216.05 Cp= 7.211
Model x1 x3 MSE= 8.73887 R2adj= 0.183 AIC= 206.23 BIC= 211.29 Cp= 2.592
Model x1 x2 x3 MSE= 8.59740 R2adj= 0.174 AIC= 207.57 BIC= 214.33 Cp= 4.000

A. Colin Cameron Univ.of California - Davis . () ML Part 1: Cross validation April 2024 17 / 43
2. Model Selection using predictive ability 2.4 Cross-validation overview

2.4 Cross-validation overview

Basic idea: assess models on basis of out-of-sample …t.

Begin with single-split validation for pedagogical reasons.
Then present K-fold cross-validation
I used extensively in machine learning
I generalizes to loss functions other than MSE such as 1
n ∑ni=1 jyi ybi j
I though more computation than e.g. AIC, BIC.
And present leave-one-out cross validation
I widely used for local …t in nonparametric regression.
Given a selected model the …nal estimation is on the full dataset
I usual inference ignores the data-mining.

A. Colin Cameron Univ.of California - Davis . () ML Part 1: Cross validation April 2024 18 / 43
2. Model Selection using predictive ability 2.5 Single-split cross validation

2.5 Single split cross validation

Randomly divide available data into two parts

I 1. model is …t on training set
I 2. MSE is computed for predictions in validation set.
Example: estimate all 8 possible models with x1 , x2 and x3
I b 0 s, predict on the
for each model estimate on the training set to get β
validation set and compute MSE in the validation set.
I choose the model with the lowest validation set MSE.
Problems with this single-split validation
I 1. Lose precision due to smaller training set
F so may actually overestimate the test error rate (MSE) of the model.
I 2. Results depend a lot on the particular single split.

A. Colin Cameron Univ.of California - Davis . () ML Part 1: Cross validation April 2024 19 / 43
2. Model Selection using predictive ability 2.5 Single-split cross validation

Single split cross validation example

Use splitsample command
I training sample (n = 32) and test sample (n = 8)

. * splitcample command for single split validation

. * training data (80% of sample) and test data (20%)
. splitsample, split(1 4) values(0 1) generate(dtrain) rseed(10101)

. tabulate dtrain

dtrain Freq. Percent Cum.

0 8 20.00 20.00
1 32 80.00 100.00

Total 40 100.00

In following code note that Stata predict command gives prediction

for all possible observations
I not just the subsample used in estimation.
A. Colin Cameron Univ.of California - Davis . () ML Part 1: Cross validation April 2024 20 / 43
2. Model Selection using predictive ability 2.5 Single-split cross validation

Single split validation example (continued)

In-sample (training sample) MSE minimized with x1 , x2 , x3 .
Out-of-sample (test sample) MSE minimized with only x1 .

. * Single split validation - training and test MSE for the 8 possible models
. forvalues k = 1/8 {
2. qui reg y ${xlist`k'} if dtrain==1
3. qui predict y`k'hat
4. qui gen y`k'errorsq = (y`k'hat - y)^2
5. qui sum y`k'errorsq if dtrain == 1
6. scalar mse`k'train = r(mean)
7. qui sum y`k'errorsq if dtrain == 0
8. qui scalar mse`k'test = r(mean)
9. display "Model " "${xlist`k'}" _col(16) ///
> " Training MSE = " %7.3f mse`k'train " Test MSE = " %7.3f mse`k'test
10. }
Model Training MSE = 10.124 Test MSE = 16.280
Model x1 Training MSE = 7.478 Test MSE = 13.871
Model x2 Training MSE = 8.840 Test MSE = 14.803
Model x3 Training MSE = 9.658 Test MSE = 15.565
Model x1 x2 Training MSE = 7.288 Test MSE = 13.973
Model x2 x3 Training MSE = 8.668 Test MSE = 14.674
Model x1 x3 Training MSE = 7.474 Test MSE = 13.892
Model x1 x2 x3 Training MSE = 7.288 Test MSE = 13.980

A. Colin Cameron Univ.of California - Davis . () ML Part 1: Cross validation April 2024 21 / 43
2. Model Selection using predictive ability 2.6 K-fold cross validation

2.6 K-fold cross validation

K -fold cross-validation
I splits data into K mutually exclusive folds of roughly equal size
I for j = 1, ..., K …t using all folds but fold j and predict on fold j
I standard choices are K = 5 and K = 10.
The following shows case K = 5

Fit on folds Test on fold

j =1 2,3,4,5 1
j =2 1,3,4,5 2
j =3 1,2,4,5 3
j =4 1,2,3,5 4
j =5 1,2,3,4 5

The K -fold CV estimate is

∑j =1 MSE(j ) , where MSE(j ) is the MSE for fold j.

1 K
CVK = K

A. Colin Cameron Univ.of California - Davis . () ML Part 1: Cross validation April 2024 22 / 43
2. Model Selection using predictive ability 2.6 K-fold cross validation

splitsample command

Create mutually exclusive subsamples of pre-speci…ed size

I splitsample command
Example is …ve equal-sized subsamples
I always set the seed
I default numbers the subsamples 1, 2, 3, ...

. * Split sample into five equal size parts using splitsample commands
. splitsample, nsplit(5) generate(snum) rseed(10101)

. tabulate snum

snum Freq. Percent Cum.

1 8 20.00 20.00
2 8 20.00 40.00
3 8 20.00 60.00
4 8 20.00 80.00
5 8 20.00 100.00

Total 40 100.00

A. Colin Cameron Univ.of California - Davis . () ML Part 1: Cross validation April 2024 23 / 43
2. Model Selection using predictive ability 2.6 K-fold cross validation

K-fold cross validation example

MSE in each test fold (of 8 observations) for model with all three
regressors.
. * Five-fold cross validation example for model with all regressors
. splitsample, nsplit(5) generate(foldnum) rseed(10101)

. matrix allmses = J(5,1,.)

. capture drop yhat yerrorsq

. forvalues i = 1/5 {
2. qui reg y x1 x2 x3 if foldnum != ì'
3. qui predict yì'hat
4. qui gen yì'errorsq = (yì'hat - y)^2
5. qui sum yì'errorsq if foldnum ==ì'
6. matrix allmses[ì',1] = r(mean)
7. }

. matrix list allmses

allmses[5,1]
c1
r1 13.980321
r2 6.4997357
r3 9.3623792
r4 6.413401
r5 12.23958

A. Colin Cameron Univ.of California - Davis . () ML Part 1: Cross validation April 2024 24 / 43
2. Model Selection using predictive ability 2.6 K-fold cross validation

Average MSE and standard deviation across …ve folds

Get mean and standard deviation of the preceding …ve MSEs.

. * Compute the average MSE over the five folds and standard deviation
. svmat allmses, names(vallmses)

. qui sum vallmses

. display "CV5 = " %5.3f r(mean) " with st. dev. = " %5.3f r(sd)
CV5 = 9.699 with st. dev. = 3.389

A. Colin Cameron Univ.of California - Davis . () ML Part 1: Cross validation April 2024 25 / 43
2. Model Selection using predictive ability 2.6 K-fold cross validation

K-fold CV using user-written crossfold command

User-written crossfold command (Daniels 2012)
I gives square root of MSE (RMSE) in each fold
Here 5-fold CV for just one model - the full model
I CV = (3.7392 + 2.5492 + 3.0602 + 2.5322 + 3.4992 ) /5 = 3.699.
5

. * Five-fold cross validation measure for one model with ll regressors

. set seed 10101

. crossfold regress y x1 x2 x3, k(5)

RMSE

est1 3.739027
est2 2.549458
est3 3.059801
est4 2.532469
est5 3.498511

. drop _est* // Drop variables created

A. Colin Cameron Univ.of California - Davis . () ML Part 1: Cross validation April 2024 26 / 43
2. Model Selection using predictive ability 2.6 K-fold cross validation

K-fold Cross Validation example (continued)

Now do 5-fold CV for all eight models with K = 5
I model with only x1 has lowest CV5
. * Five-fold cross validation measure for all possible models
. forvalues k = 1/8 {
2. set seed 10101
3. qui crossfold regress y ${xlist`k'}, k(5)
4. matrix RMSEs`k' = r(est)
5. svmat RMSEs`k', names(rmse`k')
6. qui generate mse`k' = rmse`k'^2
7. qui sum mse`k'
8. scalar cv`k' = r(mean)
9. scalar sdcv`k' = r(sd)
10. display "Model " "${xlist`k'}" _col(16) " CV5 = " %7.3f cv`k' ///
> " with st. dev. = " %7.3f sdcv`k'
11. }
Model CV5 = 11.960 with st. dev. = 3.561
Model x1 CV5 = 9.138 with st. dev. = 3.069
Model x2 CV5 = 10.407 with st. dev. = 4.139
Model x3 CV5 = 11.776 with st. dev. = 3.272
Model x1 x2 CV5 = 9.173 with st. dev. = 3.367
Model x2 x3 CV5 = 10.872 with st. dev. = 4.221
Model x1 x3 CV5 = 9.639 with st. dev. = 2.985
Model x1 x2 x3 CV5 = 9.699 with st. dev. = 3.389

A. Colin Cameron Univ.of California - Davis . () ML Part 1: Cross validation April 2024 27 / 43
2. Model Selection using predictive ability 2.6 K-fold cross validation

One standard error “rule” for K-fold cross-validation

K folds gives K estimates MSE(1 ) , ..., MSE(K )

I so we can obtain a standard error of CVK
r
1 K
K 1 ∑ j =1
se(CVK ) = (MSE(j ) CV(K ) )2 .

A further guard against over…tting that is sometimes used (ISL2)

I don’t simply choose model with minimum CV(K )
I instead choose the smallest model for which CV is within one se(CV)
of minimum CV
I clearly could instead use e.g. a 0.5 standard error rule.
Example is determining degree p of a high order polynomial in x
I if CVK is minimized at p = 7 but is only slightly higher for p = 3 we
would favor p = 3.

A. Colin Cameron Univ.of California - Davis . () ML Part 1: Cross validation April 2024 28 / 43
2. Model Selection using predictive ability 2.7 Leave-one-out cross validation

2.7 Leave-one-out Cross Validation (LOOCV)

Use a single observation for validation and (n 1) for training
I yb( i ) is ybi prediction after OLS on observations 1, .., i 1, i + 1, ..., n.
I Cycle through all n observations doing this.
Then LOOCV measure is

∑i =1 MSE(
n
∑ i = 1 ( yi
1 1 n 2
CVn = n i) = n yb( i ))

Requires n regressions in general

yi ybi 2
I except for OLS can show CVn = 1
n ∑ni=1 1 h ii
F where ybi is …tted value from OLS on the full training sample
F and hii is i th diagonal entry in the hat matrix X(X0 X) 1 X.

Used for bandwidth choice in local nonparametric regression

I such as k-nearest neighbors, kernel and local linear regression
I but not used for (global) machine learning (see below).

A. Colin Cameron Univ.of California - Davis . () ML Part 1: Cross validation April 2024 29 / 43
2. Model Selection using predictive ability 2.7 Leave-one-out cross validation

Leave-one-out cross validation example

User-written command loocv (Barron 2014)

I slow as written for any command, not just OLS.

. * Leave-one-out cross validation

. loocv regress y x1

Leave-One-Out Cross-Validation Results

Method Value

Root Mean Squared Errors 3.0989007

Mean Absolute Errors 2.5242994
Pseudo-R2 .15585569

. display "LOOCV MSE = " r(rmse)^2

LOOCV MSE = 9.6031853

A. Colin Cameron Univ.of California - Davis . () ML Part 1: Cross validation April 2024 30 / 43
2. Model Selection using predictive ability 2.7 Leave-one-out cross validation

Cross validation and AIC

LOOCV is the special case of K -fold CV with K = N

I it has little bias as all but one observation is used to …t.
I but large variance as n predicted yb( i ) based on very similar samples.
For K-fold CV, K = 5 or K = 10 is found to be a good compromise
I then neither high bias nor high variance.
K-fold usually selects fewer variables than AIC.
K-fold requires few model assumptions and has wide applicability.
I Remember: For replicability set the seed as this determines the folds.

A. Colin Cameron Univ.of California - Davis . () ML Part 1: Cross validation April 2024 31 / 43
2. Model Selection using predictive ability 2.7 Leave-one-out cross validation

Cross validation and AIC and BIC

Jun Shao (1997) Statistica Sinica 221-264
I considers linear regression with i.i.d. (0, σ2 ) errors
I n observations and pn potential regressors
I α 2 An denotes a subset with pn (α) of the p regressors
I µn = xn (α) β(αn ) is the conditional mean using α 2 An
I GIC(λn ) methods minimize MSE(α) + λn σ b2n pn (α)/n
F where λn is the penalty
F b2n is a consistent estimate of σ2 .
and σ
Case 1: λn = 2 asymptotically is AIC, LOOCV (delete-1 CV), CP
I consistent model selection if there is no …xed dimension true model.
Case 2: λn ! ∞ asympotically is BIC and delete-d CV with d /n ! 1
I consistent model selection when there is a …xed dimension true model.
Case 3: λn 2 and …xed asymptotically is delete-d CV with
d 2 (0, 1) and K -fold CV with K …xed
I compromise between 1 and 2.
A. Colin Cameron Univ.of California - Davis . () ML Part 1: Cross validation April 2024 32 / 43
2. Model Selection using predictive ability 2.8 Forwards selection, backwards selection and best subsets

2.8 Model selection: Forwards, backwards and best subsets

Forwards selection (or speci…c to general)
I start with simplest model (intercept-only) and in turn include the
variable that is most statistically signi…cant or most improves …t.
I requires up to p + (p 1) + + 1 = p (p + 1)/2 regressions where p
is number of regressors
Backwards selection (or general to speci…c)
I start with most general model and in drop the variable that is least
statistically signi…cant or least improves …t.
I requires up to p (p + 1)/2 regressions
Best subsets
I for k = 1, ..., p …nd the best …tting model with k regressors
I in theory requires (p ) + (p ) + + (pp ) = 2p regressions
0 1
I but leaps and bounds procedure makes this much quicker
I p < 40 manageable though recent work suggests p in thousands.
Hybrid
I forward selection but after new model found drop variables that do not
improve …t.
A. Colin Cameron Univ.of California - Davis . () ML Part 1: Cross validation April 2024 33 / 43
2. Model Selection using predictive ability 2.8 Forwards selection, backwards selection and best subsets

Best subsets selection and stepwise selection

User-written vselect command (Lindsey and Sheather 2010)
I this does not use CV but uses AIC, BIC, ...
I best subsets gives best …tting model (lowest MSE) with one, two and
three regressors
I and for each of these best …tting models gives various penalty measures
I all measures favor model with just x1 .

. * Best subset selection with user-written add-on vselect

. vselect y x1 x2 x3, best

Response : y
Selected predictors: x1 x2 x3

Optimal models:

# Preds R2ADJ C AIC AICC BIC

1 .2043123 .5925225 204.2265 204.8932 207.6042
2 .1959877 2.002325 205.5761 206.7189 210.6427
3 .1737073 4 207.5735 209.3382 214.329

predictors for each model:

1 : x1
2 : x1 x2
3 : x1 x2 x3

A. Colin Cameron Univ.of California - Davis . () ML Part 1: Cross validation April 2024 34 / 43
2. Model Selection using predictive ability 2.8 Forwards selection, backwards selection and best subsets

Example of penalty measures (continued)

vselect also does forward selection and backward selection

I then need to specify whether use R2adj, AIC, BIC or AICC
I e.g. vselect y x1 c2 c3, forward aic
I e.g. vselect y x1 c2 c3, backward bic
And can specify that some regressors always be included
I e.g. vselect y x2 x3, fix(x1) best
User-written gvselect command (Lindsey and Sheather 2015)
implements best subsets selection for any Stata command that
reports ln L
I then best model of any size has highest ln L
I and best model size has lowest AIC or BIC.

A. Colin Cameron Univ.of California - Davis . () ML Part 1: Cross validation April 2024 35 / 43
2. Model Selection using predictive ability 2.9 Selection using Statistical Signi…cance

2.9 Selection using Statistical Signi…cance

In past not recommended

I b
pre-testing changes the distribution of β
I and with …xed size α (such as α = 0.05) all potential regressors likely
to be found signi…cant as n ! ∞.
But recent papers say okay if critical value increases with number of
potential regressors at an appropriate rate.

A. Colin Cameron Univ.of California - Davis . () ML Part 1: Cross validation April 2024 36 / 43
2. Model Selection using predictive ability 2.9 Selection using Statistical Signi…cance

Stepwise forward based on p < 0.05

I Stata add-on command stepwise, pe(.05)
F chooses model with only intercept and x1

. * Stepwise forward using statistical significance at five percent

. stepwise, pe(.05): regress y x1 x2 x3
begin with empty model
p = 0.0020 < 0.0500 adding x1

Source SS df MS Number of obs = 40

F(1, 38) = 11.01
Model 101.318018 1 101.318018 Prob > F = 0.0020
Residual 349.556297 38 9.19884993 R-squared = 0.2247
Adj R-squared = 0.2043
Total 450.874315 39 11.5608799 Root MSE = 3.033

y Coef. Std. Err. t P>|t| [95% Conf. Interval]

x1 1.793535 .5404224 3.32 0.002 .6995073 2.887563

_cons 2.509313 .5123592 4.90 0.000 1.472097 3.54653

A. Colin Cameron Univ.of California - Davis . () ML Part 1: Cross validation April 2024 37 / 43
2. Model Selection using predictive ability 2.9 Selection using Statistical Signi…cance

Stepwise backward based on p < 0.05

I Stata add-on command stepwise, pr(.05)
F chooses model with only intercept and x1

. * Stepwise backward using statistical significance at five percent

. stepwise, pr(.05): regress y x1 x2 x3
begin with full model
p = 0.9618 >= 0.0500 removing x3
p = 0.4410 >= 0.0500 removing x2

Source SS df MS Number of obs = 40

F(1, 38) = 11.01
Model 101.318018 1 101.318018 Prob > F = 0.0020
Residual 349.556297 38 9.19884993 R-squared = 0.2247
Adj R-squared = 0.2043
Total 450.874315 39 11.5608799 Root MSE = 3.033

y Coef. Std. Err. t P>|t| [95% Conf. Interval]

x1 1.793535 .5404224 3.32 0.002 .6995073 2.887563

_cons 2.509313 .5123592 4.90 0.000 1.472097 3.54653

Option hierarchical allows selection in order of the speci…ed

regressors.
A. Colin Cameron Univ.of California - Davis . () ML Part 1: Cross validation April 2024 38 / 43
2. Model Selection using predictive ability 2.9 Selection using Statistical Signi…cance

Testing-based Forward Model Selection

Kozbur (2020), “Analysis of testing-based forward model selection”,

Econometrica, vol. 88(5), 2147-2173.
Assumptions similar to Belloni et al. (Ecta 2012) for LASSO
I including sparsity (only a fraction s0 = o (n) of the p potential
regressors in model belong).
Normalize regressors so ∑ni=1 xij2 = 1 for all regressors j
1
n p.
Then in case of independent heteroskedastic errors
I use stepwise forward OLS regression of yi on the xik0 s
I sequentially select regressors where at any given round
F choose regressor j if Wj Wk for all regressors j,k not already selected
and (de…nition 1) Wj a complicated critical value.

A. Colin Cameron Univ.of California - Davis . () ML Part 1: Cross validation April 2024 39 / 43
2. Model Selection using predictive ability 2.9 Selection using Statistical Signi…cance

Testing-based Forward Model Selection (continued)

Simpler procedures are

I De…nition 2: Wj = Wjhet = b θ/sehetrob (b
θ ) Φ 1 (1 α/p )
I De…nition 3: Wj = ∆j MSE Φ 1 (1 α/p ) where ∆j MSE is the
increase in MSE when xj is dropped.
Φ 1 (1 α/p ) is like a Bonferroni correction for multiple testing
I e.g. α = 0.05 and p = 100 then
Φ 1 (1 0.05/100) = Φ 1 (0.9995) = 3.29.

A. Colin Cameron Univ.of California - Davis . () ML Part 1: Cross validation April 2024 40 / 43
2. Model Selection using predictive ability 2.9 Selection using Statistical Signi…cance

One Covariate at a Time Testing Model Selection

Chudik, Kapetanios and Pesaran (2018), “A One Covariate at a time,

Multiple Testing Approach to Variable Selection in High-Dimensional
Linear Regression Models”, Econometrica, vol. 86(4), 1479-1512.
Proposes individual testing for each potential regressor
I …rst stage: accept all those regressors j with jWj j > threshold
I further stages: see if some variables missed …rst time around by adding
additional regressors if jWj j > higher threshold
I repeat previous step if necessary (few such additional steps are needed).
Assumptions include
I errors are a martingale di¤erence sequence
I Wj is Wald test statistic based on homoskedastic errors.

A. Colin Cameron Univ.of California - Davis . () ML Part 1: Cross validation April 2024 41 / 43
2. Model Selection using predictive ability 2.9 Selection using Statistical Signi…cance

One Covariate at a Time Testing Model Selection

(continued)

The key threshold for p potential regressors and size α tests

I jWj j > Φ 1 (1 α/(2p δ ))
I δ = 1 corresponds to usual Bonferroni for multiple testing
Simulations: α = 0.01, δ = 1 at …rst stage, δ = 2 subsequently
I p = 100: Φ 1 (1 0.01/(2 1001 )) = Φ 1 (0.9998) = 3.54
I p = 100: Φ 1 (1 0.01/(2 1002 )) = Φ 1 (0.999998) = 4.61.

A. Colin Cameron Univ.of California - Davis . () ML Part 1: Cross validation April 2024 42 / 43
References

References

ISLR2: Gareth James, Daniela Witten, Trevor Hastie and Robert Tibsharani
(2021), An Introduction to Statistical Learning: with Applications in R, 2nd Ed.,
Springer.
I PDF and $40 softcover book at
https://link.springer.com/book/10.1007/978-1-0716-1418-1

ISLP: Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani and
Jonathan Taylor (2023), An Introduction to Statistical Learning: with Applications
in Python, Springer.

I Masters level book. Free PDF from https://www.statlearning.com/ and $40

softcover book via Springer mycopy.

Chapter 28 “Machine Learning for Prediction and Inference” in A. Colin Cameron

and Pravin K. Trivedi (2022), Microeconometrics using Stata, Second edition.

A. Colin Cameron Univ.of California - Davis . () ML Part 1: Cross validation April 2024 43 / 43

DDMA05 ModelSelection
No ratings yet
DDMA05 ModelSelection
28 pages
Ch06 MultipleLinearRegression
0% (2)
Ch06 MultipleLinearRegression
19 pages
MAF3821 2024 Part1
100% (1)
MAF3821 2024 Part1
35 pages
Linear Regression
No ratings yet
Linear Regression
89 pages
Unit-2 MLT
No ratings yet
Unit-2 MLT
84 pages
Lecture Slide 02 - Supervised Learning - Summer 2023
No ratings yet
Lecture Slide 02 - Supervised Learning - Summer 2023
43 pages
Ecom 165 Notes
No ratings yet
Ecom 165 Notes
98 pages
Simple Regression
No ratings yet
Simple Regression
46 pages
Data Analytics Unit 3 Notes
100% (3)
Data Analytics Unit 3 Notes
28 pages
Uncertainty Notes
No ratings yet
Uncertainty Notes
166 pages
ILCC 2013 Proceedings
0% (1)
ILCC 2013 Proceedings
644 pages
Slides 1 Handout
No ratings yet
Slides 1 Handout
23 pages
CIVI6731 Lecture (Week11)
No ratings yet
CIVI6731 Lecture (Week11)
22 pages
Chapter 2: Properties of The Regression Coefficients and Hypothesis Testing
No ratings yet
Chapter 2: Properties of The Regression Coefficients and Hypothesis Testing
16 pages
Ch-2 Supervised Machine Learning
No ratings yet
Ch-2 Supervised Machine Learning
48 pages
ML - LAB - BE CSE (DS) Final
No ratings yet
ML - LAB - BE CSE (DS) Final
110 pages
CS550 Regression
No ratings yet
CS550 Regression
62 pages
Lecture 09 - 02.09.2024 - Regression-01
No ratings yet
Lecture 09 - 02.09.2024 - Regression-01
62 pages
Module07 - Model Selection and Regularization
No ratings yet
Module07 - Model Selection and Regularization
46 pages
Model Selection and Multiple Hypothesis Testing
No ratings yet
Model Selection and Multiple Hypothesis Testing
6 pages
18 CV & Model Selection
No ratings yet
18 CV & Model Selection
11 pages
ML Ai
No ratings yet
ML Ai
53 pages
Week 2
No ratings yet
Week 2
43 pages
Da Sem Unit 3-1
No ratings yet
Da Sem Unit 3-1
13 pages
Sec2 Regression PDF
No ratings yet
Sec2 Regression PDF
183 pages
BS1646 3 PDF
No ratings yet
BS1646 3 PDF
26 pages
0 Regularization PDF
No ratings yet
0 Regularization PDF
88 pages
STA302 Week12 Full
No ratings yet
STA302 Week12 Full
30 pages
Linear Models - Numeric Prediction
No ratings yet
Linear Models - Numeric Prediction
7 pages
Ridge Regression and Lasso Estimators For Data Analysis - 1749804481151
No ratings yet
Ridge Regression and Lasso Estimators For Data Analysis - 1749804481151
38 pages
WST 311 Notes Part 2 2024
No ratings yet
WST 311 Notes Part 2 2024
21 pages
m2 Data Analytic and Visualization
No ratings yet
m2 Data Analytic and Visualization
53 pages
Notes 3008
No ratings yet
Notes 3008
6 pages
Understanding The Geometry of Predictive Models: Workshop at S P Jain School Institute of Management and Research
No ratings yet
Understanding The Geometry of Predictive Models: Workshop at S P Jain School Institute of Management and Research
78 pages
Intro To Regression
No ratings yet
Intro To Regression
4 pages
Chapter 5 Learning Deterministic Models
No ratings yet
Chapter 5 Learning Deterministic Models
28 pages
Module 4
No ratings yet
Module 4
33 pages
Kang 2020 J. Phys. Conf. Ser. 1631 012063
No ratings yet
Kang 2020 J. Phys. Conf. Ser. 1631 012063
8 pages
Pall Aria AP Brochure WP-300 PDF
No ratings yet
Pall Aria AP Brochure WP-300 PDF
12 pages
Reg Book Stat
No ratings yet
Reg Book Stat
79 pages
Lecture 21: Model Selection 1 Choosing Models
No ratings yet
Lecture 21: Model Selection 1 Choosing Models
14 pages
2 Regression With Multiple Regressors 1
No ratings yet
2 Regression With Multiple Regressors 1
22 pages
SRM Notes
No ratings yet
SRM Notes
38 pages
Diagnostic Tests2
No ratings yet
Diagnostic Tests2
25 pages
Multiple Linear Regression in Data Mining
100% (1)
Multiple Linear Regression in Data Mining
14 pages
LP III Lab Manual
100% (1)
LP III Lab Manual
8 pages
MATH6183 Introduction+Regression
No ratings yet
MATH6183 Introduction+Regression
70 pages
Torsion of Multi-Cell Cross-Section - Hw7 - B
No ratings yet
Torsion of Multi-Cell Cross-Section - Hw7 - B
12 pages
Stat 378
No ratings yet
Stat 378
73 pages
Installation Guide
No ratings yet
Installation Guide
210 pages
Regression Analysis
No ratings yet
Regression Analysis
11 pages
Ch3 Multiple Regression
No ratings yet
Ch3 Multiple Regression
56 pages
Machine Learning Unit2
No ratings yet
Machine Learning Unit2
31 pages
Data Science Cheatsheet
100% (1)
Data Science Cheatsheet
5 pages
Econometrics Chapter 3
No ratings yet
Econometrics Chapter 3
24 pages
DS303: Introduction To Machine Learning: Manjesh K. Hanawal
No ratings yet
DS303: Introduction To Machine Learning: Manjesh K. Hanawal
17 pages
Linear Regression Algorithm
No ratings yet
Linear Regression Algorithm
16 pages
3 Unit - Dspu
No ratings yet
3 Unit - Dspu
23 pages
FML Unit2
No ratings yet
FML Unit2
13 pages
Land Use and Transport
No ratings yet
Land Use and Transport
15 pages
Module05 Notes
No ratings yet
Module05 Notes
19 pages
Ec2 1
No ratings yet
Ec2 1
11 pages
Mark Scheme (Results) January 2014: Pearson Edexcel International GCSE Further Pure Mathematics (4PM0/01) Paper 1
No ratings yet
Mark Scheme (Results) January 2014: Pearson Edexcel International GCSE Further Pure Mathematics (4PM0/01) Paper 1
24 pages
Applied Linear Algebra
No ratings yet
Applied Linear Algebra
121 pages
Regression Models Notes
No ratings yet
Regression Models Notes
13 pages
Our Lady of The Pillar College - San Manuel, Inc District No. 3, San Manuel, Isabela
No ratings yet
Our Lady of The Pillar College - San Manuel, Inc District No. 3, San Manuel, Isabela
16 pages
Ms 236 N 0
No ratings yet
Ms 236 N 0
63 pages
UNIT 7 - LESSON 1 - Features and Structure of A Critique Paper
No ratings yet
UNIT 7 - LESSON 1 - Features and Structure of A Critique Paper
52 pages
Merge 1
No ratings yet
Merge 1
205 pages
Theories of Management - 2
No ratings yet
Theories of Management - 2
19 pages
XXXX Hypothesis Testing
No ratings yet
XXXX Hypothesis Testing
101 pages
Which Clients Are Deserving of Help? A Theoretical Model and Experimental Test
No ratings yet
Which Clients Are Deserving of Help? A Theoretical Model and Experimental Test
13 pages
Pow 1 - A Timely Phone Tree
No ratings yet
Pow 1 - A Timely Phone Tree
4 pages
New Oríkì Orúko
No ratings yet
New Oríkì Orúko
3 pages
Name of The Guide: Educational Qualification
No ratings yet
Name of The Guide: Educational Qualification
2 pages
PST Worksheets
No ratings yet
PST Worksheets
2 pages
Report Rubrics
No ratings yet
Report Rubrics
2 pages
Angka Penting
No ratings yet
Angka Penting
45 pages
Master Draft Rev 7.1 - Internal - Sustaining National Capability in SM Design - Proposal
No ratings yet
Master Draft Rev 7.1 - Internal - Sustaining National Capability in SM Design - Proposal
20 pages
Measure Theory
No ratings yet
Measure Theory
110 pages
Moc I
No ratings yet
Moc I
10 pages
The Development of Protocols and Norms For Sports Vision Evaluations
No ratings yet
The Development of Protocols and Norms For Sports Vision Evaluations
12 pages
Watson 1999 Liberal Communitarianism As Political Theory
No ratings yet
Watson 1999 Liberal Communitarianism As Political Theory
8 pages
Asta Powerproject BIM
No ratings yet
Asta Powerproject BIM
2 pages
FEE 532 Power System Stability II
No ratings yet
FEE 532 Power System Stability II
30 pages
Analysis of Risk and Returns of Sahara Mutual Funds: A Project Report On
No ratings yet
Analysis of Risk and Returns of Sahara Mutual Funds: A Project Report On
7 pages
Soldier Foundation of India NGO Darpan
No ratings yet
Soldier Foundation of India NGO Darpan
5 pages
Econometrics Word File
No ratings yet
Econometrics Word File
13 pages
Xxxx-Matrix Analysis For Scientists and Engineers-Solutions
No ratings yet
Xxxx-Matrix Analysis For Scientists and Engineers-Solutions
15 pages
Booklet TPV
No ratings yet
Booklet TPV
7 pages
Star Technique Template Apolitical
No ratings yet
Star Technique Template Apolitical
5 pages
Phsycoanalytic Analysis
No ratings yet
Phsycoanalytic Analysis
4 pages
Common Distributions One Pager
No ratings yet
Common Distributions One Pager
1 page

ML 2024 Part1 CrossValidation

Uploaded by

ML 2024 Part1 CrossValidation

Uploaded by

Machine Learning for Microeconometrics

Part 1: Selection and cross validation

These slides consider how to determine how well a model predicts

We consider two types of data sets.

2.1 Generated Data Example

We consider just OLS

* Generate three correlated variables (rho = 0.5) and y linear only in x1

Variable Obs Mean Std. Dev. Min Max

x1 40 .3337951 .8986718 -1.099225 2.754746

Fitted OLS regression

As expected only x1 is statistically signi…cant at 5%

. * OLS regression of y on x1-x3

Linear regression Number of obs = 40

x1 1.555582 .5006152 3.11 0.004 .5402873 2.570877

2.2 Mean squared error

A training data set d yields prediction rule b f (x)

Models over…t in sample

2.3 Information criteria and related penalty measures

Two standard measures for general parametric model are

Models with smaller AIC and BIC are preferred.

BIC has a larger penalty.

Details: AIC and BIC for OLS

Details: More measures for OLS

Some machine learners also use Mallows Cp measure

Penalty measure measures for OLS

Measure Penalty for model size

Example of penalty measures

. * Regressor lists for all possible models

. scalar s2full = e(rmse)^2 // Needed for Mallows Cp

Example of penalty measures (continued)

2.4 Cross-validation overview

Basic idea: assess models on basis of out-of-sample …t.

2.5 Single split cross validation

Randomly divide available data into two parts

Single split cross validation example

. * splitcample command for single split validation

dtrain Freq. Percent Cum.

In following code note that Stata predict command gives prediction

Single split validation example (continued)

2.6 K-fold cross validation

Fit on folds Test on fold

The K -fold CV estimate is

∑j =1 MSE(j ) , where MSE(j ) is the MSE for fold j.

Create mutually exclusive subsamples of pre-speci…ed size

snum Freq. Percent Cum.

K-fold cross validation example

. matrix allmses = J(5,1,.)

. capture drop y*hat y*errorsq

. matrix list allmses

Average MSE and standard deviation across …ve folds

Get mean and standard deviation of the preceding …ve MSEs.

. qui sum vallmses

K-fold CV using user-written crossfold command

. * Five-fold cross validation measure for one model with ll regressors

. crossfold regress y x1 x2 x3, k(5)

. drop _est* // Drop variables created

K-fold Cross Validation example (continued)

One standard error “rule” for K-fold cross-validation

K folds gives K estimates MSE(1 ) , ..., MSE(K )

A further guard against over…tting that is sometimes used (ISL2)

2.7 Leave-one-out Cross Validation (LOOCV)

Requires n regressions in general

Used for bandwidth choice in local nonparametric regression

Leave-one-out cross validation example

User-written command loocv (Barron 2014)

. * Leave-one-out cross validation

Leave-One-Out Cross-Validation Results

Root Mean Squared Errors 3.0989007

. display "LOOCV MSE = " r(rmse)^2

Cross validation and AIC

LOOCV is the special case of K -fold CV with K = N

Cross validation and AIC and BIC

2.8 Model selection: Forwards, backwards and best subsets

Best subsets selection and stepwise selection

. * Best subset selection with user-written add-on vselect

# Preds R2ADJ C AIC AICC BIC

. capture drop yhat yerrorsq