0% found this document useful (0 votes)
27 views43 pages

ML 2024 Part1 CrossValidation

Uploaded by

jfang1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views43 pages

ML 2024 Part1 CrossValidation

Uploaded by

jfang1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

Machine Learning for Microeconometrics

Part 1: Selection and cross validation

A. Colin Cameron
Univ.of California - Davis
.

April 2024

A. Colin Cameron Univ.of California - Davis . () ML Part 1: Cross validation April 2024 1 / 43
Course Outline
Part 1: Variable selection and cross validation
2. Shrinkage methods
I ridge, lasso, elastic net
3. ML for causal inference using lasso
I OLS with many controls, IV with many instruments
4. Other methods for prediction
I nonparametric regression, principal components, splines
I neural networks
I regression trees, random forests, bagging, boosting
5. More ML for causal inference
I ATE with heterogeneous e¤ects and many controls.
6. Classi…cation and unsupervised learning
I classi…cation (categorical y ) and unsupervised learning (no y ).
A. Colin Cameron Univ.of California - Davis . () ML Part 1: Cross validation April 2024 2 / 43
1. Introduction

1. Introduction

These slides consider how to determine how well a model predicts


with a penalty for model complexity
I subsequent slides present various ML methods for prediction.
MUS2: Chapter 28.2 “Machine Learning for Prediction and Inference”
in A. Colin Cameron and Pravin K. Trivedi (2022),
Microeconometrics using Stata, Second edition, forthcoming.
For more detail on machine learning I used the two books given in the
references
I ISL2: Introduction to Statistical Learning, R and Python editions.
I ESL: Elements of Statistical Learning.
While most ML code is in R and Python, these slides use Stata.
I Stata 16 plus some add-ons covers the material in these slides.

A. Colin Cameron Univ.of California - Davis . () ML Part 1: Cross validation April 2024 3 / 43
1. Introduction

Overview

1 Introduction
2 Model Selection using Predictive Ability
1 Generated data example
2 Mean squared error
3 Information criteria and related penalty measures
4 Cross validation overview
5 Single-split cross validation
6 K-fold cross-validation
7 Leave-one-out cross validation
8 Stepwise selection and best subsets selection
9 Selection using statistical signi…cance

A. Colin Cameron Univ.of California - Davis . () ML Part 1: Cross validation April 2024 4 / 43
1. Introduction

Terminology

We consider two types of data sets.


1. training data set (or estimation sample)
I used to …t a model.
2. test data set (or hold-out sample or validation set)
I additional data used to determine how good is the model …t
I a test observation (x0 , y0 ) is a previously unseen observation.

A. Colin Cameron Univ.of California - Davis . () ML Part 1: Cross validation April 2024 5 / 43
2. Model Selection using predictive ability 2.1 Generated data example

2.1 Generated Data Example

We consider just OLS


I but the model selection methods carry over to any other machine
learner.
These slides use Stata
I most machine learning code is initially done in Python or R.
Generated data: n = 40
Three correlated regressors.
2 3 02 3 2 31
x1i 0 1 0.5 0.5
I 4 x2i 5 N @4 0 5 , 4 0.5 1 0.5 5A
x3i 0 0.5 0.5 1
But only x1 determines y
I y = 2 + x1i + ui where ui N (0, 32 ).

A. Colin Cameron Univ.of California - Davis . () ML Part 1: Cross validation April 2024 6 / 43
2. Model Selection using predictive ability 2.1 Generated data example

* Generate three correlated variables (rho = 0.5) and y linear only in x1


clear
quietly set obs 40
set seed 12345
matrix MU = (0,0,0)
scalar rho = 0.5
matrix SIGMA = (1,rho,rho n rho,1,rho n rho,rho,1)
drawnorm x1 x2 x3, means(MU) cov(SIGMA)
generate y = 2 + 1*x1 + rnormal(0,3)

A. Colin Cameron Univ.of California - Davis . () ML Part 1: Cross validation April 2024 7 / 43
2. Model Selection using predictive ability 2.1 Generated data example

. * Summarize data
. summarize

Variable Obs Mean Std. Dev. Min Max

x1 40 .3337951 .8986718 -1.099225 2.754746


x2 40 .1257017 .9422221 -2.081086 2.770161
x3 40 .0712341 1.034616 -1.676141 2.931045
y 40 3.107987 3.400129 -3.542646 10.60979

. correlate
(obs=40)

x1 x2 x3 y

x1 1.0000
x2 0.5077 1.0000
x3 0.4281 0.2786 1.0000
y 0.4740 0.3370 0.2046 1.0000

A. Colin Cameron Univ.of California - Davis . () ML Part 1: Cross validation April 2024 8 / 43
2. Model Selection using predictive ability 2.1 Generated data example

Fitted OLS regression

As expected only x1 is statistically signi…cant at 5%


I though due to randomness this is not guaranteed.

. * OLS regression of y on x1-x3


. regress y x1 x2 x3, vce(robust)

Linear regression Number of obs = 40


F(3, 36) = 4.91
Prob > F = 0.0058
R-squared = 0.2373
Root MSE = 3.0907

Robust
y Coef. Std. Err. t P>|t| [95% Conf. Interval]

x1 1.555582 .5006152 3.11 0.004 .5402873 2.570877


x2 .4707111 .5251826 0.90 0.376 -.5944086 1.535831
x3 -.0256025 .6009393 -0.04 0.966 -1.244364 1.193159
_cons 2.531396 .5377607 4.71 0.000 1.440766 3.622025

A. Colin Cameron Univ.of California - Davis . () ML Part 1: Cross validation April 2024 9 / 43
2. Model Selection using predictive ability 2.2 Mean squared error

2.2 Mean squared error


Assume (y , x) have common c.d.f. F ( ) in both training and test
data.
We wish to predict y given x = (x1 , ..., xp )
I de…ne f (x) = E [y jx] for a single observation

A training data set d yields prediction rule b f (x)


I e.g. for OLS yb = b
f (x) = x(X0 X ) 1 X0 y (where X is N p ).
Using test data we predict y at point x0 using yb0 = b
f (x0 ).
I e.g. for OLS y 0 1 0
b0 = x0 (X X) X y.
For regression consider squared error loss (y yb)2 .
We wish to estimate the true prediction error
I E [(y
F 0 yb0 )2 ] for test data set point (x0 , y0 ) F .
Some methods adapt to other loss functions
I e.g. absolute error loss and log-likelihood loss
I e.g. loss function for classi…cation is 1(y 6= yb).
A. Colin Cameron Univ.of California - Davis . () ML Part 1: Cross validation April 2024 10 / 43
2. Model Selection using predictive ability 2.2 Mean squared error

Models over…t in sample


We want to estimate the true prediction error
I EF [(y0 yb0 )2 ] for test data set point (x0 , y0 ) F.
The obvious criterion is in-sample mean squared error
I MSE = 1
n ∑ni=1 (yi ybi )2 where MSE = mean squared error.
Problem: in-sample MSE under-estimates the true prediction error
I Intuitively models “over…t” within sample.
Example: suppose y = Xβ + u
I b = (y b M)u where M = X(X0 X) 1X
then u Xβ OLS ) = (I
F bi j < jui j (OLS residual < true unknown error)
so on average ju
I b2 = s 2 =
and use σ 1
n k ∑ni=1 (yi ybi )2 and not 1
n ∑ni=1 (yi ybi )2
Two solutions:
I penalize for over…tting e.g. R̄ 2 , AIC, BIC, Cp
I use out-of-estimation sample prediction (cross-validation).
A. Colin Cameron Univ.of California - Davis . () ML Part 1: Cross validation April 2024 11 / 43
2. Model Selection using predictive ability 2.3 Information criteria and related penalty measures

2.3 Information criteria and related penalty measures

Two standard measures for general parametric model are


I Akaike’s information criterion
F AIC= 2 ln L + 2k
I BIC: Bayesian information criterion
F BIC= 2 ln L + (ln n ) k

Models with smaller AIC and BIC are preferred.


AIC has a small penalty for larger model size
I for nested models selects larger model if 2∆ ln L > 2∆k
F whereas LR test= 2∆ ln L of size α requires 2∆ ln L > χ2α (k ).

BIC has a larger penalty.

A. Colin Cameron Univ.of California - Davis . () ML Part 1: Cross validation April 2024 12 / 43
2. Model Selection using predictive ability 2.3 Information criteria and related penalty measures

Details: AIC and BIC for OLS


For classical regression assuming i.i.d. normal errors
n ln 2π n ln σ2 1 n
I ln L =
2 2 (y xi0 β)2
2 σ 2 ∑ i =1 i
Di¤erent programs can calculate a di¤erent AIC and BIC
I using the following variations.
n
Constants such as 2 ln 2π are often dropped.

The sign sign can be reversed (so AIC and BIC are positive).
Econometricians use βb and σ b )2
b2 = MSE= n1 ∑ni=1 (yi xi0 β
I then AIC= n
2 ln 2π n
2 b2
ln σ n
2 + 2k.
b and σ
Machine learners use β e2 = 1
∑ni=1 (yi e )2
xi0 β
n p p
I e is obtained from OLS in the largest model under
where β p
consideration that has p regressors including intercept
Also a …nite sample correction is
I AICC= AIC+2(K + 1)(K + 2) / (N K 2).
A. Colin Cameron Univ.of California - Davis . () ML Part 1: Cross validation April 2024 13 / 43
2. Model Selection using predictive ability 2.3 Information criteria and related penalty measures

Details: More measures for OLS


For OLS a standard measure is R̄ 2 (adjusted R 2 )
1
∑ni=1 (yi ybi )2 1
∑ni=1 (yi ybi )2
I R̄ 2 = 1 n k
1 (whereas R 2 = 1 n 1
)
n 1 ∑ni=1 (yi ȳ )2 1
n 1 ∑ni=1 (yi ȳ )2
I R̄ 2 has a small penalty for model complexity
F R̄ 2 favors the larger nested model if the subset test F > 1.

Some machine learners also use Mallows Cp measure


I σ2 )
Cp = (n MSE/e n + 2k
F MSE= 1 ∑ni=1 (yi b )2 and σ
xi0 β e2 = 1
∑ni=1 (yi yei )2
n n p
I and some replace p with “e¤ective degrees of freedom”
d (µ
p = σ12 ∑ni=1 Cov b i , yi ) .
Note that for linear regression AIC, BIC, AICC and Cp are designed
for models with homoskedastic errors
I more generally AIC and BIC are for correctly speci…ed likelihood
function.
A. Colin Cameron Univ.of California - Davis . () ML Part 1: Cross validation April 2024 14 / 43
2. Model Selection using predictive ability 2.3 Information criteria and related penalty measures

Penalty measure measures for OLS

Measure Penalty for model size


1 n
MSE = n ∑i =1 (yi ybi )2 None
1
∑ni=1 (yi ybi )2
R2 = 1 n
None
1
n∑ni=1 (yi ȳ )2
1 n
bi )2
n k ∑ i =1 (y i y
R̄ 2 = 1 1 n Too small
n 1 ∑i =1 (y i ȳ )
2
n
AIC = const 2 ln MSE 2k Too small
n
BIC = const 2 ln MSE ln nk Possibly too large

A. Colin Cameron Univ.of California - Davis . () ML Part 1: Cross validation April 2024 15 / 43
2. Model Selection using predictive ability 2.3 Information criteria and related penalty measures

Example of penalty measures


We will consider all 8 possible models based on x1 , x2 and x3 .

. * Regressor lists for all possible models


. global xlist1

. global xlist2 x1

. global xlist3 x2

. global xlist4 x3

. global xlist5 x1 x2

. global xlist6 x2 x3

. global xlist7 x1 x3

. global xlist8 x1 x2 x3

.
. * Full sample estimates with AIC, BIC, Cp, R2adj penalties
. quietly regress y $xlist8

. scalar s2full = e(rmse)^2 // Needed for Mallows Cp

A. Colin Cameron Univ.of California - Davis . () ML Part 1: Cross validation April 2024 16 / 43
2. Model Selection using predictive ability 2.3 Information criteria and related penalty measures

Example of penalty measures (continued)


Manually get various measures. All but MSE favor model with just x1
I Note that MSE decreases as x2 and x3 added to model with x1 .

. forvalues k = 1/8 {
2. quietly regress y ${xlist`k'}
3. scalar mse`k' = e(rss)/e(N)
4. scalar r2adj`k' = e(r2_a)
5. scalar aic`k' = -2*e(ll) + 2*e(rank)
6. scalar bic`k' = -2*e(ll) + e(rank)*ln(e(N))
7. scalar cp`k' = e(rss)/s2full - e(N) + 2*e(rank)
8. display "Model " "${xlist`k'}" _col(15) " MSE=" %8.5f mse`k' ///
> " R2adj=" %6.3f r2adj`k' " AIC=" %7.2f aic`k' ///
> " BIC=" %7.2f bic`k' " Cp=" %6.3f cp`k'
9. }
Model MSE=11.27186 R2adj= 0.000 AIC= 212.41 BIC= 214.10 Cp= 9.199
Model x1 MSE= 8.73891 R2adj= 0.204 AIC= 204.23 BIC= 207.60 Cp= 0.593
Model x2 MSE= 9.99158 R2adj= 0.090 AIC= 209.58 BIC= 212.96 Cp= 5.838
Model x3 MSE=10.80016 R2adj= 0.017 AIC= 212.70 BIC= 216.08 Cp= 9.224
Model x1 x2 MSE= 8.59796 R2adj= 0.196 AIC= 205.58 BIC= 210.64 Cp= 2.002
Model x2 x3 MSE= 9.84189 R2adj= 0.080 AIC= 210.98 BIC= 216.05 Cp= 7.211
Model x1 x3 MSE= 8.73887 R2adj= 0.183 AIC= 206.23 BIC= 211.29 Cp= 2.592
Model x1 x2 x3 MSE= 8.59740 R2adj= 0.174 AIC= 207.57 BIC= 214.33 Cp= 4.000

A. Colin Cameron Univ.of California - Davis . () ML Part 1: Cross validation April 2024 17 / 43
2. Model Selection using predictive ability 2.4 Cross-validation overview

2.4 Cross-validation overview

Basic idea: assess models on basis of out-of-sample …t.


Begin with single-split validation for pedagogical reasons.
Then present K-fold cross-validation
I used extensively in machine learning
I generalizes to loss functions other than MSE such as 1
n ∑ni=1 jyi ybi j
I though more computation than e.g. AIC, BIC.
And present leave-one-out cross validation
I widely used for local …t in nonparametric regression.
Given a selected model the …nal estimation is on the full dataset
I usual inference ignores the data-mining.

A. Colin Cameron Univ.of California - Davis . () ML Part 1: Cross validation April 2024 18 / 43
2. Model Selection using predictive ability 2.5 Single-split cross validation

2.5 Single split cross validation

Randomly divide available data into two parts


I 1. model is …t on training set
I 2. MSE is computed for predictions in validation set.
Example: estimate all 8 possible models with x1 , x2 and x3
I b 0 s, predict on the
for each model estimate on the training set to get β
validation set and compute MSE in the validation set.
I choose the model with the lowest validation set MSE.
Problems with this single-split validation
I 1. Lose precision due to smaller training set
F so may actually overestimate the test error rate (MSE) of the model.
I 2. Results depend a lot on the particular single split.

A. Colin Cameron Univ.of California - Davis . () ML Part 1: Cross validation April 2024 19 / 43
2. Model Selection using predictive ability 2.5 Single-split cross validation

Single split cross validation example


Use splitsample command
I training sample (n = 32) and test sample (n = 8)

. * splitcample command for single split validation


. * training data (80% of sample) and test data (20%)
. splitsample, split(1 4) values(0 1) generate(dtrain) rseed(10101)

. tabulate dtrain

dtrain Freq. Percent Cum.

0 8 20.00 20.00
1 32 80.00 100.00

Total 40 100.00

In following code note that Stata predict command gives prediction


for all possible observations
I not just the subsample used in estimation.
A. Colin Cameron Univ.of California - Davis . () ML Part 1: Cross validation April 2024 20 / 43
2. Model Selection using predictive ability 2.5 Single-split cross validation

Single split validation example (continued)


In-sample (training sample) MSE minimized with x1 , x2 , x3 .
Out-of-sample (test sample) MSE minimized with only x1 .

. * Single split validation - training and test MSE for the 8 possible models
. forvalues k = 1/8 {
2. qui reg y ${xlist`k'} if dtrain==1
3. qui predict y`k'hat
4. qui gen y`k'errorsq = (y`k'hat - y)^2
5. qui sum y`k'errorsq if dtrain == 1
6. scalar mse`k'train = r(mean)
7. qui sum y`k'errorsq if dtrain == 0
8. qui scalar mse`k'test = r(mean)
9. display "Model " "${xlist`k'}" _col(16) ///
> " Training MSE = " %7.3f mse`k'train " Test MSE = " %7.3f mse`k'test
10. }
Model Training MSE = 10.124 Test MSE = 16.280
Model x1 Training MSE = 7.478 Test MSE = 13.871
Model x2 Training MSE = 8.840 Test MSE = 14.803
Model x3 Training MSE = 9.658 Test MSE = 15.565
Model x1 x2 Training MSE = 7.288 Test MSE = 13.973
Model x2 x3 Training MSE = 8.668 Test MSE = 14.674
Model x1 x3 Training MSE = 7.474 Test MSE = 13.892
Model x1 x2 x3 Training MSE = 7.288 Test MSE = 13.980

A. Colin Cameron Univ.of California - Davis . () ML Part 1: Cross validation April 2024 21 / 43
2. Model Selection using predictive ability 2.6 K-fold cross validation

2.6 K-fold cross validation


K -fold cross-validation
I splits data into K mutually exclusive folds of roughly equal size
I for j = 1, ..., K …t using all folds but fold j and predict on fold j
I standard choices are K = 5 and K = 10.
The following shows case K = 5

Fit on folds Test on fold


j =1 2,3,4,5 1
j =2 1,3,4,5 2
j =3 1,2,4,5 3
j =4 1,2,3,5 4
j =5 1,2,3,4 5

The K -fold CV estimate is

∑j =1 MSE(j ) , where MSE(j ) is the MSE for fold j.


1 K
CVK = K

A. Colin Cameron Univ.of California - Davis . () ML Part 1: Cross validation April 2024 22 / 43
2. Model Selection using predictive ability 2.6 K-fold cross validation

splitsample command

Create mutually exclusive subsamples of pre-speci…ed size


I splitsample command
Example is …ve equal-sized subsamples
I always set the seed
I default numbers the subsamples 1, 2, 3, ...

. * Split sample into five equal size parts using splitsample commands
. splitsample, nsplit(5) generate(snum) rseed(10101)

. tabulate snum

snum Freq. Percent Cum.

1 8 20.00 20.00
2 8 20.00 40.00
3 8 20.00 60.00
4 8 20.00 80.00
5 8 20.00 100.00

Total 40 100.00

A. Colin Cameron Univ.of California - Davis . () ML Part 1: Cross validation April 2024 23 / 43
2. Model Selection using predictive ability 2.6 K-fold cross validation

K-fold cross validation example


MSE in each test fold (of 8 observations) for model with all three
regressors.
. * Five-fold cross validation example for model with all regressors
. splitsample, nsplit(5) generate(foldnum) rseed(10101)

. matrix allmses = J(5,1,.)

. capture drop y*hat y*errorsq

. forvalues i = 1/5 {
2. qui reg y x1 x2 x3 if foldnum != `i'
3. qui predict y`i'hat
4. qui gen y`i'errorsq = (y`i'hat - y)^2
5. qui sum y`i'errorsq if foldnum ==`i'
6. matrix allmses[`i',1] = r(mean)
7. }

. matrix list allmses

allmses[5,1]
c1
r1 13.980321
r2 6.4997357
r3 9.3623792
r4 6.413401
r5 12.23958

A. Colin Cameron Univ.of California - Davis . () ML Part 1: Cross validation April 2024 24 / 43
2. Model Selection using predictive ability 2.6 K-fold cross validation

Average MSE and standard deviation across …ve folds

Get mean and standard deviation of the preceding …ve MSEs.

. * Compute the average MSE over the five folds and standard deviation
. svmat allmses, names(vallmses)

. qui sum vallmses

. display "CV5 = " %5.3f r(mean) " with st. dev. = " %5.3f r(sd)
CV5 = 9.699 with st. dev. = 3.389

A. Colin Cameron Univ.of California - Davis . () ML Part 1: Cross validation April 2024 25 / 43
2. Model Selection using predictive ability 2.6 K-fold cross validation

K-fold CV using user-written crossfold command


User-written crossfold command (Daniels 2012)
I gives square root of MSE (RMSE) in each fold
Here 5-fold CV for just one model - the full model
I CV = (3.7392 + 2.5492 + 3.0602 + 2.5322 + 3.4992 ) /5 = 3.699.
5

. * Five-fold cross validation measure for one model with ll regressors


. set seed 10101

. crossfold regress y x1 x2 x3, k(5)

RMSE

est1 3.739027
est2 2.549458
est3 3.059801
est4 2.532469
est5 3.498511

. drop _est* // Drop variables created

A. Colin Cameron Univ.of California - Davis . () ML Part 1: Cross validation April 2024 26 / 43
2. Model Selection using predictive ability 2.6 K-fold cross validation

K-fold Cross Validation example (continued)


Now do 5-fold CV for all eight models with K = 5
I model with only x1 has lowest CV5
. * Five-fold cross validation measure for all possible models
. forvalues k = 1/8 {
2. set seed 10101
3. qui crossfold regress y ${xlist`k'}, k(5)
4. matrix RMSEs`k' = r(est)
5. svmat RMSEs`k', names(rmse`k')
6. qui generate mse`k' = rmse`k'^2
7. qui sum mse`k'
8. scalar cv`k' = r(mean)
9. scalar sdcv`k' = r(sd)
10. display "Model " "${xlist`k'}" _col(16) " CV5 = " %7.3f cv`k' ///
> " with st. dev. = " %7.3f sdcv`k'
11. }
Model CV5 = 11.960 with st. dev. = 3.561
Model x1 CV5 = 9.138 with st. dev. = 3.069
Model x2 CV5 = 10.407 with st. dev. = 4.139
Model x3 CV5 = 11.776 with st. dev. = 3.272
Model x1 x2 CV5 = 9.173 with st. dev. = 3.367
Model x2 x3 CV5 = 10.872 with st. dev. = 4.221
Model x1 x3 CV5 = 9.639 with st. dev. = 2.985
Model x1 x2 x3 CV5 = 9.699 with st. dev. = 3.389

A. Colin Cameron Univ.of California - Davis . () ML Part 1: Cross validation April 2024 27 / 43
2. Model Selection using predictive ability 2.6 K-fold cross validation

One standard error “rule” for K-fold cross-validation

K folds gives K estimates MSE(1 ) , ..., MSE(K )


I so we can obtain a standard error of CVK
r
1 K
K 1 ∑ j =1
se(CVK ) = (MSE(j ) CV(K ) )2 .

A further guard against over…tting that is sometimes used (ISL2)


I don’t simply choose model with minimum CV(K )
I instead choose the smallest model for which CV is within one se(CV)
of minimum CV
I clearly could instead use e.g. a 0.5 standard error rule.
Example is determining degree p of a high order polynomial in x
I if CVK is minimized at p = 7 but is only slightly higher for p = 3 we
would favor p = 3.

A. Colin Cameron Univ.of California - Davis . () ML Part 1: Cross validation April 2024 28 / 43
2. Model Selection using predictive ability 2.7 Leave-one-out cross validation

2.7 Leave-one-out Cross Validation (LOOCV)


Use a single observation for validation and (n 1) for training
I yb( i ) is ybi prediction after OLS on observations 1, .., i 1, i + 1, ..., n.
I Cycle through all n observations doing this.
Then LOOCV measure is

∑i =1 MSE(
n
∑ i = 1 ( yi
1 1 n 2
CVn = n i) = n yb( i ))

Requires n regressions in general


yi ybi 2
I except for OLS can show CVn = 1
n ∑ni=1 1 h ii
F where ybi is …tted value from OLS on the full training sample
F and hii is i th diagonal entry in the hat matrix X(X0 X) 1 X.

Used for bandwidth choice in local nonparametric regression


I such as k-nearest neighbors, kernel and local linear regression
I but not used for (global) machine learning (see below).

A. Colin Cameron Univ.of California - Davis . () ML Part 1: Cross validation April 2024 29 / 43
2. Model Selection using predictive ability 2.7 Leave-one-out cross validation

Leave-one-out cross validation example

User-written command loocv (Barron 2014)


I slow as written for any command, not just OLS.

. * Leave-one-out cross validation


. loocv regress y x1

Leave-One-Out Cross-Validation Results

Method Value

Root Mean Squared Errors 3.0989007


Mean Absolute Errors 2.5242994
Pseudo-R2 .15585569

. display "LOOCV MSE = " r(rmse)^2


LOOCV MSE = 9.6031853

A. Colin Cameron Univ.of California - Davis . () ML Part 1: Cross validation April 2024 30 / 43
2. Model Selection using predictive ability 2.7 Leave-one-out cross validation

Cross validation and AIC

LOOCV is the special case of K -fold CV with K = N


I it has little bias as all but one observation is used to …t.
I but large variance as n predicted yb( i ) based on very similar samples.
For K-fold CV, K = 5 or K = 10 is found to be a good compromise
I then neither high bias nor high variance.
K-fold usually selects fewer variables than AIC.
K-fold requires few model assumptions and has wide applicability.
I Remember: For replicability set the seed as this determines the folds.

A. Colin Cameron Univ.of California - Davis . () ML Part 1: Cross validation April 2024 31 / 43
2. Model Selection using predictive ability 2.7 Leave-one-out cross validation

Cross validation and AIC and BIC


Jun Shao (1997) Statistica Sinica 221-264
I considers linear regression with i.i.d. (0, σ2 ) errors
I n observations and pn potential regressors
I α 2 An denotes a subset with pn (α) of the p regressors
I µn = xn (α) β(αn ) is the conditional mean using α 2 An
I GIC(λn ) methods minimize MSE(α) + λn σ b2n pn (α)/n
F where λn is the penalty
F b2n is a consistent estimate of σ2 .
and σ
Case 1: λn = 2 asymptotically is AIC, LOOCV (delete-1 CV), CP
I consistent model selection if there is no …xed dimension true model.
Case 2: λn ! ∞ asympotically is BIC and delete-d CV with d /n ! 1
I consistent model selection when there is a …xed dimension true model.
Case 3: λn 2 and …xed asymptotically is delete-d CV with
d 2 (0, 1) and K -fold CV with K …xed
I compromise between 1 and 2.
A. Colin Cameron Univ.of California - Davis . () ML Part 1: Cross validation April 2024 32 / 43
2. Model Selection using predictive ability 2.8 Forwards selection, backwards selection and best subsets

2.8 Model selection: Forwards, backwards and best subsets


Forwards selection (or speci…c to general)
I start with simplest model (intercept-only) and in turn include the
variable that is most statistically signi…cant or most improves …t.
I requires up to p + (p 1) + + 1 = p (p + 1)/2 regressions where p
is number of regressors
Backwards selection (or general to speci…c)
I start with most general model and in drop the variable that is least
statistically signi…cant or least improves …t.
I requires up to p (p + 1)/2 regressions
Best subsets
I for k = 1, ..., p …nd the best …tting model with k regressors
I in theory requires (p ) + (p ) + + (pp ) = 2p regressions
0 1
I but leaps and bounds procedure makes this much quicker
I p < 40 manageable though recent work suggests p in thousands.
Hybrid
I forward selection but after new model found drop variables that do not
improve …t.
A. Colin Cameron Univ.of California - Davis . () ML Part 1: Cross validation April 2024 33 / 43
2. Model Selection using predictive ability 2.8 Forwards selection, backwards selection and best subsets

Best subsets selection and stepwise selection


User-written vselect command (Lindsey and Sheather 2010)
I this does not use CV but uses AIC, BIC, ...
I best subsets gives best …tting model (lowest MSE) with one, two and
three regressors
I and for each of these best …tting models gives various penalty measures
I all measures favor model with just x1 .

. * Best subset selection with user-written add-on vselect


. vselect y x1 x2 x3, best

Response : y
Selected predictors: x1 x2 x3

Optimal models:

# Preds R2ADJ C AIC AICC BIC


1 .2043123 .5925225 204.2265 204.8932 207.6042
2 .1959877 2.002325 205.5761 206.7189 210.6427
3 .1737073 4 207.5735 209.3382 214.329

predictors for each model:

1 : x1
2 : x1 x2
3 : x1 x2 x3

A. Colin Cameron Univ.of California - Davis . () ML Part 1: Cross validation April 2024 34 / 43
2. Model Selection using predictive ability 2.8 Forwards selection, backwards selection and best subsets

Example of penalty measures (continued)

vselect also does forward selection and backward selection


I then need to specify whether use R2adj, AIC, BIC or AICC
I e.g. vselect y x1 c2 c3, forward aic
I e.g. vselect y x1 c2 c3, backward bic
And can specify that some regressors always be included
I e.g. vselect y x2 x3, fix(x1) best
User-written gvselect command (Lindsey and Sheather 2015)
implements best subsets selection for any Stata command that
reports ln L
I then best model of any size has highest ln L
I and best model size has lowest AIC or BIC.

A. Colin Cameron Univ.of California - Davis . () ML Part 1: Cross validation April 2024 35 / 43
2. Model Selection using predictive ability 2.9 Selection using Statistical Signi…cance

2.9 Selection using Statistical Signi…cance

In past not recommended


I b
pre-testing changes the distribution of β
I and with …xed size α (such as α = 0.05) all potential regressors likely
to be found signi…cant as n ! ∞.
But recent papers say okay if critical value increases with number of
potential regressors at an appropriate rate.

A. Colin Cameron Univ.of California - Davis . () ML Part 1: Cross validation April 2024 36 / 43
2. Model Selection using predictive ability 2.9 Selection using Statistical Signi…cance

Stepwise forward based on p < 0.05


I Stata add-on command stepwise, pe(.05)
F chooses model with only intercept and x1

. * Stepwise forward using statistical significance at five percent


. stepwise, pe(.05): regress y x1 x2 x3
begin with empty model
p = 0.0020 < 0.0500 adding x1

Source SS df MS Number of obs = 40


F(1, 38) = 11.01
Model 101.318018 1 101.318018 Prob > F = 0.0020
Residual 349.556297 38 9.19884993 R-squared = 0.2247
Adj R-squared = 0.2043
Total 450.874315 39 11.5608799 Root MSE = 3.033

y Coef. Std. Err. t P>|t| [95% Conf. Interval]

x1 1.793535 .5404224 3.32 0.002 .6995073 2.887563


_cons 2.509313 .5123592 4.90 0.000 1.472097 3.54653

A. Colin Cameron Univ.of California - Davis . () ML Part 1: Cross validation April 2024 37 / 43
2. Model Selection using predictive ability 2.9 Selection using Statistical Signi…cance

Stepwise backward based on p < 0.05


I Stata add-on command stepwise, pr(.05)
F chooses model with only intercept and x1

. * Stepwise backward using statistical significance at five percent


. stepwise, pr(.05): regress y x1 x2 x3
begin with full model
p = 0.9618 >= 0.0500 removing x3
p = 0.4410 >= 0.0500 removing x2

Source SS df MS Number of obs = 40


F(1, 38) = 11.01
Model 101.318018 1 101.318018 Prob > F = 0.0020
Residual 349.556297 38 9.19884993 R-squared = 0.2247
Adj R-squared = 0.2043
Total 450.874315 39 11.5608799 Root MSE = 3.033

y Coef. Std. Err. t P>|t| [95% Conf. Interval]

x1 1.793535 .5404224 3.32 0.002 .6995073 2.887563


_cons 2.509313 .5123592 4.90 0.000 1.472097 3.54653

Option hierarchical allows selection in order of the speci…ed


regressors.
A. Colin Cameron Univ.of California - Davis . () ML Part 1: Cross validation April 2024 38 / 43
2. Model Selection using predictive ability 2.9 Selection using Statistical Signi…cance

Testing-based Forward Model Selection

Kozbur (2020), “Analysis of testing-based forward model selection”,


Econometrica, vol. 88(5), 2147-2173.
Assumptions similar to Belloni et al. (Ecta 2012) for LASSO
I including sparsity (only a fraction s0 = o (n) of the p potential
regressors in model belong).
Normalize regressors so ∑ni=1 xij2 = 1 for all regressors j
1
n p.
Then in case of independent heteroskedastic errors
I use stepwise forward OLS regression of yi on the xik0 s
I sequentially select regressors where at any given round
F choose regressor j if Wj Wk for all regressors j,k not already selected
and (de…nition 1) Wj a complicated critical value.

A. Colin Cameron Univ.of California - Davis . () ML Part 1: Cross validation April 2024 39 / 43
2. Model Selection using predictive ability 2.9 Selection using Statistical Signi…cance

Testing-based Forward Model Selection (continued)

Simpler procedures are


I De…nition 2: Wj = Wjhet = b θ/sehetrob (b
θ ) Φ 1 (1 α/p )
I De…nition 3: Wj = ∆j MSE Φ 1 (1 α/p ) where ∆j MSE is the
increase in MSE when xj is dropped.
Φ 1 (1 α/p ) is like a Bonferroni correction for multiple testing
I e.g. α = 0.05 and p = 100 then
Φ 1 (1 0.05/100) = Φ 1 (0.9995) = 3.29.

A. Colin Cameron Univ.of California - Davis . () ML Part 1: Cross validation April 2024 40 / 43
2. Model Selection using predictive ability 2.9 Selection using Statistical Signi…cance

One Covariate at a Time Testing Model Selection

Chudik, Kapetanios and Pesaran (2018), “A One Covariate at a time,


Multiple Testing Approach to Variable Selection in High-Dimensional
Linear Regression Models”, Econometrica, vol. 86(4), 1479-1512.
Proposes individual testing for each potential regressor
I …rst stage: accept all those regressors j with jWj j > threshold
I further stages: see if some variables missed …rst time around by adding
additional regressors if jWj j > higher threshold
I repeat previous step if necessary (few such additional steps are needed).
Assumptions include
I errors are a martingale di¤erence sequence
I Wj is Wald test statistic based on homoskedastic errors.

A. Colin Cameron Univ.of California - Davis . () ML Part 1: Cross validation April 2024 41 / 43
2. Model Selection using predictive ability 2.9 Selection using Statistical Signi…cance

One Covariate at a Time Testing Model Selection


(continued)

The key threshold for p potential regressors and size α tests


I jWj j > Φ 1 (1 α/(2p δ ))
I δ = 1 corresponds to usual Bonferroni for multiple testing
Simulations: α = 0.01, δ = 1 at …rst stage, δ = 2 subsequently
I p = 100: Φ 1 (1 0.01/(2 1001 )) = Φ 1 (0.9998) = 3.54
I p = 100: Φ 1 (1 0.01/(2 1002 )) = Φ 1 (0.999998) = 4.61.

A. Colin Cameron Univ.of California - Davis . () ML Part 1: Cross validation April 2024 42 / 43
References

References

ISLR2: Gareth James, Daniela Witten, Trevor Hastie and Robert Tibsharani
(2021), An Introduction to Statistical Learning: with Applications in R, 2nd Ed.,
Springer.
I PDF and $40 softcover book at
https://link.springer.com/book/10.1007/978-1-0716-1418-1

ISLP: Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani and
Jonathan Taylor (2023), An Introduction to Statistical Learning: with Applications
in Python, Springer.

I Masters level book. Free PDF from https://www.statlearning.com/ and $40


softcover book via Springer mycopy.

Chapter 28 “Machine Learning for Prediction and Inference” in A. Colin Cameron


and Pravin K. Trivedi (2022), Microeconometrics using Stata, Second edition.

A. Colin Cameron Univ.of California - Davis . () ML Part 1: Cross validation April 2024 43 / 43

You might also like