0% found this document useful (0 votes)
16 views

Linear Model and Extensions

Uploaded by

terencelam0721
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

Linear Model and Extensions

Uploaded by

terencelam0721
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 400

arXiv:2401.00649v1 [stat.

ME] 1 Jan 2024

Peng Ding

Linear Model and Extensions


To students and readers
who are interested in linear models
Contents

Acronyms xiii

Symbols xv

Useful R packages xvii

Preface xix

I Introduction 1
1 Motivations for Statistical Models 3
1.1 Data and statistical models . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Why linear models? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Ordinary Least Squares (OLS) with a Univariate Covariate 7


2.1 Univariate ordinary least squares . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Final comments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 Homework problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

II OLS and Statistical Inference 11


3 OLS with Multiple Covariates 13
3.1 The OLS formula . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2 The geometry of OLS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.3 The projection matrix from OLS . . . . . . . . . . . . . . . . . . . . . . . . 16
3.4 Homework problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

4 The Gauss–Markov Model and Theorem 21


4.1 Gauss–Markov model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.2 Properties of the OLS estimator . . . . . . . . . . . . . . . . . . . . . . . . 21
4.3 Variance estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.4 Gauss–Markov Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.5 Homework problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

5 Normal Linear Model: Inference and Prediction 29


5.1 Joint distribution of the OLS coefficient and variance estimator . . . . . . 29
5.2 Pivotal quantities and statistical inference . . . . . . . . . . . . . . . . . . 30
5.2.1 Scalar parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.2.2 Vector parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5.3 Prediction based on pivotal quantities . . . . . . . . . . . . . . . . . . . . . 34
5.4 Examples and R implementation . . . . . . . . . . . . . . . . . . . . . . . . 35
5.4.1 Univariate regression . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5.4.2 Multivariate regression . . . . . . . . . . . . . . . . . . . . . . . . . . 35

v
vi Contents

5.5 Homework problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

6 Asymptotic Inference in OLS: the Eicker–Huber–White (EHW) robust


standard error 41
6.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
6.1.1 Numerical examples . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
6.1.2 Goal of this chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
6.2 Consistency of OLS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
6.3 Asymptotic Normality of the OLS estimator . . . . . . . . . . . . . . . . . 44
6.4 Eicker–Huber–White standard error . . . . . . . . . . . . . . . . . . . . . . 45
6.4.1 Sandwich variance estimator . . . . . . . . . . . . . . . . . . . . . . 45
6.4.2 Other “HC” standard errors . . . . . . . . . . . . . . . . . . . . . . . 46
6.4.3 Special case with homoskedasticity . . . . . . . . . . . . . . . . . . . 47
6.5 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
6.5.1 LaLonde experimental data . . . . . . . . . . . . . . . . . . . . . . . 48
6.5.2 Data from King and Roberts (2015) . . . . . . . . . . . . . . . . . . 49
6.5.3 Boston housing data . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
6.6 Final remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
6.7 Homework problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

III Interpretation of OLS Based on Partial Regressions 57


7 The Frisch–Waugh–Lovell Theorem 59
7.1 Long and short regressions . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
7.2 FWL theorem for the regression coefficients . . . . . . . . . . . . . . . . . . 59
7.3 FWL theorem for the standard errors . . . . . . . . . . . . . . . . . . . . . 62
7.4 Gram–Schmidt orthogonalization, QR decomposition, and the computation
of OLS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
7.5 Homework problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

8 Applications of the Frisch–Waugh–Lovell Theorem 69


8.1 Centering regressors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
8.2 Partial correlation coefficient and Simpson’s paradox . . . . . . . . . . . . 71
8.3 Hypothesis testing and analysis of variance . . . . . . . . . . . . . . . . . . 74
8.4 Homework problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

9 Cochran’s Formula and Omitted-Variable Bias 81


9.1 Cochran’s formula . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
9.2 Omitted-variable bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
9.3 Homework problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

IV Model Fitting, Checking, and Misspecification 87


10 Multiple Correlation Coefficient 89
10.1 Equivalent definitions of R2 . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
10.2 R2 and the F statistic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
10.3 Numerical examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
10.4 Homework problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
Contents vii

11 Leverage Scores and Leave-One-Out Formulas 95


11.1 Leverage scores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
11.2 Leave-one-out formulas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
11.3 Applications of the leave-one-out formulas . . . . . . . . . . . . . . . . . . 99
11.3.1 Gauss updating formula . . . . . . . . . . . . . . . . . . . . . . . . . 99
11.3.2 Outlier detection based on residuals . . . . . . . . . . . . . . . . . . 100
11.3.3 Jackknife . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
11.4 Homework problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

12 Population Ordinary Least Squares and Inference with a Misspecified


Linear Model 107
12.1 Population OLS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
12.2 Population FWL theorem and Cochran’s formula . . . . . . . . . . . . . . 109
12.3 Population R2 and partial correlation coefficient . . . . . . . . . . . . . . . 110
12.4 Inference for the population OLS . . . . . . . . . . . . . . . . . . . . . . . . 112
12.4.1 Inference with the EHW standard errors . . . . . . . . . . . . . . . . 112
12.5 To model or not to model? . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
12.5.1 Population OLS and the restricted mean model . . . . . . . . . . . . 113
12.5.2 Anscombe’s Quartet: the importance of graphical diagnostics . . . . 114
12.5.3 More on residual plots . . . . . . . . . . . . . . . . . . . . . . . . . . 117
12.6 Conformal prediction based on exchangeability . . . . . . . . . . . . . . . . 118
12.7 Homework problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

V Overfitting, Regularization, and Model Selection 127


13 Perils of Overfitting 129
13.1 David Freedman’s simulation . . . . . . . . . . . . . . . . . . . . . . . . . . 129
13.2 Variance inflation factor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
13.3 Bias-variance trade-off . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
13.4 Model selection criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
13.4.1 RSS, R2 and adjusted R2 . . . . . . . . . . . . . . . . . . . . . . . . 133
13.4.2 Information criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
13.4.3 Cross-validation (CV) . . . . . . . . . . . . . . . . . . . . . . . . . . 134
13.5 Best subset and forward/backward selection . . . . . . . . . . . . . . . . . 135
13.6 Homework problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

14 Ridge Regression 141


14.1 Introduction to ridge regression . . . . . . . . . . . . . . . . . . . . . . . . 141
14.2 Ridge regression via the SVD of X . . . . . . . . . . . . . . . . . . . . . . . 143
14.3 Statistical properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
14.4 Selection of the tuning parameter . . . . . . . . . . . . . . . . . . . . . . . 145
14.4.1 Based on parameter estimation . . . . . . . . . . . . . . . . . . . . . 145
14.4.2 Based on prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
14.5 Computation of ridge regression . . . . . . . . . . . . . . . . . . . . . . . . 147
14.6 Numerical examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
14.6.1 Uncorrelated covariates . . . . . . . . . . . . . . . . . . . . . . . . . 147
14.6.2 Correlated covariates . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
14.7 Further commments on OLS, ridge, and PCA . . . . . . . . . . . . . . . . 149
14.8 Homework problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
viii Contents

15 Lasso 155
15.1 Introduction to the lasso . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
15.2 Comparing the lasso and ridge: a geometric perspective . . . . . . . . . . . 155
15.3 Computing the lasso via coordinate descent . . . . . . . . . . . . . . . . . . 158
15.3.1 The soft-thresholding lemma . . . . . . . . . . . . . . . . . . . . . . 158
15.3.2 Coordinate descent for the lasso . . . . . . . . . . . . . . . . . . . . 158
15.4 Example: comparing OLS, ridge and lasso . . . . . . . . . . . . . . . . . . . 159
15.5 Other shrinkage estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
15.6 Homework problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163

VI Transformation and Weighting 167


16 Transformations in OLS 169
16.1 Transformation of the outcome . . . . . . . . . . . . . . . . . . . . . . . . . 169
16.1.1 Log transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
16.1.2 Box–Cox transformation . . . . . . . . . . . . . . . . . . . . . . . . . 170
16.2 Transformation of the covariates . . . . . . . . . . . . . . . . . . . . . . . . 172
16.2.1 Polynomial, basis expansion, and generalized additive model . . . . 172
16.2.2 Regression discontinuity and regression kink . . . . . . . . . . . . . . 174
16.3 Homework problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177

17 Interaction 179
17.1 Two binary covariates interact . . . . . . . . . . . . . . . . . . . . . . . . . 179
17.2 A binary covariate interacts with a general covariate . . . . . . . . . . . . . 180
17.2.1 Treatment effect heterogeneity . . . . . . . . . . . . . . . . . . . . . 180
17.2.2 Johnson–Neyman technique . . . . . . . . . . . . . . . . . . . . . . . 180
17.2.3 Blinder–Oaxaca decomposition . . . . . . . . . . . . . . . . . . . . . 180
17.2.4 Chow test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
17.3 Difficulties of intereaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
17.3.1 Removable interaction . . . . . . . . . . . . . . . . . . . . . . . . . . 182
17.3.2 Main effect in the presence of interaction . . . . . . . . . . . . . . . 182
17.3.3 Power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
17.4 Homework problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184

18 Restricted OLS 187


18.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
18.2 Algebraic properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
18.3 Statistical inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
18.4 Final remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
18.5 Homework problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190

19 Weighted Least Squares 193


19.1 Generalized least squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
19.2 Weighted least squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
19.3 WLS motivated by heteroskedasticity . . . . . . . . . . . . . . . . . . . . . 196
19.3.1 Feasible generalized least squares . . . . . . . . . . . . . . . . . . . . 196
19.3.2 Aggregated data and ecological regression . . . . . . . . . . . . . . . 197
19.4 WLS with other motivations . . . . . . . . . . . . . . . . . . . . . . . . . . 199
19.4.1 Local linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . 199
19.4.2 Regression with survey data . . . . . . . . . . . . . . . . . . . . . . . 200
19.5 Homework problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
Contents ix

VII Generalized Linear Models 207


20 Logistic Regression for Binary Outcomes 209
20.1 Regression with binary outcomes . . . . . . . . . . . . . . . . . . . . . . . . 209
20.1.1 Linear probability model . . . . . . . . . . . . . . . . . . . . . . . . . 209
20.1.2 General link functions . . . . . . . . . . . . . . . . . . . . . . . . . . 209
20.2 Maximum likelihood estimator of the logistic model . . . . . . . . . . . . . 211
20.3 Statistics with the logit model . . . . . . . . . . . . . . . . . . . . . . . . . 214
20.3.1 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
20.3.2 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
20.4 More on interpretations of the coefficients . . . . . . . . . . . . . . . . . . . 216
20.5 Does the link function matter? . . . . . . . . . . . . . . . . . . . . . . . . . 218
20.6 Extensions of the logistic regression . . . . . . . . . . . . . . . . . . . . . . 219
20.6.1 Penalized logistic regression . . . . . . . . . . . . . . . . . . . . . . . 219
20.6.2 Case-control study . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
20.7 Other model formulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
20.7.1 Latent linear model . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
20.7.2 Inverse model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222
20.8 Homework problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223

21 Logistic Regressions for Categorical Outcomes 227


21.1 Multinomial distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
21.2 Multinomial logistic model for nominal outcomes . . . . . . . . . . . . . . . 228
21.2.1 Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
21.2.2 MLE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
21.3 A latent variable representation for the multinomial logistic regression . . . 230
21.4 Proportional odds model for ordinal outcomes . . . . . . . . . . . . . . . . 232
21.5 A case study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234
21.5.1 Binary logistic for the treatment . . . . . . . . . . . . . . . . . . . . 234
21.5.2 Binary logistic for the outcome . . . . . . . . . . . . . . . . . . . . . 235
21.5.3 Multinomial logistic for the outcome . . . . . . . . . . . . . . . . . . 236
21.5.4 Proportional odds logistic for the outcome . . . . . . . . . . . . . . . 237
21.6 Discrete choice models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238
21.6.1 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238
21.6.2 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
21.6.3 More comments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242
21.7 Homework problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243

22 Regression Models for Count Outcomes 245


22.1 Some random variables for counts . . . . . . . . . . . . . . . . . . . . . . . 245
22.1.1 Poisson . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
22.1.2 Negative-Binomial . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246
22.1.3 Zero-inflated count distributions . . . . . . . . . . . . . . . . . . . . 247
22.2 Regression models for counts . . . . . . . . . . . . . . . . . . . . . . . . . . 248
22.2.1 Poisson regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249
22.2.2 Negative-Binomial regression . . . . . . . . . . . . . . . . . . . . . . 250
22.2.3 Zero-inflated regressions . . . . . . . . . . . . . . . . . . . . . . . . . 251
22.3 A case study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252
22.3.1 Linear, Poisson, and Negative-Binomial regressions . . . . . . . . . . 252
22.3.2 Zero-inflated regressions . . . . . . . . . . . . . . . . . . . . . . . . . 253
22.4 Homework problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256
x Contents

23 Generalized Linear Models: A Unification 259


23.1 Generalized Linear Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 259
23.1.1 Exponential family . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259
23.1.2 Generalized linear model . . . . . . . . . . . . . . . . . . . . . . . . . 262
23.2 MLE for GLM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263
23.3 Other GLMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264
23.4 Homework problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266

24 From Generalized Linear Models to Restricted Mean Models: the Sand-


wich Covariance Matrix 267
24.1 Restricted mean model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267
24.2 Sandwich covariance matrix . . . . . . . . . . . . . . . . . . . . . . . . . . 268
24.3 Applications of the sandwich standard errors . . . . . . . . . . . . . . . . . 270
24.3.1 Linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270
24.3.2 Logistic regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271
24.3.2.1 An application . . . . . . . . . . . . . . . . . . . . . . . . . 271
24.3.2.2 A misspecified logistic regression . . . . . . . . . . . . . . . 271
24.3.3 Poisson regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272
24.3.3.1 A correctly specified Poisson regression . . . . . . . . . . . 272
24.3.3.2 A Negative-Binomial regression model . . . . . . . . . . . . 272
24.3.3.3 Misspecification of the conditional mean . . . . . . . . . . . 273
24.3.4 Poisson regression for binary outcomes . . . . . . . . . . . . . . . . . 273
24.3.5 How robust are the robust standard errors? . . . . . . . . . . . . . . 274
24.4 Homework problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274

25 Generalized Estimating Equation for Correlated Multivariate Data 275


25.1 Examples of correlated data . . . . . . . . . . . . . . . . . . . . . . . . . . 275
25.1.1 Longitudinal data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275
25.1.2 Clustered data: a neuroscience experiment . . . . . . . . . . . . . . . 275
25.1.3 Clustered data: a public health intervention . . . . . . . . . . . . . . 276
25.2 Marginal model and the generalized estimating equation . . . . . . . . . . 277
25.3 Statistical inference with GEE . . . . . . . . . . . . . . . . . . . . . . . . . 279
25.3.1 Computation using the Gauss–Newton method . . . . . . . . . . . . 279
25.3.2 Asymptotic inference . . . . . . . . . . . . . . . . . . . . . . . . . . . 279
25.3.3 Implementation: choice of the working covariance matrix . . . . . . . 280
25.4 A special case: cluster-robust standard error . . . . . . . . . . . . . . . . . 281
25.4.1 OLS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281
25.4.2 Logistic regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282
25.5 Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283
25.5.1 Clustered data: a neuroscience experiment . . . . . . . . . . . . . . . 283
25.5.2 Clustered data: a public health intervention . . . . . . . . . . . . . . 284
25.5.3 Longitudinal data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285
25.6 Critiques on the key assumptions . . . . . . . . . . . . . . . . . . . . . . . 286
25.6.1 Assumption (25.4) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 286
25.6.2 Assumption (25.5) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 288
25.6.3 Explanation and prediction . . . . . . . . . . . . . . . . . . . . . . . 288
25.7 Homework problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289

VIII Beyond Modeling the Conditional Mean 291


Contents xi

26 Quantile Regression 293


26.1 From the mean to the quantile . . . . . . . . . . . . . . . . . . . . . . . . . 293
26.2 From the conditional mean to conditional quantile . . . . . . . . . . . . . . 296
26.3 Sample regression quantiles . . . . . . . . . . . . . . . . . . . . . . . . . . . 298
26.3.1 Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 298
26.3.2 Asymptotic inference . . . . . . . . . . . . . . . . . . . . . . . . . . . 298
26.4 Numerical examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299
26.4.1 Sample quantiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299
26.4.2 OLS versus LAD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 300
26.5 Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301
26.5.1 Parents’ and children’s heights . . . . . . . . . . . . . . . . . . . . . 301
26.5.2 U.S. wage structure . . . . . . . . . . . . . . . . . . . . . . . . . . . 302
26.6 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304
26.7 Homework problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305

27 Modeling Time-to-Event Outcomes 307


27.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307
27.1.1 Survival analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307
27.1.2 Duration analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307
27.2 Time-to-event data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 308
27.3 Kaplan–Meier survival curve . . . . . . . . . . . . . . . . . . . . . . . . . . 312
27.4 Cox model for time-to-event outcome . . . . . . . . . . . . . . . . . . . . . 315
27.4.1 Cox model and its interpretation . . . . . . . . . . . . . . . . . . . . 315
27.4.2 Partial likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316
27.4.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319
27.4.4 Log-rank test as a score test from Cox model . . . . . . . . . . . . . 320
27.5 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323
27.5.1 Stratified Cox model . . . . . . . . . . . . . . . . . . . . . . . . . . . 323
27.5.2 Clustered Cox model . . . . . . . . . . . . . . . . . . . . . . . . . . . 324
27.5.3 Penalized Cox model . . . . . . . . . . . . . . . . . . . . . . . . . . . 325
27.6 Critiques on survival analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 325
27.7 Homework problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326

IX Appendices 327
A Linear Algebra 329
A.1 Basics of vectors and matrices . . . . . . . . . . . . . . . . . . . . . . . . . 329
A.2 Vector calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337
A.3 Homework problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339

B Random Variables 341


B.1 Some important univariate random variables . . . . . . . . . . . . . . . . . 341
B.1.1 Normal, χ2 , t and F . . . . . . . . . . . . . . . . . . . . . . . . . . . 341
B.1.2 Beta–Gamma duality . . . . . . . . . . . . . . . . . . . . . . . . . . 342
B.1.3 Exponential, Laplace, and Gumbel distributions . . . . . . . . . . . 343
B.2 Multivariate distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345
B.3 Multivariate Normal and its properties . . . . . . . . . . . . . . . . . . . . 347
B.4 Quadratic forms of random vectors . . . . . . . . . . . . . . . . . . . . . . 348
B.5 Homework problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 350
xii Contents

C Limiting Theorems and Basic Asymptotics 353


C.1 Convergence in probability . . . . . . . . . . . . . . . . . . . . . . . . . . . 353
C.2 Convergence in distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 354
C.3 Tools for proving convergence in probability and distribution . . . . . . . . 356

D M-Estimation and MLE 359


D.1 M-estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 359
D.2 Maximum likelihood estimator . . . . . . . . . . . . . . . . . . . . . . . . . 362
D.3 Homework problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364

Bibliography 367
Acronyms

I try hard to avoid using acronyms to reduce the unnecessary burden for reading. The
following are standard and will be used repeatedly.
ANOVA (Fisher’s) analysis of variance
CLT central limit theorem
CV cross-validation
EHW Eicker–Huber–White (robust covariance matrix or standard error)
FWL Frisch–Waugh–Lovell (theorem)
GEE generalized estimating equation
GLM generalized linear model
HC heteroskedasticity-consistent (covariance matrix or standard error)
IID independent and identically distributed
LAD least absolute deviations
lasso least absolute shrinkage and selection operator
MLE maximum likelihood estimate
OLS ordinary least squares
RSS residual sum of squares
WLS weighted least squares

xiii
Symbols

All vectors are column vectors as in R unless stated otherwise. Let the superscript “t ” denote
the transpose of a vector or matrix.
a
∼ approximation in distribution
R the set of all real numbers
β regression coefficient
ε error term
H hat matrix H = X(X t X)−1 X t
hii leverage score: the (i, i)the element of the hat matrix H
In identity matrix of dimension n × n
xi covariate vector for unit i
X covariate matrix
Y outcome vector
yi outcome for unit i
independence and conditional independence

xv
Useful R packages

This book uses the following R packages and functions.


package function or data use
car hccm Eicker–Huber–White robust standard error
linearHypothesis testing linear hypotheses in linear models
foreign read.dta read stata data
gee gee Generalized estimating equation
HistData GaltonFamilies Galton’s data on parents’ and children’s heights
MASS lm.ridge ridge regression
glm.nb Negative-Binomial regression
glmnet cv.glmnet Lasso with cross-validation
mlbench BostonHousing Boston housing data
polr proportional odds logistic regression
Matching lalonde LaLonde data
nnet multinom Multinomial logistic regression
quantreg rq quantile regression
survival coxph Cox proportional hazards regression
survdiff log rank test
survfit Kaplan–Meier curve

xvii
Preface

The importance of studying the linear model


A central task in statistics is to use data to build models to make inferences about the
underlying data-generating processes or make predictions of future observations. Although
real problems are very complex, the linear model can often serve as a good approximation
to the true data-generating process. Sometimes, although the true data-generating process
is nonlinear, the linear model can be a useful approximation if we properly transform the
data based on prior knowledge. Even in highly nonlinear problems, the linear model can
still be a useful first attempt in the data analysis process.
Moreover, the linear model has many elegant algebraic and geometric properties. Under
the linear model, we can derive many explicit formulas to gain insights about various aspects
of statistical modeling. In more complicated models, deriving explicit formulas may be
impossible. Nevertheless, we can use the linear model to build intuition and make conjectures
about more complicated models.
Pedagogically, the linear model serves as a building block in the whole statistical train-
ing. This book builds on my lecture notes for a master’s level “Linear Model” course at
UC Berkeley, taught over the past eight years. Most students are master’s students in
statistics. Some are undergraduate students with strong technical preparations. Some are
Ph.D. students in statistics. Some are master’s or Ph.D. students in other departments.
This book requires the readers to have basic training in linear algebra, probability theory,
and statistical inference.

Recommendations for instructors


This book has twenty-seven chapters in the main text and four chapters as the appendices.
As I mentioned before, this book grows out of my teaching of “Linear Model” at UC Berke-
ley. In different years, I taught the course in different ways, and this book is a union of my
lecture notes over the past eight years. Below I make some recommendations for instruc-
tors based on my own teaching experience. Since UC Berkeley is on the semester system,
instructors on the quarter system should make some adjustments to my recommendations
below.

Version 1: a basic linear model course assuming minimal technical preparations


If you want to teach a basic linear model course without assuming strong technical prepara-
tions from the students, you can start with the appendices by reviewing basic linear algebra,
probability theory, and statistical inference. Then you can cover Chapters 2–17. If time per-
mits, you can consider covering Chapter 20 due to the importance of the logistic model for
binary data.

Version 2: an advanced linear model course assuming strong technical preparations


If you want to teach an advanced linear model course assuming strong technical preparations
from the students, you can start with the main text directly. When I did this, I asked

xix
xx Preface

my teaching assistants to review the appendices in the first two lab sessions and assigned
homework problems from the appendices to remind the students to review the background
materials. Then you can cover Chapters 2–24. You can omit Chapter 18 and some sections
in other chapters due to their technical complications. If time permits, you can consider
covering Chapter 25 due to the importance of the generalized estimating equation as well
as its byproduct called the “cluster-robust standard error”, which is important for many
social science applications. Furthermore, you can consider covering Chapter 27 due to the
importance of the Cox proportional hazards model.

Version 3: an advanced generalized linear models course


If you want to teach a course on generalized linear models, you can use Chapters 20–27.

Additional recommendations for readers and students


Readers and students can first read my recommendations for instructors above. In addition,
I have three other recommendations.

More simulation studies


This book contains some basic simulation studies. I encourage the readers to conduct more
intensive simulation studies to deepen their understanding of the theory and methods.

Practical data analysis


Box wrote wisely that “all models are wrong but some are useful.” The usefulness of models
strongly depends on the applications. When teaching “Linear Model”, I sometimes replaced
the final exam with the final project to encourage students to practice data analysis and
make connections between the theory and applications.

Homework problems
This book contains many homework problems. It is important to try some homework prob-
lems. Moreover, some homework problems contain useful theoretical results. Even if you do
not have time to figure out the details for those problems, it is helpful to at least read the
statements of the problems.

Omitted topics
Although “Linear Model” is a standard course offered by most statistics departments, it
is not entirely clear what we should teach as the field of statistics is evolving. Although
I made some suggestions to the instructors above, you may still feel that this book has
omitted some important topics related to the linear model.

Advanced econometric models


After the linear model, many econometric textbooks cover the instrumental variable models
and panel data models. For these more specialized topics, Wooldridge (2010) is a canonical
textbook.

Advanced biostatistics models


This book covers the generalized estimating equation in Chapter 25. For analyzing longitu-
dinal data, linear and generalized linear mixed effects models are powerful tools. Fitzmaurice
Preface xxi

et al. (2012) is a canonical textbook on applied longitudinal data analysis. This book also
covers the Cox proportional hazards model in Chapter 27. For more advanced methods for
survival analysis, Kalbfleisch and Prentice (2011) is a canonical textbook.

Causal inference
I do not cover causal inference in this book intentionally. To minimize the overlap of the
materials, I wrote another textbook on causal inference (Ding, 2023). However, I did teach a
version of “Linear Model” with a causal inference unit after introducing the basics of linear
model and logistic model. Students seemed to like it because of the connections between
statistical models and causal inference.

Features of the book


The linear model is an old topic in statistics. There are already many excellent textbooks
on the linear model. This book has the following features.

• This book provides an intermediate-level introduction to the linear model. It balances


rigorous proofs and heuristic arguments.
• This book provides not only theory but also simulation studies and case studies.
• This book provides the R code to replicate all simulation studies and case studies.

• This book covers the theory of the linear model related to not only social sciences but
also biomedical studies.
• This book provides homework problems with different technical difficulties. The solu-
tions to the problems are available to instructors upon request.

Other textbooks may also have one or two of the above features. This book has the above
features simultaneously. I hope that instructors and readers find these features attractive.

Acknowledgments
Many students at UC Berkeley made critical and constructive comments on early versions of
my lecture notes. As teaching assistants for my “Linear Model” course, Sizhu Lu, Chaoran
Yu, and Jason Wu read early versions of my book carefully and helped me to improve the
book a lot.
Professors Hongyuan Cao and Zhichao Jiang taught related courses based on an early
version of the book. They made very valuable suggestions.
I am also very grateful for the suggestions from Nianqiao Ju.
When I was a student, I took a linear model course based on Weisberg (2005). In my
early years of teaching, I used Christensen (2002) and Agresti (2015) as reference books.
I also sat in Professor Jim Powell’s econometrics courses and got access to his wonderful
lecture notes. They all heavily impacted my understanding and formulation of the linear
model.
If you identify any errors, please feel free to email me.
Part I

Introduction
1
Motivations for Statistical Models

1.1 Data and statistical models


A wide range of problems in statistics and machine learning have the data structure as
below:
Unit outcome/response covariates/features/predictors
i Y X1 X2 ··· Xp
1 y1 x11 x12 ··· x1p
2 y2 x21 x22 ··· x2p
.. .. .. .. ..
. . . . .
n yn xn1 xn2 ··· xnp
For each unit i, we observe the outcome/response of interest, yi , as well as p covari-
ates/features/predictors. We often use
 
y1
 y2 
Y = . 
 
 .. 
yn

to denote the n-dimensional outcome/response vector, and


 
x11 x12 · · · x1p
 x21 x22 · · · x2p 
X= .
 
.. .. 
 .. . . 
xn1 xn2 · · · xnp

to denote the n × p covariate/feature/predictor matrix, also called the design matrix. In


most cases, the first column of X contains constants 1s.
Based on the data (X, Y ) , we can ask the following questions:
(Q1) Describe the relationship between X and Y , i.e., their association or correlation. For
example, how is the patients’ average height related to the children’s average height?
How is one’s height related to one’s weight? How are one’s education and working
experience related to one’s income?
(Q2) Predict Y ∗ based on new data X ∗ . In particular, we want to use the current data
(X, Y ) to train a predictor, and then use it to predict future Y ∗ based on future X ∗ .
This is called supervised learning in the field of machine learning. For example, how
do we predict whether an email is spam or not based on the frequencies of the most
commonly occurring words and punctuation marks in the email? How do we predict
cancer patients’ survival time based on their clinical measures?

3
4 Linear Model and Extensions

(Q3) Estimate the causal effect of some components in X on Y . What if we change some
components of X? How do we measure the impact of the hypothetical intervention of
some components of X on Y ? This is a much harder question because most statistical
tools are designed to infer association, not causation. For example, the U.S. Food and
Drug Administration (FDA) approves drugs based on randomized controlled trials
(RCT) because RCTs are most credible to infer causal effects of drugs on health
outcomes. Economists are interested in evaluating the effect of a job training program
on employment and wages. However, this is a notoriously difficult problem with only
observational data.
The above descriptions are about generic X and Y , which can be many different types.
We often use different statistical models to capture the features of different types of data.
I give a brief overview of models that will appear in later parts of this book.
(T1) X and Y are univariate and continuous. In Francis Galton’s1 classic example, X is the
parents’ average height and Y is the children’s average height (Galton, 1886). Galton
derived the following formula:
σ̂y
y = ȳ + ρ̂ (x − x̄)
σ̂x
which is equivalent to
y − ȳ x − x̄
= ρ̂ , (1.1)
σ̂y σ̂x
where
n
X n
X
x̄ = n−1 xi , ȳ = n−1 yi
i=1 i=1
are the sample means,
n
X n
X
σ̂x2 = (n − 1)−1 (xi − x̄)2 , σ̂y2 = (n − 1)−1 (yi − ȳ)2
i=1 i=1

are the sample variances, and ρ̂ = σ̂xy /(σ̂x σ̂xy ) is the sample Pearson correlation
coefficient with the sample covariance
n
X
σ̂xy = (n − 1)−1 (xi − x̄)(yi − ȳ).
i=1

This is the famous formula of “regression towards mediocrity” or “regression towards


the mean”. Galton first introduced the terminology “regression.” Galton called regres-
sion because the relative deviation of the children’s average height is smaller than that
of the parents’ average height if |ρ̂| < 1. We will derive the above Galtonian formula
in Chapter 2. The name “regression” is widely used in statistics now. For instance,
we sometimes use “linear regression” interchangeably with “linear model”; we also
extend the name to “logistic regression” or “Cox regression” which will be discussed
in later chapters of this book.
(T2) Y univariate and continuous, and X multivariate of mixed types. In the R package
ElemStatLearn, the dataset prostate has an outcome of interest as the log of the prostate-
specific antigenlpsa and some potential predictors including the log cancer volume
lcavol, the log prostate weight lweight, age age, etc.

1 Who was Francis Galton? He was Charles Darwin’s nephew and was famous for his pioneer work in

statistics and for devising a method for classifying fingerprints that proved useful in forensic science. He
also invented the term eugenics, a field that causes a lot of controversies nowadays.
Motivations for Statistical Models 5

(T3) Y binary or indicator of two classes, and X multivariate of mixed types. For example,
in the R package wooldridge, the dataset mroz contains an outcome of interest being the
binary indicator for whether a woman was in the labor force in 1975, and some useful
covariates are

covariate name covariate meaning


kidslt6 number of kids younger than six years old
kidsge6 number of kids between six and eighteen years old
age age
educ years of education
husage husband’s age
huseduc husband’s years of education

(T4) Y categorical without ordering. For example, the choice of housing type, single-family
house, townhouse, or condominium, is a categorical variable.
(T5) Y categorical and ordered. For example, the final course evaluation at UC Berkeley
can take value in {1, 2, 3, 4, 5, 6, 7}. These numbers have clear ordering but they are
not the usual real numbers.
(T6) Y counts. For example, the number of times one went to the gym last week is a
non-negative integer representing counts.

(T7) Y time-to-event outcome. For example, in medical trials, a major outcome of interest
is the survival time; in labor economics, a major outcome of interest is the time to
find the next job. The former is called survival analysis in biostatistics and the latter
is called duration analysis in econometrics.
(T8) Y multivariate and correlated. In medical trials, the data are often longitudinal, mean-
ing that the patient’s outcomes are measured repeatedly over time. So each patient
has a multivariate outcome. In field experiments of public health and development
economics, the randomized interventions are often at the village level but the out-
come data are collected at the household level. So within villages, the outcomes are
correlated.

1.2 Why linear models?


Why do we study linear models if most real problems may have nonlinear structures? There
are important reasons.

(R1) Linear models are simple but non-trivial starting points for learning.
(R2) Linear models can provide insights because we can derive explicit formulas based on
elegant algebra and geometry.
(R3) Linear models can handle nonlinearity by incorporating nonlinear terms, for example,
X can contain the polynomials or nonlinear transformations of the original covariates.
In statistics, “linear” often means linear in parameters, not necessarily in covariates.

(R4) Linear models can be good approximations to nonlinear data-generating processes.


6 Linear Model and Extensions

(R5) Linear models are simpler than nonlinear models, but they do not necessarily perform
worse than more complicated nonlinear models. We have finite data so we cannot fit
arbitrarily complicated models.
If you are interested in nonlinear models, you can take another machine learning course.
2
Ordinary Least Squares (OLS) with a Univariate
Covariate

2.1 Univariate ordinary least squares


Figure 2.1 shows the scatterplot of Galton’s dataset which can be found in the R package
HistData as GaltonFamilies. In this dataset, father denotes the height of the father and mother
denotes the height of the mother. The x-axis denotes the mid-parent height, calculated as
(father + 1.08*mother)/2, and the y-axis denotes the height of a child.

Galton's regression
80

75 ● ● ● ●


● ● ●● ● ● ●

● ●
● ● ●
● ● ● ● ●●● ●
●● ● ● ● ● ● ●● ●●
● ●
● ● ●

● ● ● ● ● ●● ● ●● ● ● ● ●●
● ● ●● ● ●● ● ● ●
● ● ●

● ● ● ● ●● ● ●
● ● ●
● ● ●● ● ● ● ● ● ● ● ● ●● ●● ●● ● ●●

● ● ● ● ● ●● ● ●
● ● ● ● ●
● ● ● ● ● ● ● ●● ●● ●● ●● ● ● ● ● ●
● ● ●
70 ● ● ●
● ● ●●● ● ● ● ● ●● ● ●
● ● ●
● ●
● ●●●●●●● ●●●●

● ●● ●● ● ● ●
childHeight

●● ● ● ● ● ● ● ●
● ● ● ● ● ● ●
● ●●● ● ●● ● ● ● ● ●●
● ●
● ● ●● ● ●● ● ●●●● ● ● ●● ● ● ● ●
● ● ● ● ● ●
● ● ● ●● ● ● ●● ●● ●● ●● ● ● ● ● ● ●

● ● ● ● ● ●●● ●●● ● ●● ●● ● ●●● ● ●● ● ● ●●● ● ●● ● ● ●● ● ●
● ● ● ● ● ●
● ● ●● ● ● ● ● ● ● ●● ●
● ● ●
● ● ● ●● ● ● ● ● ● ●●●● ● ●● ●
● ● ● ● ● ●● ● ●● ●● ●
● ● ● ●● ●●

● ● ●

● ● ●●

●●● ●

●●●● ●

● ● ● ●


● ● ● ● ● ● ●● ● ● ●●●● ●
● ●● ● ● ● ●● ● ●●●
● ●● ●●●● ● ● ●
● ●
● ● ●
● ● ●● ● ●●● ●●
● ● ●●●●● ● ● ● ●● ● ● ●● ● ●
● ● ● ● ●
65 ● ● ●


● ●


● ●

●●


● ● ● ●●●●●●

● ● ●● ● ●


●●● ●●

●●
● ●●●●●

● ●

● ●●
● ●

●●

● ●

● ● ● ●● ●● ● ● ●● ● ● ●● ●●
●●● ●●●● ● ●
● ●● ● ● ●
● ● ● ● ●
● ● ●● ● ● ● ●
● ●● ● ●●
●●● ●●● ● ● ●
● ● ●
● ● ● ●● ● ● ●●● ● ●● ● ● ●●●●
● ● ● ●●● ●●● ● ●
● ●
● ●● ●
●●● ● ● ●● ● ● ● ●● ●
● ●
● ● ● ●● ● ● ●● ● ●
● ●●●●● ●●
● ● ●● ● ● ● ● ●
● ● ●
● ● ●
● ● ● ●
● ● ●

● ● ● ●● ● ●● ●● ● ● ●● ● ●●

● ● ● ●

60 ● ● ● ● ● ● ● ●● ● ● ●● ● ●

● ●
fitted line: y=22.64+0.64x

55
64 68 72
midparentHeight

FIGURE 2.1: Galton’s dataset

With n data points (xi, yi )ni=1 , our goal is to find the best linear fit of the data

(xi, ŷi = α̂ + β̂xi )ni=1 .

What do we mean by the “best” fit? Gauss proposed to use the following criterion, called

7
8 Linear Model and Extensions

the ordinary least squares (OLS) 1 :


n
X
(α̂, β̂) = arg min n−1 (yi − a − bxi )2 .
a,b
i=1

The OLS criterion is based on the squared “misfits” yi − a − bxi . Another intuitive
criterion is based on the absolute values of those misfits, which is called the least absolute
deviation (LAD). However, OLS is simpler because the objective function is smooth in (a, b).
We will discuss LAD in Chapter 26.
How to solve the OLS minimization problem? The objective function is quadratic, and
as a and b diverge, it diverges to infinity. So it must has a unique minimizer (α̂, β̂) which
satisfies the first-order condition:
( Pn
− n2 i=1 (yi − α̂ − β̂xi ) = 0,
Pn
− n2 i=1 xi (yi − α̂ − β̂xi ) = 0.

These two equations are called the Normal Equations of OLS. The first equation implies

ȳ = α̂ + β̂ x̄, (2.1)

that is, the OLS line must go through the sample mean of the data (x̄, ȳ). The second
equation implies
xy = α̂x̄ + β̂x2 , (2.2)
where xy is the sample mean of the xi yi ’s, and x2 is the sample mean of the x2i ’s. Subtracting
(2.1)×x̄ from (2.2), we have

xy − x̄ȳ = β̂(x2 − x̄2 )


=⇒ σ̂xy = β̂ σ̂x2
σ̂xy
=⇒ β̂ = .
σ̂x2

So the OLS coefficient of x equals the sample covariance between x and y divided by the
sample variance of x. From (2.1), we obtain that

α̂ = ȳ − β̂ x̄.

Finally, the fitted line is

y = α̂ + β̂x = ȳ − β̂ x̄ + β̂x
=⇒ y − ȳ = β̂(x − x̄)
σ̂xy ρ̂xy σ̂x σ̂y
=⇒ y − ȳ = 2 (x − x̄) = (x − x̄)
σ̂x σ̂x2
y − ȳ x − x̄
=⇒ = ρ̂xy ,
σ̂y σ̂x

which is the Galtonian formula mentioned in Chapter 1.


We can obtain the fitted line based on Galton’s data using the R code below.
1 The idea of OLS is often attributed to Gauss and Legendre. Gauss used it in the process of discovering

Ceres, and his work was published in 1809. Legendre’s work appeared in 1805 but Gauss claimed that he
had been using it since 1794 or 1795. Stigler (1981) reviews the history of OLS.
Ordinary Least Squares (OLS) with a Univariate Covariate 9

> library ( " HistData " )


> xx = Galton Families $ m id pa re n tH ei gh t
> yy = Galton Families $ childHeight
>
> center _ x = mean ( xx )
> center _ y = mean ( yy )
> sd _ x = sd ( xx )
> sd _ y = sd ( yy )
> rho _ xy = cor ( xx , yy )
>
> beta _ fit = rho _ xy * sd _ y / sd _ x
> alpha _ fit = center _ y - beta _ fit * center _ x
> alpha _ fit
[1] 22.63624
> beta _ fit
[1] 0.6373609

This generates Figure 2.1.

2.2 Final comments


We can write the sample mean as the solution to the OLS with only the intercept:
n
X
ȳ = arg min n−1 (yi − µ)2 .
µ
i=1

It is rare to fit OLS of yi on xi without the intercept:


n
X
β̂ = arg min n−1 (yi − bxi )2
b
i=1

which equals Pn
xi yi ⟨x, y⟩
β̂ = Pi=1
n 2 = ,
i=1 xi ⟨x, x⟩
where
Pn x and y are the n-dimensional vectors containing all observations, and ⟨x, y⟩ =
i=1 i yi denotes the inner product. Although not directly useful, this formula will be the
x
building block for many discussions later.

2.3 Homework problems


2.1 Pairwise slopes
Given (xi , yi )ni=1 with univariate xi and yi , show that Galton’s slope equals
X
β̂ = wij bij ,
(i,j)

where the summation is over all pairs of observations (i, j),

bij = (yi − yj )/(xi − xj )


10 Linear Model and Extensions

is the slope determined by two points (xi , yi ) and (xj , yj ), and


X
wij = (xi − xj )2 / (xi′ − xj ′ )2
(i′ ,j ′ )

is the weight proportional to the squared distance between xi and xj . In the above formulas,
we define bij = 0 if xi = xj .
Remark: Wu (1986) and Gelman and Park (2009) used this formula. Problem 3.9 gives
a more general result.
Part II

OLS and Statistical Inference


3
OLS with Multiple Covariates

3.1 The OLS formula


Recall that we have the outcome vector
 
y1
 y2 
Y =
 
.. 
 . 
yn
and covariate matrix
   
x11 x12 ··· x1p xt1
 x21 x22 ··· x2p   xt2 
X= =  = (X1 , . . . , Xp )
   
.. .. .. ..
 . . .   . 
xn1 xn2 ··· xnp xtn

where xti = (xi1 , . . . , xip ) is the row vector consisting of the covariates of unit i, and Xj =
(x1j , . . . , xnj )t is the column vector of the j-th covariate for all units.
We want to find the best linear fit of the data (xi , ŷi )ni=1 with

ŷi = xti β̂ = β̂1 xi1 + · · · + β̂p xip

in the sense that


n
X
β̂ = arg min n−1 (yi − xti b)2
b
i=1
= arg min n−1 ∥Y − Xb∥2 ,
b

where β̂ is called the OLS coefficient, the ŷi ’s are called the fitted values, and the yi − ŷi ’s
are called the residuals.
The objective function is quadratic in b which diverges to infinity when b diverges to
infinity. So it must have a unique minimizer β̂ satisfying the first order condition
n
2X
− xi (yi − xti β̂) = 0,
n i=1

which simplifies to
n
X
xi (yi − xti β̂) = 0 ⇐⇒ X t (Y − X β̂) = 0. (3.1)
i=1

The above equation (3.1) is called the Normal equation of the OLS, which implies the main
theorem:

13
14 Linear Model and Extensions

Theorem 3.1 The OLS coefficient equals


n
!−1 n
!
X X
β̂ = xi xti xi yi
i=1 i=1
= (X X)−1 X t Y
t

Pn
if X t X = i=1 xi xti is non-degenerate.
The equivalence of the two forms of the OLS coefficient follows from
 t 
x1
n
 xt2  X
X t X = (x1 , . . . , xn )  .  = xi xti ,
 
 ..  i=1
xtn

and  
y1
 y2  X n
X t Y = (x1 , . . . , xn )  = xi yi .
 
.. 
 .  i=1
yn
For different purposes, both forms can be useful.
The non-degeneracy of X t X in Theorem 3.1 requires that for any non-zero vector α ∈
p
R , we must have
αt X t Xα = ∥Xα∥2 ̸= 0
which is equivalent to
Xα ̸= 0,
i.e., the columns of X are linearly independent 1 . This effectively rules out redundant
columns in the design matrix X. If X1 can be represented by other columns X1 =
c2 X2 + · · · + cp Xp for some (c2 , . . . , cp ), then X t X is degenerate.
Throughout the book, we invoke the following condition unless stated otherwise.
Condition 3.1 The column vectors of X are linearly independent.

3.2 The geometry of OLS


The OLS has very clear geometric interpretations. Figure 3.1 illustrate its geometry with
n = 3 and p = 2. For any b = (b1 , . . . , bp )t ∈ Rp and X = (X1 , . . . , Xp ) ∈ Rn×p ,

Xb = b1 X1 + · · · + bp Xp

represents a linear combination of the column vectors of the design matrix X. So the OLS
problem is to find the best linear combination of the column vectors of X to approximate the
response vector Y . Recall that all linear combinations of the column vectors of X constitute
1 This book uses different notions of “independence” which can be confusing sometimes. In linear algebra,

a set of vectors is linearly independent if any nonzero linear combination of them is not zero; see Chapter A.
In probability theory, two random variables are independent if their joint density factorizes into the product
of the marginal distributions; see Chapter B.
OLS with Multiple Covariates 15

FIGURE 3.1: The geometry of OLS

the column space of X, denoted by C(X) 2 . So the OLS problem is to find the vector in C(X)
that is the closest to Y . Geometrically, the vector must be the projection of Y onto C(X).
By projection, the residual vector ε̂ = Y −X β̂ must be orthogonal to C(X), or, equivalently,
the residual vector is orthogonal to X1 , . . . , Xp . This geometric intuition implies that

X1t ε̂ = 0, . . . , Xpt ε̂ = 0,
 t 
X1 ε̂
t  .. 
⇐⇒ X ε̂ =  .  = 0,
Xpt ε̂
⇐⇒ X t (Y − X β̂) = 0,

which is essentially the Normal equation (3.1). The above argument gives a geometric deriva-
tion of the OLS formula in Theorem 3.1.
In Figure 3.1, since the triangle ABC is rectangular, the fitted vector Ŷ = X β̂ is orthog-
onal to the residual vector ε̂, and moreover, the Pythagorean Theorem implies that

∥Y ∥2 = ∥X β̂∥2 + ∥ε̂∥2 .

In most applications, X contains a column of intercepts 1n = (1, . . . , 1)t . In those cases,


we have
n
X
1tn ε̂ = 0 =⇒ n−1 ε̂i = 0,
i=1

2 Please review Chapter A for some basic linear algebra background.


16 Linear Model and Extensions

so the residuals are automatically centered.


The following theorem states an algebraic fact that gives an alternative proof of the
OLS formula. It is essentially the Pythagorean Theorem for the rectangular triangle BCD
in Figure 3.1.

Theorem 3.2 For any b ∈ Rp , we have the following decomposition

∥Y − Xb∥2 = ∥Y − X β̂∥2 + ∥X(β̂ − b)∥2 ,

where implies that ∥Y − Xb∥2 ≥ ∥Y − X β̂∥2 with equality holding if and only if b = β̂.

Proof of Theorem 3.2: We have the following decomposition:

∥Y − Xb∥2 = (Y − Xb)t (Y − Xb)


= (Y − X β̂ + X β̂ − Xb)t (Y − X β̂ + X β̂ − Xb)
= (Y − X β̂)t (Y − X β̂) + (X β̂ − Xb)t (X β̂ − Xb)
+(Y − X β̂)t (X β̂ − Xb) + (X β̂ − Xb)t (Y − X β̂).

The first term equals ∥Y − X β̂∥2 and the second term equals ∥X(β̂ − b)∥2 . We need to show
the last two terms are zero. By symmetry of these two terms, we only need to show that
the last term is zero. This is true by the Normal equation (3.1) of the OLS:

(X β̂ − Xb)t (Y − X β̂) = (β̂ − b)t X t (Y − X β̂) = 0.

3.3 The projection matrix from OLS


The geometry in Section 3.2 also shows that Ŷ = X β̂ is the solution to the following problem

Ŷ = arg min ∥Y − v∥2 .


v∈C(X)

Using Theorem 3.1, we have Ŷ = X β̂ = HY , where

H = X(X t X)−1 X t

is an n × n matrix. It is called the hat matrix because it puts a hat on Y when multiplying
Y . Algebraically, we can show that H is a projection matrix because

H2 = X(X t X)−1 X t X(X t X)−1 X t


= X(X t X)−1 X t
= H,

and
t
X(X t X)−1 X t

Ht =
= X(X t X)−1 X t
= H.
OLS with Multiple Covariates 17

Its rank equals its trace, so equals

trace X(X t X)−1 X t



rank(H) = trace(H) =
trace (X t X)−1 X t X

=
= trace(Ip ) = p.

The projection matrix H has the following geometric interpretations.

Proposition 3.1 The projection matrix H = X(X t X)−1 X t satisfies


(G1) Hv = v ⇐⇒ v ∈ C(X);
(G2) Hw = 0 ⇐⇒ w ⊥ C(X).

Recall that C(X) is the column space of X. (G1) states that projecting any vector in
C(X) onto C(X) does not change the vector, and (G2) states that projecting any vector
orthogonal to C(X) onto C(X) results in a zero vector.
Proof of Proposition 3.1: I first prove (G1). If v ∈ C(X), then v = Xb for some b,
which implies that Hv = X(X t X)−1 X t Xb = Xb = v. Conversely, if v = Hv, then v =
X(X t X)−1 X t v = Xu with u = (X t X)−1 X t v, which ensures that v ∈ C(X).
I then prove (G2). If w ⊥ C(X), then w is orthogonal to all column vectors of X. So

Xjt w = 0 (j = 1, . . . , p)
t
=⇒ X w = 0
=⇒ Hw = X(X t X)−1 X t w = 0.

Conversely, if Hw = X(X t X)−1 X t w = 0, then wt X(X t X)−1 X t w = 0. Because (X t X)−1


is positive definite, we have X t w = 0 ensuring that w ⊥ C(X). □
Writing H = (hij )1≤i,j≤n and ŷ = (ŷ1 , . . . , ŷn )t , we have another basic identity
n
X X
ŷi = hij yj = hii yi + hij yj .
j=1 j̸=i

It shows that the predicted value ŷi is a linear combination of all the outcomes. Moreover,
if X contains a column of intercepts 1n = (1, . . . , 1)t , then
n
X
H1n = 1n =⇒ hij = 1 (i = 1, . . . , n),
j=1

which implies that ŷi is a weighted average of all the outcomes. Although the sum of the
weights is one, some of them can be negative.
In general, the hat matrix has complex forms, but when the covariates are dummy
variables, it has more explicit forms. I give two examples below.

Example 3.1 In a treatment-control experiment with m treated and n control units, the
matrix X contains 1 and a dummy variable for the treatment:
 
1m 1m
X= .
1n 0n

We can show that


H = diag{m−1 1m 1tm , n−1 1n 1tn }.
18 Linear Model and Extensions

Example 3.2 In an experiment with nj units receiving treatment level j (j = 1, . . . , J),


the covariate matrix X contains J dummy variables for the treatment levels:
X = diag{1n1 , . . . , 1nJ }.
We can show that
H = diag{n−1 t −1 t
1 1n1 1n1 , . . . , nJ 1nJ 1nJ }.

3.4 Homework problems


3.1 Univariate and multivariate OLS
Derive the univariate OLS based on the multivariate OLS formula with
 
1 x1
X =  ... .. 

. 
1 xn
where the xi ’s are scalars.

3.2 OLS via vector and matrix calculus


Using vector and matrix calculus, show that the OLS estimator minimizes (Y − Xb)t (Y −
Xb).

3.3 OLS based on pseudo inverse


Show that β̂ = X + Y .
Remark: Recall the definition of the pseudo inverse in Chapter A.

3.4 Invariance of OLS


Assume that X t X is non-degenerate and Γ is a p×p non-degenerate matrix. Define X̃ = XΓ.
From the OLS fit of Y on X, we obtain the coefficient β̂, the fitted value Ŷ , and the residual
ε̂; from the OLS fit of Y on X̃, we obtain the coefficient β̃, the fitted value Ỹ , and the residual
ε̃.
Prove that
β̂ = Γβ̃, Ŷ = Ỹ , ε̂ = ε̃.
Remark: From a linear algebra perspective, X and XΓ have the same column space if
Γ is a non-degenerate matrix:
{Xb : b ∈ Rp } = {XΓc : c ∈ Rp }.
Consequently, there must be a unique projection of Y onto the common column space.

3.5 Invariance of the hat matrix


Show that H does not change if we change X to XΓ where Γ ∈ Rp×p is a non-degenerate
matrix.

3.6 Special hat matrices


Verify the formulas of the hat matrices in Examples 3.1 and 3.2.
OLS with Multiple Covariates 19

3.7 OLS with multiple responses


For each unit i = 1, . . . , n, we have multiple responses yi = (yi1 , . . . , yiq )t ∈ Rq and multiple
covariates xi = (xi1 , . . . , xip )t ∈ Rp . Define
   t
y11 · · · y1q y1
 .. ..  =  ... 
. n×q
Y = .  = (Y1 , . . . , Yq ) ∈ R
 

yn1 ··· ynq ynt

and    t
x11 ··· x1p x1
X =  ... ..  =  ..  = (X , . . . , X ) ∈ Rn×p

.   .  1 p
xn1 ··· xnp xtn
as the response and covariate matrices, respectively. Define the multiple OLS coefficient
matrix as
n
X
B̂ = arg minp×q
∥yi − B t xi ∥2
B∈R
i=1

Show that B̂ = (B̂1 , . . . , B̂q ) has column vectors

B̂1 = (X t X)−1 X t Y1 ,
..
.
B̂q = (X t X)−1 X t Yq .

Remark: This result tells us that the OLS fit with a vector outcome reduces to multiple
separate OLS fits, or, the OLS fit of a matrix Y on a matrix X reduces to the column-wise
OLS fits of Y on X.

3.8 Full sample and subsample OLS coefficients


Partition the full sample into K subsamples:
   
X(1) Y(1)
X =  ...  , Y =  ...  ,
   

X(K) Y(K)

where the kth sample consists of (X(k) , Y(k) ) with X(k) ∈ Rnk ×p and Y(k) ∈ Rnk being the
PK
covariate matrix and outcome vector. Note that n = k=1 nk Let β̂ be the OLS coefficient
based on the full sample, and β̂(k) be the OLS coefficient based on the kth sample. Show
that
XK
β̂ = W(k) β̂(k) ,
k=1

where the weight matrix equals

W(k) = (X t X)−1 X(k)


t
X(k) .
20 Linear Model and Extensions

3.9 Jacobi’s theorem


n

The set {1, . . . , n} has p size-p subsets. Each subset S defines a linear equation for b ∈ Rp :

YS = XS b

where YS ∈ Rp is the subvector of Y and XS ∈ Rp×p is the submatrix of X, corresponding


to the units in S. Define the subset coefficient

β̂S = XS−1 YS

if XS is invertible and β̂S = 0 otherwise. Show that the OLS coefficient equals a weighted
average of these subset coefficients:
X
β̂ = wS β̂S
S

where the summation is over all subsets and


| det(XS )|2
wS = P 2
.
S ′ | det(XS )|

Remark: To prove this result, we can use Cramer’s rule to express the OLS coefficient
and use the Cauchy–Binet formula to expand the determinant of X t X. This result extends
Problem 2.1. Berman (1988) attributed it to Jacobi. Wu (1986) used it in analyzing the
statistical properties of OLS.
4
The Gauss–Markov Model and Theorem

4.1 Gauss–Markov model


Without any stochastic assumptions, the OLS in Chapter 3 is purely algebraic. From now
on, we want to discuss the statistical properties of β̂ and associated quantities, so we need
to invoke some statistical modeling assumptions. A simple starting point is the following
Gauss–Markov model with a fixed design matrix X and unknown parameters (β, σ 2 ).
Assumption 4.1 (Gauss–Markov model) We have
Y = Xβ + ε
where the design matrix X is fixed with linearly independent column vectors, and the random
error term ε has the first two moments
E(ε) = 0,
cov(ε) = σ 2 In .
The unknown parameters are (β, σ 2 ).
The Gauss–Markov model assumes that Y has mean Xβ and covariance matrix σ 2 In .
At the individual level, we can also write it as
yi = xti β + εi , (i = 1, . . . , n)
where the error terms are uncorrelated with mean 0 and variance σ 2 .
The assumption that X is fixed is not essential, because we can condition on X even
if we think X is random. The mean of each yi is linear in xi with the same β coefficient,
which is a rather strong assumption. So is the homoskedasticity1 assumption that the error
terms have the same variance σ 2 . The critiques on the assumptions aside, I will derive the
properties of β̂ under the Gauss–Markov model.

4.2 Properties of the OLS estimator


I first derive the mean and covariance of β̂ = (X t X)−1 X t Y .
Theorem 4.1 Under Assumption 4.1, we have
E(β̂) = β,
cov(β̂) = σ 2 (X t X)−1 .
1 In this book, I do not spell it as homoscedasticity since “k” better indicates the meaning of variance.

McCulloch (1985) gave a convincing argument. See also Paloyo (2014).

21
22 Linear Model and Extensions

Proof of Theorem 4.1: Because E(Y ) = Xβ, we have

E(β̂) = E (X t X)−1 X t Y


= (X t X)−1 X t E(Y )
= (X t X)−1 X t Xβ
= β.

Because cov(Y ) = σ 2 In , we have

cov(β̂) = cov (X t X)−1 X t Y




= (X t X)−1 X t cov(Y )X(X t X)−1


= σ 2 (X t X)−1 X t X(X t X)−1
= σ 2 (X t X)−1 .


We can decompose the response vector as

Y = Ŷ + ε̂,

where the fitted vector is Ŷ = X β̂ = HY and the residual vector is ε̂ = Y − Ŷ = (In − H)Y.
The two matrices H and In − H are the keys, which have the following properties.

Lemma 4.1 Both H and In − H are projection matrices. In particular,

HX = X, (In − H)X = 0,

and they are orthogonal:


H(In − H) = (In − H)H = 0.

These follow from simple linear algebra, and I leave the proof as Problem 4.1. It states
that H and In − H are projection matrices onto the column space of X and its complement.
Algebraically, Ŷ and ε̂ are orthogonal by the OLS projection because Lemma 4.1 implies

Ŷ t ε̂ = Y t H t (In − H)Y
= Y t H(In − H)Y
= 0.

This is also coherent with the geometry in Figure 3.1.


Moreover, we can derive the mean and covariance matrix of Ŷ and ε̂.
Theorem 4.2 Under Assumption 4.1, we have
   
Ŷ Xβ
E =
ε̂ 0

and    
Ŷ 2 H 0
cov =σ .
ε̂ 0 In − H

So Ŷ and ε̂ are uncorrelated.


The Gauss–Markov Model and Theorem 23

Please do not be confused with the two statements above. First, Ŷ and ε̂ are orthogonal.
Second, Ŷ and ε̂ are uncorrelated. They have different meanings. The first statement is an
algebraic fact of the OLS procedure. It is about a relationship between two vectors Ŷ
and ε̂ which holds without assuming the Gauss–Markov model. The second statement is
stochastic. It is about a relationship between two random vectors Ŷ and ε̂ which requires
the Gauss–Markov model assumption.
Proof of Theorem 4.2: The conclusion follows from the simple fact that
     
Ŷ HY H
= = Y
ε̂ (In − H)Y In − H
is a linear transformation of Y . It has mean
   
Ŷ H
E = E(Y )
ε̂ In − H
 
H
= Xβ
In − H
 
HXβ
=
(In − H) Xβ
 

= ,
0
and covariance matrix
   
Ŷ H 
cov = cov(Y ) H t (In − H)t
ε̂ In − H
 
2 H 
=σ H In − H
In − H
H2
 
2 H(In − H)

(In − H)H (In − H)2
 
2 H 0
=σ ,
0 In − H
where the last step follows from Lemma 4.1. □
Assume the Gauss–Markov model. Although the original responses and error terms are
uncorrelated between units with cov(εi , εj ) = 0 for i ̸= j, the fitted values and the residuals
are correlated with
cov(ŷi , ŷj ) = σ 2 hij , cov(ε̂i , ε̂j ) = σ 2 (1 − hij )
for i ̸= j based on Theorem 4.2.

4.3 Variance estimation


Theorem 4.1 quantifies the uncertainty of β̂ by its covariance matrix. However, it is not
directly useful because σ 2 is still unknown. Our next task is to estimate σ 2 based on the
observed data. It is the variance of each εi , but the εi ’s are not observable either. Their
empirical analogues are the residuals ε̂i = yi − xti β̂. It seems intuitive to estimate σ 2 by
σ̃ 2 = rss/n
24 Linear Model and Extensions

where
n
X
rss = ε̂2i
i=1

is the residual sum of squares. However, Theorem 4.2 shows that ε̂i has mean zero and
variance σ 2 (1 − hii ), which is not the same as the variance of original εi . Consequently, rss
has mean
n
X
E(rss) = σ 2 (1 − hii )
i=1
= σ 2 {n − trace(H)}
= σ 2 (n − p),

which implies the following theorem.

Theorem 4.3 Define


n
X
σ̂ 2 = rss/(n − p) = ε̂2i /(n − p).
i=1
2 2
Then E(σ̂ ) = σ under Assumption 4.1.

Theorem 4.3 implies that σ̃ 2 is a biased estimator for σ 2 because E(σ̃ 2 ) = σ 2 (n − p)/n.
It underestimates σ 2 but with a large sample size n, the bias is small.

4.4 Gauss–Markov Theorem


So far, we have focused on the OLS estimator. It is intuitive, but we have not answered
the fundamental question yet. Why should we focus on it? Are there any other better
estimators? Under the Gauss–Markov model, the answer is definite: we focus on the OLS
estimator because it is optimal in the sense of having the smallest covariance matrix among
all linear unbiased estimators. The following famous Gauss–Markov theorem quantifies this
claim, which was named after Carl Friedrich Gauss and Andrey Markov2 . It is for this reason
that I call the corresponding model the Gauss–Markov model. The textbook by Monahan
(2008) also uses this name.

Theorem 4.4 Under Assumption 4.1, the OLS estimator β̂ for β is the best linear unbiased
estimator (BLUE) in the sense that3

cov(β̃) ⪰ cov(β̂)

for any estimator β̃ satisfying


(C1) β̃ = AY for some A ∈ Rp×n not depending on Y ;

(C2) E(β̃) = β for any β.


2 David and Neyman (1938) used the name Markoff theorem. Lehmann (1951) appeared to first use the

name Gauss–Markov theorem.


3 We write M ≻ M is M − M is positive semi-definite. See Chapter A for a review.
1 2 1 2
The Gauss–Markov Model and Theorem 25

Before proving Theorem 4.4, we need to understand its meaning and immediate impli-
cations. We do not compare the OLS estimator with any arbitrary estimators. In fact, we
restrict to the estimators that are linear and unbiased. Condition (C1) requires that β̃ is
a linear estimator. More precisely, it is a linear transformation of the response vector Y ,
where A can be any complex and possibly nonlinear function of X. Condition (C2) requires
that β̃ is an unbiased estimator for β, no matter what true value β takes.
Why do we restrict the estimator to be linear? The class of linear estimator is actually
quite large because A can be any nonlinear function of X, and the only requirement is that
the estimator is linear in Y . The unbiasedness is a natural requirement for many problems.
However, in many modern applications with many covariates, some biased estimators can
perform better than unbiased estimators if they have smaller variances. We will discuss
these estimators in Part V of this book.
We compare the estimators based on their covariances, which are natural extensions of
variances for scalar random variables. The conclusion cov(β̃) ⪰ cov(β̂) implies that for any
vector c ∈ Rp , we have
ct cov(β̃)c ⪰ ct cov(β̂)c
which is equivalent to
var(ct β̃) ≥ var(ct β̂),
So any linear transformation of the OLS estimator has a variance smaller than or equal to
the same linear transformation of any other estimator. In particular, if c = (0, . . . , 1, . . . , 0)t
with only the jth coordinate being 1, then the above inequality implies that
var(β̃j ) ≥ var(β̂j ), (j = 1, . . . , p).
So the OLS estimator has a smaller variance than other estimators for all coordinates.
Now we prove the theorem.
Proof of Theorem 4.4: We must verify that the OLS estimator itself satisfies (C1) and
(C2). We have β̂ = ÂY with  = (X t X)−1 X t , and it is unbiased by Theorem 4.1.
First, the unbiasedness requirement implies that
E(β̃) = β =⇒ E(AY ) = AE(Y ) = AXβ = β
=⇒ AXβ = β
for any value of β. So
AX = Ip (4.1)
t −1 t
must hold. In particular, the OLS estimator satisfies ÂX = (X X) X X = Ip .
Second, we can decompose the covariance of β̃ as
cov(β̃) = cov(β̂ + β̃ − β̂)
= cov(β̂) + cov(β̃ − β̂) + cov(β̂, β̃ − β̂) + cov(β̃ − β̂, β̂).
The last two terms are in fact zero. By symmetry, we only need to show that the third term
is zero:
n o
cov(β̂, β̃ − β̂) = cov ÂY, (A − Â)Y
= Âcov(Y )(A − Â)t
= σ 2 Â(A − Â)t
= σ 2 (ÂAt − ÂÂt )
= σ 2 (X t X)−1 X t At − (X t X)−1 X t X(X t X)−1


= σ 2 (X t X)−1 Ip − (X t X)−1

(by (4.1))
= 0.
26 Linear Model and Extensions

The above covariance decomposition simplifies to


cov(β̃) = cov(β̂) + cov(β̃ − β̂),
which implies
cov(β̃) − cov(β̂) = cov(β̃ − β̂) ⪰ 0.

In the process of the proof, we have shown two stronger results
cov(β̃ − β̂, β̂) = 0
and
cov(β̃ − β̂) = cov(β̃) − cov(β̂).
They hold only when β̂ is BLUE. They do not hold when comparing two general estimators.
Theorem 4.4 is elegant but abstract. It says that in some sense, we can just focus on
the OLS estimator because it is the best one in terms of the covariance among all linear
unbiased estimators. Then we do not need to consider other estimators. However, we have
not mentioned any other estimators for β yet, which makes Theorem 4.4 not concrete
enough. From the proof above, a linear unbiased estimator β̃ = AY only needs to satisfy
AX = Ip , which imposes p2 constraints on the p × n matrix A. Therefore, we have p(n − p)
free parameters to choose from and have infinitely many linear unbiased estimators in
general. A class of linear unbiased estimators discussed more thoroughly in Chapter 19, are
the weighted least squares estimators
β̃ = (X t Σ−1 X)−1 X t Σ−1 Y,
where Σ is a positive definite matrix not depending on Y such that Σ and X t Σ−1 X are
invertible. It is linear, and we can show that it is unbiased for β:
E(β̃) = E (X t Σ−1 X)−1 X t Σ−1 Y


= (X t Σ−1 X)−1 X t Σ−1 Xβ


= β.
Different choices of Σ give different β̃, but Theorem 4.4 states that the OLS estimator with
Σ = In has the smallest covariance matrix under the Gauss–Markov model.
I will give an extension and some applications of the Gauss–Markov Theorem as home-
work problems.

4.5 Homework problems


4.1 Projection matrices
Prove Lemma 4.1.

4.2 Univariate OLS and the optimal design


Assume the Gauss–Markov model yi = α + βxi + εi (i = 1, . . . , n) with a scalar xi . Show
that the variance of the OLS coefficient for xi equals
n
.X
var(β̂) = σ 2 (xi − x̄)2 .
i=1
The Gauss–Markov Model and Theorem 27

Assume xi must be in the interval [0, 1]. We want to choose their values to minimize
var(β̂). Assume that n is an even number. Find the minimizers xi ’s.
Hint: You may find the following probability result useful. For a random variable ξ in
the interval [0, 1], we have the following inequality
var(ξ) = E(ξ 2 ) − {E(ξ)}2
≤ E(ξ) − {E(ξ)}2
= E(ξ){1 − E(ξ)}
≤ 1/4.
The first inequality becomes an equality if and only if ξ = 0 or 1; the second inequality
becomes an equality if and only if E(ξ) = 1/2.

4.3 BLUE estimator for the mean


Assume that yi has mean µ and variance σ 2 , andPyi (i = 1, . . . , n) are uncorrelated. A
n
Pn estimator of the mean µ has the form µ̂ = i=1 ai yi , which is unbiased as long as
linear
i=1 ai = 1. So there are infinitely many linear unbiased estimators for µ.
Find the BLUE for µ and prove why it is BLUE.

4.4 Consequence of useless regressors


Partition the covariate matrix and parameter into
 
β1
X = (X1 , X2 ), β= ,
β2

where X1 ∈ Rn×k , X2 ∈ Rn×l , β1 ∈ Rk and β2 ∈ Rl with k + l = p. Assume the Gauss–


Markov model with β2 = 0. Let β̂1 be the first k coordinates of β̂ = (X t X)−1 X t Y and
β̃1 = (X1t X1 )−1 X1t Y be the coefficient based on the partial OLS fit of Y on X1 only. Show
that
cov(β̂1 ) ⪰ cov(β̃1 ).

4.5 Simple average of subsample OLS coefficients


Inherit the setting P
of Problem 3.8. Define the simple average of the subsample OLS coeffi-
K
cients as β̄ = K −1 k=1 β̂(k) . Assume the Gauss–Markov model. Show that

cov(β̄) ⪰ cov(β̂).

4.6 Gauss–Markov theorem for prediction


Under Assumption 4.1, the OLS predictor Ŷ = X β̂ for the mean Xβ is the best linear
unbiased predictor in the sense that cov(Ỹ ) ⪰ cov(Ŷ ) for any predictor Ỹ satisfying
(C1) Ỹ = H̃Y for some H̃ ∈ Rn×n not depending on Y ;
(C2) E(Ỹ ) = Xβ for any β.
Prove this theorem.

4.7 Nonlinear unbiased estimator under the Gauss–Markov model


Under Assumption 4.1, prove that if
X t Qj X = 0, trace(Qj ) = 0, (j = 1, . . . , p)
28 Linear Model and Extensions

then  
Y t Q1 Y
β̃ = β̂ + 
 .. 
. 
Y t Qp Y
is unbiased for β.
Remark: The above estimator β̃ is a quadratic function of Y . It is a nonlinear unbiased
estimator for β. It is not difficult to show the unbiasedness. More remarkably, Koopmann
(1982, Theorem 4.3) showed that under Assumption 4.1, any unbiased estimator for β must
have the form of β̃.
5
Normal Linear Model: Inference and Prediction

Under the Gauss–Markov model, we have calculated the first two moments of the OLS
estimator β̂ = (X t X)−1 X t Y :

E(β̂) = β,
cov(β̂) = σ 2 (X t X)−1 ,

and have shown that σ̂ 2 = ε̂t ε̂/(n − p) is unbiased for σ 2 , where ε̂ = Y − X β̂ is the
residual vector. The Gauss–Markov theorem further ensures that the OLS estimator is
BLUE. Although these results characterize the nice properties of the OLS estimator, they
do not fully determine its distribution and are thus inadequate for statistical inference.
This chapter will derive the joint distribution of (β̂, σ̂ 2 ) under the Normal linear model
with stronger distribution assumptions.

Assumption 5.1 (Normal linear model) We have

Y ∼ N(Xβ, σ 2 In ),

or, equivalently,
ind
yi ∼ N(xti β, σ 2 ), (i = 1, . . . , n),
where the design matrix X is fixed with linearly independent column vectors. The unknown
parameters are (β, σ 2 ).

We can also write the Normal linear model as a linear function of covariates with error
terms:
Y = Xβ + ε
or, equivalently,
yi = xti β + εi , (i = 1, . . . , n),
where
iid
ε ∼ N(0, σ 2 In ) or εi ∼ N(0, σ 2 ), (i = 1, . . . , n).
Assumption 5.1 implies Assumption 4.1. Beyond the Gauss–Markov model, it further
requires IID Normal error terms. Assumption 5.1 is extremely strong, but it is canonical in
statistics. It allows us to derive elegant formulas and also justifies the outputs of the linear
regression functions in many statistical packages. I will relax it in Chapter 6.

5.1 Joint distribution of the OLS coefficient and variance estimator


We first state the main theorem on the joint distribution of (β̂, σ̂ 2 ) via the joint distribution
of (β̂, ε̂).

29
30 Linear Model and Extensions

Theorem 5.1 Under Assumption 5.1, we have


(X t X)−1
     
β̂ β 0
∼N , σ2 ,
ε̂ 0 0 In − H
and
σ̂ 2 /σ 2 ∼ χ2n−p /(n − p).
So
β̂ ε̂, β̂ σ̂ 2 .
Proof of Theorem 5.1: First,
(X t X)−1 X t Y
   
β̂
=
ε̂ (In − H)Y
(X t X)−1 X t
 
= Y
In − H
is a linear transformation of Y , so they are jointly Normal. We have verified their means
and variances in Chapter 4, so we only need to show that their covariance is zero:
cov(β̂, ε̂) = (X t X)−1 X t cov(Y )(In − H)t
= σ 2 (X t X)−1 X t (In − H t )
= 0
which holds because (In − H)X = 0 by Lemma 4.1.
Second, because σ̂ 2 = rss/(n − p) = ε̂t ε̂/(n − p) is a quadratic function of ε̂, it is
independent of β̂. We only need to show that it is a scaled chi-squared distribution. This
follows from Theorem B.10 in Chapter B due to the Normality of ε̂/σ with the projection
matrix In − H as its covariance matrix. □
The second theorem is on the joint distribution of (Ŷ , ε̂). We have shown their means
and covariance matrix in the last chapter. Because they are linear transformations of Y ,
they are jointly Normal and independent.
Theorem 5.2 Under Assumption 5.1, we have
     
Ŷ Xβ 2 H 0
∼N ,σ ,
ε̂ 0 0 In − H
so
Ŷ ε̂.
Recall that we have shown that Y = Ŷ + ε̂ with Ŷ ⊥ ε̂ by the OLS properties, which
is a pure linear algebra fact without assumptions. Theorem 4.2 ensures that Ŷ and ε̂ are
uncorrelated under Assumption 4.1. Now Theorem 5.2 further ensures that Ŷ ε̂ under
Assumption 5.1. The first result states that Ŷ and ε̂ are orthogonal. The second result
states that Ŷ and ε̂ are uncorrelated. The third result states Ŷ and ε̂ are independent.
They hold under different assumptions.

5.2 Pivotal quantities and statistical inference


5.2.1 Scalar parameters
We first consider statistical inference for ct β, a one-dimensional linear function of β where
c ∈ Rp . For example, if c = ej ≡ (0, . . . , 1, . . . , 0)t with only the jth element being one,
Normal Linear Model: Inference and Prediction 31

then ct β = βj is the jth element of β which measures the impact of xij on yi on average.
Standard software packages report statistical inference for each element of β. Sometimes we
may also be interested in βj − βj ′ , the difference between the coefficients of two covariates,
which corresponds to c = (0, . . . , 0, 1, 0, . . . , 0, −1, 0, . . . , 0)t = ej − ej ′ .
Theorem 5.1 implies that

ct β̂ ∼ N ct β, σ 2 ct (X t X)−1 c .


However, this is not directly useful because σ 2 is unknown. With σ 2 replaced by σ̂ 2 , the
standardized distribution
ct β̂ − ct β
Tc ≡ p
σ̂ 2 ct (X t X)−1 c
does not follow N(0, 1) anymore. In fact, it is a t distribution as shown in Theorem 5.3
below.

Theorem 5.3 Under Assumption 5.1, for a fixed vector c ∈ Rp , we have

Tc ∼ tn−p .

Proof of Theorem 5.3: From Theorem 5.1, the standardized distribution with the true
σ 2 follows
ct β̂ − ct β
p ∼ N(0, 1),
σ 2 ct (X t X)−1 c
σ̂ 2 /σ 2 ∼ χ2n−p /(n − p), and they are independent. These facts imply that

ct β̂ − ct β
Tc = p
σ̂ 2 ct (X t X)−1 c
r
ct β̂ − ct β . σ̂ 2
=p
σ 2 ct (X t X)−1 c σ2
N(0, 1)
∼q ,
χ2n−p /(n − p)

where N(0, 1) and χ2n−p denote independent standard Normal and χ2n−p random variables,
respectively, with a little abuse of notation. Therefore, Tc ∼ tn−p by the definition of the t
distribution. □
In Theorem 5.3, the left-hand side depends on the observed data and the unknown true
parameters, but the right-hand side is a random variable depending on only the dimension
(n, p) of X, but neither the data nor the true parameters. We call the quantity on the
left-hand side a pivotal quantity. Based on the quantiles of the tn−p random variable, we
can tie the data and the true parameter via the following probability statement
( )
ct β̂ − ct β
pr p ≤ t1−α/2,n−p = 1 − α
σ̂ 2 ct (X t X)−1 c

for any 0 < α < 1, where t1−α/2,n−p is the 1 − α/2 quantile of tn−p . When n − p is large
(e.g. larger than 30), the 1 − α/2 quantile of tn−p is close to that of N(0, 1). In particular,
t97.5%,n−p ≈ 1.96, the 97.5% quantile of N(0, 1), which is the critical value for the 95%
confidence interval.
Define p
σ̂ 2 ct (X t X)−1 c ≡ se
ˆc
32 Linear Model and Extensions

which is often called the (estimated) standard error of ct β̂. Using this definition, we can
equivalently write the above probability statement as
n o
pr ct β̂ − t1−α/2,n−p se
ˆ c ≤ ct β ≤ ct β̂ + t1−α/2,n−p se
ˆ c = 1 − α.

We use
ct β̂ ± t1−α/2,n−p se
ˆc
as a 1 − α level confidence interval for ct β. By duality of confidence interval and hypothesis
testing, we can also construct a level α test for ct β. More precisely, we reject the null
hypothesis ct β = d if the above confidence interval does not cover d, for a fixed number d.
As an important case, c = ej so ct β = βj . Standard software packages, for example,
p
R, report the point estimator β̂j , the standard error seˆ j = σ̂ 2 [(X t X)−1 ]jj , the t statistic
ˆ j , and the two-sided p-value pr(|tn−p | ≥ |Tj |) for testing whether βj equals zero
Tj = β̂j /se
or not. Section 5.4 below gives some examples.

5.2.2 Vector parameters


We then consider statistical inference for Cβ, a multi-dimensional linear function of β where
C ∈ Rl×p . If l = 1, then it reduces to the one-dimensional case. If l > 1, then
 t  t 
c1 c1 β
 ..   .. 
C =  .  =⇒ Cβ =  . 
ctl ctl β

correspond to the joint value of l parameters ct1 β, . . . , ctl β.

Example 5.1 If  
0 1 0 ··· 0
 0 0 1 ··· 0 
C= ,
 
.. .. .. ..
 . . . ··· . 
0 0 0 ··· 1
then 

β2
Cβ =  ...  ,
 

βp
contains all the coefficients except for the first one (the intercept in most cases). Most
software packages report the test of the joint significance of (β2 , . . . , βp ). Section 5.4 below
gives some examples.

Example 5.2 Another leading application is to test whether β2 = 0 in the following regres-
sion partitioned by X = (X1 , X2 ) where X1 and X2 are n × k and n × l matrices:

Y = X1 β1 + X2 β2 + ε,

with  
 β1
C= 0l×k Il =⇒ Cβ = β2 .
β2
We will discuss this partitioned regression in more detail in Chapters 7 and 8.
Normal Linear Model: Inference and Prediction 33

Now we will focus on the generic problem of inferring Cβ. To avoid degeneracy, we
assume that C does not have redundant rows, quantified below.

Assumption 5.2 C has linearly independent rows.

Theorem 5.1 implies that

C β̂ − Cβ ∼ N 0, σ 2 C(X t X)−1 C t


and therefore the standardized quadratic form has a chi-squared distribution


−1
(C β̂ − Cβ)t σ 2 C(X t X)−1 C t (C β̂ − Cβ) ∼ χ2l .


The above chi-squared distribution follows from the property of the quadratic form of a
Normal in Theorem B.10, where σ 2 C(X t X)−1 C t is a positive definite matrix1 . Again this
is not directly useful with unknown σ 2 . Replacing σ 2 with the unbiased estimator σ̂ 2 and
using a scaling factor l, we can obtain a pivotal quantity that has an F distribution as
summarized in Theorem 5.4 below.

Theorem 5.4 Under Assumptions 5.1 and 5.2, we have


−1
(C β̂ − Cβ)t C(X t X)−1 C t

(C β̂ − Cβ)
FC ≡ ∼ Fl,n−p .
lσ̂ 2
Proof of Theorem 5.4: Similar to the proof of Theorem 5.3, we apply Theorem 5.1 to
derive that
−1
(C β̂ − Cβ)t σ 2 C(X t X)−1 C t

(C β̂ − Cβ)/l
FC =
σ̂ 2 /σ 2
χ2l /l
∼ 2 ,
χn−p /(n − p)
where χ2l and χ2n−p denote independent χ2l and χ2n−p random variables, respectively, with
a little abuse of notation. Therefore, FC ∼ Fl,n−p by the definition of the F distribution. □
Theorem 5.4 motivates the following confidence region for Cβ:
n −1
o
r : (C β̂ − r)t C(X t X)−1 C t (C β̂ − r) ≤ lσ̂ 2 f1−α,l,n−p ,


where f1−α,l,n−p is the upper α quantile of the Fl,n−p distribution. By duality of the con-
fidence region and hypothesis testing, we can also construct a level α test for Cβ. Most
statistical packages automatically report the p-value based on the F statistic in Example
5.1.
As a final remark, the statistics in Theorems 5.3 and 5.4 are called the Wald-type
statistics.
1 Because X has linearly independent columns, X t X is a non-degenerate and thus positive definite matrix.

Since ut C(X t X)−1 C t u ≥ 0, to show that C(X t X)−1 C t is non-degenerate, we only need to show that
ut C(X t X)−1 C t u = 0 =⇒ u = 0.
From ut C(X t X)−1 C t u = 0, we know C t u = u1 c1 + · · · ul cl = 0. Since the rows of C are linearly indepen-
dent, we must have u = 0.
34 Linear Model and Extensions

5.3 Prediction based on pivotal quantities


Practitioners use OLS not only to infer β but also to predict future outcomes. For the pair
of future data (xn+1 , yn+1 ), we observe only xn+1 and want to predict yn+1 based on (X, Y )
and xn+1 . Assume a stable relationship between yn+1 and xn+1 , that is,

yn+1 ∼ N(xtn+1 β, σ 2 )

with the same (β, σ 2 ).


First, we can predict the mean of yn+1 which is xtn+1 β. It is just a one-dimensional
linear function of β, so the theory in Theorem 5.3 is directly applicable. A natural unbiased
predictor is xtn+1 β̂ with 1 − α level prediction interval

xtn+1 β̂ ± t1−α/2,n−p se
ˆ xn+1 .

Second, we can predict yn+1 itself, which is a random variable. We can still use xtn+1 β̂ as
a natural unbiased predictor but need to modify the prediction interval. Because yn+1 β̂,
we have
yn+1 − xtn+1 β̂ ∼ N 0, σ 2 + σ 2 xtn+1 (X t X)−1 xn+1 ,


and therefore
r
yn+1 − xtn+1 β̂ yn+1 − xtn+1 β̂ . σ̂ 2
=
σ2
p p
σ̂ 2 + σ̂ 2 xn+1 (X t X)−1 xn+1
t σ 2 + σ 2 xn+1 (X t X)−1 xn+1
t

N(0, 1)
∼q ,
χ2n−p /(n − p)

where N(0, 1) and χ2n−p denote independent standard Normal and χ2n−p random variables,
respectively, with a little abuse of notation. Therefore,

yn+1 − xtn+1 β̂
p ∼ tn−p
σ̂ + σ̂ 2 xtn+1 (X t X)−1 xn+1
2

is a pivotal quantity. Define the squared prediction error as

ˆ 2xn+1 = σ̂ 2 + σ̂ 2 xtn+1 (X t X)−1 xn+1


pe
 !−1 
 n
X 
= σ̂ 2 1 + n−1 xtn+1 n−1 xi xti xn+1 ,
 
i=1

2
which has two components. The first one has magnitude close Pn to σ twhich is of constant
−1
order. The second one has a magnitude decreasing in n if n i=1 xi xi converges to a finite
limit with large n. Therefore, the first component dominates the second one with large n,
which results in the main difference between predicting the mean of yn+1 and predicting
ˆ xn+1 , we can construct the following 1 − α level prediction
yn+1 itself. Using the notation pe
interval:
xtn+1 β̂ ± t1−α/2,n−p pe
ˆ xn+1 .
Normal Linear Model: Inference and Prediction 35

5.4 Examples and R implementation


Below I illustrate the theory in this chapter with two classic datasets.

5.4.1 Univariate regression


Revisiting Galton’s data, we have the following result:
> library ( " HistData " )
> galton _ fit = lm ( childHeight ~ midparentHeight ,
+ data = Galt onFamil ies )
> round ( summary ( galton _ fit )$ coef , 3 )
Estimate Std . Error t value Pr ( >| t |)
( Intercept ) 22.636 4.265 5.307 0
m id pa r en tH ei g ht 0.637 0.062 10.345 0

With the fitted line, we can predict childHeight at different values of midparentHeight. In
the predict function, if we specify interval = "confidence", it gives the confidence intervals for
the means of the new outcomes; if we specify interval = "prediction", it gives the prediction
intervals for the new outcomes themselves.
> new _ mph = seq ( 6 0 , 8 0 , by = 0 . 5 )
> new _ data = data . frame ( mi d pa re nt H ei gh t = new _ mph )
> new _ ci = predict ( galton _ fit , new _ data ,
+ interval = " confidence " )
> new _ pi = predict ( galton _ fit , new _ data ,
+ interval = " prediction " )
> round ( head ( cbind ( new _ ci , new _ pi )) , 3 )
fit lwr upr fit lwr upr
1 60.878 59.744 62.012 60.878 54.126 67.630
2 61.197 60.122 62.272 61.197 54.454 67.939
3 61.515 60.499 62.531 61.515 54.782 68.249
4 61.834 60.877 62.791 61.834 55.109 68.559
5 62.153 61.254 63.051 62.153 55.436 68.869
6 62.471 61.632 63.311 62.471 55.762 69.180

Figure 5.1 plots the fitted line as well as the confidence intervals and prediction intervals
at level 95%. The file code5.4.1.R contains the R code.

5.4.2 Multivariate regression


The R package Matching contains an experimental dataset lalonde from LaLonde (1986).
Units were randomly assigned to the job training program, with treat being the treatment
indicator. The outcome re78 is the real earnings in 1978, and other variables are pretreatment
covariates. From the simple OLS, the treatment has a significant positive effect, whereas
none of the covariates are predictive of the outcome.
> library ( " Matching " )
> data ( lalonde )
> lalonde _ fit = lm ( re 7 8 ~ . , data = lalonde )
> summary ( lalonde _ fit )

Call :
lm ( formula = re 7 8 ~ . , data = lalonde )

Residuals :
Min 1 Q Median 3Q Max
-9 6 1 2 -4 3 5 5 -1 5 7 2 3054 53119
36 Linear Model and Extensions

prediction in Galton's regression

80


● ● ● ●


● ● ●
● ● ● ●
● ●
● ● ●
● ● ● ● ● ●●
●● ● ● ● ● ● ● ●
● ●●
● ●
● ● ●
● ● ● ● ●●● ●●● ●● ● ● ● ● ● ●●● ●
● ● ● ●
● ● ●
● ●
● ● ●●● ● ●
● ● ●
● ●● ● ●● ● ●● ● ●●● ● ●● ● ● ● ●● ● ●

● ● ●●● ● ●
● ●● ● ●
● ● ● ● ● ●● ●● ●
● ● ● ●● ● ● ● ● ●
● ●
70 ● ● ● ●
● ● ● ●●● ●●
● ● ●●●●●●●●●
●●●● ●●●



●● ●
● ● ● ●
childHeight

● ● ●
●● ● ●● ● ● ● ●
●● ● ● ● ● ●
● ●●● ● ●●● ● ● ● ●●
● ● ●●●●● ●● ● ●● ●
● ● ● ●● ● ● ● ●
● ● ● ● ● ●
● ● ● ●● ● ● ●● ●● ●
● ●●● ● ● ● ● ●
● ● ● ● ● ●●
● ●●● ● ●
● ●



●●● ●●● ●●●
● ●
● ● ●●

● ●●●

● ● ●●
● ● fitted
● ● ●● ● ● ●● ●●● ●
● ● ●
● ● ● ●●
● ● ●● ● ●●●● ●● ●● ● ● ●●● ●● ● ●● ●● ●
● ● ● ●● ●●





● ●

● ●

● ●
●● ● ●
● ●●
● ●●● ●
●● ●●● ●

●●
●●

● ● ●● ●


●●●

● ●● ●
●●
● ●
●● ● ●


● ●
confidence
● ● ●
● ● ●● ● ●●●● ●● ●●
● ● ● ●●●●
●● ● ● ●● ● ●
● ● ● ● ●
● ● ●


● ●

●● ●

●● ●
●● ● ●

● ●● ● ●

●●●● ●
● ●

● ● ●

●●●●●●
● ●●

●●

● ●● ●●


● ● ●

prediction

● ● ● ●● ●● ● ● ● ●●●
● ●●
● ●●●●●●
● ● ●
● ●● ● ● ●
● ● ●● ●
● ● ●● ● ● ● ●
● ●● ● ●●

●●●●● ● ● ●
● ● ●
● ● ● ●● ● ●●
●● ● ●
●● ●●●●●
● ●● ● ●●●
●● ● ●
● ●
●●● ●●
●● ● ●●● ● ● ● ●
● ●
● ●
● ● ● ●● ● ● ●● ● ●●
● ● ●●
●● ● ●●
● ●● ● ● ● ●
● ● ●
● ● ●
● ● ●
● ●● ●

● ● ● ●● ● ●● ●● ● ● ●● ● ●●
●● ● ●

60 ● ● ● ●● ● ● ●● ● ● ●● ● ●


● ●

60 65 70 75 80
midparentHeight

FIGURE 5.1: Prediction in Galton’s regression

Coefficients :
Estimate Std . Error t value Pr ( >| t |)
( Intercept ) 2.567e+02 3.522e+03 0.073 0.94193
age 5.357e+01 4.581e+01 1.170 0.24284
educ 4.008e+02 2.288e+02 1.751 0.08058 .
black -2 . 0 3 7 e + 0 3 1 . 1 7 4 e + 0 3 -1 . 7 3 6 0 . 0 8 3 3 1 .
hisp 4.258e+02 1.565e+03 0.272 0.78562
married -1 . 4 6 3 e + 0 2 8 . 8 2 3 e + 0 2 -0 . 1 6 6 0 . 8 6 8 3 5
nodegr -1 . 5 1 8 e + 0 1 1 . 0 0 6 e + 0 3 -0 . 0 1 5 0 . 9 8 7 9 7
re 7 4 1 . 2 3 4e - 0 1 8 . 7 8 4e - 0 2 1.405 0.16079
re 7 5 1 . 9 7 4e - 0 2 1 . 5 0 3e - 0 1 0.131 0.89554
u74 1.380e+03 1.188e+03 1.162 0.24590
u75 -1 . 0 7 1 e + 0 3 1 . 0 2 5 e + 0 3 -1 . 0 4 5 0 . 2 9 6 5 1
treat 1.671e+03 6.411e+02 2 . 6 0 6 0 . 0 0 9 4 8 **

Residual standard error : 6 5 1 7 on 4 3 3 degrees of freedom


Multiple R - squared : 0 . 0 5 8 2 2 , Adjusted R - squared : 0 . 0 3 4 3
F - statistic : 2 . 4 3 3 on 1 1 and 4 3 3 DF , p - value : 0 . 0 0 5 9 7 4

The above result shows that none of the pretreatment covariates is significant. It is also
of interest to test whether they are jointly significant. The result below shows that they are
only marginally significant at the level 0.05 based on a joint test.
> library ( " car " )
> l i n e a r H y p o t he s i s ( lalonde _ fit ,
+ c ( " age = 0 " , " educ = 0 " , " black = 0 " ,
+ " hisp = 0 " , " married = 0 " , " nodegr = 0 " ,
+ " re 7 4 = 0 " , " re 7 5 = 0 " , " u 7 4 = 0 " ,
+ " u 7 5 = 0 " ))
Linear hypothesis test
Normal Linear Model: Inference and Prediction 37

Hypothesis :
age = 0
educ = 0
black = 0
hisp = 0
married = 0
nodegr = 0
re 7 4 = 0
re 7 5 = 0
u74 = 0
u75 = 0

Model 1 : restricted model


Model 2 : re 7 8 ~ age + educ + black + hisp + married + nodegr + re 7 4 +
re 7 5 + u 7 4 + u 7 5 + treat

Res . Df RSS Df Sum of Sq F Pr ( > F )


1 443 1.9178e+10
2 433 1.8389e+10 10 788799023 1.8574 0.04929 *

Below I create two pseudo datasets: one with all units assigned to the treatment, and
the other with all units assigned to the control, fixing all the pretreatment covariates. The
predicted outcomes are the counterfactual outcomes under the treatment and control. I
further calculate their means and verify that their difference equals the OLS coefficient of
treat.
> new _ treat = lalonde
> new _ treat $ treat = 1
> predict _ lalonde 1 = predict ( lalonde _ fit , new _ treat ,
+ interval = " none " )
> new _ control = lalonde
> new _ control $ treat = 0
> predict _ lalonde 0 = predict ( lalonde _ fit , new _ control ,
+ interval = " none " )
> mean ( predict _ lalonde 1 )
[1] 6276.91
> mean ( predict _ lalonde 0 )
[1] 4606.201
>
> mean ( predict _ lalonde 1 ) - mean ( predict _ lalonde 0 )
[1] 1670.709

5.5 Homework problems


5.1 MLE
Under the Normal linear model, show that the maximum likelihood estimator (MLE) for β
is the OLS estimator, but the MLE for σ 2 is σ̃ 2 = rss/n. Compare the mean squared errors
of σ̂ 2 and σ̃ 2 for estimating σ 2 .

5.2 MLE with Laplace errors


Assume that yi = xti β + σεi where the εi ’s are i.i.d. Laplace distribution with density
f (ε) = 2−1 e−|ε| (i = 1, . . . , n). Find the MLEs of (β, σ 2 ).
Remark: We will revisit this problem in Chapter 26.
38 Linear Model and Extensions

5.3 Joint prediction


With multiple future data points (Xn+1 , Yn+1 ) where Xn+1 ∈ Rl×p and Yn+1 ∈ Rl , con-
struct the joint predictors and prediction region for Yn+1 based on (X, Y ) and Xn+1 .
As a starting point, you can assume that l ≤ p and the rows of Xn+1 are linearly
independent. You can then consider the case in which the rows of Xn+1 are not linearly
independent.
Hint: Use Theorem B.10.

5.4 Two-sample problem


iid iid
1. Assume that z1 , . . . , zm ∼ N(µ1 , σ 2 ) and w1 , . . . , wn ∼ N(µ2 , σ 2 ), and test H0 : µ1 = µ2 .
Show that under H0 , the t statistic with pooled variance estimator have the following
distribution:
z̄ − w̄
tequal = p ∼ tm+n−2 ,
σ̂ (m−1 + n−1 )
2

where
σ̂ 2 = (m − 1)Sz2 + (n − 1)Sw
2

/(m + n − 2)
with the sample means
m
X n
X
z̄ = m−1 zi , w̄ = n−1 wi ,
i=1 i=1

and the sample variances


m
X n
X
Sz2 = (m − 1)−1 (zi − z̄)2 , 2
Sw = (n − 1)−1 (wi − w̄)2 .
i=1 i=1

Remark: The name “equal” is motivated by the “var.equal” parameter of the R function
t.test.

2. We can write the above problem as testing hypothesis H0 : β1 = 0 in the linear regression
Y = Xβ + ε with
     
z1 1 1 ε1
 ..   .. ..   .. 
 .   . .   . 
       
 zm   1 1  β 0
 ε m

Y =   , X=   , β= , ε=  .
w
 1 

 1 0 
 β 1 ε
 m+1 

 .   . .  .
 ..   .. ..  ..
 
 
wn 1 0 εm+n

Based on the Normal linear model, we can compute the t statistic. Show that it is
identical to tequal .

5.5 Analysis of Variance (ANOVA) with a multi-level treatment


Let xi be the indicator vector for J treatment levels in a completely randomized experiment,
for example, xi = ej = (0, . . . , 1, . . . , 0)t with the jth element being one if unit i receives
treatment level j (j = 1, . . . , J). Let yi be the outcome of unit i (i = 1, . . . , n). Let Tj
P of units receiving treatment j, and let nj = |Tj | be the sample size and
be the indices
ȳj = n−1
j i∈Tj yi be the sample mean of the outcomes under treatment j. Define ȳ =
Normal Linear Model: Inference and Prediction 39
n
n−1 i=1 yi as the grand mean. We can test whether the treatment has any effect on the
P
outcome by testing the null hypothesis

H0 : β1 = · · · = βJ

in the Normal linear model Y = Xβ + ε assuming ε ∼ N(0, σ 2 In ). This is a special case of


testing Cβ = 0. Find C and show that the corresponding F statistic is identical to
PJ
j=1 nj (ȳj − ȳ)2 /(J − 1)
F = PJ P ∼ FJ−1,n−J .
j=1 i∈Tj (yi − ȳj )2 /(n − J)

Remarks: (1) This is Fisher’s F statistic. (2) In this linear model formulation, X does not
contain a column of 1’s. (3) The choice of C is not unique, but the final formula for F is.
(4) You may use the Sherman–Morrison formula in Problem 1.3.

5.6 Confidence interval for σ 2


Based on Theorem 5.1, construct a 1 − α level confidence interval for σ 2 .

5.7 Relationship between t and F


Show that when C containing only one row ct , then Tc2 = FC , where Tc is defined in
Theorem 5.3 and FC is defined in Theorem 5.4.

5.8 rss and t-statistic in univariate OLS


Focus on univariate OLS discussed in Chapter 2: yi = α̂ + β̂xi + ε̂i (i = 1, . . . , n). Show
that rss equals
Xn n
X
ε̂2i = (yi − ȳ)2 (1 − ρ̂2xy )
i=1 i=1

and under the homoskedasticity assumption, the t-statistic associated with β̂ equals
ρ̂xy
q .
(1 − ρ̂2xy )/(n − 2)

5.9 Equivalence of the t-statistics


With the data (xi , yi )ni=1 where both xi and yi are scalars. Run OLS fit of yi on (1, xi ) to
obtain ty|x , the t-statistic of the coefficient of xi , under the homoskedasticity assumption.
Run OLS fit of xi on (1, yi ) to obtain tx|y , the t-statistic of the coefficient of yi , under the
homoskedasticity assumption.
Show ty|x = tx|y .
Remark: This is a numerical result that holds without any stochastic assumptions. I give
an example below.
> library ( MASS )
> # simulate bivariate normal distribution
> xy = mvrnorm ( n = 1 0 0 , mu = c ( 0 , 0 ) ,
+ Sigma = matrix ( c ( 1 , 0 . 5 , 0 . 5 , 1 ) , ncol = 2 ))
> xy = as . data . frame ( xy )
> colnames ( xy ) = c ( " x " , " y " )
> # # OLS
> reg . y . x = lm ( y ~ x , data = xy )
> reg . x . y = lm ( x ~ y , data = xy )
40 Linear Model and Extensions

> # # compare t statistics based on homoskedastic errors


> summary ( reg . y . x )$ coef [ 2 , 3 ]
[1] 4.470331
> summary ( reg . x . y )$ coef [ 2 , 3 ]
[1] 4.470331

The equivalence of the t-statistics from the OLS fit of y on x and that of x on y demonstrates
that based on OLS, the data do not contain any information about the direction of the
relationship between x and y.

5.10 An application
The R package sampleSelection (Toomet and Henningsen, 2008) describes the dataset RandHIE
as follows: “The RAND Health Insurance Experiment was a comprehensive study of health
care cost, utilization and outcome in the United States. It is the only randomized study
of health insurance, and the only study which can give definitive evidence as to the causal
effects of different health insurance plans.” You can find more detailed information about
other variables in this package. The main outcome of interest lnmeddol means the log of
medical expenses. Use linear regression to investigate the relationship between the outcome
and various important covariates.
Note that the solution to this problem is not unique, but you need to justify your choice
of covariates and model, and need to interpret the results.
6
Asymptotic Inference in OLS: the
Eicker–Huber–White (EHW) robust standard error

6.1 Motivation
Standard software packages, for example, R, report the point estimator, standard error, and
p-value for each coordinate of β based on the Normal linear model:

Y = Xβ + ε ∼ N(Xβ, σ 2 In ).

Statistical inference based on this model is finite-sample exact. However, the assumptions of
this model are extremely strong: the functional form is linear, the error terms are additive
with distributions not dependent on X, and the error terms are IID Normal with the same
variance. If we do not believe some of these assumptions, can we still trust the associated
statistical inference? Let us start with some simple numerical examples, with the R code in
code6.1.R.

6.1.1 Numerical examples


The first one is the ideal Normal linear model:
> library ( car )
> n = 200
> x = runif (n , -2 , 2 )
> beta = 1
> xbeta = x * beta
> Simu 1 = replicate ( 5 0 0 0 ,
+ { y = xbeta + rnorm ( n )
+ ols . fit = lm ( y ~ x )
+ c ( summary ( ols . fit )$ coef [ 2 , 1 : 2 ] ,
+ sqrt ( hccm ( ols . fit )[ 2 , 2 ]))
+ })

In the above, we generate outcomes from a simple linear model yi = xi + εi with


iid
εi ∼ N(0, σ 2 = 1). Over 5000 replications of the data, we computed the OLS coefficient β̂
of xi and reported two standard errors. One is the standard error discussed in Chapter 5
under the Normal linear model, which is also the default choice of the lm function of R. The
other one, computed by the hccm function in the R package car, will be the main topic of
this chapter. The (1, 1) the panel of Figure 6.1 shows the histogram of the estimator and
reports the standard error (se0), as well as two estimated standard errors (se1 and se2).
The distribution of β̂ is symmetric and bell-shaped around the true parameter 1, and the
estimated standard errors are close to the true one.
To investigate the impact of Normality, we change the error terms to be IID exponential
with mean 1 and variance 1.
> Simu 2 = replicate ( 5 0 0 0 ,

41
42 Linear Model and Extensions

+ { y = xbeta + rexp ( n )
+ ols . fit = lm ( y ~ x )
+ c ( summary ( ols . fit )$ coef [ 2 , 1 : 2 ] ,
+ sqrt ( hccm ( ols . fit )[ 2 , 2 ]))
+ })

The (1, 2) panel of Figure 6.1 corresponds to this setting. With non-Normal errors, β̂ is
still symmetric and bell-shaped around the true parameter 1, and the estimated standard
errors are close to the true one. So the Normality of the error terms does not seem to be
a crucial assumption for the validity of the inference procedure under the Normal linear
model.
We then generate errors from Normal with variance depending on x:
> Simu 3 = replicate ( 5 0 0 0 ,
+ { y = xbeta + rnorm (n , 0 , abs ( x ))
+ ols . fit = lm ( y ~ x )
+ c ( summary ( ols . fit )$ coef [ 2 , 1 : 2 ] ,
+ sqrt ( hccm ( ols . fit )[ 2 , 2 ]))
+ })

The (2, 1) panel of Figure 6.1 corresponds to this setting. With heteroskedastic Normal
errors, β̂ is symmetric and bell-shaped around the true parameter 1, se2 is close to se0, but
se1 underestimates se0. So the heteroskedasticity of the error terms does not change the
Normality of the OLS estimator dramatically, although the statistical inference discussed
in Chapter 5 can be invalid.
Finally, we generate heteroskedastic non-Normal errors:
> Simu 4 = replicate ( 5 0 0 0 ,
+ { y = xbeta + runif (n , -x ^ 2 , x ^ 2 )
+ ols . fit = lm ( y ~ x )
+ c ( summary ( ols . fit )$ coef [ 2 , 1 : 2 ] ,
+ sqrt ( hccm ( ols . fit )[ 2 , 2 ]))
+ })

The (2, 2) panel of Figure 6.1 corresponds to this setting, which has a similar pattern as the
(2, 1) panel. So the Normality of the error terms is not crucial, but the homoskedasticity is.

6.1.2 Goal of this chapter


In this chapter, we will still impose the linearity assumption, but relax the distributional
assumption on the error terms. We assume the following heteroskedastic linear model.

Assumption 6.1 (Heteroskedastic linear model) We have

yi = xti β + εi ,

where the εi ’s are independent with mean zero and variance σi2 (i = 1, . . . , n). The design
matrix X is fixed with linearly independent column vectors, and (β, σ12 , . . . , σn2 ) are unknown
parameters.

Because the error terms can have different variances, they are not IID in general under
the heteroskedastic linear model. Their variances can be functions of the xi ’s, and the
variances σi2 are n free unknown numbers. Again, treating the xi ’s as fixed is not essential,
because we can condition on them if they are random. Without imposing Normality on the
error terms, we cannot determine the finite sample exact distribution of the OLS estimator.
This chapter will use the asymptotic analysis, assuming that the sample size n is large so
that certain limiting theorems hold.
EHW standard error in OLS 43

The asymptotic analysis later will show that if the error terms are homoskedastic, i.e.,
σi2 = σ 2 for all i = 1, . . . , n, we can still trust the statistical inference discussed in Chapter
5 based on the Normal linear model as long the central limit theorem (CLT) for the OLS
estimator holds as n → ∞. If the error terms are heteroskedastic, i.e., their variances
are different, we must adjust the standard error with the so-called Eicker–Huber–White
heteroskedasticity robust standard error. I will give the technical details below. If you are
unfamiliar with the asymptotic analysis, please first review the basics in Chapter C.

6.2 Consistency of OLS


Under the heteroskedastic linear model, the OLS estimator β̂ is still unbiased for β because
the error terms have to mean zero. Moreover, we can show that it is consistent for β with
large n and some regularity conditions. We start with a useful lemma.
Lemma 6.1 Under Assumption 6.1, the OLS estimator has the representation β̂ − β =
Bn−1 ξn , where
n
X
Bn = n−1 xi xti ,
i=1
n
X
ξn = n−1 xi εi
i=1

Proof of Lemma 6.1: Since yi = xti β + εi , we have

n
X
β̂ = Bn−1 n−1 xi yi
i=1
Xn
= Bn−1 n−1 xi (xti β + εi )
i=1
n
X
= Bn−1 Bn β + Bn−1 n−1 xi εi
i=1
n
X
= β + Bn−1 n−1 xi εi .
i=1


In the representation of Lemma 6.1, Bn is fixed and ξn is random. Since E(ξn ) = 0, we
know that E(β̂) = β, so the OLS estimator is unbiased. Moreover,
n
!
X
cov(ξn ) = cov n−1 xi εi
i=1
n
X
= n−2 σi2 xi xti
i=1
= Mn /n,
44 Linear Model and Extensions

where
n
X
Mn = n−1 σi2 xi xti .
i=1

So the covariance of the OLS estimator is

cov(β̂) = n−1 Bn−1 Mn Bn−1 .

It has a sandwich form, justifying the choice of notation Bn for the “bread” and Mn for the
“meat.”
Intuitively, if Bn and Mn have finite limits, then the covariance of β̂ shrinks to zero with
large n, implying that β̂ will concentrate near its mean β. This is the idea of consistency,
formally stated below.

Assumption 6.2 Bn → B and Mn → M where B and M are finite with B invertible.

Theorem 6.1 Under Assumptions 6.1 and 6.2, we have β̂ → β in probability.

Proof of Theorem 6.1: We only need to show that ξn → 0 in probability. It has mean
zero and covariance matrix Mn /n, so it converges to zero in probability using Proposition
C.4 in Chapter C. □

6.3 Asymptotic Normality of the OLS estimator


Intuitively, ξn is the sample average of some independent terms, and therefore, the classic
Lindberg–Feller theorem guarantees that it enjoys a CLT under some regularity conditions.
Consequently, β̂ also enjoys a CLT with mean β and covariance matrix n−1 Bn−1 Mn Bn−1 . The
asymptotic results in this chapter require rather tedious regularity conditions. I give them
for generality, and they hold automatically if we are willing to assume that the covariates
and error terms are all bounded by a constant not depending on n. These general conditions
are basically moment conditions required by the law of large numbers and CLT. You do not
have to pay too much attention to the conditions when you first read this chapter.
The CLT relies on an additional condition on a higher-order moment
n
X
d2+δ,n = n−1 ∥xi ∥2+δ E(ε2+δ
i ).
i=1

Theorem 6.2 Under Assumptions 6.1 and 6.2, if there exist a δ > 0 and C > 0 not
depending on n such that d2+δ,n ≤ C, then

n(β̂ − β) → N(0, B −1 M B −1 )

in distribution.

Proof of Theorem 6.2: The key is to show the CLT for ξn , and the CLT for β̂ holds due
to the Slutsky’s Theorem; see Chapter C for a review. Define

zn,i = n−1/2 xi εi , (i = 1, . . . , n)

with mean zero and finite covariance, and we need to verify the two conditions required
EHW standard error in OLS 45

by the Lindeberg–Feller CLT stated as Proposition C.8 in Chapter C. First, the Lyapunov
condition holds because
n
X n  
 X
E ∥zn,i ∥2+δ = E n−(2+δ)/2 ∥xi ∥2+δ ε2+δ
i
i=1 i=1
n
X
= n−δ/2 × n−1 ∥xi ∥2+δ E(ε2+δ
i )
i=1
−δ/2
=n × d2+δ,n → 0.

Second,
n
X n
X
cov(zn,i ) = n−1 σi2 xi xti
i=1 i=1
Mn → M. =
Pn Pn
So the Lindberg–Feller CLT implies that n−1/2 i=1 xi εi = i=1 zn,i → N(0, M ) in distri-
bution. □

6.4 Eicker–Huber–White standard error


6.4.1 Sandwich variance estimator
The CLT in Theorem 6.2 shows that
a
β̂ ∼ N(β, n−1 B −1 M B −1 ),
a
where ∼ denotes “approximation in distribution.” However, the asymptotic covariance is
unknown, and we need to use the data to construct a reasonable estimator for statistical
inference. It is relatively easy to replace B with its sample analog Bn , but
n
X
M̃n = n−1 ε2i xi xti
i=1

as the sample analog for M is not directly useful because the error terms are unknown
either. It is natural to use ε̂2i to replace ε2i to obtain the following estimator for M :
n
X
M̂n = n−1 ε̂2i xi xti .
i=1

Although each ε̂2i is a poor estimator for σi2 , the sample average M̂n turns out to be
well-behaved with large n and the regularity conditions below.

Theorem 6.3 Under Assumptions 6.1 and 6.2, we have M̂n → M in probability if
n
X n
X n
X
n−1 var(ε2i )x2ij1 x2ij2 , n−1 xij1 xij2 xij3 xij4 , n−1 σi2 x2ij1 x2ij2 x2ij3 (6.1)
i=1 i=1 i=1

are bounded from above by a constant C not depending on n for any j1 , j2 , j3 , j4 = 1, . . . , p.


46 Linear Model and Extensions

Proof of Theorem 6.3: Assumption 6.2 ensures that β̂ → β in probability by Theorem 6.1.
Markov’s inequality and the boundedness of the first term in (6.1) ensure that M̃n −Mn → 0
in probability. So we only need to show that M̂n − M̃n → 0 in probability. The (j1 , j2 )th
element of their difference is
n
X n
X
(M̂n − M̃n )j1 ,j2 = n−1 ε̂2i xi,j1 xi,j2 − n−1 ε2i xi,j1 xi,j2
i=1 i=1
n 
X 2 
−1
=n εi + xti β − xti β̂ − ε2i xi,j1 xi,j2
i=1
Xn  2  
−1
=n xti β − xti β̂ + 2εi xti β − xti β̂ xi,j1 xi,j2
i=1
n
X
= (β − β̂)t n−1 xi xti xi,j1 xi,j2 (β − β̂)
i=1
n
X
+ 2(β − β̂)t n−1 xi xi,j1 xi,j2 εi .
i=1

It converges to zero in probability because the first term converges to zero due to the
boundedness of the second term in (6.1), and the second term converges to zero in probability
due to Markov’s inequality and the boundedness of the third term in (6.1). □
The final variance estimator for β̂ is
n
!−1 n
! n
!−1
X X X
−1 −1 t −1 2 t −1 t
V̂ehw = n n xi xi n ε̂i xi xi n xi xi ,
i=1 i=1 i=1

which is called the Eicker–Huber–White (EHW) heteroskedasticity robust covariance ma-


trix. In matrix form, it equals
V̂ehw = (X t X)−1 (X t Ω̂X)(X t X)−1 ,

where Ω̂ = diag ε̂21 , . . . , ε̂2n . Eicker (1967) first proposed to use V̂ehw . White (1980a) pop-
ularized it in economics which has been influential in empirical research. Related estimators
appeared in many other contexts of statistics. Cox (1961) and Huber (1967) discussed
the sandwich variance in the context of misspecified parametric models; see Section D.2.
Fuller (1975) proposed a more general form of V̂ehw in the context of survey sampling.
The square root of the diagonal terms of V̂ehw , denoted by se ˆ ehw,j (j = 1, . . . , p), are called
the heteroskedasticity-consistent standard errors, heteroskedasticity-robust standard errors,
White standard errors, Huber–White standard errors, or Eicker–Huber–White standard er-
rors, among many other names.
We can conduct statistical inference based on Normal approximations. For example, we
can test linear hypotheses based on
a
β̂ ∼ N(β, V̂ehw ),
and in particular, we can infer each element of the coefficient based on
a
ˆ 2ehw,j ).
β̂j ∼ N(βj , se

6.4.2 Other “HC” standard errors


Statistical inference based on the EHW standard error relaxes the parametric assumptions of
the Normal linear model. However, its validity relies strongly on the asymptotic argument.
EHW standard error in OLS 47

At finite samples, it can have poor behaviors. Since White (1980a) published his paper,
several modifications of V̂ appeared aiming for better finite-sample properties. I summarize
them below. They all rely on the hii ’s, which are the diagonal elements of the projection
matrix H and called the leverage scores. Define
n
!−1 n
! n
!−1
X X X
−1 −1 t −1 2 t −1 t
V̂ehw,k = n n xi xi n ε̂i,k xi xi n xi xi ,
i=1 i=1 i=1

where 

 ε̂i , (k = 0, HC0);
q
n

ε̂ n−p , (k = 1, HC1);

i




ε̂i,k = ε̂i / 1 − hii , (k = 2, HC2);





 ε̂i /(1 − hii ), (k = 3, HC3);
ε̂i /(1 − hii )min{2,nhii /(2p)} ,

(k = 4, HC4).
The HC1 correction is similar to the degrees of freedom correction in the OLS covariance
estimator. The HC2 correction was motivated by the unbiasedness of covariance when the
error terms have the same variance; see Problem 6.8 for more details. The HC3 correction
was motivated by a method called jackknife which will be discussed in Chapter 11. This
version appeared even early than White (1980a); see Miller (1974), Hinkley (1977), and
Reeds (1978). I do not have a good intuition for the HC4 correction. See MacKinnon and
White (1985), Long and Ervin (2000) and Cribari-Neto (2004) for reviews. Using simulation
studies, Long and Ervin (2000) recommended HC3.

6.4.3 Special case with homoskedasticity


As an important special case with σi2 = σ 2 for all i = 1, . . . , n, we have
n
X
Mn = σ 2 n−1 xi xti = σ 2 Bn ,
i=1

which simplifies the covariance of β̂ to cov(β̂) = σ 2 Bn−1 /n, and the asymptotic Normality

to n(β̂ − β) → N(0, σ 2PB −1 ) in distribution. We have shown that under the Gauss–Markov
2 −1 n 2 2 2 2
model, σ̂ = (n − p) i=1 ε̂i is unbiased for σ . Moreover, σ̂ is consistent for σ under
the same condition as Theorem 6.1, justifying the use of
n
!
−1
X
2
V̂ = σ̂ xi xi = σ̂ 2 (X t X)
t

i=1

as the covariance estimator. So under homoskedasticity, we can conduct statistical inference


based on the following approximate Normality:
 
a −1
β̂ ∼ N β, σ̂ 2 (X t X) .

It is slightly different from the inference based on t and F distributions. But with large n,
the difference is very small.
I will end this section with a formal result on the consistency of σ̂ 2 .

Theorem 6.4 Under Assumptions 6.1 P and 6.2, we have σ̂ 2 → σ 2 in probability if σi2 =
2 −1 n 2
σ < ∞ for all i = 1, . . . , n and n i=1 var(εi ) is bounded above by a constant not
depending on n.
48 Linear Model and Extensions
Pn
Proof of Theorem 6.4: UsingP Markov’s inequality, we can show that n−1 i=1 ε2i → σ 2
−1 n
in probability. In addition,
Pn n 2 ε̂2i has the same probability limit as σ̂ 2 . So we only
i=1 P
−1 −1 n 2
need to show that n i=1 ε̂i − n i=1 εi → 0 in probability. Their difference is
n
X n
X
n−1 ε̂2i − n−1 ε2i
i=1 i=1
n 
X 2 
−1
=n εi + xti β − xti β̂ − ε2i
i=1
Xn  2   
= n−1 xti β − xti β̂ + 2 xti β − xti β̂ εi
i=1
n
X n
X
= (β − β̂)t n−1 xi xti (β − β̂) + 2(β − β̂)t n−1 xi εi
i=1 i=1
Xn
t −1
= −(β − β̂) n xi xti (β − β̂),
i=1

where the last step follows from Lemma 6.1. So the difference converges to zero in probability
because β̂ − β → 0 in probability by Theorem 6.1 and Bn → B by Assumption 6.2. □

6.5 Examples
I use three examples to compare various standard errors for the regression coefficients, with
the R code in code6.5.R. The car package contains the hccm function that implements the
EHW standard errors.
> library ( " car " )

6.5.1 LaLonde experimental data


First, I revisit the lalonde data. In the following analysis, different standard errors give
similar t-values. Only treat is significant, but all other pretreatment covariates are not.
> library ( " Matching " )
> data ( lalonde )
> ols . fit = lm ( re 7 8 ~ . , data = lalonde )
> ols . fit . hc 0 = sqrt ( diag ( hccm ( ols . fit , type = " hc 0 " )))
> ols . fit . hc 1 = sqrt ( diag ( hccm ( ols . fit , type = " hc 1 " )))
> ols . fit . hc 2 = sqrt ( diag ( hccm ( ols . fit , type = " hc 2 " )))
> ols . fit . hc 3 = sqrt ( diag ( hccm ( ols . fit , type = " hc 3 " )))
> ols . fit . hc 4 = sqrt ( diag ( hccm ( ols . fit , type = " hc 4 " )))
> ols . fit . coef = summary ( ols . fit )$ coef
> tvalues = ols . fit . coef [ , 1 ]/
+ cbind ( ols . fit . coef [ , 2 ] , ols . fit . hc 0 , ols . fit . hc 1 ,
+ ols . fit . hc 2 , ols . fit . hc 3 , ols . fit . hc 4 )
> colnames ( tvalues ) = c ( " ols " , " hc 0 " , " hc 1 " , " hc 2 " , " hc 3 " , " hc 4 " )
> round ( tvalues , 2 )
ols hc 0 hc 1 hc 2 hc 3 hc 4
( Intercept ) 0 . 0 7 0 . 0 7 0 . 0 7 0 . 0 7 0 . 0 7 0 . 0 7
age 1.17 1.29 1.28 1.27 1.25 1.25
educ 1.75 2.03 2.00 1.99 1.94 1.92
black -1 . 7 4 -2 . 0 0 -1 . 9 7 -1 . 9 5 -1 . 9 1 -1 . 9 1
EHW standard error in OLS 49

hisp 0.27 0.30 0.30 0.30 0.29 0.29


married -0 . 1 7 -0 . 1 7 -0 . 1 7 -0 . 1 7 -0 . 1 6 -0 . 1 6
nodegr -0 . 0 2 -0 . 0 1 -0 . 0 1 -0 . 0 1 -0 . 0 1 -0 . 0 1
re 7 4 1.40 0.98 0.96 0.92 0.87 0.77
re 7 5 0.13 0.14 0.14 0.13 0.13 0.12
u74 1.16 0.89 0.88 0.87 0.85 0.83
u75 -1 . 0 5 -0 . 7 6 -0 . 7 5 -0 . 7 5 -0 . 7 4 -0 . 7 4
treat 2.61 2.49 2.46 2.45 2.41 2.40

6.5.2 Data from King and Roberts (2015)


The following example comes from King and Roberts (2015). The outcome variable is the
multilateral aid flows, and the covariates include log population, log population squared,
gross domestic product, former colony status, distance from the Western world, political
freedom, military expenditures, arms imports, and the indicators for the years. Different
robust standard errors are quite different for some coefficients.
> library ( foreign )
> dat = read . dta ( " isq . dta " )
> dat = na . omit ( dat [ , c ( " multish " , " lnpop " , " lnpopsq " ,
+ " lngdp " , " lncolony " , " lndist " ,
+ " freedom " , " militexp " , " arms " ,
+ " year 8 3 " , " year 8 6 " , " year 8 9 " , " year 9 2 " )])
> ols . fit = lm ( multish ~ lnpop + lnpopsq + lngdp + lncolony
+ + lndist + freedom + militexp + arms
+ + year 8 3 + year 8 6 + year 8 9 + year 9 2 , data = dat )
> ols . fit . hc 0 = sqrt ( diag ( hccm ( ols . fit , type = " hc 0 " )))
> ols . fit . hc 1 = sqrt ( diag ( hccm ( ols . fit , type = " hc 1 " )))
> ols . fit . hc 2 = sqrt ( diag ( hccm ( ols . fit , type = " hc 2 " )))
> ols . fit . hc 3 = sqrt ( diag ( hccm ( ols . fit , type = " hc 3 " )))
> ols . fit . hc 4 = sqrt ( diag ( hccm ( ols . fit , type = " hc 4 " )))
> ols . fit . coef = summary ( ols . fit )$ coef
> tvalues = ols . fit . coef [ , 1 ]/
+ cbind ( ols . fit . coef [ , 2 ] , ols . fit . hc 0 , ols . fit . hc 1 ,
+ ols . fit . hc 2 , ols . fit . hc 3 , ols . fit . hc 4 )
> colnames ( tvalues ) = c ( " ols " , " hc 0 " , " hc 1 " , " hc 2 " , " hc 3 " , " hc 4 " )
> round ( tvalues , 2 )
ols hc 0 hc 1 hc 2 hc 3 hc 4
( Intercept ) 7 . 4 0 4 . 6 0 4 . 5 4 4 . 4 3 4 . 2 7 4 . 1 4
lnpop -8 . 2 5 -4 . 4 6 -4 . 4 0 -4 . 3 0 -4 . 1 4 -4 . 0 1
lnpopsq 9.56 4.79 4.72 4.61 4.44 4.31
lngdp -6 . 3 9 -6 . 1 4 -6 . 0 6 -6 . 0 1 -5 . 8 8 -5 . 8 6
lncolony 4.70 4.75 4.69 4.64 4.53 4.47
lndist -0 . 1 4 -0 . 1 6 -0 . 1 6 -0 . 1 6 -0 . 1 5 -0 . 1 6
freedom 2.25 1.80 1.78 1.75 1.69 1.65
militexp 0.51 0.59 0.59 0.57 0.55 0.52
arms 1.34 1.17 1.15 1.10 1.03 0.91
year 8 3 1.05 0.85 0.84 0.83 0.80 0.79
year 8 6 0.35 0.40 0.39 0.39 0.38 0.38
year 8 9 0.70 0.81 0.80 0.80 0.78 0.79
year 9 2 0.31 0.40 0.40 0.40 0.39 0.40

However, if we apply the log transformation on the outcome, then all standard errors
give similar t-values.
> ols . fit = lm ( log ( multish + 1 ) ~ lnpop + lnpopsq + lngdp + lncolony
+ + lndist + freedom + militexp + arms
+ + year 8 3 + year 8 6 + year 8 9 + year 9 2 , data = dat )
> ols . fit . hc 0 = sqrt ( diag ( hccm ( ols . fit , type = " hc 0 " )))
> ols . fit . hc 1 = sqrt ( diag ( hccm ( ols . fit , type = " hc 1 " )))
> ols . fit . hc 2 = sqrt ( diag ( hccm ( ols . fit , type = " hc 2 " )))
> ols . fit . hc 3 = sqrt ( diag ( hccm ( ols . fit , type = " hc 3 " )))
50 Linear Model and Extensions

> ols . fit . hc 4 = sqrt ( diag ( hccm ( ols . fit , type = " hc 4 " )))
> ols . fit . coef = summary ( ols . fit )$ coef
> tvalues = ols . fit . coef [ , 1 ]/
+ cbind ( ols . fit . coef [ , 2 ] , ols . fit . hc 0 , ols . fit . hc 1 ,
+ ols . fit . hc 2 , ols . fit . hc 3 , ols . fit . hc 4 )
> colnames ( tvalues ) = c ( " ols " , " hc 0 " , " hc 1 " , " hc 2 " , " hc 3 " , " hc 4 " )
> round ( tvalues , 2 )
ols hc 0 hc 1 hc 2 hc 3 hc 4
( Intercept ) 2 . 9 6 2 . 8 1 2 . 7 7 2 . 7 2 2 . 6 3 2 . 5 3
lnpop -2 . 8 7 -2 . 6 3 -2 . 6 0 -2 . 5 4 -2 . 4 5 -2 . 3 5
lnpopsq 4.21 3.72 3.67 3.59 3.46 3.32
lngdp -8 . 0 2 -7 . 4 9 -7 . 3 8 -7 . 3 8 -7 . 2 7 -7 . 3 3
lncolony 6.31 6.19 6.11 6.08 5.97 5.95
lndist -0 . 1 6 -0 . 1 4 -0 . 1 4 -0 . 1 4 -0 . 1 4 -0 . 1 4
freedom 1.47 1.53 1.51 1.50 1.47 1.46
militexp -0 . 3 2 -0 . 3 2 -0 . 3 1 -0 . 3 1 -0 . 3 0 -0 . 2 9
arms 1.27 1.12 1.10 1.05 0.98 0.86
year 8 3 0.10 0.10 0.10 0.10 0.10 0.10
year 8 6 -0 . 1 4 -0 . 1 4 -0 . 1 4 -0 . 1 4 -0 . 1 4 -0 . 1 4
year 8 9 0.46 0.45 0.44 0.44 0.44 0.44
year 9 2 0.03 0.03 0.03 0.03 0.03 0.03

In general, the difference between the OLS and EHW standard errors may be due to the
heteroskedasticity or the poor approximation of the linear model. The above two analyses
based on the original and transformed outcomes suggest that the linear approximation
works better for the log-transformed outcome. We will discuss the issues of transformation
and model misspecification later.

6.5.3 Boston housing data


I also re-analyze the classic Boston housing data (Harrison Jr and Rubinfeld, 1978). The
outcome variable is the median value of owner-occupied homes in US dollars 1000, and the
covariates include per capita crime rate by town, the proportion of residential land zoned
for lots over 25,000 square feet, the proportion of non-retail business acres per town, etc.
You can find more details in the R package. In this example, different standard errors yield
very different t-values.
> library ( " mlbench " )
> data ( BostonHousing )
> ols . fit = lm ( medv ~ . , data = BostonHousing )
> summary ( ols . fit )

Call :
lm ( formula = medv ~ . , data = BostonHousing )

Residuals :
Min 1Q Median 3Q Max
-1 5 . 5 9 5 -2 . 7 3 0 -0 . 5 1 8 1.777 26.199

Coefficients :
Estimate Std . Error t value Pr ( >| t |)
( Intercept ) 3 . 6 4 6 e + 0 1 5 . 1 0 3 e + 0 0 7 . 1 4 4 3 . 2 8e - 1 2 ***
crim -1 . 0 8 0e - 0 1 3 . 2 8 6e - 0 2 -3 . 2 8 7 0 . 0 0 1 0 8 7 **
zn 4 . 6 4 2e - 0 2 1 . 3 7 3e - 0 2 3.382 0.000778 ***
indus 2 . 0 5 6e - 0 2 6 . 1 5 0e - 0 2 0.334 0.738288
chas 1 2 . 6 8 7 e + 0 0 8 . 6 1 6e - 0 1 3.118 0.001925 **
nox -1 . 7 7 7 e + 0 1 3 . 8 2 0 e + 0 0 -4 . 6 5 1 4 . 2 5e - 0 6 ***
rm 3 . 8 1 0 e + 0 0 4 . 1 7 9e - 0 1 9 . 1 1 6 < 2e - 1 6 ***
age 6 . 9 2 2e - 0 4 1 . 3 2 1e - 0 2 0.052 0.958229
dis -1 . 4 7 6 e + 0 0 1 . 9 9 5e - 0 1 -7 . 3 9 8 6 . 0 1e - 1 3 ***
rad 3 . 0 6 0e - 0 1 6 . 6 3 5e - 0 2 4 . 6 1 3 5 . 0 7e - 0 6 ***
EHW standard error in OLS 51

tax -1 . 2 3 3e - 0 2 3 . 7 6 0e - 0 3 -3 . 2 8 0 0 . 0 0 1 1 1 2 **
ptratio -9 . 5 2 7e - 0 1 1 . 3 0 8e - 0 1 -7 . 2 8 3 1 . 3 1e - 1 2 ***
b 9 . 3 1 2e - 0 3 2 . 6 8 6e - 0 3 3.467 0.000573 ***
lstat -5 . 2 4 8e - 0 1 5 . 0 7 2e - 0 2 -1 0 . 3 4 7 < 2e - 1 6 ***

Residual standard error : 4 . 7 4 5 on 4 9 2 degrees of freedom


Multiple R - squared : 0 . 7 4 0 6 , Adjusted R - squared : 0 . 7 3 3 8
F - statistic : 1 0 8 . 1 on 1 3 and 4 9 2 DF , p - value : < 2 . 2e - 1 6

>
> ols . fit . hc 0 = sqrt ( diag ( hccm ( ols . fit , type = " hc 0 " )))
> ols . fit . hc 1 = sqrt ( diag ( hccm ( ols . fit , type = " hc 1 " )))
> ols . fit . hc 2 = sqrt ( diag ( hccm ( ols . fit , type = " hc 2 " )))
> ols . fit . hc 3 = sqrt ( diag ( hccm ( ols . fit , type = " hc 3 " )))
> ols . fit . hc 4 = sqrt ( diag ( hccm ( ols . fit , type = " hc 4 " )))
> ols . fit . coef = summary ( ols . fit )$ coef
> tvalues = ols . fit . coef [ , 1 ]/
+ cbind ( ols . fit . coef [ , 2 ] , ols . fit . hc 0 , ols . fit . hc 1 ,
+ ols . fit . hc 2 , ols . fit . hc 3 , ols . fit . hc 4 )
> colnames ( tvalues ) = c ( " ols " , " hc 0 " , " hc 1 " , " hc 2 " , " hc 3 " , " hc 4 " )
> round ( tvalues , 2 )
ols hc 0 hc 1 hc 2 hc 3 hc 4
( Intercept ) 7.14 4.62 4.56 4.48 4.33 4.25
crim -3 . 2 9 -3 . 7 8 -3 . 7 3 -3 . 4 8 -3 . 1 7 -2 . 5 8
zn 3.38 3.42 3.37 3.35 3.27 3.28
indus 0.33 0.41 0.41 0.41 0.40 0.40
chas 1 3.12 2.11 2.08 2.05 2.00 2.00
nox -4 . 6 5 -4 . 7 6 -4 . 6 9 -4 . 6 4 -4 . 5 3 -4 . 5 2
rm 9.12 4.57 4.51 4.43 4.28 4.18
age 0.05 0.04 0.04 0.04 0.04 0.04
dis -7 . 4 0 -6 . 9 7 -6 . 8 7 -6 . 8 1 -6 . 6 6 -6 . 6 6
rad 4.61 5.05 4.98 4.91 4.76 4.65
tax -3 . 2 8 -4 . 6 5 -4 . 5 8 -4 . 5 4 -4 . 4 3 -4 . 4 2
ptratio -7 . 2 8 -8 . 2 3 -8 . 1 1 -8 . 0 6 -7 . 8 9 -7 . 9 3
b 3.47 3.53 3.48 3.44 3.34 3.30
lstat -1 0 . 3 5 -5 . 3 4 -5 . 2 7 -5 . 1 8 -5 . 0 1 -4 . 9 3
The log transformation of the outcome does not remove the discrepancy among the
standard errors. In this example, heteroskedasticity seems an important problem.
> ols . fit = lm ( log ( medv ) ~ . , data = BostonHousing )
> summary ( ols . fit )

Call :
lm ( formula = log ( medv ) ~ . , data = BostonHousing )

Residuals :
Min 1Q Median 3Q Max
-0 . 7 3 3 6 1 -0 . 0 9 7 4 7 -0 . 0 1 6 5 7 0.09629 0.86435

Coefficients :
Estimate Std . Error t value Pr ( >| t |)
( Intercept ) 4 . 1 0 2 0 4 2 3 0 . 2 0 4 2 7 2 6 2 0 . 0 8 1 < 2e - 1 6 ***
crim -0 . 0 1 0 2 7 1 5 0 . 0 0 1 3 1 5 5 -7 . 8 0 8 3 . 5 2e - 1 4 ***
zn 0.0011725 0.0005495 2.134 0.033349 *
indus 0.0024668 0.0024614 1.002 0.316755
chas 1 0.1008876 0.0344859 2 . 9 2 5 0 . 0 0 3 5 9 8 **
nox -0 . 7 7 8 3 9 9 3 0 . 1 5 2 8 9 0 2 -5 . 0 9 1 5 . 0 7e - 0 7 ***
rm 0.0908331 0.0167280 5 . 4 3 0 8 . 8 7e - 0 8 ***
age 0.0002106 0.0005287 0.398 0.690567
dis -0 . 0 4 9 0 8 7 3 0 . 0 0 7 9 8 3 4 -6 . 1 4 9 1 . 6 2e - 0 9 ***
rad 0.0142673 0.0026556 5 . 3 7 3 1 . 2 0e - 0 7 ***
tax -0 . 0 0 0 6 2 5 8 0 . 0 0 0 1 5 0 5 -4 . 1 5 7 3 . 8 0e - 0 5 ***
ptratio -0 . 0 3 8 2 7 1 5 0 . 0 0 5 2 3 6 5 -7 . 3 0 9 1 . 1 0e - 1 2 ***
b 0.0004136 0.0001075 3 . 8 4 7 0 . 0 0 0 1 3 5 ***
52 Linear Model and Extensions

lstat -0 . 0 2 9 0 3 5 5 0 . 0 0 2 0 2 9 9 -1 4 . 3 0 4 < 2e - 1 6 ***

Residual standard error : 0 . 1 8 9 9 on 4 9 2 degrees of freedom


Multiple R - squared : 0 . 7 8 9 6 , Adjusted R - squared : 0 . 7 8 4 1
F - statistic : 1 4 2 . 1 on 1 3 and 4 9 2 DF , p - value : < 2 . 2e - 1 6

>
> ols . fit . hc 0 = sqrt ( diag ( hccm ( ols . fit , type = " hc 0 " )))
> ols . fit . hc 1 = sqrt ( diag ( hccm ( ols . fit , type = " hc 1 " )))
> ols . fit . hc 2 = sqrt ( diag ( hccm ( ols . fit , type = " hc 2 " )))
> ols . fit . hc 3 = sqrt ( diag ( hccm ( ols . fit , type = " hc 3 " )))
> ols . fit . hc 4 = sqrt ( diag ( hccm ( ols . fit , type = " hc 4 " )))
> ols . fit . coef = summary ( ols . fit )$ coef
> tvalues = ols . fit . coef [ , 1 ]/
+ cbind ( ols . fit . coef [ , 2 ] , ols . fit . hc 0 , ols . fit . hc 1 ,
+ ols . fit . hc 2 , ols . fit . hc 3 , ols . fit . hc 4 )
> colnames ( tvalues ) = c ( " ols " , " hc 0 " , " hc 1 " , " hc 2 " , " hc 3 " , " hc 4 " )
> round ( tvalues , 2 )
ols hc 0 hc 1 hc 2 hc 3 hc 4
( Intercept ) 2 0 . 0 8 1 4 . 2 9 1 4 . 0 9 1 3 . 8 6 1 3 . 4 3 1 3 . 1 3
crim -7 . 8 1 -5 . 3 1 -5 . 2 4 -4 . 8 5 -4 . 3 9 -3 . 5 6
zn 2.13 2.68 2.64 2.62 2.56 2.56
indus 1.00 1.46 1.44 1.43 1.40 1.41
chas 1 2.93 2.69 2.66 2.62 2.56 2.56
nox -5 . 0 9 -4 . 7 9 -4 . 7 2 -4 . 6 7 -4 . 5 6 -4 . 5 4
rm 5.43 3.31 3.26 3.20 3.10 3.02
age 0.40 0.33 0.32 0.32 0.31 0.31
dis -6 . 1 5 -6 . 1 2 -6 . 0 3 -5 . 9 8 -5 . 8 4 -5 . 8 2
rad 5.37 5.23 5.16 5.05 4.87 4.67
tax -4 . 1 6 -5 . 0 5 -4 . 9 8 -4 . 9 0 -4 . 7 6 -4 . 6 9
ptratio -7 . 3 1 -8 . 8 4 -8 . 7 2 -8 . 6 7 -8 . 5 1 -8 . 5 5
b 3.85 2.80 2.76 2.72 2.65 2.59
lstat -1 4 . 3 0 -7 . 8 6 -7 . 7 5 -7 . 6 3 -7 . 4 0 -7 . 2 8

6.6 Final remarks


The beauty of the asymptotic analysis and the EHW standard error is that they hold
under weak parametric assumptions on the error term. We do not need to modify the OLS
estimator but only need to modify the covariance estimator.
However, this framework has limitations. First, the proofs are based on limiting theorems
that require the sample size to be infinity. We are often unsure whether the sample size is
large enough for a particular application we have. Second, the EHW standard errors can be
severely biased and have large variability. Finally, under the heteroskedastic linear model,
the Gauss–Markov theorem does not hold, so the OLS can be inefficient. We will discuss
possible improvements in Chapter 19. Finally, unlike Section 5.3, we cannot create any
reasonable prediction intervals for a future observation yn+1 based on (X, Y, xn+1 ) since its
2
variance σn+1 is fundamentally unknown without further assumptions.
EHW standard error in OLS 53

6.7 Homework problems


6.1 Testing linear hypotheses under heteroskedasticity
Under the heteroskedastic linear model, how to test the hypotheses
H0 : ct β = 0,
for c ∈ Rp , and
H0 : Cβ = 0
for C ∈ Rl×p with linearly independent rows?

6.2 Two-sample problem continued


Continue Problem 5.4.
1. Assume that z1 , . . . , zm are IID with mean µ1 and variance σ12 , and w1 , . . . , wn are IID
with mean µ2 and variance σ22 , and test H0 : µ1 = µ2 . Show that under H0 , the following
t statistic has an asymptotically Normal distribution:
z̄ − w̄
tunequal = p → N(0, 1)
2
Sz /m + Sw 2 /n

in distribution.
Remark: The name “unequal” is motivated by the “var.equal” parameter of the R function
t.test.

2. We can write the above problem as testing hypothesis H0 : β1 = 0 in the heteroskedastic


linear regression. Based on the EHW standard error, we can compute the t statistic.
Show that it is identical to tunequal with the HC2 correction.

6.3 ANOVA with heteroskedasticity


This is an extension of Problem 5.5 in Chapter 5. Assume yi | i ∈ Tj has mean βj and
variance σj2 , which can be rewritten as a linear model without the Normality and ho-
moskedasticity. In the process of solving Problem 5.5, you have derived the estimator of the
covariance matrix of the OLS estimator under homoskedasticity. Find the HC0 and HC2
versions of the EHW covariance matrix. Which covariance matrices do you recommend, and
why?

6.4 Invariance of the EHW covariance estimator


If we transform X to X̃ = XΓ where Γ is a p×p non-degenerate matrix, the OLS fit changes
from
Y = X β̂ + ε̂
to
Y = X̃ β̃ + ε̃,
and the associated EHW covariance estimator changes from V̂ehw to Ṽehw . Show that
V̂ = ΓṼ Γt ,
and the above result holds for HCj (j = 0, 1, 2, 3, 4). Show that the relationship also holds
for the covariance estimator assuming homoskedasticity.
Hint: You can use the results in Problems 3.4 and 3.5.
54 Linear Model and Extensions

6.5 Breakdown of the equivalence of the t-statistics based on the EHW standard error
This problem parallels Problem 5.9.
With the data (xi , yi )ni=1 where both xi and yi are scalars. Run OLS fit of yi on (1, xi )
to obtain ty|x , the t-statistic of the coefficient of xi , based on the EHW standard error. Run
OLS fit of xi on (1, yi ) to obtain tx|y , the t-statistic of the coefficient of yi , based on the
EHW standard error.
Give a counterexample with ty|x ̸= tx|y .

6.6 Empirical comparison of the standard errors


Long and Ervin (2000) reviewed and compared several commonly-used standard errors in
OLS. Redo their simulation and replicate their Figures 1–4. They specified more details of
their covariate generating process in a technical report (Long and Ervin, 1998).

6.7 Robust standard error in practice


King and Roberts (2015) gave three examples where the EHW standard errors differ
from the OLS standard error. I have replicated one example in Section 6.5.2. Replicate
another one using linear regression although the original analysis used Poisson regres-
sion. You can find the datasets used by King and Roberts (2015) at Harvard Dataverse
(https://dataverse.harvard.edu/).

6.8 Unbiased sandwich variance estimator under the Gauss–Markov model


Under the Gauss–Markov model with σi2 = σ 2 for i = 1, . . . , n, show that the HC0 version
of V̂ehw is biased but the HC2 version of Ṽehw is unbiased for cov(β̂).
EHW standard error in OLS 55

Normal non−Normal

se0=0.064 se0=0.064
6 se1=0.064 se1=0.064
se2=0.064 se2=0.064

homoskedastic
4

2
density

se0=0.093 se0=0.088
4 se1=0.071 se1=0.06
se2=0.094 se2=0.089

heteroskedastic
2

0
0.6 0.8 1.0 1.2 1.4 0.6 0.8 1.0 1.2 1.4
^
β

FIGURE 6.1: Simulation with 5000 replications: “se0” denotes the true standard error of
β̂, “se1” denotes the estimated standard error based on the homoskedasticity assumption,
and “se2” denotes the Eicker–Huber–White standard error allowing for heteroskedasticity.
The density curves are Normal with mean 1 and standard deviation se0.
Part III

Interpretation of OLS Based on


Partial Regressions
7
The Frisch–Waugh–Lovell Theorem

7.1 Long and short regressions


If we partition X and β into
 
 β1
X= X1 X2 , β= ,
β2

where X1 ∈ Rn×k , X2 ∈ Rn×l , β1 ∈ Rk and β2 ∈ Rl , then we can consider the long regression

Y = X β̂ + ε̂
 
 β̂1
= X1 X2 + ε̂
β̂2
= X1 β̂1 + X2 β̂2 + ε̂,

and the short regression


Y = X2 β̃2 + ε̃,
 
β̂1
where β̂ = and β̃2 are the OLS coefficients, and ε̂ and ε̃ are the residual vectors
β̂2
from the long and short regressions, respectively. These two regressions are of great interest
in practice. For example, we can ask the following questions:

(Q1) if the true β1 is zero, then what is the consequence of including X1 in the long
regression?
(Q2) if the true β1 is not zero, then what is the consequence of omitting X1 in the short
regression?
(Q3) what is the difference between β̂2 and β̃2 ? Both of them are measures of the “impact”
of X2 on Y . Then why are they different? Does their difference give us any information
about β1 ?
Many problems in statistics are related to the long and short regressions. We will discuss
some applications in Chapter 8 and give a related result in Chapter 9.

7.2 FWL theorem for the regression coefficients


The following theorem helps to answer these questions.

59
60 Linear Model and Extensions

Theorem 7.1 The OLS estimator for β2 in the short regression is β̃2 = (X2t X2 )−1 X2t Y ,
and the OLS estimator for β2 in the long regression has the following equivalent forms

β̂2 = (X t X)−1 X t Y last l elements


 
(7.1)
−1
= {X2t (In − H1 )X2 } X2t (In − H1 )Y where H1 = X1 (X1t X1 )−1 X1t (7.2)
= (X̃2t X̃2 )−1 X̃2t Y where X̃2 = (In − H1 )X2 (7.3)
= (X̃2t X̃2 )−1 X̃2t Ỹ where Ỹ = (In − H1 )Y. (7.4)

This result is often called the Frisch–Waugh–Lovell (FWL) Theorem in econometrics


(Frisch and Waugh, 1933; Lovell, 1963), although its equivalent forms were also known in
classic statistics1 .
Before proving Theorem 7.1, I will first discuss its meanings and interpretations. Equa-
tion (7.1) follows from the definition of the OLS coefficient. The matrix In − H1 in equation
(7.2) is the projection matrix onto the space orthogonal to the column space of X1 . Equa-
tion (7.3) states that β̂2 equals the OLS coefficient of Y on X̃2 = (In − H1 )X2 , which is the
residual matrix from the column-wise OLS fit of X2 on X1 2 . So β̂2 measures the “impact”
of X2 on Y after “adjusting” for the impact of X1 , that is, it measures the partial or pure
“impact” of X2 on Y . Equation (7.4) is a slight modification of Equation (7.3), stating that
β̂2 equals the OLS coefficient of Ỹ on X̃2 , where Ỹ = (In − H1 )Y is the residual vector
from the OLS fit of Y on X1 . From (7.3) and (7.4), it is not crucial to residualize Y , but it
is crucial to residualize X2 .
The forms (7.3) and (7.4) suggest the interpretation of β̂2 as the “impact” of X2 on Y
holding X1 constant, or in an econometric term, the “impact” of X2 on Y ceteris paribus.
Marshall (1890) used the Latin phrase ceteris paribus. Its English meaning is “with other
conditions remaining the same.” However, the algebraic meaning of the FWL theorem is
that the OLS coefficient of a variable equals the partial regression coefficient based on
the residuals. Therefore, taking the Latin phase too seriously may be problematic because
Theorem 7.1 is a pure algebraic result without any distributional assumptions. We cannot
hold X1 constant using pure linear algebra. Sometimes, we can manipulate the value of X1
in an experimental setting, but this relies on the assumption of the data-collecting process.
There are many ways to prove Theorem 7.1. Below I first take a detour to give an
unnecessarily complicated proof because some intermediate steps will be useful for later
parts of the book. I will then give a simpler proof which requires a deep understanding of
OLS as a linear projection.
The first proof relies on the following lemma.

Lemma 7.1 The inverse of X t X is


 
t −1 S11 S12
(X X) = ,
S21 S22

where

S11 = (X1t X1 )−1 + (X1t X1 )−1 X1t X2 (X̃2t X̃2 )−1 X2t X1 (X1t X1 )−1 ,
S12 = −(X1t X1 )−1 X1t X2 (X̃2t X̃2 )−1 ,
t
S21 = S12 ,
S22 = (X̃2t X̃2 )−1 .
1 Professor Alan Agresti gave me the reference of Yule (1907).
2 See Problem 3.7 for more details.
The Frisch–Waugh–Lovell Theorem 61

I leave the proof of Lemma 7.1 as Problem 7.1. With Lemma 7.1, we can easily prove
Theorem 7.1.
Proof of Theorem 7.1: (Version 1) The OLS coefficient is
    t 
β̂1 S11 S12 X1 Y
= (X t X)−1 X t Y = .
β̂2 S21 S22 X2t Y

Then using Lemma 7.1, we can simplify β̂2 as

β̂2 = S21 X1t Y + S22 X2t Y


= −(X̃2t X̃2 )−1 X2t X1 (X1t X1 )−1 X1t Y + (X̃2t X̃2 )−1 X2t Y
= −(X̃2t X̃2 )−1 X2t H1 Y + (X̃2t X̃2 )−1 X2t Y
= (X̃2t X̃2 )−1 X2t (In − H1 )Y (7.5)
= (X̃2t X̃2 )−1 X̃2t Y. (7.6)

Equation (7.5) is the form (7.2), and Equation (7.6) is the form (7.3). Because we also have
X2t (In − H1 )Y = X2t (In − H1 )2 Y = X̃2t Ỹ , we can write β̂2 as β̂2 = (X̃2t X̃2 )−1 X̃2t Ỹ , giving
the form (7.4). □
The second proof does not invert the block matrix of X t X directly.
Proof of Theorem 7.1: (Version 2) The OLS decomposition Y = X1 β̂1 +X2 β̂2 + ε̂ satisfies
t
X t ε̂ = 0 =⇒ X1 X2 ε̂ = 0 =⇒ X1t ε̂ = 0, X2t ε̂ = 0.

Multiplying In − H1 on both sides of the OLS decomposition, we have

(In − H1 )Y = (In − H1 )X1 β̂1 + (In − H1 )X2 β̂2 + (In − H1 )ε̂,

which reduces to
(In − H1 )Y = (In − H1 )X2 β̂2 + ε̂
because (In − H1 )X1 = 0 and (In − H1 )ε̂ = ε̂ − H1 ε̂ = ε̂ − X1 (X1t X1 )−1 X1t ε̂ = ε̂. Further
multiplying X2t on both sides of the identity, we have

X2t (In − H1 )Y = X2t (In − H1 )X2 β̂2

because X2t ε̂ = 0. The FWL theorem follows immediately.


A subtle issue in this proof is to verify that X2t (In − H1 )X2 is invertible. It is easy to
show that matrix X2t (In − H1 )X2 is positive semi-definite. To show it has rank l, we only
need to show that
ut2 X2t (In − H1 )X2 u2 = 0 =⇒ u2 = 0.
We have ut2 X2t (In − H1 )X2 u2 = ∥(In − H1 )X2 u2 ∥2 = 0, so (In − H1 )X2 u2 = 0, which
further implies X2 u2 ∈ C(X1 ) by Proposition 3.1. That is, X2 u2 = X1 u1 for some u1 . So
X1 u1 − X2 u2 = 0. Since the columns of X are linearly independent, we must have u1 = 0
and u2 = 0. □
I will end this section with two byproducts of the FWL theorem. First, X̃2 is the residual
matrix from the OLS fit of X2 on X1 . It is an n×l matrix with linearly independent columns
as shown in the proof of Theorem 7.1 (Version 2) and induces a projection matrix

H̃2 = X̃2 (X̃2t X̃2 )−1 X̃2t .

This projection matrix is closely related to the projection matrices induced by X and X1
as shown in the following lemma.
62 Linear Model and Extensions

Lemma 7.2 We have

H1 H̃2 = H̃2 H1 = 0, H = H1 + H̃2 .

Lemma 7.2 is purely algebraic. I leave the proof as Problem 7.3. The first two identities
imply that the column space of X̃2 is orthogonal to the column space of X1 . The last
identity H = H1 + H̃2 has a clear geometric interpretation. For any vector v ∈ Rn , we have
Hv = H1 v + H̃2 v, so the projection of v onto the column space of X equals the summation
of the projection of v onto the column space of X1 and the projection of v onto the column
space of X̃2 . Importantly, H ̸= H1 + H2 in general.
Second, we can obtain β̂2 from (7.3) or (7.4), which corresponds to the partial regression
of Y on X̃2 or the partial regression of Ỹ on X̃2 . We can verify that the residual vector
from the second partial regression equals the residual vector from the full regression.

Corollary 7.1 We have ε̂ = ê, where ε̂ is the residual vector from the OLS fit of Y on X
and ê is the residual vector from the OLS fit of Ỹ on X̃2 , respectively.

It is important to note that this conclusion is only true if both Y and X2 are residualized.
The conclusion does not hold if we only residualize X2 . See Problem 7.2.
Proof of Corollary 7.1: We have ε̂ = (I − H)Y and

ê = (I − H̃2 )Ỹ = (I − H̃2 )(I − H1 )Y.

It suffices to show that I−H = (I−H̃2 )(I−H1 ), or, equivalently, I−H = I−H1 −H̃2 +H̃2 H1 .
This holds due to Lemma 7.2. □

7.3 FWL theorem for the standard errors


Based on the OLS fit of Y on X, we have two estimated covariances for the second compo-
nent β̂2 : V̂ assuming homoskedasticity and V̂ehw allowing for heteroskedasticity.
The FWL theorem demonstrates that we can also obtain β̂2 from the OLS fit of Ỹ on
X̃2 . Then based on this partial regression, we have two estimated covariances for β̂2 : Ṽ
assuming homoskedasticity and Ṽehw allowing for heteroskedasticity.
The following theorem establishes their equivalence.

Theorem 7.2 (n − k − l)V̂ = (n − l)Ṽ and V̂ehw = Ṽehw .

Theorem 7.1 is well known for a long time but Theorem 7.2 is less well known. Lovell
(1963) hinted at the first identity in Theorem 7.2, and Ding (2021a) proved Theorem 7.2.
Proof of Theorem 7.2: By Corollary 7.1, the full regression and partial regression have
the same residual vector, denoted by ε̂. Therefore, Ω̂ehw = Ω̃ehw = diag{ε̂2 } in the EHW
covariance estimator.
Based on the full regression, define σ̂ 2 = ∥ε̂∥22 /(n − k − l). Then V̂ equals the (2, 2)th
block of σ̂ 2 (X t X)−1 , and V̂ehw equals the (2, 2)th block of (X t X)−1 X t Ω̂ehw X(X t X)−1 .
Based on the partial regression, define σ̃ 2 = ∥ε̂∥22 /(n − l). Then Ṽ = σ̃ 2 (X̃2t X̃2 )−1 and
Ṽehw = (X̃2t X̃2 )−1 X̃2t Ω̃ehw X̃2 (X̃2t X̃2 )−1 .
Let σ̂ 2 = ∥ε̂∥2 /(n − k − l) and σ̃ 2 = ∥ε̂∥2 /(n − l) be the common variance estimators.
They are identical up to the degrees of freedom correction. Under homoskedasticity, the
covariance estimator for β̂2 is the (2, 2)th block of σ̂ 2 (X t X)−1 , that is, σ̂ 2 S22 = σ̂ 2 (X̃2t X̃2 )−1
The Frisch–Waugh–Lovell Theorem 63

by Lemma 7.1, which is identical to the covariance estimator for β̃2 up to the degrees of
freedom correction.
The EHW covariance estimator from the full regression is the (2, 2) block of ÂΩ̂ehw Ât ,
where
   
∗ ∗
 = (X t X)−1 X t = =
−(X̃2t X̃2 )−1 X2t H1 + (X̃2t X̃2 )−1 X2t (X̃2t X̃2 )−1 X̃2t

by Lemma 7.1. I omit the ∗ term because it does not affect the final calculation. Define
Ã2 = (X̃2t X̃2 )−1 X̃2t , and then

V̂ehw = Ã2 Ω̂ehw Ãt2 = Ã2 Ω̃ehw Ãt2 ,

which equals the EHW covariance estimator Ṽehw from the partial regression. □

7.4 Gram–Schmidt orthogonalization, QR decomposition, and the


computation of OLS
When the regressors are orthogonal, the coefficients from the long and short regressions are
identical, which simplifies the calculation and theoretical discussion.

Corollary 7.2 If X1t X2 = 0, i.e., the columns of X1 and X2 are orthogonal, then X̃2 = X2
and β̂2 = β̃2 .

Proof of Corollary 7.2: We can directly prove Corollary 7.2 by verifying that X t X is
block diagonal.
Alternatively, Corollary 7.2 follows directly from

X̃2 = (In − H1 )X2 = X2 − X1 (X1t X1 )−1 X1t X2 = X2 ,

and Theorem 7.1. □


This simple fact motivates us to orthogonalize the columns of the covariate matrix,
which in turn gives the famous QR decomposition in linear algebra. Interestingly, the lm
function in R uses the QR decomposition to compute the OLS estimator. To facilitate the
discussion, I will use the notation
β̂V2 |V1 V1
as the linear projection of the vector V2 ∈ Rn onto the vector V1 ∈ Rn , where β̂V2 |V1 =
V2t V1 /V1t V1 . This is from the univariate OLS of V2 on V1 .
With a slight abuse of notation, partition the covariate matrix into column vectors
X = (X1 , . . . , Xp ). The goal is to find orthogonal vectors (U1 , . . . , Up ) that generate the
same column space as X. Start with

X1 = U1 .

Regress X2 on U1 to obtain the fitted and residual vector

X2 = β̂X2 |U1 U1 + U2 ;

by OLS, U1 and U2 must be orthogonal. Regress X3 on (U1 , U2 ) to obtain the fitted and
residual vector
X3 = β̂X3 |U1 U1 + β̂X3 |U2 U2 + U3 ;
64 Linear Model and Extensions

by Corollary 7.2, the OLS reduces to two univariate OLS and ensures that U3 is orthogonal
to both U1 and U2 . This justifies the notation β̂X3 |U1 and β̂X3 |U2 . Continue this procedure
to the last column vector:
p−1
X
Xp = β̂Xp |Uj Uj + Up ;
j=1

by OLS, Up is orthogonal to all Uj (j = 1, . . . , p − 1). We further normalize the U vectors


to have unit length:
Qj = Uj /∥Uj ∥, (j = 1, . . . , p).
The whole process is called the Gram–Schmidt orthogonalization, which is essentially the
sequential OLS fits. This process generates an n×p matrix with orthonormal column vectors

Q = (Q1 , . . . , Qp ).

More interestingly, the column vectors of X and Q can linearly represent each other because

X = (X1 , . . . , Xp )
 
1 β̂X2 |U1 β̂X3 |U1 ··· β̂Xp |U1
0 1 β̂X3 |U2 ··· β̂Xp |U2 
= (U1 , . . . , Up )  .
 
 .. .. .. .. 
. . ··· . 
0 0 0 ··· 1
 
1 β̂X2 |U1 β̂X3 |U1 ··· β̂Xp |U1
0 1 β̂X3 |U2 ··· β̂Xp |U2 
= Qdiag{∥Uj ∥}pj=1  . .
 
 .. .. .. .. 
. . ··· . 
0 0 0 ··· 1

We can verify that the product of the second and the third matrix is an upper triangular
matrix, denoted by R. By definition, the jth diagonal element of R equals ∥Uj ∥, and the
(j, j ′ )th element of R equals ∥Uj ∥β̂Xj′ |Uj for j ′ > j. Therefore, we can decompose X as

X = QR

where Q is an n × p matrix with orthonormal columns and R is a p × p upper triangular


matrix. This is called the QR decomposition of X.
Most software packages, for example, R, do not calculate the inverse of X t X directly.
Instead, they first find the QR decomposition of X = QR. Since the Normal equation
simplifies to

X t X β̂ = X t Y,
R Qt QRβ̂ = Rt Qt Y,
t

Rβ̂ = Qt Y,

they then backsolve the linear equation since R is upper triangular.


In R, the qr function returns the QR decomposition of a matrix.
> X = matrix ( rnorm ( 7 * 3 ) , 7 , 3 )
> X
[,1] [,2] [,3]
[ 1 ,] -0 . 5 7 2 3 1 2 2 3 0 . 1 1 9 6 3 2 5 0 . 8 0 8 7 5 0 5
[ 2 ,] -1 . 7 6 0 9 0 2 2 5 1 . 0 6 2 7 6 3 1 1 . 8 1 7 0 3 6 1
[ 3 ,] -0 . 0 4 1 4 4 2 8 1 -0 . 2 9 0 4 7 4 9 -1 . 8 3 7 2 2 4 7
The Frisch–Waugh–Lovell Theorem 65

[ 4 ,] -0 . 3 7 6 2 7 8 2 1 0 . 4 4 7 6 9 3 2 -0 . 9 6 2 9 3 2 0
[ 5 ,] -1 . 4 0 8 4 8 0 2 7 0 . 2 7 3 5 4 0 8 -0 . 8 0 4 7 9 1 7
[ 6 ,] 1 . 8 4 8 7 8 5 1 8 0 . 7 2 9 0 0 0 5 1 . 2 6 8 8 9 2 9
[ 7 ,] 0 . 0 6 4 3 2 8 5 6 0 . 2 2 5 6 2 8 4 0 . 3 9 7 2 2 2 9
> qrX = qr ( X )
> qr . Q ( qrX )
[,1] [,2] [,3]
[ 1 ,] -0 . 1 9 1 0 0 8 7 8 -0 . 0 3 4 6 0 6 1 7 0 . 3 0 3 4 0 4 8 1
[ 2 ,] -0 . 5 8 7 6 9 9 8 1 -0 . 6 0 4 4 2 9 2 8 0 . 2 3 7 5 3 9 0 0
[ 3 ,] -0 . 0 1 3 8 3 1 5 1 0 . 2 1 1 9 1 9 9 1 -0 . 5 5 8 3 9 9 2 8
[ 4 ,] -0 . 1 2 5 5 8 2 5 7 -0 . 2 8 7 2 8 4 0 3 -0 . 6 2 8 6 4 7 5 0
[ 5 ,] -0 . 4 7 0 0 7 9 2 4 -0 . 0 7 0 2 0 0 7 6 -0 . 3 6 6 4 0 9 3 8
[ 6 ,] 0 . 6 1 7 0 3 0 6 7 -0 . 6 8 7 7 8 4 1 1 -0 . 0 9 9 9 9 8 5 9
[ 7 ,] 0 . 0 2 1 4 6 9 6 1 -0 . 1 6 7 4 8 2 4 6 0 . 0 1 6 0 5 4 9 3
> qr . R ( qrX )
[,1] [,2] [,3]
[ 1 ,] 2 . 9 9 6 2 6 1 -0 . 3 7 3 5 6 7 3 0 . 0 9 3 7 7 8 8
[ 2 ,] 0 . 0 0 0 0 0 0 -1 . 3 9 5 0 6 4 2 -2 . 1 2 1 7 2 2 3
[ 3 ,] 0 . 0 0 0 0 0 0 0 . 0 0 0 0 0 0 0 2 . 4 8 2 6 1 8 6

If we specify qr = TRUE in the lm function, it will also return the QR decomposition of


the covariate matrix.
> Y = rnorm ( 7 )
> lmfit = lm ( Y ~ 0 + X , qr = TRUE )
> qr . Q ( lmfit $ qr )
[,1] [,2] [,3]
[ 1 ,] -0 . 4 3 5 3 5 0 5 4 -0 . 2 5 6 7 9 8 2 3 -0 . 6 5 4 8 0 4 0 0
[ 2 ,] -0 . 4 7 0 9 1 2 7 5 -0 . 1 3 6 3 9 4 5 9 0 . 1 4 7 4 6 4 4 4
[ 3 ,] 0 . 6 6 4 9 4 5 3 2 0 . 0 7 7 2 5 4 3 5 -0 . 3 9 4 3 6 2 6 5
[ 4 ,] 0 . 2 1 1 3 6 3 4 7 -0 . 7 8 8 1 4 7 3 7 0 . 3 4 6 1 1 8 2 0
[ 5 ,] -0 . 0 4 4 9 3 3 5 6 -0 . 4 0 4 1 3 8 2 9 -0 . 0 1 1 5 6 2 7 3
[ 6 ,] -0 . 2 8 0 4 6 5 0 4 0 . 1 0 6 5 5 7 5 5 -0 . 1 6 1 9 3 2 5 1
[ 7 ,] 0 . 1 4 5 6 1 8 0 8 -0 . 3 3 7 0 8 2 1 9 -0 . 4 9 7 8 0 5 6 1
> qr . R ( lmfit $ qr )
X1 X2 X3
1 3 . 1 9 0 0 3 5 -0 . 6 9 6 4 2 6 9 1 . 8 6 9 3 2 6 0
2 0.000000 2.0719787 1.9210212
3 0 . 0 0 0 0 0 0 0 . 0 0 0 0 0 0 0 -0 . 9 2 6 1 9 2 1

The replicating R code is in code7.4.R3 .

7.5 Homework problems


7.1 Inverse of a block matrix
Prove Lemma 7.1 and the following alternative form:
 
t −1 Q11 Q12
(X X) = ,
Q21 Q22
where H2 = X2 (X2t X2 )−1 X2t , X̃1 = (In − H2 )X1 , and
Q11 = (X̃1t X̃1 )−1 ,
Q12 = −(X̃1t X̃1 )−1 X1t X2 (X2t X2 )−1 ,
Q21 = Qt12 ,
Q22 = (X2t X2 )−1 + (X2t X2 )−1 X2t X1 (X̃1t X̃1 )−1 X1t X2 (X2t X2 )−1 .
3 The “0 +” in the above code forces the OLS to exclude the constant term.
66 Linear Model and Extensions

Hint: Use the formula in Problem 1.3.

7.2 Residuals in the FWL theorem


Give an example in which the residual vector from the partial regression of Y on X̃2 does
not equal to the residual vector from the full regression.

7.3 Projection matrices


Prove Lemma 7.2.
Hint: Use Lemma 7.1.

7.4 FWL theorem and leverage scores


Consider the partitioned regression Y = X1 β̂1 + X2 β̂2 + ε̂. To obtain the coefficient β̂2 , we
can run two OLS fits:
(R1) regress X2 on X1 to obtain the residual X̃2 ;

(R2) regress Y on X̃2 to obtain the coefficient, which equals β̂2 by the FWL Theorem.
Although partial regression (R2) can recover the OLS coefficient, the leverage scores
from (R2) are not the same as those from the long regression. Show that the summation of
the corresponding leverage scores from (R1) and (R2) equals the leverage scores from the
long regression.
Remark: The leverage scores are the diagonal elements of the hat matrix from OLS fits.
Chapter 6 before mentioned them and Chapter 11 later will discuss them in more detail.

7.5 Another invariance property of the OLS coefficient


Partition the covariate matrix as X = (X1 , X2 ) where X1 ∈ Rn×k and X2 ∈ Rn×l . Given
any A ∈ Rk×l , define X̃2 = X2 − X1 A. Fit two OLS:

Y = β̂1 X1 + β̂2 X2 + ε̂

and
Y = β̃1 X1 + β̃2 X̃2 + ε̃.
Show that
β̂2 = β̃2 , ε̂ = ε̃.
Hint: Use the result in Problem 3.4.
Remark: Choose A = (X1t X1 )−1 X1t X2 to be the coefficient matrix of the OLS fit of
X2 on X1 . The above result ensures that β̂2 equals β̃2 from the OLS fit of Y on X1 and
(In − H1 )X2 , which is coherent with the FWL theorem since X1t (In − H1 )X2 = 0.

7.6 Alternative formula for the EHW standard error


Consider the partition regression Y = X1 β̂1 + X2 β̂2 + ε̂ with X1 is an n × (p − 1) matrix
and X2 is an n dimensional vector. So β̂2 is a scalar, and the (p, p)th element of V̂ehw equals
ˆ 2ehw,2 , the squared EHW standard error for β̂2 .
se
Define  
x̃12
X̃2 = (In − H1 )X2 =  ...  .
 

x̃n2
The Frisch–Waugh–Lovell Theorem 67

Prove that under Assumption 6.1, we have


n
X
var(β̂2 ) = wi σi2 ,
i=1
n
X
ˆ 2ehw,2
se = wi ε̂2i
i=1

where
x̃2
wi = Pn i2 2 2 .
( i=1 x̃i2 )
Remark: You can use Theorems 7.1 and 7.2 to prove the result. The original formula of
the EHW covariance matrix has a complex form. However, using the FWL theorems, we
can simplify each of the squared EHW standard errors as a weighted average of the squared
residuals, or, equivalently, a simple quadratic form of the residual vector.

7.7 A counterexample to the Gauss–Markov theorem


The Gauss–Markov theorem does not hold under the heteroskedastic linear model. This
problem gives a counterexample in a simple linear model.
Assume yi = βxi + εi without the intercept and with potentially different var(εi ) = σi2
across i = 1, . . . , n. P
Consider two PnOLS 2estimators: the first OLS estimator does not contain
n
the intercept β̂ = i=1 x i y i / i=1 xi ; the second OLS estimator contains the intercept
Pn n
β̃ = i=1 (xi − x̄)yi / i=1 (xi − x̄)2 even though the true linear model does not contain the
P
intercept.
The Gauss-Markov theorem ensures that if σi2 = σ 2 for all i’s, then the variance of β̂ is
smaller than or equal to the variance of β̃. However, it does not hold when σi2 ’s vary.
Give a counterexample in which the variance of β̂ is larger than the variance of β̃.

7.8 QR decomposition of X and the computation of OLS


Verify that the R matrix equals
 
Qt1 X1 Qt1 X2 ··· Qt1 Xp
 0 Qt2 X2 ··· Qt2 Xp 
R= . ..  .
 
..
 .. . ··· . 
0 0 ··· Qtp Xp

Based on the QR decomposition of X, show that

H = QQt ,

and hii equals the squared length of the i-th row of Q.

7.9 Uniqueness of the QR decomposition


Show that if X has linearly independent column vectors, the QR decomposition must be
unique. That is, if X = QR = Q1 R1 where Q and Q1 have orthonormal columns and R
and R1 are upper triangular, then we must have

Q = Q1 , R = R1 .
8
Applications of the Frisch–Waugh–Lovell Theorem

The FWL theorem has many applications, and I will highlight some of them in this chapter.

8.1 Centering regressors


As a special case, partition the covariate matrix into X = (X1 , X2 ) with X1 = 1n . This is
the usual case including the constant as the first regressor. The projection matrix
 −1
· · · n−1

n
H1 = 1n (1tn 1n )−1 1tn = n−1 1n 1tn =  ... ..  ≡ A

.  n
n−1 ··· n−1

contains n−1 ’s as its elements, and

Cn = In − n−1 1n 1tn

is the projection matrix orthogonal to 1n . Multiplying any vector by An is equivalent to


obtaining the average of its components, and multiplying any vector by Cn is equivalent to
centering that vector, for example,
 

An Y =  ...  = ȳ1n ,
 

and  
y1 − ȳ
Cn Y =  ..
.
 
.
yn − ȳ
More generally, multiplying any matrix by An is equivalent to averaging each column, and
multiplying any matrix by Cn is equivalent to centering each column of that matrix, for
example,  t 
x̄2
 .. 
An X2 =  .  = 1n x̄t2 ,
x̄t2
and  
xt12 − x̄t2
Cn X2 =  ..
,
 
.
xn2 − x̄2
t t

69
70 Linear Model and Extensions
Pn
where X2 contains row vectors xt12 , . . . , xtn2 with average x̄2 = n−1 i=1 xi2 . The FWL
theorem implies that the coefficient of X2 in the OLS fit of Y on (1n , X2 ) equals the
coefficient of Cn X2 in the OLS fit of Cn Y on Cn X2 , that is, the OLS fit of the centered
response vector on the column-wise centered X2 . An immediate consequence is that if each
column is centered in the design matrix, then to obtain the OLS coefficients, it does not
matter whether to include the column 1n or not.
The centering matrix Cn has another property: its quadratic form equals the sample
variance multiplied by n − 1, for example,

Y t Cn Y = Y t Cnt Cn Y
 
y1 − ȳ
= (y1 − ȳ, . . . , yn − ȳ) 
 .. 
. 
yn − ȳ
Xn
= (yi − ȳ)2
i=1
= (n − 1)σ̂y2 ,

where σ̂y2 is the sample variance of the outcomes. For an n × p matrix X,


 
X1t
X t Cn X =  ...  Cn X1 · · · Xp
  

Xpt
 t 
X1 Cn X1 · · · X1t Cn Xp
=
 .. .. 
. . 
Xpt Cn X1 · · · Xpt Cn Xp
 
σ̂11 · · · σ̂1p
= (n − 1)  ... ..  ,

. 
σ̂p1 ··· σ̂pp

where
n
X
σ̂j1 j2 = (n − 1)−1 (xij1 − x̄·j1 )(xij2 − x̄·j2 )
i=1

is the sample covariance between Xj1 and Xj2 . So (n − 1)−1 X t Cn X equals the sample
covariance matrix of X. For these reason, I choose the notation Cn with “C” for both
“centering” and “covariance.”
In another important special case, X1 contains the dummies for a discrete variable, for
example, the indicators for different treatment levels or groups. See Example 3.2 for the
Applications of the Frisch–Waugh–Lovell Theorem 71

background. With k groups, X1 can take the following two forms:

1 ··· 0
 
1
 .. .. .. 
 . . .  
1 ··· 0

 
 1 1 ··· 0   .. .. 
 ..

.. .. 
  . . 
 . . . 

 1 ···

  0 
 1 0 ··· 1 
 
 .. .. 
X1 = 
 . ..

..  or X 1 =  . .  , (8.1)
 .. . . 

 0 ···

  1  
 1 0 1 

 . .. 

 1
  .. . 
0 0 
0 ··· 1 n×k
 
 . .. .. 
 .. . . 
1 0 · · · 0 n×k

where the first form of X1 contains 1n and k − 1 dummy variables, and the second form of
X1 contains k dummy variables. In both forms of X1 , the observations are sorted according
to the group indicators. If we regress Y on X1 , the residual vector is
 
ȳ[1]
 .. 
 . 
 
 ȳ[1] 
 
Y −  ...  , (8.2)
 
 
 ȳ[k] 
 
 . 
 .. 
ȳ[k]

where ȳ[1] , . . . , ȳ[k] are the averages of the outcomes within groups 1, . . . , k. Effectively, we
center Y by group-specific means. Similarly, if we regress X2 on X1 , we center each column of
X2 by the group-specific means. Let Y c and X2c be the centered response vector and design
matrix. The FWL theorem implies that the OLS coefficient of X2 in the long regression is
the OLS coefficient of X2c in the partial regression of Y c on X2c . When k is large, running
the OLS with centered variables can reduce the computational cost.

8.2 Partial correlation coefficient and Simpson’s paradox


The sample Pearson correlation coefficient between n observations of two scalars (xi , yi )ni=1
Pn
(xi − x̄)(yi − ȳ)
ρ̂yx = pPn i=1 pPn
(x
i=1 i − x̄)2 i=1 (yi − ȳ)
2

measures the linear relationship between x and y. How do we measure the linear relationship
between x and y after controlling for some other variables w ∈ Rk−1 ? Intuitively, we can
measure it with the sample Pearson correlation coefficient based on the residuals from the
following two OLS fits:
(R1) run OLS of Y on (1, W ) and obtain residual vector ε̂y and residual sum of squares
rssy ;
72 Linear Model and Extensions

(R2) run OLS of X on (1, W ) and obtain residual vector ε̂x and residual sum of squares
rssx .
With ε̂y and ε̂x , we can define the sampling partial correlation coefficient between x and y
given w as Pn
ε̂x,i ε̂y,i
ρ̂yx|w = qP i=1 qP .
n 2 n 2
ε̂
i=1 x,i ε̂
i=1 y,i

In the above definition, we do not center the residuals because they have zero sample means
due to the inclusions of the intercept in the OLS fits (R1) and (R2). The sample partial
correlation coefficient determines the coefficient of ε̂x in the OLS fit of ε̂y on ε̂x :
Pn s Pn
2
i=1 ε̂x,i ε̂y,i i=1 ε̂y,i σ̂y|w
β̂yx|w = Pn 2 = ρ̂yx|w Pn 2 = ρ̂yx|w , (8.3)
i=1 ε̂x,i i=1 ε̂x,i σ̂x|w

2 2
where σ̂y|w = rssy /(n − k) and σ̂x|w = rssx /(n − k) are the variance estimators based
on regressions (R1) and (R2) motivated by the Gauss–Markov model. Based on the FWL
theorem, β̂yx|w equals the OLS coefficient of X in the long regression of Y on (1, X, W ).
Therefore, (8.3) is the Galtonian formula for multiple regression, which is analogous to that
for univariate regression (1.1).
To investigate the relationship between y and x, different researchers may run different
regressions. One may run OLS of Y on (1, X, W ), and the other may run OLS of Y on
(1, X, W ′ ), where W ′ is a subset of W . Let β̂yx|w be the coefficient of X in the first regression,
and let β̂yx|w′ be the coefficient of X in the second regression. Mathematically, it is possible
that these two coefficients have different signs, which is called Simpson’s paradox1 . It is a
paradox because we expect both coefficients to measure the “impact” of X on Y . Because
these two coefficients have the same signs as the partial correlation coefficients ρ̂yx|w and
ρ̂yx|w′ , Simpson’s paradox is equivalent to

ρ̂yx|w ρ̂yx|w′ < 0.

To simplify the presentation, we discuss the special case with w′ being an empty set. Simp-
son’s paradox is then equivalent to

ρ̂yx|w ρ̂yx < 0.

The following theorem gives an expression linking ρ̂yx|w and ρ̂yx .

Theorem 8.1 For Y, X, W ∈ Rn , we have


ρ̂yx − ρ̂yw ρ̂xw
ρ̂yx|w = q p .
1 − ρ̂2yw 1 − ρ̂2xw

Its proof is purely algebraic, so I leave it as Problem 8.6. Theorem 8.1 states that we
can obtain the sample partial correlation coefficient based on the three pairwise correlation
coefficients. Figure 8.1 illustrates the interplay among three variables. In particular, the
correlation between x and y is due to two “pathways”: the one acting through w and the
one acting independent of w. The first path way is related to the product term ρ̂yw ρ̂xw , and
the second pathway is related to ρ̂yx|w . This gives some intuition for Theorem 8.1.
1 The usual form of Simpson’s paradox is in terms of a 2 × 2 × 2 table with all binary variables. Here we

focus on the continuous version.


Applications of the Frisch–Waugh–Lovell Theorem 73

FIGURE 8.1: Correlations among three variables

Based on data (yi , xi , wi )ni=1 , we can compute the sample correlation matrix
 
1 ρ̂yx ρ̂yw
R̂ =  ρ̂xy 1 ρ̂xw  ,
ρ̂wy ρ̂wx 1

which is symmetric and positive semi-definite. Simpson’s paradox happens if and only if

ρ̂yx (ρ̂yx − ρ̂yw ρ̂xw ) < 0 ⇐⇒ ρ̂2yx < ρ̂yx ρ̂yw ρ̂xw .

We can observe Simpson’s Paradox in the following simulation with the R code in
code8.2.R.
> n = 1000
> w = rbinom (n , 1 , 0 . 5 )
> x 1 = rnorm (n , -1 , 1 )
> x 0 = rnorm (n , 2 , 1 )
> x = ifelse (w , x 1 , x 0 )
> y = x + 6 * w + rnorm ( n )
> fit . xw = lm ( y ~ x + w )$ coef
> fit . x = lm ( y ~ x )$ coef
> fit . xw
( Intercept ) x w
0.05655442 0.97969907 5.92517072
> fit . x
( Intercept ) x
3 . 6 4 2 2 9 7 8 -0 . 3 7 4 3 3 6 8

Because w is binary, we can plot (x, y) in each group of w = 1 and w = 0 in Figure 8.2.
In both groups, y and x are positively associated with positive regression coefficients; but
in the pooled data, y and x are negatively associated with a negative regression coefficient.
74 Linear Model and Extensions

reversion of the sign

7.5


5.0 ●●
● ●●
●● ●

● ● ● ● ●●
● ●● ●
● ●● ●





● ●
●● ●

● ●

● ● ●


● ●●●

factor(w)

●● ●
●●● ●● ●● ● ●● ●●● ● ●
● ● ●●● ● ●●● ● ●● ●

●● ● ● ● ●
0
y


● ● ● ●● ●● ●● ●
● ●● ●● ● ● ●● ●
● ●● ●● ● ●●
● ● ● ● ●● ● ●●
● ● ●
●● ● ●●●●●●● ● ●●
●●● ● ● ● ●
●● ● ● ● ● ● ● ●● ●
● ●
●● ● ●
● ●● ●
● ●
● ● ● ●
● ●● ●



● ● ●●


● 1
2.5 ●

●●


●●

●●●
●● ● ● ● ●

● ●

● ●● ●● ● ●
● ●
● ●●





●●●




●●
●●

●● ● ● ● ● ● ●
● ● ● ● ●●
● ● ●● ● ● ● ●
● ● ● ●● ●
● ● ● ●●● ●● ● ● ●● ●●● ● ●

● ● ●● ●● ●●●● ●
●● ●●● ● ● ● ● ●● ●
●●●
●● ●● ● ● ●●
● ● ●● ● ● ● ● ●● ●●
● ●● ●
● ●● ●● ● ●● ●
● ●

● ● ● ●● ●
●●● ● ●● ● ● ● ●
● ● ●●● ●● ● ●
● ● ●● ●● ● ●● ● ● ●
● ● ● ● ● ●● ● ●
● ●
● ● ●● ● ●
● ● ●●●● ● ● ●
● ●● ● ●●
●● ●
●●
● ● ●
●●● ●●●
● ●●
● ●● ● ● ●●●
● ● ● ●●● ● ●
● ● ● ●● ● ● ● ● ●
● ●
● ● ●
● ● ● ●
● ●
0.0 ●
● ●
● ●
● ●
● ●


● ● ● ●
● ● ●●

● ●
● ●
● ● ●
● ●
● ● ●

● ● ●

−2.5 ●

−5.0 −2.5 0.0 2.5 5.0


x

FIGURE 8.2: An Example of Simpson’s Paradox. The two solid regression lines are fitted
separately using the data from two groups, and the dash regression line is fitted using the
pooled data.

8.3 Hypothesis testing and analysis of variance


Partition X and β into
 
 β1
X= X1 X2 , β= ,
β2

where X1 ∈ Rn×k , X2 ∈ Rn×l , β1 ∈ Rk and β2 ∈ Rl . We are often interested in testing


H0 : β2 = 0
in the long regression
Y = Xβ + ε = X1 β1 + X2 β2 + ε, (8.4)
where ε ∼ N(0, σ 2 In ). If H0 holds, then X2 is redundant and a short regression suffices:
Y = X1 β + ε. (8.5)
This is a special case of testing Cβ = 0 with

C= 0l×k Il×l .
Applications of the Frisch–Waugh–Lovell Theorem 75

As discussed before, we can use


β̂2 ∼ N(0, σ 2 S22 )
with S22 = (X̃2t X̃2 )−1 being the (2, 2)th block of (X t X)−1 by Lemma 7.1, to construct the
Wald-type statistic for hypothesis testing:

β̂2t (S22 )−1 β̂2


FWald =
lσ̂ 2
t t
β̂2 X̃2 X̃2 β̂2
=
lσ̂ 2
∼ Fl,n−p .

Now I will discuss testing H0 from an alternative perspective based on comparing the
residual sum of squares in the long regression (8.4) and the short regression (8.5). This
technique is called the analysis of variance (ANOVA), pioneered by R. A. Fisher in the
design and analysis of experiments. Intuitively, if β2 = 0, then the residual vectors from the
long regression (8.4) and the short regression (8.5) should not be “too different.” However,
with the error term ε, these residuals are random, then the key is to quantify the magnitude
of the difference. Define
rsslong = Y t (In − H)Y
and
rssshort = Y t (In − H1 )Y
as the residual sum of squares from the long and short regressions, respectively. By the
definition of OLS, it must be true that

rsslong ≤ rssshort

and
rssshort − rsslong = Y t (H − H1 )Y ≥ 0. (8.6)
To understand the magnitude of the change in the residual sum of squares, we can stan-
dardize the above difference and define
(rssshort − rsslong )/l
Fanova = ,
rsslong /(n − p)

In the definition of the above statistic, l and n − p are the degrees of freedom to make
the mathematics more elegant, but they do not change the discussion fundamentally. The
denominator of Fanova is σ̂ 2 , so we can also write it as
rssshort − rsslong
Fanova = . (8.7)
lσ̂ 2
The following theorem states that these two perspectives yield an identical test statistic.

Theorem 8.2 Under Assumption 5.1, if β2 = 0, then Fanova ∼ Fl,n−p . In fact, Fanova =
FWald which is a numerical result without Assumption 5.1.

I divide the proof into two parts. The first part derives the exact distribution of Fanova
under the Normal linear model. It relies on the following lemma on the basic properties of
the projection matrices. I relegate its proof to Problem 8.9.
76 Linear Model and Extensions

Lemma 8.1 We have


HX1 = X1 , HX2 = X2 , HH1 = H1 , H1 H = H1
Moreover, H − H1 is a projection matrix of rank p − k = l, In − H is a projection matrix
of rank n − p, and they are orthogonal:
(H − H1 )(In − H) = 0. (8.8)
Proof of Theorem 8.2 (Part I):
The residual vector from the long regression is ε̂ = (In − H)Y = (In − H)(Xβ + ε) =
(In − H)ε, so the residual sum of squares is
rsslong = ε̂t ε̂ = εt (In − H)ε;
since β2 = 0, the residual vector from the short regression is ε̃ = (In − H1 )Y = (In −
H1 )(X1 β1 + ε) = (In − H1 )ε, so the residual sum of squares is
rssshort = ε̃t ε̃ = εt (In − H1 )ε.
Let ε0 = ε/σ ∼ N(0, In ) be a standard Normal random vector, then we can write Fanova as
εt (H − H1 )ε/l
Fanova =
εt (In − H)ε/(n − p)
εt0 (H − H1 )ε0 /l
=
εt0 (In − H)ε0 /(n − p)
∥(H − H1 )ε0 ∥2 /l
= . (8.9)
∥(In − H)ε0 ∥2 /(n − p)
Therefore, we have the following joint Normality using the basic fact (8.8):
   
(H − H1 )ε0 H − H1
= ε0
(In − H)ε0 In − H
   
0 H − H1 0
∼N , .
0 0 In − H
So (H − H1 )ε0 and (In − H)ε0 are Normal with mean zero and two projection matrices
H − H1 and In − H as covariances, respectively, and moreover, they are independent. These
imply that their squared lengths are chi-squared:
∥(H − H1 )ε0 ∥2 ∼ χ2l , ∥(In − H)ε0 ∥2 ∼ χ2n−p ,
and they are independent. These facts, coupled with (8.9), imply that Fanova ∼ Fl,n−p . □
The second part demonstrates that Fanova = FWald without assuming the Normal linear
model, which gives an indirect proof for the exact distribution of Fanova under the Normal
linear model.
Proof of Theorem 8.2 (Part II): Using the FWL theorem that β̂2 = (X̃2t X̃2 )−1 X̃2t Y ,
we can rewrite FWald as
Y t X̃2 (X̃2t X̃2 )−1 X̃2t X̃2 (X̃2t X̃2 )−1 X̃2t Y
FWald =
lσ̂ 2
t t −1 t
Y X̃2 (X̃2 X̃2 ) X̃2 Y
=
lσ̂ 2
t
Y H̃2 Y
= , (8.10)
lσ̂ 2
Applications of the Frisch–Waugh–Lovell Theorem 77

recalling that H̃2 = X̃2 (X̃2t X̃2 )−1 X̃2t is the projection matrix onto the column space of X̃2 .
Therefore, Fanova = FWald follows from the basic identity H − H1 = H̃2 ensured by Lemma
7.2. □
We can use the anova function in R to compute the F statistic and the p-value. Below
I revisit the lalonde data with the R code in code8.3.R. The result is identical as in Section
5.4.2.
> library ( " Matching " )
> data ( lalonde )
> lalonde _ full = lm ( re 7 8 ~ . , data = lalonde )
> lalonde _ treat = lm ( re 7 8 ~ treat , data = lalonde )
> anova ( lalonde _ treat , lalonde _ full )
Analysis of Variance Table

Model 1 : re 7 8 ~ treat
Model 2 : re 7 8 ~ age + educ + black + hisp + married + nodegr + re 7 4 +
re 7 5 + u 7 4 + u 7 5 + treat
Res . Df RSS Df Sum of Sq F Pr ( > F )
1 443 1.9178e+10
2 433 1.8389e+10 10 788799023 1.8574 0.04929 *

In fact, we can conduct an analysis of variance in a sequence of models. For example, we


can supplement the above analysis with a model containing only the intercept. The function
anova works for a sequence of nested models with increasing complexities.
> lalonde 1 = lm ( re 7 8 ~ 1 , data = lalonde )
> anova ( lalonde 1 , lalonde _ treat , lalonde _ full )
Analysis of Variance Table

Model 1 : re 7 8 ~ 1
Model 2 : re 7 8 ~ treat
Model 3 : re 7 8 ~ age + educ + black + hisp + married + nodegr + re 7 4 +
re 7 5 + u 7 4 + u 7 5 + treat
Res . Df RSS Df Sum of Sq F Pr ( > F )
1 444 1.9526e+10
2 443 1 . 9 1 7 8 e + 1 0 1 3 4 8 0 1 3 4 5 6 8 . 1 9 4 6 0 . 0 0 4 4 0 5 **
3 433 1.8389e+10 10 788799023 1.8574 0.049286 *

Overall, the treatment variable is significantly related to the outcome but none of the
pretreatment covariate is.

8.4 Homework problems


8.1 FWL with intercept
The following result is an immediate extension of Theorem 7.1.
Partition X and β into
 
 β0
X = 1n X1 X2 , β =  β1  ,
β2

where X1 ∈ Rn×k , X2 ∈ Rn×l , β0 ∈ R, β1 ∈ Rk and β2 ∈ Rl . then we can consider the long


78 Linear Model and Extensions

regression
Y = X β̂ + ε̂
 
 β̂0
= 1n X1 X2  β̂1  + ε̂
β̂2
= 1n β̂0 + X1 β̂1 + X2 β̂2 + ε̂,
and the short regression
Y = 1n β̃0 + X2 β̃2 + ε̃,
 
β̂0  
β̃0
where β̂ =  β̂1  and are the OLS coefficients, and ε̂ and ε̃ are the residual
β̃2
β̂2
vectors from the long and short regressions, respectively.
Prove the following theorem.
Theorem 8.3 The OLS estimator for β2 in the long regression equals the coefficient of X̃2
in the OLS fit of Y on (1n , X̃2 ), where X̃2 is the residual matrix of the column-wise OLS
fit of X2 on (1n , X1 ), and also equals the coefficient of X̃2 in the OLS fit of Ỹ on (1n , X̃2 ),
where Ỹ is the residual vector of the OLS fit of Y on (1n , X1 ).

8.2 General centering


Verify (8.2).

8.3 Two-way centering of a matrix


Given X ∈ Rn×p , show that all rows and columns of Cn XCp have mean 0, where Cn =
In − n−1 1n 1tn and Cp = Ip − p−1 1p 1tp .

8.4 t-statistic in multivariate OLS


This problem extends Problem 5.8.
Focus on multivariate OLS discussed in Chapter 5: yi = α̂ + β̂1 xi1 + β̂2t xi2 + ε̂i (i =
1, . . . , n), where xi1 is a scalar and xi2 can be a vector. Show that under homoskedasticity,
the t-statistic associated with β̂1 equals
ρ̂yx1 |x2
q ,
(1 − ρ̂2yx1 |x2 )/(n − p)

where p is the total number of regressors and ρ̂yx1 |x2 is the sample partial correlation
coefficient between y and x1 given x2 .
Remark: Frank (2000) applied this formula to causal inference.

8.5 Equivalence of the t-statistics in multivariate OLS


This problem extends Problems 5.9 and 6.5.
With the data (xi1 , xi2 , yi )ni=1 where both xi1 and yi are scalars and xi2 can be a vector.
Run OLS fit of yi on (1, xi1 , xi2 ) to obtain ty|x1 ,x2 , the t-statistic of the coefficient of xi1 ,
under the homoskedasticity assumption. Run OLS fit of xi1 on (1, yi , xi2 ) to obtain tx1 |y,x2 ,
the t-statistic of the coefficient of yi , under the homoskedasticity assumption.
Show ty|x1 ,x2 = tx1 |y,x2 . Give a counterexample in which the numerical equivalence of
the t-statistics breaks down based on the EHW standard error.
Applications of the Frisch–Waugh–Lovell Theorem 79

8.6 Formula of the partial correlation coefficient


Prove Theorem 8.1.

8.7 Examples of Simpson’s Paradox


Give three numerical examples of (Y, X, W ) which causes Simpson’s Paradox. Report the
mean and covariance matrix for each example.

8.8 Simpson’s Paradox in reality


Find a real-life dataset with Simpson’s Paradox.

8.9 Basic properties of projection matrices


Prove Lemma 8.1.

8.10 Correlation of the regression coefficients


1. Regress Y on (1n , X1 , X2 ) where X1 and X2 are two n-vectors with positive sample
Pearson correlation ρ̂x1 x2 > 0. Show that the corresponding OLS coefficients β̂1 and β̂2
are negatively correlated under the Gauss–Markov model.

2. Regress Y on (1n , X1 , X2 , X3 ) where X1 and X2 are two n-vectors and X3 is an n × L


dimensional matrix. If the partial correlation coefficient between X1 and X2 given X3
is positive, then show that the corresponding OLS coefficients β̂1 and β̂2 are negatively
correlated under the Gauss–Markov model.

8.11 Inverse of sample covariance matrix and partial correlation coefficient


This is the sample version of Problem 2.7.
Based on X ∈ Rn×p , we can compute the sample covariance matrix Σ̂. Denote its inverse
by Σ̂−1 = (σ̂ jk )1≤j,k≤p . Show that for any pair j ̸= k, we have

σ̂ jk = 0 ⇐⇒ ρ̂xj xk |x\(j,k) = 0

where ρ̂xj xk |x\(j,k) is the partial correlation coefficient of Xj and Xk given all other variables.
9
Cochran’s Formula and Omitted-Variable Bias

9.1 Cochran’s formula


Consider an n × 1 vector Y , an n × k matrix X1 , and an n × l matrix X2 . Similar to the
FWL theorem, we do not impose any statistical models. We can fit the following OLS:
Y = X1 β̂1 + X2 β̂2 + ε̂, (9.1)
Y = X2 β̃2 + ε̃, (9.2)
X1 = X2 δ̂ + Û , (9.3)
where ε̂, ε̃ are the residual vectors, and Û is the residual matrix. The last OLS fit means
the OLS fit of each column of X1 on X2 , and therefore the corresponding residual Û is an
n × k matrix.
Theorem 9.1 Under the OLS fits (9.1)–(9.3), we have

β̃2 = β̂2 + δ̂ β̂1 .


This is a pure linear algebra fact similar to the FWL theorem. It is called Cochran’s
formula in statistics. Sir David Cox (Cox, 2007) attributed the formula to Cochran (1938)
although Cochran himself attributed the formula to Fisher (1925a).
Cochran’s formula may seem familiar to readers knowing the chain rule in calculus. In
a deterministic world with scalar y, x1 , x2 , if
y(x1 , x2 ) = x1 β1 + x2 β2
and
x1 (x2 ) = x2 δ,
then
dy ∂y ∂x1 ∂y
= +
dx2 ∂x1 ∂x2 ∂x2
= δβ1 + β2 .
But the OLS decompositions above do not establish any deterministic relationships among
Y and (X1 , X2 ).
In some sense, the formula is obvious. From the first and the third OLS fits, we have
Y = X1 β̂1 + X2 β̂2 + ε̂
 
= X2 δ̂ + Û β̂1 + X2 β̂2 + ε̂

= X2 δ̂ β̂1 + Û β̂1 + X2 β̂2 + ε̂


   
= X2 δ̂ β̂1 + β̂2 + Û β̂1 + ε̂ . (9.4)

81
82 Linear Model and Extensions

FIGURE 9.1: A diagram for Cochran’s formula

This suggests that β̃2 = β̂2 + δ̂ β̂1 . The above derivation follows from simple algebraic
manipulations and does not use any properties of the OLS. To prove Theorem 9.1, we need
to verify that the last line is indeed the OLS fit of Y on X2 . The proof is fact very simple.
Proof of Theorem 9.1: Based on the above discussion, we only need to show that (9.4)
is the OLS fit of Y on X2 , which is equivalent to show that Û β̂1 + ε̂ is orthogonal to all
columns of X2 . This follows from
 
X2t Û β̂1 + ε̂ = X2t Û β̂1 + X2t ε̂ = 0,

because X2t Û = 0 based on the OLS fit in (9.3) and X2t ε̂ = 0 based on the OLS fit in (9.1).

Figure 9.1 illustrates Theorem 9.1. Intuitively, β̃2 measures the total impact of X2 on
Y , which has two channels: β̂2 measures the impact acting directly and δ̂ β̂1 measures the
impact acting indirectly through X1 .
Figure 9.1 shows the interplay among three variables. Theoretically, we can discuss a
system of more than three variables which is called the path model. This more advanced topic
is beyond the scope of this book. Wright (1921, 1934)’s initial discussion of this approach
was motivated by genetic studies. See Freedman (2009) for a textbook introduction.

9.2 Omitted-variable bias


The proof of Theorem 9.1 is very simple. However, it is one of the most insightful formu-
las in statistics. Econometricians often call it the omitted-variable bias formula because it
quantifies the bias of the OLS coefficient of X2 in the short regression omitting possibly
important variables in X1 . If the OLS coefficient from the long regression is unbiased then
the the OLS coefficient from the short regression has a biased term δ̂ β̂1 , which equals the
product of the coefficient of X2 in the OLS fit of X1 on X2 and the coefficient of X1 in the
long regression.
Below I will discuss a canonical example of using OLS to estimate the treatment effect
in observational studies. For unit i (i = 1, . . . , n), let yi be the outcome, zi be the binary
treatment indicator (1 for the treatment group and 0 for the control group) and xi be the
observed covariate vector. Practitioners often fit the following OLS:
yi = β̃0 + β̃1 zi + β̃2t xi + ε̃i
Cochran’s Formula and Omitted-Variable Bias 83

and interpret β̃1 as the treatment effect estimate. However, observational studies may suffer
from unmeasured confounding, that is, the treatment and control units differ in unobserved
but important ways. In the simplest case, the above OLS may have omitted a variable ui
for each unit i, which is called a confounder. The oracle OLS is

yi = β̂0 + β̂1 zi + β̂2t xi + β̂3 ui + ε̂i

and the coefficient β̂1 is an unbiased estimator if the model with ui is correct. With X1
containing the values of the ui ’s and X2 containing the values of the (1, zi , xti )’s, Cochran’s
formula implies that
     
β̃0 β̂0 δ̂0
β̃1  = β̂1  + β̂3 δ̂1 
β̃2 β̂2 δ̂2
where (δ̂0 , δ̂1 , δ̂2t )t is the coefficient vector in the OLS fit of ui on (1, zi , xi ). Therefore, we
can quantify the difference between the observed estimate β̃1 and oracle estimate β̂1 :

β̃1 − β̂1 = β̂3 δ̂1 ,

which is sometimes called the confounding bias.


Using the basic properties of OLS, we can show that δ̂1 equals the difference in means
of ei = ui − δ̂2t xi across the treatment and control groups:

δ̂1 = ē1 − ē0 ,

where the bar and subscript jointly denote the sample mean of a particular variable within
a treatment group. So

β̃ − β̂1 = β̂3 (ē1 − ē0 ). (9.5)

Moreover, we can obtain a more explicit formula for δ̂1 :

δ̂1 = ū1 − ū0 − δ̂2t (x̄1 − x̄0 ).

So

β̃ − β̂1 = β̂3 (ū1 − ū0 ) − β̂3 δ̂2t (x̄1 − x̄0 ). (9.6)

Both (9.5) and (9.6) give some insights into the bias due to omitting an important
covariate u. It is clear that the bias depends on β̂3 , which quantifies the relationship between
u and y. The formula (9.5) shows that the bias also depends on the imbalance in means
of u across the treatment and control groups, after adjusting for the observed covariates
x, that is, the imbalance in means of the residual confounding. The formula (9.6) shows
a more explicit formula of the bias. The above discussion is often called bias analysis in
epidemiology or sensitivity analysis in statistics and econometrics.

9.3 Homework problems


9.1 Baron–Kenny method for mediation analysis
The Baron–Kenny method, popularized by Baron and Kenny (1986), is one of the most cited
methods in social science. It concerns the interplay among three variables z, m, y controlling
84 Linear Model and Extensions

FIGURE 9.2: The graph for the Baron–Kenny method

for some other variables x. Let Z, M, Y be n × 1 vectors representing the observed values of
z, m, y, and let X be the n × p matrix representing the observations of x. The question of
interest is to assess the “direct” and “indirect” effects of z on y, acting independently and
through m, respectively. We do not need to define these notions precisely since we are only
interested in the algebraic property below.
The Baron–Kenny method runs the OLS

Y = β̂0 1n + β̂1 Z + β̂2 M + X β̂3 + ε̂Y

and interprets β̂1 as the estimator of the “direct effect” of z on y. The “indirect effect” of
z on y through m has two estimators. First, based on the OLS

Y = β̃0 1n + β̃1 Z + X β̃3 + ε̃Y ,

define the difference estimator as β̃1 − β̂1 . Second, based on the OLS

M = γ̂0 1n + γ̂1 Z + X γ̂2 + ε̂M ,

define the product estimator as γ̂1 β̂2 . Figure 9.2 illustrates the OLS fits used in defining the
estimators.
Prove that
β̃1 − β̂1 = γ̂1 β̂2
that is, the difference estimator and product estimator are numerically identical.

9.2 A special case of path analysis


Figure 9.3 represents the order of the variables X1 , X2 , X3 , Y ∈ Rn . Run the following OLS:

Y = β̂0 1n + β̂1 X1 + β̂2 X2 + β̂3 X3 + ε̂Y ,


X3 = δ̂0 1n + δ̂1 X1 + δ̂2 X2 + ε̂3 ,
X2 = θ̂0 1n + θ̂1 X1 + ε̂2 ,

and
Y = β̃0 1n + β̃1 X1 + ε̃Y .
Cochran’s Formula and Omitted-Variable Bias 85

FIGURE 9.3: A path model

Show that
β̃1 = β̂1 + β̂2 θ̂1 + β̂3 δ̂1 + β̂3 δ̂2 θ̂1 .
Remark: The OLS coefficient of X1 in the short regression of Y on (1n , X1 ) equals the
summation of all the path coefficients from X1 to Y as illustrated by Figure 9.3. This
problem is a special case of the path model, but the conclusion holds in general.

9.3 EHW in long and short regressions


Theorem 9.1 gives Cochran’s formula related to the coefficients from three OLS fits. This
problem concerns the covariance estimation. There are at least two ways to estimate the
covariance of β̃2 in the short regression (9.2). First, from the second OLS fit, the EHW
covariance estimator is

Ṽ2 = (X2t X2 )−1 X2t diag(ε̃2 )X2 (X2t X2 )−1 .

Second, Cochran’s formula implies that


 
β̂1
β̃2 = (δ̂, Il )
β̂2

is a linear transformation of the coefficient from the long regression, which further justifies
the EHW covariance estimator
 t
′ t −1 t 2 t −1 δ̂
Ṽ2 = (δ̂, Il )(X X) X diag(ε̂ )X(X X) .
Il

Show that
Ṽ2′ = (X2t X2 )−1 X2t diag(ε̂2 )X2 (X2t X2 )−1 .
Hint: Use the result in Problem 7.1.
Remark: Based on Theorem 7.2, the EHW covariance estimator for β̂2 is

V̂2 = (X̃2t X̃2 )−1 X̃2t diag(ε̂2 )X̃2 (X̃2t X̃2 )−1 ,

where X̃2 = (In − H1 )X2 .


86 Linear Model and Extensions

9.4 Statistical properties of under-fitted OLS


Assume that Y = Xβ + ε = X1 β1 + X2 β2 + ε follows the Gauss–Markov model, where
X1 ∈ Rn×k , X2 ∈ Rn×l , and cov(ε) = σ 2 In . However, we only fit the OLS of Y on X2 with
coefficient β̃2 and estimated variance σ̃ 2 . Show that

E(β̃2 ) = (X2t X2 )−1 X2t X1 β1 + β2 , var(β̃2 ) = σ 2 (X2t X2 )−1

and
E(σ̃ 2 ) = σ 2 + β1t X1t (In − H2 )X1 β1 /(n − l) ≥ σ 2 .
Part IV

Model Fitting, Checking, and


Misspecification
10
Multiple Correlation Coefficient

This chapter will introduce the R2 , the multiple correlation coefficient, also called the coef-
ficient of determination (Wright, 1921). It can achieve two goals: first, it extends the sample
Pearson correlation coefficient between two scalars to a measure of correlation between a
scalar outcome and a vector covariate; second, it measures how well multiple covariates can
linearly represent an outcome.

10.1 Equivalent definitions of R2


I start with the standard definition of R2 between Y and X. Slightly different from other
chapters, X excludes the column of 1’s, so now X is an n × (p − 1) matrix. Based on the
OLS of Y on (1n , X), we define
Pn
2 (ŷi − ȳ)2
R = Pni=1 2
.
i=1 (yi − ȳ)

We have discussed before that including 1n in the OLS ensures


n
X
n−1 ε̂i = 0,
i=1
Xn n
X
=⇒ n−1 yi = n−1 ŷi ,
i=1 i=1
=⇒ ¯
ȳ = ŷ,
i.e., the average of the fitted values equals the average of the original observed outcomes. So
I use ȳ for both the means of outcomes and the fitted values. With scaling factor (n − 1)−1 ,
the denominator of R2 is the sample variance of the outcomes, and the numerator of R2 is
the sample variance of the fitted values. We can verify the following decomposition:
Lemma 10.1 We have the following variance decomposition:
n
X n
X n
X
(yi − ȳ)2 = (ŷi − ȳ)2 + (yi − ŷi )2 .
i=1 i=1 i=1

I leave P
the proof of Lemma 10.1 as Problem 10.1. LemmaP 10.1 states that the total sum
n n
of squares i=1P (yi − ȳ)2 equals the regression sum of squares i=1 (ŷi − ȳ)2 plus the residual
n 2 2
sum of squares i=1 (yi − ŷi ) . From Lemma 10.1, R must be lie within the interval [0, 1]
which measures the proportion of the regression sum of squares in the total sum of squares.
An immediate consequence of Lemma 10.1 is that
n
X
rss = (1 − R2 ) (yi − ȳ)2 .
i=1

89
90 Linear Model and Extensions

We can also verify that R2 is the squared sample Pearson correlation coefficient between
Y and Ŷ .
Theorem 10.1 We have R2 = ρ̂2yŷ where
Pn
− ȳ)(ŷi − ȳ)
i=1 (yi
ρ̂yŷ = pPn pPn . (10.1)
2 2
i=1 (yi − ȳ) i=1 (ŷi − ȳ)

I leave the proof of Theorem 10.1 as Problem 10.2. It states that the multiple correlation
coefficient equals the squared Pearson correlation coefficient between yi and ŷi . Although
the sample Pearson correlation coefficient can be positive or negative, R2 is always non-
negative. Geometrically, R2 equals the squared cosine of the angle between the centered
vectors Y − ȳ1n and Ŷ − ȳ1n ; see Chapter A.1.
In terms of long and short regressions, we can partition the design matrix into 1n and
X, then the OLS fit of the long regression is

Y = 1n β̂0 + X β̂ + ε̂, (10.2)

and the OLS fit of the short regression is

Y = 1n β̃0 + ε̃, (10.3)

with β̃0 = ȳ. The total sum of squares is the residual sum of squares from the short regression
so by Lemma 10.1, R2 also equals
rssshort − rsslong
R2 = . (10.4)
rssshort

10.2 R2 and the F statistic


Under the Normal linear model

Y = 1n β0 + Xβ + ε, ε ∼ N(0, σ 2 In ), (10.5)

we can use the F statistic to test whether β = 0. This F statistic is a monotone function
of R2 . Most standard software packages report both F and R2 . I first give a numeric result
without assuming that model (10.5) is correct.

Theorem 10.2 We have


n−p R2
F = × .
p−1 1 − R2

Proof of Theorem 10.2: Based on the long regression (10.2) and the short regression
(10.3), we have (10.4) and

(rssshort − rsslong )/(p − 1)


F = .
rsslong /(n − p)

So the conclusion follows. □


I then give the exact distribution of R2 under the Normal linear model.
Multiple Correlation Coefficient 91

Corollary 10.1 Under the Normal linear model (10.5), if β = 0, then


 
p−1 n−p
R2 ∼ Beta , .
2 2

Proof of Corollary 10.1: By definition, the F statistic can be represented as

χ2p−1 /(p − 1)
F =
χ2n−p /(n − p)

where χ2p−1 and χ2n−p denote independent χ2p−1 and χ2n−p random variables, respectively,
with a little abuse of notation. Using Theorem 10.2, we have

R2 p−1 χ2p−1
= F × =
1 − R2 n−p χ2n−p

which implies
χ2p−1
R2 = .
χ2p−1 + χ2n−p
p−1 1
and χ2n−p ∼ Gamma n−p 1
 
Because χ2p−1 ∼ Gamma 2 , 2 2 , 2 by Proposition B.1, we
have
Gamma p−1 1

2 2 , 2
R =
Gamma p−1 1 n−p 1
 
2 , 2 + Gamma 2 , 2

where Gamma p−1 1 n−p 1


 
2 , 2 and Gamma 2 , 2 denote independent Gamma random vari-
ables, with a little abuse of notation. The R2 follows the Beta distribution by the Beta–
Gamma duality in Theorem B.1. □

10.3 Numerical examples


Below I first use the LaLonde data to verify Theorems 10.1 and 10.2 numerically.
> library ( " Matching " )
> data ( lalonde )
> ols . fit = lm ( re 7 8 ~ . , y = TRUE , data = lalonde )
> ols . summary = summary ( ols . fit )
> r 2 = ols . summary $ r . squared
> all . equal ( r 2 , ( cor ( ols . fit $y , ols . fit $ fitted . values ))^ 2 ,
+ check . names = FALSE )
[ 1 ] TRUE
>
> fstat = ols . summary $ fstatistic
> all . equal ( fstat [ 1 ] , fstat [ 3 ]/ fstat [ 2 ]* r 2 /( 1 -r 2 ) ,
+ check . names = FALSE )
[ 1 ] TRUE

I then use the data from King and Roberts (2015) to verify Theorems 10.1 and 10.2
numerically.
> library ( foreign )
> dat = read . dta ( " isq . dta " )
> dat = na . omit ( dat [ , c ( " multish " , " lnpop " , " lnpopsq " ,
+ " lngdp " , " lncolony " , " lndist " ,
+ " freedom " , " militexp " , " arms " ,
92 Linear Model and Extensions

+ " year 8 3 " , " year 8 6 " , " year 8 9 " , " year 9 2 " )])
>
> ols . fit = lm ( log ( multish + 1 ) ~ lnpop + lnpopsq + lngdp + lncolony
+ + lndist + freedom + militexp + arms
+ + year 8 3 + year 8 6 + year 8 9 + year 9 2 ,
+ y = TRUE , data = dat )
> ols . summary = summary ( ols . fit )
> r 2 = ols . summary $ r . squared
> all . equal ( r 2 , ( cor ( ols . fit $y , ols . fit $ fitted . values ))^ 2 ,
+ check . names = FALSE )
[ 1 ] TRUE
>
> fstat = ols . summary $ fstatistic
> all . equal ( fstat [ 1 ] , fstat [ 3 ]/ fstat [ 2 ]* r 2 /( 1 -r 2 ) ,
+ check . names = FALSE )
[ 1 ] TRUE

The R code is in code10.3.R.

10.4 Homework problems


10.1 Variance decomposition
Prove Lemma 10.1.

10.2 R2 and the sample Pearson correlation coefficient


Prove Theorem 10.1.

10.3 Exact distribution of ρ̂


Assume the Normal linear model yi = α + βxi + εi with a univariate xi with β = 0 and εi ’s
IID N(0, σ 2 ). Find the exact distribution of ρ̂xy .

10.4 Partial R2
The form (10.4) of R2 is well defined in more general long and short regressions:

Ŷ = 1n β̂0 + X β̂ + W γ̂ + ε̂Y

and
Ŷ = 1n β̃0 + W γ̃ + ε̃Y
where X is an n × k matrix and W is an n × l matrix. Define the partial R2 between Y and
X given W as

2 rss(Y ∼ 1n + W ) − rss(Y ∼ 1n + X + W )
RY.X|W =
rss(Y ∼ 1n + W )
which spells out the formulas of the long and short regressions. This is an intuitive measure of
the multiple correlation between Y and X after controlling for W . The following properties
make this intuition more explicit.

1. The partial R2 equals


2 2
2 RY.XW − RY.W
RY.X|W = 2
1 − RY.W
Multiple Correlation Coefficient 93
2 2
where RY.XW is the multiple correlation between Y and (X, W ), and RY.W is the mul-
tiple correlation between Y and W .
2. The partial R2 equals the R2 between ε̃Y and ϵ̃X :
2
RY.X|W = Rε̃2Y .ε̃X

where ε̃X is the residual matrix from the OLS fit of X on (1n , W ).
Prove the above two results.
Do the following two results hold?
2 2 2
RY.XW = RY.W + RY.X|W ,
2 2 2
RY.XW = RY.W |X + RY.X|W .

For each result, give a proof if it is correct, or give a counterexample if it is incorrect in


general.

10.5 Omitted-variable bias in terms of the partial R2


Revisit Section 9.2 on the following three OLS fits. The first one involves only the observed
variables:
yi = β̃0 + β̃1 zi + β̃2t xi + ε̃i ,
and the second and third ones involve the unobserved u:

yi = β̂0 + β̂1 zi + β̂2t xi + β̂3 ui + ε̂i ,


ui = δ̂0 + δ̂1 zi + δ̂2t xi + v̂i .

The omitted-variable bias formula states that β̃ − β̂1 = β̂3 δ̂1 . This formula is simple but
may be difficult to interpret since u is unobserved and its scale is unclear to researchers.
Prove that the formula has an alternative form:
2
RZ.U |X rss(Y ∼ 1n + Z + X)
|β̃1 − β̂1 |2 = RY.U
2
|ZX × 2 × .
1− RZ.U |X rss(Z ∼ 1n + X)

Remark: Cinelli and Hazlett (2020) suggested the partial R2 parametrization for the
omitted-variable bias formula. The formula has three factors: the first two factors depend
2 2
on the unknown sensitivity parameters RY.U |ZX and RZ.U |X , and the third factor equals
the ratio of the two residual sums of squares based on the observed data.
11
Leverage Scores and Leave-One-Out Formulas

11.1 Leverage scores


We have seen the use of the hat matrix H = X(X t X)−1 X t in previous chapters. Because
 t
x1
 ..  t −1 
H =  .  (X X) x1 · · · xn ,
xtn

its (i, j)th element equals hij = xti (X t X)−1 xj . In this chapter, we will pay special attention
to its diagonal elements

hii = xti (X t X)−1 xi (i = 1, . . . , n)

often called the leverage scores, which play important roles in many discussions later.
First, because H is a projection matrix of rank p, we have
n
X
hii = trace(H) = rank(H) = p
i=1

which implies that


n
X
n−1 hii = p/n,
i=1

i.e., the average of the leverage scores equals p/n and the maximum of the leverage scores
must be larger than or equal to p/n, which is close to zero when p is small relative to n.
Second, because H = H 2 and H = H t , we have
n
X n
X X
hii = hij hji = h2ij = h2ii + h2ij ≥ h2ii
j=1 j=1 j̸=i

which implies
hii ∈ [0, 1],
i.e., each leverage score is bounded between zero and one1 .
Third, because Ŷ = HY , we have
n
X X
ŷi = hij yj = hii yi + hij yj
j=1 j̸=i

1 This also follows from Theorem A.4 since the eigenvalues of H are 0 and 1.

95
96 Linear Model and Extensions

which implies that


∂ ŷi
= hii .
∂yi
So hii measures the contribution of yi in the predicted value ŷi . In general, we do not want
the contribution of yi in predicting itself to be too large, because this means we do not
borrow enough information from other observations, making the prediction very noisy. This
is also clear from the variance of the predicted value ŷi = xti β̂ under the Gauss–Markov
model:2

var(ŷi ) = xti cov(β̂)xi


= σ 2 xti (X t X)−1 xi
= σ 2 hii .

So the variance of ŷi increases with hii .


The final property of hii is less obvious: it measures whether observation i is an outlier
based on its covariate value, that is, whether xi is far from the center of the data. Partition
−1

the design matrix
Pn as X = 1 n X 2 with H1 = n 1n 1tn . The covariates X2 has sample
−1
mean x̄2 = n i=1 xi2 and sample covariance

n
X
S = (n − 1)−1 (xi2 − x̄2 )(xi2 − x̄2 )t
i=1
−1
= (n − 1) X2t (In − H1 )X2 .

The sample Mahalanobis distance between xi2 and the center x̄2 is

Di2 = (xi2 − x̄2 )t S −1 (xi2 − x̄2 ).

The following theorem shows that hii is a monotone function of Di2 :

Theorem 11.1 We have


1 Di2
hii = + , (11.1)
n n−1
so hii ≥ 1/n.

Proof of Theorem 11.1: The definition of Di2 implies that it is the (i, i)th element of the
following matrix:
 t
x12 − x̄2
..  −1
x12 − x̄2 · · · xn2 − x̄2

 S

 .
xn2 − x̄2
−1
=(In − H1 )X2 (n − 1)−1 X2t (In − H1 )X2

X2t (In − H1 )
=(n − 1)X̃2 (X̃2t X̃2 )−1 X̃2t
=(n − 1)H̃2
=(n − 1)(H − H1 ),

recalling that X̃2 = (In − H1 )X2 , H̃2 = X̃2 (X̃2t X̃2 )−1 X̃2t , and H = H1 + H̃2 by Lemma 7.2.
Therefore,
Di2 = (n − 1)(hii − 1/n)
2 We have already proved a more general result on the covariance matrix of Ŷ in Theorem 4.2.
Leverage Scores and Leave-One-Out Formulas 97

which implies (11.1). □


These are the basic properties of the leverage scores. Chatterjee and Hadi (1988) pro-
vided an in-depth discussion of the properties of the leverage scores. We will see their roles
frequently in later parts of the chapter.
Another advanced result on the leverage scores is due to Huber (1973). He proved that
in the linear model with non-Normal IID εi ∼ [0, σ 2 ] and σ 2 < ∞, all linear combinations
of the OLS coefficient are asymptotically Normal if and only if the maximum leverage score
converges to 0. This is a very elegant asymptotic result on the OLS coefficient. I give more
details in Chapter C as an application of the Lindeberg–Feller CLT.

11.2 Leave-one-out formulas


To measure the impact of the ith observation on the final OLS estimator, a natural approach
is to delete the ith row from the full data
 t   
x1 y1
X =  ...  , Y =  ...  ,
   

xtn yn
and check how much the OLS estimator changes. Let
   
xt1 y1
 ..   .. 
 .   . 
 t   
 x   yi−1 
 i−1
X[−i] =  t  , Y[−i] = 
 
x
 i+1 
 yi+1



 .   .
 ..   ..


t
xn yn
denote the leave-i-out data, and define
t
β̂[−i] = (X[−i] X[−i] )−1 X[−i]
t
Y[−i] (11.2)

as the corresponding OLS estimator. We can fit n OLS by deleting the ith row (i = 1, . . . , n).
However, this is computationally intensive especially when n is large. The following theorem
shows that we need only to fit OLS once.

Theorem 11.2 Recalling that β̂ is the full data OLS, ε̂i is the residual and hii is the
leverage score for the ith observation, we have

β̂[−i] = β̂ − (1 − hii )−1 (X t X)−1 xi ε̂i

if hii ̸= 1.

Proof of Theorem 11.2: From (11.2), we need to invert


X
t
X[−i] X[−i] = xi′ xti′ = X t X − xi xti
i′ ̸=i

and calculate X
t
X[−i] Y[−i] = xi yi = X t Y − xi yi ,
i′ ̸=i
98 Linear Model and Extensions

which are the original X t X and X t Y without the contribution of the ith observation. Using
the following Sherman–Morrison formula in Problem 1.3:

(A + uv t )−1 = A−1 − (1 + v t A−1 u)−1 A−1 uv t A−1

with A = X t X, u = xi , and v = −xi we can invert X[−i]


t
X[−i] as

−1
X[−i] )−1 = (X t X)−1 + 1 − xti (X t X)−1 xi (X t X)−1 xi xti (X t X)−1
t

(X[−i]
−1
= (X t X)−1 + (1 − hii ) (X t X)−1 xi xti (X t X)−1 .

Therefore,
t
β̂[−i] = (X[−i] X[−i] )−1 X[−i]
t
Y[−i]
n o
−1
= (X t X)−1 + (1 − hii ) (X t X)−1 xi xti (X t X)−1 (X t Y − xi yi )
= (X t X)−1 X t Y
− (X t X)−1 xi yi
−1
+ (1 − hii ) (X t X)−1 xi xti (X t X)−1 X t Y
−1
− (1 − hii ) (X t X)−1 xi xti (X t X)−1 xi yi
−1 −1
= β̂ − (X t X)−1 xi yi + (1 − hii ) (X t X)−1 xi xti β̂ − hii (1 − hii ) (X t X)−1 xi yi
−1 −1
= β̂ − (1 − hii ) (X t X)−1 xi yi + (1 − hii ) (X t X)−1 xi ŷi
−1
= β̂ − (1 − hii ) (X t X)−1 xi ε̂i .


With the leave-i-out OLS estimator β̂[−i] , we can define the predicted residual

ε̂[−i] = yi − xti β̂[−i] ,

which is different from the original residual ε̂i . The predicted residual based on leave-one-out
can better measure the performance of the prediction because it mimics the real problem
of predicting a future observation. In contrast, the original residual based on the full data
ε̂i = yi − xti β̂ gives an overly optimistic measure of the performance of the prediction. This
is related to the overfitting issue discussed later. Under the Gauss–Markov model, Theorem
4.2 implies that the original residual has mean zero and variance

var(ε̂i ) = σ 2 (1 − hii ), (11.3)

and we can show that the predicted residual has mean zero and variance

var(ε̂[−i] ) = var(yi − xti β̂[−i] ) = σ 2 + σ 2 xti (X[−i]


t
X[−i] )−1 xi . (11.4)

The following theorem further simplifies the predicted residual and its variance.

Theorem 11.3 We have


ε̂[−i] = ε̂i /(1 − hii ),
and under Assumption 4.1, we have

var(ε̂[−i] ) = σ 2 /(1 − hii ). (11.5)


Leverage Scores and Leave-One-Out Formulas 99

Proof of Theorem 11.3: By definition and Theorem 11.2, we have

ε̂[−i] = yi − xti β̂[−i]


n o
−1
= yi − xti β̂ − (1 − hii ) (X t X)−1 xi ε̂i
−1
= yi − xti β̂ + (1 − hii ) xti (X t X)−1 xi ε̂i
−1
= ε̂i + hii (1 − hii ) ε̂i
= ε̂i /(1 − hii ). (11.6)

Combining (11.3) and (11.6), we can derive its variance formula. □


Comparing formulas (11.4) and (11.5), we obtain that
−1
X[−i] )−1 xi = (1 − hii )−1 = 1 − xti (X t X)−1 xi

1 + xti (X[−i]
t
.

This is not an obvious linear algebra identity, but it follows immediately from the two ways
of calculating the variance of the predicted residual.

11.3 Applications of the leave-one-out formulas


11.3.1 Gauss updating formula
Consider an online setting in which the data points come sequentially as illustrated by the
figure below:

In this setting, we can update the OLS estimator step by step: based on the first n data
points (xi , yi )ni=1 , we calculate the OLS estimator β̂(n) , and with an additional data point
(xn+1 , yn+1 ), we update the OLS estimator as β̂(n+1) . These two OLS estimators are closely
related as shown in the following theorem.

Theorem 11.4 Let X(n) be the design matrix and Y(n) be the outcome vector for the first
n observations. We have
β̂(n+1) = β̂(n) + γ(n+1) ε̂[n+1] ,
t
where γ(n+1) = (X(n+1) X(n+1) )−1 xn+1 and ε̂[n+1] = yn+1 −xtn+1 β̂(n) is the predicted residual
of the (n + 1)the outcome based on the OLS of the first n observations.

Proof of Theorem 11.4: This is the reverse form of the leave-one-out formula. We can
view the first n + 1 data points as the full data, and β̂(n) as the OLS estimator leaving the
100 Linear Model and Extensions

(n + 1)the observation out. Applying Theorem 11.2, we have


ε̂n+1
t
β̂(n) = β̂(n+1) − (X(n+1) X(n+1) )−1 xn+1
1 − hn+1,n+1
= β̂(n+1) − γ(n+1) ε̂[n+1] ,

where ε̂n+1 is the (n + 1)th residual based on the full data OLS, and the (n + 1)th predicted
residual equals ε̂[n+1] = ε̂n+1 /(1 − hn+1,n+1 ) based on Theorem 11.3. □
Theorem 11.4 shows that to obtain β̂(n+1) from β̂(n) , the adjustment depends on the
predicted residual ε̂[n+1] . If we have a perfect prediction of the (n + 1)th observation based
on β̂(n) , then we do not need to make any adjustment to obtain β̂(n+1) ; if the predicted
residual is large, then we need to make a large adjustment.
Theorem 11.4 suggests an algorithm for sequentially computing the OLS estimators. But
t
it gives a formula that involves inverting X(n+1) X(n+1) at each step. Using the Sherman–
t
Morrison formula in Problem 1.3 for updating the inverse of X(n+1) X(n+1) based on the
t
inverse of X(n) X(n) , we have an even simpler algorithm below:
t
(G1) Start with V(n) = (X(n) X(n) )−1 and β̂(n) .

(G2) Update
 −1
V(n+1) = V(n) − 1 + xtn+1 V(n) xn+1 V(n) xn+1 xtn+1 V(n) .

(G3) Calculate γ(n+1) = V(n+1) xn+1 and ε̂[n+1] = yn+1 − xtn+1 β̂(n) .

(G4) Update β̂(n+1) = β̂(n) + γ(n+1) ε̂[n+1] .

11.3.2 Outlier detection based on residuals


Under the Normal linear model Y = Xβ + ε with ε ∼ N(0, σ 2 In ), we know some basic
probabilistic properties of the residual vector:

E(ε̂) = 0, var(ε̂) = σ 2 (In − H).

At the same time, the residual vector is computable based on the data. So it is sensible to
check whether these properties of the residual vector are plausible based on data, which in
turn serves as modeling checking for the Normal linear model.
The first quantity is the standardized residual
ε̂i
standri = p .
2
σ̂ (1 − hii )
We may hope that it has mean 0 and variance 1. However, because of the dependence
between ε̂i and σ̂ 2 , it is not easy to quantify the exact distribution of standri .
The second quantity is the studentized residual based on the predicted residual:

ε̂[−i] yi − xti β̂[−i]


studri = q =q ,
2 /(1 − h )
σ̂[−i] 2 /(1 − h )
σ̂[−i]
ii ii

2
where β̂[−i] and σ̂[−i] are the estimates of the coefficient and variance based on the leave-i-
2
out OLS. Because (yi , β̂[−i] , σ̂[−i] ) are mutually independent under the Normal linear model,
we can show that
studri ∼ tn−p−1 . (11.7)
Leverage Scores and Leave-One-Out Formulas 101

Because we know the distribution of studri , we can compare it to the quantiles of the t
distribution.
The third quantity is Cook’s distance (Cook, 1977):

cooki = (β̂[−i] − β̂)t X t X(β̂[−i] − β̂)/(pσ̂ 2 )


= (X β̂[−i] − X β̂)t (X β̂[−i] − X β̂)/(pσ̂ 2 ),

where the first form measures the change of the OLS estimator and the second form mea-
sures the change in the predicted values based on leaving-i-out. It has a slightly different
motivation, but eventually, it is related to the previous two residuals due to the leave-one-out
formulas.

Theorem 11.5 Cook’s distance is related to the standardized residual via:


hii
cooki = standr2i × .
p(1 − hii )

I leave the proof of Theorem 11.5 as Problem 11.4.


I will end this subsubsection with two examples. The R code is in code11.3.2.R. The first
one is simulated. I generate data from a univariate Normal linear model without outliers. I
then use hatvalues, r.standard, r.student and cooks.distance to an lm.object to calculate the
leverage scores, standardized residuals, studentized residuals, and Cook’s distances. Their
plots are in the first column of Figure 11.1.
n = 100
x = seq ( 0 , 1 , length = n )
y = 1 + 3 * x + rnorm ( n )
lmmod = lm ( y ~ x )
hatvalues ( lmmod )
rstandard ( lmmod )
rstudent ( lmmod )
cooks . distance ( lmmod )

If I add 8 to the outcome of the last observation, the plots change to the second column
of Figure 11.1. If I add 8 to the 50th observation, the plots change to the last column of
Figure 11.1. Both visually show the outliers. In this example, the three residual plots give
qualitatively the same pattern, so the choice among them does not matter much. In general
cases, we may prefer studri because it has a known distribution under the Normal linear
model.
The second one is a further analysis of the Lalonde data. Based on the plots in Figure
11.2, there are indeed some outliers in the data. It is worth investigating them more carefully.
Although the outliers detection methods above are very classic, they are rarely imple-
mented in modern data analyses. They are simple and useful diagnostics. I recommend
using them at least as a part of the exploratory data analysis.

11.3.3 Jackknife
Jackknife is a general strategy for bias reduction and variance estimation proposed by
Quenouille (1949, 1956) and popularized by Tukey (1958). Based on independent data
(Z1 , . . . , Zn ), how to estimate the variance of a general estimator θ̂(Z1 , . . . , Zn ) of the pa-
rameter θ? Define θ̂[−i] as the estimator without observation i, and the pseudo-value as

θ̃i = nθ̂ − (n − 1)θ̂[−i] .


102 Linear Model and Extensions

no outlier outlier n outlier n/2


0.04

0.03

leverage
0.02

0.01
4

standardized
0

−2
outlier checks

−4

−6

studentized
0

−5

0.8

0.6

Cook
0.4

0.2

0.0
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
x

FIGURE 11.1: Outlier detections

Pn
The jackknife point estimator is θ̂j = n−1 i=1 θ̃i , and the jackknife variance estimator is
n
1 X
V̂j = (θ̃i − θ̂j )(θ̃i − θ̂j )t .
n(n − 1) i=1

We have already shown that the OLS coefficient is unbiased and derived several variance
estimators for it. Here we focus on the jackknife in OLS using the leave-one-out formula for
the coefficient. The pseudo-value is

β̃i = nβ̂ − (n − 1)β̂[−i]


n o
= nβ̂ − (n − 1) β̂ − (1 − hii )−1 (X t X)−1 xi ε̂i

= β̂ + (n − 1)(1 − hii )−1 (X t X)−1 xi ε̂i .

The jackknife point estimator is


n
!−1 n
!
n−1 −1
X
−1
X ε̂i
β̂j = β̂ + n xi xti n xi .
n i=1 i=1
1 − hii
Leverage Scores and Leave-One-Out Formulas 103

0.15

leverage
0.10

0.05

0.00
8

standardized
4

2
outlier checks

−2

7.5

studentized
5.0

2.5

0.0

0.20

0.15

Cook
0.10

0.05

0.00
0 100 200 300 400
x

FIGURE 11.2: Outlier detections in the LaLonde data

It is a little unfortunate that the jackknife point estimator is not identical to the OLS
estimator, which is BLUE under the Gauss–Markov model. We can show that E(β̂j ) = β
and it is a linear estimator. So the Gauss–Markov theorem ensures that cov(β̂j ) ⪰ cov(β̂).
Nevertheless, the difference between β̂j and β̂ is quite small. I omit their difference in the
following derivation. Assuming that β̂j ∼= β̂, we can continue to calculate the approximate
jackknife variance estimator:
n
∼ 1 X
V̂j = (β̃i − β̂)(β̃i − β̂)t
n(n − 1) i=1
n  2
n − 1 t −1 X ε̂i
= (X X) xi xti (X t X)−1 ,
n i=1
1 − hii

which is almost identical to the HC3 form of the EHW covariance matrix introduced in
Chapter 6.4.2. Miller (1974) first analyzed the jackknife in OLS but dismissed it immediately.
Hinkley (1977) modified the original jackknife and proposed a version that is identical to
HC1, and Wu (1986) proposed some further modifications and proposed a version that is
identical to HC2. Weber (1986) made connections between EHW and jackknife standard
104 Linear Model and Extensions

errors. However, Long and Ervin (2000)’s finite-sample simulation seems to favor the original
jackknife or HC3.

11.4 Homework problems


11.1 Implementing the Gauss updating formula
Implement the algorithm in (G1)–(G4), and try it on simulated data.

11.2 The distribution of the studentized residual


Prove (11.7).

11.3 Leave-one-out coefficient


Prove
n
X
β̂ = wi β̂[−i] ,
i=1
Pn
and find the weights wi ’s. Show they are positive and sum to one. Does β̂ = n−1 i=1 β̂[−i]
hold in general?

11.4 Cook’s distance and the standardized residual


Prove Theorem 11.5.

11.5 The relationship between the standardized and studentized residual


Prove the following results.
2
1. (n − p − 1)σ̂[−i] = (n − p)σ̂ 2 − ε̂2i /(1 − hii ).
2. There is a monotone relationship between the standardized and studentized residual:
s
n−p−1
studri = standri .
n − p − standr2i
Pn 2
Remark: The right-hand side of the first result must be nonnegative so k=1 ε̂k ≥
ε̂2i /(1 − hii ), which implies the following inequality:

ε̂2
hii + Pn i 2 ≤ 1.
k=1 ε̂k

From this inequality, if hii = 1 then ε̂i = 0 which further implies that hij = 0 for all j ̸= i.

11.6 Subset and full-data OLS coefficients


Leaving one observation out, we have the OLS coefficient formula in Theorem 11.2. Leave
a subset of observations out, we have the OLS coefficient formula below. Partition the
covariate matrix and outcome vector based on a subset S of {1, . . . , n}:
   
XS YS
X= , Y = .
X\S Y\S
Leverage Scores and Leave-One-Out Formulas 105

Without using the observations in set S, we have the OLS coefficient


t
β̂\S = (X\S X\S )−1 X\S Y\S .

The following formula facilitates the computation of many β̂\S ’s simultaneously, which relies
crucially on the subvector of the residuals

ε̂S = YS − XS β̂\S

and the submatrix of H


HSS = X\S (X t X)−1 X\S
t

corresponding to the set S.

Theorem 11.6 Recalling that β̂ is the full data OLS, we have

β̂\S = β̂ − (X t X)−1 X\S


t
(I − HSS )−1 ε̂S

where I is the identity matrix with the same dimension of HSS .

Prove Theorem 11.6.

11.7 Bounding the leverage score


Theorem 11.1 shows n−1 ≤ hii ≤ 1 for all i = 1, . . . , n. Prove the following stronger result:

n−1 ≤ hii ≤ s−1


i

where si is the number of rows that are identical to xi or −xi .

11.8 More on the leverage score


Show that
t
det(X[−i] X[−i] ) = (1 − hii )det(X t X).
t
Remark: If hii = 1, then X[−i] X[−i] is degenerate with determinant 0. Therefore, if we
delete an observation i with leverage score 1, the columns of the covariate matrix become
linearly dependent.
12
Population Ordinary Least Squares and Inference
with a Misspecified Linear Model

Previous chapters assume fixed X and random Y . We can also view each observation (xi , yi )
as IID draws from a population and discuss population OLS. We will view the OLS from
the level of random variables instead of data points. Besides its theoretical interest, the
population OLS facilitates the discussion of the properties of misspecified linear models and
motivates a prediction procedure without imposing any distributional assumptions.

12.1 Population OLS


Assume that (xi , yi )ni=1 are IID with the same distribution as (x, y), where x ∈ Rp and
y ∈ R. Below I will use (x, y) to denote a general observation, dropping the subscript i
for simplicity. Let the joint distribution be F (x, y), and E(·), var(·), and cov(·) be the
expectation, variance, and covariance under this joint distribution. We want to use x to
predict y. The following theorem states that the conditional expectation E(y | x) is the
best predictor in terms of the mean squared error.
Theorem 12.1 For any function m(x), we have the decomposition
h i h i
2 2
E {y − m(x)} = E {E(y | x) − m(x)} + E{var(y | x)}, (12.1)

provided the existence of the moments in (12.1). This implies


h i
2
E(y | x) = arg min E {y − m(x)}
m(·)

with the minimum value equaling E{var(y | x)}, the expectation of the conditional variance
of y given x.
Theorem 12.1 is well known in probability theory. I relegate its proof as Problem 12.1.
We have finite data points, but the function E(y | x) lies in an infinite dimensional space.
Nonparametric estimation of E(y | x) is generally a hard problem, especially with a mul-
tidimensional x. As a starting point, we often use a linear function of x to approximate
E(y | x) and define the population OLS coefficient as
β = arg minp L(b),
b∈R

where
L(b) = E (y − xt b)2


= E y 2 + bt xxt b − 2yxt b


= E(y 2 ) + bt E (xxt ) b − 2E (yxt ) b

107
108 Linear Model and Extensions

is a quadratic function of b. From the first-order condition, we have


∂L(b)
|b=β = 2E (xxt ) β − 2E (xy) = 0
∂b
based on Proposition A.7, so
−1
β = {E (xxt )} E (xy) ; (12.2)
from the second order condition, we have
∂ 2 L(b)
= 2E (xxt ) ≥ 0.
∂b∂bt
So β is the unique minimizer of L(b).
The above derivation shows that xt β is the best linear predictor, and the following
theorem states precisely that xt β is the best linear approximation to the possibly nonlinear
conditional mean E(y | x).
Theorem 12.2 We have
h i
2
β = arg minp E {E(y | x) − xt b}
b∈R
−1
= {E (xxt )} E {xE(y | x)} .
As a special case, when the covariate “x” in the above formulas contains 1 and a scalar
x, the OLS coefficient has the following form.
Corollary 12.1 For scalar x and y, we have
(α, β) = arg min E(y − a − bx)2 ,
(a,b)

where α = E(y) − E(x)β and


s
cov(x, y) var(y)
β= = ρxy
var(x) var(x)
Recall that
cov(x, y)
ρxy = p
var(x)var(y)
is the population Pearson correlation coefficient. So Corollary 12.1 gives the population
version of the Galtonian formula. I leave the proofs of Theorem 12.2 and Corollary 12.1 as
Problems 12.2 and 12.3.
With β, we can define
ε = y − xt β (12.3)
as the population residual. Because we usually do not use the upper-case letter E for ε, this
notation may cause confusions with previous discussion on OLS where ε denotes the vector
of the error terms. Here ε is a scalar in (12.3). By the definition of β, we can verify
E(xε) = E {x(y − xt β)} = E(xy) − E(xxt )β = 0. (12.4)
If we include 1 as a component of x, then E(ε) = 0, which, coupled with (12.4), implies
cov(x, ε) = 0. So with an intercept in β, the mean of the population residual must be zero,
and it is uncorrelated with other covariates by construction.
We can also rewrite (12.3) as
y = xt β + ε, (12.5)
which holds by the definition of the population OLS coefficient and residual without any
modeling assumption. We call (12.5) the population OLS decomposition.
Population Ordinary Least Squares and Inference with a Misspecified Linear Model 109

12.2 Population FWL theorem and Cochran’s formula


To aid interpretation of the population OLS coefficient, we have the population FWL the-
orem.

Theorem 12.3 Consider the population OLS decomposition

y = β0 + β1 x1 + · · · + βp−1 xp−1 + ε (12.6)

and the partial population OLS decompositions:

xk = γ0 + γ1 x1 + · · · + no xk + · · · + γp−1 xp−1 + x̃k , (12.7)


y = δ0 + δ1 x1 + · · · + no xk + · · · + δp−1 xp−1 + ỹ, (12.8)
ỹ = β̃k x̃k + ε̃, (12.9)

where x̃k is the residual from (12.7) and ỹ is the residual from (12.8). The coefficient βk
from (12.6) equals cov(x̃k , ỹ)/var(x̃k ), the coefficient of x̃k from the population OLS of ỹ on
x̃k in (12.9), which also equals cov(x̃k , y)/var(x̃k ), the coefficient of x̃k from the population
OLS of y on x̃k . Moreover, ε from (12.6) equals ε̃ from (12.9).

Similar to the proof of Theorem 7.1, we can invert the matrix E(xxt ) in (12.2) to prove
Theorem 12.3 directly. Below I adopt an alternative proof which is a modification of the
one given by Angrist and Pischke (2008).
Proof of Theorem 12.3: Some basic identities of population OLS help to simplify the
proof below. First, the OLS decomposition (12.7) ensures

cov(x̃k , xk ) = cov(x̃k , γ0 + γ1 x1 + · · · + no xk + γp−1 xp−1 + x̃k )


= cov(x̃k , x̃k )

that is,
cov(x̃k , xk ) = var(x̃k ). (12.10)
It also ensures

cov(x̃k , xl ) = 0, (l ̸= k). (12.11)

Second, the OLS decompositions (12.6) and (12.8) ensure

cov(x̃k , ε) = 0, (12.12)

because x̃k is a linear combination of x by (12.7). It also ensures

cov(x̃k , δ0 + δ1 x1 + · · · + no xk + · · · + δp−1 xp−1 ) = 0,

which further implies


cov(x̃k , y) = cov(x̃k , ỹ). (12.13)
Now I prove the equivalent forms of the coefficient βk . From (12.6), we have

cov(x̃k , y) = cov (x̃k , β0 + β1 x1 + · · · + βp−1 xp−1 + ε)


X
= βk cov(x̃k , xk ) + βl cov(x̃k , xl ) + cov(x̃k , ε)
l̸=k

= βk var(x̃k ),
110 Linear Model and Extensions

by (12.10)–(12.12). Therefore,
cov(x̃k , y)
βk = ,
var(x̃k )
which also equals β̃k by (12.13).
Finally, I prove ε = ε̃. It suffices to show that ε̃ = ỹ − β̃k x̃k satisfies

E(ε̃) = 0, cov(xk , ε̃) = 0, cov(xl , ε̃) = 0 (l ̸= k).

The first identity holds by (12.9). The second identity holds because

cov(xk , ε̃) = cov(xk , ỹ) − β̃k cov(xk , x̃k ) = cov(x̃k , ỹ) − β̃k cov(x̃k , x̃k ) = 0

by (12.13), (12.10), and the population OLS of (12.9). The third identity holds because

cov(xl , ε̃) = cov(xl , ỹ) − β̃k cov(xl , x̃k ) = 0

by (12.8) and (12.11). □


We also have a population version of Cochran’s formula. Assume (yi , xi1 , xi2 )ni=1 are IID,
where yi is a scalar, xi1 has dimension k, and xi2 has dimension l. We have the following
OLS decompositions of random variables

yi = β1t xi1 + β2t xi2 + εi , (12.14)


yi = β̃2t xi2 + ε̃i , (12.15)
t
xi1 = δ xi2 + ui . (12.16)

Equation (12.14) defines the population long regression, and Equation (12.15) defines the
population short regression. In Equation (12.16), δ is a l × k matrix because it is the OLS
decomposition of a vector on a vector. We can view (12.16) as OLS decomposition of each
component of xi1 on xi2 . The following theorem states the population version of Cochran’s
formula.

Theorem 12.4 Based on (12.14)–(12.16), we have β̃2 = β2 + δβ1 .

I relegate the proof of Theorem 12.4 as Problem 12.5.

12.3 Population R2 and partial correlation coefficient


Exclude 1 from x and assume x ∈ Rp−1 in this subsection. Assume that covariates and
outcome are centered with mean zero and covariance matrix
   
x Σxx Σxy
cov = .
y Σyx σy2

There are multiple equivalent definitions of R2 . I start with the following definition

Σyx Σ−1
xx Σxy
R2 = ,
σy2

and will give several equivalent definitions below. Let β be the population OLS coefficient
of y on x, and ŷ = xt β be the best linear predictor.
Population Ordinary Least Squares and Inference with a Misspecified Linear Model 111

Theorem 12.5 R2 equals the ratio of the variance of the best linear predictor of y and the
variance of y itself:
var(ŷ)
R2 = .
var(y)
Proof of Theorem 12.5: Because of the centering of x, we can verify that

var(ŷ) = β t cov(x)β
= cov(y, x)cov(x)−1 cov(x)cov(x)−1 cov(x, y)
= cov(y, x)cov(x)−1 cov(x, y).

Theorem 12.6 R2 equals the maximum value of the squared Pearson correlation coefficient
between y and a linear combination of x:

R2 = max
p−1
ρ2 (y, xt b) = ρ2 (y, ŷ).
b∈R

Proof of Theorem 12.6: We have


cov2 (y, xt b) bt Σxy Σyx b
ρ2 (y, xt b) = = 2 .
var(y)var(x b)
t σy × bt Σxx b
1/2 −1/2
Define γ = Σxx b and b = Σxx γ such that b and γ have one-to-one mapping. So the
maximum value of
−1/2 −1/2
γ t Σxx Σxy Σyx Σxx γ
σy2 × ρ2 (y, xt b) = ,
γtγ
−1/2 −1/2
equals the maximum eigenvalue of Σxx Σxy Σyx Σxx , attained when γ equals the corre-
sponding eigenvector; see Theorem A.4. This matrix is positive semi-definite and has rank
one due to Proposition A.1, so it has exactly one non-zero eigenvalue which must equal its
trace. So

max ρ2 (y, xt b) = σy−2 trace(Σ−1/2 −1/2


xx Σxy Σyx Σxx )
b∈Rp−1

= σy−2 trace(Σxy Σyx Σ−1/2 −1/2


xx Σxx )

= σy−2 trace(Σyx Σ−1


xx Σxy )

= σy−2 Σyx Σ−1


xx Σxy

= R2 .

We can easily verify that R2 = ρ2 (y, ŷ). □


We can also define population partial correlation and R2 . For scalar y and x with another
scalar or vector w, we can define the population OLS decomposition based on (1, w) as

y = ŷ + ỹ, x = x̂ + x̃, (12.17)

where

ỹ = {y − E(y)} − {w − E(w)}t βy , x̃ = {x − E(x)} − {w − E(w)}t βx ,

with βy and βx being the coefficients of w in these population OLS. We then define the
population partial correlation coefficient as

ρyx|w = ρỹx̃ .
112 Linear Model and Extensions

If the marginal correlation and partial correlation have different signs, then we have Simp-
son’s paradox at the population level.
With a scalar w, we have a more explicit formula below.
Theorem 12.7 For scalar (y, x, w), we have
ρyx − ρxw ρyw
ρyx|w = p q .
1 − ρ2xw 1 − ρ2yw

I relegate the proof of Theorem 12.7 as Problem 12.7.


We can also extend the definition of ρyx|w to the partial R2 with a scalar y and possibly
vector x and w. The population OLS decompositions (12.17) still hold in the general case.
Then we can define the partial R2 between y and x given w as the R2 between ỹ and x̃:
2 2
Ry.x|w = Rỹ.x̃ .

12.4 Inference for the population OLS


12.4.1 Inference with the EHW standard errors
Based on the IID data (xi , yi )ni=1 , we can easily obtain the moment estimator for the pop-
ulation OLS coefficient
n
!−1 n
!
X X
−1 t −1
β̂ = n xi xi n xi yi ,
i=1 i=1

and the residuals ε̂i = yi − xti β̂. Assume finite fourth moments of (x, y). We can use the law
of large numbers to show that
n
X n
X
n−1 xi xti → E(xxt ), n−1 xi yi → E(xy),
i=1 i=1
Pn
so β̂ → β in probability. We can use the CLT to show that n−1/2 i=1 xi εi → N(0, M ) in
distribution, where M = E(ε2 xxt ), so

n(β̂ − β) → N(0, V ) (12.18)

in distribution, where V = B −1 M B −1 with B = E(xxt ). The moment estimator for the


asymptotic variance of β̂ is again the Eicker–Huber–White robust covariance estimator:
n
!−1 n
! n
!−1
X X X
V̂ehw = n−1 n−1 xi xti n−1 ε̂2i xi xti n−1 xi xti . (12.19)
i=1 i=1 i=1

Following the almost the same proof of Theorem 6.3, we can show that V̂ehw is consistent for
the asymptotic covariance V . I summarize the formal results below, with the proof relegated
as Problem 12.4.
iid
Theorem 12.8 Assume that (xi , yi )ni=1 ∼ (x, y) with E(∥x∥4 ) < ∞ and E(y 4 ) < ∞. We
have (12.18) and nV̂ehw → V in probability.
Population Ordinary Least Squares and Inference with a Misspecified Linear Model 113

So the EHW standard error is not only robust to the heteroskedasticity of the errors but
also robust to the misspecification of the linear model (Huber, 1967; White, 1980b; Angrist
and Pischke, 2008; Buja et al., 2019a). Of course, when the linear model is wrong, we need
to modify the interpretation of β: it is the coefficient of x in the best linear prediction of y
or the best linear approximation of the conditional mean function E(y | x).

12.5 To model or not to model?


12.5.1 Population OLS and the restricted mean model
1.0

true response curve


best linear approximations
0.8
0.6
0.4
y

F1
0.2
0.0

F3 F2
−0.2

−1.0 −0.5 0.0 0.5 1.0

FIGURE 12.1: Best linear approximations correspond to three different distributions of x.

I start with a simple example. In the following calculation, I will use the fact that the
kth moment of a Uniform(0, 1) random variable equals 1/(k + 1).
Example 12.1 Assume that x ∼ F (x), ε ∼ N(0, 1), x ε, and y = x2 + ε.
1. If x ∼ F1 (x) is Uniform(−1, 1), then the best linear approximation is 1/3 + 0 · x. We
can see this result from
cov(x, y) cov(x, x2 ) E(x3 )
β(F1 ) = = = = 0,
var(x) var(x) var(x)
and α(F1 ) = E(y) = E(x2 ) = 1/3.
2. If x ∼ F2 (x) is Uniform(0, 1), then the best linear approximation is −1/6 + x. We can
see this result from
cov(x, y) cov(x, x2 ) E(x3 ) − E(x)E(x2 ) 1/4 − 1/2 × 1/3
β(F2 ) = = = 2 = = 1,
var(x) var(x) 2
E(x ) − {E(x)} 1/3 − (1/2)2
114 Linear Model and Extensions

and α(F2 ) = E(y) − βE(x) = E(x2 ) − E(x) = 1/3 − 1/2 = −1/6


3. If x ∼ F3 (x) is Uniform(−1, 0), then the best linear approximation is −1/6 − x. This
result holds by symmetry.
Figure 12.1 shows the true conditional mean function x2 and the best linear approxima-
tions. As highlighted in the notation above, the best linear approximation depends on the
distribution of X.

From the above, we can see that the best linear approximation depends on the distribu-
tion of X. This complicates the interpretation of β from the population OLS decomposition
(Buja et al., 2019a). More importantly, this can cause problems if we care about the external
validity of statistical inference (Sims, 2010, page 66).
However, if we believe the following restricted mean model

E(y | x) = xt β (12.20)

or, equivalently,
y = xt β + ε, E(ε | x) = 0,
then the population OLS coefficient is the true parameter of interest:
−1 −1
{E(xxt )} E(xy) = {E(xxt )} E {xE(y | x)}
t −1
= {E(xx )} E(xxt β)
= β.

Moreover, the population OLS coefficient does not depend on the distribution of x. The
above asymptotic inference applies to this model too.
Freedman (1981) distinguished two types of OLS: the regression model and the corre-
lation model, as shown in Figure 12.2. The left-hand side represents the regression model,
or the restricted mean model (12.20). In the regression model, we first generate x and ε
under some restrictions, for example, E(ε | x) = 0, and then generate the outcome based
on y = xt β + ε, a linear function of x with error ε. In the correlation model, we start with a
pair (x, y), then decompose y into the best linear predictor xt β and the leftover residual ε.
The latter ensures E(εx) = 0, but the former requires E(ε | x) = 0. So the former imposes
a stronger assumption since E(ε | x) = 0 implies

E(εx) = E{E(εx | x)} = E{E(ε | x)x} = 0.

12.5.2 Anscombe’s Quartet: the importance of graphical diagnostics


Anscombe (1973) used four simple datasets to illustrate the importance of graphical diag-
nostics in linear regression. His datasets are in anscombe in the R package datasets: x1
and y1 constitute the first dataset, and so on.
> library ( datasets )
> # # Anscombe ’ s Quartet
> anscombe
x1 x2 x3 x4 y1 y2 y3 y4
1 10 10 10 8 8.04 9.14 7.46 6.58
2 8 8 8 8 6.95 8.14 6.77 5.76
3 13 13 13 8 7.58 8.74 12.74 7.71
4 9 9 9 8 8.81 8.77 7.11 8.84
5 11 11 11 8 8.33 9.26 7.81 8.47
6 14 14 14 8 9.96 8.10 8.84 7.04
Population Ordinary Least Squares and Inference with a Misspecified Linear Model 115

regression model correlation model

FIGURE 12.2: Freedman’s classification of OLS

7 6 6 6 8 7.24 6.13 6.08 5.25


8 4 4 4 19 4.26 3.10 5.39 12.50
9 12 12 12 8 10.84 9.13 8.15 5.56
10 7 7 7 8 4.82 7.26 6.42 7.91
11 5 5 5 8 5.68 4.74 5.73 6.89

The four datasets have similar sample moments.


> # # mean of x
> c ( mean ( anscombe $ x 1 ) ,
+ mean ( anscombe $ x 2 ) ,
+ mean ( anscombe $ x 3 ) ,
+ mean ( anscombe $ x 4 ))
[1] 9 9 9 9
> # # variance of x
> c ( var ( anscombe $ x 1 ) ,
+ var ( anscombe $ x 2 ) ,
+ var ( anscombe $ x 3 ) ,
+ var ( anscombe $ x 4 ))
[1] 11 11 11 11
> # # mean of y
> c ( mean ( anscombe $ y 1 ) ,
+ mean ( anscombe $ y 2 ) ,
+ mean ( anscombe $ y 3 ) ,
+ mean ( anscombe $ y 4 ))
[1] 7.500909 7.500909 7.500000 7.500909
> # # variance of y
> c ( var ( anscombe $ y 1 ) ,
+ var ( anscombe $ y 2 ) ,
+ var ( anscombe $ y 3 ) ,
+ var ( anscombe $ y 4 ))
[1] 4.127269 4.127629 4.122620 4.123249
116 Linear Model and Extensions

The results based on linear regression are almost identical.


> ols 1 = lm ( y 1 ~ x 1 , data = anscombe )
> summary ( ols 1 )

Call :
lm ( formula = y 1 ~ x 1 , data = anscombe )

Residuals :
Min 1Q Median 3Q Max
-1 . 9 2 1 2 7 -0 . 4 5 5 7 7 -0 . 0 4 1 3 6 0.70941 1.83882

Coefficients :
Estimate Std . Error t value Pr ( >| t |)
( Intercept ) 3.0001 1.1247 2.667 0.02573 *
x1 0.5001 0.1179 4 . 2 4 1 0 . 0 0 2 1 7 **

Residual standard error : 1 . 2 3 7 on 9 degrees of freedom


Multiple R - squared : 0 . 6 6 6 5 , Adjusted R - squared : 0 . 6 2 9 5
F - statistic : 1 7 . 9 9 on 1 and 9 DF , p - value : 0 . 0 0 2 1 7

> ols 2 = lm ( y 2 ~ x 2 , data = anscombe )


> summary ( ols 2 )

Call :
lm ( formula = y 2 ~ x 2 , data = anscombe )

Residuals :
Min 1Q Median 3Q Max
-1 . 9 0 0 9 -0 . 7 6 0 9 0.1291 0.9491 1.2691

Coefficients :
Estimate Std . Error t value Pr ( >| t |)
( Intercept ) 3.001 1.125 2.667 0.02576 *
x2 0.500 0.118 4 . 2 3 9 0 . 0 0 2 1 8 **

Residual standard error : 1 . 2 3 7 on 9 degrees of freedom


Multiple R - squared : 0 . 6 6 6 2 , Adjusted R - squared : 0 . 6 2 9 2
F - statistic : 1 7 . 9 7 on 1 and 9 DF , p - value : 0 . 0 0 2 1 7 9

> ols 3 = lm ( y 3 ~ x 3 , data = anscombe )


> summary ( ols 3 )

Call :
lm ( formula = y 3 ~ x 3 , data = anscombe )

Residuals :
Min 1 Q Median 3Q Max
-1 . 1 5 8 6 -0 . 6 1 4 6 -0 . 2 3 0 3 0.1540 3.2411

Coefficients :
Estimate Std . Error t value Pr ( >| t |)
( Intercept ) 3.0025 1.1245 2.670 0.02562 *
x3 0.4997 0.1179 4 . 2 3 9 0 . 0 0 2 1 8 **

Residual standard error : 1 . 2 3 6 on 9 degrees of freedom


Multiple R - squared : 0 . 6 6 6 3 , Adjusted R - squared : 0 . 6 2 9 2
F - statistic : 1 7 . 9 7 on 1 and 9 DF , p - value : 0 . 0 0 2 1 7 6

> ols 4 = lm ( y 4 ~ x 4 , data = anscombe )


> summary ( ols 4 )

Call :
lm ( formula = y 4 ~ x 4 , data = anscombe )
Population Ordinary Least Squares and Inference with a Misspecified Linear Model 117

Residuals :
Min 1 Q Median 3Q Max
-1 . 7 5 1 -0 . 8 3 1 0 . 0 0 0 0.809 1.839

Coefficients :
Estimate Std . Error t value Pr ( >| t |)
( Intercept ) 3.0017 1.1239 2.671 0.02559 *
x4 0.4999 0.1178 4 . 2 4 3 0 . 0 0 2 1 6 **

Residual standard error : 1 . 2 3 6 on 9 degrees of freedom


Multiple R - squared : 0 . 6 6 6 7 , Adjusted R - squared : 0 . 6 2 9 7
F - statistic : 1 8 on 1 and 9 DF , p - value : 0 . 0 0 2 1 6 5

However, the scatter plots of the datasets in Figure 12.3 reveal fundamental differences
between the datasets. The first dataset seems ideal for linear regression. The second dataset
shows a quadratic form of y versus x, and therefore, the linear model is misspecified. The
third dataset shows a linear trend of y versus x, but an outlier has severely distorted the
slope of the linear line. The fourth dataset is supported on only two values of x and thus
may suffer from severe extrapolation.

12.5.3 More on residual plots


Most standard statistical theory for inference assumes a correctly specified linear model
(e.g., Gauss–Markov model, Normal linear model, or restricted mean model). However, the
corresponding inferential procedures are often criticized since it is challenging to ensure that
the model is correctly specified. Alternatively, we can argue that without assuming the linear
model, the OLS estimator is consistent for the coefficient in the best linear approximation
of the conditional mean function E(y | x), which is often a meaningful quantify even the
linear model is misspecified. This can be misleading. Example 12.1 shows that the best
linear approximation can be a bad approximation to a nonlinear conditional mean function,
and it depends on the distribution of the covariates.
A classic statistical approach is to check whether the residual ε̂i has any nonlinear trend
with respect to the covariates. With only a few covariates, we can plot the residual against
each covariate; with many covariates, we can plot the residual against the fitted value ŷi .
Figure 12.4 gives four examples with the R code in code12.5.2.R. In these examples, the
covariates are the same:
n = 200
x 1 = rexp ( n )
x 2 = runif ( n )

The outcome models differ:


1. Example a: y = x1 + x2 + rnorm(n)

2. Example b: y = x1 + x2 + rnorm(n, 0, x1+x2)


3. Example c: y = x1^2+ x2^2 + rnorm(n)
4. Example d: y = x1^2+ x2^2 + rnorm(n, 0, x1+x2)
In the last two examples, the residuals indeed show some nonlinear relationship with
the covariates and the fitted value. This suggests that the linear function can be a poor
approximation to the true conditional mean function.
118 Linear Model and Extensions

11

9
10

8
9

7
8
y1

y2

6
7

5
6

4
5

3
4

4 6 8 10 12 14 4 6 8 10 12 14

x1 x2

12
12

10
10
y3

y4

8
8

6
6

4 6 8 10 12 14 8 10 12 14 16 18

x3 x4

FIGURE 12.3: Anscombe’s Quartet: scatter plots

12.6 Conformal prediction based on exchangeability


Chapter 5.3 discusses the prediction of a future outcome yn+1 based on xn+1 and (X, Y ).
It requires the Normal linear model assumption. Chapter 6 relaxes the Normality assump-
tion on the error term in statistical inference but does not discuss prediction. Under the
heteroskedastic linear model assumption, we can predict the mean E(yn+1 ) = xtn+1 β by
xtn+1 β̂ with asymptotic standard error (xtn+1 V̂ehw xn+1 )1/2 , where V̂ehw is the EHW covari-
ance matrix for the OLS coefficient. However, it is fundamentally challenging to predict
yn+1 itself since the heteroskedastic linear model allows it to have a completely unknown
2
variance σn+1 .
Under the population OLS formulation, it seems even more challenging to predict the
future outcome since we do not even assume that the linear model is correctly specified. In
particular, xtn+1 β̂ does not have the same mean as yn+1 in general. Perhaps surprisingly, we
Population Ordinary Least Squares and Inference with a Misspecified Linear Model 119

x1 x2 yhat

10

lin.homosk
0

lin.heterosk
10

0
residuals

quad.homosk
10

quad.heterosk
10

0 2 4 6 0.00 0.25 0.50 0.75 1.00 0 5 10 15 20

FIGURE 12.4: Residual plots

can construct a prediction interval for yn+1 based on xn+1 and (X, Y ) using an idea called
conformal prediction (Vovk et al., 2005; Lei et al., 2018). It leverages the exchangeability1
of the data points
(x1 , y1 ), . . . , (xn , yn ), (xn+1 , yn+1 ).
Pretending that we know the value yn+1 = y ∗ , we can fit OLS using n + 1 data points and
obtain residuals
ε̂i (y ∗ ) = yi − xti β̂(y ∗ ), (i = 1, . . . , n + 1)
where we emphasize the dependence of the OLS coefficient and residuals on the unknown
y ∗ . The absolute values of the residuals |ε̂i (y ∗ )|’s are also exchangeable, so the rank of
1 Exchangeability is a technical term in probability and statistics. Random elements z , . . . , z are ex-
1 n
changeable if (zπ(1) , . . . , zπ(n) ) have the same distribution as (z1 , . . . , zn ), where π(1), . . . , π(n) is a per-
mutation of the integers 1, . . . , n. In other words, a set of random elements are exchangeable if their joint
distribution does not change under re-ordering. IID random elements are exchangeable.
120 Linear Model and Extensions

|ε̂n+1 (y ∗ )|, denoted by


n
X
R̂n+1 (y ∗ ) = 1 + 1{|ε̂i (y ∗ )| ≤ |ε̂n+1 (y ∗ )|},
i=1

must have a uniform distribution over {1, 2, . . . , n, n + 1}, a known distribution not depend-
ing on anything else. It is a pivotal quantity satisfying
n o
pr R̂n+1 (y ∗ ) ≤ ⌈(1 − α)(n + 1)⌉ ≥ 1 − α. (12.21)

Equivalently, this is a statement linking the unknown quantity y ∗ and observed data, so it
gives a confidence set for y ∗ at level 1 − α. In practice, we can use a grid search to solve for
the inequality (12.21) involving y ∗ .
Below we evaluate the leave-one-out prediction with the Boston housing data.
library ( " mlbench " )
data ( BostonHousing )
attach ( BostonHousing )
n = dim ( BostonHousing )[ 1 ]
p = dim ( BostonHousing )[ 2 ] - 1
ymin = min ( medv )
ymax = max ( medv )
grid . y = seq ( ymin - 3 0 , ymax + 3 0 , 0 . 1 )
BostonHousing = BostonHousing [ order ( medv ) , ]
detach ( BostonHousing )

ols . fit . full = lm ( medv ~ . , data = BostonHousing ,


x = TRUE , y = TRUE , qr = TRUE )
beta = ols . fit . full $ coef
e . sigma = summary ( ols . fit . full )$ sigma
X = ols . fit . full $ x
Y = ols . fit . full $ y
X . QR = ols . fit . full $ qr
X.Q = qr . Q ( X . QR )
X.R = qr . R ( X . QR )
Gram . inv = solve ( t ( X . R )%*% X . R )
hatmat = X . Q %*% t ( X . Q )
resmat = diag ( n ) - hatmat
leverage = diag ( hatmat )
Resvec = ols . fit . full $ residuals

cvt = qt ( 0 . 9 7 5 , df = n -p - 1 )
cvr = ceiling ( 0 . 9 5 *( n + 1 ))

loo . pred = matrix ( 0 , n , 5 )


loo . cov = matrix ( 0 , n , 2 )
for ( i in 1 : n )
{
beta . i = beta - Gram . inv %*% X [i , ]* Resvec [ i ]/( 1 - leverage [ i ])
e . sigma . i = sqrt ( e . sigma ^ 2 *( n - p ) -
( Resvec [ i ])^ 2 /( 1 - leverage [ i ]))/
sqrt ( n - p - 1 )
pred . i = sum ( X [i , ]* beta . i )
lower . i = pred . i - cvt * e . sigma . i / sqrt ( 1 - leverage [ i ])
upper . i = pred . i + cvt * e . sigma . i / sqrt ( 1 - leverage [ i ])
loo . pred [i , 1 : 3 ] = c ( pred .i , lower .i , upper . i )
loo . cov [i , 1 ] = findInterval ( Y [ i ] , c ( lower .i , upper . i ))

grid . r = sapply ( grid .y ,


FUN = function ( y ){
Res = Resvec + resmat [ , i ]*( y - Y [ i ])
Population Ordinary Least Squares and Inference with a Misspecified Linear Model 121

first 20 middle 20 last 20


35

50 ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●


outcomes and predicted values

20 30 ●

25 40
10 ● ● predict
● ● ● ●
● ●

● ●
● ● ●
● ●
● ● ● ● ●

● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
Normal
20
30 conformal
0

15
20
−10
10

10
5 10 15 20 245 250 255 260 490 495 500 505
indices

FIGURE 12.5: Leave-one-out prediction intervals based on the Boston housing data

rank ( abs ( Res ))[ i ]


})
Cinterval = range ( grid . y [ grid .r <= cvr ])
loo . pred [i , 4 : 5 ] = Cinterval
loo . cov [i , 2 ] = findInterval ( Y [ i ] , Cinterval )

In the above code, I use the QR decomposition to compute X t X and H. Moreover, the
calculations of lower.i, upper.i, and Res use some tricks to avoid fitting n OLS. I relegate
the justification of them in Problem 12.10.
The variable loo.pred has five columns corresponding to the point predictors, lower and
upper intervals based on the Normal linear model and conformal prediction.
> colnames ( loo . pred ) = c ( " point " , " G . l " , " G . u " , " c . l " , " c . u " )
> head ( loo . pred )
point G.l G.u c.l c.u
[ 1 ,] 6 . 6 3 3 5 1 4 -2 . 9 4 1 5 3 2 1 6 . 2 0 8 5 5 9 -3 . 5 1 6 . 7
[ 2 ,] 8 . 8 0 6 6 4 1 -1 . 3 4 9 3 6 7 1 8 . 9 6 2 6 4 9 -2 . 6 2 0 . 1
[ 3 ,] 1 2 . 0 4 4 1 5 4 2.608290 21.480018 2.2 21.8
[ 4 ,] 1 1 . 0 2 5 2 5 3 1.565152 20.485355 1.2 21.0
[ 5 ,] -5 . 1 8 1 1 5 4 -1 4 . 8 1 9 0 4 1 4 . 4 5 6 7 3 3 -1 5 . 0 4 . 9
[ 6 ,] 8 . 3 2 4 1 1 4 -1 . 3 8 2 9 1 0 1 8 . 0 3 1 1 3 8 -2 . 0 1 8 . 8

Figure 12.5 plots the observed outcomes and the prediction intervals for the 20 obser-
vations with the outcomes at the bottom, middle, and top. The Normal and conformal
intervals are almost indistinguishable. For the observations with the highest outcome, the
predictions are quite poor. Surprisingly, the overall coverage rates across observations are
close to 95% for both methods.
> apply ( loo . cov == 1 , 2 , mean )
[1] 0.9486166 0.9525692

Figure 12.6 compares the lengths of the two prediction intervals. Although the conformal
prediction intervals are slightly wider than the Normal prediction interval, the differences
are rather small, with the ratio of the length above 0.96.
The R code is in code12.6.R.
122 Linear Model and Extensions

ratio of the lengths of leave−one−out prediction intervals


1.00


● ●
● ● ●
● ● ● ● ●
●●●● ● ● ● ●
● ● ●● ● ● ● ● ●● ● ●●●● ● ● ●●
●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●
●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ●● ● ●● ●●
● ● ●● ● ● ●●● ● ● ●● ● ●●●
● ●
● ● ● ● ●
● ● ● ●● ● ● ●
● ●● ●● ●● ● ● ● ● ● ●
●● ● ● ● ● ●● ● ● ● ●
● ●● ● ● ●
● ● ●

● ●
● ● ● ● ● ● ●● ● ● ● ●● ●● ● ● ● ● ● ●● ●● ● ● ● ● ●
● ●●● ● ● ● ● ●● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ●
● ●
●●

● ●●
● ● ● ● ● ● ● ●● ● ● ● ● ●
● ● ●● ● ● ● ●●● ● ● ● ●● ● ● ● ●●
● ●
● ● ●● ● ● ●
● ● ●
● ●
● ● ●
● ● ● ● ● ●
● ●
● ●● ● ● ● ● ●●
● ● ●
● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●
● ●
● ● ●
●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ●●
● ● ● ●● ● ● ● ● ● ● ●●
Normal over conformal

● ●●● ● ● ● ●
●● ● ● ● ● ● ● ● ● ● ● ● ●●
● ● ● ● ●● ● ● ● ● ● ● ●

0.95 ●



●●



● ● ● ●


● ●




● ●



● ● ●



● ●

● ● ● ● ● ● ●

● ● ● ●
● ● ● ● ●
● ● ● ●
● ●


● ● ●

● ●



0.90 ●

0.85

0 100 200 300 400 500


observation i

FIGURE 12.6: Boston housing data

12.7 Homework problems


12.1 Conditional mean
Prove Theorem 12.1.

12.2 Best linear approximation


Prove Theorem 12.2.
Hint: It is similar to Problem 8.6.

12.3 Univariate population OLS


Prove Corollary 12.1.

12.4 Asymptotics for the population OLS


Prove Theorem 12.8.

12.5 Population Cochran’s formula


Prove Theorem 12.4.

12.6 Canonical correlation analysis (CCA)


Assume that (x, y), where x ∈ Rp and y ∈ Rk , has the joint non-degenerate covariance
matrix:  
Σxx Σxy
.
Σyx Σyy

1. Find the best linear combinations (α, β) that give the maximum Pearson correlation
coefficient:
(α, β) = arg max ρ(y t a, xt b).
a∈Rk ,b∈Rp
Population Ordinary Least Squares and Inference with a Misspecified Linear Model 123

Note that you need to detail the steps in calculating (α, β) based on the covariance
matrix above.
2. Define the maximum value as cc(x, y). Show that cc(x, y) ≥ 0 and cc(x, y) = 0 if x y.
Remark: The maximum value cc(x, y) is called the canonical correlation between x and
y. We can also define partial canonical correlation between x and y given w.

12.7 Population partial correlation coefficient


Prove Theorem 12.7.

12.8 Independence and correlation


With scalar random variables x and y, show that if x y, then ρyx = 0. With another
random variable w, if x y | w, does ρyx|w = 0 hold? If so, give a proof; otherwise, give a
counterexample.

12.9 Best linear approximation of a cubic curve


Assume that x ∼ N(0, 1), ε ∼ N(0, σ 2 ), x ε, and y = x3 + ε. Find the best linear approxi-
mation of E(y | x) based on x.

12.10 Leave-one-out formula in conformal prediction


Justify the calculations of lower.i, upper.i, and Res in Section 12.6.

12.11 Conformal prediction for multiple outcomes


Assuming exchangeability of

(x1 , y1 ), . . . , (xn , yn ), (xn+1 , yn+1 ), . . . , (xn+k , yn+k ).

Propose a method to construct joint conformal prediction regions for (yn+1 , . . . , yn+k ) based
on (X, Y ) and (xn+1 , . . . , xn+k ).

12.12 Cox’s theorem


Cox (1960) considered the data-generating process

x1 −→ x2 −→ y

under the following linear models: for i = 1, . . . , n, we have

xi2 = α0 + α1 xi1 + ηi

and
yi = β0 + β1 xi2 + εi
where ηi has mean 0 and variance ση2 , εi has mean 0 and variance σε2 , and the ηi s and εi s
are independent. The linear model implies

yi = (β0 + β1 α0 ) + (β1 α1 )xi1 + (εi + β1 ηi )

where εi + β1 ηi s are independent with mean 0 and variance σε2 + β12 ση2 .
Therefore, we have two ways to estimate β1 α1 :
1. the first estimator is γ̂1 , the OLS estimator of the yi ’s on the xi1 ’s with the intercept;
124 Linear Model and Extensions

2. the second estimator is α̂1 β̂1 , the product of the OLS estimator of the xi2 ’s on the xi1 ’s
with the intercept and that of the yi ’s on the xi2 ’s with the intercept.
Cox (1960) proved the following theorem.

Theorem 12.9 Let X1 = (x11 , . . . , xn1 ). We have var(α̂1 β̂1 | X1 ) ≤ var(γ̂1 | X1 ), and
more precisely,
σε2 + β12 ση2
var(γ̂1 | X1 ) = Pn 2
i=1 (xi1 − x̄1 )
and
σε2 E(ρ̂212 | X1 ) + β12 ση2
var(α̂1 β̂1 | X1 ) = Pn 2
i=1 (xi1 − x̄1 )
where ρ̂12 ∈ [−1, 1] is the sample Pearson correlation coefficient between the xi1 ’s and the
xi2 ’s.

Prove Theorem 12.9.


Remark: If we further assume that the error terms are Normal, then α̂1 β̂1 is the max-
imum likelihood estimator for α1 β1 . Therefore, the asymptotic optimality theory for the
maximum likelihood estimator justifies the superiority of α̂1 β̂1 over γ̂1 . Cox’s theorem pro-
vides a stronger finite-sample result without assuming the Normality of the error terms.

12.13 Measurement error and Frisch’s bounds


1. Given scalar random variables x and y, we can obtain the population OLS coefficient
(α, β) of y on (1, x). However, x and y may be measured with errors, that is, we observe
x∗ = x + u and y ∗ = y + v, where u and v are mean zero error terms satisfying u v
and (u, v) (x, y). We can obtain the population OLS coefficient (α∗ , β ∗ ) of y ∗ on (1, x∗ )
and the population OLS coefficient (a∗ , b∗ ) of x∗ on (1, y ∗ ).
Prove that if β = 0 then β ∗ = b∗ = 0; if β ̸= 0 then

|β ∗ | ≤ |β| ≤ 1/|b∗ |.

2. Given scalar random variables x, y and a random vector w, we can obtain the pop-
ulation OLS coefficient (α, β, γ) of y on (1, x, w). When x and y are measured with
error as above with mean zero errors satisfying u v and (u, v) (x, y, w), we can obtain
the population OLS coefficient (α∗ , β ∗ , γ ∗ ) of y on (1, x∗ , w), and the population OLS
coefficient (a∗ , b∗ , c∗ ) of x∗ on (1, y ∗ , w).
Prove that the same result holds as in the first part of the problem.
Remark: Tamer (2010) reviewed Frisch (1934)’s upper and lower bounds for the uni-
variate OLS coefficient based on the two OLS coefficients of the observables. The second
part of the problem extends the result to the multivariate OLS with a covariate subject to
measurement error. The lower bound is well documented in most books on measurement
errors, but the upper bound is much less well known.

12.14 A three-way decomposition


The main text focuses on the two-way decomposition of the outcome: y = xt β + ε, where
β is the population OLS coefficient and ε is the population OLS residual. However, xt β is
only the best linear approximation to the true conditional mean function µ(x) ≡ E(y | x).
This suggests the following three-way decomposition of the outcome:

y = xt β + {µ(x) − xt β} + {y − µ(x)},
Population Ordinary Least Squares and Inference with a Misspecified Linear Model 125

which must hold without any assumptions. Introduce the notation for the linear term

ŷ = xt β,

the notation for the approximation error:

δ = µ(x) − xt β,

and the notation for the “ideal residual”:

e = y − µ(x).

Then we can decompose the outcome as

y = ŷ + δ + e

and the population OLS residual as

ε = {µ(x) − xt β} + {y − µ(x)} = δ + e.

1. Prove that
E(ŷe | x) = 0, E(δe | x) = 0
and
E(ŷe) = 0, E(δe) = 0, E(ŷδ) = 0.
Further, prove that
E(ε2 ) = E(δ 2 ) + E(e2 ).

2. Introduce an intermediate quantity between the population OLS coefficient β and the
OLS coefficient β̂:
n
!−1 n
!
X X
β̃ = n−1 xi xti n−1 xi µ(xi ) .
i=1 i=1


Equation (12.18) states that n(β̂ − β) → N(0, B −1 M B −1 ) in distribution, where
B = E(xxt ) and M = E(ε2 xxt ).
Prove that
cov(β̂ − β) = cov(β̂ − β̃) + cov(β̃ − β),
and moreover,
√ √
n(β̂ − β̃) → N(0, B −1 M1 B −1 ), n(β̃ − β) → N(0, B −1 M2 B −1 )

in distribution, where

M1 = E(e2 xxt ), M2 = E(δ 2 xxt )

Verify that M = M1 + M2 .
Remark: To prove the result, you may find the law of total covariance formula in (B.4)
helpful. We can also write M1 as M1 = E{var(y | x)xxt }. So the meat matrix M has two
sources of uncertainty, one is from the conditional variance of y given x, and the other is
from the approximation error.
Part V

Overfitting, Regularization, and


Model Selection
13
Perils of Overfitting

Previous chapters assume that the covariate matrix X is given and the linear model, cor-
rectly specified or not, is also given. Although including useless covariates in the linear
model results in less precise estimators, this problem is not severe when the total number
of covariates is small compared to the sample size. In many modern applications, however,
the number of covariates can be large compared to the sample size. Sometimes, it can be
a nonignorable fraction of the sample size; sometimes, it can even be larger than the sam-
ple size. For instance, modern DNA sequencing technology often generates covariates of
millions of dimensions, which is much larger than the usual sample size under study. In
these applications, the theory in previous chapters is inadequate. This chapter introduces
an important notation in statistics: overfitting.

13.1 David Freedman’s simulation


Freedman (1983) used a simple simulation to illustrate the problem with a large number
of covariates. He simulated data from the following Normal linear model Y = Xβ + ε with
ε ∼ N(0, σ 2 In ) and β = (µ, 0, . . . , 0)t . He then computed the sample R2 . Since the covariates
do not explain any variability of the outcome at all in the true model, we would expect R2
to be extremely small over repeated sampling. However, he showed, via both simulation and
theory, that R2 is surprisingly large when p is large compared to n.
Figure 13.1 shows the results from Freedman’s simulation setting with n = 100 and
p = 50, over 1000 replications. The R code is in code13.1.R. The (1, 1)th subfigure shows the
histogram of the R2 , which centers around 0.5. This can be easily explained by the exact
distribution of R2 proved in Corollary 10.1:
 
p−1 n−p
R2 ∼ Beta , ,
2 2

with the density shown in the (1, 1)th and (1, 2)th subfigure of Figure 13.1. The beta
distribution above has mean
p−1
p−1
E(R2 ) = p−1
2
n−p =
2 + 2
n−1

and variance
p−1 n−p
×
var(R2 ) = 2 2
p−1 n−p 2 p−1 n−p
 
2 + 2 2 + 2 +1
2(p − 1)(n − p)
= .
(n − 1)2 (n + 1)

129
130 Linear Model and Extensions

When p/n → 0, we have


E(R2 ) → 0, var(R2 ) → 0,
so Markov’s inequality implies that R2 → 0 in probability. However, when p/n → γ ∈ (0, 1),
we have
E(R2 ) → γ, var(R2 ) → 0,
so Markov’s inequality implies that R2 → γ in probability. This means that when p has the
same order as n, the sample R2 is close to the ratio p/n even though there is no association
between the covariates and the outcome in the true data-generating process. In Freedman’s
simulation, γ = 0.5 so R2 is close to 0.5.
The (1, 2)th subfigure shows the histogram of the R2 based on a model selection first step
by dropping all covariates with p-values larger than 0.25. The R2 in the (1, 2)th subfigure
are slightly smaller but still centered around 0.37. The joint F test based on the selected
model does not generate uniform p-values in the (2, 2)th subfigure, in contrast to the uniform
p-values in the (2, 1)th subfigure. With a model selection step, statistical inference becomes
much more complicated. This is a topic called selective inference which is beyond the scope
of this book.
The above simulation and calculation give an important warning: we cannot over-
interpret the sample R2 since it can be too optimistic about model fitting. In many empirical
research, R2 is at most 0.1 with a large number of covariates, making us wonder whether
those researchers are just chasing the noise rather than the signal. So we do not trust R2
as a model-fitting measure with a large number of covariates. In general, R2 cannot avoid
overfitting, and we must modify it in model selection.

13.2 Variance inflation factor


The following theorem quantifies the potential problem of including too many covariates in
OLS.

Theorem 13.1 Consider a fixed design matrix X. Let β̂j be the coefficient of Xj of the
OLS fit of Y on (1n , X1 , . . . , Xq ) with q ≤ p. Under the model yi = f (xi ) + εi with an
unknown f (·) and the εi ’s uncorrelated with mean zero and variance σ 2 , the variance of β̂j
equals
σ2 1
var(β̂j ) = Pn × ,
i=1 (xij − x̄j )
2 1 − Rj2
where Rj2 is the sample R2 from the OLS fit of Xj on 1n and all other covariates.

Theorem 13.1 does not even assume that the true mean function is linear. It states that
the variance of β̂j has two multiplicative components. If we run a short regression of Y on
1n and Xj = (x1j , . . . , xnj )t , the coefficient equals
Pn
(xij − x̄j )yi
β̃j = Pi=1n 2
i=1 (xij − x̄j )
Perils of Overfitting 131
Pn
where x̄j = n−1 i=1 xij . It has variance
 Pn 
(xij − x̄j )yi
var(β̃j ) = var Pi=1 n 2
i=1 (xij − x̄j )
Pn 2 2
i=1 (xij − x̄j ) σ
= Pn 2
{ i=1 (xij − x̄j )2 }
σ2
= Pn 2
.
i=1 (xij − x̄j )

So the first component is the variance of the OLS coefficient in the short regression. The
second component 1/(1 − Rj2 ) is called the variance inflation factor (VIF). The VIF indeed
inflates the variance of β̃j , and the more covariates are added into the long regression, the
larger the variance inflation factor is. In R, the car package provides the function vif to
compute the VIF for each covariate.
The proof of Theorem 13.1 below is based on the FWL theorem.
Proof of Theorem 13.1: Let X̃j = (x̃1j , . . . , x̃nj )t be the residual vector from the OLS
fit of Xj on 1n and other covariates, which have a sample mean of zero. The FWL theorem
implies that Pn
x̃ij yi
β̂j = Pi=1
n 2
i=1 x̃ij

which has variance Pn


x̃2ij σ 2 σ2
var(β̂j ) = Pi=1 2 = Pn . (13.1)
n
x̃2ij i=1 x̃2ij
i=1
Pn 2
Because i=1 x̃ij is the residual sum of squares from the OLS of Xj on 1n and other
covariates, it is related to Rj2 via
Pn
x̃2ij
Rj = 1 − Pn i=1
2
2
i=1 (xij − x̄j )

or, equivalently,
n
X n
X
x̃2ij = (1 − Rj2 ) (xij − x̄j )2 . (13.2)
i=1 i=1

Combining (13.1) and (13.2) gives Theorem 13.1. □

13.3 Bias-variance trade-off


Theorem 13.1 characterizes the variance of the OLS coefficient, but it does not characterize
its bias. In general, a more complex model is closer to the true mean function f (xi ), and can
then reduce the bias of approximating the mean function. However, Theorem 13.1 implies
that a more complex model results in larger variances of the OLS coefficients. So we face a
bias-variance trade-off.
Consider a simple case where the true data-generating process is linear:

yi = β1 + β2 xi1 + · · · + βs−1 xis + εi . (13.3)

Ideally, we want to use the model (13.3) with exactly s covariates. In practice, we may not
132 Linear Model and Extensions

know which covariates to include in the OLS. If we underfit the data using a short regression
with q < s:
yi = β̃1 + β̃2 xi1 + · · · + β̃q−1 xiq + ε̃i , (i = 1, . . . , n) (13.4)
then the OLS coefficients are biased. If we increase the complexity of the model to overfit
the data using a long regression with p > s:

yi = β̂1 + β̂2 xi1 + · · · + β̂p−1 xip + ε̂i , (i = 1, . . . , n) (13.5)

then the OLS coefficients are unbiased. Theorem 13.1, however, shows that the OLS co-
efficients from the under-fitted model (13.4) have smaller variances than those from the
overfitted model (13.5).

Example 13.1 In general, we have a sequence of models with increasing complexity. For
simplicity, we consider nested models containing 1n and covariates

{X1 } ⊂ {X1 , X2 } ⊂ · · · ⊂ {X1 , . . . , Xp }

in the following simulation setting. The true linear model is yi = xti β + N(0, 1) with p = 40
but only the first 10 covariates have non-zero coefficients 1 and all other covariates have
coefficients 0. We generate two datasets: both have sample size n = 200, all covariates have
IID N(0, 1) entries, and the error terms are IID. We use the first dataset to fit the OLS
and thus call it the training dataset. We use the second dataset to assess the performance
of the fitted OLS from the training dataset, and thus call it the testing dataset1 . Figure
13.2 plots the residual sum of squares against the number of covariates in the training
and testing datasets. By definition of OLS, the residual sum of squares decreases with the
number of covariates in the training dataset, but it first decreases and then increases in the
testing dataset with minimum value attained at 10, the number of covariates in the true
data generating process.

The following example has a nonlinear true mean function but still uses OLS with
polynomials of covariates to approximate the truth2 .

Example 13.2 The true nonlinear model is yi = sin(2πxi ) + N(0, 1) with the xi ’s equally
spaced in [0, 1] and the error terms are IID. The training and testing datasets both have
sample sizes n = 200. Figure 13.3 plots the residual sum of squares against the order of the
polynomial in the OLS fit
p−1
X
yi = βj xji + εi .
j=0

By the definition of OLS, the residual sum of squares decreases with the order of polynomials
in the training dataset, but it achieves the minimum near p = 5 in the testing dataset. We
can show that the residual sum of squares decreases to zero with p = n in the training dataset;
see Problem 13.7. However, it is larger than that under p = 5 in the testing dataset.

The R code for Figures 13.2 and 13.3 is in code13.3.R.


1 Splitting a dataset into the training and testing datasets is a standard tool to assess the out-of-sample

performance of proposed methods. It is important in statistics and machine learning.


2 A celebrated theorem due to Weierstrass states that on a bounded interval, any continuous function

can be approximated arbitrarily well by a polynomial function. Here is the mathematical statement of
Weierstrass’s theorem. Suppose f is a continuous function defined on the interval [a, b]. For every ε > 0,
there exists a polynomial p such that for all x ∈ [a, b], we have |f (x) − p(x)| < ε.
Perils of Overfitting 133

13.4 Model selection criteria


With a large number of covariates X1 , . . . , Xp , we want to select a model that has the
best performance for prediction. In total, we have 2p possible models. Which one is the
best? What is the criterion for the best model? Practitioners often use the linear model for
multiple purposes. A dominant criterion is the prediction performance of the linear model
in a new dataset (Yu and Kumbier, 2020). However, we do not have the new dataset yet
in the statistical modeling stage. So we need to find criteria that are good proxies for the
prediction performance.

13.4.1 RSS, R2 and adjusted R2


The first obvious criterion is the rss, which, however, is not a good criterion because it
favors the largest model. The sample R2 has the same problem of favoring the largest
model. Most model selection criteria are in some sense modifications of rss or R2 .
The adjusted R2 takes into account the complexity of the model:
n−1
R̄2 = 1 − (1 − R2 )
n−p
Pn
(yi − ŷi )2 /(n − p)
= 1 − Pi=1
n 2
i=1 (yi − ȳ) /(n − 1)
σ̂ 2
=1− .
σ̂y2
So based on R̄2 , the best model has the smallest σ̂ 2 , the estimator for the variance of the
error term in the Gauss–Markov model. The following theorem shows that R̄2 is closely
related to the F statistic in testing two nested Normal linear models.
Theorem 13.2 Consider the setting of Chapter 8.3. Test two nested Normal linear models:
Y = X1 β1 + ε
versus
Y = X1 β1 + X2 β2 + ε,
or, equivalently, test β2 = 0. We can use the standard F statistic defined in Chapter 8.3,
and we can also compare the adjusted R2 ’s from these two models: R̄12 and R̄22 . They are
related via
F > 1 ⇐⇒ R̄12 < R̄22 .
I leave the proof of the theorem as Problem 13.3. From Theorem 13.2, R̄2 does not
necessarily favor the largest model. However, R̄2 still favors unnecessarily large models
compared with the usual hypothesis testing based on the Normal linear model because the
mean of F is approximately 1, but the upper quantile of F is much larger than 1 (for
example, the 95% quantile of F1,n−p is larger than 3.8, and the 95% quantile of F2,n−p is
larger than 2.9).

13.4.2 Information criteria


Taking into account the model complexity, we can find the model with the smallest aic or
bic, defined as
rss
aic = n log + 2p
n
134 Linear Model and Extensions

and
rss
bic = n log+ p log n,
n
with full names “Akaike’s information criterion ” and “Bayes information criterion.”
aic and bic are both monotone functions of the rss penalized by the number of param-
eters p in the model. The penalty in bic is larger so it favors smaller models than aic. Shao
(1997)’s results suggested that bic can consistently select the true model if the linear model
is correctly specified, but aic can select the model that minimizes the prediction error if
the linear model is misspecified. In most statistical practice, the linear model assumption
cannot be justified, so we recommend using aic.

13.4.3 Cross-validation (CV)


The first choice is the leave-one-out cross-validation based on the predicted residual:
n n
X X ε̂2i
press = ε̂2[−i] = (13.6)
i=1 i=1
(1 − hii )2

which is called the predicted residual errorP


sum of squares (PRESS) statistic.
n
Because the average value of hii is n−1 i=1 hii = p/n, we can approximate PRESS by
the generalized cross-validation (GCV) criterion:
n
X ε̂2i  p −2
gcv = = rss × 1 − .
i=1
(1 − p/n)2 n

When p/n ≈ 0, we have3


 p 2p
log gcv = log rss − 2 log 1 − ≈ log rss + = aic/n + log n,
n n
so GCV is approximately equivalent to AIC with small p/n. With large p/n, they may have
large differences.
GCV is not crucial for OLS because it is easy to compute PRESS. However, it is much
more useful in other models where we need to fit the data n times to compute PRESS.
For a general model without simple leave-one-out formulas, it is computationally intensive
to obtain PRESS. The K-fold cross-validation (K-CV) is computationally more attractive.
The best model has the smallest K-CV, computed as follows:
1. randomly shuffle the observations;
2. split the data into K folds;
3. for each fold k, use all other folds as the training data, and compute the predicted errors
on fold k (k = 1, . . . , K);
4. aggregate the prediction errors across K folds, denoted by K-CV.
When K = 3, we split the data into 3 folds. Run OLS to obtain a fitted function with
folds 2, 3 and use it to predict on fold 1, yielding prediction error r1 ; run OLS with folds
1, 3 and predict on fold 2, yielding prediction error r2 ; run OLS with folds 1, 2 and predict
on fold 2, yielding prediction error r3 . The total prediction error is r = r1 + r2 + r3 . We
want to select a model that minimizes r. Usually, practitioners choose K = 5 or 10, but
this can depend on the computational resource.
3 The approximation is due to the Taylor expansion log(1 + x) = x − x2 /2 + x3 /3 − · · · ≈ x.
Perils of Overfitting 135

13.5 Best subset and forward/backward selection


Given a model selection criterion, we can select the best model.
For a small p, we can enumerate all 2p models. The function regsubsets in the R package
leaps implements this4 . Figure 13.4 shows the results of the best subset selection in two
applications, with the code in code13.4.4.R.
For large p, we can use forward or backward regressions. Forward regression starts with a
model with only the intercept. In step one, it finds the best covariate among the p candidates
based on the prespecified criterion. In step two, it keeps this covariate in the model and
finds the next best covariate among the remaining p − 1 candidates. It proceeds by adding
the next best covariate one by one.
The backward regression does the opposite. It starts with the full model with all p
covariates. In step one, it drops the worst covariate among the p candidates based on the
prespecified criterion. In step two, it drops the next worst covariate among the remaining
p − 1 candidates. It proceeds by dropping the next worst covariate one by one.
Both methods generate a sequence of models, and select the best one based on the
prespecified criterion. Forward regression works in the case with p ≥ n but it stops at step
n − 1; backward regression works only in the case with p < n. The functions step or stepAIC
in the MASS package implement these.

13.6 Homework problems


13.1 Inflation and deflation of the estimated variance
This problem extends Theorem 13.1 to the estimated variance.
The covariate matrix X has columns 1n , X1 , . . . , Xp . Compare the coefficient of X1 in
the following long and short regressions:

Y = β̂0 1n + β̂1 X1 + · · · + β̂p Xp + ε̂,

and
Y = β̃0 1n + β̃1 X1 + · · · + β̃q Xq + ε̃,
where q < p. Under the condition in Theorem 13.1,
2
var(β̂1 ) 1 − RX 1 .X2 ···Xq
= 2 ≥ 1,
var(β̃1 ) 1 − RX1 .X2 ···Xp

2
recalling that RU.V denotes the R2 of U on V . Now we compare the corresponding estimated
variances var(
ˆ β̂1 ) and var(
˜ β̃1 ) based on homoskedasticity.

1. Show that
2 2
var(
ˆ β̂1 ) 1 − RY.X 1 ···Xp
1 − RX 1 .X2 ···Xq n−q−1
= 2 × 2 × .
var(
˜ β̃1 ) 1 − RY.X 1 ···Xq
1 − RX1 .X2 ···Xp n −p−1
4 Note that this function uses a definition of bic that differs from the above definition by a constant, but

this does not change the model selection result.


136 Linear Model and Extensions

2. Using the definition of the partial R2 in Problem 10.4, show that


2
var(
ˆ β̂1 ) 1 − RY.X q+1 ···Xp |X1 ···Xq n−q−1
= 2 × .
var(
˜ β̃1 ) 1 − R X1 .Xq+1 ···Xp |X2 ...Xq n −p−1

Remark: The first result shows that the ratio of the estimated variances has three factors:
the first one corresponds to the R2 ’s of the outcome on the covariates, the second one is
identical to the one for the ratio of the true variances, and the third one corresponds to
the degrees of freedom correction. The first factor deflates the estimated variance since the
R2 increases with more covariates included in the regression, and the second and the third
factors inflate the estimated variance. Overall, whether adding more covariates inflate or
deflate the estimated variance depends on the interplay of the three factors. The answer is
not definite as Theorem 13.1.
The variance inflation result in Theorem 13.1 sometimes causes confusion. It only con-
cerns the variance. When we view some covariates as random, then the bias term can also
contribute to the variance of the OLS estimator. In this case, we should interpret Theorem
13.1 with caution. See Ding (2021b) for a related discussion.

13.2 Inflation and deflation of the variance under heteroskedasticity


Relax the condition in Theorem 13.1 as var(εi ) = σi2 with possibly different variances. Give
a counterexample in which the variance decreases with more covariates included in the
regression.

13.3 Equivalence of F and R̄2


Prove Theorem 13.2.

13.4 Using press to construct an unbiased estimator for σ 2


Prove that
2 press
σ̂press = Pn −1
i=1 (1 − hii )

is unbiased for σ 2 under the Gauss–Markov model in Assumption 4.1, recalling press in
(13.6) and the leverage score hii of unit i.
Remark: Theorem 4.3 shows that σ̂ 2 = rss/(n − p) is unbiased for σ 2 under the Gauss–
Markov model. rss is the “in-sample” residual sum of squares, whereas press is the “leave-
one-out” residual sum of squares. The estimator σ̂ 2 is standard, whereas σ̂press
2
appeared in
Shen et al. (2023).

13.5 Simulation with misspecified linear models


Replicate the simulation in Example 13.1 with correlated covariates and an outcome model
with quadratic terms of covariates.

13.6 Best subset selection in lalonde data


Produce the figure similar to the ones in Figure 13.4 based on the lalonde data in the Matching
package. Report the selected model based on aic, bic, press, and gcv.
Perils of Overfitting 137

13.7 Perfect polynomial


Prove that given distinct xi (i = 1, . . . , n) within [0, 1] and any yi (i = 1, . . . , n), we can
always find an nth order polynomial
n−1
X
pn (x) = bj xj
j=0

such that
pn (xi ) = yi , (i = 1, . . . , n).
Hint: Use the formula in (A.4).
138 Linear Model and Extensions

full model selected model

4 4
R2

2 2

0 0
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00

full model selected model


20

0.9
15
p−value

0.6
10

0.3 5

0.0 0
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00

FIGURE 13.1: Freedman’s simulation. The first row shows the histograms of the R2 s, and
the second row shows the histograms of the p-values in testing that all coefficients are 0.
The first column corresponds to the full model without testing, and the second column
corresponds to the selected model with testing.
Perils of Overfitting 139

prediction errors

22
● training data
testing data
20
18


RSS/n


16


14


12

● ●
10

● ● ● ● ● ● ● ● ●
● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

0 10 20 30 40

# covariates

FIGURE 13.2: Training and testing errors: linear mean function

prediction errors
1.5

● training data
testing data
1.4


1.3
RSS/n

1.2
1.1

● ●
1.0

● ●

● ● ●
0.9

● ●


● ● ● ● ● ● ● ● ●

5 10 15 20

# covariates

FIGURE 13.3: Training and testing errors: nonlinear mean function


140 Linear Model and Extensions

Penn bonus experimental data


RSS AIC BIC
29920

688000
29880
29900

684000
29850

29880

680000 29820

29860
676000 29790
3 6 9 3 6 9 3 6 9
#predictors

Boston housing data


RSS AIC BIC

1850

1800
17500
1800

1750
15000
1700

1700

12500

1600 1650

5 10 5 10 5 10
#predictors

FIGURE 13.4: Best subset selection


14
Ridge Regression

14.1 Introduction to ridge regression


The OLS estimator has many nice properties. For example, Chapter 4 shows that it is BLUE
under the Gauss–Markov model, and Chapter 5 shows that it follows a Normal distribution
that allows for finite-sample exact inference under the Normal linear model. However, it
can have the following problems that motivate ridge regression in this chapter.
The first motivation is quite straightforward. From the formula

β̂ = (X t X)−1 X t Y,

if the columns of X are highly correlated, then X t X will be nearly singular; more extremely,
if the number of covariates is larger than the sample size, then X t X has a rank smaller
than or equal to n and thus is not invertible. So numerically, the OLS estimator can be
unstable due to inverting X t X. Because X t X must be positive semi-definite, its smallest
eigenvalue determines whether it is invertible or not. Hoerl and Kennard (1970) proposed
the following ridge estimator as a modification of OLS:

β̂ ridge (λ) = (X t X + λIp )−1 X t Y, (λ > 0) (14.1)

which involves a positive tuning parameter λ. Because the smallest eigenvalue of X t X +λIp
is larger than or equal to λ > 0, the ridge estimator is always well defined.
Now I turn to the second equivalent motivation. The OLS estimator minimizes the
residual sum of squares
n
X
rss(b0 , b1 , . . . , bp ) = (yi − b0 − b1 xi1 − · · · − bp xip )2 .
i=1

From Theorem 13.1 on the variance inflation factor, the variances of the OLS estimators
increase with additional covariates included in the regression, leading to unnecessarily large
estimators by chance. To avoid large OLS coefficients, we can penalize the residual sum of
squares criterion with the squared length of the coefficients1 and use
 
 p
X 
β̂ ridge (λ) = arg min rss(b0 , b1 , . . . , bp ) + λ b2j . (14.2)
b0 ,b1 ,...,bp  
j=1

Again in (14.2), λ is a tuning parameter that ranges from zero to infinity. We first discuss
the ridge estimator with a fixed λ and then discuss how to choose it. When λ = 0, it
reduces to OLS; when λ = ∞, all coefficients must be zero except that β̂0ridge (∞) = ȳ.
1 This is also called the Tikhonov regularization (Tikhonov, 1943). See Bickel and Li (2006) for a review
of the idea of regularization in statistics.

141
142 Linear Model and Extensions

With λ ∈ (0, ∞), the ridge coefficients are generally smaller than the OLS coefficients,
and the penalty shrinks the OLS coefficients toward zero. So the parameter λ controls the
magnitudes of the coefficients or the “complexity” of the model. In (14.2), we only penalize
the slope parameters not the intercept.
As a dual problem in optimization, we can also define the ridge estimator as
β̂ ridge (t) = arg min rss(b0 , b1 , . . . , bp )
b0 ,b1 ,...,bp
p
X
s.t. b2j ≤ t. (14.3)
j=1

Definitions (14.2) and (14.3) are equivalent because for a given λ, we can always find a t
such that the solutions from (14.2) and (14.3) are identical. In fact, the corresponding t and
λ satisfy t = ∥β̂ ridge (λ)∥2 .
However, the ridge estimator has an obvious problem: it is not invariant to linear trans-
formations of X. In particular, it is not equivalent under different scaling
Ppof the covariates.
Intuitively, the bj ’s depend on the scale of Xj ’s, but the penalty term j=1 b2j puts equal
weight on each coefficient. A convention in practice is to standardize the covariates before
applying the ridge estimator2 .
Condition 14.1 (standardization) The covariates satisfy
n
X n
X
n−1 xij = 0, n−1 x2ij = 1, (j = 1, . . . , p)
i=1 i=1

and the outcome satisfy ȳ = 0.


With all covariates centered at zero, the ridge estimator for the intercept, given any
values of the slopes and the tuning parameter λ, equals β̂0ridge = ȳ. So if we center the
outcomes at mean zero, then we can drop the intercept in the ridge estimators defined in
(14.2) and (14.3).
For descriptive simplicity, we will assume Condition 14.1 and call it standardization from
now on. This allows us to drop the intercept. Using the matrix form, the ridge estimator
minimizes
(Y − Xb)t (Y − Xb) + λbt b,
which is a quadratic function of b. From the first order condition, we have
n o
− 2X t Y − X β̂ ridge (λ) + 2λβ̂ ridge (λ) = 0

=⇒β̂ ridge (λ) = (X t X + λIp )−1 X t Y,


which coincides with the definition in (14.1). We also have the second-order condition
2X t X + 2λIp ≻ 0, (λ > 0)
which verifies that the ridge estimator is indeed the minimizer. The predicted vector is
Ŷ ridge (λ) = X β̂ ridge (λ) = X(X t X + λIp )−1 X t Y = H(λ)Y,
where H(λ) = X(X t X + λIp )−1 X t is the hat matrix for ridge regression. When λ = 0, it
reduces to the hat matrix for the OLS; when λ > 0, it is not a projection matrix because
{H(λ)}2 ̸= H(λ).
2 I choose this standardization because it is the default choice in the function lm.ridge in the R package

MASS. In practical data analysis, the covariates may have concrete meanings. In those cases, you may not
want to scale the covariates in the way as Condition 14.1. However, the discussion below does not rely on
the choice of scaling although it requires centering the covariates and outcome.
Ridge Regression 143

14.2 Ridge regression via the SVD of X


I will focus on the case with n ≥ p and relegate the discussion of the case with n ≤ p
to Problem 14.11. To facilitate the presentation, I will use the SVD decomposition of the
covariate matrix:
X = U DV t
where U ∈ Rn×p has orthonormal columns such that U t U = Ip , V ∈ Rp×p is an orthogonal
matrix with V V t = V t V = Ip , and D ∈ Rp×p is a diagonal matrix consisting of the singular
values. Figure 14.1 illustrates this important decomposition.

diagonal

orthogonal

orthonormal columns

FIGURE 14.1: SVD of X

The SVD of X implies the eigen-decomposition of X t X:


X t X = V D2 V t
with eigen-vectors Vj being the column vectors of V and eigen-values d2j being the squared
singular values. The following lemma is crucial for simplifying the theory and computation.
Lemma 14.1 The ridge coefficient equals
!
dj
β̂ ridge (λ) = V diag U t Y,
d2j + λ
where the diagonal matrix is p × p.
Proof of Lemma 14.1: The ridge coefficient equals
β̂ ridge (λ) = (X t X + λIp )−1 X t Y
= (V DU t U DV t + λIp )−1 V DU t Y
= V (D2 + λIp )−1 V t V DU t Y
= V (D2 + λIp )−1 DU t Y
!
dj
= V diag U t Y.
d2j + λ

144 Linear Model and Extensions

14.3 Statistical properties


The Gauss–Markov theorem shows that the OLS estimator is BLUE under the Gauss–
Markov model: Y = Xβ + ε, where ε has mean zero and covariance σ 2 In . Then in what
sense, can ridge regression improve OLS? I will discuss the statistical properties of the ridge
estimator under the Gauss–Markov model.
Based on Lemma 14.1, we can calculate the mean of the ridge estimator:
!
ridge dj
E{β̂ (λ)} = V diag U t Xβ
d2j + λ
!
dj
= V diag U t U DV t β
d2j + λ
!
d2j
= V diag V tβ
d2j + λ

which does not equal β in general. So the ridge estimator is biased. We can also calculate
the covariance matrix of the ridge estimator:
! !
ridge 2 dj t dj
cov{β̂ (λ)} = σ V diag U U diag Vt
d2j + λ d2j + λ
!
2
dj
= σ 2 V diag V t.
(d2j + λ)2

The mean squared error (MSE) is a measure capturing the bias-variance trade-off:
n ot n o
mse(λ) = E β̂ ridge (λ) − β β̂ ridge (λ) − β .

Using Theorem B.8 on the expectation of a quadratic form, we have

mse(λ) = [E{β̂ ridge (λ)} − β]t [E{β̂ ridge (λ)} − β] + trace[cov{β̂ ridge (λ)}] .
| {z } | {z }
C1 C2

The following theorem simplifies C1 and C2 .

Theorem 14.1 Under Assumption 4.1, the ridge estimator satisfies


p
X γj2
C1 = λ2 , (14.4)
j=1
(d2j + λ)2

where γ = V t β = (γ1 , . . . , γp )t has the jth coordinate γj = Vjt β, and


p
2
X d2j
C2 = σ . (14.5)
j=1
(d2j + λ)2
Ridge Regression 145

Proof of Theorem 14.1: First, we have


( ! )2
d2j
C1 = βt V diag 2 V t − Ip β
dj + λ
!
t λ2
= β V diag V tβ
(d2j + λ)2
!
t λ2
= γ diag γ
(d2j + λ)2
p
X γj2
= λ2 .
j=1
(d2j + λ)2

Second, we have
! !
2
d j
C2 = σ 2 trace V diag Vt
(d2j + λ)2
!!
2
d2j
= σ trace diag
(d2j + λ)2
p
2
X d2j
= σ .
j=1
(d2j + λ)2


Theorem 14.1 shows the bias-variance trade-off for the ridge estimator. The MSE is
mse(λ) = C1 + C2
p p
X γj2 X d2j
= λ2 + σ 2
.
j=1
(d2j + λ)2 j=1
(d2j + λ)2

When λ = P 0, the ridge estimator reduces to the OLS estimator: the bias is zero and the
p
variance σ 2 j=1 d−2
j dominates. When λ = ∞, the ridge estimator reduces to zero: the bias
Pp 2
j=1 γj dominates and the variance is zero. As we increase λ from zero, the bias increases
and the variance decreases. So we face a bias-variance trade-off.

14.4 Selection of the tuning parameter


14.4.1 Based on parameter estimation
For parameter estimation, we want to choose the λ that minimizes the MSE. So the optimal
λ must satisfy the following first-order condition:
p p
∂mse(λ) X λ d2j + λ − λ X d2j
=2 γj2 2 − 2σ 2
=0
∂λ j=1
dj + λ (d2j + λ)2 j=1
(d2j + λ)3

which is equivalent to
p p
X γj2 d2j 2
X d2j
λ = σ . (14.6)
j=1
(d2j + λ)3 j=1
(d2j + λ)3
146 Linear Model and Extensions

However, (14.6) is not directly useful because we do not know γ and σ 2 . Three methods
below try to solve (14.6) approximately.
Dempster et al. (1977) used OLS to construct an unbiased estimator σ̂ 2 and γ̂ = V t β̂,
and then solve λ from
p p
X γ̂j2 d2j 2
X d2j
λ = σ̂ ,
j=1
(d2j + λ)3 j=1
(d2j + λ)3
which is a nonlinear equation of λ.
Hoerl et al. (1975) assumed that X t X = Ip . Then d2j = 1 (j = 1, . . . , p) and γ = β, and
solve λ from
p p
X β̂j2 2
X 1
λ 3
= σ̂ ,
j=1
(1 + λ) j=1
(1 + λ)3
resulting in
λhkb = pσ̂ 2 /∥β̂∥2 .
Lawless (1976) used
λlw = pσ̂ 2 /β̂ t D2 β̂
to weight the βj ’s based on the eigenvalues of X t X.
But all these methods require estimating (β, σ 2 ). If the initial OLS estimator is not
reliable, then these estimates of λ are unlikely to be reliable. None of these methods work
for the case with p > n.

14.4.2 Based on prediction


For prediction, we need slightly different criteria. Without estimating (β, σ 2 ), we can use
the leave-one-out cross-validation. The leave-one-out formula for the ridge below is similar
to that for OLS.
Theorem 14.2 Define β̂(λ) = (X t X + λIp )−1 X t Y as the ridge coefficient (dropping the
superscript “ridge” for simplicity), ε̂(λ) = Y − X β̂(λ) as the residual vector using the
full data, and hii (λ) = xti (X t X + λIp )−1 xi as the (i, i)th diagonal element of H(λ) =
X(X t X + λIp )−1 X t . Define β̂[−i] (λ) as the ridge coefficient without observation i, and
ε̂[−i] (λ) = yi − xti β̂[−i] (λ) as the predicted residual. The leave-one-out formulas for ridge
regression are
β̂[−i] (λ) = β̂(λ) − {1 − hii (λ)}−1 (X t X + λIp )−1 xi ε̂i (λ)
and
ε̂[−i] (λ) = ε̂i (λ)/{1 − hii (λ)}.
I leave the proof of Theorem 14.2 as Problem 14.5.
By Theorem 14.2, the PRESS statistic for ridge is
n n 2
X  2 X {ε̂i (λ)}
press(λ) = ε̂[−i] (λ) = .
i=1 i=1
{1 − hii (λ)}2

Golub et al. (1979) proposed the GCV criterion to simplify the calculation of the PRESS
statistic by replacing hii (λ) with their average value n−1 trace{H(λ)}:
Pn 2
i=1 {ε̂i (λ)}
gcv(λ) = 2.
[1 − n−1 trace {H(λ)}]
In the R package MASS, the function lm.ridge implements the ridge regression, kHKB and
kLW report two estimators for λ, and GCV contains the GCV values for a sequence of λ.
Ridge Regression 147

14.5 Computation of ridge regression


Lemma 14.1 gives the ridge coefficients. So the predicted vector equals

Ŷ (λ) = X β̂ ridge (λ)


!
t dj
= U DV V diag 2 U tY
dj + λ
!
dj
= U Ddiag U tY
d2j + λ
!
d2j
= U diag U t Y,
d2j + λ

and the hat matrix equals


!
d2j
H(λ) = U diag U t.
d2j + λ

These formulas allow us to compute the ridge coefficient and predictor vector for many
values of λ without inverting each X t X + λIp . We have similar formulas for the case with
n < p; see Problem 14.11.
A subtle point is due to the standardization of the covariates of the outcome. In R, the
lm.ridge function first computes the ridge coefficient based on the standardized covariates
and outcome, and then transforms them back to the originalPscale. Let x̄1 , . . . , x̄p , ȳ be
n
the means of the covariates and outcome, and let sdj = {n−1 i=1 (xij − x̄i )2 }1/2 be the
standard deviation of the covariates which are report as scales in the output of lm.ridge.
From the ridge coefficients {β̂1ridge (λ), . . . , β̂pridge (λ)} based on the standardized variables,
we can obtain the predicted values based on the original variables as

ŷi (λ) − ȳ = β̂1ridge (λ)(xi1 − x̄1 )/sd1 + · · · + β̂pridge (λ)(xip − x̄p )/sdp

or, equivalently,

ŷi (λ) = α̂ridge (λ) + β̂1ridge (λ)/sd1 × xi1 + · · · + β̂pridge (λ)/sdp × xip

where
α̂ridge (λ) = ȳ − β̂1ridge (λ)x̄1 /sd1 − · · · − β̂pridge (λ)x̄p /sdp .

14.6 Numerical examples


We can use the following numerical example to illustrate the bias-variance trade-off in
selecting λ in the ridge. The code is in code14.5.R.

14.6.1 Uncorrelated covariates


I first simulate data from a Normal linear model with uncorrelated covariates.
148 Linear Model and Extensions

library ( MASS )
n = 200
p = 100
beta = rep ( 1 / sqrt ( p ) , p )
sig = 1 / 2
X = matrix ( rnorm ( n * p ) , n , p )
X = scale ( X )
X = X * sqrt ( n /( n - 1 ))
Y = as . vector ( X %*% beta + rnorm (n , 0 , sig ))

The following code calculates the theoretical bias, variance, and mean squared error,
reported in the (1, 1)th panel of Figure 14.2.
eigenxx = eigen ( t ( X )%*% X )
xis = eigenxx $ values
gammas = t ( eigenxx $ vectors )%*% beta

lambda . seq = seq ( 0 , 7 0 , 0 . 0 1 )


bias 2 . seq = lambda . seq
var . seq = lambda . seq
mse . seq = lambda . seq
for ( i in 1 : length ( lambda . seq ))
{
ll = lambda . seq [ i ]
bias 2 . seq [ i ] = ll ^ 2 * sum ( gammas ^ 2 /( xis + ll )^ 2 )
var . seq [ i ] = sig ^ 2 * sum ( xis /( xis + ll )^ 2 )
mse . seq [ i ] = bias 2 . seq [ i ] + var . seq [ i ]
}

y . min = min ( bias 2 . seq , var . seq , mse . seq )


y . max = max ( bias 2 . seq , var . seq , mse . seq )
par ( mfrow = c ( 2 , 2 ))
plot ( bias 2 . seq ~ lambda . seq , type = " l " ,
ylim = c ( y . min , y . max ) ,
xlab = expression ( lambda ) , main = " " ,
ylab = " bias - variance ␣ tradeoff " ,
lty = 2 , bty = " n " )
lines ( var . seq ~ lambda . seq , lty = 3 )
lines ( mse . seq ~ lambda . seq , lwd = 3 , lty = 1 )
abline ( v = lambda . seq [ which . min ( mse . seq )] ,
lty = 1 , col = " grey " )
legend ( " topright " , c ( " bias " , " variance " , " mse " ) ,
lty = c ( 2 , 3 , 1 ) , lwd = c ( 1 , 1 , 4 ) , bty = " n " )

The (1, 1)th panel also reported the λ’s based on different approaches.
ridge . fit = lm . ridge ( Y ~ X , lambda = lambda . seq )
abline ( v = lambda . seq [ which . min ( ridge . fit $ GCV )] ,
lty = 2 , col = " grey " )
abline ( v = ridge . fit $ kHKB , lty = 3 , col = " grey " )
abline ( v = ridge . fit $ kLW , lty = 4 , col = " grey " )
legend ( " bottomright " ,
c ( " MSE " , " GCV " , " HKB " , " LW " ) ,
lty = 1 : 4 , col = " grey " , bty = " n " )

I also calculate the prediction error of the ridge estimator in the testing dataset, which
follows the same data-generating process as the training dataset. The (1, 2)th panel of
Figure 14.2 shows its relationship with λ. Overall, GCV, HKB, and LW are similar, but the
λ selected by the MSE criterion is the worst for prediction.
X . new = matrix ( rnorm ( n * p ) , n , p )
X . new = scale ( X . new )
X . new = X . new * matrix ( sqrt ( n /( n - 1 )) , n , p )
Y . new = as . vector ( X . new %*% beta + rnorm (n , 0 , sig ))
predict . error = Y . new - X . new %*% ridge . fit $ coef
Ridge Regression 149

predict . mse = apply ( predict . error ^ 2 , 2 , mean )


plot ( predict . mse ~ lambda . seq , type = " l " ,
xlab = expression ( lambda ) ,
ylab = " predicted ␣ MSE " , bty = " n " )
abline ( v = lambda . seq [ which . min ( mse . seq )] ,
lty = 1 , col = " grey " )
abline ( v = lambda . seq [ which . min ( ridge . fit $ GCV )] ,
lty = 2 , col = " grey " )
abline ( v = ridge . fit $ kHKB , lty = 3 , col = " grey " )
abline ( v = ridge . fit $ kLW , lty = 4 , col = " grey " )
legend ( " bottomright " ,
c ( " MSE " , " GCV " , " HKB " , " LW " ) ,
lty = 1 : 4 , col = " grey " , bty = " n " )

mtext ( " independent ␣ covariates " , side = 1 ,


line = -5 8 , outer = TRUE , font . main = 1 , cex = 1 . 5 )

14.6.2 Correlated covariates


I then simulate data from a Normal linear model with correlated covariates.
n = 200
p = 100
beta = rep ( 1 / sqrt ( p ) , p )
sig = 1 / 2
# # correlated Normals
X = matrix ( rnorm ( n * p ) , n , p ) + rnorm (n , 0 , 0 . 5 )
# # standardize the covariates
X = scale ( X )
X = X * matrix ( sqrt ( n /( n - 1 )) , n , p )
Y = as . vector ( X %*% beta + rnorm (n , 0 , sig ))

The second row of Figure 14.2 shows the bias-variance trade-off. Overall, GCV works
the best for selecting λ for prediction.

14.7 Further commments on OLS, ridge, and PCA


The SVD of X is closely related to the principal component analysis (PCA), so is OLS
and ridge regression. Assume that the columns of X are centered, so X t X = V D2 V t is
proportional to the sample covariance matrix of X. Assume d1 ≥ d2 ≥ · · · . PCA tries to
find linear combinations of the covariate xi that contain maximal information. For a vector
v ∈ Rp , the linear combination v t xi has sample variance proportional to
Q(v) = v t X t Xv.
If we multiply v by a constant c, the above sample variance will change by the factor c2 . So
a meaningful criterion is to maximize Q(v) such that ∥v∥ = 1. This is exactly the setting
of Theorem A.3. The maximum value equals d21 which is achieved by V1 , the first column
of V . We call  t 
x1 V1
 .. 
XV1 =  . 
xtn V1
the first principal component of X. Similar to Theorem A.3, we can further maximize Q(v)
such that ∥v∥ = 1 and v ⊥ V1 , yielding the maximum value d22 which is achieved by V2 .
150 Linear Model and Extensions

independent covariates
0.25

bias

0.42
variance
mse
0.20
bias−variance tradeoff

0.40
0.15

predicted MSE
0.10

0.38
0.05

0.36
MSE MSE
GCV GCV
HKB HKB
0.00

LW LW

0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70

λ λ

correlated covariates
bias
0.30

variance
0.7

mse
0.25

0.6
bias−variance tradeoff

0.20

predicted MSE

MSE
GCV
0.5
0.15

HKB
LW
0.10

0.4
0.05

MSE
GCV
0.3

HKB
0.00

LW

0 200 400 600 800 0 200 400 600 800

λ λ

FIGURE 14.2: Bias-variance trade-off in ridge regression


Ridge Regression 151

We call XV2 the second principal component of X. By induction, we can define all the p
principal components, stacked in the following n × p matrix:

(XV1 , . . . , XVp ) = XV = U DV t V = U D.

So U D in the SVD decomposition contains the principal components of X. Since D is


a diagonal matrix that only changes the scales of the columns of U , we also call U =
(U1 , . . . , Up ) the principal components of X. They are orthogonal since U t U = Ip .
Section 14.5 shows that the ridge estimator yields the predicted value
!
d2j
Ŷ (λ) = U diag U tY
d2j + λ
p
X d2j
= ⟨Uj , Y ⟩Uj
j=1
d2j + λ

where ⟨Uj , Y ⟩ = Ujt Y denotes the inner product of vectors Uj and Y. As a special case with
λ = 0, the OLS estimator yields the predicted value

Ŷ = U U tY
Xp
= ⟨Uj , Y ⟩Uj ,
j=1

which is identical to the predicted value based on OLS of Y on the principal components
U . Moreover, the principal components in U are orthogonal and have unit length, so the
OLS fit of Y on U is equivalent to the component-wise OLS of Y on Uj with coefficient
⟨Uj , Y ⟩ (j = 1, . . . , p). So the predicted value based OLS equals a linear combination of
the principal components with coefficients ⟨Uj , Y ⟩; the predicted value based on ridge also
equals a linear combination of the principal components but the coefficients are shrunk by
the factors d2j /(d2j + λ).
When the columns of X are not linearly independent, for example, p > n, we cannot
run OLS of Y on X or OLS of Y on U , but we can still run ridge. Motivated by the
formulas above, another approach is to run OLS of Y on the first p∗ principal components
Ũ = (U1 , . . . , Up∗ ) with p∗ < p. This is called the principal component regression (PCR).
The predicted value is

Ŷ (p∗ ) = (Ũ t Ũ )−1 Ũ t Y



p
X
= ⟨Uj , Y ⟩Uj ,
j=1

which truncates the summation in the formula of Ŷ based on OLS. Compared to the pre-
dicted values of OLS and ridge, Ŷ (p∗ ) effectively imposes zero weights on the principal
components corresponding to small singular values. It depends on a tuning parameter p∗
similar to λ in the ridge. Since p∗ must be a positive integer and λ can be any positive real
value, PCR is a discrete procedure while ridge is a continuous procedure.
152 Linear Model and Extensions

14.8 Homework problems


14.1 Ridge coefficient as a posterior mode under a Normal prior
Assume fixed X, σ 2 and τ 2 . Show that if

Y | β ∼ N(Xβ, σ 2 In ), β ∼ N(0, τ 2 Ip ),

then the mode of the posterior distribution of β | Y equals β̂ ridge (σ 2 /τ 2 ):

β̂ ridge (σ 2 /τ 2 ) = arg max f (β | Y )


β

where f (β | Y ) is the posterior density of β given Y .

14.2 Derivative of the MSE


Show that
∂mse(λ)
< 0.
∂λ λ=0
Remark: This result ensures that the ridge estimator must have a smaller MSE than
OLS in a neighborhood of λ = 0, which is coherent with the pattern in Figure 14.2.

14.3 Ridge and OLS


Show that if X has linearly independent columns, then

β̂ ridge (λ) = (X t X + λIp )−1 X t X β̂


!
d2j
= V diag V t β̂
d2j + λ

where β̂ is the OLS coefficient.

14.4 Ridge as OLS with augmented data


Show that β̂ ridge (λ) equals the OLS coefficient of Ỹ on X̃ with augmented data
   
Ỹ =
Y
, X̃ = √X ,
0p λIp

where Ỹ is an n + p dimensional vector and X̃ is an (n + p) × p matrix.


Remark: The columns of X̃ must be linearly independent, so the inverse of X̃ t X̃ al-
ways exists. This is a theoretical result of the ridge regression. It should not be used for
computation especially when p is large.

14.5 Leave-one-out formulas for ridge


Prove Theorem 14.2.
Hint: You can use the result in Problem 14.4 and apply the leave-one-out formulas for
OLS in Theorems 11.2 and 11.3.
Ridge Regression 153

14.6 Generalized ridge regression


Covariates have different importance, so it is reasonable to use different weights in the
penalty term. Find the explicit formula for the ridge regression with general quadratic
penalty:
arg minp {(Y − Xb)t (Y − Xb) + λbt Qb}
b∈R

where Q is a p × p positive definite matrix.

14.7 Degrees of freedom of ridge regression


Pn
For a predictor Ŷ for Y , define the degrees of freedom of the predictor as i=1 cov(yi , ŷi )/σ 2 .
Calculate the degrees of freedom of ridge regression in terms of the eigenvalues of X t X.

14.8 Extending the simulation in Figure 14.2


Re-run the simulation that generates Figure 14.2, and report the λ selected by Dempster
et al. (1977)’s method, PRESS, and K-fold CV. Extend the simulation to the case with
p > n.

14.9 Unification of OLS, ridge, and PCR


We can unify the predicted values of the OLS, ridge, and PCR as
p
X
Ŷ = sj ⟨Uj , Y ⟩Uj ,
j=1

where 
1, OLS,


d2j
sj = 2 , ridge,
 dj +λ
1(j ≤ p∗ ), PCR.

Based on the unified formula, show that under Assumption 4.1, we have
p
X
E(Ŷ ) = sj dj γj Uj
j=1

with the γj ’s defined in Theorem 14.1, and


p
X
cov(Ŷ ) = σ 2 s2j Uj Ujt .
j=1

14.10 An equivalent form of ridge coefficient


Show that the ridge coefficient has two equivalent forms: for λ > 0,

(X t X + λIp )−1 X t Y = X t (XX t + λIn )−1 Y.

Remark: This formula has several interesting implications. First, the left-hand side in-
volves inverting a p × p matrix, and it is more useful when p < n; the right-hand side
involves inverting an n × n matrix, so it is more useful when p > n. Second, from the form
on the right-hand side, we can see that the ridge estimator lies in C(X t ), the row space of
X. That is, the ridge estimator can be written as X t δ, where δ = (XX t + λIn )−1 Y ∈ Rp .
154 Linear Model and Extensions

This always holds but is particularly interesting in the case with p > n when the row space
of X is not the entire Rp . Third, if p > n and XX t is invertible, then we can let λ go to
zero on the right-hand side, yielding

β̂ ridge (0) = X t (XX t )−1 Y

which is the minimum norm estimator; see Problem 18.7. Using the definition of the pseu-
doinverse in Chapter A, we can further show that

β̂ ridge (0) = X + Y.

14.11 Computation of ridge with n < p


When n < p, X has singular value decomposition X = U DV t , where D ∈ Rn×n is a
diagonal matrix containing the singular values, U ∈ Rn×n is an orthogonal matrix with
U U t = U t U = In , and V ∈ Rp×n has orthonormal columns with V t V = In . Show that the
ridge coefficient, the predicted value, and the hat matrix have the same form as the case
with n > p. The only subtle difference is that the diagonal matrices have dimension n × n.
Remark: The above result also ensures that Theorem 14.1 holds when p > n if we modify
the summation as from j = 1 to n.

14.12 Recommended reading


To celebrate the 50th anniversary of Hoerl and Kennard (1970)’s paper in Technometrics,
the editor invited Roger W. Hoerl, the son of Art Hoerl, to review the historical aspects
of the original paper, and Trevor Hastie to review the essential role of the idea of ridge
regression in data science. See Hoerl (2020) and Hastie (2020).
15
Lasso

15.1 Introduction to the lasso


Ridge regression works well for prediction, but it may be difficult to interpret many small
but non-zero coefficients. Tibshirani (1996) proposed to use the lasso, the acronym for the
Least Absolute Shrinkage and Selection Operator, to achieve the ambitious goal of simul-
taneously estimating parameters and selecting important variables in the linear regression.
By changing the penalty term in the ridge regression, the lasso automatically estimates
some parameters as zero, dropping them out of the model and thus selecting the remaining
variables as important predictors.
Tibshirani (1996) defined the lasso as

β̂ lasso (t) = arg min rss(b0 , b1 , . . . , bp )


b0 ,b1 ,...,bp
p
X
s.t. |bj | ≤ t. (15.1)
j=1

Osborne et al. (2000) studied its dual form


 
 p
X 
β̂ lasso (λ) = arg min rss(b0 , b1 , . . . , bp ) + λ |bj | . (15.2)
b0 ,b1 ,...,bp  
j=1

The two forms of lasso are equivalent in the sense that for a given λ in (15.2), there exists
a t such that the solution for (15.1) is identical to the solution for (15.2). In particular,
Pp lasso
t = j=1 β̂j (λ). Technically, the minimizer of the lasso problem may not be unique
especially when p > n, so the right-hand sides of the optimization problems should be a
set. Fortunately, even though the minimizer may not be unique, the resulting predictor is
always unique. Tibshirani (2013) clarifies this issue.
Both forms of the lasso are useful. We will use the form (15.2) for computation and
use the form (15.1) for geometric intuition. Similar to the ridge estimator, the lasso is not
invariant to the linear transformation of X. We proceed after standardizing the covariates
and outcome as Condition 14.1. For the same reason as the ridge, we can drop the intercept
after standardization.

15.2 Comparing the lasso and ridge: a geometric perspective


The ridge and lasso are very similar: both minimize a penalized version of the residual sum
of squares. They differ in the penalty term: ridge uses an L2 penalty, i.e., the L2 norm of the

155
156 Linear Model and Extensions

contour plot

FIGURE 15.1: Lasso with a sparse solution

Pp
coefficient ∥b∥2 = j=1 b2j , and lasso uses an L1 penalty, i.e., the L1 norm of the coefficient
Pp
∥b∥1 = j=1 |bj |. Compared to the ridge, the lasso can give sparse solutions due to the
non-smooth penalty term. That is, estimators of some coefficients are exactly zero.
Focus on the form (15.1). We can gain insights from the contour plot of the residual sum
of squares as a function of b. With a well-defined OLS estimator β̂, Theorem 3.2 ensures

(Y − Xb)t (Y − Xb) = (Y − X β̂)t (Y − X β̂) + (b − β̂)t X t X(b − β̂),

which equals a constant term plus a quadratic function centered at the OLS coefficient.
Without any penalty, the minimizer is of course the OLSPcoefficient. With the L1 penalty,
p
the OLS coefficient may not be in the region defined by j=1 |bj | ≤ t. If this happens, the
intersection of the contour plot of (Y −Xb)t (Y −Xb) and the border of the restriction region
P p
j=1 |bj | ≤ t can be at some axis. For example, Figure 15.1 shows a case with p = 2, and
the lasso estimator hits the x-axis, resulting in a zero coefficient for the second coordinate.
However, this does not mean that lasso always generates sparse solutions because sometimes
the intersection of the contour plot of (Y − Xb)t (Y − Xb) and the border of the restriction
region is at an edge of the region. For example, Figure 15.2 shows a case with a non-sparse
lasso solution.
In contrast, the restriction region of the ridge is a circle, so the ridge solution does not
hit any axis unless the original OLS coefficient is zero. Figure 15.3 shows the general ridge
estimator.
Lasso 157

contour plot

FIGURE 15.2: Lasso with a non-sparse solution

contour plot

FIGURE 15.3: Ridge


158 Linear Model and Extensions

15.3 Computing the lasso via coordinate descent


Many efficient algorithms can solve the lasso problem. The glmnet package in R uses the
coordinate descent algorithm based on the form (15.2) (Friedman et al., 2007, 2010). I will
first review a lemma which is the stepstone for the algorithm.

15.3.1 The soft-thresholding lemma


Let sign(x) denote the sign of a real number x, which equals 1, 0, −1 if x > 0, x = 0, x < 0,
respectively. Let (x)+ = max(x, 0) denote the positive part of a real number x.

Lemma 15.1 Given b0 and λ ≥ 0, we have


1
arg min (b − b0 )2 + λ|b| = sign(b0 ) (|b0 | − λ)+
b∈R 2

b0 − λ, if b0 ≥ λ,

= 0 if − λ ≤ b0 ≤ λ,

b0 + λ if b0 ≤ −λ.

The solution in Lemma 15.1 is a function of b0 and λ, and we will use the notation

S(b0 , λ) = sign(b0 ) (|b0 | − λ)+

from now on, where S denotes the soft-thresholding operator. For a given λ > 0, it is a
function of b0 illustrated by Figure 15.4. The proof of Lemma 15.1 is to solve the optimization
problem. It is tricky since we cannot naively solve the first-order condition due to the non-
smoothness of |b| at 0. Nevertheless, it is only a one-dimensional optimization problem, and
I relegate the proof as Problem 15.2.

15.3.2 Coordinate descent for the lasso


For a given λ > 0, we can use the following algorithm:
1. Standardize the data as Condition 14.1. So we need to solve a lasso problem without
the intercept. For simplicity of derivation, we change the scale of the residual sum of
squares without essentially changing the problem:
n p
1 X 2
X
min (yi − b1 xi1 − · · · − bp xip ) + λ |bj |.
b1 ,...,bp 2n
i=1 j=1

Initialize β̂.
2. Update β̂j given all other coefficients. Define the partial residual as rij = yi −
P
k̸=j β̂k xik . Updating β̂j is equivalent to minimizing

n
1 X
(rij − bj xij )2 + λ|bj |.
2n i=1

Define Pn n
i=1 xij rij
X
−1
β̂j,0 = P n 2 = n xij rij
i=1 xij i=1
Lasso 159

λ=2

3
2
1
S(b0, λ)

0
-1
-2
-3

-4 -2 0 2 4

b0

FIGURE 15.4: Soft-thresholding

as the OLS coefficient of the rij ’s on the xij ’s, so


n n n
1 X 1 X 1 X 2
(rij − bj xij )2 = (rij − β̂j,0 xij )2 + x (bj − β̂j,0 )2
2n i=1 2n i=1 2n i=1 ij
1
= constant + (bj − β̂j,0 )2 .
2

Then updating β̂j is equivalent to minimizing 12 (bj − β̂j,0 )2 + λ|bj |. Lemma 15.1 implies

β̂j = S(β̂j,0 , λ).

3. Iterate until convergence.


Does the algorithm always converge? The theory of Tseng (2001) ensures it converges,
but this is beyond the scope of this book. We can start with a large λ and all zero coefficients.
We then gradually decrease λ, and for each λ, we apply the above algorithm. We finally
select λ via K-fold cross-validation. Since we gradually decrease λ, the initial values from
the last step are very close to the minimizer and the algorithm converges fairly fast.

15.4 Example: comparing OLS, ridge and lasso


In the Boston housing data, the OLS, ridge, and lasso have similar performance in out-of-
sample prediction. Lasso and ridge have similar coefficients. See Figure 15.5(a).
> library ( " mlbench " )
160 Linear Model and Extensions

> library ( " glmnet " )


> library ( " MASS " )
> data ( BostonHousing )
>
> # # training and testing data
> set . seed ( 2 3 0 )
> nsample = dim ( BostonHousing )[ 1 ]
> trainindex = sample ( 1 : nsample , floor ( nsample * 0 . 9 ))
>
> xmatrix = model . matrix ( medv ~ . , data = BostonHousing )[ , -1 ]
> yvector = BostonHousing $ medv
> dat = data . frame ( yvector , xmatrix )
>
> # # linear regression
> bostonlm = lm ( yvector ~ . , data = dat [ trainindex , ])
> predicterror = dat $ yvector [ - trainindex ] -
+ predict ( bostonlm , dat [ - trainindex , ])
> mse . ols = sum ( predicterror ^ 2 )/ length ( predicterror )
>
> # # ridge regression
> lambdas = seq ( 0 , 5 , 0 . 0 1 )
> lm 0 = lm . ridge ( yvector ~ . , data = dat [ trainindex , ] ,
+ lambda = lambdas )
> coefridge = coef ( lm 0 )[ which . min ( lm 0 $ GCV ) , ]
> p r e d i c t e r r o r r i d g e = dat $ yvector [ - trainindex ] -
+ cbind ( 1 , xmatrix [ - trainindex , ])%*% coefridge
> mse . ridge = sum ( p r e d i c t e r r o r r i d g e ^ 2 )/ length ( p r e d i c t e r r o r r i d g e )
>
> # # lasso
> cvboston = cv . glmnet ( x = xmatrix [ trainindex , ] , y = yvector [ trainindex ])
> coeflasso = coef ( cvboston , s = " lambda . min " )
> p r e d i c t e r r o r l a s s o = dat $ yvector [ - trainindex ] -
+ cbind ( 1 , xmatrix [ - trainindex , ])%*% coeflasso
> mse . lasso = sum ( p r e d i c t e r r o r l a s s o ^ 2 )/ length ( p r e d i c t e r r o r l a s s o )
>
> c ( mse . ols , mse . ridge , mse . lasso )
[1] 29.37365 29.07174 28.88161

But if we artificially add 200 columns of covariates of pure noise N(0, 1), then the ridge
and lasso perform much better. Lasso can automatically shrink many coefficients to zero.
See Figure 15.5(b).
> # # adding more noisy covariates
> n . noise = 2 0 0
> xnoise = matrix ( rnorm ( nsample * n . noise ) , nsample , n . noise )
> xmatrix = cbind ( xmatrix , xnoise )
> dat = data . frame ( yvector , xmatrix )
>
> # # linear regression
> bostonlm = lm ( yvector ~ . , data = dat [ trainindex , ])
> predicterror = dat $ yvector [ - trainindex ] -
+ predict ( bostonlm , dat [ - trainindex , ])
> mse . ols = sum ( predicterror ^ 2 )/ length ( predicterror )
>
> # # ridge regression
> lambdas = seq ( 1 0 0 , 1 5 0 , 0 . 0 1 )
> lm 0 = lm . ridge ( yvector ~ . , data = dat [ trainindex , ] ,
+ lambda = lambdas )
> coefridge = coef ( lm 0 )[ which . min ( lm 0 $ GCV ) , ]
> p r e d i c t e r r o r r i d g e = dat $ yvector [ - trainindex ] -
+ cbind ( 1 , xmatrix [ - trainindex , ])%*% coefridge
> mse . ridge = sum ( p r e d i c t e r r o r r i d g e ^ 2 )/ length ( p r e d i c t e r r o r r i d g e )
>
>
Lasso 161

(a) original data


ridge lasso
● ●

● ●

● ●
0 ●
● ● ● ● ●



● ● ● ● ●


● ●
● ●
coefficients

−5

−10

−15 ●

5 10 5 10
indices of covariates

(b) original data + noisy covariates


ridge lasso

0.6


● ●

● ● ● ●
● ● ● ● ●

0.3 ● ●
● ●




● ●

● ●

coefficients

● ●
● ● ● ● ●
● ● ●
● ● ● ●
● ● ● ● ● ●
● ● ●
● ● ● ●
● ● ● ●
● ● ● ● ● ●
● ● ● ● ●
● ● ● ● ●
● ● ●
● ● ● ● ●
● ●
● ● ● ● ● ●
● ●● ● ● ●
● ●
● ● ● ● ●

● ● ● ● ● ●

0.0 ● ●

● ●
● ●

● ●




● ●
● ●●
● ● ● ●








●●● ●● ● ● ●● ●●●●●●●●●● ●● ● ●●● ●●● ●● ●● ●● ● ●● ●● ●●●● ●● ● ●●● ●● ●●●



● ●●● ●● ●● ●● ●●● ● ● ●● ●●●●●●● ●● ●●●●●● ●● ● ●●●● ● ●●●●●● ● ●●●●●●●●●●●●●●●● ● ●●●●●●● ●

● ●
●● ●●● ●● ●●

● ● ● ● ● ● ●
● ● ●
● ●
● ● ● ● ● ● ● ●
● ● ● ● ●
● ●
● ● ● ●
● ● ●

● ●● ● ●

● ●
● ● ●
● ● ● ●
● ● ● ● ● ●
● ● ●
● ●
● ● ●
● ● ●
● ●
● ● ●
● ● ● ● ●
● ●
● ● ● ●
● ●
● ●

−0.3 ●


● ● ● ●

● ●


● ● ●

● ● ●



0 50 100 150 200 0 50 100 150 200


indices of covariates for the noise

FIGURE 15.5: Comparing ridge and lasso

> # # lasso
> cvboston = cv . glmnet ( x = xmatrix [ trainindex , ] , y = yvector [ trainindex ])
> coeflasso = coef ( cvboston , s = " lambda . min " )
>
> p r e d i c t e r r o r l a s s o = dat $ yvector [ - trainindex ] -
+ cbind ( 1 , xmatrix [ - trainindex , ])%*% coeflasso
> mse . lasso = sum ( p r e d i c t e r r o r l a s s o ^ 2 )/ length ( p r e d i c t e r r o r l a s s o )
>
> c ( mse . ols , mse . ridge , mse . lasso )
[1] 41.80376 33.33372 32.64287
162 Linear Model and Extensions

E E E

A A A
B
B B

J L J L J L

(a) 0 < q < 1 (b) q = 1 (c) q = 2

FIGURE 15.6: Shrinkage estimators

15.5 Other shrinkage estimators


A general class of shrinkage estimators is the bridge estimator (Frank and Friedman, 1993):
 
 p
X 
β̂(λ) = arg min rss(b0 , b1 , . . . , bp ) + λ |bj |q
b0 ,b1 ,...,bp  
j=1

or, by duality,

β̂(t) = arg min rss(b0 , b1 , . . . , bp )


b0 ,b1 ,...,bp
p
X
s.t. |bj |q ≤ t.
j=1

Figure 15.6 shows the constraints corresponding to different values of q.


Zou and Hastie (2005) proposed the elastic net which combines the penalties of the lasso
and ridge:
 
p
X
β̂ enet (λ, α) = arg min rss(b0 , b1 , . . . , bp ) + λ αb2j + (1 − α)|bj |  .

b0 ,b1 ,...,bp
j=1

Figure 15.7 compares the constraints corresponding to the ridge, lasso, and elastic net.
Because the constraint of the elastic net is not smooth, it encourages sparse solutions in the
same way as the lasso. Due to the ridge penalty, the elastic net can deal with the collinearity
of the covariates better than the lasso.
Friedman et al. (2007) proposed to use the coordinate descent algorithm to solve for
the elastic net estimator, and Friedman et al. (2009) implemented it in an R package called
glmnet.
Lasso 163

ridge

elastic net

lasso

FIGURE 15.7: Comparing the ridge, lasso, and elastic net

15.6 Homework problems


15.1 Uniqueness of the lasso prediction
Consider the lasso problem:
min ∥Y − Xb∥2 + λ∥b∥1 .
b∈Rp

Show that if β̂ (1) and β̂ (2) are two solutions, then αβ̂ (1) + (1 − α)β̂ (2) is also a solution for
any 0 ≤ α ≤ 1. Show that X β̂ (1) = X β̂ (2) must hold.
Hint: The function ∥ · ∥2 is strongly convex. That is, for any v1 , v2 and 0 < α < 1, we
have
∥αv1 + (1 − α)v2 ∥2 ≤ α∥v1 ∥2 + (1 − α)∥v2 ∥2
and the inequality holds when v1 ̸= v2 . The function ∥ · ∥ is convex. That is, for any v1 , v2
and 0 < α < 1, we have

∥αv1 + (1 − α)v2 ∥1 ≤ α∥v1 ∥1 + (1 − α)∥v2 ∥1 .

15.2 The soft-thresholding lemma


Prove Lemma 15.1.

15.3 Penalized OLS with an orthogonal design matrix


Consider the special case with standardized and orthogonal design matrix:

X t 1n = 0, X t X = Ip .
164 Linear Model and Extensions

For a fixed λ ≥ 0, find the explicit formulas of the jth coordinates of the following estimators
in terms of the corresponding jth coordinate of the OLS estimator β̂j and λ (j = 1, . . . , p):

β̂ ridge (λ) = arg minp ∥Y − Xb∥2 + λ∥b∥2 ,



b∈R

β̂ lasso (λ) = arg minp ∥Y − Xb∥2 + λ∥b∥1 ,



b∈R

β̂ enet (λ) = arg minp ∥Y − Xb∥2 + λ(α∥b∥2 + (1 − α)∥b∥1 ) ,



b∈R
subset
(λ) = arg minp ∥Y − Xb∥2 + λ∥b∥0 ,

β̂
b∈R

where
p
X p
X p
X
∥b∥2 = b2j , ∥b∥1 = |bj |, ∥b∥0 = 1(bj ̸= 0).
j=1 j=1 j=1

15.4 Standardization in the elastic net


For fixed λ and α, show that the intercept in β̂ enet (λ, α) equals zero under the standardiza-
tion in Condition 14.1.

15.5 Coordinate descent for the elastic net


Give the detailed coordinate descent algorithm for the elastic net.

15.6 Reducing elastic net to lasso


Consider the following form of the elastic net:

arg minp ∥Y − Xb∥2 + λ{α∥b∥2 + (1 − α)∥b∥1 }.


b∈R

Show that it reduces to the following lasso:

arg minp ∥Ỹ − X̃b∥2 + λ̃∥b∥1 ,


b∈R

where    
Ỹ =
Y
, X̃ = √X , λ̃ = λ(1 − α).
0p λαIp
Hint: Use the result in Problem 14.4.

15.7 Reducing lasso to iterative ridge


Based on the simple result
min (a2 + c2 )/2 = |b|,
ac=b

for scalars a, b, c, Hoff (2017) rewrote the lasso problem

min {∥Y − Xb∥2 + λ∥b∥1 }


b∈Rp

as
min {∥Y − X(u ◦ v)∥2 + λ(∥u∥2 + ∥v∥2 )/2}
u,v∈Rp

where ◦ denotes the component-wise product of vectors. Hoff (2017, Lemma 1) showed that
a local minimum of the new problem must be a local minimum of the lasso problem.
Show that the new problem can be solved based on the following iterative ridge regres-
sions:
Lasso 165

1. given u, we update v based on the ridge regression of Y on Xu with tuning parameter


λ/2, where Xu = Xdiag(u1 , . . . , up );
2. given v, we update u based on the ridge regression of Y on Xv with tuning parameter
λ/2, where Xv = Xdiag(v1 , . . . , vp ).

15.8 More noise in the Boston housing data


The Boston housing data have n = 506 observations. Add p = n columns of covariates of
random noise, and compare OLS, ridge, and lasso, as in Section 15.4. Add p = 2n columns
of covariates of random noise, and compare OLS, ridge, and lasso.

15.9 Recommended reading


Tibshirani (2011) gives a review of the lasso, as well as its history and recent developments.
Two discussants, Professors Peter Bühlmann and Chris Holmes, make some excellent com-
ments.
Part VI

Transformation and Weighting


16
Transformations in OLS

Transforming the outcome and covariates is fundamental in linear models. Whenever we


specify a linear model yi = xti β + εi , we implicitly have transformed the original y and x,
or at least we have chosen the scales of them. Carroll and Ruppert (1988) is a textbook on
this topic. This chapter discusses some important special cases.

16.1 Transformation of the outcome


Although we can view
yi = xti β + εi , (i = 1, . . . , n)
as a linear projection that works for any type of outcome yi ∈ R, the linear model works the
best for continuous outcomes and especially for Normally distributed outcomes. Sometimes,
the linear model can be a poor approximation of the original outcome but may perform
well for certain transformations of the outcome.

16.1.1 Log transformation


With positive, especially heavy-tailed outcomes, a standard transformation is the log trans-
formation. So we fit a linear model

log yi = xti β + εi , (i = 1, . . . , n).

The interpretation of the coefficients changes a little bit. Because


∂ logŷi ∂ ŷi .
= ∂xij = β̂j ,
∂xij ŷi

we can interpret β̂j in the following way: ceteris paribus, if xij increases by one unit, then
the proportional increase in the average outcome is β̂j . In economics, β̂j is the semi-elasticity
of y on xj in the model with log transformation on the outcome.
Sometimes, we may apply the log transformation on both the outcome and a certain
covariate:
log yi = β1 xi1 + · · · + βj log xij + · · · + εi , (i = 1, . . . , n).
The jth fitted coefficient becomes
∂ logŷi ∂ ŷi . ∂xij
= = β̂j ,
∂ log xij ŷi xij

so ceteris paribus, if xij increases by 1%, then the average outcome will increase by β̂j %.

169
170 Linear Model and Extensions

In economics, β̂j is the xj -elasticity of y in the model with log transformation on both the
outcome and xj .
The log transformation only works for positive variables. For a nonnegative outcome,
we can modify the log transformation to log(yi + 1).

16.1.2 Box–Cox transformation


Power transformation is another important class. The Box–Cox transformation unifies the
log transformation and the power transformation:
( λ
y −1
gλ (y) = λ , λ ̸= 0,
log y, λ = 0.
L’Hôpital’s rule implies that
yλ − 1 dy λ /dλ
lim = lim = lim y λ log y = log y,
λ→0 λ λ→0 1 λ→0

so as a function of λ, gλ (y) is continuous at λ = 0. The log transformation is a limiting


version of the power transformation. Can we choose λ based on data? Box and Cox (1964)
proposed a strategy based on the maximum likelihood under the Normal linear model:
   
yλ1 gλ (y1 )
Yλ =  ...  =  .. 2
 ∼ N(Xβ, σ In ).
   
.
yλn gλ (y1 )
The density function of Yλ is
 
2 −n/2 1 t
f (Yλ ) = (2πσ ) exp − 2 (Yλ − Xβ) (Yλ − Xβ) .

The Jacobian of the transformation from Y to Yλ is
 λ−1 
y1
y2λ−1 n
 
∂Yλ   Y
det = det  = y λ−1 ,
 
∂Y ..  i=1 i
 .
ynλ−1
so the density function of Y is
 Yn
2 −n/2 1
f (Y ) = (2πσ ) t
exp − 2 (Yλ − Xβ) (Yλ − Xβ) yiλ−1 .
2σ i=1

If we treat the density function of Y as a function of (β, σ 2 , λ), then it is the likelihood func-
tion, defined as L(β, σ 2 , λ). Given (σ 2 , λ), maximizing the likelihood function is equivalent
to minimizing (Yλ − Xβ)t (Yλ − Xβ), i.e., we can run OLS of Yλ on X to obtain

β̂(λ) = (X t X)−1 X t Yλ .

Given λ, maximizing the likelihood function is equivalent to first obtaining β̂(λ) and then
obtaining σ̂ 2 (λ) = n−1 Yλ (In − H)Yλ . The final step is to maximize the profile likelihood as
a function of λ:
 n
nσ̂ 2 (λ) Y λ−1

−n/2
L(β̂(λ), σ̂ 2 (λ), λ) = 2πσ̂ 2 (λ)

exp − 2 y .
2σ̂ (λ) i=1 i
Transformations in OLS 171

Dropping some constants, the log profile likelihood function of λ is


n
n 2
X
lp (λ) = − log σ̂ (λ) + (λ − 1) log yi .
2 i=1

The boxcox function in the R package MASS plots lp (λ), finds it maximizer λ̂, and construct
a 95% confidence interval [λ̂l , λ̂U ] based on the following asymptotic pivotal quantity
n o
a
2 lp (λ̂) − lp (λ) ∼ χ21 ,

which holds by Wilks’ Theorem. In practice, we often use the λ values within [λ̂l , λ̂U ] that
have more scientific meanings.
I use two datasets to illustrate the Box–Cox transformation, with the R code in
code16.1.2.R. For the jobs data, λ = 2 seems a plausible value.
library ( MASS )
library ( mediation )
par ( mfrow = c ( 1 , 3 ))
jobslm = lm ( job _ seek ~ treat + econ _ hard + depress 1 + sex + age + occp + marital +
nonwhite + educ + income , data = jobs )
boxcox ( jobslm , lambda = seq ( 1 . 5 , 3 , 0 . 1 ) , plotit = TRUE )
jobslm 2 = lm ( I ( job _ seek ^ 2 ) ~ treat + econ _ hard + depress 1 + sex + age + occp + marital +
nonwhite + educ + income , data = jobs )
hist ( jobslm $ residuals , xlab = " residual " , ylab = " " ,
main = " job _ seek " , font . main = 1 )
hist ( jobslm 2 $ residuals , , xlab = " residual " , ylab = " " ,
main = " job _ seek ^ 2 " , font . main = 1 )

job_seek job_seek^2
140
250

95%
120
200
−1435

100
log−Likelihood

150

80
60
−1440

100

40
50

20
−1445

1.5 2.0 2.5 3.0 −3 −2 −1 0 1 −15 −10 −5 0 5 10

λ residual residual

FIGURE 16.1: Box–Cox transformation in the jobs data

In the Penn bonus experiment data, λ = 0.3 seems a plausible value. However, the resid-
ual plot does not seem Normal, making the Box–Cox transformation not very meaningful.
penndata = read . table ( " pennbonus . txt " )
par ( mfrow = c ( 1 , 3 ))
pennlm = lm ( duration ~ . , data = penndata )
boxcox ( pennlm , lambda = seq ( 0 . 2 , 0 . 4 , 0 . 0 5 ) , plotit = TRUE )

pennlm . 3 = lm ( I ( duration ^( 0 . 3 )) ~. , data = penndata )


172 Linear Model and Extensions

hist ( pennlm $ residuals , xlab = " residual " , ylab = " " ,
main = " duration " , font . main = 1 )
hist ( pennlm . 3 $ residuals , xlab = " residual " , ylab = " " ,
main = " duration ^ 0 . 3 " , font . main = 1 )

duration duration^0.3

1400
−28730

1000
95%

1200

800
−28740

1000
log−Likelihood

600
800
−28750

600

400
−28760

400

200
200
−28770

0
0.20 0.25 0.30 0.35 0.40 −20 −10 0 10 20 30 40 −1.5 −0.5 0.0 0.5 1.0 1.5

λ residual residual

FIGURE 16.2: Box–Cox transformation in the Penn bonus experiment data

16.2 Transformation of the covariates


16.2.1 Polynomial, basis expansion, and generalized additive model
Linear approximations may not be adequate, so we can consider a polynomial specification.
With one-dimensional x, we can use

yi = β1 + β2 xi + β3 x2i · · · + βp xip−1 + εi .

In economics, it is almost the default choice to include the quadratic term of working
experience in the log wage equation. I give an example below using the data from Angrist
et al. (2006). The quadratic term of exper is significant.
> library ( foreign )
> census 0 0 = read . dta ( " census 0 0 . dta " )
> head ( census 0 0 )
age educ logwk perwt exper exper 2 black
1 48 12 6.670576 1.0850021 30 900 0
2 42 13 6.783905 0.9666383 23 529 0
3 49 13 6.762383 1.2132297 30 900 0
4 44 13 6.302851 0.4833191 25 625 0
5 45 16 6.043386 0.9666383 23 529 0
6 43 13 5.061138 1.0850021 24 576 0
>
> census 0 0 ols 1 = lm ( logwk ~ educ + exper + black ,
+ data = census 0 0 )
> census 0 0 ols 2 = lm ( logwk ~ educ + exper + I ( exper ^ 2 ) + black ,
+ data = census 0 0 )
Transformations in OLS 173

> round ( summary ( census 0 0 ols 1 )$ coef , 4 )


Estimate Std . Error t value Pr ( >| t |)
( Intercept ) 4.8918 0.0315 155.0540 0.0000
educ 0.1152 0.0012 99.1472 0.0000
exper 0.0002 0.0008 0.2294 0.8185
black -0 . 2 4 6 6 0 . 0 0 8 5 -2 9 . 1 6 7 4 0.0000
> round ( summary ( census 0 0 ols 2 )$ coef , 4 )
Estimate Std . Error t value Pr ( >| t |)
( Intercept ) 5.0777 0.0887 57.2254 0.0000
educ 0.1148 0.0012 97.6506 0.0000
exper -0 . 0 1 4 8 0 . 0 0 6 7 -2 . 2 0 1 3 0.0277
I ( exper ^ 2 ) 0.0003 0.0001 2.2425 0.0249
black -0 . 2 4 6 7 0 . 0 0 8 5 -2 9 . 1 7 3 2 0.0000

We can also include polynomial terms of more than one covariate, for example,

(1, x1i , . . . , xdi1 , xi2 , . . . , xli2 )

or
(1, x1i , . . . , xdi1 , xi2 , . . . , xli2 , xi1 xi2 , . . . , xdi1 xli2 ).
We can also approximate the conditional mean function of the outcome by a linear
combination of some basis functions:

yi = f (xi ) + εi
J

X
= βj Sj (xi ) + εi ,
j=1

where the Sj (xi )’s are basis functions. The gam function in the mgcv package uses this strat-
egy including the automatic procedure of choosing the number of basis functions J. The
following example has a sine function as the truth, and the basis expansion approximation
yields reasonable performance with sample size n = 1000. Figure 16.3 plots both the true
and estimated curves.
library ( mgcv )
n = 1000
dat = data . frame ( x <- seq ( 0 , 1 , length . out = n ) ,
true <- sin ( x * 1 0 ) ,
y <- true + rnorm ( n ))
np . fit = gam ( y ~ s ( x ) , data = dat )
plot ( y ~ x , data = dat , bty = " n " ,
pch = 1 9 , cex = 0 . 1 , col = " grey " )
lines ( true ~ x , col = " grey " )
lines ( np . fit $ fitted . values ~ x , lty = 2 )
legend ( " bottomright " , c ( " true " , " estimated " ) ,
lty = 1 : 2 , col = c ( " grey " , " black " ) ,
bty = " n " )

The generalized additive model is an extension of the multivariate case:

yi = f1 (xi1 ) + · · · + fp (xip ) + εi
J1 Jp

X X
= β1j Sj (xi1 ) + · · · + βpj Sj (xip ) + εi .
j=1 j=1

The gam function in the mgcv package implements this strategy. Again I use the dataset from
Angrist et al. (2006) to illustrate the procedure with nonlinearity in educ and exper shown
in Figure 16.4.
174 Linear Model and Extensions

4




● ● ●
● ●
● ●



● ● ●

● ●
● ●

● ●
● ●

● ●
● ● ● ● ● ● ●

2

● ● ●

● ● ● ●
● ● ●


● ● ●
● ● ● ●
● ● ●
● ● ●
● ●
● ●
● ● ● ●
● ● ● ●
● ● ● ●
● ● ● ● ●
● ● ●
● ● ● ● ●
● ●● ● ● ● ●
● ● ● ● ●
● ●
● ● ● ●
● ● ● ●● ●
● ● ● ● ●
● ● ● ● ●
● ● ●
● ● ● ● ●
● ● ● ● ●
● ● ● ●
● ● ● ●
● ● ●
● ● ● ● ● ● ● ● ●
● ● ● ●
● ● ●
● ● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ●
● ● ● ● ● ●
● ● ● ●
● ● ● ● ● ●
● ● ● ● ●
● ● ● ● ● ● ● ●●
● ● ● ● ● ● ●
● ● ● ●
● ● ● ● ● ●
● ● ● ● ●
● ● ● ● ● ● ● ● ●
● ● ● ● ●
● ● ● ● ● ● ●
● ● ● ●● ● ● ● ● ● ●
● ● ● ● ● ● ● ●
● ● ● ● ●
● ● ● ● ● ● ●
● ● ● ● ● ● ● ●
● ● ● ● ● ● ●
● ● ● ● ● ● ●●
● ● ● ●
● ● ● ● ● ● ● ●
● ● ● ●
● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ● ● ● ●
● ● ● ●
● ● ● ● ●
● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ●● ●
● ● ● ● ● ●
● ● ● ● ●

● ● ● ●



● ● ● ●
● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ●
● ● ● ● ● ● ● ● ● ●
● ● ● ● ●

● ● ●
● ● ● ● ●

● ● ● ● ● ●●
● ● ●● ●
● ● ● ●● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ●
● ● ● ● ●
● ●
0

● ● ● ● ● ● ● ● ●●
● ● ● ● ● ● ●● ●
● ● ● ● ● ● ●
y

● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ●
● ● ●
● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ●
● ●

● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ●
● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ●
● ● ● ●


● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ●
● ● ● ● ●
● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ●
● ● ● ● ● ● ●
● ● ● ●●
●● ● ● ● ●
● ● ● ● ● ● ● ● ●
● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ●
● ●
● ●
● ●● ● ● ● ● ● ● ●
● ● ● ● ● ● ●
● ● ● ●
● ● ● ●
● ● ● ● ●
● ● ● ● ●
● ● ● ●
●● ● ● ● ●
● ●
● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ●
● ● ● ●
● ● ● ●
● ● ● ● ● ●
● ● ● ●
● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ●
● ● ●
● ●
● ● ● ●
● ● ● ●
● ● ● ●

● ● ● ●
● ● ● ● ● ●
● ● ●
● ● ● ● ● ● ●
● ● ● ● ●
● ●
● ●
● ● ●● ●
● ● ●
● ● ●
● ● ● ●
−2

● ● ●
● ●
● ● ● ● ●
● ● ● ●
● ● ● ● ●

● ●
● ● ●
● ● ●
● ●
● ●
● ● ●
● ●
● ●

● ●

● ● ● ●



● ● ●

true
● ● ●

●●



estimated
−4

0.0 0.2 0.4 0.6 0.8 1.0

FIGURE 16.3: Nonparametric regression using the basis expansion

census 0 0 gam = gam ( logwk ~ s ( educ ) + s ( exper ) + black ,


data = census 0 0 )
summary ( census 0 0 gam )
par ( mfrow = c ( 1 , 2 ))
plot ( census 0 0 gam , bty = " n " )

The R code in this section is in code16.2.1.R. See Wood (2017) for more details about the
generalized additive model.

16.2.2 Regression discontinuity and regression kink


The left panel of Figure 16.5 shows an example of regression discontinuity, where the linear
functions before and after a cutoff point can differ with a possible jump. A simple way to
capture the two regimes of linear regression is to fit the following model:

yi = β1 + β2 xi + β3 1 (xi > c) + β4 xi 1 (xi > c) + εi .

So (
β1 + β2 xi + εi xi ≤ c,
yi =
(β1 + β3 ) + (β2 + β4 ) xi + εi , xi > c.
Testing the discontinuity at c is equivalent to testing

(β1 + β3 ) + (β2 + β4 ) c = β1 + β2 c ⇐⇒ β3 + β4 c = 0.

If we center the covariates at c, then

yi = β1 + β2 (xi − c) + β3 1 (xi > c) + β4 (xi − c)1 (xi > c) + εi


Transformations in OLS 175

0.5

0.5
s(exper,6.43)
s(educ,8.98)

0.0

0.0
−0.5

−0.5

8 10 12 14 16 18 20 15 20 25 30 35

educ exper

FIGURE 16.4: Generalized additive model


176 Linear Model and Extensions

regression discontinuity regression kink

1.5
2

1.0

1
y

0.5

0.0
0

−0.5
0 25 50 75 100 0 25 50 75 100
x

FIGURE 16.5: Regression discontinuity and kink

and (
β1 + β2 (xi − c) + εi xi ≤ c,
yi =
(β1 + β3 ) + (β2 + β4 ) (xi − c) + εi , xi > c.
So testing the discontinuity at c is equivalent to testing β3 = 0.
The right panel of Figure 16.5 shows an example of regression kink, where the linear
functions before and after a cutoff point can differ but the whole regression line is continuous.
A simple way to capture the two regimes of linear regression is to fit the following model:

yi = β1 + β2 Rc (xi ) + β3 (xi − c) + εi

using (
0, x ≤ c,
Rc (x) = max(0, x − c) =
x − c, x > c.
So (
β1 + β3 (xi − c) + εi , xi ≤ c,
yi =
β1 + (β2 + β3 ) (xi − c) + εi , xi > c.
This ensures that the mean function is continuous at c with both left and right limits
equaling β1 . Testing the kink is equivalent to testing β2 = 0.
These regressions have many applications in economics, but I omit the economic back-
ground. Readers can find more discussions in Angrist and Pischke (2008) and Card et al.
(2015).
Transformations in OLS 177

16.3 Homework problems


16.1 Piecewise linear regression
Generate data in the same way as the example in Figure 16.3, and fit a continuous piecewise
linear function with cutoff points 0, 0.2, 0.4, 0.6, 0.8, 1.
17
Interaction

Interaction is an important notion in applied statistics. It measures the interplay of two


or more variables acting simultaneously on an outcome. Epidemiologists find that cigarette
smoking and alcohol consumption both increase the risks of many cancers. Then they want
to measure how cigarette smoking and alcohol consumption jointly increase the risks. That
is, does cigarette smoking increase the risks of cancers more in the presence of alcohol
consumption than in the absence of it? Political scientists are interested in measuring the
interplay of different get-out-to-vote interventions on voting behavior. This chapter will
review many aspects of interaction in the context of linear regression. Cox (1984) and
Berrington de González and Cox (2007) reviewed interaction from a statistical perspective.
VanderWeele (2015) offers a textbook discussion on interaction with a focus on applications
in epidemiology.

17.1 Two binary covariates interact


Let’s start with the simplest yet nontrivial example with two binary covariates xi1 , xi2 ∈
{0, 1}. We can fit the OLS:

yi = β̂0 + β̂1 xi1 + β̂2 xi2 + β̂12 xi1 xi2 + ε̂i . (17.1)

We can express the coefficients in terms of the means of the outcomes within four combi-
nations of the covariates. The following proposition is an algebraic result.

Proposition 17.1 From (17.1), we have

β̂0 = ȳ00 ,
β̂1 = ȳ10 − ȳ00 ,
β̂2 = ȳ01 − ȳ00 ,
β̂12 = (ȳ11 − ȳ10 ) − (ȳ01 − ȳ00 ),

where ȳf1 f2 is the average value of the yi ’s with xi1 = f1 and xi2 = f2 .

The proof of Proposition 17.1 is pure algebraic which is relegated to Problem 17.1. The
proposition generalizes to OLS with more than two binary covariates. See Zhao and Ding
(2022) for more details.
Practitioners also interpret the coefficient of the product term of two continuous variables
as an interaction. The coefficient β̂12 equals the difference between ȳ11 − ȳ10 , the effect of
xi2 on yi holding xi1 at level 1, and ȳ01 − ȳ00 , the effect of xi2 on yi holding xi1 at level 0.
It also equals
β̂12 = (ȳ11 − ȳ01 ) − (ȳ10 − ȳ00 ),

179
180 Linear Model and Extensions

that is, the difference between ȳ11 − ȳ01 , the effect of xi1 on yi holding xi2 at level 1, and
ȳ10 − ȳ00 , the effect of xi1 on yi holding xi2 at level 0. The formula shows the symmetry of
xi1 and xi2 in defining interaction.

17.2 A binary covariate interacts with a general covariate


17.2.1 Treatment effect heterogeneity
In many studies, we are interested in the effect of a binary treatment zi on an outcome yi ,
adjusting for some background covariates xi . The covariates can play many roles in this
problem. They may affect the treatment, enter the outcome model, and modify the effect of
the treatment on the outcome. We can formulate the problem in terms of linear regression:

yi = β0 + β1 zi + β2t xi + β3t xi zi + εi , (17.2)

where E(εi | zi , xi ) = 0. So

E(yi | zi = 1, xi ) = β0 + β1 + (β2 + β3 )t xi

and
E(yi | zi = 0, xi ) = β0 + β2t xi ,
which implies that

E(yi | zi = 1, xi ) − E(yi | zi = 0, xi ) = β1 + β3t xi .

The conditional average treatment effect is thus a linear function of the covariates. As long
as β3 ̸= 0, we have treatment effect heterogeneity, which is also called effect modification.
A statistical test for β3 = 0 is straightforward based on OLS and EHW standard error.
Note that (17.2) includes the interaction of the treatment and all covariates. With prior
knowledge, we may believe that the treatment effect varies with respect to a subset of
covariates, or, equivalently, we may set some components of β3 to be zero.

17.2.2 Johnson–Neyman technique


Johnson and Neyman (1936) proposed a technique to identify the region of covariates in
which the conditional average treatment β1 + β3t x is zero. For a given x, we can test the null
hypothesis that β1 + β3t x = 0, which is a linear combination of the regression coefficients
of (17.2). If we fail to reject the null hypothesis, then this x belongs to the region of zero
conditional average effect. See Rogosa (1981) for more discussions.

17.2.3 Blinder–Oaxaca decomposition


The linear regression (17.2) also applies to descriptive statistics when zi is a binary indicator
for subgroups. For example, zi can be a binary indicator for age, racial, or gender groups,
yi can be the log wage, and xi can be a vector of explanatory variables such as education,
experience, industry, and occupation. Sometimes, it is more insightful to write (17.2) in
terms of two possibly non-parallel linear regressions:

yi = γ0 + θ0t xi + εi , E(εi | zi = 0, xi ) = 0 (17.3)


Interaction 181

for the group with zi = 0, and


yi = γ1 + θ1t xi + εi , E(εi | zi = 1, xi ) = 0 (17.4)
for the group with zi = 1. Regressions (17.3) and (17.4) are just a reparametrization of
(17.2) with
γ0 = β0 , θ0 = β2 , γ1 = β0 + β1 , θ1 = β2 + β3 .
Based on (17.3) and (17.4), we can decompose the difference in the outcome means as
E(yi | zi = 1) − E(yi | zi = 0)
= {γ1 + θ1t E(xi | zi = 1)} − {γ0 + θ0t E(xi | zi = 0)}
= θ0t {E(xi | zi = 1) − E(xi | zi = 0)}
+(θ1 − θ0 )t E(xi | zi = 0) + γ1 − γ0
+(θ1 − θ0 )t {E(xi | zi = 1) − E(xi | zi = 0)}.
The decomposition has three components: the first component
E = θ0t {E(xi | zi = 1) − E(xi | zi = 0)}
= β2t {E(xi | zi = 1) − E(xi | zi = 0)}
measures the endowment effect since it is due to the difference in the covariates; the second
component
C = (θ1 − θ0 )t E(xi | zi = 0) + γ1 − γ0
= β3t E(xi | zi = 0) + β1
measures the difference in coefficients; the third component
I = (θ1 − θ0 )t {E(xi | zi = 1) − E(xi | zi = 0)}
= β3t {E(xi | zi = 1) − E(xi | zi = 0)}
measures the interaction between the endowment and coefficients. This is called the Blinder–
Oaxaca decomposition. Jann (2008) reviews other forms of the decomposition, extending
the original forms in Blinder (1973) and Oaxaca (1973).
Estimation and testing for E, C, and I are straightforward. Based on the OLS of (17.2)
and the sample means x̄1 and x̄0 of the covariates, we have point estimators
Ê = β̂2t (x̄1 − x̄0 ),
Cˆ = β̂3t x̄0 + β̂1 ,
Î = β̂3t (x̄1 − x̄0 ).
Given the covariates, they are just linear transformations of the OLS coefficients. Statistical
inference is thus straightforward.

17.2.4 Chow test


Chow (1960) proposed to test whether the two regressions (17.3) and (17.4) are identical.
Under the null hypothesis that γ0 = γ1 and θ0 = θ1 , he proposed an F test assuming
homoskedasticity, which is called the Chow test in econometrics. In fact, this is just a
special case of the standard F test for the null hypothesis that β1 = 0 and β3 = 0 in (17.2).
Moreover, based on the OLS in (17.2), we can also derive the robust test based on the EHW
covariance estimator. Chow (1960) discussed a subtle case in which one group has a small
size rending the OLS fit underdetermined. I relegate the details to Problem 17.3. Note that
under this null hypothesis, C = I = 0, so the difference in the outcome means is purely due
to the difference in the covariate means.
182 Linear Model and Extensions

17.3 Difficulties of intereaction


17.3.1 Removable interaction
The significance of the interaction term differs with y and log(y).
> n = 1000
> x 1 = rnorm ( n )
> x 2 = rnorm ( n )
> y = exp ( x 1 + x 2 + rnorm ( n ))
> ols . fit = lm ( log ( y ) ~ x 1 * x 2 )
> summary ( ols . fit )

Call :
lm ( formula = log ( y ) ~ x 1 * x 2 )

Residuals :
Min 1 Q Median 3Q Max
-3 . 7 3 7 3 -0 . 6 8 2 2 -0 . 0 1 1 1 0.7084 3.1039

Coefficients :
Estimate Std . Error t value Pr ( >| t |)
( Intercept ) 0 . 0 0 3 2 1 4 0.031286 0.103 0.918
x1 1.056801 0.030649 34.480 <2e - 1 6 ***
x2 1.009404 0.030778 32.797 <2e - 1 6 ***
x1:x2 -0 . 0 1 7 5 2 8 0 . 0 3 0 5 2 6 -0 . 5 7 4 0.566

> ols . fit = lm ( y ~ x 1 * x 2 )


> summary ( ols . fit )

Call :
lm ( formula = y ~ x 1 * x 2 )

Residuals :
Min 1 Q Median 3Q Max
-3 5 . 9 5 -5 . 1 7 -0 . 9 7 2.34 513.35

Coefficients :
Estimate Std . Error t value Pr ( >| t |)
( Intercept ) 5.2842 0.6686 7 . 9 0 3 7 . 1 7e - 1 5 ***
x1 6.7565 0 . 6 5 5 0 1 0 . 3 1 5 < 2e - 1 6 ***
x2 4.9548 0.6577 7 . 5 3 3 1 . 1 1e - 1 3 ***
x1:x2 7.3810 0 . 6 5 2 4 1 1 . 3 1 4 < 2e - 1 6 ***

17.3.2 Main effect in the presence of interaction


In the OLS fit below, we observe significant main effects.
> # # data from " https :// stats . idre . ucla . edu / stat / data / hsbdemo . dta "
> hsbdemo = read . table ( " hsbdemo . txt " )
> ols . fit = lm ( read ~ math + socst , data = hsbdemo )
> summary ( ols . fit )

Call :
lm ( formula = read ~ math + socst , data = hsbdemo )

Residuals :
Min 1Q Median 3Q Max
-1 8 . 8 7 2 9 -4 . 8 9 8 7 -0 . 6 2 8 6 5.2380 23.6993

Coefficients :
Interaction 183

Estimate Std . Error t value Pr ( >| t |)


( Intercept ) 7.14654 3.04066 2.350 0.0197 *
math 0.50384 0.06337 7 . 9 5 1 1 . 4 1e - 1 3 ***
socst 0.35414 0.05530 6 . 4 0 4 1 . 0 8e - 0 9 ***

Then we add the interaction term into the OLS, and suddenly we have significant inter-
action but not significant main effects.
> ols . fit = lm ( read ~ math * socst , data = hsbdemo )
> summary ( ols . fit )

Call :
lm ( formula = read ~ math * socst , data = hsbdemo )

Residuals :
Min 1Q Median 3Q Max
-1 8 . 6 0 7 1 -4 . 9 2 2 8 -0 . 7 1 9 5 4.5912 21.8592

Coefficients :
Estimate Std . Error t value Pr ( >| t |)
( Intercept ) 3 7 . 8 4 2 7 1 5 1 4 . 5 4 5 2 1 0 2 . 6 0 2 0 . 0 0 9 9 8 **
math -0 . 1 1 0 5 1 2 0 . 2 9 1 6 3 4 -0 . 3 7 9 0 . 7 0 5 1 4
socst -0 . 2 2 0 0 4 4 0 . 2 7 1 7 5 4 -0 . 8 1 0 0 . 4 1 9 0 8
math : socst 0.011281 0.005229 2.157 0.03221 *

However, if we center the covariates, the main effects are significant again.
> hsbdemo $ math . c = hsbdemo $ math - mean ( hsbdemo $ math )
> hsbdemo $ socst . c = hsbdemo $ socst - mean ( hsbdemo $ socst )
> ols . fit = lm ( read ~ math . c * socst .c , data = hsbdemo )
> summary ( ols . fit )

Call :
lm ( formula = read ~ math . c * socst .c , data = hsbdemo )

Residuals :
Min 1Q Median 3Q Max
-1 8 . 6 0 7 1 -4 . 9 2 2 8 -0 . 7 1 9 5 4.5912 21.8592

Coefficients :
Estimate Std . Error t value Pr ( >| t |)
( Intercept ) 51.615327 0 . 5 6 8 6 8 5 9 0 . 7 6 3 < 2e - 1 6 ***
math . c 0.480654 0.063701 7 . 5 4 5 1 . 6 5e - 1 2 ***
socst . c 0.373829 0.055546 6 . 7 3 0 1 . 8 2e - 1 0 ***
math . c : socst . c 0 . 0 1 1 2 8 1 0.005229 2.157 0.0322 *

Based on the linear model with interaction


E(yi | xi1 , xi2 ) = β0 + β1 xi1 + β2 xi2 + β12 xi1 xi2 ,
better definitions of the main effects are
n n
X ∂E(yi | xi1 , xi2 ) X
n−1 = n−1 (β1 + β12 xi2 ) = β1 + β12 x̄2
i=1
∂xi1 i=1

and
n n
X ∂E(yi | xi1 , xi2 ) X
n−1 = n−1 (β2 + β12 xi1 ) = β2 + β12 x̄1 ,
i=1
∂xi2 i=1
which are called the average partial or marginal effects. So when the covariates are cen-
tered, we can interpret β1 and β2 as the main effects. In contrast, the interpretation of the
interaction term does not depend on the centering of the covariates because
∂ 2 E(yi | xi1 , xi2 )
= β12 .
∂xi1 ∂xi2
184 Linear Model and Extensions

The R code in this section is in code17.3.R.

17.3.3 Power
Usually, statistical tests for interaction do not have enough power. Proposition 17.1 provides
a simple explanation. The variance of the interaction equals
2
σ11 σ2 σ2 σ2
var(β̂12 ) = + 10 + 01 + 00 ,
n11 n10 n01 n00
where σf21 f2 = var(yi | xi1 = f1 , xi2 = f2 ). Therefore, its variance is driven by the smallest
value of n11 , n10 , n01 , n00 . Even when the total sample size is large, one of the subgroup
sample sizes can be small, resulting in a large variance of the estimator of the interaction.

17.4 Homework problems


17.1 Interaction and difference-in-difference
Prove Proposition 17.1. Moreover, simplify the HC0 and HC2 versions of the EHW standard
errors of the coefficients in terms of nf1 f2 and σ̂f21 f2 , where nf1 f2 is the sample size and σ̂f21 f2
is the sample variance of the outcomes for units with xi1 = f1 and xi2 = f2 .
Hint: You can prove the proposition by inverting the 4 × 4 matrix X t X. However, this
method is a little too tedious. Moreover, this proof does not generalize to OLS with K > 2
binary covariates. So it is better to find alternative proofs. For the EHW standard errors,
you can use the result in Problems 6.3 and 6.4.

17.2 Two OLS


Given data (xi , zi , yi )ni=1 where xi denotes the covariates, zi denotes the binary group indi-
cator, and yi denotes the outcome. We can fit two separate OLS:

ŷi = γ̂1 + xti β̂1

and
ŷi = γ̂0 + xti β̂0
with data in group 1 and group 0, respectively. We can also fit a joint OLS using the pooled
data:
ŷi = α̂0 + α̂z zi + xti α̂x + zi xti α̂zx .

1. Find (α̂0 , α̂z , α̂x , α̂zx ) in terms of (γ̂1 , β̂1 , γ̂0 , β̂0 ).
2. Show that the fitted values ŷi ’s are the same from the separate and the pooled OLS for
all units i = 1, . . . , n.
3. Show that the leverage scores hii ’s are the same from the separate and the pooled OLS.

17.3 Chow test when one group size is too small


Assume (17.3) and (17.4) with homoskedastic Normal error terms. Let n1 and n0 denote
the sample sizes of groups with zi = 1 and zi = 0. Consider the case with n0 larger than the
number of covariates but n1 smaller than the number of covariates. So we can fit OLS and
Interaction 185

estimate the variance based on (17.3), but we cannot do so based on (17.4). The statistical
test discussed in the main paper does not apply. Chow (1960) proposed the following test
based on prediction.
Let γ̂0 and θ̂0 be the coefficients, and σ02 be the variance estimate based on OLS with
units zi = 0. Under the null hypothesis that γ0 = γ1 and θ0 = θ1 , predict the outcomes of
the units zi = 1:
ŷi = γ̂0 + θ̂0t xi
with the prediction error
di = yi − ŷi
following a multivariate Normal distribution. Propose an F test based on di with zi = 1.
Hint: It is more convenient to use the matrix form of OLS.

17.4 Invariance of the interaction


In Section 17.3.2, the point estimate and standard error of the coefficient of the interaction
term remain the same no matter whether we center the covariates or not. This result holds
in general. This problem quantifies this phenomenon.
With scalars xi1 , xi2 , yi (i = 1, . . . , n), we can fit the OLS

yi = β̂0 + β̂1 xi1 + β̂2 xi2 + β̂12 xi1 xi2 + ε̂i .

Under any location transformations of the covariates x′i1 = xi1 − c1 , x′i2 = xi2 − c2 , we can
fit the OLS
yi = β̃0 + β̃1 x′i1 + β̃2 x′i2 + β̃12 x′i1 x′i2 + ε̃i .

1. Express β̂0 , β̂1 , β̂2 , β̂12 in terms of β̃0 , β̃1 , β̃2 , β̃12 . Verify that β̂12 = β̃12 .

2. Show that the EHW standard errors for β̂12 and β̃12 are identical.
Hint: Use the results in Problems 3.4 and 6.4.
18
Restricted OLS

Assume that in the standard linear model Y = Xβ + ε, the parameter has restriction

Cβ = r (18.1)

where C is an l × p matrix and r is a l dimensional vector. Assume that C has linearly inde-
pendent row vectors; otherwise, some restrictions are redundant. We can use the restricted
OLS:
β̂r = arg minp ∥Y − Xb∥2
b∈R

under the restriction


Cb = r.
I first give some examples of linear models with restricted parameters, then derive the al-
gebraic properties of the restricted OLS estimator β̂r , and finally discuss statistical inference
with restricted OLS.

18.1 Examples
Example 18.1 (Short regression) Partition X into X1 and X2 with k and l columns,
respectively, with p = k + l. The short regression of Y on X1 yields OLS coefficient β̂1 . So
(β̂1t , 0tl ) = β̂r with
C = (0l×k , Il×l ), r = 0l .

Example 18.2 (Testing linear hypothesis) Consider testing the linear hypothesis
Cβ = r in the linear model. We have discussed in Chapter 5 the Wald test based on the
OLS estimator and its estimated covariance matrix under the Normal linear model. An al-
ternative strategy is to test the hypothesis based on comparing the residual sum of squares
under the OLS and restricted OLS. Therefore, we need to compute both β̂ and β̂r .

Example 18.3 (One-way analysis of variance) If xi contains the intercept and Q1


dummy variables of a discrete regressor of Q1 levels, (fi1 , . . . , fiQ1 )t , then we must im-
pose a restriction on the parameter in the linear model
Q1
X
yi = α + βj fij + εi .
j=1

A canonical choice is βQ1 = 0, which is equivalent to dropping the last dummy variable
PQ1
due to its redundancy. Another canonical choice is j=1 βj = 0. This restriction keeps the
symmetry of the regressors in the linear model and changes the interpretation of βj as the
deviation from the “effect” of level j with respect to the average “effect.” Both are special
cases of restricted OLS.

187
188 Linear Model and Extensions

Example 18.4 (Two-way analysis of variance) With two factors of levels Q1 and
Q2 , respectively, the regressor xi contains the Q1 dummy variables of the first factor,
(fi1 , . . . , fiQ1 )t , the Q2 dummies of the second factor, (gi1 , . . . , giQ2 )t , and the Q1 Q2 dummy
variables of the interaction terms, (fi1 gi1 , . . . , fiQ1 giQ2 )t . We must impose restrictions on
the parameters in the linear model
Q1
X Q2
X Q1 X
X Q2
yi = α + βj fij + γk gik + δjk fij gik + εi .
j=1 k=1 j=1 k=1

Similar to the discussion in Example 18.3, two canonical choices of restrictions are

βQ1 = 0, γQ2 = 0, δQ1 ,k = δj,Q2 = 0, (j = 1, . . . , Q1 ; k = 1, . . . , Q2 ).

and
Q1
X Q2
X Q1
X Q2
X
βj = 0, γk = 0, δjk = δjk = 0, (j = 1, . . . , Q1 ; k = 1, . . . , Q2 ).
j=1 k=1 j=1 k=1

18.2 Algebraic properties


I first give an explicit formula of the restricted OLS (Theil, 1971; Rao, 1973). For simplicity,
the following theorem assumes that X t X is invertible. This condition may not hold in
general; see Examples 18.3 and 18.4. Greene and Seaks (1991) discussed the results without
this assumption; see Problem 18.8 for more details.

Theorem 18.1 If X t X is invertible, then

β̂r = β̂ − (X t X)−1 C t {C(X t X)−1 C t }−1 (C β̂ − r),

where β̂ is the unrestricted OLS coefficient.

Proof of Theorem 18.1: The Lagrangian for the restricted optimization problem is

(Y − Xb)t (Y − Xb) − 2λt (Cb − r).

So the first order condition is

2X t (Y − Xb) − 2C t λ = 0

which implies
X t Xb = X t Y − C t λ. (18.2)
Solve the linear system in (18.2) to obtain

b = (X t X)−1 (X t Y − C t λ).

Using the linear restriction Cb = r, we have

C(X t X)−1 (X t Y − C t λ) = r

which implies that


λ = {C(X t X)−1 C t }−1 (C β̂ − r).
Restricted OLS 189

So the restricted OLS coefficient is


β̂r = (X t X)−1 (X t Y − C t λ)
= β̂ − (X t X)−1 C t λ
= β̂ − (X t X)−1 C t {C(X t X)−1 C t }−1 (C β̂ − r).
Since the objective function is convex and the restrictions are linear, the solution from the
first-order condition is indeed the minimizer. □
In the special case with r = 0, Theorem 18.1 has a simpler form.
Corollary 18.1 Under the restriction (18.1) with r = 0, we have

β̂r = Mr β̂,
where
Mr = Ip − (X t X)−1 C t {C(X t X)−1 C t }−1 C.
Moreover, Mr satisfies the following properties
Mr (X t X)−1 C t = 0, CMr = 0, {Ip − C t (CC t )−1 C}Mr = Mr .
The Mr matrix plays central roles below.
The following result is also an immediate corollary of Theorem 18.1.
Corollary 18.2 Under the restriction (18.1), we have

β̂r − β = Mr (β̂ − β).


I leave the proofs of Corollaries 18.1 and 18.2 as Problem 18.1.

18.3 Statistical inference


I first focus on the Gauss–Markov model with the restriction (18.1). As direct consequences
of Corollary 18.2, we can show that the restricted OLS estimator is unbiased for β, and
obtain its covariance matrix below.
Corollary 18.3 Assume the Gauss–Markov model with the restriction (18.1). We have

E(β̂r ) = β,
cov(β̂r ) = σ 2 Mr (X t X)−1 Mrt .
Moreover, under the Normal linear model with the restriction (18.1), we can derive the
exact distribution of the restricted OLS estimator and propose an unbiased estimator for
σ2 .
Theorem 18.2 Assume the Normal linear model with the restriction (18.1). We have

β̂r ∼ N(β, σ 2 Mr (X t X)−1 Mrt ).


An unbiased estimator for σ 2 is
σ̂r2 = ε̂tr ε̂r /(n − p + l),

where ε̂r = Y − X β̂r . Moreover,


β̂r σ̂r2 .
190 Linear Model and Extensions

Based on the results in Theorem 18.2, we can derive the t and F statistics for finite-
sample inference of β based on the estimator β̂r and the estimated covariance matrix

σ̂r2 Mr (X t X)−1 Mrt .

Corollary 18.3 and Theorem 18.2 extend the results for the OLS estimator. I leave their
proofs as Problem 18.3.
I then discuss statistical inference under the heteroskedastic linear model with the re-
striction (18.1). Corollary 18.2 implies that

cov(β̂r ) = Mr (X t X)−1 X t diag{σ12 , . . . , σn2 }X(X t X)−1 Mrt .

Therefore, the EHW-type estimated covariance matrix is

V̂ehw,r = Mr (X t X)−1 X t diag{ε̂2i,r , . . . , ε̂2n,r }X(X t X)−1 Mrt .

where the ε̂i,r ’s are the residuals from the restricted OLS.

18.4 Final remarks


This chapter follows Theil (1971) and Rao (1973). Tarpey (2000) contains additional alge-
braic results on restricted OLS.

18.5 Homework problems


18.1 Algebraic details of restricted OLS
Prove Corollaries 18.1 and 18.2.

18.2 Invariance of restricted OLS


Consider an N × 1 vector Y and two N × p matrices, X and X ′ , that satisfy X ′ = XΓ for
some nonsingular p × p matrix Γ. The restricted OLS fits of

Y = X β̂r + ϵ̂r subject to C β̂r = r,


Y = X̃ β̃r + ϵ̃r subject to C̃ β̃r = r,

with X̃ = XΓ and C̃ = CΓ yield (β̂r , ϵ̂r , V̂ehw,r ) and (β̃r , ϵ̃r , Ṽehw,r ) as the coefficient vectors,
residuals, and robust covariances.
Prove that they must satisfy

β̂r = Γβ̃r , ϵ̂r = ϵ̃r , V̂ehw,r = ΓṼehw,r Γt .

18.3 Moments and distribution of restricted OLS


Prove Corollary 18.3 and Theorem 18.2.
Restricted OLS 191

18.4 Gauss–Markov theorem for restricted OLS


The Gauss–Markov theorem for β̂r holds, as an extension of Theorem 4.4 for β̂.
Theorem 18.3 Under the Gauss–Markov model with the restrictions (18.1), β̂r is the best
linear unbiased estimator in the sense that cov(β̃r ) − cov(β̂r ) ⪰ 0 for any linear estimator
β̃r = c̃+ Ãr Y , with c̃ ∈ Rp and Ãr ∈ Rp×n , that satisfies E(β̃r ) = β for all β under constraint
(18.1).
Prove Theorem 18.3.
Remark: As a corollary of Theorem 18.3, we have
(X t X)−1 ⪰ Mr (X t X)−1 Mrt
because the restricted OLS estimator is BLUE whereas the unrestricted OLS is not, under
the Gauss–Markov theorem with the restriction (18.1).

18.5 Short regression as restricted OLS


The short regression is a special case of the formula of β̂r . Show that
(X1 X1 )−1 X1t Y
 t 
β̂r =
0l
with
C = (0l×k , Il×l ), r = 0l .
In this special case, p = k + l.
From the short regression, we can obtain the EHW estimated covariance matrix V̂ehw,1 .
We can also obtain the EHW estimated covariance matrix V̂ehw,r from the restricted OLS.
Show that  
V̂ehw,1 0
V̂ehw,r = .
0 0

18.6 Reducing restricted OLS to OLS


Consider the restricted OLS fit
Y = X β̂r + ε̂r subject to C β̂r = 0, (18.3)
where X ∈ Rn×p and C ∈ Rl×p .
Let C⊥ ∈ R(p−l)×p be an orthogonal complement of C in the sense that (C⊥
t
, C t ) is
t
nonsingular with C⊥ C = 0. Define
t t −1
X⊥ = XC⊥ (C⊥ C⊥ ) .
Consider the corresponding unrestricted OLS fit
Y = X⊥ β̂⊥ + ε̂⊥ , (18.4)
First, prove that the coefficient and residual vectors must satisfy
β̂⊥ = C⊥ β̂r , ε̂⊥ = ε̂r .
Second, prove
t
V̂ehw,⊥ = C⊥ V̂ehw,r C⊥ ,
where V̂ehw,⊥ is the EHW robust covariance matrix from (18.4) and V̂ehw,r is the EHW
robust covariance matrix from (18.3).
192 Linear Model and Extensions

18.7 Minimum normal estimator as restricted OLS


An application of the formula of β̂r is the minimum norm estimator for under-determined
linear equations. When X has more columns than rows, Y = Xβ can have infinitely many
solutions, but we may only be interested in the solution with the minimum norm. Assume
p ≥ n and the rows of X are linearly independent. Show that the solution to

min ∥b∥2 such that Y = Xb


b

is
β̂m = X t (XX t )−1 Y.

18.8 Restricted OLS with degenerate design matrix


Greene and Seaks (1991) pointed out that restricted OLS does not require that X t X be
invertible, although the proof of Theorem 18.1 does. Modify the proof to show that the
restricted OLS and the Lagrange multiplier satisfy
   t 
β̂r −1 X Y
=W
λ r

as long as  t 
X X Ct
W =
C 0
is invertible.
Derive the statistical results in parallel with Section 18.3.
Remark: If X has full column rank p, then W must be invertible. Even if X does not
have full column rank, W can still be invertible. See Problem 18.9 below for more details.

18.9 Restricted OLS with degenerate design matrix: more algebra


This problem provides more algebraic details for Problem 18.8. Prove Lemma 18.1 below.

Lemma 18.1 Consider  


X tX Ct
W =
C 0
where X t X may not be invertible and C has full row rank.
X
The matrix W is invertible if and only if has full column rank p.
C
 
X
Remark: When X has full column rank p, then must have full column rank p,
C
which ensures that W is invertible by Lemma 18.1. I made the comment in Problem 18.8.
The invertibility of W plays an important role in other applications. See Benzi et al.
(2005) and Bai and Bai (2013) for more general results.
19
Weighted Least Squares

19.1 Generalized least squares


We can extend the Gauss–Markov model to allow for a general covariance structure of the
error term. The following model is due to Aitkin (1936).
Assumption 19.1 (Generalized Gauss–Markov model) We have
Y = Xβ + ε, E(ε) = 0, cov(ε) = σ 2 Σ. (19.1)
where X is a fixed matrix with linearly independent columns. The unknown parameters are
β and σ 2 . The Σ is a known positive definite matrix.
Two leading cases of generalized least squares are
Σ = diag w1−1 , . . . , wn−1 ,

(19.2)
which corresponds to a diagonal covariance matrix, and
Σ = diag {Σ1 , . . . , ΣK } (19.3)
PK
which corresponds to a block diagonal covariance matrix where Σk is nk ×nk and k=1 nk =
n.
Under model (19.1), we can still use the OLS estimator β̂ = (X t X)−1 X t Y . It is unbiased

E(β̂) = (X t X)−1 X t E(Y ) = (X t X)−1 X t Xβ = β.


It has covariance matrix
cov(β̂) = cov (X t X)−1 X t Y


= (X t X)−1 X t cov(Y )X(X t X)−1


= σ 2 (X t X)−1 X t ΣX(X t X)−1 . (19.4)
The OLS estimator is BLUE under the Gauss–Markov model, but it is not under the
generalized Gauss–Markov model. Then what is the BLUE? We can transform (19.1) into
the Gauss–Markov model by standardizing the error term:
Σ−1/2 Y = Σ−1/2 Xβ + Σ−1/2 ε.
Define Y∗ = Σ−1/2 Y, X∗ = Σ−1/2 X and ε∗ = Σ−1/2 ε. The model (19.1) reduces to
Y∗ = X̃β + ε∗ , E(ε∗ ) = 0, cov(ε∗ ) = σ 2 In ,
which is the Gauss–Markov model for the transformed variables Y∗ and X∗ . The Gauss–
Markov theorem ensures that the BLUE is
β̂Σ = (X∗t X∗ )−1 X∗t Y∗ = (X t Σ−1 X)−1 X t Σ−1 Y.

193
194 Linear Model and Extensions

It is unbiased because

E(β̂Σ ) = (X t Σ−1 X)−1 X t Σ−1 E(Y )


= (X t Σ−1 X)−1 X t Σ−1 Xβ
= β.

It has covariance matrix

cov(β̂Σ ) = cov (X t Σ−1 X)−1 X t Σ−1 Y




= (X t Σ−1 X)−1 X t Σ−1 cov(Y )Σ−1 X(X t Σ−1 X)−1


= σ 2 (X t Σ−1 X)−1 X t Σ−1 ΣΣ−1 X(X t Σ−1 X)−1
= σ 2 (X t Σ−1 X)−1 . (19.5)

In particular, cov(β̂Σ ) is smaller than or equal to cov(β̂) in the matrix sense1 . So based on
(19.4) and (19.5), we have the following pure linear algebra inequality:
Corollary 19.1 If X has linear independent columns and Σ is invertible, then

(X t Σ−1 X)−1 ⪯ (X t X)−1 X t ΣX(X t X)−1 .

Problem 19.1 gives a more general result.

19.2 Weighted least squares


This chapter focuses on the first covariance structure in (19.2) and Chapter 25 will discuss
the second in (19.3). The Σ in (19.2) results in the weighted least squares (WLS) estimator

β̂w = β̂Σ = (X t Σ−1 X)−1 X t Σ−1 Y


n
!−1 n
X X
t
= wi xi xi wi xi yi .
i=1 i=1

From the derivation above, we can also write the WLS estimator as

β̂w = arg minp (Y − Xb)t Σ−1 (Y − Xb)


b∈R
n
X
= arg minp wi (yi − xti b)2
b∈R
i=1
= arg minp (Y∗ − X∗ b)t (Y∗ − X∗ b)
b∈R
n
X
= arg minp (y∗i − xt∗i b)2 ,
b∈R
i=1

1 The matrix X t Σ−1 X is positive definite and thus invertible, because (1) for any α ∈ Rp ,
Σ−1 ⪰ 0 =⇒ αt X t ΣXα ≥ 0,
and (2)
αt X t ΣXα = 0 ⇐⇒ Xα = 0 ⇐⇒ α = 0
since X has linearly independent columns.
Weighted Least Squares 195
1/2 1/2
where y∗i = wi yi and x∗i = wi xi . So WLS is equivalent to the OLS with transformed
variables, with the weights inversely proportional to the variances of the errors. By this
equivalence, WLS inherits many properties of OLS. See the problems in Section 19.5 for
more details.
Analogous to OLS, we can derive finite-sample exact inference based on the generalized
Normal linear model:
yi = xti β + εi , εi ∼ N(0, σ 2 /wi ),
or, equivalently,
y∗i = xt∗i β + ε∗i , ε∗i ∼ N(0, σ 2 ).
The lm function with weights reports the standard error, t-statistic, and p-value based on
this model. This assumes that the weights fully capture the heteroskedasticity, which is
unrealistic in many problems.
In addition, we can derive asymptotic inference based on the following heteroskedastic
model
yi = xti β + εi
where the εi ’s are independent with mean zero and variances σi2 (i = 1, . . . , n). It is possible
that wi ̸= 1/σi2 , i.e., the variances used to construct the WLS estimator can be misspec-
ified. Even though there is no guarantee that β̂w is BLUE, it is still unbiased. From the
decomposition
n
!−1 n
X X
β̂w = wi xi xti wi xi yi
i=1 i=1
n
!−1 n
X X
= wi xi xti wi xi (xti β + εi )
i=1 i=1
n
!−1 n
!
X X
−1 −1
=β+ n wi xi xti n wi xi εi ,
i=1 i=1

we can apply the law of large numbers to show that β̂w is consistent for β and apply the
CLT to show that
a
β̂w ∼ N (β, Vw ) ,
where
n
!−1 n
! n
!−1
X X X
−1 −1 −1 −1
Vw = n n wi xi xti n wi2 σi2 xi xti n wi xi xti .
i=1 i=1 i=1

The EHW robust covariance generalizes to


n
!−1 n
! n
!−1
X X X
−1 −1 −1 −1
V̂ehw,w = n n wi xi xti n wi2 ε̂2w,i xi xti n wi xi xti ,
i=1 i=1 i=1

where ε̂w,i = yi − xti β̂w is the residual from the WLS. Note that in the sandwich covariance,
wi appears in the “bread” but wi2 appears in the “meat.” This formula appeared in Magee
(1998) and Romano and Wolf (2017). The function hccm in the R package car can compute
various EHW covariance estimators based on WLS. To save space in the examples below,
I report only the standard errors based on the generalized Normal linear model and leave
the calculations of the EHW covariances as a homework problem.
196 Linear Model and Extensions

19.3 WLS motivated by heteroskedasticity


19.3.1 Feasible generalized least squares

Assume that ε has mean zero and covariance diag σ12 , . . . , σn2 . If the σi2 ’s are known, we
can simply apply the WLS above; if they are unknown, we need to estimate them first. This
gives the following feasible generalized least squares estimator (FGLS):
1. Run OLS of yi on xi to obtain the residuals ε̂i . Then obtain the squared residuals ε̂2i .
2. Run OLS of log(ε̂2i ) on xi to obtain the fitted values and exponentiate them to obtain
(σ̂i2 )ni=1 ;
3. Run WLS of yi on xi with weights σ̂i−2 to obtain

n
!−1 n
X X
β̂fgls = σ̂i−2 xi xti σ̂i−2 xi yi .
i=1 i=1

In Step 2, we can change the model based on our understanding of heteroskedasticity. Here
I use the Boston housing data to compare the OLS and FGLS, with R code in code18.3.1.R.
> library ( mlbench )
> data ( BostonHousing )
> ols . fit = lm ( medv ~ . , data = BostonHousing )
> dat . res = BostonHousing
> dat . res $ medv = log (( ols . fit $ residuals )^ 2 )
> t . res . ols = lm ( medv ~ . , data = dat . res )
> w . fgls = exp ( - t . res . ols $ fitted . values )
> fgls . fit = lm ( medv ~ . , weights = w . fgls , data = BostonHousing )
> ols . fgls = cbind ( summary ( ols . fit )$ coef [ , 1 : 3 ] ,
+ summary ( fgls . fit )$ coef [ , 1 : 3 ])
> round ( ols . fgls , 3 )
Estimate Std . Error t value Estimate Std . Error t value
( Intercept ) 36.459 5.103 7.144 9.499 4.064 2.34
crim -0 . 1 0 8 0 . 0 3 3 -3 . 2 8 7 -0 . 0 8 1 0.044 -1 . 8 2
zn 0.046 0.014 3.382 0.030 0.011 2.67
indus 0.021 0.061 0.334 -0 . 0 3 5 0.038 -0 . 9 2
chas 1 2.687 0.862 3.118 1.462 1.119 1.31
nox -1 7 . 7 6 7 3 . 8 2 0 -4 . 6 5 1 -7 . 1 6 1 2.784 -2 . 5 7
rm 3.810 0.418 9.116 5.675 0.364 15.59
age 0.001 0.013 0.052 -0 . 0 4 4 0.008 -5 . 5 0
dis -1 . 4 7 6 0 . 1 9 9 -7 . 3 9 8 -0 . 9 2 7 0.139 -6 . 6 8
rad 0.306 0.066 4.613 0.170 0.051 3.31
tax -0 . 0 1 2 0 . 0 0 4 -3 . 2 8 0 -0 . 0 1 0 0.002 -4 . 1 4
ptratio -0 . 9 5 3 0 . 1 3 1 -7 . 2 8 3 -0 . 7 0 0 0.094 -7 . 4 5
b 0.009 0.003 3.467 0.014 0.002 6.54
lstat -0 . 5 2 5 0 . 0 5 1 -1 0 . 3 4 7 -0 . 1 5 8 0.036 -4 . 3 8

Unfortunately, the coefficients, including the point estimates and standard errors, from
OLS and FGLS are quite different for several covariates. This suggests that the linear model
may be misspecified. Otherwise, both estimators are consistent for the same true coefficient,
and they should not be so different even in the presence of randomness.
The above FGLS estimator is close to Wooldridge (2012, Chapter 8). Romano and Wolf
(2017) propose to regress log(max(δ 2 , ε̂2i )) on log |xi1 |, . . . , log |xip | to estimate the individual
variances. Their modification has two features: first, they truncate the small residuals by a
pre-specified positive number δ 2 ; second, their regressors are the logs of the absolute values
of the original covariates. Romano and Wolf (2017) highlighted the efficiency gain from the
Weighted Least Squares 197

estimator
0.8
WLS
OLS

n
t

0.6
500
1000
1500
2000
0.4

0.00 0.25 0.50 0.75 1.00


x

FIGURE 19.1: Fulton data

FGLS compared to OLS in the presence of heteroskedasticity. DiCiccio et al. (2019) proposed
some improved versions of the FGLS estimator even if the variance function is misspecified.
However, it is unusual for practitioners to use FGLS even though it can be more efficient
than OLS. There are several reasons. First, the EHW standard errors are convenient for
correcting the standard error of OLS under heteroskedasticity. Second, the efficiency gain
is usually small, and it is even possible that the FGLS is less efficient than OLS when the
variance function is misspecified. Third, the linear model is very likely to be misspecified,
and if so, OLS and FGLS estimate different parameters. The OLS has the interpretations as
the best linear predictor and the best linear approximation of the conditional mean, but the
FGLS has more complicated interpretations when the linear model is wrong. Based on these
reasons, we need to carefully justify the choice of FGLS over OLS in real data analyses.

19.3.2 Aggregated data and ecological regression


In some case, (yi , xi ) come from aggregated data, for example, yi can be the average test
score and xi can be the average parents’ income of students within classroom i. If we
believe that the student-level test score and parents’ income follow a homoskedastic linear
model, then the model based on the classroom average must be heteroskedastic, with the
variance inversely proportional to the classroom size. In this case, a natural choice of weight
is wi = ni , the classroom size.
Below I use the lavoteall dataset from the R package ei. It contains the the fraction of
black registered voters x, the fraction of voter turnout t, and the total number of people n
in each Louisiana precinct. Figure 19.1 is the scatterplot. In this example, OLS and WLS
give similar results although n varies a lot across precincts.
> library ( " ei " )
> data ( lavoteall )
> ols . fit = lm ( t ~ x , data = lavoteall )
> wls . fit = lm ( t ~ x , weights = n , data = lavoteall )
> compare = cbind ( summary ( ols . fit )$ coef [ , 1 : 3 ] ,
198 Linear Model and Extensions

+ summary ( wls . fit )$ coef [ , 1 : 3 ])


> round ( compare , 3 )
Estimate Std . Error t value Estimate Std . Error t value
( Intercept ) 0.711 0.002 408.211 0.706 0.002 421.662
x -0 . 0 8 3 0 . 0 0 4 -1 9 . 9 5 3 -0 . 0 8 0 0 . 0 0 4 -1 9 . 9 3 8

In the above, we can interpret the coefficient of x as the precinct-level relationship


between the fraction of black registered voters and the fraction voting. Political scientists
are interested in using aggregated data to infer individual voting behavior. Hypothetically,
the precinct i has individual data {xij , yij : j = 1, . . . , ni } where xij and yij are the binary
racial and voting status of individual (i, j) (i = 1, . . . , n; j = 1, . . . , ni ). However, we only
observe the aggregated data {x̄i· , ȳi· , ni : i = 1, . . . , n}, where
ni
X ni
X
x̄i· = ni−1 xij , ȳi· = n−1
i yij
j=1 j=1

are the fraction of black registered voters and the fraction voting, respectively. Can we infer
the individual voting behavior based on the aggregated data? In general, this is almost im-
possible. Under some assumptions, we can make progress. Goodman’s ecological regression
below is one possibility.
Assume that for precinct i = 1, . . . , n, we have
iid iid
yij | xij = 1 ∼ Bernoulli(pi1 ), yij | xij = 0 ∼ Bernoulli(pi0 ), (j = 1, . . . , ni ).

This is the individual-level model, where the pi1 ’s and pi0 ’s measure the association between
race and voting. We further assume that they are random and independent of the xij ’s, with
means
E(pi1 ) = p1 , E(pi0 ) = p0 . (19.6)
Then we can decompose the aggregated outcome variable as
ni
X
ȳi· = n−1
i yij
j=1
Xni
= n−1
i {xij yij + (1 − xij )yij }
j=1
Xni
= n−1
i {xij p1 + (1 − xij )p0 } + εi
j=1
= p1 x̄i· + p0 (1 − x̄i· ) + εi ,

where
ni
X
εi = n−1
i {xij (yij − p1 ) + (1 − xij )(yij − p0 )}.
j=1

So we have a linear relationship between the aggregated outcome and covariate

ȳi· = p1 x̄i· + p0 (1 − x̄i· ) + εi ,

where
E(εi | x̄i· ) = 0.
Goodman (1953) suggested to use the OLS of ȳi· on {x̄i· , (1 − x̄i· )} to estimate (p1 , p0 ),
and Goodman (1959) suggested to use the corresponding WLS with weight ni since the
Weighted Least Squares 199

variance of εi has the magnitude n−1


i . Moreover, the variance of εi has a rather complicated
form of heteroskedasticity, so we should use the EHW standard error for inference. This is
called Goodman’s regression or ecological regression. The R code in code18.3.2.R implements
ecological regression based on the lavoteall data.
> ols . fit = lm ( t ~ 0 + x + I ( 1 -x ) , data = lavoteall )
> wls . fit = lm ( t ~ 0 + x + I ( 1 -x ) , weights = n , data = lavoteall )
> compare = cbind ( summary ( ols . fit )$ coef [ , 1 : 3 ] ,
+ summary ( wls . fit )$ coef [ , 1 : 3 ])
> round ( compare , 3 )
Estimate Std . Error t value Estimate Std . Error t value
x 0.628 0.003 188.292 0.626 0.003 194.493
I(1 - x) 0.711 0.002 408.211 0.706 0.002 421.662

The assumption in (19.6) is crucial, which can be too strong when the precinct level pi1 ’s
and pi0 ’s vary in systematic but unobserved ways. When the assumption is violated, it is
possible that the ecological regression yields the opposite result compared to the individual
regression. This is called the ecological fallacy.
Another obvious problem of ecological regression is that the estimated coefficients may
lie outside of the interval [0, 1]. Problem 19.18 gives an example.
Gelman et al. (2001) gave an alternative set of assumptions justifying the ecological
regression. King (1997) proposed some extensions. Robinson (1950) warned that the ecolog-
ical correlation might not inform individual correlation. Freedman et al. (1991) warned that
the assumptions underlying the ecological regression might not be plausible in practice.

19.4 WLS with other motivations


WLS can be used in other settings unrelated to heteroskedasticity. I review two examples
below.

19.4.1 Local linear regression


Calculus tells us that locally we can approximate any smooth function f (x) by a linear
function even though the original function can be highly nonlinear:

f (x) ≈ f (x0 ) + f ′ (x0 )(x − x0 )

when x is near x0 . The left panel of Figure 19.2 shows that in the neighborhood of x0 = 0.4,
even a sine function can be well approximated by a line. Based on data (xi , yi )ni=1 , if we
want to predict the mean value of y given x = x0 , then we can predict based on a line with
the local data points close to x0 . It is also reasonable to down weight the points that are
far from x0 , which motivates the following WLS:
n
X 2
(α̂, β̂) = arg min wi {yi − a − b(xi − x0 )}
a,b
i=1

with wi = K {(xi − x0 )/h} where K(·) is called the kernel function and h is called the
bandwidth parameter. With the fitted line ŷ(x) = α̂ + β̂(x − x0 ), the predicted value at
x = x0 is the intercept α̂.
Technically, K(·) can be any density function, and two canonical choices are the standard
Normal density and the Epanechikov kernel K(t) = 0.75(1 − t2 )1(|t| ≤ 1). The choice of
200 Linear Model and Extensions

local linear approximation local linear fit


● ●
● ●

2
● ●
● ●
● ●
● ●

● ●

● ● ● ● ● ●
● ● ● ●
● ● ● ●
● ● ● ● ● ● ● ●
● ●
● ●
● ● ● ● ● ●
● ●
● ● ● ● ● ● ● ●
● ● ● ●
● ● ● ●
● ● ● ● ● ● ● ●
● ●
● ● ● ● ● ● ● ●
●● ● ● ●● ● ●
● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ●
● ● ● ● ● ● ● ●
● ●● ● ● ● ● ●● ● ● ●
● ● ● ● ● ● ● ● ● ●
● ● ● ●
● ● ● ● ● ● ● ●
1

1
● ● ● ●
● ●
● ● ● ●
●● ●●
● ● ● ●
●● ● ●●
● ● ●● ● ● ●● ● ●●
● ● ●● ● ●
● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ●
●● ● ● ● ●● ● ● ●
● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ●
●● ●
● ● ●
● ●● ●
● ● ●
● ● ● ● ● ●
●●● ● ● ●●● ● ●
●● ● ●● ●
● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ●
●● ● ● ● ●● ● ● ●
●● ● ● ● ● ●
● ●● ● ● ● ● ●

● ● ● ●
● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ●
● ● ● ●
●● ● ● ● ● ● ●● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ● ● ●
y

y
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ●●● ● ● ●
● ●●● ● ●
● ● ● ● ● ● ●● ●●
● ● ● ● ● ● ●● ●●
● ●
0

0
● ● ● ● ● ● ● ● ● ●
●● ● ●● ●
● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ●
● ●
● ● ● ●
● ● ● ● ● ● ●● ● ● ● ● ● ● ●●
● ● ● ● ● ●
● ● ● ● ● ● ● ●
●● ● ●● ●
● ● ● ●
● ● ● ●
●● ● ●● ●
● ●
● ● ● ● ● ● ● ●
● ●● ● ● ● ●● ● ●
● ● ● ● ● ● ● ● ● ●
● ● ● ●● ● ● ● ● ●● ●
● ● ● ● ● ● ● ● ● ●
● ● ● ●
● ● ● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ●● ● ● ● ● ●● ●
● ●
● ●
● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ●
● ● ● ●● ● ● ● ● ● ●● ● ●
● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ●
● ●
● ● ● ● ● ●
● ● ● ●

● ● ● ●

● ● ● ●
●● ● ● ●● ● ●
● ● ● ● ● ●
● ●
● ● ● ●
● ●
● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ●
● ●
−1

−1
●● ● ●● ●
● ● ● ● ● ● ● ● ● ●
● ●● ● ● ● ●● ● ●
● ●● ● ●●
●● ●●
● ●
● ● ● ●
● ● ● ●
● ●
● ● ● ● ● ● ● ●
● ●● ● ●●
● ●● ● ● ● ●● ● ●
● ●
● ● ● ●
● ● ● ● ● ●
● ● ● ●
● ● ● ●
● ● ● ● ● ●
● ● ● ●
● ●
● ● ● ● ● ●
● ●
● ●
● ● ● ● ● ●

●● ●●
● ●
−2

−2
● ●

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

x x

FIGURE 19.2: Local linear regression

the kernel does not matter that much. The choice of the bandwidth matters much more.
With large bandwidth, we have a poor linear approximation, leading to bias; with small
bandwidth, we have few data points, leading to large variance. In practice, we face a bias-
variance trade-off. In practice, we can either use cross-validation or other criterion to select
h.
In general, we can approximate a smooth function by a polynomial:
K
X f (k) (x0 )
f (x) ≈ (x − x0 )k
k!
k=0

when x is near x0 . So we can even fit a polynomial function locally, which is called local
polynomial regression (Fan and Gijbels, 1996). In the R package Kernsmooth, the function
locpoly fits local polynomial regression, and the function dpill selects h based on Ruppert
et al. (1995). The default specification of locpoly is the local linear regression.
> library ( " KernSmooth " )
> n = 500
> x = seq ( 0 , 1 , length . out = n )
> fx = sin ( 8 * x )
> y = fx + rnorm (n , 0 , 0 . 5 )
> plot ( y ~ x , pch = 1 9 , cex = 0 . 2 , col = " grey " , bty = " n " ,
+ main = " local ␣ linear ␣ fit " , font . main = 1 )
> lines ( fx ~ x , lwd = 2 , col = " grey " )
> h = dpill (x , y )
> locp . fit = locpoly (x , y , bandwidth = h )
> lines ( locp . fit , lty = 2 )

19.4.2 Regression with survey data


Most discussions in this book are based on i.i.d. samples, or, at least, the sample represents
the population of interest. Sometimes, researchers over sample some units and under sample
some other units from a population of interest.
Weighted Least Squares 201

sampling with probability

FIGURE 19.3: Survey sampling


202 Linear Model and Extensions

If we have the large population with size N , then the ideal OLS estimator of the yi ’s on
the xi ’s is
N
!−1 N
X X
t
β̂ideal = xi xi xi yi .
i=1 i=1

However, we do not have all the data points in the large population, but sample each data
point independently with probability

πi = pr(Ii = 1 | xi , yi ),

where Ii is a binary indicator for being included in the sample. Conditioning on XN =


(xi )N N
i=1 and YN = (yi )i=1 , β̂ideal is a fixed number, and an estimator is the following WLS
estimator
N
!−1 N
X Ii t
X Ii
β̂1/π = xi xi xi yi
π
i=1 i
π
i=1 i
n
!−1 n
X X
−1
= πi xi xi t
πi−1 xi yi ,
i=1 i=1

with weights inversely proportional to the sampling probability. This inverse probability
weighting estimator is reasonable because
N
! N
X Ii t
X
E xi xi | XN , YN = xi xti ,
i=1
π i i=1
N
! N
X Ii X
E xi yi | XN , YN = xi yi .
π
i=1 i i=1

The inverse probability weighting estimators are called the Horvitz–Thompson estimators
(Horvitz and Thompson, 1952), which are the cornerstones of survey sampling.
Below I use the dataset census00.dta to illustrate the use of sampling weight, which is
the perwt variable. The R code is in code18.4.2.R.
> library ( foreign )
> census 0 0 = read . dta ( " census 0 0 . dta " )
> ols . fit = lm ( logwk ~ age + educ + exper + exper 2 + black ,
+ data = census 0 0 )
> wls . fit = lm ( logwk ~ age + educ + exper + exper 2 + black ,
+ weights = perwt , data = census 0 0 )
> compare = cbind ( summary ( ols . fit )$ coef [ , 1 : 3 ] ,
+ summary ( wls . fit )$ coef [ , 1 : 3 ])
> round ( compare , 4 )
Estimate Std . Error t value Estimate Std . Error t value
( Intercept ) 5.1667 0.1282 40.3 5.0740 0.1268 40.0
age -0 . 0 1 4 8 0.0067 -2 . 2 -0 . 0 0 8 4 0.0067 -1 . 3
educ 0.1296 0.0066 19.7 0.1228 0.0065 18.8
exper 2 0.0003 0.0001 2.2 0.0002 0.0001 1.3
black -0 . 2 4 6 7 0.0085 -2 9 . 2 -0 . 2 5 7 4 0.0080 -3 2 . 0
Weighted Least Squares 203

19.5 Homework problems


19.1 A linear algebra fact related to WLS
This problem extends Corollary 19.1.
Show that
(X t Σ−1 X)−1 ⪯ (X t ΩX)−1 X t ΩΣΩX(X t ΩX)−1 .
When Ω = Σ−1 , the equality holds.

19.2 Generalized least squares with a block diagonal covariance


Partition X and Y into   
X1 Y1
X =  ...  , Y =  ... 
   

XK YK
nk ×p
corresponding to Σ in (19.3) such that Xk ∈ R and Yk ∈ Rnk . Show that the generalized
least squares estimator is
K
!−1 K !
X X
t −1 t −1
β̂Σ = Xk Σk Xk Xk Σk Yk .
k=1 k=1

19.3 Univariate WLS


Prove the following Galtonian formula for the univariate WLS:
n
X
min wi (yi − a − bxi )2
a,b
i=1

has the minimizer


Pn
w (x − x̄w )(yi − ȳw )
β̂w = Pni i
i=1
2
, α̂w = ȳw − β̂w x̄w
i=1 wi (xi − x̄w )
Pn Pn Pn Pn
where x̄w = i=1 wi xi / i=1 wi and ȳw = i=1 wi yi / i=1 wi are the weighted means of
the covariate and outcome.

19.4 Difference-in-means with weights


With a binary covariate xi , show that the coefficient of xi in the WLS of yi on (1, xi ) with
weights wi (i = 1, . . . , n) equals ȳw,1 − ȳw,0 , where
Pn Pn
w i xi yi wi (1 − xi )yi
ȳw,1 = Pi=1n , ȳw,0 = Pi=1
n
i=1 wi xi i=1 wi (1 − xi )

are the weighted averages of the outcome under treatment and control, respectively.
Hint: You can use the result in Problem 19.3.

19.5 Asymptotic Normality of WLS and robust covariance estimator


Under the heteroskedastic linear model, show that β̂w is consistent and asymptotically

Normal, and show that nV̂w is consistent for the asymptotic covariance of n(β̂w − β).
Specify the regularity conditions.
204 Linear Model and Extensions

19.6 WLS in ANOVA


This problem extends Problems 5.5 and 6.3.
For units i = 1, . . . , n, assume yi denotes the outcome, xi denotes the p-vector with
entries as the dummy variables for a discrete covariate with p levels, wi > 0 denotes a
weight, and πi > 0 denotes another weight that is a function of xi only (for example,
πi = nj /n if xi = ej ). Run the following regressions:
• WLS of yi on xi with weight wi for i = 1, . . . , n to obtain the coefficient vector β̂ and
EHW covariance matrix V̂ ;
• WLS of yi on xi with weight wi πi for i = 1, . . . , n to obtain the coefficient vector β̂ ′ and
EHW covariance matrix V̂ ′ .
Show that β̂ = β̂ ′ with the jth entry
P
i:xi =ej wi yi
β̂j = β̂j′ = P ,
i:xi =ej wi

and moreover, V̂ = V̂ ′ are diagonal with the (j, j)th entry


2 2
P
′ i:xi =ej wi (yi − β̂j )
V̂jj = V̂jj = P .
( i:xi =ej wi )2

19.7 An infeasible generalized least squares estimator


Can we skip Step 2 in Section 19.3.1 and directly apply the following WLS estimator:
n
!−1 n
X X
−2
β̂igls = t
ε̂i xi xi ε̂−2
i xi yi
i=1 i=1

with ε̂i = yi − xti β̂. If so, give a theoretical justification; if not, give a counterexample.
Evaluate the finite-sample properties of β̂igls using simulated data.

19.8 FWL theorem in WLS


This problem is an extension of Theorem 7.1.
Consider the WLS with an n × 1 vector Y , an n × k matrix X1 , an n × l matrix X2 , and
weights wi ’s. Show that β̂w,2 in the long WLS fit

Y = X1 β̂w,1 + X2 β̂w,2 + ε̂w

equals the coefficient of X̃w,2 in the WLS fit of Ỹw on X̃w,2 , where X̃w,2 are the residual
vectors from the column-wise WLS of X2 on X1 , and Ỹw is the residual vector from the
WLS of Y on X1 .

19.9 The sample version of Cochran’s formula in WLS


This problem is an extension of Theorem 9.1.
Consider the WLS with an n × 1 vector Y , an n × k matrix X1 , an n × l matrix X2 , and
weights wi ’s. We can fit the following WLS:
Y = X1 β̂w,1 + X2 β̂w,2 + ε̂w ,
Y = X2 β̃w,2 + ε̃w ,
X1 = X2 δ̂w + Ûw ,
Weighted Least Squares 205

where ε̂w , ε̃w , Ûw are the residuals. The last WLS fit means the WLS fit of each column of
X1 on X2 . Similar to Theorem 9.1, we have

β̃w,2 = β̂w,2 + δ̂w β̂w,1 .

Prove this result.

19.10 EHW robust covariance estimator in WLS


We have shown in Section 19.1 that the coefficients from WLS are identical to those from
OLS with transformed variables. Further show that the corresponding HC0 version of EHW
covariance estimators are also identical.

19.11 Invariance of covariance estimators in WLS


Problem 6.4 states the invariance of covariance estimators in OLS. Show that the same
result holds for covariance estimators in WLS.

19.12 Ridge with weights


Define the ridge regression with weights wi ’s, and derive the the formula for the ridge
coefficient.

19.13 Coordinate descent algorithm in lasso with weights


Define the lasso with weights wi ’s, and give the coordinate descent algorithm for solving
the weighted lasso problem.

19.14 General leave-one-out formula via WLS


With data (X, Y ), we can define β̂[−i] (w) as the WLS estimator of Y on X with weights
wi′ = 1(i′ ̸= i) + w1(i′ = i) for i′ = 1, . . . , n, where 0 ≤ w ≤ 1. It reduces to the OLS
estimator β̂ when w = 1 and the leave-one-out OLS estimator β̂[−i] when w = 0.
Show the general formula
1−w
β̂[−i] (w) = β̂ − (X t X)−1 xi ε̂i
1 − (1 − w)hii

recalling that hii is the leverage score and ε̂i is the residual of observation i.
Remark: Based on the above formula, we can compute the derivative of β̂[−i] (w) with
respect to w:
∂ β̂[−i] (w) 1
= (X t X)−1 xi ε̂i ,
∂w {1 − (1 − w)hii }2
which reduces to
∂ β̂[−i] (0) 1
= (X t X)−1 xi ε̂i
∂w (1 − hii )2
at w = 0 and
∂ β̂[−i] (1)
= (X t X)−1 xi ε̂i
∂w
at w = 1. Pregibon (1981) reviewed related formulas for OLS. Broderick et al. (2020)
discussed related formulas for general statistical models.
206 Linear Model and Extensions

19.15 Hat matrix and leverage score in WLS


Based on the WLS estimator β̂w = (X t W X)−1 X t W Y with W = diag(w1 , . . . , wn ), we
have the predicted vector

Ŷw = X β̂w = X(X t W X)−1 X t W Y.

This motivates the definition of the hat matrix

Hw = X(X t W X)−1 X t W

such that Ŷw = Hw Y .


First, show the following basic properties of the hat matrix:

W Hw = Hwt W, X t W (In − Hw ) = 0.

Second, prove an extended version of Theorem 11.1: with xi = (1, xti2 )t , the (i, i)th
diagonal element of Hw satisfies
wi 2
hw,ii = Pn (1 + Dw,i )
i′ =1 wi′

where
2 −1
Dw,i = (xi2 − x̄w,2 )t Sw (xi2 − x̄w,2 )
Pn Pn
with
Pn x̄w,2 = i=1 wi xi2 / i=1 Pwi being the weighted average of xi2 ’s and Sw =
n
i=1 w i (x i2 − x̄ w,2 )(x i2 − x̄ w,2 )t
/ i=1 wi being the corresponding sample covariance ma-
trix.
Remark: Li and Valliant (2009) presented the basic properties of Hw for WLS in the
context of survey data.

19.16 Leave-one-out formula for WLS


Use the notation in Problem 19.15. Let β̂w be the WLS estimator of Y on X with weights
wi ’s. Let β̂w[−i] be the WLS estimator without using the ith observation. Show that
wi
β̂w[−i] = β̂w − (X t W X)−1 xi ε̂w,i .
1 − hw,ii

19.17 EHW standard errors in WLS


Report the EHW standard errors in the examples in Sections 19.3.1, 19.3.2, and 19.4.2.

19.18 Another example of ecological inference


The fultongen dataset in the ri package contains aggregated data from 289 precincts in
Fulton County, Georgia. The variable t represents the fraction voting in 1994 and x the
fraction in 1992. The variable n represents the total number of people. Run ecological re-
gression similar to Section 19.3.2.
Part VII

Generalized Linear Models


20
Logistic Regression for Binary Outcomes

Many applications have binary outcomes yi ∈ {0, 1}. This chapter discusses statistical
models of binary outcomes, focusing on the logistic regression.

20.1 Regression with binary outcomes


20.1.1 Linear probability model
For simplicity, we can still use the linear model for a binary outcome. It is also called the
linear probability model:
yi = xti β + εi , E(εi | xi ) = 0
because the conditional probability of yi given xi is a linear function of xi :
pr(yi = 1 | xi ) = E(yi | xi ) = xti β.
An advantage of this linear model is that the interpretation of the coefficient remains the
same as linear models for general outcomes:
∂pr(yi = 1 | xi )
= βj ,
∂xij
that is, βj measures the partial impact of xij on the probability of yi .
A minor technical issue is that the linear probability model implies heteroskedasticity
because
var(yi | xi ) = xti β(1 − xti β).
Therefore, we must use the EHW covariance based on OLS. We can also use FGLS to
improve efficiency over OLS.
A more severe problem with the linear probability model is its plausibility in general.
We may not believe that a linear model is the correct model for a binary outcome because
the probability pr(yi = 1 | xi ) on the left-hand side is bounded between zero and one, but
the linear combination xti β on the right-hand side can be unbounded for general covariates
and coefficient. Nevertheless, the OLS decomposition yi = xti β + εi works for any yi ∈ R,
so it is applicable for binary yi . Sometimes, practitioners feel that the linear model is not
natural for binary outcomes because the predicted value can be outside the range of [0, 1].
Therefore, it is more reasonable to build a model that automatically accommodates the
binary feature of the outcome.

20.1.2 General link functions


A linear combination of general covariates may be outside the range of [0, 1], but we can
find a monotone transformation to force it to lie within the interval [0, 1]. This motivates

209
210 Linear Model and Extensions

us to consider the following model:

pr(yi = 1 | xi ) = g(xti β),

where g(·) : R → [0, 1] is a monotone function, and its inverse is often called the link
function. Mathematically, the distribution function of any continuous random variable is a
monotone function that maps from R to [0, 1]. So we have infinitely many choices for g(·).
Four canonical choices “logit”, “probit”, “cauchit”, and “cloglog” are below which are the
standard options in R:
name functional form
ez
logit g(z) = 1+e z

probit g(z) = Φ(z)


cauchit g(z) = π1 arctan(z) + 21
cloglog g(z) = 1 − exp(−ez )
The above g(z)’s correspond to different distribution functions. The g(z) for the logit
model1 is the distribution function of the standard logistic distribution with density
ez
g ′ (z) = = g(z) {1 − g(z)} . (20.1)
(1 + ez )2

The g(z) for the probit model2 is the distribution function of a standard Normal distribu-
tion. The g(z) for the cauchit model is the distribution function of the standard Cauchy
distribution with density
1
g ′ (z) = .
π(1 + z 2 )
The g(z) for the cloglog model is the distribution function of the standard log-Weilbull
distribution with density
g ′ (z) = exp(z − ez ).
I will give more motivations for the first three link functions in Section 20.7.1 and for the
fourth link function in Problem 22.4.
Figure 20.1 shows the distributions and densities of the corresponding link functions. The
distribution functions are quite similar for all links, but the density for cloglog is asymmetric
although all other three densities are symmetric.
This chapter will focus on the logit model, and extensions to other models are concep-
tually straightforward. We can also write the logit model as
t
exi β
pr(yi = 1 | xi ) ≡ π(xi , β) = t , (20.2)
1 + exi β
for the conditional probability of yi given xi , or, equivalently,

pr(yi = 1 | xi )
logit {pr(yi = 1 | xi )} ≡ log = xti β,
1 − pr(yi = 1 | xi )

for the log of the odds of yi given xi , with the logit function
π
logit(π) = log .
1−π
1 Berkson (1944) was an early use of the logit model.
2 Bliss (1934) was an early use of the probit model.
Logistic Regression for Binary Outcomes 211

distribution density

0.4
1.0
probit probit
logit logit
cloglog cloglog
0.8

0.3
cauchit cauchit
0.6

dg(z) dz
g(z)

0.2
0.4

0.1
0.2
0.0

0.0
−5 0 5 −5 0 5

z z

FIGURE 20.1: Distributions and densities corresponding to the link functions

Because yi is a binary random variable, its probability completely determines its distribu-
tion. So we can also write the logit model as
t
!
exi β
yi | xi ∼ Bernoulli t .
1 + exi β

Each coefficient βj measures the impact of xij on the log odds of the outcome:

logit{pr(yi = 1 | xi )} = βj .
∂xij
Epidemiologists also call βj the conditional log odds ratio because
βj = logit {pr(yi = 1 | . . . , xij + 1, . . .)} − logit {pr(yi = 1 | . . . , xij , . . .)}
pr(yi = 1 | . . . , xij + 1) pr(yi = 1 | . . . , xij , . . .)
= log − log
1 − pr(yi = 1 | . . . , xij + 1) 1 − pr(yi = 1 | . . . , xij , . . .)
 
pr(yi = 1 | . . . , xij + 1, . . .) . pr(yi = 1 | . . . , xij , . . .)
= log ,
1 − pr(yi = 1 | . . . , xij + 1, . . .) 1 − pr(yi = 1 | . . . , xij , . . .)
that is, the change of the log odds of yi if we increase xij by a unit holding other covariates
unchanged. Qualitatively, if βj > 0, then larger values of xij lead to larger probability of
yi = 1; if βj < 0, then larger values of xij lead to smaller probability of yi = 1.

20.2 Maximum likelihood estimator of the logistic model


Because we have specified a fully parametric model for yi given xi , we can estimate β using
the maximum likelihood. With independent observations, the likelihood function for general
212 Linear Model and Extensions

binary outcomes is3


n
Y
L(β) = f (yi | xi )
i=1
Yn
= {π(xi , β) if yi = 1 or 1 − π(xi , β) if yi = 0}
i=1
Yn
y 1−yi
= {π(xi , β)} i {1 − π(xi , β)} .
i=1

Under the logit form (20.2), the likelihood function simplifies to


n  y i
Y π(xi , β)
L(β) = {1 − π(xi , β)}
i=1
1 − π(xi , β)
n  yi
Y t 1
= exi β t

i=1
1 + exi β
n t
Y eyi xi β
= t .

i=1
1 + exi β
The log-likelihood function is
n n
X o
t
log L(β) = yi xti β − log(1 + exi β ) ,
i=1

the score function is


n t
!
∂ log L(β) X xi exi β
= xi yi − t
∂β i=1
1 + exi β
n t
!
X exi β
= xi yi − t

i=1
1 + exi β
Xn
= xi {yi − g(xti β)}
i=1
n
X
= xi {yi − π(xi , β)} ,
i=1

and the Hessian matrix


∂ 2 log L(β)
 2 
∂ log L(β)
=
∂β∂β t ∂βj ∂βj ′ 1≤j,j ′ ≤p
n
X ∂g(xti β)
=− xi
i=1
∂β t
n
(20.1) X
= − xi xti g(xti β) {1 − g(xti β)}
i=1
n
X
=− π(xi , β) {1 − π(xi , β)} xi xti .
i=1

3 The notation can be confusing because β denotes both the true parameter and the dummy variable for

the likelihood function.


Logistic Regression for Binary Outcomes 213

For any α ∈ Rp , we have


n
∂ 2 log L(β) X
αt α = − π(xi , β) {1 − π(xi , β)} (αt xi )2 ≤ 0
∂β∂β t i=1

so the Hessian matrix is negative semi-definite. If it is negative definite, then the likelihood
function has a unique maximizer.
The maximum likelihood estimate (MLE) must satisfy the following score or Normal
equation: !
n n t
X n o X exi β̂
xi yi − π(xi , β̂) = xi yi − t
= 0.
i=1 i=1 1 + exi β̂
If we view π(xi , β̂) as the fitted probability for yi , then yi − π(xi , β̂) is the residual, and the
score equation is similar to that of OLS. Moreover, if xi contains 1, then
n n
X o n
X n
X
yi − π(xi , β̂) = 0 =⇒ n−1 yi = n−1 π(xi , β̂),
i=1 i=1 i=1

that is the average of the outcomes equals the average of their fitted values.
However, the score equation is nonlinear, and in general, there is no explicit formula for
the MLE. We usually use Newton’s method to solve for the MLE based on the linearization of
the score equation. Starting from the old value β old , we can approximate the score equation
by a linear equation:
∂ log L(β) ∼ ∂ log L(β old ) ∂ 2 log L(β old )
0= = + (β − β old ),
∂β ∂β ∂β∂β t
and then update
−1
∂ 2 log L(β old ) ∂ log L(β old )

β new = β old − .
∂β∂β t ∂β
Using the matrix form, we can gain more insight from Newton’s method. Recall that
   t 
y1 x1
 ..   .. 
Y =  . , X =  . ,
yn xtn
and define
π(x1 , β old )
 

Πold = .. n
W old = diag π(xi , β old ) 1 − π(xi , β old ) i=1 .
 
,
 
.
old
π(xn , β )
Then
∂ log L(β old )
= X t (Y − Πold ),
∂β
∂ 2 log L(β old )
= −X t W old X,
∂β∂β t
and Newton’s method simplifies to
β new = β old + (X t W old X)−1 X t (Y − Πold )
= (X t W old X)−1 X t W old Xβ old + X t (Y − Πold )


= (X t W old X)−1 X t W old Z old ,


214 Linear Model and Extensions

where
Z old = Xβ old + (W old )−1 (Y − Πold ).
So we can obtain β new based on the WLS fit of Z old on X with weights W old , the diagonal
elements of which are the conditional variances of the yi ’s given the xi ’s at β old . The glm
function in R uses the Fisher scoring algorithm, which is identical to Newton’s method for the
logit model4 . Sometimes, it is also called the iteratively reweighted least squares algorithm.

20.3 Statistics with the logit model


20.3.1 Inference
Based on the general theory of MLE, β̂ is consistent for β and is asymptotically Normal.
Approximately, we can conduct statistical inference based on
 !−1 
2
a
 ∂ log L(β̂)  n o
β̂ ∼ N β, − = N β, (X t Ŵ X)−1 ,
 ∂β∂β t 

where h n oin
Ŵ = diag π(xi , β̂) 1 − π(xi , β̂) .
i=1
Based on this, the glm function reports the point estimate, standard error, z-value, and p-
value for each coordinate of β. It is almost identical to the output of the lm function, except
that the interpretation of the coefficient becomes the conditional log odds ratio.
I use the data from Hirano et al. (2000) to illustrate logistic regression, where the main
interest is the effect of the encouragement of receiving the flu shot via email on the binary
indicator of flu-related hospitalization. We can fit a logistic regression using the glm function
in R with family = binomial(link = logit).
> flu = read . table ( " fludata . txt " , header = TRUE )
> flu = within ( flu , rm ( receive ))
> assign . logit = glm ( outcome ~ . ,
+ family = binomial ( link = logit ) ,
+ data = flu )
> summary ( assign . logit )

Call :
glm ( formula = outcome ~ . , family = binomial ( link = logit ) , data = flu )

Deviance Residuals :
Min 1Q Median 3Q Max
-1 . 1 9 5 7 -0 . 4 5 6 6 -0 . 3 8 2 1 -0 . 3 0 4 8 2.6450

Coefficients :
Estimate Std . Error z value Pr ( >| z |)
( Intercept ) -2 . 1 9 9 8 1 5 0 . 4 0 8 6 8 4 -5 . 3 8 3 7 . 3 4e - 0 8 ***
assign -0 . 1 9 7 5 2 8 0 . 1 3 6 2 3 5 -1 . 4 5 0 0 . 1 4 7 0 9

4 The Fisher scoring algorithm uses a slightly different approximation:


∂ log L(β) ∼ ∂ log L(β old )
 2
∂ log L(β old )

0= = +E | X (β − β old ),
∂β ∂β ∂β∂β t
with the expected Fisher information instead of the observed Fisher information. For other link functions,
the Fisher scoring algorithm is different from Newton’s method.
Logistic Regression for Binary Outcomes 215

age -0 . 0 0 7 9 8 6 0.005569 -1 . 4 3 4 0 . 1 5 1 5 4
copd 0.337037 0.153939 2.189 0.02857 *
dm 0.454342 0.143593 3.164 0.00156 **
heartd 0.676190 0.153384 4 . 4 0 8 1 . 0 4e - 0 5 ***
race -0 . 2 4 2 9 4 9 0.143013 -1 . 6 9 9 0 . 0 8 9 3 6 .
renal 1.519505 0.365973 4 . 1 5 2 3 . 3 0e - 0 5 ***
sex -0 . 2 1 2 0 9 5 0.144477 -1 . 4 6 8 0 . 1 4 2 1 0
liverd 0.098957 1.084644 0.091 0.92731

( Dispersion parameter for binomial family taken to be 1 )

Null deviance : 1 6 6 7 . 9 on 2 8 6 0 degrees of freedom


Residual deviance : 1 5 9 8 . 4 on 2 8 5 1 degrees of freedom
AIC : 1 6 1 8 . 4

Number of Fisher Scoring iterations : 5

Three subtle issues arise in the above code. First, flu = within(flu, rm(receive)) drops
receive, which is the indicator of whether a patient received the flu shot or not. The reason
is that assign is randomly assigned but receive is subject to selection bias, that is, patients
receiving the flu shot can be quite different from patients not receiving the flu shot.
Second, the Null deviance and Residual deviance are defined as −2 log L(β̃) and
−2 log L(β̂), respectively, where β̃ is the MLE assuming that all coefficients except the
intercept are zero, and β̂ is the MLE without any restrictions. They are not of independent
interest, but their difference is: Wilks’ theorem ensures that

L(β̂) a 2
{−2 log L(β̃)} − {−2 log L(β̂)} = 2 log ∼ χp−1 .
L(β̃)

So we can test whether the coefficients of the covariates are all zero, which is analogous to
the joint F test in linear models.
> pchisq ( assign . logit $ null . deviance - assign . logit $ deviance ,
+ df = assign . logit $ df . null - assign . logit $ df . residual ,
+ lower . tail = FALSE )
[ 1 ] 1 . 9 1 2 9 5 2e - 1 1

Third, the AIC is defined as −2 log L(β̂) + 2p, where p is the number of parameters in
the logit model. This is also the general formula of AIC for other parametric models.

20.3.2 Prediction
The logit model is often used for prediction or classification since the outcome is binary.
With the MLE β̂, we can predict the probability of being one as π̂n+1 = g(xtn+1 β̂) for a unit
with covariate value xn+1 , and we can easily dichotomize the fitted probability to predict
the outcome itself by ŷn+1 = 1(π̂n+1 ≥ c), for example, with c = 0.5.
We can even quantify the uncertainty in the fitted probability based on a linear approx-
imation (i.e., the delta method). Based on

π̂n+1 = g(xtn+1 β̂)



= g(xtn+1 β) + g ′ (xtn+1 β)xtn+1 (β̂ − β)
= g(xtn+1 β) + g(xtn+1 β){1 − g(xtn+1 β)}xtn+1 (β̂ − β),

we can approximate the asymptotic variance of π̂n+1 by

[g(xtn+1 β){1 − g(xtn+1 β)}]2 xtn+1 (X t Ŵ X)−1 xn+1 .


216 Linear Model and Extensions

We can use the predict function in R to calculate the predicted values based on a glm
object in the same way as the linear model. If we specify type="response", then we obtain the
fitted probabilities; if we specify se.fit = TRUE, then we also obtain the standard errors of the
fitted probabilities. In the following, I predict the probabilities of flu-related hospitalization
if a patient receives the email encouragement or not, fixing other covariates at their empirical
means.
> emp . mean = apply ( flu , 2 , mean )
> data . ave = rbind ( emp . mean , emp . mean )
> data . ave [ 1 , 1 ] = 1
> data . ave [ 2 , 1 ] = 0
> data . ave = data . frame ( data . ave )
> predict ( assign . logit , newdata = data . ave ,
+ type = " response " , se . fit = TRUE )
$ fit
emp . mean emp . mean . 1
0.06981828 0.08378818

$ se . fit
emp . mean emp . mean . 1
0.006689665 0.007526307

20.4 More on interpretations of the coefficients


Many practitioners find the coefficients in the logit model difficult to interpret. Another
measure of the impact of the covariate on the outcome is the average marginal effect or
average partial effect. For a continuous covariate xij , the average marginal effect is defined
as
n
X ∂pr(yi = 1 | xi )
amej = n−1
i=1
∂xij
n
X
= n−1 g ′ (xti β)βj ,
i=1

which reduces to the following form for the logit model


n
X
amej = βj × n−1 π(xi , β) {1 − π(xi , β)} .
i=1

For a binary covariate xij , the average marginal effect is defined as


n
X
amej = n−1 {pr(yi = 1 | . . . , xij = 1, . . .) − pr(yi = 1 | . . . , xij = 0, . . .)}
i=1

The margins function in the margins package can compute the average marginal effects and
the corresponding standard errors. In particular, the average marginal effect of assign is not
significant as shown below. The R code in this section is in code19.3.R.
> library ( " margins " )
> ape = margins ( assign . logit )
> summary ( ape )
factor AME SE z p lower upper
Logistic Regression for Binary Outcomes 217

age -0 . 0 0 0 6 0.0004 -1 . 4 3 2 2 0.1521 -0 . 0 0 1 4 0.0002


assign -0 . 0 1 5 0 0.0103 -1 . 4 4 8 0 0.1476 -0 . 0 3 5 2 0.0053
copd 0.0255 0.0117 2.1830 0.0290 0.0026 0.0485
dm 0.0344 0.0109 3.1465 0.0017 0.0130 0.0559
heartd 0.0512 0.0118 4.3441 0.0000 0.0281 0.0743
liverd 0.0075 0.0822 0.0912 0.9273 -0 . 1 5 3 6 0.1686
race -0 . 0 1 8 4 0.0109 -1 . 6 9 5 8 0.0899 -0 . 0 3 9 7 0.0029
renal 0.1151 0.0278 4.1461 0.0000 0.0607 0.1696
sex -0 . 0 1 6 1 0.0110 -1 . 4 6 6 0 0.1426 -0 . 0 3 7 6 0.0054

The interaction term is much more complicated. Contradictory suggestions are given
across fields. Consider the following model

pr(yi = 1 | xi1 , xi2 ) = g(β0 + β1 xi1 + β2 xi2 + β12 xi1 xi2 ).

If the link is logit, then epidemiologists interpret eβ12 as the interaction between xi1 and
xi2 on the odds ratio scale. Consider a simple case with binary xi1 and xi2 . Given xi2 = 1,
the odds ratio of xi1 on yi equals eβ1 +β12 ; given xi2 = 0, the odds ratio of xi1 on yi equals
eβ1 . Therefore, the ratio of the two odds ratio equals eβ12 . When we measure effects on the
odds ratio scale, the logistic model is a natural choice. The interaction term in the logistic
model indeed measures the interaction of xi1 and xi2 .
Ai and Norton (2003) gave a different suggestion. Define zi = β0 + β1 xi1 + β2 xi2 +
β12 xi1 xi2 . We have two ways to define the interaction effect: first,
n n
X ∂pr(yi = 1 | xi1 , xi2 ) X
n−1 = n−1 g ′ (zi )β12 .
i=1
∂(x i1 x i2 ) i=1

second,
n
X ∂ 2 pr(yi = 1 | xi1 , xi2 )
n−1
i=1
∂xi1 ∂xi2
n
X ∂  ∂pr(yi = 1 | xi1 , xi2 ) 
= n−1
i=1
∂xi2 ∂xi1
n
X ∂
= n−1 {g ′ (zi )(β1 + β12 xi2 )}
i=1
∂xi2
n
X
= n−1 {g ′′ (zi )(β2 + β12 xi1 )(β1 + β12 xi2 ) + g ′ (zi )β12 } ;
i=1

Although the first one is more straightforward based on the definition of the average partial
effect, the second one is more reasonable based on the natural definition of interaction based
on the mixed derivative. Note that even if β12 = 0, the second definition of interaction does
not necessarily equal 0 since
n n
X ∂ 2 pr(yi = 1 | xi1 , xi2 ) X
n−1 = n−1 g ′′ (zi )β1 β2 .
i=1
∂x i1 ∂xi2 i=1

This is due to the nonlinearity of the link function. The second definition quantifies inter-
action based on the probability itself while the parameters in the logistic model measure
the odds ratio. This combination of model and parameter does not seem a natural choice.
218 Linear Model and Extensions

20.5 Does the link function matter?


First, I generate data from a simple one-dimensional logistic model.
> n = 100
> x = rnorm (n , 0 , 3 )
> prob = 1 /( 1 + exp ( - 1 + x ))
> y = rbinom (n , 1 , prob )

Then I fit the data with the linear probability model and binary models with four link
functions.
> lpmfit = lm ( y ~ x )
> probitfit = glm ( y ~ x , family = binomial ( link = " probit " ))
Warning message :
glm . fit : fitted probabilities numerically 0 or 1 occurred
> logitfit = glm ( y ~ x , family = binomial ( link = " logit " ))
> cloglogfit = glm ( y ~ x , family = binomial ( link = " cloglog " ))
Warning message :
glm . fit : fitted probabilities numerically 0 or 1 occurred
> cauchitfit = glm ( y ~ x , family = binomial ( link = " cauchit " ))

The coefficients are quite different because the coefficients measure the association be-
tween x and y on difference scales. These parameters are not directly comparable.
> betacoef = c ( lpmfit $ coef [ 2 ] ,
+ probitfit $ coef [ 2 ] ,
+ logitfit $ coef [ 2 ] ,
+ cloglogfit $ coef [ 2 ] ,
+ cauchitfit $ coef [ 2 ])
> names ( betacoef ) = c ( " lpm " , " probit " , " logit " , " cloglog " , " cauchit " )
> round ( betacoef , 2 )
lpm probit logit cloglog cauchit
-0 . 1 0 -0 . 8 3 -1 . 4 7 -1 . 0 7 -2 . 0 9

However, if we care only about the prediction, then these five models give very similar
results.
> table (y , lpmfit $ fitted . values > 0 . 5 )

y FALSE TRUE
0 31 9
1 5 55
> table (y , probitfit $ fitted . values > 0 . 5 )

y FALSE TRUE
0 31 9
1 5 55
> table (y , logitfit $ fitted . values > 0 . 5 )

y FALSE TRUE
0 31 9
1 5 55
> table (y , cloglogfit $ fitted . values > 0 . 5 )

y FALSE TRUE
0 34 6
1 7 53
> table (y , cauchitfit $ fitted . values > 0 . 5 )

y FALSE TRUE
0 34 6
1 7 53
Logistic Regression for Binary Outcomes 219

the true link is logit


lpm probit logit cloglog cauchit

1.5 ●







1.0 ●


● ●●




●●●
●●
●●

●●
●●

●●




●●●
●●
●●

●●
●●





●● ● ●●●
●●
●●

●●
●●




●●●
●●
●●

●●




● ●● ●●

● ● ●
fitted probability

● ● ●●● ●

●●
● ●
●● ●●●

● ●● ● ● ●●
● ● ● ●


● ● ●
● ● ●
● ● ●●


● ● ● ●●

● ●
●● ●● ●

●●
● ●●
● ● ●●
● ●
● ●
● ●●
●●
● ●
● ●●

● ● ●

●●●● ● ●
●●●● ● ● ● ●
●●
●●
●●
●● ● ●
● ●

●● ●

●● ● ●
0.5 ●
●●
●●




●●

● ●


●●


●●



●●● ●

● ●
●● ●
●● ● ●
●● ● ● ●

● ● ● ●● ●
● ● ●


● ●● ●
● ●● ●● ●●

● ●●
● ● ●●
● ●● ●● ●
● ●
● ●
●● ●● ●●
●● ● ●
● ●● ●●●
● ● ●● ●●●
●● ●●
● ●● ●● ●●●●
● ● ●●●

●●
●●●
0.0 ●●●●
●●
●●

●● ●●●●●
●●
●●
●● ●●

●●

−0.5 ●

0.00 0.25 0.50 0.75 1.000.00 0.25 0.50 0.75 1.000.00 0.25 0.50 0.75 1.000.00 0.25 0.50 0.75 1.000.00 0.25 0.50 0.75 1.00
true probability

FIGURE 20.2: Comparing the fitted probabilities from different link functions

Figure 20.2 shows the fitted probabilities versus the true probabilities pr(yi = 1 | xi ). The
patterns are quite similar although the linear probability model can give fitted probabilities
outside [0, 1]. When we use the cutoff point 0.5 to predict the binary outcome, the problem
of the linear probability model becomes rather minor.
An interesting fact is that the coefficients from the logit model approximately equal
those from the probit model multiplied by 1.7, a constant that minimizes maxy |glogit (by) −
gprobit (y)|. We can easily compute this constant numerically:
> d . logit . probit = function ( b ){
+ x = seq ( - 2 0 , 2 0 , 0 . 0 0 0 0 1 )
+ max ( abs ( plogis ( b * x ) - pnorm ( x )))
+ }
>
> optimize ( d . logit . probit , c ( - 1 0 , 1 0 ))
$ minimum
[1] 1.701743

$ objective
[1] 0.009457425

Based on the above calculation, the maximum difference is approximately 0.009. Therefore,
the logit and probit link functions are extremely close up to the scaling factor 1.7. However,
minb maxy |glogit (by) − g∗ (y)| is much larger for the link functions of cauchit and cloglog.

20.6 Extensions of the logistic regression


20.6.1 Penalized logistic regression
Similar to the high dimensional linear model, we can also extend the logit model to a penal-
ized version. Since the objective function for the original logit model is the log-likelihood,
220 Linear Model and Extensions

we can minimize the following penalized log-likelihood function:


n p
1X X
arg min − ℓi (β) + λ {αβj2 + (1 − α)|βj |},
β0 ,β1 ,...,βp n i=1 j=1

where
ℓi (β) = yi (β0 + β1 xi1 + · · · + βp xip ) − log(1 + eβ0 +β1 xi1 +···+βp xip )
is the log-likelihood function. When α = 1, it gives the ridge analog of the logistic regression;
when α = 0, it gives the lasso analog; when α ∈ (0, 1), it gives the elastic net analog. The R
package glmpath uses the coordinate descent algorithm based on a quadratic approximation of
the log-likelihood function. We can select the tuning parameter λ based on cross-validation.

20.6.2 Case-control study


A nice property of the logit model is that it works not only for the cohort study with data
from conditional distribution yi | xi but also for the case-control study with data from
the conditional distribution xi | yi . The former is a prospective study while the latter is a
retrospective study. Below, I will explain the basic idea in Prentice and Pyke (1979).
Assume that (xi , yi , si ) IID with
t
eβ0 +xi β
pr(yi = 1 | xi ) = t (20.3)
1 + eβ0 +xi β
and
(
p1 , if yi = 1,
pr(si = 1 | xi , yi ) = pr(si = 1 | yi ) = (20.4)
p0 , if yi = 0.
But we only have data with si = 1 with p1 and p0 often unknown. Fortunately, conditioning
on si = 1, we have the following result.
Theorem 20.1 Under (20.3) and (20.4), we have
t
eδ+β0 +xi β
pr(yi = 1 | xi , si = 1) = t ,
1 + eδ+β0 +xi β
where δ = log(p1 /p0 ).
Proof of Theorem 20.1: We have
pr(yi = 1 | xi , si = 1)
pr(yi = 1 | xi )pr(si = 1 | xi , yi = 1)
=
pr(yi = 1 | xi )pr(si = 1 | xi , yi = 1) + pr(yi = 0 | xi )pr(si = 1 | xi , yi = 0)
by Bayes’ formula. Under the logit model, we have
t
eβ0 +xi β
t p1
1+eβ0 +xi β
pr(yi = 1 | xi , si = 1) = t
eβ0 +xi β 1
t p1 + t p0
1+eβ0 +xi β 1+eβ0 +xi β
t
eβ0 +xi β p1
= t
e 0 +xi β p1 + p0
β
t
eβ0 +xi β p1 /p0
= t
eβ0 +xi β p1 /p0 + 1
t
eδ+β0 +xi β
= t .
1 + eδ+β0 +xi β
Logistic Regression for Binary Outcomes 221


Theorem 20.1 ensures that conditioning on si = 1, the model of yi given xi is still logit
with the intercept changing from β0 to β0 + log(p1 /p0 ). Although we cannot consistently
estimate the intercept without knowing (p1 , p0 ), we can still estimate all the slopes. Kagan
(2001) showed that the logistic link is the only one that enjoys this property.
Samarani et al. (2019) hypothesized that variation in the inherited activating Killer-cell
Immunoglobulin-like Receptor genes in humans is associated with their innate susceptibili-
ty/resistance to developing Crohn’s disease. They used a case-control study from three cities
(Manitoba, Montreal, and Ottawa) in Canada to investigate the potential association.
> dat = read . csv ( " samarani . csv " )
> pool . glm = glm ( case _ comb ~ ds 1 + ds 2 + ds 3 + ds 4 _ a +
+ ds 4 _ b + ds 5 + ds 1 _ 3 + center ,
+ family = binomial ( link = logit ) ,
+ data = dat )
> summary ( pool . glm )

Call :
glm ( formula = case _ comb ~ ds 1 + ds 2 + ds 3 + ds 4 _ a + ds 4 _ b + ds 5 +
ds 1 _ 3 + center , family = binomial ( link = logit ) , data = dat )

Deviance Residuals :
Min 1Q Median 3Q Max
-1 . 9 9 8 2 -0 . 9 2 7 4 -0 . 5 2 9 1 1.0113 2.2289

Coefficients :
Estimate Std . Error z value Pr ( >| z |)
( Intercept ) -2 . 3 9 6 8 1 0 . 2 1 7 6 8 -1 1 . 0 1 1 < 2e - 1 6 ***
ds 1 0.55945 0.14437 3.875 0.000107 ***
ds 2 0.42531 0.14758 2.882 0.003954 **
ds 3 0.81377 0.14503 5 . 6 1 1 2 . 0 1e - 0 8 ***
ds 4 _ a 0.30270 0.30679 0.987 0.323802
ds 4 _ b 0.29199 0.17726 1.647 0.099511 .
ds 5 0.92049 0.14852 6 . 1 9 8 5 . 7 2e - 1 0 ***
ds 1 _ 3 0.49982 0.14706 3.399 0.000677 ***
cent erMontre al -0 . 0 5 8 1 6 0 . 1 5 8 8 9 -0 . 3 6 6 0 . 7 1 4 3 1 6
centerOttawa 0.14164 0.20251 0.699 0.484292

( Dispersion parameter for binomial family taken to be 1 )

Null deviance : 1 4 0 3 . 7 on 1 0 2 0 degrees of freedom


Residual deviance : 1 1 9 2 . 0 on 1 0 1 1 degrees of freedom
AIC : 1 2 1 2

Number of Fisher Scoring iterations : 3

20.7 Other model formulations


20.7.1 Latent linear model
Let yi = 1(yi∗ ≥ 0) where

yi∗ = xti β + εi
222 Linear Model and Extensions

and −εi has distribution function g(·) and is independent of xi . From this latent linear
model, we can verify that

pr(yi = 1 | xi ) = pr(yi∗ ≥ 0 | xi )
= pr(xti β + εi ≥ 0 | xi )
= pr(−εi ≤ xti β | xi )
= g(xti β).

So the g(·) function can be interpreted as the distribution function of the error term in the
latent linear model.
This latent variable formulation provides another way to interpret the coefficients in the
models for binary data. It is a powerful way to generate models for more complex data. We
will see another example in the next chapter.

20.7.2 Inverse model


Assume that

yi ∼ Bernoulli(q), (20.5)

and

xi | yi = 1 ∼ N(µ1 , Σ), xi | yi = 0 ∼ N(µ0 , Σ), (20.6)

where xi does not contain 1. This is called the linear discriminant model. We can verify
that yi | xi follows a logit model as shown in the theorem below.

Theorem 20.2 Under (21.10) and (21.11), we have

logit{pr(yi = 1 | xi )} = α + xti β,

where
q 1 t −1
µ1 Σ µ1 − µt0 Σ−1 µ0 , β = Σ−1 (µ1 − µ0 ) .

α = log − (20.7)
1−q 2
Proof of Theorem 20.2: Using Bayes’ formula, we have

pr(yi = 1, xi )
pr(yi = 1 | xi ) =
pr(xi )
pr(yi = 1)pr(xi | yi = 1)
=
pr(yi = 1)pr(xi | yi = 1) + pr(yi = 0)pr(xi | yi = 0)
e∆
= ,
1 + e∆
Logistic Regression for Binary Outcomes 223

where
pr(yi = 1)pr(xi | yi = 1)
∆ = log
pr(yi = 0)pr(xi | yi = 0)
−1/2
exp −(xi − µ1 )t Σ−1 (xi − µ1 )/2

q {(2π)p det(Σ)}
= log −1/2
(1 − q) {(2π)p det(Σ)} exp {−(xi − µ0 )t Σ−1 (xi − µ0 )/2}
t −1

q exp −(xi − µ1 ) Σ (xi − µ1 )/2
= log
(1 − q) exp {−(xi − µ0 )t Σ−1 (xi − µ0 )/2}
q exp − −2xti Σ−1 µ1 + µt1 Σ−1 µ1 /2
 
= log
(1 − q) exp {− (−2xti Σ−1 µ0 + µt0 Σ−1 µ0 ) /2}
q 1 t −1
µ1 Σ µ1 − µt0 Σ−1 µ0 + xti Σ−1 (µ1 − µ0 ) .

= log −
1−q 2

So yi | xi follows a logistic model with α and β given in (21.12). □


We can easily obtain Pnthe moment estimators for the unknown parameters under (21.10)
and (21.11). Let n1 = i=1 yi and n0 = n − n1 . The moment estimator for q is q̂ = n1 /n,
the sample mean of the yi ’s. The moment estimators for µ1 and µ0 are
n
X n
X
µ̂1 = n−1
1 yi xi , µ̂0 = n−1
0 (1 − yi )xi ,
i=1 i=1

the sample means of the xi ’s for units with yi = 1 and yi = 0, respectively. The moment
estimator for Σ is
" n n
#
X X
t t
Σ̂ = yi (xi − µ̂1 )(xi − µ̂1 ) + (1 − yi )(xi − µ̂0 )(xi − µ̂0 ) /(n − 2),
i=1 i=1

the pooled covariance matrix, after centering the xi ’s by the y-specific means. Based on
Theorem 20.2, we can obtain estimates α̂ and β̂ by replacing the true parameters with their
moment estimators. This gives us another way to fit the logistic model.
Efron (1975) compared the above moment estimator and the MLE under the logistic
model. Since the linear discriminant model imposes stronger assumptions, the estimator
based on Theorem 20.2 is more efficient. In contrast, the MLE of the logistic model is more
robust because it does not impose the Normality assumption on xi .

20.8 Homework problems


20.1 Invariance of logistic regression
This problem extends Problem 3.4.
Assume x̃i = xi Γ with an invertible Γ. Run logistic regression of yi ’s on xi ’s to obtain
the coefficient β̂ and fitted probabilities π̂i ’s. Run another logistic regression of yi ’s on x̃i ’s
to obtain the coefficient β̃ and fitted probabilities π̃i ’s.
Show that β̂ = Γβ̃ and π̂i = π̃i for all i’s.

20.2 Two logistic regressions


This is an extension of Problem 17.2.
224 Linear Model and Extensions

Given data (xi , zi , yi )ni=1 where xi denotes the covariates, zi ∈ {1, 0} denotes the bi-
nary group indicator, and yi denotes the binary outcome. We can fit two separate logistic
regressions:
logit{pr(yi = 1 | zi = 1, xi )} = γ1 + xti β1
and
logit{pr(yi = 1 | zi = 0, xi )} = γ0 + xti β0
with the treated and control data, respectively. We can also fit a joint logistic regression
using the pooled data:

logit{pr(yi = 1 | zi , xi )} = α0 + αz zi + xti αx + zi xti αzx .

Let the parameters with hats denote the MLEs, for example, γ̂1 is the MLE for γ1 . Find
(α̂0 , α̂z , α̂x , α̂zx ) in terms of (γ̂1 , β̂1 , γ̂0 , β̂0 ).

20.3 Likelihood for Probit model


Write down the likelihood function for the Probit model, and derive the steps for Newton’s
method and Fisher scoring for computing the MLE. How do we estimate the asymptotic
covariance matrix of the MLE?

20.4 Logit and general exponential family


Efron (1975) pointed out an extension of Theorem 20.2. Show that under (21.10) and

f (xi | yi = y) = g(θy , η)h(xi , η) exp(xti θy ), (y = 0, 1)

with parameters (θ1 , θ0 , η), we have

logit{pr(yi = 1 | xi )} = α + xti β.

Find the formulas of α and β in terms of (θ1 , θ0 , η).


Hint: As a sanity check, you can compare this problem with Theorem 20.2.

20.5 Empirical comparison of logistic regression and linear discriminant analysis


Compare the performance of logistic regression and linear discriminant analysis in terms
of prediction accuracy. You should simulate at least three cases: (1) the model for linear
discriminant analysis is correct; (2) the model for linear discriminant analysis is incorrect but
the model for logistic regression is correct; (3) the model for logistic regression is incorrect.

20.6 Quadratic discriminant analysis


Assume that
yi ∼ Bernoulli(q),
and
xi | yi = 1 ∼ N(µ1 , Σ1 ), xi | yi = 0 ∼ N(µ0 , Σ0 ),
where xi ∈ Rp does not contain 1. Prove that

logit{pr(yi = 1 | xi )} = α + xti β + xti Λxi ,


Logistic Regression for Binary Outcomes 225

where
q 1 det(Σ1 ) 1 t −1
α = log − log − (µ Σ µ1 − µt0 Σ−1
0 µ0 ),
1−q 2 det(Σ0 ) 2 1 1
β = Σ−1 −1
1 µ1 − Σ0 µ0 ,
1
Λ = − (Σ−1 − Σ−1
0 ).
2 1
Remark: This problem extends the linear discriminant model in Section 20.7.2 to the
quadratic discriminant model by allowing for heteroskedasticity in the conditional Normality
of x given y. It implies the logistic model with the linear, quadratic, and interaction terms
of the basic covariates.

20.7 Logit and other links


Compute the minimizer and minimum value of maxy |glogit (by) − g∗ (y)| for ∗ = cauchit and
cloglog.

20.8 Data analysis


Reanalyze the data in Section 20.6.2, stratifying the analysis based on center. Do the results
vary significantly across centers?

20.9 R2 in logistic regression


Recall that R2 in the linear model measures the linear dependence of the outcome on
the covariates. However, the definition of R2 is not obvious in the logistic model. The glm
function in R does not return any R2 for the logistic regression.
Recall the following equivalent definitions of R2 in the linear model
Pn
(ŷi − ȳ)2
R2 = Pni=1 2
i=1 (yi − ȳ)
Pn
(yi − ŷi )2
= 1 − Pi=1 n 2
i=1 (yi − ȳ)
Pn 2
( (yi − ȳ)(ŷi − ȳ))
= ρ̂2yŷ = Pn i=1 2
Pn 2
.
i=1 (yi − ȳ) i=1 (ŷi − ȳ)

The fitted values are π̂i = π(xi , β̂) in the logistic model, which have mean ȳ with the
intercept included in the model. Analogously, we can define R2 in the logistic model as

2 ssm 2 ssr 2
Cy2π̂
Rmodel = , Rresidual =1− , Rcorrelation = ρ̂2yπ̂ = ,
sst sst sst ssm
where
n
X n
X n
X n
X
sst = (yi − ȳ)2 , ssm = (π̂i − ȳ)2 , ssr = (yi − π̂i )2 , Cyπ̂ = (yi − ȳ)(π̂i − ȳ).
i=1 i=1 i=1 i=1
2
These three definitions are not equivalent in general. In particular, Rmodel differs from
2
Rresidual since
sst = ssm + ssr + 2Cε̂π̂
where
n
X
Cε̂π̂ = (yi − π̂i )(π̂i − ȳ).
i=1
226 Linear Model and Extensions
2 2
1. Prove that Rmodel ≥ 0, Rcorrelation ≥ 0 with equality holding if π̂i = ȳ for all i. Prove
2 2 2
that Rmodel ≤ 1, Rresidual ≤ 1, Rcorrelation ≤ 1 with equality holding if yi = π̂i for all i.
2
Note that Rresidual may be negative. Give an example.
2. Define Pn Pn
¯1 = Pi=1 yi π̂i ¯0 = Pi=1 (1 − yi )π̂i
π̂ n , π̂ n
i=1 yi i=1 (1 − yi )
as the average of the fitted values for units with yi = 1 and yi = 0, respectively. Define
¯1 − π̂
D = π̂ ¯0 .

Prove that q
2 2 2 2
D = (Rmodel + Rresidual )/2 = Rmodel Rcorrelation .
Further prove that D ≥ 0 with equality holding if π̂i = ȳ for all i, and D ≤ 1 with
equality holding if yi = π̂i for all i.
3. McFadden (1974) defined the following R2 :

2 log L(β̂)
Rmcfadden =1−
log L(β̃)

recalling that β̃ is the MLE assuming that all coefficients except the intercept are zero,
and β̂ is the MLE without any restrictions. This R2 must lie within [0, 1]. Verify that
under the Normal linear model, the above formula does not reduce to the usual R2 .

4. Cox and Snell (1989) defined the following R2 :


!2/n
2 L(β̃)
RCS =1− .
L(β̂)

Verify that under the Normal linear model, the above formula reduces to the usual R2 .
2 2 2
Remark: Tjur (2009) gave an excellent discussion of Rmodel , Rresidual , Rcorrelation and D.
2 2/n
Nagelkerke (1991) pointed out that the upper bound of this RCS is 1 − (L(β̃)) < 1 and
proposed to modify it as
 2/n
1 − L( β̃)
L(β̂)
2
Rnagelkerke =  2/n
1 − L(β̃)

to ensure that its upper bound is 1. However, this modification seems purely ad hoc. Al-
though D is an appealing definition of R2 for the logistic model, it does not generalize to
2
other models. Overall, I feel that Rcorrelation is a better definition that easily generalizes to
other models. Zhang (2017) defined an R2 based on the variance function of the outcome
for generalized linear models including the binary logistic model. Hu et al. (2006) studied
the asymptotic properties of some of the R2 s above.
21
Logistic Regressions for Categorical Outcomes

Categorical outcomes are common in empirical research. The first type of categorical out-
come is nominal. For example, the outcome denotes the preference for fruits (apple, orange,
and pear) or transportation services (Uber, Lyft, or BART). The second type of categorical
outcome is ordinal. For example, the outcome denotes the course evaluation at Berkeley
(1, 2, . . . , 7) or Amazon review (1 to 5 stars). This chapter discusses statistical modeling
strategies for categorical outcomes, including two classes of models corresponding to the
nominal and ordinal outcomes, respectively.

21.1 Multinomial distribution


A categorical random variable y taking values in {1, . . . , K} with probabilities pr(y = k) =
πk (k = 1, . . . , K) is often called a multinomial distribution, denoted by

y ∼ Multinomial {1; (π1 , . . . , πK )} , (21.1)


PK
where k=1 πk = 1. We can calculate the mean and covariance matrix of y:

Proposition 21.1 If y is the Multinomial random variable in (21.1), then (1(y =


1), . . . , 1(y = K − 1)) has mean (π1 , . . . , πK−1 ) and covariance matrix
 
π1 (1 − π1 ) −π1 π2 ··· −π1 πK−1
 −π1 π2 π2 (1 − π2 ) · · · −π2 πK−1 
. (21.2)
 
 .. .. . . .
.
 . . . . 
−π1 πK−1 −π2 πK−1 ··· πK−1 (1 − πK−1 )

As a byproduct, we know that the matrix in (21.2) is positive semi-definite.

Proof of Proposition 21.1: Without loss of generality, I will calculate the (1, 1)th and
the (1, 2)th element of the matrix. First, 1(y = 1) is Bernoulli with probability π1 , so the
(1, 1)th element equals var(1(y = 1)) = π1 (1 − π1 ). Similarly, the (2, 2)th element equals
var(1(y = 2)) = π2 (1 − π2 ).
Second, 1(y = 1)+1(y = 2) is Bernoulli with probability π1 +π2 , so var(1(y = 1)+1(y =
2)) = (π1 + π2 )(1 − π1 − π2 ). Therefore, the (1, 2)-th element equals

cov(1(y = 1), 1(y = 2))


= {var(1(y = 1) + 1(y = 2)) − var(1(y = 1)) − var(1(y = 2))} /2
= −π1 π2 .

227
228 Linear Model and Extensions

With independent samples of (xi , yi )ni=1 , we want to model yi based on covariates xi 1 :

yi | xi ∼ Multinomial [1; {π1 (xi ), . . . , πK (xi )}] ,


PK
where k=1 πk (xi ) = 1 for all xi . We can write the probability mass function of pr(yi | xi )
as
K
Y
πyi (xi ) = {πk (xi ) if yi = k}
k=1
K
Y 1(yi =k)
= {πk (xi )} .
k=1

Here πk (xi ) is a general function of xi . The remaining parts of this chapter will discuss the
canonical choices of πk (xi ) for nominal and ordinal outcomes.

21.2 Multinomial logistic model for nominal outcomes


21.2.1 Modeling
Viewing category K as the reference level, we can model the ratio of the probabilities of
categories k and K as
πk (xi )
log = xti βk (k = 1, . . . , K − 1)
πK (xi )
which implies that
t
πk (xi ) = πK (xi )exi βk (k = 1, . . . , K − 1).

Due to the normalization, we have


PK PK xt βk
k=1 πk (xi ) = 1 =⇒ k=1 πPK (xi )e i =1
K xti βk
=⇒ πK (xi ) k=1 .P e =1
K xti βk
=⇒ πK (xi ) = 1 k=1 e
t
exi βk
=⇒ πk (xi ) = PK xt βl (k = 1, . . . , K − 1).
l=1 e
i

A more compact form is


t
exi βk
πk (xi ) = πk (xi , β) = PK , (k = 1, . . . , K) (21.3)
xti βl
l=1 e

where β = (β1 , . . . , βK−1 ) denotes the parameter with βK = 0 for the reference category.
From the ratio form of (21.3), we can only identify βk − βK for all k = 1, . . . , K. So for
convenience, we impose the restriction βK = 0. Model (21.3) is called the multinomial
logistic regression model.
1 An alternative strategy is to model 1(y = k) | x for each k. The advantage of this strategy is that
i i
it reduces to binary logistic models. The disadvantage of this strategy is that it does not model the whole
distribution of yi and can lose efficiency in estimation.
Multinomial and Proportional Odds Model 229

Similar to the binary logistic regression model, we can interpret the coefficients as the
conditional log odds ratio compared to the reference level:

πk (. . . , xij + 1, . . .) πk (. . . , xij , . . .)
βk,j = log − log
πK (. . . , xij + 1, . . .) πK (. . . , xij , . . .)
 
πk (. . . , xij + 1, . . .) . πk (. . . , xij , . . .)
= log .
πK (. . . , xij + 1, . . .) πK (. . . , xij , . . .)

21.2.2 MLE
The likelihood function for the multinomial logistic model is
n Y
Y K
1(yi =k)
L(β) = {πk (xi )}
i=1 k=1
n Y
K
( t
)1(yi =k)
Y exi βk
= PK t
i=1 k=1 l=1 exi βl
n
"( K
) K
#
Y Y .X
1(yi =k)xti βk xti βk
= e e ,
i=1 k=1 k=1

and the log-likelihood function is


n
"K (K )#
X X X t
t xi βk
log L(β) = 1(yi = k)xi βk − log e .
i=1 k=1 k=1

The score function is  ∂ log L(β) 


∂β1
∂ log L(β)  ..  ∈ Rp(K−1)

=

.
∂β ∂ log L(β)

∂βK−1

with
n
( t
)
∂ log L(β) X xi exi βk
= xi 1(yi = k) − PK
∂βk i=1 l=1 e
xti βl

n
( t
)
X exi βk
= xi 1(yi = k) − PK
xti βl
i=1 l=1 e
n
X
= xi {1(yi = k) − πk (xi , β)} ∈ Rp , (k = 1, . . . , K − 1).
i=1

The Hessian matrix is


 2
∂ 2 log L(β) ∂ 2 log L(β)

∂ log L(β)
t ∂β1 ∂β2t ··· t
 2∂β1 ∂β1 ∂β1 ∂βK−1
∂ 2 log L(β) ∂ 2 log L(β)

2
 ∂ log L(β) ··· 
∂ log L(β)  ∂β2 ∂β1t ∂β2 ∂β2t t
∂β2 ∂βK−1
 ∈ Rp(K−1)×p(K−1)

= .. .. .. (21.4)
∂β∂β t 
 . . . 
∂ 2 log L(β) ∂ 2 log L(β) ∂ 2 log L(β)
 
∂βK−1 ∂β1t ∂βK−1 ∂β2t ··· ∂βK−1 ∂βK−1t
230 Linear Model and Extensions

with the (k, k)th block


n t
!
∂ 2 log L(β) X ∂ exi βk
= − xi t PK
∂βk ∂βkt i=1
∂β k l=1 e
xti βl

n xti βk
PK xt βl t t
X
te l=1 e
i − exi βk exi βk
= − xi xi PK t
i=1 ( l=1 exi βl )2
Xn
= − πk (xi , β) {1 − πk (xi , β)} xi xti ∈ Rp×p (k = 1, . . . , K − 1)
i=1

and the (k, l)th block


n t
!
∂ 2 log L(β) X ∂ exi βk
t = − xi t PK
∂βk ∂βl i=1
∂βl l=1
t
exi βl
n t t
X −exi βk exi βl
=− xi xti PK t
i=1 ( l=1 exi βl )2
n
X
= πk (xi , β)πl (xi , β)xi xti ∈ Rp×p (k ̸= l : k, l = 1, . . . , K − 1).
i=1

We can verify that the Hessian matrix is negative semi-definite based on Proposition 21.1,
which is left as Problem 21.2.
In R, the function multinom in the nnet package uses Newton’s method to fit the multino-
mial logistic model. We can make inference about the parameters based on the asymptotic
Normality of the MLE. Based on a new observation with covariate xn+1 , we can make
prediction based on the fitted probabilities πk (xn+1 , β̂), and furthermore classify it into K
categories based on
ŷn+1 = arg max πk (xn+1 , β̂).
1≤k≤K

21.3 A latent variable representation for the multinomial logistic


regression
We can view the multinomial logistic regression (21.3) as an extension of the binary lo-
gistic regression model. The binary logistic regression has a latent variable representation
as shown in Section 20.7.1. The multinomial logistic regression also has a latent variable
representation below.
Assume 
U = xti β1 + εi1 ,
 i1


..
 .

U
iK = xti βK + εiK ,
where εi1 , . . . , εiK are IID standard Gumbel random variables2 . Using the language of eco-
nomics, (Ui1 , . . . , UiK ) are the utilities associated with the choices (1, . . . , K). So unit i
chooses k if k has the highest utility:
yi = k if Uik > Uil for all l ̸= k.
2 See Section B.1.3 for a review.
Multinomial and Proportional Odds Model 231

We can show that this latent variable model implies (21.3). This follows from the lemma
below, which is due to McFadden (1974)3 . When K = 2, it also gives another latent variable
representation for the binary logistic regression, which is different from the one in Section
20.7.1.

Lemma 21.1 Assume 


U1
 = V 1 + ε1 ,
..

 .

U
K = V K + εK ,
where ε1 , . . . , εK are IID standard Gumbel. Define

y = arg max Ul
1≤l≤K

as the index corresponding to the maximum of the Uk ’s. We have


eVk
pr(y = k) = PK .
l=1 eVl
Proof of Lemma 21.1: Recall that the standard Gumbel random variable has CDF F (z) =
exp(−e−z ) and density f (z) = exp(−e−z )e−z .
The event “y = k” is equivalent to the event “Uk > Ul for all l ̸= k”, so

pr(y = k) =pr(Uk > Ul , l = 1, . . . , k − 1, k + 1, . . . , K)


=pr(Vk + εk > Vl + εl , l = 1, . . . , k − 1, k + 1, . . . , K)
Z ∞
= pr(Vk + z > Vl + εl , l = 1, . . . , k − 1, k + 1, . . . , K)f (z)dz
−∞

where the last line follows from conditioning on εk . By independence of the ε’s, we have
Z ∞ Y
pr(y = k) = pr(εl < Vk − Vl + z)f (z)dz
−∞ l̸=k
Z∞ Y
= exp(−e−Vk +Vl −z ) exp(−e−z )e−z dz
−∞ l̸=k
 
Z ∞ X
= exp − e−Vk +Vl e−z  exp(−e−z )e−z dz.
−∞ l̸=k

Changing of variables t = e−z with dz = −1/tdt, we obtain


 
Z ∞ X
pr(y = k) = exp −t e−Vk +Vl  exp(−t)dt
0 l̸=k
Z ∞
= exp(−tCk )dt
0

where X
Ck = 1 + e−Vk +Vl .
l̸=k

3 Daniel McFadden shared the 2000 Nobel Memorial Prize in Economic Sciences with James Heckman.
232 Linear Model and Extensions

The integral simplifies to 1/Ck due to the density of the exponential distribution. Therefore,
1
pr(y = k) =
e−Vk +Vl
P
1+ l̸=k
Vk
e
= P
eVk + l̸=k eVl
eVk
= PK .
l=1 eVl


This lemma is remarkable. It extends to more complex utility functions. I will use it
again in Section 21.6.

21.4 Proportional odds model for ordinal outcomes


For ordinal outcomes, we can still use the multinomial logistic model, but by doing this, we
discard the ordering information in the outcome. Consider a simple case with a scalar xi , the
multinomial logistic model does not rule out the possibility that βk < 0 and βk+1 > 0, which
implies that xi increases the probability of category k + 1 but decreases the probability of
category k. However, the outcome must first reach level k and then increase to level k + 1.
If this happens, it may be hard to interpret the model.
Motivated by the latent linear representation of the binary logistic model, we imagine
that the ordinal outcome arises from discretizing a continuous latent variable:

yi∗ = xti β + εi , pr(εi ≤ z | xi ) = g(z) (21.5)

and

yi = k, if αk−1 < yi∗ ≤ αk , (k = 1, . . . , K) (21.6)

where
−∞ = α0 < α1 < · · · < αK−1 < αK = ∞.
Figure 21.1 illustrates the data generating process with K = 4.
The unknown parameters are (β, α1 , . . . , αK−1 ). The distribution of the latent error
term g(·) is known, for example, it can be logistic or Normal. The former results in the
proportional odds logistic model, and the latter results in the ordered Probit model. Based
on the latent linear model, we can compute

pr(yi ≤ k | xi ) = pr(yi∗ ≤ αk | xi )
= pr(xti β + εi ≤ αk | xi )
= pr(εi ≤ αk − xti β | xi )
= g(αk − xti β).

I will focus on the proportional odds logistic model in the main text and defer the details
for the ordered Probit model to Problem 21.5. With this model, we have
t
eαk −xi β
pr(yi ≤ k | xi ) =
1 + eαk −xi β
t
Multinomial and Proportional Odds Model 233

ordinal outcome

1 2 3 4
density of the latent variable

α0 = − ∞ α1 α2 α3 α4 = ∞

y* = xT β + ε

FIGURE 21.1: Latent variable representation of the ordinal outcome

or
pr(yi ≤ k | xi )
logit{pr(yi ≤ k | xi )} = log = αk − xti β. (21.7)
pr(yi > k | xi )
The model has the “proportional odds” property because

pr(yi ≤ k | . . . , xij + 1, . . .) . pr(yi ≤ k | . . . , xij , . . .)


= e−βj
pr(yi > k | . . . , xij + 1, . . .) pr(yi > k | . . . , xij , . . .)

which is a positive constant not depending on k.


The sign of xti β is negative due to the latent variable representation. Some textbooks
and software packages use a positive sign, but the function polr in package MASS of R uses
(21.7).
The proportional odds model implies a quite complicated form of the probability for
each category:
t t
eαk −xi β eαk−1 −xi β
pr(yi = k | xi ) = − t , (k = 1, . . . , K).
1 + eαk −xi β 1 + eαk−1 −xi β
t

So the likelihood function is


n Y
Y K
L(β, α1 , . . . , αK−1 ) = {pr(yi = k | xi )}1(yi =k)
i=1 k=1
n Y
K t t
!1(yi =k)
Y eαk −xi β eαk−1 −xi β
= − .
1 + eαk −xi β 1 + eαk−1 −xi β
t t

i=1 k=1

The log-likelihood function is concave (Pratt, 1981; Burridge, 1981), and it is strictly
234 Linear Model and Extensions

concave in most cases. The function polr in the R package MASS computes the MLE of the
proportional odds model using the BFGS algorithm. It uses the explicit formulas of the
gradient of the log-likelihood function, and computes the Hessian matrix numerically. I
relegate the formulas of the gradient as a homework problem. For more details of the Hessian
matrix, see Agresti (2010), which is a textbook discussion on modeling ordinal data.

21.5 A case study


I use a small observational dataset from the Karolinska Institute in Stockholm, Sweden to
illustrate the application of logistic regressions. Rubin (2008) used this dataset to investigate
whether it is better for cardia cancer patients to be treated in a large or small-volume
hospital, where volume is defined by the number of patients with cardia cancer treated
in recent years. I use the following variables: highdiag indicating whether a patient was
diagnosed at a high volume hospital, hightreat indicating whether a patient was treated at
a high volume hospital, age representing the age, rural indicating whether the patient was
from a rural area, and survival representing the years of survival after diagnosis with three
categories (“1”, “2-4”, “5+”). The R code is in code20.5.R.
karolinska = read . table ( " karolinska . txt " , header = TRUE )
karolinska = karolinska [ , c ( " highdiag " , " hightreat " ,
" age " , " rural " ,
" male " , " survival " )]

21.5.1 Binary logistic for the treatment


We have two choices of the treatment: highdiag and hightreat. The logistic fit of highdiag on
covariates is shown below.
> diagglm = glm ( highdiag ~ age + rural + male ,
+ data = karolinska ,
+ family = binomial ( link = " logit " ))
> summary ( diagglm )

Call :
glm ( formula = highdiag ~ age + rural + male , family = binomial ( link = " logit " ) ,
data = karolinska )

Deviance Residuals :
Min 1Q Median 3Q Max
-2 . 0 6 1 4 7 -0 . 9 8 6 4 5 -0 . 0 5 7 5 9 1.01391 1.75696

Coefficients :
Estimate Std . Error z value Pr ( >| z |)
( Intercept ) 3 . 4 6 6 0 4 1.14545 3.026 0.002479 **
age -0 . 0 3 1 2 4 0 . 0 1 4 8 1 -2 . 1 1 0 0.034854 *
rural -1 . 2 6 3 2 2 0 . 3 4 5 3 0 -3 . 6 5 8 0.000254 ***
male -0 . 9 7 5 2 4 0 . 4 1 3 0 3 -2 . 3 6 1 0.018216 *

The logistic fit of hightreat is shown below.


> treatglm = glm ( hightreat ~ age + rural + male ,
+ data = karolinska ,
+ family = binomial ( link = " logit " ))
> summary ( treatglm )
Multinomial and Proportional Odds Model 235

Call :
glm ( formula = hightreat ~ age + rural + male , family = binomial ( link = " logit " ) ,
data = karolinska )

Deviance Residuals :
Min 1Q Median 3Q Max
-2 . 2 9 1 2 -0 . 9 9 7 8 0.5387 0.8408 1.4810

Coefficients :
Estimate Std . Error z value Pr ( >| z |)
( Intercept ) 6 . 4 4 6 8 3 1.49544 4.311 1 . 6 3e - 0 5 ***
age -0 . 0 6 2 9 7 0 . 0 1 8 9 0 -3 . 3 3 2 0.000862 ***
rural -1 . 2 8 7 7 7 0 . 3 9 5 7 2 -3 . 2 5 4 0.001137 **
male -0 . 7 4 8 5 6 0 . 4 5 2 8 5 -1 . 6 5 3 0.098329 .

Both treatments are associated with the covariates. hightreat is more strongly associated
with age. Rubin (2008) argued that highdiag is more random than hightreat, and may have
weaker association with other hidden covariates. For each model below, I fit the data twice
corresponding to two choices of treatment. Overall, we should trust the results with highdiag
more based on Rubin (2008)’s argument.

21.5.2 Binary logistic for the outcome


I first fit binary logistic models for the dichotomized outcome indicating whether the patient
survived longer than a year after diagnosis.
> karolinska $ loneyear = ( karolinska $ survival != " 1 " )
> loneyearglm = glm ( loneyear ~ highdiag + age + rural + male ,
+ data = karolinska ,
+ family = binomial ( link = " logit " ))
> summary ( loneyearglm )

Call :
glm ( formula = loneyear ~ highdiag + age + rural + male , family = binomial ( link = " logit " ) ,
data = karolinska )

Deviance Residuals :
Min 1Q Median 3Q Max
-1 . 1 7 5 5 -0 . 9 9 3 6 -0 . 7 7 3 9 1.3024 1.8557

Coefficients :
Estimate Std . Error z value Pr ( >| z |)
( Intercept ) -1 . 2 2 9 1 9 1 . 1 5 5 4 5 -1 . 0 6 4 0.2874
highdiag 0.13684 0.36586 0.374 0.7084
age -0 . 0 0 3 8 9 0 . 0 1 4 1 1 -0 . 2 7 6 0.7829
rural 0.33360 0.35798 0.932 0.3514
male 0.86706 0.44034 1.969 0.0489 *

highdiag is not significant in the above regression.


> loneyearglm = glm ( loneyear ~ hightreat + age + rural + male ,
+ data = karolinska ,
+ family = binomial ( link = " logit " ))
> summary ( loneyearglm )

Call :
glm ( formula = loneyear ~ hightreat + age + rural + male , family = binomial ( link = " logit " ) ,
data = karolinska )

Deviance Residuals :
Min 1Q Median 3Q Max
-1 . 3 7 6 7 -0 . 9 6 8 3 -0 . 6 7 8 4 1.0813 2.0833
236 Linear Model and Extensions

Coefficients :
Estimate Std . Error z value Pr ( >| z |)
( Intercept ) -3 . 3 5 3 9 7 7 1 . 3 1 7 9 4 2 -2 . 5 4 5 0 . 0 1 0 9 3 *
hightreat 1.417458 0.455603 3.111 0.00186 **
age 0.008725 0.014840 0.588 0.55655
rural 0.633278 0.368525 1.718 0.08572 .
male 1.079973 0.452191 2.388 0.01693 *

hightreat is significant in the above regression.

21.5.3 Multinomial logistic for the outcome


I then fit multinomial logistic models for the outcome with three categories.
> library ( nnet )
> yearmultinom = multinom ( survival ~ highdiag + age + rural + male ,
+ data = karolinska )
# weights : 1 8 ( 1 0 variable )
initial value 1 7 3 . 5 8 0 7 4 2
iter 1 0 value 1 3 4 . 3 3 1 9 9 2
final value 1 3 4 . 1 3 0 8 1 5
converged
> summary ( yearmultinom )
Call :
multinom ( formula = survival ~ highdiag + age + rural + male ,
data = karolinska )

Coefficients :
( Intercept ) highdiag age rural male
2 -4 -1 . 0 7 5 8 1 8 -0 . 0 6 9 7 3 1 8 7 -0 . 0 0 4 6 2 4 0 3 0 0 . 1 7 4 4 2 5 6 0 . 5 0 2 8 7 8 6
5+ -4 . 1 8 0 4 1 6 0 . 6 4 0 3 6 2 8 9 -0 . 0 0 1 8 4 6 4 5 3 0 . 7 3 6 5 1 1 1 2 . 1 6 2 8 7 1 7

Std . Errors :
( Intercept ) highdiag age rural male
2 -4 1.286987 0.4113006 0.01596377 0.4014718 0.4716831
5+ 2.003581 0.5816365 0.02148936 0.5741017 1.0741239

Residual Deviance : 2 6 8 . 2 6 1 6
AIC : 2 8 8 . 2 6 1 6
> predict ( yearmultinom , type = " probs " )[ 1 : 5 , ]
1 2 -4 5+
1 0.5950631 0.2647047 0.14023222
2 0.5941802 0.2655369 0.14028293
3 0.8081376 0.1718963 0.01996613
4 0.5950631 0.2647047 0.14023222
5 0.6366929 0.2260086 0.13729849

highdiag is not significant above. The predict function gives the fitted probabilities for
all categories of the outcome.
> yearmultinom = multinom ( survival ~ hightreat + age + rural + male ,
+ data = karolinska )
# weights : 1 8 ( 1 0 variable )
initial value 1 7 3 . 5 8 0 7 4 2
iter 1 0 value 1 2 9 . 5 4 8 6 4 2
final value 1 2 9 . 2 8 3 7 3 9
converged
> summary ( yearmultinom )
Call :
multinom ( formula = survival ~ hightreat + age + rural + male ,
data = karolinska )

Coefficients :
( Intercept ) hightreat age rural male
Multinomial and Proportional Odds Model 237

2 -4 -3 . 3 1 2 4 3 3 1.326354 0.008527561 0.5186654 0.7514451


5+ -5 . 9 3 5 1 7 2 1.627711 0.008978103 0.9063831 2.2780877

Std . Errors :
( Intercept ) hightreat age rural male
2 -4 1.463258 0.5141127 0.01660648 0.4085976 0.4806953
5+ 2.190305 0.7320788 0.02244867 0.5645595 1.0739669

Residual Deviance : 2 5 8 . 5 6 7 5
AIC : 2 7 8 . 5 6 7 5

hightreat is significant above.

21.5.4 Proportional odds logistic for the outcome


The multinomial logisitic model does not reflect the ordering information of the outcome. For
instance, in the regression with highdiag, the coefficient for level “2-4” is −0.06973187 < 0,
but the coefficient for level “5+” is 0.64036289 > 0, which means that highdiag decreases
the probability of “2-4” but increases the probability of “5”. However, this is illogical since
a patient must first live longer than two years and then live longer than five years. Nev-
ertheless, it is not a severe problem in this case study because those coefficients are not
significant.
> library ( MASS )
> yearpo = polr ( survival ~ highdiag + age + rural + male ,
+ Hess = TRUE ,
+ data = karolinska )
> summary ( yearpo )
Call :
polr ( formula = survival ~ highdiag + age + rural + male , data = karolinska ,
Hess = TRUE )

Coefficients :
Value Std . Error t value
highdiag 0 . 2 1 6 7 5 5 0.35892 0.6039
age -0 . 0 0 2 8 8 1 0 . 0 1 3 7 8 -0 . 2 0 9 1
rural 0.371898 0.35313 1.0532
male 0.943955 0.43588 2.1656

Intercepts :
Value Std . Error t value
1 | 2 -4 1.4079 1.1309 1.2450
2 -4 | 5 + 2 . 9 2 8 4 1 . 1 5 1 4 2.5434

Residual Deviance : 2 7 1 . 0 7 7 8
AIC : 2 8 3 . 0 7 7 8
> predict ( yearpo , type = " probs " )[ 1 : 5 , ]
1 2 -4 5+
1 0.5862465 0.2800892 0.13366427
2 0.5855475 0.2804542 0.13399823
3 0.8087341 0.1421065 0.04915948
4 0.5862465 0.2800892 0.13366427
5 0.6205983 0.2615112 0.11789050

highdiag is not significant above. The predict function gives the fitted probabilities of
three categories.
> yearpo = polr ( survival ~ hightreat + age + rural + male ,
+ Hess = TRUE ,
+ data = karolinska )
> summary ( yearpo )
Call :
238 Linear Model and Extensions

polr ( formula = survival ~ hightreat + age + rural + male , data = karolinska ,


Hess = TRUE )

Coefficients :
Value Std . Error t value
hightreat 1.399538 0.44518 3.1438
age 0.008032 0.01438 0.5584
rural 0.638862 0.35450 1.8022
male 1.122698 0.44377 2.5299

Intercepts :
Value Std . Error t value
1 | 2 -4 3 . 3 2 7 3 1 . 2 7 5 2 2.6092
2 -4 | 5 + 4 . 9 2 5 8 1 . 3 1 0 6 3.7583

Residual Deviance : 2 6 0 . 2 8 3 1
AIC : 2 7 2 . 2 8 3 1

hightreat is significant above.

21.6 Discrete choice models


21.6.1 Model
The covariates in model (21.3) depend only on individuals. McFadden (1974) extends it
to allow for choice-specific covariates zik . His formulation is based on the latent utility
representation: 
t
U = zi1 θ + εi1 ,
 i1


..
 .

U t
iK = ziK θ + εiK ,
where εi1 , . . . , εiK are IID standard Gumbel. Unit i chooses k if k has the highest utility.
Lemma 21.1 implies that
t
ezik θ
πk (zi ) = πk (zi , θ) = PK t
, (k = 1, . . . , K). (21.8)
l=1 ezil θ

Model (21.8) seems rather similar to model (21.3). However, there are many subtle
differences. First, a component of zik may vary only with choice k, for example, it can
represent the price of choice k. Partition zik into three types of covariates: xi that only vary
cross individuals, ck that only vary across choices, and wik that vary across both individuals
and choices. Model (21.8) reduces to
t t t t t
exi θx +ck θc +wik θw eck θc +wik θw
πk (zi ) = PK t t t
= PK ,
xi θx +cl θc +wil θw ctl θc +wil
t θ
l=1 e l=1 e
w

that is, the individual-specific covariates drop out. Therefore, zik in model (21.8) does not
contain covariates that vary only with individuals. In particular, zik in model (21.8) does not
contain the constant, but in contrast, the xi in model (21.3) ususally contains the intercept
by default.
Second, if we want to use individual-specific covariates in the model, they must have
Multinomial and Proportional Odds Model 239

choice-specific coefficients. So a more general model unifying (21.3) and (21.8) is


t t
ewik θ+xi βk
πk (xi , wik , θ, β) = PK t θ+xt β
, (k = 1, . . . , K). (21.9)
wil
l=1 e
i l

Equivalently, we can create pseudo covariates zik as the original wik together with interac-
tion of xi and the dummy for choice k. For example, if K = 3 and xi contain the intercept
and a scalar individual-specific covariate, then (zi1 , zi2 , zi3 ) are
   
zi1 wi1 1 0 xi 0
zi2  = wi2 0 1 0 xi  ,
zi3 wi3 0 0 0 0

where K = 3 is the reference level. So with augmented covariates, the discrete choice model
(21.8) is strictly more general than the multinomial logistic model (21.3). In the special case
with K = 2, model (21.8) reduces to
t
exi β
pr(yi = 1 | xi ) = t
1 + exi β
where xi = zi1 − zi2 .
Based on the model specification (21.8), the log likelihood function is
K
n X K
!
X X t
log L(θ) = 1(yi = k) zik t
θ − log ezil θ .
i=1 k=1 l=1

So the score function is


K
n X PK t
!
zil θ
∂ l=1 e zil
X
log L(θ) = 1(yi = k) zik − PK zt θ
∂θ i=1 k=1 l=1 e il

n X
X K
= 1(yi = k){zik − E(zik ; θ)},
i=1 k=1

where E(·; θ) is the average value of {zi1 , . . . , ziK } over the probability mass function
K
X
t t
zik θ
pk (θ) = e / ezil θ .
l=1

The Hessian matrix is


∂2
log L(θ)
∂θ∂θt
n X K PK zt θ PK zt θ PK zt θ PK zt θ t
il z z t il −
l=1 e l=1 e l=1 e
il z
l=1 e
il z
il il il
X
il
= − 1(yi = k) PK zt θ 2
i=1 k=1 ( l=1 e il )

n X
X K
= − 1(yi = k)cov(zik ; θ),
i=1 k=1

where cov(·; θ) is the covariance matrix of {zi1 , . . . , ziK } over the probability mass function
defined above. From these formulas, we can easily compute the MLE using Newton’s method
and obtain its asymptotic distribution based on the inverse of the Fisher information matrix.
240 Linear Model and Extensions

21.6.2 Example
The R package mlogit provides a function mlogit to fit the general discrete logistic model
(Croissant, 2020). Here I use an example from this package to illustrate the model fitting
of mlogit. The R code is in code20.6.R.
> library ( " nnet " )
> library ( " mlogit " )
> data ( " Fishing " )
> head ( Fishing )
mode price . beach price . pier price . boat price . charter
1 charter 157.930 157.930 157.930 182.930
2 charter 15.114 15.114 10.534 34.534
3 boat 161.874 161.874 24.334 59.334
4 pier 15.134 15.134 55.930 84.930
5 boat 106.930 106.930 41.514 71.014
6 charter 192.474 192.474 28.934 63.934
catch . beach catch . pier catch . boat catch . charter income
1 0.0678 0.0503 0.2601 0.5391 7083.332
2 0.1049 0.0451 0.1574 0.4671 1250.000
3 0.5333 0.4522 0.2413 1.0266 3750.000
4 0.0678 0.0789 0.1643 0.5391 2083.333
5 0.0678 0.0503 0.1082 0.3240 4583.332
6 0.5333 0.4522 0.1665 0.3975 4583.332

The dataset Fishing is in the “wide” format, where mode denotes the choice of four modes
of fishing (beach, pier, boat and charter), price and catch denote the price and catching
rates which are choice-specific, income is individual-specific. We need to first transform the
dataset into “long” format.
> Fish = dfidx ( Fishing ,
+ varying = 2 : 9 ,
+ shape = " wide " ,
+ choice = " mode " )
> head ( Fish )
~~~~~~~
first 1 0 observations out of 4 7 2 8
~~~~~~~
mode income price catch idx
1 FALSE 7 0 8 3 . 3 3 2 1 5 7 . 9 3 0 0 . 0 6 7 8 1 : each
2 FALSE 7 0 8 3 . 3 3 2 1 5 7 . 9 3 0 0 . 2 6 0 1 1 : boat
3 TRUE 7 0 8 3 . 3 3 2 1 8 2 . 9 3 0 0 . 5 3 9 1 1 : rter
4 FALSE 7 0 8 3 . 3 3 2 1 5 7 . 9 3 0 0 . 0 5 0 3 1 : pier
5 FALSE 1 2 5 0 . 0 0 0 1 5 . 1 1 4 0 . 1 0 4 9 2 : each
6 FALSE 1 2 5 0 . 0 0 0 1 0 . 5 3 4 0 . 1 5 7 4 2 : boat
7 TRUE 1 2 5 0 . 0 0 0 3 4 . 5 3 4 0 . 4 6 7 1 2 : rter
8 FALSE 1 2 5 0 . 0 0 0 1 5 . 1 1 4 0 . 0 4 5 1 2 : pier
9 FALSE 3 7 5 0 . 0 0 0 1 6 1 . 8 7 4 0 . 5 3 3 3 3 : each
1 0 TRUE 3 7 5 0 . 0 0 0 2 4 . 3 3 4 0 . 2 4 1 3 3 : boat

Using only choice-specific covariates, we have the following fitted model:


> summary ( mlogit ( mode ~ 0 + price + catch , data = Fish ))

Call :
mlogit ( formula = mode ~ 0 + price + catch , data = Fish , method = " nr " )

Frequencies of alternatives : choice


beach boat charter pier
0.11337 0.35364 0.38240 0.15059

nr method
6 iterations , 0 h : 0 m : 0 s
g ’( - H )^ - 1 g ␣ = ␣ 0 . 0 0 0 1 7 9
Multinomial and Proportional Odds Model 241

successive ␣ function ␣ values ␣ within ␣ tolerance ␣ limits

Coefficients ␣ :
␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ Estimate ␣ Std . ␣ Error ␣z - value ␣ ␣ Pr ( >| z |)
price ␣ -0 . 0 2 0 4 7 6 5 ␣ ␣ 0 . 0 0 1 2 2 3 1 ␣ -1 6 . 7 4 2 ␣ <␣ 2 . 2e - 1 6 ␣ ***
catch ␣ ␣ 0 . 9 5 3 0 9 8 2 ␣ ␣ 0 . 0 8 9 4 1 3 4 ␣ ␣ 1 0 . 6 5 9 ␣ <␣ 2 . 2e - 1 6 ␣ ***

Log - Likelihood : ␣ -1 3 1 2

If we do not enforce 0 + price, we allow for intercepts that vary across choices:
> summary ( mlogit ( mode ~ price + catch , data = Fish ))

Call :
mlogit ( formula = mode ~ price + catch , data = Fish , method = " nr " )

Frequencies of alternatives : choice


beach boat charter pier
0.11337 0.35364 0.38240 0.15059

nr method
7 iterations , 0 h : 0 m : 0 s
g ’( - H )^ - 1 g ␣ = ␣ 6 . 2 2E - 0 6
successive ␣ function ␣ values ␣ within ␣ tolerance ␣ limits

Coefficients ␣ :
␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ Estimate ␣ Std . ␣ Error ␣ ␣z - value ␣ ␣ Pr ( >| z |)
( Intercept ): boat ␣ ␣ ␣ ␣ ␣ 0 . 8 7 1 3 7 4 9 ␣ ␣ 0 . 1 1 4 0 4 2 8 ␣ ␣ ␣ 7 . 6 4 0 8 ␣ 2 . 1 5 4e - 1 4 ␣ ***
( Intercept ): charter ␣ ␣ 1 . 4 9 8 8 8 8 4 ␣ ␣ 0 . 1 3 2 9 3 2 8 ␣ ␣ 1 1 . 2 7 5 5 ␣ <␣ 2 . 2e - 1 6 ␣ ***
( Intercept ): pier ␣ ␣ ␣ ␣ ␣ 0 . 3 0 7 0 5 5 2 ␣ ␣ 0 . 1 1 4 5 7 3 8 ␣ ␣ ␣ 2 . 6 8 0 0 ␣ 0 . 0 0 7 3 6 2 7 ␣ **
price ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ -0 . 0 2 4 7 8 9 6 ␣ ␣ 0 . 0 0 1 7 0 4 4 ␣ -1 4 . 5 4 4 4 ␣ <␣ 2 . 2e - 1 6 ␣ ***
catch ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ 0 . 3 7 7 1 6 8 9 ␣ ␣ 0 . 1 0 9 9 7 0 7 ␣ ␣ ␣ 3 . 4 2 9 7 ␣ 0 . 0 0 0 6 0 4 2 ␣ ***

Log - Likelihood : ␣ -1 2 3 0 . 8
McFadden ␣ R ^ 2 : ␣ ␣ 0 . 1 7 8 2 3
Likelihood ␣ ratio ␣ test ␣ : ␣ chisq ␣ = ␣ 5 3 3 . 8 8 ␣ ( p . value ␣ = ␣ <␣ 2 . 2 2e - 1 6 )

Using only individual-specific covariates, we have the following fitted model:


> summary ( mlogit ( mode ~ 0 | income , data = Fish ))

Call :
mlogit ( formula = mode ~ 0 | income , data = Fish , method = " nr " )

Frequencies of alternatives : choice


beach boat charter pier
0.11337 0.35364 0.38240 0.15059

nr method
4 iterations , 0 h : 0 m : 0 s
g ’( - H )^ - 1 g ␣ = ␣ 8 . 3 2E - 0 7
gradient ␣ close ␣ to ␣ zero

Coefficients ␣ :
␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ Estimate ␣ ␣ Std . ␣ Error ␣z - value ␣ ␣ Pr ( >| z |)
( Intercept ): boat ␣ ␣ ␣ ␣ ␣ 7 . 3 8 9 2e - 0 1 ␣ ␣ 1 . 9 6 7 3e - 0 1 ␣ ␣ 3 . 7 5 6 0 ␣ 0 . 0 0 0 1 7 2 7 ␣ ***
( Intercept ): charter ␣ ␣ 1 . 3 4 1 3 e + 0 0 ␣ ␣ 1 . 9 4 5 2e - 0 1 ␣ ␣ 6 . 8 9 5 5 ␣ 5 . 3 6 7e - 1 2 ␣ ***
( Intercept ): pier ␣ ␣ ␣ ␣ ␣ 8 . 1 4 1 5e - 0 1 ␣ ␣ 2 . 2 8 6 3e - 0 1 ␣ ␣ 3 . 5 6 1 0 ␣ 0 . 0 0 0 3 6 9 5 ␣ ***
income : boat ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ 9 . 1 9 0 6e - 0 5 ␣ ␣ 4 . 0 6 6 4e - 0 5 ␣ ␣ 2 . 2 6 0 2 ␣ 0 . 0 2 3 8 1 1 6 ␣ *
income : charter ␣ ␣ ␣ ␣ ␣ ␣ -3 . 1 6 4 0e - 0 5 ␣ ␣ 4 . 1 8 4 6e - 0 5 ␣ -0 . 7 5 6 1 ␣ 0 . 4 4 9 5 9 0 8
income : pier ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ -1 . 4 3 4 0e - 0 4 ␣ ␣ 5 . 3 2 8 8e - 0 5 ␣ -2 . 6 9 1 1 ␣ 0 . 0 0 7 1 2 2 3 ␣ **

Log - Likelihood : ␣ -1 4 7 7 . 2
McFadden ␣ R ^ 2 : ␣ ␣ 0 . 0 1 3 7 3 6
Likelihood ␣ ratio ␣ test ␣ : ␣ chisq ␣ = ␣ 4 1 . 1 4 5 ␣ ( p . value ␣ = ␣ 6 . 0 9 3 1e - 0 9 )
242 Linear Model and Extensions

It is equivalent to fitting the multinomial logistic model:


summary ( multinom ( mode ~ income , data = Fishing ))

The most general model includes all covariates.


> summary ( mlogit ( mode ~ price + catch | income , data = Fish ))

Call :
mlogit ( formula = mode ~ price + catch | income , data = Fish ,
method = " nr " )

Frequencies of alternatives : choice


beach boat charter pier
0.11337 0.35364 0.38240 0.15059

nr method
7 iterations , 0 h : 0 m : 0 s
g ’( - H )^ - 1 g ␣ = ␣ 1 . 3 7E - 0 5
successive ␣ function ␣ values ␣ within ␣ tolerance ␣ limits

Coefficients ␣ :
␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ Estimate ␣ ␣ Std . ␣ Error ␣ ␣z - value ␣ ␣ Pr ( >| z |)
( Intercept ): boat ␣ ␣ ␣ ␣ ␣ 5 . 2 7 2 8e - 0 1 ␣ ␣ 2 . 2 2 7 9e - 0 1 ␣ ␣ ␣ 2 . 3 6 6 7 ␣ 0 . 0 1 7 9 4 8 5
( Intercept ): charter ␣ ␣ 1 . 6 9 4 4 e + 0 0 ␣ ␣ 2 . 2 4 0 5e - 0 1 ␣ ␣ ␣ 7 . 5 6 2 4 ␣ 3 . 9 5 2e - 1 4
( Intercept ): pier ␣ ␣ ␣ ␣ ␣ 7 . 7 7 9 6e - 0 1 ␣ ␣ 2 . 2 0 4 9e - 0 1 ␣ ␣ ␣ 3 . 5 2 8 3 ␣ 0 . 0 0 0 4 1 8 3
price ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ -2 . 5 1 1 7e - 0 2 ␣ ␣ 1 . 7 3 1 7e - 0 3 ␣ -1 4 . 5 0 4 2 ␣ <␣ 2 . 2e - 1 6
catch ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ 3 . 5 7 7 8e - 0 1 ␣ ␣ 1 . 0 9 7 7e - 0 1 ␣ ␣ ␣ 3 . 2 5 9 3 ␣ 0 . 0 0 1 1 1 7 0
income : boat ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ 8 . 9 4 4 0e - 0 5 ␣ ␣ 5 . 0 0 6 7e - 0 5 ␣ ␣ ␣ 1 . 7 8 6 4 ␣ 0 . 0 7 4 0 3 4 5
income : charter ␣ ␣ ␣ ␣ ␣ ␣ -3 . 3 2 9 2e - 0 5 ␣ ␣ 5 . 0 3 4 1e - 0 5 ␣ ␣ -0 . 6 6 1 3 ␣ 0 . 5 0 8 4 0 3 1
income : pier ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ -1 . 2 7 5 8e - 0 4 ␣ ␣ 5 . 0 6 4 0e - 0 5 ␣ ␣ -2 . 5 1 9 3 ␣ 0 . 0 1 1 7 5 8 2

( Intercept ): boat ␣ ␣ ␣ ␣ *
( Intercept ): charter ␣ ***
( Intercept ): pier ␣ ␣ ␣ ␣ ***
price ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ***
catch ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ **
income : boat ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ .
income : charter
income : pier ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ *

Log - Likelihood : ␣ -1 2 1 5 . 1
McFadden ␣ R ^ 2 : ␣ ␣ 0 . 1 8 8 6 8
Likelihood ␣ ratio ␣ test ␣ : ␣ chisq ␣ = ␣ 5 6 5 . 1 7 ␣ ( p . value ␣ = ␣ <␣ 2 . 2 2e - 1 6 )

21.6.3 More comments


The assumption of Gumbel error terms is very strong. However, relaxing this assumption
leads to much more complicated forms of the conditional probabilities of the outcome. The
model (21.8) implies that
πk (zi )
= exp{(zik − zil )t θ},
πl (zi )
so the choice between k and l does not depend on the existence of other choices. This is
called the independence of irrelevant alternatives (IIA) assumption. This is often a plausible
assumption. However, it may be violated. For example, with the apple and orange, someone
chooses the apple; but with the apple, orange, and banana, s/he may choose the orange.
The model (21.8) is the basic form of the discrete choice model. Train (2009) is a mono-
graph on this topic which provides many extensions.
Multinomial and Proportional Odds Model 243

21.7 Homework problems


21.1 Inverse model for the multinomial logit model
The following result extends Theorem 20.2.
Assume that

yi ∼ Multinomial(1; q1 , . . . , qK ), (21.10)

and

xi | yi = k ∼ N(µk , Σ), (21.11)

where xi does not contain 1. We can verify that yi | xi follows a multinomial logit model as
shown in the theorem below.

Theorem 21.1 Under (21.10) and (21.11), we have


t
eαk +xi βk
pr(yi = k | xi ) = PK t
,
l=1 eαl +xi βl

where
1
αk = log qk − µtk Σ−1 µk , βk = Σ−1 µk .
2
Prove Theorem 21.1.

21.2 Hessian matrix in the multinomial logit model


Prove that the Hessian matrix (21.4) of the log-likelihood function of the multinomial logit
model is negative semi-definite.
Hint: Use Proposition 21.1.

21.3 Iteratively reweighted least squares algorithm for the multinomial logit model
Similar to the binary logistic model, Newton’s method for computing the MLE for the
multinomial logit model can be written as iteratively reweighted least squares. Give the
details.

21.4 Score function of the proportional odds model


Derive the explicit formulas of the score function of the proportional odds model.

21.5 Ordered Probit regression


If we choose εi | xi ∼ N(0, 1) in (21.5), then the corresponding model is called the ordered
Probit regression. Write down the likelihood function and derive the score function for this
model.
Remark: You can use the function polr in R to fit this model with the specification
method = "probit".
244 Linear Model and Extensions

21.6 Case-control study and multinomial logistic model


This problem extends Theorem 20.1.
Assume that t
eαk +xi βk
pr(yi = k | xi ) = PK
αl +xti βl
l=1 e
and
pr(si = 1 | yi = k, xi ) = pr(si = 1 | yi = k) = pk
for k = 1, . . . , K. Show that
t
eα̃k +xi βk
pr(yi = k | xi , si = 1) = PK
α̃l +xti βl
l=1 e

with α̃k = αk + log pk for k = 1, . . . , K.


22
Regression Models for Count Outcomes

A random variable for counts can take values in {0, 1, 2, . . .}. This type of variable is common
in applied statistics. For example, it can represent how many times you visit the gym every
week, how many lectures you have missed in the linear model course, how many traffic
accidents happened in certain areas during certain periods, etc. This chapter focuses on
statistical modeling of those outcomes given covariates. Hilbe (2014) is a textbook focusing
on count outcome regressions.

22.1 Some random variables for counts


I first review four canonical choices of random variables for modeling count data.

22.1.1 Poisson
A random variable y is Poisson(λ) if its probability mass function is
λk
pr(y = k) = e−λ , (k = 0, 1, 2, . . .)
k!
λk
P∞
which sums to 1 by the Taylor expansion formula eλ = k=0 k! . The Poisson(λ) random
variable has the following properties:
Proposition 22.1 If y ∼ Poisson(λ), then E(y) = var(y) = λ.
Proposition 22.2 If y1 , . . . , yK are mutually independent with yk ∼ Poisson(λk ) for k =
1, . . . , K, then
y1 + · · · + yK ∼ Poisson(λ),
and
(y1 , . . . , yK ) | y1 + · · · + yK = n ∼ Multinomial (n, (λ1 /λ, . . . , λK /λ)) ,
where λ = λ1 + · · · + λK .
Conversely, if S ∼ Poisson(λ) with λ = λ1 + · · · + λK , and (y1 , . . . , yK ) | S =
n ∼ Multinomial (n, (λ1 /λ, . . . , λK /λ)), then y1 , . . . , yK are mutually independent with
yk ∼ Poisson(λk ) for k = 1, . . . , K.
Where does the Poisson random variable come from? One way to generate Poisson is
through independent Bernoulli random variables. I will review Le Cam (1960)’s theorem
below without giving a proof.
Theorem 22.1 Suppose Xi ’s are Pindependent Bernoulli
Pn random variables with probabilities
n
pi ’s (i = 1, . . . , n). Define λn = i=1 pi and Sn = i=1 Xi . Then
∞ n
X λkn X
pr(Sn = k) − e−λn ≤2 p2i .
k! i=1
k=0

245
246 Linear Model and Extensions

As a special case, if pi = λ/n, then Theorem 22.1 implies


∞ k n
−λ λ
X X
pr(Sn = k) − e ≤2 (λ/n)2 = λ2 /n → 0.
k! i=1
k=0

So the sum of IID Bernoulli random variables is approximately Poisson if the probability
has order 1/n. This is called the law of rare events, or Poisson limit theorem, or Le Cam’s
theorem. By Theorem 22.1, we can use Poisson as a model for the sum of many rare events.

22.1.2 Negative-Binomial
The Poisson distribution restricts that the mean must be the same as the variance. It
cannot capture the feature of overdispersed data with variance larger than the mean. The
Negative-Binomial is an extension of the Poisson that allows for overdispersion. Here the
definition of the Negative-Binomial below is different from its standard definition, but it is
more natural as an extension of the Poisson1 . Define y as the Negative-Binomial random
variable, denoted by NB(µ, θ) with µ > 0 and θ > 0, if
(
y | λ ∼ Poisson(λ),
(22.1)
λ ∼ Gamma(θ, θ/µ).

So the Negative-Binomial is the Poisson with a random Gamma intensity, that is, the
Negative-Binomial is a scale mixture of the Poisson. If θ → ∞, then λ is a point mass at µ
and the Negative-Binomial reduces to Poisson(µ). We can verify that it has the following
probability mass function.

Proposition 22.3 The Negative-Binomial random variable defined in (22.1) has the prob-
ability mass function
 θ  k
Γ(k + θ) θ µ
pr(y = k) = , (k = 0, 1, 2, . . .).
Γ(k + 1)Γ(θ) µ + θ µ+θ
Proof of Proposition 22.3: We have
Z ∞
pr(y = k) = pr(y = k | λ)f (λ)dλ
0
Z ∞
λk (θ/µ)θ θ−1 −(θ/µ)λ
= e−λ λ e dλ
0 k! Γ(θ)
(θ/µ)θ ∞ k+θ−1 −(1+θ/µ)λ
Z
= λ e dλ.
k!Γ(θ) 0

The function in the integral is the density function of Gamma(k + θ, 1 + θ/µ) without the
normalizing constant
(1 + θ/µ)k+θ
.
Γ(k + θ)
1 With IID Bernoulli(p) trials, the Negative-Binomial distribution, denoted by y ∼ NB′ (r, p), is the

number of success before the rth failure. Its probability mass function is
 
k+r−1
pr(y = k) = (1 − p)r pk , (k = 0, 1, 2, . . .)
k
If p = µ/(µ + θ) and r = θ then these two definitions coincide. This definition is more restrictive because r
must be an integer.
Regression Models for Count Outcomes 247

So
(θ/µ)θ . (1 + θ/µ)k+θ
pr(y = k) =
k!Γ(θ) Γ(k + θ)
 θ  k
Γ(k + θ) θ µ
= .
k!Γ(θ) µ+θ µ+θ


We can derive the mean and variance of the Negative-Binomial.

Proposition 22.4 The Negative-Binomial random variable defined in (22.1) has moments

E(y) = µ,
µ2
var(y) = µ+ > E(y).
θ
Proof of Proposition 22.4: Recall Proposition B.2 that a Gamma(α, β) random variable
has mean α/β and variance α/β 2 . We have

θ
E(y) = E {E(y | λ)} = E(λ) = = µ,
θ/µ

and
θ θ µ2
var(y) = E {var(y | λ)} + var {E(y | λ)} = E(λ) + var(λ) = + = µ + .
θ/µ (θ/µ)2 θ


So the dispersion parameter θ controls the variance of the Negative-Binomial. With the
same mean, the Negative-Binomial has a larger variance than Poisson. Figure 22.1 further
compares the log probability mass functions of the Negative Binomial and Poisson. It shows
that the Negative Binomial has a slightly higher probability at zero but much heavier tails
than the Poisson.

22.1.3 Zero-inflated count distributions


Many count distributions have larger masses at zero compared to Poisson and Negative-
Binomial. Therefore, it is also important to have more general distributions capturing this
feature of empirical data. We can simply add an additional zero component to the Poisson
or the Negative-Binomial.
A zero-inflated Poisson random variable y is a mixture of two components: a point mass
at zero and a Poisson(λ) random variable, with probabilities p and 1 − p, respectively. So y
has the probability mass function
(
p + (1 − p)e−λ , if k = 0,
pr(y = k) = k
(1 − p)e−λ λk! , if k = 1, 2, . . . .

and moments below:


Proposition 22.5 E(y) = (1 − p)λ and var(y) = (1 − p)λ(1 + pλ).

A zero-inflated Negative-Binomial random variable y is a mixture of two components:


248 Linear Model and Extensions

comparing Poisson and Negative Binomial


θ=1 θ = 10
0

−10

µ=1
−20
log probability mass function

−30

−40 distribution
Poisson
Negbin
−4

µ=5
−8

−12

0 5 10 15 20 0 5 10 15 20
y

FIGURE 22.1: Comparing the log probabilities of Poisson and Negative-Binomial with the
same mean

a point mass at zero and a NB(µ, θ) random variable, with probabilities p and 1 − p,
respectively. So y has probability mass function
  θ
p + (1 − p) θ

, if k = 0,
µ+θ
pr(y = k) =  θ  k
(1 − p) Γ(k+θ)
 θ µ
, if k = 1, 2, . . . .
Γ(k+1)Γ(θ) µ+θ µ+θ

and moments below:

Proposition 22.6 E(y) = (1 − p)µ and var(y) = (1 − p)µ(1 + µ/θ + pµ).


I leave the proofs of Propositions 22.5 and 22.6 as Problem 22.2.

22.2 Regression models for counts


To model a count outcome yi given xi , we can still use OLS. However, a problem with OLS
is that the predicted value can be negative. This can be easily fixed by running OLS of
log(yi + 1) given xi . However, this still does not reflect the fact that yi is a count outcome.
For example, these two OLS fits cannot easily make a prediction for the probabilities pr(yi ≥
1 | xi ) or pr(yi > 3 | xi ). A more direct approach is to model the conditional distribution
of yi given xi using the distributions reviewed in Section 22.1.
Regression Models for Count Outcomes 249

22.2.1 Poisson regression


Poisson regression assumes
(
yi | xi ∼ Poisson(λi ),
t
λi = λ(xi , β) = exi β .

So the mean and variance of yi | xi are


t
E(yi | xi ) = var(yi | xi ) = exi β .

Because
log E(yi | xi ) = xti β,
this model is sometimes called the log-linear model, with the coefficient βj interpreted as
the conditional log mean ratio:

E(yi | . . . , xij + 1, . . .)
log = βj .
E(yi | . . . , xij , . . .)

The likelihood function for independent Poisson random variables is


n n
Y λyi i Y
L(β) = e−λi ∝ e−λi λyi i ,
i=1
yi ! i=1

and omitting the constants, we can write the log-likelihood function as


n
X n 
X 
t
log L(β) = (−λi + yi log λi ) = −exi β + yi xti β .
i=1 i=1

The score function is


n 
∂ log L(β) X t

= −xi exi β + xi yi
∂β i=1
n
X  
t
= xi yi − exi β
i=1
n
X
= xi {yi − λ(xi , β)} ,
i=1

and the Hessian matrix is


n
∂ 2 log L(β) X ∂  xti β 
= − xi e
∂β∂β t i=1
∂β t
n
X t
= − exi β xi xti
i=1

which is negative semi-definite. When the Hessian is negative definite, the MLE is unique.
It must satisfy that
n
X   Xn n o
t
xi yi − exi β̂ = xi yi − λ(xi , β̂) = 0.
i=1 i=1
250 Linear Model and Extensions

We can solve this nonlinear equation using Newton’s method:


−1
∂ 2 log L(β old ) ∂ log L(β old )

new old
β =β −
∂β∂β t ∂β
= β old − (X t W old X)−1 X t (Y − Λold ),

where   
xt1 y1
X =  ...  , Y =  ... 
   

xtn yn
and
exp(xt1 β old )
 

Λold = .. W old = diag exp(xti β old )


 n
, .
 
. i=1
t old
exp(xn β )
Similar to the derivation for the logit model, we can simplify Newton’s method to

β new = (X t W old X)−1 X t W old Z old ,

where
Z old = Xβ old + (W old )−1 (Y − Λold ).
So we have an iterative reweighted least squares algorithm. In R, we can use the glm function
with “family = poisson(link = "log")” to fit the Poisson regression, which uses Newton’s
method.
Statistical inference based on
 !−1 
a
 ∂ 2 log L(β̂)  n
t −1
o
β̂ ∼ N β, − = N β, (X Ŵ X) ,
 ∂β∂β t 

n on
where Ŵ = diag exp(xti β̂) .
i=1
t
After obtaining the MLE, we can predict the mean E(yi | xi ) by λ̂i = exi β̂ . Because
Poisson regression is a fully parametrized model, we can also predict any other probability
quantities involving yi | xi . For example, we can predict pr(yi = 0 | xi ) by e−λ̂i , and
pr(yi ≥ 3 | xi ) by 1 − e−λ̂i (1 + λ̂i + λ̂2i /2).

22.2.2 Negative-Binomial regression


Negative-Binomial regression assumes
(
yi | xi ∼ NB(µi , θ),
t
µi = exi β ,

so it has conditional mean and variance:


t t t
E(yi | xi ) = exi β , var(yi | xi ) = exi β (1 + exi β /θ).

It is also a log-linear model.


Regression Models for Count Outcomes 251

PnThe log-likelihood function for Negative-Binomial regression is log L(β, θ) =


i=1 li (β, θ) with

li (β, θ) = log Γ(yi + θ) − log Γ(yi + 1) − log Γ(θ)


   
θ µi
+θ log + yi log ,
µi + θ µi + θ
t
where µi = exi β has partial derivative ∂µi /∂β = µi xi . We can use Newton’s algorithm
or Fisher scoring algorithm to compute the MLE (β̂, θ̂) which requires deriving the first
and second derivatives of log L(β, θ) with respect to (β, θ). I will derive some important
components and relegate other details to Problem 22.1. First,
n
∂ log L(β, θ) X
= (1 + µi /θ)−1 (yi − µi )xi .
∂β i=1

The corresponding first-order condition can be viewed as the estimating equation of Poisson
regression with weights (1 + µi /θ)−1 . Second,
n
∂ 2 log L(β, θ) X µi
= (y − µi )xi .
2 i
∂β∂θ i=1
(µi + θ)

We can verify
∂ 2 log L(β, θ)
 
E |X =0
∂β∂θ
since each term inside the summation has conditional expectation zero. This implies that
the Fisher information matrix is diagonal, so β̂ and θ̂ are asymptotically independent.
The glm.nb in the MASS package iterate between β and θ: given θ, update β based on
Fisher scoring; given β, update θ based on Newton’s algorithm. It reports standard errors
based on the inverse of the Fisher information matrix.2

22.2.3 Zero-inflated regressions


The zero-inflated Poisson regression assumes that
(
0, with probability pi ,
yi | xi ∼
Poisson(λi ), with probability 1 − pi ,
where t
exi γ t
pi = x tγ , λi = exi β .
1+e i

The zero-inflated Negative-Binomial regression assumes that


(
0, with probability pi ,
yi | xi ∼
NB(µi , θ), with probability 1 − pi ,
where t
exi γ t
pi = x tγ , µi = exi β .
1+e i

To avoid over-parametrization, we can also restrict some coefficients to be zero. The


zeroinfl function in the R package pscl can fit the zero-inflated Poisson and Negative-
Binomial regressions.
2 The command rnbreg in Stata uses the Berndt–Hall–Hall–Hausman (BHHH) algorithm by default,

which may give slightly different numbers compared with R. The BHHH algorithm is similar to Newton’s
algorithm but avoids calculating the Hessian matrix.
252 Linear Model and Extensions

22.3 A case study


I will use the dataset from Royer et al. (2015) to illustrate the regressions for count outcomes.
The R code is in code21.3.R. From the regression formula below, we are interested in the
effect of two treatments incentive_commit and incentive on the number of visits to the gym,
controlling for two pretreatment covariates target and member_gym_pre.
> library ( " foreign " )
> library ( " MASS " )
> gym 1 = read . dta ( " gym _ treatment _ exp _ weekly . dta " )
> f . reg = weekly _ visit ~ incentive _ commit + incentive +
+ target + member _ gym _ pre

22.3.1 Linear, Poisson, and Negative-Binomial regressions

three regressions
incentivecommit incentive

1.0

Linear regression
0.5

0.0

−0.5
point estimates and confidence intervals

1.0

Poisson regression
0.5

0.0

−0.5

1.0
NB regression

0.5

0.0

−0.5

0 25 50 75 100 0 25 50 75 100
weeks

FIGURE 22.2: Linear, Poisson, and Negative-Binomial regressions

Each worker was observed over time, so we run regressions with the outcome data ob-
served in each week. In the following, we compute the linear regression coefficients, standard
errors, and AICs.
> weekids = sort ( unique ( gym 1 $ incentive _ week ))
> lweekids = length ( weekids )
> c o e f i n c e n t i v e c o m m i t = 1 : lweekids
Regression Models for Count Outcomes 253

checking overdispersion point and interval estimates


3

2
1
variance

log(θ)
0
1

−1

0 1 2 3 0 25 50 75 100
mean weeks

FIGURE 22.3: Overdispersion of the data

> coefincentive = 1 : lweekids


> seincentivecommit = 1 : lweekids
> seincentive = 1 : lweekids
> AIClm = 1 : lweekids
> for ( i in 1 : lweekids )
+ {
+ gymweek = gym 1 [ which ( gym 1 $ incentive _ week == weekids [ i ]) , ]
+ regweek = lm ( f . reg , data = gymweek )
+ regweekcoef = summary ( regweek )$ coef
+
+ coefincentivecommit [i] = regweekcoef [ 2 , 1]
+ coefincentive [ i ] = regweekcoef [ 3 , 1]
+ seincentivecommit [i] = regweekcoef [ 2 , 2]
+ seincentive [ i ] = regweekcoef [ 3 , 2]
+
+ AIClm [ i ] = AIC ( regweek )
+ }

By changing the line with lm by


regweek = glm ( f . reg , family = poisson ( link = " log " ) , data = gymweek )

and
regweek = glm . nb ( f . reg , data = gymweek )

we obtain the corresponding results from Poisson and Negative-Binomial regressions. Figure
22.2 compares the regression coefficients with the associated confidence intervals over time.
Three regressions give very similar patterns: incentive_commit has both short-term and long-
term effects, but incentive only has short-term effects.
The left panel of Figure 22.3 shows that variances are larger than the means for out-
comes from all weeks, and the right panel of Figure 22.3 shows the point estimates and
confidence intervals of θ from Negative-Binomial regressions. Overall, overdispersion seems
an important feature of the data.

22.3.2 Zero-inflated regressions


Figure 22.4 plots the histograms of the outcomes from four weeks before and four weeks
after the experiment. Eight histograms all show severe zero inflation because most workers
254 Linear Model and Extensions

just did not go to the gym regardless of the treatments. Therefore, it seems crucial to
accommodate the zeros in the models.

week −5 week −4 week −3 week −2

800

600

400

200

0
count

week 2 week 3 week 4 week 5

800

600

400

200

0
0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5
y

FIGURE 22.4: Zero-inflation of the data

We now fit zero-inflated Poisson regressions. The model has parameters for the zero
component and parameters for the Poisson components.
> library ( " pscl " )
> coefincentivecommit 0 = coefincentivecommit
> coefincentive 0 = coefincentive
> seincentivecommit 0 = seincentivecommit
> seincentive 0 = seincentive
> AIC 0 poisson = AICnb
> for ( i in 1 : lweekids )
+ {
+ gymweek = gym 1 [ which ( gym 1 $ incentive _ week == weekids [ i ]) , ]
+ regweek = zeroinfl ( f . reg , dist = " poisson " , data = gymweek )
+ regweekcoef = summary ( regweek )$ coef
+
+ coefincentivecommit [i] = regweekcoef $ count [ 2 , 1]
+ coefincentive [ i ] = regweekcoef $ count [ 3 , 1]
+ seincentivecommit [i] = regweekcoef $ count [ 2 , 2]
+ seincentive [ i ] = regweekcoef $ count [ 3 , 2]
+
+ coefincentivecommit 0[i] = regweekcoef $ zero [ 2 , 1]
+ coefincentive 0 [ i ] = regweekcoef $ zero [ 3 , 1]
+ seincentivecommit 0[i] = regweekcoef $ zero [ 2 , 2]
+ seincentive 0 [ i ] = regweekcoef $ zero [ 3 , 2]
+
+ AIC 0 poisson [ i ] = AIC ( regweek )
+ }

Replacing the line with zeroinfl by


regweek = zeroinfl ( f . reg , dist = " negbin " , data = gymweek )

we can fit the corresponding zero-inflated Negative-Binomial regressions. Figure 22.5 plots
Regression Models for Count Outcomes 255

Zero−inflated Poisson regression


incentivecommit incentive

0.25

0.00

mean
−0.25
estimates

−0.50

zero
−1

−2

0 25 50 75 100 0 25 50 75 100
weeks

Zero−inflated NB regression
incentivecommit incentive

0.25

0.00

mean
−0.25
estimates

−0.50

zero
−1

−2

0 25 50 75 100 0 25 50 75 100
weeks

FIGURE 22.5: Zero-inflated regressions

the point estimates and the confidence intervals of the coefficients of the treatment. It shows
that the treatments do not have effects on the Poisson or Negative-Binomial components,
but have effects on the zero components. This suggests that the treatments affect the out-
come mainly by changing the workers’ behavior of whether to go to the gym.
Another interesting result is the large θ̂’s from the zero-inflated Negative-Binomial re-
gression:
> quantile ( gymtheta , probs = c ( 0 . 0 1 , 0 . 2 5 , 0 . 5 , 0 . 7 5 , 0 . 9 9 ))
1% 25% 50% 75% 99%
12.3 13.1 13.7 14.4 15.7

Once the zero-inflated feature has been modeled, it is not crucial to account for the
overdispersion. It is reasonable because the maximum outcome is five, ruling out heavy-
tailedness. This is further corroborated by the following comparison of the AICs from five
regression models. Figure 22.6 shows that zero-inflated Poisson regressions have the smallest
AICs, beating the zero-inflated Negative-Binomial regressions, which are more flexible but
have more parameters to estimate.
256 Linear Model and Extensions

Comparing AIC from different models

3000
model
Linear
2500
Poisson
AIC

NB
2000
Z−Inf Poisson
Z−Inf NB
1500

0 25 50 75 100
weeks

FIGURE 22.6: Comparing AICs from five regression models

22.4 Homework problems


22.1 Newton’s method for Negative-Binomial regression
Calculate the score function and Hessian matrix based on the log-likelihood function of the
Negative-Binomial regression. What is the joint asymptotic distribution of the MLE (β̂, θ̂)?

22.2 Moments of Zero-inflated Poisson and Negative-Binomial


Prove Propositions 22.5 and 22.6.

22.3 Overdispersion and zero-inflation


Show that for a zero-inflated Poisson, if p ≤ 1/2 then E(y) < var(y) always holds. What is
the condition for E(y) < var(y) when p > 1/2?

22.4 Poisson latent variable and the binary regression model with the cloglog link
t
Assume that yi∗ | xi ∼ Poisson(exi β ), and define yi = 1(yi∗ > 0) as the indicator that yi∗ is
not zero. Show that yi | xi follows a cloglog model, that is,

pr(yi = 1 | xi ) = g(xti β),

where g(z) = 1 − exp(−ez ).


Remark: The cloglog model for binary outcome arises naturally from a latent Poisson
model. It was only briefly mentioned in Chapter 20.

22.5 Likelihood for the zero-inflated Poisson regression


Write down the likelihood function for the Zero-inflated Poisson model, and derive the steps
for Newton’s method.

22.6 Likelihood for the Zero-inflated Negative-Binomial regression


Write down the likelihood function for the Zero-inflated Negative-Binomial model, and
derive the steps for Newton’s method.
Regression Models for Count Outcomes 257

22.7 Prediction in the Zero-inflated Negative-Binomial regression


After obtaining the MLE (β̂, γ̂) and its asymptotic covariance matrix V̂ , predict the con-
ditional mean E(yi | xi ), the conditional probability pr(yi = 0 | xi ), and the conditional
probability pr(yi ≥ 5 | xi ). What are the associated asymptotic standard errors?

22.8 Data analysis


Zeileis et al. (2008) gives a tutorial on count outcome regressions using the dataset from
Deb and Trivedi (1997). Replicate and extend their analysis based on the discussion in this
chapter.

22.9 Data analysis


Fisman and Miguel (2007) was an application of Negative-Binomial regression, and Alber-
garia and Fávero (2017) replicated their study and argued that the zero-inflated Negative-
Binomial regression was more appropriate. Replicate and extend their analysis based on the
discussion in this chapter.
23
Generalized Linear Models: A Unification

This chapter unifies Chapters 20–22 under the formulation of the generalized linear model
(GLM) by Nelder and Wedderburn (1972).

23.1 Generalized Linear Models


So far we have discussed the following models for independent observations (yi , xi )ni=1 .

Example 23.1 The Normal linear model for continuous outcomes assumes

yi | xi ∼ N(µi , σ 2 ), with µi = xti β. (23.1)

Example 23.2 The logistic model for binary outcomes assumes


t
exi β
yi | xi ∼ Bernoulli(µi ), with µi = t . (23.2)
1 + exi β
Example 23.3 The Poisson model for count outcomes assumes
t
yi | xi ∼ Poisson(µi ), with µi = exi β . (23.3)

Example 23.4 The Negative-Binomial model for overdispersed count outcomes assumes
t
yi | xi ∼ NB(µi , δ), with µi = exi β . (23.4)

We use δ for the dispersion parameter to avoid confusion because θ means something else
below (Chapter 22 uses θ for the dispersion parameter).

In the above models, µi denotes the conditional mean. This chapter will unify Examples
23.1–23.4 as GLMs.

23.1.1 Exponential family


Consider a general conditional probability density or mass function:
 
yi θi − b(θi )
f (yi | xi ; θi , ϕ) = exp + c(yi , ϕ) , (23.5)
a(ϕ)

where (θi , ϕ) are unknown parameters, and {a(·), b(·), c(·, ·)} are known functions. The above
conditional density (23.5) is called the natural exponential family with dispersion, where θi
is the natural parameter depending on xi , and ϕ is the dispersion parameter. Sometimes,
a(ϕ) = 1 and c(yi , ϕ) = c(yi ), simplifying the conditional density to a natural exponential
family. Examples 23.1–23.4 have a unified structure as (23.5), as detailed below.

259
260 Linear Model and Extensions

Example 23.1 (continued) Model (23.1) has conditional probability density function
(yi − µi )2
 
2 2 −1/2
f (yi | xi ; µi , σ ) = (2πσ ) exp −
2σ 2
1 2
yi µi − 2 µi yi2
 
1 2
= exp − log(2πσ ) − 2 ,
σ2 2 2σ
with
1 2
θi = µi , b(θi ) = θ ,
2 i
and
ϕ = σ2 , a(ϕ) = σ 2 = ϕ.
Example 23.2 (continued) Model (23.2) has conditional probability mass function
f (yi | xi ; µi ) = µyi i (1 − µi )1−yi
 yi
µi
= (1 − µi )
1 − µi
 
µi 1
= exp yi log − log ,
1 − µi 1 − µi
with
µi eθi 1
θi = log ⇐⇒ µi = , b(θi ) = log = log(1 + eθi ),
1 − µi 1 + eθ i 1 − µi
and
a(ϕ) = 1.
Example 23.3 (continued) Model (23.3) has conditional probability mass function
µyi i
f (yi | xi ; µi ) = e−µi
yi !
= exp {yi log µi − µi − log yi !} ,
with
θi = log µi , b(θi ) = µi = eθi ,
and
a(ϕ) = 1.
Example 23.4 (continued) Model (23.4), for a fixed δ, has conditional probability mass
function
f (yi | xi ; µi )
 yi  δ
Γ(yi + δ) µi δ
=
Γ(δ)Γ(yi + 1) µi + δ µi + δ
 
µi µi + δ
= exp yi log − δ log + log Γ(yi + δ) − log Γ(δ) − log Γ(yi + 1) ,
µi + δ δ
with
µi δ µi + δ
θi = log ⇐⇒ = 1 − eθ i , b(θi ) = δ log = −δ log(1 − eθi ),
µi + δ µi + δ δ
and
a(ϕ) = 1.
GLM: a Unification 261

The logistic and Poisson models are simpler without the dispersion parameter. The
Normal linear model has a dispersion parameter for the variance. The Negative-Binomial
model is more complex: without fixing δ it does not belong to the exponential family with
dispersion.
The exponential family (23.5) has nice properties derived from the classic Bartlett’s
identities (Bartlett, 1953). I first review Bartlett’s identities:

Lemma 23.1 Given a probability density or mass function f (y | θ) indexed by a scalar


parameter θ, if we can change the order of expectation and differentiation, then
 
∂ log f (y | θ)
E =0
∂θ

and ( 2 )  2 
∂ log f (y | θ) ∂ log f (y | θ)
E =E − .
∂θ ∂θ2

This lemma is well-known in classic statistical theory for likelihood, and I give a simple
proof below.
Proof of Lemma 23.1: Define ℓ(y | θ) = log f (y | θ) as the log likelihood function, so
eℓ(y|θ) is the density satisfying
Z Z
eℓ(y|θ) dy = f (y | θ)dy = 1

by the definition of a probability density function (we can replace the integral by summation
for a probability mass function). Differentiate the above identity to obtain
Z

eℓ(y|θ) dy = 0
∂θ
Z
∂ ℓ(y|θ)
=⇒ e dy = 0
∂θ
Z

=⇒ eℓ(y|θ) ℓ(y | θ)dy = 0,
∂θ
which implies Bartlett’s first identity. Differentiate it twice to obtain
Z
∂ ∂
eℓ(y|θ) ℓ(y | θ)dy = 0
∂θ ∂θ
Z "  2 2
#
ℓ(y|θ) ∂ ℓ(y|θ) ∂
=⇒ e ℓ(y | θ) + e ℓ(y | θ) dy = 0,
∂θ ∂θ2

which implies Bartlett’s second identity. □


Lemma 23.1 implies the moments of the exponential family (23.5).
Theorem 23.1 The first two moments of (23.5) are

E(yi | xi ; θi , ϕ) ≡ µi = b′ (θi )

and
var(yi | xi ; θi , ϕ) ≡ σi2 = b′′ (θi )a(ϕ).
262 Linear Model and Extensions

Proof of Theorem 23.1: The first two derivatives of the log conditional density are

∂ log f (yi | xi ; θi , ϕ) yi − b′ (θi ) ∂ 2 log f (yi | xi ; θi , ϕ) b′′ (θi )


= , 2 =− .
∂θi a(ϕ) ∂θi a(ϕ)

Lemma 23.1 implies that


" 2 #
yi − b′ (θi ) yi − b′ (θi ) b′′ (θi )
 
E = 0, E = ,
a(ϕ) a(ϕ) a(ϕ)

which further imply the first two moments of yi given xi . □

23.1.2 Generalized linear model


Section 23.1.1 is general, allowing the mean parameter µi to depend on xi in an arbitrary
way. This flexibility does not immediately generate a useful statistical procedure. To borrow
information across observations, we assume that the relationship between yi and xi remain
“stable” and can be captured by a fixed parameter β. A simple starting point is to use xti β
to approximate µi , which, however, works naturally only for outcomes taking values in a
wide range of (−∞, ∞). For general outcome variables, we can link its mean and the linear
combination of covariates by
µi = µ(xti β),
where µ(·) is a known function and β is an unknown parameter. The inverse of µ(·) is called
the link function. This is called a GLM, which has the following components:
(C1) the conditional distribution (23.5) as an exponential family with dispersion;

(C2) the conditional mean µi = b′ (θi ) and variance σi2 = b′′ (θi )a(ϕ) in Theorem 23.1;
(C3) the function linking the conditional mean and covariates µi = µ(xti β).
Models (23.1)–(23.4) are the classical examples. Figure 23.1 illustrates the relationship
among key quantities in a GLM. In particular,

θi = (b′ )−1 (µi ) = (b′ )−1 {µ(xti β)} (23.6)

depends on xi and β, with (b′ )−1 indicating the inverse function of b′ (·).

FIGURE 23.1: Quantities in a GLM


GLM: a Unification 263

23.2 MLE for GLM


The contribution of unit i to the log-likelihood function is

yi θi − b(θi )
ℓi = log f (yi | xi ; β, ϕ) = + c(yi , ϕ).
a(ϕ)

The contribution of unit i to the score function is


∂ℓi ∂ℓi ∂θi ∂µi
= ,
∂β ∂θi ∂µi ∂β
where
∂ℓi yi − b′ (θi )
= ,
∂θi a(ϕ)
∂θi 1 a(ϕ)
= ′′ = 2
∂µi b (θi ) σi

follow from Theorem 23.1. So


∂ℓi yi − b′ (θi ) ∂µi yi − µi ∂µi
= = ,
∂β σi2 ∂β σi2 ∂β

leading to the following score equation for the MLE:


n
X yi − µi ∂µi
= 0, (23.7)
i=1
σi2 ∂β

or, more explicitly,


n
X yi − µ(xt β) i
µ′ (xti β)xi = 0
i=1
σi2

The general relationship (23.6) between θi and β is quite complicated. A natural choice
of µ(·) is to cancel (b′ )−1 in (23.6) so that

µ(·) = b′ (·) =⇒ θi = xti β.

This link function µ(·) is called the canonical link or the natural link, which leads to further
simplifications:

∂µi σ2
µi = b′ (xti β) =⇒ = b′′ (xti β)xi = b′′ (θi )xi = i xi ,
∂β a(ϕ)

and
∂ℓi yi − µi σi2
= xi = a(ϕ)−1 xi (yi − µi )
∂β σi2 a(ϕ)
Xn
=⇒ a(ϕ)−1 xi (yi − µi ) = 0
i=1
n
X
=⇒ xi (yi − µi ) = 0. (23.8)
i=1
264 Linear Model and Extensions

We have shown that the MLEs of models (23.1)–(23.3) all satisfy (23.8). However, the MLE
of (23.4) does not because it does not use the natural link function resulting in µ(·) ̸= b′ (·):
e∗
µ(∗) = e∗ , b′ (∗) = δ .
1 − e∗
Using Bartlett’s second identity in Lemma 23.1, we can write the expected Fisher infor-
mation conditional on covariates as
n   X n
( 2 )
X ∂ℓi ∂ℓi yi − µi ∂µi ∂µi
E | xi = E | xi
i=1
∂β ∂β t i=1
σi2 ∂β ∂β t
n
X 1 ∂µi ∂µi
=
σ 2 ∂β ∂β t
i=1 i
n
X 1 ′ t 2 t
=
σ 2 {µ (xi β)} xi xi
i=1 i
= X t W X,

where  n
1 2
W = diag {µ′ (xti β)} .
σi2 i=1
With the canonical link, it further simplifies to
n   n
( 2 )
X ∂ℓi ∂ℓi X yi − µi
E | xi = E xi xti | xi
i=1
∂β ∂β t i=1
a(ϕ)
n
−2
X
= {a(ϕ)} σi2 xi xti .
i=1

We can obtain the estimated covariance matrix by replacing the unknown parameters
with their estimates. Now we review the estimated covariance matrices of the classical GLMs
with canonical links.
Example 23.1 (continued) In the Normal linear model, V̂ = σ̂ 2 (X t X)−1 with σ 2 esti-
mated further by the residual sum of squares.

Example 23.2 (continued) In the binary logistic model, V̂ = (X t Ŵ X)−1 with Ŵ =


t t
diag{π̂i (1 − π̂i )}ni=1 , where π̂i = exi β̂ /(1 + exi β̂ ).

Example 23.3 (continued) In the Poisson model, V̂ = (X t Ŵ X)−1 with Ŵ =


t
diag{λ̂i }ni=1 , where λ̂i = exi β̂ .

I relegate the derivation of the formula for the Negative-Binomial regression as Problem
23.2. It is a purely theoretical exercise since δ is usually unknown in practice.

23.3 Other GLMs


The glm function in R allows for the specification of the family parameters, with the corre-
sponding canonical link functions shown below:
GLM: a Unification 265

binomial ( link = " logit " )


gaussian ( link = " identity " )
Gamma ( link = " inverse " )
inverse . gaussian ( link = " 1 / mu ^ 2 " )
poisson ( link = " log " )
quasi ( link = " identity " , variance = " constant " )
quasibinomial ( link = " logit " )
quasipoisson ( link = " log " )

Examples 23.1–23.3 correspond to the second, the first, and the fifth choices above. Below
I will briefly discuss the third choice for the Gamma regression and omit the discussion of
other choices. See the help file of the glm function and McCullagh and Nelder (1989) for
more details.
The Gamma(α, β) random variable is positive with mean α/β and variance α/β 2 . For
convenience in modeling, we use a reparametrization Gamma′ (µ, ν) where
       
µ α/β α ν
= ⇐⇒ = .
ν α β ν/µ

So its mean equals µ and its variance equals µ2 /ν which is quadratic √ in µ. A feature of
the Gamma random variable is that its coefficient of variation equals 1/ ν which does not
depend on the mean. So Gamma′ (µ, ν) is a parametrization based on the mean and the
coefficient of variation (McCullagh and Nelder, 1989).1 Gamma′ (µ, ν) has density

β α α−1 −βy (ν/µ)ν ν−1 −(ν/µ)y


f (y) = y e = y e ,
Γ(α) Γ(ν)

and we can verify that it belongs to the exponential family with dispersion. Gamma regres-
sion assumes
yi | xi ∼ Gamma′ (µi , ν)
t
where µi = exi β . So it is also a log-linear model. This does not correspond to the canonical
link. Instead, we should specify Gamma(link = "log") to fit the log-linear Gamma regression
model.
The log-likelihood function is
n n
X νyi o
log L(β, ν) = − xt β + (ν − 1) log yi + ν log ν − νxti β − log Γ(ν) .
i=1
e i

Then
n n
∂ log L(β, ν) X t X t t
= (νyi e−xi β xi − νxi ) = ν e−xi β (yi − exi β )xi
∂β i=1 i=1

and
n
∂ 2 log L(β, ν) X −xti β t
= e (yi − exi β )xi .
∂β∂ν i=1

So the MLE for β solves the following estimating equation


n
X t t
e−xi β (yi − exi β )xi = 0.
i=1

Moreover, ∂ 2 log L(β, ν)/∂β∂ν has expectation zero so the Fisher information matrix is
p
1 The coefficient of variation of a random variable A equals var(A)/E(A).
266 Linear Model and Extensions

diagonal. In fact, it is identically zero when evaluated at β̂ since it is identical to the


estimating equation.
I end this subsection with a comment on the estimating equation of β. It is similar to
t
the Poisson score equation except for the additional weight e−xi β . For positive outcomes, it
is also conventional to fit OLS of log yi on xi , resulting in the following estimating equation
n
X
(log yi − xti β)xi = 0.
i=1

Firth (1988) compared Gamma and log-Normal regressions based on efficiency. However,
these two models are not entirely comparable: Gamma regression assumes that the log of
the conditional mean of yi given xi is linear in xi , whereas log-Normal regression assumes
that the conditional mean of log yi given xi is linear in xi . By Jensen’s inequality, log E(yi |
xi ) ≥ E(log yi | xi ). See Problem 23.3 for more discussions of Gamma regression.

23.4 Homework problems


23.1 MLE in GLMs with binary regressors
The MLEs for β in Models (23.1)–(23.3) do not have explicit formulas in general. But in
the special case with the covariate xi containing 1 and a binary covariate zi ∈ {0, 1}, their
MLEs do have simple formulas. Find them in terms of sample means of the outcomes.
Then find the variance estimators of β̂.

23.2 Negative-Binomial covariance matrices


Assume that δ is known. Derive the estimated asymptotic covariance matrices of the MLE
t
in the Negative-Binomial regression with µi = exi β .

23.3 Gamma regression


Verify that Gamma′ (µ, ν) belongs to the natural exponential family with dispersion. Derive
the first- and second-order derivatives of the log-likelihood function and Newton’s algorithm
for computing the MLE. Derive the estimated asymptotic covariance matrices of the MLE.
t
Show that if yi | xi ∼ Gamma′ (µi , ν) with µi = exi β , then

E(log yi | xi ) = ψ(ν) − log(ν) + xti β

and
var(log yi | xi ) = ψ ′ (ν)
where ψ(ν) = d log Γ(ν)/dν is the digamma function and ψ ′ (ν) is the trigamma function.
Remark: Use Proposition B.3 to calculate the moments. The above conditional mean
function ensures that the OLS estimator of log yi on xi is consistent for all components of
β except for the intercept.
24
From Generalized Linear Models to Restricted Mean
Models: the Sandwich Covariance Matrix

This chapter discusses the consequence of misspecified GLMs, extending the EHW covari-
ance estimator to its analogs under the GLMs. It serves as a stepping stone to the next
chapter on the generalized estimating equation.

24.1 Restricted mean model


The logistic, Poisson, and Negative-Binomial models are extensions of the Normal linear
model. All of them are fully parametric models. However, we have also discussed OLS as a
restricted mean model
E(yi | xi ) = xti β
without imposing any additional assumptions (e.g., the variance) on the conditional distri-
bution. The restricted mean model is a semiparametric model. Then a natural question is:
what are the analogs of the restricted mean model for the binary and count models?
Binary outcome is too special because the conditional mean determines the distribu-
t t
tion. So if we assume that the conditional mean is µi = exi β /(1 + exi β ), then conditional
distribution must be Bernoulli(µi ). Consequently, misspecification of the conditional mean
function implies misspecification of the whole conditional distribution.
For other outcomes, the conditional mean cannot determine the conditional distribution.
If we assume
E(yi | xi ) = µ(xti β),
we can verify that
( n ) " ( n )#
X yi − µ(xt β) ∂µ(xt β) X yi − µ(xt β) ∂µ(xt β)
i i i i
E =E E | xi =0
i=1
σ̃ 2 (xi , β) ∂β i=1
σ̃ 2 (xi , β) ∂β

for any σ̃ 2 that can be a function of xi , the true parameter β, and maybe ϕ. So we can
estimate β by solving the estimating equation:
n
X yi − µ(xt β) ∂µ(xt β)
i i
= 0. (24.1)
i=1
σ̃ 2 (xi , β) ∂β

If σ̃ 2 (xi , β) = σ 2 (xi ) = var(yi | xi ), then the above estimating equation is the score equation
derived from the GLM of an exponential family. If not, (24.1) is not a score function but
it is still a valid estimating equation. In the latter case, σ̃ 2 (xi , β) is a “working” variance.
This has important implications for the practical data analysis. First, we can interpret the
MLE from a GLM more broadly: it is also valid under a restricted mean model even if the

267
268 Linear Model and Extensions

conditional distribution is misspecified. Second, we can construct more general estimators


beyond the MLEs from GLMs. However, we must address the issue of variance estimation
since the inference based on the Fisher information matrix no longer works in general.

24.2 Sandwich covariance matrix


To simplify the notation, we assume (xi , yi )ni=1 are IID draws although we usually view
the covariates as fixed. This additional assumption is innocuous as the final inferential
procedures are identical.
n
24.1 Assume (xi , yi )i=1 are IID with E(yi | xi ) = µ(xi β). We have
t
Theorem
√ 
n β̂ − β → N(0, B −1 M B −1 ) with
 
1 ∂µ(xt β) ∂µ(xt β)
B=E , (24.2)
σ̃ 2 (x, β) ∂β ∂β t
" #
σ 2 (x) ∂µ(xt β) ∂µ(xt β)
M =E 2 . (24.3)
{σ̃ 2 (x, β)} ∂β ∂β t
Proof of Theorem 24.1: Applying Theorem D.1 to
y − µ(xt b) ∂µ(xt β)
w = (x, y), m(w, β) = ,
σ̃ 2 (x, β) ∂β
we can derive the asymptotic distribution of the β̂.
The bread matrix equals
 
∂m(w, β)
B = −E
∂β t
  
∂ y − µ(xt β) ∂µ(xt β)
= −E
∂β t σ̃ 2 (x, β) ∂β
y − µ(x β) ∂ 2 µ(xt β) ∂µ(xt β) ∂
  
t
y − µ(xt β)
= −E + (24.4)
σ̃ 2 (x, β) ∂β∂β t ∂β ∂β t σ̃ 2 (x, β)
  
t
∂µ(x β) ∂ y − µ(x β) t
= −E
∂β ∂β t σ̃ 2 (x, β)
" #
∂µ(xt β) ∂µ(xt β) 2 ∂µ(xt β) ∂ σ̃ 2 (x, β) y − µ(xt β)
=E /σ̃ (x, β) + (24.5)
∂β ∂β t ∂β ∂β t {σ̃ 2 (x, β)}2
 
∂µ(xt β) ∂µ(xt β) 2
=E /σ̃ (x, β) ,
∂β ∂β t
where the first term of (24.4) and the second term of (24.5) are both zero under the restricted
mean model. The meat matrix equals
M = E {m(w, β)m(w, β)t }
" 2 #
y − µ(xt β) ∂µ(xt β) ∂µ(xt β)
=E
σ̃ 2 (x) ∂β ∂β t
" #
σ 2 (x) ∂µ(xt β) ∂µ(xt β)
=E 2 .
{σ̃ 2 (x)} ∂β ∂β t
GLM and Sandwich Covariance Matrix 269


We can estimate the asymptotic variance by replacing B and M by their sample analogs.
With β̂ and the residual ε̂i = yi − µ(xti β̂), we can conduct statistical inference based on the
following Normal approximation:
a
β̂ ∼ N(β, V̂ ),
with V̂ ≡ n−1 B̂ −1 M̂ B̂ −1 , where
n
X 1 ∂µ(xti β̂) ∂µ(xti β̂)
B̂ = n−1 ,
i=1 σ̃ 2 (xi , β̂) ∂β ∂β t
n
X ε̂2i ∂µ(xti β̂) ∂µ(xti β̂)
M̂ = n−1 .
i=1 σ̃ 4 (xi , β̂) ∂β ∂β t

As a special case, when the GLM is correctly specified with σ 2 (x) = σ̃ 2 (x, β), then B =
M and the asymptotic variance reduces to the inverse of the Fisher information matrix
discussed in Section 23.2.
Example 24.1 (continued) In a working Normal linear model, σ̃ 2 (xi , β) = σ̃ 2 is constant
and ∂µ(xti β)/∂β = xi . So
n n
−1
X 1 −1
X ε̂2i
B̂ = n x xt ,
2 i i
M̂ = n x xt
2 )2 i i
i=1
σ̃ i=1
(σ̃

with residual ε̂i = yi − xti β̂, recovering the EHW variance estimator
n
!−1 n ! n !−1
X X X
t 2 t t
V̂ = xi xi ε̂i xi xi xi xi .
i=1 i=1 i=1

Example 24.2 (continued) In a working binary logistic model, σ̃ 2 (xi , β) = π(xi , β){1 −
t
π(xi , β)} and ∂µ(xti β)/∂β = π(xi , β){1 − π(xi , β)}xi , where π(xi , β) = µ(xti β) = exi β /(1 +
t
exi β ). So
n
X Xn
B̂ = n−1 π̂i (1 − π̂i )xi xti , M̂ = n−1 ε̂2i xi xti
i=1 i=1
xti β̂ xti β̂
with fitted mean π̂i = e /(1 + e ) and residual ε̂i = yi − π̂i , yielding a new covariance
estimator
n
!−1 n
! n
!−1
X X X
V̂ = π̂i (1 − π̂i )xi xti ε̂2i xi xti π̂i (1 − π̂i )xi xti .
i=1 i=1 i=1

Example 24.3 (continued) In a working Poisson model, σ̃ 2 (xi , β) = λ(xi , β) and


t
∂µ(xti β)/∂β = λ(xi , β)xi , where λ(xi , β) = µ(xti β) = exi β . So
n
X n
X
B̂ = n−1 λ̂i xi xti , M̂ = n−1 ε̂2i xi xti
i=1 i=1

xti β̂
with fitted mean λ̂i = e and residual ε̂i = yi − λ̂i , yielding a new covariance estimator
n
!−1 n ! n !−1
X X X
V̂ = λ̂i xi xti ε̂2i xi xti λ̂i xi xti .
i=1 i=1 i=1

Again, I relegate the derivation of the formulas for the Negative-Binomial regression as
a homework problem. The R package sandwich implements the above covariance matrices
(Zeileis, 2006).
270 Linear Model and Extensions

24.3 Applications of the sandwich standard errors


24.3.1 Linear regression
In R, several functions can compute the EHW standard error: the hccm function in the car
package, and the vcovHC and sandwich functions in the sandwich package. The first two are
special functions for OLS, and the third one works for general models. Below, we use these
functions to compute various types of standard errors.
> library ( " car " )
> library ( " lmtest " )
> library ( " sandwich " )
> library ( " mlbench " )
>
> # # linear regression
> data ( " BostonHousing " )
> lm . boston = lm ( medv ~ . , data = BostonHousing )
> hccm 0 = hccm ( lm . boston , type = " hc 0 " )
> sandwich 0 = sandwich ( lm . boston , adjust = FALSE )
> vcovHC 0 = vcovHC ( lm . boston , type = " HC 0 " )
>
> hccm 1 = hccm ( lm . boston , type = " hc 1 " )
> sandwich 1 = sandwich ( lm . boston , adjust = TRUE )
> vcovHC 1 = vcovHC ( lm . boston , type = " HC 1 " )
>
> hccm 3 = hccm ( lm . boston , type = " hc 3 " )
> vcovHC 3 = vcovHC ( lm . boston , type = " HC 3 " )
>
> dat . reg = data . frame ( hccm 0 = diag ( hccm 0 )^( 0 . 5 ) ,
+ sandwich 0 = diag ( sandwich 0 )^( 0 . 5 ) ,
+ vcovHC 0 = diag ( vcovHC 0 )^( 0 . 5 ) ,
+
+ hccm 1 = diag ( hccm 1 )^( 0 . 5 ) ,
+ sandwich 1 = diag ( sandwich 1 )^( 0 . 5 ) ,
+ vcovHC 1 = diag ( vcovHC 1 )^( 0 . 5 ) ,
+
+ hccm 3 = diag ( hccm 3 )^( 0 . 5 ) ,
+ vcovHC 3 = diag ( vcovHC 3 )^( 0 . 5 ))
> round ( dat . reg [ - 1 , ] , 2 )
hccm 0 sandwich 0 vcovHC 0 hccm 1 sandwich 1 vcovHC 1 hccm 3 vcovHC 3
crim 0.03 0.03 0.03 0.03 0.03 0.03 0.03 0.03
zn 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01
indus 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05
chas 1 1.28 1.28 1.28 1.29 1.29 1.29 1.35 1.35
nox 3.73 3.73 3.73 3.79 3.79 3.79 3.92 3.92
rm 0.83 0.83 0.83 0.84 0.84 0.84 0.89 0.89
age 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02
dis 0.21 0.21 0.21 0.21 0.21 0.21 0.22 0.22
rad 0.06 0.06 0.06 0.06 0.06 0.06 0.06 0.06
tax 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
ptratio 0 . 1 2 0.12 0.12 0.12 0.12 0.12 0.12 0.12
b 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
lstat 0.10 0.10 0.10 0.10 0.10 0.10 0.10 0.10

The sandwich function can compute HC0 and HC1, corresponding to adjusting for the
degrees of freedom or not; hccm and vcovHC can compute other HC standard errors.
GLM and Sandwich Covariance Matrix 271

24.3.2 Logistic regression


24.3.2.1 An application
In the flu shot example, two types of standard errors are rather similar. The simple logistic
model does not seem to suffer from severe misspecification.
> flu = read . table ( " fludata . txt " , header = TRUE )
> flu = within ( flu , rm ( receive ))
> assign . logit = glm ( outcome ~ . ,
+ family = binomial ( link = logit ) ,
+ data = flu )
> summary ( assign . logit )
Coefficients :
Estimate Std . Error z value Pr ( >| z |)
( Intercept ) -2 . 1 9 9 8 1 5 0 . 4 0 8 6 8 4 -5 . 3 8 3 7 . 3 4e - 0 8 ***
assign -0 . 1 9 7 5 2 8 0 . 1 3 6 2 3 5 -1 . 4 5 0 0 . 1 4 7 0 9
age -0 . 0 0 7 9 8 6 0 . 0 0 5 5 6 9 -1 . 4 3 4 0 . 1 5 1 5 4
copd 0.337037 0.153939 2.189 0.02857 *
dm 0.454342 0.143593 3 . 1 6 4 0 . 0 0 1 5 6 **
heartd 0.676190 0.153384 4 . 4 0 8 1 . 0 4e - 0 5 ***
race -0 . 2 4 2 9 4 9 0 . 1 4 3 0 1 3 -1 . 6 9 9 0 . 0 8 9 3 6 .
renal 1.519505 0.365973 4 . 1 5 2 3 . 3 0e - 0 5 ***
sex -0 . 2 1 2 0 9 5 0 . 1 4 4 4 7 7 -1 . 4 6 8 0 . 1 4 2 1 0
liverd 0.098957 1.084644 0.091 0.92731

> coeftest ( assign . logit , vcov = sandwich )

z test of coefficients :

Estimate Std . Error z value Pr ( >| z |)


( Intercept ) -2 . 1 9 9 8 1 4 5 0 . 4 0 5 9 3 8 6 -5 . 4 1 9 1 5 . 9 9 1e - 0 8 ***
assign -0 . 1 9 7 5 2 8 3 0 . 1 3 7 1 7 8 5 -1 . 4 3 9 9 0 . 1 4 9 8 8 5
age -0 . 0 0 7 9 8 5 9 0 . 0 0 5 7 0 5 3 -1 . 3 9 9 7 0 . 1 6 1 5 9 0
copd 0.3370371 0.1556781 2.1650 0.030391 *
dm 0.4543416 0.1394709 3.2576 0.001124 **
heartd 0 . 6 7 6 1 8 9 5 0 . 1 5 2 1 1 0 5 4 . 4 4 5 4 8 . 7 7 4e - 0 6 ***
race -0 . 2 4 2 9 4 8 8 0 . 1 4 3 0 9 5 7 -1 . 6 9 7 8 0 . 0 8 9 5 4 4 .
renal 1 . 5 1 9 5 0 4 9 0 . 3 6 5 9 2 3 8 4 . 1 5 2 5 3 . 2 8 8e - 0 5 ***
sex -0 . 2 1 2 0 9 5 4 0 . 1 4 8 9 4 3 5 -1 . 4 2 4 0 0 . 1 5 4 4 4 7
liverd 0.0989572 1.1411133 0.0867 0.930894

24.3.2.2 A misspecified logistic regression


Freedman (2006) discussed the following misspecified logistic regression. The discrepancy
between the two types of standard errors is a warning of the misspecification of the condi-
tional mean function because it determines the whole conditional distribution. In this case,
it is not meaningful to interpret the coefficients.
> n = 100
> x = runif (n , 0 , 1 0 )
> prob . x = 1 /( 1 + exp ( 3 * x - 0 . 5 * x ^ 2 ))
> y = rbinom (n , 1 , prob . x )
> freedman . logit = glm ( y ~ x , family = binomial ( link = logit ))
> summary ( freedman . logit )
Coefficients :
Estimate Std . Error z value Pr ( >| z |)
( Intercept ) -6 . 6 7 6 4 1 . 3 2 5 4 -5 . 0 3 7 4 . 7 2e - 0 7 ***
x 1.1083 0.2209 5 . 0 1 7 5 . 2 5e - 0 7 ***

> coeftest ( freedman . logit , vcov = sandwich )

z test of coefficients :
272 Linear Model and Extensions

Estimate Std . Error z value Pr ( >| z |)


( Intercept ) -6 . 6 7 6 4 1 2 . 4 6 0 3 5 -2 . 7 1 3 6 0 . 0 0 6 6 5 6 **
x 1.10832 0 . 3 9 6 7 2 2 . 7 9 3 7 0 . 0 0 5 2 1 1 **

24.3.3 Poisson regression


24.3.3.1 A correctly specified Poisson regression
I first generate data from a correctly specified Poisson regression. The two types of standard
errors are very close.
> n = 1000
> x = rnorm ( n )
> lambda . x = exp ( x / 5 )
> y = rpois (n , lambda . x )
> pois . pois = glm ( y ~ x , family = poisson ( link = log ))
> summary ( pois . pois )
Coefficients :
Estimate Std . Error z value Pr ( >| z |)
( Intercept ) -0 . 0 0 4 3 8 6 0 . 0 3 2 1 1 7 -0 . 1 3 7 0.891
x 0.189069 0.031110 6 . 0 7 7 1 . 2 2e - 0 9 ***

> coeftest ( pois . pois , vcov = sandwich )

z test of coefficients :

Estimate Std . Error z value Pr ( >| z |)


( Intercept ) -0 . 0 0 4 3 8 6 2 0 . 0 3 1 1 9 5 7 -0 . 1 4 0 6 0.8882
x 0 . 1 8 9 0 6 9 1 0 . 0 2 9 9 1 2 4 6 . 3 2 0 8 2 . 6 0 3e - 1 0 ***

24.3.3.2 A Negative-Binomial regression model


I then generate data from a Negative-Binomial regression model. The conditional mean
t
function is still E(yi | xi ) = exi β , so we can still use Poisson regression as a working model.
The robust standard error doubles the classical standard error.
> library ( MASS )
> theta = 0 . 2
> y = rnegbin (n , mu = lambda .x , theta = theta )
> nb . pois = glm ( y ~ x , family = poisson ( link = log ))
> summary ( nb . pois )
Coefficients :
Estimate Std . Error z value Pr ( >| z |)
( Intercept ) -0 . 0 7 7 4 7 0 . 0 3 3 1 5 -2 . 3 3 7 0.0194 *
x 0.13847 0.03241 4 . 2 7 2 1 . 9 4e - 0 5 ***

> coeftest ( nb . pois , vcov = sandwich )

z test of coefficients :

Estimate Std . Error z value Pr ( >| z |)


( Intercept ) -0 . 0 7 7 4 7 5 0 . 0 7 9 4 3 1 -0 . 9 7 5 4 0 . 3 2 9 3 7
x 0.138467 0.061460 2.2530 0.02426 *

Because the true model is the Negative-Binomial regression, we can use the correct
model to fit the data. Theoretically, the MLE is the most efficient estimator. However, in
this particular dataset, the robust standard error from Poisson regression is no larger than
the one from Negative-Binomial regression. Moreover, the robust standard errors from the
Poisson and Negative-Binomial regressions are very close.
GLM and Sandwich Covariance Matrix 273

> nb . nb = glm . nb ( y ~ x )
> summary ( nb . nb )
Coefficients :
Estimate Std . Error z value Pr ( >| z |)
( Intercept ) -0 . 0 8 0 4 7 0 . 0 7 3 3 6 -1 . 0 9 7 0.2727
x 0.16487 0.07276 2.266 0.0234 *

> coeftest ( nb . nb , vcov = sandwich )

z test of coefficients :

Estimate Std . Error z value Pr ( >| z |)


( Intercept ) -0 . 0 8 0 4 6 7 0 . 0 7 9 5 1 0 -1 . 0 1 2 0 . 3 1 1 5 1 7
x 0.164869 0.063902 2 . 5 8 0 0 . 0 0 9 8 7 9 **

24.3.3.3 Misspecification of the conditional mean


When the conditional mean function is misspecified, the Poisson and Negative-Binomial
regressions give different point estimates, and it is unclear how to compare the standard
errors.
> lambda . x = x ^ 2
> y = rpois (n , lambda . x )
> wr . pois = glm ( y ~ x , family = poisson ( link = log ))
> summary ( wr . pois )
Coefficients :
Estimate Std . Error z value Pr ( >| z |)
( Intercept ) -0 . 0 3 7 6 0 0 . 0 3 2 4 5 -1 . 1 5 9 0 . 2 4 6 4 5 7
x 0.11933 0.03182 3 . 7 5 1 0 . 0 0 0 1 7 6 ***

> coeftest ( wr . pois , vcov = sandwich )

z test of coefficients :

Estimate Std . Error z value Pr ( >| z |)


( Intercept ) -0 . 0 3 7 6 0 4 0 . 0 5 3 0 3 3 -0 . 7 0 9 1 0.4783
x 0.119331 0.101126 1.1800 0.2380

>
> wr . nb = glm . nb ( y ~ x )
There were 2 6 warnings ( use warnings () to see them )
> summary ( wr . nb )
Coefficients :
Estimate Std . Error z value Pr ( >| z |)
( Intercept ) 0 . 1 5 9 8 4 0.05802 2 . 7 5 5 0 . 0 0 5 8 7 **
x -0 . 3 4 6 2 2 0 . 0 5 7 8 9 -5 . 9 8 1 2 . 2 2e - 0 9 ***

> coeftest ( wr . nb , vcov = sandwich )

z test of coefficients :

Estimate Std . Error z value Pr ( >| z |)


( Intercept ) 0 . 1 5 9 8 3 7 0 . 0 6 1 5 6 4 2 . 5 9 6 3 0 . 0 0 9 4 2 4 **
x -0 . 3 4 6 2 2 3 0 . 2 3 8 1 2 4 -1 . 4 5 4 0 0 . 1 4 5 9 5 7

Overall, for count outcome regression, it seems that Poisson regression suffices as long
as we use the robust standard error. The Negative-Binomial is unlikely to offer more if only
the conditional mean is of interest.

24.3.4 Poisson regression for binary outcomes


Zou (2004) proposed to use Poisson regression to analyze binary outcomes. This can be
274 Linear Model and Extensions

reasonable if the parameter of interest is the risk ratio instead of the odds ratio. Importantly,
since the Poisson model is a wrong model, we must use the sandwich covariance estimator.

24.3.5 How robust are the robust standard errors?


Section 24.1 discusses the restricted mean model as an extension of the GLM, allowing for
misspecification of the GLM while still preserving the conditional mean. We can extend the
discussion to other parametric models. Huber (1967) started the literature on the statis-
tical properties of the MLE in a misspecified model, and White (1982) addressed detailed
inferential problems. Buja et al. (2019b) reviewed this topic recently.
The discussion in Section 24.1 is useful when the conditional mean is correctly specified.
However, if we think the GLM is severely misspecified with a wrong conditional mean, then
the robust sandwich standard errors are unlikely to be helpful because the MLE converges
to a wrong parameter in the first place (Freedman, 2006).

24.4 Homework problems


24.1 MLE in GLMs with binary regressors
Continue with Problem 23.1. Find the variance estimators of β̂ without assuming the models
are correct.

24.2 Negative-Binomial covariance matrices


Continue with Problem 23.2. Derive the estimated asymptotic covariance matrices of the
MLE without assuming the Negative-Binomial model is correct.

24.3 Robust standard errors in the Karolinska data


Report the robust standard errors in the case study of Section 21.5 in Chapter 21. For some
models, the function coeftest(*, vcov = sandwich) does work. Alternatively, you can use the
nonparametric bootstrap to obtain the robust standard errors.

24.4 Robust standard errors in the gym data


Report the robust standard errors in the case study of Section 22.3 in Chapter 22.
25
Generalized Estimating Equation for Correlated
Multivariate Data

Previous chapters dealt with cross-sectional data, that is, we observe n units at a partic-
ular time point, collecting various covariates and outcomes. In addition, we assume that
these units are independent, and sometimes, we even assume they are IID draws. Many
applications have correlated data. Two canonical examples are
(E1) repeated measurements of the same set of units over time, which are often called lon-
gitudinal data in biostatistics (Fitzmaurice et al., 2012) or panel data in econometrics
(Wooldridge, 2010); and
(E2) clustered observations belonging to classrooms, villages, etc, which are common in
cluster-randomized experiments in education (Schochet, 2013) and public health
(Turner et al., 2017a,b).
Many excellent textbooks cover this topic intensively. This chapter focuses on a simple yet
powerful strategy, which is a natural extension of the GLM discussed in the last chapter.
It was initially proposed in Liang and Zeger (1986), the most cited paper published in
Biometrika in the past one hundred years (Titterington, 2013). For simplicity, we will use
the term “longitudinal data” for general correlated data.

25.1 Examples of correlated data


25.1.1 Longitudinal data
We have used the data from Royer et al. (2015) in Chapter 22. Each worker’s number of
gym visits was repeatedly measured over more than 100 weeks. It is a standard longitudinal
dataset. In Chapter 22, we conducted analysis for each week separately, and in this chapter,
we will accommodate the longitudinal structure of the data.

25.1.2 Clustered data: a neuroscience experiment


Moen et al. (2016) examined the effects of Pten knockdown and fatty acid delivery on soma
size of neurons in the brain of a mouse. The useful variables for our analysis are the id
of mouse mouseid, the fatty acid level fa, the Pten knockdown indicator pten, the outcome
somasize, the number of neurons numpten and numctrl under Pten knockdown or not.
> Pten = read . csv ( " P t e n A n a l y s i s D a t a . csv " )[ , -( 7 : 9 )]
> head ( Pten )
mouseid fa pten somasize numctrl numpten
1 0 0 0 83.837 30 44
2 0 0 0 69.984 30 44
3 0 0 0 82.128 30 44

275
276 Linear Model and Extensions

4 0 0 0 86.446 30 44
5 0 0 0 74.032 30 44
6 0 0 0 71.693 30 44

The three-way table below shows the treatment combinations for 14 mice, from which
we can see that the Pten knockdown indicator varies within mice, but the fatty acid level
varies only between mice.
> table ( Pten $ mouseid , Pten $ fa , Pten $ pten )
, , = 0

0 1 2
0 30 0 0
1 58 0 0
2 18 0 0
3 2 0 0
4 56 0 0
5 0 39 0
6 0 33 0
7 0 58 0
8 0 60 0
9 0 0 15
10 0 0 27
11 0 0 7
12 0 0 34
13 0 0 22

, , = 1

0 1 2
0 44 0 0
1 68 0 0
2 33 0 0
3 11 0 0
4 76 0 0
5 0 55 0
6 0 55 0
7 0 75 0
8 0 92 0
9 0 0 34
10 0 0 29
11 0 0 20
12 0 0 53
13 0 0 38

25.1.3 Clustered data: a public health intervention


Poor sanitation leads to morbidity and mortality in developing countries. In 2012, Guiteras
et al. (2015) conducted a cluster-randomized experiment in rural Bangladesh to evaluate
the effectiveness of different policies on the use of hygienic latrines. To illustrate our theory,
we use a subset of their original data and exclude the households not eligible for subsidies
or with missing outcomes, resulting in 10125 households in total. The median, mean, and
maximum of village size are 83, 119, and 500, respectively. We choose the outcome yit as
the binary indicator for whether the household (i, t) had access to a hygienic latrine or not,
measured in June 2013, and covariate xit as the baseline access rate to hygienic latrines
in the community that household (i, t) belonged to, measured in January 2012 before the
experiment.
Generalized Estimating Equation for Correlated Multivariate Data 277

The useful variables below are z, x, y, and vid, which denote the binary treatment
indicator, covariate xit , the outcome, and the village id vid,
> hygaccess = read . csv ( " hygaccess . csv " )
> hygaccess = hygaccess [ , c ( " r 4 _ hyg _ access " , " treat _ cat _ 1 " ,
+ " bl _ c _ hyg _ access " , " vid " , " eligible " )]
> hygaccess = hygaccess [ which ( hygaccess $ eligible == " Eligible " &
+ hygaccess $ r 4 _ hyg _ access != " Missing " ) ,]
> hygaccess $ y = ifelse ( hygaccess $ r 4 _ hyg _ access == " Yes " , 1 , 0 )
> hygaccess $ z = hygaccess $ treat _ cat _ 1
> hygaccess $ x = hygaccess $ bl _ c _ hyg _ access

25.2 Marginal model and the generalized estimating equation


We will extend the restricted mean model to deal with longitudinal data, where we observe
outcome yit and covariate xit for each unit i = 1, . . . , n at time t = 1, . . . , ni . The ni ’s
can vary across units. When ni = 1 for all units, we drop the time index and model the
conditional mean as
E(yi | xi ) = µ(xti β),
and use the following estimating equation to estimate the parameter β:
n
X yi − µ(xt β) ∂µ(xt β)
i i
= 0. (25.1)
i=1
σ̃ 2 (xi , β) ∂β

In (25.1), σ̃ 2 (xi , β) is a working variance function usually motivated by a GLM, but it can
be misspecified. With an ni × 1 vector outcome and an ni × p covariate matrix
   t 
yi1 xi1
 ..   .. 
Yi =  .  , Xi =  .  , (i = 1, . . . , n) (25.2)
yini xtini

we can extend the restricted mean model to


 
E(yi1 | Xi )
E(Yi | Xi ) ≡
 .. 
(25.3)
 . 
E(yini | Xi )
 
E(yi1 | xi1 )
=
 .. 
(25.4)
 . 
E(yini | xini )
 
µ(xti1 β)
=
 .. 
(25.5)
 . 
t
µ(xini β)
≡ µ(Xi , β), (25.6)

where (25.3) and (25.6) are definitions, and (25.4) and (25.5) are the two key assumptions.
Assumption (25.4) requires that the conditional mean of yit depends only on xit but not
on any other xis with s ̸= t. Assumption (25.5) requires that the relationship between xit
278 Linear Model and Extensions

and yit is stable across units and time points with the function µ(·) and the parameter β
not varying with respect to i or t. The model assumptions in (25.4) and (25.5) are really
strong, and I defer the critiques to the end of this chapter. Nevertheless, the marginal model
attracts practitioners for
(A1) its similarity to GLM and the restricted mean model, and

(A2) its simplicity of requiring only specification of the marginal conditional means, not
the whole joint distribution.
The advantage (A1) facilitates the interpretation of the coefficient, and the advantage (A2)
is crucial because of the lack of familiar multivariate distributions in statistics except for
the multivariate Normal. The generalized estimating equation (GEE) for β is the vector
form of (25.1):
n
X ∂µ(Xi , β) −1
Ṽ (Xi , β) {Yi − µ(Xi , β)} = |{z}
0 , (25.7)
i=1 |
∂β | {z }| {z }
{z } ni ×ni ni ×1 p×1
p×ni

where (25.7) has a similar form as (25.1) with three terms organized to match the dimension
so that matrix multiplications are well-defined:
(GEE1) the last term  
yi1 − µ(xti1 β)
Yi − µ(Xi , β) = 
 .. 
. 
yini − µ(xini β)
t

represents the residual vector,

(GEE2) the second term is the inverse of Ṽ (Xi , β), a working covariance matrix of the
conditional distribution of Yi given Xi which may be misspecified:

Ṽ (Xi , β) ̸= V (Xi ) ≡ cov(Yi | Xi ).

It is relatively easy to specify the working variance σ̃ 2 (xit , β) for each marginal
component, for example, based on the marginal GLM. So the key is to specify the
ni × ni dimensional correlation matrix Ri to obtain

Ṽ (Xi , β) = diag{σ̃(xit , β)}ni=1


i
Ri diag{σ̃(xit , β)}ni=1
i
.

We assume that the Ri ’s are given now, and will discuss how to choose them in a
later section.
(GEE3) the first term is the partial derivative of an ni × 1 vector µ(Xi , β) =
(µ(xti1 β), . . . , µ(xtini β))t with respect to a p × 1 vector β = (β1 , . . . , βp )t :

∂µ(xtini β)
 
∂µ(Xi , β) ∂µ(xti1 β)
= ,...,
∂β ∂β ∂β
 
∂µ(xti1 β) ∂µ(xtin β)
∂β1 · · · ∂β1
i

.. ..
 
=  ,
 
 . . 
t
∂µ(xti1 β) ∂µ(xin β)
∂βp ··· ∂βp
i

which is a p × ni matrix, denoted by Di (β).


Generalized Estimating Equation for Correlated Multivariate Data 279

25.3 Statistical inference with GEE


25.3.1 Computation using the Gauss–Newton method
We can use Newton’s method to solve the GEE (25.7). However, calculating the derivative
of the left-hand side of (25.7) involves calculating the second order derivative of µ(Xi , β)
with respect to β. A simpler alternative without calculating the second-order derivative is
the Gauss–Newton method based on the following approximation:
n
X ∂µ(Xi , β)
0= Ṽ −1 (Xi ) {Yi − µ(Xi , β)}
i=1
∂β
n

X
Di (β old )Ṽ −1 (Xi , β old ) Yi − µ(Xi , β old ) − Dit (β old )(β − β old )
 
=
i=1
n
X
Di (β old )Ṽ −1 (Xi , β old ) Yi − µ(Xi , β old )

=
i=1
Xn
− Di (β old )Ṽ −1 (Xi , β old )Dit (β old )(β − β old ).
i=1

old
So given β , we update it as
( n )−1
X
new old old −1 old
β = β + Di (β )Ṽ (Xi , β )Dit (β old )
i=1
n
X
Di (β old )Ṽ −1 (Xi , β old ) Yi − µ(Xi , β old ) .

× (25.8)
i=1

25.3.2 Asymptotic inference


The asymptotic distribution
√ of β̂ follows from Theorem D.2. Similar to the proof of Theorem
24.1, we can verify that n(β̂ − β) → N(0, B −1 M B −1 ) in distribution where
( n
)
X
−1 −1 t
B=E n Di (β)Ṽ (Xi , β)Di (β) ,
i=1
( n
)
X
−1 −1 −1
M =E n Di (β)Ṽ (Xi , β)V (Xi )Ṽ (Xi , β)Dit (β) .
i=1

After obtaining β̂ and the residual vector ε̂i = Yi − µ(Xi , β̂) for unit i (i = 1, . . . , n), we
can conduct asymptotic inference based on the Normal approximation
a
β̂ ∼ N(β, n−1 B̂ −1 M̂ B̂ −1 ),
where
n
X
B̂ = n−1 Di (β̂)Ṽ −1 (Xi , β̂)Dit (β̂),
i=1
n
X
M̂ = n−1 Di (β̂)Ṽ −1 (Xi , β̂)ε̂i ε̂ti Ṽ −1 (Xi , β̂)Dit (β̂).
i=1
280 Linear Model and Extensions

This covariance estimator proposed by Liang and Zeger (1986), is robust to the misspeci-
fication of the marginal variances and the correlation structure as long as the conditional
mean of Yi given Xi is correctly specified.

25.3.3 Implementation: choice of the working covariance matrix


We have not discussed the choice of the working correlation matrix Ri . Different choices
do not affect the consistency but affect the efficiency of β̂. A simple starting point is the
independent working correlation matrix Ri = Ini . Under this correlation matrix, the GEE
reduces to
 −2 
n  t  σ̃ (xi1 , β)
X ∂µ(xt β)
i1 ∂µ(xini β)  ..
,...,

∂β ∂β
 . 
i=1 −2
σ̃ (xini , β)
 
yi1 − µ(xti1 β)
× ..
 = 0,
 
.
yini − µ(xtini β)

or, more compactly,


ni
n X
X yit − µ(xt β) ∂µ(xt β)
it it
= 0. (25.9)
i=1 t=1
σ̃ 2 (xit , β) ∂β

This is simply the estimating equation of a restricted mean model treating all data points
(i, t) as independent observations. This implies that the point estimate assuming the inde-
pendent working correlation matrix is still consistent, although we must change the standard
error as in Section 25.3.2.
With this simple starting point, we have a consistent yet inefficient estimate of β, and
then we can compute the residuals. The correlation among the residuals contains information
about the true covariance matrix. With small and equal ni ’s, we can estimate the conditional
covariance without imposing any structure based on the residuals. Using the estimated
covariance matrix, we can update the GEE estimate to improve efficiency. This leads to a
two-step procedure.
An important intermediate case is motivated by the exchangeability of the data
points within the same unit i, so the working covariance matrix is Ṽ (Xi , β) =
diag{σ̃(xit )}ni=1
i
Ri (ρ)diag{σ̃(xit )}ni=1
i
, where
 
1 ρ ··· ρ
 ρ 1 ··· ρ 
Ri (ρ) =  . . ..  .
 
 .. .. . 
ρ ρ ··· 1

We can estimate ρ based on the residuals from the first step.


The above three choices of the working covariance matrix are called “independent”,
“unstructured”, and “exchangeable” in the “corstr” parameter of the function gee in the gee
package in R. This function also contains other choices proposed by Liang and Zeger (1986).
A carefully chosen working covariance matrix can lead to efficiency gain compared to
the simple independent covariance matrix. An efficient estimator requires a correctly spec-
ified working covariance matrix. This is often an infeasible goal, and what is more, the
conditional covariance cov(Yi | Xi ) is a nuisance parameter if the conditional mean is the
Generalized Estimating Equation for Correlated Multivariate Data 281

main parameter of interest. In practice, the independent working covariance suffices in many
applications despite its potential efficiency loss. This is similar to the use of OLS in the pres-
ence of heteroskedasticity in linear models. Section 25.4 focuses on the independent working
covariance, which is common in econometrics. Section 25.6 gives further justifications for
this simple strategy.

25.4 A special case: cluster-robust standard error


Importantly, Liang and Zeger (1986)’s standard error treats each cluster i as an independent
contributor to the uncertainty. In econometrics, this is called the cluster-robust standard
error. Alternatively, we can use the bootstrap by resampling the clusters to approximate the
asymptotic covariance matrix. I will discuss linear and logistic regressions in this section,
and leave the technical details of Poisson regression as a homework problem.
Stack the Yi ’s and Xi ’s in (25.2) together to obtain
   
Y1 X1
 ..   .. 
Y =  . , X =  . ,
Yn Xn

Pn are the N dimensional outcome vector and N × p covariate matrix, where N =


which
i=1 ni .

25.4.1 OLS
An important special case is the marginal linear model with an independent working co-
variance matrix and homoskedasticity, resulting in the following estimating equation:
ni
n X
X
xit (yit − xtit β) = 0.
i=1 t=1

So the point estimator is just the pooled OLS using all data points:
n Xni
!−1 n n
X XX i
t
β̂ = xit xit xit yit
i=1 t=1 i=1 t=1
n
!−1 n
X X
= Xit Xi Xit Yi
i=1 i=1
= (X t X)−1 X t Y.
The three forms of β̂ above are identical: the first one is based on N observations, the
second one is based on n independent units, and the last one is based on the matrix form
with the pooled data. Although the point estimate is identical to the case with independent
data points, we must adjust for the standard error according to Section 25.3.2. From
Di (β) = (xi1 , . . . , xini ) = Xit ,
we can verify that
n
!−1 n n
!−1
X X X
cov(
ˆ β̂) = Xit Xi Xit ε̂i ε̂ti Xi Xit Xi ,
i=1 i=1 i=1
282 Linear Model and Extensions

where ε̂i = Yi − Xi β̂ = (ε̂i1 , . . . , ε̂ini )t is the residual vector of unit i. This is called the
(Liang–Zeger) cluster-robust covariance matrix in econometrics. The square roots of the
diagonal terms are called the cluster-robust standard errors. The cluster-robust covariance
matrix is often much larger than the (Eicker–Huber–White) heteroskedasticity-robust co-
variance matrix assuming independence of observations (i, t):

ni
n X
!−1 ni
n X ni
n X
!−1
X X X
cov
ˆ ehw (β̂) = xit xtit ε̂2it xit xtit xit xtit .
i=1 t=1 i=1 t=1 i=1 t=1

Note that
n
X ni
n X
X
X tX = Xit Xi = xit xtit ,
i=1 i=1 t=1

so the bread matrices in cov(ˆ β̂) and cov


ˆ ehw (β̂) are identical. The only difference is due to
the meat matrices:
n n ni
! n !t ni
n X
X X X X i X
t t
Xi ε̂i ε̂i Xi = ε̂it xit ε̂it xit ̸= ε̂2it xit xtit
i=1 i=1 t=1 t=1 i=1 t=1

in general.

25.4.2 Logistic regression


For binary outcomes, we can use the marginal logistic model with an independent working
covariance matrix, resulting in the following estimating equation:
ni
n X
X
xit {yit − π(xit , β)} = 0
i=1 t=1

t t
where π(xit , β) = exit β /(1 + exit β ). So the point estimator is the pooled logistic regression
using all data points, but we must adjust for the standard error according to Section 25.3.2.
From

Di (β) = (π(xi1 , β){1 − π(xi1 , β)}xi1 , . . . , π(xini , β){1 − π(xini , β)}xini )


= Xit Ṽ (Xi , β),

with Ṽ (Xi , β) = diag{π(xit , β){1 − π(xit , β)}}nt=1


i
, we can verify that
n
X n
X
B̂ = n−1 Xit V̂i Xi , M̂ = n−1 Xit ε̂i ε̂ti Xi ,
i=1 i=1

where ε̂i = (ε̂i1 , . . . , ε̂ini )t with residual ε̂it = yit − exit β̂ /(1 + exit β̂ ), and V̂i =
diag{π(xit , β̂){1 − π(xit , β̂)}}nt=1
i
. So the cluster-robust covariance estimator for logistic re-
gression is
n
!−1 n n
!−1
X X X
t t t t
cov(
ˆ β̂) = Xi V̂i Xi Xi ε̂i ε̂i Xi Xi V̂i Xi .
i=1 i=1 i=1

I leave the cluster-robust covariance estimator for Poisson regression to Problem 25.5.
Generalized Estimating Equation for Correlated Multivariate Data 283

25.5 Application
We will use the gee package for all the analyses below.

25.5.1 Clustered data: a neuroscience experiment


The original study was interested in the potential interaction between two treatments, so I
always include the interaction term in the regression model.
From the simple specification below, pten has a significant effect, but fa and the inter-
actions are not significant.
Pten . gee = gee ( somasize ~ factor ( fa )* pten ,
+ id = mouseid ,
+ family = gaussian ,
+ corstr = " independence " ,
+ data = Pten )
> summary ( Pten . gee )$ coef
Estimate Naive S . E . Naive z Robust S . E . Robust z
( Intercept ) 93.106 1.594 58.4216 3.059 30.4374
factor ( fa ) 1 3.756 2.175 1.7268 3.174 1.1836
factor ( fa ) 2 6.907 2.551 2.7078 5.407 1.2774
pten 11.039 2.082 5.3016 2.200 5.0166
factor ( fa ) 1 : pten 8.727 2.834 3.0795 5.023 1.7373
factor ( fa ) 2 : pten -2 . 9 0 4 3 . 2 7 0 -0 . 8 8 8 1 3 . 5 5 4 -0 . 8 1 7 3
>
>
> Pten . gee = gee ( somasize ~ factor ( fa )* pten ,
+ id = mouseid ,
+ family = gaussian ,
+ corstr = " exchangeable " ,
+ data = Pten )
> summary ( Pten . gee )$ coef
Estimate Naive S . E . Naive z Robust S . E . Robust z
( Intercept ) 90.900 3.532 25.7376 2.701 33.6535
factor ( fa ) 1 4.921 5.115 0.9621 2.914 1.6889
factor ( fa ) 2 6.408 5.066 1.2649 5.904 1.0853
pten 11.501 1.979 5.8120 2.190 5.2515
factor ( fa ) 1 : pten 8.807 2.688 3.2766 5.050 1.7439
factor ( fa ) 2 : pten -1 . 5 2 5 3 . 1 1 3 -0 . 4 8 9 8 2 . 7 0 3 -0 . 5 6 4 1

Including two covariates, we have the following results. The covariates are predictive of
the outcome, changing the significance level of the main effect of fa. The interaction terms
between pten and fa are not significant either.
> Pten . gee = gee ( somasize ~ factor ( fa )* pten + numctrl + numpten ,
+ id = mouseid ,
+ family = gaussian ,
+ corstr = " independence " ,
+ data = Pten )
> summary ( Pten . gee )$ coef
Estimate Naive S . E . Naive z Robust S . E . Robust z
( Intercept ) 81.9422 2.791 29.3602 4.0917 20.026
factor ( fa ) 1 6.2267 2.237 2.7835 4.2429 1.468
factor ( fa ) 2 14.8956 2.657 5.6053 4.1839 3.560
pten 12.3771 2.020 6.1272 2.2477 5.507
numctrl 0.8721 0.120 7.2672 0.3028 2.880
numpten -0 . 4 8 4 3 0 . 1 0 1 -4 . 7 9 4 8 0.2381 -2 . 0 3 4
factor ( fa ) 1 : pten 7.7498 2.744 2.8240 5.1064 1.518
factor ( fa ) 2 : pten -2 . 9 6 2 9 3 . 1 6 6 -0 . 9 3 5 9 3.3105 -0 . 8 9 5
>
284 Linear Model and Extensions

>
> Pten . gee = gee ( somasize ~ factor ( fa )* pten + numctrl + numpten ,
+ id = mouseid ,
+ family = gaussian ,
+ corstr = " exchangeable " ,
+ data = Pten )
> summary ( Pten . gee )$ coef
Estimate Naive S . E . Naive z Robust S . E . Robust z
( Intercept ) 85.3316 5.2872 16.1393 5.4095 15.7745
factor ( fa ) 1 5.4952 4.2761 1.2851 4.0207 1.3667
factor ( fa ) 2 12.2174 4.1669 2.9320 4.2363 2.8840
pten 11.8044 1.9718 5.9865 2.1946 5.3789
numctrl 0.9326 0.2867 3.2527 0.3479 2.6810
numpten -0 . 5 6 7 8 0 . 2 5 0 4 -2 . 2 6 7 4 0 . 2 7 7 2 -2 . 0 4 8 2
factor ( fa ) 1 : pten 8.5137 2.6777 3.1795 5.0612 1.6821
factor ( fa ) 2 : pten -1 . 7 7 5 5 3 . 0 9 9 5 -0 . 5 7 2 8 2 . 7 5 4 7 -0 . 6 4 4 5

From the regressions above, we observe that (1) two choices of the covariance matrix
do not lead to fundamental differences; and (2) without using the cluster-robust standard
error, the results can be misleading.

25.5.2 Clustered data: a public health intervention


We first fit simple GEE without using the covariate.
> hygaccess . gee = gee ( y ~ z , id = vid ,
+ family = binomial ( link = logit ) ,
+ corstr = " independence " ,
+ data = hygaccess )
> summary ( hygaccess . gee )$ coef
Estimate Naive S . E . Naive z Robust S . E . Robust z
( Intercept ) -0 . 7 5 6 8 0 . 0 4 4 3 9 -1 7 . 0 4 9 0 . 1 7 6 3 -4 . 2 9 2 4
zLPP Only 0.1551 0.06657 2.330 0.2301 0.6741
zLPP + Subsidy 0.7562 0.05503 13.742 0.2027 3.7313
zLPP + Subsidy + Supply 0.7344 0.05444 13.490 0.2010 3.6546
zSupply Only 0.3568 0.07364 4.846 0.3091 1.1544
>
> hygaccess . gee = gee ( y ~ z , id = vid ,
+ family = binomial ( link = logit ) ,
+ corstr = " exchangeable " ,
+ data = hygaccess )
> summary ( hygaccess . gee )$ coef
Estimate Naive S . E . Naive z Robust S . E . Robust z
( Intercept ) -0 . 7 7 9 9 0 . 1 3 1 4 -5 . 9 3 7 1 0 . 1 5 2 2 -5 . 1 2 3 5
zLPP Only 0.1638 0.2042 0.8021 0.2290 0.7152
zLPP + Subsidy 0.7789 0.1500 5.1926 0.1790 4.3524
zLPP + Subsidy + Supply 0.7348 0.1506 4.8798 0.1760 4.1753
zSupply Only 0.2690 0.2207 1.2187 0.3011 0.8931

Without adjusting for the covariates, treatment levels “zLPP+Subsidy” and


“zLPP+Subsidy+Supply” are significant. The exchangeable working covariance matrix does
seem to improve the estimated precision.
We then fit GEE with a covariate.
> hygaccess . gee = gee ( y ~ z + x , id = vid ,
+ family = binomial ( link = logit ) ,
+ corstr = " independence " ,
+ data = hygaccess )
> summary ( hygaccess . gee )$ coef
Estimate Naive S . E . Naive z Robust S . E . Robust z
( Intercept ) -1 . 7 5 2 6 0 . 0 6 1 7 4 -2 8 . 3 8 6 0 . 1 3 9 8 -1 2 . 5 3 8
zLPP Only 0.2277 0.06833 3.332 0.1393 1.635
zLPP + Subsidy 0.6850 0.05645 12.133 0.1191 5.749
Generalized Estimating Equation for Correlated Multivariate Data 285

zLPP + Subsidy + Supply 0.7389 0.05578 13.246 0.1361 5.430


zSupply Only 0.3614 0.07514 4.810 0.2426 1.490
x 2.0488 0.08209 24.957 0.2158 9.492
>
> hygaccess . gee = gee ( y ~ z + x , id = vid ,
+ family = binomial ( link = logit ) ,
+ corstr = " exchangeable " ,
+ data = hygaccess )
> summary ( hygaccess . gee )$ coef
Estimate Naive S . E . Naive z Robust S . E . Robust z
( Intercept ) -1 . 7 9 7 6 0 . 1 3 2 4 -1 3 . 5 7 5 0 . 1 5 4 1 -1 1 . 6 6 7
zLPP Only 0.3038 0.1781 1.705 0.1946 1.561
zLPP + Subsidy 0.7227 0.1316 5.491 0.1271 5.688
zLPP + Subsidy + Supply 0.8547 0.1327 6.441 0.1247 6.855
zSupply Only 0.3236 0.1911 1.693 0.2398 1.350
x 1.9497 0.1128 17.286 0.1947 10.016

Covariate adjustment improves efficiency and makes the choice of the working covariance
matrix less important.

25.5.3 Longitudinal data


The regression formula f.reg will remain the same although other parameters may vary.
> library ( " gee " )
> library ( " foreign " )
> gym 1 = read . dta ( " gym _ treatment _ exp _ weekly . dta " )
> f . reg = weekly _ visit ~ incentive _ commit + incentive + target + member _ gym _ pre

Using all data, we find a significant effect of incentive_commit but an insignificant effect
of incentive.
normal . gee = gee ( f . reg , id = id ,
+ family = gaussian ,
+ corstr = " independence " ,
+ data = gym 1 )
> normal . gee = summary ( normal . gee )$ coef
> normal . gee
Estimate Naive S . E . Naive z Robust S . E . Robust z
( Intercept ) -0 . 6 9 0 0 5 0 . 0 1 1 1 3 6 -6 1 . 9 6 8 0 . 0 8 6 7 2 -7 . 9 5 7 2
incentive _ commit 0 . 1 5 6 6 6 0.008358 18.745 0.06376 2.4569
incentive 0.01022 0.008275 1.235 0.05910 0.1729
target 0.62666 0.007465 83.949 0.06773 9.2527
member _ gym _ pre 1.14919 0.007077 162.375 0.06252 18.3801

However, this pooled analysis can be misleading because we have seen from the analysis
before that the treatments have no effects in the pre-experimental periods and smaller effects
in the long term. A pooled analysis can dilute the short-term effects, missing the treatment
effect heterogeneity across time. This can be fixed by the following subgroup analysis based
on time.
> normal . gee 1 = gee ( f . reg , id = id ,
+ subset = ( incentive _ week < 0 ) ,
+ family = gaussian ,
+ corstr = " independence " ,
+ data = gym 1 )
> normal . gee 1 = summary ( normal . gee 1 )$ coef
> normal . gee 1
Estimate Naive S . E . Naive z Robust S . E . Robust z
( Intercept ) -0 . 8 7 9 3 7 4 0 . 0 4 2 3 0 -2 0 . 7 8 6 8 0 . 0 8 7 3 9 -1 0 . 0 6 2 2 4
incentive _ commit -0 . 0 0 4 2 4 1 0 . 0 3 1 7 5 -0 . 1 3 3 6 0 . 0 6 2 4 3 -0 . 0 6 7 9 4
incentive -0 . 0 7 3 8 8 4 0 . 0 3 1 4 4 -2 . 3 5 0 2 0 . 0 6 2 2 3 -1 . 1 8 7 2 8
target 0.742675 0.02836 26.1887 0.06701 11.08301
286 Linear Model and Extensions

member _ gym _ pre 1.601569 0.02689 59.5664 0.06600 24.26763


>
>
> normal . gee 2 = gee ( f . reg , id = id ,
+ subset = ( incentive _ week > 0 & incentive _ week < 1 5 ) ,
+ family = gaussian ,
+ corstr = " independence " ,
+ data = gym 1 )
> normal . gee 2 = summary ( normal . gee 2 )$ coef
> normal . gee 2
Estimate Naive S . E . Naive z Robust S . E . Robust z
( Intercept ) -0 . 7 9 2 5 0 . 0 3 2 7 5 -2 4 . 1 9 4 0.08982 -8 . 8 2 3
incentive _ commit 0.3662 0.02458 14.898 0.06895 5.311
incentive 0.1744 0.02434 7.166 0.06457 2.701
target 0.6735 0.02196 30.674 0.07159 9.408
member _ gym _ pre 1.4138 0.02082 67.914 0.06727 21.018
>
> normal . gee 3 = gee ( f . reg , id = id ,
+ subset = ( incentive _ week >= 1 5 ) ,
+ family = gaussian ,
+ corstr = " independence " ,
+ data = gym 1 )
> normal . gee 3 = summary ( normal . gee 3 )$ coef
> normal . gee 3
Estimate Naive S . E . Naive z Robust S . E . Robust z
( Intercept ) -0 . 6 6 1 5 0 0 0 . 0 1 2 2 2 2 -5 4 . 1 3 0 . 0 9 0 2 8 -7 . 3 2 7 3
incentive _ commit 0 . 1 3 4 7 8 9 0.009173 14.69 0.06676 2.0189
incentive -0 . 0 0 9 7 1 6 0.009082 -1 . 0 7 0 . 0 6 1 4 2 -0 . 1 5 8 2
target 0.611635 0.008193 74.66 0.07042 8.6860
member _ gym _ pre 1.077874 0.007768 138.77 0.06494 16.5967

Changing the family parameter to poisson(link = log), we can fit a marginal log-linear
model with independent Poisson covariance. Figure 25.1 shows the point estimates and
confidence intervals based on the regressions above. The confidence intervals based on the
cluster-robust standard errors are much wider than those based on the EHW standard
errors. Without dealing with clustering, the confidence intervals are too narrow and give
wrong inference.

25.6 Critiques on the key assumptions


Consider the simple case with ni = 2 for all i below.

25.6.1 Assumption (25.4)


Assumption (25.4) requires
E(yit | Xi ) = E(yit | xit ),
which holds automatically if xit = xi is time-invariant. With time-varying covariates, it
effectively rules out the dynamics between x and y. Assumption (25.4) holds in the following
data-generating process:
xi1 / xi2

 
yi1 yi2
Generalized Estimating Equation for Correlated Multivariate Data 287

Confidence intervals based on EHW and LZ standard errors


pooled before short long

0.50

Normal
0.25


0.00 ●
● ●

−0.25

0.50 ●

Poisson

0.25 ●


0.00 ● ●

−0.25

inc_com inc inc_com inc inc_com inc inc_com inc

FIGURE 25.1: GEE analysis of the gym data

It does not hold if the lagged x affects y or the lagged y affects x:

xi1 / xi2 or xi1 / xi2 or xi1 / xi2


< <

 "     " 
yi1 yi2 yi1 yi2 yi1 yi2

With more complex data generating processes, Assumption (25.4) does not hold in general:

xi1 / xi2
<

 / y" i2

yi1

Liang and Zeger (1986) assumed fixed covariates, ruling out the dynamics of x. Pepe and
Anderson (1994) pointed out the importance of Assumption (25.4) in GEE with random
time-varying covariates. Pepe and Anderson (1994) also showed that with an independent
working covariance matrix, we can drop Assumption (25.4) as long as the marginal condi-
tional mean is correctly specified. That is, if E(yit | xit ) = µ(xtit β), then
( n n )
i
XX yit − µ(xtit β) ∂µ(xtit β)
E
i=1 t=1
σ̃ 2 (xit , β) ∂β
n X ni  
X E{yit − µ(xtit β) | xit } ∂µ(xtit β)
= E
i=1 t=1
σ̃ 2 (xit , β) ∂β
= 0.

This gives another justification for the use of the independent working covariance matrix
even though it can result in efficiency loss when Assumption (25.4) holds.
288 Linear Model and Extensions

25.6.2 Assumption (25.5)


Assumption (25.5) requires a “stable” relationship between x and y across clusters and time
points:
E(yit | xit ) = µ(xtit β)
where µ and β do not depend on i or t. For clustered data, we can justify this assumption
by the exchangeability of the units within clusters. However, it is much harder to interpret
or justify it for longitudinal data with complex outcome dynamics.
We consider linear structural equations with a scalar time-invariant covariate. Without
direct dependence of yi2 on yi1 , the data generating process

yi1 = α1 + βxi + εi1 ,


yi2 = α2 + βxi + εi2 ,

corresponding to the graph


xi

} !
yi1 yi2
has conditional expectations E(yit | xi ) = αt + βxi if

E(εit | xi ) = 0. (25.10)

However, with direct dependence of yi2 on yi1 , the data generating process

yi1 = α1 + βxi + εi1 ,


yi2 = α2 + γyi1 + δxi + εi2 ,

corresponding to the graph


xi

} / y! i2
yi1
has conditional expectations E(yi1 | xi ) = α1 + βxi but

E(yi2 | xi ) = α2 + γ(α1 + βxi ) + δxi = (α2 + γα1 ) + (δ + βγ)xi

if (25.10) holds. The stability assumption requires

α1 = α2 + γα1 , β = βγ + δ,

which are strange restrictions on the model parameters.


With time-varying covariates, this issue becomes even more subtle because Assumption
(25.4) is unlikely to hold in the first place.

25.6.3 Explanation and prediction


Liang and Zeger (1986)’s marginal model is more useful if the goal is to explain the rela-
tionship between x and y, in particular, a component of xit represents the time-invariant
treatment and yit represents the time-varying outcomes. If the goal is prediction, then the
marginal model can be problematic. For instance, if we observe the covariate value for a
Generalized Estimating Equation for Correlated Multivariate Data 289

future observation xis , the marginal model gives predicted outcome µ(xtis β̂) with the associ-
ated standard error computed based on the delta method. We can see two obvious problems
with this prediction. First, it does not depend on s. Consequently, predicting s = 10 is the
same as predicting s = 100. However, the intuition is overwhelming that predicting the
long-run outcome is much more difficult than predicting the short-run outcome, so we hope
the standard error should be much larger for predicting the outcome at s = 100. Second,
the prediction does not depend on the lag outcomes because the marginal model ignores
the dynamics of the outcome. With longitudinal observations, building a model with the
lag outcomes may increase the prediction ability.

25.7 Homework problems


25.1 Sandwich asymptotic covariance matrix for GEE
Verify the formulas of B and M in Section 25.3.2.

25.2 Cluster-robust standard error in OLS with a cluster-specific binary regressor


Consider a special case with xit = (1, xi )t and xi ∈ {0, 1} for i = 1, . . . , n, and view “1” as
treatment and “0” as control. Show that the coefficient of xi in the pooled OLS fit of yit
on xit equals τ̂ = ȳ1 − ȳ0 where
X ni
n X ni
n X
X
ȳ1 = xi yit /N1 , ȳ0 = (1 − xi )yit /N0 ,
i=1 t=1 i=1 t=1
Pn Pn
with N1 = i=1 ni xi and N0 = i=1 ni (1 − xi ) denoting the total number of observations
under treatment and control, respectively. Further show that the cluster-robust standard
error of τ̂ equals the square root of
Pn 2
Pn 2
i=1 xi Ri i=1 (1 − xi )Ri
+ ,
N12 N02
where  Pni
Pt=1 (yit − ȳ1 ), if xi = 1,
Ri = ni
t=1 (yit − ȳ0 ), if xi = 0.

25.3 Cluster-robust standard error in GLM with a cluster-specific binary regressor


Inherit the setting from Problem 25.2.
With a binary outcome yit , show that the coefficient of xi in the pooled logit regression
of yit on xit equals τ̂ = logitȳ1 − logitȳ0 . Further show that the cluster-robust standard
error of τ̂ equals the square root of
Pn 2
Pn 2
i=1 xi Ri i=1 (1 − xi )Ri
2 + 2 .
N1 ȳ1 (1 − ȳ1 ) N0 ȳ0 (1 − ȳ0 )
With a count outcome yit , show that the coefficient of xi in the pooled Poisson regression
of yit on xit equals τ̂ = log ȳ1 − log ȳ0 . Further show that the cluster-robust standard error
of τ̂ equals the square root of
Pn 2
Pn 2
i=1 xi Ri i=1 (1 − xi )Ri
+ .
N12 ȳ1 N02 ȳ0
290 Linear Model and Extensions

25.4 Cluster-robust standard error in ANOVA


This problem extends Problems 5.5, 6.3 and 19.6.
Inherit the setting from Problem 19.6. If the units are clustered by a factor ci ∈
{1, . . . , M } for i = 1, . . . , n, we can obtain the cluster-robust covariances V̂lz and V̂lz′ from
the two WLS fits. Show that V̂lz = V̂lz′ .

25.5 Cluster-robust standard error for Poisson regression


Similar to Sections 25.4.1 and 25.4.2, derive the cluster-robust covariance matrix for Poisson
regression:
n
!−1 n n
!−1
X X X
t t t t
cov(
ˆ β̂) = Xi V̂i Xi Xi ε̂i ε̂i Xi Xi V̂i Xi ,
i=1 i=1 i=1

diag{exit β̂ }nt=1
t
where ε̂i = Yi − µ(Xi , β̂) and V̂i = i
.

25.6 Data analysis


Re-analyze the data from Royer et al. (2015) using the exchangeable working covariance
matrix. Compare the corresponding results with Figure 25.1.
Part VIII

Beyond Modeling the


Conditional Mean
26
Quantile Regression

26.1 From the mean to the quantile


For a random variable y, we can define its mean as

E(y) = arg min E (y − µ)2 .



µ∈R

With IID data (yi )ni=1 , we can compute the sample mean
n
X n
X
ȳ = n−1 yi = arg min n−1 (yi − µ)2 ,
µ∈R
i=1 i=1

which satisfies the CLT: √


n(ȳ − E(y)) → N(0, var(y))
in distribution if the variance var(y) is finite.

distribution and quantile


1
τ
0

F−1 (τ)

FIGURE 26.1: CDF and quantile

However, the mean can miss important information about y. How about other features
of the outcome y? Quantiles can characterize the distribution of y. For a random variable
y, we can define its distribution function as F (c) = pr(y ≤ c) and its τ th quantile as

F −1 (τ ) = inf {q : F (q) ≥ τ } .

293
294 Linear Model and Extensions

ρτ(u) ρτ(u) ρτ(u)

u u u

τ=1 3 τ=1 2 τ=2 3

FIGURE 26.2: Check function

This defines a quantile function F −1 : [0, 1] → R. If the distribution function is strictly


monotone, then the quantile function reduces to the inverse of the distribution function,
and the τ -th quantile solves τ = pr(y ≤ q) as an equation of q. See Figure 26.1. For
simplicity, this chapter focuses on the case with a monotone distribution function. The
definition above formulates the mean as the minimizer of an objective function. Similarly,
we can define quantiles in an equivalent way below.
Proposition 26.1 With a monotone distribution function and positive density at the τ th
quantile, we have
F −1 (τ ) = arg min E {ρτ (y − q)} ,
q∈R

where (
uτ, if u ≥ 0,
ρτ (u) = u {τ − 1(u < 0)} =
−u(1 − τ ), if u < 0,
is the check function (the name comes from its shape; see Figure 26.2). In particular, the
median of y is
median(y) = F −1 (0.5) = arg min E {|y − q|} .
q∈R

Proof of Proposition 26.1: To simplify the proof, we further assume that y has density
function f (·). We will use Leibniz’s integral rule:
(Z )
b(x) Z b(x)
d ′ ′ ∂f (x, t)
f (x, t)dt = f (x, b(x))b (x) − f (x, a(x))a (x) + dt.
dx a(x) a(x) ∂x

We can write
Z q Z ∞
E {ρτ (y − q)} = (τ − 1)(c − q)f (c)dc + τ (c − q)f (c)dc.
−∞ q

To minimize it over q, we can solve the first-order condition


Z q Z ∞
∂E {ρτ (y − q)}
= (1 − τ ) f (c)dc − τ f (c)dc = 0.
∂q −∞ q

So
(1 − τ )pr(y ≤ q) − τ {1 − pr(y ≤ q)} = 0
which implies that
τ = pr(y ≤ q),
Quantile Regression 295

so the τ th quantile satisfies the first-order condition. The second-order condition ensures it
is the minimizer:
∂ 2 E {ρτ (y − q)}
= f F −1 (τ ) > 0

∂q 2 −1
q=F (τ )

by Leibniz’s integral rule again. Pn □


The empirical distribution function is F̂ (c) = n−1 i=1 1(yi ≤ c), which is a step func-
tion, increasing but not strictly monotone. With Proposition 26.1, we can easily define the
sample quantile as
Xn
−1 −1
F̂ (τ ) = arg min n ρτ (yi − q),
q∈R
i=1

which may not be unique even though the population quantile is. We can view F̂ −1 (τ ) as
a set containing all minimizers, and with large samples the values in the set do not differ
much. Similar to the sample mean, the sample quantile also satisfies a CLT.
iid
Theorem 26.1 Assume (yi )ni=1 ∼ y with distribution function F (·) that is strictly increas-
ing and density function f (·) that is positive at the τ th quantile. The sample quantile is
consistent for the true quantile and is asymptotically Normal:
!
√ n −1 o τ (1 − τ )
n F̂ (τ ) − F −1 (τ ) → N 0, 2
[f {F −1 (τ )}]

in distribution. In particular, the sample median satisfies


!
√ n −1 o 1
n F̂ (0.5) − median(y) → N 0, 2
4 [f {median(y)}]

in distribution.

Proof of Theorem 26.1: Based on the first order condition in Proposition 26.1, the
population quantile solves
E{mτ (y − q)} = 0,
and the sample quantile solves
n
X
n−1 mτ (yi − q) = 0,
i=1

where the check function has a partial derivative with respect to u except for the point 0:

mτ (u) = (τ − 1)1(u < 0) + τ 1(u > 0)


= τ − 1(u < 0).

By Theorem D.1, we only need to find the bread and meat matrices, which are scalars now:
∂E{mτ (y − q)}
B =
∂q q=F −1 (τ )
∂E{τ − 1(y ≤ q)}
=
∂q q=F −1 (τ )
∂F (q)
= −
∂q q=F −1 (τ )
= −f {F −1 (τ )},
296 Linear Model and Extensions

and

= E {mτ (y − q)}2
 
M
q=F −1 (τ )
h i
2
= E {τ − 1(y ≤ q)} −1 q=F (τ )
2
 
= E τ + 1(y ≤ q) − 2 · 1(y ≤ q)τ
q=F −1 (τ )
2 2
= τ + τ − 2τ
= τ (1 − τ ).

Therefore, n{F̂ −1 (τ ) − F −1 (τ )} converges to Normal with mean zero and variance
M/B = τ (1 − τ )/[f {F −1 (τ )}]2 .
2

To conduct statistical inference for the quantile F −1 (τ ), we need to estimate the density
of y at the τ th quantile to obtain the estimated standard error of F̂ −1 (τ ). Alternatively, we
can use the bootstrap to obtain the estimated standard error. We will discuss the inference
of quantiles in R in Section 26.4.

26.2 From the conditional mean to conditional quantile


With an explanatory variable x for outcome y, we can define the conditional mean as
h i
2
E(y | x) = arg min E {y − m(x)} .
m(·)

We can use a linear function xt β to approximate the conditional mean with the population
OLS coefficient
−1
β = arg minp E (y − xt b)2 = {E(xxt )} E(xy),

b∈R

and the sample OLS coefficient


n
!−1 n
!
X X
β̂ = n−1 xi xti n−1 xi yi .
i=1 i=1

We have discussed the statistical properties of β̂ in previous chapters. Motivated by Propo-


sition 26.1, we can define the conditional quantile function as

F −1 (τ | x) = arg min E [ρτ {y − q(x)}] .


q(·)

We can use a linear function xt β(τ ) to approximate the conditional quantile function with

β(τ ) = arg minp E {ρτ (y − xt b)}


b∈R

called the τ th population regression quantile, and


n
X
β̂(τ ) = arg minp n−1 ρτ (yi − xti b) (26.1)
b∈R
i=1
Quantile Regression 297

called the τ th sample regression quantile. As a special case, when τ = 0.5, we have the
regression median:
n
X
β̂(0.5) = arg minp n−1 |yi − xti b|,
b∈R
i=1

which is also called the least absolute deviations (LAD).


Koenker and Bassett Jr (1978) started the literature under a correctly specified condi-
tional quantile model:
F −1 (τ | x) = xt β(τ ).
The interpretation of the j-th coordinate of the coefficient, βj (τ ), is the partial influence of
xij on the τ th conditional quantile of yi given xi . Angrist et al. (2006) discussed quantile
regression under misspecification, viewing it as the best linear approximation to the true
conditional quantile function. This chapter will focus on the statistical properties of the
sample regression quantiles following Angrist et al. (2006)’s discussion of statistical inference
allowing for the misspecification of the quantile regression model.
Before that, we first comment on the population regression quantiles based on some
generative models. Below assume that the vi ’s are IID independent of the covariates xi ’s,
with mean zero and distribution g(c) = pr(vi ≤ c).

Example 26.1 Under the linear model yi = xti β + σvi , we can verify that

E(yi | xi ) = xti β

and
F −1 (τ | xi ) = xti β + σg −1 (τ ).
Therefore, with the first regressor being 1, we have

β1 (τ ) = β1 + σg −1 (τ ), βj (τ ) = βj , (j = 2, . . . , p).

In this case, both the true conditional mean and quantile functions are linear, and the
population regression quantiles are constant across τ except for the intercept.

Example 26.2 Under a heteroskedastic linear model yi = xti β + (xti γ)vi with xti γ > 0 for
all xi ’s, we can verify that
E(yi | xi ) = xti β
and
F −1 (τ | xi ) = xti β + xti γg −1 (τ ).
Therefore,
β(τ ) = β + γg −1 (τ ).
In this case, both the true conditional quantile functions are linear, and all coordinates of
the population regression quantiles vary with τ .

Example 26.3 Under the transformed linear model log yi = xti β + σvi , we can verify that

E(yi | xi ) = exp(xti β)Mv (σ),

where Mv (t) = E(etv ) is the moment generating function of v, and

F −1 (τ | xi ) = exp xti β + σg −1 (τ ) .


In this case, both the true conditional mean and quantile functions are log-linear in covari-
ates.
298 Linear Model and Extensions

26.3 Sample regression quantiles


26.3.1 Computation
The regression quantiles (26.1) do not have explicit formulas in general, and we need to
solve the optimization problem numerically. Motivated by the piece-wise linear feature of
the check function, we decompose yi − xti β into the difference between its positive part and
negative part:
yi − xti β = ui − vi ,
where
ui = max(yi − xti β, 0), vi = − min(yi − xti β, 0).
So the objective function simplifies to the summation of

ρτ (yi − xti β) = τ ui + (1 − τ )vi ,

which is simply a linear function of the ui ’s and vi ’s. Of course, these ui ’s and vi ’s are not
arbitrary because they must satisfy the constraints by the data. Using the notation
   t     
y1 x1 u1 v1
Y =  ...  , X =  ...  , u =  ...  , v =  ...  ,
       

yn xtn un vn

finding the τ th regression quantile is equivalent to a linear programming problem with linear
objective function and linear constraints:

min τ 1tn u + (1 − τ )1tn v,


b,u,v
s.t. Y = Xb + u − v,
and ui ≥ 0, vi ≥ 0 (i = 1, . . . , n).

The function rq in the R package quantreg computes the regression quantiles with various
choices of methods.

26.3.2 Asymptotic inference


Similar to the sample quantiles, the regression quantiles are also consistent for the pop-
ulation regression quantiles and asymptotically Normal. So we can conduct asymptotic
inference based on the results in the following theorem (Angrist et al., 2006).
iid
Theorem 26.2 Assume (yi , xi )ni=1 ∼ (y, x). Under some regularity conditions, we have
√ n o
n β̂(τ ) − β(τ ) → N(0, B −1 M B −1 )

in distribution, where
h i
  2
B = E fy|x {xt β(τ )} xxt , M = E {τ − 1 (y − xt β(τ ) ≤ 0)} xxt ,

with fy|x (·) denoting the conditional density of y given x.


Quantile Regression 299

Proof of Theorem 26.2: The population regression quantile solves

E {mτ (y − xt b)x} = 0,

and the sample regression quantile solves


n
X
n−1 mτ (yi − xti b)xi = 0.
i=1

By Theorem D.1, we only need to calculate the explicit forms of B and M . Let Fy|x (·) and
fy|x (·) be the conditional distribution and density functions. We have

E {mτ (y − xt b)x} = E [{τ − 1(y − xt b ≤ 0)} x]


 
= E τ − Fy|x (xt b) x ,

so
∂E {mτ (y − xt b)x} 
= −E fy|x (xt b)xxt .
∂b t

This implies the formula of B. The formula of M follows from

M = E m2τ (y − xt β(τ ))xxt



h i
2
= E {τ − 1(y − xt β(τ ) ≤ 0)} xxt .


Based on Theorem 26.2, we can estimate the asymptotic covariance matrix of β̂(τ ) by
n−1 B̂ −1 M̂ B̂ −1 , where
n n
X  o2
M̂ = n−1 τ − 1 yi − xti β̂(τ ) ≤ 0 xi xti
i=1

and
n n o
−1
X
B̂ = (2nh) 1 |yi − xti β̂(τ )| ≤ h xi xti
i=1

for a carefully chosen h. Powell (1991)’s theory suggests to use h satisfying condition h =
O(n−1/3 ), but the theory is not so helpful since it only suggests the order of h. The quantreg
package in R chooses a specific h that satisfies this condition. In finite samples, the bootstrap
often gives a better estimation of the asymptotic covariance matrix.

26.4 Numerical examples


26.4.1 Sample quantiles
We can use the quantile function to obtain the sample quantiles. However, it does not
report standard errors. Instead, we can use the rq function to compute sample quantiles
by regressing the outcome on constant 1. These two functions may return different sample
quantiles when they are not unique. The difference is often small with large sample sizes.
I use the following simulation to compare various methods for standard error estimation.
The first data-generating process has a standard Normal outcome.
300 Linear Model and Extensions

Exponential(1) Normal(0,1)

0.20

method
standard errors

0.15
boot
iid
ker
0.10
true

0.05

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
quantiles

FIGURE 26.3: Standard errors for sample quantiles

library ( quantreg )
mc = 2 0 0 0
n = 200
taus = ( 1 : 9 )/ 1 0
get . se = function ( x ){ x $ coef [ 1 ,2 ]}
q . normal = replicate ( mc ,{
y = rnorm ( n )
qy = rq ( y ~ 1 , tau = taus )
se . iid = summary ( qy , se = " iid " )
se . ker = summary ( qy , se = " ker " )
se . boot = summary ( qy , se = " boot " )

qy = qy $ coef
se . iid = sapply ( se . iid , get . se )
se . ker = sapply ( se . ker , get . se )
se . boot = sapply ( se . boot , get . se )

c ( qy , se . iid , se . ker , se . boot )


})

In the above, se = "iid", se = "ker", and se = "boot" correspond to the standard errors
by Koenker and Bassett Jr (1978), Powell (1991), and the bootstrap. I also run the same
simulation but replace the Normal outcome with Exponential: y = rexp(n). Figure 26.3 com-
pares the estimated standard errors with the true asymptotic standard error in Theorem
26.1. Bootstrap works the best, and the one involving kernel estimation of the density seems
biased.

26.4.2 OLS versus LAD


I will use simulation to compare OLS and LAD. In rq, the default value is tau=0.5, fitting
the LAD. The first data-generating process is a Normal linear model:
x = rnorm ( n )
simu . normal = replicate ( mc , {
y = 1 + x + rnorm ( n )
c ( lm ( y ~ x )$ coef [ 2 ] , rq ( y ~ x )$ coef [ 2 ])
})
Quantile Regression 301

The second data generating process replaces the error term to a Laplace distribution1 :
simu . laplace = replicate ( mc , {
y = 1 + x + rexp ( n ) - rexp ( n )
c ( lm ( y ~ x )$ coef [ 2 ] , rq ( y ~ x )$ coef [ 2 ])
})

OLS is the MLE under a Normal linear model, and LAD is the MLE under a linear
model with independent Laplace errors.
The third data-generating process replaces the error term with standard Exponential:
simu . exp = replicate ( mc , {
y = 1 + x + rexp ( n )
c ( lm ( y ~ x )$ coef [ 2 ] , rq ( y ~ x )$ coef [ 2 ])
})

The fourth data generating process has yi = 1 + ei xi with ei IID Exponential, so

E(yi | xi ) = 1 + xi , var(yi | xi ) = x2i ,

which is a heteroskedastic linear model, and

F −1 (0.5 | xi ) = 1 + median(ei )xi = 1 + (log 2)xi ,

which is a linear quantile model. The coefficients are different in the conditional mean and
quantile functions.
x = abs ( x )
simu . x = replicate ( mc , {
y = 1 + rexp ( n )* x
c ( lm ( y ~ x )$ coef [ 2 ] , rq ( y ~ x )$ coef [ 2 ])
})

Figure 26.4 compares OLS and LAD under the above four data-generating processes.
With Normal errors, OLS is more efficient; with Laplace errors, LAD is more efficient.
This confirms the theory of MLE. With Exponential errors, LAD is also more efficient than
OLS. Under the fourth data-generating process, LAD is more efficient than OLS. In general,
however, OLS and LAD target the conditional mean and conditional median, respectively.
Since the parameters differ in general, the comparison of the standard errors is not very
meaningful. Both OLS and LAD give useful information about the data.

26.5 Application
26.5.1 Parents’ and children’s heights
I revisit Galton’s data introduced in Chapter 2. The following code gives the coefficients for
quantiles 0.1 to 0.9.
> library ( " HistData " )
> taus = ( 1 : 9 )/ 1 0
> qr . galton = rq ( childHeight ~ midparentHeight ,
+ tau = taus ,
+ data = GaltonFa milies )
> coef . galton = qr . galton $ coef

1 Note that the difference between two independent Exponentials has the same distribution as Laplace.

See Proposition B.8.


302 Linear Model and Extensions

LAD OLS
6
se=0.07 se=0.07

Exponential
4

0
6
se=0.11 se=0.19

Exponential X
4

0
6
se=0.08 se=0.1
4

Laplace
2

0
6
se=0.09 se=0.07
4

Normal
2

0
0.5 1.0 1.5 0.5 1.0 1.5
^
β

FIGURE 26.4: Regression quantiles

Figure 26.5 shows the quantile regression lines, which are almost parallel with different
intercepts. In Galton’s data, x and y are very close to a bivariate Normal distribution.
Theoretically, we can verify that with bivariate Normal (x, y), the conditional quantile
function F −1 (τ | x) is linear in x with the same slope. See Problem 26.2.

26.5.2 U.S. wage structure


Angrist et al. (2006) used quantile regression to study the U.S. wage structure. They used
census data in 1980, 1990, and 2000 to fit quantile regressions on log weekly wage on edu-
cation and other variables. The following code gives the coefficients for quantile regressions
with τ equaling 0.1 to 0.9. I repeated the regressions with data from three years. Due to
the large sample size, I use the m-of-n bootstrap with m = 500.
> library ( foreign )
> census 8 0 = read . dta ( " census 8 0 . dta " )
Quantile Regression 303

quantile regressions
80

75
τ
0.9
0.8
70 0.7
childHeight

0.6
0.5
0.4
65
0.3
0.2
0.1
60

55
64 68 72
midparentHeight

FIGURE 26.5: Galton’s data

> census 9 0 = read . dta ( " census 9 0 . dta " )


> census 0 0 = read . dta ( " census 0 0 . dta " )
> f . reg = logwk ~ educ + exper + exper 2 + black
>
> m . boot = 500
> rq 8 0 = rq ( f . reg , data = census 8 0 , tau = taus )
> rqlist 8 0 = summary ( rq 8 0 , se = " boot " ,
+ bsmethod = " xy " , mofn = m . boot )
> rq 9 0 = rq ( f . reg , data = census 9 0 , tau = taus )
> rqlist 9 0 = summary ( rq 9 0 , se = " boot " ,
+ bsmethod = " xy " , mofn = m . boot )
> rq 0 0 = rq ( f . reg , data = census 0 0 , tau = taus )
> rqlist 0 0 = summary ( rq 0 0 , se = " boot " ,
+ bsmethod = " xy " , mofn = m . boot )

Figure 26.6 shows the coefficient of educ across years and across quantiles. In 1980, the
coefficients are nearly constant across quantiles, showing no evidence of heterogeneity in the
return of education. Compared with 1980, the return of education in 1990 increases across
all quantiles, but it increases more at the upper quantiles. Compared with 1990, the return
of education in 2000 decreases at the lower quantiles and increases at the upper quantiles,
showing more dramatic heterogeneity across quantiles.
The original data used by Angrist et al. (2006) contain weights due to sampling. Ideally,
we should use the weights in the quantile regression. Like lm, the rq function also allows for
specifying weights.
The R code in this section is in code24.5.R.
304 Linear Model and Extensions

quantile regressions
0.16

0.14
coefficient of educ

year
0.12
00
90
80
0.10

0.08

0.25 0.50 0.75


τ

FIGURE 26.6: Angrist et al. (2006)’s data

26.6 Extensions
With clustered data, we must use the cluster-robust standard error which can be approxi-
mated by the clustered bootstrap with the rq function. I use Hagemann (2017)’s example
below where the students are clustered in classrooms. See code24.6.R.
> star = read . csv ( " star . csv " )
> star . rq = rq ( pscore ~ small + regaide + black +
+ girl + poor + tblack + texp +
+ tmasters + factor ( fe ) ,
+ data = star )
> res = summary ( star . rq , se = " boot " )$ coef [ 2 : 9 , ]
> res . clus = summary ( star . rq , se = " boot " ,
+ cluster = star $ classid )$ coef [ 2 : 9 , ]
> round ( res , 3 )
Value Std . Error t value Pr ( >| t |)
small 6.500 1.122 5.795 0.000
regaide 0.294 1.071 0.274 0.784
black -1 0 . 3 3 4 1 . 6 5 7 -6 . 2 3 7 0.000
girl 5.073 0.878 5.777 0.000
poor -1 4 . 3 4 4 1 . 0 2 4 -1 4 . 0 1 1 0.000
tblack -0 . 1 9 7 1 . 7 5 1 -0 . 1 1 3 0.910
texp 0.413 0.098 4.231 0.000
tmasters -0 . 5 3 0 1 . 0 6 8 -0 . 4 9 7 0.619
> round ( res . clus , 3 )
Value Std . Error t value Pr ( >| t |)
small 6.500 1.662 3.912 0.000
regaide 0.294 1.627 0.181 0.857
black -1 0 . 3 3 4 1 . 8 4 9 -5 . 5 8 8 0.000
girl 5.073 0.819 6.195 0.000
poor -1 4 . 3 4 4 1 . 1 5 2 -1 2 . 4 5 5 0.000
tblack -0 . 1 9 7 3 . 1 1 3 -0 . 0 6 3 0.949
texp 0.413 0.168 2.465 0.014
Quantile Regression 305

tmasters -0 . 5 3 0 1.444 -0 . 3 6 7 0.713

With high dimensional covariates, we can use regularized quantile regression. For in-
stance, the rq function can implement the lasso version with method = "lasso" and a prespec-
ified lambda. It does not implement the ridge version.

26.7 Homework problems


26.1 Quantile regression with a binary regressor
For i = 1, . . . , n, the first 1/3 observations have xi = 1 and the last 2/3 observations have
xi = 0; yi | xi = 1 follows an Exponential(1), and yi | xi = 0 follows an Exponential(2).
Find
n
X
(α̂, β̂) = arg min ρ1/2 (yi − a − bxi ).
(a,b)
i=1
and the joint asymptotic distribution.

26.2 Conditional quantile function in bivariate Normal


Show that if (x, y) follows a bivariate Normal, the conditional quantile function of y given
x is linear in x with the same slope across all τ .

26.3 Quantile range and variance


A symmetric random variable y satisfies y ∼ −y. Define the 1 − α quantile range of a
symmetric random variable y as the interval of its α/2 and 1 − α/2 quantiles. Given two
symmetric random variables y1 and y2 , show that if the 1 − α quantile range of y1 is wider
than that of y2 for all α, then var(y1 ) ≥ var(y2 ). Does the converse of the statement hold?
If so, give a proof; if not, give a counterexample.

26.4 Interquartile range and estimation


The interquartile range of a random variable y equals the difference between its 75% and
25% quantiles. Based on IID data (yi )ni=1 , write a function to estimate the interquartile
range and the corresponding standard error using the bootstrap. Use simulated data to
evaluate the finite sample properties of the point estimate (e.g., bias and variance) and the
95% confidence interval (e.g. coverage rate and length).
Find the asymptotic distribution of the estimator for the interquartile range.

26.5 Joint asymptotic distribution of the sample median and the mean
Assume that y1 , . . . , yn ∼ y are IID. Find the joint asymptotic distribution of the sample
mean ȳ and median m̂.
Hint: The mean µ and median m satisfy the estimating equation with
 
y−µ
w(y, µ, m) = .
0.5 − 1(y − m ≤ 0)

26.6 Weighted quantile regression and application


Many real data contain weights due to sampling. For example, in Angrist et al. (2006)’s data,
perwt is the sampling weight. Define the weighted quantile regression problem theoretically
306 Linear Model and Extensions

and re-analyze Angrist et al. (2006)’s data with weights. Note that similar to lm and glm,
the quantile regression function rq also has a parameter weights.
27
Modeling Time-to-Event Outcomes

27.1 Examples
Time-to-event data are common in biomedical and social sciences. Statistical analysis of
time-to-event data is called survival analysis in biostatistics and duration analysis in econo-
metrics. The former name comes from biomedical applications where the outcome denotes
the survival time or the time to the recurrence of the disease of interest. The latter name
comes from the economic applications where the outcome denotes the weeks unemployed
or days until the next arrest after being released from incarceration. See Kalbfleisch and
Prentice (2011) for biomedical applications and Heckman and Singer (1984) for economic
applications. Freedman (2008) gave a concise and critical introduction to survival analysis.

27.1.1 Survival analysis


The Combined Pharmacotherapies and Behavioral Interventions study evaluated the effi-
cacy of medication, behavioral therapy, and their combination for the treatment of alcohol
dependence (Anton et al., 2006). Between January 2001 and January 2004, n = 1224 re-
cently alcohol-abstinent volunteers were randomized to receive medical management with
16 weeks of naltrexone (100mg daily) or placebo, with or without a combined behavioral
intervention. It was a 2 × 2 factorial experiment. The outcome of interest is the time to the
first day of heavy drinking and other endpoints. I adopt the data from Lin et al. (2016).
> COMBINE = read . table ( " combine _ data . txt " , header = TRUE )[ , -1 ]
> head ( COMBINE )
AGE GENDER T 0 _ PDA NALTREXONE THERAPY site relapse futime
1 31 male 3 . 3 3 3 3 3 3 1 0 site _ 0 0 112
2 4 1 female 1 6 . 6 6 6 6 6 7 1 1 site _ 0 1 8
3 44 male 7 3 . 3 3 3 3 3 3 0 1 site _ 0 1 20
4 65 male 1 0 . 0 0 0 0 0 0 1 0 site _ 0 0 112
5 39 male 0 . 0 0 0 0 0 0 0 1 site _ 0 1 4
6 56 male 1 3 . 3 3 3 3 3 3 0 0 site _ 0 1 1

NALTREXONE and THERAPY are two treatment indicators. futime is the follow-up time, which
is censored if relapse equals 0. For those censored observations, futime equals 112, so it
is administrative censoring. Figure 27.1 shows the histograms of futime in four treatment
groups. A large number of patients have censored outcomes. Other variables are covariates.

27.1.2 Duration analysis


Carpenter (2002) asked the question: Why does the U.S. Food and Drug Administration
approve some drugs more quickly than others? With data about 450 drugs reviewed from
1977 to 2000, he studied the dependence of review times on various covariates, including
political influence, wealth of the richest organization representing the disease, media cover-

307
308 Linear Model and Extensions

THERAPY: 0 THERAPY: 1

90

NALTREXONE: 0
60

30

90

NALTREXONE: 1
60

30

0
0 30 60 90 0 30 60 90
futime

FIGURE 27.1: Histograms of the time to event in the data from Lin et al. (2016)

age, etc. I use the version of data analyzed by Keele (2010). The outcome acttime is censored
indicated by censor. The original paper contains more detailed explanations of the variables.
> fda <- read . dta ( " fda . dta " )
> names ( fda )
[ 1 ] " acttime " " censor " " hcomm " " hfloor " " scomm "
[ 6 ] " sfloor " " prespart " " demhsmaj " " demsnmaj " " orderent "
[ 1 1 ] " stafcder " " prevgenx " " lethal " " deathrt 1 " " hosp 0 1 "
[ 1 6 ] " hospdisc " " hhosleng " " acutediz " " orphdum " " mandiz 0 1 "
[ 2 1 ] " femdiz 0 1 " " peddiz 0 1 " " natreg " " natregsq " " wpnoavg 3 "
[ 2 6 ] " vandavg 3 " " condavg 3 " " _ st " "_d" "_t"
[31] "_t0" " caseid "

An obvious feature of time-to-event data is that the outcome is non-negative. This can
be easily dealt with by the log transformation. However, the outcomes may be censored,
resulting in inadequate tail information. With right censoring, modeling the mean involves
extrapolation in the right tail.

27.2 Time-to-event data


Let T ≥ 0 denote the outcome of interest. We can characterize a non-negative continuous
T using its density f (t), distribution function F (t), survival function S(t) = 1 − F (t) =
pr(T > t), and hazard function

λ(t) = lim pr(t ≤ T < t + ∆t | T ≥ t)/∆t.


∆t↓0

Within a small time interval [t, t + ∆t], we have approximation

pr(t ≤ T < t + ∆t | T ≥ t) ∼
= λ(t)∆t,
Modeling Time-to-Event Outcomes 309

so the hazard function denotes the death rate within a small interval conditioning on sur-
viving up to time t. Both the survival and hazard functions are commonly used to describe
a positive random variable. First, the survival function has a simple relationship with the
expectation.
Proposition 27.1 For a non-negative random variable T ,
Z ∞
E(T ) = S(t)dt.
0

Proposition 27.1 holds for both continuous and discrete non-negative random variables.
It states that the expectation of a nonnegative random variable equals the area under the
survival function. It does not require the existence of the density function of T .
Proof of Proposition 27.1: Fubini’s theorem allows us to swap the expectation and
integral below:
(Z )
T
E(T ) = E dt
0
Z ∞ 
= E 1(T > t)dt
Z ∞ 0
= E{1(T > t)}dt
Z0 ∞
= pr(T > t)dt
Z0 ∞
= S(t)dt.
0


Second, the survival and hazard functions can determine each other in the following way.
Proposition 27.2 For a non-negative continuous random variable T ,
f (t) d
λ(t) = = − log S(t),
S(t) dt
 Z t 
S(t) = exp − λ(s)ds .
0

Proof of Proposition 27.2: By definition,


pr(t ≤ T < t + ∆t) 1
λ(t) = lim
∆t↓0 ∆t pr(T ≥ t)
F (t + ∆t) − F (t) 1
= lim
∆t↓0 ∆t S(t)
f (t)
= .
S(t)
We can further write the above equation as
f (t)
λ(t) =
S(t)
dS(t)/dt
= −
S(t)
d
= − log S(t),
dt
310 Linear Model and Extensions

0.9

lnorm(µ, σ)
3
hazard functions

hazard functions
α
(0.5, 0.5)
0.5 0.6
(2, 0.5)
2 1
(0.5, 2)
2
(2, 2)
0.3
1

0 0.0

0 1 2 3 4 5 0 2 4 6
t t

FIGURE 27.2: Left: Gamma(α, β = 2) hazard functions; Right: Log-Normal(µ, σ 2 ) hazard


functions

so we can use the Newton–Leibniz formula to obtain

d log S(t) = −λ(t)dt

which implies Z t
log S(t) − log S(0) = − λ(s)ds.
0
Rt
Because log S(0) = 0, we have log S(t) = − 0
λ(s)ds, giving the final result. □

Example 27.1 (Exponential) The Exponential(λ) random variable T has density f (t) =
λe−λt , survival function S(t) = e−λt , and constant hazard function λ(t) = λ. An impor-
tant feature of the Exponential random variable is its memoryless property as shown in
Proposition B.6.

Example 27.2 (Gamma) The Gamma(α, β) random variable T has density f (t) =
β α tα−1 e−βt /Γ(α). When α = 1, it reduces to Exponential(β) with a constant hazard func-
tion. In general, the survival function and hazard function do not have simple forms, but
we can use dgamma and pgamma to compute them numerically. The left panel of Figure 27.2
plots the hazard functions of Gamma(α, β). When α < 1, the hazard function is decreasing;
when α > 1, the hazard function is increasing.

Example 27.3 (Log-Normal) The Log-Normal random variable T ∼Log-Normal(µ, σ 2 )


equals exponential of N(µ, σ 2 ). The right panel of Figure 27.2 plots the hazard functions
with four different parameter combinations.

Example 27.4 (Weibull) The Weibull distribution has many different parametrizations.
Here I follow the R function dweibull, which has a shape parameter a > 0 and scale parameter
b > 0. The Weibull(a, b) random variable T can be generated by

T = bZ 1/a (27.1)

which is equivalent to
log T = log b + a−1 log Z,
Modeling Time-to-Event Outcomes 311

discrete distribution

1.0


0.8

0.6
S(t)


0.4
0.2


0.0

0 1 2 3 4 5 6 7

FIGURE 27.3: Discrete survival function with masses (0.1, 0.05, 0.15, 0.2, 0.3, 0.2) at
(1, 2, 3, 4, 5, 6)

where Z ∼ Exponential(1). We can verify that T has the density function


 a−1   a 
a t t
f (t) = exp − ,
b b b

survival function   a 
t
S(t) = exp − ,
b
and hazard function
 a−1
a t
λ(t) = .
b b
So when a = 1, Weibull reduces to Exponential with constant hazard function. When a > 1,
the hazard function increases; when a < 1, the hazard function decreases.

We can characterize a positive discrete random variable T ∈ {t1 , t2P


, . . .} by its probability
mass function fP
(tk ) = pr(T = tk ), distribution function F (t) = k:tk ≤t f (tk ), survival
function S(t) = k:tk >t f (tk ), and discrete hazard function

f (tk )
λk = pr(T = tk | T ≥ tk ) = ,
S(tk −)

where S(tk −) denotes the left limit of the function S(t) at tk . Figure 27.3 shows an example
of a survival function for a discrete random variable, which shows that S(t) is a step function
and right-continuous with left limits.
The discrete hazard and survival functions have the following connection which will be
useful for the next section.
312 Linear Model and Extensions

# failures
# censoring
# at risk

FIGURE 27.4: Data structure for the Kaplan–Meier curve

Proposition 27.3 For a positive discrete random variable T , its survival function is a step
function determined by Y
S(t) = pr(T > t) = (1 − λk ).
k:tk ≤t

Note that S(t) is a step function decreasing at each tk because λk is probability and
thus bounded between zero and one.
Proof of Proposition 27.3: By definition,

1 − λk = 1 − pr(T = tk | T ≥ tk ) = pr(T > tk | T ≥ tk )

is the probability of surviving longer than tk conditional on surviving at least as long as tk .


We can verify Proposition 27.3 within each interval of the tk ’s. For example, if t < t1 , then
S(t) = pr(T > t) = 1; if t1 ≤ t < t2 , then

S(t) = pr(T > t1 )


= pr(T > t1 , T ≥ t1 )
= pr(T > t1 | T ≥ t1 )pr(T ≥ t1 )
= 1 − λ1 ;

if t2 ≤ t < t3 , then

S(t) = pr(T > t2 ) = pr(T > t2 , T ≥ t2 )


= pr(T > t2 | T ≥ t2 )pr(T ≥ t2 )
= (1 − λ2 )(1 − λ1 ).

We can also verify other values of S(t) by induction. □

27.3 Kaplan–Meier survival curve


Without censoring, estimating the CDF or the survival function is rather Pn straightforward.
With IID data (T1 , . . . , Tn ), we can estimate the CDF by F̂ (t) = n−1 i=1 1(Ti ≤ t) and
the survival function by Ŝ(t) = 1 − F̂ (t).
Figure 27.4 shows the common data structure with censoring in survival analysis:
(S1) t1 , . . . , tK are the death times, and d1 , . . . , dK are the corresponding number of deaths;
(S2) r1 , . . . , rK are the number of patients at risk, that is, r1 patients are not dead or
censored right before time t1 , and so on;
Modeling Time-to-Event Outcomes 313

(S3) c1 , . . . , cK are the number of censored patients within interval [t1 , t2 ), . . . , [dK , ∞).
Kaplan and Meier (1958) proposed the following simple estimator for the survival func-
tion.
Definition 27.1 (Kaplan–Meier curve) First estimate the discrete hazard function at
the failure times {t1 , . . . , tK } as λ̂k = dk /rk (k = 1, . . . , K) and then estimate the survival
function as Y
Ŝ(t) = (1 − λ̂k ).
k:tk ≤t

The Ŝ(t) in Definition 27.1 is also called the product-limit estimator of the survival
function due to its mathematical form.
At each failure time tk , we view dk as the result of rk Bernoulli trials with probability
λk . So λ̂k = dk /rk has variance λk (1 − λk )/rk which can be estimated by

ˆ λ̂k ) = λ̂k (1 − λ̂k )/rk .


var(
We can estimate the variance of the survival function using the delta method. We can
approximate the variance of
X
log Ŝ(t) = log(1 − λ̂k )
k:tk ≤t


X X
= log(1 − λk ) − (1 − λk )−1 (λ̂k − λk )
k:tk ≤t k:tk ≤t

by
n o X
var
ˆ log Ŝ(t) = (1 − λk )−2 var(
ˆ λ̂k )
k:tk ≤t
X
= (1 − λ̂k )−2 λ̂k (1 − λ̂k )/rk
k:tk ≤t
X dk
= ,
rk (rk − dk )
k:tk ≤t

which is called Greenwood’s formula (Greenwood, 1926). A hidden assumption above is the
independence of the λ̂k ’s. This assumption cannot be justified due to the dependence of the
events. However, a deeper theory of counting processes shows that Greenwood’s formula is
valid even without the independence (Fleming and Harrington, 2011).
Based on Greenwood’s formula, we can construct a confidence interval for log S(t):
r n o
log Ŝ(t) ± zα var
ˆ log Ŝ(t) ,

which implies a confidence interval for S(t). However, this interval can be outside of range
[0, 1] because the log transformation log S(t) is in the range of (−∞, 0) but the Normal
approximation is in the range (−∞, ∞). A better transformation is log-log:
n o
v(t) = log {− log S(t)} , v̂(t) = log − log Ŝ(t) .

Using Taylor expansion, we can approximate the variance of


1 n o
v̂(t) ∼
= log {− log S(t)} − log Ŝ(t) − log S(t)
log S(t)
314 Linear Model and Extensions

by n o
var
ˆ log Ŝ(t)
n o2 .
log Ŝ(t)
Based on this formula and Greenwood’s formula above, we can construct a confidence
interval for v(t): r
n o n o
log − log Ŝ(t) ± zα var
ˆ log Ŝ(t) / log Ŝ(t),

which implies another confidence interval for S(t). In the R package survival, the func-
tion survfit can fit the Kaplan–Meier curve, where the specifications conf.type = "log" and
conf.type = "log-log" return confidence intervals based on the log and log-log transforma-
tions, respectively.
Figure 27.5 plots four curves based on the combination of NALTREXONE and THERAPY using
the data of Lin et al. (2016). I do not show the confidence intervals due to the large overlap.
> km 4 groups = survfit ( Surv ( futime , relapse ) ~ NALTREXONE + THERAPY ,
+ data = COMBINE )
> plot ( km 4 groups , bty = " n " , col = 1 : 4 ,
+ xlab = " t " , ylab = " survival ␣ functions " )
> legend ( " topright " ,
+ c ( " NALTREXONE = 0 ,␣ THERAPY = 0 " ,
+ " NALTREXONE = 0 ,␣ THERAPY = 1 " ,
+ " NALTREXONE = 1 ,␣ THERAPY = 0 " ,
+ " NALTREXONE = 1 ,␣ THERAPY = 1 " ) ,
+ col = 1 : 4 , lty = 1 , bty = " n " )
1.0

NALTREXONE=0, THERAPY=0
NALTREXONE=0, THERAPY=1
NALTREXONE=1, THERAPY=0
0.8

NALTREXONE=1, THERAPY=1
survival functions

0.6
0.4
0.2
0.0

0 20 40 60 80 100

FIGURE 27.5: Lin et al. (2016)’s data

The above discussion on the Kaplan–Meier curve is rather heuristic. More fundamen-
tally, what is the underlying censoring mechanism that ensures the possibility that the
distribution of the survival time can be recovered by the observed data? It turns out that
we have implicitly assumed that the survival time and the censoring time are independent.
Homework problem 27.1 gives a theoretical statement.
Modeling Time-to-Event Outcomes 315

27.4 Cox model for time-to-event outcome


Another important problem is to model the dependence of the survival time T on covariates
x. The major challenge is that the survival time is often censored. Let Ci be the censoring
time of unit i, and we can only observe the minimum value of the survival time and the
censoring time. So the observed data are (xi , yi , δi )ni=1 , where

yi = min(Ti , Ci ), δi = 1(Ti ≤ Ci )

are the event time and the censoring indicator, respectively. A key assumption is that the
censoring mechanism is noninformative:
Assumption 27.1 (noninformative censoring) Ti Ci | xi .
We can start with parametric models.

Example 27.5 Assume Ti | xi ∼ Log-Normal(xti β, σ 2 ). Equivalently,

log Ti = xti β + εi ,

where the εi ’s are IID N(0, σ 2 ) independent of the xi ’s. This is a Normal linear model on
log Ti .
t
Example 27.6 Assume that Ti | xi ∼ Weibull(a, b = exi β ). Based on the definition of the
Weibull distribution in Example 27.1, we have

log Ti = xti β + εi

where the εi ’s are IID a−1 log Exponential(1), independent of the xi ’s.

The R package survival contains the function survreg to fit parametric survival models
including the choices of dist = "lognormal", dist = "weibull", etc. However, these parametric
models are not commonly used in practice. The parametric forms can be too strong, and
due to right censoring, the inference can be driven by extrapolation to the right tail.

27.4.1 Cox model and its interpretation


Cox (1972) proposed to model the conditional hazard function

λ(t | x) = lim pr(t ≤ T < t + ∆t | T ≥ t, x)/∆t


∆t↓0
f (t | x)
= .
S(t | x)

His celebrated proportional hazards model has the following form.

Assumption 27.2 (Cox proportional hazards model) Assume the conditional hazard
ratio function has the form

λ(t | x) = λ0 (t) exp(xt β), (27.2)

where β is an unknown parameter and λ0 (·) is an unknown function.


316 Linear Model and Extensions

Assumption 27.2 is equivalent to

log λ(t | x) = log λ0 (t) + xt β.

Unlike other regression models, x does not contain the intercept in (27.2). If the first com-
ponent of x is 1, then we can write

λ(t | x) = λ0 (t) exp(x1 β1 + · · · + xp βp )


= λ0 (t)eβ1 exp(x2 β2 + · · · + xp βp )

and redefine λ0 (t)eβ1 as another unknown function. With an intercept, we cannot identify
λ0 (t) and β1 separately. So we drop the intercept to ensure identifiability.
From the log-linear form of the conditional hazard function, we have
λ(t | x′ )
log = (x′ − x)t β,
λ(t | x)
so each coordinate of β measures the log conditional hazard ratio holding other covariates
constant. Because of this, (27.2) is called the proportional hazards model. A positive βj
suggests a “positive” effect on the hazard function and thus a “negative” effect on the
survival time itself. Consider a special case with a binary xi , then the proportional hazards
assumption implies that λ(t | 1) = γλ(t | 0) with γ = exp(β), and therefore the survival
functions satisfy
 Z t 
S(t | 1) = exp − λ(u | 1)du
0
 Z t 
= exp −γ λ(u | 0)du
0
γ
= {S(t | 0)} ,

which is a power transformation. Qualitatively, we have the following two cases:


(PH1) β < 0: so the hazard ratio γ = exp(β) < 1 and S(t | 1) ≥ S(t | 0) for all t, which
implies a longer survival time under treatment;
(PH2) β > 0: so the hazard ratio γ = exp(β) > 1 and S(t | 1) ≤ S(t | 0) for all t, which
implies a shorter survival time under treatment.
Figure 27.6 shows some survival functions satisfying the proportional hazards assumption,
none of which cross each other within the interval t ∈ (0, ∞). When the two survival
functions cross, the proportional hazards assumption does not hold.
Theoretically, we can allow the covariates to be time-dependent, that is, xi (t) can depend
on t and thus is a stochastic process. However, the interpretation of the coefficient becomes
challenging (Fisher and Lin, 1999). This chapter focuses on the simple case with time-
invariant covariates.

27.4.2 Partial likelihood


The likelihood function is rather complicated, which depends on an unknown function λ0 (t).
Assuming no ties, Cox (1972) proposed to use the partial likelihood function to estimate β:
K
Y exp(xtk β)
L(β) = P t ,
k=1 l∈R(tk ) exp(xl β)
Modeling Time-to-Event Outcomes 317

Under proportional hazards assumption


Exponential(1) Gamma(2,2) lnorm(0,1)
1.00

0.75
survival functions

power
0
0.50
0.5
2
0.25

0.00
0 2 4 6 0 2 4 6 0 2 4 6
t

FIGURE 27.6: Proportional hazards assumption with different baseline survival functions,
where the power equals γ = exp(β).

where the product is over K time points with failures, xk is the covariate value of the failure
at time tk , and R(tk ) contains the indices of the units at risk at time tk , i.e., the units not
censored or failed right before the time tk .
Freedman (2008) gives a heuristic explanation of the partial likelihood based on the
following results which extends Proposition B.7 on the Exponential distribution.

Theorem 27.1 If T1 , . . . , Tn are independent with hazard function


Pn λi (t) (i = 1, . . . , n),
then their minimum value T = min1≤i≤n Ti has hazard function i=1 λi (t). Moreover, if
λi (t) = ci λ(t), then
ci
pr(Ti = T ) = Pn .
i′ =1 ci

Proof of Theorem 27.1: The survival function of T is

pr(T > t) = pr(T1 > t, . . . , Tn > t)


Yn
= Si (t),
i=1

so Proposition 27.2 implies that its hazard function is


n  
d X d
− log pr(T > t) = − log Si (t)
dt i=1
dt
n
X
= λi (t).
i=1

So the first conclusion follows. Pn Qn


As a byproduct of the above proof, the density of T is i=1 λi (t) i=1 Si (t) based in
Proposition 27.2. It must have integral one; with λi (t) = ci λ(t), this implies
n
!Z n
X ∞ Y
ci λ(t) Si (t)dt = 1. (27.3)
i=1 0 i=1
318 Linear Model and Extensions

Therefore, we have

pr(Ti = T ) = pr{Ti ≤ Ti′ for all i′ ̸= i}


Z ∞Y
= Si′ (t)fi (t)dt
0 i′ ̸=i
Z ∞ Y n
= Si′ (t)λi (t)dt
0 i′ =1
Z ∞ n
Y
= ci λ(t) Si′ (t)dt
0 i′ =1
n
.X
= ci ci′ ,
i=1

where the last equality holds due to (27.3). □


Theorem 27.1 explains each of the K components in the partial likelihood. At time tk ,
the units in R(tk ) are all at risk, and unit k fails, assuming no ties. The probability that
unit k has the smallest failure time among units in R(tk ) is

exp(xtk β)
P t
l∈R(tk ) exp(xl β)

from Theorem 27.1. The product in the partial likelihood is based on the independence of
the events at the K failure times, which is more difficult to justify. A rigorous justifica-
tion relies on the deeper theory of counting processes (Fleming and Harrington, 2011) or
semiparametric statistics (Tsiatis, 2007).
The log-likelihood function is
 
XK  X 
log L(β) = xtk β − log exp(xtl β) ,
 
k=1 l∈R(tk )

and the score function is


K
( P t
)
∂ log L(β) X l∈R(tk ) exp(xl β)xl
= xk − P t .
∂β l∈R(tk ) exp(xl β)
k=1

Define X
πβ (l | Rk ) = exp(xtl β)/ exp(xtl β), (l ∈ R(tk ))
l∈R(tk )

which sum to one, so they induce a probability measure leading to expectation Eβ (· | Rk )


and covariance covβ (· | Rk ). With this notation, the score function simplifies to
K
∂ log L(β) X
= {xk − Eβ (x | Rk )} ,
∂β
k=1
P
where Eβ (x | Rk ) = l∈R(tk ) πl (β | Rk )xl ; the Hessian matrix simplifies to

K
∂ 2 log L(β) X
=− covβ (x | Rk ) ⪯ 0,
∂β∂β t
k=1
Modeling Time-to-Event Outcomes 319

where

covβ (x | Rk )
 2
 P t t
P t . X
exp(x l β)x x
l l exp(x l β) 
= Pl∈R(tk ) P l∈R(tk )
exp(x t
l β)
− l∈R(tk ) exp(xtl β)xl l∈R(tk ) exp(xtl β)xtl  
l∈R(tk )
X X X
= πβ (l | Rk )xl xtl − πβ (l | Rk )xl πβ (l | Rk )xtl .
l∈R(tk ) l∈R(tk ) l∈R(tk )

The coxph function in the R package survival uses Newton’s method to compute the
maximizer β̂ of the partial likelihood function, and uses the inverse of the observed Fisher
information to approximate its asymptotic variance. Lin and Wei (1989) proposed a sand-
wich covariance estimator to allow for the misspecification of the Cox model. The coxph
function with robust = TRUE reports the corresponding robust standard errors.

27.4.3 Examples
Using Lin et al. (2016)’s data, we have the following results.
> cox . fit <- coxph ( Surv ( futime , relapse ) ~ NALTREXONE * THERAPY +
+ AGE + GENDER + T 0 _ PDA + site ,
+ data = COMBINE )
> summary ( cox . fit )
Call :
coxph ( formula = Surv ( futime , relapse ) ~ NALTREXONE * THERAPY +
AGE + GENDER + T 0 _ PDA + site , data = COMBINE )

n = 1 2 2 6 , number of events = 8 5 6

coef exp ( coef ) se ( coef ) z Pr ( >| z |)


NALTREXONE -0 . 2 4 9 7 1 9 0 . 7 7 9 0 2 0 0.097690 -2 . 5 5 6 0 . 0 1 0 5 8 *
THERAPY -0 . 1 6 7 0 5 0 0 . 8 4 6 1 5 8 0.096102 -1 . 7 3 8 0 . 0 8 2 1 7 .
AGE -0 . 0 1 5 5 4 0 0 . 9 8 4 5 8 0 0.003559 -4 . 3 6 6 1 . 2 7e - 0 5 ***
GENDERmale -0 . 1 4 0 6 2 1 0 . 8 6 8 8 1 8 0.075368 -1 . 8 6 6 0 . 0 6 2 0 7 .
T 0 _ PDA 0.002550 1.002553 0.001368 1.863 0.06242 .
sitesite _ 1 -0 . 0 9 1 8 5 3 0 . 9 1 2 2 3 9 0.167261 -0 . 5 4 9 0 . 5 8 2 9 0
sitesite _ 1 0 -0 . 2 2 7 1 8 5 0 . 7 9 6 7 7 4 0.175427 -1 . 2 9 5 0 . 1 9 5 3 1
sitesite _ 2 0.121236 1.128892 0.160052 0.757 0.44876
sitesite _ 3 -0 . 0 8 4 4 8 3 0 . 9 1 8 9 8 7 0.161121 -0 . 5 2 4 0 . 6 0 0 0 4
sitesite _ 4 -0 . 4 7 1 6 1 2 0 . 6 2 3 9 9 6 0.175203 -2 . 6 9 2 0 . 0 0 7 1 1 **
sitesite _ 5 -0 . 1 2 8 2 8 6 0 . 8 7 9 6 0 2 0.161782 -0 . 7 9 3 0 . 4 2 7 8 0
sitesite _ 6 -0 . 2 4 0 5 6 3 0 . 7 8 6 1 8 5 0.161958 -1 . 4 8 5 0 . 1 3 7 4 5
sitesite _ 7 0.372004 1.450639 0.157616 2.360 0.01827 *
sitesite _ 8 0.067700 1.070045 0.160876 0.421 0.67388
sitesite _ 9 0.267373 1.306528 0.154911 1.726 0.08435 .
NALTREXONE : THERAPY 0.337539 1.401495 0.137441 2.456 0.01405 *

NALTREXONE has a significant negative log hazard ratio, but THERAPY has a nonsignificant
negative log hazard ratio. More interestingly, their interaction NALTREXONE:THERAPY has a sig-
nificant positive log hazard ratio. This suggests that combining NALTREXONE and THERAPY is
worse than using NALTREXONE alone to delay the first time of heavy drinking and other end-
points. This is also coherent with the survival curves in Figure 27.5, in which the best
Kaplan–Meier curve corresponds to NALTREXONE=1, THERAPY=0.
Using Keele (2010)’s data, we have the following results:
> cox . fit <- coxph ( Surv ( acttime , censor ) ~
+ hcomm + hfloor + scomm + sfloor +
+ prespart + demhsmaj + demsnmaj +
320 Linear Model and Extensions

+ prevgenx + lethal +
+ deathrt 1 + acutediz + hosp 0 1 +
+ hospdisc + hhosleng +
+ mandiz 0 1 + femdiz 0 1 + peddiz 0 1 + orphdum +
+ natreg + I ( natreg ^ 2 ) + vandavg 3 + wpnoavg 3 +
+ condavg 3 + orderent + stafcder ,
+ data = fda )
> summary ( cox . fit )
Call :
coxph ( formula = Surv ( acttime , censor ) ~ hcomm + hfloor + scomm +
sfloor + prespart + demhsmaj + demsnmaj + prevgenx + lethal +
deathrt 1 + acutediz + hosp 0 1 + hospdisc + hhosleng + mandiz 0 1 +
femdiz 0 1 + peddiz 0 1 + orphdum + natreg + I ( natreg ^ 2 ) + vandavg 3 +
wpnoavg 3 + condavg 3 + orderent + stafcder , data = fda )

n = 4 0 8 , number of events = 2 6 2

coef exp ( coef ) se ( coef ) z Pr ( >| z |)


hcomm 3 . 6 4 2e - 0 1 1.439e+00 2.951e+00 0.123 0.901775
hfloor 7.944e+00 2.819e+03 8.173e+00 0.972 0.331071
scomm 4 . 7 1 6e - 0 1 1.603e+00 1.898e+00 0.248 0.803771
sfloor 2.604e+00 1.352e+01 2.370e+00 1.099 0.271877
prespart 8 . 0 3 8e - 0 1 2.234e+00 3 . 0 4 2e - 0 1 2.643 0.008226 **
demhsmaj 1.363e+00 3.909e+00 1.917e+00 0.711 0.476890
demsnmaj 1.217e+00 3.377e+00 5 . 6 0 6e - 0 1 2.171 0.029940 *
prevgenx -9 . 9 1 5e - 0 4 9 . 9 9 0e - 0 1 7 . 7 7 9e - 0 4 -1 . 2 7 5 0.202459
lethal 7 . 8 7 2e - 0 2 1.082e+00 2 . 3 7 8e - 0 1 0.331 0.740605
deathrt 1 6 . 5 3 7e - 0 1 1.923e+00 2 . 4 3 5e - 0 1 2.685 0.007253 **
acutediz 1 . 9 9 4e - 0 1 1.221e+00 2 . 2 6 2e - 0 1 0.882 0.377896
hosp 0 1 4 . 2 8 0e - 0 2 1.044e+00 2 . 4 9 5e - 0 1 0.172 0.863768
hospdisc -1 . 2 3 8e - 0 6 1.000e+00 5 . 2 7 8e - 0 7 -2 . 3 4 5 0.019002 *
hhosleng -1 . 2 7 3e - 0 2 9 . 8 7 4e - 0 1 1 . 9 8 8e - 0 2 -0 . 6 4 0 0.521891
mandiz 0 1 -1 . 1 7 7e - 0 1 8 . 8 8 9e - 0 1 3 . 8 0 0e - 0 1 -0 . 3 1 0 0.756711
femdiz 0 1 9 . 0 3 2e - 0 1 2.468e+00 3 . 4 9 7e - 0 1 2.583 0.009799 **
peddiz 0 1 -3 . 4 0 1e - 0 2 9 . 6 6 6e - 0 1 5 . 1 1 2e - 0 1 -0 . 0 6 7 0.946968
orphdum 5 . 5 4 0e - 0 1 1.740e+00 2 . 1 0 9e - 0 1 2.626 0.008630 **
natreg -2 . 2 2 1e - 0 2 9 . 7 8 0e - 0 1 8 . 2 8 2e - 0 3 -2 . 6 8 2 0.007318 **
I ( natreg ^ 2 ) 1 . 0 2 9e - 0 4 1.000e+00 4 . 5 6 7e - 0 5 2.253 0.024276 *
vandavg 3 -2 . 0 1 4e - 0 2 9 . 8 0 1e - 0 1 1 . 5 3 6e - 0 2 -1 . 3 1 1 0.189802
wpnoavg 3 5 . 2 2 0e - 0 3 1.005e+00 1 . 4 2 6e - 0 3 3.660 0.000252 ***
condavg 3 9 . 6 2 8e - 0 3 1.010e+00 2 . 2 7 1e - 0 2 0.424 0.671637
orderent -1 . 8 1 0e - 0 2 9 . 8 2 1e - 0 1 8 . 1 4 7e - 0 3 -2 . 2 2 2 0.026296 *
stafcder 8 . 0 1 3e - 0 4 1.001e+00 7 . 9 8 6e - 0 4 1.003 0.315719

27.4.4 Log-rank test as a score test from Cox model


A standard problem in clinical trials is to compare the survival times under treatment
and control. Assume no ties in the failure times, and let x denote the binary indicator for
treatment. Under the proportional hazards assumption, the control group has hazard λ0 (t),
and the treatment group has hazard λ1 (t) = λ0 (t)eβ . We are interested in testing the null
hypothesis
β = 0 ⇐⇒ λ1 (t) = λ0 (t) ⇐⇒ S1 (t) = S0 (t).
Under the null hypothesis, the score function reduces to
K
∂ log L(β) X
= {xk − Eβ=0 (x | Rk )}
∂β β=0
k=1
K  
X rk1
= xk − ,
rk
k=1
Modeling Time-to-Event Outcomes 321

because P
l∈R(tk ) xl rk1
Eβ=0 (x | Rk ) = P =
l∈R(tk ) 1 rk
equaling the ratio of the number of treated units at risk rk1 over the number of units at
risk rk , at time tk . The Fisher information at the null is
K
∂ 2 log L(0) X
− = covβ=0 (x | Rk )
∂β∂β t
k=1
K  
X rk1 rk1
= 1− .
rk rk
k=1

The score test for classical parametric models relies on


∂ 2 log L(0)
 
∂ log L(0) a
∼ N 0, − ,
∂β ∂β∂β t
which follows from Bartlett’s identity and the CLT. Applying this fact to Cox’s model, we
have PK  
rk1
k=1 x k − rk a
LR = r
PK rk1   ∼ N(0, 1).
k=1 rk 1 − rrk1
k

So we reject the null at level α if |LR| is larger than the upper 1 − α/2 quantile of standard
Normal. This is almost identical to the log-rank test without ties. Allowing for ties, Mantel
(1966) proposed a more general form of the log-rank test1 .
The survdiff function in the survival package implements various tests including the log-
rank test as a special case. Below, I use the gehan dataset in the MASS package to illustrate
the log rank test. The data were from a matched-pair experiment of 42 leukaemia patients
(Gehan, 1965). Treated units received the drug 6-mercaptopurine, and the rest are controls.
For illustration purposes, I ignore the pair indicators.
> library ( MASS )
> head ( gehan )
pair time cens treat
1 1 1 1 control
2 1 10 1 6 - MP
3 2 22 1 control
4 2 7 1 6 - MP
5 3 3 1 control
6 3 32 0 6 - MP
> survdiff ( Surv ( time , cens ) ~ treat ,
+ data = gehan )
Call :
survdiff ( formula = Surv ( time , cens ) ~ treat , data = gehan )

N Observed Expected (O - E )^ 2 / E (O - E )^ 2 / V
treat = 6 - MP 21 9 19.3 5.46 16.8
treat = control 2 1 21 10.7 9.77 16.8

Chisq = 1 6 . 8 on 1 degrees of freedom , p = 4e - 0 5

The treatment was quite effective, yielding an extremely small p-value even with moder-
ate sample size. It is also clear from the Kaplan–Meier curves in Figure 27.7 and the results
from fitting the Cox proportional hazards model.
1 Peto and Peto (1972) popularized the name log-rank test.
322 Linear Model and Extensions

1.0
6−MP
control
0.8
survival functions

0.6
0.4
0.2
0.0

0 5 10 15 20 25 30 35

FIGURE 27.7: Kaplan–Meier curves with 95% confidence intervals based on Gehan (1965)’s
data

> cox . gehan = coxph ( Surv ( time , cens ) ~ treat ,


+ data = gehan )
> summary ( cox . gehan )
Call :
coxph ( formula = Surv ( time , cens ) ~ treat , data = gehan )

n = 4 2 , number of events = 3 0

coef exp ( coef ) se ( coef ) z Pr ( >| z |)


treatcontrol 1 . 5 7 2 1 4.8169 0 . 4 1 2 4 3 . 8 1 2 0 . 0 0 0 1 3 8 ***

exp ( coef ) exp ( - coef ) lower . 9 5 upper . 9 5


treatcontrol 4.817 0.2076 2.147 10.81

Concordance = 0 . 6 9 ( se = 0.041 )
Likelihood ratio test = 1 6 . 3 5 on 1 df , p = 5e - 0 5
Wald test = 1 4 . 5 3 on 1 df , p = 1e - 0 4
Score ( logrank ) test = 1 7 . 2 5 on 1 df , p = 3e - 0 5

The log-rank test is a standard tool in survival analysis. However, what it delivers is just
a special case of the Cox proportional hazards model. The p-value from the log-rank test is
close to the p-value from the score test of the Cox proportional hazards model with only a
binary treatment indicator. The latter can also adjust for other pretreatment covariates.
Modeling Time-to-Event Outcomes 323

27.5 Extensions
27.5.1 Stratified Cox model
Many randomized trials are stratified. The Combined Pharmacotherapies and Behavioral
Interventions study reviewed at the beginning of this chapter is an example with site in-
dicating the strata. The previous analysis includes the dummy variables of site in the Cox
model. An alternative more flexible model is to allow for different baseline hazard functions
across strata. Technically, assume

λs (t | x) = λs (t) exp(β t x)

for strata s = 1, . . . , S, where β is an unknown parameter and {λ1 (·), . . . , λS (·)} are un-
known functions. Therefore, within each stratum s, the proportional hazards assumption
holds; across strata, the proportional hazard assumption may not hold. Within stratum s,
we can obtain the partial likelihood Ls (β); by independence of the data across strata, we
can obtain the joint partial likelihood
S
Y
Ls (β).
s=1

Based on the standard procedure, we can obtain the MLE and conduct inference based on
the large-sample theory. The coxph function can naturally allow for stratification with the
+ strata() in the regression formula.
> cox . fit <- coxph ( Surv ( futime , relapse ) ~ NALTREXONE * THERAPY +
+ AGE + GENDER + T 0 _ PDA + strata ( site ) ,
+ robust = TRUE ,
+ data = COMBINE )
> summary ( cox . fit )
Call :
coxph ( formula = Surv ( futime , relapse ) ~ NALTREXONE * THERAPY +
AGE + GENDER + T 0 _ PDA + strata ( site ) , data = COMBINE , robust = TRUE )

n = 1 2 2 6 , number of events = 8 5 6

coef exp ( coef ) se ( coef ) robust se z Pr ( >| z |)


NALTREXONE -0 . 2 5 2 2 3 9 0 . 7 7 7 0 5 9 0 . 0 9 7 7 8 8 0 . 0 9 6 4 3 7 -2 . 6 1 6 0 . 0 0 8 9 1
THERAPY -0 . 1 7 3 4 5 6 0 . 8 4 0 7 5 4 0 . 0 9 6 2 5 8 0 . 0 9 5 9 5 8 -1 . 8 0 8 0 . 0 7 0 6 6
AGE -0 . 0 1 5 1 0 4 0 . 9 8 5 0 1 0 0 . 0 0 3 5 5 4 0 . 0 0 3 5 1 2 -4 . 3 0 1 1 . 7e - 0 5
GENDERmale -0 . 1 3 9 8 3 7 0 . 8 6 9 5 0 0 0 . 0 7 5 3 8 8 0 . 0 7 6 5 8 0 -1 . 8 2 6 0 . 0 6 7 8 5
T 0 _ PDA 0.002747 1.002751 0.001369 0.001350 2.035 0.04182
NALTREXONE : THERAPY 0.335671 1.398879 0.137676 0.136890 2.452 0.01420

NALTREXONE **
THERAPY .
AGE ***
GENDERmale .
T 0 _ PDA *
NALTREXONE : THERAPY *

exp ( coef ) exp ( - coef ) lower . 9 5 upper . 9 5


NALTREXONE 0.7771 1.2869 0.6432 0.9387
THERAPY 0.8408 1.1894 0.6966 1.0147
AGE 0.9850 1.0152 0.9783 0.9918
GENDERmale 0.8695 1.1501 0.7483 1.0103
T 0 _ PDA 1.0028 0.9973 1.0001 1.0054
NALTREXONE : THERAPY 1.3989 0.7149 1.0697 1.8294
324 Linear Model and Extensions

Concordance = 0 . 5 6 1 ( se = 0 . 0 1 1 )
Likelihood ratio test = 3 5 . 2 4 on 6 df , p = 4e - 0 6
Wald test = 3 3 . 8 5 on 6 df , p = 7e - 0 6
Score ( logrank ) test = 3 4 . 9 4 on 6 df , p = 4e - 0 6 , Robust = 3 4 . 1 5 p = 6e - 0 6

27.5.2 Clustered Cox model


With clustered data, we must adjust for the standard errors. The coxph function reports
the cluster-robust standard errors with the specification of cluster. A canonical example of
clustered data is from the matched-pair design if we view the pairs as clusters. The example
below uses the data from Huster et al. (1989) in which two eyes of a patient were either
assigned to treatment or control.
> library ( " timereg " )
> data ( diabetes )
> pair . cox = coxph ( Surv ( time , status ) ~ treat + adult + agedx ,
+ robust = TRUE ,
+ data = diabetes )
> summary ( pair . cox )
Call :
coxph ( formula = Surv ( time , status ) ~ treat + adult + agedx , data = diabetes ,
robust = TRUE )

n = 3 9 4 , number of events = 1 5 5

coef exp ( coef ) se ( coef ) robust se z Pr ( >| z |)


treat -0 . 7 8 1 4 8 3 0 . 4 5 7 7 2 7 0 . 1 6 8 9 7 7 0 . 1 7 0 1 1 2 -4 . 5 9 4 4 . 3 5e - 0 6
adult -0 . 1 3 6 9 6 7 0 . 8 7 1 9 9 9 0 . 2 8 9 3 4 4 0 . 2 7 0 9 0 9 -0 . 5 0 6 0.613
agedx 0 . 0 0 7 8 3 6 1 . 0 0 7 8 6 6 0.009681 0.009360 0.837 0.403

treat ***
adult
agedx

exp ( coef ) exp ( - coef ) lower . 9 5 upper . 9 5


treat 0.4577 2.1847 0.3279 0.6389
adult 0.8720 1.1468 0.5128 1.4829
agedx 1.0079 0.9922 0.9895 1.0265

Concordance = 0 . 5 9 6 ( se = 0 . 0 2 4 )
Likelihood ratio test = 2 3 . 1 3 on 3 df , p = 4e - 0 5
Wald test = 2 1 . 5 4 on 3 df , p = 8e - 0 5
Score ( logrank ) test = 2 3 . 0 1 on 3 df , p = 4e - 0 5 , Robust = 2 2 . 0 9 p = 6e - 0 5

( Note : the likelihood ratio and score tests assume independence of


observations within a cluster , the Wald and robust score tests do not ).
> pair . cox = coxph ( Surv ( time , status ) ~ treat + adult + agedx ,
+ robust = TRUE , cluster = id ,
+ data = diabetes )
> summary ( pair . cox )
Call :
coxph ( formula = Surv ( time , status ) ~ treat + adult + agedx , data = diabetes ,
robust = TRUE , cluster = id )

n = 3 9 4 , number of events = 1 5 5

coef exp ( coef ) se ( coef ) robust se z Pr ( >| z |)


treat -0 . 7 8 1 4 8 3 0 . 4 5 7 7 2 7 0 . 1 6 8 9 7 7 0 . 1 4 8 3 3 0 -5 . 2 6 9 1 . 3 7e - 0 7
adult -0 . 1 3 6 9 6 7 0 . 8 7 1 9 9 9 0 . 2 8 9 3 4 4 0 . 2 9 5 2 3 9 -0 . 4 6 4 0.643
agedx 0 . 0 0 7 8 3 6 1 . 0 0 7 8 6 6 0.009681 0.010272 0.763 0.446
Modeling Time-to-Event Outcomes 325

treat ***
adult
agedx

exp ( coef ) exp ( - coef ) lower . 9 5 upper . 9 5


treat 0.4577 2.1847 0.3423 0.6122
adult 0.8720 1.1468 0.4889 1.5553
agedx 1.0079 0.9922 0.9878 1.0284

Concordance = 0 . 5 9 6 ( se = 0 . 0 2 3 )
Likelihood ratio test = 2 3 . 1 3 on 3 df , p = 4e - 0 5
Wald test = 2 8 . 5 5 on 3 df , p = 3e - 0 6
Score ( logrank ) test = 2 3 . 0 1 on 3 df , p = 4e - 0 5 , Robust = 2 6 . 5 5 p = 7e - 0 6

( Note : the likelihood ratio and score tests assume independence of


observations within a cluster , the Wald and robust score tests do not ).

27.5.3 Penalized Cox model


The glmnet package implements the penalized version of the Cox model.

27.6 Critiques on survival analysis


The Kaplan–Meier curve and the Cox proportional hazards model are standard tools for
analyzing medical data with censored survival times. They are among the most commonly
used methods in medical journals. Kaplan and Meier (1958) and Cox (1972) are two of the
most cited papers in statistics.
Freedman (2008) criticized these two standard tools. Both rely on the critical assumption
of noninformative censoring that censoring and survival time are independent or condition-
ally independent given covariates. When censoring is due to administrative constraints, this
may be a plausible assumption. The data from Lin et al. (2016) is a convincing example
of noninformative censoring. However, many other studies have more complex censoring
mechanisms, for example, one may drop out of the study, and another may be killed by an
irrelevant cause. The Cox model relies on an additional assumption of proportional hazards.
This particular functional form facilitates the interpretation of the coefficients as log condi-
tional hazard ratios if the model is correctly specified. However, its interpretation becomes
obscure when the model is mis-specified. Two survival curves based on Lin et al. (2016) ’s
data cross each other, which makes the proportional hazards assumption dubious.
Hernán (2010) offered a more fundamental critique on hazard-based survival analysis.
For example, in a randomized treatment-control experiment, the hazard ratio at time t
is the ratio of the instantaneous probability of death conditioning on the event that the
patients have survived up to time t:

lim∆t↓0 pr(t ≤ T < t + ∆t | x = 1, T ≥ t)/∆t


lim∆t↓0 pr(t ≤ T < t + ∆t | x = 0, T ≥ t)/∆t

This ratio is difficult to interpret because patients who have survived up to time t can be
quite different in treatment and control groups, especially when the treatment is effective.
Even though patients are randomly assigned at the baseline, the survivors up to time t are
not. Hernán (2010) suggested focusing on the comparison of the survival functions.
326 Linear Model and Extensions

27.7 Homework problems


27.1 Identifiability of the survival time under independent censoring
Assume the survival time T and censoring time C are continuous and independent random
variables. But we can only observe y = min(T, C) and δ = 1(T ≤ C). Show that the hazard
function of T can be identified by the following formula:

pr(y = t, δ = 1)
λT (t) = .
pr(y ≥ t)

27.2 Log-Normal regression model


Does the log-Normal regression model in Example 27.5 satisfy the proportional hazards
assumption? Based on (yi , xi , δi )ni=1 , what is the likelihood function under Assumption
27.1? Compare it with the partial likelihood function.

27.3 Weibull random variable


Using (27.1) to show the formulas of density, survival, and hazard functions. Calculate its
mean and variance.
Hint: Use the Gamma function to express the moments.

27.4 Weibull regression model


Find the distribution of εi in the Weibull regression model in Example 27.6. Show E(T | x)
is log-linear in x, and E(log T | x) is linear in x. Does it satisfy the proportional hazards
assumption? Based on (yi , xi , δi )ni=1 , what is the likelihood function under Assumption 27.1?
Compare it with the partial likelihood function.

27.5 Invariance of the proportional hazards model


Assume that T | x follows a proportional hazards model. Show that any non-negative and
strictly increasing transformation g(T ) | x also follows a proportional hazards model.
Part IX

Appendices
A
Linear Algebra

All vectors are column vectors in this book. This is coherent with R.

A.1 Basics of vectors and matrices


Euclidean space
The n-dimensional Euclidean space Rn is a set of all n-dimensional vectors equipped with
an inner product:
n
X
⟨x, y⟩ = xt y = xi yi
i=1
t t
where x = (x1 , . . . , xn ) and y = (y1 , . . . , yn ) are two n-dimensional vectors. The length of
a vector x is defined as p √
∥x∥ = ⟨x, x⟩ = xt x.
The Cauchy–Schwarz inequality states that
|⟨x, y⟩| ≤ ∥x∥ · ∥y∥,
or, more transparently,
n
!2 n
! n
!
X X X
xi yi ≤ x2i yi2 .
i=1 i=1 i=1

The equality holds if and only if yi = a + bxi for some a and b, for all i = 1, . . . , n. We can
use the Cauchy–Schwarz inequality to prove the triangle inequality
∥x + y∥ ≤ ∥x∥ + ∥y∥.
We say that x and y are orthogonal, denoted by x ⊥ y, if ⟨x, y⟩ = 0. We call a set of vec-
tors v1 , . . . , vm ∈ Rn orthonormal if they all have unit length and are mutually orthogonal.
Geometrically, we can define the cosine of the angle between two vectors x, y ∈ Rn as
Pn
⟨x, y⟩ xi yi
cos ∠(x, y) = = pPn i=12 Pn .
∥x∥∥y∥ i=1 xi
2
i=1 yi
For unit vectors, it
Preduces to the inner product.
Pn When both x and y are orthogonal to 1n ,
n
that is, x̄ = n−1 i=1 xi = 0 and ȳ = n−1 i=1 yi = 0, the formula of the cosine of the
angle is identical to the sample Pearson correlation coefficient
Pn
(xi − x̄)(yi − ȳ)
ρ̂xy = pPn i=1 Pn .
2 2
i=1 (xi − x̄) i=1 (yi − ȳ)
Sometimes, we simply say that the cosine of the angle of two vectors measures their corre-
lation even when they are not orthogonal to 1n .

329
330 Linear Model and Extensions

Column space of a matrix


Given an n × m matrix A, we can view it in terms of all elements
 
a11 · · · a1m
A = (aij ) =  ... ..  ,

. 
an1 · · · anm
or row vectors  
at1
A =  ...  ,
 

atn
where ai ∈ Rm (i = 1, . . . , n), or column vectors
A = (A1 , . . . , Am ),
where Aj ∈ Rn j = 1, . . . , m. In statistics, the rows are corresponding to the units, so the
ith row vector is the vector observations for unit i. Moreover, viewing A in terms of its
column vectors can give more insights. Define the column space of A as
C(A) = {α1 A1 + · · · + αm Am : α1 , . . . , αm ∈ R} ,
which is the set of all linear combinations of the column vectors A1 , . . . , Am . The column
space is important because we can write Aα, with α = (α1 , . . . , αm )t , as
 
α1
 .. 
Aα = (A1 , . . . , Am )  .  = α1 A1 + · · · + αm Am ∈ C(A).
αm
We define the row space of A as the column space of At .

1.1 Matrix product


Given an n × m matrix A = (aij ) and an m × r matrix B = (bij ), we can define their
product as C = AB where the n × r matrix C = (cij ) has the (i, j)th element
m
X
cij = aik bkj .
k=1

In terms of the row vectors or column vectors of A and B, we have


cij = ati Bj ,
that is, cij equals the inner product of the ith row vector of A and the jth column vector
of B. Moreover, the matrix product satisfies
AB = A(B1 , . . . , Br ) = (AB1 , . . . , ABr ) (A.1)
so the column vectors of AB belongs to the column space of A; it also satisfies
 t  t 
a1 a1 B
 ..   .. 
AB =  .  B =  .  (A.2)
atn atn B
so the row vectors of AB belong to the column space of B t , or equivalently, the row space
of B.
Linear Algebra 331

1.2 Linearly independent vectors and rank


We call a set of vectors A1 , . . . , Am ∈ Rn linearly independent if
x 1 A1 + · · · + x m Am = 0
must imply x1 = · · · = xm = 0. We call Aj1 , . . . , Ajk maximally linearly independent
if adding another vector makes them linearly dependent. Define k as the rank of of
{A1 , . . . , Am } and also define k as the rank of the matrix A = (A1 , . . . , Am ).
A set of vectors may have different subsets of vectors that are maximally linearly inde-
pendent. But the rank k is unique. We can also define the rank of a matrix in terms of its
row vectors. A remarkable theorem in linear algebra is that it does not matter whether we
define the rank of a matrix in terms of its column vectors or row vectors.
From the matrix product formulas (A.1) and (A.2), we have the following result.
Proposition A.1 rank(AB) ≤ min{rank(A), rank(B)}.
The rank decomposition of a matrix decomposes A into the product of two matrices of
full ranks.
Proposition A.2 If an n × m matrix has rank k, then A = BC for some n × k matrix B
and k × m matrix C.
Proof of Proposition A.2: Let Aj1 , . . . , Ajk be the maximally linearly independent col-
umn vectors of A. Stack them into an n × k matrix B = (Aj1 , . . . , Ajk ). They can linearly
represent all column vectors of A:
A = (c11 Aj1 + · · · + ck1 Ajk , . . . , c1m Aj1 + · · · + ckm Ajk ) = (BC1 , . . . , BCm ) = BC,
where C = (C1 , . . . , Cm ) is an k × m matrix with column vectors
   
c11 c1m
C1 =  ...  , Cm =  ...  .
   

ck1 ckm

Proposition A.1 ensures that rank(B) ≥ k and rank(C) ≥ k so they must both have rank
k. The decomposition in Proposition A.2 is not unique since the choice of the maximally
linearly independent column vectors of A is not unique.

Some special matrices


An n×n matrix A is symmetric if At = A. An n×n diagonal matrix A has zero off-diagonal
elements, denoted by A = diag{a11 , . . . , ann }. Diagonal matrices are symmetric.
An n × n matrix is orthogonal if At A = AAt = In . The column vectors of an orthogonal
matrix are orthonormal; so are its row vectors. If A is orthogonal, then
∥Ax∥ = ∥x∥
for any vector x ∈ Rn . That is, multiplying a vector by an orthogonal matrix does not change
the length of the vector. Geometrically, an orthogonal matrix corresponds to rotation.
An n × n matrix A is upper triangular if aij = 0 for i > j and lower triangular if aij = 0
for i < j. An n × n matrix A can be factorized as
A = LU
where L is a lower triangular matrix and U is an upper triangular matrix. This is called
the LU decomposition of a matrix.
332 Linear Model and Extensions

Determinant
The original definition of the determinant of a square matrix A = (aij ) has a very complex
form, which will not be used in this book.
The determinant of a 2 × 2 matrix has a simple form:
 
a b
det = ad − bc. (A.3)
c d
The determinant of the Vandermonde matrix has the following formula:
1 x1 x21 · · · xn−1
 
1
1 x2 x22 · · · xn−1 
2 Y
det  . = (xj − xi ). (A.4)
 
. . .
 .. .. .. .. 
1≤i,j≤n
1 xn x2n · · · xn−1
n

The properties of the determinant are more useful. I will review two.
Proposition A.3 For two square matrices A and B, we have
det(AB) = det(A)det(B) = det(BA).
Proposition A.4 For two square matrices A ∈ Rm×m and B ∈ Rn×n , we have
   
A 0 A D
det = det = det(A)det(B).
C B 0 B

Inverse of a matrix
Let In be the n × n identity matrix. An n × n matrix A is invertible/nonsingular if there
exists an n × n matrix B such that AB = BA = In . We call B the inverse of A, denoted
by A−1 . If A is an orthogonal matrix, then At = A−1 .
A square matrix is invertible if and only if det(A) ̸= 0.
The inverse of a 2 × 2 matrix is
 −1  
a b 1 d −b
= . (A.5)
c d ad − bc −c a
The inverse of a 3 × 3 lower triangular matrix is
 −1  
a 0 0 cf 0 0
 b c 0  = 1  −bf af 0 . (A.6)
acf
d e f be − cd −ae ac
A useful identity is
(AB)−1 = B −1 A−1
if both A and B are invertible.

Eigenvalues and eigenvectors


For an n × n matrix A, if there exists a pair of n-dimensional vector x and a scalar λ such
that
Ax = λx,
then we call λ an eigenvalue and x the associated eigenvector of A. From the definition,
eigenvalue and eigenvector always come in pairs. The following eigen-decomposition theorem
is important for real symmetric matrices.
Linear Algebra 333

Theorem A.1 If A is an n × n symmetric matrix, then there exists an orthogonal matrix


P such that
P t AP = diag{λ1 , . . . , λn },
where the λ’s are the n eigenvalues of A, and the column vectors of P = (γ1 , · · · , γn ) are
the corresponding eigenvectors.

If we write the eigendecomposition as

AP = P diag{λ1 , . . . , λn }

or, equivalently,
A(γ1 , · · · , γn ) = (λ1 γ1 , · · · , λn γn ),
then (λi , γi ) must be a pair of eigenvalue and eigenvector. Moreover, the eigendecomposition
in Theorem A.1 is unique up to the permutation of the columns of P and the corresponding
λi ’s.

Corollary A.1 If P t AP = diag{λ1 , . . . , λn }, then

A = P diag{λ1 , . . . , λn }P t , Ak = A · A · · · A = P diag{λk1 , . . . , λkn }P t ;

if the eigenvalues of A are nonzero, then

A−1 = P diag{1/λ1 , . . . , 1/λn }P t .

The eigen-decomposition is also useful for defining the square root of an n×n symmetric
matrix. In particular, if the eigenvalues of A are nonnegative, then we can define
p p
A1/2 = P diag{ λ1 , . . . , λn }P t

By definition, A1/2 is a symmetric matrix satisfying A1/2 A1/2 = A. There are other defini-
tions of the square root of a symmetric matrix, but we adopt this form in this book.
From Theorem A.1, we can write A as

A = P diag{λ1 , . . . , λn }P t
 
γ1t
= (γ1 , · · · , γn )diag{λ1 , . . . , λn }  ... 
 

γnt
Xn
= λi γi γit .
i=1

For an n × n symmetric matrix A, its rank equals the number of non-zero eigenvalues
and its determinant equals the product of all eigenvalues. The matrix A is of full rank if
all its eigenvalues are non-zero, which implies that its rank equals n and its determinant is
non-zero.

Quadratic form
For an n × n symmetric matrix A = (aij ) and an n-dimensional vector x, we can define the
quadratic form as
n X
X n
xt Ax = ⟨x, Ax⟩ = aij xi xj .
i=1 j=1
334 Linear Model and Extensions

We always consider a symmetric matrix in the quadratic form without loss of generality.
Otherwise, we can symmetrize A as à = (A + At )/2 without changing the value of the
quadratic form because
xt Ax = xt Ãx.
We call A positive semi-definite, denoted by A ⪰ 0, if xt Ax ≥ 0 for all x; we call A
positive definite, denoted by A ≻ 0, if xt Ax > 0 for all nonzero x.
We can also define the partial order between matrices. We call A ⪰ B if and only if
A − B ⪰ 0, and we call A ≻ B if and only if A − B ≻ 0. This is important in statistics
because we often compare the efficiency of estimators based on their variances or covariance
matrices. Given two unbiased estimators θ̂1 and θ̂2 for a scalar parameter θ, we say that
θ̂1 is more efficient than θ̂2 if var(θ̂2 ) ≥ var(θ̂1 ). In the vector case, we say that θ̂1 is more
efficient than θ̂2 if cov(θ̂2 ) ⪰ cov(θ̂1 ), which is equivalent to var(ℓt θ̂2 ) ≥ var(ℓt θ̂1 ) for any
linear combination of the estimators.
The eigenvalues of a symmetric matrix determine whether it is positive semi-definite or
positive definite.

Theorem A.2 For a symmetric matrix A, it is positive semi-definite if and only if all its
eigenvalues are nonnegative, and it is positive definite if and only if all its eigenvalues are
positive.

An important result is the relationship between the eigenvalues and the extreme values
of the quadratic form. Assume that the eigenvalues are rearranged in decreasing order such
that λ1 ≥ · · · ≥ λn . For a unit vector x, we have that
n
X n
X
xt Ax = xt λi γi γit x = λi αi2
i=1 i=1

where    t 
α1 γ1 x
 ..   .. 
α =  .  =  .  = P tx
αn γnt x
has length ∥α∥2 = ∥x∥2 = 1. Then the maximum value of xt Ax is λ1 which is achieved at
α1 = 1 and α2 = · · · = αn = 0 (for example, if x = γ1 , then α1 = 1 and α2 = · · · = αn = 0).
For a unit vector x that is orthogonal to γ1 , we have that
n
X
xt Ax = λi αi2
i=2

where α = P t x has unit length with α1 = 0. The maximum value of xt Ax is λ2 which is


achieved at α2 = 1 and α1 = α3 = · · · = αn = 0, for example, x = γ2 . By induction, we
have the following theorem.

Theorem
Pn A.3 Suppose that an n × n symmetric matrix has eigen-decomposition
i where λ1 ≥ · · · ≥ λn .
t
i=1 λi γ i γ

1. The optimization problem

max xt Ax such that ∥x∥ = 1


x

has maximum λ1 which can be achieved by γ1 .


Linear Algebra 335

2. The optimization problem

max xt Ax such that ∥x∥ = 1, x ⊥ γ1


x

has maximum λ2 which can be achieved by γ2 .


3. The optimization problem

max xt Ax such that ∥x∥ = 1, x ⊥ γ1 , . . . , x ⊥ γk


x

has maximum λk+1 which can be achieved by γk+1 (k = 1, . . . , n − 1).

Theorem A.3 implies the following theorem on the Rayleigh quotient

r(x) = xt Ax/xt x (x ∈ Rn ).

Theorem A.4 (Rayleigh quotient and eigenvalues) The maximum and minimum eigenval-
ues of an n × n symmetric matrix A equals

λmax (A) = max r(x), λmin (A) = min r(x)


x̸=0 x̸=0

with the maximizer and minimizer being the eigenvectors corresponding to the maximum
and minimum eigenvalues, respectively.

An immediate consequence of Theorem A.4 is that the diagonal elements of A are


bounded by the smallest and largest eigenvalues of A. This follows by taking x =
(0, . . . , 1, . . . , 0)t where only the ith element equals 1.

Trace
The trace of an n × n matrix A = (aij ) is the sum of all its diagonal elements, denoted by
n
X
trace(A) = aii .
i=1

The trace operator has two important properties that can sometimes help to simplify
calculations.

Proposition A.5 trace(AB) = trace(BA) as long as AB and BA are both square matrices.

We can verify Proposition A.5 by definition. It states that AB and BA have the same
trace although AB differs from BA in general. In fact, it is particularly useful if the dimen-
t
sion of BA is much lower than the dimension of AB. For example, if both A = P(a 1 , . . . , an )
n
and B = (b1 , . . . , bn ) are vectors, then trace(AB) = trace(BA) = ⟨B , A⟩ = i=1 ai bi .
t

Proposition A.6 P The trace of an n × n symmetric matrix A equals the sum of its eigen-
n
values: trace(A) = i=1 λi .

Proof of Proposition A.6: It follows from the eigen-decomposition and Proposition A.5.
Let Λ = diag{λ1 , . . . , λn }, and we have
n
X
trace(A) = trace(P ΛP t ) = trace(ΛP t P ) = trace(Λ) = λi .
i=1


336 Linear Model and Extensions

Projection matrix
An n × n matrix H is a projection matrix if it is symmetric and H 2 = H. The eigenvalues
of H must be either 1 or 0. To see this, we assume that Hx = λx for some nonzero vector
x, and use two ways to calculate H 2 x:

H 2 x = Hx = λx, H 2 x = H(Hx) = H(λx) = λHx = λ2 x.

So (λ − λ2 )x = 0 which implies that λ − λ2 = 0, i.e., λ = 0 or 1. So the trace of H equals


its rank:
trace(H) = rank(H).
Why is this a reasonable definition of a “projection matrix”? Or, why must a projection
matrix satisfy H t = H and H 2 = H? First, it is reasonable to require that Hx1 = x1
for any x1 ∈ C(H), the column space of H. Since x1 = Hα for some α, we indeed have
Hx1 = H(Hα) = H 2 α = Hα = x1 because of the property H 2 = H. Second, it is reasonable
to require that x1 ⊥ x2 for any vector x1 = Hα ∈ C(H) and x2 such that Hx2 = 0. So we
need αt H t x2 = 0 which is true if H = H t . Therefore, the two conditions are natural for
the definition of a projection matrix.
More interestingly, a project matrix has a more explicit form as stated below.

Theorem A.5 If an n × p matrix X has p linearly independent columns, then H =


X(X t X)−1 X t is a projection matrix. Conversely, if an n × n matrix H is a projection
matrix with rank p, then H = X(X t X)−1 X t for some n × p matrix X with linearly inde-
pendent columns.

It is relatively easy to verify the first part of Theorem A.5; see Chapter 3. The second part
of Theorem A.5 follows from the eigen-decomposition of H, with the first p eigen-vectors
being the column vectors of X.

Cholesky decomposition
An n × n positive semi-definite matrix A can be decomposed as A = LLt where L is an
n × n lower triangular matrix with non-negative diagonal elements. If A is positive definite,
the decomposition is unique. In general, it is not. Take an arbitrary orthogonal matrix Q,
we have A = LQQt Lt = CC t where C = LQ. So we can decompose a positive semi-definite
matrix A as A = CC t , but this decomposition is not unique.

Singular value decomposition (SVD)


Any n × m matrix A can be decomposed as

A = U DV t

where U is n × n orthogonal matrix, V is m × m orthogonal matrix, and D is n × m matrix


with all zeros for the non-diagonal elements. For a tall matrix with n ≥ m, the diagonal
matrix D has many zeros, so we can also write

A = U DV t

where U is n × m matrix with orthonormal columns (U t U = Im ), V is m × m orthogonal


matrix, and D is m × m diagonal matrix. Similarly, for a wide matrix with n ≤ m, we can
write
A = U DV t
Linear Algebra 337

where U is n × n orthogonal matrix, V is m × n matrix with orthonormal columns (V t V =


In ), and D is n × n diagonal matrix.
If D has only r ≤ min(m, n) nonzero elements, then we can further simplify the decom-
position as
A = U DV t
where U is n × r matrix with orthonormal columns (U t U = Ir ), V is m × r matrix with
orthonormal columns (V t V = Ir ), and D is r × r diagonal matrix. With more explicit forms
of
U = (U1 , . . . , Ur ), D = diag(d1 , . . . , dr ), V = (V1 , . . . , Vr ),
we can write A as
  
d1 V1t r
..   ..  X
A = (U1 , . . . , Ur )   .  = dk Uk Vkt .

.
dr Vrt k=1

The SVD implies that


AAt = U DDt U t , At A = V Dt DV t ,
which are the eigen decompositions of AAt and At A. This ensures that AAt and At A have
the same non-zero eigenvalues.
An application of the SVD is to define the pseudoinverse of any matrix. Define D+ as
the pseudoinverse of D with the non-zero elements inverted but the zero elements intact at
zero. Define
Xr
A+ = V D + U t = d−1 t
k Vk Uk
k=1
as the pseudoinverse of A. The definition holds even if A is not a square matrix. We can
verify that
AA+ A = A, A+ AA+ = A+ .
If A is a square nondegenerate matrix, then A+ = A−1 equals the standard definition
of the inverse. In the special case with a symmetric A, its SVD is identical to its eigen
decomposition. If A is not invertible, its pseudoinverse equals
A+ = P diag(λ−1 −1
1 , . . . , λk , 0, . . . , 0)P
t

if rank(A) = k < n and λ1 , λ1 , . . . , λk are the nonzero eigen-values.


Another application of the SVD is the polar decomposition for any square matrix A.
Since A = U DV t = U DU t U V t with orthogonal U and V , we have
A = (AAt )1/2 Γ,
where (AAt )1/2 = U DU t and Γ = U V t is an orthogonal matrix.

A.2 Vector calculus


If f (x) is a function from Rp to R, then we use the notation
 ∂f (x) 
∂x1
∂f (x)  .. 
≡ . 

∂x 
∂f (x)
∂xp
338 Linear Model and Extensions

for the component-wise partial derivative, which must have the same dimension as x. It is
often called the gradient of f. For example, for a linear function f (x) = xt a = at x with
a, x ∈ Rp , we have
   ∂ Pp aj xj 
∂at x j=1  
∂x1 ∂x1 a1
t
∂a x  ..   ..  .
=   ..  = a;
= = (A.7)
∂x  .   Pp
.
t
∂a x ∂ j=1 aj xj ap
∂xp ∂xp

for a quadratic function f (x) = xt Ax with a symmetric A ∈ Rp×p and x ∈ Rp , we have


   ∂ Pp Pp
aij xi xj 
∂xt Ax i=1 j=1  
∂x1 ∂x1 2a11 x1 + · · · + 2a1p xp
t
∂x Ax  ..   ..   ..
= = =  = 2Ax.

∂x  .   Pp Pp
.  .
∂xt Ax ∂ i=1 j=1 aij xi xj 2ap1 x1 + · · · + 2app xp
∂xp ∂xp

These are two important rules of vector calculus used in this book, summarized below.

Proposition A.7 We have

∂at x
= a,
∂x
∂xt Ax
= 2Ax.
∂x
We can also extend the definition to vector functions. If f (x) = (f1 (x), . . . , fq (x))t is a
function from Rp to Rq , then we use the notation
∂fq (x)
 ∂f (x) 
 
1
∂x1 ··· ∂x1
∂f (x) ∂f1 (x) ∂fq (x)  . .. 
≡ ,··· , = . . , (A.8)
 .

∂x ∂x ∂x ∂f1 (x) ∂fq (x)
∂xp ··· ∂xp

which is a p × q matrix with rows corresponding to the elements of x and the columns
corresponding to the elements of f (x). We can easily extend the first result of Proposition
A.7.

Proposition A.8 For B ∈ Rp×q and x ∈ Rp , we have

∂B t x
= B.
∂x
Proof of Proposition A.8: Partition B = (B1 , . . . , Bq ) in terms of its column vectors.
The jth element of B t x is Bjt x so the j-th column of ∂B t x/∂x is Bj based on Proposition
A.7. This verifies that ∂B t x/∂x equals B. □
Some authors define ∂f (x)/∂x as the transpose of (A.8). I adopt this form for its natural
connection with (A.7) when q = 1. Sometimes, it is indeed more convenient to work with
the transpose of ∂f (x)/∂x. Then I will use the notation
 t  
∂f (x) ∂f (x) ∂f (x) ∂f (x)
= = ,··· ,
∂xt ∂x ∂x1 ∂xp

which puts the transpose notation on x.


Linear Algebra 339

The above formulas become more powerful in conjunction with the chain rule. For ex-
ample, for any differentiable function h(z) mapping from R to R with derivative h′ (z), we
have
∂h(at x)
= h′ (at x)a,
∂x
∂h(xt Ax)
= 2h′ (xt Ax)Ax.
∂x
For any differentiable function h(z) mapping from Rq to R with gradient ∂h(z)/∂z, we have

∂h(B t x) ∂h(B1t x, . . . , Bqt x)


=
∂x ∂x
q
X ∂h(B1t x, . . . , Bqt x)
= Bj
j=1
∂zj
∂h(B t x)
= B .
∂z
Moreover, we can also define the Hessian matrix of a function f (x) mapping from Rp to
R:
∂ 2 f (x)
 2   
∂ f (x) ∂ ∂f (x)
= = .
∂x∂xt ∂xi ∂xj 1≤i,j≤p ∂xt ∂x

A.3 Homework problems


1.1 Triangle inequality of the inner product
With three unit vectors u, v, w ∈ Rn , prove that
p p p
1 − ⟨u, w⟩ ≤ 1 − ⟨u, v⟩ + 1 − ⟨v, w⟩.

Remark: The result is a direct consequence of the standard triangle inequality but it
has an interesting implication. If ⟨u, v⟩ ≥ 1 − ϵ and ⟨v, w⟩ ≥ 1 − ϵ, then ⟨u, w⟩ ≥ 1 − 4ϵ.
This implied inequality is mostly interesting when ϵ is small. It states that when u and v
are highly corrected and v and w are highly correlated, then u and w must also be highly
correlated. Note that we can find counterexamples for the following relationship:

⟨u, v⟩ > 0, ⟨v, w⟩ > 0 but ⟨u, w⟩ = 0.

1.2 Van der Corput inequality


Assume that v, u1 , . . . , um ∈ Rn have unit length. Prove that
m
!2 m X
m
X X
⟨v, ui ⟩ ≤ ⟨ui , uj ⟩.
i=1 i=1 j=1

Remark: This result is not too difficult to prove but it says something fundamentally
interesting. If v is correlated with many vectors u1 , . . . , um , then at least some vectors in
u1 , . . . , um must be also correlated.
340 Linear Model and Extensions

1.3 Inverse of a block matrix


Prove that
 −1
A B
C D
A−1 + A−1 B(D − CA−1 B)−1 CA−1 −A−1 B(D − CA−1 B)−1
 
=
−(D − CA−1 B)−1 CA−1 (D − CA−1 B)−1
(A − BD−1 C)−1 −(A − BD−1 C)−1 BD−1
 
= ,
−D C(A − BD−1 C)−1
−1
D + D−1 C(A − BD−1 C)−1 BD−1
−1

provided that all the inverses of the matrices exist. The two forms of the inverse imply the
Woodbury formula:

(A − BD−1 C)−1 = A−1 + A−1 B(D − CA−1 B)−1 CA−1 ,

which further implies the Sherman–Morrison formula:

(A + uv t )−1 = A−1 − (1 + v t A−1 u)−1 A−1 uv t A−1 ,

where A is an invertible square matrix, and u and v are two column vectors.

1.4 Matrix determinant lemma


Prove that given the identity matrix In and two n-vectors u and v, we have

det(In + uv t ) = 1 + v t u.

Further show that if In is replaced by an n × n invertible matrix A, we have

det(A + uv t ) = (1 + v t A−1 u) · det(A).

1.5 Decomposition of a positive semi-definite matrix


Show that if A is positive semi-definite, then there exists a matrix C such that A = CC t .

1.6 Trace of the product of two matrices


Prove that A and B are two n × n positive semi-definite
Pn matrices, then trace(AB) ≥ 0.
Hint: Use the eigen-decomposition of A = i=1 λi γi γit .
Remark: In fact, a stronger result holds. If two n × n symmetric matrices A and B have
eigenvalues
λ1 ≥ · · · ≥ λn , µ1 ≥ · · · ≥ µn
respectively, then
n
X n
X
λi µn+1−i ≤ trace(AB) ≤ λi µi .
i=1 i=1

The result is due to Von Neumann (1937) and Ruhe (1970). See also Chen and Li (2019,
Lemma 4.12).

1.7 Vector calculus


What is the formula for ∂xt Ax/∂x if A is not symmetric in Proposition A.7?
B
Random Variables

iid
Let “IID” denote “independent and identically distributed”, “ ∼” denote a sequence of ran-
dom variables that are IID with some common distribution, and “ ” denote independence
between random variables. Define Euler’s Gamma function as
Z ∞
Γ(z) = xz−1 e−x dx, (z > 0),
0

which is a natural extension of the factorial since Γ(n) = (n − 1)!. Further define the
digamma function as ψ(z) = d log Γ(z)/dz and the trigamma function as ψ ′ (z). In R, we
can use
gamma ( z )
lgamma ( z )
digamma ( z )
trigamma ( z )

to compute Γ(z), log Γ(z), ψ(z), and ψ ′ (z).

B.1 Some important univariate random variables


B.1.1 Normal, χ2 , t and F
The standard Normal random variable Z ∼ N(0, 1) has density

f (z) = (2π)−1/2 exp −z 2 /2 .




A Normal random variable X has mean µ and variance σ 2 , denoted by N(µ, σ 2 ), if X =


µ + σZ. We can show that X has density

f (x) = (2π)1/2 exp −(x − µ)2 /(2σ 2 ) .




A chi-squared random variable with degrees of freedom n, denoted by Qn ∼ χ2n , can be


represented as
Xn
Qn = Zi2 ,
i=1
iid
where Zi ∼ N(0, 1). We can show that its density is
.n o
fn (q) = q n/2 exp(−q/2) 2n/2 Γ(n/2) , (q > 0). (B.1)

We can verify that the above density (B.1) is well-defined even if we change the integer n
to be an arbitrary positive real number ν, and call the corresponding random variable Qν
a chi-squared random variable with degrees of freedom ν, denoted by Qν ∼ χ2ν .

341
342 Linear Model and Extensions

A t random variable with degrees of freedom ν can be represented as


Z
tν = p
Qν /ν
where Z ∼ N(0, 1), Qν ∼ χ2ν , and Z Qν .
An F random variable with degrees of freedom (r, s) can be represented as
Qr /r
F =
Qs /s
where Qr ∼ χ2r , Qs ∼ χ2s , and Qr Qs .

B.1.2 Beta–Gamma duality


The Gamma(α, β) random variable with parameters α, β > 0 has density
β α α−1 −βx
f (x) = x e , (x > 0). (B.2)
Γ(α)
The Beta(α, β) random variable with parameters α, β > 0 has density
Γ(α + β) α−1
f (x) = x (1 − x)β−1 , (0 < x < 1).
Γ(α)Γ(β)
These two random variables are closely related as shown in the following theorem.
Theorem B.1 If X ∼ Gamma(α, θ), Y ∼ Gamma(β, θ) and X Y , then
1. X + Y ∼ Gamma(α + β, θ),
2. X/(X + Y ) ∼ Beta(α, β),
3. X + Y X/(X + Y ).
Another simple but useful fact is that χ2 is a special Gamma random variable. Com-
paring the densities in (B.1) and (B.2), we obtain the following result.
Proposition B.1 χ2n ∼ Gamma(n/2, 1/2).
We can also calculate the moments of the Gamma and Beta distributions.
Proposition B.2 If X ∼ Gamma(α, β), then
α
E(X) = ,
β
α
var(X) = .
β2
Proposition B.3 If X ∼ Gamma(α, β), then
E(log X) = ψ(α) − log β,
var(log X) = ψ ′ (α).
Proposition B.4 If X ∼ Beta(α, β), then
α
E(X) = ,
α+β
αβ
var(X) = .
(α + β)(α + β + 1)
Random Variables 343

Proposition B.5 If X ∼ Beta(α, β), then

E(log X) = ψ(α) − ψ(α + β),


var(log X) = ψ ′ (α) − ψ ′ (α + β).

I leave the proofs of the above propositions as Problem 2.2.

B.1.3 Exponential, Laplace, and Gumbel distributions


An Exponential(λ) random variable X ≥ 0 has density f (x) = λe−λx , mean 1/λ, median
log 2/λ and variance 1/λ2 . The standard Exponential random variable X0 has λ = 1, and
X0 /λ generates Exponential(λ).
An important feature of Exponential(λ) is the memoryless property.

Proposition B.6 If X ∼ Exponential(λ), then

pr(X ≥ x + c | X ≥ c) = pr(X ≥ x).

If X represents the survival time, then the probability of surviving another x time is
always the same no matter how long the existing survival time is.
Proof of Proposition B.6: Because pr(X > x) = e−λx , we have

pr(X ≥ x + c)
pr(X ≥ x + c | X ≥ c) =
pr(X ≥ c)
e−λ(x+c)
=
e−λc
−λx
= e
= pr(X ≥ x).


The minimum of independent exponential random variables also follows an exponential
distribution.

Proposition B.7 Assume that Xi ∼ Exponential(λi ) are independent (i = 1, . . . , n). Then

X = min(X1 , . . . , Xn ) ∼ Exponential(λ1 + · · · + λn )

and
λi
pr(Xi = X) = .
λ1 + · · · + λn
Proof of Proposition B.7: First,

pr(X > x) = pr(Xi > x, i = 1, . . . , n)


Yn
= pr(Xi > x)
i=1
Yn
= e−λi x
i=1
− n
P
= e i=1 λi x

Pn
so X is Exponential( i=1 λi ).
344 Linear Model and Extensions

Second, we have

pr(Xi = X) = pr(Xi < Xj for all j ̸= i)


Z ∞Y
= pr(Xj > x)λi e−λi x dx
0 j̸=i
Z ∞ Y
= e−λj x λi e−λi x dx
0 j̸=i
Z n
∞Y
= λi e−λj x dx
0 i=1
Z ∞
− n
P
= λi e j=1 λj x dx
0
n
X
= λi / λj .
j=1


The difference between two IID exponential random variables follows the Laplace dis-
tribution

Proposition B.8 If y1 and y2 are two IID Exponential(λ), then y = y1 − y2 has density

λ
exp(−λ|c|), −∞ < c < ∞
2
which is the density of a Laplace distribution with mean 0 and variance 2/λ2 .

Proof of Proposition B.8: Both y1 and y2 have density f (c) = λe−λc and CDF F (c) =
1 − e−λc . The CDF of y = y1 − y2 at c ≥ 0 is
Z ∞
pr(y1 − y2 ≤ c) = pr(y2 ≥ z − c)λe−λz dz
0
Z ∞
= e−λ(z−c) λe−λz dz
0
Z ∞
= λe λc
e−2λz dz
0
= λeλc /(2λ)
= eλc /2.

By symmetry, y1 − y2 ∼ y2 − y1 , so the CDF at c ≤ 0 is

pr(y1 − y2 ≤ c) = 1 − pr(y1 − y2 ≤ −c)


= 1 − e−λc /2.

Therefore, the density of y at c ≥ 0 is λ2 eλc , and the density of y at c ≤ 0 is λ2 e−λc , which


can be unified as λ2 eλ|c| . □
If X0 is the standard exponential random variable, then we define the Gumbel(µ, β)
random variable as
Y = µ − β log X0 .
Random Variables 345

The standard Gumbel distribution has µ = 0 and β = 1, with CDF


F (y) = exp(−e−y ), y∈R
and density
f (y) = exp(−e−y )e−y , y ∈ R.
By definition and Proposition B.7, we can verify that the maximum of IID Gumbels is
also Gumbel.
Proposition B.9 If Y1 , . . . , Yn are IID Gumbel(µ, β), then
max Yi ∼ Gumbel(µ + β log n, β).
1≤i≤n

If Y1 , . . . , Yn are independent Gumbel(µi , 1), then


n
!
X
µi
max Yi ∼ Gumbel log e ,1 .
1≤i≤n
i=1

I leave the proof as Problem 2.3.

B.2 Multivariate distributions


A random vector (X1 , . . . , Xn )t is a vector consisting of n random variables. If all compo-
nents are continuous, we can define its joint density fX1 ···Xn (x1 , . . . , xn ).
For a random vector X Y with X and Y possibly being vectors, if it has joint density
fXY (x, y), then we can obtain the marginal distribution of X
Z
fX (x) = fXY (x, y)dy

and define the conditional density


fXY (x, y)
fY |X (y | x) = if fX (x) ̸= 0.
fX (x)
Based on the conditional density, we can define the conditional expectation of any function
of Y as Z
E {g(Y ) | X = x} = g(y)fY |X (y | x)dy

and the conditional variance as


h i
2 2
var {g(Y ) | X = x} = E {g(Y )} | X = x − [E {g(Y ) | X = x}] .

In the above definitions, the conditional mean and variance are both deterministic functions
of x. We can replace x by the random variable X to define E {g(Y ) | X} and var {g(Y ) | X},
which are functions of the random variable X and are thus random variables.
Below are two important laws of conditional expectation and variance.
Theorem B.2 (Law of total expectation) We have
E(Y ) = E {E(Y | X)} .
Theorem B.3 (Law of total variance or analysis of variance) We have
var(Y ) = E {var(Y | X)} + var {E(Y | X)} .
346 Linear Model and Extensions

Independence
Random variables (X1 , . . . , Xn ) are mutually independent if
fX1 ···Xn (x1 , . . . , xn ) = fX1 (x1 ) · · · fXn (xn ).
Note that in this definition, each of (X1 , . . . , Xn ) can be vectors. We have the following
rules under independence.
Proposition B.10 If X Y , then h(X) g(Y ) for any functions h(·) and g(·).
Proposition B.11 If X Y , then
fXY (x, y) = fX (x)fY (y),
fY |X (y | x) = fY (y),
E {g(Y ) | X} = E {g(Y )} ,
E {g(Y )h(X)} = E {g(Y )} E {h(X)} .

Expectations of random vectors or random matrices


For a random matrix W = (Wij ), we define E(W ) = (E(Wij )). For constant matrices A
and C, we can verify that
E(AW + C) = AE(W ) + C,
E(AW C) = AE(W )C.

Covariance between two random vectors


If W ∈ Rr and Y ∈ Rs , then their covariance
 t
cov(W, Y ) = E {W − E(W )} {Y − E(Y )}
is an r × s matrix. As a special case,
 t
cov(Y ) = cov(Y, Y ) = E {Y − E(Y )} {Y − E(Y )} = E(Y Y t ) − E(Y )E(Y )t .
For a scalar random variable, cov(Y ) = var(Y ).
Proposition B.12 For A ∈ Rr×n , Y ∈ Rn and C ∈ Rr , we have cov(AY + C) =
Acov(Y )At .
Using Proposition B.12, we can verify that for any n-dimensional random vector,
cov(Y ) ⪰ 0 because for all x ∈ Rn , we have
xt cov(Y )x = cov(xt Y ) = var(xt Y ) ≥ 0.
Proposition B.13 For two random vectors W and Y , we have
cov(AW + C, BY + D) = Acov(W, Y )B t
and
cov(AW + BY ) = Acov(W )At + Bcov(Y )B t + Acov(W, Y )B t + Bcov(Y, W )At .
Similar to Theorem B.3, we have the following decomposition of the covariance.
Theorem B.4 (Law of total covariance) We have
cov (Y, W ) = E {cov (Y, W | X)} + cov {E(Y | X), E(W | X)} .
Random Variables 347

B.3 Multivariate Normal and its properties


I use a generative definition of the multivariate Normal random vector. First, Z is a standard
iid
Normal random vector if Z = (Z1 , . . . , Zn )t has components Zi ∼ N(0, 1). Given a mean
vector µ and a positive semi-definite covariance matrix Σ, define a Normal random vector
Y ∼ N(µ, Σ) with mean µ and covariance Σ if Y can be represented as
Y = µ + AZ, (B.3)
where A satisfies Σ = AAt . We can verify that cov(Y ) = Σ, so indeed Σ is its covariance
matrix. If Σ ≻ 0, then we can verify that Y has density
−1/2
fY (y) = (2π)−n/2 {det(Σ)} exp −(y − µ)t Σ−1 (y − µ)/2 .

(B.4)
We can easily verify the following result by calculating the density.
Proposition B.14 If Z ∼ N(0, In ) and Γ is an orthogonal matrix, then ΓZ ∼ N(0, In ).
I do not define multivariate Normal based on the density (B.4) because it is only well
defined with a positive definite Σ. I do not define multivariate Normal based on the char-
acteristic function because it is more advanced than the level of this book. Definition (B.3)
does not require Σ to be positive definite and is more elementary. However, it has a sub-
tle issue of uniqueness. Although the decomposition Σ = AAt is not unique, the resulting
distribution Y = µ + AZ is. We can verify this using the Polar decomposition. Because
A = Σ1/2 Γ where Γ is an orthogonal matrix, we have Y = µ + Σ1/2 ΓZ = µ + Σ1/2 Z̃ where
Z̃ = ΓZ is a standard Normal random vector by Proposition B.14. Importantly, although
the definition (B.3) can be general, we usually use the following representation
Y = µ + Σ1/2 Z.
Theorem B.5 Assume that
     
Y1 µ1 Σ11 Σ12
∼N , .
Y2 µ2 Σ21 Σ22

Then Y1 Y2 if and only if Σ12 = 0.


I leave the proof of Theorem B.5 as Problem 2.4.
Proposition B.15 If Y ∼ N(µ,Σ), then BY + C ∼ N(Bµ + C, BΣB t ), that is, any linear
transformation of a Normal random vector is also a Normal random vector.
Proof of Proposition B.15: By definition, Y = µ + Σ1/2 Z where Z is the standard
Normal random vector, we have
BY + c = B(µ + Σ1/2 Z) + C
= Bµ + C + BΣ1/2 Z
∼ N(Bµ + C, BΣ1/2 Σ1/2t B t )
∼ N(Bµ + C, BΣB t ).

An obvious corollary of Proposition B.15 is that if X1 ∼ N(µ1 , σ12 ) and X2 ∼ N(µ2 , σ22 )
are independent, then X1 +X2 ∼ N(µ1 +µ2 , σ12 +σ22 ). So the summation of two independent
Normals is also Normal. Remarkably, the reverse of the result is also true.
348 Linear Model and Extensions

Theorem B.6 (Levy–Cramer) If X1 X2 and X1 + X2 is Normal, then both X1 and X2


must be Normal.

The statement of Theorem B.6 is extremely simple. But its proof is non-trivial and
beyond the scope of this book. See Benhamou et al. (2018) for a proof.

Theorem B.7 Assume


     
Y1 µ1 Σ11 Σ12
∼N , .
Y2 µ2 Σ21 Σ22

1. The marginal distributions are Normal:

Y1 ∼ N (µ1 , Σ11 ) ,
Y2 ∼ N (µ2 , Σ22 ) .

2. If Σ22 ≻ 0, then the conditional distribution is Normal:

Y1 | Y2 = y2 ∼ N µ1 + Σ12 Σ−1 −1

22 (y2 − µ2 ), Σ11 − Σ12 Σ22 Σ21 ;

Y2 is independent of the residual

Y1 − Σ12 Σ−1 −1

22 (Y2 − µ2 ) ∼ N µ1 , Σ11 − Σ12 Σ22 Σ21 .

I review some other results of the multivariate Normal below.

Proposition B.16 Assume Y ∼ N(µ, σ 2 In ). If AB t = 0, then AY BY .

Proposition B.17 Assume

σ12
     
Y1 µ1 ρσ1 σ2
∼N , ,
Y2 µ2 ρσ1 σ2 σ22

where ρ is the correlation coefficient defined as

cov(Y1 , Y2 )
ρ= p .
var(Y1 )var(Y2 )

Then the conditional distribution is


 
σ1 2 2
Y1 | Y2 = y2 ∼ N µ1 + ρ (y2 − µ2 ), σ1 (1 − ρ ) .
σ2

B.4 Quadratic forms of random vectors


Given a random vector Y and a symmetric matrix A, we can define the quadratic form
Y t AY , which is a random variable playing an important role in statistics. The first theorem
is about its mean.

Theorem B.8 If Y has mean µ and covariance Σ, then

E(Y t AY ) = trace(AΣ) + µt Aµ.


Random Variables 349

Proof of Theorem B.8: The proof relies on the following two basic facts.
• E(Y Y t ) = cov(Y ) + E(Y )E(Y t ) = Σ + µµt .
• For an n × n symmetricPrandom matrix Pn W = (wij ), we have E {trace(W )} =
n
trace {E(W )} because E ( i=1 wii ) = i=1 E(wii ).
The conclusion follows from

E(Y t AY ) = E {trace(Y t AY )}
= E {trace(AY Y t )}
= trace {E(AY Y t )}
= trace AE(Y Y T )


= trace {A(Σ + µµt )}


= trace AΣ + AµµT


= trace(AΣ) + trace(µt Aµ)


= trace(AΣ) + µt Aµ.


The variance of the quadratic form is much more complicated for a general random
vector. For the multivariate Normal random vector, we have the following formula.
Theorem B.9 If Y ∼ N(µ, Σ), then

var(Y t AY ) = 2trace(AΣAΣ) + 4µt AΣAµ.

I relegate the proof as Problem 2.10.


From its definition, χ2n is the summation of the squares of n IID standard Normal random
variables. It is closely related to quadratic forms of multivariate Normals.

Theorem B.10 1. If Y ∼ N(µ, Σ) is an n-dimensional random vector with Σ ≻ 0, then

(Y − µ)t Σ−1 (Y − µ) ∼ χ2n .

If rank(Σ) = k ≤ n, then
(Y − µ)t Σ+ (Y − µ) ∼ χ2k .

2. If Y ∼ N(0, In ) and H is a projection matrix of rank K, then

Y t HY ∼ χ2K .

3. If Y ∼ N(0, H) where H is a projection matrix of rank K, then

Y t Y ∼ χ2K .

Proof of Theorem B.10:


1. I only prove the general result with rank(Σ) = k ≤ n. By definition, Y = µ + Σ1/2 Z
where Z is a standard Normal random vector, then

(Y − µ)t Σ+ (Y − µ) = Z t Σ1/2 Σ+ Σ1/2 Z


k
X
= Zi2 ∼ χ2k .
i=1
350 Linear Model and Extensions

2. Using the eigendecomposition of the projection matrix

H = P diag {1, . . . , 1, 0, . . . , 0} P t

with K 1’s in the diagonal matrix, we have

Y t HY = Y t P diag {1, . . . , 1, 0, . . . , 0} P t Y
= Z t diag {1, . . . , 1, 0, . . . , 0} Z,

where Z = (Z1 , . . . , Zn )t = P t Y ∼ N(0, P t P ) = N(0, In ) is a standard Normal random


vector. So
X K
Y t HY = Zi2 ∼ χ2K .
i=1

3. Writing Y = H 1/2 Z where Z is a standard Normal random vector, we have

Y t Y = Z t H 1/2 H 1/2 Z = Z t HZ ∼ χ2K

using the second result.


B.5 Homework problems


2.1 Beta-Gamma duality
Prove Theorem B.1.
Hint: Calculate the joint density of (X + Y, X/(X + Y )).

2.2 Gamma and Beta moments


Prove Propositions B.2–B.5.

2.3 Maximums of Gumbels


Prove Proposition B.9.

2.4 Independence and uncorrelatedness in the multivariate Normal


Prove Theorem B.5.

2.5 Transformation of bivariate Normal


Prove that if (Y1 , Y2 )t follows a bivariate Normal distribution
     
Y1 0 1 ρ
∼N , ,
Y2 0 ρ 1

then
Y1 + Y2 Y1 − Y2 .
Remark: This result holds for arbitrary ρ.
Random Variables 351

2.6 Normal conditional distributions


Suppose that (X1 , X2 ) has the joint distribution
 
1 2 2 2 2

fX1 X2 (x1 , x2 ) ∝ C0 exp − Ax1 x2 + x1 + x2 − 2Bx1 x2 − 2C1 x1 − 2C2 x2 ,
2

where C0 is the normalizing constant depending on (A, B, C1 , C2 ). To ensure that this is a


well-defined density, we need A ≥ 0, and if A = 0 then |B| < 1. Prove that the conditional
distributions are
 
Bx2 + C1 1
X1 | X2 = x2 ∼ N , ,
Ax22 + 1 Ax22 + 1
 
Bx1 + C2 1
X2 | X1 = x1 ∼ N 2 , 2 .
Ax1 + 1 Ax1 + 1

Remark: For a bivariate Normal distribution, the two conditional distributions are both
Normal. The converse of the statement is not true. That is, even if the two conditional
distributions are both Normal, the joint distribution may not be bivariate Normal. Gelman
and Meng (1991) reported this interesting result.

2.7 Inverse of covariance matrix and conditional independence in multivariate Normal


Assume X = (X1 , . . . , Xp )t ∼ N(µ, Σ). Denote the inverse of its covariance matrix by
Σ−1 = (σ jk )1≤j,k≤p . Show that for any pair of j ̸= k, we have

σ jk = 0 ⇐⇒ Xj Xk | X\(j,k)

where X\(j,k) contains all the variables except Xj and Xk .

2.8 Independence of linear and quadratic functions of the multivariate Normal


Assume Y ∼ N(µ, σ 2 In ). For an n dimensional vector a and two n × n symmetric matrices
A and B, show that

1. if Aa = 0, then at Y Y t AY ;
2. if AB = BA = 0, then Y t AY Y t BY .
Hint: To simplify the proof, you can the pseudoinverse of A which satisfies AA+ A = A. In
fact, a strong result holds. Ogasawara and Takahashi (1951) proved the following theorem;
see also Styan (1970, Theorem 5).

Theorem B.11 Assume Y ∼ N(µ, Σ). Define quadratic forms Y t AY and Y t BY for two
symmetric matrices A and B. The Y t AY and Y t BY are independent if and only if

ΣAΣBΣ = 0, ΣAΣBµ = ΣBΣAµ = 0, µt AΣBµ = 0.

2.9 Independence of the sample mean and variance of IID Normals


iid Pn
If X1P, . . . , Xn ∼ N(µ, σ 2 ), then X̄ S 2 , where X̄ = n−1 i=1 Xi and S 2 = (n −
n
1)−1 i=1 (Xi − X̄)2 .
Remark: A remarkable result due to Geary (1936) ensures the reverse of the above result.
That is, if X1 , . . . , Xn are IID and X̄ S 2 , then X1 , . . . , Xn must be Normals. See Lukacs
(1942) and Benhamou et al. (2018) for proofs.
352 Linear Model and Extensions

2.10 Variance of the quadratic form of the multivariate Normal


Prove Theorem B.9. Use it to further prove that if Y ∼ N(µ, Σ), then

cov(Y t A1 Y, Y t A2 Y ) = 2trace(A1 ΣA2 Σ) + 4µt A1 ΣA2 µ.

Hint: Write Y = µ + Σ1/2 Z and reduce the problem to calculating the moments of
standard Normals.
C
Limiting Theorems and Basic Asymptotics

This chapter reviews the basics of limiting theorems and asymptotic analyses that are useful
for this book. See Newey and McFadden (1994) and Van der Vaart (2000) for in-depth
discussions.

C.1 Convergence in probability


Definition C.2 Random vectors Zn ∈ RK converge to Z in probability, denoted by Zn → Z
in probability, if
pr {∥Zn − Z∥ > c} → 0, n → ∞
for all c > 0.

This definition incorporates the classic definition of convergence of non-random vectors:

Proposition C.1 If non-random vectors Zn → Z, the convergence also holds in probability.


Convergence in probability for random vectors is equivalent to element-wise convergence
because of the following result:
Proposition C.2 If Zn → Z and Wn → W in probability, then (Zn , Wn ) → (Z, W ) in
probability.

The above proposition does not require any conditions on the joint distribution of
(Zn , Wn ).
For an IID sequence of random vectors, we have the following weak law of large numbers:
Proposition C.3 (Khintchine’s
Pn weak law of large numbers) If Z1 , . . . , Zn are IID
with mean µ ∈ RK , then n−1 i=1 Zi → µ in probability.
A more elementary tool is Markov’s inequality:

pr {∥Zn − Z∥ > c} ≤ E {∥Zn − Z∥} /c (C.1)

or
pr {∥Zn − Z∥ > c} ≤ E ∥Zn − Z∥2 /c2 .

(C.2)

 (C.1) is useful if E {∥Zn − Z∥} converges to zero, and inequality (C.2) is


Inequality
useful if E ∥Zn − Z∥2 converges to zero. The latter gives a standard tool for establishing
convergence in probability by showing that the covariance matrix converges to zero.

Proposition C.4 If random vectors Zn ∈ RK have mean zero and covariance cov(Zn ) =
an Cn where an → 0 and Cn → C < ∞, then Zn → 0 in probability.

353
354 Linear Model and Extensions

Proof of Proposition C.4: Using (C.2), we have

pr {∥Zn ∥ > c} ≤ c−2 E ∥Zn ∥2




= c−2 E (Znt Zn )
= c−2 trace {E (Zn Znt )}
= c−2 trace {cov(Zn )}
= c−2 an trace(Cn ) → 0,

which implies that Zn → 0 in probability. □


For example, we usually use Proposition C.4 to show the weak
Pn law of large numbers for
the sample mean of independent random variables Z̄n = n−1 i=1 Zi . If we can show that
n
X
−2
cov(Z̄n ) = n cov(Zi ) → 0, (C.3)
i=1
Pn
conclude that Z̄n − n−1 i=1 E(Zi ) → 0 in probability. The condition in (C.3)
then we can P
n
holds if n−1 i=1 cov(Zi ) converges to a constant matrix.
Note that convergence in probability does not imply convergence of moments in general.
The following theorem gives a sufficient condition.

Proposition C.5 (dominant convergence theorem) If Zn → Z in probability and


∥Zn ∥ ≤ ∥Y ∥ with E∥Y ∥ < ∞, then E(Zn ) → E(Z).

C.2 Convergence in distribution


Definition C.3 Random vectors Zn ∈ RK converge to Z in distribution, if for all every
continuous point z of the function t → pr(Z ≤ t), we have

pr(Zn ≤ z) → pr(Z ≤ z), n → ∞.

When the limit is a constant, we have an equivalence of convergences in probability and


distribution:

Proposition C.6 If c is a non-random vector, then Zn → c in probability is equivalent to


Zn → c in distribution.

For IID sequences of random vectors, we have the Lindeberg–Lévy central limit theorem
(CLT):
Proposition C.7 (Lindeberg–Lévy CLT) If random Pn vectors Z1 , . . . , Zn are IID with
mean µ and covariance Σ, then n1/2 (Z̄n − µ) = n−1/2 i=1 (Zi − µ) → N(0, Σ) in distribu-
tion.
The more general Lindeberg–Feller CLT holds for independent sequences of random
vectors:
Proposition C.8 For each n, let Zn1 , . . . , Zn,kn be independent random vectors with finite
variances such that
Pkn  2

(LF1) i=1 E ∥Zni ∥ 1 {∥Zni ∥ > c} → 0 for every c > 0;
Limiting Theorems and Basic Asymptotics 355
Pkn
(LF2) i=1 cov(Zni ) → Σ.
Pkn
Then i=1 {Zni − E(Zni )} → N(0, Σ) in distribution.
Condition (LF2) often holds by proper standardization, and the key is to verify Condition
(LF1). Condition (LF1) is general but it looks cumbersome. In many cases, we impose a
stronger moment condition that is easier to verify:
Pkn 2+δ
(LF1’) i=1 E∥Zni ∥ → 0 for some δ > 0.
We can show that (LF1’) implies that (LF1):
kn
X kn
X
E ∥Zni ∥2 1 {∥Zni ∥ > c} = E ∥Zni ∥2+δ ∥Zni ∥−δ 1 ∥Zni ∥δ > cδ
    
i=1 i=1
kn
X
≤ E∥Zni ∥2+δ c−δ → 0.
i=1

Condition (LF1’) is called the Lyapunov condition.


A beautiful application of the Lindeberg–Feller CLT is the proof of Huber (1973)’s
theorem on OLS mentioned in Chapter 11. I first review the theorem and then give a proof.

Theorem C.1 Assume Y = Xβ + ε where the covariates are fixed and error terms ε =
(ε1 , . . . , εn )t are IID non-Normal with mean zero and finite variance σ 2 . Recall the OLS
estimator β̂ = (X t X)−1 X t Y . Any linear combination of β̂ is asymptotically Normal if and
only if
max hii → 0,
1≤i≤n

where hii is the ith diagonal element of the hat matrix H = X(X t X)−1 X t .

In the main text, hii is called the leverage score of unit i. The maximum leverage score

κ = max hii
1≤i≤n

plays an important role in analyzing the properties of OLS.


Theorem C.1 assumes that the errors are not Normal because the asymptotic Normality
under Normal errors is a trivial result (See Chapter 5). It is slightly different from the
asymptotic analysis in Chapter 6. Theorem C.1 only concerns any linear combination of the
OLS estimator alone, but the results in Chapter 6 allow for the joint inference of several
linear combinations of the OLS estimator. An implicit assumption of Chapter 6 is that the
dimension p of the covariate matrix is fixed, but Theorem C.1 allows for a diverging p. The
leverage score condition implicitly restricts the dimension and moments of the covariates.
Another interesting feature of Theorem C.1 is that the statement is coordinate-free, that is,
it holds up to a non-singular transformation of the covariates (See also Problems 3.4 and
3.5). The proof of sufficiency follows Huber (1973) closely, and the proof of necessity was
suggested by Professor Peter Bickel.
Proof of Theorem C.1: I first simplify the notation without essential loss of generality.
By the invariance of the OLS estimator and the hat matrix in Problems 3.4 and 3.5, we can
also assume X t X = Ip . So

β̂ − β = (X t X)−1 X t ε = X t ε

and the hat matrix


H = X(X t X)−1 X t = XX t
356 Linear Model and Extensions

has diagonal elements hii = xti xi = ∥xi ∥2 and non-diagonal elements hij = xti xj . We can
also assume σ 2 = 1.
Consider a fixed vector a ∈ Rp and assume ∥a∥2 = 1. We have

at β̂ − at β = at X t ε ≡ st ε,

where
s = Xa ⇐⇒ si = xti a (i = 1, . . . , n)
satisfies
∥s∥2 = at X t Xa = ∥a∥2 = 1
and
s2i = (xti a)2 ≤ ∥xi ∥2 ∥a∥2 = ∥xi ∥2 = hii
by the Cauchy–Schwarz inequality.
I first prove the sufficiency. The key term at β̂ − at β is a linear combination of the
IID errors, and it has mean 0 and variance var(st ε) = ∥s∥2 = 1. We only need to verify
Condition (LF1) to establish the CLT. It holds because for any fixed c > 0, we have
n
X n
X
E s2i ε2i 1 {|si εi | > c} s2i max E ε2i 1 {|si εi | > c}
   
≤ (C.4)
1≤i≤n
i=1 i=1
max E ε2i 1 {|si εi | > c}
 
= (C.5)
1≤i≤n
h n oi
≤ E ε2i 1 κ1/2 |εi | > c (C.6)
→ 0, (C.7)

where (C.4) follows from the property of max, (C.5) follows from the fact ∥s∥2 = 1, (C.6)
follows from the fact that |si | ≤ |hii | ≤ κ1/2 , and (C.7) follows from κ → 0 and the dominant
convergence theorem in Proposition C.5.
I then prove the necessity. Pick one i∗ from arg max1≤i≤n hii . Consider a special lin-
ear combination of the OLS estimator: ŷi∗ = xti∗ β̂, which is the fitted value of the i∗ th
observation and has the form

ŷi∗ = xti∗ β̂
= xti∗ X t ε
Xn
= xti∗ xj εj
j=1
X
= hi∗ i∗ εi∗ + hi∗ j εj .
j̸=i∗
P
If ŷi∗ is asymptotically Normal, then both hi∗ i∗ εi∗ and j̸=i∗ hi j εj must have Normal

limiting distributions by Theorem B.6. Therefore, hi∗ i∗ must converge to zero because εi∗
has a non-Normal distribution. So max1≤i≤n hii must converge to zero. □

C.3 Tools for proving convergence in probability and distribution


The first tool is the continuous mapping theorem:
Limiting Theorems and Basic Asymptotics 357

Proposition C.9 Let f : RK → RL be continuous except on a set O with pr(Z ∈ O) = 0.


Then Zn → Z implies f (Zn ) → f (Z) in probability (and in distribution).
The second tool is Slutsky’s Theorem:
Proposition C.10 Let Zn and Wn be random vectors. If Zn → Z in distribution, and
Wn → c in probability (or in distribution) for a constant c, then

1. Zn + Wn → Z + c in distribution;
2. Wn Zn → cZ in distribution;
3. Wn−1 Zn → c−1 Z in distribution if c ̸= 0.
The third tool is the delta method. I will present a special case below for asymptotically
Normal random vectors. Heuristically, it states that if Tn is asymptotically Normal, then
any function of Tn is also asymptotically Normal. This is true because any function is a
locally linear function by the first-order Taylor expansion.

Proposition C.11 Let f (z)√be a function from Rp to Rq , and ∂f (z)/∂z ∈ Rp×q be the
partial derivative matrix. If n(Zn − θ) → N(µ, Σ) in distribution, then


 
∂f (θ) ∂f (θ) ∂f (θ)
n{f (Zn ) − f (θ)} → N µ, Σ
∂z t ∂z t ∂z

in distribution.

Proof of Proposition C.11: I will give an informal proof. Using Taylor expansion, we
have
√ ∂f (θ) √
n{f (Zn ) − f (θ)} ∼
= n(Zn − θ),
∂z t
√ √
which is a linear transformation of n(Zn − θ). Because n(Zn − θ) → N(µ, Σ) in distri-
bution, we have
√ ∂f (θ)
n{f (Zn ) − f (θ)} → N(µ, Σ)
∂z

t

∂f (θ) ∂f (θ) ∂f (θ)
= N µ, Σ
∂z t ∂z t ∂z

in distribution. □
Proposition C.11 above is more useful when ∂f (θ)/∂z ̸= 0. Otherwise, we need to invoke
higher-order Taylor expansion to obtain a more accurate asymptotic approximation.
D
M-Estimation and MLE

A wide range of statistics estimation problems can be formulated as an estimating equation:


n
X
−1
m̄(W, b) = n m(wi , b) = 0
i=1

n
where m(·, ·) is a vector function with the same dimension as b, and W = {wi }i=1 are the
observed data. Let β̂ denote the solution which is an estimator of β. Under mild regularity
conditions, β̂ is consistent and asymptotically Normal 1 . This is the classical theory of
M-estimation. I will review it below. See Stefanski and Boos (2002) for a reader-friendly
introduction that contains many interesting and important examples. The proofs below are
not rigorous. See Newey and McFadden (1994) for the rigorous ones.

D.1 M-estimation
I start with the simple case with IID data.
n
Theorem D.1 Assume that W = {wi }i=1 are IID with the same distribution as w. The
true parameter β ∈ Rp is the unique solution of

E {m(w, b)} = 0,

and the estimator β̂ ∈ Rp is the solution of

m̄(W, b) = 0.

Under some regularity conditions,


√  
n β̂ − β → N(0, B −1 M B −t )

in distribution, where

∂E {m(w, β)}
B=− , M = E{m(w, β)m(w, β)t }.
∂bt
Proof of Theorem D.1: I give a “physics” proof. When I use approximations, I mean the
error terms are of higher orders under some regularity conditions. The consistency follows
1 There are counterexamples in which β̂ is inconsistent; see Freedman and Diaconis (1982). The examples

in this book are all regular.

359
360 Linear Model and Extensions

from swapping the order of “solving equation” and “taking the limit based on the law of
large numbers”:

lim β̂ = lim {solve m̄(W, b) = 0}


n→∞ n→∞
n o
= solve lim m̄(W, b) = 0
n→∞
= solve [E {m(w, b)} = 0]
= β.

The asymptotic Normality follows from three steps. First, from the Taylor expansion
∂ m̄(W, β)
0 = m̄(W, β̂) ∼
= m̄(W, β) + (β̂ − β)
∂bt
we obtain ( )
  ∂ m̄(W, β) −1 n
√  1 X
n β̂ − β ∼
= − √ m(wi , β) .
∂bt n i=1
Second, the law of large numbers ensures that
∂ m̄(W, β) ∂E {m(w, β)}
− →− =B
∂bt ∂bt
in probability, and the CLT ensures that
n
X
n−1/2 m(wi , β) → N(0, M )
i=1

in distribution. Finally, Slutsky’s theorem implies the result. □


The above result also holds with independent but non-IID data.
n
Theorem D.2 Assume that {wi }i=1 are independent observations. The true parameter
β ∈ Rp is the unique solution to
E {m̄(W, b)} = 0,
and the estimator β̂ ∈ Rp is the solution to

m̄(W, b) = 0,

Under some regularity conditions,


√  
n β̂ − β → N(0, B −1 M B −t )

in distribution, where
n n
X ∂E {m(wi , β)} X
B = − lim n−1 , M = lim n−1 cov{m(wi , β)}.
n→∞
i=1
∂bt n→∞
i=1

For both cases above, we can further construct the following sandwich covariance esti-
mator:
n
!−1 n ! n !−1
X ∂E{m(wi , β̂)} X
t
X ∂E{m(wi , β̂)}
m(wi , β̂)m(wi , β̂)
i=1
∂bt i=1 i=1
∂b

For non-IID data, the above covariance estimator can be conservative unless E{m(wi , β)} =
0 for all i = 1, . . . , n.
M-Estimation and MLE 361
iid
Example D.1 Assume that x1 , . . . , xn ∼ x with mean µ. The sample mean x̄ solves the
estimating equation
Xn
n−1 (xi − µ) = 0.
i=1

Apply Theorem D.1 to obtain B = −1 and M = σ 2 , which imply



n(x̄ − µ) → N(0, σ 2 )

in distribution, the standard CLT for the sample mean. Moreover, the sandwich covariance
estimator is
n
X
V̂ = n−2 (xi − x̄)2 ,
i=1

which equals the sample variance of x, multiplied by (n − 1)/n2 ≈ 1/n. This is a standard
result.
If we only assume that x1 , . . . , xn are independent with the same mean µ but possibly
different variances σi2 (i = 1, . . . , n), the sample mean x̄ is still a reasonable estimator
for µ which solves the same estimating equation above. Moreover, the sandwich covariance
estimator V̂ is still a consistent estimator for the true variance of x̄. This is less standard.
IfPwe assume that xi ∼ [µi , σi2 ] are independent, we can still use x̄ to estimate µ =
n
n−1 i=1 µi . The estimating equation remains the same as above. The sandwich covariance
estimator V̂ becomes conservative since E(xi − µ) ̸= 0 in general.
Problem 4.1 gives more details.
iid
Example D.2 Assume that x1 , . . . , xn ∼ x with mean µ and variance σ 2 . The sample
mean and variance (x̄, σ̂ 2 ) jointly solves the estimating equation with
 
x−µ
m(x, µ, σ 2 ) =
(x − µ)2 − σ 2

ignoring the difference between n and n − 1 in the definition of the sample variance. Apply
Theorem D.1 to obtain
   2 
−1 0 σ µ3
B= , M=
0 −1 µ3 µ4 − σ 4

where µk = E{(x − µ)k }, which imply



 
x̄ − µ
n → N(0, M )
σ̂ 2 − σ 2

in distribution.

Example D.3 Assume that (xi , yi )ni=1 are IID draws from (x, y) with mean (µx , µy ). Use
x̄/ȳ to estimate γ = µx /µy . It satisfies the estimating equation with

m(x, y, γ) = x − γy.

Apply Theorem D.1 to obtain B = −µy and M = var(x − γy), which imply

   
x̄ µx var(x − γy)
n − → N 0,
ȳ µy µ2y

if µy ̸= 0.
362 Linear Model and Extensions

D.2 Maximum likelihood estimator


As an important application of Theorem D.1, we can derive the asymptotic properties of
the maximum likelihood estimator (MLE) θ̂ under IID sampling from a parametric model
iid
y1 , . . . , yn ∼ f (y | θ).

The MLE satisfies the following estimating equation:


 
∂ log f (y | θ)
E = 0, (D.1)
∂θ

which is Bartlett’s first identity. Under regularity conditions, n(θ̂ − θ) converges in distri-
bution to Normal with mean zero and covariance B −1 M B −1 , where
 
∂ ∂ log f (y | θ)
B = − tE
∂θ ∂θ
 2 
∂ log f (y | θ)
= E −
∂θ∂θt

is called the Fisher information matrix, denoted by I(θ), and


 
∂ log f (y | θ) ∂ log f (y | θ)
M =E
∂θ ∂θt

is sometimes also called the Fisher information matrix, denoted by J(θ).


If the model is correct, Bartlett’s second identity ensures that

I(θ) = J(θ), (D.2)



and therefore n(θ̂ − θ) converges in distribution to Normal with mean zero and covariance
I(θ)−1 = J(θ)−1 . So a covariance matrix estimator for the MLE is In (θ̂)−1 or Jn (θ̂)−1 ,
where
n
X ∂ 2 log f (yi | θ̂)
In (θ̂) = −
i=1
∂θ∂θt
and
n
X ∂ log f (yi | θ̂) ∂ log f (yi | θ̂)
Jn (θ̂) = .
i=1
∂θ ∂θt
Fisher (1925b) pioneered the asymptotic theory of the MLE under correctly specified mod-
els.
If the model is incorrect, I(θ) can be different from J(θ) but the sandwich covariance
B −1 M B −1 still holds. So a covariance matrix estimator for the MLE under misspecification
is
In (θ̂)−1 Jn (θ̂)In (θ̂)−1 .
Huber (1967) studied the asymptotic theory of the MLE under correctly specified models. He
focused on the case with IID observations and pioneered the sandwich covariance formula.
Perhaps a more important question is what is the parameter if the model is misspecified.
The population analog of the MLE is the minimizer of

−E{log f (y | θ)},
M-Estimation and MLE 363

where the expectation is over true but unknown distribution y ∼ g(y). We can rewrite the
population objective function as
Z Z Z
g(y)
− g(y) log f (y | θ)dy = g(y) log dy − g(y) log g(y)dy.
f (y | θ)
The first term is called the Kullback–Leibler divergence or relative entropy of g(y) and
f (y | θ), whereas the second term is called the entropy of g(y). The first term depends on
θ whereas the second term does not. Therefore, the targeted parameter of the MLE is the
minimizer of the Kullback–Leibler divergence. By Gibbs’ inequality, the Kullback–Leibler
divergence is non-negative in general and is 0 if g(y) = f (y | θ). Therefore, if the model is
correct, then the true θ indeed minimizes the Kullback–Leibler divergence with minimum
value 0
iid
Example D.4 Assume that y1 , . . . , yn ∼ N(µ, 1). The log-likelihood contributed by unit i
is
1 1
log f (yi | µ) = − log(2π) − (yi − µ)2 ,
2 2
so
2
∂ log f (yi | µ) ∂ log f (yi | µ)
= yi − µ, = −1.
∂µ ∂µ2
The MLE is µ̂ = ȳ. If the model is correctly specified, we can use
n
X
In (µ̂)−1 = n−1 or Jn (µ̂)−1 = 1/ (yi − µ̂)2
i=1

to estimate the variance of µ̂. If the model is misspecified, we can use


n
X
In (µ̂)−1 Jn (µ̂)In (µ̂)−1 = (yi − µ̂)2 /n2
i=1

to estimate the variance of µ̂.


The sandwich variance estimator seems the best overall. The Normal model can be totally
wrong but it is still meaningful to estimate the mean parameter µ = E(y). The MLE is
just the
Pnsample moment estimator which has variance var(y)/n. Since the sample variance
s2 = i=1 (yi − µ̂)2 /(n − 1) is unbiased for var(y), a natural unbiased estimator for var(µ̂)
is s2 /n, which is close to the sandwich variance estimator.
The above discussion extends to the case with independent but non-IID data. The
covariance estimators still apply by replacing each f by fi within the summation. Note
that the sandwich covariance estimator is conservative in general. White (1982) pioneered
the asymptotic analysis of the MLE with misspecified models in econometrics but made a
mistake for the M term. Chow (1984) corrected his error, and Abadie et al. (2014) developed
a more general theory.
A leading application is the MLE under a misspecified Normal linear model. The EHW
robust covariance arises naturally in this case.
Example D.5 The Normal linear model has individual log-likelihood:
1 1
li = − log(2πσ 2 ) − 2 (yi − xti β)2 , (i = 1, . . . , n)
2 2σ
with the simplification li = log f (yi | xi , β, σ 2 ). So the first-order derivatives are
∂li 1 ∂li 1 1
= 2 xi (yi − xti β), 2
=− 2 + (yi − xti β)2 ;
∂β σ ∂σ 2σ 2(σ 2 )2
364 Linear Model and Extensions

the second derivative is


∂ 2 li 1 ∂ 2 li 1 1
= − 2 xi xti , = − 2 3 (yi − xti β)2
∂β 2 σ ∂(σ 2 )2 2(σ 2 )2 (σ )

and
∂ 2 li 1
= − 2 2 xi (yi − xti β).
∂β∂σ 2 (σ )
Pn
The MLE of β is the OLS estimator β̂ and the MLE of σ 2 is σ̃ 2 = 2
i=1 ε̂i /n, where
ε̂i = yi − xti β̂ is the residual.
We have !
n
2 1 X t n
In (β̂, σ̃ ) = diag xi xi , ,
σ̃ 2 i=1 2(σ̃ 2 )2
and Pn
 1 2

2 (σ̃ 2 )2
t
i=1 ε̂i xi xi ∗
Jn (β̂, σ̃ ) = ,
∗ ∗
where the ∗ terms do not matter for the later calculations. If the Normal linear model is
correctly specified, we can use the (1, 1)th block of In (β̂, σ̃ 2 )−1 as the covariance estimator
for β̂, which equals
n
!−1
X
σ̃ 2 xi xti .
i=1

If the Normal linear model is misspecified, we can use the (1, 1)th block of
In (β̂, σ̃ 2 )−1 Jn (β̂, σ̃ 2 )In (β̂, σ̃ 2 )−1 as the covariance estimator for β̂, which equals

n
!−1 n
! n
!−1
X X X
xi xti ε̂2i xi xti xi xti ,
i=1 i=1 i=1

the EHW robust covariance estimator introduced in Chapter 6.

D.3 Homework problems


4.1 Estimating the mean
Example D.1 concerns the asymptotic properties. This problem supplements it with more
finite-sample results.
Slightly modify the sandwich covariance estimator to
n
1 X
Ṽ = (xi − x̄)2 .
n(n − 1) i=1

Show that E(Ṽ ) = var(x̄) when x1 , . . . , xn are independent with the same mean µ, and
E(Ṽ ) ≥ var(x̄) when xi ∼ [µi , σi2 ] are independent.
M-Estimation and MLE 365

4.2 Sample Pearson correlation coefficient


Assume that (xi , yi )ni=1 are IID draws from (x, y) with mean (µx , µy ) and fourth moments.
Derive the asymptotic distribution of the sample Pearson correlation coefficient. Express
the asymptotic variance in terms of

µkl = E{(x − µx )k (y − µy )l },

for example,

var(x) = µ20 , var(y) = µ02 , cov(x, y) = µ11 , ρ = µ11 / µ20 µ02 .

Hint: Use the fact that β = (µx , µy , µ20 , µ02 , ρ)t satisfies the estimating equation with
 
x − µx

 y − µy 

2
m(x, y, β) = 
 (x − µx ) − µ 20


 (y − µy )2 − µ02 

(x − µx )(y − µy ) − ρ µ20 µ02

and the sample moments and Pearson correlation coefficient are the corresponding estima-
tors. You may also find the formula (A.6) useful.

4.3 A misspecified Exponential model


iid
Assume that y1 , . . . , yn ∼ Exponential distribution with mean µ. Find the MLE of µ and its
asymptotic variance estimators under correctly specified and incorrectly specified models.
Bibliography

Abadie, A., Imbens, G. W., and Zheng, F. (2014). Inference for misspecified models with
fixed regressors. Journal of the American Statistical Association, 109:1601–1614.
Agresti, A. (2010). Analysis of Ordinal Categorical Data. New York: John Wiley & Sons.
Agresti, A. (2015). Foundations of Linear and Generalized Linear Models. John Wiley &
Sons.

Ai, C. and Norton, E. C. (2003). Interaction terms in logit and probit models. Economics
Letters, 80:123–129.
Aitkin, A. C. (1936). On least squares and linear combination of observations. Proceedings
of the Royal Society of Edinburgh, 55:42–48.

Albergaria, M. and Fávero, L. P. (2017). Narrow replication of fisman and miguel’s


(2007a)‘corruption, norms, and legal enforcement: Evidence from diplomatic parking tick-
ets’. Journal of Applied Econometrics, 32(4):919–922.
Angrist, J., Chernozhukov, V., and Fernández-Val, I. (2006). Quantile regression under
misspecification, with an application to the US wage structure. Econometrica, 74:539–
563.

Angrist, J. D. and Pischke, J.-S. (2008). Mostly Harmless Econometrics: An Empiricist’s


Companion. Princeton: Princeton University Press.
Anscombe, F. J. (1973). Graphs in statistical analysis. American Statistician, 27:17–21.

Anton, R. F., O’Malley, S. S., Ciraulo, D. A., Cisler, R. A., Couper, D., Donovan, D. M.,
Gastfriend, D. R., Hosking, J. D., Johnson, B. A., and LoCastro, J. S. (2006). Com-
bined pharmacotherapies and behavioral interventions for alcohol dependence: the COM-
BINE study: a randomized controlled trial. Journal of the American Medical Association,
295:2003–2017.
Bai, Z.-J. and Bai, Z.-Z. (2013). On nonsingularity of block two-by-two matrices. Linear
Algebra and Its Applications, 439:2388–2404.
Baron, R. M. and Kenny, D. A. (1986). The moderator–mediator variable distinction in so-
cial psychological research: Conceptual, strategic, and statistical considerations. Journal
of Personality and Social Psychology, 51:1173.

Bartlett, M. S. (1953). Approximate confidence intervals. Biometrika, 40:12–19.


Benhamou, E., Guez, B., and Paris, N. (2018). Three remarkable properties of the normal
distribution for sample variance. Theoretical Mathematics and Applications, 8:1792–9709.
Benzi, M., Golub, G. H., and Liesen, J. (2005). Numerical solution of saddle point problems.
Acta Numerica, 14:1–137.

367
368 Bibliography

Berkson, J. (1944). Application of the logistic function to bio-assay. Journal of the American
Statistical Association, 39:357–365.
Berman, M. (1988). A theorem of Jacobi and its generalization. Biometrika, 75:779–783.
Berrington de González, A. and Cox, D. R. (2007). Interpretation of interaction: A review.
Annals of Applied Statistics, 1:371–385.

Bickel, P. J. and Li, B. (2006). Regularization in statistics. Test, 15:271–344.


Blinder, A. S. (1973). Wage discrimination: reduced form and structural estimates. Journal
of Human Resources, 8:436–455.
Bliss, C. I. (1934). The method of probits. Science, 79:38–39.

Box, G. E. P. and Cox, D. R. (1964). An analysis of transformations. Journal of the Royal


Statistical Society: Series B (Methodological), 26:211–243.
Broderick, T., Giordano, R., and Meager, R. (2020). An automatic finite-sample robust-
ness metric: When can dropping a little data make a big difference? arXiv preprint
arXiv:2011.14999.

Buja, A., Brown, L., Berk, R., George, E., Pitkin, E., Traskin, M., Zhang, K., and Zhao,
L. (2019a). Models as approximations i: Consequences illustrated with linear regression.
Statistical Science, 34:523–544.
Buja, A., Brown, L., Kuchibhotla, A. K., Berk, R., George, E., and Zhao, L. (2019b). Models
as approximations ii: A model-free theory of parametric regression. Statistical Science,
34:545–565.
Burridge, J. (1981). A note on maximum likelihood estimation for regression models using
grouped data. Journal of the Royal Statistical Society: Series B (Methodological), 43:41–
45.

Card, D., Lee, D. S., Pei, Z., and Weber, A. (2015). Inference on causal effects in a gener-
alized regression kink design. Econometrica, 83:2453–2483.
Carpenter, D. P. (2002). Groups, the media, agency waiting costs, and FDA drug approval.
American Journal of Political Science, 46:490–505.

Carroll, R. J. and Ruppert, D. (1988). Transformation and Weighting in Regression. Lon-


don: Chapman & Hall/CRC.
Chatterjee, S. and Hadi, A. S. (1988). Sensitivity Analysis in Linear Regression. John Wiley
& Sons.
Chen, J. and Li, X. (2019). Model-free nonconvex matrix completion: Local minima analysis
and applications in memory-efficient kernel PCA. Journal of Machine Learning Research,
20:1–39.
Chow, G. C. (1960). Tests of equality between sets of coefficients in two linear regressions.
Econometrica, 28:591–605.

Chow, G. C. (1984). Maximum-likelihood estimation of misspecified models. Economic


Modelling, 1:134–138.
Christensen, R. (2002). Plane Answers to Complex Questions. New York: Springer.
Bibliography 369

Cinelli, C. and Hazlett, C. (2020). Making sense of sensitivity: Extending omitted variable
bias. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 82:39–67.
Cochran, W. G. (1938). The omission or addition of an independent variate in multiple
linear regression. Supplement to the Journal of the Royal Statistical Society, 5:171–176.
Cook, R. D. (1977). Detection of influential observation in linear regression. Technometrics,
19:15–18.
Cox, D. R. (1960). Regression analysis when there is prior information about supplementary
variables. Journal of the Royal Statistical Society: Series B (Methodological), 22(1):172–
176.
Cox, D. R. (1961). Tests of separate families of hypotheses. In Neyman, J., editor, Pro-
ceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability,
pages 105–123.
Cox, D. R. (1972). Regression models and life-tables. Journal of the Royal Statistical
Society: Series B (Methodological), 34:187–202.

Cox, D. R. (1984). Interaction. International Statistical Review, 52:1–24.


Cox, D. R. (2007). On a generalization of a result of W. G. Cochran. Biometrika, 94:755–
759.
Cox, D. R. and Snell, E. J. (1989). Analysis of Binary Data. London: CRC press.

Cribari-Neto, F. (2004). Asymptotic inference under heteroskedasticity of unknown form.


Computational Statistics and Data Analysis, 45:215–233.
Croissant, Y. (2020). Estimation of random utility models in R: The mlogit package. Journal
of Statistical Software, 95:1–41.
David, F. N. and Neyman, J. (1938). Extensions of the markoff theorem on least squares.
Stat Res Mem, 2:19–38.
Deb, P. and Trivedi, P. K. (1997). Demand for medical care by the elderly: a finite mixture
approach. Journal of Applied Econometrics, 12:313–336.
Dempster, A. P., Schatzoff, M., and Wermuth, N. (1977). A simulation study of alternatives
to ordinary least squares. Journal of the American Statistical Association, 72:77–91.
DiCiccio, C. J., Romano, J. P., and Wolf, M. (2019). Improving weighted least squares
inference. Econometrics and Statistics, 10:96–119.
Ding, P. (2021a). The Frisch–Waugh–Lovell theorem for standard errors. Statistics and
Probability Letters, 168:108945.

Ding, P. (2021b). Two seemingly paradoxical results in linear models: the variance inflation
factor and the analysis of covariance. Journal of Causal Inference, 9:1–8.
Ding, P. (2023). A First Course in Causal Inference. Chapman and Hall/CRC.
Efron, B. (1975). The efficiency of logistic regression compared to normal discriminant
analysis. Journal of the American Statistical Association, 70:892–898.
370 Bibliography

Eicker, F. (1967). Limit theorems for regressions with unequal and dependent errors. In
Cam, L. L. and Neyman, J., editors, Proceedings of the Fifth Berkeley Symposium on
Mathematical Statistics and Probability, pages 59–82. Berkeley, CA: University of Cali-
fornia Press.
Fan, J. and Gijbels, I. (1996). Local Polynomial Modelling and its Applications. London:
CRC Press.

Firth, D. (1988). Multiplicative errors: log-normal or gamma? Journal of the Royal Statis-
tical Society: Series B (Methodological), 50:266–268.
Fisher, L. D. and Lin, D. Y. (1999). Time-dependent covariates in the Cox proportional-
hazards regression model. Annual Review of Public Health, 20:145–157.

Fisher, R. A. (1925a). Statistical Methods for Research Workers. Edinburgh: Oliver and
Boyd, 1st edition.
Fisher, R. A. (1925b). Theory of statistical estimation. In Mathematical Proceedings of the
Cambridge Philosophical Society, volume 22, pages 700–725. Cambridge University Press.

Fisman, R. and Miguel, E. (2007). Corruption, norms, and legal enforcement: Evidence
from diplomatic parking tickets. Journal of Political Economy, 115:1020–1048.
Fitzmaurice, G. M., Laird, N. M., and Ware, J. H. (2012). Applied Longitudinal Analysis.
New York: John Wiley & Sons.
Fleming, T. R. and Harrington, D. P. (2011). Counting Processes and Survival Analysis.
New York: John Wiley & Sons.
Frank, K. A. (2000). Impact of a confounding variable on a regression coefficient. Sociological
Methods and Research, 29:147–194.
Frank, L. E. and Friedman, J. H. (1993). A statistical view of some chemometrics regression
tools. Technometrics, 35:109–135.
Freedman, D. A. (1981). Bootstrapping regression models. The Annals of Statistics, 9:1218–
1228.
Freedman, D. A. (1983). A note on screening regression equations. The American Statisti-
cian, 37:152–155.

Freedman, D. A. (2006). On the so-called “Huber sandwich estimator” and “robust standard
errors”. American Statistician, 60:299–302.
Freedman, D. A. (2008). Survival analysis: A primer. American Statistician, 62:110–119.
Freedman, D. A. (2009). Statistical Models: Theory and Practice. Cambridge: Cambridge
University Press.
Freedman, D. A. and Diaconis, P. (1982). On inconsistent M-estimators. Annals of Statistics,
10:454–461.
Freedman, D. A., Klein, S. P., Sacks, J., Smyth, C. A., and Everett, C. G. (1991). Ecological
regression and voting rights. Evaluation Review, 15:673–711.
Friedman, J., Hastie, T., Höfling, H., and Tibshirani, R. (2007). Pathwise coordinate opti-
mization. Annals of Applied Statistics, 1:302–332.
Bibliography 371

Friedman, J., Hastie, T., and Tibshirani, R. (2009). glmnet: Lasso and elastic-net regularized
generalized linear models. R package version, 1(4).
Friedman, J., Hastie, T., and Tibshirani, R. (2010). Regularization paths for generalized
linear models via coordinate descent. Journal of Statistical Software, 33:1–22.
Frisch, R. (1934). Statistical confluence analysis by means of complete regression systems
(vol. 5). Norway, University Institute of Economics.
Frisch, R. and Waugh, F. V. (1933). Partial time regressions as compared with individual
trends. Econometrica, 1:387–401.
Fuller, W. A. (1975). Regression analysis for sample survey. Sankhya, 37:117–132.

Galton, F. (1886). Regression towards mediocrity in hereditary stature. The Journal of the
Anthropological Institute of Great Britain and Ireland, 15:246–263.
Geary, R. C. (1936). The distribution of “Student’s” ratio for non-normal samples. Supple-
ment to the Journal of the Royal Statistical Society, 3:178–184.
Gehan, E. A. (1965). A generalized Wilcoxon test for comparing arbitrarily singly-censored
samples. Biometrika, 52:203–224.
Gelman, A. and Meng, X.-L. (1991). A note on bivariate distributions that are conditionally
normal. American Statistician, 45:125–126.
Gelman, A. and Park, D. K. (2009). Splitting a predictor at the upper quarter or third and
the lower quarter or third. American Statistician, 63:1–8.
Gelman, A., Park, D. K., Ansolabehere, S., Price, P. N., and Minnite, L. C. (2001). Models,
assumptions and model checking in ecological regressions. Journal of the Royal Statistical
Society: Series A (Statistics in Society), 164:101–118.
Golub, G. H., Heath, M., and Wahba, G. (1979). Generalized cross-validation as a method
for choosing a good ridge parameter. Technometrics, 21:215–223.
Goodman, L. A. (1953). Ecological regressions and behavior of individuals. American
Sociological Review, 18:663–664.
Goodman, L. A. (1959). Some alternatives to ecological correlation. American Journal of
Sociology, 64:610–625.
Greene, W. H. and Seaks, T. G. (1991). The restricted least squares estimator: A pedagogical
note. The Review of Economics and Statistics, 73:563–567.
Greenwood, M. (1926). The natural duration of cancer. A Report on the Natural Duration
of Cancer., (33).

Guiteras, R., Levinsohn, J., and Mobarak, A. M. (2015). Encouraging sanitation investment
in the developing world: a cluster-randomized trial. Science, 348:903–906.
Hagemann, A. (2017). Cluster-robust bootstrap inference in quantile regression models.
Journal of the American Statistical Association, 112:446–456.

Harrison Jr, D. and Rubinfeld, D. L. (1978). Hedonic housing prices and the demand for
clean air. Journal of Environmental Economics and Management, 5:81–102.
372 Bibliography

Hastie, T. (2020). Ridge regularization: An essential concept in data science. Technometrics,


62:426–433.
Heckman, J. J. and Singer, B. (1984). Econometric duration analysis. Journal of Econo-
metrics, 24:63–132.
Hernán, M. A. (2010). The hazards of hazard ratios. Epidemiology, 21:13–15.

Hilbe, J. M. (2014). Modeling Count Data. Cambridge: Cambridge University Press.


Hinkley, D. V. (1977). Jackknifing in unbalanced situations. Technometrics, 19:285–292.
Hirano, K., Imbens, G. W., Rubin, D. B., and Zhou, X. H. (2000). Assessing the effect of
an influenza vaccine in an encouragement design. Biostatistics, 1:69–88.

Hoerl, A. E., Kannard, R. W., and Baldwin, K. F. (1975). Ridge regression: some simula-
tions. Communications in Statistics—Theory and Methods, 4:105–123.
Hoerl, A. E. and Kennard, R. W. (1970). Ridge regression: Biased estimation for nonorthog-
onal problems. Technometrics, 12:55–67.

Hoerl, R. W. (2020). Ridge regression: A historical context. Technometrics, 62:420–425.


Hoff, P. D. (2017). Lasso, fractional norm and structured sparse estimation using
a Hadamard product parametrization. Computational Statistics and Data Analysis,
115:186–198.
Horvitz, D. G. and Thompson, D. J. (1952). A generalization of sampling without replace-
ment from a finite universe. Journal of the American Statistical Association, 47:663–685.
Hu, B., Shao, J., and Palta, M. (2006). Pseudo-R2 in logistic regression model. Statistica
Sinica, 16:847–860.
Huber, P. J. (1967). The behavior of maximum likelihood estimates under nonstandard
conditions. In Cam, L. M. L. and Neyman, J., editors, Proceedings of the Fifth Berkeley
Symposium on Mathematical Statistics and Probability, volume 1, pages 221–233. Berke-
ley, California: University of California Press.
Huber, P. J. (1973). Robust regression: asymptotics, conjectures and Monte Carlo. Annals
of Statistics, 1:799–821.

Huster, W. J., Brookmeyer, R., and Self, S. G. (1989). Modelling paired survival data with
covariates. Biometrics, pages 145–156.
Jann, B. (2008). The Blinder–Oaxaca decomposition for linear regression models. The Stata
Journal, 8:453–479.

Johnson, P. O. and Neyman, J. (1936). Tests of certain linear hypotheses and their appli-
cation to some educational problems. Statistical Research Memoirs, 1:57–93.
Kagan, A. (2001). A note on the logistic link function. Biometrika, 88:599–601.
Kalbfleisch, J. D. and Prentice, R. L. (2011). The Statistical Analysis of Failure Time Data.
New York: John Wiley & Sons.

Kaplan, E. L. and Meier, P. (1958). Nonparametric estimation from incomplete observations.


Journal of the American Statistical Association, 53:457–481.
Bibliography 373

Keele, L. (2010). Proportionally difficult: testing for nonproportional hazards in Cox models.
Political Analysis, 18:189–205.
King, G. (1997). A Solution to the Ecological Inference Problem: Reconstructing Individual
Behavior from Aggregate Data. Princeton: Princeton University Press.
King, G. and Roberts, M. E. (2015). How robust standard errors expose methodological
problems they do not fix, and what to do about it. Political Analysis, 23:159–179.
Koenker, R. and Bassett Jr, G. (1978). Regression quantiles. Econometrica, 46:33–50.
Koopmann, R. (1982). Parameterschätzung bei a-priori-Information. Vandenhoeck &
Ruprecht.
LaLonde, R. J. (1986). Evaluating the econometric evaluations of training programs with
experimental data. American Economic Review, 76:604–620.
Lawless, J. F. (1976). A simulation study of ridge and other regression estimators. Com-
munications in Statistics—Theory and Methods, 5:307–323.
Le Cam, L. (1960). An approximation theorem for the Poisson Binomial distribution. Pacific
Journal of Mathematics, 10:1181–1197.
Lehmann, E. L. (1951). A general concept of unbiasedness. The Annals of Mathematical
Statistics, 22:587–592.
Lei, J., G’Sell, M., Rinaldo, A., Tibshirani, R. J., and Wasserman, L. (2018). Distribution-
free predictive inference for regression. Journal of the American Statistical Association,
113:1094–1111.
Li, J. and Valliant, R. (2009). Survey weighted hat matrix and leverages. Survey Method-
ology, 35:15–24.
Liang, K.-Y. and Zeger, S. L. (1986). Longitudinal data analysis using generalized linear
models. Biometrika, 73:13–22.
Lin, D.-Y., Gong, J., Gallo, P., Bunn, P. H., and Couper, D. (2016). Simultaneous inference
on treatment effects in survival studies with factorial designs. Biometrics, 72:1078–1085.
Lin, D. Y. and Wei, L.-J. (1989). The robust inference for the cox proportional hazards
model. Journal of the American statistical Association, 84:1074–1078.
Long, J. S. and Ervin, L. H. (1998). Correcting for heteroscedasticity with heteroscedasticity
consistent standard errors in the linear regression model: Small sample considerations.
Indiana University, Bloomington, IN, 47405.
Long, J. S. and Ervin, L. H. (2000). Using heteroscedasticity consistent standard errors in
the linear regression model. American Statistician, 54:217–224.
Lovell, M. C. (1963). Seasonal adjustment of economic time series and multiple regression
analysis. Journal of the American Statistical Association, 58:993–1010.
Lukacs, E. (1942). A characterization of the normal distribution. The Annals of Mathe-
matical Statistics, 13:91–93.
MacKinnon, J. G. and White, H. (1985). Some heteroskedasticity-consistent covariance ma-
trix estimators with improved finite sample properties. Journal of Econometrics, 29:305–
325.
374 Bibliography

Magee, L. (1998). Improving survey-weighted least squares regression. Journal of the Royal
Statistical Society: Series B (Statistical Methodology), 60:115–126.
Mantel, N. (1966). Evaluation of survival data and two new rank order statistics arising in
its consideration. Cancer Chemother. Rep., 50:163–170.
Marshall, A. (1890). Principles of Economics. New York: Macmillan and Company.
McCullagh, P. and Nelder, J. (1989). Generalized Linear Models. Boca Raton: Chapman
and Hall/CRC, second edition.
McCulloch, J. H. (1985). On heteros*edasticity. Econometrica, 53:483.
McFadden, D. (1974). Conditional logit analysis of qualitative choice behavior. In Zarembka,
P., editor, Frontiers in Econometrics. Academic Press.
Miller, R. G. J. (1974). An unbalanced jackknife. The Annals of Statistics, 2:880–891.
Moen, E. L., Fricano-Kugler, C. J., Luikart, B. W., and O’Malley, A. J. (2016). Analyzing
clustered data: why and how to account for multiple observations nested within a study
participant? PLoS One, 11:e0146721.
Monahan, J. F. (2008). A Primer on Linear Models. London: CRC Press.
Nagelkerke, N. (1991). A note on a general definition of the coefficient of determination.
Biometrika, 78:691–692.
Nelder, J. A. and Wedderburn, R. W. M. (1972). Generalized linear models. Journal of the
Royal Statistical Society: Series A (General), 135:370–384.
Newey, K. W. and McFadden, D. (1994). Large sample estimation and hypothesis. In
Engle, R. F. and McFadden, D. L., editors, Handbook of Econometrics IV, pages 2112–
2245. Citeseer.
Oaxaca, R. (1973). Male-female wage differentials in urban labor markets. International
Economic Review, 14:693–709.
Ogasawara, T. and Takahashi, M. (1951). Independence of quadratic quantities in a normal
system. Journal of Science of the Hiroshima University, Series A, 15:1–9.
Osborne, M. R., Presnell, B., and Turlach, B. A. (2000). On the lasso and its dual. Journal
of Computational and Graphical Statistics, 9:319–337.
Paloyo, A. R. (2014). When did we begin to spell ‘heteros*edasticity’ correctly? Philippine
Review of Economics, LI:162–178.
Pepe, M. S. and Anderson, G. L. (1994). A cautionary note on inference for marginal regres-
sion models with longitudinal data and general correlated response data. Communications
in Statistics—Simulation and Computation, 23:939–951.
Peto, R. and Peto, J. (1972). Asymptotically efficient rank invariant test procedures. Journal
of the Royal Statistical Society: Series A (General), 135:185–198.
Powell, J. L. (1991). Estimation of monotonic regression models under quantile restrictions.
Nonparametric and semiparametric methods in Econometrics, pages 357–384.
Pratt, J. W. (1981). Concavity of the log likelihood. Journal of the American Statistical
Association, 76:103–106.
Bibliography 375

Pregibon, D. (1981). Logistic regression diagnostics. Annals of Statistics, 9:705–724.


Prentice, R. L. and Pyke, R. (1979). Logistic disease incidence models and case-control
studies. Biometrika, 66:403–411.
Quenouille, M. H. (1949). Approximate tests of correlation in time-series. Journal of the
Royal Statistical Society. Series B (Methodological), 11:68–84.
Quenouille, M. H. (1956). Notes on bias in estimation. Biometrika, 43:353–360.
Rao, C. R. (1973). Linear Statistical Inference and its Applications. John Wiley & Sons,
2nd edition.
Reeds, J. A. (1978). Jackknifing maximum likelihood estimates. Annals of Statistics, 6:727–
739.
Robinson, W. S. (1950). Ecological correlations and the behavior of individuals. American
Journal of Sociology, 15:351–357.
Rogosa, D. (1981). On the relationship between the Johnson-Neyman region of significance
and statistical tests of parallel within-group regressions. Educational and Psychological
Measurement, 41:73–84.
Romano, J. P. and Wolf, M. (2017). Resurrecting weighted least squares. Journal of
Econometrics, 197:1–19.
Royer, H., Stehr, M., and Sydnor, J. (2015). Incentives, commitments, and habit formation
in exercise: evidence from a field experiment with workers at a fortune-500 company.
American Economic Journal: Applied Economics, 7:51–84.
Rubin, D. B. (2008). For objective causal inference, design trumps analysis. The Annals of
Applied Statistics, 2:808–840.
Ruhe, A. (1970). Perturbation bounds for means of eigenvalues and invariant subspaces.
BIT Numerical Mathematics, 10:343–354.
Ruppert, D., Sheather, S. J., and Wand, M. P. (1995). An effective bandwidth selector for
local least squares regression. Journal of the American Statistical Association, 90:1257–
1270.
Samarani, S., Mack, D. R., Bernstein, C. N., Iannello, A., Debbeche, O., Jantchou, P.,
Faure, C., Deslandres, C., Amre, D. K., and Ahmad, A. (2019). Activating killer-cell
immunoglobulin-like receptor genes confer risk for crohn’s disease in children and adults
of the western european descent: Findings based on case-control studies. PLoS One,
14:e0217767.
Schochet, P. Z. (2013). Estimators for clustered education RCTs using the Neyman model
for causal inference. Journal of Educational and Behavioral Statistics, 38:219–238.
Shao, J. (1997). An asymptotic theory for linear model selection. Statistica sinica, 7:221–
242.
Shen, D., Song, D., Ding, P., and Sekhon, J. S. (2023). Algebraic and statistical properties
of the ordinary least squares interpolator. arXiv preprint arXiv:2309.15769.
Sims, C. A. (2010). But economics is not an experimental science. Journal of Economic
Perspectives, 24:59–68.
376 Bibliography

Stefanski, L. A. and Boos, D. D. (2002). The calculus of m-estimation. American Statisti-


cian, 56:29–38.
Stigler, S. M. (1981). Gauss and the invention of least squares. Annals of Statistics, 9:465–
474.
Styan, G. P. H. (1970). Notes on the distribution of quadratic forms in singular normal
variables. Biometrika, 57:567–572.
Tamer, E. (2010). Partial identification in econometrics. Annual Review of Economics,
2:167–195.
Tarpey, T. (2000). A note on the prediction sum of squares statistic for restricted least
squares. American Statistician, 54:116–118.

Theil, H. (1971). Principles of Econometrics. New York: John Wiley & Sons.
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal
Statistical Society: Series B (Methodological), 58:267–288.
Tibshirani, R. (2011). Regression shrinkage and selection via the lasso: a retrospective.
Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73(3):273–
282.
Tibshirani, R. J. (2013). The lasso problem and uniqueness. Electronic Journal of Statistics,
7:1456–1490.

Tikhonov, A. N. (1943). On the stability of inverse problems. In Dokl. Akad. Nauk SSSR,
volume 39, pages 195–198.
Titterington, D. M. (2013). Biometrika highlights from volume 28 onwards. Biometrika,
100:17–73.
Tjur, T. (2009). Coefficients of determination in logistic regression models—a new proposal:
The coefficient of discrimination. American Statistician, 63:366–372.
Toomet, O. and Henningsen, A. (2008). Sample selection models in R: Package sampleSe-
lection. Journal of Statistical Software, 27:1–23.
Train, K. E. (2009). Discrete Choice Methods with Simulation. Cambridge: Cambridge
University Press.
Tseng, P. (2001). Convergence of a block coordinate descent method for nondifferentiable
minimization. Journal of Optimization Theory and Applications, 109:475–494.
Tsiatis, A. (2007). Semiparametric Theory and Missing Data. New York: Springer.

Tukey, J. (1958). Bias and confidence in not quite large samples. Annals of Mathematical
Statistics, 29:614.
Turner, E. L., Li, F., Gallis, J. A., Prague, M., and Murray, D. M. (2017a). Review of re-
cent methodological developments in group-randomized trials: part 1—design. American
Journal of Public Health, 107:907–915.

Turner, E. L., Prague, M., Gallis, J. A., Li, F., and Murray, D. M. (2017b). Review of recent
methodological developments in group-randomized trials: part 2—analysis. American
Journal of Public Health, 107:1078–1086.
Bibliography 377

Van der Vaart, A. W. (2000). Asymptotic Statistics. Cambridge: Cambridge University


Press.
VanderWeele, T. J. (2015). Explanation in Causal Inference: Methods for Mediation and
Interaction. Oxford: Oxford University Press.
Von Neumann, J. (1937). Some matrix-inequalities and metrization of matric-space. Tomsk
Univ. Rev, 1:286–300.
Vovk, V., Gammerman, A., and Shafer, G. (2005). Algorithmic Learning in a Random
World. New York: Springer.
Weber, N. C. (1986). The jackknife and heteroskedasticity: Consistent variance estimation
for regression models. Economics Letters, 20:161–163.
Weisberg, S. (2005). Applied Linear Regression. New York: John Wiley & Sons.
White, H. (1980a). A heteroskedasticity-consistent covariance matrix estimator and a direct
test for heteroskedasticity. Econometrica, 48:817–838.
White, H. (1980b). Using least squares to approximate unknown regression functions. In-
ternational Economic Review, 21:149–170.
White, H. (1982). Maximum likelihood estimation of misspecified models. Econometrica,
pages 1–25.
Wood, S. N. (2017). Generalized Additive Models: an Introduction with R. CRC press.
Wooldridge, J. M. (2010). Econometric Analysis of Cross Section and Panel Data. Cam-
bridge: MIT press.
Wooldridge, J. M. (2012). Introductory econometrics: a modern approach. Cengage Learn-
ing, 41:673–690.
Wright, S. (1921). Correlation and causation. Journal of Agricultural Research, 7(7):557–
585.
Wright, S. (1934). The method of path coefficients. Annals of Mathematical Statistics,
5:161–215.
Wu, C.-F. J. (1986). Jackknife, bootstrap and other resampling methods in regression
analysis. Annals of Statistics, 14:1261–1295.
Yu, B. and Kumbier, K. (2020). Veridical data science. Proceedings of the National Academy
of Sciences of the United States of America, 117:3920–3929.
Yule, G. U. (1907). On the theory of correlation for any number of variables, treated by a
new system of notation. Proceedings of the Royal Society of London. Series A, Containing
Papers of a Mathematical and Physical Character, 79:182–193.
Zeileis, A. (2006). Object-oriented computation of sandwich estimators. Journal of Statis-
tical Software.
Zeileis, A., Kleiber, C., and Jackman, S. (2008). Regression models for count data in R.
Journal of Statistical Software, 27:1–25.
Zhang, D. (2017). A coefficient of determination for generalized linear models. The American
Statistician, 71:310–316.
378 Bibliography

Zhao, A. and Ding, P. (2022). Regression-based causal inference with factorial experiments:
estimands, model specifications, and design-based properties. Biometrika, 109:799–815.
Zou, G. (2004). A modified Poisson regression approach to prospective studies with binary
data. American Journal of Epidemiology, 159:702–706.
Zou, H. and Hastie, T. (2005). Regularization and variable selection via the elastic net.
Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67:301–320.

You might also like