Linear Model and Extensions
Linear Model and Extensions
Peng Ding
Acronyms xiii
Symbols xv
Preface xix
I Introduction 1
1 Motivations for Statistical Models 3
1.1 Data and statistical models . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Why linear models? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
v
vi Contents
15 Lasso 155
15.1 Introduction to the lasso . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
15.2 Comparing the lasso and ridge: a geometric perspective . . . . . . . . . . . 155
15.3 Computing the lasso via coordinate descent . . . . . . . . . . . . . . . . . . 158
15.3.1 The soft-thresholding lemma . . . . . . . . . . . . . . . . . . . . . . 158
15.3.2 Coordinate descent for the lasso . . . . . . . . . . . . . . . . . . . . 158
15.4 Example: comparing OLS, ridge and lasso . . . . . . . . . . . . . . . . . . . 159
15.5 Other shrinkage estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
15.6 Homework problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
17 Interaction 179
17.1 Two binary covariates interact . . . . . . . . . . . . . . . . . . . . . . . . . 179
17.2 A binary covariate interacts with a general covariate . . . . . . . . . . . . . 180
17.2.1 Treatment effect heterogeneity . . . . . . . . . . . . . . . . . . . . . 180
17.2.2 Johnson–Neyman technique . . . . . . . . . . . . . . . . . . . . . . . 180
17.2.3 Blinder–Oaxaca decomposition . . . . . . . . . . . . . . . . . . . . . 180
17.2.4 Chow test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
17.3 Difficulties of intereaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
17.3.1 Removable interaction . . . . . . . . . . . . . . . . . . . . . . . . . . 182
17.3.2 Main effect in the presence of interaction . . . . . . . . . . . . . . . 182
17.3.3 Power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
17.4 Homework problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
IX Appendices 327
A Linear Algebra 329
A.1 Basics of vectors and matrices . . . . . . . . . . . . . . . . . . . . . . . . . 329
A.2 Vector calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337
A.3 Homework problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339
Bibliography 367
Acronyms
I try hard to avoid using acronyms to reduce the unnecessary burden for reading. The
following are standard and will be used repeatedly.
ANOVA (Fisher’s) analysis of variance
CLT central limit theorem
CV cross-validation
EHW Eicker–Huber–White (robust covariance matrix or standard error)
FWL Frisch–Waugh–Lovell (theorem)
GEE generalized estimating equation
GLM generalized linear model
HC heteroskedasticity-consistent (covariance matrix or standard error)
IID independent and identically distributed
LAD least absolute deviations
lasso least absolute shrinkage and selection operator
MLE maximum likelihood estimate
OLS ordinary least squares
RSS residual sum of squares
WLS weighted least squares
xiii
Symbols
All vectors are column vectors as in R unless stated otherwise. Let the superscript “t ” denote
the transpose of a vector or matrix.
a
∼ approximation in distribution
R the set of all real numbers
β regression coefficient
ε error term
H hat matrix H = X(X t X)−1 X t
hii leverage score: the (i, i)the element of the hat matrix H
In identity matrix of dimension n × n
xi covariate vector for unit i
X covariate matrix
Y outcome vector
yi outcome for unit i
independence and conditional independence
xv
Useful R packages
xvii
Preface
xix
xx Preface
my teaching assistants to review the appendices in the first two lab sessions and assigned
homework problems from the appendices to remind the students to review the background
materials. Then you can cover Chapters 2–24. You can omit Chapter 18 and some sections
in other chapters due to their technical complications. If time permits, you can consider
covering Chapter 25 due to the importance of the generalized estimating equation as well
as its byproduct called the “cluster-robust standard error”, which is important for many
social science applications. Furthermore, you can consider covering Chapter 27 due to the
importance of the Cox proportional hazards model.
Homework problems
This book contains many homework problems. It is important to try some homework prob-
lems. Moreover, some homework problems contain useful theoretical results. Even if you do
not have time to figure out the details for those problems, it is helpful to at least read the
statements of the problems.
Omitted topics
Although “Linear Model” is a standard course offered by most statistics departments, it
is not entirely clear what we should teach as the field of statistics is evolving. Although
I made some suggestions to the instructors above, you may still feel that this book has
omitted some important topics related to the linear model.
et al. (2012) is a canonical textbook on applied longitudinal data analysis. This book also
covers the Cox proportional hazards model in Chapter 27. For more advanced methods for
survival analysis, Kalbfleisch and Prentice (2011) is a canonical textbook.
Causal inference
I do not cover causal inference in this book intentionally. To minimize the overlap of the
materials, I wrote another textbook on causal inference (Ding, 2023). However, I did teach a
version of “Linear Model” with a causal inference unit after introducing the basics of linear
model and logistic model. Students seemed to like it because of the connections between
statistical models and causal inference.
• This book covers the theory of the linear model related to not only social sciences but
also biomedical studies.
• This book provides homework problems with different technical difficulties. The solu-
tions to the problems are available to instructors upon request.
Other textbooks may also have one or two of the above features. This book has the above
features simultaneously. I hope that instructors and readers find these features attractive.
Acknowledgments
Many students at UC Berkeley made critical and constructive comments on early versions of
my lecture notes. As teaching assistants for my “Linear Model” course, Sizhu Lu, Chaoran
Yu, and Jason Wu read early versions of my book carefully and helped me to improve the
book a lot.
Professors Hongyuan Cao and Zhichao Jiang taught related courses based on an early
version of the book. They made very valuable suggestions.
I am also very grateful for the suggestions from Nianqiao Ju.
When I was a student, I took a linear model course based on Weisberg (2005). In my
early years of teaching, I used Christensen (2002) and Agresti (2015) as reference books.
I also sat in Professor Jim Powell’s econometrics courses and got access to his wonderful
lecture notes. They all heavily impacted my understanding and formulation of the linear
model.
If you identify any errors, please feel free to email me.
Part I
Introduction
1
Motivations for Statistical Models
3
4 Linear Model and Extensions
(Q3) Estimate the causal effect of some components in X on Y . What if we change some
components of X? How do we measure the impact of the hypothetical intervention of
some components of X on Y ? This is a much harder question because most statistical
tools are designed to infer association, not causation. For example, the U.S. Food and
Drug Administration (FDA) approves drugs based on randomized controlled trials
(RCT) because RCTs are most credible to infer causal effects of drugs on health
outcomes. Economists are interested in evaluating the effect of a job training program
on employment and wages. However, this is a notoriously difficult problem with only
observational data.
The above descriptions are about generic X and Y , which can be many different types.
We often use different statistical models to capture the features of different types of data.
I give a brief overview of models that will appear in later parts of this book.
(T1) X and Y are univariate and continuous. In Francis Galton’s1 classic example, X is the
parents’ average height and Y is the children’s average height (Galton, 1886). Galton
derived the following formula:
σ̂y
y = ȳ + ρ̂ (x − x̄)
σ̂x
which is equivalent to
y − ȳ x − x̄
= ρ̂ , (1.1)
σ̂y σ̂x
where
n
X n
X
x̄ = n−1 xi , ȳ = n−1 yi
i=1 i=1
are the sample means,
n
X n
X
σ̂x2 = (n − 1)−1 (xi − x̄)2 , σ̂y2 = (n − 1)−1 (yi − ȳ)2
i=1 i=1
are the sample variances, and ρ̂ = σ̂xy /(σ̂x σ̂xy ) is the sample Pearson correlation
coefficient with the sample covariance
n
X
σ̂xy = (n − 1)−1 (xi − x̄)(yi − ȳ).
i=1
1 Who was Francis Galton? He was Charles Darwin’s nephew and was famous for his pioneer work in
statistics and for devising a method for classifying fingerprints that proved useful in forensic science. He
also invented the term eugenics, a field that causes a lot of controversies nowadays.
Motivations for Statistical Models 5
(T3) Y binary or indicator of two classes, and X multivariate of mixed types. For example,
in the R package wooldridge, the dataset mroz contains an outcome of interest being the
binary indicator for whether a woman was in the labor force in 1975, and some useful
covariates are
(T4) Y categorical without ordering. For example, the choice of housing type, single-family
house, townhouse, or condominium, is a categorical variable.
(T5) Y categorical and ordered. For example, the final course evaluation at UC Berkeley
can take value in {1, 2, 3, 4, 5, 6, 7}. These numbers have clear ordering but they are
not the usual real numbers.
(T6) Y counts. For example, the number of times one went to the gym last week is a
non-negative integer representing counts.
(T7) Y time-to-event outcome. For example, in medical trials, a major outcome of interest
is the survival time; in labor economics, a major outcome of interest is the time to
find the next job. The former is called survival analysis in biostatistics and the latter
is called duration analysis in econometrics.
(T8) Y multivariate and correlated. In medical trials, the data are often longitudinal, mean-
ing that the patient’s outcomes are measured repeatedly over time. So each patient
has a multivariate outcome. In field experiments of public health and development
economics, the randomized interventions are often at the village level but the out-
come data are collected at the household level. So within villages, the outcomes are
correlated.
(R1) Linear models are simple but non-trivial starting points for learning.
(R2) Linear models can provide insights because we can derive explicit formulas based on
elegant algebra and geometry.
(R3) Linear models can handle nonlinearity by incorporating nonlinear terms, for example,
X can contain the polynomials or nonlinear transformations of the original covariates.
In statistics, “linear” often means linear in parameters, not necessarily in covariates.
(R5) Linear models are simpler than nonlinear models, but they do not necessarily perform
worse than more complicated nonlinear models. We have finite data so we cannot fit
arbitrarily complicated models.
If you are interested in nonlinear models, you can take another machine learning course.
2
Ordinary Least Squares (OLS) with a Univariate
Covariate
Galton's regression
80
●
75 ● ● ● ●
●
● ● ●● ● ● ●
● ●
● ● ●
● ● ● ● ●●● ●
●● ● ● ● ● ● ●● ●●
● ●
● ● ●
● ● ● ● ● ●● ● ●● ● ● ● ●●
● ● ●● ● ●● ● ● ●
● ● ●
●
● ● ● ● ●● ● ●
● ● ●
● ● ●● ● ● ● ● ● ● ● ● ●● ●● ●● ● ●●
●
● ● ● ● ● ●● ● ●
● ● ● ● ●
● ● ● ● ● ● ● ●● ●● ●● ●● ● ● ● ● ●
● ● ●
70 ● ● ●
● ● ●●● ● ● ● ● ●● ● ●
● ● ●
● ●
● ●●●●●●● ●●●●
●
● ●● ●● ● ● ●
childHeight
●● ● ● ● ● ● ● ●
● ● ● ● ● ● ●
● ●●● ● ●● ● ● ● ● ●●
● ●
● ● ●● ● ●● ● ●●●● ● ● ●● ● ● ● ●
● ● ● ● ● ●
● ● ● ●● ● ● ●● ●● ●● ●● ● ● ● ● ● ●
●
● ● ● ● ● ●●● ●●● ● ●● ●● ● ●●● ● ●● ● ● ●●● ● ●● ● ● ●● ● ●
● ● ● ● ● ●
● ● ●● ● ● ● ● ● ● ●● ●
● ● ●
● ● ● ●● ● ● ● ● ● ●●●● ● ●● ●
● ● ● ● ● ●● ● ●● ●● ●
● ● ● ●● ●●
● ● ●
●
● ● ●●
●
●●● ●
●
●●●● ●
●
● ● ● ●
●
●
● ● ● ● ● ● ●● ● ● ●●●● ●
● ●● ● ● ● ●● ● ●●●
● ●● ●●●● ● ● ●
● ●
● ● ●
● ● ●● ● ●●● ●●
● ● ●●●●● ● ● ● ●● ● ● ●● ● ●
● ● ● ● ●
65 ● ● ●
●
●
● ●
●
● ●
●●
●
●
● ● ● ●●●●●●
●
● ● ●● ● ●
●
●
●
●●● ●●
●●
● ●●●●●
●
● ●
●
● ●●
● ●
●●
●
● ●
●
● ● ● ●● ●● ● ● ●● ● ● ●● ●●
●●● ●●●● ● ●
● ●● ● ● ●
● ● ● ● ●
● ● ●● ● ● ● ●
● ●● ● ●●
●●● ●●● ● ● ●
● ● ●
● ● ● ●● ● ● ●●● ● ●● ● ● ●●●●
● ● ● ●●● ●●● ● ●
● ●
● ●● ●
●●● ● ● ●● ● ● ● ●● ●
● ●
● ● ● ●● ● ● ●● ● ●
● ●●●●● ●●
● ● ●● ● ● ● ● ●
● ● ●
● ● ●
● ● ● ●
● ● ●
●
● ● ● ●● ● ●● ●● ● ● ●● ● ●●
● ● ● ●
60 ● ● ● ● ● ● ● ●● ● ● ●● ● ●
● ●
fitted line: y=22.64+0.64x
●
55
64 68 72
midparentHeight
With n data points (xi, yi )ni=1 , our goal is to find the best linear fit of the data
What do we mean by the “best” fit? Gauss proposed to use the following criterion, called
7
8 Linear Model and Extensions
The OLS criterion is based on the squared “misfits” yi − a − bxi . Another intuitive
criterion is based on the absolute values of those misfits, which is called the least absolute
deviation (LAD). However, OLS is simpler because the objective function is smooth in (a, b).
We will discuss LAD in Chapter 26.
How to solve the OLS minimization problem? The objective function is quadratic, and
as a and b diverge, it diverges to infinity. So it must has a unique minimizer (α̂, β̂) which
satisfies the first-order condition:
( Pn
− n2 i=1 (yi − α̂ − β̂xi ) = 0,
Pn
− n2 i=1 xi (yi − α̂ − β̂xi ) = 0.
These two equations are called the Normal Equations of OLS. The first equation implies
ȳ = α̂ + β̂ x̄, (2.1)
that is, the OLS line must go through the sample mean of the data (x̄, ȳ). The second
equation implies
xy = α̂x̄ + β̂x2 , (2.2)
where xy is the sample mean of the xi yi ’s, and x2 is the sample mean of the x2i ’s. Subtracting
(2.1)×x̄ from (2.2), we have
So the OLS coefficient of x equals the sample covariance between x and y divided by the
sample variance of x. From (2.1), we obtain that
α̂ = ȳ − β̂ x̄.
y = α̂ + β̂x = ȳ − β̂ x̄ + β̂x
=⇒ y − ȳ = β̂(x − x̄)
σ̂xy ρ̂xy σ̂x σ̂y
=⇒ y − ȳ = 2 (x − x̄) = (x − x̄)
σ̂x σ̂x2
y − ȳ x − x̄
=⇒ = ρ̂xy ,
σ̂y σ̂x
Ceres, and his work was published in 1809. Legendre’s work appeared in 1805 but Gauss claimed that he
had been using it since 1794 or 1795. Stigler (1981) reviews the history of OLS.
Ordinary Least Squares (OLS) with a Univariate Covariate 9
which equals Pn
xi yi ⟨x, y⟩
β̂ = Pi=1
n 2 = ,
i=1 xi ⟨x, x⟩
where
Pn x and y are the n-dimensional vectors containing all observations, and ⟨x, y⟩ =
i=1 i yi denotes the inner product. Although not directly useful, this formula will be the
x
building block for many discussions later.
is the weight proportional to the squared distance between xi and xj . In the above formulas,
we define bij = 0 if xi = xj .
Remark: Wu (1986) and Gelman and Park (2009) used this formula. Problem 3.9 gives
a more general result.
Part II
where xti = (xi1 , . . . , xip ) is the row vector consisting of the covariates of unit i, and Xj =
(x1j , . . . , xnj )t is the column vector of the j-th covariate for all units.
We want to find the best linear fit of the data (xi , ŷi )ni=1 with
where β̂ is called the OLS coefficient, the ŷi ’s are called the fitted values, and the yi − ŷi ’s
are called the residuals.
The objective function is quadratic in b which diverges to infinity when b diverges to
infinity. So it must have a unique minimizer β̂ satisfying the first order condition
n
2X
− xi (yi − xti β̂) = 0,
n i=1
which simplifies to
n
X
xi (yi − xti β̂) = 0 ⇐⇒ X t (Y − X β̂) = 0. (3.1)
i=1
The above equation (3.1) is called the Normal equation of the OLS, which implies the main
theorem:
13
14 Linear Model and Extensions
Pn
if X t X = i=1 xi xti is non-degenerate.
The equivalence of the two forms of the OLS coefficient follows from
t
x1
n
xt2 X
X t X = (x1 , . . . , xn ) . = xi xti ,
.. i=1
xtn
and
y1
y2 X n
X t Y = (x1 , . . . , xn ) = xi yi .
..
. i=1
yn
For different purposes, both forms can be useful.
The non-degeneracy of X t X in Theorem 3.1 requires that for any non-zero vector α ∈
p
R , we must have
αt X t Xα = ∥Xα∥2 ̸= 0
which is equivalent to
Xα ̸= 0,
i.e., the columns of X are linearly independent 1 . This effectively rules out redundant
columns in the design matrix X. If X1 can be represented by other columns X1 =
c2 X2 + · · · + cp Xp for some (c2 , . . . , cp ), then X t X is degenerate.
Throughout the book, we invoke the following condition unless stated otherwise.
Condition 3.1 The column vectors of X are linearly independent.
Xb = b1 X1 + · · · + bp Xp
represents a linear combination of the column vectors of the design matrix X. So the OLS
problem is to find the best linear combination of the column vectors of X to approximate the
response vector Y . Recall that all linear combinations of the column vectors of X constitute
1 This book uses different notions of “independence” which can be confusing sometimes. In linear algebra,
a set of vectors is linearly independent if any nonzero linear combination of them is not zero; see Chapter A.
In probability theory, two random variables are independent if their joint density factorizes into the product
of the marginal distributions; see Chapter B.
OLS with Multiple Covariates 15
the column space of X, denoted by C(X) 2 . So the OLS problem is to find the vector in C(X)
that is the closest to Y . Geometrically, the vector must be the projection of Y onto C(X).
By projection, the residual vector ε̂ = Y −X β̂ must be orthogonal to C(X), or, equivalently,
the residual vector is orthogonal to X1 , . . . , Xp . This geometric intuition implies that
X1t ε̂ = 0, . . . , Xpt ε̂ = 0,
t
X1 ε̂
t ..
⇐⇒ X ε̂ = . = 0,
Xpt ε̂
⇐⇒ X t (Y − X β̂) = 0,
which is essentially the Normal equation (3.1). The above argument gives a geometric deriva-
tion of the OLS formula in Theorem 3.1.
In Figure 3.1, since the triangle ABC is rectangular, the fitted vector Ŷ = X β̂ is orthog-
onal to the residual vector ε̂, and moreover, the Pythagorean Theorem implies that
∥Y ∥2 = ∥X β̂∥2 + ∥ε̂∥2 .
where implies that ∥Y − Xb∥2 ≥ ∥Y − X β̂∥2 with equality holding if and only if b = β̂.
The first term equals ∥Y − X β̂∥2 and the second term equals ∥X(β̂ − b)∥2 . We need to show
the last two terms are zero. By symmetry of these two terms, we only need to show that
the last term is zero. This is true by the Normal equation (3.1) of the OLS:
H = X(X t X)−1 X t
is an n × n matrix. It is called the hat matrix because it puts a hat on Y when multiplying
Y . Algebraically, we can show that H is a projection matrix because
and
t
X(X t X)−1 X t
Ht =
= X(X t X)−1 X t
= H.
OLS with Multiple Covariates 17
Recall that C(X) is the column space of X. (G1) states that projecting any vector in
C(X) onto C(X) does not change the vector, and (G2) states that projecting any vector
orthogonal to C(X) onto C(X) results in a zero vector.
Proof of Proposition 3.1: I first prove (G1). If v ∈ C(X), then v = Xb for some b,
which implies that Hv = X(X t X)−1 X t Xb = Xb = v. Conversely, if v = Hv, then v =
X(X t X)−1 X t v = Xu with u = (X t X)−1 X t v, which ensures that v ∈ C(X).
I then prove (G2). If w ⊥ C(X), then w is orthogonal to all column vectors of X. So
Xjt w = 0 (j = 1, . . . , p)
t
=⇒ X w = 0
=⇒ Hw = X(X t X)−1 X t w = 0.
It shows that the predicted value ŷi is a linear combination of all the outcomes. Moreover,
if X contains a column of intercepts 1n = (1, . . . , 1)t , then
n
X
H1n = 1n =⇒ hij = 1 (i = 1, . . . , n),
j=1
which implies that ŷi is a weighted average of all the outcomes. Although the sum of the
weights is one, some of them can be negative.
In general, the hat matrix has complex forms, but when the covariates are dummy
variables, it has more explicit forms. I give two examples below.
Example 3.1 In a treatment-control experiment with m treated and n control units, the
matrix X contains 1 and a dummy variable for the treatment:
1m 1m
X= .
1n 0n
and t
x11 ··· x1p x1
X = ... .. = .. = (X , . . . , X ) ∈ Rn×p
. . 1 p
xn1 ··· xnp xtn
as the response and covariate matrices, respectively. Define the multiple OLS coefficient
matrix as
n
X
B̂ = arg minp×q
∥yi − B t xi ∥2
B∈R
i=1
B̂1 = (X t X)−1 X t Y1 ,
..
.
B̂q = (X t X)−1 X t Yq .
Remark: This result tells us that the OLS fit with a vector outcome reduces to multiple
separate OLS fits, or, the OLS fit of a matrix Y on a matrix X reduces to the column-wise
OLS fits of Y on X.
X(K) Y(K)
where the kth sample consists of (X(k) , Y(k) ) with X(k) ∈ Rnk ×p and Y(k) ∈ Rnk being the
PK
covariate matrix and outcome vector. Note that n = k=1 nk Let β̂ be the OLS coefficient
based on the full sample, and β̂(k) be the OLS coefficient based on the kth sample. Show
that
XK
β̂ = W(k) β̂(k) ,
k=1
YS = XS b
β̂S = XS−1 YS
if XS is invertible and β̂S = 0 otherwise. Show that the OLS coefficient equals a weighted
average of these subset coefficients:
X
β̂ = wS β̂S
S
Remark: To prove this result, we can use Cramer’s rule to express the OLS coefficient
and use the Cauchy–Binet formula to expand the determinant of X t X. This result extends
Problem 2.1. Berman (1988) attributed it to Jacobi. Wu (1986) used it in analyzing the
statistical properties of OLS.
4
The Gauss–Markov Model and Theorem
21
22 Linear Model and Extensions
E(β̂) = E (X t X)−1 X t Y
= (X t X)−1 X t E(Y )
= (X t X)−1 X t Xβ
= β.
□
We can decompose the response vector as
Y = Ŷ + ε̂,
where the fitted vector is Ŷ = X β̂ = HY and the residual vector is ε̂ = Y − Ŷ = (In − H)Y.
The two matrices H and In − H are the keys, which have the following properties.
HX = X, (In − H)X = 0,
These follow from simple linear algebra, and I leave the proof as Problem 4.1. It states
that H and In − H are projection matrices onto the column space of X and its complement.
Algebraically, Ŷ and ε̂ are orthogonal by the OLS projection because Lemma 4.1 implies
Ŷ t ε̂ = Y t H t (In − H)Y
= Y t H(In − H)Y
= 0.
and
Ŷ 2 H 0
cov =σ .
ε̂ 0 In − H
Please do not be confused with the two statements above. First, Ŷ and ε̂ are orthogonal.
Second, Ŷ and ε̂ are uncorrelated. They have different meanings. The first statement is an
algebraic fact of the OLS procedure. It is about a relationship between two vectors Ŷ
and ε̂ which holds without assuming the Gauss–Markov model. The second statement is
stochastic. It is about a relationship between two random vectors Ŷ and ε̂ which requires
the Gauss–Markov model assumption.
Proof of Theorem 4.2: The conclusion follows from the simple fact that
Ŷ HY H
= = Y
ε̂ (In − H)Y In − H
is a linear transformation of Y . It has mean
Ŷ H
E = E(Y )
ε̂ In − H
H
= Xβ
In − H
HXβ
=
(In − H) Xβ
Xβ
= ,
0
and covariance matrix
Ŷ H
cov = cov(Y ) H t (In − H)t
ε̂ In − H
2 H
=σ H In − H
In − H
H2
2 H(In − H)
=σ
(In − H)H (In − H)2
2 H 0
=σ ,
0 In − H
where the last step follows from Lemma 4.1. □
Assume the Gauss–Markov model. Although the original responses and error terms are
uncorrelated between units with cov(εi , εj ) = 0 for i ̸= j, the fitted values and the residuals
are correlated with
cov(ŷi , ŷj ) = σ 2 hij , cov(ε̂i , ε̂j ) = σ 2 (1 − hij )
for i ̸= j based on Theorem 4.2.
where
n
X
rss = ε̂2i
i=1
is the residual sum of squares. However, Theorem 4.2 shows that ε̂i has mean zero and
variance σ 2 (1 − hii ), which is not the same as the variance of original εi . Consequently, rss
has mean
n
X
E(rss) = σ 2 (1 − hii )
i=1
= σ 2 {n − trace(H)}
= σ 2 (n − p),
Theorem 4.3 implies that σ̃ 2 is a biased estimator for σ 2 because E(σ̃ 2 ) = σ 2 (n − p)/n.
It underestimates σ 2 but with a large sample size n, the bias is small.
Theorem 4.4 Under Assumption 4.1, the OLS estimator β̂ for β is the best linear unbiased
estimator (BLUE) in the sense that3
cov(β̃) ⪰ cov(β̂)
Before proving Theorem 4.4, we need to understand its meaning and immediate impli-
cations. We do not compare the OLS estimator with any arbitrary estimators. In fact, we
restrict to the estimators that are linear and unbiased. Condition (C1) requires that β̃ is
a linear estimator. More precisely, it is a linear transformation of the response vector Y ,
where A can be any complex and possibly nonlinear function of X. Condition (C2) requires
that β̃ is an unbiased estimator for β, no matter what true value β takes.
Why do we restrict the estimator to be linear? The class of linear estimator is actually
quite large because A can be any nonlinear function of X, and the only requirement is that
the estimator is linear in Y . The unbiasedness is a natural requirement for many problems.
However, in many modern applications with many covariates, some biased estimators can
perform better than unbiased estimators if they have smaller variances. We will discuss
these estimators in Part V of this book.
We compare the estimators based on their covariances, which are natural extensions of
variances for scalar random variables. The conclusion cov(β̃) ⪰ cov(β̂) implies that for any
vector c ∈ Rp , we have
ct cov(β̃)c ⪰ ct cov(β̂)c
which is equivalent to
var(ct β̃) ≥ var(ct β̂),
So any linear transformation of the OLS estimator has a variance smaller than or equal to
the same linear transformation of any other estimator. In particular, if c = (0, . . . , 1, . . . , 0)t
with only the jth coordinate being 1, then the above inequality implies that
var(β̃j ) ≥ var(β̂j ), (j = 1, . . . , p).
So the OLS estimator has a smaller variance than other estimators for all coordinates.
Now we prove the theorem.
Proof of Theorem 4.4: We must verify that the OLS estimator itself satisfies (C1) and
(C2). We have β̂ = ÂY with  = (X t X)−1 X t , and it is unbiased by Theorem 4.1.
First, the unbiasedness requirement implies that
E(β̃) = β =⇒ E(AY ) = AE(Y ) = AXβ = β
=⇒ AXβ = β
for any value of β. So
AX = Ip (4.1)
t −1 t
must hold. In particular, the OLS estimator satisfies ÂX = (X X) X X = Ip .
Second, we can decompose the covariance of β̃ as
cov(β̃) = cov(β̂ + β̃ − β̂)
= cov(β̂) + cov(β̃ − β̂) + cov(β̂, β̃ − β̂) + cov(β̃ − β̂, β̂).
The last two terms are in fact zero. By symmetry, we only need to show that the third term
is zero:
n o
cov(β̂, β̃ − β̂) = cov ÂY, (A − Â)Y
= Âcov(Y )(A − Â)t
= σ 2 Â(A − Â)t
= σ 2 (ÂAt − ÂÂt )
= σ 2 (X t X)−1 X t At − (X t X)−1 X t X(X t X)−1
= σ 2 (X t X)−1 Ip − (X t X)−1
(by (4.1))
= 0.
26 Linear Model and Extensions
Assume xi must be in the interval [0, 1]. We want to choose their values to minimize
var(β̂). Assume that n is an even number. Find the minimizers xi ’s.
Hint: You may find the following probability result useful. For a random variable ξ in
the interval [0, 1], we have the following inequality
var(ξ) = E(ξ 2 ) − {E(ξ)}2
≤ E(ξ) − {E(ξ)}2
= E(ξ){1 − E(ξ)}
≤ 1/4.
The first inequality becomes an equality if and only if ξ = 0 or 1; the second inequality
becomes an equality if and only if E(ξ) = 1/2.
cov(β̄) ⪰ cov(β̂).
then
Y t Q1 Y
β̃ = β̂ +
..
.
Y t Qp Y
is unbiased for β.
Remark: The above estimator β̃ is a quadratic function of Y . It is a nonlinear unbiased
estimator for β. It is not difficult to show the unbiasedness. More remarkably, Koopmann
(1982, Theorem 4.3) showed that under Assumption 4.1, any unbiased estimator for β must
have the form of β̃.
5
Normal Linear Model: Inference and Prediction
Under the Gauss–Markov model, we have calculated the first two moments of the OLS
estimator β̂ = (X t X)−1 X t Y :
E(β̂) = β,
cov(β̂) = σ 2 (X t X)−1 ,
and have shown that σ̂ 2 = ε̂t ε̂/(n − p) is unbiased for σ 2 , where ε̂ = Y − X β̂ is the
residual vector. The Gauss–Markov theorem further ensures that the OLS estimator is
BLUE. Although these results characterize the nice properties of the OLS estimator, they
do not fully determine its distribution and are thus inadequate for statistical inference.
This chapter will derive the joint distribution of (β̂, σ̂ 2 ) under the Normal linear model
with stronger distribution assumptions.
Y ∼ N(Xβ, σ 2 In ),
or, equivalently,
ind
yi ∼ N(xti β, σ 2 ), (i = 1, . . . , n),
where the design matrix X is fixed with linearly independent column vectors. The unknown
parameters are (β, σ 2 ).
We can also write the Normal linear model as a linear function of covariates with error
terms:
Y = Xβ + ε
or, equivalently,
yi = xti β + εi , (i = 1, . . . , n),
where
iid
ε ∼ N(0, σ 2 In ) or εi ∼ N(0, σ 2 ), (i = 1, . . . , n).
Assumption 5.1 implies Assumption 4.1. Beyond the Gauss–Markov model, it further
requires IID Normal error terms. Assumption 5.1 is extremely strong, but it is canonical in
statistics. It allows us to derive elegant formulas and also justifies the outputs of the linear
regression functions in many statistical packages. I will relax it in Chapter 6.
29
30 Linear Model and Extensions
then ct β = βj is the jth element of β which measures the impact of xij on yi on average.
Standard software packages report statistical inference for each element of β. Sometimes we
may also be interested in βj − βj ′ , the difference between the coefficients of two covariates,
which corresponds to c = (0, . . . , 0, 1, 0, . . . , 0, −1, 0, . . . , 0)t = ej − ej ′ .
Theorem 5.1 implies that
ct β̂ ∼ N ct β, σ 2 ct (X t X)−1 c .
However, this is not directly useful because σ 2 is unknown. With σ 2 replaced by σ̂ 2 , the
standardized distribution
ct β̂ − ct β
Tc ≡ p
σ̂ 2 ct (X t X)−1 c
does not follow N(0, 1) anymore. In fact, it is a t distribution as shown in Theorem 5.3
below.
Tc ∼ tn−p .
Proof of Theorem 5.3: From Theorem 5.1, the standardized distribution with the true
σ 2 follows
ct β̂ − ct β
p ∼ N(0, 1),
σ 2 ct (X t X)−1 c
σ̂ 2 /σ 2 ∼ χ2n−p /(n − p), and they are independent. These facts imply that
ct β̂ − ct β
Tc = p
σ̂ 2 ct (X t X)−1 c
r
ct β̂ − ct β . σ̂ 2
=p
σ 2 ct (X t X)−1 c σ2
N(0, 1)
∼q ,
χ2n−p /(n − p)
where N(0, 1) and χ2n−p denote independent standard Normal and χ2n−p random variables,
respectively, with a little abuse of notation. Therefore, Tc ∼ tn−p by the definition of the t
distribution. □
In Theorem 5.3, the left-hand side depends on the observed data and the unknown true
parameters, but the right-hand side is a random variable depending on only the dimension
(n, p) of X, but neither the data nor the true parameters. We call the quantity on the
left-hand side a pivotal quantity. Based on the quantiles of the tn−p random variable, we
can tie the data and the true parameter via the following probability statement
( )
ct β̂ − ct β
pr p ≤ t1−α/2,n−p = 1 − α
σ̂ 2 ct (X t X)−1 c
for any 0 < α < 1, where t1−α/2,n−p is the 1 − α/2 quantile of tn−p . When n − p is large
(e.g. larger than 30), the 1 − α/2 quantile of tn−p is close to that of N(0, 1). In particular,
t97.5%,n−p ≈ 1.96, the 97.5% quantile of N(0, 1), which is the critical value for the 95%
confidence interval.
Define p
σ̂ 2 ct (X t X)−1 c ≡ se
ˆc
32 Linear Model and Extensions
which is often called the (estimated) standard error of ct β̂. Using this definition, we can
equivalently write the above probability statement as
n o
pr ct β̂ − t1−α/2,n−p se
ˆ c ≤ ct β ≤ ct β̂ + t1−α/2,n−p se
ˆ c = 1 − α.
We use
ct β̂ ± t1−α/2,n−p se
ˆc
as a 1 − α level confidence interval for ct β. By duality of confidence interval and hypothesis
testing, we can also construct a level α test for ct β. More precisely, we reject the null
hypothesis ct β = d if the above confidence interval does not cover d, for a fixed number d.
As an important case, c = ej so ct β = βj . Standard software packages, for example,
p
R, report the point estimator β̂j , the standard error seˆ j = σ̂ 2 [(X t X)−1 ]jj , the t statistic
ˆ j , and the two-sided p-value pr(|tn−p | ≥ |Tj |) for testing whether βj equals zero
Tj = β̂j /se
or not. Section 5.4 below gives some examples.
Example 5.1 If
0 1 0 ··· 0
0 0 1 ··· 0
C= ,
.. .. .. ..
. . . ··· .
0 0 0 ··· 1
then
β2
Cβ = ... ,
βp
contains all the coefficients except for the first one (the intercept in most cases). Most
software packages report the test of the joint significance of (β2 , . . . , βp ). Section 5.4 below
gives some examples.
Example 5.2 Another leading application is to test whether β2 = 0 in the following regres-
sion partitioned by X = (X1 , X2 ) where X1 and X2 are n × k and n × l matrices:
Y = X1 β1 + X2 β2 + ε,
with
β1
C= 0l×k Il =⇒ Cβ = β2 .
β2
We will discuss this partitioned regression in more detail in Chapters 7 and 8.
Normal Linear Model: Inference and Prediction 33
Now we will focus on the generic problem of inferring Cβ. To avoid degeneracy, we
assume that C does not have redundant rows, quantified below.
C β̂ − Cβ ∼ N 0, σ 2 C(X t X)−1 C t
The above chi-squared distribution follows from the property of the quadratic form of a
Normal in Theorem B.10, where σ 2 C(X t X)−1 C t is a positive definite matrix1 . Again this
is not directly useful with unknown σ 2 . Replacing σ 2 with the unbiased estimator σ̂ 2 and
using a scaling factor l, we can obtain a pivotal quantity that has an F distribution as
summarized in Theorem 5.4 below.
where f1−α,l,n−p is the upper α quantile of the Fl,n−p distribution. By duality of the con-
fidence region and hypothesis testing, we can also construct a level α test for Cβ. Most
statistical packages automatically report the p-value based on the F statistic in Example
5.1.
As a final remark, the statistics in Theorems 5.3 and 5.4 are called the Wald-type
statistics.
1 Because X has linearly independent columns, X t X is a non-degenerate and thus positive definite matrix.
Since ut C(X t X)−1 C t u ≥ 0, to show that C(X t X)−1 C t is non-degenerate, we only need to show that
ut C(X t X)−1 C t u = 0 =⇒ u = 0.
From ut C(X t X)−1 C t u = 0, we know C t u = u1 c1 + · · · ul cl = 0. Since the rows of C are linearly indepen-
dent, we must have u = 0.
34 Linear Model and Extensions
yn+1 ∼ N(xtn+1 β, σ 2 )
xtn+1 β̂ ± t1−α/2,n−p se
ˆ xn+1 .
Second, we can predict yn+1 itself, which is a random variable. We can still use xtn+1 β̂ as
a natural unbiased predictor but need to modify the prediction interval. Because yn+1 β̂,
we have
yn+1 − xtn+1 β̂ ∼ N 0, σ 2 + σ 2 xtn+1 (X t X)−1 xn+1 ,
and therefore
r
yn+1 − xtn+1 β̂ yn+1 − xtn+1 β̂ . σ̂ 2
=
σ2
p p
σ̂ 2 + σ̂ 2 xn+1 (X t X)−1 xn+1
t σ 2 + σ 2 xn+1 (X t X)−1 xn+1
t
N(0, 1)
∼q ,
χ2n−p /(n − p)
where N(0, 1) and χ2n−p denote independent standard Normal and χ2n−p random variables,
respectively, with a little abuse of notation. Therefore,
yn+1 − xtn+1 β̂
p ∼ tn−p
σ̂ + σ̂ 2 xtn+1 (X t X)−1 xn+1
2
2
which has two components. The first one has magnitude close Pn to σ twhich is of constant
−1
order. The second one has a magnitude decreasing in n if n i=1 xi xi converges to a finite
limit with large n. Therefore, the first component dominates the second one with large n,
which results in the main difference between predicting the mean of yn+1 and predicting
ˆ xn+1 , we can construct the following 1 − α level prediction
yn+1 itself. Using the notation pe
interval:
xtn+1 β̂ ± t1−α/2,n−p pe
ˆ xn+1 .
Normal Linear Model: Inference and Prediction 35
With the fitted line, we can predict childHeight at different values of midparentHeight. In
the predict function, if we specify interval = "confidence", it gives the confidence intervals for
the means of the new outcomes; if we specify interval = "prediction", it gives the prediction
intervals for the new outcomes themselves.
> new _ mph = seq ( 6 0 , 8 0 , by = 0 . 5 )
> new _ data = data . frame ( mi d pa re nt H ei gh t = new _ mph )
> new _ ci = predict ( galton _ fit , new _ data ,
+ interval = " confidence " )
> new _ pi = predict ( galton _ fit , new _ data ,
+ interval = " prediction " )
> round ( head ( cbind ( new _ ci , new _ pi )) , 3 )
fit lwr upr fit lwr upr
1 60.878 59.744 62.012 60.878 54.126 67.630
2 61.197 60.122 62.272 61.197 54.454 67.939
3 61.515 60.499 62.531 61.515 54.782 68.249
4 61.834 60.877 62.791 61.834 55.109 68.559
5 62.153 61.254 63.051 62.153 55.436 68.869
6 62.471 61.632 63.311 62.471 55.762 69.180
Figure 5.1 plots the fitted line as well as the confidence intervals and prediction intervals
at level 95%. The file code5.4.1.R contains the R code.
Call :
lm ( formula = re 7 8 ~ . , data = lalonde )
Residuals :
Min 1 Q Median 3Q Max
-9 6 1 2 -4 3 5 5 -1 5 7 2 3054 53119
36 Linear Model and Extensions
80
●
●
●
● ● ● ●
●
● ● ●
● ● ● ●
● ●
● ● ●
● ● ● ● ● ●●
●● ● ● ● ● ● ● ●
● ●●
● ●
● ● ●
● ● ● ● ●●● ●●● ●● ● ● ● ● ● ●●● ●
● ● ● ●
● ● ●
● ●
● ● ●●● ● ●
● ● ●
● ●● ● ●● ● ●● ● ●●● ● ●● ● ● ● ●● ● ●
●
● ● ●●● ● ●
● ●● ● ●
● ● ● ● ● ●● ●● ●
● ● ● ●● ● ● ● ● ●
● ●
70 ● ● ● ●
● ● ● ●●● ●●
● ● ●●●●●●●●●
●●●● ●●●
●
●
●
●● ●
● ● ● ●
childHeight
● ● ●
●● ● ●● ● ● ● ●
●● ● ● ● ● ●
● ●●● ● ●●● ● ● ● ●●
● ● ●●●●● ●● ● ●● ●
● ● ● ●● ● ● ● ●
● ● ● ● ● ●
● ● ● ●● ● ● ●● ●● ●
● ●●● ● ● ● ● ●
● ● ● ● ● ●●
● ●●● ● ●
● ●
●
●
●
●●● ●●● ●●●
● ●
● ● ●●
●
● ●●●
●
● ● ●●
● ● fitted
● ● ●● ● ● ●● ●●● ●
● ● ●
● ● ● ●●
● ● ●● ● ●●●● ●● ●● ● ● ●●● ●● ● ●● ●● ●
● ● ● ●● ●●
●
●
●
●
● ●
●
●
● ●
●
● ●
●● ● ●
● ●●
● ●●● ●
●● ●●● ●
●
●●
●●
● ● ●● ●
●
●
●
●●●
●
● ●● ●
●●
● ●
●● ● ●
●
●
● ●
confidence
● ● ●
● ● ●● ● ●●●● ●● ●●
● ● ● ●●●●
●● ● ● ●● ● ●
● ● ● ● ●
● ● ●
●
●
● ●
●
●● ●
●● ●
●● ● ●
●
● ●● ● ●
●
●●●● ●
● ●
● ● ●
●
●●●●●●
● ●●
●
●●
●
● ●● ●●
●
●
● ● ●
●
prediction
●
● ● ● ●● ●● ● ● ● ●●●
● ●●
● ●●●●●●
● ● ●
● ●● ● ● ●
● ● ●● ●
● ● ●● ● ● ● ●
● ●● ● ●●
●
●●●●● ● ● ●
● ● ●
● ● ● ●● ● ●●
●● ● ●
●● ●●●●●
● ●● ● ●●●
●● ● ●
● ●
●●● ●●
●● ● ●●● ● ● ● ●
● ●
● ●
● ● ● ●● ● ● ●● ● ●●
● ● ●●
●● ● ●●
● ●● ● ● ● ●
● ● ●
● ● ●
● ● ●
● ●● ●
●
● ● ● ●● ● ●● ●● ● ● ●● ● ●●
●● ● ●
60 ● ● ● ●● ● ● ●● ● ● ●● ● ●
●
● ●
60 65 70 75 80
midparentHeight
Coefficients :
Estimate Std . Error t value Pr ( >| t |)
( Intercept ) 2.567e+02 3.522e+03 0.073 0.94193
age 5.357e+01 4.581e+01 1.170 0.24284
educ 4.008e+02 2.288e+02 1.751 0.08058 .
black -2 . 0 3 7 e + 0 3 1 . 1 7 4 e + 0 3 -1 . 7 3 6 0 . 0 8 3 3 1 .
hisp 4.258e+02 1.565e+03 0.272 0.78562
married -1 . 4 6 3 e + 0 2 8 . 8 2 3 e + 0 2 -0 . 1 6 6 0 . 8 6 8 3 5
nodegr -1 . 5 1 8 e + 0 1 1 . 0 0 6 e + 0 3 -0 . 0 1 5 0 . 9 8 7 9 7
re 7 4 1 . 2 3 4e - 0 1 8 . 7 8 4e - 0 2 1.405 0.16079
re 7 5 1 . 9 7 4e - 0 2 1 . 5 0 3e - 0 1 0.131 0.89554
u74 1.380e+03 1.188e+03 1.162 0.24590
u75 -1 . 0 7 1 e + 0 3 1 . 0 2 5 e + 0 3 -1 . 0 4 5 0 . 2 9 6 5 1
treat 1.671e+03 6.411e+02 2 . 6 0 6 0 . 0 0 9 4 8 **
The above result shows that none of the pretreatment covariates is significant. It is also
of interest to test whether they are jointly significant. The result below shows that they are
only marginally significant at the level 0.05 based on a joint test.
> library ( " car " )
> l i n e a r H y p o t he s i s ( lalonde _ fit ,
+ c ( " age = 0 " , " educ = 0 " , " black = 0 " ,
+ " hisp = 0 " , " married = 0 " , " nodegr = 0 " ,
+ " re 7 4 = 0 " , " re 7 5 = 0 " , " u 7 4 = 0 " ,
+ " u 7 5 = 0 " ))
Linear hypothesis test
Normal Linear Model: Inference and Prediction 37
Hypothesis :
age = 0
educ = 0
black = 0
hisp = 0
married = 0
nodegr = 0
re 7 4 = 0
re 7 5 = 0
u74 = 0
u75 = 0
Below I create two pseudo datasets: one with all units assigned to the treatment, and
the other with all units assigned to the control, fixing all the pretreatment covariates. The
predicted outcomes are the counterfactual outcomes under the treatment and control. I
further calculate their means and verify that their difference equals the OLS coefficient of
treat.
> new _ treat = lalonde
> new _ treat $ treat = 1
> predict _ lalonde 1 = predict ( lalonde _ fit , new _ treat ,
+ interval = " none " )
> new _ control = lalonde
> new _ control $ treat = 0
> predict _ lalonde 0 = predict ( lalonde _ fit , new _ control ,
+ interval = " none " )
> mean ( predict _ lalonde 1 )
[1] 6276.91
> mean ( predict _ lalonde 0 )
[1] 4606.201
>
> mean ( predict _ lalonde 1 ) - mean ( predict _ lalonde 0 )
[1] 1670.709
where
σ̂ 2 = (m − 1)Sz2 + (n − 1)Sw
2
/(m + n − 2)
with the sample means
m
X n
X
z̄ = m−1 zi , w̄ = n−1 wi ,
i=1 i=1
Remark: The name “equal” is motivated by the “var.equal” parameter of the R function
t.test.
2. We can write the above problem as testing hypothesis H0 : β1 = 0 in the linear regression
Y = Xβ + ε with
z1 1 1 ε1
.. .. .. ..
. . . .
zm 1 1 β 0
ε m
Y = , X= , β= , ε= .
w
1
1 0
β 1 ε
m+1
. . . .
.. .. .. ..
wn 1 0 εm+n
Based on the Normal linear model, we can compute the t statistic. Show that it is
identical to tequal .
H0 : β1 = · · · = βJ
Remarks: (1) This is Fisher’s F statistic. (2) In this linear model formulation, X does not
contain a column of 1’s. (3) The choice of C is not unique, but the final formula for F is.
(4) You may use the Sherman–Morrison formula in Problem 1.3.
and under the homoskedasticity assumption, the t-statistic associated with β̂ equals
ρ̂xy
q .
(1 − ρ̂2xy )/(n − 2)
The equivalence of the t-statistics from the OLS fit of y on x and that of x on y demonstrates
that based on OLS, the data do not contain any information about the direction of the
relationship between x and y.
5.10 An application
The R package sampleSelection (Toomet and Henningsen, 2008) describes the dataset RandHIE
as follows: “The RAND Health Insurance Experiment was a comprehensive study of health
care cost, utilization and outcome in the United States. It is the only randomized study
of health insurance, and the only study which can give definitive evidence as to the causal
effects of different health insurance plans.” You can find more detailed information about
other variables in this package. The main outcome of interest lnmeddol means the log of
medical expenses. Use linear regression to investigate the relationship between the outcome
and various important covariates.
Note that the solution to this problem is not unique, but you need to justify your choice
of covariates and model, and need to interpret the results.
6
Asymptotic Inference in OLS: the
Eicker–Huber–White (EHW) robust standard error
6.1 Motivation
Standard software packages, for example, R, report the point estimator, standard error, and
p-value for each coordinate of β based on the Normal linear model:
Y = Xβ + ε ∼ N(Xβ, σ 2 In ).
Statistical inference based on this model is finite-sample exact. However, the assumptions of
this model are extremely strong: the functional form is linear, the error terms are additive
with distributions not dependent on X, and the error terms are IID Normal with the same
variance. If we do not believe some of these assumptions, can we still trust the associated
statistical inference? Let us start with some simple numerical examples, with the R code in
code6.1.R.
41
42 Linear Model and Extensions
+ { y = xbeta + rexp ( n )
+ ols . fit = lm ( y ~ x )
+ c ( summary ( ols . fit )$ coef [ 2 , 1 : 2 ] ,
+ sqrt ( hccm ( ols . fit )[ 2 , 2 ]))
+ })
The (1, 2) panel of Figure 6.1 corresponds to this setting. With non-Normal errors, β̂ is
still symmetric and bell-shaped around the true parameter 1, and the estimated standard
errors are close to the true one. So the Normality of the error terms does not seem to be
a crucial assumption for the validity of the inference procedure under the Normal linear
model.
We then generate errors from Normal with variance depending on x:
> Simu 3 = replicate ( 5 0 0 0 ,
+ { y = xbeta + rnorm (n , 0 , abs ( x ))
+ ols . fit = lm ( y ~ x )
+ c ( summary ( ols . fit )$ coef [ 2 , 1 : 2 ] ,
+ sqrt ( hccm ( ols . fit )[ 2 , 2 ]))
+ })
The (2, 1) panel of Figure 6.1 corresponds to this setting. With heteroskedastic Normal
errors, β̂ is symmetric and bell-shaped around the true parameter 1, se2 is close to se0, but
se1 underestimates se0. So the heteroskedasticity of the error terms does not change the
Normality of the OLS estimator dramatically, although the statistical inference discussed
in Chapter 5 can be invalid.
Finally, we generate heteroskedastic non-Normal errors:
> Simu 4 = replicate ( 5 0 0 0 ,
+ { y = xbeta + runif (n , -x ^ 2 , x ^ 2 )
+ ols . fit = lm ( y ~ x )
+ c ( summary ( ols . fit )$ coef [ 2 , 1 : 2 ] ,
+ sqrt ( hccm ( ols . fit )[ 2 , 2 ]))
+ })
The (2, 2) panel of Figure 6.1 corresponds to this setting, which has a similar pattern as the
(2, 1) panel. So the Normality of the error terms is not crucial, but the homoskedasticity is.
yi = xti β + εi ,
where the εi ’s are independent with mean zero and variance σi2 (i = 1, . . . , n). The design
matrix X is fixed with linearly independent column vectors, and (β, σ12 , . . . , σn2 ) are unknown
parameters.
Because the error terms can have different variances, they are not IID in general under
the heteroskedastic linear model. Their variances can be functions of the xi ’s, and the
variances σi2 are n free unknown numbers. Again, treating the xi ’s as fixed is not essential,
because we can condition on them if they are random. Without imposing Normality on the
error terms, we cannot determine the finite sample exact distribution of the OLS estimator.
This chapter will use the asymptotic analysis, assuming that the sample size n is large so
that certain limiting theorems hold.
EHW standard error in OLS 43
The asymptotic analysis later will show that if the error terms are homoskedastic, i.e.,
σi2 = σ 2 for all i = 1, . . . , n, we can still trust the statistical inference discussed in Chapter
5 based on the Normal linear model as long the central limit theorem (CLT) for the OLS
estimator holds as n → ∞. If the error terms are heteroskedastic, i.e., their variances
are different, we must adjust the standard error with the so-called Eicker–Huber–White
heteroskedasticity robust standard error. I will give the technical details below. If you are
unfamiliar with the asymptotic analysis, please first review the basics in Chapter C.
n
X
β̂ = Bn−1 n−1 xi yi
i=1
Xn
= Bn−1 n−1 xi (xti β + εi )
i=1
n
X
= Bn−1 Bn β + Bn−1 n−1 xi εi
i=1
n
X
= β + Bn−1 n−1 xi εi .
i=1
□
In the representation of Lemma 6.1, Bn is fixed and ξn is random. Since E(ξn ) = 0, we
know that E(β̂) = β, so the OLS estimator is unbiased. Moreover,
n
!
X
cov(ξn ) = cov n−1 xi εi
i=1
n
X
= n−2 σi2 xi xti
i=1
= Mn /n,
44 Linear Model and Extensions
where
n
X
Mn = n−1 σi2 xi xti .
i=1
It has a sandwich form, justifying the choice of notation Bn for the “bread” and Mn for the
“meat.”
Intuitively, if Bn and Mn have finite limits, then the covariance of β̂ shrinks to zero with
large n, implying that β̂ will concentrate near its mean β. This is the idea of consistency,
formally stated below.
Proof of Theorem 6.1: We only need to show that ξn → 0 in probability. It has mean
zero and covariance matrix Mn /n, so it converges to zero in probability using Proposition
C.4 in Chapter C. □
Theorem 6.2 Under Assumptions 6.1 and 6.2, if there exist a δ > 0 and C > 0 not
depending on n such that d2+δ,n ≤ C, then
√
n(β̂ − β) → N(0, B −1 M B −1 )
in distribution.
Proof of Theorem 6.2: The key is to show the CLT for ξn , and the CLT for β̂ holds due
to the Slutsky’s Theorem; see Chapter C for a review. Define
zn,i = n−1/2 xi εi , (i = 1, . . . , n)
with mean zero and finite covariance, and we need to verify the two conditions required
EHW standard error in OLS 45
by the Lindeberg–Feller CLT stated as Proposition C.8 in Chapter C. First, the Lyapunov
condition holds because
n
X n
X
E ∥zn,i ∥2+δ = E n−(2+δ)/2 ∥xi ∥2+δ ε2+δ
i
i=1 i=1
n
X
= n−δ/2 × n−1 ∥xi ∥2+δ E(ε2+δ
i )
i=1
−δ/2
=n × d2+δ,n → 0.
Second,
n
X n
X
cov(zn,i ) = n−1 σi2 xi xti
i=1 i=1
Mn → M. =
Pn Pn
So the Lindberg–Feller CLT implies that n−1/2 i=1 xi εi = i=1 zn,i → N(0, M ) in distri-
bution. □
as the sample analog for M is not directly useful because the error terms are unknown
either. It is natural to use ε̂2i to replace ε2i to obtain the following estimator for M :
n
X
M̂n = n−1 ε̂2i xi xti .
i=1
Although each ε̂2i is a poor estimator for σi2 , the sample average M̂n turns out to be
well-behaved with large n and the regularity conditions below.
Theorem 6.3 Under Assumptions 6.1 and 6.2, we have M̂n → M in probability if
n
X n
X n
X
n−1 var(ε2i )x2ij1 x2ij2 , n−1 xij1 xij2 xij3 xij4 , n−1 σi2 x2ij1 x2ij2 x2ij3 (6.1)
i=1 i=1 i=1
Proof of Theorem 6.3: Assumption 6.2 ensures that β̂ → β in probability by Theorem 6.1.
Markov’s inequality and the boundedness of the first term in (6.1) ensure that M̃n −Mn → 0
in probability. So we only need to show that M̂n − M̃n → 0 in probability. The (j1 , j2 )th
element of their difference is
n
X n
X
(M̂n − M̃n )j1 ,j2 = n−1 ε̂2i xi,j1 xi,j2 − n−1 ε2i xi,j1 xi,j2
i=1 i=1
n
X 2
−1
=n εi + xti β − xti β̂ − ε2i xi,j1 xi,j2
i=1
Xn 2
−1
=n xti β − xti β̂ + 2εi xti β − xti β̂ xi,j1 xi,j2
i=1
n
X
= (β − β̂)t n−1 xi xti xi,j1 xi,j2 (β − β̂)
i=1
n
X
+ 2(β − β̂)t n−1 xi xi,j1 xi,j2 εi .
i=1
It converges to zero in probability because the first term converges to zero due to the
boundedness of the second term in (6.1), and the second term converges to zero in probability
due to Markov’s inequality and the boundedness of the third term in (6.1). □
The final variance estimator for β̂ is
n
!−1 n
! n
!−1
X X X
−1 −1 t −1 2 t −1 t
V̂ehw = n n xi xi n ε̂i xi xi n xi xi ,
i=1 i=1 i=1
At finite samples, it can have poor behaviors. Since White (1980a) published his paper,
several modifications of V̂ appeared aiming for better finite-sample properties. I summarize
them below. They all rely on the hii ’s, which are the diagonal elements of the projection
matrix H and called the leverage scores. Define
n
!−1 n
! n
!−1
X X X
−1 −1 t −1 2 t −1 t
V̂ehw,k = n n xi xi n ε̂i,k xi xi n xi xi ,
i=1 i=1 i=1
where
ε̂i , (k = 0, HC0);
q
n
ε̂ n−p , (k = 1, HC1);
i
√
ε̂i,k = ε̂i / 1 − hii , (k = 2, HC2);
ε̂i /(1 − hii ), (k = 3, HC3);
ε̂i /(1 − hii )min{2,nhii /(2p)} ,
(k = 4, HC4).
The HC1 correction is similar to the degrees of freedom correction in the OLS covariance
estimator. The HC2 correction was motivated by the unbiasedness of covariance when the
error terms have the same variance; see Problem 6.8 for more details. The HC3 correction
was motivated by a method called jackknife which will be discussed in Chapter 11. This
version appeared even early than White (1980a); see Miller (1974), Hinkley (1977), and
Reeds (1978). I do not have a good intuition for the HC4 correction. See MacKinnon and
White (1985), Long and Ervin (2000) and Cribari-Neto (2004) for reviews. Using simulation
studies, Long and Ervin (2000) recommended HC3.
which simplifies the covariance of β̂ to cov(β̂) = σ 2 Bn−1 /n, and the asymptotic Normality
√
to n(β̂ − β) → N(0, σ 2PB −1 ) in distribution. We have shown that under the Gauss–Markov
2 −1 n 2 2 2 2
model, σ̂ = (n − p) i=1 ε̂i is unbiased for σ . Moreover, σ̂ is consistent for σ under
the same condition as Theorem 6.1, justifying the use of
n
!
−1
X
2
V̂ = σ̂ xi xi = σ̂ 2 (X t X)
t
i=1
It is slightly different from the inference based on t and F distributions. But with large n,
the difference is very small.
I will end this section with a formal result on the consistency of σ̂ 2 .
Theorem 6.4 Under Assumptions 6.1 P and 6.2, we have σ̂ 2 → σ 2 in probability if σi2 =
2 −1 n 2
σ < ∞ for all i = 1, . . . , n and n i=1 var(εi ) is bounded above by a constant not
depending on n.
48 Linear Model and Extensions
Pn
Proof of Theorem 6.4: UsingP Markov’s inequality, we can show that n−1 i=1 ε2i → σ 2
−1 n
in probability. In addition,
Pn n 2 ε̂2i has the same probability limit as σ̂ 2 . So we only
i=1 P
−1 −1 n 2
need to show that n i=1 ε̂i − n i=1 εi → 0 in probability. Their difference is
n
X n
X
n−1 ε̂2i − n−1 ε2i
i=1 i=1
n
X 2
−1
=n εi + xti β − xti β̂ − ε2i
i=1
Xn 2
= n−1 xti β − xti β̂ + 2 xti β − xti β̂ εi
i=1
n
X n
X
= (β − β̂)t n−1 xi xti (β − β̂) + 2(β − β̂)t n−1 xi εi
i=1 i=1
Xn
t −1
= −(β − β̂) n xi xti (β − β̂),
i=1
where the last step follows from Lemma 6.1. So the difference converges to zero in probability
because β̂ − β → 0 in probability by Theorem 6.1 and Bn → B by Assumption 6.2. □
6.5 Examples
I use three examples to compare various standard errors for the regression coefficients, with
the R code in code6.5.R. The car package contains the hccm function that implements the
EHW standard errors.
> library ( " car " )
However, if we apply the log transformation on the outcome, then all standard errors
give similar t-values.
> ols . fit = lm ( log ( multish + 1 ) ~ lnpop + lnpopsq + lngdp + lncolony
+ + lndist + freedom + militexp + arms
+ + year 8 3 + year 8 6 + year 8 9 + year 9 2 , data = dat )
> ols . fit . hc 0 = sqrt ( diag ( hccm ( ols . fit , type = " hc 0 " )))
> ols . fit . hc 1 = sqrt ( diag ( hccm ( ols . fit , type = " hc 1 " )))
> ols . fit . hc 2 = sqrt ( diag ( hccm ( ols . fit , type = " hc 2 " )))
> ols . fit . hc 3 = sqrt ( diag ( hccm ( ols . fit , type = " hc 3 " )))
50 Linear Model and Extensions
> ols . fit . hc 4 = sqrt ( diag ( hccm ( ols . fit , type = " hc 4 " )))
> ols . fit . coef = summary ( ols . fit )$ coef
> tvalues = ols . fit . coef [ , 1 ]/
+ cbind ( ols . fit . coef [ , 2 ] , ols . fit . hc 0 , ols . fit . hc 1 ,
+ ols . fit . hc 2 , ols . fit . hc 3 , ols . fit . hc 4 )
> colnames ( tvalues ) = c ( " ols " , " hc 0 " , " hc 1 " , " hc 2 " , " hc 3 " , " hc 4 " )
> round ( tvalues , 2 )
ols hc 0 hc 1 hc 2 hc 3 hc 4
( Intercept ) 2 . 9 6 2 . 8 1 2 . 7 7 2 . 7 2 2 . 6 3 2 . 5 3
lnpop -2 . 8 7 -2 . 6 3 -2 . 6 0 -2 . 5 4 -2 . 4 5 -2 . 3 5
lnpopsq 4.21 3.72 3.67 3.59 3.46 3.32
lngdp -8 . 0 2 -7 . 4 9 -7 . 3 8 -7 . 3 8 -7 . 2 7 -7 . 3 3
lncolony 6.31 6.19 6.11 6.08 5.97 5.95
lndist -0 . 1 6 -0 . 1 4 -0 . 1 4 -0 . 1 4 -0 . 1 4 -0 . 1 4
freedom 1.47 1.53 1.51 1.50 1.47 1.46
militexp -0 . 3 2 -0 . 3 2 -0 . 3 1 -0 . 3 1 -0 . 3 0 -0 . 2 9
arms 1.27 1.12 1.10 1.05 0.98 0.86
year 8 3 0.10 0.10 0.10 0.10 0.10 0.10
year 8 6 -0 . 1 4 -0 . 1 4 -0 . 1 4 -0 . 1 4 -0 . 1 4 -0 . 1 4
year 8 9 0.46 0.45 0.44 0.44 0.44 0.44
year 9 2 0.03 0.03 0.03 0.03 0.03 0.03
In general, the difference between the OLS and EHW standard errors may be due to the
heteroskedasticity or the poor approximation of the linear model. The above two analyses
based on the original and transformed outcomes suggest that the linear approximation
works better for the log-transformed outcome. We will discuss the issues of transformation
and model misspecification later.
Call :
lm ( formula = medv ~ . , data = BostonHousing )
Residuals :
Min 1Q Median 3Q Max
-1 5 . 5 9 5 -2 . 7 3 0 -0 . 5 1 8 1.777 26.199
Coefficients :
Estimate Std . Error t value Pr ( >| t |)
( Intercept ) 3 . 6 4 6 e + 0 1 5 . 1 0 3 e + 0 0 7 . 1 4 4 3 . 2 8e - 1 2 ***
crim -1 . 0 8 0e - 0 1 3 . 2 8 6e - 0 2 -3 . 2 8 7 0 . 0 0 1 0 8 7 **
zn 4 . 6 4 2e - 0 2 1 . 3 7 3e - 0 2 3.382 0.000778 ***
indus 2 . 0 5 6e - 0 2 6 . 1 5 0e - 0 2 0.334 0.738288
chas 1 2 . 6 8 7 e + 0 0 8 . 6 1 6e - 0 1 3.118 0.001925 **
nox -1 . 7 7 7 e + 0 1 3 . 8 2 0 e + 0 0 -4 . 6 5 1 4 . 2 5e - 0 6 ***
rm 3 . 8 1 0 e + 0 0 4 . 1 7 9e - 0 1 9 . 1 1 6 < 2e - 1 6 ***
age 6 . 9 2 2e - 0 4 1 . 3 2 1e - 0 2 0.052 0.958229
dis -1 . 4 7 6 e + 0 0 1 . 9 9 5e - 0 1 -7 . 3 9 8 6 . 0 1e - 1 3 ***
rad 3 . 0 6 0e - 0 1 6 . 6 3 5e - 0 2 4 . 6 1 3 5 . 0 7e - 0 6 ***
EHW standard error in OLS 51
tax -1 . 2 3 3e - 0 2 3 . 7 6 0e - 0 3 -3 . 2 8 0 0 . 0 0 1 1 1 2 **
ptratio -9 . 5 2 7e - 0 1 1 . 3 0 8e - 0 1 -7 . 2 8 3 1 . 3 1e - 1 2 ***
b 9 . 3 1 2e - 0 3 2 . 6 8 6e - 0 3 3.467 0.000573 ***
lstat -5 . 2 4 8e - 0 1 5 . 0 7 2e - 0 2 -1 0 . 3 4 7 < 2e - 1 6 ***
>
> ols . fit . hc 0 = sqrt ( diag ( hccm ( ols . fit , type = " hc 0 " )))
> ols . fit . hc 1 = sqrt ( diag ( hccm ( ols . fit , type = " hc 1 " )))
> ols . fit . hc 2 = sqrt ( diag ( hccm ( ols . fit , type = " hc 2 " )))
> ols . fit . hc 3 = sqrt ( diag ( hccm ( ols . fit , type = " hc 3 " )))
> ols . fit . hc 4 = sqrt ( diag ( hccm ( ols . fit , type = " hc 4 " )))
> ols . fit . coef = summary ( ols . fit )$ coef
> tvalues = ols . fit . coef [ , 1 ]/
+ cbind ( ols . fit . coef [ , 2 ] , ols . fit . hc 0 , ols . fit . hc 1 ,
+ ols . fit . hc 2 , ols . fit . hc 3 , ols . fit . hc 4 )
> colnames ( tvalues ) = c ( " ols " , " hc 0 " , " hc 1 " , " hc 2 " , " hc 3 " , " hc 4 " )
> round ( tvalues , 2 )
ols hc 0 hc 1 hc 2 hc 3 hc 4
( Intercept ) 7.14 4.62 4.56 4.48 4.33 4.25
crim -3 . 2 9 -3 . 7 8 -3 . 7 3 -3 . 4 8 -3 . 1 7 -2 . 5 8
zn 3.38 3.42 3.37 3.35 3.27 3.28
indus 0.33 0.41 0.41 0.41 0.40 0.40
chas 1 3.12 2.11 2.08 2.05 2.00 2.00
nox -4 . 6 5 -4 . 7 6 -4 . 6 9 -4 . 6 4 -4 . 5 3 -4 . 5 2
rm 9.12 4.57 4.51 4.43 4.28 4.18
age 0.05 0.04 0.04 0.04 0.04 0.04
dis -7 . 4 0 -6 . 9 7 -6 . 8 7 -6 . 8 1 -6 . 6 6 -6 . 6 6
rad 4.61 5.05 4.98 4.91 4.76 4.65
tax -3 . 2 8 -4 . 6 5 -4 . 5 8 -4 . 5 4 -4 . 4 3 -4 . 4 2
ptratio -7 . 2 8 -8 . 2 3 -8 . 1 1 -8 . 0 6 -7 . 8 9 -7 . 9 3
b 3.47 3.53 3.48 3.44 3.34 3.30
lstat -1 0 . 3 5 -5 . 3 4 -5 . 2 7 -5 . 1 8 -5 . 0 1 -4 . 9 3
The log transformation of the outcome does not remove the discrepancy among the
standard errors. In this example, heteroskedasticity seems an important problem.
> ols . fit = lm ( log ( medv ) ~ . , data = BostonHousing )
> summary ( ols . fit )
Call :
lm ( formula = log ( medv ) ~ . , data = BostonHousing )
Residuals :
Min 1Q Median 3Q Max
-0 . 7 3 3 6 1 -0 . 0 9 7 4 7 -0 . 0 1 6 5 7 0.09629 0.86435
Coefficients :
Estimate Std . Error t value Pr ( >| t |)
( Intercept ) 4 . 1 0 2 0 4 2 3 0 . 2 0 4 2 7 2 6 2 0 . 0 8 1 < 2e - 1 6 ***
crim -0 . 0 1 0 2 7 1 5 0 . 0 0 1 3 1 5 5 -7 . 8 0 8 3 . 5 2e - 1 4 ***
zn 0.0011725 0.0005495 2.134 0.033349 *
indus 0.0024668 0.0024614 1.002 0.316755
chas 1 0.1008876 0.0344859 2 . 9 2 5 0 . 0 0 3 5 9 8 **
nox -0 . 7 7 8 3 9 9 3 0 . 1 5 2 8 9 0 2 -5 . 0 9 1 5 . 0 7e - 0 7 ***
rm 0.0908331 0.0167280 5 . 4 3 0 8 . 8 7e - 0 8 ***
age 0.0002106 0.0005287 0.398 0.690567
dis -0 . 0 4 9 0 8 7 3 0 . 0 0 7 9 8 3 4 -6 . 1 4 9 1 . 6 2e - 0 9 ***
rad 0.0142673 0.0026556 5 . 3 7 3 1 . 2 0e - 0 7 ***
tax -0 . 0 0 0 6 2 5 8 0 . 0 0 0 1 5 0 5 -4 . 1 5 7 3 . 8 0e - 0 5 ***
ptratio -0 . 0 3 8 2 7 1 5 0 . 0 0 5 2 3 6 5 -7 . 3 0 9 1 . 1 0e - 1 2 ***
b 0.0004136 0.0001075 3 . 8 4 7 0 . 0 0 0 1 3 5 ***
52 Linear Model and Extensions
>
> ols . fit . hc 0 = sqrt ( diag ( hccm ( ols . fit , type = " hc 0 " )))
> ols . fit . hc 1 = sqrt ( diag ( hccm ( ols . fit , type = " hc 1 " )))
> ols . fit . hc 2 = sqrt ( diag ( hccm ( ols . fit , type = " hc 2 " )))
> ols . fit . hc 3 = sqrt ( diag ( hccm ( ols . fit , type = " hc 3 " )))
> ols . fit . hc 4 = sqrt ( diag ( hccm ( ols . fit , type = " hc 4 " )))
> ols . fit . coef = summary ( ols . fit )$ coef
> tvalues = ols . fit . coef [ , 1 ]/
+ cbind ( ols . fit . coef [ , 2 ] , ols . fit . hc 0 , ols . fit . hc 1 ,
+ ols . fit . hc 2 , ols . fit . hc 3 , ols . fit . hc 4 )
> colnames ( tvalues ) = c ( " ols " , " hc 0 " , " hc 1 " , " hc 2 " , " hc 3 " , " hc 4 " )
> round ( tvalues , 2 )
ols hc 0 hc 1 hc 2 hc 3 hc 4
( Intercept ) 2 0 . 0 8 1 4 . 2 9 1 4 . 0 9 1 3 . 8 6 1 3 . 4 3 1 3 . 1 3
crim -7 . 8 1 -5 . 3 1 -5 . 2 4 -4 . 8 5 -4 . 3 9 -3 . 5 6
zn 2.13 2.68 2.64 2.62 2.56 2.56
indus 1.00 1.46 1.44 1.43 1.40 1.41
chas 1 2.93 2.69 2.66 2.62 2.56 2.56
nox -5 . 0 9 -4 . 7 9 -4 . 7 2 -4 . 6 7 -4 . 5 6 -4 . 5 4
rm 5.43 3.31 3.26 3.20 3.10 3.02
age 0.40 0.33 0.32 0.32 0.31 0.31
dis -6 . 1 5 -6 . 1 2 -6 . 0 3 -5 . 9 8 -5 . 8 4 -5 . 8 2
rad 5.37 5.23 5.16 5.05 4.87 4.67
tax -4 . 1 6 -5 . 0 5 -4 . 9 8 -4 . 9 0 -4 . 7 6 -4 . 6 9
ptratio -7 . 3 1 -8 . 8 4 -8 . 7 2 -8 . 6 7 -8 . 5 1 -8 . 5 5
b 3.85 2.80 2.76 2.72 2.65 2.59
lstat -1 4 . 3 0 -7 . 8 6 -7 . 7 5 -7 . 6 3 -7 . 4 0 -7 . 2 8
in distribution.
Remark: The name “unequal” is motivated by the “var.equal” parameter of the R function
t.test.
6.5 Breakdown of the equivalence of the t-statistics based on the EHW standard error
This problem parallels Problem 5.9.
With the data (xi , yi )ni=1 where both xi and yi are scalars. Run OLS fit of yi on (1, xi )
to obtain ty|x , the t-statistic of the coefficient of xi , based on the EHW standard error. Run
OLS fit of xi on (1, yi ) to obtain tx|y , the t-statistic of the coefficient of yi , based on the
EHW standard error.
Give a counterexample with ty|x ̸= tx|y .
Normal non−Normal
se0=0.064 se0=0.064
6 se1=0.064 se1=0.064
se2=0.064 se2=0.064
homoskedastic
4
2
density
se0=0.093 se0=0.088
4 se1=0.071 se1=0.06
se2=0.094 se2=0.089
heteroskedastic
2
0
0.6 0.8 1.0 1.2 1.4 0.6 0.8 1.0 1.2 1.4
^
β
FIGURE 6.1: Simulation with 5000 replications: “se0” denotes the true standard error of
β̂, “se1” denotes the estimated standard error based on the homoskedasticity assumption,
and “se2” denotes the Eicker–Huber–White standard error allowing for heteroskedasticity.
The density curves are Normal with mean 1 and standard deviation se0.
Part III
where X1 ∈ Rn×k , X2 ∈ Rn×l , β1 ∈ Rk and β2 ∈ Rl , then we can consider the long regression
Y = X β̂ + ε̂
β̂1
= X1 X2 + ε̂
β̂2
= X1 β̂1 + X2 β̂2 + ε̂,
(Q1) if the true β1 is zero, then what is the consequence of including X1 in the long
regression?
(Q2) if the true β1 is not zero, then what is the consequence of omitting X1 in the short
regression?
(Q3) what is the difference between β̂2 and β̃2 ? Both of them are measures of the “impact”
of X2 on Y . Then why are they different? Does their difference give us any information
about β1 ?
Many problems in statistics are related to the long and short regressions. We will discuss
some applications in Chapter 8 and give a related result in Chapter 9.
59
60 Linear Model and Extensions
Theorem 7.1 The OLS estimator for β2 in the short regression is β̃2 = (X2t X2 )−1 X2t Y ,
and the OLS estimator for β2 in the long regression has the following equivalent forms
where
S11 = (X1t X1 )−1 + (X1t X1 )−1 X1t X2 (X̃2t X̃2 )−1 X2t X1 (X1t X1 )−1 ,
S12 = −(X1t X1 )−1 X1t X2 (X̃2t X̃2 )−1 ,
t
S21 = S12 ,
S22 = (X̃2t X̃2 )−1 .
1 Professor Alan Agresti gave me the reference of Yule (1907).
2 See Problem 3.7 for more details.
The Frisch–Waugh–Lovell Theorem 61
I leave the proof of Lemma 7.1 as Problem 7.1. With Lemma 7.1, we can easily prove
Theorem 7.1.
Proof of Theorem 7.1: (Version 1) The OLS coefficient is
t
β̂1 S11 S12 X1 Y
= (X t X)−1 X t Y = .
β̂2 S21 S22 X2t Y
Equation (7.5) is the form (7.2), and Equation (7.6) is the form (7.3). Because we also have
X2t (In − H1 )Y = X2t (In − H1 )2 Y = X̃2t Ỹ , we can write β̂2 as β̂2 = (X̃2t X̃2 )−1 X̃2t Ỹ , giving
the form (7.4). □
The second proof does not invert the block matrix of X t X directly.
Proof of Theorem 7.1: (Version 2) The OLS decomposition Y = X1 β̂1 +X2 β̂2 + ε̂ satisfies
t
X t ε̂ = 0 =⇒ X1 X2 ε̂ = 0 =⇒ X1t ε̂ = 0, X2t ε̂ = 0.
which reduces to
(In − H1 )Y = (In − H1 )X2 β̂2 + ε̂
because (In − H1 )X1 = 0 and (In − H1 )ε̂ = ε̂ − H1 ε̂ = ε̂ − X1 (X1t X1 )−1 X1t ε̂ = ε̂. Further
multiplying X2t on both sides of the identity, we have
This projection matrix is closely related to the projection matrices induced by X and X1
as shown in the following lemma.
62 Linear Model and Extensions
Lemma 7.2 is purely algebraic. I leave the proof as Problem 7.3. The first two identities
imply that the column space of X̃2 is orthogonal to the column space of X1 . The last
identity H = H1 + H̃2 has a clear geometric interpretation. For any vector v ∈ Rn , we have
Hv = H1 v + H̃2 v, so the projection of v onto the column space of X equals the summation
of the projection of v onto the column space of X1 and the projection of v onto the column
space of X̃2 . Importantly, H ̸= H1 + H2 in general.
Second, we can obtain β̂2 from (7.3) or (7.4), which corresponds to the partial regression
of Y on X̃2 or the partial regression of Ỹ on X̃2 . We can verify that the residual vector
from the second partial regression equals the residual vector from the full regression.
Corollary 7.1 We have ε̂ = ê, where ε̂ is the residual vector from the OLS fit of Y on X
and ê is the residual vector from the OLS fit of Ỹ on X̃2 , respectively.
It is important to note that this conclusion is only true if both Y and X2 are residualized.
The conclusion does not hold if we only residualize X2 . See Problem 7.2.
Proof of Corollary 7.1: We have ε̂ = (I − H)Y and
It suffices to show that I−H = (I−H̃2 )(I−H1 ), or, equivalently, I−H = I−H1 −H̃2 +H̃2 H1 .
This holds due to Lemma 7.2. □
Theorem 7.1 is well known for a long time but Theorem 7.2 is less well known. Lovell
(1963) hinted at the first identity in Theorem 7.2, and Ding (2021a) proved Theorem 7.2.
Proof of Theorem 7.2: By Corollary 7.1, the full regression and partial regression have
the same residual vector, denoted by ε̂. Therefore, Ω̂ehw = Ω̃ehw = diag{ε̂2 } in the EHW
covariance estimator.
Based on the full regression, define σ̂ 2 = ∥ε̂∥22 /(n − k − l). Then V̂ equals the (2, 2)th
block of σ̂ 2 (X t X)−1 , and V̂ehw equals the (2, 2)th block of (X t X)−1 X t Ω̂ehw X(X t X)−1 .
Based on the partial regression, define σ̃ 2 = ∥ε̂∥22 /(n − l). Then Ṽ = σ̃ 2 (X̃2t X̃2 )−1 and
Ṽehw = (X̃2t X̃2 )−1 X̃2t Ω̃ehw X̃2 (X̃2t X̃2 )−1 .
Let σ̂ 2 = ∥ε̂∥2 /(n − k − l) and σ̃ 2 = ∥ε̂∥2 /(n − l) be the common variance estimators.
They are identical up to the degrees of freedom correction. Under homoskedasticity, the
covariance estimator for β̂2 is the (2, 2)th block of σ̂ 2 (X t X)−1 , that is, σ̂ 2 S22 = σ̂ 2 (X̃2t X̃2 )−1
The Frisch–Waugh–Lovell Theorem 63
by Lemma 7.1, which is identical to the covariance estimator for β̃2 up to the degrees of
freedom correction.
The EHW covariance estimator from the full regression is the (2, 2) block of ÂΩ̂ehw Ât ,
where
∗ ∗
 = (X t X)−1 X t = =
−(X̃2t X̃2 )−1 X2t H1 + (X̃2t X̃2 )−1 X2t (X̃2t X̃2 )−1 X̃2t
by Lemma 7.1. I omit the ∗ term because it does not affect the final calculation. Define
Ã2 = (X̃2t X̃2 )−1 X̃2t , and then
which equals the EHW covariance estimator Ṽehw from the partial regression. □
Corollary 7.2 If X1t X2 = 0, i.e., the columns of X1 and X2 are orthogonal, then X̃2 = X2
and β̂2 = β̃2 .
Proof of Corollary 7.2: We can directly prove Corollary 7.2 by verifying that X t X is
block diagonal.
Alternatively, Corollary 7.2 follows directly from
X1 = U1 .
X2 = β̂X2 |U1 U1 + U2 ;
by OLS, U1 and U2 must be orthogonal. Regress X3 on (U1 , U2 ) to obtain the fitted and
residual vector
X3 = β̂X3 |U1 U1 + β̂X3 |U2 U2 + U3 ;
64 Linear Model and Extensions
by Corollary 7.2, the OLS reduces to two univariate OLS and ensures that U3 is orthogonal
to both U1 and U2 . This justifies the notation β̂X3 |U1 and β̂X3 |U2 . Continue this procedure
to the last column vector:
p−1
X
Xp = β̂Xp |Uj Uj + Up ;
j=1
Q = (Q1 , . . . , Qp ).
More interestingly, the column vectors of X and Q can linearly represent each other because
X = (X1 , . . . , Xp )
1 β̂X2 |U1 β̂X3 |U1 ··· β̂Xp |U1
0 1 β̂X3 |U2 ··· β̂Xp |U2
= (U1 , . . . , Up ) .
.. .. .. ..
. . ··· .
0 0 0 ··· 1
1 β̂X2 |U1 β̂X3 |U1 ··· β̂Xp |U1
0 1 β̂X3 |U2 ··· β̂Xp |U2
= Qdiag{∥Uj ∥}pj=1 . .
.. .. .. ..
. . ··· .
0 0 0 ··· 1
We can verify that the product of the second and the third matrix is an upper triangular
matrix, denoted by R. By definition, the jth diagonal element of R equals ∥Uj ∥, and the
(j, j ′ )th element of R equals ∥Uj ∥β̂Xj′ |Uj for j ′ > j. Therefore, we can decompose X as
X = QR
X t X β̂ = X t Y,
R Qt QRβ̂ = Rt Qt Y,
t
Rβ̂ = Qt Y,
[ 4 ,] -0 . 3 7 6 2 7 8 2 1 0 . 4 4 7 6 9 3 2 -0 . 9 6 2 9 3 2 0
[ 5 ,] -1 . 4 0 8 4 8 0 2 7 0 . 2 7 3 5 4 0 8 -0 . 8 0 4 7 9 1 7
[ 6 ,] 1 . 8 4 8 7 8 5 1 8 0 . 7 2 9 0 0 0 5 1 . 2 6 8 8 9 2 9
[ 7 ,] 0 . 0 6 4 3 2 8 5 6 0 . 2 2 5 6 2 8 4 0 . 3 9 7 2 2 2 9
> qrX = qr ( X )
> qr . Q ( qrX )
[,1] [,2] [,3]
[ 1 ,] -0 . 1 9 1 0 0 8 7 8 -0 . 0 3 4 6 0 6 1 7 0 . 3 0 3 4 0 4 8 1
[ 2 ,] -0 . 5 8 7 6 9 9 8 1 -0 . 6 0 4 4 2 9 2 8 0 . 2 3 7 5 3 9 0 0
[ 3 ,] -0 . 0 1 3 8 3 1 5 1 0 . 2 1 1 9 1 9 9 1 -0 . 5 5 8 3 9 9 2 8
[ 4 ,] -0 . 1 2 5 5 8 2 5 7 -0 . 2 8 7 2 8 4 0 3 -0 . 6 2 8 6 4 7 5 0
[ 5 ,] -0 . 4 7 0 0 7 9 2 4 -0 . 0 7 0 2 0 0 7 6 -0 . 3 6 6 4 0 9 3 8
[ 6 ,] 0 . 6 1 7 0 3 0 6 7 -0 . 6 8 7 7 8 4 1 1 -0 . 0 9 9 9 9 8 5 9
[ 7 ,] 0 . 0 2 1 4 6 9 6 1 -0 . 1 6 7 4 8 2 4 6 0 . 0 1 6 0 5 4 9 3
> qr . R ( qrX )
[,1] [,2] [,3]
[ 1 ,] 2 . 9 9 6 2 6 1 -0 . 3 7 3 5 6 7 3 0 . 0 9 3 7 7 8 8
[ 2 ,] 0 . 0 0 0 0 0 0 -1 . 3 9 5 0 6 4 2 -2 . 1 2 1 7 2 2 3
[ 3 ,] 0 . 0 0 0 0 0 0 0 . 0 0 0 0 0 0 0 2 . 4 8 2 6 1 8 6
(R2) regress Y on X̃2 to obtain the coefficient, which equals β̂2 by the FWL Theorem.
Although partial regression (R2) can recover the OLS coefficient, the leverage scores
from (R2) are not the same as those from the long regression. Show that the summation of
the corresponding leverage scores from (R1) and (R2) equals the leverage scores from the
long regression.
Remark: The leverage scores are the diagonal elements of the hat matrix from OLS fits.
Chapter 6 before mentioned them and Chapter 11 later will discuss them in more detail.
Y = β̂1 X1 + β̂2 X2 + ε̂
and
Y = β̃1 X1 + β̃2 X̃2 + ε̃.
Show that
β̂2 = β̃2 , ε̂ = ε̃.
Hint: Use the result in Problem 3.4.
Remark: Choose A = (X1t X1 )−1 X1t X2 to be the coefficient matrix of the OLS fit of
X2 on X1 . The above result ensures that β̂2 equals β̃2 from the OLS fit of Y on X1 and
(In − H1 )X2 , which is coherent with the FWL theorem since X1t (In − H1 )X2 = 0.
x̃n2
The Frisch–Waugh–Lovell Theorem 67
where
x̃2
wi = Pn i2 2 2 .
( i=1 x̃i2 )
Remark: You can use Theorems 7.1 and 7.2 to prove the result. The original formula of
the EHW covariance matrix has a complex form. However, using the FWL theorems, we
can simplify each of the squared EHW standard errors as a weighted average of the squared
residuals, or, equivalently, a simple quadratic form of the residual vector.
H = QQt ,
Q = Q1 , R = R1 .
8
Applications of the Frisch–Waugh–Lovell Theorem
The FWL theorem has many applications, and I will highlight some of them in this chapter.
Cn = In − n−1 1n 1tn
ȳ
and
y1 − ȳ
Cn Y = ..
.
.
yn − ȳ
More generally, multiplying any matrix by An is equivalent to averaging each column, and
multiplying any matrix by Cn is equivalent to centering each column of that matrix, for
example, t
x̄2
..
An X2 = . = 1n x̄t2 ,
x̄t2
and
xt12 − x̄t2
Cn X2 = ..
,
.
xn2 − x̄2
t t
69
70 Linear Model and Extensions
Pn
where X2 contains row vectors xt12 , . . . , xtn2 with average x̄2 = n−1 i=1 xi2 . The FWL
theorem implies that the coefficient of X2 in the OLS fit of Y on (1n , X2 ) equals the
coefficient of Cn X2 in the OLS fit of Cn Y on Cn X2 , that is, the OLS fit of the centered
response vector on the column-wise centered X2 . An immediate consequence is that if each
column is centered in the design matrix, then to obtain the OLS coefficients, it does not
matter whether to include the column 1n or not.
The centering matrix Cn has another property: its quadratic form equals the sample
variance multiplied by n − 1, for example,
Y t Cn Y = Y t Cnt Cn Y
y1 − ȳ
= (y1 − ȳ, . . . , yn − ȳ)
..
.
yn − ȳ
Xn
= (yi − ȳ)2
i=1
= (n − 1)σ̂y2 ,
Xpt
t
X1 Cn X1 · · · X1t Cn Xp
=
.. ..
. .
Xpt Cn X1 · · · Xpt Cn Xp
σ̂11 · · · σ̂1p
= (n − 1) ... .. ,
.
σ̂p1 ··· σ̂pp
where
n
X
σ̂j1 j2 = (n − 1)−1 (xij1 − x̄·j1 )(xij2 − x̄·j2 )
i=1
is the sample covariance between Xj1 and Xj2 . So (n − 1)−1 X t Cn X equals the sample
covariance matrix of X. For these reason, I choose the notation Cn with “C” for both
“centering” and “covariance.”
In another important special case, X1 contains the dummies for a discrete variable, for
example, the indicators for different treatment levels or groups. See Example 3.2 for the
Applications of the Frisch–Waugh–Lovell Theorem 71
1 ··· 0
1
.. .. ..
. . .
1 ··· 0
1 1 ··· 0 .. ..
..
.. ..
. .
. . .
1 ···
0
1 0 ··· 1
.. ..
X1 =
. ..
.. or X 1 = . . , (8.1)
.. . .
0 ···
1
1 0 1
. ..
1
.. .
0 0
0 ··· 1 n×k
. .. ..
.. . .
1 0 · · · 0 n×k
where the first form of X1 contains 1n and k − 1 dummy variables, and the second form of
X1 contains k dummy variables. In both forms of X1 , the observations are sorted according
to the group indicators. If we regress Y on X1 , the residual vector is
ȳ[1]
..
.
ȳ[1]
Y − ... , (8.2)
ȳ[k]
.
..
ȳ[k]
where ȳ[1] , . . . , ȳ[k] are the averages of the outcomes within groups 1, . . . , k. Effectively, we
center Y by group-specific means. Similarly, if we regress X2 on X1 , we center each column of
X2 by the group-specific means. Let Y c and X2c be the centered response vector and design
matrix. The FWL theorem implies that the OLS coefficient of X2 in the long regression is
the OLS coefficient of X2c in the partial regression of Y c on X2c . When k is large, running
the OLS with centered variables can reduce the computational cost.
measures the linear relationship between x and y. How do we measure the linear relationship
between x and y after controlling for some other variables w ∈ Rk−1 ? Intuitively, we can
measure it with the sample Pearson correlation coefficient based on the residuals from the
following two OLS fits:
(R1) run OLS of Y on (1, W ) and obtain residual vector ε̂y and residual sum of squares
rssy ;
72 Linear Model and Extensions
(R2) run OLS of X on (1, W ) and obtain residual vector ε̂x and residual sum of squares
rssx .
With ε̂y and ε̂x , we can define the sampling partial correlation coefficient between x and y
given w as Pn
ε̂x,i ε̂y,i
ρ̂yx|w = qP i=1 qP .
n 2 n 2
ε̂
i=1 x,i ε̂
i=1 y,i
In the above definition, we do not center the residuals because they have zero sample means
due to the inclusions of the intercept in the OLS fits (R1) and (R2). The sample partial
correlation coefficient determines the coefficient of ε̂x in the OLS fit of ε̂y on ε̂x :
Pn s Pn
2
i=1 ε̂x,i ε̂y,i i=1 ε̂y,i σ̂y|w
β̂yx|w = Pn 2 = ρ̂yx|w Pn 2 = ρ̂yx|w , (8.3)
i=1 ε̂x,i i=1 ε̂x,i σ̂x|w
2 2
where σ̂y|w = rssy /(n − k) and σ̂x|w = rssx /(n − k) are the variance estimators based
on regressions (R1) and (R2) motivated by the Gauss–Markov model. Based on the FWL
theorem, β̂yx|w equals the OLS coefficient of X in the long regression of Y on (1, X, W ).
Therefore, (8.3) is the Galtonian formula for multiple regression, which is analogous to that
for univariate regression (1.1).
To investigate the relationship between y and x, different researchers may run different
regressions. One may run OLS of Y on (1, X, W ), and the other may run OLS of Y on
(1, X, W ′ ), where W ′ is a subset of W . Let β̂yx|w be the coefficient of X in the first regression,
and let β̂yx|w′ be the coefficient of X in the second regression. Mathematically, it is possible
that these two coefficients have different signs, which is called Simpson’s paradox1 . It is a
paradox because we expect both coefficients to measure the “impact” of X on Y . Because
these two coefficients have the same signs as the partial correlation coefficients ρ̂yx|w and
ρ̂yx|w′ , Simpson’s paradox is equivalent to
To simplify the presentation, we discuss the special case with w′ being an empty set. Simp-
son’s paradox is then equivalent to
Its proof is purely algebraic, so I leave it as Problem 8.6. Theorem 8.1 states that we
can obtain the sample partial correlation coefficient based on the three pairwise correlation
coefficients. Figure 8.1 illustrates the interplay among three variables. In particular, the
correlation between x and y is due to two “pathways”: the one acting through w and the
one acting independent of w. The first path way is related to the product term ρ̂yw ρ̂xw , and
the second pathway is related to ρ̂yx|w . This gives some intuition for Theorem 8.1.
1 The usual form of Simpson’s paradox is in terms of a 2 × 2 × 2 table with all binary variables. Here we
Based on data (yi , xi , wi )ni=1 , we can compute the sample correlation matrix
1 ρ̂yx ρ̂yw
R̂ = ρ̂xy 1 ρ̂xw ,
ρ̂wy ρ̂wx 1
which is symmetric and positive semi-definite. Simpson’s paradox happens if and only if
ρ̂yx (ρ̂yx − ρ̂yw ρ̂xw ) < 0 ⇐⇒ ρ̂2yx < ρ̂yx ρ̂yw ρ̂xw .
We can observe Simpson’s Paradox in the following simulation with the R code in
code8.2.R.
> n = 1000
> w = rbinom (n , 1 , 0 . 5 )
> x 1 = rnorm (n , -1 , 1 )
> x 0 = rnorm (n , 2 , 1 )
> x = ifelse (w , x 1 , x 0 )
> y = x + 6 * w + rnorm ( n )
> fit . xw = lm ( y ~ x + w )$ coef
> fit . x = lm ( y ~ x )$ coef
> fit . xw
( Intercept ) x w
0.05655442 0.97969907 5.92517072
> fit . x
( Intercept ) x
3 . 6 4 2 2 9 7 8 -0 . 3 7 4 3 3 6 8
Because w is binary, we can plot (x, y) in each group of w = 1 and w = 0 in Figure 8.2.
In both groups, y and x are positively associated with positive regression coefficients; but
in the pooled data, y and x are negatively associated with a negative regression coefficient.
74 Linear Model and Extensions
7.5
●
●
5.0 ●●
● ●●
●● ●
●
● ● ● ● ●●
● ●● ●
● ●● ●
●
●
●
●
● ●
●● ●
● ●
●
● ● ●
●
●
● ●●●
●
factor(w)
●
●● ●
●●● ●● ●● ● ●● ●●● ● ●
● ● ●●● ● ●●● ● ●● ●
●
●● ● ● ● ●
0
y
●
● ● ● ●● ●● ●● ●
● ●● ●● ● ● ●● ●
● ●● ●● ● ●●
● ● ● ● ●● ● ●●
● ● ●
●● ● ●●●●●●● ● ●●
●●● ● ● ● ●
●● ● ● ● ● ● ● ●● ●
● ●
●● ● ●
● ●● ●
● ●
● ● ● ●
● ●● ●
●
●
●
● ● ●●
●
●
● 1
2.5 ●
●●
●
●
●●
●
●●●
●● ● ● ● ●
● ●
●
● ●● ●● ● ●
● ●
● ●●
●
●
●
●
●
●●●
●
●
●
●
●●
●●
●
●
●● ● ● ● ● ● ●
● ● ● ● ●●
● ● ●● ● ● ● ●
● ● ● ●● ●
● ● ● ●●● ●● ● ● ●● ●●● ● ●
●
● ● ●● ●● ●●●● ●
●● ●●● ● ● ● ● ●● ●
●●●
●● ●● ● ● ●●
● ● ●● ● ● ● ● ●● ●●
● ●● ●
● ●● ●● ● ●● ●
● ●
●
● ● ● ●● ●
●●● ● ●● ● ● ● ●
● ● ●●● ●● ● ●
● ● ●● ●● ● ●● ● ● ●
● ● ● ● ● ●● ● ●
● ●
● ● ●● ● ●
● ● ●●●● ● ● ●
● ●● ● ●●
●● ●
●●
● ● ●
●●● ●●●
● ●●
● ●● ● ● ●●●
● ● ● ●●● ● ●
● ● ● ●● ● ● ● ● ●
● ●
● ● ●
● ● ● ●
● ●
0.0 ●
● ●
● ●
● ●
● ●
●
●
●
● ● ● ●
● ● ●●
●
● ●
● ●
● ● ●
● ●
● ● ●
● ● ●
−2.5 ●
FIGURE 8.2: An Example of Simpson’s Paradox. The two solid regression lines are fitted
separately using the data from two groups, and the dash regression line is fitted using the
pooled data.
Now I will discuss testing H0 from an alternative perspective based on comparing the
residual sum of squares in the long regression (8.4) and the short regression (8.5). This
technique is called the analysis of variance (ANOVA), pioneered by R. A. Fisher in the
design and analysis of experiments. Intuitively, if β2 = 0, then the residual vectors from the
long regression (8.4) and the short regression (8.5) should not be “too different.” However,
with the error term ε, these residuals are random, then the key is to quantify the magnitude
of the difference. Define
rsslong = Y t (In − H)Y
and
rssshort = Y t (In − H1 )Y
as the residual sum of squares from the long and short regressions, respectively. By the
definition of OLS, it must be true that
rsslong ≤ rssshort
and
rssshort − rsslong = Y t (H − H1 )Y ≥ 0. (8.6)
To understand the magnitude of the change in the residual sum of squares, we can stan-
dardize the above difference and define
(rssshort − rsslong )/l
Fanova = ,
rsslong /(n − p)
In the definition of the above statistic, l and n − p are the degrees of freedom to make
the mathematics more elegant, but they do not change the discussion fundamentally. The
denominator of Fanova is σ̂ 2 , so we can also write it as
rssshort − rsslong
Fanova = . (8.7)
lσ̂ 2
The following theorem states that these two perspectives yield an identical test statistic.
Theorem 8.2 Under Assumption 5.1, if β2 = 0, then Fanova ∼ Fl,n−p . In fact, Fanova =
FWald which is a numerical result without Assumption 5.1.
I divide the proof into two parts. The first part derives the exact distribution of Fanova
under the Normal linear model. It relies on the following lemma on the basic properties of
the projection matrices. I relegate its proof to Problem 8.9.
76 Linear Model and Extensions
recalling that H̃2 = X̃2 (X̃2t X̃2 )−1 X̃2t is the projection matrix onto the column space of X̃2 .
Therefore, Fanova = FWald follows from the basic identity H − H1 = H̃2 ensured by Lemma
7.2. □
We can use the anova function in R to compute the F statistic and the p-value. Below
I revisit the lalonde data with the R code in code8.3.R. The result is identical as in Section
5.4.2.
> library ( " Matching " )
> data ( lalonde )
> lalonde _ full = lm ( re 7 8 ~ . , data = lalonde )
> lalonde _ treat = lm ( re 7 8 ~ treat , data = lalonde )
> anova ( lalonde _ treat , lalonde _ full )
Analysis of Variance Table
Model 1 : re 7 8 ~ treat
Model 2 : re 7 8 ~ age + educ + black + hisp + married + nodegr + re 7 4 +
re 7 5 + u 7 4 + u 7 5 + treat
Res . Df RSS Df Sum of Sq F Pr ( > F )
1 443 1.9178e+10
2 433 1.8389e+10 10 788799023 1.8574 0.04929 *
Model 1 : re 7 8 ~ 1
Model 2 : re 7 8 ~ treat
Model 3 : re 7 8 ~ age + educ + black + hisp + married + nodegr + re 7 4 +
re 7 5 + u 7 4 + u 7 5 + treat
Res . Df RSS Df Sum of Sq F Pr ( > F )
1 444 1.9526e+10
2 443 1 . 9 1 7 8 e + 1 0 1 3 4 8 0 1 3 4 5 6 8 . 1 9 4 6 0 . 0 0 4 4 0 5 **
3 433 1.8389e+10 10 788799023 1.8574 0.049286 *
Overall, the treatment variable is significantly related to the outcome but none of the
pretreatment covariate is.
regression
Y = X β̂ + ε̂
β̂0
= 1n X1 X2 β̂1 + ε̂
β̂2
= 1n β̂0 + X1 β̂1 + X2 β̂2 + ε̂,
and the short regression
Y = 1n β̃0 + X2 β̃2 + ε̃,
β̂0
β̃0
where β̂ = β̂1 and are the OLS coefficients, and ε̂ and ε̃ are the residual
β̃2
β̂2
vectors from the long and short regressions, respectively.
Prove the following theorem.
Theorem 8.3 The OLS estimator for β2 in the long regression equals the coefficient of X̃2
in the OLS fit of Y on (1n , X̃2 ), where X̃2 is the residual matrix of the column-wise OLS
fit of X2 on (1n , X1 ), and also equals the coefficient of X̃2 in the OLS fit of Ỹ on (1n , X̃2 ),
where Ỹ is the residual vector of the OLS fit of Y on (1n , X1 ).
where p is the total number of regressors and ρ̂yx1 |x2 is the sample partial correlation
coefficient between y and x1 given x2 .
Remark: Frank (2000) applied this formula to causal inference.
σ̂ jk = 0 ⇐⇒ ρ̂xj xk |x\(j,k) = 0
where ρ̂xj xk |x\(j,k) is the partial correlation coefficient of Xj and Xk given all other variables.
9
Cochran’s Formula and Omitted-Variable Bias
81
82 Linear Model and Extensions
This suggests that β̃2 = β̂2 + δ̂ β̂1 . The above derivation follows from simple algebraic
manipulations and does not use any properties of the OLS. To prove Theorem 9.1, we need
to verify that the last line is indeed the OLS fit of Y on X2 . The proof is fact very simple.
Proof of Theorem 9.1: Based on the above discussion, we only need to show that (9.4)
is the OLS fit of Y on X2 , which is equivalent to show that Û β̂1 + ε̂ is orthogonal to all
columns of X2 . This follows from
X2t Û β̂1 + ε̂ = X2t Û β̂1 + X2t ε̂ = 0,
because X2t Û = 0 based on the OLS fit in (9.3) and X2t ε̂ = 0 based on the OLS fit in (9.1).
□
Figure 9.1 illustrates Theorem 9.1. Intuitively, β̃2 measures the total impact of X2 on
Y , which has two channels: β̂2 measures the impact acting directly and δ̂ β̂1 measures the
impact acting indirectly through X1 .
Figure 9.1 shows the interplay among three variables. Theoretically, we can discuss a
system of more than three variables which is called the path model. This more advanced topic
is beyond the scope of this book. Wright (1921, 1934)’s initial discussion of this approach
was motivated by genetic studies. See Freedman (2009) for a textbook introduction.
and interpret β̃1 as the treatment effect estimate. However, observational studies may suffer
from unmeasured confounding, that is, the treatment and control units differ in unobserved
but important ways. In the simplest case, the above OLS may have omitted a variable ui
for each unit i, which is called a confounder. The oracle OLS is
and the coefficient β̂1 is an unbiased estimator if the model with ui is correct. With X1
containing the values of the ui ’s and X2 containing the values of the (1, zi , xti )’s, Cochran’s
formula implies that
β̃0 β̂0 δ̂0
β̃1 = β̂1 + β̂3 δ̂1
β̃2 β̂2 δ̂2
where (δ̂0 , δ̂1 , δ̂2t )t is the coefficient vector in the OLS fit of ui on (1, zi , xi ). Therefore, we
can quantify the difference between the observed estimate β̃1 and oracle estimate β̂1 :
where the bar and subscript jointly denote the sample mean of a particular variable within
a treatment group. So
So
Both (9.5) and (9.6) give some insights into the bias due to omitting an important
covariate u. It is clear that the bias depends on β̂3 , which quantifies the relationship between
u and y. The formula (9.5) shows that the bias also depends on the imbalance in means
of u across the treatment and control groups, after adjusting for the observed covariates
x, that is, the imbalance in means of the residual confounding. The formula (9.6) shows
a more explicit formula of the bias. The above discussion is often called bias analysis in
epidemiology or sensitivity analysis in statistics and econometrics.
for some other variables x. Let Z, M, Y be n × 1 vectors representing the observed values of
z, m, y, and let X be the n × p matrix representing the observations of x. The question of
interest is to assess the “direct” and “indirect” effects of z on y, acting independently and
through m, respectively. We do not need to define these notions precisely since we are only
interested in the algebraic property below.
The Baron–Kenny method runs the OLS
and interprets β̂1 as the estimator of the “direct effect” of z on y. The “indirect effect” of
z on y through m has two estimators. First, based on the OLS
define the difference estimator as β̃1 − β̂1 . Second, based on the OLS
define the product estimator as γ̂1 β̂2 . Figure 9.2 illustrates the OLS fits used in defining the
estimators.
Prove that
β̃1 − β̂1 = γ̂1 β̂2
that is, the difference estimator and product estimator are numerically identical.
and
Y = β̃0 1n + β̃1 X1 + ε̃Y .
Cochran’s Formula and Omitted-Variable Bias 85
Show that
β̃1 = β̂1 + β̂2 θ̂1 + β̂3 δ̂1 + β̂3 δ̂2 θ̂1 .
Remark: The OLS coefficient of X1 in the short regression of Y on (1n , X1 ) equals the
summation of all the path coefficients from X1 to Y as illustrated by Figure 9.3. This
problem is a special case of the path model, but the conclusion holds in general.
is a linear transformation of the coefficient from the long regression, which further justifies
the EHW covariance estimator
t
′ t −1 t 2 t −1 δ̂
Ṽ2 = (δ̂, Il )(X X) X diag(ε̂ )X(X X) .
Il
Show that
Ṽ2′ = (X2t X2 )−1 X2t diag(ε̂2 )X2 (X2t X2 )−1 .
Hint: Use the result in Problem 7.1.
Remark: Based on Theorem 7.2, the EHW covariance estimator for β̂2 is
V̂2 = (X̃2t X̃2 )−1 X̃2t diag(ε̂2 )X̃2 (X̃2t X̃2 )−1 ,
and
E(σ̃ 2 ) = σ 2 + β1t X1t (In − H2 )X1 β1 /(n − l) ≥ σ 2 .
Part IV
This chapter will introduce the R2 , the multiple correlation coefficient, also called the coef-
ficient of determination (Wright, 1921). It can achieve two goals: first, it extends the sample
Pearson correlation coefficient between two scalars to a measure of correlation between a
scalar outcome and a vector covariate; second, it measures how well multiple covariates can
linearly represent an outcome.
I leave P
the proof of Lemma 10.1 as Problem 10.1. LemmaP 10.1 states that the total sum
n n
of squares i=1P (yi − ȳ)2 equals the regression sum of squares i=1 (ŷi − ȳ)2 plus the residual
n 2 2
sum of squares i=1 (yi − ŷi ) . From Lemma 10.1, R must be lie within the interval [0, 1]
which measures the proportion of the regression sum of squares in the total sum of squares.
An immediate consequence of Lemma 10.1 is that
n
X
rss = (1 − R2 ) (yi − ȳ)2 .
i=1
89
90 Linear Model and Extensions
We can also verify that R2 is the squared sample Pearson correlation coefficient between
Y and Ŷ .
Theorem 10.1 We have R2 = ρ̂2yŷ where
Pn
− ȳ)(ŷi − ȳ)
i=1 (yi
ρ̂yŷ = pPn pPn . (10.1)
2 2
i=1 (yi − ȳ) i=1 (ŷi − ȳ)
I leave the proof of Theorem 10.1 as Problem 10.2. It states that the multiple correlation
coefficient equals the squared Pearson correlation coefficient between yi and ŷi . Although
the sample Pearson correlation coefficient can be positive or negative, R2 is always non-
negative. Geometrically, R2 equals the squared cosine of the angle between the centered
vectors Y − ȳ1n and Ŷ − ȳ1n ; see Chapter A.1.
In terms of long and short regressions, we can partition the design matrix into 1n and
X, then the OLS fit of the long regression is
with β̃0 = ȳ. The total sum of squares is the residual sum of squares from the short regression
so by Lemma 10.1, R2 also equals
rssshort − rsslong
R2 = . (10.4)
rssshort
Y = 1n β0 + Xβ + ε, ε ∼ N(0, σ 2 In ), (10.5)
we can use the F statistic to test whether β = 0. This F statistic is a monotone function
of R2 . Most standard software packages report both F and R2 . I first give a numeric result
without assuming that model (10.5) is correct.
Proof of Theorem 10.2: Based on the long regression (10.2) and the short regression
(10.3), we have (10.4) and
χ2p−1 /(p − 1)
F =
χ2n−p /(n − p)
where χ2p−1 and χ2n−p denote independent χ2p−1 and χ2n−p random variables, respectively,
with a little abuse of notation. Using Theorem 10.2, we have
R2 p−1 χ2p−1
= F × =
1 − R2 n−p χ2n−p
which implies
χ2p−1
R2 = .
χ2p−1 + χ2n−p
p−1 1
and χ2n−p ∼ Gamma n−p 1
Because χ2p−1 ∼ Gamma 2 , 2 2 , 2 by Proposition B.1, we
have
Gamma p−1 1
2 2 , 2
R =
Gamma p−1 1 n−p 1
2 , 2 + Gamma 2 , 2
I then use the data from King and Roberts (2015) to verify Theorems 10.1 and 10.2
numerically.
> library ( foreign )
> dat = read . dta ( " isq . dta " )
> dat = na . omit ( dat [ , c ( " multish " , " lnpop " , " lnpopsq " ,
+ " lngdp " , " lncolony " , " lndist " ,
+ " freedom " , " militexp " , " arms " ,
92 Linear Model and Extensions
+ " year 8 3 " , " year 8 6 " , " year 8 9 " , " year 9 2 " )])
>
> ols . fit = lm ( log ( multish + 1 ) ~ lnpop + lnpopsq + lngdp + lncolony
+ + lndist + freedom + militexp + arms
+ + year 8 3 + year 8 6 + year 8 9 + year 9 2 ,
+ y = TRUE , data = dat )
> ols . summary = summary ( ols . fit )
> r 2 = ols . summary $ r . squared
> all . equal ( r 2 , ( cor ( ols . fit $y , ols . fit $ fitted . values ))^ 2 ,
+ check . names = FALSE )
[ 1 ] TRUE
>
> fstat = ols . summary $ fstatistic
> all . equal ( fstat [ 1 ] , fstat [ 3 ]/ fstat [ 2 ]* r 2 /( 1 -r 2 ) ,
+ check . names = FALSE )
[ 1 ] TRUE
10.4 Partial R2
The form (10.4) of R2 is well defined in more general long and short regressions:
Ŷ = 1n β̂0 + X β̂ + W γ̂ + ε̂Y
and
Ŷ = 1n β̃0 + W γ̃ + ε̃Y
where X is an n × k matrix and W is an n × l matrix. Define the partial R2 between Y and
X given W as
2 rss(Y ∼ 1n + W ) − rss(Y ∼ 1n + X + W )
RY.X|W =
rss(Y ∼ 1n + W )
which spells out the formulas of the long and short regressions. This is an intuitive measure of
the multiple correlation between Y and X after controlling for W . The following properties
make this intuition more explicit.
where ε̃X is the residual matrix from the OLS fit of X on (1n , W ).
Prove the above two results.
Do the following two results hold?
2 2 2
RY.XW = RY.W + RY.X|W ,
2 2 2
RY.XW = RY.W |X + RY.X|W .
The omitted-variable bias formula states that β̃ − β̂1 = β̂3 δ̂1 . This formula is simple but
may be difficult to interpret since u is unobserved and its scale is unclear to researchers.
Prove that the formula has an alternative form:
2
RZ.U |X rss(Y ∼ 1n + Z + X)
|β̃1 − β̂1 |2 = RY.U
2
|ZX × 2 × .
1− RZ.U |X rss(Z ∼ 1n + X)
Remark: Cinelli and Hazlett (2020) suggested the partial R2 parametrization for the
omitted-variable bias formula. The formula has three factors: the first two factors depend
2 2
on the unknown sensitivity parameters RY.U |ZX and RZ.U |X , and the third factor equals
the ratio of the two residual sums of squares based on the observed data.
11
Leverage Scores and Leave-One-Out Formulas
its (i, j)th element equals hij = xti (X t X)−1 xj . In this chapter, we will pay special attention
to its diagonal elements
often called the leverage scores, which play important roles in many discussions later.
First, because H is a projection matrix of rank p, we have
n
X
hii = trace(H) = rank(H) = p
i=1
i.e., the average of the leverage scores equals p/n and the maximum of the leverage scores
must be larger than or equal to p/n, which is close to zero when p is small relative to n.
Second, because H = H 2 and H = H t , we have
n
X n
X X
hii = hij hji = h2ij = h2ii + h2ij ≥ h2ii
j=1 j=1 j̸=i
which implies
hii ∈ [0, 1],
i.e., each leverage score is bounded between zero and one1 .
Third, because Ŷ = HY , we have
n
X X
ŷi = hij yj = hii yi + hij yj
j=1 j̸=i
1 This also follows from Theorem A.4 since the eigenvalues of H are 0 and 1.
95
96 Linear Model and Extensions
n
X
S = (n − 1)−1 (xi2 − x̄2 )(xi2 − x̄2 )t
i=1
−1
= (n − 1) X2t (In − H1 )X2 .
The sample Mahalanobis distance between xi2 and the center x̄2 is
Proof of Theorem 11.1: The definition of Di2 implies that it is the (i, i)th element of the
following matrix:
t
x12 − x̄2
.. −1
x12 − x̄2 · · · xn2 − x̄2
S
.
xn2 − x̄2
−1
=(In − H1 )X2 (n − 1)−1 X2t (In − H1 )X2
X2t (In − H1 )
=(n − 1)X̃2 (X̃2t X̃2 )−1 X̃2t
=(n − 1)H̃2
=(n − 1)(H − H1 ),
recalling that X̃2 = (In − H1 )X2 , H̃2 = X̃2 (X̃2t X̃2 )−1 X̃2t , and H = H1 + H̃2 by Lemma 7.2.
Therefore,
Di2 = (n − 1)(hii − 1/n)
2 We have already proved a more general result on the covariance matrix of Ŷ in Theorem 4.2.
Leverage Scores and Leave-One-Out Formulas 97
xtn yn
and check how much the OLS estimator changes. Let
xt1 y1
.. ..
. .
t
x yi−1
i−1
X[−i] = t , Y[−i] =
x
i+1
yi+1
. .
.. ..
t
xn yn
denote the leave-i-out data, and define
t
β̂[−i] = (X[−i] X[−i] )−1 X[−i]
t
Y[−i] (11.2)
as the corresponding OLS estimator. We can fit n OLS by deleting the ith row (i = 1, . . . , n).
However, this is computationally intensive especially when n is large. The following theorem
shows that we need only to fit OLS once.
Theorem 11.2 Recalling that β̂ is the full data OLS, ε̂i is the residual and hii is the
leverage score for the ith observation, we have
if hii ̸= 1.
and calculate X
t
X[−i] Y[−i] = xi yi = X t Y − xi yi ,
i′ ̸=i
98 Linear Model and Extensions
which are the original X t X and X t Y without the contribution of the ith observation. Using
the following Sherman–Morrison formula in Problem 1.3:
−1
X[−i] )−1 = (X t X)−1 + 1 − xti (X t X)−1 xi (X t X)−1 xi xti (X t X)−1
t
(X[−i]
−1
= (X t X)−1 + (1 − hii ) (X t X)−1 xi xti (X t X)−1 .
Therefore,
t
β̂[−i] = (X[−i] X[−i] )−1 X[−i]
t
Y[−i]
n o
−1
= (X t X)−1 + (1 − hii ) (X t X)−1 xi xti (X t X)−1 (X t Y − xi yi )
= (X t X)−1 X t Y
− (X t X)−1 xi yi
−1
+ (1 − hii ) (X t X)−1 xi xti (X t X)−1 X t Y
−1
− (1 − hii ) (X t X)−1 xi xti (X t X)−1 xi yi
−1 −1
= β̂ − (X t X)−1 xi yi + (1 − hii ) (X t X)−1 xi xti β̂ − hii (1 − hii ) (X t X)−1 xi yi
−1 −1
= β̂ − (1 − hii ) (X t X)−1 xi yi + (1 − hii ) (X t X)−1 xi ŷi
−1
= β̂ − (1 − hii ) (X t X)−1 xi ε̂i .
□
With the leave-i-out OLS estimator β̂[−i] , we can define the predicted residual
which is different from the original residual ε̂i . The predicted residual based on leave-one-out
can better measure the performance of the prediction because it mimics the real problem
of predicting a future observation. In contrast, the original residual based on the full data
ε̂i = yi − xti β̂ gives an overly optimistic measure of the performance of the prediction. This
is related to the overfitting issue discussed later. Under the Gauss–Markov model, Theorem
4.2 implies that the original residual has mean zero and variance
and we can show that the predicted residual has mean zero and variance
The following theorem further simplifies the predicted residual and its variance.
This is not an obvious linear algebra identity, but it follows immediately from the two ways
of calculating the variance of the predicted residual.
In this setting, we can update the OLS estimator step by step: based on the first n data
points (xi , yi )ni=1 , we calculate the OLS estimator β̂(n) , and with an additional data point
(xn+1 , yn+1 ), we update the OLS estimator as β̂(n+1) . These two OLS estimators are closely
related as shown in the following theorem.
Theorem 11.4 Let X(n) be the design matrix and Y(n) be the outcome vector for the first
n observations. We have
β̂(n+1) = β̂(n) + γ(n+1) ε̂[n+1] ,
t
where γ(n+1) = (X(n+1) X(n+1) )−1 xn+1 and ε̂[n+1] = yn+1 −xtn+1 β̂(n) is the predicted residual
of the (n + 1)the outcome based on the OLS of the first n observations.
Proof of Theorem 11.4: This is the reverse form of the leave-one-out formula. We can
view the first n + 1 data points as the full data, and β̂(n) as the OLS estimator leaving the
100 Linear Model and Extensions
where ε̂n+1 is the (n + 1)th residual based on the full data OLS, and the (n + 1)th predicted
residual equals ε̂[n+1] = ε̂n+1 /(1 − hn+1,n+1 ) based on Theorem 11.3. □
Theorem 11.4 shows that to obtain β̂(n+1) from β̂(n) , the adjustment depends on the
predicted residual ε̂[n+1] . If we have a perfect prediction of the (n + 1)th observation based
on β̂(n) , then we do not need to make any adjustment to obtain β̂(n+1) ; if the predicted
residual is large, then we need to make a large adjustment.
Theorem 11.4 suggests an algorithm for sequentially computing the OLS estimators. But
t
it gives a formula that involves inverting X(n+1) X(n+1) at each step. Using the Sherman–
t
Morrison formula in Problem 1.3 for updating the inverse of X(n+1) X(n+1) based on the
t
inverse of X(n) X(n) , we have an even simpler algorithm below:
t
(G1) Start with V(n) = (X(n) X(n) )−1 and β̂(n) .
(G2) Update
−1
V(n+1) = V(n) − 1 + xtn+1 V(n) xn+1 V(n) xn+1 xtn+1 V(n) .
(G3) Calculate γ(n+1) = V(n+1) xn+1 and ε̂[n+1] = yn+1 − xtn+1 β̂(n) .
At the same time, the residual vector is computable based on the data. So it is sensible to
check whether these properties of the residual vector are plausible based on data, which in
turn serves as modeling checking for the Normal linear model.
The first quantity is the standardized residual
ε̂i
standri = p .
2
σ̂ (1 − hii )
We may hope that it has mean 0 and variance 1. However, because of the dependence
between ε̂i and σ̂ 2 , it is not easy to quantify the exact distribution of standri .
The second quantity is the studentized residual based on the predicted residual:
2
where β̂[−i] and σ̂[−i] are the estimates of the coefficient and variance based on the leave-i-
2
out OLS. Because (yi , β̂[−i] , σ̂[−i] ) are mutually independent under the Normal linear model,
we can show that
studri ∼ tn−p−1 . (11.7)
Leverage Scores and Leave-One-Out Formulas 101
Because we know the distribution of studri , we can compare it to the quantiles of the t
distribution.
The third quantity is Cook’s distance (Cook, 1977):
where the first form measures the change of the OLS estimator and the second form mea-
sures the change in the predicted values based on leaving-i-out. It has a slightly different
motivation, but eventually, it is related to the previous two residuals due to the leave-one-out
formulas.
If I add 8 to the outcome of the last observation, the plots change to the second column
of Figure 11.1. If I add 8 to the 50th observation, the plots change to the last column of
Figure 11.1. Both visually show the outliers. In this example, the three residual plots give
qualitatively the same pattern, so the choice among them does not matter much. In general
cases, we may prefer studri because it has a known distribution under the Normal linear
model.
The second one is a further analysis of the Lalonde data. Based on the plots in Figure
11.2, there are indeed some outliers in the data. It is worth investigating them more carefully.
Although the outliers detection methods above are very classic, they are rarely imple-
mented in modern data analyses. They are simple and useful diagnostics. I recommend
using them at least as a part of the exploratory data analysis.
11.3.3 Jackknife
Jackknife is a general strategy for bias reduction and variance estimation proposed by
Quenouille (1949, 1956) and popularized by Tukey (1958). Based on independent data
(Z1 , . . . , Zn ), how to estimate the variance of a general estimator θ̂(Z1 , . . . , Zn ) of the pa-
rameter θ? Define θ̂[−i] as the estimator without observation i, and the pseudo-value as
0.03
leverage
0.02
0.01
4
standardized
0
−2
outlier checks
−4
−6
studentized
0
−5
0.8
0.6
Cook
0.4
0.2
0.0
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
x
Pn
The jackknife point estimator is θ̂j = n−1 i=1 θ̃i , and the jackknife variance estimator is
n
1 X
V̂j = (θ̃i − θ̂j )(θ̃i − θ̂j )t .
n(n − 1) i=1
We have already shown that the OLS coefficient is unbiased and derived several variance
estimators for it. Here we focus on the jackknife in OLS using the leave-one-out formula for
the coefficient. The pseudo-value is
0.15
leverage
0.10
0.05
0.00
8
standardized
4
2
outlier checks
−2
7.5
studentized
5.0
2.5
0.0
0.20
0.15
Cook
0.10
0.05
0.00
0 100 200 300 400
x
It is a little unfortunate that the jackknife point estimator is not identical to the OLS
estimator, which is BLUE under the Gauss–Markov model. We can show that E(β̂j ) = β
and it is a linear estimator. So the Gauss–Markov theorem ensures that cov(β̂j ) ⪰ cov(β̂).
Nevertheless, the difference between β̂j and β̂ is quite small. I omit their difference in the
following derivation. Assuming that β̂j ∼= β̂, we can continue to calculate the approximate
jackknife variance estimator:
n
∼ 1 X
V̂j = (β̃i − β̂)(β̃i − β̂)t
n(n − 1) i=1
n 2
n − 1 t −1 X ε̂i
= (X X) xi xti (X t X)−1 ,
n i=1
1 − hii
which is almost identical to the HC3 form of the EHW covariance matrix introduced in
Chapter 6.4.2. Miller (1974) first analyzed the jackknife in OLS but dismissed it immediately.
Hinkley (1977) modified the original jackknife and proposed a version that is identical to
HC1, and Wu (1986) proposed some further modifications and proposed a version that is
identical to HC2. Weber (1986) made connections between EHW and jackknife standard
104 Linear Model and Extensions
errors. However, Long and Ervin (2000)’s finite-sample simulation seems to favor the original
jackknife or HC3.
ε̂2
hii + Pn i 2 ≤ 1.
k=1 ε̂k
From this inequality, if hii = 1 then ε̂i = 0 which further implies that hij = 0 for all j ̸= i.
The following formula facilitates the computation of many β̂\S ’s simultaneously, which relies
crucially on the subvector of the residuals
ε̂S = YS − XS β̂\S
Previous chapters assume fixed X and random Y . We can also view each observation (xi , yi )
as IID draws from a population and discuss population OLS. We will view the OLS from
the level of random variables instead of data points. Besides its theoretical interest, the
population OLS facilitates the discussion of the properties of misspecified linear models and
motivates a prediction procedure without imposing any distributional assumptions.
with the minimum value equaling E{var(y | x)}, the expectation of the conditional variance
of y given x.
Theorem 12.1 is well known in probability theory. I relegate its proof as Problem 12.1.
We have finite data points, but the function E(y | x) lies in an infinite dimensional space.
Nonparametric estimation of E(y | x) is generally a hard problem, especially with a mul-
tidimensional x. As a starting point, we often use a linear function of x to approximate
E(y | x) and define the population OLS coefficient as
β = arg minp L(b),
b∈R
where
L(b) = E (y − xt b)2
= E y 2 + bt xxt b − 2yxt b
107
108 Linear Model and Extensions
where x̃k is the residual from (12.7) and ỹ is the residual from (12.8). The coefficient βk
from (12.6) equals cov(x̃k , ỹ)/var(x̃k ), the coefficient of x̃k from the population OLS of ỹ on
x̃k in (12.9), which also equals cov(x̃k , y)/var(x̃k ), the coefficient of x̃k from the population
OLS of y on x̃k . Moreover, ε from (12.6) equals ε̃ from (12.9).
Similar to the proof of Theorem 7.1, we can invert the matrix E(xxt ) in (12.2) to prove
Theorem 12.3 directly. Below I adopt an alternative proof which is a modification of the
one given by Angrist and Pischke (2008).
Proof of Theorem 12.3: Some basic identities of population OLS help to simplify the
proof below. First, the OLS decomposition (12.7) ensures
that is,
cov(x̃k , xk ) = var(x̃k ). (12.10)
It also ensures
cov(x̃k , ε) = 0, (12.12)
= βk var(x̃k ),
110 Linear Model and Extensions
by (12.10)–(12.12). Therefore,
cov(x̃k , y)
βk = ,
var(x̃k )
which also equals β̃k by (12.13).
Finally, I prove ε = ε̃. It suffices to show that ε̃ = ỹ − β̃k x̃k satisfies
The first identity holds by (12.9). The second identity holds because
cov(xk , ε̃) = cov(xk , ỹ) − β̃k cov(xk , x̃k ) = cov(x̃k , ỹ) − β̃k cov(x̃k , x̃k ) = 0
by (12.13), (12.10), and the population OLS of (12.9). The third identity holds because
Equation (12.14) defines the population long regression, and Equation (12.15) defines the
population short regression. In Equation (12.16), δ is a l × k matrix because it is the OLS
decomposition of a vector on a vector. We can view (12.16) as OLS decomposition of each
component of xi1 on xi2 . The following theorem states the population version of Cochran’s
formula.
There are multiple equivalent definitions of R2 . I start with the following definition
Σyx Σ−1
xx Σxy
R2 = ,
σy2
and will give several equivalent definitions below. Let β be the population OLS coefficient
of y on x, and ŷ = xt β be the best linear predictor.
Population Ordinary Least Squares and Inference with a Misspecified Linear Model 111
Theorem 12.5 R2 equals the ratio of the variance of the best linear predictor of y and the
variance of y itself:
var(ŷ)
R2 = .
var(y)
Proof of Theorem 12.5: Because of the centering of x, we can verify that
var(ŷ) = β t cov(x)β
= cov(y, x)cov(x)−1 cov(x)cov(x)−1 cov(x, y)
= cov(y, x)cov(x)−1 cov(x, y).
Theorem 12.6 R2 equals the maximum value of the squared Pearson correlation coefficient
between y and a linear combination of x:
R2 = max
p−1
ρ2 (y, xt b) = ρ2 (y, ŷ).
b∈R
= R2 .
where
with βy and βx being the coefficients of w in these population OLS. We then define the
population partial correlation coefficient as
ρyx|w = ρỹx̃ .
112 Linear Model and Extensions
If the marginal correlation and partial correlation have different signs, then we have Simp-
son’s paradox at the population level.
With a scalar w, we have a more explicit formula below.
Theorem 12.7 For scalar (y, x, w), we have
ρyx − ρxw ρyw
ρyx|w = p q .
1 − ρ2xw 1 − ρ2yw
and the residuals ε̂i = yi − xti β̂. Assume finite fourth moments of (x, y). We can use the law
of large numbers to show that
n
X n
X
n−1 xi xti → E(xxt ), n−1 xi yi → E(xy),
i=1 i=1
Pn
so β̂ → β in probability. We can use the CLT to show that n−1/2 i=1 xi εi → N(0, M ) in
distribution, where M = E(ε2 xxt ), so
√
n(β̂ − β) → N(0, V ) (12.18)
Following the almost the same proof of Theorem 6.3, we can show that V̂ehw is consistent for
the asymptotic covariance V . I summarize the formal results below, with the proof relegated
as Problem 12.4.
iid
Theorem 12.8 Assume that (xi , yi )ni=1 ∼ (x, y) with E(∥x∥4 ) < ∞ and E(y 4 ) < ∞. We
have (12.18) and nV̂ehw → V in probability.
Population Ordinary Least Squares and Inference with a Misspecified Linear Model 113
So the EHW standard error is not only robust to the heteroskedasticity of the errors but
also robust to the misspecification of the linear model (Huber, 1967; White, 1980b; Angrist
and Pischke, 2008; Buja et al., 2019a). Of course, when the linear model is wrong, we need
to modify the interpretation of β: it is the coefficient of x in the best linear prediction of y
or the best linear approximation of the conditional mean function E(y | x).
F1
0.2
0.0
F3 F2
−0.2
I start with a simple example. In the following calculation, I will use the fact that the
kth moment of a Uniform(0, 1) random variable equals 1/(k + 1).
Example 12.1 Assume that x ∼ F (x), ε ∼ N(0, 1), x ε, and y = x2 + ε.
1. If x ∼ F1 (x) is Uniform(−1, 1), then the best linear approximation is 1/3 + 0 · x. We
can see this result from
cov(x, y) cov(x, x2 ) E(x3 )
β(F1 ) = = = = 0,
var(x) var(x) var(x)
and α(F1 ) = E(y) = E(x2 ) = 1/3.
2. If x ∼ F2 (x) is Uniform(0, 1), then the best linear approximation is −1/6 + x. We can
see this result from
cov(x, y) cov(x, x2 ) E(x3 ) − E(x)E(x2 ) 1/4 − 1/2 × 1/3
β(F2 ) = = = 2 = = 1,
var(x) var(x) 2
E(x ) − {E(x)} 1/3 − (1/2)2
114 Linear Model and Extensions
From the above, we can see that the best linear approximation depends on the distribu-
tion of X. This complicates the interpretation of β from the population OLS decomposition
(Buja et al., 2019a). More importantly, this can cause problems if we care about the external
validity of statistical inference (Sims, 2010, page 66).
However, if we believe the following restricted mean model
E(y | x) = xt β (12.20)
or, equivalently,
y = xt β + ε, E(ε | x) = 0,
then the population OLS coefficient is the true parameter of interest:
−1 −1
{E(xxt )} E(xy) = {E(xxt )} E {xE(y | x)}
t −1
= {E(xx )} E(xxt β)
= β.
Moreover, the population OLS coefficient does not depend on the distribution of x. The
above asymptotic inference applies to this model too.
Freedman (1981) distinguished two types of OLS: the regression model and the corre-
lation model, as shown in Figure 12.2. The left-hand side represents the regression model,
or the restricted mean model (12.20). In the regression model, we first generate x and ε
under some restrictions, for example, E(ε | x) = 0, and then generate the outcome based
on y = xt β + ε, a linear function of x with error ε. In the correlation model, we start with a
pair (x, y), then decompose y into the best linear predictor xt β and the leftover residual ε.
The latter ensures E(εx) = 0, but the former requires E(ε | x) = 0. So the former imposes
a stronger assumption since E(ε | x) = 0 implies
Call :
lm ( formula = y 1 ~ x 1 , data = anscombe )
Residuals :
Min 1Q Median 3Q Max
-1 . 9 2 1 2 7 -0 . 4 5 5 7 7 -0 . 0 4 1 3 6 0.70941 1.83882
Coefficients :
Estimate Std . Error t value Pr ( >| t |)
( Intercept ) 3.0001 1.1247 2.667 0.02573 *
x1 0.5001 0.1179 4 . 2 4 1 0 . 0 0 2 1 7 **
Call :
lm ( formula = y 2 ~ x 2 , data = anscombe )
Residuals :
Min 1Q Median 3Q Max
-1 . 9 0 0 9 -0 . 7 6 0 9 0.1291 0.9491 1.2691
Coefficients :
Estimate Std . Error t value Pr ( >| t |)
( Intercept ) 3.001 1.125 2.667 0.02576 *
x2 0.500 0.118 4 . 2 3 9 0 . 0 0 2 1 8 **
Call :
lm ( formula = y 3 ~ x 3 , data = anscombe )
Residuals :
Min 1 Q Median 3Q Max
-1 . 1 5 8 6 -0 . 6 1 4 6 -0 . 2 3 0 3 0.1540 3.2411
Coefficients :
Estimate Std . Error t value Pr ( >| t |)
( Intercept ) 3.0025 1.1245 2.670 0.02562 *
x3 0.4997 0.1179 4 . 2 3 9 0 . 0 0 2 1 8 **
Call :
lm ( formula = y 4 ~ x 4 , data = anscombe )
Population Ordinary Least Squares and Inference with a Misspecified Linear Model 117
Residuals :
Min 1 Q Median 3Q Max
-1 . 7 5 1 -0 . 8 3 1 0 . 0 0 0 0.809 1.839
Coefficients :
Estimate Std . Error t value Pr ( >| t |)
( Intercept ) 3.0017 1.1239 2.671 0.02559 *
x4 0.4999 0.1178 4 . 2 4 3 0 . 0 0 2 1 6 **
However, the scatter plots of the datasets in Figure 12.3 reveal fundamental differences
between the datasets. The first dataset seems ideal for linear regression. The second dataset
shows a quadratic form of y versus x, and therefore, the linear model is misspecified. The
third dataset shows a linear trend of y versus x, but an outlier has severely distorted the
slope of the linear line. The fourth dataset is supported on only two values of x and thus
may suffer from severe extrapolation.
11
9
10
8
9
7
8
y1
y2
6
7
5
6
4
5
3
4
4 6 8 10 12 14 4 6 8 10 12 14
x1 x2
12
12
10
10
y3
y4
8
8
6
6
4 6 8 10 12 14 8 10 12 14 16 18
x3 x4
x1 x2 yhat
10
lin.homosk
0
lin.heterosk
10
0
residuals
quad.homosk
10
quad.heterosk
10
can construct a prediction interval for yn+1 based on xn+1 and (X, Y ) using an idea called
conformal prediction (Vovk et al., 2005; Lei et al., 2018). It leverages the exchangeability1
of the data points
(x1 , y1 ), . . . , (xn , yn ), (xn+1 , yn+1 ).
Pretending that we know the value yn+1 = y ∗ , we can fit OLS using n + 1 data points and
obtain residuals
ε̂i (y ∗ ) = yi − xti β̂(y ∗ ), (i = 1, . . . , n + 1)
where we emphasize the dependence of the OLS coefficient and residuals on the unknown
y ∗ . The absolute values of the residuals |ε̂i (y ∗ )|’s are also exchangeable, so the rank of
1 Exchangeability is a technical term in probability and statistics. Random elements z , . . . , z are ex-
1 n
changeable if (zπ(1) , . . . , zπ(n) ) have the same distribution as (z1 , . . . , zn ), where π(1), . . . , π(n) is a per-
mutation of the integers 1, . . . , n. In other words, a set of random elements are exchangeable if their joint
distribution does not change under re-ordering. IID random elements are exchangeable.
120 Linear Model and Extensions
must have a uniform distribution over {1, 2, . . . , n, n + 1}, a known distribution not depend-
ing on anything else. It is a pivotal quantity satisfying
n o
pr R̂n+1 (y ∗ ) ≤ ⌈(1 − α)(n + 1)⌉ ≥ 1 − α. (12.21)
Equivalently, this is a statement linking the unknown quantity y ∗ and observed data, so it
gives a confidence set for y ∗ at level 1 − α. In practice, we can use a grid search to solve for
the inequality (12.21) involving y ∗ .
Below we evaluate the leave-one-out prediction with the Boston housing data.
library ( " mlbench " )
data ( BostonHousing )
attach ( BostonHousing )
n = dim ( BostonHousing )[ 1 ]
p = dim ( BostonHousing )[ 2 ] - 1
ymin = min ( medv )
ymax = max ( medv )
grid . y = seq ( ymin - 3 0 , ymax + 3 0 , 0 . 1 )
BostonHousing = BostonHousing [ order ( medv ) , ]
detach ( BostonHousing )
cvt = qt ( 0 . 9 7 5 , df = n -p - 1 )
cvr = ceiling ( 0 . 9 5 *( n + 1 ))
50 ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
●
outcomes and predicted values
●
20 30 ●
25 40
10 ● ● predict
● ● ● ●
● ●
●
● ●
● ● ●
● ●
● ● ● ● ●
● ●
●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
Normal
20
30 conformal
0
15
20
−10
10
10
5 10 15 20 245 250 255 260 490 495 500 505
indices
FIGURE 12.5: Leave-one-out prediction intervals based on the Boston housing data
In the above code, I use the QR decomposition to compute X t X and H. Moreover, the
calculations of lower.i, upper.i, and Res use some tricks to avoid fitting n OLS. I relegate
the justification of them in Problem 12.10.
The variable loo.pred has five columns corresponding to the point predictors, lower and
upper intervals based on the Normal linear model and conformal prediction.
> colnames ( loo . pred ) = c ( " point " , " G . l " , " G . u " , " c . l " , " c . u " )
> head ( loo . pred )
point G.l G.u c.l c.u
[ 1 ,] 6 . 6 3 3 5 1 4 -2 . 9 4 1 5 3 2 1 6 . 2 0 8 5 5 9 -3 . 5 1 6 . 7
[ 2 ,] 8 . 8 0 6 6 4 1 -1 . 3 4 9 3 6 7 1 8 . 9 6 2 6 4 9 -2 . 6 2 0 . 1
[ 3 ,] 1 2 . 0 4 4 1 5 4 2.608290 21.480018 2.2 21.8
[ 4 ,] 1 1 . 0 2 5 2 5 3 1.565152 20.485355 1.2 21.0
[ 5 ,] -5 . 1 8 1 1 5 4 -1 4 . 8 1 9 0 4 1 4 . 4 5 6 7 3 3 -1 5 . 0 4 . 9
[ 6 ,] 8 . 3 2 4 1 1 4 -1 . 3 8 2 9 1 0 1 8 . 0 3 1 1 3 8 -2 . 0 1 8 . 8
Figure 12.5 plots the observed outcomes and the prediction intervals for the 20 obser-
vations with the outcomes at the bottom, middle, and top. The Normal and conformal
intervals are almost indistinguishable. For the observations with the highest outcome, the
predictions are quite poor. Surprisingly, the overall coverage rates across observations are
close to 95% for both methods.
> apply ( loo . cov == 1 , 2 , mean )
[1] 0.9486166 0.9525692
Figure 12.6 compares the lengths of the two prediction intervals. Although the conformal
prediction intervals are slightly wider than the Normal prediction interval, the differences
are rather small, with the ratio of the length above 0.96.
The R code is in code12.6.R.
122 Linear Model and Extensions
● ●●● ● ● ● ●
●● ● ● ● ● ● ● ● ● ● ● ● ●●
● ● ● ● ●● ● ● ● ● ● ● ●
0.95 ●
●
●
●
●●
●
●
●
● ● ● ●
●
●
● ●
●
●
●
●
●
●
● ●
●
●
●
● ● ●
●
●
●
● ●
●
●
● ● ● ● ● ● ●
●
● ● ● ●
● ● ● ● ●
● ● ● ●
● ●
●
●
● ● ●
●
● ●
●
●
●
●
0.90 ●
0.85
1. Find the best linear combinations (α, β) that give the maximum Pearson correlation
coefficient:
(α, β) = arg max ρ(y t a, xt b).
a∈Rk ,b∈Rp
Population Ordinary Least Squares and Inference with a Misspecified Linear Model 123
Note that you need to detail the steps in calculating (α, β) based on the covariance
matrix above.
2. Define the maximum value as cc(x, y). Show that cc(x, y) ≥ 0 and cc(x, y) = 0 if x y.
Remark: The maximum value cc(x, y) is called the canonical correlation between x and
y. We can also define partial canonical correlation between x and y given w.
Propose a method to construct joint conformal prediction regions for (yn+1 , . . . , yn+k ) based
on (X, Y ) and (xn+1 , . . . , xn+k ).
x1 −→ x2 −→ y
xi2 = α0 + α1 xi1 + ηi
and
yi = β0 + β1 xi2 + εi
where ηi has mean 0 and variance ση2 , εi has mean 0 and variance σε2 , and the ηi s and εi s
are independent. The linear model implies
where εi + β1 ηi s are independent with mean 0 and variance σε2 + β12 ση2 .
Therefore, we have two ways to estimate β1 α1 :
1. the first estimator is γ̂1 , the OLS estimator of the yi ’s on the xi1 ’s with the intercept;
124 Linear Model and Extensions
2. the second estimator is α̂1 β̂1 , the product of the OLS estimator of the xi2 ’s on the xi1 ’s
with the intercept and that of the yi ’s on the xi2 ’s with the intercept.
Cox (1960) proved the following theorem.
Theorem 12.9 Let X1 = (x11 , . . . , xn1 ). We have var(α̂1 β̂1 | X1 ) ≤ var(γ̂1 | X1 ), and
more precisely,
σε2 + β12 ση2
var(γ̂1 | X1 ) = Pn 2
i=1 (xi1 − x̄1 )
and
σε2 E(ρ̂212 | X1 ) + β12 ση2
var(α̂1 β̂1 | X1 ) = Pn 2
i=1 (xi1 − x̄1 )
where ρ̂12 ∈ [−1, 1] is the sample Pearson correlation coefficient between the xi1 ’s and the
xi2 ’s.
|β ∗ | ≤ |β| ≤ 1/|b∗ |.
2. Given scalar random variables x, y and a random vector w, we can obtain the pop-
ulation OLS coefficient (α, β, γ) of y on (1, x, w). When x and y are measured with
error as above with mean zero errors satisfying u v and (u, v) (x, y, w), we can obtain
the population OLS coefficient (α∗ , β ∗ , γ ∗ ) of y on (1, x∗ , w), and the population OLS
coefficient (a∗ , b∗ , c∗ ) of x∗ on (1, y ∗ , w).
Prove that the same result holds as in the first part of the problem.
Remark: Tamer (2010) reviewed Frisch (1934)’s upper and lower bounds for the uni-
variate OLS coefficient based on the two OLS coefficients of the observables. The second
part of the problem extends the result to the multivariate OLS with a covariate subject to
measurement error. The lower bound is well documented in most books on measurement
errors, but the upper bound is much less well known.
y = xt β + {µ(x) − xt β} + {y − µ(x)},
Population Ordinary Least Squares and Inference with a Misspecified Linear Model 125
which must hold without any assumptions. Introduce the notation for the linear term
ŷ = xt β,
δ = µ(x) − xt β,
e = y − µ(x).
y = ŷ + δ + e
ε = {µ(x) − xt β} + {y − µ(x)} = δ + e.
1. Prove that
E(ŷe | x) = 0, E(δe | x) = 0
and
E(ŷe) = 0, E(δe) = 0, E(ŷδ) = 0.
Further, prove that
E(ε2 ) = E(δ 2 ) + E(e2 ).
2. Introduce an intermediate quantity between the population OLS coefficient β and the
OLS coefficient β̂:
n
!−1 n
!
X X
β̃ = n−1 xi xti n−1 xi µ(xi ) .
i=1 i=1
√
Equation (12.18) states that n(β̂ − β) → N(0, B −1 M B −1 ) in distribution, where
B = E(xxt ) and M = E(ε2 xxt ).
Prove that
cov(β̂ − β) = cov(β̂ − β̃) + cov(β̃ − β),
and moreover,
√ √
n(β̂ − β̃) → N(0, B −1 M1 B −1 ), n(β̃ − β) → N(0, B −1 M2 B −1 )
in distribution, where
Verify that M = M1 + M2 .
Remark: To prove the result, you may find the law of total covariance formula in (B.4)
helpful. We can also write M1 as M1 = E{var(y | x)xxt }. So the meat matrix M has two
sources of uncertainty, one is from the conditional variance of y given x, and the other is
from the approximation error.
Part V
Previous chapters assume that the covariate matrix X is given and the linear model, cor-
rectly specified or not, is also given. Although including useless covariates in the linear
model results in less precise estimators, this problem is not severe when the total number
of covariates is small compared to the sample size. In many modern applications, however,
the number of covariates can be large compared to the sample size. Sometimes, it can be
a nonignorable fraction of the sample size; sometimes, it can even be larger than the sam-
ple size. For instance, modern DNA sequencing technology often generates covariates of
millions of dimensions, which is much larger than the usual sample size under study. In
these applications, the theory in previous chapters is inadequate. This chapter introduces
an important notation in statistics: overfitting.
with the density shown in the (1, 1)th and (1, 2)th subfigure of Figure 13.1. The beta
distribution above has mean
p−1
p−1
E(R2 ) = p−1
2
n−p =
2 + 2
n−1
and variance
p−1 n−p
×
var(R2 ) = 2 2
p−1 n−p 2 p−1 n−p
2 + 2 2 + 2 +1
2(p − 1)(n − p)
= .
(n − 1)2 (n + 1)
129
130 Linear Model and Extensions
Theorem 13.1 Consider a fixed design matrix X. Let β̂j be the coefficient of Xj of the
OLS fit of Y on (1n , X1 , . . . , Xq ) with q ≤ p. Under the model yi = f (xi ) + εi with an
unknown f (·) and the εi ’s uncorrelated with mean zero and variance σ 2 , the variance of β̂j
equals
σ2 1
var(β̂j ) = Pn × ,
i=1 (xij − x̄j )
2 1 − Rj2
where Rj2 is the sample R2 from the OLS fit of Xj on 1n and all other covariates.
Theorem 13.1 does not even assume that the true mean function is linear. It states that
the variance of β̂j has two multiplicative components. If we run a short regression of Y on
1n and Xj = (x1j , . . . , xnj )t , the coefficient equals
Pn
(xij − x̄j )yi
β̃j = Pi=1n 2
i=1 (xij − x̄j )
Perils of Overfitting 131
Pn
where x̄j = n−1 i=1 xij . It has variance
Pn
(xij − x̄j )yi
var(β̃j ) = var Pi=1 n 2
i=1 (xij − x̄j )
Pn 2 2
i=1 (xij − x̄j ) σ
= Pn 2
{ i=1 (xij − x̄j )2 }
σ2
= Pn 2
.
i=1 (xij − x̄j )
So the first component is the variance of the OLS coefficient in the short regression. The
second component 1/(1 − Rj2 ) is called the variance inflation factor (VIF). The VIF indeed
inflates the variance of β̃j , and the more covariates are added into the long regression, the
larger the variance inflation factor is. In R, the car package provides the function vif to
compute the VIF for each covariate.
The proof of Theorem 13.1 below is based on the FWL theorem.
Proof of Theorem 13.1: Let X̃j = (x̃1j , . . . , x̃nj )t be the residual vector from the OLS
fit of Xj on 1n and other covariates, which have a sample mean of zero. The FWL theorem
implies that Pn
x̃ij yi
β̂j = Pi=1
n 2
i=1 x̃ij
or, equivalently,
n
X n
X
x̃2ij = (1 − Rj2 ) (xij − x̄j )2 . (13.2)
i=1 i=1
Ideally, we want to use the model (13.3) with exactly s covariates. In practice, we may not
132 Linear Model and Extensions
know which covariates to include in the OLS. If we underfit the data using a short regression
with q < s:
yi = β̃1 + β̃2 xi1 + · · · + β̃q−1 xiq + ε̃i , (i = 1, . . . , n) (13.4)
then the OLS coefficients are biased. If we increase the complexity of the model to overfit
the data using a long regression with p > s:
then the OLS coefficients are unbiased. Theorem 13.1, however, shows that the OLS co-
efficients from the under-fitted model (13.4) have smaller variances than those from the
overfitted model (13.5).
Example 13.1 In general, we have a sequence of models with increasing complexity. For
simplicity, we consider nested models containing 1n and covariates
in the following simulation setting. The true linear model is yi = xti β + N(0, 1) with p = 40
but only the first 10 covariates have non-zero coefficients 1 and all other covariates have
coefficients 0. We generate two datasets: both have sample size n = 200, all covariates have
IID N(0, 1) entries, and the error terms are IID. We use the first dataset to fit the OLS
and thus call it the training dataset. We use the second dataset to assess the performance
of the fitted OLS from the training dataset, and thus call it the testing dataset1 . Figure
13.2 plots the residual sum of squares against the number of covariates in the training
and testing datasets. By definition of OLS, the residual sum of squares decreases with the
number of covariates in the training dataset, but it first decreases and then increases in the
testing dataset with minimum value attained at 10, the number of covariates in the true
data generating process.
The following example has a nonlinear true mean function but still uses OLS with
polynomials of covariates to approximate the truth2 .
Example 13.2 The true nonlinear model is yi = sin(2πxi ) + N(0, 1) with the xi ’s equally
spaced in [0, 1] and the error terms are IID. The training and testing datasets both have
sample sizes n = 200. Figure 13.3 plots the residual sum of squares against the order of the
polynomial in the OLS fit
p−1
X
yi = βj xji + εi .
j=0
By the definition of OLS, the residual sum of squares decreases with the order of polynomials
in the training dataset, but it achieves the minimum near p = 5 in the testing dataset. We
can show that the residual sum of squares decreases to zero with p = n in the training dataset;
see Problem 13.7. However, it is larger than that under p = 5 in the testing dataset.
can be approximated arbitrarily well by a polynomial function. Here is the mathematical statement of
Weierstrass’s theorem. Suppose f is a continuous function defined on the interval [a, b]. For every ε > 0,
there exists a polynomial p such that for all x ∈ [a, b], we have |f (x) − p(x)| < ε.
Perils of Overfitting 133
and
rss
bic = n log+ p log n,
n
with full names “Akaike’s information criterion ” and “Bayes information criterion.”
aic and bic are both monotone functions of the rss penalized by the number of param-
eters p in the model. The penalty in bic is larger so it favors smaller models than aic. Shao
(1997)’s results suggested that bic can consistently select the true model if the linear model
is correctly specified, but aic can select the model that minimizes the prediction error if
the linear model is misspecified. In most statistical practice, the linear model assumption
cannot be justified, so we recommend using aic.
and
Y = β̃0 1n + β̃1 X1 + · · · + β̃q Xq + ε̃,
where q < p. Under the condition in Theorem 13.1,
2
var(β̂1 ) 1 − RX 1 .X2 ···Xq
= 2 ≥ 1,
var(β̃1 ) 1 − RX1 .X2 ···Xp
2
recalling that RU.V denotes the R2 of U on V . Now we compare the corresponding estimated
variances var(
ˆ β̂1 ) and var(
˜ β̃1 ) based on homoskedasticity.
1. Show that
2 2
var(
ˆ β̂1 ) 1 − RY.X 1 ···Xp
1 − RX 1 .X2 ···Xq n−q−1
= 2 × 2 × .
var(
˜ β̃1 ) 1 − RY.X 1 ···Xq
1 − RX1 .X2 ···Xp n −p−1
4 Note that this function uses a definition of bic that differs from the above definition by a constant, but
Remark: The first result shows that the ratio of the estimated variances has three factors:
the first one corresponds to the R2 ’s of the outcome on the covariates, the second one is
identical to the one for the ratio of the true variances, and the third one corresponds to
the degrees of freedom correction. The first factor deflates the estimated variance since the
R2 increases with more covariates included in the regression, and the second and the third
factors inflate the estimated variance. Overall, whether adding more covariates inflate or
deflate the estimated variance depends on the interplay of the three factors. The answer is
not definite as Theorem 13.1.
The variance inflation result in Theorem 13.1 sometimes causes confusion. It only con-
cerns the variance. When we view some covariates as random, then the bias term can also
contribute to the variance of the OLS estimator. In this case, we should interpret Theorem
13.1 with caution. See Ding (2021b) for a related discussion.
is unbiased for σ 2 under the Gauss–Markov model in Assumption 4.1, recalling press in
(13.6) and the leverage score hii of unit i.
Remark: Theorem 4.3 shows that σ̂ 2 = rss/(n − p) is unbiased for σ 2 under the Gauss–
Markov model. rss is the “in-sample” residual sum of squares, whereas press is the “leave-
one-out” residual sum of squares. The estimator σ̂ 2 is standard, whereas σ̂press
2
appeared in
Shen et al. (2023).
such that
pn (xi ) = yi , (i = 1, . . . , n).
Hint: Use the formula in (A.4).
138 Linear Model and Extensions
4 4
R2
2 2
0 0
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
0.9
15
p−value
0.6
10
0.3 5
0.0 0
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
FIGURE 13.1: Freedman’s simulation. The first row shows the histograms of the R2 s, and
the second row shows the histograms of the p-values in testing that all coefficients are 0.
The first column corresponds to the full model without testing, and the second column
corresponds to the selected model with testing.
Perils of Overfitting 139
prediction errors
22
● training data
testing data
20
18
●
RSS/n
●
16
●
14
●
12
● ●
10
● ● ● ● ● ● ● ● ●
● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
0 10 20 30 40
# covariates
prediction errors
1.5
● training data
testing data
1.4
●
1.3
RSS/n
1.2
1.1
● ●
1.0
● ●
● ● ●
0.9
● ●
●
●
● ● ● ● ● ● ● ● ●
5 10 15 20
# covariates
688000
29880
29900
684000
29850
29880
680000 29820
29860
676000 29790
3 6 9 3 6 9 3 6 9
#predictors
1850
1800
17500
1800
1750
15000
1700
1700
12500
1600 1650
5 10 5 10 5 10
#predictors
β̂ = (X t X)−1 X t Y,
if the columns of X are highly correlated, then X t X will be nearly singular; more extremely,
if the number of covariates is larger than the sample size, then X t X has a rank smaller
than or equal to n and thus is not invertible. So numerically, the OLS estimator can be
unstable due to inverting X t X. Because X t X must be positive semi-definite, its smallest
eigenvalue determines whether it is invertible or not. Hoerl and Kennard (1970) proposed
the following ridge estimator as a modification of OLS:
which involves a positive tuning parameter λ. Because the smallest eigenvalue of X t X +λIp
is larger than or equal to λ > 0, the ridge estimator is always well defined.
Now I turn to the second equivalent motivation. The OLS estimator minimizes the
residual sum of squares
n
X
rss(b0 , b1 , . . . , bp ) = (yi − b0 − b1 xi1 − · · · − bp xip )2 .
i=1
From Theorem 13.1 on the variance inflation factor, the variances of the OLS estimators
increase with additional covariates included in the regression, leading to unnecessarily large
estimators by chance. To avoid large OLS coefficients, we can penalize the residual sum of
squares criterion with the squared length of the coefficients1 and use
p
X
β̂ ridge (λ) = arg min rss(b0 , b1 , . . . , bp ) + λ b2j . (14.2)
b0 ,b1 ,...,bp
j=1
Again in (14.2), λ is a tuning parameter that ranges from zero to infinity. We first discuss
the ridge estimator with a fixed λ and then discuss how to choose it. When λ = 0, it
reduces to OLS; when λ = ∞, all coefficients must be zero except that β̂0ridge (∞) = ȳ.
1 This is also called the Tikhonov regularization (Tikhonov, 1943). See Bickel and Li (2006) for a review
of the idea of regularization in statistics.
141
142 Linear Model and Extensions
With λ ∈ (0, ∞), the ridge coefficients are generally smaller than the OLS coefficients,
and the penalty shrinks the OLS coefficients toward zero. So the parameter λ controls the
magnitudes of the coefficients or the “complexity” of the model. In (14.2), we only penalize
the slope parameters not the intercept.
As a dual problem in optimization, we can also define the ridge estimator as
β̂ ridge (t) = arg min rss(b0 , b1 , . . . , bp )
b0 ,b1 ,...,bp
p
X
s.t. b2j ≤ t. (14.3)
j=1
Definitions (14.2) and (14.3) are equivalent because for a given λ, we can always find a t
such that the solutions from (14.2) and (14.3) are identical. In fact, the corresponding t and
λ satisfy t = ∥β̂ ridge (λ)∥2 .
However, the ridge estimator has an obvious problem: it is not invariant to linear trans-
formations of X. In particular, it is not equivalent under different scaling
Ppof the covariates.
Intuitively, the bj ’s depend on the scale of Xj ’s, but the penalty term j=1 b2j puts equal
weight on each coefficient. A convention in practice is to standardize the covariates before
applying the ridge estimator2 .
Condition 14.1 (standardization) The covariates satisfy
n
X n
X
n−1 xij = 0, n−1 x2ij = 1, (j = 1, . . . , p)
i=1 i=1
MASS. In practical data analysis, the covariates may have concrete meanings. In those cases, you may not
want to scale the covariates in the way as Condition 14.1. However, the discussion below does not rely on
the choice of scaling although it requires centering the covariates and outcome.
Ridge Regression 143
diagonal
orthogonal
orthonormal columns
which does not equal β in general. So the ridge estimator is biased. We can also calculate
the covariance matrix of the ridge estimator:
! !
ridge 2 dj t dj
cov{β̂ (λ)} = σ V diag U U diag Vt
d2j + λ d2j + λ
!
2
dj
= σ 2 V diag V t.
(d2j + λ)2
The mean squared error (MSE) is a measure capturing the bias-variance trade-off:
n ot n o
mse(λ) = E β̂ ridge (λ) − β β̂ ridge (λ) − β .
mse(λ) = [E{β̂ ridge (λ)} − β]t [E{β̂ ridge (λ)} − β] + trace[cov{β̂ ridge (λ)}] .
| {z } | {z }
C1 C2
Second, we have
! !
2
d j
C2 = σ 2 trace V diag Vt
(d2j + λ)2
!!
2
d2j
= σ trace diag
(d2j + λ)2
p
2
X d2j
= σ .
j=1
(d2j + λ)2
□
Theorem 14.1 shows the bias-variance trade-off for the ridge estimator. The MSE is
mse(λ) = C1 + C2
p p
X γj2 X d2j
= λ2 + σ 2
.
j=1
(d2j + λ)2 j=1
(d2j + λ)2
When λ = P 0, the ridge estimator reduces to the OLS estimator: the bias is zero and the
p
variance σ 2 j=1 d−2
j dominates. When λ = ∞, the ridge estimator reduces to zero: the bias
Pp 2
j=1 γj dominates and the variance is zero. As we increase λ from zero, the bias increases
and the variance decreases. So we face a bias-variance trade-off.
which is equivalent to
p p
X γj2 d2j 2
X d2j
λ = σ . (14.6)
j=1
(d2j + λ)3 j=1
(d2j + λ)3
146 Linear Model and Extensions
However, (14.6) is not directly useful because we do not know γ and σ 2 . Three methods
below try to solve (14.6) approximately.
Dempster et al. (1977) used OLS to construct an unbiased estimator σ̂ 2 and γ̂ = V t β̂,
and then solve λ from
p p
X γ̂j2 d2j 2
X d2j
λ = σ̂ ,
j=1
(d2j + λ)3 j=1
(d2j + λ)3
which is a nonlinear equation of λ.
Hoerl et al. (1975) assumed that X t X = Ip . Then d2j = 1 (j = 1, . . . , p) and γ = β, and
solve λ from
p p
X β̂j2 2
X 1
λ 3
= σ̂ ,
j=1
(1 + λ) j=1
(1 + λ)3
resulting in
λhkb = pσ̂ 2 /∥β̂∥2 .
Lawless (1976) used
λlw = pσ̂ 2 /β̂ t D2 β̂
to weight the βj ’s based on the eigenvalues of X t X.
But all these methods require estimating (β, σ 2 ). If the initial OLS estimator is not
reliable, then these estimates of λ are unlikely to be reliable. None of these methods work
for the case with p > n.
Golub et al. (1979) proposed the GCV criterion to simplify the calculation of the PRESS
statistic by replacing hii (λ) with their average value n−1 trace{H(λ)}:
Pn 2
i=1 {ε̂i (λ)}
gcv(λ) = 2.
[1 − n−1 trace {H(λ)}]
In the R package MASS, the function lm.ridge implements the ridge regression, kHKB and
kLW report two estimators for λ, and GCV contains the GCV values for a sequence of λ.
Ridge Regression 147
These formulas allow us to compute the ridge coefficient and predictor vector for many
values of λ without inverting each X t X + λIp . We have similar formulas for the case with
n < p; see Problem 14.11.
A subtle point is due to the standardization of the covariates of the outcome. In R, the
lm.ridge function first computes the ridge coefficient based on the standardized covariates
and outcome, and then transforms them back to the originalPscale. Let x̄1 , . . . , x̄p , ȳ be
n
the means of the covariates and outcome, and let sdj = {n−1 i=1 (xij − x̄i )2 }1/2 be the
standard deviation of the covariates which are report as scales in the output of lm.ridge.
From the ridge coefficients {β̂1ridge (λ), . . . , β̂pridge (λ)} based on the standardized variables,
we can obtain the predicted values based on the original variables as
ŷi (λ) − ȳ = β̂1ridge (λ)(xi1 − x̄1 )/sd1 + · · · + β̂pridge (λ)(xip − x̄p )/sdp
or, equivalently,
ŷi (λ) = α̂ridge (λ) + β̂1ridge (λ)/sd1 × xi1 + · · · + β̂pridge (λ)/sdp × xip
where
α̂ridge (λ) = ȳ − β̂1ridge (λ)x̄1 /sd1 − · · · − β̂pridge (λ)x̄p /sdp .
library ( MASS )
n = 200
p = 100
beta = rep ( 1 / sqrt ( p ) , p )
sig = 1 / 2
X = matrix ( rnorm ( n * p ) , n , p )
X = scale ( X )
X = X * sqrt ( n /( n - 1 ))
Y = as . vector ( X %*% beta + rnorm (n , 0 , sig ))
The following code calculates the theoretical bias, variance, and mean squared error,
reported in the (1, 1)th panel of Figure 14.2.
eigenxx = eigen ( t ( X )%*% X )
xis = eigenxx $ values
gammas = t ( eigenxx $ vectors )%*% beta
The (1, 1)th panel also reported the λ’s based on different approaches.
ridge . fit = lm . ridge ( Y ~ X , lambda = lambda . seq )
abline ( v = lambda . seq [ which . min ( ridge . fit $ GCV )] ,
lty = 2 , col = " grey " )
abline ( v = ridge . fit $ kHKB , lty = 3 , col = " grey " )
abline ( v = ridge . fit $ kLW , lty = 4 , col = " grey " )
legend ( " bottomright " ,
c ( " MSE " , " GCV " , " HKB " , " LW " ) ,
lty = 1 : 4 , col = " grey " , bty = " n " )
I also calculate the prediction error of the ridge estimator in the testing dataset, which
follows the same data-generating process as the training dataset. The (1, 2)th panel of
Figure 14.2 shows its relationship with λ. Overall, GCV, HKB, and LW are similar, but the
λ selected by the MSE criterion is the worst for prediction.
X . new = matrix ( rnorm ( n * p ) , n , p )
X . new = scale ( X . new )
X . new = X . new * matrix ( sqrt ( n /( n - 1 )) , n , p )
Y . new = as . vector ( X . new %*% beta + rnorm (n , 0 , sig ))
predict . error = Y . new - X . new %*% ridge . fit $ coef
Ridge Regression 149
The second row of Figure 14.2 shows the bias-variance trade-off. Overall, GCV works
the best for selecting λ for prediction.
independent covariates
0.25
bias
0.42
variance
mse
0.20
bias−variance tradeoff
0.40
0.15
predicted MSE
0.10
0.38
0.05
0.36
MSE MSE
GCV GCV
HKB HKB
0.00
LW LW
0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70
λ λ
correlated covariates
bias
0.30
variance
0.7
mse
0.25
0.6
bias−variance tradeoff
0.20
predicted MSE
MSE
GCV
0.5
0.15
HKB
LW
0.10
0.4
0.05
MSE
GCV
0.3
HKB
0.00
LW
λ λ
We call XV2 the second principal component of X. By induction, we can define all the p
principal components, stacked in the following n × p matrix:
(XV1 , . . . , XVp ) = XV = U DV t V = U D.
where ⟨Uj , Y ⟩ = Ujt Y denotes the inner product of vectors Uj and Y. As a special case with
λ = 0, the OLS estimator yields the predicted value
Ŷ = U U tY
Xp
= ⟨Uj , Y ⟩Uj ,
j=1
which is identical to the predicted value based on OLS of Y on the principal components
U . Moreover, the principal components in U are orthogonal and have unit length, so the
OLS fit of Y on U is equivalent to the component-wise OLS of Y on Uj with coefficient
⟨Uj , Y ⟩ (j = 1, . . . , p). So the predicted value based OLS equals a linear combination of
the principal components with coefficients ⟨Uj , Y ⟩; the predicted value based on ridge also
equals a linear combination of the principal components but the coefficients are shrunk by
the factors d2j /(d2j + λ).
When the columns of X are not linearly independent, for example, p > n, we cannot
run OLS of Y on X or OLS of Y on U , but we can still run ridge. Motivated by the
formulas above, another approach is to run OLS of Y on the first p∗ principal components
Ũ = (U1 , . . . , Up∗ ) with p∗ < p. This is called the principal component regression (PCR).
The predicted value is
which truncates the summation in the formula of Ŷ based on OLS. Compared to the pre-
dicted values of OLS and ridge, Ŷ (p∗ ) effectively imposes zero weights on the principal
components corresponding to small singular values. It depends on a tuning parameter p∗
similar to λ in the ridge. Since p∗ must be a positive integer and λ can be any positive real
value, PCR is a discrete procedure while ridge is a continuous procedure.
152 Linear Model and Extensions
Y | β ∼ N(Xβ, σ 2 In ), β ∼ N(0, τ 2 Ip ),
where
1, OLS,
d2j
sj = 2 , ridge,
dj +λ
1(j ≤ p∗ ), PCR.
Based on the unified formula, show that under Assumption 4.1, we have
p
X
E(Ŷ ) = sj dj γj Uj
j=1
Remark: This formula has several interesting implications. First, the left-hand side in-
volves inverting a p × p matrix, and it is more useful when p < n; the right-hand side
involves inverting an n × n matrix, so it is more useful when p > n. Second, from the form
on the right-hand side, we can see that the ridge estimator lies in C(X t ), the row space of
X. That is, the ridge estimator can be written as X t δ, where δ = (XX t + λIn )−1 Y ∈ Rp .
154 Linear Model and Extensions
This always holds but is particularly interesting in the case with p > n when the row space
of X is not the entire Rp . Third, if p > n and XX t is invertible, then we can let λ go to
zero on the right-hand side, yielding
which is the minimum norm estimator; see Problem 18.7. Using the definition of the pseu-
doinverse in Chapter A, we can further show that
β̂ ridge (0) = X + Y.
The two forms of lasso are equivalent in the sense that for a given λ in (15.2), there exists
a t such that the solution for (15.1) is identical to the solution for (15.2). In particular,
Pp lasso
t = j=1 β̂j (λ). Technically, the minimizer of the lasso problem may not be unique
especially when p > n, so the right-hand sides of the optimization problems should be a
set. Fortunately, even though the minimizer may not be unique, the resulting predictor is
always unique. Tibshirani (2013) clarifies this issue.
Both forms of the lasso are useful. We will use the form (15.2) for computation and
use the form (15.1) for geometric intuition. Similar to the ridge estimator, the lasso is not
invariant to the linear transformation of X. We proceed after standardizing the covariates
and outcome as Condition 14.1. For the same reason as the ridge, we can drop the intercept
after standardization.
155
156 Linear Model and Extensions
contour plot
Pp
coefficient ∥b∥2 = j=1 b2j , and lasso uses an L1 penalty, i.e., the L1 norm of the coefficient
Pp
∥b∥1 = j=1 |bj |. Compared to the ridge, the lasso can give sparse solutions due to the
non-smooth penalty term. That is, estimators of some coefficients are exactly zero.
Focus on the form (15.1). We can gain insights from the contour plot of the residual sum
of squares as a function of b. With a well-defined OLS estimator β̂, Theorem 3.2 ensures
which equals a constant term plus a quadratic function centered at the OLS coefficient.
Without any penalty, the minimizer is of course the OLSPcoefficient. With the L1 penalty,
p
the OLS coefficient may not be in the region defined by j=1 |bj | ≤ t. If this happens, the
intersection of the contour plot of (Y −Xb)t (Y −Xb) and the border of the restriction region
P p
j=1 |bj | ≤ t can be at some axis. For example, Figure 15.1 shows a case with p = 2, and
the lasso estimator hits the x-axis, resulting in a zero coefficient for the second coordinate.
However, this does not mean that lasso always generates sparse solutions because sometimes
the intersection of the contour plot of (Y − Xb)t (Y − Xb) and the border of the restriction
region is at an edge of the region. For example, Figure 15.2 shows a case with a non-sparse
lasso solution.
In contrast, the restriction region of the ridge is a circle, so the ridge solution does not
hit any axis unless the original OLS coefficient is zero. Figure 15.3 shows the general ridge
estimator.
Lasso 157
contour plot
contour plot
The solution in Lemma 15.1 is a function of b0 and λ, and we will use the notation
from now on, where S denotes the soft-thresholding operator. For a given λ > 0, it is a
function of b0 illustrated by Figure 15.4. The proof of Lemma 15.1 is to solve the optimization
problem. It is tricky since we cannot naively solve the first-order condition due to the non-
smoothness of |b| at 0. Nevertheless, it is only a one-dimensional optimization problem, and
I relegate the proof as Problem 15.2.
Initialize β̂.
2. Update β̂j given all other coefficients. Define the partial residual as rij = yi −
P
k̸=j β̂k xik . Updating β̂j is equivalent to minimizing
n
1 X
(rij − bj xij )2 + λ|bj |.
2n i=1
Define Pn n
i=1 xij rij
X
−1
β̂j,0 = P n 2 = n xij rij
i=1 xij i=1
Lasso 159
λ=2
3
2
1
S(b0, λ)
0
-1
-2
-3
-4 -2 0 2 4
b0
Then updating β̂j is equivalent to minimizing 12 (bj − β̂j,0 )2 + λ|bj |. Lemma 15.1 implies
But if we artificially add 200 columns of covariates of pure noise N(0, 1), then the ridge
and lasso perform much better. Lasso can automatically shrink many coefficients to zero.
See Figure 15.5(b).
> # # adding more noisy covariates
> n . noise = 2 0 0
> xnoise = matrix ( rnorm ( nsample * n . noise ) , nsample , n . noise )
> xmatrix = cbind ( xmatrix , xnoise )
> dat = data . frame ( yvector , xmatrix )
>
> # # linear regression
> bostonlm = lm ( yvector ~ . , data = dat [ trainindex , ])
> predicterror = dat $ yvector [ - trainindex ] -
+ predict ( bostonlm , dat [ - trainindex , ])
> mse . ols = sum ( predicterror ^ 2 )/ length ( predicterror )
>
> # # ridge regression
> lambdas = seq ( 1 0 0 , 1 5 0 , 0 . 0 1 )
> lm 0 = lm . ridge ( yvector ~ . , data = dat [ trainindex , ] ,
+ lambda = lambdas )
> coefridge = coef ( lm 0 )[ which . min ( lm 0 $ GCV ) , ]
> p r e d i c t e r r o r r i d g e = dat $ yvector [ - trainindex ] -
+ cbind ( 1 , xmatrix [ - trainindex , ])%*% coefridge
> mse . ridge = sum ( p r e d i c t e r r o r r i d g e ^ 2 )/ length ( p r e d i c t e r r o r r i d g e )
>
>
Lasso 161
● ●
● ●
0 ●
● ● ● ● ●
●
●
● ● ● ● ●
●
● ●
● ●
coefficients
−5
−10
−15 ●
5 10 5 10
indices of covariates
●
● ●
●
● ● ● ●
● ● ● ● ●
0.3 ● ●
● ●
●
●
●
●
●
● ●
●
● ●
●
coefficients
● ●
● ● ● ● ●
● ● ●
● ● ● ●
● ● ● ● ● ●
● ● ●
● ● ● ●
● ● ● ●
● ● ● ● ● ●
● ● ● ● ●
● ● ● ● ●
● ● ●
● ● ● ● ●
● ●
● ● ● ● ● ●
● ●● ● ● ●
● ●
● ● ● ● ●
●
● ● ● ● ● ●
●
0.0 ● ●
●
● ●
● ●
●
● ●
●
●
●
●
● ●
● ●●
● ● ● ●
●
●
●
●
●
●
●
●
●
●
●●● ●● ● ● ●● ●●●●●●●●●● ●● ● ●●● ●●● ●● ●● ●● ● ●● ●● ●●●● ●● ● ●●● ●● ●●●
●
●
●
●
● ●●● ●● ●● ●● ●●● ● ● ●● ●●●●●●● ●● ●●●●●● ●● ● ●●●● ● ●●●●●● ● ●●●●●●●●●●●●●●●● ● ●●●●●●● ●
● ●
●● ●●● ●● ●●
●
● ● ● ● ● ● ●
● ● ●
● ●
● ● ● ● ● ● ● ●
● ● ● ● ●
● ●
● ● ● ●
● ● ●
●
● ●● ● ●
●
● ●
● ● ●
● ● ● ●
● ● ● ● ● ●
● ● ●
● ●
● ● ●
● ● ●
● ●
● ● ●
● ● ● ● ●
● ●
● ● ● ●
● ●
● ●
●
−0.3 ●
●
●
● ● ● ●
● ●
●
●
● ● ●
● ● ●
●
●
●
●
> # # lasso
> cvboston = cv . glmnet ( x = xmatrix [ trainindex , ] , y = yvector [ trainindex ])
> coeflasso = coef ( cvboston , s = " lambda . min " )
>
> p r e d i c t e r r o r l a s s o = dat $ yvector [ - trainindex ] -
+ cbind ( 1 , xmatrix [ - trainindex , ])%*% coeflasso
> mse . lasso = sum ( p r e d i c t e r r o r l a s s o ^ 2 )/ length ( p r e d i c t e r r o r l a s s o )
>
> c ( mse . ols , mse . ridge , mse . lasso )
[1] 41.80376 33.33372 32.64287
162 Linear Model and Extensions
E E E
A A A
B
B B
J L J L J L
or, by duality,
Figure 15.7 compares the constraints corresponding to the ridge, lasso, and elastic net.
Because the constraint of the elastic net is not smooth, it encourages sparse solutions in the
same way as the lasso. Due to the ridge penalty, the elastic net can deal with the collinearity
of the covariates better than the lasso.
Friedman et al. (2007) proposed to use the coordinate descent algorithm to solve for
the elastic net estimator, and Friedman et al. (2009) implemented it in an R package called
glmnet.
Lasso 163
ridge
elastic net
lasso
Show that if β̂ (1) and β̂ (2) are two solutions, then αβ̂ (1) + (1 − α)β̂ (2) is also a solution for
any 0 ≤ α ≤ 1. Show that X β̂ (1) = X β̂ (2) must hold.
Hint: The function ∥ · ∥2 is strongly convex. That is, for any v1 , v2 and 0 < α < 1, we
have
∥αv1 + (1 − α)v2 ∥2 ≤ α∥v1 ∥2 + (1 − α)∥v2 ∥2
and the inequality holds when v1 ̸= v2 . The function ∥ · ∥ is convex. That is, for any v1 , v2
and 0 < α < 1, we have
X t 1n = 0, X t X = Ip .
164 Linear Model and Extensions
For a fixed λ ≥ 0, find the explicit formulas of the jth coordinates of the following estimators
in terms of the corresponding jth coordinate of the OLS estimator β̂j and λ (j = 1, . . . , p):
where
p
X p
X p
X
∥b∥2 = b2j , ∥b∥1 = |bj |, ∥b∥0 = 1(bj ̸= 0).
j=1 j=1 j=1
where
Ỹ =
Y
, X̃ = √X , λ̃ = λ(1 − α).
0p λαIp
Hint: Use the result in Problem 14.4.
as
min {∥Y − X(u ◦ v)∥2 + λ(∥u∥2 + ∥v∥2 )/2}
u,v∈Rp
where ◦ denotes the component-wise product of vectors. Hoff (2017, Lemma 1) showed that
a local minimum of the new problem must be a local minimum of the lasso problem.
Show that the new problem can be solved based on the following iterative ridge regres-
sions:
Lasso 165
we can interpret β̂j in the following way: ceteris paribus, if xij increases by one unit, then
the proportional increase in the average outcome is β̂j . In economics, β̂j is the semi-elasticity
of y on xj in the model with log transformation on the outcome.
Sometimes, we may apply the log transformation on both the outcome and a certain
covariate:
log yi = β1 xi1 + · · · + βj log xij + · · · + εi , (i = 1, . . . , n).
The jth fitted coefficient becomes
∂ logŷi ∂ ŷi . ∂xij
= = β̂j ,
∂ log xij ŷi xij
so ceteris paribus, if xij increases by 1%, then the average outcome will increase by β̂j %.
169
170 Linear Model and Extensions
In economics, β̂j is the xj -elasticity of y in the model with log transformation on both the
outcome and xj .
The log transformation only works for positive variables. For a nonnegative outcome,
we can modify the log transformation to log(yi + 1).
If we treat the density function of Y as a function of (β, σ 2 , λ), then it is the likelihood func-
tion, defined as L(β, σ 2 , λ). Given (σ 2 , λ), maximizing the likelihood function is equivalent
to minimizing (Yλ − Xβ)t (Yλ − Xβ), i.e., we can run OLS of Yλ on X to obtain
β̂(λ) = (X t X)−1 X t Yλ .
Given λ, maximizing the likelihood function is equivalent to first obtaining β̂(λ) and then
obtaining σ̂ 2 (λ) = n−1 Yλ (In − H)Yλ . The final step is to maximize the profile likelihood as
a function of λ:
n
nσ̂ 2 (λ) Y λ−1
−n/2
L(β̂(λ), σ̂ 2 (λ), λ) = 2πσ̂ 2 (λ)
exp − 2 y .
2σ̂ (λ) i=1 i
Transformations in OLS 171
The boxcox function in the R package MASS plots lp (λ), finds it maximizer λ̂, and construct
a 95% confidence interval [λ̂l , λ̂U ] based on the following asymptotic pivotal quantity
n o
a
2 lp (λ̂) − lp (λ) ∼ χ21 ,
which holds by Wilks’ Theorem. In practice, we often use the λ values within [λ̂l , λ̂U ] that
have more scientific meanings.
I use two datasets to illustrate the Box–Cox transformation, with the R code in
code16.1.2.R. For the jobs data, λ = 2 seems a plausible value.
library ( MASS )
library ( mediation )
par ( mfrow = c ( 1 , 3 ))
jobslm = lm ( job _ seek ~ treat + econ _ hard + depress 1 + sex + age + occp + marital +
nonwhite + educ + income , data = jobs )
boxcox ( jobslm , lambda = seq ( 1 . 5 , 3 , 0 . 1 ) , plotit = TRUE )
jobslm 2 = lm ( I ( job _ seek ^ 2 ) ~ treat + econ _ hard + depress 1 + sex + age + occp + marital +
nonwhite + educ + income , data = jobs )
hist ( jobslm $ residuals , xlab = " residual " , ylab = " " ,
main = " job _ seek " , font . main = 1 )
hist ( jobslm 2 $ residuals , , xlab = " residual " , ylab = " " ,
main = " job _ seek ^ 2 " , font . main = 1 )
job_seek job_seek^2
140
250
95%
120
200
−1435
100
log−Likelihood
150
80
60
−1440
100
40
50
20
−1445
λ residual residual
In the Penn bonus experiment data, λ = 0.3 seems a plausible value. However, the resid-
ual plot does not seem Normal, making the Box–Cox transformation not very meaningful.
penndata = read . table ( " pennbonus . txt " )
par ( mfrow = c ( 1 , 3 ))
pennlm = lm ( duration ~ . , data = penndata )
boxcox ( pennlm , lambda = seq ( 0 . 2 , 0 . 4 , 0 . 0 5 ) , plotit = TRUE )
hist ( pennlm $ residuals , xlab = " residual " , ylab = " " ,
main = " duration " , font . main = 1 )
hist ( pennlm . 3 $ residuals , xlab = " residual " , ylab = " " ,
main = " duration ^ 0 . 3 " , font . main = 1 )
duration duration^0.3
1400
−28730
1000
95%
1200
800
−28740
1000
log−Likelihood
600
800
−28750
600
400
−28760
400
200
200
−28770
0
0.20 0.25 0.30 0.35 0.40 −20 −10 0 10 20 30 40 −1.5 −0.5 0.0 0.5 1.0 1.5
λ residual residual
yi = β1 + β2 xi + β3 x2i · · · + βp xip−1 + εi .
In economics, it is almost the default choice to include the quadratic term of working
experience in the log wage equation. I give an example below using the data from Angrist
et al. (2006). The quadratic term of exper is significant.
> library ( foreign )
> census 0 0 = read . dta ( " census 0 0 . dta " )
> head ( census 0 0 )
age educ logwk perwt exper exper 2 black
1 48 12 6.670576 1.0850021 30 900 0
2 42 13 6.783905 0.9666383 23 529 0
3 49 13 6.762383 1.2132297 30 900 0
4 44 13 6.302851 0.4833191 25 625 0
5 45 16 6.043386 0.9666383 23 529 0
6 43 13 5.061138 1.0850021 24 576 0
>
> census 0 0 ols 1 = lm ( logwk ~ educ + exper + black ,
+ data = census 0 0 )
> census 0 0 ols 2 = lm ( logwk ~ educ + exper + I ( exper ^ 2 ) + black ,
+ data = census 0 0 )
Transformations in OLS 173
We can also include polynomial terms of more than one covariate, for example,
or
(1, x1i , . . . , xdi1 , xi2 , . . . , xli2 , xi1 xi2 , . . . , xdi1 xli2 ).
We can also approximate the conditional mean function of the outcome by a linear
combination of some basis functions:
yi = f (xi ) + εi
J
∼
X
= βj Sj (xi ) + εi ,
j=1
where the Sj (xi )’s are basis functions. The gam function in the mgcv package uses this strat-
egy including the automatic procedure of choosing the number of basis functions J. The
following example has a sine function as the truth, and the basis expansion approximation
yields reasonable performance with sample size n = 1000. Figure 16.3 plots both the true
and estimated curves.
library ( mgcv )
n = 1000
dat = data . frame ( x <- seq ( 0 , 1 , length . out = n ) ,
true <- sin ( x * 1 0 ) ,
y <- true + rnorm ( n ))
np . fit = gam ( y ~ s ( x ) , data = dat )
plot ( y ~ x , data = dat , bty = " n " ,
pch = 1 9 , cex = 0 . 1 , col = " grey " )
lines ( true ~ x , col = " grey " )
lines ( np . fit $ fitted . values ~ x , lty = 2 )
legend ( " bottomright " , c ( " true " , " estimated " ) ,
lty = 1 : 2 , col = c ( " grey " , " black " ) ,
bty = " n " )
yi = f1 (xi1 ) + · · · + fp (xip ) + εi
J1 Jp
∼
X X
= β1j Sj (xi1 ) + · · · + βpj Sj (xip ) + εi .
j=1 j=1
The gam function in the mgcv package implements this strategy. Again I use the dataset from
Angrist et al. (2006) to illustrate the procedure with nonlinearity in educ and exper shown
in Figure 16.4.
174 Linear Model and Extensions
4
●
●
●
●
● ● ●
● ●
● ●
●
●
●
●
● ● ●
●
● ●
● ●
●
● ●
● ●
●
● ●
● ● ● ● ● ● ●
●
2
● ● ●
●
● ● ● ●
● ● ●
●
●
● ● ●
● ● ● ●
● ● ●
● ● ●
● ●
● ●
● ● ● ●
● ● ● ●
● ● ● ●
● ● ● ● ●
● ● ●
● ● ● ● ●
● ●● ● ● ● ●
● ● ● ● ●
● ●
● ● ● ●
● ● ● ●● ●
● ● ● ● ●
● ● ● ● ●
● ● ●
● ● ● ● ●
● ● ● ● ●
● ● ● ●
● ● ● ●
● ● ●
● ● ● ● ● ● ● ● ●
● ● ● ●
● ● ●
● ● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ●
● ● ● ● ● ●
● ● ● ●
● ● ● ● ● ●
● ● ● ● ●
● ● ● ● ● ● ● ●●
● ● ● ● ● ● ●
● ● ● ●
● ● ● ● ● ●
● ● ● ● ●
● ● ● ● ● ● ● ● ●
● ● ● ● ●
● ● ● ● ● ● ●
● ● ● ●● ● ● ● ● ● ●
● ● ● ● ● ● ● ●
● ● ● ● ●
● ● ● ● ● ● ●
● ● ● ● ● ● ● ●
● ● ● ● ● ● ●
● ● ● ● ● ● ●●
● ● ● ●
● ● ● ● ● ● ● ●
● ● ● ●
● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ● ● ● ●
● ● ● ●
● ● ● ● ●
● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ●● ●
● ● ● ● ● ●
● ● ● ● ●
●
● ● ● ●
●
●
●
● ● ● ●
● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ●
● ● ● ● ● ● ● ● ● ●
● ● ● ● ●
●
● ● ●
● ● ● ● ●
●
● ● ● ● ● ●●
● ● ●● ●
● ● ● ●● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ●
● ● ● ● ●
● ●
0
● ● ● ● ● ● ● ● ●●
● ● ● ● ● ● ●● ●
● ● ● ● ● ● ●
y
● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ●
● ● ●
● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ●
● ●
●
● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ●
● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ●
● ● ● ●
●
●
● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ●
● ● ● ● ●
● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ●
● ● ● ● ● ● ●
● ● ● ●●
●● ● ● ● ●
● ● ● ● ● ● ● ● ●
● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ●
● ●
● ●
● ●● ● ● ● ● ● ● ●
● ● ● ● ● ● ●
● ● ● ●
● ● ● ●
● ● ● ● ●
● ● ● ● ●
● ● ● ●
●● ● ● ● ●
● ●
● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ●
● ● ● ●
● ● ● ●
● ● ● ● ● ●
● ● ● ●
● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ●
● ● ●
● ●
● ● ● ●
● ● ● ●
● ● ● ●
●
● ● ● ●
● ● ● ● ● ●
● ● ●
● ● ● ● ● ● ●
● ● ● ● ●
● ●
● ●
● ● ●● ●
● ● ●
● ● ●
● ● ● ●
−2
● ● ●
● ●
● ● ● ● ●
● ● ● ●
● ● ● ● ●
●
● ●
● ● ●
● ● ●
● ●
● ●
● ● ●
● ●
● ●
●
● ●
●
● ● ● ●
●
●
●
● ● ●
●
true
● ● ●
●
●
●●
●
●
●
●
estimated
−4
The R code in this section is in code16.2.1.R. See Wood (2017) for more details about the
generalized additive model.
So (
β1 + β2 xi + εi xi ≤ c,
yi =
(β1 + β3 ) + (β2 + β4 ) xi + εi , xi > c.
Testing the discontinuity at c is equivalent to testing
(β1 + β3 ) + (β2 + β4 ) c = β1 + β2 c ⇐⇒ β3 + β4 c = 0.
0.5
0.5
s(exper,6.43)
s(educ,8.98)
0.0
0.0
−0.5
−0.5
8 10 12 14 16 18 20 15 20 25 30 35
educ exper
1.5
2
1.0
1
y
0.5
0.0
0
−0.5
0 25 50 75 100 0 25 50 75 100
x
and (
β1 + β2 (xi − c) + εi xi ≤ c,
yi =
(β1 + β3 ) + (β2 + β4 ) (xi − c) + εi , xi > c.
So testing the discontinuity at c is equivalent to testing β3 = 0.
The right panel of Figure 16.5 shows an example of regression kink, where the linear
functions before and after a cutoff point can differ but the whole regression line is continuous.
A simple way to capture the two regimes of linear regression is to fit the following model:
yi = β1 + β2 Rc (xi ) + β3 (xi − c) + εi
using (
0, x ≤ c,
Rc (x) = max(0, x − c) =
x − c, x > c.
So (
β1 + β3 (xi − c) + εi , xi ≤ c,
yi =
β1 + (β2 + β3 ) (xi − c) + εi , xi > c.
This ensures that the mean function is continuous at c with both left and right limits
equaling β1 . Testing the kink is equivalent to testing β2 = 0.
These regressions have many applications in economics, but I omit the economic back-
ground. Readers can find more discussions in Angrist and Pischke (2008) and Card et al.
(2015).
Transformations in OLS 177
yi = β̂0 + β̂1 xi1 + β̂2 xi2 + β̂12 xi1 xi2 + ε̂i . (17.1)
We can express the coefficients in terms of the means of the outcomes within four combi-
nations of the covariates. The following proposition is an algebraic result.
β̂0 = ȳ00 ,
β̂1 = ȳ10 − ȳ00 ,
β̂2 = ȳ01 − ȳ00 ,
β̂12 = (ȳ11 − ȳ10 ) − (ȳ01 − ȳ00 ),
where ȳf1 f2 is the average value of the yi ’s with xi1 = f1 and xi2 = f2 .
The proof of Proposition 17.1 is pure algebraic which is relegated to Problem 17.1. The
proposition generalizes to OLS with more than two binary covariates. See Zhao and Ding
(2022) for more details.
Practitioners also interpret the coefficient of the product term of two continuous variables
as an interaction. The coefficient β̂12 equals the difference between ȳ11 − ȳ10 , the effect of
xi2 on yi holding xi1 at level 1, and ȳ01 − ȳ00 , the effect of xi2 on yi holding xi1 at level 0.
It also equals
β̂12 = (ȳ11 − ȳ01 ) − (ȳ10 − ȳ00 ),
179
180 Linear Model and Extensions
that is, the difference between ȳ11 − ȳ01 , the effect of xi1 on yi holding xi2 at level 1, and
ȳ10 − ȳ00 , the effect of xi1 on yi holding xi2 at level 0. The formula shows the symmetry of
xi1 and xi2 in defining interaction.
where E(εi | zi , xi ) = 0. So
E(yi | zi = 1, xi ) = β0 + β1 + (β2 + β3 )t xi
and
E(yi | zi = 0, xi ) = β0 + β2t xi ,
which implies that
The conditional average treatment effect is thus a linear function of the covariates. As long
as β3 ̸= 0, we have treatment effect heterogeneity, which is also called effect modification.
A statistical test for β3 = 0 is straightforward based on OLS and EHW standard error.
Note that (17.2) includes the interaction of the treatment and all covariates. With prior
knowledge, we may believe that the treatment effect varies with respect to a subset of
covariates, or, equivalently, we may set some components of β3 to be zero.
Call :
lm ( formula = log ( y ) ~ x 1 * x 2 )
Residuals :
Min 1 Q Median 3Q Max
-3 . 7 3 7 3 -0 . 6 8 2 2 -0 . 0 1 1 1 0.7084 3.1039
Coefficients :
Estimate Std . Error t value Pr ( >| t |)
( Intercept ) 0 . 0 0 3 2 1 4 0.031286 0.103 0.918
x1 1.056801 0.030649 34.480 <2e - 1 6 ***
x2 1.009404 0.030778 32.797 <2e - 1 6 ***
x1:x2 -0 . 0 1 7 5 2 8 0 . 0 3 0 5 2 6 -0 . 5 7 4 0.566
Call :
lm ( formula = y ~ x 1 * x 2 )
Residuals :
Min 1 Q Median 3Q Max
-3 5 . 9 5 -5 . 1 7 -0 . 9 7 2.34 513.35
Coefficients :
Estimate Std . Error t value Pr ( >| t |)
( Intercept ) 5.2842 0.6686 7 . 9 0 3 7 . 1 7e - 1 5 ***
x1 6.7565 0 . 6 5 5 0 1 0 . 3 1 5 < 2e - 1 6 ***
x2 4.9548 0.6577 7 . 5 3 3 1 . 1 1e - 1 3 ***
x1:x2 7.3810 0 . 6 5 2 4 1 1 . 3 1 4 < 2e - 1 6 ***
Call :
lm ( formula = read ~ math + socst , data = hsbdemo )
Residuals :
Min 1Q Median 3Q Max
-1 8 . 8 7 2 9 -4 . 8 9 8 7 -0 . 6 2 8 6 5.2380 23.6993
Coefficients :
Interaction 183
Then we add the interaction term into the OLS, and suddenly we have significant inter-
action but not significant main effects.
> ols . fit = lm ( read ~ math * socst , data = hsbdemo )
> summary ( ols . fit )
Call :
lm ( formula = read ~ math * socst , data = hsbdemo )
Residuals :
Min 1Q Median 3Q Max
-1 8 . 6 0 7 1 -4 . 9 2 2 8 -0 . 7 1 9 5 4.5912 21.8592
Coefficients :
Estimate Std . Error t value Pr ( >| t |)
( Intercept ) 3 7 . 8 4 2 7 1 5 1 4 . 5 4 5 2 1 0 2 . 6 0 2 0 . 0 0 9 9 8 **
math -0 . 1 1 0 5 1 2 0 . 2 9 1 6 3 4 -0 . 3 7 9 0 . 7 0 5 1 4
socst -0 . 2 2 0 0 4 4 0 . 2 7 1 7 5 4 -0 . 8 1 0 0 . 4 1 9 0 8
math : socst 0.011281 0.005229 2.157 0.03221 *
However, if we center the covariates, the main effects are significant again.
> hsbdemo $ math . c = hsbdemo $ math - mean ( hsbdemo $ math )
> hsbdemo $ socst . c = hsbdemo $ socst - mean ( hsbdemo $ socst )
> ols . fit = lm ( read ~ math . c * socst .c , data = hsbdemo )
> summary ( ols . fit )
Call :
lm ( formula = read ~ math . c * socst .c , data = hsbdemo )
Residuals :
Min 1Q Median 3Q Max
-1 8 . 6 0 7 1 -4 . 9 2 2 8 -0 . 7 1 9 5 4.5912 21.8592
Coefficients :
Estimate Std . Error t value Pr ( >| t |)
( Intercept ) 51.615327 0 . 5 6 8 6 8 5 9 0 . 7 6 3 < 2e - 1 6 ***
math . c 0.480654 0.063701 7 . 5 4 5 1 . 6 5e - 1 2 ***
socst . c 0.373829 0.055546 6 . 7 3 0 1 . 8 2e - 1 0 ***
math . c : socst . c 0 . 0 1 1 2 8 1 0.005229 2.157 0.0322 *
and
n n
X ∂E(yi | xi1 , xi2 ) X
n−1 = n−1 (β2 + β12 xi1 ) = β2 + β12 x̄1 ,
i=1
∂xi2 i=1
which are called the average partial or marginal effects. So when the covariates are cen-
tered, we can interpret β1 and β2 as the main effects. In contrast, the interpretation of the
interaction term does not depend on the centering of the covariates because
∂ 2 E(yi | xi1 , xi2 )
= β12 .
∂xi1 ∂xi2
184 Linear Model and Extensions
17.3.3 Power
Usually, statistical tests for interaction do not have enough power. Proposition 17.1 provides
a simple explanation. The variance of the interaction equals
2
σ11 σ2 σ2 σ2
var(β̂12 ) = + 10 + 01 + 00 ,
n11 n10 n01 n00
where σf21 f2 = var(yi | xi1 = f1 , xi2 = f2 ). Therefore, its variance is driven by the smallest
value of n11 , n10 , n01 , n00 . Even when the total sample size is large, one of the subgroup
sample sizes can be small, resulting in a large variance of the estimator of the interaction.
and
ŷi = γ̂0 + xti β̂0
with data in group 1 and group 0, respectively. We can also fit a joint OLS using the pooled
data:
ŷi = α̂0 + α̂z zi + xti α̂x + zi xti α̂zx .
1. Find (α̂0 , α̂z , α̂x , α̂zx ) in terms of (γ̂1 , β̂1 , γ̂0 , β̂0 ).
2. Show that the fitted values ŷi ’s are the same from the separate and the pooled OLS for
all units i = 1, . . . , n.
3. Show that the leverage scores hii ’s are the same from the separate and the pooled OLS.
estimate the variance based on (17.3), but we cannot do so based on (17.4). The statistical
test discussed in the main paper does not apply. Chow (1960) proposed the following test
based on prediction.
Let γ̂0 and θ̂0 be the coefficients, and σ02 be the variance estimate based on OLS with
units zi = 0. Under the null hypothesis that γ0 = γ1 and θ0 = θ1 , predict the outcomes of
the units zi = 1:
ŷi = γ̂0 + θ̂0t xi
with the prediction error
di = yi − ŷi
following a multivariate Normal distribution. Propose an F test based on di with zi = 1.
Hint: It is more convenient to use the matrix form of OLS.
Under any location transformations of the covariates x′i1 = xi1 − c1 , x′i2 = xi2 − c2 , we can
fit the OLS
yi = β̃0 + β̃1 x′i1 + β̃2 x′i2 + β̃12 x′i1 x′i2 + ε̃i .
1. Express β̂0 , β̂1 , β̂2 , β̂12 in terms of β̃0 , β̃1 , β̃2 , β̃12 . Verify that β̂12 = β̃12 .
2. Show that the EHW standard errors for β̂12 and β̃12 are identical.
Hint: Use the results in Problems 3.4 and 6.4.
18
Restricted OLS
Assume that in the standard linear model Y = Xβ + ε, the parameter has restriction
Cβ = r (18.1)
where C is an l × p matrix and r is a l dimensional vector. Assume that C has linearly inde-
pendent row vectors; otherwise, some restrictions are redundant. We can use the restricted
OLS:
β̂r = arg minp ∥Y − Xb∥2
b∈R
18.1 Examples
Example 18.1 (Short regression) Partition X into X1 and X2 with k and l columns,
respectively, with p = k + l. The short regression of Y on X1 yields OLS coefficient β̂1 . So
(β̂1t , 0tl ) = β̂r with
C = (0l×k , Il×l ), r = 0l .
Example 18.2 (Testing linear hypothesis) Consider testing the linear hypothesis
Cβ = r in the linear model. We have discussed in Chapter 5 the Wald test based on the
OLS estimator and its estimated covariance matrix under the Normal linear model. An al-
ternative strategy is to test the hypothesis based on comparing the residual sum of squares
under the OLS and restricted OLS. Therefore, we need to compute both β̂ and β̂r .
A canonical choice is βQ1 = 0, which is equivalent to dropping the last dummy variable
PQ1
due to its redundancy. Another canonical choice is j=1 βj = 0. This restriction keeps the
symmetry of the regressors in the linear model and changes the interpretation of βj as the
deviation from the “effect” of level j with respect to the average “effect.” Both are special
cases of restricted OLS.
187
188 Linear Model and Extensions
Example 18.4 (Two-way analysis of variance) With two factors of levels Q1 and
Q2 , respectively, the regressor xi contains the Q1 dummy variables of the first factor,
(fi1 , . . . , fiQ1 )t , the Q2 dummies of the second factor, (gi1 , . . . , giQ2 )t , and the Q1 Q2 dummy
variables of the interaction terms, (fi1 gi1 , . . . , fiQ1 giQ2 )t . We must impose restrictions on
the parameters in the linear model
Q1
X Q2
X Q1 X
X Q2
yi = α + βj fij + γk gik + δjk fij gik + εi .
j=1 k=1 j=1 k=1
Similar to the discussion in Example 18.3, two canonical choices of restrictions are
and
Q1
X Q2
X Q1
X Q2
X
βj = 0, γk = 0, δjk = δjk = 0, (j = 1, . . . , Q1 ; k = 1, . . . , Q2 ).
j=1 k=1 j=1 k=1
Proof of Theorem 18.1: The Lagrangian for the restricted optimization problem is
2X t (Y − Xb) − 2C t λ = 0
which implies
X t Xb = X t Y − C t λ. (18.2)
Solve the linear system in (18.2) to obtain
b = (X t X)−1 (X t Y − C t λ).
C(X t X)−1 (X t Y − C t λ) = r
β̂r = Mr β̂,
where
Mr = Ip − (X t X)−1 C t {C(X t X)−1 C t }−1 C.
Moreover, Mr satisfies the following properties
Mr (X t X)−1 C t = 0, CMr = 0, {Ip − C t (CC t )−1 C}Mr = Mr .
The Mr matrix plays central roles below.
The following result is also an immediate corollary of Theorem 18.1.
Corollary 18.2 Under the restriction (18.1), we have
E(β̂r ) = β,
cov(β̂r ) = σ 2 Mr (X t X)−1 Mrt .
Moreover, under the Normal linear model with the restriction (18.1), we can derive the
exact distribution of the restricted OLS estimator and propose an unbiased estimator for
σ2 .
Theorem 18.2 Assume the Normal linear model with the restriction (18.1). We have
Based on the results in Theorem 18.2, we can derive the t and F statistics for finite-
sample inference of β based on the estimator β̂r and the estimated covariance matrix
Corollary 18.3 and Theorem 18.2 extend the results for the OLS estimator. I leave their
proofs as Problem 18.3.
I then discuss statistical inference under the heteroskedastic linear model with the re-
striction (18.1). Corollary 18.2 implies that
where the ε̂i,r ’s are the residuals from the restricted OLS.
with X̃ = XΓ and C̃ = CΓ yield (β̂r , ϵ̂r , V̂ehw,r ) and (β̃r , ϵ̃r , Ṽehw,r ) as the coefficient vectors,
residuals, and robust covariances.
Prove that they must satisfy
is
β̂m = X t (XX t )−1 Y.
as long as t
X X Ct
W =
C 0
is invertible.
Derive the statistical results in parallel with Section 18.3.
Remark: If X has full column rank p, then W must be invertible. Even if X does not
have full column rank, W can still be invertible. See Problem 18.9 below for more details.
193
194 Linear Model and Extensions
It is unbiased because
In particular, cov(β̂Σ ) is smaller than or equal to cov(β̂) in the matrix sense1 . So based on
(19.4) and (19.5), we have the following pure linear algebra inequality:
Corollary 19.1 If X has linear independent columns and Σ is invertible, then
From the derivation above, we can also write the WLS estimator as
1 The matrix X t Σ−1 X is positive definite and thus invertible, because (1) for any α ∈ Rp ,
Σ−1 ⪰ 0 =⇒ αt X t ΣXα ≥ 0,
and (2)
αt X t ΣXα = 0 ⇐⇒ Xα = 0 ⇐⇒ α = 0
since X has linearly independent columns.
Weighted Least Squares 195
1/2 1/2
where y∗i = wi yi and x∗i = wi xi . So WLS is equivalent to the OLS with transformed
variables, with the weights inversely proportional to the variances of the errors. By this
equivalence, WLS inherits many properties of OLS. See the problems in Section 19.5 for
more details.
Analogous to OLS, we can derive finite-sample exact inference based on the generalized
Normal linear model:
yi = xti β + εi , εi ∼ N(0, σ 2 /wi ),
or, equivalently,
y∗i = xt∗i β + ε∗i , ε∗i ∼ N(0, σ 2 ).
The lm function with weights reports the standard error, t-statistic, and p-value based on
this model. This assumes that the weights fully capture the heteroskedasticity, which is
unrealistic in many problems.
In addition, we can derive asymptotic inference based on the following heteroskedastic
model
yi = xti β + εi
where the εi ’s are independent with mean zero and variances σi2 (i = 1, . . . , n). It is possible
that wi ̸= 1/σi2 , i.e., the variances used to construct the WLS estimator can be misspec-
ified. Even though there is no guarantee that β̂w is BLUE, it is still unbiased. From the
decomposition
n
!−1 n
X X
β̂w = wi xi xti wi xi yi
i=1 i=1
n
!−1 n
X X
= wi xi xti wi xi (xti β + εi )
i=1 i=1
n
!−1 n
!
X X
−1 −1
=β+ n wi xi xti n wi xi εi ,
i=1 i=1
we can apply the law of large numbers to show that β̂w is consistent for β and apply the
CLT to show that
a
β̂w ∼ N (β, Vw ) ,
where
n
!−1 n
! n
!−1
X X X
−1 −1 −1 −1
Vw = n n wi xi xti n wi2 σi2 xi xti n wi xi xti .
i=1 i=1 i=1
where ε̂w,i = yi − xti β̂w is the residual from the WLS. Note that in the sandwich covariance,
wi appears in the “bread” but wi2 appears in the “meat.” This formula appeared in Magee
(1998) and Romano and Wolf (2017). The function hccm in the R package car can compute
various EHW covariance estimators based on WLS. To save space in the examples below,
I report only the standard errors based on the generalized Normal linear model and leave
the calculations of the EHW covariances as a homework problem.
196 Linear Model and Extensions
n
!−1 n
X X
β̂fgls = σ̂i−2 xi xti σ̂i−2 xi yi .
i=1 i=1
In Step 2, we can change the model based on our understanding of heteroskedasticity. Here
I use the Boston housing data to compare the OLS and FGLS, with R code in code18.3.1.R.
> library ( mlbench )
> data ( BostonHousing )
> ols . fit = lm ( medv ~ . , data = BostonHousing )
> dat . res = BostonHousing
> dat . res $ medv = log (( ols . fit $ residuals )^ 2 )
> t . res . ols = lm ( medv ~ . , data = dat . res )
> w . fgls = exp ( - t . res . ols $ fitted . values )
> fgls . fit = lm ( medv ~ . , weights = w . fgls , data = BostonHousing )
> ols . fgls = cbind ( summary ( ols . fit )$ coef [ , 1 : 3 ] ,
+ summary ( fgls . fit )$ coef [ , 1 : 3 ])
> round ( ols . fgls , 3 )
Estimate Std . Error t value Estimate Std . Error t value
( Intercept ) 36.459 5.103 7.144 9.499 4.064 2.34
crim -0 . 1 0 8 0 . 0 3 3 -3 . 2 8 7 -0 . 0 8 1 0.044 -1 . 8 2
zn 0.046 0.014 3.382 0.030 0.011 2.67
indus 0.021 0.061 0.334 -0 . 0 3 5 0.038 -0 . 9 2
chas 1 2.687 0.862 3.118 1.462 1.119 1.31
nox -1 7 . 7 6 7 3 . 8 2 0 -4 . 6 5 1 -7 . 1 6 1 2.784 -2 . 5 7
rm 3.810 0.418 9.116 5.675 0.364 15.59
age 0.001 0.013 0.052 -0 . 0 4 4 0.008 -5 . 5 0
dis -1 . 4 7 6 0 . 1 9 9 -7 . 3 9 8 -0 . 9 2 7 0.139 -6 . 6 8
rad 0.306 0.066 4.613 0.170 0.051 3.31
tax -0 . 0 1 2 0 . 0 0 4 -3 . 2 8 0 -0 . 0 1 0 0.002 -4 . 1 4
ptratio -0 . 9 5 3 0 . 1 3 1 -7 . 2 8 3 -0 . 7 0 0 0.094 -7 . 4 5
b 0.009 0.003 3.467 0.014 0.002 6.54
lstat -0 . 5 2 5 0 . 0 5 1 -1 0 . 3 4 7 -0 . 1 5 8 0.036 -4 . 3 8
Unfortunately, the coefficients, including the point estimates and standard errors, from
OLS and FGLS are quite different for several covariates. This suggests that the linear model
may be misspecified. Otherwise, both estimators are consistent for the same true coefficient,
and they should not be so different even in the presence of randomness.
The above FGLS estimator is close to Wooldridge (2012, Chapter 8). Romano and Wolf
(2017) propose to regress log(max(δ 2 , ε̂2i )) on log |xi1 |, . . . , log |xip | to estimate the individual
variances. Their modification has two features: first, they truncate the small residuals by a
pre-specified positive number δ 2 ; second, their regressors are the logs of the absolute values
of the original covariates. Romano and Wolf (2017) highlighted the efficiency gain from the
Weighted Least Squares 197
estimator
0.8
WLS
OLS
n
t
0.6
500
1000
1500
2000
0.4
FGLS compared to OLS in the presence of heteroskedasticity. DiCiccio et al. (2019) proposed
some improved versions of the FGLS estimator even if the variance function is misspecified.
However, it is unusual for practitioners to use FGLS even though it can be more efficient
than OLS. There are several reasons. First, the EHW standard errors are convenient for
correcting the standard error of OLS under heteroskedasticity. Second, the efficiency gain
is usually small, and it is even possible that the FGLS is less efficient than OLS when the
variance function is misspecified. Third, the linear model is very likely to be misspecified,
and if so, OLS and FGLS estimate different parameters. The OLS has the interpretations as
the best linear predictor and the best linear approximation of the conditional mean, but the
FGLS has more complicated interpretations when the linear model is wrong. Based on these
reasons, we need to carefully justify the choice of FGLS over OLS in real data analyses.
are the fraction of black registered voters and the fraction voting, respectively. Can we infer
the individual voting behavior based on the aggregated data? In general, this is almost im-
possible. Under some assumptions, we can make progress. Goodman’s ecological regression
below is one possibility.
Assume that for precinct i = 1, . . . , n, we have
iid iid
yij | xij = 1 ∼ Bernoulli(pi1 ), yij | xij = 0 ∼ Bernoulli(pi0 ), (j = 1, . . . , ni ).
This is the individual-level model, where the pi1 ’s and pi0 ’s measure the association between
race and voting. We further assume that they are random and independent of the xij ’s, with
means
E(pi1 ) = p1 , E(pi0 ) = p0 . (19.6)
Then we can decompose the aggregated outcome variable as
ni
X
ȳi· = n−1
i yij
j=1
Xni
= n−1
i {xij yij + (1 − xij )yij }
j=1
Xni
= n−1
i {xij p1 + (1 − xij )p0 } + εi
j=1
= p1 x̄i· + p0 (1 − x̄i· ) + εi ,
where
ni
X
εi = n−1
i {xij (yij − p1 ) + (1 − xij )(yij − p0 )}.
j=1
where
E(εi | x̄i· ) = 0.
Goodman (1953) suggested to use the OLS of ȳi· on {x̄i· , (1 − x̄i· )} to estimate (p1 , p0 ),
and Goodman (1959) suggested to use the corresponding WLS with weight ni since the
Weighted Least Squares 199
The assumption in (19.6) is crucial, which can be too strong when the precinct level pi1 ’s
and pi0 ’s vary in systematic but unobserved ways. When the assumption is violated, it is
possible that the ecological regression yields the opposite result compared to the individual
regression. This is called the ecological fallacy.
Another obvious problem of ecological regression is that the estimated coefficients may
lie outside of the interval [0, 1]. Problem 19.18 gives an example.
Gelman et al. (2001) gave an alternative set of assumptions justifying the ecological
regression. King (1997) proposed some extensions. Robinson (1950) warned that the ecolog-
ical correlation might not inform individual correlation. Freedman et al. (1991) warned that
the assumptions underlying the ecological regression might not be plausible in practice.
when x is near x0 . The left panel of Figure 19.2 shows that in the neighborhood of x0 = 0.4,
even a sine function can be well approximated by a line. Based on data (xi , yi )ni=1 , if we
want to predict the mean value of y given x = x0 , then we can predict based on a line with
the local data points close to x0 . It is also reasonable to down weight the points that are
far from x0 , which motivates the following WLS:
n
X 2
(α̂, β̂) = arg min wi {yi − a − b(xi − x0 )}
a,b
i=1
with wi = K {(xi − x0 )/h} where K(·) is called the kernel function and h is called the
bandwidth parameter. With the fitted line ŷ(x) = α̂ + β̂(x − x0 ), the predicted value at
x = x0 is the intercept α̂.
Technically, K(·) can be any density function, and two canonical choices are the standard
Normal density and the Epanechikov kernel K(t) = 0.75(1 − t2 )1(|t| ≤ 1). The choice of
200 Linear Model and Extensions
2
● ●
● ●
● ●
● ●
● ●
● ● ● ● ● ●
● ● ● ●
● ● ● ●
● ● ● ● ● ● ● ●
● ●
● ●
● ● ● ● ● ●
● ●
● ● ● ● ● ● ● ●
● ● ● ●
● ● ● ●
● ● ● ● ● ● ● ●
● ●
● ● ● ● ● ● ● ●
●● ● ● ●● ● ●
● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ●
● ● ● ● ● ● ● ●
● ●● ● ● ● ● ●● ● ● ●
● ● ● ● ● ● ● ● ● ●
● ● ● ●
● ● ● ● ● ● ● ●
1
1
● ● ● ●
● ●
● ● ● ●
●● ●●
● ● ● ●
●● ● ●●
● ● ●● ● ● ●● ● ●●
● ● ●● ● ●
● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ●
●● ● ● ● ●● ● ● ●
● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ●
●● ●
● ● ●
● ●● ●
● ● ●
● ● ● ● ● ●
●●● ● ● ●●● ● ●
●● ● ●● ●
● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ●
●● ● ● ● ●● ● ● ●
●● ● ● ● ● ●
● ●● ● ● ● ● ●
●
● ● ● ●
● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ●
● ● ● ●
●● ● ● ● ● ● ●● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ● ● ●
y
y
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
●
● ●●● ● ● ●
● ●●● ● ●
● ● ● ● ● ● ●● ●●
● ● ● ● ● ● ●● ●●
● ●
0
0
● ● ● ● ● ● ● ● ● ●
●● ● ●● ●
● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ●
● ●
● ● ● ●
● ● ● ● ● ● ●● ● ● ● ● ● ● ●●
● ● ● ● ● ●
● ● ● ● ● ● ● ●
●● ● ●● ●
● ● ● ●
● ● ● ●
●● ● ●● ●
● ●
● ● ● ● ● ● ● ●
● ●● ● ● ● ●● ● ●
● ● ● ● ● ● ● ● ● ●
● ● ● ●● ● ● ● ● ●● ●
● ● ● ● ● ● ● ● ● ●
● ● ● ●
● ● ● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ●● ● ● ● ● ●● ●
● ●
● ●
● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ●
● ● ● ●● ● ● ● ● ● ●● ● ●
● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ●
● ●
● ● ● ● ● ●
● ● ● ●
●
● ● ● ●
●
● ● ● ●
●● ● ● ●● ● ●
● ● ● ● ● ●
● ●
● ● ● ●
● ●
● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ●
● ●
−1
−1
●● ● ●● ●
● ● ● ● ● ● ● ● ● ●
● ●● ● ● ● ●● ● ●
● ●● ● ●●
●● ●●
● ●
● ● ● ●
● ● ● ●
● ●
● ● ● ● ● ● ● ●
● ●● ● ●●
● ●● ● ● ● ●● ● ●
● ●
● ● ● ●
● ● ● ● ● ●
● ● ● ●
● ● ● ●
● ● ● ● ● ●
● ● ● ●
● ●
● ● ● ● ● ●
● ●
● ●
● ● ● ● ● ●
●● ●●
● ●
−2
−2
● ●
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
x x
the kernel does not matter that much. The choice of the bandwidth matters much more.
With large bandwidth, we have a poor linear approximation, leading to bias; with small
bandwidth, we have few data points, leading to large variance. In practice, we face a bias-
variance trade-off. In practice, we can either use cross-validation or other criterion to select
h.
In general, we can approximate a smooth function by a polynomial:
K
X f (k) (x0 )
f (x) ≈ (x − x0 )k
k!
k=0
when x is near x0 . So we can even fit a polynomial function locally, which is called local
polynomial regression (Fan and Gijbels, 1996). In the R package Kernsmooth, the function
locpoly fits local polynomial regression, and the function dpill selects h based on Ruppert
et al. (1995). The default specification of locpoly is the local linear regression.
> library ( " KernSmooth " )
> n = 500
> x = seq ( 0 , 1 , length . out = n )
> fx = sin ( 8 * x )
> y = fx + rnorm (n , 0 , 0 . 5 )
> plot ( y ~ x , pch = 1 9 , cex = 0 . 2 , col = " grey " , bty = " n " ,
+ main = " local ␣ linear ␣ fit " , font . main = 1 )
> lines ( fx ~ x , lwd = 2 , col = " grey " )
> h = dpill (x , y )
> locp . fit = locpoly (x , y , bandwidth = h )
> lines ( locp . fit , lty = 2 )
If we have the large population with size N , then the ideal OLS estimator of the yi ’s on
the xi ’s is
N
!−1 N
X X
t
β̂ideal = xi xi xi yi .
i=1 i=1
However, we do not have all the data points in the large population, but sample each data
point independently with probability
πi = pr(Ii = 1 | xi , yi ),
with weights inversely proportional to the sampling probability. This inverse probability
weighting estimator is reasonable because
N
! N
X Ii t
X
E xi xi | XN , YN = xi xti ,
i=1
π i i=1
N
! N
X Ii X
E xi yi | XN , YN = xi yi .
π
i=1 i i=1
The inverse probability weighting estimators are called the Horvitz–Thompson estimators
(Horvitz and Thompson, 1952), which are the cornerstones of survey sampling.
Below I use the dataset census00.dta to illustrate the use of sampling weight, which is
the perwt variable. The R code is in code18.4.2.R.
> library ( foreign )
> census 0 0 = read . dta ( " census 0 0 . dta " )
> ols . fit = lm ( logwk ~ age + educ + exper + exper 2 + black ,
+ data = census 0 0 )
> wls . fit = lm ( logwk ~ age + educ + exper + exper 2 + black ,
+ weights = perwt , data = census 0 0 )
> compare = cbind ( summary ( ols . fit )$ coef [ , 1 : 3 ] ,
+ summary ( wls . fit )$ coef [ , 1 : 3 ])
> round ( compare , 4 )
Estimate Std . Error t value Estimate Std . Error t value
( Intercept ) 5.1667 0.1282 40.3 5.0740 0.1268 40.0
age -0 . 0 1 4 8 0.0067 -2 . 2 -0 . 0 0 8 4 0.0067 -1 . 3
educ 0.1296 0.0066 19.7 0.1228 0.0065 18.8
exper 2 0.0003 0.0001 2.2 0.0002 0.0001 1.3
black -0 . 2 4 6 7 0.0085 -2 9 . 2 -0 . 2 5 7 4 0.0080 -3 2 . 0
Weighted Least Squares 203
XK YK
nk ×p
corresponding to Σ in (19.3) such that Xk ∈ R and Yk ∈ Rnk . Show that the generalized
least squares estimator is
K
!−1 K !
X X
t −1 t −1
β̂Σ = Xk Σk Xk Xk Σk Yk .
k=1 k=1
are the weighted averages of the outcome under treatment and control, respectively.
Hint: You can use the result in Problem 19.3.
with ε̂i = yi − xti β̂. If so, give a theoretical justification; if not, give a counterexample.
Evaluate the finite-sample properties of β̂igls using simulated data.
equals the coefficient of X̃w,2 in the WLS fit of Ỹw on X̃w,2 , where X̃w,2 are the residual
vectors from the column-wise WLS of X2 on X1 , and Ỹw is the residual vector from the
WLS of Y on X1 .
where ε̂w , ε̃w , Ûw are the residuals. The last WLS fit means the WLS fit of each column of
X1 on X2 . Similar to Theorem 9.1, we have
recalling that hii is the leverage score and ε̂i is the residual of observation i.
Remark: Based on the above formula, we can compute the derivative of β̂[−i] (w) with
respect to w:
∂ β̂[−i] (w) 1
= (X t X)−1 xi ε̂i ,
∂w {1 − (1 − w)hii }2
which reduces to
∂ β̂[−i] (0) 1
= (X t X)−1 xi ε̂i
∂w (1 − hii )2
at w = 0 and
∂ β̂[−i] (1)
= (X t X)−1 xi ε̂i
∂w
at w = 1. Pregibon (1981) reviewed related formulas for OLS. Broderick et al. (2020)
discussed related formulas for general statistical models.
206 Linear Model and Extensions
Hw = X(X t W X)−1 X t W
W Hw = Hwt W, X t W (In − Hw ) = 0.
Second, prove an extended version of Theorem 11.1: with xi = (1, xti2 )t , the (i, i)th
diagonal element of Hw satisfies
wi 2
hw,ii = Pn (1 + Dw,i )
i′ =1 wi′
where
2 −1
Dw,i = (xi2 − x̄w,2 )t Sw (xi2 − x̄w,2 )
Pn Pn
with
Pn x̄w,2 = i=1 wi xi2 / i=1 Pwi being the weighted average of xi2 ’s and Sw =
n
i=1 w i (x i2 − x̄ w,2 )(x i2 − x̄ w,2 )t
/ i=1 wi being the corresponding sample covariance ma-
trix.
Remark: Li and Valliant (2009) presented the basic properties of Hw for WLS in the
context of survey data.
Many applications have binary outcomes yi ∈ {0, 1}. This chapter discusses statistical
models of binary outcomes, focusing on the logistic regression.
209
210 Linear Model and Extensions
where g(·) : R → [0, 1] is a monotone function, and its inverse is often called the link
function. Mathematically, the distribution function of any continuous random variable is a
monotone function that maps from R to [0, 1]. So we have infinitely many choices for g(·).
Four canonical choices “logit”, “probit”, “cauchit”, and “cloglog” are below which are the
standard options in R:
name functional form
ez
logit g(z) = 1+e z
The g(z) for the probit model2 is the distribution function of a standard Normal distribu-
tion. The g(z) for the cauchit model is the distribution function of the standard Cauchy
distribution with density
1
g ′ (z) = .
π(1 + z 2 )
The g(z) for the cloglog model is the distribution function of the standard log-Weilbull
distribution with density
g ′ (z) = exp(z − ez ).
I will give more motivations for the first three link functions in Section 20.7.1 and for the
fourth link function in Problem 22.4.
Figure 20.1 shows the distributions and densities of the corresponding link functions. The
distribution functions are quite similar for all links, but the density for cloglog is asymmetric
although all other three densities are symmetric.
This chapter will focus on the logit model, and extensions to other models are concep-
tually straightforward. We can also write the logit model as
t
exi β
pr(yi = 1 | xi ) ≡ π(xi , β) = t , (20.2)
1 + exi β
for the conditional probability of yi given xi , or, equivalently,
pr(yi = 1 | xi )
logit {pr(yi = 1 | xi )} ≡ log = xti β,
1 − pr(yi = 1 | xi )
for the log of the odds of yi given xi , with the logit function
π
logit(π) = log .
1−π
1 Berkson (1944) was an early use of the logit model.
2 Bliss (1934) was an early use of the probit model.
Logistic Regression for Binary Outcomes 211
distribution density
0.4
1.0
probit probit
logit logit
cloglog cloglog
0.8
0.3
cauchit cauchit
0.6
dg(z) dz
g(z)
0.2
0.4
0.1
0.2
0.0
0.0
−5 0 5 −5 0 5
z z
Because yi is a binary random variable, its probability completely determines its distribu-
tion. So we can also write the logit model as
t
!
exi β
yi | xi ∼ Bernoulli t .
1 + exi β
Each coefficient βj measures the impact of xij on the log odds of the outcome:
∂
logit{pr(yi = 1 | xi )} = βj .
∂xij
Epidemiologists also call βj the conditional log odds ratio because
βj = logit {pr(yi = 1 | . . . , xij + 1, . . .)} − logit {pr(yi = 1 | . . . , xij , . . .)}
pr(yi = 1 | . . . , xij + 1) pr(yi = 1 | . . . , xij , . . .)
= log − log
1 − pr(yi = 1 | . . . , xij + 1) 1 − pr(yi = 1 | . . . , xij , . . .)
pr(yi = 1 | . . . , xij + 1, . . .) . pr(yi = 1 | . . . , xij , . . .)
= log ,
1 − pr(yi = 1 | . . . , xij + 1, . . .) 1 − pr(yi = 1 | . . . , xij , . . .)
that is, the change of the log odds of yi if we increase xij by a unit holding other covariates
unchanged. Qualitatively, if βj > 0, then larger values of xij lead to larger probability of
yi = 1; if βj < 0, then larger values of xij lead to smaller probability of yi = 1.
i=1
1 + exi β
n t
Y eyi xi β
= t .
i=1
1 + exi β
The log-likelihood function is
n n
X o
t
log L(β) = yi xti β − log(1 + exi β ) ,
i=1
i=1
1 + exi β
Xn
= xi {yi − g(xti β)}
i=1
n
X
= xi {yi − π(xi , β)} ,
i=1
3 The notation can be confusing because β denotes both the true parameter and the dummy variable for
so the Hessian matrix is negative semi-definite. If it is negative definite, then the likelihood
function has a unique maximizer.
The maximum likelihood estimate (MLE) must satisfy the following score or Normal
equation: !
n n t
X n o X exi β̂
xi yi − π(xi , β̂) = xi yi − t
= 0.
i=1 i=1 1 + exi β̂
If we view π(xi , β̂) as the fitted probability for yi , then yi − π(xi , β̂) is the residual, and the
score equation is similar to that of OLS. Moreover, if xi contains 1, then
n n
X o n
X n
X
yi − π(xi , β̂) = 0 =⇒ n−1 yi = n−1 π(xi , β̂),
i=1 i=1 i=1
that is the average of the outcomes equals the average of their fitted values.
However, the score equation is nonlinear, and in general, there is no explicit formula for
the MLE. We usually use Newton’s method to solve for the MLE based on the linearization of
the score equation. Starting from the old value β old , we can approximate the score equation
by a linear equation:
∂ log L(β) ∼ ∂ log L(β old ) ∂ 2 log L(β old )
0= = + (β − β old ),
∂β ∂β ∂β∂β t
and then update
−1
∂ 2 log L(β old ) ∂ log L(β old )
β new = β old − .
∂β∂β t ∂β
Using the matrix form, we can gain more insight from Newton’s method. Recall that
t
y1 x1
.. ..
Y = . , X = . ,
yn xtn
and define
π(x1 , β old )
Πold = .. n
W old = diag π(xi , β old ) 1 − π(xi , β old ) i=1 .
,
.
old
π(xn , β )
Then
∂ log L(β old )
= X t (Y − Πold ),
∂β
∂ 2 log L(β old )
= −X t W old X,
∂β∂β t
and Newton’s method simplifies to
β new = β old + (X t W old X)−1 X t (Y − Πold )
= (X t W old X)−1 X t W old Xβ old + X t (Y − Πold )
where
Z old = Xβ old + (W old )−1 (Y − Πold ).
So we can obtain β new based on the WLS fit of Z old on X with weights W old , the diagonal
elements of which are the conditional variances of the yi ’s given the xi ’s at β old . The glm
function in R uses the Fisher scoring algorithm, which is identical to Newton’s method for the
logit model4 . Sometimes, it is also called the iteratively reweighted least squares algorithm.
where h n oin
Ŵ = diag π(xi , β̂) 1 − π(xi , β̂) .
i=1
Based on this, the glm function reports the point estimate, standard error, z-value, and p-
value for each coordinate of β. It is almost identical to the output of the lm function, except
that the interpretation of the coefficient becomes the conditional log odds ratio.
I use the data from Hirano et al. (2000) to illustrate logistic regression, where the main
interest is the effect of the encouragement of receiving the flu shot via email on the binary
indicator of flu-related hospitalization. We can fit a logistic regression using the glm function
in R with family = binomial(link = logit).
> flu = read . table ( " fludata . txt " , header = TRUE )
> flu = within ( flu , rm ( receive ))
> assign . logit = glm ( outcome ~ . ,
+ family = binomial ( link = logit ) ,
+ data = flu )
> summary ( assign . logit )
Call :
glm ( formula = outcome ~ . , family = binomial ( link = logit ) , data = flu )
Deviance Residuals :
Min 1Q Median 3Q Max
-1 . 1 9 5 7 -0 . 4 5 6 6 -0 . 3 8 2 1 -0 . 3 0 4 8 2.6450
Coefficients :
Estimate Std . Error z value Pr ( >| z |)
( Intercept ) -2 . 1 9 9 8 1 5 0 . 4 0 8 6 8 4 -5 . 3 8 3 7 . 3 4e - 0 8 ***
assign -0 . 1 9 7 5 2 8 0 . 1 3 6 2 3 5 -1 . 4 5 0 0 . 1 4 7 0 9
age -0 . 0 0 7 9 8 6 0.005569 -1 . 4 3 4 0 . 1 5 1 5 4
copd 0.337037 0.153939 2.189 0.02857 *
dm 0.454342 0.143593 3.164 0.00156 **
heartd 0.676190 0.153384 4 . 4 0 8 1 . 0 4e - 0 5 ***
race -0 . 2 4 2 9 4 9 0.143013 -1 . 6 9 9 0 . 0 8 9 3 6 .
renal 1.519505 0.365973 4 . 1 5 2 3 . 3 0e - 0 5 ***
sex -0 . 2 1 2 0 9 5 0.144477 -1 . 4 6 8 0 . 1 4 2 1 0
liverd 0.098957 1.084644 0.091 0.92731
Three subtle issues arise in the above code. First, flu = within(flu, rm(receive)) drops
receive, which is the indicator of whether a patient received the flu shot or not. The reason
is that assign is randomly assigned but receive is subject to selection bias, that is, patients
receiving the flu shot can be quite different from patients not receiving the flu shot.
Second, the Null deviance and Residual deviance are defined as −2 log L(β̃) and
−2 log L(β̂), respectively, where β̃ is the MLE assuming that all coefficients except the
intercept are zero, and β̂ is the MLE without any restrictions. They are not of independent
interest, but their difference is: Wilks’ theorem ensures that
L(β̂) a 2
{−2 log L(β̃)} − {−2 log L(β̂)} = 2 log ∼ χp−1 .
L(β̃)
So we can test whether the coefficients of the covariates are all zero, which is analogous to
the joint F test in linear models.
> pchisq ( assign . logit $ null . deviance - assign . logit $ deviance ,
+ df = assign . logit $ df . null - assign . logit $ df . residual ,
+ lower . tail = FALSE )
[ 1 ] 1 . 9 1 2 9 5 2e - 1 1
Third, the AIC is defined as −2 log L(β̂) + 2p, where p is the number of parameters in
the logit model. This is also the general formula of AIC for other parametric models.
20.3.2 Prediction
The logit model is often used for prediction or classification since the outcome is binary.
With the MLE β̂, we can predict the probability of being one as π̂n+1 = g(xtn+1 β̂) for a unit
with covariate value xn+1 , and we can easily dichotomize the fitted probability to predict
the outcome itself by ŷn+1 = 1(π̂n+1 ≥ c), for example, with c = 0.5.
We can even quantify the uncertainty in the fitted probability based on a linear approx-
imation (i.e., the delta method). Based on
We can use the predict function in R to calculate the predicted values based on a glm
object in the same way as the linear model. If we specify type="response", then we obtain the
fitted probabilities; if we specify se.fit = TRUE, then we also obtain the standard errors of the
fitted probabilities. In the following, I predict the probabilities of flu-related hospitalization
if a patient receives the email encouragement or not, fixing other covariates at their empirical
means.
> emp . mean = apply ( flu , 2 , mean )
> data . ave = rbind ( emp . mean , emp . mean )
> data . ave [ 1 , 1 ] = 1
> data . ave [ 2 , 1 ] = 0
> data . ave = data . frame ( data . ave )
> predict ( assign . logit , newdata = data . ave ,
+ type = " response " , se . fit = TRUE )
$ fit
emp . mean emp . mean . 1
0.06981828 0.08378818
$ se . fit
emp . mean emp . mean . 1
0.006689665 0.007526307
The margins function in the margins package can compute the average marginal effects and
the corresponding standard errors. In particular, the average marginal effect of assign is not
significant as shown below. The R code in this section is in code19.3.R.
> library ( " margins " )
> ape = margins ( assign . logit )
> summary ( ape )
factor AME SE z p lower upper
Logistic Regression for Binary Outcomes 217
The interaction term is much more complicated. Contradictory suggestions are given
across fields. Consider the following model
If the link is logit, then epidemiologists interpret eβ12 as the interaction between xi1 and
xi2 on the odds ratio scale. Consider a simple case with binary xi1 and xi2 . Given xi2 = 1,
the odds ratio of xi1 on yi equals eβ1 +β12 ; given xi2 = 0, the odds ratio of xi1 on yi equals
eβ1 . Therefore, the ratio of the two odds ratio equals eβ12 . When we measure effects on the
odds ratio scale, the logistic model is a natural choice. The interaction term in the logistic
model indeed measures the interaction of xi1 and xi2 .
Ai and Norton (2003) gave a different suggestion. Define zi = β0 + β1 xi1 + β2 xi2 +
β12 xi1 xi2 . We have two ways to define the interaction effect: first,
n n
X ∂pr(yi = 1 | xi1 , xi2 ) X
n−1 = n−1 g ′ (zi )β12 .
i=1
∂(x i1 x i2 ) i=1
second,
n
X ∂ 2 pr(yi = 1 | xi1 , xi2 )
n−1
i=1
∂xi1 ∂xi2
n
X ∂ ∂pr(yi = 1 | xi1 , xi2 )
= n−1
i=1
∂xi2 ∂xi1
n
X ∂
= n−1 {g ′ (zi )(β1 + β12 xi2 )}
i=1
∂xi2
n
X
= n−1 {g ′′ (zi )(β2 + β12 xi1 )(β1 + β12 xi2 ) + g ′ (zi )β12 } ;
i=1
Although the first one is more straightforward based on the definition of the average partial
effect, the second one is more reasonable based on the natural definition of interaction based
on the mixed derivative. Note that even if β12 = 0, the second definition of interaction does
not necessarily equal 0 since
n n
X ∂ 2 pr(yi = 1 | xi1 , xi2 ) X
n−1 = n−1 g ′′ (zi )β1 β2 .
i=1
∂x i1 ∂xi2 i=1
This is due to the nonlinearity of the link function. The second definition quantifies inter-
action based on the probability itself while the parameters in the logistic model measure
the odds ratio. This combination of model and parameter does not seem a natural choice.
218 Linear Model and Extensions
Then I fit the data with the linear probability model and binary models with four link
functions.
> lpmfit = lm ( y ~ x )
> probitfit = glm ( y ~ x , family = binomial ( link = " probit " ))
Warning message :
glm . fit : fitted probabilities numerically 0 or 1 occurred
> logitfit = glm ( y ~ x , family = binomial ( link = " logit " ))
> cloglogfit = glm ( y ~ x , family = binomial ( link = " cloglog " ))
Warning message :
glm . fit : fitted probabilities numerically 0 or 1 occurred
> cauchitfit = glm ( y ~ x , family = binomial ( link = " cauchit " ))
The coefficients are quite different because the coefficients measure the association be-
tween x and y on difference scales. These parameters are not directly comparable.
> betacoef = c ( lpmfit $ coef [ 2 ] ,
+ probitfit $ coef [ 2 ] ,
+ logitfit $ coef [ 2 ] ,
+ cloglogfit $ coef [ 2 ] ,
+ cauchitfit $ coef [ 2 ])
> names ( betacoef ) = c ( " lpm " , " probit " , " logit " , " cloglog " , " cauchit " )
> round ( betacoef , 2 )
lpm probit logit cloglog cauchit
-0 . 1 0 -0 . 8 3 -1 . 4 7 -1 . 0 7 -2 . 0 9
However, if we care only about the prediction, then these five models give very similar
results.
> table (y , lpmfit $ fitted . values > 0 . 5 )
y FALSE TRUE
0 31 9
1 5 55
> table (y , probitfit $ fitted . values > 0 . 5 )
y FALSE TRUE
0 31 9
1 5 55
> table (y , logitfit $ fitted . values > 0 . 5 )
y FALSE TRUE
0 31 9
1 5 55
> table (y , cloglogfit $ fitted . values > 0 . 5 )
y FALSE TRUE
0 34 6
1 7 53
> table (y , cauchitfit $ fitted . values > 0 . 5 )
y FALSE TRUE
0 34 6
1 7 53
Logistic Regression for Binary Outcomes 219
1.5 ●
●
●
●
●
●
●
●
1.0 ●
●
●
● ●●
●
●
●
●
●●●
●●
●●
●
●●
●●
●
●●
●
●
●
●
●●●
●●
●●
●
●●
●●
●
●
●
●
●
●● ● ●●●
●●
●●
●
●●
●●
●
●
●
●
●●●
●●
●●
●
●●
●
●
●
●
● ●● ●●
●
● ● ●
fitted probability
● ● ●●● ●
●
●●
● ●
●● ●●●
●
● ●● ● ● ●●
● ● ● ●
●
●
● ● ●
● ● ●
● ● ●●
●
●
● ● ● ●●
●
● ●
●● ●● ●
●
●●
● ●●
● ● ●●
● ●
● ●
● ●●
●●
● ●
● ●●
●
● ● ●
●
●●●● ● ●
●●●● ● ● ● ●
●●
●●
●●
●● ● ●
● ●
●
●● ●
●
●● ● ●
0.5 ●
●●
●●
●
●
●
●
●●
●
● ●
●
●
●●
●
●
●●
●
●
●
●●● ●
●
● ●
●● ●
●● ● ●
●● ● ● ●
●
● ● ● ●● ●
● ● ●
●
●
● ●● ●
● ●● ●● ●●
●
● ●●
● ● ●●
● ●● ●● ●
● ●
● ●
●● ●● ●●
●● ● ●
● ●● ●●●
● ● ●● ●●●
●● ●●
● ●● ●● ●●●●
● ● ●●●
●
●●
●●●
0.0 ●●●●
●●
●●
●
●● ●●●●●
●●
●●
●● ●●
●
●●
●
−0.5 ●
0.00 0.25 0.50 0.75 1.000.00 0.25 0.50 0.75 1.000.00 0.25 0.50 0.75 1.000.00 0.25 0.50 0.75 1.000.00 0.25 0.50 0.75 1.00
true probability
FIGURE 20.2: Comparing the fitted probabilities from different link functions
Figure 20.2 shows the fitted probabilities versus the true probabilities pr(yi = 1 | xi ). The
patterns are quite similar although the linear probability model can give fitted probabilities
outside [0, 1]. When we use the cutoff point 0.5 to predict the binary outcome, the problem
of the linear probability model becomes rather minor.
An interesting fact is that the coefficients from the logit model approximately equal
those from the probit model multiplied by 1.7, a constant that minimizes maxy |glogit (by) −
gprobit (y)|. We can easily compute this constant numerically:
> d . logit . probit = function ( b ){
+ x = seq ( - 2 0 , 2 0 , 0 . 0 0 0 0 1 )
+ max ( abs ( plogis ( b * x ) - pnorm ( x )))
+ }
>
> optimize ( d . logit . probit , c ( - 1 0 , 1 0 ))
$ minimum
[1] 1.701743
$ objective
[1] 0.009457425
Based on the above calculation, the maximum difference is approximately 0.009. Therefore,
the logit and probit link functions are extremely close up to the scaling factor 1.7. However,
minb maxy |glogit (by) − g∗ (y)| is much larger for the link functions of cauchit and cloglog.
where
ℓi (β) = yi (β0 + β1 xi1 + · · · + βp xip ) − log(1 + eβ0 +β1 xi1 +···+βp xip )
is the log-likelihood function. When α = 1, it gives the ridge analog of the logistic regression;
when α = 0, it gives the lasso analog; when α ∈ (0, 1), it gives the elastic net analog. The R
package glmpath uses the coordinate descent algorithm based on a quadratic approximation of
the log-likelihood function. We can select the tuning parameter λ based on cross-validation.
□
Theorem 20.1 ensures that conditioning on si = 1, the model of yi given xi is still logit
with the intercept changing from β0 to β0 + log(p1 /p0 ). Although we cannot consistently
estimate the intercept without knowing (p1 , p0 ), we can still estimate all the slopes. Kagan
(2001) showed that the logistic link is the only one that enjoys this property.
Samarani et al. (2019) hypothesized that variation in the inherited activating Killer-cell
Immunoglobulin-like Receptor genes in humans is associated with their innate susceptibili-
ty/resistance to developing Crohn’s disease. They used a case-control study from three cities
(Manitoba, Montreal, and Ottawa) in Canada to investigate the potential association.
> dat = read . csv ( " samarani . csv " )
> pool . glm = glm ( case _ comb ~ ds 1 + ds 2 + ds 3 + ds 4 _ a +
+ ds 4 _ b + ds 5 + ds 1 _ 3 + center ,
+ family = binomial ( link = logit ) ,
+ data = dat )
> summary ( pool . glm )
Call :
glm ( formula = case _ comb ~ ds 1 + ds 2 + ds 3 + ds 4 _ a + ds 4 _ b + ds 5 +
ds 1 _ 3 + center , family = binomial ( link = logit ) , data = dat )
Deviance Residuals :
Min 1Q Median 3Q Max
-1 . 9 9 8 2 -0 . 9 2 7 4 -0 . 5 2 9 1 1.0113 2.2289
Coefficients :
Estimate Std . Error z value Pr ( >| z |)
( Intercept ) -2 . 3 9 6 8 1 0 . 2 1 7 6 8 -1 1 . 0 1 1 < 2e - 1 6 ***
ds 1 0.55945 0.14437 3.875 0.000107 ***
ds 2 0.42531 0.14758 2.882 0.003954 **
ds 3 0.81377 0.14503 5 . 6 1 1 2 . 0 1e - 0 8 ***
ds 4 _ a 0.30270 0.30679 0.987 0.323802
ds 4 _ b 0.29199 0.17726 1.647 0.099511 .
ds 5 0.92049 0.14852 6 . 1 9 8 5 . 7 2e - 1 0 ***
ds 1 _ 3 0.49982 0.14706 3.399 0.000677 ***
cent erMontre al -0 . 0 5 8 1 6 0 . 1 5 8 8 9 -0 . 3 6 6 0 . 7 1 4 3 1 6
centerOttawa 0.14164 0.20251 0.699 0.484292
yi∗ = xti β + εi
222 Linear Model and Extensions
and −εi has distribution function g(·) and is independent of xi . From this latent linear
model, we can verify that
pr(yi = 1 | xi ) = pr(yi∗ ≥ 0 | xi )
= pr(xti β + εi ≥ 0 | xi )
= pr(−εi ≤ xti β | xi )
= g(xti β).
So the g(·) function can be interpreted as the distribution function of the error term in the
latent linear model.
This latent variable formulation provides another way to interpret the coefficients in the
models for binary data. It is a powerful way to generate models for more complex data. We
will see another example in the next chapter.
yi ∼ Bernoulli(q), (20.5)
and
where xi does not contain 1. This is called the linear discriminant model. We can verify
that yi | xi follows a logit model as shown in the theorem below.
logit{pr(yi = 1 | xi )} = α + xti β,
where
q 1 t −1
µ1 Σ µ1 − µt0 Σ−1 µ0 , β = Σ−1 (µ1 − µ0 ) .
α = log − (20.7)
1−q 2
Proof of Theorem 20.2: Using Bayes’ formula, we have
pr(yi = 1, xi )
pr(yi = 1 | xi ) =
pr(xi )
pr(yi = 1)pr(xi | yi = 1)
=
pr(yi = 1)pr(xi | yi = 1) + pr(yi = 0)pr(xi | yi = 0)
e∆
= ,
1 + e∆
Logistic Regression for Binary Outcomes 223
where
pr(yi = 1)pr(xi | yi = 1)
∆ = log
pr(yi = 0)pr(xi | yi = 0)
−1/2
exp −(xi − µ1 )t Σ−1 (xi − µ1 )/2
q {(2π)p det(Σ)}
= log −1/2
(1 − q) {(2π)p det(Σ)} exp {−(xi − µ0 )t Σ−1 (xi − µ0 )/2}
t −1
q exp −(xi − µ1 ) Σ (xi − µ1 )/2
= log
(1 − q) exp {−(xi − µ0 )t Σ−1 (xi − µ0 )/2}
q exp − −2xti Σ−1 µ1 + µt1 Σ−1 µ1 /2
= log
(1 − q) exp {− (−2xti Σ−1 µ0 + µt0 Σ−1 µ0 ) /2}
q 1 t −1
µ1 Σ µ1 − µt0 Σ−1 µ0 + xti Σ−1 (µ1 − µ0 ) .
= log −
1−q 2
the sample means of the xi ’s for units with yi = 1 and yi = 0, respectively. The moment
estimator for Σ is
" n n
#
X X
t t
Σ̂ = yi (xi − µ̂1 )(xi − µ̂1 ) + (1 − yi )(xi − µ̂0 )(xi − µ̂0 ) /(n − 2),
i=1 i=1
the pooled covariance matrix, after centering the xi ’s by the y-specific means. Based on
Theorem 20.2, we can obtain estimates α̂ and β̂ by replacing the true parameters with their
moment estimators. This gives us another way to fit the logistic model.
Efron (1975) compared the above moment estimator and the MLE under the logistic
model. Since the linear discriminant model imposes stronger assumptions, the estimator
based on Theorem 20.2 is more efficient. In contrast, the MLE of the logistic model is more
robust because it does not impose the Normality assumption on xi .
Given data (xi , zi , yi )ni=1 where xi denotes the covariates, zi ∈ {1, 0} denotes the bi-
nary group indicator, and yi denotes the binary outcome. We can fit two separate logistic
regressions:
logit{pr(yi = 1 | zi = 1, xi )} = γ1 + xti β1
and
logit{pr(yi = 1 | zi = 0, xi )} = γ0 + xti β0
with the treated and control data, respectively. We can also fit a joint logistic regression
using the pooled data:
Let the parameters with hats denote the MLEs, for example, γ̂1 is the MLE for γ1 . Find
(α̂0 , α̂z , α̂x , α̂zx ) in terms of (γ̂1 , β̂1 , γ̂0 , β̂0 ).
logit{pr(yi = 1 | xi )} = α + xti β.
where
q 1 det(Σ1 ) 1 t −1
α = log − log − (µ Σ µ1 − µt0 Σ−1
0 µ0 ),
1−q 2 det(Σ0 ) 2 1 1
β = Σ−1 −1
1 µ1 − Σ0 µ0 ,
1
Λ = − (Σ−1 − Σ−1
0 ).
2 1
Remark: This problem extends the linear discriminant model in Section 20.7.2 to the
quadratic discriminant model by allowing for heteroskedasticity in the conditional Normality
of x given y. It implies the logistic model with the linear, quadratic, and interaction terms
of the basic covariates.
The fitted values are π̂i = π(xi , β̂) in the logistic model, which have mean ȳ with the
intercept included in the model. Analogously, we can define R2 in the logistic model as
2 ssm 2 ssr 2
Cy2π̂
Rmodel = , Rresidual =1− , Rcorrelation = ρ̂2yπ̂ = ,
sst sst sst ssm
where
n
X n
X n
X n
X
sst = (yi − ȳ)2 , ssm = (π̂i − ȳ)2 , ssr = (yi − π̂i )2 , Cyπ̂ = (yi − ȳ)(π̂i − ȳ).
i=1 i=1 i=1 i=1
2
These three definitions are not equivalent in general. In particular, Rmodel differs from
2
Rresidual since
sst = ssm + ssr + 2Cε̂π̂
where
n
X
Cε̂π̂ = (yi − π̂i )(π̂i − ȳ).
i=1
226 Linear Model and Extensions
2 2
1. Prove that Rmodel ≥ 0, Rcorrelation ≥ 0 with equality holding if π̂i = ȳ for all i. Prove
2 2 2
that Rmodel ≤ 1, Rresidual ≤ 1, Rcorrelation ≤ 1 with equality holding if yi = π̂i for all i.
2
Note that Rresidual may be negative. Give an example.
2. Define Pn Pn
¯1 = Pi=1 yi π̂i ¯0 = Pi=1 (1 − yi )π̂i
π̂ n , π̂ n
i=1 yi i=1 (1 − yi )
as the average of the fitted values for units with yi = 1 and yi = 0, respectively. Define
¯1 − π̂
D = π̂ ¯0 .
Prove that q
2 2 2 2
D = (Rmodel + Rresidual )/2 = Rmodel Rcorrelation .
Further prove that D ≥ 0 with equality holding if π̂i = ȳ for all i, and D ≤ 1 with
equality holding if yi = π̂i for all i.
3. McFadden (1974) defined the following R2 :
2 log L(β̂)
Rmcfadden =1−
log L(β̃)
recalling that β̃ is the MLE assuming that all coefficients except the intercept are zero,
and β̂ is the MLE without any restrictions. This R2 must lie within [0, 1]. Verify that
under the Normal linear model, the above formula does not reduce to the usual R2 .
Verify that under the Normal linear model, the above formula reduces to the usual R2 .
2 2 2
Remark: Tjur (2009) gave an excellent discussion of Rmodel , Rresidual , Rcorrelation and D.
2 2/n
Nagelkerke (1991) pointed out that the upper bound of this RCS is 1 − (L(β̃)) < 1 and
proposed to modify it as
2/n
1 − L( β̃)
L(β̂)
2
Rnagelkerke = 2/n
1 − L(β̃)
to ensure that its upper bound is 1. However, this modification seems purely ad hoc. Al-
though D is an appealing definition of R2 for the logistic model, it does not generalize to
2
other models. Overall, I feel that Rcorrelation is a better definition that easily generalizes to
other models. Zhang (2017) defined an R2 based on the variance function of the outcome
for generalized linear models including the binary logistic model. Hu et al. (2006) studied
the asymptotic properties of some of the R2 s above.
21
Logistic Regressions for Categorical Outcomes
Categorical outcomes are common in empirical research. The first type of categorical out-
come is nominal. For example, the outcome denotes the preference for fruits (apple, orange,
and pear) or transportation services (Uber, Lyft, or BART). The second type of categorical
outcome is ordinal. For example, the outcome denotes the course evaluation at Berkeley
(1, 2, . . . , 7) or Amazon review (1 to 5 stars). This chapter discusses statistical modeling
strategies for categorical outcomes, including two classes of models corresponding to the
nominal and ordinal outcomes, respectively.
Proof of Proposition 21.1: Without loss of generality, I will calculate the (1, 1)th and
the (1, 2)th element of the matrix. First, 1(y = 1) is Bernoulli with probability π1 , so the
(1, 1)th element equals var(1(y = 1)) = π1 (1 − π1 ). Similarly, the (2, 2)th element equals
var(1(y = 2)) = π2 (1 − π2 ).
Second, 1(y = 1)+1(y = 2) is Bernoulli with probability π1 +π2 , so var(1(y = 1)+1(y =
2)) = (π1 + π2 )(1 − π1 − π2 ). Therefore, the (1, 2)-th element equals
227
228 Linear Model and Extensions
Here πk (xi ) is a general function of xi . The remaining parts of this chapter will discuss the
canonical choices of πk (xi ) for nominal and ordinal outcomes.
where β = (β1 , . . . , βK−1 ) denotes the parameter with βK = 0 for the reference category.
From the ratio form of (21.3), we can only identify βk − βK for all k = 1, . . . , K. So for
convenience, we impose the restriction βK = 0. Model (21.3) is called the multinomial
logistic regression model.
1 An alternative strategy is to model 1(y = k) | x for each k. The advantage of this strategy is that
i i
it reduces to binary logistic models. The disadvantage of this strategy is that it does not model the whole
distribution of yi and can lose efficiency in estimation.
Multinomial and Proportional Odds Model 229
Similar to the binary logistic regression model, we can interpret the coefficients as the
conditional log odds ratio compared to the reference level:
πk (. . . , xij + 1, . . .) πk (. . . , xij , . . .)
βk,j = log − log
πK (. . . , xij + 1, . . .) πK (. . . , xij , . . .)
πk (. . . , xij + 1, . . .) . πk (. . . , xij , . . .)
= log .
πK (. . . , xij + 1, . . .) πK (. . . , xij , . . .)
21.2.2 MLE
The likelihood function for the multinomial logistic model is
n Y
Y K
1(yi =k)
L(β) = {πk (xi )}
i=1 k=1
n Y
K
( t
)1(yi =k)
Y exi βk
= PK t
i=1 k=1 l=1 exi βl
n
"( K
) K
#
Y Y .X
1(yi =k)xti βk xti βk
= e e ,
i=1 k=1 k=1
with
n
( t
)
∂ log L(β) X xi exi βk
= xi 1(yi = k) − PK
∂βk i=1 l=1 e
xti βl
n
( t
)
X exi βk
= xi 1(yi = k) − PK
xti βl
i=1 l=1 e
n
X
= xi {1(yi = k) − πk (xi , β)} ∈ Rp , (k = 1, . . . , K − 1).
i=1
n xti βk
PK xt βl t t
X
te l=1 e
i − exi βk exi βk
= − xi xi PK t
i=1 ( l=1 exi βl )2
Xn
= − πk (xi , β) {1 − πk (xi , β)} xi xti ∈ Rp×p (k = 1, . . . , K − 1)
i=1
We can verify that the Hessian matrix is negative semi-definite based on Proposition 21.1,
which is left as Problem 21.2.
In R, the function multinom in the nnet package uses Newton’s method to fit the multino-
mial logistic model. We can make inference about the parameters based on the asymptotic
Normality of the MLE. Based on a new observation with covariate xn+1 , we can make
prediction based on the fitted probabilities πk (xn+1 , β̂), and furthermore classify it into K
categories based on
ŷn+1 = arg max πk (xn+1 , β̂).
1≤k≤K
We can show that this latent variable model implies (21.3). This follows from the lemma
below, which is due to McFadden (1974)3 . When K = 2, it also gives another latent variable
representation for the binary logistic regression, which is different from the one in Section
20.7.1.
y = arg max Ul
1≤l≤K
where the last line follows from conditioning on εk . By independence of the ε’s, we have
Z ∞ Y
pr(y = k) = pr(εl < Vk − Vl + z)f (z)dz
−∞ l̸=k
Z∞ Y
= exp(−e−Vk +Vl −z ) exp(−e−z )e−z dz
−∞ l̸=k
Z ∞ X
= exp − e−Vk +Vl e−z exp(−e−z )e−z dz.
−∞ l̸=k
where X
Ck = 1 + e−Vk +Vl .
l̸=k
3 Daniel McFadden shared the 2000 Nobel Memorial Prize in Economic Sciences with James Heckman.
232 Linear Model and Extensions
The integral simplifies to 1/Ck due to the density of the exponential distribution. Therefore,
1
pr(y = k) =
e−Vk +Vl
P
1+ l̸=k
Vk
e
= P
eVk + l̸=k eVl
eVk
= PK .
l=1 eVl
□
This lemma is remarkable. It extends to more complex utility functions. I will use it
again in Section 21.6.
and
where
−∞ = α0 < α1 < · · · < αK−1 < αK = ∞.
Figure 21.1 illustrates the data generating process with K = 4.
The unknown parameters are (β, α1 , . . . , αK−1 ). The distribution of the latent error
term g(·) is known, for example, it can be logistic or Normal. The former results in the
proportional odds logistic model, and the latter results in the ordered Probit model. Based
on the latent linear model, we can compute
pr(yi ≤ k | xi ) = pr(yi∗ ≤ αk | xi )
= pr(xti β + εi ≤ αk | xi )
= pr(εi ≤ αk − xti β | xi )
= g(αk − xti β).
I will focus on the proportional odds logistic model in the main text and defer the details
for the ordered Probit model to Problem 21.5. With this model, we have
t
eαk −xi β
pr(yi ≤ k | xi ) =
1 + eαk −xi β
t
Multinomial and Proportional Odds Model 233
ordinal outcome
1 2 3 4
density of the latent variable
α0 = − ∞ α1 α2 α3 α4 = ∞
y* = xT β + ε
or
pr(yi ≤ k | xi )
logit{pr(yi ≤ k | xi )} = log = αk − xti β. (21.7)
pr(yi > k | xi )
The model has the “proportional odds” property because
i=1 k=1
The log-likelihood function is concave (Pratt, 1981; Burridge, 1981), and it is strictly
234 Linear Model and Extensions
concave in most cases. The function polr in the R package MASS computes the MLE of the
proportional odds model using the BFGS algorithm. It uses the explicit formulas of the
gradient of the log-likelihood function, and computes the Hessian matrix numerically. I
relegate the formulas of the gradient as a homework problem. For more details of the Hessian
matrix, see Agresti (2010), which is a textbook discussion on modeling ordinal data.
Call :
glm ( formula = highdiag ~ age + rural + male , family = binomial ( link = " logit " ) ,
data = karolinska )
Deviance Residuals :
Min 1Q Median 3Q Max
-2 . 0 6 1 4 7 -0 . 9 8 6 4 5 -0 . 0 5 7 5 9 1.01391 1.75696
Coefficients :
Estimate Std . Error z value Pr ( >| z |)
( Intercept ) 3 . 4 6 6 0 4 1.14545 3.026 0.002479 **
age -0 . 0 3 1 2 4 0 . 0 1 4 8 1 -2 . 1 1 0 0.034854 *
rural -1 . 2 6 3 2 2 0 . 3 4 5 3 0 -3 . 6 5 8 0.000254 ***
male -0 . 9 7 5 2 4 0 . 4 1 3 0 3 -2 . 3 6 1 0.018216 *
Call :
glm ( formula = hightreat ~ age + rural + male , family = binomial ( link = " logit " ) ,
data = karolinska )
Deviance Residuals :
Min 1Q Median 3Q Max
-2 . 2 9 1 2 -0 . 9 9 7 8 0.5387 0.8408 1.4810
Coefficients :
Estimate Std . Error z value Pr ( >| z |)
( Intercept ) 6 . 4 4 6 8 3 1.49544 4.311 1 . 6 3e - 0 5 ***
age -0 . 0 6 2 9 7 0 . 0 1 8 9 0 -3 . 3 3 2 0.000862 ***
rural -1 . 2 8 7 7 7 0 . 3 9 5 7 2 -3 . 2 5 4 0.001137 **
male -0 . 7 4 8 5 6 0 . 4 5 2 8 5 -1 . 6 5 3 0.098329 .
Both treatments are associated with the covariates. hightreat is more strongly associated
with age. Rubin (2008) argued that highdiag is more random than hightreat, and may have
weaker association with other hidden covariates. For each model below, I fit the data twice
corresponding to two choices of treatment. Overall, we should trust the results with highdiag
more based on Rubin (2008)’s argument.
Call :
glm ( formula = loneyear ~ highdiag + age + rural + male , family = binomial ( link = " logit " ) ,
data = karolinska )
Deviance Residuals :
Min 1Q Median 3Q Max
-1 . 1 7 5 5 -0 . 9 9 3 6 -0 . 7 7 3 9 1.3024 1.8557
Coefficients :
Estimate Std . Error z value Pr ( >| z |)
( Intercept ) -1 . 2 2 9 1 9 1 . 1 5 5 4 5 -1 . 0 6 4 0.2874
highdiag 0.13684 0.36586 0.374 0.7084
age -0 . 0 0 3 8 9 0 . 0 1 4 1 1 -0 . 2 7 6 0.7829
rural 0.33360 0.35798 0.932 0.3514
male 0.86706 0.44034 1.969 0.0489 *
Call :
glm ( formula = loneyear ~ hightreat + age + rural + male , family = binomial ( link = " logit " ) ,
data = karolinska )
Deviance Residuals :
Min 1Q Median 3Q Max
-1 . 3 7 6 7 -0 . 9 6 8 3 -0 . 6 7 8 4 1.0813 2.0833
236 Linear Model and Extensions
Coefficients :
Estimate Std . Error z value Pr ( >| z |)
( Intercept ) -3 . 3 5 3 9 7 7 1 . 3 1 7 9 4 2 -2 . 5 4 5 0 . 0 1 0 9 3 *
hightreat 1.417458 0.455603 3.111 0.00186 **
age 0.008725 0.014840 0.588 0.55655
rural 0.633278 0.368525 1.718 0.08572 .
male 1.079973 0.452191 2.388 0.01693 *
Coefficients :
( Intercept ) highdiag age rural male
2 -4 -1 . 0 7 5 8 1 8 -0 . 0 6 9 7 3 1 8 7 -0 . 0 0 4 6 2 4 0 3 0 0 . 1 7 4 4 2 5 6 0 . 5 0 2 8 7 8 6
5+ -4 . 1 8 0 4 1 6 0 . 6 4 0 3 6 2 8 9 -0 . 0 0 1 8 4 6 4 5 3 0 . 7 3 6 5 1 1 1 2 . 1 6 2 8 7 1 7
Std . Errors :
( Intercept ) highdiag age rural male
2 -4 1.286987 0.4113006 0.01596377 0.4014718 0.4716831
5+ 2.003581 0.5816365 0.02148936 0.5741017 1.0741239
Residual Deviance : 2 6 8 . 2 6 1 6
AIC : 2 8 8 . 2 6 1 6
> predict ( yearmultinom , type = " probs " )[ 1 : 5 , ]
1 2 -4 5+
1 0.5950631 0.2647047 0.14023222
2 0.5941802 0.2655369 0.14028293
3 0.8081376 0.1718963 0.01996613
4 0.5950631 0.2647047 0.14023222
5 0.6366929 0.2260086 0.13729849
highdiag is not significant above. The predict function gives the fitted probabilities for
all categories of the outcome.
> yearmultinom = multinom ( survival ~ hightreat + age + rural + male ,
+ data = karolinska )
# weights : 1 8 ( 1 0 variable )
initial value 1 7 3 . 5 8 0 7 4 2
iter 1 0 value 1 2 9 . 5 4 8 6 4 2
final value 1 2 9 . 2 8 3 7 3 9
converged
> summary ( yearmultinom )
Call :
multinom ( formula = survival ~ hightreat + age + rural + male ,
data = karolinska )
Coefficients :
( Intercept ) hightreat age rural male
Multinomial and Proportional Odds Model 237
Std . Errors :
( Intercept ) hightreat age rural male
2 -4 1.463258 0.5141127 0.01660648 0.4085976 0.4806953
5+ 2.190305 0.7320788 0.02244867 0.5645595 1.0739669
Residual Deviance : 2 5 8 . 5 6 7 5
AIC : 2 7 8 . 5 6 7 5
Coefficients :
Value Std . Error t value
highdiag 0 . 2 1 6 7 5 5 0.35892 0.6039
age -0 . 0 0 2 8 8 1 0 . 0 1 3 7 8 -0 . 2 0 9 1
rural 0.371898 0.35313 1.0532
male 0.943955 0.43588 2.1656
Intercepts :
Value Std . Error t value
1 | 2 -4 1.4079 1.1309 1.2450
2 -4 | 5 + 2 . 9 2 8 4 1 . 1 5 1 4 2.5434
Residual Deviance : 2 7 1 . 0 7 7 8
AIC : 2 8 3 . 0 7 7 8
> predict ( yearpo , type = " probs " )[ 1 : 5 , ]
1 2 -4 5+
1 0.5862465 0.2800892 0.13366427
2 0.5855475 0.2804542 0.13399823
3 0.8087341 0.1421065 0.04915948
4 0.5862465 0.2800892 0.13366427
5 0.6205983 0.2615112 0.11789050
highdiag is not significant above. The predict function gives the fitted probabilities of
three categories.
> yearpo = polr ( survival ~ hightreat + age + rural + male ,
+ Hess = TRUE ,
+ data = karolinska )
> summary ( yearpo )
Call :
238 Linear Model and Extensions
Coefficients :
Value Std . Error t value
hightreat 1.399538 0.44518 3.1438
age 0.008032 0.01438 0.5584
rural 0.638862 0.35450 1.8022
male 1.122698 0.44377 2.5299
Intercepts :
Value Std . Error t value
1 | 2 -4 3 . 3 2 7 3 1 . 2 7 5 2 2.6092
2 -4 | 5 + 4 . 9 2 5 8 1 . 3 1 0 6 3.7583
Residual Deviance : 2 6 0 . 2 8 3 1
AIC : 2 7 2 . 2 8 3 1
Model (21.8) seems rather similar to model (21.3). However, there are many subtle
differences. First, a component of zik may vary only with choice k, for example, it can
represent the price of choice k. Partition zik into three types of covariates: xi that only vary
cross individuals, ck that only vary across choices, and wik that vary across both individuals
and choices. Model (21.8) reduces to
t t t t t
exi θx +ck θc +wik θw eck θc +wik θw
πk (zi ) = PK t t t
= PK ,
xi θx +cl θc +wil θw ctl θc +wil
t θ
l=1 e l=1 e
w
that is, the individual-specific covariates drop out. Therefore, zik in model (21.8) does not
contain covariates that vary only with individuals. In particular, zik in model (21.8) does not
contain the constant, but in contrast, the xi in model (21.3) ususally contains the intercept
by default.
Second, if we want to use individual-specific covariates in the model, they must have
Multinomial and Proportional Odds Model 239
Equivalently, we can create pseudo covariates zik as the original wik together with interac-
tion of xi and the dummy for choice k. For example, if K = 3 and xi contain the intercept
and a scalar individual-specific covariate, then (zi1 , zi2 , zi3 ) are
zi1 wi1 1 0 xi 0
zi2 = wi2 0 1 0 xi ,
zi3 wi3 0 0 0 0
where K = 3 is the reference level. So with augmented covariates, the discrete choice model
(21.8) is strictly more general than the multinomial logistic model (21.3). In the special case
with K = 2, model (21.8) reduces to
t
exi β
pr(yi = 1 | xi ) = t
1 + exi β
where xi = zi1 − zi2 .
Based on the model specification (21.8), the log likelihood function is
K
n X K
!
X X t
log L(θ) = 1(yi = k) zik t
θ − log ezil θ .
i=1 k=1 l=1
n X
X K
= 1(yi = k){zik − E(zik ; θ)},
i=1 k=1
where E(·; θ) is the average value of {zi1 , . . . , ziK } over the probability mass function
K
X
t t
zik θ
pk (θ) = e / ezil θ .
l=1
n X
X K
= − 1(yi = k)cov(zik ; θ),
i=1 k=1
where cov(·; θ) is the covariance matrix of {zi1 , . . . , ziK } over the probability mass function
defined above. From these formulas, we can easily compute the MLE using Newton’s method
and obtain its asymptotic distribution based on the inverse of the Fisher information matrix.
240 Linear Model and Extensions
21.6.2 Example
The R package mlogit provides a function mlogit to fit the general discrete logistic model
(Croissant, 2020). Here I use an example from this package to illustrate the model fitting
of mlogit. The R code is in code20.6.R.
> library ( " nnet " )
> library ( " mlogit " )
> data ( " Fishing " )
> head ( Fishing )
mode price . beach price . pier price . boat price . charter
1 charter 157.930 157.930 157.930 182.930
2 charter 15.114 15.114 10.534 34.534
3 boat 161.874 161.874 24.334 59.334
4 pier 15.134 15.134 55.930 84.930
5 boat 106.930 106.930 41.514 71.014
6 charter 192.474 192.474 28.934 63.934
catch . beach catch . pier catch . boat catch . charter income
1 0.0678 0.0503 0.2601 0.5391 7083.332
2 0.1049 0.0451 0.1574 0.4671 1250.000
3 0.5333 0.4522 0.2413 1.0266 3750.000
4 0.0678 0.0789 0.1643 0.5391 2083.333
5 0.0678 0.0503 0.1082 0.3240 4583.332
6 0.5333 0.4522 0.1665 0.3975 4583.332
The dataset Fishing is in the “wide” format, where mode denotes the choice of four modes
of fishing (beach, pier, boat and charter), price and catch denote the price and catching
rates which are choice-specific, income is individual-specific. We need to first transform the
dataset into “long” format.
> Fish = dfidx ( Fishing ,
+ varying = 2 : 9 ,
+ shape = " wide " ,
+ choice = " mode " )
> head ( Fish )
~~~~~~~
first 1 0 observations out of 4 7 2 8
~~~~~~~
mode income price catch idx
1 FALSE 7 0 8 3 . 3 3 2 1 5 7 . 9 3 0 0 . 0 6 7 8 1 : each
2 FALSE 7 0 8 3 . 3 3 2 1 5 7 . 9 3 0 0 . 2 6 0 1 1 : boat
3 TRUE 7 0 8 3 . 3 3 2 1 8 2 . 9 3 0 0 . 5 3 9 1 1 : rter
4 FALSE 7 0 8 3 . 3 3 2 1 5 7 . 9 3 0 0 . 0 5 0 3 1 : pier
5 FALSE 1 2 5 0 . 0 0 0 1 5 . 1 1 4 0 . 1 0 4 9 2 : each
6 FALSE 1 2 5 0 . 0 0 0 1 0 . 5 3 4 0 . 1 5 7 4 2 : boat
7 TRUE 1 2 5 0 . 0 0 0 3 4 . 5 3 4 0 . 4 6 7 1 2 : rter
8 FALSE 1 2 5 0 . 0 0 0 1 5 . 1 1 4 0 . 0 4 5 1 2 : pier
9 FALSE 3 7 5 0 . 0 0 0 1 6 1 . 8 7 4 0 . 5 3 3 3 3 : each
1 0 TRUE 3 7 5 0 . 0 0 0 2 4 . 3 3 4 0 . 2 4 1 3 3 : boat
Call :
mlogit ( formula = mode ~ 0 + price + catch , data = Fish , method = " nr " )
nr method
6 iterations , 0 h : 0 m : 0 s
g ’( - H )^ - 1 g ␣ = ␣ 0 . 0 0 0 1 7 9
Multinomial and Proportional Odds Model 241
Coefficients ␣ :
␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ Estimate ␣ Std . ␣ Error ␣z - value ␣ ␣ Pr ( >| z |)
price ␣ -0 . 0 2 0 4 7 6 5 ␣ ␣ 0 . 0 0 1 2 2 3 1 ␣ -1 6 . 7 4 2 ␣ <␣ 2 . 2e - 1 6 ␣ ***
catch ␣ ␣ 0 . 9 5 3 0 9 8 2 ␣ ␣ 0 . 0 8 9 4 1 3 4 ␣ ␣ 1 0 . 6 5 9 ␣ <␣ 2 . 2e - 1 6 ␣ ***
Log - Likelihood : ␣ -1 3 1 2
If we do not enforce 0 + price, we allow for intercepts that vary across choices:
> summary ( mlogit ( mode ~ price + catch , data = Fish ))
Call :
mlogit ( formula = mode ~ price + catch , data = Fish , method = " nr " )
nr method
7 iterations , 0 h : 0 m : 0 s
g ’( - H )^ - 1 g ␣ = ␣ 6 . 2 2E - 0 6
successive ␣ function ␣ values ␣ within ␣ tolerance ␣ limits
Coefficients ␣ :
␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ Estimate ␣ Std . ␣ Error ␣ ␣z - value ␣ ␣ Pr ( >| z |)
( Intercept ): boat ␣ ␣ ␣ ␣ ␣ 0 . 8 7 1 3 7 4 9 ␣ ␣ 0 . 1 1 4 0 4 2 8 ␣ ␣ ␣ 7 . 6 4 0 8 ␣ 2 . 1 5 4e - 1 4 ␣ ***
( Intercept ): charter ␣ ␣ 1 . 4 9 8 8 8 8 4 ␣ ␣ 0 . 1 3 2 9 3 2 8 ␣ ␣ 1 1 . 2 7 5 5 ␣ <␣ 2 . 2e - 1 6 ␣ ***
( Intercept ): pier ␣ ␣ ␣ ␣ ␣ 0 . 3 0 7 0 5 5 2 ␣ ␣ 0 . 1 1 4 5 7 3 8 ␣ ␣ ␣ 2 . 6 8 0 0 ␣ 0 . 0 0 7 3 6 2 7 ␣ **
price ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ -0 . 0 2 4 7 8 9 6 ␣ ␣ 0 . 0 0 1 7 0 4 4 ␣ -1 4 . 5 4 4 4 ␣ <␣ 2 . 2e - 1 6 ␣ ***
catch ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ 0 . 3 7 7 1 6 8 9 ␣ ␣ 0 . 1 0 9 9 7 0 7 ␣ ␣ ␣ 3 . 4 2 9 7 ␣ 0 . 0 0 0 6 0 4 2 ␣ ***
Log - Likelihood : ␣ -1 2 3 0 . 8
McFadden ␣ R ^ 2 : ␣ ␣ 0 . 1 7 8 2 3
Likelihood ␣ ratio ␣ test ␣ : ␣ chisq ␣ = ␣ 5 3 3 . 8 8 ␣ ( p . value ␣ = ␣ <␣ 2 . 2 2e - 1 6 )
Call :
mlogit ( formula = mode ~ 0 | income , data = Fish , method = " nr " )
nr method
4 iterations , 0 h : 0 m : 0 s
g ’( - H )^ - 1 g ␣ = ␣ 8 . 3 2E - 0 7
gradient ␣ close ␣ to ␣ zero
Coefficients ␣ :
␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ Estimate ␣ ␣ Std . ␣ Error ␣z - value ␣ ␣ Pr ( >| z |)
( Intercept ): boat ␣ ␣ ␣ ␣ ␣ 7 . 3 8 9 2e - 0 1 ␣ ␣ 1 . 9 6 7 3e - 0 1 ␣ ␣ 3 . 7 5 6 0 ␣ 0 . 0 0 0 1 7 2 7 ␣ ***
( Intercept ): charter ␣ ␣ 1 . 3 4 1 3 e + 0 0 ␣ ␣ 1 . 9 4 5 2e - 0 1 ␣ ␣ 6 . 8 9 5 5 ␣ 5 . 3 6 7e - 1 2 ␣ ***
( Intercept ): pier ␣ ␣ ␣ ␣ ␣ 8 . 1 4 1 5e - 0 1 ␣ ␣ 2 . 2 8 6 3e - 0 1 ␣ ␣ 3 . 5 6 1 0 ␣ 0 . 0 0 0 3 6 9 5 ␣ ***
income : boat ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ 9 . 1 9 0 6e - 0 5 ␣ ␣ 4 . 0 6 6 4e - 0 5 ␣ ␣ 2 . 2 6 0 2 ␣ 0 . 0 2 3 8 1 1 6 ␣ *
income : charter ␣ ␣ ␣ ␣ ␣ ␣ -3 . 1 6 4 0e - 0 5 ␣ ␣ 4 . 1 8 4 6e - 0 5 ␣ -0 . 7 5 6 1 ␣ 0 . 4 4 9 5 9 0 8
income : pier ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ -1 . 4 3 4 0e - 0 4 ␣ ␣ 5 . 3 2 8 8e - 0 5 ␣ -2 . 6 9 1 1 ␣ 0 . 0 0 7 1 2 2 3 ␣ **
Log - Likelihood : ␣ -1 4 7 7 . 2
McFadden ␣ R ^ 2 : ␣ ␣ 0 . 0 1 3 7 3 6
Likelihood ␣ ratio ␣ test ␣ : ␣ chisq ␣ = ␣ 4 1 . 1 4 5 ␣ ( p . value ␣ = ␣ 6 . 0 9 3 1e - 0 9 )
242 Linear Model and Extensions
Call :
mlogit ( formula = mode ~ price + catch | income , data = Fish ,
method = " nr " )
nr method
7 iterations , 0 h : 0 m : 0 s
g ’( - H )^ - 1 g ␣ = ␣ 1 . 3 7E - 0 5
successive ␣ function ␣ values ␣ within ␣ tolerance ␣ limits
Coefficients ␣ :
␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ Estimate ␣ ␣ Std . ␣ Error ␣ ␣z - value ␣ ␣ Pr ( >| z |)
( Intercept ): boat ␣ ␣ ␣ ␣ ␣ 5 . 2 7 2 8e - 0 1 ␣ ␣ 2 . 2 2 7 9e - 0 1 ␣ ␣ ␣ 2 . 3 6 6 7 ␣ 0 . 0 1 7 9 4 8 5
( Intercept ): charter ␣ ␣ 1 . 6 9 4 4 e + 0 0 ␣ ␣ 2 . 2 4 0 5e - 0 1 ␣ ␣ ␣ 7 . 5 6 2 4 ␣ 3 . 9 5 2e - 1 4
( Intercept ): pier ␣ ␣ ␣ ␣ ␣ 7 . 7 7 9 6e - 0 1 ␣ ␣ 2 . 2 0 4 9e - 0 1 ␣ ␣ ␣ 3 . 5 2 8 3 ␣ 0 . 0 0 0 4 1 8 3
price ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ -2 . 5 1 1 7e - 0 2 ␣ ␣ 1 . 7 3 1 7e - 0 3 ␣ -1 4 . 5 0 4 2 ␣ <␣ 2 . 2e - 1 6
catch ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ 3 . 5 7 7 8e - 0 1 ␣ ␣ 1 . 0 9 7 7e - 0 1 ␣ ␣ ␣ 3 . 2 5 9 3 ␣ 0 . 0 0 1 1 1 7 0
income : boat ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ 8 . 9 4 4 0e - 0 5 ␣ ␣ 5 . 0 0 6 7e - 0 5 ␣ ␣ ␣ 1 . 7 8 6 4 ␣ 0 . 0 7 4 0 3 4 5
income : charter ␣ ␣ ␣ ␣ ␣ ␣ -3 . 3 2 9 2e - 0 5 ␣ ␣ 5 . 0 3 4 1e - 0 5 ␣ ␣ -0 . 6 6 1 3 ␣ 0 . 5 0 8 4 0 3 1
income : pier ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ -1 . 2 7 5 8e - 0 4 ␣ ␣ 5 . 0 6 4 0e - 0 5 ␣ ␣ -2 . 5 1 9 3 ␣ 0 . 0 1 1 7 5 8 2
( Intercept ): boat ␣ ␣ ␣ ␣ *
( Intercept ): charter ␣ ***
( Intercept ): pier ␣ ␣ ␣ ␣ ***
price ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ***
catch ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ **
income : boat ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ .
income : charter
income : pier ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ *
Log - Likelihood : ␣ -1 2 1 5 . 1
McFadden ␣ R ^ 2 : ␣ ␣ 0 . 1 8 8 6 8
Likelihood ␣ ratio ␣ test ␣ : ␣ chisq ␣ = ␣ 5 6 5 . 1 7 ␣ ( p . value ␣ = ␣ <␣ 2 . 2 2e - 1 6 )
yi ∼ Multinomial(1; q1 , . . . , qK ), (21.10)
and
where xi does not contain 1. We can verify that yi | xi follows a multinomial logit model as
shown in the theorem below.
where
1
αk = log qk − µtk Σ−1 µk , βk = Σ−1 µk .
2
Prove Theorem 21.1.
21.3 Iteratively reweighted least squares algorithm for the multinomial logit model
Similar to the binary logistic model, Newton’s method for computing the MLE for the
multinomial logit model can be written as iteratively reweighted least squares. Give the
details.
A random variable for counts can take values in {0, 1, 2, . . .}. This type of variable is common
in applied statistics. For example, it can represent how many times you visit the gym every
week, how many lectures you have missed in the linear model course, how many traffic
accidents happened in certain areas during certain periods, etc. This chapter focuses on
statistical modeling of those outcomes given covariates. Hilbe (2014) is a textbook focusing
on count outcome regressions.
22.1.1 Poisson
A random variable y is Poisson(λ) if its probability mass function is
λk
pr(y = k) = e−λ , (k = 0, 1, 2, . . .)
k!
λk
P∞
which sums to 1 by the Taylor expansion formula eλ = k=0 k! . The Poisson(λ) random
variable has the following properties:
Proposition 22.1 If y ∼ Poisson(λ), then E(y) = var(y) = λ.
Proposition 22.2 If y1 , . . . , yK are mutually independent with yk ∼ Poisson(λk ) for k =
1, . . . , K, then
y1 + · · · + yK ∼ Poisson(λ),
and
(y1 , . . . , yK ) | y1 + · · · + yK = n ∼ Multinomial (n, (λ1 /λ, . . . , λK /λ)) ,
where λ = λ1 + · · · + λK .
Conversely, if S ∼ Poisson(λ) with λ = λ1 + · · · + λK , and (y1 , . . . , yK ) | S =
n ∼ Multinomial (n, (λ1 /λ, . . . , λK /λ)), then y1 , . . . , yK are mutually independent with
yk ∼ Poisson(λk ) for k = 1, . . . , K.
Where does the Poisson random variable come from? One way to generate Poisson is
through independent Bernoulli random variables. I will review Le Cam (1960)’s theorem
below without giving a proof.
Theorem 22.1 Suppose Xi ’s are Pindependent Bernoulli
Pn random variables with probabilities
n
pi ’s (i = 1, . . . , n). Define λn = i=1 pi and Sn = i=1 Xi . Then
∞ n
X λkn X
pr(Sn = k) − e−λn ≤2 p2i .
k! i=1
k=0
245
246 Linear Model and Extensions
So the sum of IID Bernoulli random variables is approximately Poisson if the probability
has order 1/n. This is called the law of rare events, or Poisson limit theorem, or Le Cam’s
theorem. By Theorem 22.1, we can use Poisson as a model for the sum of many rare events.
22.1.2 Negative-Binomial
The Poisson distribution restricts that the mean must be the same as the variance. It
cannot capture the feature of overdispersed data with variance larger than the mean. The
Negative-Binomial is an extension of the Poisson that allows for overdispersion. Here the
definition of the Negative-Binomial below is different from its standard definition, but it is
more natural as an extension of the Poisson1 . Define y as the Negative-Binomial random
variable, denoted by NB(µ, θ) with µ > 0 and θ > 0, if
(
y | λ ∼ Poisson(λ),
(22.1)
λ ∼ Gamma(θ, θ/µ).
So the Negative-Binomial is the Poisson with a random Gamma intensity, that is, the
Negative-Binomial is a scale mixture of the Poisson. If θ → ∞, then λ is a point mass at µ
and the Negative-Binomial reduces to Poisson(µ). We can verify that it has the following
probability mass function.
Proposition 22.3 The Negative-Binomial random variable defined in (22.1) has the prob-
ability mass function
θ k
Γ(k + θ) θ µ
pr(y = k) = , (k = 0, 1, 2, . . .).
Γ(k + 1)Γ(θ) µ + θ µ+θ
Proof of Proposition 22.3: We have
Z ∞
pr(y = k) = pr(y = k | λ)f (λ)dλ
0
Z ∞
λk (θ/µ)θ θ−1 −(θ/µ)λ
= e−λ λ e dλ
0 k! Γ(θ)
(θ/µ)θ ∞ k+θ−1 −(1+θ/µ)λ
Z
= λ e dλ.
k!Γ(θ) 0
The function in the integral is the density function of Gamma(k + θ, 1 + θ/µ) without the
normalizing constant
(1 + θ/µ)k+θ
.
Γ(k + θ)
1 With IID Bernoulli(p) trials, the Negative-Binomial distribution, denoted by y ∼ NB′ (r, p), is the
number of success before the rth failure. Its probability mass function is
k+r−1
pr(y = k) = (1 − p)r pk , (k = 0, 1, 2, . . .)
k
If p = µ/(µ + θ) and r = θ then these two definitions coincide. This definition is more restrictive because r
must be an integer.
Regression Models for Count Outcomes 247
So
(θ/µ)θ . (1 + θ/µ)k+θ
pr(y = k) =
k!Γ(θ) Γ(k + θ)
θ k
Γ(k + θ) θ µ
= .
k!Γ(θ) µ+θ µ+θ
□
We can derive the mean and variance of the Negative-Binomial.
Proposition 22.4 The Negative-Binomial random variable defined in (22.1) has moments
E(y) = µ,
µ2
var(y) = µ+ > E(y).
θ
Proof of Proposition 22.4: Recall Proposition B.2 that a Gamma(α, β) random variable
has mean α/β and variance α/β 2 . We have
θ
E(y) = E {E(y | λ)} = E(λ) = = µ,
θ/µ
and
θ θ µ2
var(y) = E {var(y | λ)} + var {E(y | λ)} = E(λ) + var(λ) = + = µ + .
θ/µ (θ/µ)2 θ
□
So the dispersion parameter θ controls the variance of the Negative-Binomial. With the
same mean, the Negative-Binomial has a larger variance than Poisson. Figure 22.1 further
compares the log probability mass functions of the Negative Binomial and Poisson. It shows
that the Negative Binomial has a slightly higher probability at zero but much heavier tails
than the Poisson.
−10
µ=1
−20
log probability mass function
−30
−40 distribution
Poisson
Negbin
−4
µ=5
−8
−12
0 5 10 15 20 0 5 10 15 20
y
FIGURE 22.1: Comparing the log probabilities of Poisson and Negative-Binomial with the
same mean
a point mass at zero and a NB(µ, θ) random variable, with probabilities p and 1 − p,
respectively. So y has probability mass function
θ
p + (1 − p) θ
, if k = 0,
µ+θ
pr(y = k) = θ k
(1 − p) Γ(k+θ)
θ µ
, if k = 1, 2, . . . .
Γ(k+1)Γ(θ) µ+θ µ+θ
Because
log E(yi | xi ) = xti β,
this model is sometimes called the log-linear model, with the coefficient βj interpreted as
the conditional log mean ratio:
E(yi | . . . , xij + 1, . . .)
log = βj .
E(yi | . . . , xij , . . .)
which is negative semi-definite. When the Hessian is negative definite, the MLE is unique.
It must satisfy that
n
X Xn n o
t
xi yi − exi β̂ = xi yi − λ(xi , β̂) = 0.
i=1 i=1
250 Linear Model and Extensions
where
xt1 y1
X = ... , Y = ...
xtn yn
and
exp(xt1 β old )
where
Z old = Xβ old + (W old )−1 (Y − Λold ).
So we have an iterative reweighted least squares algorithm. In R, we can use the glm function
with “family = poisson(link = "log")” to fit the Poisson regression, which uses Newton’s
method.
Statistical inference based on
!−1
a
∂ 2 log L(β̂) n
t −1
o
β̂ ∼ N β, − = N β, (X Ŵ X) ,
∂β∂β t
n on
where Ŵ = diag exp(xti β̂) .
i=1
t
After obtaining the MLE, we can predict the mean E(yi | xi ) by λ̂i = exi β̂ . Because
Poisson regression is a fully parametrized model, we can also predict any other probability
quantities involving yi | xi . For example, we can predict pr(yi = 0 | xi ) by e−λ̂i , and
pr(yi ≥ 3 | xi ) by 1 − e−λ̂i (1 + λ̂i + λ̂2i /2).
The corresponding first-order condition can be viewed as the estimating equation of Poisson
regression with weights (1 + µi /θ)−1 . Second,
n
∂ 2 log L(β, θ) X µi
= (y − µi )xi .
2 i
∂β∂θ i=1
(µi + θ)
We can verify
∂ 2 log L(β, θ)
E |X =0
∂β∂θ
since each term inside the summation has conditional expectation zero. This implies that
the Fisher information matrix is diagonal, so β̂ and θ̂ are asymptotically independent.
The glm.nb in the MASS package iterate between β and θ: given θ, update β based on
Fisher scoring; given β, update θ based on Newton’s algorithm. It reports standard errors
based on the inverse of the Fisher information matrix.2
which may give slightly different numbers compared with R. The BHHH algorithm is similar to Newton’s
algorithm but avoids calculating the Hessian matrix.
252 Linear Model and Extensions
three regressions
incentivecommit incentive
1.0
Linear regression
0.5
0.0
−0.5
point estimates and confidence intervals
1.0
Poisson regression
0.5
0.0
−0.5
1.0
NB regression
0.5
0.0
−0.5
0 25 50 75 100 0 25 50 75 100
weeks
Each worker was observed over time, so we run regressions with the outcome data ob-
served in each week. In the following, we compute the linear regression coefficients, standard
errors, and AICs.
> weekids = sort ( unique ( gym 1 $ incentive _ week ))
> lweekids = length ( weekids )
> c o e f i n c e n t i v e c o m m i t = 1 : lweekids
Regression Models for Count Outcomes 253
2
1
variance
log(θ)
0
1
−1
0 1 2 3 0 25 50 75 100
mean weeks
and
regweek = glm . nb ( f . reg , data = gymweek )
we obtain the corresponding results from Poisson and Negative-Binomial regressions. Figure
22.2 compares the regression coefficients with the associated confidence intervals over time.
Three regressions give very similar patterns: incentive_commit has both short-term and long-
term effects, but incentive only has short-term effects.
The left panel of Figure 22.3 shows that variances are larger than the means for out-
comes from all weeks, and the right panel of Figure 22.3 shows the point estimates and
confidence intervals of θ from Negative-Binomial regressions. Overall, overdispersion seems
an important feature of the data.
just did not go to the gym regardless of the treatments. Therefore, it seems crucial to
accommodate the zeros in the models.
800
600
400
200
0
count
800
600
400
200
0
0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5
y
We now fit zero-inflated Poisson regressions. The model has parameters for the zero
component and parameters for the Poisson components.
> library ( " pscl " )
> coefincentivecommit 0 = coefincentivecommit
> coefincentive 0 = coefincentive
> seincentivecommit 0 = seincentivecommit
> seincentive 0 = seincentive
> AIC 0 poisson = AICnb
> for ( i in 1 : lweekids )
+ {
+ gymweek = gym 1 [ which ( gym 1 $ incentive _ week == weekids [ i ]) , ]
+ regweek = zeroinfl ( f . reg , dist = " poisson " , data = gymweek )
+ regweekcoef = summary ( regweek )$ coef
+
+ coefincentivecommit [i] = regweekcoef $ count [ 2 , 1]
+ coefincentive [ i ] = regweekcoef $ count [ 3 , 1]
+ seincentivecommit [i] = regweekcoef $ count [ 2 , 2]
+ seincentive [ i ] = regweekcoef $ count [ 3 , 2]
+
+ coefincentivecommit 0[i] = regweekcoef $ zero [ 2 , 1]
+ coefincentive 0 [ i ] = regweekcoef $ zero [ 3 , 1]
+ seincentivecommit 0[i] = regweekcoef $ zero [ 2 , 2]
+ seincentive 0 [ i ] = regweekcoef $ zero [ 3 , 2]
+
+ AIC 0 poisson [ i ] = AIC ( regweek )
+ }
we can fit the corresponding zero-inflated Negative-Binomial regressions. Figure 22.5 plots
Regression Models for Count Outcomes 255
0.25
0.00
mean
−0.25
estimates
−0.50
zero
−1
−2
0 25 50 75 100 0 25 50 75 100
weeks
Zero−inflated NB regression
incentivecommit incentive
0.25
0.00
mean
−0.25
estimates
−0.50
zero
−1
−2
0 25 50 75 100 0 25 50 75 100
weeks
the point estimates and the confidence intervals of the coefficients of the treatment. It shows
that the treatments do not have effects on the Poisson or Negative-Binomial components,
but have effects on the zero components. This suggests that the treatments affect the out-
come mainly by changing the workers’ behavior of whether to go to the gym.
Another interesting result is the large θ̂’s from the zero-inflated Negative-Binomial re-
gression:
> quantile ( gymtheta , probs = c ( 0 . 0 1 , 0 . 2 5 , 0 . 5 , 0 . 7 5 , 0 . 9 9 ))
1% 25% 50% 75% 99%
12.3 13.1 13.7 14.4 15.7
Once the zero-inflated feature has been modeled, it is not crucial to account for the
overdispersion. It is reasonable because the maximum outcome is five, ruling out heavy-
tailedness. This is further corroborated by the following comparison of the AICs from five
regression models. Figure 22.6 shows that zero-inflated Poisson regressions have the smallest
AICs, beating the zero-inflated Negative-Binomial regressions, which are more flexible but
have more parameters to estimate.
256 Linear Model and Extensions
3000
model
Linear
2500
Poisson
AIC
NB
2000
Z−Inf Poisson
Z−Inf NB
1500
0 25 50 75 100
weeks
22.4 Poisson latent variable and the binary regression model with the cloglog link
t
Assume that yi∗ | xi ∼ Poisson(exi β ), and define yi = 1(yi∗ > 0) as the indicator that yi∗ is
not zero. Show that yi | xi follows a cloglog model, that is,
This chapter unifies Chapters 20–22 under the formulation of the generalized linear model
(GLM) by Nelder and Wedderburn (1972).
Example 23.1 The Normal linear model for continuous outcomes assumes
Example 23.4 The Negative-Binomial model for overdispersed count outcomes assumes
t
yi | xi ∼ NB(µi , δ), with µi = exi β . (23.4)
We use δ for the dispersion parameter to avoid confusion because θ means something else
below (Chapter 22 uses θ for the dispersion parameter).
In the above models, µi denotes the conditional mean. This chapter will unify Examples
23.1–23.4 as GLMs.
where (θi , ϕ) are unknown parameters, and {a(·), b(·), c(·, ·)} are known functions. The above
conditional density (23.5) is called the natural exponential family with dispersion, where θi
is the natural parameter depending on xi , and ϕ is the dispersion parameter. Sometimes,
a(ϕ) = 1 and c(yi , ϕ) = c(yi ), simplifying the conditional density to a natural exponential
family. Examples 23.1–23.4 have a unified structure as (23.5), as detailed below.
259
260 Linear Model and Extensions
Example 23.1 (continued) Model (23.1) has conditional probability density function
(yi − µi )2
2 2 −1/2
f (yi | xi ; µi , σ ) = (2πσ ) exp −
2σ 2
1 2
yi µi − 2 µi yi2
1 2
= exp − log(2πσ ) − 2 ,
σ2 2 2σ
with
1 2
θi = µi , b(θi ) = θ ,
2 i
and
ϕ = σ2 , a(ϕ) = σ 2 = ϕ.
Example 23.2 (continued) Model (23.2) has conditional probability mass function
f (yi | xi ; µi ) = µyi i (1 − µi )1−yi
yi
µi
= (1 − µi )
1 − µi
µi 1
= exp yi log − log ,
1 − µi 1 − µi
with
µi eθi 1
θi = log ⇐⇒ µi = , b(θi ) = log = log(1 + eθi ),
1 − µi 1 + eθ i 1 − µi
and
a(ϕ) = 1.
Example 23.3 (continued) Model (23.3) has conditional probability mass function
µyi i
f (yi | xi ; µi ) = e−µi
yi !
= exp {yi log µi − µi − log yi !} ,
with
θi = log µi , b(θi ) = µi = eθi ,
and
a(ϕ) = 1.
Example 23.4 (continued) Model (23.4), for a fixed δ, has conditional probability mass
function
f (yi | xi ; µi )
yi δ
Γ(yi + δ) µi δ
=
Γ(δ)Γ(yi + 1) µi + δ µi + δ
µi µi + δ
= exp yi log − δ log + log Γ(yi + δ) − log Γ(δ) − log Γ(yi + 1) ,
µi + δ δ
with
µi δ µi + δ
θi = log ⇐⇒ = 1 − eθ i , b(θi ) = δ log = −δ log(1 − eθi ),
µi + δ µi + δ δ
and
a(ϕ) = 1.
GLM: a Unification 261
The logistic and Poisson models are simpler without the dispersion parameter. The
Normal linear model has a dispersion parameter for the variance. The Negative-Binomial
model is more complex: without fixing δ it does not belong to the exponential family with
dispersion.
The exponential family (23.5) has nice properties derived from the classic Bartlett’s
identities (Bartlett, 1953). I first review Bartlett’s identities:
and ( 2 ) 2
∂ log f (y | θ) ∂ log f (y | θ)
E =E − .
∂θ ∂θ2
This lemma is well-known in classic statistical theory for likelihood, and I give a simple
proof below.
Proof of Lemma 23.1: Define ℓ(y | θ) = log f (y | θ) as the log likelihood function, so
eℓ(y|θ) is the density satisfying
Z Z
eℓ(y|θ) dy = f (y | θ)dy = 1
by the definition of a probability density function (we can replace the integral by summation
for a probability mass function). Differentiate the above identity to obtain
Z
∂
eℓ(y|θ) dy = 0
∂θ
Z
∂ ℓ(y|θ)
=⇒ e dy = 0
∂θ
Z
∂
=⇒ eℓ(y|θ) ℓ(y | θ)dy = 0,
∂θ
which implies Bartlett’s first identity. Differentiate it twice to obtain
Z
∂ ∂
eℓ(y|θ) ℓ(y | θ)dy = 0
∂θ ∂θ
Z " 2 2
#
ℓ(y|θ) ∂ ℓ(y|θ) ∂
=⇒ e ℓ(y | θ) + e ℓ(y | θ) dy = 0,
∂θ ∂θ2
E(yi | xi ; θi , ϕ) ≡ µi = b′ (θi )
and
var(yi | xi ; θi , ϕ) ≡ σi2 = b′′ (θi )a(ϕ).
262 Linear Model and Extensions
Proof of Theorem 23.1: The first two derivatives of the log conditional density are
(C2) the conditional mean µi = b′ (θi ) and variance σi2 = b′′ (θi )a(ϕ) in Theorem 23.1;
(C3) the function linking the conditional mean and covariates µi = µ(xti β).
Models (23.1)–(23.4) are the classical examples. Figure 23.1 illustrates the relationship
among key quantities in a GLM. In particular,
depends on xi and β, with (b′ )−1 indicating the inverse function of b′ (·).
yi θi − b(θi )
ℓi = log f (yi | xi ; β, ϕ) = + c(yi , ϕ).
a(ϕ)
The general relationship (23.6) between θi and β is quite complicated. A natural choice
of µ(·) is to cancel (b′ )−1 in (23.6) so that
This link function µ(·) is called the canonical link or the natural link, which leads to further
simplifications:
∂µi σ2
µi = b′ (xti β) =⇒ = b′′ (xti β)xi = b′′ (θi )xi = i xi ,
∂β a(ϕ)
and
∂ℓi yi − µi σi2
= xi = a(ϕ)−1 xi (yi − µi )
∂β σi2 a(ϕ)
Xn
=⇒ a(ϕ)−1 xi (yi − µi ) = 0
i=1
n
X
=⇒ xi (yi − µi ) = 0. (23.8)
i=1
264 Linear Model and Extensions
We have shown that the MLEs of models (23.1)–(23.3) all satisfy (23.8). However, the MLE
of (23.4) does not because it does not use the natural link function resulting in µ(·) ̸= b′ (·):
e∗
µ(∗) = e∗ , b′ (∗) = δ .
1 − e∗
Using Bartlett’s second identity in Lemma 23.1, we can write the expected Fisher infor-
mation conditional on covariates as
n X n
( 2 )
X ∂ℓi ∂ℓi yi − µi ∂µi ∂µi
E | xi = E | xi
i=1
∂β ∂β t i=1
σi2 ∂β ∂β t
n
X 1 ∂µi ∂µi
=
σ 2 ∂β ∂β t
i=1 i
n
X 1 ′ t 2 t
=
σ 2 {µ (xi β)} xi xi
i=1 i
= X t W X,
where n
1 2
W = diag {µ′ (xti β)} .
σi2 i=1
With the canonical link, it further simplifies to
n n
( 2 )
X ∂ℓi ∂ℓi X yi − µi
E | xi = E xi xti | xi
i=1
∂β ∂β t i=1
a(ϕ)
n
−2
X
= {a(ϕ)} σi2 xi xti .
i=1
We can obtain the estimated covariance matrix by replacing the unknown parameters
with their estimates. Now we review the estimated covariance matrices of the classical GLMs
with canonical links.
Example 23.1 (continued) In the Normal linear model, V̂ = σ̂ 2 (X t X)−1 with σ 2 esti-
mated further by the residual sum of squares.
I relegate the derivation of the formula for the Negative-Binomial regression as Problem
23.2. It is a purely theoretical exercise since δ is usually unknown in practice.
Examples 23.1–23.3 correspond to the second, the first, and the fifth choices above. Below
I will briefly discuss the third choice for the Gamma regression and omit the discussion of
other choices. See the help file of the glm function and McCullagh and Nelder (1989) for
more details.
The Gamma(α, β) random variable is positive with mean α/β and variance α/β 2 . For
convenience in modeling, we use a reparametrization Gamma′ (µ, ν) where
µ α/β α ν
= ⇐⇒ = .
ν α β ν/µ
So its mean equals µ and its variance equals µ2 /ν which is quadratic √ in µ. A feature of
the Gamma random variable is that its coefficient of variation equals 1/ ν which does not
depend on the mean. So Gamma′ (µ, ν) is a parametrization based on the mean and the
coefficient of variation (McCullagh and Nelder, 1989).1 Gamma′ (µ, ν) has density
and we can verify that it belongs to the exponential family with dispersion. Gamma regres-
sion assumes
yi | xi ∼ Gamma′ (µi , ν)
t
where µi = exi β . So it is also a log-linear model. This does not correspond to the canonical
link. Instead, we should specify Gamma(link = "log") to fit the log-linear Gamma regression
model.
The log-likelihood function is
n n
X νyi o
log L(β, ν) = − xt β + (ν − 1) log yi + ν log ν − νxti β − log Γ(ν) .
i=1
e i
Then
n n
∂ log L(β, ν) X t X t t
= (νyi e−xi β xi − νxi ) = ν e−xi β (yi − exi β )xi
∂β i=1 i=1
and
n
∂ 2 log L(β, ν) X −xti β t
= e (yi − exi β )xi .
∂β∂ν i=1
Moreover, ∂ 2 log L(β, ν)/∂β∂ν has expectation zero so the Fisher information matrix is
p
1 The coefficient of variation of a random variable A equals var(A)/E(A).
266 Linear Model and Extensions
Firth (1988) compared Gamma and log-Normal regressions based on efficiency. However,
these two models are not entirely comparable: Gamma regression assumes that the log of
the conditional mean of yi given xi is linear in xi , whereas log-Normal regression assumes
that the conditional mean of log yi given xi is linear in xi . By Jensen’s inequality, log E(yi |
xi ) ≥ E(log yi | xi ). See Problem 23.3 for more discussions of Gamma regression.
and
var(log yi | xi ) = ψ ′ (ν)
where ψ(ν) = d log Γ(ν)/dν is the digamma function and ψ ′ (ν) is the trigamma function.
Remark: Use Proposition B.3 to calculate the moments. The above conditional mean
function ensures that the OLS estimator of log yi on xi is consistent for all components of
β except for the intercept.
24
From Generalized Linear Models to Restricted Mean
Models: the Sandwich Covariance Matrix
This chapter discusses the consequence of misspecified GLMs, extending the EHW covari-
ance estimator to its analogs under the GLMs. It serves as a stepping stone to the next
chapter on the generalized estimating equation.
for any σ̃ 2 that can be a function of xi , the true parameter β, and maybe ϕ. So we can
estimate β by solving the estimating equation:
n
X yi − µ(xt β) ∂µ(xt β)
i i
= 0. (24.1)
i=1
σ̃ 2 (xi , β) ∂β
If σ̃ 2 (xi , β) = σ 2 (xi ) = var(yi | xi ), then the above estimating equation is the score equation
derived from the GLM of an exponential family. If not, (24.1) is not a score function but
it is still a valid estimating equation. In the latter case, σ̃ 2 (xi , β) is a “working” variance.
This has important implications for the practical data analysis. First, we can interpret the
MLE from a GLM more broadly: it is also valid under a restricted mean model even if the
267
268 Linear Model and Extensions
□
We can estimate the asymptotic variance by replacing B and M by their sample analogs.
With β̂ and the residual ε̂i = yi − µ(xti β̂), we can conduct statistical inference based on the
following Normal approximation:
a
β̂ ∼ N(β, V̂ ),
with V̂ ≡ n−1 B̂ −1 M̂ B̂ −1 , where
n
X 1 ∂µ(xti β̂) ∂µ(xti β̂)
B̂ = n−1 ,
i=1 σ̃ 2 (xi , β̂) ∂β ∂β t
n
X ε̂2i ∂µ(xti β̂) ∂µ(xti β̂)
M̂ = n−1 .
i=1 σ̃ 4 (xi , β̂) ∂β ∂β t
As a special case, when the GLM is correctly specified with σ 2 (x) = σ̃ 2 (x, β), then B =
M and the asymptotic variance reduces to the inverse of the Fisher information matrix
discussed in Section 23.2.
Example 24.1 (continued) In a working Normal linear model, σ̃ 2 (xi , β) = σ̃ 2 is constant
and ∂µ(xti β)/∂β = xi . So
n n
−1
X 1 −1
X ε̂2i
B̂ = n x xt ,
2 i i
M̂ = n x xt
2 )2 i i
i=1
σ̃ i=1
(σ̃
with residual ε̂i = yi − xti β̂, recovering the EHW variance estimator
n
!−1 n ! n !−1
X X X
t 2 t t
V̂ = xi xi ε̂i xi xi xi xi .
i=1 i=1 i=1
Example 24.2 (continued) In a working binary logistic model, σ̃ 2 (xi , β) = π(xi , β){1 −
t
π(xi , β)} and ∂µ(xti β)/∂β = π(xi , β){1 − π(xi , β)}xi , where π(xi , β) = µ(xti β) = exi β /(1 +
t
exi β ). So
n
X Xn
B̂ = n−1 π̂i (1 − π̂i )xi xti , M̂ = n−1 ε̂2i xi xti
i=1 i=1
xti β̂ xti β̂
with fitted mean π̂i = e /(1 + e ) and residual ε̂i = yi − π̂i , yielding a new covariance
estimator
n
!−1 n
! n
!−1
X X X
V̂ = π̂i (1 − π̂i )xi xti ε̂2i xi xti π̂i (1 − π̂i )xi xti .
i=1 i=1 i=1
xti β̂
with fitted mean λ̂i = e and residual ε̂i = yi − λ̂i , yielding a new covariance estimator
n
!−1 n ! n !−1
X X X
V̂ = λ̂i xi xti ε̂2i xi xti λ̂i xi xti .
i=1 i=1 i=1
Again, I relegate the derivation of the formulas for the Negative-Binomial regression as
a homework problem. The R package sandwich implements the above covariance matrices
(Zeileis, 2006).
270 Linear Model and Extensions
The sandwich function can compute HC0 and HC1, corresponding to adjusting for the
degrees of freedom or not; hccm and vcovHC can compute other HC standard errors.
GLM and Sandwich Covariance Matrix 271
z test of coefficients :
z test of coefficients :
272 Linear Model and Extensions
z test of coefficients :
z test of coefficients :
Because the true model is the Negative-Binomial regression, we can use the correct
model to fit the data. Theoretically, the MLE is the most efficient estimator. However, in
this particular dataset, the robust standard error from Poisson regression is no larger than
the one from Negative-Binomial regression. Moreover, the robust standard errors from the
Poisson and Negative-Binomial regressions are very close.
GLM and Sandwich Covariance Matrix 273
> nb . nb = glm . nb ( y ~ x )
> summary ( nb . nb )
Coefficients :
Estimate Std . Error z value Pr ( >| z |)
( Intercept ) -0 . 0 8 0 4 7 0 . 0 7 3 3 6 -1 . 0 9 7 0.2727
x 0.16487 0.07276 2.266 0.0234 *
z test of coefficients :
z test of coefficients :
>
> wr . nb = glm . nb ( y ~ x )
There were 2 6 warnings ( use warnings () to see them )
> summary ( wr . nb )
Coefficients :
Estimate Std . Error z value Pr ( >| z |)
( Intercept ) 0 . 1 5 9 8 4 0.05802 2 . 7 5 5 0 . 0 0 5 8 7 **
x -0 . 3 4 6 2 2 0 . 0 5 7 8 9 -5 . 9 8 1 2 . 2 2e - 0 9 ***
z test of coefficients :
Overall, for count outcome regression, it seems that Poisson regression suffices as long
as we use the robust standard error. The Negative-Binomial is unlikely to offer more if only
the conditional mean is of interest.
reasonable if the parameter of interest is the risk ratio instead of the odds ratio. Importantly,
since the Poisson model is a wrong model, we must use the sandwich covariance estimator.
Previous chapters dealt with cross-sectional data, that is, we observe n units at a partic-
ular time point, collecting various covariates and outcomes. In addition, we assume that
these units are independent, and sometimes, we even assume they are IID draws. Many
applications have correlated data. Two canonical examples are
(E1) repeated measurements of the same set of units over time, which are often called lon-
gitudinal data in biostatistics (Fitzmaurice et al., 2012) or panel data in econometrics
(Wooldridge, 2010); and
(E2) clustered observations belonging to classrooms, villages, etc, which are common in
cluster-randomized experiments in education (Schochet, 2013) and public health
(Turner et al., 2017a,b).
Many excellent textbooks cover this topic intensively. This chapter focuses on a simple yet
powerful strategy, which is a natural extension of the GLM discussed in the last chapter.
It was initially proposed in Liang and Zeger (1986), the most cited paper published in
Biometrika in the past one hundred years (Titterington, 2013). For simplicity, we will use
the term “longitudinal data” for general correlated data.
275
276 Linear Model and Extensions
4 0 0 0 86.446 30 44
5 0 0 0 74.032 30 44
6 0 0 0 71.693 30 44
The three-way table below shows the treatment combinations for 14 mice, from which
we can see that the Pten knockdown indicator varies within mice, but the fatty acid level
varies only between mice.
> table ( Pten $ mouseid , Pten $ fa , Pten $ pten )
, , = 0
0 1 2
0 30 0 0
1 58 0 0
2 18 0 0
3 2 0 0
4 56 0 0
5 0 39 0
6 0 33 0
7 0 58 0
8 0 60 0
9 0 0 15
10 0 0 27
11 0 0 7
12 0 0 34
13 0 0 22
, , = 1
0 1 2
0 44 0 0
1 68 0 0
2 33 0 0
3 11 0 0
4 76 0 0
5 0 55 0
6 0 55 0
7 0 75 0
8 0 92 0
9 0 0 34
10 0 0 29
11 0 0 20
12 0 0 53
13 0 0 38
The useful variables below are z, x, y, and vid, which denote the binary treatment
indicator, covariate xit , the outcome, and the village id vid,
> hygaccess = read . csv ( " hygaccess . csv " )
> hygaccess = hygaccess [ , c ( " r 4 _ hyg _ access " , " treat _ cat _ 1 " ,
+ " bl _ c _ hyg _ access " , " vid " , " eligible " )]
> hygaccess = hygaccess [ which ( hygaccess $ eligible == " Eligible " &
+ hygaccess $ r 4 _ hyg _ access != " Missing " ) ,]
> hygaccess $ y = ifelse ( hygaccess $ r 4 _ hyg _ access == " Yes " , 1 , 0 )
> hygaccess $ z = hygaccess $ treat _ cat _ 1
> hygaccess $ x = hygaccess $ bl _ c _ hyg _ access
In (25.1), σ̃ 2 (xi , β) is a working variance function usually motivated by a GLM, but it can
be misspecified. With an ni × 1 vector outcome and an ni × p covariate matrix
t
yi1 xi1
.. ..
Yi = . , Xi = . , (i = 1, . . . , n) (25.2)
yini xtini
where (25.3) and (25.6) are definitions, and (25.4) and (25.5) are the two key assumptions.
Assumption (25.4) requires that the conditional mean of yit depends only on xit but not
on any other xis with s ̸= t. Assumption (25.5) requires that the relationship between xit
278 Linear Model and Extensions
and yit is stable across units and time points with the function µ(·) and the parameter β
not varying with respect to i or t. The model assumptions in (25.4) and (25.5) are really
strong, and I defer the critiques to the end of this chapter. Nevertheless, the marginal model
attracts practitioners for
(A1) its similarity to GLM and the restricted mean model, and
(A2) its simplicity of requiring only specification of the marginal conditional means, not
the whole joint distribution.
The advantage (A1) facilitates the interpretation of the coefficient, and the advantage (A2)
is crucial because of the lack of familiar multivariate distributions in statistics except for
the multivariate Normal. The generalized estimating equation (GEE) for β is the vector
form of (25.1):
n
X ∂µ(Xi , β) −1
Ṽ (Xi , β) {Yi − µ(Xi , β)} = |{z}
0 , (25.7)
i=1 |
∂β | {z }| {z }
{z } ni ×ni ni ×1 p×1
p×ni
where (25.7) has a similar form as (25.1) with three terms organized to match the dimension
so that matrix multiplications are well-defined:
(GEE1) the last term
yi1 − µ(xti1 β)
Yi − µ(Xi , β) =
..
.
yini − µ(xini β)
t
(GEE2) the second term is the inverse of Ṽ (Xi , β), a working covariance matrix of the
conditional distribution of Yi given Xi which may be misspecified:
It is relatively easy to specify the working variance σ̃ 2 (xit , β) for each marginal
component, for example, based on the marginal GLM. So the key is to specify the
ni × ni dimensional correlation matrix Ri to obtain
We assume that the Ri ’s are given now, and will discuss how to choose them in a
later section.
(GEE3) the first term is the partial derivative of an ni × 1 vector µ(Xi , β) =
(µ(xti1 β), . . . , µ(xtini β))t with respect to a p × 1 vector β = (β1 , . . . , βp )t :
∂µ(xtini β)
∂µ(Xi , β) ∂µ(xti1 β)
= ,...,
∂β ∂β ∂β
∂µ(xti1 β) ∂µ(xtin β)
∂β1 · · · ∂β1
i
.. ..
= ,
. .
t
∂µ(xti1 β) ∂µ(xin β)
∂βp ··· ∂βp
i
old
So given β , we update it as
( n )−1
X
new old old −1 old
β = β + Di (β )Ṽ (Xi , β )Dit (β old )
i=1
n
X
Di (β old )Ṽ −1 (Xi , β old ) Yi − µ(Xi , β old ) .
× (25.8)
i=1
After obtaining β̂ and the residual vector ε̂i = Yi − µ(Xi , β̂) for unit i (i = 1, . . . , n), we
can conduct asymptotic inference based on the Normal approximation
a
β̂ ∼ N(β, n−1 B̂ −1 M̂ B̂ −1 ),
where
n
X
B̂ = n−1 Di (β̂)Ṽ −1 (Xi , β̂)Dit (β̂),
i=1
n
X
M̂ = n−1 Di (β̂)Ṽ −1 (Xi , β̂)ε̂i ε̂ti Ṽ −1 (Xi , β̂)Dit (β̂).
i=1
280 Linear Model and Extensions
This covariance estimator proposed by Liang and Zeger (1986), is robust to the misspeci-
fication of the marginal variances and the correlation structure as long as the conditional
mean of Yi given Xi is correctly specified.
This is simply the estimating equation of a restricted mean model treating all data points
(i, t) as independent observations. This implies that the point estimate assuming the inde-
pendent working correlation matrix is still consistent, although we must change the standard
error as in Section 25.3.2.
With this simple starting point, we have a consistent yet inefficient estimate of β, and
then we can compute the residuals. The correlation among the residuals contains information
about the true covariance matrix. With small and equal ni ’s, we can estimate the conditional
covariance without imposing any structure based on the residuals. Using the estimated
covariance matrix, we can update the GEE estimate to improve efficiency. This leads to a
two-step procedure.
An important intermediate case is motivated by the exchangeability of the data
points within the same unit i, so the working covariance matrix is Ṽ (Xi , β) =
diag{σ̃(xit )}ni=1
i
Ri (ρ)diag{σ̃(xit )}ni=1
i
, where
1 ρ ··· ρ
ρ 1 ··· ρ
Ri (ρ) = . . .. .
.. .. .
ρ ρ ··· 1
main parameter of interest. In practice, the independent working covariance suffices in many
applications despite its potential efficiency loss. This is similar to the use of OLS in the pres-
ence of heteroskedasticity in linear models. Section 25.4 focuses on the independent working
covariance, which is common in econometrics. Section 25.6 gives further justifications for
this simple strategy.
25.4.1 OLS
An important special case is the marginal linear model with an independent working co-
variance matrix and homoskedasticity, resulting in the following estimating equation:
ni
n X
X
xit (yit − xtit β) = 0.
i=1 t=1
So the point estimator is just the pooled OLS using all data points:
n Xni
!−1 n n
X XX i
t
β̂ = xit xit xit yit
i=1 t=1 i=1 t=1
n
!−1 n
X X
= Xit Xi Xit Yi
i=1 i=1
= (X t X)−1 X t Y.
The three forms of β̂ above are identical: the first one is based on N observations, the
second one is based on n independent units, and the last one is based on the matrix form
with the pooled data. Although the point estimate is identical to the case with independent
data points, we must adjust for the standard error according to Section 25.3.2. From
Di (β) = (xi1 , . . . , xini ) = Xit ,
we can verify that
n
!−1 n n
!−1
X X X
cov(
ˆ β̂) = Xit Xi Xit ε̂i ε̂ti Xi Xit Xi ,
i=1 i=1 i=1
282 Linear Model and Extensions
where ε̂i = Yi − Xi β̂ = (ε̂i1 , . . . , ε̂ini )t is the residual vector of unit i. This is called the
(Liang–Zeger) cluster-robust covariance matrix in econometrics. The square roots of the
diagonal terms are called the cluster-robust standard errors. The cluster-robust covariance
matrix is often much larger than the (Eicker–Huber–White) heteroskedasticity-robust co-
variance matrix assuming independence of observations (i, t):
ni
n X
!−1 ni
n X ni
n X
!−1
X X X
cov
ˆ ehw (β̂) = xit xtit ε̂2it xit xtit xit xtit .
i=1 t=1 i=1 t=1 i=1 t=1
Note that
n
X ni
n X
X
X tX = Xit Xi = xit xtit ,
i=1 i=1 t=1
in general.
t t
where π(xit , β) = exit β /(1 + exit β ). So the point estimator is the pooled logistic regression
using all data points, but we must adjust for the standard error according to Section 25.3.2.
From
where ε̂i = (ε̂i1 , . . . , ε̂ini )t with residual ε̂it = yit − exit β̂ /(1 + exit β̂ ), and V̂i =
diag{π(xit , β̂){1 − π(xit , β̂)}}nt=1
i
. So the cluster-robust covariance estimator for logistic re-
gression is
n
!−1 n n
!−1
X X X
t t t t
cov(
ˆ β̂) = Xi V̂i Xi Xi ε̂i ε̂i Xi Xi V̂i Xi .
i=1 i=1 i=1
I leave the cluster-robust covariance estimator for Poisson regression to Problem 25.5.
Generalized Estimating Equation for Correlated Multivariate Data 283
25.5 Application
We will use the gee package for all the analyses below.
Including two covariates, we have the following results. The covariates are predictive of
the outcome, changing the significance level of the main effect of fa. The interaction terms
between pten and fa are not significant either.
> Pten . gee = gee ( somasize ~ factor ( fa )* pten + numctrl + numpten ,
+ id = mouseid ,
+ family = gaussian ,
+ corstr = " independence " ,
+ data = Pten )
> summary ( Pten . gee )$ coef
Estimate Naive S . E . Naive z Robust S . E . Robust z
( Intercept ) 81.9422 2.791 29.3602 4.0917 20.026
factor ( fa ) 1 6.2267 2.237 2.7835 4.2429 1.468
factor ( fa ) 2 14.8956 2.657 5.6053 4.1839 3.560
pten 12.3771 2.020 6.1272 2.2477 5.507
numctrl 0.8721 0.120 7.2672 0.3028 2.880
numpten -0 . 4 8 4 3 0 . 1 0 1 -4 . 7 9 4 8 0.2381 -2 . 0 3 4
factor ( fa ) 1 : pten 7.7498 2.744 2.8240 5.1064 1.518
factor ( fa ) 2 : pten -2 . 9 6 2 9 3 . 1 6 6 -0 . 9 3 5 9 3.3105 -0 . 8 9 5
>
284 Linear Model and Extensions
>
> Pten . gee = gee ( somasize ~ factor ( fa )* pten + numctrl + numpten ,
+ id = mouseid ,
+ family = gaussian ,
+ corstr = " exchangeable " ,
+ data = Pten )
> summary ( Pten . gee )$ coef
Estimate Naive S . E . Naive z Robust S . E . Robust z
( Intercept ) 85.3316 5.2872 16.1393 5.4095 15.7745
factor ( fa ) 1 5.4952 4.2761 1.2851 4.0207 1.3667
factor ( fa ) 2 12.2174 4.1669 2.9320 4.2363 2.8840
pten 11.8044 1.9718 5.9865 2.1946 5.3789
numctrl 0.9326 0.2867 3.2527 0.3479 2.6810
numpten -0 . 5 6 7 8 0 . 2 5 0 4 -2 . 2 6 7 4 0 . 2 7 7 2 -2 . 0 4 8 2
factor ( fa ) 1 : pten 8.5137 2.6777 3.1795 5.0612 1.6821
factor ( fa ) 2 : pten -1 . 7 7 5 5 3 . 0 9 9 5 -0 . 5 7 2 8 2 . 7 5 4 7 -0 . 6 4 4 5
From the regressions above, we observe that (1) two choices of the covariance matrix
do not lead to fundamental differences; and (2) without using the cluster-robust standard
error, the results can be misleading.
Covariate adjustment improves efficiency and makes the choice of the working covariance
matrix less important.
Using all data, we find a significant effect of incentive_commit but an insignificant effect
of incentive.
normal . gee = gee ( f . reg , id = id ,
+ family = gaussian ,
+ corstr = " independence " ,
+ data = gym 1 )
> normal . gee = summary ( normal . gee )$ coef
> normal . gee
Estimate Naive S . E . Naive z Robust S . E . Robust z
( Intercept ) -0 . 6 9 0 0 5 0 . 0 1 1 1 3 6 -6 1 . 9 6 8 0 . 0 8 6 7 2 -7 . 9 5 7 2
incentive _ commit 0 . 1 5 6 6 6 0.008358 18.745 0.06376 2.4569
incentive 0.01022 0.008275 1.235 0.05910 0.1729
target 0.62666 0.007465 83.949 0.06773 9.2527
member _ gym _ pre 1.14919 0.007077 162.375 0.06252 18.3801
However, this pooled analysis can be misleading because we have seen from the analysis
before that the treatments have no effects in the pre-experimental periods and smaller effects
in the long term. A pooled analysis can dilute the short-term effects, missing the treatment
effect heterogeneity across time. This can be fixed by the following subgroup analysis based
on time.
> normal . gee 1 = gee ( f . reg , id = id ,
+ subset = ( incentive _ week < 0 ) ,
+ family = gaussian ,
+ corstr = " independence " ,
+ data = gym 1 )
> normal . gee 1 = summary ( normal . gee 1 )$ coef
> normal . gee 1
Estimate Naive S . E . Naive z Robust S . E . Robust z
( Intercept ) -0 . 8 7 9 3 7 4 0 . 0 4 2 3 0 -2 0 . 7 8 6 8 0 . 0 8 7 3 9 -1 0 . 0 6 2 2 4
incentive _ commit -0 . 0 0 4 2 4 1 0 . 0 3 1 7 5 -0 . 1 3 3 6 0 . 0 6 2 4 3 -0 . 0 6 7 9 4
incentive -0 . 0 7 3 8 8 4 0 . 0 3 1 4 4 -2 . 3 5 0 2 0 . 0 6 2 2 3 -1 . 1 8 7 2 8
target 0.742675 0.02836 26.1887 0.06701 11.08301
286 Linear Model and Extensions
Changing the family parameter to poisson(link = log), we can fit a marginal log-linear
model with independent Poisson covariance. Figure 25.1 shows the point estimates and
confidence intervals based on the regressions above. The confidence intervals based on the
cluster-robust standard errors are much wider than those based on the EHW standard
errors. Without dealing with clustering, the confidence intervals are too narrow and give
wrong inference.
yi1 yi2
Generalized Estimating Equation for Correlated Multivariate Data 287
0.50
●
Normal
0.25
●
●
●
0.00 ●
● ●
−0.25
0.50 ●
Poisson
●
0.25 ●
●
●
0.00 ● ●
−0.25
" "
yi1 yi2 yi1 yi2 yi1 yi2
With more complex data generating processes, Assumption (25.4) does not hold in general:
xi1 / xi2
<
/ y" i2
yi1
Liang and Zeger (1986) assumed fixed covariates, ruling out the dynamics of x. Pepe and
Anderson (1994) pointed out the importance of Assumption (25.4) in GEE with random
time-varying covariates. Pepe and Anderson (1994) also showed that with an independent
working covariance matrix, we can drop Assumption (25.4) as long as the marginal condi-
tional mean is correctly specified. That is, if E(yit | xit ) = µ(xtit β), then
( n n )
i
XX yit − µ(xtit β) ∂µ(xtit β)
E
i=1 t=1
σ̃ 2 (xit , β) ∂β
n X ni
X E{yit − µ(xtit β) | xit } ∂µ(xtit β)
= E
i=1 t=1
σ̃ 2 (xit , β) ∂β
= 0.
This gives another justification for the use of the independent working covariance matrix
even though it can result in efficiency loss when Assumption (25.4) holds.
288 Linear Model and Extensions
} !
yi1 yi2
has conditional expectations E(yit | xi ) = αt + βxi if
E(εit | xi ) = 0. (25.10)
However, with direct dependence of yi2 on yi1 , the data generating process
} / y! i2
yi1
has conditional expectations E(yi1 | xi ) = α1 + βxi but
α1 = α2 + γα1 , β = βγ + δ,
future observation xis , the marginal model gives predicted outcome µ(xtis β̂) with the associ-
ated standard error computed based on the delta method. We can see two obvious problems
with this prediction. First, it does not depend on s. Consequently, predicting s = 10 is the
same as predicting s = 100. However, the intuition is overwhelming that predicting the
long-run outcome is much more difficult than predicting the short-run outcome, so we hope
the standard error should be much larger for predicting the outcome at s = 100. Second,
the prediction does not depend on the lag outcomes because the marginal model ignores
the dynamics of the outcome. With longitudinal observations, building a model with the
lag outcomes may increase the prediction ability.
diag{exit β̂ }nt=1
t
where ε̂i = Yi − µ(Xi , β̂) and V̂i = i
.
With IID data (yi )ni=1 , we can compute the sample mean
n
X n
X
ȳ = n−1 yi = arg min n−1 (yi − µ)2 ,
µ∈R
i=1 i=1
F−1 (τ)
However, the mean can miss important information about y. How about other features
of the outcome y? Quantiles can characterize the distribution of y. For a random variable
y, we can define its distribution function as F (c) = pr(y ≤ c) and its τ th quantile as
F −1 (τ ) = inf {q : F (q) ≥ τ } .
293
294 Linear Model and Extensions
u u u
where (
uτ, if u ≥ 0,
ρτ (u) = u {τ − 1(u < 0)} =
−u(1 − τ ), if u < 0,
is the check function (the name comes from its shape; see Figure 26.2). In particular, the
median of y is
median(y) = F −1 (0.5) = arg min E {|y − q|} .
q∈R
Proof of Proposition 26.1: To simplify the proof, we further assume that y has density
function f (·). We will use Leibniz’s integral rule:
(Z )
b(x) Z b(x)
d ′ ′ ∂f (x, t)
f (x, t)dt = f (x, b(x))b (x) − f (x, a(x))a (x) + dt.
dx a(x) a(x) ∂x
We can write
Z q Z ∞
E {ρτ (y − q)} = (τ − 1)(c − q)f (c)dc + τ (c − q)f (c)dc.
−∞ q
So
(1 − τ )pr(y ≤ q) − τ {1 − pr(y ≤ q)} = 0
which implies that
τ = pr(y ≤ q),
Quantile Regression 295
so the τ th quantile satisfies the first-order condition. The second-order condition ensures it
is the minimizer:
∂ 2 E {ρτ (y − q)}
= f F −1 (τ ) > 0
∂q 2 −1
q=F (τ )
which may not be unique even though the population quantile is. We can view F̂ −1 (τ ) as
a set containing all minimizers, and with large samples the values in the set do not differ
much. Similar to the sample mean, the sample quantile also satisfies a CLT.
iid
Theorem 26.1 Assume (yi )ni=1 ∼ y with distribution function F (·) that is strictly increas-
ing and density function f (·) that is positive at the τ th quantile. The sample quantile is
consistent for the true quantile and is asymptotically Normal:
!
√ n −1 o τ (1 − τ )
n F̂ (τ ) − F −1 (τ ) → N 0, 2
[f {F −1 (τ )}]
in distribution.
Proof of Theorem 26.1: Based on the first order condition in Proposition 26.1, the
population quantile solves
E{mτ (y − q)} = 0,
and the sample quantile solves
n
X
n−1 mτ (yi − q) = 0,
i=1
where the check function has a partial derivative with respect to u except for the point 0:
By Theorem D.1, we only need to find the bread and meat matrices, which are scalars now:
∂E{mτ (y − q)}
B =
∂q q=F −1 (τ )
∂E{τ − 1(y ≤ q)}
=
∂q q=F −1 (τ )
∂F (q)
= −
∂q q=F −1 (τ )
= −f {F −1 (τ )},
296 Linear Model and Extensions
and
= E {mτ (y − q)}2
M
q=F −1 (τ )
h i
2
= E {τ − 1(y ≤ q)} −1 q=F (τ )
2
= E τ + 1(y ≤ q) − 2 · 1(y ≤ q)τ
q=F −1 (τ )
2 2
= τ + τ − 2τ
= τ (1 − τ ).
√
Therefore, n{F̂ −1 (τ ) − F −1 (τ )} converges to Normal with mean zero and variance
M/B = τ (1 − τ )/[f {F −1 (τ )}]2 .
2
□
To conduct statistical inference for the quantile F −1 (τ ), we need to estimate the density
of y at the τ th quantile to obtain the estimated standard error of F̂ −1 (τ ). Alternatively, we
can use the bootstrap to obtain the estimated standard error. We will discuss the inference
of quantiles in R in Section 26.4.
We can use a linear function xt β to approximate the conditional mean with the population
OLS coefficient
−1
β = arg minp E (y − xt b)2 = {E(xxt )} E(xy),
b∈R
We can use a linear function xt β(τ ) to approximate the conditional quantile function with
called the τ th sample regression quantile. As a special case, when τ = 0.5, we have the
regression median:
n
X
β̂(0.5) = arg minp n−1 |yi − xti b|,
b∈R
i=1
Example 26.1 Under the linear model yi = xti β + σvi , we can verify that
E(yi | xi ) = xti β
and
F −1 (τ | xi ) = xti β + σg −1 (τ ).
Therefore, with the first regressor being 1, we have
β1 (τ ) = β1 + σg −1 (τ ), βj (τ ) = βj , (j = 2, . . . , p).
In this case, both the true conditional mean and quantile functions are linear, and the
population regression quantiles are constant across τ except for the intercept.
Example 26.2 Under a heteroskedastic linear model yi = xti β + (xti γ)vi with xti γ > 0 for
all xi ’s, we can verify that
E(yi | xi ) = xti β
and
F −1 (τ | xi ) = xti β + xti γg −1 (τ ).
Therefore,
β(τ ) = β + γg −1 (τ ).
In this case, both the true conditional quantile functions are linear, and all coordinates of
the population regression quantiles vary with τ .
Example 26.3 Under the transformed linear model log yi = xti β + σvi , we can verify that
F −1 (τ | xi ) = exp xti β + σg −1 (τ ) .
In this case, both the true conditional mean and quantile functions are log-linear in covari-
ates.
298 Linear Model and Extensions
which is simply a linear function of the ui ’s and vi ’s. Of course, these ui ’s and vi ’s are not
arbitrary because they must satisfy the constraints by the data. Using the notation
t
y1 x1 u1 v1
Y = ... , X = ... , u = ... , v = ... ,
yn xtn un vn
finding the τ th regression quantile is equivalent to a linear programming problem with linear
objective function and linear constraints:
The function rq in the R package quantreg computes the regression quantiles with various
choices of methods.
in distribution, where
h i
2
B = E fy|x {xt β(τ )} xxt , M = E {τ − 1 (y − xt β(τ ) ≤ 0)} xxt ,
E {mτ (y − xt b)x} = 0,
By Theorem D.1, we only need to calculate the explicit forms of B and M . Let Fy|x (·) and
fy|x (·) be the conditional distribution and density functions. We have
so
∂E {mτ (y − xt b)x}
= −E fy|x (xt b)xxt .
∂b t
□
Based on Theorem 26.2, we can estimate the asymptotic covariance matrix of β̂(τ ) by
n−1 B̂ −1 M̂ B̂ −1 , where
n n
X o2
M̂ = n−1 τ − 1 yi − xti β̂(τ ) ≤ 0 xi xti
i=1
and
n n o
−1
X
B̂ = (2nh) 1 |yi − xti β̂(τ )| ≤ h xi xti
i=1
for a carefully chosen h. Powell (1991)’s theory suggests to use h satisfying condition h =
O(n−1/3 ), but the theory is not so helpful since it only suggests the order of h. The quantreg
package in R chooses a specific h that satisfies this condition. In finite samples, the bootstrap
often gives a better estimation of the asymptotic covariance matrix.
Exponential(1) Normal(0,1)
0.20
method
standard errors
0.15
boot
iid
ker
0.10
true
0.05
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
quantiles
library ( quantreg )
mc = 2 0 0 0
n = 200
taus = ( 1 : 9 )/ 1 0
get . se = function ( x ){ x $ coef [ 1 ,2 ]}
q . normal = replicate ( mc ,{
y = rnorm ( n )
qy = rq ( y ~ 1 , tau = taus )
se . iid = summary ( qy , se = " iid " )
se . ker = summary ( qy , se = " ker " )
se . boot = summary ( qy , se = " boot " )
qy = qy $ coef
se . iid = sapply ( se . iid , get . se )
se . ker = sapply ( se . ker , get . se )
se . boot = sapply ( se . boot , get . se )
In the above, se = "iid", se = "ker", and se = "boot" correspond to the standard errors
by Koenker and Bassett Jr (1978), Powell (1991), and the bootstrap. I also run the same
simulation but replace the Normal outcome with Exponential: y = rexp(n). Figure 26.3 com-
pares the estimated standard errors with the true asymptotic standard error in Theorem
26.1. Bootstrap works the best, and the one involving kernel estimation of the density seems
biased.
The second data generating process replaces the error term to a Laplace distribution1 :
simu . laplace = replicate ( mc , {
y = 1 + x + rexp ( n ) - rexp ( n )
c ( lm ( y ~ x )$ coef [ 2 ] , rq ( y ~ x )$ coef [ 2 ])
})
OLS is the MLE under a Normal linear model, and LAD is the MLE under a linear
model with independent Laplace errors.
The third data-generating process replaces the error term with standard Exponential:
simu . exp = replicate ( mc , {
y = 1 + x + rexp ( n )
c ( lm ( y ~ x )$ coef [ 2 ] , rq ( y ~ x )$ coef [ 2 ])
})
which is a linear quantile model. The coefficients are different in the conditional mean and
quantile functions.
x = abs ( x )
simu . x = replicate ( mc , {
y = 1 + rexp ( n )* x
c ( lm ( y ~ x )$ coef [ 2 ] , rq ( y ~ x )$ coef [ 2 ])
})
Figure 26.4 compares OLS and LAD under the above four data-generating processes.
With Normal errors, OLS is more efficient; with Laplace errors, LAD is more efficient.
This confirms the theory of MLE. With Exponential errors, LAD is also more efficient than
OLS. Under the fourth data-generating process, LAD is more efficient than OLS. In general,
however, OLS and LAD target the conditional mean and conditional median, respectively.
Since the parameters differ in general, the comparison of the standard errors is not very
meaningful. Both OLS and LAD give useful information about the data.
26.5 Application
26.5.1 Parents’ and children’s heights
I revisit Galton’s data introduced in Chapter 2. The following code gives the coefficients for
quantiles 0.1 to 0.9.
> library ( " HistData " )
> taus = ( 1 : 9 )/ 1 0
> qr . galton = rq ( childHeight ~ midparentHeight ,
+ tau = taus ,
+ data = GaltonFa milies )
> coef . galton = qr . galton $ coef
1 Note that the difference between two independent Exponentials has the same distribution as Laplace.
LAD OLS
6
se=0.07 se=0.07
Exponential
4
0
6
se=0.11 se=0.19
Exponential X
4
0
6
se=0.08 se=0.1
4
Laplace
2
0
6
se=0.09 se=0.07
4
Normal
2
0
0.5 1.0 1.5 0.5 1.0 1.5
^
β
Figure 26.5 shows the quantile regression lines, which are almost parallel with different
intercepts. In Galton’s data, x and y are very close to a bivariate Normal distribution.
Theoretically, we can verify that with bivariate Normal (x, y), the conditional quantile
function F −1 (τ | x) is linear in x with the same slope. See Problem 26.2.
quantile regressions
80
75
τ
0.9
0.8
70 0.7
childHeight
0.6
0.5
0.4
65
0.3
0.2
0.1
60
55
64 68 72
midparentHeight
Figure 26.6 shows the coefficient of educ across years and across quantiles. In 1980, the
coefficients are nearly constant across quantiles, showing no evidence of heterogeneity in the
return of education. Compared with 1980, the return of education in 1990 increases across
all quantiles, but it increases more at the upper quantiles. Compared with 1990, the return
of education in 2000 decreases at the lower quantiles and increases at the upper quantiles,
showing more dramatic heterogeneity across quantiles.
The original data used by Angrist et al. (2006) contain weights due to sampling. Ideally,
we should use the weights in the quantile regression. Like lm, the rq function also allows for
specifying weights.
The R code in this section is in code24.5.R.
304 Linear Model and Extensions
quantile regressions
0.16
0.14
coefficient of educ
year
0.12
00
90
80
0.10
0.08
26.6 Extensions
With clustered data, we must use the cluster-robust standard error which can be approxi-
mated by the clustered bootstrap with the rq function. I use Hagemann (2017)’s example
below where the students are clustered in classrooms. See code24.6.R.
> star = read . csv ( " star . csv " )
> star . rq = rq ( pscore ~ small + regaide + black +
+ girl + poor + tblack + texp +
+ tmasters + factor ( fe ) ,
+ data = star )
> res = summary ( star . rq , se = " boot " )$ coef [ 2 : 9 , ]
> res . clus = summary ( star . rq , se = " boot " ,
+ cluster = star $ classid )$ coef [ 2 : 9 , ]
> round ( res , 3 )
Value Std . Error t value Pr ( >| t |)
small 6.500 1.122 5.795 0.000
regaide 0.294 1.071 0.274 0.784
black -1 0 . 3 3 4 1 . 6 5 7 -6 . 2 3 7 0.000
girl 5.073 0.878 5.777 0.000
poor -1 4 . 3 4 4 1 . 0 2 4 -1 4 . 0 1 1 0.000
tblack -0 . 1 9 7 1 . 7 5 1 -0 . 1 1 3 0.910
texp 0.413 0.098 4.231 0.000
tmasters -0 . 5 3 0 1 . 0 6 8 -0 . 4 9 7 0.619
> round ( res . clus , 3 )
Value Std . Error t value Pr ( >| t |)
small 6.500 1.662 3.912 0.000
regaide 0.294 1.627 0.181 0.857
black -1 0 . 3 3 4 1 . 8 4 9 -5 . 5 8 8 0.000
girl 5.073 0.819 6.195 0.000
poor -1 4 . 3 4 4 1 . 1 5 2 -1 2 . 4 5 5 0.000
tblack -0 . 1 9 7 3 . 1 1 3 -0 . 0 6 3 0.949
texp 0.413 0.168 2.465 0.014
Quantile Regression 305
With high dimensional covariates, we can use regularized quantile regression. For in-
stance, the rq function can implement the lasso version with method = "lasso" and a prespec-
ified lambda. It does not implement the ridge version.
26.5 Joint asymptotic distribution of the sample median and the mean
Assume that y1 , . . . , yn ∼ y are IID. Find the joint asymptotic distribution of the sample
mean ȳ and median m̂.
Hint: The mean µ and median m satisfy the estimating equation with
y−µ
w(y, µ, m) = .
0.5 − 1(y − m ≤ 0)
and re-analyze Angrist et al. (2006)’s data with weights. Note that similar to lm and glm,
the quantile regression function rq also has a parameter weights.
27
Modeling Time-to-Event Outcomes
27.1 Examples
Time-to-event data are common in biomedical and social sciences. Statistical analysis of
time-to-event data is called survival analysis in biostatistics and duration analysis in econo-
metrics. The former name comes from biomedical applications where the outcome denotes
the survival time or the time to the recurrence of the disease of interest. The latter name
comes from the economic applications where the outcome denotes the weeks unemployed
or days until the next arrest after being released from incarceration. See Kalbfleisch and
Prentice (2011) for biomedical applications and Heckman and Singer (1984) for economic
applications. Freedman (2008) gave a concise and critical introduction to survival analysis.
NALTREXONE and THERAPY are two treatment indicators. futime is the follow-up time, which
is censored if relapse equals 0. For those censored observations, futime equals 112, so it
is administrative censoring. Figure 27.1 shows the histograms of futime in four treatment
groups. A large number of patients have censored outcomes. Other variables are covariates.
307
308 Linear Model and Extensions
THERAPY: 0 THERAPY: 1
90
NALTREXONE: 0
60
30
90
NALTREXONE: 1
60
30
0
0 30 60 90 0 30 60 90
futime
FIGURE 27.1: Histograms of the time to event in the data from Lin et al. (2016)
age, etc. I use the version of data analyzed by Keele (2010). The outcome acttime is censored
indicated by censor. The original paper contains more detailed explanations of the variables.
> fda <- read . dta ( " fda . dta " )
> names ( fda )
[ 1 ] " acttime " " censor " " hcomm " " hfloor " " scomm "
[ 6 ] " sfloor " " prespart " " demhsmaj " " demsnmaj " " orderent "
[ 1 1 ] " stafcder " " prevgenx " " lethal " " deathrt 1 " " hosp 0 1 "
[ 1 6 ] " hospdisc " " hhosleng " " acutediz " " orphdum " " mandiz 0 1 "
[ 2 1 ] " femdiz 0 1 " " peddiz 0 1 " " natreg " " natregsq " " wpnoavg 3 "
[ 2 6 ] " vandavg 3 " " condavg 3 " " _ st " "_d" "_t"
[31] "_t0" " caseid "
An obvious feature of time-to-event data is that the outcome is non-negative. This can
be easily dealt with by the log transformation. However, the outcomes may be censored,
resulting in inadequate tail information. With right censoring, modeling the mean involves
extrapolation in the right tail.
pr(t ≤ T < t + ∆t | T ≥ t) ∼
= λ(t)∆t,
Modeling Time-to-Event Outcomes 309
so the hazard function denotes the death rate within a small interval conditioning on sur-
viving up to time t. Both the survival and hazard functions are commonly used to describe
a positive random variable. First, the survival function has a simple relationship with the
expectation.
Proposition 27.1 For a non-negative random variable T ,
Z ∞
E(T ) = S(t)dt.
0
Proposition 27.1 holds for both continuous and discrete non-negative random variables.
It states that the expectation of a nonnegative random variable equals the area under the
survival function. It does not require the existence of the density function of T .
Proof of Proposition 27.1: Fubini’s theorem allows us to swap the expectation and
integral below:
(Z )
T
E(T ) = E dt
0
Z ∞
= E 1(T > t)dt
Z ∞ 0
= E{1(T > t)}dt
Z0 ∞
= pr(T > t)dt
Z0 ∞
= S(t)dt.
0
□
Second, the survival and hazard functions can determine each other in the following way.
Proposition 27.2 For a non-negative continuous random variable T ,
f (t) d
λ(t) = = − log S(t),
S(t) dt
Z t
S(t) = exp − λ(s)ds .
0
0.9
lnorm(µ, σ)
3
hazard functions
hazard functions
α
(0.5, 0.5)
0.5 0.6
(2, 0.5)
2 1
(0.5, 2)
2
(2, 2)
0.3
1
0 0.0
0 1 2 3 4 5 0 2 4 6
t t
which implies Z t
log S(t) − log S(0) = − λ(s)ds.
0
Rt
Because log S(0) = 0, we have log S(t) = − 0
λ(s)ds, giving the final result. □
Example 27.1 (Exponential) The Exponential(λ) random variable T has density f (t) =
λe−λt , survival function S(t) = e−λt , and constant hazard function λ(t) = λ. An impor-
tant feature of the Exponential random variable is its memoryless property as shown in
Proposition B.6.
Example 27.2 (Gamma) The Gamma(α, β) random variable T has density f (t) =
β α tα−1 e−βt /Γ(α). When α = 1, it reduces to Exponential(β) with a constant hazard func-
tion. In general, the survival function and hazard function do not have simple forms, but
we can use dgamma and pgamma to compute them numerically. The left panel of Figure 27.2
plots the hazard functions of Gamma(α, β). When α < 1, the hazard function is decreasing;
when α > 1, the hazard function is increasing.
Example 27.4 (Weibull) The Weibull distribution has many different parametrizations.
Here I follow the R function dweibull, which has a shape parameter a > 0 and scale parameter
b > 0. The Weibull(a, b) random variable T can be generated by
T = bZ 1/a (27.1)
which is equivalent to
log T = log b + a−1 log Z,
Modeling Time-to-Event Outcomes 311
discrete distribution
1.0
●
●
●
0.8
●
0.6
S(t)
●
0.4
0.2
●
0.0
0 1 2 3 4 5 6 7
FIGURE 27.3: Discrete survival function with masses (0.1, 0.05, 0.15, 0.2, 0.3, 0.2) at
(1, 2, 3, 4, 5, 6)
survival function a
t
S(t) = exp − ,
b
and hazard function
a−1
a t
λ(t) = .
b b
So when a = 1, Weibull reduces to Exponential with constant hazard function. When a > 1,
the hazard function increases; when a < 1, the hazard function decreases.
f (tk )
λk = pr(T = tk | T ≥ tk ) = ,
S(tk −)
where S(tk −) denotes the left limit of the function S(t) at tk . Figure 27.3 shows an example
of a survival function for a discrete random variable, which shows that S(t) is a step function
and right-continuous with left limits.
The discrete hazard and survival functions have the following connection which will be
useful for the next section.
312 Linear Model and Extensions
# failures
# censoring
# at risk
Proposition 27.3 For a positive discrete random variable T , its survival function is a step
function determined by Y
S(t) = pr(T > t) = (1 − λk ).
k:tk ≤t
Note that S(t) is a step function decreasing at each tk because λk is probability and
thus bounded between zero and one.
Proof of Proposition 27.3: By definition,
if t2 ≤ t < t3 , then
(S3) c1 , . . . , cK are the number of censored patients within interval [t1 , t2 ), . . . , [dK , ∞).
Kaplan and Meier (1958) proposed the following simple estimator for the survival func-
tion.
Definition 27.1 (Kaplan–Meier curve) First estimate the discrete hazard function at
the failure times {t1 , . . . , tK } as λ̂k = dk /rk (k = 1, . . . , K) and then estimate the survival
function as Y
Ŝ(t) = (1 − λ̂k ).
k:tk ≤t
The Ŝ(t) in Definition 27.1 is also called the product-limit estimator of the survival
function due to its mathematical form.
At each failure time tk , we view dk as the result of rk Bernoulli trials with probability
λk . So λ̂k = dk /rk has variance λk (1 − λk )/rk which can be estimated by
∼
X X
= log(1 − λk ) − (1 − λk )−1 (λ̂k − λk )
k:tk ≤t k:tk ≤t
by
n o X
var
ˆ log Ŝ(t) = (1 − λk )−2 var(
ˆ λ̂k )
k:tk ≤t
X
= (1 − λ̂k )−2 λ̂k (1 − λ̂k )/rk
k:tk ≤t
X dk
= ,
rk (rk − dk )
k:tk ≤t
which is called Greenwood’s formula (Greenwood, 1926). A hidden assumption above is the
independence of the λ̂k ’s. This assumption cannot be justified due to the dependence of the
events. However, a deeper theory of counting processes shows that Greenwood’s formula is
valid even without the independence (Fleming and Harrington, 2011).
Based on Greenwood’s formula, we can construct a confidence interval for log S(t):
r n o
log Ŝ(t) ± zα var
ˆ log Ŝ(t) ,
which implies a confidence interval for S(t). However, this interval can be outside of range
[0, 1] because the log transformation log S(t) is in the range of (−∞, 0) but the Normal
approximation is in the range (−∞, ∞). A better transformation is log-log:
n o
v(t) = log {− log S(t)} , v̂(t) = log − log Ŝ(t) .
by n o
var
ˆ log Ŝ(t)
n o2 .
log Ŝ(t)
Based on this formula and Greenwood’s formula above, we can construct a confidence
interval for v(t): r
n o n o
log − log Ŝ(t) ± zα var
ˆ log Ŝ(t) / log Ŝ(t),
which implies another confidence interval for S(t). In the R package survival, the func-
tion survfit can fit the Kaplan–Meier curve, where the specifications conf.type = "log" and
conf.type = "log-log" return confidence intervals based on the log and log-log transforma-
tions, respectively.
Figure 27.5 plots four curves based on the combination of NALTREXONE and THERAPY using
the data of Lin et al. (2016). I do not show the confidence intervals due to the large overlap.
> km 4 groups = survfit ( Surv ( futime , relapse ) ~ NALTREXONE + THERAPY ,
+ data = COMBINE )
> plot ( km 4 groups , bty = " n " , col = 1 : 4 ,
+ xlab = " t " , ylab = " survival ␣ functions " )
> legend ( " topright " ,
+ c ( " NALTREXONE = 0 ,␣ THERAPY = 0 " ,
+ " NALTREXONE = 0 ,␣ THERAPY = 1 " ,
+ " NALTREXONE = 1 ,␣ THERAPY = 0 " ,
+ " NALTREXONE = 1 ,␣ THERAPY = 1 " ) ,
+ col = 1 : 4 , lty = 1 , bty = " n " )
1.0
NALTREXONE=0, THERAPY=0
NALTREXONE=0, THERAPY=1
NALTREXONE=1, THERAPY=0
0.8
NALTREXONE=1, THERAPY=1
survival functions
0.6
0.4
0.2
0.0
0 20 40 60 80 100
The above discussion on the Kaplan–Meier curve is rather heuristic. More fundamen-
tally, what is the underlying censoring mechanism that ensures the possibility that the
distribution of the survival time can be recovered by the observed data? It turns out that
we have implicitly assumed that the survival time and the censoring time are independent.
Homework problem 27.1 gives a theoretical statement.
Modeling Time-to-Event Outcomes 315
yi = min(Ti , Ci ), δi = 1(Ti ≤ Ci )
are the event time and the censoring indicator, respectively. A key assumption is that the
censoring mechanism is noninformative:
Assumption 27.1 (noninformative censoring) Ti Ci | xi .
We can start with parametric models.
log Ti = xti β + εi ,
where the εi ’s are IID N(0, σ 2 ) independent of the xi ’s. This is a Normal linear model on
log Ti .
t
Example 27.6 Assume that Ti | xi ∼ Weibull(a, b = exi β ). Based on the definition of the
Weibull distribution in Example 27.1, we have
log Ti = xti β + εi
where the εi ’s are IID a−1 log Exponential(1), independent of the xi ’s.
The R package survival contains the function survreg to fit parametric survival models
including the choices of dist = "lognormal", dist = "weibull", etc. However, these parametric
models are not commonly used in practice. The parametric forms can be too strong, and
due to right censoring, the inference can be driven by extrapolation to the right tail.
Assumption 27.2 (Cox proportional hazards model) Assume the conditional hazard
ratio function has the form
Unlike other regression models, x does not contain the intercept in (27.2). If the first com-
ponent of x is 1, then we can write
and redefine λ0 (t)eβ1 as another unknown function. With an intercept, we cannot identify
λ0 (t) and β1 separately. So we drop the intercept to ensure identifiability.
From the log-linear form of the conditional hazard function, we have
λ(t | x′ )
log = (x′ − x)t β,
λ(t | x)
so each coordinate of β measures the log conditional hazard ratio holding other covariates
constant. Because of this, (27.2) is called the proportional hazards model. A positive βj
suggests a “positive” effect on the hazard function and thus a “negative” effect on the
survival time itself. Consider a special case with a binary xi , then the proportional hazards
assumption implies that λ(t | 1) = γλ(t | 0) with γ = exp(β), and therefore the survival
functions satisfy
Z t
S(t | 1) = exp − λ(u | 1)du
0
Z t
= exp −γ λ(u | 0)du
0
γ
= {S(t | 0)} ,
0.75
survival functions
power
0
0.50
0.5
2
0.25
0.00
0 2 4 6 0 2 4 6 0 2 4 6
t
FIGURE 27.6: Proportional hazards assumption with different baseline survival functions,
where the power equals γ = exp(β).
where the product is over K time points with failures, xk is the covariate value of the failure
at time tk , and R(tk ) contains the indices of the units at risk at time tk , i.e., the units not
censored or failed right before the time tk .
Freedman (2008) gives a heuristic explanation of the partial likelihood based on the
following results which extends Proposition B.7 on the Exponential distribution.
Therefore, we have
exp(xtk β)
P t
l∈R(tk ) exp(xl β)
from Theorem 27.1. The product in the partial likelihood is based on the independence of
the events at the K failure times, which is more difficult to justify. A rigorous justifica-
tion relies on the deeper theory of counting processes (Fleming and Harrington, 2011) or
semiparametric statistics (Tsiatis, 2007).
The log-likelihood function is
XK X
log L(β) = xtk β − log exp(xtl β) ,
k=1 l∈R(tk )
Define X
πβ (l | Rk ) = exp(xtl β)/ exp(xtl β), (l ∈ R(tk ))
l∈R(tk )
K
∂ 2 log L(β) X
=− covβ (x | Rk ) ⪯ 0,
∂β∂β t
k=1
Modeling Time-to-Event Outcomes 319
where
covβ (x | Rk )
2
P t t
P t . X
exp(x l β)x x
l l exp(x l β)
= Pl∈R(tk ) P l∈R(tk )
exp(x t
l β)
− l∈R(tk ) exp(xtl β)xl l∈R(tk ) exp(xtl β)xtl
l∈R(tk )
X X X
= πβ (l | Rk )xl xtl − πβ (l | Rk )xl πβ (l | Rk )xtl .
l∈R(tk ) l∈R(tk ) l∈R(tk )
The coxph function in the R package survival uses Newton’s method to compute the
maximizer β̂ of the partial likelihood function, and uses the inverse of the observed Fisher
information to approximate its asymptotic variance. Lin and Wei (1989) proposed a sand-
wich covariance estimator to allow for the misspecification of the Cox model. The coxph
function with robust = TRUE reports the corresponding robust standard errors.
27.4.3 Examples
Using Lin et al. (2016)’s data, we have the following results.
> cox . fit <- coxph ( Surv ( futime , relapse ) ~ NALTREXONE * THERAPY +
+ AGE + GENDER + T 0 _ PDA + site ,
+ data = COMBINE )
> summary ( cox . fit )
Call :
coxph ( formula = Surv ( futime , relapse ) ~ NALTREXONE * THERAPY +
AGE + GENDER + T 0 _ PDA + site , data = COMBINE )
n = 1 2 2 6 , number of events = 8 5 6
NALTREXONE has a significant negative log hazard ratio, but THERAPY has a nonsignificant
negative log hazard ratio. More interestingly, their interaction NALTREXONE:THERAPY has a sig-
nificant positive log hazard ratio. This suggests that combining NALTREXONE and THERAPY is
worse than using NALTREXONE alone to delay the first time of heavy drinking and other end-
points. This is also coherent with the survival curves in Figure 27.5, in which the best
Kaplan–Meier curve corresponds to NALTREXONE=1, THERAPY=0.
Using Keele (2010)’s data, we have the following results:
> cox . fit <- coxph ( Surv ( acttime , censor ) ~
+ hcomm + hfloor + scomm + sfloor +
+ prespart + demhsmaj + demsnmaj +
320 Linear Model and Extensions
+ prevgenx + lethal +
+ deathrt 1 + acutediz + hosp 0 1 +
+ hospdisc + hhosleng +
+ mandiz 0 1 + femdiz 0 1 + peddiz 0 1 + orphdum +
+ natreg + I ( natreg ^ 2 ) + vandavg 3 + wpnoavg 3 +
+ condavg 3 + orderent + stafcder ,
+ data = fda )
> summary ( cox . fit )
Call :
coxph ( formula = Surv ( acttime , censor ) ~ hcomm + hfloor + scomm +
sfloor + prespart + demhsmaj + demsnmaj + prevgenx + lethal +
deathrt 1 + acutediz + hosp 0 1 + hospdisc + hhosleng + mandiz 0 1 +
femdiz 0 1 + peddiz 0 1 + orphdum + natreg + I ( natreg ^ 2 ) + vandavg 3 +
wpnoavg 3 + condavg 3 + orderent + stafcder , data = fda )
n = 4 0 8 , number of events = 2 6 2
because P
l∈R(tk ) xl rk1
Eβ=0 (x | Rk ) = P =
l∈R(tk ) 1 rk
equaling the ratio of the number of treated units at risk rk1 over the number of units at
risk rk , at time tk . The Fisher information at the null is
K
∂ 2 log L(0) X
− = covβ=0 (x | Rk )
∂β∂β t
k=1
K
X rk1 rk1
= 1− .
rk rk
k=1
So we reject the null at level α if |LR| is larger than the upper 1 − α/2 quantile of standard
Normal. This is almost identical to the log-rank test without ties. Allowing for ties, Mantel
(1966) proposed a more general form of the log-rank test1 .
The survdiff function in the survival package implements various tests including the log-
rank test as a special case. Below, I use the gehan dataset in the MASS package to illustrate
the log rank test. The data were from a matched-pair experiment of 42 leukaemia patients
(Gehan, 1965). Treated units received the drug 6-mercaptopurine, and the rest are controls.
For illustration purposes, I ignore the pair indicators.
> library ( MASS )
> head ( gehan )
pair time cens treat
1 1 1 1 control
2 1 10 1 6 - MP
3 2 22 1 control
4 2 7 1 6 - MP
5 3 3 1 control
6 3 32 0 6 - MP
> survdiff ( Surv ( time , cens ) ~ treat ,
+ data = gehan )
Call :
survdiff ( formula = Surv ( time , cens ) ~ treat , data = gehan )
N Observed Expected (O - E )^ 2 / E (O - E )^ 2 / V
treat = 6 - MP 21 9 19.3 5.46 16.8
treat = control 2 1 21 10.7 9.77 16.8
The treatment was quite effective, yielding an extremely small p-value even with moder-
ate sample size. It is also clear from the Kaplan–Meier curves in Figure 27.7 and the results
from fitting the Cox proportional hazards model.
1 Peto and Peto (1972) popularized the name log-rank test.
322 Linear Model and Extensions
1.0
6−MP
control
0.8
survival functions
0.6
0.4
0.2
0.0
0 5 10 15 20 25 30 35
FIGURE 27.7: Kaplan–Meier curves with 95% confidence intervals based on Gehan (1965)’s
data
n = 4 2 , number of events = 3 0
Concordance = 0 . 6 9 ( se = 0.041 )
Likelihood ratio test = 1 6 . 3 5 on 1 df , p = 5e - 0 5
Wald test = 1 4 . 5 3 on 1 df , p = 1e - 0 4
Score ( logrank ) test = 1 7 . 2 5 on 1 df , p = 3e - 0 5
The log-rank test is a standard tool in survival analysis. However, what it delivers is just
a special case of the Cox proportional hazards model. The p-value from the log-rank test is
close to the p-value from the score test of the Cox proportional hazards model with only a
binary treatment indicator. The latter can also adjust for other pretreatment covariates.
Modeling Time-to-Event Outcomes 323
27.5 Extensions
27.5.1 Stratified Cox model
Many randomized trials are stratified. The Combined Pharmacotherapies and Behavioral
Interventions study reviewed at the beginning of this chapter is an example with site in-
dicating the strata. The previous analysis includes the dummy variables of site in the Cox
model. An alternative more flexible model is to allow for different baseline hazard functions
across strata. Technically, assume
λs (t | x) = λs (t) exp(β t x)
for strata s = 1, . . . , S, where β is an unknown parameter and {λ1 (·), . . . , λS (·)} are un-
known functions. Therefore, within each stratum s, the proportional hazards assumption
holds; across strata, the proportional hazard assumption may not hold. Within stratum s,
we can obtain the partial likelihood Ls (β); by independence of the data across strata, we
can obtain the joint partial likelihood
S
Y
Ls (β).
s=1
Based on the standard procedure, we can obtain the MLE and conduct inference based on
the large-sample theory. The coxph function can naturally allow for stratification with the
+ strata() in the regression formula.
> cox . fit <- coxph ( Surv ( futime , relapse ) ~ NALTREXONE * THERAPY +
+ AGE + GENDER + T 0 _ PDA + strata ( site ) ,
+ robust = TRUE ,
+ data = COMBINE )
> summary ( cox . fit )
Call :
coxph ( formula = Surv ( futime , relapse ) ~ NALTREXONE * THERAPY +
AGE + GENDER + T 0 _ PDA + strata ( site ) , data = COMBINE , robust = TRUE )
n = 1 2 2 6 , number of events = 8 5 6
NALTREXONE **
THERAPY .
AGE ***
GENDERmale .
T 0 _ PDA *
NALTREXONE : THERAPY *
Concordance = 0 . 5 6 1 ( se = 0 . 0 1 1 )
Likelihood ratio test = 3 5 . 2 4 on 6 df , p = 4e - 0 6
Wald test = 3 3 . 8 5 on 6 df , p = 7e - 0 6
Score ( logrank ) test = 3 4 . 9 4 on 6 df , p = 4e - 0 6 , Robust = 3 4 . 1 5 p = 6e - 0 6
n = 3 9 4 , number of events = 1 5 5
treat ***
adult
agedx
Concordance = 0 . 5 9 6 ( se = 0 . 0 2 4 )
Likelihood ratio test = 2 3 . 1 3 on 3 df , p = 4e - 0 5
Wald test = 2 1 . 5 4 on 3 df , p = 8e - 0 5
Score ( logrank ) test = 2 3 . 0 1 on 3 df , p = 4e - 0 5 , Robust = 2 2 . 0 9 p = 6e - 0 5
n = 3 9 4 , number of events = 1 5 5
treat ***
adult
agedx
Concordance = 0 . 5 9 6 ( se = 0 . 0 2 3 )
Likelihood ratio test = 2 3 . 1 3 on 3 df , p = 4e - 0 5
Wald test = 2 8 . 5 5 on 3 df , p = 3e - 0 6
Score ( logrank ) test = 2 3 . 0 1 on 3 df , p = 4e - 0 5 , Robust = 2 6 . 5 5 p = 7e - 0 6
This ratio is difficult to interpret because patients who have survived up to time t can be
quite different in treatment and control groups, especially when the treatment is effective.
Even though patients are randomly assigned at the baseline, the survivors up to time t are
not. Hernán (2010) suggested focusing on the comparison of the survival functions.
326 Linear Model and Extensions
pr(y = t, δ = 1)
λT (t) = .
pr(y ≥ t)
Appendices
A
Linear Algebra
All vectors are column vectors in this book. This is coherent with R.
The equality holds if and only if yi = a + bxi for some a and b, for all i = 1, . . . , n. We can
use the Cauchy–Schwarz inequality to prove the triangle inequality
∥x + y∥ ≤ ∥x∥ + ∥y∥.
We say that x and y are orthogonal, denoted by x ⊥ y, if ⟨x, y⟩ = 0. We call a set of vec-
tors v1 , . . . , vm ∈ Rn orthonormal if they all have unit length and are mutually orthogonal.
Geometrically, we can define the cosine of the angle between two vectors x, y ∈ Rn as
Pn
⟨x, y⟩ xi yi
cos ∠(x, y) = = pPn i=12 Pn .
∥x∥∥y∥ i=1 xi
2
i=1 yi
For unit vectors, it
Preduces to the inner product.
Pn When both x and y are orthogonal to 1n ,
n
that is, x̄ = n−1 i=1 xi = 0 and ȳ = n−1 i=1 yi = 0, the formula of the cosine of the
angle is identical to the sample Pearson correlation coefficient
Pn
(xi − x̄)(yi − ȳ)
ρ̂xy = pPn i=1 Pn .
2 2
i=1 (xi − x̄) i=1 (yi − ȳ)
Sometimes, we simply say that the cosine of the angle of two vectors measures their corre-
lation even when they are not orthogonal to 1n .
329
330 Linear Model and Extensions
atn
where ai ∈ Rm (i = 1, . . . , n), or column vectors
A = (A1 , . . . , Am ),
where Aj ∈ Rn j = 1, . . . , m. In statistics, the rows are corresponding to the units, so the
ith row vector is the vector observations for unit i. Moreover, viewing A in terms of its
column vectors can give more insights. Define the column space of A as
C(A) = {α1 A1 + · · · + αm Am : α1 , . . . , αm ∈ R} ,
which is the set of all linear combinations of the column vectors A1 , . . . , Am . The column
space is important because we can write Aα, with α = (α1 , . . . , αm )t , as
α1
..
Aα = (A1 , . . . , Am ) . = α1 A1 + · · · + αm Am ∈ C(A).
αm
We define the row space of A as the column space of At .
ck1 ckm
□
Proposition A.1 ensures that rank(B) ≥ k and rank(C) ≥ k so they must both have rank
k. The decomposition in Proposition A.2 is not unique since the choice of the maximally
linearly independent column vectors of A is not unique.
Determinant
The original definition of the determinant of a square matrix A = (aij ) has a very complex
form, which will not be used in this book.
The determinant of a 2 × 2 matrix has a simple form:
a b
det = ad − bc. (A.3)
c d
The determinant of the Vandermonde matrix has the following formula:
1 x1 x21 · · · xn−1
1
1 x2 x22 · · · xn−1
2 Y
det . = (xj − xi ). (A.4)
. . .
.. .. .. ..
1≤i,j≤n
1 xn x2n · · · xn−1
n
The properties of the determinant are more useful. I will review two.
Proposition A.3 For two square matrices A and B, we have
det(AB) = det(A)det(B) = det(BA).
Proposition A.4 For two square matrices A ∈ Rm×m and B ∈ Rn×n , we have
A 0 A D
det = det = det(A)det(B).
C B 0 B
Inverse of a matrix
Let In be the n × n identity matrix. An n × n matrix A is invertible/nonsingular if there
exists an n × n matrix B such that AB = BA = In . We call B the inverse of A, denoted
by A−1 . If A is an orthogonal matrix, then At = A−1 .
A square matrix is invertible if and only if det(A) ̸= 0.
The inverse of a 2 × 2 matrix is
−1
a b 1 d −b
= . (A.5)
c d ad − bc −c a
The inverse of a 3 × 3 lower triangular matrix is
−1
a 0 0 cf 0 0
b c 0 = 1 −bf af 0 . (A.6)
acf
d e f be − cd −ae ac
A useful identity is
(AB)−1 = B −1 A−1
if both A and B are invertible.
AP = P diag{λ1 , . . . , λn }
or, equivalently,
A(γ1 , · · · , γn ) = (λ1 γ1 , · · · , λn γn ),
then (λi , γi ) must be a pair of eigenvalue and eigenvector. Moreover, the eigendecomposition
in Theorem A.1 is unique up to the permutation of the columns of P and the corresponding
λi ’s.
The eigen-decomposition is also useful for defining the square root of an n×n symmetric
matrix. In particular, if the eigenvalues of A are nonnegative, then we can define
p p
A1/2 = P diag{ λ1 , . . . , λn }P t
By definition, A1/2 is a symmetric matrix satisfying A1/2 A1/2 = A. There are other defini-
tions of the square root of a symmetric matrix, but we adopt this form in this book.
From Theorem A.1, we can write A as
A = P diag{λ1 , . . . , λn }P t
γ1t
= (γ1 , · · · , γn )diag{λ1 , . . . , λn } ...
γnt
Xn
= λi γi γit .
i=1
For an n × n symmetric matrix A, its rank equals the number of non-zero eigenvalues
and its determinant equals the product of all eigenvalues. The matrix A is of full rank if
all its eigenvalues are non-zero, which implies that its rank equals n and its determinant is
non-zero.
Quadratic form
For an n × n symmetric matrix A = (aij ) and an n-dimensional vector x, we can define the
quadratic form as
n X
X n
xt Ax = ⟨x, Ax⟩ = aij xi xj .
i=1 j=1
334 Linear Model and Extensions
We always consider a symmetric matrix in the quadratic form without loss of generality.
Otherwise, we can symmetrize A as à = (A + At )/2 without changing the value of the
quadratic form because
xt Ax = xt Ãx.
We call A positive semi-definite, denoted by A ⪰ 0, if xt Ax ≥ 0 for all x; we call A
positive definite, denoted by A ≻ 0, if xt Ax > 0 for all nonzero x.
We can also define the partial order between matrices. We call A ⪰ B if and only if
A − B ⪰ 0, and we call A ≻ B if and only if A − B ≻ 0. This is important in statistics
because we often compare the efficiency of estimators based on their variances or covariance
matrices. Given two unbiased estimators θ̂1 and θ̂2 for a scalar parameter θ, we say that
θ̂1 is more efficient than θ̂2 if var(θ̂2 ) ≥ var(θ̂1 ). In the vector case, we say that θ̂1 is more
efficient than θ̂2 if cov(θ̂2 ) ⪰ cov(θ̂1 ), which is equivalent to var(ℓt θ̂2 ) ≥ var(ℓt θ̂1 ) for any
linear combination of the estimators.
The eigenvalues of a symmetric matrix determine whether it is positive semi-definite or
positive definite.
Theorem A.2 For a symmetric matrix A, it is positive semi-definite if and only if all its
eigenvalues are nonnegative, and it is positive definite if and only if all its eigenvalues are
positive.
An important result is the relationship between the eigenvalues and the extreme values
of the quadratic form. Assume that the eigenvalues are rearranged in decreasing order such
that λ1 ≥ · · · ≥ λn . For a unit vector x, we have that
n
X n
X
xt Ax = xt λi γi γit x = λi αi2
i=1 i=1
where t
α1 γ1 x
.. ..
α = . = . = P tx
αn γnt x
has length ∥α∥2 = ∥x∥2 = 1. Then the maximum value of xt Ax is λ1 which is achieved at
α1 = 1 and α2 = · · · = αn = 0 (for example, if x = γ1 , then α1 = 1 and α2 = · · · = αn = 0).
For a unit vector x that is orthogonal to γ1 , we have that
n
X
xt Ax = λi αi2
i=2
Theorem
Pn A.3 Suppose that an n × n symmetric matrix has eigen-decomposition
i where λ1 ≥ · · · ≥ λn .
t
i=1 λi γ i γ
r(x) = xt Ax/xt x (x ∈ Rn ).
Theorem A.4 (Rayleigh quotient and eigenvalues) The maximum and minimum eigenval-
ues of an n × n symmetric matrix A equals
with the maximizer and minimizer being the eigenvectors corresponding to the maximum
and minimum eigenvalues, respectively.
Trace
The trace of an n × n matrix A = (aij ) is the sum of all its diagonal elements, denoted by
n
X
trace(A) = aii .
i=1
The trace operator has two important properties that can sometimes help to simplify
calculations.
Proposition A.5 trace(AB) = trace(BA) as long as AB and BA are both square matrices.
We can verify Proposition A.5 by definition. It states that AB and BA have the same
trace although AB differs from BA in general. In fact, it is particularly useful if the dimen-
t
sion of BA is much lower than the dimension of AB. For example, if both A = P(a 1 , . . . , an )
n
and B = (b1 , . . . , bn ) are vectors, then trace(AB) = trace(BA) = ⟨B , A⟩ = i=1 ai bi .
t
Proposition A.6 P The trace of an n × n symmetric matrix A equals the sum of its eigen-
n
values: trace(A) = i=1 λi .
Proof of Proposition A.6: It follows from the eigen-decomposition and Proposition A.5.
Let Λ = diag{λ1 , . . . , λn }, and we have
n
X
trace(A) = trace(P ΛP t ) = trace(ΛP t P ) = trace(Λ) = λi .
i=1
□
336 Linear Model and Extensions
Projection matrix
An n × n matrix H is a projection matrix if it is symmetric and H 2 = H. The eigenvalues
of H must be either 1 or 0. To see this, we assume that Hx = λx for some nonzero vector
x, and use two ways to calculate H 2 x:
It is relatively easy to verify the first part of Theorem A.5; see Chapter 3. The second part
of Theorem A.5 follows from the eigen-decomposition of H, with the first p eigen-vectors
being the column vectors of X.
Cholesky decomposition
An n × n positive semi-definite matrix A can be decomposed as A = LLt where L is an
n × n lower triangular matrix with non-negative diagonal elements. If A is positive definite,
the decomposition is unique. In general, it is not. Take an arbitrary orthogonal matrix Q,
we have A = LQQt Lt = CC t where C = LQ. So we can decompose a positive semi-definite
matrix A as A = CC t , but this decomposition is not unique.
A = U DV t
A = U DV t
for the component-wise partial derivative, which must have the same dimension as x. It is
often called the gradient of f. For example, for a linear function f (x) = xt a = at x with
a, x ∈ Rp , we have
∂ Pp aj xj
∂at x j=1
∂x1 ∂x1 a1
t
∂a x .. .. .
= .. = a;
= = (A.7)
∂x . Pp
.
t
∂a x ∂ j=1 aj xj ap
∂xp ∂xp
These are two important rules of vector calculus used in this book, summarized below.
∂at x
= a,
∂x
∂xt Ax
= 2Ax.
∂x
We can also extend the definition to vector functions. If f (x) = (f1 (x), . . . , fq (x))t is a
function from Rp to Rq , then we use the notation
∂fq (x)
∂f (x)
1
∂x1 ··· ∂x1
∂f (x) ∂f1 (x) ∂fq (x) . ..
≡ ,··· , = . . , (A.8)
.
∂x ∂x ∂x ∂f1 (x) ∂fq (x)
∂xp ··· ∂xp
which is a p × q matrix with rows corresponding to the elements of x and the columns
corresponding to the elements of f (x). We can easily extend the first result of Proposition
A.7.
∂B t x
= B.
∂x
Proof of Proposition A.8: Partition B = (B1 , . . . , Bq ) in terms of its column vectors.
The jth element of B t x is Bjt x so the j-th column of ∂B t x/∂x is Bj based on Proposition
A.7. This verifies that ∂B t x/∂x equals B. □
Some authors define ∂f (x)/∂x as the transpose of (A.8). I adopt this form for its natural
connection with (A.7) when q = 1. Sometimes, it is indeed more convenient to work with
the transpose of ∂f (x)/∂x. Then I will use the notation
t
∂f (x) ∂f (x) ∂f (x) ∂f (x)
= = ,··· ,
∂xt ∂x ∂x1 ∂xp
The above formulas become more powerful in conjunction with the chain rule. For ex-
ample, for any differentiable function h(z) mapping from R to R with derivative h′ (z), we
have
∂h(at x)
= h′ (at x)a,
∂x
∂h(xt Ax)
= 2h′ (xt Ax)Ax.
∂x
For any differentiable function h(z) mapping from Rq to R with gradient ∂h(z)/∂z, we have
Remark: The result is a direct consequence of the standard triangle inequality but it
has an interesting implication. If ⟨u, v⟩ ≥ 1 − ϵ and ⟨v, w⟩ ≥ 1 − ϵ, then ⟨u, w⟩ ≥ 1 − 4ϵ.
This implied inequality is mostly interesting when ϵ is small. It states that when u and v
are highly corrected and v and w are highly correlated, then u and w must also be highly
correlated. Note that we can find counterexamples for the following relationship:
Remark: This result is not too difficult to prove but it says something fundamentally
interesting. If v is correlated with many vectors u1 , . . . , um , then at least some vectors in
u1 , . . . , um must be also correlated.
340 Linear Model and Extensions
provided that all the inverses of the matrices exist. The two forms of the inverse imply the
Woodbury formula:
where A is an invertible square matrix, and u and v are two column vectors.
det(In + uv t ) = 1 + v t u.
The result is due to Von Neumann (1937) and Ruhe (1970). See also Chen and Li (2019,
Lemma 4.12).
iid
Let “IID” denote “independent and identically distributed”, “ ∼” denote a sequence of ran-
dom variables that are IID with some common distribution, and “ ” denote independence
between random variables. Define Euler’s Gamma function as
Z ∞
Γ(z) = xz−1 e−x dx, (z > 0),
0
which is a natural extension of the factorial since Γ(n) = (n − 1)!. Further define the
digamma function as ψ(z) = d log Γ(z)/dz and the trigamma function as ψ ′ (z). In R, we
can use
gamma ( z )
lgamma ( z )
digamma ( z )
trigamma ( z )
We can verify that the above density (B.1) is well-defined even if we change the integer n
to be an arbitrary positive real number ν, and call the corresponding random variable Qν
a chi-squared random variable with degrees of freedom ν, denoted by Qν ∼ χ2ν .
341
342 Linear Model and Extensions
If X represents the survival time, then the probability of surviving another x time is
always the same no matter how long the existing survival time is.
Proof of Proposition B.6: Because pr(X > x) = e−λx , we have
pr(X ≥ x + c)
pr(X ≥ x + c | X ≥ c) =
pr(X ≥ c)
e−λ(x+c)
=
e−λc
−λx
= e
= pr(X ≥ x).
□
The minimum of independent exponential random variables also follows an exponential
distribution.
X = min(X1 , . . . , Xn ) ∼ Exponential(λ1 + · · · + λn )
and
λi
pr(Xi = X) = .
λ1 + · · · + λn
Proof of Proposition B.7: First,
Pn
so X is Exponential( i=1 λi ).
344 Linear Model and Extensions
Second, we have
□
The difference between two IID exponential random variables follows the Laplace dis-
tribution
Proposition B.8 If y1 and y2 are two IID Exponential(λ), then y = y1 − y2 has density
λ
exp(−λ|c|), −∞ < c < ∞
2
which is the density of a Laplace distribution with mean 0 and variance 2/λ2 .
Proof of Proposition B.8: Both y1 and y2 have density f (c) = λe−λc and CDF F (c) =
1 − e−λc . The CDF of y = y1 − y2 at c ≥ 0 is
Z ∞
pr(y1 − y2 ≤ c) = pr(y2 ≥ z − c)λe−λz dz
0
Z ∞
= e−λ(z−c) λe−λz dz
0
Z ∞
= λe λc
e−2λz dz
0
= λeλc /(2λ)
= eλc /2.
In the above definitions, the conditional mean and variance are both deterministic functions
of x. We can replace x by the random variable X to define E {g(Y ) | X} and var {g(Y ) | X},
which are functions of the random variable X and are thus random variables.
Below are two important laws of conditional expectation and variance.
Theorem B.2 (Law of total expectation) We have
E(Y ) = E {E(Y | X)} .
Theorem B.3 (Law of total variance or analysis of variance) We have
var(Y ) = E {var(Y | X)} + var {E(Y | X)} .
346 Linear Model and Extensions
Independence
Random variables (X1 , . . . , Xn ) are mutually independent if
fX1 ···Xn (x1 , . . . , xn ) = fX1 (x1 ) · · · fXn (xn ).
Note that in this definition, each of (X1 , . . . , Xn ) can be vectors. We have the following
rules under independence.
Proposition B.10 If X Y , then h(X) g(Y ) for any functions h(·) and g(·).
Proposition B.11 If X Y , then
fXY (x, y) = fX (x)fY (y),
fY |X (y | x) = fY (y),
E {g(Y ) | X} = E {g(Y )} ,
E {g(Y )h(X)} = E {g(Y )} E {h(X)} .
The statement of Theorem B.6 is extremely simple. But its proof is non-trivial and
beyond the scope of this book. See Benhamou et al. (2018) for a proof.
Y1 ∼ N (µ1 , Σ11 ) ,
Y2 ∼ N (µ2 , Σ22 ) .
Y1 | Y2 = y2 ∼ N µ1 + Σ12 Σ−1 −1
22 (y2 − µ2 ), Σ11 − Σ12 Σ22 Σ21 ;
Y1 − Σ12 Σ−1 −1
22 (Y2 − µ2 ) ∼ N µ1 , Σ11 − Σ12 Σ22 Σ21 .
σ12
Y1 µ1 ρσ1 σ2
∼N , ,
Y2 µ2 ρσ1 σ2 σ22
cov(Y1 , Y2 )
ρ= p .
var(Y1 )var(Y2 )
Proof of Theorem B.8: The proof relies on the following two basic facts.
• E(Y Y t ) = cov(Y ) + E(Y )E(Y t ) = Σ + µµt .
• For an n × n symmetricPrandom matrix Pn W = (wij ), we have E {trace(W )} =
n
trace {E(W )} because E ( i=1 wii ) = i=1 E(wii ).
The conclusion follows from
E(Y t AY ) = E {trace(Y t AY )}
= E {trace(AY Y t )}
= trace {E(AY Y t )}
= trace AE(Y Y T )
□
The variance of the quadratic form is much more complicated for a general random
vector. For the multivariate Normal random vector, we have the following formula.
Theorem B.9 If Y ∼ N(µ, Σ), then
If rank(Σ) = k ≤ n, then
(Y − µ)t Σ+ (Y − µ) ∼ χ2k .
Y t HY ∼ χ2K .
Y t Y ∼ χ2K .
H = P diag {1, . . . , 1, 0, . . . , 0} P t
Y t HY = Y t P diag {1, . . . , 1, 0, . . . , 0} P t Y
= Z t diag {1, . . . , 1, 0, . . . , 0} Z,
then
Y1 + Y2 Y1 − Y2 .
Remark: This result holds for arbitrary ρ.
Random Variables 351
Remark: For a bivariate Normal distribution, the two conditional distributions are both
Normal. The converse of the statement is not true. That is, even if the two conditional
distributions are both Normal, the joint distribution may not be bivariate Normal. Gelman
and Meng (1991) reported this interesting result.
σ jk = 0 ⇐⇒ Xj Xk | X\(j,k)
1. if Aa = 0, then at Y Y t AY ;
2. if AB = BA = 0, then Y t AY Y t BY .
Hint: To simplify the proof, you can the pseudoinverse of A which satisfies AA+ A = A. In
fact, a strong result holds. Ogasawara and Takahashi (1951) proved the following theorem;
see also Styan (1970, Theorem 5).
Theorem B.11 Assume Y ∼ N(µ, Σ). Define quadratic forms Y t AY and Y t BY for two
symmetric matrices A and B. The Y t AY and Y t BY are independent if and only if
Hint: Write Y = µ + Σ1/2 Z and reduce the problem to calculating the moments of
standard Normals.
C
Limiting Theorems and Basic Asymptotics
This chapter reviews the basics of limiting theorems and asymptotic analyses that are useful
for this book. See Newey and McFadden (1994) and Van der Vaart (2000) for in-depth
discussions.
The above proposition does not require any conditions on the joint distribution of
(Zn , Wn ).
For an IID sequence of random vectors, we have the following weak law of large numbers:
Proposition C.3 (Khintchine’s
Pn weak law of large numbers) If Z1 , . . . , Zn are IID
with mean µ ∈ RK , then n−1 i=1 Zi → µ in probability.
A more elementary tool is Markov’s inequality:
or
pr {∥Zn − Z∥ > c} ≤ E ∥Zn − Z∥2 /c2 .
(C.2)
Proposition C.4 If random vectors Zn ∈ RK have mean zero and covariance cov(Zn ) =
an Cn where an → 0 and Cn → C < ∞, then Zn → 0 in probability.
353
354 Linear Model and Extensions
= c−2 E (Znt Zn )
= c−2 trace {E (Zn Znt )}
= c−2 trace {cov(Zn )}
= c−2 an trace(Cn ) → 0,
For IID sequences of random vectors, we have the Lindeberg–Lévy central limit theorem
(CLT):
Proposition C.7 (Lindeberg–Lévy CLT) If random Pn vectors Z1 , . . . , Zn are IID with
mean µ and covariance Σ, then n1/2 (Z̄n − µ) = n−1/2 i=1 (Zi − µ) → N(0, Σ) in distribu-
tion.
The more general Lindeberg–Feller CLT holds for independent sequences of random
vectors:
Proposition C.8 For each n, let Zn1 , . . . , Zn,kn be independent random vectors with finite
variances such that
Pkn 2
(LF1) i=1 E ∥Zni ∥ 1 {∥Zni ∥ > c} → 0 for every c > 0;
Limiting Theorems and Basic Asymptotics 355
Pkn
(LF2) i=1 cov(Zni ) → Σ.
Pkn
Then i=1 {Zni − E(Zni )} → N(0, Σ) in distribution.
Condition (LF2) often holds by proper standardization, and the key is to verify Condition
(LF1). Condition (LF1) is general but it looks cumbersome. In many cases, we impose a
stronger moment condition that is easier to verify:
Pkn 2+δ
(LF1’) i=1 E∥Zni ∥ → 0 for some δ > 0.
We can show that (LF1’) implies that (LF1):
kn
X kn
X
E ∥Zni ∥2 1 {∥Zni ∥ > c} = E ∥Zni ∥2+δ ∥Zni ∥−δ 1 ∥Zni ∥δ > cδ
i=1 i=1
kn
X
≤ E∥Zni ∥2+δ c−δ → 0.
i=1
Theorem C.1 Assume Y = Xβ + ε where the covariates are fixed and error terms ε =
(ε1 , . . . , εn )t are IID non-Normal with mean zero and finite variance σ 2 . Recall the OLS
estimator β̂ = (X t X)−1 X t Y . Any linear combination of β̂ is asymptotically Normal if and
only if
max hii → 0,
1≤i≤n
where hii is the ith diagonal element of the hat matrix H = X(X t X)−1 X t .
In the main text, hii is called the leverage score of unit i. The maximum leverage score
κ = max hii
1≤i≤n
β̂ − β = (X t X)−1 X t ε = X t ε
has diagonal elements hii = xti xi = ∥xi ∥2 and non-diagonal elements hij = xti xj . We can
also assume σ 2 = 1.
Consider a fixed vector a ∈ Rp and assume ∥a∥2 = 1. We have
at β̂ − at β = at X t ε ≡ st ε,
where
s = Xa ⇐⇒ si = xti a (i = 1, . . . , n)
satisfies
∥s∥2 = at X t Xa = ∥a∥2 = 1
and
s2i = (xti a)2 ≤ ∥xi ∥2 ∥a∥2 = ∥xi ∥2 = hii
by the Cauchy–Schwarz inequality.
I first prove the sufficiency. The key term at β̂ − at β is a linear combination of the
IID errors, and it has mean 0 and variance var(st ε) = ∥s∥2 = 1. We only need to verify
Condition (LF1) to establish the CLT. It holds because for any fixed c > 0, we have
n
X n
X
E s2i ε2i 1 {|si εi | > c} s2i max E ε2i 1 {|si εi | > c}
≤ (C.4)
1≤i≤n
i=1 i=1
max E ε2i 1 {|si εi | > c}
= (C.5)
1≤i≤n
h n oi
≤ E ε2i 1 κ1/2 |εi | > c (C.6)
→ 0, (C.7)
where (C.4) follows from the property of max, (C.5) follows from the fact ∥s∥2 = 1, (C.6)
follows from the fact that |si | ≤ |hii | ≤ κ1/2 , and (C.7) follows from κ → 0 and the dominant
convergence theorem in Proposition C.5.
I then prove the necessity. Pick one i∗ from arg max1≤i≤n hii . Consider a special lin-
ear combination of the OLS estimator: ŷi∗ = xti∗ β̂, which is the fitted value of the i∗ th
observation and has the form
ŷi∗ = xti∗ β̂
= xti∗ X t ε
Xn
= xti∗ xj εj
j=1
X
= hi∗ i∗ εi∗ + hi∗ j εj .
j̸=i∗
P
If ŷi∗ is asymptotically Normal, then both hi∗ i∗ εi∗ and j̸=i∗ hi j εj must have Normal
∗
limiting distributions by Theorem B.6. Therefore, hi∗ i∗ must converge to zero because εi∗
has a non-Normal distribution. So max1≤i≤n hii must converge to zero. □
1. Zn + Wn → Z + c in distribution;
2. Wn Zn → cZ in distribution;
3. Wn−1 Zn → c−1 Z in distribution if c ̸= 0.
The third tool is the delta method. I will present a special case below for asymptotically
Normal random vectors. Heuristically, it states that if Tn is asymptotically Normal, then
any function of Tn is also asymptotically Normal. This is true because any function is a
locally linear function by the first-order Taylor expansion.
Proposition C.11 Let f (z)√be a function from Rp to Rq , and ∂f (z)/∂z ∈ Rp×q be the
partial derivative matrix. If n(Zn − θ) → N(µ, Σ) in distribution, then
√
∂f (θ) ∂f (θ) ∂f (θ)
n{f (Zn ) − f (θ)} → N µ, Σ
∂z t ∂z t ∂z
in distribution.
Proof of Proposition C.11: I will give an informal proof. Using Taylor expansion, we
have
√ ∂f (θ) √
n{f (Zn ) − f (θ)} ∼
= n(Zn − θ),
∂z t
√ √
which is a linear transformation of n(Zn − θ). Because n(Zn − θ) → N(µ, Σ) in distri-
bution, we have
√ ∂f (θ)
n{f (Zn ) − f (θ)} → N(µ, Σ)
∂z
t
∂f (θ) ∂f (θ) ∂f (θ)
= N µ, Σ
∂z t ∂z t ∂z
in distribution. □
Proposition C.11 above is more useful when ∂f (θ)/∂z ̸= 0. Otherwise, we need to invoke
higher-order Taylor expansion to obtain a more accurate asymptotic approximation.
D
M-Estimation and MLE
n
where m(·, ·) is a vector function with the same dimension as b, and W = {wi }i=1 are the
observed data. Let β̂ denote the solution which is an estimator of β. Under mild regularity
conditions, β̂ is consistent and asymptotically Normal 1 . This is the classical theory of
M-estimation. I will review it below. See Stefanski and Boos (2002) for a reader-friendly
introduction that contains many interesting and important examples. The proofs below are
not rigorous. See Newey and McFadden (1994) for the rigorous ones.
D.1 M-estimation
I start with the simple case with IID data.
n
Theorem D.1 Assume that W = {wi }i=1 are IID with the same distribution as w. The
true parameter β ∈ Rp is the unique solution of
E {m(w, b)} = 0,
m̄(W, b) = 0.
in distribution, where
∂E {m(w, β)}
B=− , M = E{m(w, β)m(w, β)t }.
∂bt
Proof of Theorem D.1: I give a “physics” proof. When I use approximations, I mean the
error terms are of higher orders under some regularity conditions. The consistency follows
1 There are counterexamples in which β̂ is inconsistent; see Freedman and Diaconis (1982). The examples
359
360 Linear Model and Extensions
from swapping the order of “solving equation” and “taking the limit based on the law of
large numbers”:
The asymptotic Normality follows from three steps. First, from the Taylor expansion
∂ m̄(W, β)
0 = m̄(W, β̂) ∼
= m̄(W, β) + (β̂ − β)
∂bt
we obtain ( )
∂ m̄(W, β) −1 n
√ 1 X
n β̂ − β ∼
= − √ m(wi , β) .
∂bt n i=1
Second, the law of large numbers ensures that
∂ m̄(W, β) ∂E {m(w, β)}
− →− =B
∂bt ∂bt
in probability, and the CLT ensures that
n
X
n−1/2 m(wi , β) → N(0, M )
i=1
m̄(W, b) = 0,
in distribution, where
n n
X ∂E {m(wi , β)} X
B = − lim n−1 , M = lim n−1 cov{m(wi , β)}.
n→∞
i=1
∂bt n→∞
i=1
For both cases above, we can further construct the following sandwich covariance esti-
mator:
n
!−1 n ! n !−1
X ∂E{m(wi , β̂)} X
t
X ∂E{m(wi , β̂)}
m(wi , β̂)m(wi , β̂)
i=1
∂bt i=1 i=1
∂b
For non-IID data, the above covariance estimator can be conservative unless E{m(wi , β)} =
0 for all i = 1, . . . , n.
M-Estimation and MLE 361
iid
Example D.1 Assume that x1 , . . . , xn ∼ x with mean µ. The sample mean x̄ solves the
estimating equation
Xn
n−1 (xi − µ) = 0.
i=1
in distribution, the standard CLT for the sample mean. Moreover, the sandwich covariance
estimator is
n
X
V̂ = n−2 (xi − x̄)2 ,
i=1
which equals the sample variance of x, multiplied by (n − 1)/n2 ≈ 1/n. This is a standard
result.
If we only assume that x1 , . . . , xn are independent with the same mean µ but possibly
different variances σi2 (i = 1, . . . , n), the sample mean x̄ is still a reasonable estimator
for µ which solves the same estimating equation above. Moreover, the sandwich covariance
estimator V̂ is still a consistent estimator for the true variance of x̄. This is less standard.
IfPwe assume that xi ∼ [µi , σi2 ] are independent, we can still use x̄ to estimate µ =
n
n−1 i=1 µi . The estimating equation remains the same as above. The sandwich covariance
estimator V̂ becomes conservative since E(xi − µ) ̸= 0 in general.
Problem 4.1 gives more details.
iid
Example D.2 Assume that x1 , . . . , xn ∼ x with mean µ and variance σ 2 . The sample
mean and variance (x̄, σ̂ 2 ) jointly solves the estimating equation with
x−µ
m(x, µ, σ 2 ) =
(x − µ)2 − σ 2
ignoring the difference between n and n − 1 in the definition of the sample variance. Apply
Theorem D.1 to obtain
2
−1 0 σ µ3
B= , M=
0 −1 µ3 µ4 − σ 4
in distribution.
Example D.3 Assume that (xi , yi )ni=1 are IID draws from (x, y) with mean (µx , µy ). Use
x̄/ȳ to estimate γ = µx /µy . It satisfies the estimating equation with
m(x, y, γ) = x − γy.
Apply Theorem D.1 to obtain B = −µy and M = var(x − γy), which imply
√
x̄ µx var(x − γy)
n − → N 0,
ȳ µy µ2y
if µy ̸= 0.
362 Linear Model and Extensions
−E{log f (y | θ)},
M-Estimation and MLE 363
where the expectation is over true but unknown distribution y ∼ g(y). We can rewrite the
population objective function as
Z Z Z
g(y)
− g(y) log f (y | θ)dy = g(y) log dy − g(y) log g(y)dy.
f (y | θ)
The first term is called the Kullback–Leibler divergence or relative entropy of g(y) and
f (y | θ), whereas the second term is called the entropy of g(y). The first term depends on
θ whereas the second term does not. Therefore, the targeted parameter of the MLE is the
minimizer of the Kullback–Leibler divergence. By Gibbs’ inequality, the Kullback–Leibler
divergence is non-negative in general and is 0 if g(y) = f (y | θ). Therefore, if the model is
correct, then the true θ indeed minimizes the Kullback–Leibler divergence with minimum
value 0
iid
Example D.4 Assume that y1 , . . . , yn ∼ N(µ, 1). The log-likelihood contributed by unit i
is
1 1
log f (yi | µ) = − log(2π) − (yi − µ)2 ,
2 2
so
2
∂ log f (yi | µ) ∂ log f (yi | µ)
= yi − µ, = −1.
∂µ ∂µ2
The MLE is µ̂ = ȳ. If the model is correctly specified, we can use
n
X
In (µ̂)−1 = n−1 or Jn (µ̂)−1 = 1/ (yi − µ̂)2
i=1
and
∂ 2 li 1
= − 2 2 xi (yi − xti β).
∂β∂σ 2 (σ )
Pn
The MLE of β is the OLS estimator β̂ and the MLE of σ 2 is σ̃ 2 = 2
i=1 ε̂i /n, where
ε̂i = yi − xti β̂ is the residual.
We have !
n
2 1 X t n
In (β̂, σ̃ ) = diag xi xi , ,
σ̃ 2 i=1 2(σ̃ 2 )2
and Pn
1 2
2 (σ̃ 2 )2
t
i=1 ε̂i xi xi ∗
Jn (β̂, σ̃ ) = ,
∗ ∗
where the ∗ terms do not matter for the later calculations. If the Normal linear model is
correctly specified, we can use the (1, 1)th block of In (β̂, σ̃ 2 )−1 as the covariance estimator
for β̂, which equals
n
!−1
X
σ̃ 2 xi xti .
i=1
If the Normal linear model is misspecified, we can use the (1, 1)th block of
In (β̂, σ̃ 2 )−1 Jn (β̂, σ̃ 2 )In (β̂, σ̃ 2 )−1 as the covariance estimator for β̂, which equals
n
!−1 n
! n
!−1
X X X
xi xti ε̂2i xi xti xi xti ,
i=1 i=1 i=1
Show that E(Ṽ ) = var(x̄) when x1 , . . . , xn are independent with the same mean µ, and
E(Ṽ ) ≥ var(x̄) when xi ∼ [µi , σi2 ] are independent.
M-Estimation and MLE 365
µkl = E{(x − µx )k (y − µy )l },
for example,
√
var(x) = µ20 , var(y) = µ02 , cov(x, y) = µ11 , ρ = µ11 / µ20 µ02 .
Hint: Use the fact that β = (µx , µy , µ20 , µ02 , ρ)t satisfies the estimating equation with
x − µx
y − µy
2
m(x, y, β) =
(x − µx ) − µ 20
(y − µy )2 − µ02
√
(x − µx )(y − µy ) − ρ µ20 µ02
and the sample moments and Pearson correlation coefficient are the corresponding estima-
tors. You may also find the formula (A.6) useful.
Abadie, A., Imbens, G. W., and Zheng, F. (2014). Inference for misspecified models with
fixed regressors. Journal of the American Statistical Association, 109:1601–1614.
Agresti, A. (2010). Analysis of Ordinal Categorical Data. New York: John Wiley & Sons.
Agresti, A. (2015). Foundations of Linear and Generalized Linear Models. John Wiley &
Sons.
Ai, C. and Norton, E. C. (2003). Interaction terms in logit and probit models. Economics
Letters, 80:123–129.
Aitkin, A. C. (1936). On least squares and linear combination of observations. Proceedings
of the Royal Society of Edinburgh, 55:42–48.
Anton, R. F., O’Malley, S. S., Ciraulo, D. A., Cisler, R. A., Couper, D., Donovan, D. M.,
Gastfriend, D. R., Hosking, J. D., Johnson, B. A., and LoCastro, J. S. (2006). Com-
bined pharmacotherapies and behavioral interventions for alcohol dependence: the COM-
BINE study: a randomized controlled trial. Journal of the American Medical Association,
295:2003–2017.
Bai, Z.-J. and Bai, Z.-Z. (2013). On nonsingularity of block two-by-two matrices. Linear
Algebra and Its Applications, 439:2388–2404.
Baron, R. M. and Kenny, D. A. (1986). The moderator–mediator variable distinction in so-
cial psychological research: Conceptual, strategic, and statistical considerations. Journal
of Personality and Social Psychology, 51:1173.
367
368 Bibliography
Berkson, J. (1944). Application of the logistic function to bio-assay. Journal of the American
Statistical Association, 39:357–365.
Berman, M. (1988). A theorem of Jacobi and its generalization. Biometrika, 75:779–783.
Berrington de González, A. and Cox, D. R. (2007). Interpretation of interaction: A review.
Annals of Applied Statistics, 1:371–385.
Buja, A., Brown, L., Berk, R., George, E., Pitkin, E., Traskin, M., Zhang, K., and Zhao,
L. (2019a). Models as approximations i: Consequences illustrated with linear regression.
Statistical Science, 34:523–544.
Buja, A., Brown, L., Kuchibhotla, A. K., Berk, R., George, E., and Zhao, L. (2019b). Models
as approximations ii: A model-free theory of parametric regression. Statistical Science,
34:545–565.
Burridge, J. (1981). A note on maximum likelihood estimation for regression models using
grouped data. Journal of the Royal Statistical Society: Series B (Methodological), 43:41–
45.
Card, D., Lee, D. S., Pei, Z., and Weber, A. (2015). Inference on causal effects in a gener-
alized regression kink design. Econometrica, 83:2453–2483.
Carpenter, D. P. (2002). Groups, the media, agency waiting costs, and FDA drug approval.
American Journal of Political Science, 46:490–505.
Cinelli, C. and Hazlett, C. (2020). Making sense of sensitivity: Extending omitted variable
bias. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 82:39–67.
Cochran, W. G. (1938). The omission or addition of an independent variate in multiple
linear regression. Supplement to the Journal of the Royal Statistical Society, 5:171–176.
Cook, R. D. (1977). Detection of influential observation in linear regression. Technometrics,
19:15–18.
Cox, D. R. (1960). Regression analysis when there is prior information about supplementary
variables. Journal of the Royal Statistical Society: Series B (Methodological), 22(1):172–
176.
Cox, D. R. (1961). Tests of separate families of hypotheses. In Neyman, J., editor, Pro-
ceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability,
pages 105–123.
Cox, D. R. (1972). Regression models and life-tables. Journal of the Royal Statistical
Society: Series B (Methodological), 34:187–202.
Ding, P. (2021b). Two seemingly paradoxical results in linear models: the variance inflation
factor and the analysis of covariance. Journal of Causal Inference, 9:1–8.
Ding, P. (2023). A First Course in Causal Inference. Chapman and Hall/CRC.
Efron, B. (1975). The efficiency of logistic regression compared to normal discriminant
analysis. Journal of the American Statistical Association, 70:892–898.
370 Bibliography
Eicker, F. (1967). Limit theorems for regressions with unequal and dependent errors. In
Cam, L. L. and Neyman, J., editors, Proceedings of the Fifth Berkeley Symposium on
Mathematical Statistics and Probability, pages 59–82. Berkeley, CA: University of Cali-
fornia Press.
Fan, J. and Gijbels, I. (1996). Local Polynomial Modelling and its Applications. London:
CRC Press.
Firth, D. (1988). Multiplicative errors: log-normal or gamma? Journal of the Royal Statis-
tical Society: Series B (Methodological), 50:266–268.
Fisher, L. D. and Lin, D. Y. (1999). Time-dependent covariates in the Cox proportional-
hazards regression model. Annual Review of Public Health, 20:145–157.
Fisher, R. A. (1925a). Statistical Methods for Research Workers. Edinburgh: Oliver and
Boyd, 1st edition.
Fisher, R. A. (1925b). Theory of statistical estimation. In Mathematical Proceedings of the
Cambridge Philosophical Society, volume 22, pages 700–725. Cambridge University Press.
Fisman, R. and Miguel, E. (2007). Corruption, norms, and legal enforcement: Evidence
from diplomatic parking tickets. Journal of Political Economy, 115:1020–1048.
Fitzmaurice, G. M., Laird, N. M., and Ware, J. H. (2012). Applied Longitudinal Analysis.
New York: John Wiley & Sons.
Fleming, T. R. and Harrington, D. P. (2011). Counting Processes and Survival Analysis.
New York: John Wiley & Sons.
Frank, K. A. (2000). Impact of a confounding variable on a regression coefficient. Sociological
Methods and Research, 29:147–194.
Frank, L. E. and Friedman, J. H. (1993). A statistical view of some chemometrics regression
tools. Technometrics, 35:109–135.
Freedman, D. A. (1981). Bootstrapping regression models. The Annals of Statistics, 9:1218–
1228.
Freedman, D. A. (1983). A note on screening regression equations. The American Statisti-
cian, 37:152–155.
Freedman, D. A. (2006). On the so-called “Huber sandwich estimator” and “robust standard
errors”. American Statistician, 60:299–302.
Freedman, D. A. (2008). Survival analysis: A primer. American Statistician, 62:110–119.
Freedman, D. A. (2009). Statistical Models: Theory and Practice. Cambridge: Cambridge
University Press.
Freedman, D. A. and Diaconis, P. (1982). On inconsistent M-estimators. Annals of Statistics,
10:454–461.
Freedman, D. A., Klein, S. P., Sacks, J., Smyth, C. A., and Everett, C. G. (1991). Ecological
regression and voting rights. Evaluation Review, 15:673–711.
Friedman, J., Hastie, T., Höfling, H., and Tibshirani, R. (2007). Pathwise coordinate opti-
mization. Annals of Applied Statistics, 1:302–332.
Bibliography 371
Friedman, J., Hastie, T., and Tibshirani, R. (2009). glmnet: Lasso and elastic-net regularized
generalized linear models. R package version, 1(4).
Friedman, J., Hastie, T., and Tibshirani, R. (2010). Regularization paths for generalized
linear models via coordinate descent. Journal of Statistical Software, 33:1–22.
Frisch, R. (1934). Statistical confluence analysis by means of complete regression systems
(vol. 5). Norway, University Institute of Economics.
Frisch, R. and Waugh, F. V. (1933). Partial time regressions as compared with individual
trends. Econometrica, 1:387–401.
Fuller, W. A. (1975). Regression analysis for sample survey. Sankhya, 37:117–132.
Galton, F. (1886). Regression towards mediocrity in hereditary stature. The Journal of the
Anthropological Institute of Great Britain and Ireland, 15:246–263.
Geary, R. C. (1936). The distribution of “Student’s” ratio for non-normal samples. Supple-
ment to the Journal of the Royal Statistical Society, 3:178–184.
Gehan, E. A. (1965). A generalized Wilcoxon test for comparing arbitrarily singly-censored
samples. Biometrika, 52:203–224.
Gelman, A. and Meng, X.-L. (1991). A note on bivariate distributions that are conditionally
normal. American Statistician, 45:125–126.
Gelman, A. and Park, D. K. (2009). Splitting a predictor at the upper quarter or third and
the lower quarter or third. American Statistician, 63:1–8.
Gelman, A., Park, D. K., Ansolabehere, S., Price, P. N., and Minnite, L. C. (2001). Models,
assumptions and model checking in ecological regressions. Journal of the Royal Statistical
Society: Series A (Statistics in Society), 164:101–118.
Golub, G. H., Heath, M., and Wahba, G. (1979). Generalized cross-validation as a method
for choosing a good ridge parameter. Technometrics, 21:215–223.
Goodman, L. A. (1953). Ecological regressions and behavior of individuals. American
Sociological Review, 18:663–664.
Goodman, L. A. (1959). Some alternatives to ecological correlation. American Journal of
Sociology, 64:610–625.
Greene, W. H. and Seaks, T. G. (1991). The restricted least squares estimator: A pedagogical
note. The Review of Economics and Statistics, 73:563–567.
Greenwood, M. (1926). The natural duration of cancer. A Report on the Natural Duration
of Cancer., (33).
Guiteras, R., Levinsohn, J., and Mobarak, A. M. (2015). Encouraging sanitation investment
in the developing world: a cluster-randomized trial. Science, 348:903–906.
Hagemann, A. (2017). Cluster-robust bootstrap inference in quantile regression models.
Journal of the American Statistical Association, 112:446–456.
Harrison Jr, D. and Rubinfeld, D. L. (1978). Hedonic housing prices and the demand for
clean air. Journal of Environmental Economics and Management, 5:81–102.
372 Bibliography
Hoerl, A. E., Kannard, R. W., and Baldwin, K. F. (1975). Ridge regression: some simula-
tions. Communications in Statistics—Theory and Methods, 4:105–123.
Hoerl, A. E. and Kennard, R. W. (1970). Ridge regression: Biased estimation for nonorthog-
onal problems. Technometrics, 12:55–67.
Huster, W. J., Brookmeyer, R., and Self, S. G. (1989). Modelling paired survival data with
covariates. Biometrics, pages 145–156.
Jann, B. (2008). The Blinder–Oaxaca decomposition for linear regression models. The Stata
Journal, 8:453–479.
Johnson, P. O. and Neyman, J. (1936). Tests of certain linear hypotheses and their appli-
cation to some educational problems. Statistical Research Memoirs, 1:57–93.
Kagan, A. (2001). A note on the logistic link function. Biometrika, 88:599–601.
Kalbfleisch, J. D. and Prentice, R. L. (2011). The Statistical Analysis of Failure Time Data.
New York: John Wiley & Sons.
Keele, L. (2010). Proportionally difficult: testing for nonproportional hazards in Cox models.
Political Analysis, 18:189–205.
King, G. (1997). A Solution to the Ecological Inference Problem: Reconstructing Individual
Behavior from Aggregate Data. Princeton: Princeton University Press.
King, G. and Roberts, M. E. (2015). How robust standard errors expose methodological
problems they do not fix, and what to do about it. Political Analysis, 23:159–179.
Koenker, R. and Bassett Jr, G. (1978). Regression quantiles. Econometrica, 46:33–50.
Koopmann, R. (1982). Parameterschätzung bei a-priori-Information. Vandenhoeck &
Ruprecht.
LaLonde, R. J. (1986). Evaluating the econometric evaluations of training programs with
experimental data. American Economic Review, 76:604–620.
Lawless, J. F. (1976). A simulation study of ridge and other regression estimators. Com-
munications in Statistics—Theory and Methods, 5:307–323.
Le Cam, L. (1960). An approximation theorem for the Poisson Binomial distribution. Pacific
Journal of Mathematics, 10:1181–1197.
Lehmann, E. L. (1951). A general concept of unbiasedness. The Annals of Mathematical
Statistics, 22:587–592.
Lei, J., G’Sell, M., Rinaldo, A., Tibshirani, R. J., and Wasserman, L. (2018). Distribution-
free predictive inference for regression. Journal of the American Statistical Association,
113:1094–1111.
Li, J. and Valliant, R. (2009). Survey weighted hat matrix and leverages. Survey Method-
ology, 35:15–24.
Liang, K.-Y. and Zeger, S. L. (1986). Longitudinal data analysis using generalized linear
models. Biometrika, 73:13–22.
Lin, D.-Y., Gong, J., Gallo, P., Bunn, P. H., and Couper, D. (2016). Simultaneous inference
on treatment effects in survival studies with factorial designs. Biometrics, 72:1078–1085.
Lin, D. Y. and Wei, L.-J. (1989). The robust inference for the cox proportional hazards
model. Journal of the American statistical Association, 84:1074–1078.
Long, J. S. and Ervin, L. H. (1998). Correcting for heteroscedasticity with heteroscedasticity
consistent standard errors in the linear regression model: Small sample considerations.
Indiana University, Bloomington, IN, 47405.
Long, J. S. and Ervin, L. H. (2000). Using heteroscedasticity consistent standard errors in
the linear regression model. American Statistician, 54:217–224.
Lovell, M. C. (1963). Seasonal adjustment of economic time series and multiple regression
analysis. Journal of the American Statistical Association, 58:993–1010.
Lukacs, E. (1942). A characterization of the normal distribution. The Annals of Mathe-
matical Statistics, 13:91–93.
MacKinnon, J. G. and White, H. (1985). Some heteroskedasticity-consistent covariance ma-
trix estimators with improved finite sample properties. Journal of Econometrics, 29:305–
325.
374 Bibliography
Magee, L. (1998). Improving survey-weighted least squares regression. Journal of the Royal
Statistical Society: Series B (Statistical Methodology), 60:115–126.
Mantel, N. (1966). Evaluation of survival data and two new rank order statistics arising in
its consideration. Cancer Chemother. Rep., 50:163–170.
Marshall, A. (1890). Principles of Economics. New York: Macmillan and Company.
McCullagh, P. and Nelder, J. (1989). Generalized Linear Models. Boca Raton: Chapman
and Hall/CRC, second edition.
McCulloch, J. H. (1985). On heteros*edasticity. Econometrica, 53:483.
McFadden, D. (1974). Conditional logit analysis of qualitative choice behavior. In Zarembka,
P., editor, Frontiers in Econometrics. Academic Press.
Miller, R. G. J. (1974). An unbalanced jackknife. The Annals of Statistics, 2:880–891.
Moen, E. L., Fricano-Kugler, C. J., Luikart, B. W., and O’Malley, A. J. (2016). Analyzing
clustered data: why and how to account for multiple observations nested within a study
participant? PLoS One, 11:e0146721.
Monahan, J. F. (2008). A Primer on Linear Models. London: CRC Press.
Nagelkerke, N. (1991). A note on a general definition of the coefficient of determination.
Biometrika, 78:691–692.
Nelder, J. A. and Wedderburn, R. W. M. (1972). Generalized linear models. Journal of the
Royal Statistical Society: Series A (General), 135:370–384.
Newey, K. W. and McFadden, D. (1994). Large sample estimation and hypothesis. In
Engle, R. F. and McFadden, D. L., editors, Handbook of Econometrics IV, pages 2112–
2245. Citeseer.
Oaxaca, R. (1973). Male-female wage differentials in urban labor markets. International
Economic Review, 14:693–709.
Ogasawara, T. and Takahashi, M. (1951). Independence of quadratic quantities in a normal
system. Journal of Science of the Hiroshima University, Series A, 15:1–9.
Osborne, M. R., Presnell, B., and Turlach, B. A. (2000). On the lasso and its dual. Journal
of Computational and Graphical Statistics, 9:319–337.
Paloyo, A. R. (2014). When did we begin to spell ‘heteros*edasticity’ correctly? Philippine
Review of Economics, LI:162–178.
Pepe, M. S. and Anderson, G. L. (1994). A cautionary note on inference for marginal regres-
sion models with longitudinal data and general correlated response data. Communications
in Statistics—Simulation and Computation, 23:939–951.
Peto, R. and Peto, J. (1972). Asymptotically efficient rank invariant test procedures. Journal
of the Royal Statistical Society: Series A (General), 135:185–198.
Powell, J. L. (1991). Estimation of monotonic regression models under quantile restrictions.
Nonparametric and semiparametric methods in Econometrics, pages 357–384.
Pratt, J. W. (1981). Concavity of the log likelihood. Journal of the American Statistical
Association, 76:103–106.
Bibliography 375
Theil, H. (1971). Principles of Econometrics. New York: John Wiley & Sons.
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal
Statistical Society: Series B (Methodological), 58:267–288.
Tibshirani, R. (2011). Regression shrinkage and selection via the lasso: a retrospective.
Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73(3):273–
282.
Tibshirani, R. J. (2013). The lasso problem and uniqueness. Electronic Journal of Statistics,
7:1456–1490.
Tikhonov, A. N. (1943). On the stability of inverse problems. In Dokl. Akad. Nauk SSSR,
volume 39, pages 195–198.
Titterington, D. M. (2013). Biometrika highlights from volume 28 onwards. Biometrika,
100:17–73.
Tjur, T. (2009). Coefficients of determination in logistic regression models—a new proposal:
The coefficient of discrimination. American Statistician, 63:366–372.
Toomet, O. and Henningsen, A. (2008). Sample selection models in R: Package sampleSe-
lection. Journal of Statistical Software, 27:1–23.
Train, K. E. (2009). Discrete Choice Methods with Simulation. Cambridge: Cambridge
University Press.
Tseng, P. (2001). Convergence of a block coordinate descent method for nondifferentiable
minimization. Journal of Optimization Theory and Applications, 109:475–494.
Tsiatis, A. (2007). Semiparametric Theory and Missing Data. New York: Springer.
Tukey, J. (1958). Bias and confidence in not quite large samples. Annals of Mathematical
Statistics, 29:614.
Turner, E. L., Li, F., Gallis, J. A., Prague, M., and Murray, D. M. (2017a). Review of re-
cent methodological developments in group-randomized trials: part 1—design. American
Journal of Public Health, 107:907–915.
Turner, E. L., Prague, M., Gallis, J. A., Li, F., and Murray, D. M. (2017b). Review of recent
methodological developments in group-randomized trials: part 2—analysis. American
Journal of Public Health, 107:1078–1086.
Bibliography 377
Zhao, A. and Ding, P. (2022). Regression-based causal inference with factorial experiments:
estimands, model specifications, and design-based properties. Biometrika, 109:799–815.
Zou, G. (2004). A modified Poisson regression approach to prospective studies with binary
data. American Journal of Epidemiology, 159:702–706.
Zou, H. and Hastie, T. (2005). Regularization and variable selection via the elastic net.
Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67:301–320.