0% found this document useful (0 votes)
16 views23 pages

Linear Least Squared

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views23 pages

Linear Least Squared

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

2

Linear Least Squares

The linear model is the main technique in regression problems and the primary
tool for it is least squares fitting. We minimize a sum of squared errors, or
equivalently the sample average of squared errors. That is a natural choice
when we’re interested in finding the regression function which minimizes the
corresponding expected squared error. This chapter presents the basic theory
of linear least squares estimation, looking at it with calculus, linear algebra
and geometry. It also develops some distribution theory for linear least squares
and computational aspects of linear regression. We will draw repeatedly on the
material here in later chapters that look at specific data analysis problems.

2.1 Least squares estimates


We begin with observed response values y1 , y2 , . . . , yn and features zij for i =
1, . . . , n and j = 1, . . . , p. We’re looking for the parameter values β1 , . . . , βp that
minimize
n 
X p
X 2
S(β1 , . . . , βp ) = yi − zij βj . (2.1)
i=1 j=1

The minimizing values are denoted with hats: β̂1 , . . . , β̂p .


To minimize S, we set its partial derivative with respect to each βj to 0.
The solution satisfies
n  p
∂ X X 
S = −2 yi − zij β̂j zij = 0, j = 1, . . . , p. (2.2)
∂βj i=1 j=1

1
2 2. LINEAR LEAST SQUARES

We’ll show later that this indeed gives the minimum, not the maximum or a
saddle point. The p equations in (2.2) are known as the normal equations.
This is due to normal being a synonym for perpendicular or orthogonal, and
not due to any assumption about the normal distribution. Consider the vector
Zj = (z1j , . . . , znj )0 ∈ Rn of values for the j’th feature. Equation (2.2) says
that this feature vector P has a dot product of zero with the residual vector having
p
i’th element ε̂i = yi − j=1 zij β̂j . Each feature vector is orthogonal (normal)
to the vector of residuals.
It is worth while writing equations (2.1) and (2.2) in matrix notation. Though
the transition from (2.1) to (2.2) is simpler with coordinates, our later manip-
ulations are easier in vector form. Let’s pack the responses and feature values
into the vector Y and matrix Z via
   
y1 z11 z12 . . . z1p
 y2   z21 z22 . . . z2p 
Y =  .  , and, Z =  . ..  .
   
.. ..
 ..   .. . . . 
yn zn1 zn2 . . . znp

We also put β = (β1 , . . . , βp )0 and let β̂ denote the minimizer of the sum of
squares. Finally let ε̂ = Y − Z β̂, the n vector of residuals.
Now equation (2.1) can be written S(β) = (Y − Zβ)0 (Y − Zβ) and the
normal equations (2.2) become
ε̂0 Z = 0 (2.3)
after dividing by −2. These normal equations can also be written
Z 0 Z β̂ = Z 0 Y. (2.4)
When Z 0 Z is invertible then we find that
β̂ = (Z 0 Z)−1 Z 0 Y. (2.5)
We’ll suppose at first that Z 0 Z is invertible and then consider the singular case
in Chapter 2.7.
Equation (2.5) underlies another meaning of the work ‘linear’ in linear re-
gression. The estimated coefficient β̂ is a fixed linear combination of Y , meaning
that we get it by multiplying Y by the matrix (Z 0 Z)−1 Z 0 . The predicted value
of Y at any new point x0 with features z0 = φ(x0 ) is also linear in Y ; it is
z00 (Z 0 Z)−1 Z 0 Y .
Now let’s prove that β̂ = (Z 0 Z)−1 Z 0 Y is in fact the minimizer, not just a
point where the gradient of the sum of squares vanishes. It seems obvious that
a sum of squares like (2.1) cannot have a maximizing β. But we need to rule
out saddle points too, and we’ll also find that β̂ is the unique least squares
estimator. Since we already found an expression for β̂ we prove it is right by
expressing a generic βe ∈ Rp as β̂ + (βe − β̂) and then expanding S(β̂ + (βe − β̂)).
This adding and subtracting technique is often applicable in problems featuring
squared errors.
2.2. GEOMETRY OF LEAST SQUARES 3

Theorem 2.1. Let Z be an n × p matrix with Z 0 Z invertible, and let Y be an


n vector. Define S(β) = (Y − Zβ)0 (Y − Zβ) and set β̂ = (Z 0 Z)−1 Z 0 Y . Then
S(β) > S(β̂) holds whenever β 6= β̂.

Proof. We know that Z 0 (Y − Z β̂) = 0 and will use it below. Let βe be any point
in Rp and let γ = βe − β̂. Then

S(β) e 0 (Y − Z β)
e = (Y − Z β) e
= (Y − Z β̂ − Zγ)0 (Y − Z β̂ − Zγ)
= (Y − Z β̂)0 (Y − Z β̂) − γ 0 Z 0 (Y − Z β̂) − (Y − Z β̂)Zγ + γ 0 Z 0 Zγ
= S(β̂) + γ 0 Z 0 Zγ.

Thus S(β)e = S(β̂) + kZγk2 ≥ S(β̂). It follows that β̂ is a minimizer of S.


For uniqueness we need to show that γ 6= 0 implies Zγ 6= 0. Suppose to the
contrary that Zγ = 0 for γ 6= 0. Then we would have Z 0 Zγ = 0 for γ 6= 0, but
this contradicts the assumption that Z 0 Z is invertible. Therefore if βe 6= β̂ then
e = S(β̂) + kZ(βe − β̂)k2 > 0
S(β)

2.2 Geometry of least squares


Figure xxx shows a sketch to illustrate linear least squares. The vector y =
(y1 , . . . , yn )0 is represented as a point in Rn . The set

M = {Zβ | β ∈ Rp } (2.6)

is a p dimensional linear subset of Rn . It is fully p dimensional here because


we have assumed that Z 0 Z is invertible and so Z has rank p. Under our model,
E(Y ) = Zβ ∈ M.
The idea behind least squares is to find β̂ so that ŷ = Z β̂ is the closest point
to y from within M. We expect to find this closest point to y by “dropping a
perpendicular line” to M. That is, the error ε̂ = y − ŷ should be perpendicular
to any line within M. From the normal equations (2.3), we have ε̂0 Z = 0 so ε̂
is actually perpendicular to every point Zβ ∈ M. Therefore ε̂ is perpendicular
to every line in M.
We can form a right angle triangle using the three points y, ŷ and Z βe for
any βe ∈ Rp . Taking βe = 0 and using Pythagoras’ theorem we get

kyk2 = kε̂k2 + kz β̂k2 .

When the first column of Z consists of 1s then (ȳ, . . . , ȳ)0 = Z(ȳ, 0, . . . , 0) ∈ M


and Pythagoras’ theorem implies
n
X n
X n
X
(yi − ȳ)2 = (ŷi − ȳ)2 + (yi − ŷ)2 . (2.7)
i=1 i=1 i=1
4 2. LINEAR LEAST SQUARES

The left side of (2.7) is called the centered sum of squares of the yi . It is
n − 1 times the usual estimate of the common variance of the Yi . The equation
decomposes this sum of squares into two parts. The first is the centered sum of
squared errors of the fitted values ŷi . The second is the sumPof squared model
n
errors. When the first column of Z consists of 1s then (1/n) i=1 ŷi = ȳ. That
¯
is ŷ = ȳ. Now the two terms in (2.7) correspond to the sum of squares of the
fitted values ŷi about their mean and the sum of squared residuals. We write
this as SSTOT = SSFIT + SSRES . The total sum of squares is the sum of squared
fits plus the sum of squared residuals.
Some of the variation in Y is ‘explained’ by the model and some of it left
unexplained. The fraction explained is denoted by

SSFIT SSRES
R2 = =1− .
SSTOT SSTOT

The quantity R2 is known as the coefficient of determination. It measures how


well Y is predicted or determined by Z β̂. Its nonnegative square root R is called
the coefficient of multiple correlation. It is one measure of how well the response
Y correlates with the p predictors in Z taken collectively. For the special case
of simple linear regression with zi = (1, xi ) ∈ R2 then (Exercise xxx) R2 is the
square of the usual Pearson correlation of x and y.
Equation (2.7) is an example of an ANOVA (short for analysis of variance)
decomposition. ANOVA decompositions split a variance (or a sum of squares)
into two or more pieces. Not surprisingly there is typically some orthogonality
or the Pythagoras theorem behind them.

2.3 Algebra of least squares


The predicted value for yi , using the least squares estimates, is ŷi = Zi β̂. We
can write the whole vector of fitted values as ŷ = Z β̂ = Z(Z 0 Z)−1 Z 0 Y . That is
ŷ = Hy where
H = Z(Z 0 Z)−1 Z 0 .
Tukey coined the term “hat matrix” for H because it puts the hat on y. Some
simple properties of the hat matrix are important in interpreting least squares.
By writing H 2 = HH out fully and cancelling we find H 2 = H. A matrix
H with H 2 = H is called idempotent. In hindsight, it is geometrically obvious
that we should have had H 2 = H. For any y ∈ Rn the closest point to y inside
of M is Hy. Since Hy is already in M, H(Hy) = Hy. That is H 2 y = Hy for
any y and so H 2 = H. Clearly H k = H for any integer k ≥ 1.
The matrix Z 0 Z is symmetric, and so therefore is (Z 0 Z)−1 . It follows that
the hat matrix H is symmetric too. A symmetric idempotent matrix such as H
is called a perpendicular projection matrix.

Theorem 2.2. Let H be a symmetric idempotent real valued matrix. Then the
eigenvalues of H are all either 0 or 1.
2.4. DISTRIBUTIONAL RESULTS 5

Proof. Suppose that x is an eigenvector of H with eigenvalue λ, so Hx = λx.


Because H is idempotent H 2 x = Hx = λx. But we also have H 2 x = H(Hx) =
H(λx) = λ2 x. Therefore λx = λ2 x. The definition of eigenvector does not allow
x = 0 and so we must have λ2 = λ. Either λ = 0 or λ = 1.
Let H = P 0 ΛP P where the columns of P are eigenvectors pi of H for i =
n
1, . . . , n. Then H = i=1 λi pi p0i , where by Theorem 2.2 each λi is 0 or 1. With
no loss of generality, we can arrange Pfor the ones to precede the zeros. Suppose
r
that there are r ones. Then H = i=1 pi p0i . We certainly expect r to equal p
here. This indeed holds. The eigenvalues of H sum to r, so tr(H) = r. Also
tr(H)P = tr(Z(Z 0 Z)−1 Z 0 ) = tr(Z 0 Z(Z 0 Z)−1 ) = tr(Ip ) = p. Therefore r = p and
p
H = i=1 pi p0i where pi are mutually orthogonal n vectors.
The prediction for observation i can be written as ŷi = Hi y where Hi is
the i’th row of the hat matrix. The i’th row of H is simply zi0 (Z 0 Z)−1 Z 0 and
the ij element of the hat matrix is Hij = zi0 (Z 0 Z)−1 zj . Because Hij = Hji the
contribution of yi to ŷj equals that of yj to ŷi . The diagonal elements of the
hat matrix will prove to be very important. They are

Hii = zi0 (Z 0 Z)−1 zi . (2.8)

We are also interested in the residuals ε̂i = yi − ŷi . The entire vector of
residuals may be written ε̂ = y − ŷ = (I − H)y. It is easy to see that if
H is a PPM, then so is I − H. Symmetry is trivial, and (I − H)(I − H) =
I − H − H + HH = I − H.
p
The model space M = {Zβ | β ∈ RP } is the set of linear combinations of
p
columns of Z. A typical entry of M is j=1 βj Zj . Thus M is what is known
as the column space of Z, denoted col(Z). The hat matrix H projects vectors
onto col(Z). The set of vectors orthogonal to Z, that is with Zv = 0, is a linear
subspace of Rn , known as the null space of Z. We may write it as M⊥ or as
null(Z). The column space and null spaces are orthogonal complements. Any
v ∈ Rn can be uniquely written as v1 + v2 where v1 ∈ M and v2 ∈ M⊥ . In
terms of the hat matrix, v1 = Hv and v2 = (I − H)v.

2.4 Distributional results


We continue to suppose that X and hence Z is a nonrandom matrix and that
Z has full rank. The least squares model has Y = Zβ + ε. Then

β̂ = (Z 0 Z)−1 Z 0 Y = (Z 0 Z)−1 (Z 0 Zβ + ε) = β + (Z 0 Z)−1 Z 0 ε. (2.9)

The only randomness in the right side of (2.9) comes from ε. This makes it easy
to work out the mean and variance of β̂ in the fixed x.
Lemma 2.1. If Y = Zβ + ε where Z is nonrandom and Z 0 Z is invertible and
E(ε) = 0, then

E(β̂) = β. (2.10)
6 2. LINEAR LEAST SQUARES

Proof. E(β̂) = β + (Z 0 Z)−1 Z 0 E(ε) = β.

The least squares estimates β̂ are unbiased for β as long as ε has mean
zero. Lemma 2.1 does not require normally distributed errors. It does not even
make any assumptions about var(ε). To study the variance of β̂ we will need
assumptions on var(ε), but not on its mean.
Lemma 2.2. If Y = Zβ + ε where Z is nonrandom and Z 0 Z is invertible and
var(ε) = σ 2 I, for 0 ≤ σ 2 < ∞, then

var(β̂) = (Z 0 Z)−1 σ 2 . (2.11)

Proof. Using equation (2.9),

var(β̂) = (Z 0 Z)−1 Z 0 var(ε)Z(Z 0 Z)−1


= (Z 0 Z)−1 Z 0 σ 2 I Z(Z 0 Z)−1


= (Z 0 Z)−1 σ 2 , (2.12)

after a particularly nice cancellation.


We will look at and interpret equation (2.12) for many specific linear mod-
els. For now we notice that it holds without regard to whether ε is normally
distributed or even whether it has mean 0.
Up to now we have studied the mean and variance of the estimate of β. Next
we turn our attention to σ 2 . The key to estimating σ 2 is the residual vector ε̂.
Lemma 2.3. Under the conditions of Lemma 2.2, E(ε̂0 ε̂) = (n − p)σ 2 . For
p < n the estimate
n
1 X
s2 = (Yi − zi0 β̂)2 (2.13)
n − p i=1

satisfies E(s2 ) = σ 2 .
Proof. Recall that ε̂ = (I − H)ε. Therefore E(ε̂0 ε̂) = E(ε0 (I − H)ε) = tr((I −
H)Iσ 2 ) = (n − p)σ 2 . Finally s2 = ε̂0 ε̂/(n − p).
Now we add in the strong assumption that ε ∼ N (0, σ 2 I). Normally dis-
tributed errors make for normally distributed least squares estimates, fits and
residuals.
Theorem 2.3. Suppose that Z is an n by p matrix of real values, Z 0 Z is
invertible, and Y = Zβ + ε where ε ∼ N (0, σ 2 I). Then

β̂ ∼ N (β, (Z 0 Z)−1 σ 2 ),
ŷ ∼ N (Zβ, Hσ 2 ),
ε̂ ∼ N (0, (I − H)σ 2 ),

where H = Z(Z 0 Z)−1 Z 0 . Furthermore, ε̂ is independent of β̂ and of ŷ.


2.4. DISTRIBUTIONAL RESULTS 7

Proof. Consider the vector


   0 −1 0 
β̂ (Z Z) Z
v =  ŷ  =  H  Y ∈ Rp+2n .
ε̂ I −H

The response Y has a multivariate normal distribution and so therefore does v.


Hence each of β̂, ŷ, and ε̂ is multivariate normal.
The expected value of v is
 0 −1 0     
(Z Z) Z β β
E(v) =  H  Zβ =  HZβ  = Zβ 
I −H (I − H)Zβ 0

because HZ = Z, establishing the means listed above for β̂, ŷ and ε̂.
The variance of β̂ is (Z 0 Z)−1 σ 2 by Lemma 2.2. The variance of ŷ is var(ŷ) =
Hvar(Y )H 0 = H(σ 2 I)H = Hσ 2 because H is symmetric and idempotent. Sim-
ilarly ε̂ = (I − H)(Zβ + ε) = (I − H)ε and so var(ε̂) = (I − H)(σ 2 I)(I − H)0 =
(I − H)σ 2 because I − H is symmetric and idempotent.
We have established the three distributions displayed but have yet to prove
the claimed independence. To this end

cov(β̂, ε̂) = (Z 0 Z)−1 Z 0 (σ 2 I)(I − H) = (Z 0 Z)−1 (Z − HZ)0 σ 2 = 0

because Z = HZ. Therefore β̂ and ε̂ are uncorrelated and hence independent


because v has a normal distribution. Similarly cov(ŷ, ε̂) = H(I − H)0 σ 2 =
(H − HH 0 )σ 2 = 0 establishing the second and final independence claim.
We can glean some insight about the hat matrix from Theorem 2.3. First
because var(ŷi ) = σ 2 Hii we have Hii ≥ 0. Next because var(ε̂) = σ 2 (1 − Hii )
we have 1 − Hii ≥ 0. Therefore 0 ≤ Hii ≤ 1 holds for i = 1, . . . , n.
Theorem 2.4. PnAssume the conditions of Theorem 2.3, and also that p < n.
1 0
Let s2 = n−p i=1 (Yi − zi β̂)2
. Then

(n − p)s2 ∼ σ 2 χ2(n−p) . (2.14)

Proof. First

(n − p)s2 = (Y − Z β̂)0 (Y − Z β̂) = Y 0 (I − H)Y = ε0 (I − H)ε.

The matrix I − H can be written I − H = P ΛP 0 where P is orthogonal and


 = P 0 ∼
Λ = diag(λ1 , . . . , λn ). From I − H = (I − H)2 we get λi ∈ {0, 1}. Let e
2 0 0 2 2
N (0, σ I). Then ε (I − H)ε = e  ∼ σ χ(k) where k is the number of λi = 1,
 Λe
that is k = tr(I − H). Therefore

k = n−tr(H) = n−tr(Z(Z 0 Z)−1 Z 0 ) = n−tr((Z 0 Z)−1 Z 0 Z) = n−tr(Ip ) = n−p.


8 2. LINEAR LEAST SQUARES

Equation (2.14) still holds trivially, even if σ = 0. The theorem statement


excludes the edge case with n = p because s2 is then not well defined (zero over
zero).
Very often we focus our attention on a specific linear combination of the
components of β. We write such a combination as cβ where c is a p × 1 row
vector. Taking c = (0, . . . , 0, 1, 0, . . . 0) with the 1 in the j’th place gives cβ = βj .
If for instance cβj = 0, then we question whether feature j helps us predict Y .
More generally, parameter combinations like β2 − β1 or β1 − 2β2 + β3 can be
studied by making a judicious choice of c. Taking c = z00 makes cβ the expected
value of Y when the feature vector is set to z0 and taking c = ze00 − z00 lets us
study the difference between E(Y ) at features ze0 and z0 .
For any such vector c, Theorem 2.3 implies that cβ̂ ∼ N (cβ, c(Z 0 Z)−1 c0 σ 2 ).
We ordinarily don’t know σ 2 but can estimate it via s2 from (2.13). In later
chapters we will test whether cβ takes a specific value cβ0 , and form confidence
intervals for cβ using Theorem 2.5.

Theorem 2.5. Suppose that Y ∼ N (Zβ, σ 2 I) where Z is an n × p matrix, Z 0 Z


is invertible, and σ > 0. Let β̂ = (Z 0 Z)−1 Z 0 Y and s2 = n−p
1
(Y − Z β̂)0 (Y − Z β̂).
Then for any nonzero 1 × p (row) vector c,

cβ̂ − cβ
p ∼ t(n−p) . (2.15)
s c(Z 0 Z)−1 c0
p
Proof. Let U = (cβ̂ − cβ)/(σ c(Z 0 Z)−1 c0 ). Then U ∼ N (0, 1). Let V = s2 /σ 2 ,
so V ∼ χ2(n−p) /(n − p). Now U is a function of β̂ and V is a function of ε̂.
Therefore U and V are independent. Finally

(cβ̂−cβ)

cβ̂ − cβ c(Z 0 Z)−1 c0 σ U
p = = √ ∼ t(n−p)
s c(Z 0 Z)−1 c0 s/σ V

by the definition of the t distribution.

Theorem 2.5 is the basis for t tests and related confidence intervals for linear
models. The recipe is to take the estimate cβ̂ subtract the hypothesized param-
eter value cβ and then divide by an estimate of the standard deviation of cβ̂.
Having gone through these steps we’re left with a t(n−p) distributed quantity.
We will use those steps repeatedly in a variety of linear model settings.
Sometimes Theorem 2.5 is not quite powerful enough. We may have r dif-
ferent linear combinations of β embodied in an r × p matrix C and we wish to
test whether Cβ = Cβ0 . We could test r rows of C individually, but testing
them all at once is different and some problems will require such a simultaneous
test. Theorem 2.6 is an r dimensional generalization of Theorem 2.5.

Theorem 2.6. Suppose that Y ∼ N (Zβ, σ 2 I) where Z is an n × p matrix, Z 0 Z


is invertible, and σ > 0. Let β̂ = (Z 0 Z)−1 Z 0 Y and s2 = n−p
1
(Y − Z β̂)0 (Y − Z β̂).
2.5. R2 AND THE EXTRA SUM OF SQUARES 9

Then for any r × p (row) matrix C with linearly independent rows,


1 −1
(β̂ − β)0 C 0 C(Z 0 Z)−1 C 0 C(β̂ − β)/s2 ∼ Fr,n−p .

(2.16)
r
Proof. Let U = C(β̂ − β)/σ. Then U ∼ N (0, Σ) where Σ = C(Z 0 Z)−1 C 0
is a nonsingular r × r matrix, and so U 0 Σ−1 U ∼ χ2(r) . Let V = s2 /σ 2 , so
V ∼ χ2(n−p) /(n − p). The left side of (2.16) is (1/r)U 0 Σ−1 U ∼ χ2(r) /r divided
by V /(n − p) ∼ χ2(n−p) /(n − p). These two χ2 random variables are independent
because β̂ is independent of s2 .
To use Theorem 2.6 we formulate an hypothesis H0 : Cβ = Cβ0 , plug β = β0
into the left side of equation (2.16) and reject H0 at the level α if the result is
1−α
larger that Fr,n−p .
If we put r = 1 in (2.16) and write C as c then it simplifies to

(cβ̂ − cβ)2
∼ F1,n−p .
s2 c(Z 0 Z)−1 c0

We might have expected to recover (2.15). In fact, the results are strongly
related. The statistic on the left of (2.16) is the square of the one on the left
of (2.15). The F1,n−p distribution on the right of (2.16) is the square of the
t(n−p) distribution on the right of (2.15). The difference is only that (2.15)
explicitly takes account of the sign of cβ̂ − cβ while (2.16) does not.

2.5 R2 and the extra sum of squares


Constructing the matrix C and plugging it into Theorem 2.6 is a little awkward.
In practice there is a simpler and more direct way to test hypotheses. Suppose
that we want to test the hypothesis that some given subset of the components
of β are all 0. Then E(Y ) = Zγ e where Z e has q < r of the columns of Z and
γ ∈ Rq contains coefficients for just those q columns. All we have to do is run
the regression both ways, first on Z and then on Z,
e and from the resulting sums
of squared errors we can form a test of the submodel.
Let us call E(Y ) = Zβ the full model and E(Y ) = Zγ e the submodel.
Their sums of squared residuals are SSFULL and SSSUB respectively. Because
the full model can reproduce any model that the submodel can, we always
have SSFULL ≤ SSSUB . But if the sum of squares increases too much under the
submodel, then that is evidence against the submodel. The extra sum of squares
is SSEXTRA = SSSUB −SSFULL . We reject the submodel when SSEXTRA /SSFULL
is too large to have arisen by chance.
Theorem 2.7. Suppose that Y ∼ N (Zβ, σ 2 I) where Z is an n × p matrix, Z 0 Z
is invertible, and σ > 0. Let Ze be an n × q submatrix of Z with q < p.
Let the full model have least squares estimate β̂ = (Z 0 Z)−1 Z 0 Y , and sum of
squared errors SSFULL = (Y − Z β̂)0 (Y − Z β̂). Similarly let the submodel have
10 2. LINEAR LEAST SQUARES

e0 Z)
least squares estimate γ̂ = (Z e −1 Ze0 Y and sum of squared errors SSSUB =
0
(Y − Zeγ̂) (Y − Zeγ̂).
e holds for some γ ∈ Rq then
If the submodel E(Y ) = Zγ

1

p−q SSSUB − SSFULL
1 ∼ Fp−q,n−p . (2.17)
n−p SSFULL

Proof. Let H = Z(Z 0 Z)−1 Z 0 and H e = Z( e0 Z)


e Z e −1 Ze0 be the hat matrices for
the full model and submodel respectively. Then SSFULL = Y 0 (I − H)Y ∼
σ 2 χ2rank(I−H) because I − H is symmetric and idempotent. The extra sum of
squares is SSSUB − SSFULL = Y 0 (I − H)Y
e − Y 0 (I − H)Y = Y 0 (H − H)Y.
e
The matrix H − H e is symmetric. For any Y ∈ Rn we get H HY e = HY e
because HY is in the column space of Z and hence also in that of Z. Therefore
e e
HHe =H e and by transposition HH
e =H e too. Therefore (H − H)(H
e − H)
e =
H − HH e − HH
e +H e = H − H. e Thus H − H e is idempotent too. Therefore
SSSUB − SSFULL ∼ σ 2 χ2rank(H−H)
e .
The matrices H and He −H are orthogonal. It follows that HY and (H
e −H)Y
are independent and therefore SSFULL is independent of SSSUB − SSFULL . We
have shown that
1
rank(H−H)
e (SSSUB − SSFULL )
1 ∼ Frank(H−H),rank(I−H) .
SSFULL
e
rank(I−H)

To complete the proof we note that when Z has full rank p then so does H. Also
e For symmetric idempotent matrices like I − H
Ze has rank q and so does H.
and H − H e their rank equals their trace. Then tr(I − H) = n − tr(H) = n − p
and tr(H − H)
e = tr(H) − tr(H) e = p − q to complete the proof.

There is a geometric interpretation to equation (2.17). If Y = Zγe + ε with


2
ε ∼ N (0, σ I) then we can write Y as YSUB + YFULL + YRES where YSUB is
in a q dimensional space spanned by the columns of Z, e YFULL is in the p − q
dimensional space of vectors orthogonal to Z but in the column space of Z and
e
YRES is in the n − p dimensional space of vectors orthogonal to Z. A spherical
Gaussian vector like ε shows no preference for any of these spaces. The mean
squared norm of it’s projection in each space is just σ 2 times the dimension of
the space. The F test is measuring whether the dimensions corresponding to
YFULL are getting more than their share of the projection.
In an important special case, the full model contains
Pn an intercept and the
submodel only has the intercept. Then SSSUB = i=1 (yi − ȳ)2 . The ANOVA
decomposition in equation (2.7) has SSSUB = SSFULL + SSRES . The F test then
rejects the submodel in favor of the full model when R2 is large enough.
2.6. RANDOM PREDICTORS 11

2.6 Random predictors


Here we consider the ‘random X random Y ’ setting. A conditioning argument
says that we can treat each Xi as if it were fixed quantity equal to the observed
value xi and then hints that this analysis is better than a random X analysis
even when the Xi are random. It is certainly a simpler analysis. A reader who
accepts the conditioning argument may want to skip this section. A reader who
is curious to see what happens with random X should read on.
If (Xi , Yi ) are IID pairs then so are (Zi , Yi ). We will need the expected value
of Y given that X = x. Call it µ(x). Similarly define σ 2 (x) = var(Y | X = x).
The least squares value of β is the one that minimizes E((Y1 − Z1 β)2 ) where
we’ve picked observation i = 1 for definiteness. Using very similar methods to
the least squares derivation we get

E(Z1 (Y1 − Z10 β)) = 0

as an equation defining β. In other words we have a population version of the


normal equations. The error Yi − Zi0 β is orthogonal to the feature vector Zi in
that their product has expectation zero.
As before we can rearrange the normal equations to get

β = E(Z1 Z10 )−1 E(Z1 Y1 ),

for the case where E(Z1 Z10 ) is invertible. The least squares estimator can, for
nonsingular Z 0 Z, be written
 n
!−1 n
1X 1X
β̂ = Zi Zi0 Zi Yi .
n i=1 n i=1

In other words the least squares estimator is obtained by plugging in estimates


of the numerator and denominator for β.
Now E(β̂) = E((Z 0 Z)−1 Z 0 Y ). When X is random then so is Z. The
expectation does not simplify to E((Z 0 Z)−1 ) × E(Z 0 Y ) because expectations of
products are not generally the corresponding products of expectations. Even
if it did E((Z 0 Z)−1 ) does not simplify to E(Z1 Z10 )−1 /n because an expected
inverse is not the inverse of the expectation. We cannot cancel our way to the
answer we want.
We will condition on X1 , . . . , Xn . Let X represent X1 , . . . , Xn . Conditionally
on X the Xi as well as Z and the Zi are fixed. Conditioning on X and then
working purely formally, we get

E(β̂) = E(E(β̂ | X ))
= E((Z 0 Z)−1 Z 0 E(Y | X ))
= E((Z 0 Z)−1 Z 0 µ(X)).

where µ(X) is the random vector with i’th element E(Y | X = Xi ).


12 2. LINEAR LEAST SQUARES

So far we have not connected the function µ(x) to the linear regression model.
To make this case comparable to the fixed x random Y case lets suppose that
µ(x) = z 0 β where z = φ(x) and also that σ 2 (x) = σ 2 is constant. From the first
of these assumptions we find
E(β̂) = E((Z 0 Z)−1 Z 0 Zβ) (2.18)

Equation (2.18) is pretty close to E(β̂) = β. If Z 0 Z has probability 0 of


being singular then E(β̂) = β as required. Even if Z 0 Z is very nearly but not
quite singular,
  equation (2.18) does not imply trouble. For example even if
1 0
Z 0Z = for small  > 0 then (Z 0 Z)−1 Z 0 Zβ = β.
0 
The biggest worry is that Z 0 Z might sometimes be singular, or numerically
so close to singular than an algorithm treats it as singular. It can be singular
even when E(Zi Zi0 ) is not singular. Suppose for example that X = 1 with
probability 1/2 and 0 with probability 1/2 and that the regression has features
φ(X) = (1, X)0 . Then with probabilty 2−n all the xi = 1 and Z 0 Z is singular.
The xi might all be 0 as well, so the probability of a singular Z 0 Z is 21−n . If
the Xi ’s are discrete and IID there is always some nonzero probability, though
perhaps a very tiny one, that Z 0 Z is singular. Then β̂ exists with probability
1−21−n and fails to exist with probability 21−n . If that happened we’d probably
drop feature 2, and implicitly or explicitly use β̂ = (Ȳ , 0)0 . Then we have a bias
that is exponentially small in n. It may be a nuisance to have an algorithm
that sometimes behaves qualitatively differently, but such a small bias in and
of itself will seldom have practical importance.
The upshot is that when the linear model properly matches µ(x) and Z 0 Z is
never singular, then bias is either absent or negligible for the random X random
Y setting.
Let’s suppose that the possibility of singular Z 0 Z can be neglected. Then
keeping in mind that we have assumed var(Y | X = x) = σ 2 for all x,

var(β̂) = var(E(β̂ | X )) + E(var(β̂ | X ))


= E(var(β̂ | X ))
= E (Z 0 Z)−1 Z 0 (σ 2 I)Z(Z 0 Z)−1


= E((Z 0 Z)−1 )σ 2 . (2.19)


The variance is similar to the expression (Z 0 Z)−1 σ 2 that we get in the ’fixed X
random Y’ case.
The variance in the random X case uses the expected value of (Z 0 Z)−1 while
in the fixed case we got the actual observed value of (Z 0 Z)−1 . A ‘larger’ matrix
Z 0 Z gives us a more accurate estimate of β in the fixed X case. In the fixed X
setting we’re using the Z 0 Z we got rather than taking account of other values it
could have taken. It seems to be better to take account of the actual accuracy
we got than to factor in accuracy levels we might have had.
An analogy due to David Cox likens this to weighing an object on either an
accurate scale or a very poor one. Suppose that have an object with weight ω.
2.7. RANK DEFICIENT FEATURES 13

We toss a fair coin, and if it comes up heads we use the good scale and get a
value Y ∼ N (ω, 1). If the coin comes up tails we get Y ∼ N (ω, 100). So if we
get heads then we could take a 95% confidence interval to be Y ± 1.96, but then
if we get tails we take Y ± 19.6 instead. Or we could ignore the coin and use
.
Y ± 16.45 because Pr(|Y − ω| ≤ 16.45) = 0.05 in this case. It seems pretty clear
that if we know how the coin toss came out that we should take account of it
and not use Y ± 16.45.
The matrix Z 0 Z is analogous to the coin. It affects the precision of our
answer but we assume it does not give us information on β. A quantity that
affects the precision of our answers without carrying information on them is said
to be “ancillary”, meaning secondary or subordinate.
In some settings n itself is random. The ancillarity argument says that we
should prefer to use the actual sample size we got rather than take account of
others we might have gotten but didn’t.
The weighing example and random sample size examples are constructed to
be extreme to make the point. If the scales had variances 1 and 1.001 we might
not bother so much with conditioning on the coin. If sampling fluctuations in
Z 0 Z are not so large then might prefer to treat it as fixed most of the time
but switch if treating it as random offered some advantage. For instance some
bootstrap and cross-validatory analyses are simpler to present with random X’s.

2.7 Rank deficient features


Sometimes Z 0 Z has no inverse and so (Z 0 Z)−1 Z 0 Y is not available as an esti-
mate. This will happen if one of the columns of Z is a linear combination of
the others. In fact that is the only way that a singular Z 0 Z matrix can arise:

Theorem 2.8. If Z 0 Z is not invertible then one column of Z is a linear com-


bination of the others.

Proof. If Z 0 Z is not invertible then Z 0 Zv = 0 for some nonzero v ∈ Rp . Then


v 0 Z 0 Zv = 0 too. But v 0 ZP
0
Zv = kZvk2 , and so Z 0 v = 0. Let vj be a nonzero
element of v. Then Zj = k6=j Zk vk /vj .

There is more than one way that we could find one of our features to be
a linear combination of the others. Perhaps we have measured temperature
in degrees Fahrenheit as well as Celcius. Recalling that “C = 1.8F + 32” we
will find linear dependence in Z, if as usual the model includes an intercept.
Similarly, if a financial model makes features out of sales from every region and
also the total sales over all regions, then ‘total’ column within Z will be the sum
the regional ones. Finally a row of Z might take the form (1, 1M , 1F , . . . )0 where
1M and 1F are indicator functions of male and female subjects respectively and
the dots represent additional features.
A singular matrix Z is often a sign that we have made a mistake in encoding
our regression. In that case we fix it by removing redundant features until Z 0 Z
is invertible, and then using methods and results for the invertible case. The
14 2. LINEAR LEAST SQUARES

rest of this section describes settings where we might actually prefer to work
with a singular encoding.
One reason to work with a redundant encoding is for symmetry. We’ll see
examples of that in Chapter ?? on the one way analysis of variance. Another
reason is that we might be planning to do a large number of regressions, say
one for every gene in an organism, one for every customer of a company, or
one per second in an electronic device. These regressions might be all on the
same variables but some may be singular as a consequence of the data sets
they’re given. It may not always be the same variable that is responsible for the
singularity. Also, no matter what the features are, we are gauranteed to have a
non-invertible Z 0 Z if p > n. When we cannot tweak each regression individually
to remove redundant features, then we need a graceful way to do regressions
with or without linearly dependent columns.
In the singular setting M = {Zβ | β ∈ Rp } is an r dimensional subset of Rn
where r < p. We have an r dimensional space indexed by a p dimensional vector
β. There is thus a p−r dimensional set of labels β for each point in M. Suppose
for example that Zi1 = 1, and that Zi2 = Fi and Zi3 = Ci are temperatures in
degrees Fahrenheit and Celcius of data points i = 1, . . . , n. Then for any t ∈ R,
β0 + β1 Fi + β2 Ci = (β0 − 32t) + (β1 − 1.8t)Fi + (β2 + t)Ci .
If Yi is near to β0 + β1 Fi + β2 Ci then it is equally near to (β0 − 32t) + (β1 −
1.8t)Fi + (β2 + t)Ci for any value of t.
The problem with rank deficient Z is not, as we might have feared, that
there is no solution to Z 0 Z β̂ = Z 0 y. It is that there is an infinite family of
solutions. That family is a linear subspace of Rp . A reasonable approach is to
simply choose the shortest solution vector. That is we find the unique minimizer
of β 0 β subject to the constraint Z 0 Zβ = Z 0 y and we call it β̂. We’ll see how to
do this numerically in Chapter 2.9.
When Z 0 Z is singular we may well find that some linear combinations cβ̂
are uniquely determined by the normal equations (Z 0 Z)β̂ = Z 0 y even though
β̂ itself is not. If we can confine our study of β to such linear combinations cβ
then singularity of Z 0 Z is not so deleterious.
From the geometry, it is clear that there has to be a unique point in M that
is closest to the vector y. Therefore ŷ = Z β̂ is determined by the least squares
conditions. It follows that each component ŷi = zi0 β̂ is also determined. In other
words, if c = zi0 for one of our sampled feature vectors, then cβ̂ is determined.
It follows that if c is a linear combination of zi0 that is a linear combination
of rows
Pnof the design matrix Z, then cβ̂ is determined. We can write such c as
c = i=1 γi zi0 = γ 0 Z for γ ∈ Rn . In fact, the only estimable cβ are take this
form.
Definition 2.1 (Estimability). ThePnlinear combination cβ is estimable if there
exists a linear combination λ0 Y = i=1 λi Yi of the responses for which E(λ0 Y ) =
cβ holds for any β ∈ Rp .
The definition of estimability above stipulates that we can find a linear
combination of the Yi that is unbiased for cβ. The next theorem gives two
2.8. MAXIMUM LIKELIHOOD AND GAUSS MARKOV 15

equivalent properties that could also have been used as the definition. One is
that estimable linear combinations cβ have unique least squares estimators. The
other is that estimable linear combinations are linear combinations of the rows
of the design matrix Z. If there is some cβ that we really want to estimate,
we can make sure it is estimable when we’re choosing the xi at which to gather
data.
Theorem 2.9. The following are equivalent:
1. cβ is estimable.
2. c = γ 0 Z for some γ ∈ Rn .
3. If Z 0 Z β̂ = Z 0 Y and Z 0 Z βe = Z 0 Y then cβ̂ = cβ.
e
Proof. [Some algebra goes here, or possibly it becomes an exercise.]

2.8 Maximum likelihood and Gauss Markov


This section looks at motivations or justifications for choosing least squares.
The first is via maximum likelihood. If we assume that the errors are normally
distributed then least squares emerges as the maximum likelihood estimator of
the regression coefficient β.
Using least squares is not the same as assuming normal errors. There are
other assumptions that can lead one to using least squares. Under the weaker
assumption that the error vector ε has mean zero and variance matrix σ 2 I, the
least squares estimators have the smallest variance among the class of linear
unbiased estimators. This optimality is made precise by the Gauss-Markov
theorem.
The least squares estimates β̂ can also be obtained as maximum likelihood
estimates of β under a normal distribution model. That is, we look for the value
of β which maximizes the probability of our observed data and we will find it
is the same β̂ that we found by least squares and studied via projections. The
MLE of σ 2 is
n
1X
σ̂ 2 = (yi − zi0 β̂)2 . (2.20)
n i=1

To derive this MLE, we suppose that Y ∼ N (Zβ, σ 2 I) for fixed and full
rank Z. The probability of observing Yi = yi , for i = 1, . . . , n, is of course
zero, which is not helpful. But our observations were only measured to finite
precision and we might interpret the observation yi as corresponding to the
event yi − ∆ ≤ Yi ≤ yi + ∆ where ∆ > 0 is our measuring accuracy. We don’t
have to actually know ∆, but it should always be small compared to σ. The
probability of observing Y in a small n dimensional box around y is
 
. (2∆)n  1
0

Pr |Yi − yi | ≤ ∆, 1 ≤ i ≤ n = exp − (y − Zβ) (y − Zβ) .
(2π)n/2 σ n 2σ 2
(2.21)
16 2. LINEAR LEAST SQUARES

For any 0 < σ < ∞ the β which maximizes (2.21) is the one that minimizes
the sum of squares (y − Zβ)0 (y − Zβ). So the MLE is the least squares estimate
and can be found by solving the normal equations.
Next we find the MLE of σ. The logarithm of the right side of (2.21) is
n 1
−n log ∆ − log(2π) − n log σ − 2 (y − Zβ)0 (y − Zβ). (2.22)
2 2σ
Differentiating (2.22) with respect to σ yields
n 1
− + (y − Zβ)0 (y − Zβ), (2.23)
σ 2σ 3
which vanishes if and only if
1
σ2 = (y − Zβ)0 (y − Zβ). (2.24)
n
Solving (2.24) with β = β̂ gives the MLE σ̂ 2 from (2.20).
It is also necessary to investigate the second derivative of the log likelhood
with respect to σ at the point σ = σ̂. The value there is −2n/σ 2 < 0, and
so (2.20) really is the MLE and is not a minimum.
In the case where y = Z β̂ exactly the log likelihood is degenerate. The MLE
formulas for β̂ and σ̂ are still interpretable. The first is the value that makes
error zero and the second reduces to 0.
The second justification for least squares is via the Gauss-Markov theorem.
We say that λ0 Y is a linear unbiased estimator for cβ if E(λ0 Y ) = cβ holds for
all β. Let’s agree to define the best linear unbiased estimator of cβ as the one
that attains the minimum variance. It need not be unique. The Gauss-Markov
theorem states that least squares provides best linear unbiased estimators.
Theorem 2.10 (Gauss-Markov). Let Y = Zβ + ε where Z is a nonrandom
n × p matrix, β is an unknown point in Rp and ε is a random vector with mean
0 and variance matrix σ 2 In . Let cβ be estimable and let β̂ be a least squares
estimate. Then cβ̂ is a best linear unbiased estimate of cβ.
Proof. [Full rank case] We prove the theorem for the full rank case with Z 0 Z
invertible. We already know that cβ̂ is unbiased and linear. Suppose that
E(λ0 Y ) = cβ, and so λ0 Z = c. Then
var(λ0 Y ) = var(cβ̂ + (λ0 Y − cβ̂))
= var(cβ̂) + var(λ0 Y − cβ̂) + 2cov(cβ̂, λ0 Y − cβ̂).
Now
cov(cβ̂, λ0 Y − cβ̂) = cov(c(Z 0 Z)−1 Z 0 Y, (λ0 − c(Z 0 Z)−1 Z 0 )Y )
= c(Z 0 Z)−1 Z 0 (σ 2 I)(λ − Z(Z 0 Z)−1 c0 )
 
= σ 2 c(Z 0 Z)−1 Z 0 λ − c(Z 0 Z)−1 c0 )
=0
because Z 0 λ = c0 . Therefore var(λ0 Y ) = var(cβ̂) + var(λ0 Y − cβ̂) ≥ var(cβ̂).
2.9. LEAST SQUARES COMPUTATION 17

For rank deficient Z, a similar proof can be made but takes a bit more care.
See for example Christiansen (xxxx). Intuitively we might expect the Gauss-
Markov theorem to hold up for estimable functions in the rank deficient case.
We could drop columns of Z until Z 0 Z has full rank, get a best linear unbiased
estimator in the resulting full rank problem, and then because we only dropped
redundant columns, reinstate the ones we dropped.

2.9 Least squares computation


It is by now rare for somebody analyzing data to have to program up a least
squares algorithm. In a research setting however, it might be simpler to embed
a least squares solver in one’s software than to write the software in some other
tool that has least squares solvers built in. Other times our problem is too big
to fit our usual data analysis environment and we have to compute least squares
some other way. Even outside of these special settings, we may find that our
software offers us several least squares algorithms to choose from and then it
pays to understand the choices.
This section can be skipped on first reading, especially for readers who are
willing to assume that their least squares software with its default settings sim-
ply works without any attention on their part.
Before considering special cases, we note that the cost to do least squares
computation typically grows proportionally to O(np2 + p3 ). For example when
solving the normal equations directly, it takes O(np2 ) work simply to form
Z 0 Z and Z 0 y and then O(p3 ) to solve Z 0 Z β̂ = Z 0 y. We will consider more
sophisticated least squares algorithms based on the QR decomposition and the
singular value decomposition. These also cost O(np2 +p3 ) with different implied
constants.

Solving the normal equations


Given an n by p matrix Z and Y ∈ Rn the least squares problem is to solve
(Z 0 Z)β̂ = Z 0 Y for β̂. We have p equations in p unknowns. The standard way to
solve p linear equations in p unknowns is to apply Gaussian elimination. Some-
times Gaussian elimination encounters numerical difficulty because it requires
division by a small number. In that case pivoting methods, which rearrange the
order of the p equations can improve accuracy.
When Z has full rank then Z 0 Z is positive definite. It can be factored
by the Cholesky decomposition into Z 0 Z = G0 G where G is a p by p upper
triangular matrix. Then to solve G0 Gβ̂ = Z 0 Y we solve two triangular systems
of equations: first we solve G0 u = Z 0 Y for u and then we solve Gβ̂ = u for β̂.
A triangular system is convenient because, one can start with a single equation
in a single unknown and each equation thereafter brings in one new variable to
solve for.
It turns out that neither Gaussian elimination nor Cholesky decompositions
is the best way to solve least squares problems. They are numerically unstable if
18 2. LINEAR LEAST SQUARES

Z 0 Z is nearly singular and of course they fail if Z 0 Z is actually singular. The QR


decomposition is more reliable than solving the normal equations, when Z has
full rank. If Z does not have full rank, then the singular value decomposition can
be used to pick out one of the least squares solutions, or even to parameterize
the entire solution set.

QR decomposition
A better approach is to employ a QR decomposition of Z. In such a decompo-
sition we write Z = QR where Q is an n by n orthogonal matrix whose and R
is an n by p upper triangular matrix. When as usual n > p then the last n − p
rows of R are zero. Then we don’t really need the last n − p columns of Q. Thus
there are two QR decompositions
  e
R

Z = QR = Q e Q∗ =Q e R,
e
0

where Q e is an n × p matrix with orthonormal columns and R e is a p × p upper


triangular matrix.
In this section we’ll verify that a QR decomposition exists and see how to
find it. Then we’ll see how to use it in least squares problems. Finally we’ll
look into why using the QR decomposition should be better than solving the
normal equations directly. Implementation of the QR decomposition requires
care about storage and bookkeeping. For pseudo-code see Golub and van Loan
(xxxx).
The existence of a QR decomposition follows directly from the Gram-Schmidt
orthogonalization process. That process is essentially a regression of each col-
umn of Z onto the previous ones. Since we’ll be using QR to do regression, it is
almost circular to use regression to define QR. We’ll use Householder reflections
instead. They’re interesting in their own right, and they also underly many
implementations of the QR decomposition.
Suppose that u is a unit vector. Then H = H(u) = I −2uu0 is a Householder
reflection matrix. It is a symmetric orthogonal matrix. If v is a vector parallel
to u then Hv = −v. If v is a vector orthogonal to u then Hv = v. This is why H
is called a reflection. It reflects u onto −u while leaving any vector orthogonal
to u unchanged.
Suppose that v and w are distinct nonzero vectors of equal length. Then we
can find a unit vector u so that H(u) reflects v onto w. That is (I − 2uu0 )v = w.
Then we’ll be able to reflect the first column of Z onto a vector that has zeros
below the first entry.
The vector we need to make a reflection of v onto u is u = (w − v)/kw − vk.
First notice that (w + v)/2 is orthogonal to u and (w − v)/2 is parallel to u.
Therefore
w − v w + v  w − v w + v
Hv = H + = − = w.
2 2 2 2
2.9. LEAST SQUARES COMPUTATION 19

Let Q1 = H1 be an n by n Householder reflection matrix taking Z1 onto


the vector (R11 , 0, 0, . . . , 0)0 where R11 = ±kZ1 k. Then
 
R11
 0
 

Q1 Z =  . H1 Z2 · · · H1 Zp 

.
 .. 
0

We have managed to put zeros below the diagonal in the first column on the
right side. In practice, the choice between kZ1 k and −kZ1 k is made based on
which will be lead to a more numerically stable answer. See Golub and van
Loan (xxxx) for details.
We will use matrices Qj taking the form
 
I 0
Qj = j−1
0 Hj

where, for j = 2, . . . , p, the Hj are n − j + 1 by n − j + 1 Householder reflection


matrices. For a vector v the product Qj v leaves the top j − 1 components of v
unchanged while rotating the rest of v. We can choose a rotation so that the
last n − j components of Qj v are zero.
After carefully choosing the matrices Hj we obtain
 
R11 R12 ... R1p
 0 R22 ... R2p 
 .. .. .. 
 
..
 .
 . . . 

Qp Qp−1 · · · Q2 Q1 Z = 
 0 0 ... Rpp , (2.25)
 0 0 ... 0 
 
 . .. .. .. 
 .. . . . 
0 0 ... 0

giving us Z = QR where Q = (Qp Qp−1 . . . Q2 Q1 )0 has orthonormal columns,


and R is the upper triangular matrix on the right of (2.25). The leftmost p
columns of Q are Q e and the top p rows of R are R. e
Given that Z = Q eRe we can plug it into the normal equations Z 0 Z β̂ = Z 0 Y
and get Re0 R
eβ̂ = R e0 Q
e 0 Y . When Z has full rank, then R e0 is a p × p full rank
matrix too, and so it is invertible. Any solution β̂ to R e0 (R e 0 Y ) = 0 also
eβ̂ − Q
eβ̂ − Q
solves R e 0 Y = 0.
That is, given Z = Q eR e we find Ye = Q e 0 Y and then solve
 
R11 R12 . . . R1p
 0 R22 . . . R2p 
..  β̂ = Ye .
 
 .. .. ..
 . . . . 
0 0 . . . Rpp
20 2. LINEAR LEAST SQUARES

This latter is a triangular system, so it is easy to solve for β̂p then β̂p−1 and so
on, ending with β̂1 .
We are sure to run into trouble if R e is singular. At least that trouble will
be detected because R e is singular if and only if Rii = 0 for some 1 ≤ i ≤ p. At
that point we’ll notice that we’re trying to reflect a vector of length zero. Also
if some Rii is near zero then numerical near singularity of Z is detected.
If we solve the normal equations directly, then we are solving R e0 R
eβ̂ = Z 0 Y .
0
The matrix R eR e is less favorable to solve with than is R.e A precise definition
requires the notion of matrix condition number. But roughly speaking, suppose
that in solving the system Ra e = b for a roundoff errors in b get multiplied by a
factor f . Solving the normal equations directly is like applying two such solu-
tions one for Re and one for R e0 and might then multiply error by f 2 . Employing
the QR decomposition instead of solving the normal equations directly roughly
equivalent to doubling one’s numerical precision (from single to double or from
double to quadruple).

Singular value decomposition


Any n by p matrix Z may be written

Z = U ΣV 0 (2.26)

where U is an n × n orthogonal matrix, V is a p × p orthogonal matrix, and


Σ is an n × p diagonal matrix with nonnegative elements. For a proof of the
existence of the SVD and for algorithms to compute it, see Golub and van Loan
(xxxx).
If we multiply the matrix Z by a column vector w we find that Zw = U ΣV 0 w
corresponds to rotating w by V 0 then scaling each component of V 0 w by a
corresponding diagonal element of Σ, and then rotating the result again. Rotate,
stretch, and rotate is what happens to w when multiplied by Z and this holds
for any Z. We use the term rotate here loosely, to mean a rigid motion due to
the action of an orthogonal matrix. Reflections and coordinate permutations
are included in this notion of rotation.
Suppose that the i’th column of U is ui , the i’th column of V is vi and the
i’th diagonal element of Σ is σi . Then we can write
min(n,p)
X
Z= σi ui vi0
i=1

as a sum of rank one matrices. It is customary to order the indices so that


σ1 ≥ σ2 ≥ · · · ≥ σmin(n,p) ≥ 0. The rank of Z is the number r of nonzero σi .
For the QR deccomposition we ignored the last p − r columns of Q because
the corresponding rows of R were zero. When n > p we can make a similar
simplification because the last n − p rows of Σ are zero. That is we can also
write
Z=U e 0
e ΣV
2.10. SPECIAL LEAST SQUARES SYSTEMS 21

where U e U ∗ ) and Σ
e has the first p columns of U = (U e has the first p rows of Σ.
We begin by noting that the Euclidean length of the vector y − Zβ is pre-
served when it is multiplied by an orthogonal matrix such as U . Then
ky − Zβk2 = ky − U ΣV βk2
= kU 0 y − ΣV 0 βk2
e 0 y − ΣV
= kU e 0 βk2 + kU ∗ yk2 ,

after splitting the first p from the last n − p components of U 0 y − ΣV 0 β. Any


vector β that minimizes kU e 0 y − ΣV
e 0 βk is a least squares solution. Let ye = Ue 0y ∈
p 0 p
R and β = V β ∈ R . In the nonsingular case each σi > 0. Then the least
e
squares solution is found by putting βei = yei /σi for i = 1, . . . , p and β = V β.e
In the singular case one or more of the σi are zero. Suppose that the last k
of the σi are zero. Then setting βei = yei /σi for i = 1, . . . , p − k and taking any
real values at all for βei with i = p − k + 1, . . . , p gives a least squares solution.
The shortest least squares vector is found by taking βei = 0 for i = p − k +
1, . . . , p. Writing (
+ 1/σi σi > 0
σi =
0 σi = 0
and Σ+ = diag(σ1+ , . . . , σp+ ) we find that the shortest least squares solution is
e 0 Y.
β̂ = V βe = V Σ+ U (2.27)
When we want to simply pick one least squares solution out of many, to use
as a default, then the one in (2.27) is a good choice. In practice the computed
singular values have some roundoff error in them. Then one should set σi+ = 0
when σi ≤ σ1 ×  where  is a measure of numerical precision. Note that the
threshold scales with the largest singular value. A matrix Z with σ1 = σ2 =
· · · = σp = /2 would lead to β̂ = 0 unless we use a relative error. In the unusual
event that all σi are zero then the condition σi ≤ σ1 ×  is fulfilled as we would
like while the condition σi < σ1 ×  is not.
Now suppose that we want a parameterized form for the entire set of least
squares solutions. Perhaps we’re going to pick one of them by numerically
optimizing some other condition instead of kβ̂k used above. When k of the σi
are zero, then we write the entire set as a function of τ ∈ Rk , as follows:
  
+ 0 0 e 0 Y, τ ∈ Rk ,
V Σ + U
0 diag(τ )
where diag(τ ) is a k × k diagonal matrix with elements of τ on the diagonal set
above into the lower right block of a p × p matrix.

2.10 Special least squares systems


If the columns Zj of Z are orthogonal then least squares problems simplify
considerably. The reason is that Z 0 Z is diagonal and then so is (Z 0 Z)−1 in the
22 2. LINEAR LEAST SQUARES

full rank case. The result is that


Pn
Zij yi Zj0 y
β̂j = Pi=1 n 2 = , j = 1, . . . , p,
i=1 Zij kZj k2

so the coefficients can be computed independently, one at a time. In the even


more special case that each of Zj is a unit vector then β̂j = Zj0 y.
Orthogonal predictors bring great simplification. The cost of computation
is only O(np). The variance of β̂ is σ 2 diag(1/kZj k2 ) so the components of β̂
are uncorrelated. In the Gaussian case the β̂j are statistically independent in
addition to the computational independence noted above.
In problems where scattered data are observed we won’t ordinarily find or-
thogonal columns. In designed experiments we might arrange for orthogonal
columns.
When the predictors xi lie on a regular grid in one or two dimensions then
we may construct some orthogonal features zi from them.

Orthogonal polynomials
B splines
Fourier series
Haar Wavelets

2.11 Leave one out formula


In this section we explore what happens to a regression model when one data
point is added or removed. We begin with the Sherman-Morrison formula.
Suppose that A is an invertible n × n matrix and that u and v are n vectors
with 1 + v 0 A−1 u 6= 0. Then

A−1 uv 0 A−1
(A + uv 0 )−1 = A−1 − . (2.28)
1 + v 0 A−1 u
Equation (2.28) can be proved by multiplying the right hand side by A + uv 0 .
See Lemma ??. The condition 1 + v 0 A−1 u 6= 0 is needed to avoid turning an
invertible A into a singular A + uv 0 .
PnSuppose that we delete observation i from the regression. Then Z 0 Z =
0 0 0 0
`=1 z` z` is replaced by (Z Z)(−i) = Z Z − zi zi , using a subscript of (−i) to
denote removal of observation i. We can fit this into (2.28) by taking u = zi
and v = −zi . Then

(Z 0 Z)−1 zi zi0 (Z 0 Z)−1


(Z 0 Z)−1 0
(−i) = (Z Z)
−1
+
1 − zi0 (Z 0 Z)−1 zi
(Z 0 Z)−1 zi zi0 (Z 0 Z)−1
= (Z 0 Z)−1 +
1 − Hii
2.11. LEAVE ONE OUT FORMULA 23

after recognizing the hat matrix diagonal from equation (2.8). We also find that
(Z 0 y)(−i) = Z 0 y − zi yi . Now
 (Z 0 Z)−1 zi zi0 (Z 0 Z)−1  0 
β̂(−i) = (Z 0 Z)−1 + Z Y − zi yi
1 − Hii
(Z 0 Z)−1 zi zi β̂ (Z 0 Z)−1 zi Hii yi
= β̂ − (Z 0 Z)−1 zi yi + −
1 − Hii 1 − Hii
(Z 0 Z)−1 zi zi β̂ (Z 0 Z)−1 zi yi
= β̂ + −
1 − Hii 1 − Hii
0 −1
(Z Z) zi (yi − ŷi )
= β̂ − .
1 − Hii

Now the prediction for yi when (zi , yi ) is removed from the least squares fit is

Hii (yi − ŷi )


ŷi,(−i) , zi0 β̂(−i) = ŷi − .
1 − Hii
Multiplying both sides by 1 − Hii and rearranging we find that

ŷi = Hii yi + (1 − Hii )ŷi,(−i) . (2.29)

Equation (2.29) has an important interpretation. The least squares fit ŷi is
a weighted combination of yi itself and the least squares prediction we would
have made for it, had it been left out of the fitting. The larger Hii is, the more
that ŷi depends on yi . It also means that if we want to compute a “leave one
out” residual yi − ŷi,(−i) we don’t have to actually take (zi , yi ) out of the data
and rerun the regression. We can instead use
yi − ŷi
yi − ŷi,(−i) = . (2.30)
1 − Hii

You might also like