Linear Least Squared
Linear Least Squared
The linear model is the main technique in regression problems and the primary
tool for it is least squares fitting. We minimize a sum of squared errors, or
equivalently the sample average of squared errors. That is a natural choice
when we’re interested in finding the regression function which minimizes the
corresponding expected squared error. This chapter presents the basic theory
of linear least squares estimation, looking at it with calculus, linear algebra
and geometry. It also develops some distribution theory for linear least squares
and computational aspects of linear regression. We will draw repeatedly on the
material here in later chapters that look at specific data analysis problems.
1
2 2. LINEAR LEAST SQUARES
We’ll show later that this indeed gives the minimum, not the maximum or a
saddle point. The p equations in (2.2) are known as the normal equations.
This is due to normal being a synonym for perpendicular or orthogonal, and
not due to any assumption about the normal distribution. Consider the vector
Zj = (z1j , . . . , znj )0 ∈ Rn of values for the j’th feature. Equation (2.2) says
that this feature vector P has a dot product of zero with the residual vector having
p
i’th element ε̂i = yi − j=1 zij β̂j . Each feature vector is orthogonal (normal)
to the vector of residuals.
It is worth while writing equations (2.1) and (2.2) in matrix notation. Though
the transition from (2.1) to (2.2) is simpler with coordinates, our later manip-
ulations are easier in vector form. Let’s pack the responses and feature values
into the vector Y and matrix Z via
y1 z11 z12 . . . z1p
y2 z21 z22 . . . z2p
Y = . , and, Z = . .. .
.. ..
.. .. . . .
yn zn1 zn2 . . . znp
We also put β = (β1 , . . . , βp )0 and let β̂ denote the minimizer of the sum of
squares. Finally let ε̂ = Y − Z β̂, the n vector of residuals.
Now equation (2.1) can be written S(β) = (Y − Zβ)0 (Y − Zβ) and the
normal equations (2.2) become
ε̂0 Z = 0 (2.3)
after dividing by −2. These normal equations can also be written
Z 0 Z β̂ = Z 0 Y. (2.4)
When Z 0 Z is invertible then we find that
β̂ = (Z 0 Z)−1 Z 0 Y. (2.5)
We’ll suppose at first that Z 0 Z is invertible and then consider the singular case
in Chapter 2.7.
Equation (2.5) underlies another meaning of the work ‘linear’ in linear re-
gression. The estimated coefficient β̂ is a fixed linear combination of Y , meaning
that we get it by multiplying Y by the matrix (Z 0 Z)−1 Z 0 . The predicted value
of Y at any new point x0 with features z0 = φ(x0 ) is also linear in Y ; it is
z00 (Z 0 Z)−1 Z 0 Y .
Now let’s prove that β̂ = (Z 0 Z)−1 Z 0 Y is in fact the minimizer, not just a
point where the gradient of the sum of squares vanishes. It seems obvious that
a sum of squares like (2.1) cannot have a maximizing β. But we need to rule
out saddle points too, and we’ll also find that β̂ is the unique least squares
estimator. Since we already found an expression for β̂ we prove it is right by
expressing a generic βe ∈ Rp as β̂ + (βe − β̂) and then expanding S(β̂ + (βe − β̂)).
This adding and subtracting technique is often applicable in problems featuring
squared errors.
2.2. GEOMETRY OF LEAST SQUARES 3
Proof. We know that Z 0 (Y − Z β̂) = 0 and will use it below. Let βe be any point
in Rp and let γ = βe − β̂. Then
S(β) e 0 (Y − Z β)
e = (Y − Z β) e
= (Y − Z β̂ − Zγ)0 (Y − Z β̂ − Zγ)
= (Y − Z β̂)0 (Y − Z β̂) − γ 0 Z 0 (Y − Z β̂) − (Y − Z β̂)Zγ + γ 0 Z 0 Zγ
= S(β̂) + γ 0 Z 0 Zγ.
M = {Zβ | β ∈ Rp } (2.6)
The left side of (2.7) is called the centered sum of squares of the yi . It is
n − 1 times the usual estimate of the common variance of the Yi . The equation
decomposes this sum of squares into two parts. The first is the centered sum of
squared errors of the fitted values ŷi . The second is the sumPof squared model
n
errors. When the first column of Z consists of 1s then (1/n) i=1 ŷi = ȳ. That
¯
is ŷ = ȳ. Now the two terms in (2.7) correspond to the sum of squares of the
fitted values ŷi about their mean and the sum of squared residuals. We write
this as SSTOT = SSFIT + SSRES . The total sum of squares is the sum of squared
fits plus the sum of squared residuals.
Some of the variation in Y is ‘explained’ by the model and some of it left
unexplained. The fraction explained is denoted by
SSFIT SSRES
R2 = =1− .
SSTOT SSTOT
Theorem 2.2. Let H be a symmetric idempotent real valued matrix. Then the
eigenvalues of H are all either 0 or 1.
2.4. DISTRIBUTIONAL RESULTS 5
We are also interested in the residuals ε̂i = yi − ŷi . The entire vector of
residuals may be written ε̂ = y − ŷ = (I − H)y. It is easy to see that if
H is a PPM, then so is I − H. Symmetry is trivial, and (I − H)(I − H) =
I − H − H + HH = I − H.
p
The model space M = {Zβ | β ∈ RP } is the set of linear combinations of
p
columns of Z. A typical entry of M is j=1 βj Zj . Thus M is what is known
as the column space of Z, denoted col(Z). The hat matrix H projects vectors
onto col(Z). The set of vectors orthogonal to Z, that is with Zv = 0, is a linear
subspace of Rn , known as the null space of Z. We may write it as M⊥ or as
null(Z). The column space and null spaces are orthogonal complements. Any
v ∈ Rn can be uniquely written as v1 + v2 where v1 ∈ M and v2 ∈ M⊥ . In
terms of the hat matrix, v1 = Hv and v2 = (I − H)v.
The only randomness in the right side of (2.9) comes from ε. This makes it easy
to work out the mean and variance of β̂ in the fixed x.
Lemma 2.1. If Y = Zβ + ε where Z is nonrandom and Z 0 Z is invertible and
E(ε) = 0, then
E(β̂) = β. (2.10)
6 2. LINEAR LEAST SQUARES
The least squares estimates β̂ are unbiased for β as long as ε has mean
zero. Lemma 2.1 does not require normally distributed errors. It does not even
make any assumptions about var(ε). To study the variance of β̂ we will need
assumptions on var(ε), but not on its mean.
Lemma 2.2. If Y = Zβ + ε where Z is nonrandom and Z 0 Z is invertible and
var(ε) = σ 2 I, for 0 ≤ σ 2 < ∞, then
= (Z 0 Z)−1 σ 2 , (2.12)
satisfies E(s2 ) = σ 2 .
Proof. Recall that ε̂ = (I − H)ε. Therefore E(ε̂0 ε̂) = E(ε0 (I − H)ε) = tr((I −
H)Iσ 2 ) = (n − p)σ 2 . Finally s2 = ε̂0 ε̂/(n − p).
Now we add in the strong assumption that ε ∼ N (0, σ 2 I). Normally dis-
tributed errors make for normally distributed least squares estimates, fits and
residuals.
Theorem 2.3. Suppose that Z is an n by p matrix of real values, Z 0 Z is
invertible, and Y = Zβ + ε where ε ∼ N (0, σ 2 I). Then
β̂ ∼ N (β, (Z 0 Z)−1 σ 2 ),
ŷ ∼ N (Zβ, Hσ 2 ),
ε̂ ∼ N (0, (I − H)σ 2 ),
because HZ = Z, establishing the means listed above for β̂, ŷ and ε̂.
The variance of β̂ is (Z 0 Z)−1 σ 2 by Lemma 2.2. The variance of ŷ is var(ŷ) =
Hvar(Y )H 0 = H(σ 2 I)H = Hσ 2 because H is symmetric and idempotent. Sim-
ilarly ε̂ = (I − H)(Zβ + ε) = (I − H)ε and so var(ε̂) = (I − H)(σ 2 I)(I − H)0 =
(I − H)σ 2 because I − H is symmetric and idempotent.
We have established the three distributions displayed but have yet to prove
the claimed independence. To this end
Proof. First
cβ̂ − cβ
p ∼ t(n−p) . (2.15)
s c(Z 0 Z)−1 c0
p
Proof. Let U = (cβ̂ − cβ)/(σ c(Z 0 Z)−1 c0 ). Then U ∼ N (0, 1). Let V = s2 /σ 2 ,
so V ∼ χ2(n−p) /(n − p). Now U is a function of β̂ and V is a function of ε̂.
Therefore U and V are independent. Finally
(cβ̂−cβ)
√
cβ̂ − cβ c(Z 0 Z)−1 c0 σ U
p = = √ ∼ t(n−p)
s c(Z 0 Z)−1 c0 s/σ V
Theorem 2.5 is the basis for t tests and related confidence intervals for linear
models. The recipe is to take the estimate cβ̂ subtract the hypothesized param-
eter value cβ and then divide by an estimate of the standard deviation of cβ̂.
Having gone through these steps we’re left with a t(n−p) distributed quantity.
We will use those steps repeatedly in a variety of linear model settings.
Sometimes Theorem 2.5 is not quite powerful enough. We may have r dif-
ferent linear combinations of β embodied in an r × p matrix C and we wish to
test whether Cβ = Cβ0 . We could test r rows of C individually, but testing
them all at once is different and some problems will require such a simultaneous
test. Theorem 2.6 is an r dimensional generalization of Theorem 2.5.
(cβ̂ − cβ)2
∼ F1,n−p .
s2 c(Z 0 Z)−1 c0
We might have expected to recover (2.15). In fact, the results are strongly
related. The statistic on the left of (2.16) is the square of the one on the left
of (2.15). The F1,n−p distribution on the right of (2.16) is the square of the
t(n−p) distribution on the right of (2.15). The difference is only that (2.15)
explicitly takes account of the sign of cβ̂ − cβ while (2.16) does not.
e0 Z)
least squares estimate γ̂ = (Z e −1 Ze0 Y and sum of squared errors SSSUB =
0
(Y − Zeγ̂) (Y − Zeγ̂).
e holds for some γ ∈ Rq then
If the submodel E(Y ) = Zγ
1
p−q SSSUB − SSFULL
1 ∼ Fp−q,n−p . (2.17)
n−p SSFULL
To complete the proof we note that when Z has full rank p then so does H. Also
e For symmetric idempotent matrices like I − H
Ze has rank q and so does H.
and H − H e their rank equals their trace. Then tr(I − H) = n − tr(H) = n − p
and tr(H − H)
e = tr(H) − tr(H) e = p − q to complete the proof.
for the case where E(Z1 Z10 ) is invertible. The least squares estimator can, for
nonsingular Z 0 Z, be written
n
!−1 n
1X 1X
β̂ = Zi Zi0 Zi Yi .
n i=1 n i=1
E(β̂) = E(E(β̂ | X ))
= E((Z 0 Z)−1 Z 0 E(Y | X ))
= E((Z 0 Z)−1 Z 0 µ(X)).
So far we have not connected the function µ(x) to the linear regression model.
To make this case comparable to the fixed x random Y case lets suppose that
µ(x) = z 0 β where z = φ(x) and also that σ 2 (x) = σ 2 is constant. From the first
of these assumptions we find
E(β̂) = E((Z 0 Z)−1 Z 0 Zβ) (2.18)
We toss a fair coin, and if it comes up heads we use the good scale and get a
value Y ∼ N (ω, 1). If the coin comes up tails we get Y ∼ N (ω, 100). So if we
get heads then we could take a 95% confidence interval to be Y ± 1.96, but then
if we get tails we take Y ± 19.6 instead. Or we could ignore the coin and use
.
Y ± 16.45 because Pr(|Y − ω| ≤ 16.45) = 0.05 in this case. It seems pretty clear
that if we know how the coin toss came out that we should take account of it
and not use Y ± 16.45.
The matrix Z 0 Z is analogous to the coin. It affects the precision of our
answer but we assume it does not give us information on β. A quantity that
affects the precision of our answers without carrying information on them is said
to be “ancillary”, meaning secondary or subordinate.
In some settings n itself is random. The ancillarity argument says that we
should prefer to use the actual sample size we got rather than take account of
others we might have gotten but didn’t.
The weighing example and random sample size examples are constructed to
be extreme to make the point. If the scales had variances 1 and 1.001 we might
not bother so much with conditioning on the coin. If sampling fluctuations in
Z 0 Z are not so large then might prefer to treat it as fixed most of the time
but switch if treating it as random offered some advantage. For instance some
bootstrap and cross-validatory analyses are simpler to present with random X’s.
There is more than one way that we could find one of our features to be
a linear combination of the others. Perhaps we have measured temperature
in degrees Fahrenheit as well as Celcius. Recalling that “C = 1.8F + 32” we
will find linear dependence in Z, if as usual the model includes an intercept.
Similarly, if a financial model makes features out of sales from every region and
also the total sales over all regions, then ‘total’ column within Z will be the sum
the regional ones. Finally a row of Z might take the form (1, 1M , 1F , . . . )0 where
1M and 1F are indicator functions of male and female subjects respectively and
the dots represent additional features.
A singular matrix Z is often a sign that we have made a mistake in encoding
our regression. In that case we fix it by removing redundant features until Z 0 Z
is invertible, and then using methods and results for the invertible case. The
14 2. LINEAR LEAST SQUARES
rest of this section describes settings where we might actually prefer to work
with a singular encoding.
One reason to work with a redundant encoding is for symmetry. We’ll see
examples of that in Chapter ?? on the one way analysis of variance. Another
reason is that we might be planning to do a large number of regressions, say
one for every gene in an organism, one for every customer of a company, or
one per second in an electronic device. These regressions might be all on the
same variables but some may be singular as a consequence of the data sets
they’re given. It may not always be the same variable that is responsible for the
singularity. Also, no matter what the features are, we are gauranteed to have a
non-invertible Z 0 Z if p > n. When we cannot tweak each regression individually
to remove redundant features, then we need a graceful way to do regressions
with or without linearly dependent columns.
In the singular setting M = {Zβ | β ∈ Rp } is an r dimensional subset of Rn
where r < p. We have an r dimensional space indexed by a p dimensional vector
β. There is thus a p−r dimensional set of labels β for each point in M. Suppose
for example that Zi1 = 1, and that Zi2 = Fi and Zi3 = Ci are temperatures in
degrees Fahrenheit and Celcius of data points i = 1, . . . , n. Then for any t ∈ R,
β0 + β1 Fi + β2 Ci = (β0 − 32t) + (β1 − 1.8t)Fi + (β2 + t)Ci .
If Yi is near to β0 + β1 Fi + β2 Ci then it is equally near to (β0 − 32t) + (β1 −
1.8t)Fi + (β2 + t)Ci for any value of t.
The problem with rank deficient Z is not, as we might have feared, that
there is no solution to Z 0 Z β̂ = Z 0 y. It is that there is an infinite family of
solutions. That family is a linear subspace of Rp . A reasonable approach is to
simply choose the shortest solution vector. That is we find the unique minimizer
of β 0 β subject to the constraint Z 0 Zβ = Z 0 y and we call it β̂. We’ll see how to
do this numerically in Chapter 2.9.
When Z 0 Z is singular we may well find that some linear combinations cβ̂
are uniquely determined by the normal equations (Z 0 Z)β̂ = Z 0 y even though
β̂ itself is not. If we can confine our study of β to such linear combinations cβ
then singularity of Z 0 Z is not so deleterious.
From the geometry, it is clear that there has to be a unique point in M that
is closest to the vector y. Therefore ŷ = Z β̂ is determined by the least squares
conditions. It follows that each component ŷi = zi0 β̂ is also determined. In other
words, if c = zi0 for one of our sampled feature vectors, then cβ̂ is determined.
It follows that if c is a linear combination of zi0 that is a linear combination
of rows
Pnof the design matrix Z, then cβ̂ is determined. We can write such c as
c = i=1 γi zi0 = γ 0 Z for γ ∈ Rn . In fact, the only estimable cβ are take this
form.
Definition 2.1 (Estimability). ThePnlinear combination cβ is estimable if there
exists a linear combination λ0 Y = i=1 λi Yi of the responses for which E(λ0 Y ) =
cβ holds for any β ∈ Rp .
The definition of estimability above stipulates that we can find a linear
combination of the Yi that is unbiased for cβ. The next theorem gives two
2.8. MAXIMUM LIKELIHOOD AND GAUSS MARKOV 15
equivalent properties that could also have been used as the definition. One is
that estimable linear combinations cβ have unique least squares estimators. The
other is that estimable linear combinations are linear combinations of the rows
of the design matrix Z. If there is some cβ that we really want to estimate,
we can make sure it is estimable when we’re choosing the xi at which to gather
data.
Theorem 2.9. The following are equivalent:
1. cβ is estimable.
2. c = γ 0 Z for some γ ∈ Rn .
3. If Z 0 Z β̂ = Z 0 Y and Z 0 Z βe = Z 0 Y then cβ̂ = cβ.
e
Proof. [Some algebra goes here, or possibly it becomes an exercise.]
To derive this MLE, we suppose that Y ∼ N (Zβ, σ 2 I) for fixed and full
rank Z. The probability of observing Yi = yi , for i = 1, . . . , n, is of course
zero, which is not helpful. But our observations were only measured to finite
precision and we might interpret the observation yi as corresponding to the
event yi − ∆ ≤ Yi ≤ yi + ∆ where ∆ > 0 is our measuring accuracy. We don’t
have to actually know ∆, but it should always be small compared to σ. The
probability of observing Y in a small n dimensional box around y is
. (2∆)n 1
0
Pr |Yi − yi | ≤ ∆, 1 ≤ i ≤ n = exp − (y − Zβ) (y − Zβ) .
(2π)n/2 σ n 2σ 2
(2.21)
16 2. LINEAR LEAST SQUARES
For any 0 < σ < ∞ the β which maximizes (2.21) is the one that minimizes
the sum of squares (y − Zβ)0 (y − Zβ). So the MLE is the least squares estimate
and can be found by solving the normal equations.
Next we find the MLE of σ. The logarithm of the right side of (2.21) is
n 1
−n log ∆ − log(2π) − n log σ − 2 (y − Zβ)0 (y − Zβ). (2.22)
2 2σ
Differentiating (2.22) with respect to σ yields
n 1
− + (y − Zβ)0 (y − Zβ), (2.23)
σ 2σ 3
which vanishes if and only if
1
σ2 = (y − Zβ)0 (y − Zβ). (2.24)
n
Solving (2.24) with β = β̂ gives the MLE σ̂ 2 from (2.20).
It is also necessary to investigate the second derivative of the log likelhood
with respect to σ at the point σ = σ̂. The value there is −2n/σ 2 < 0, and
so (2.20) really is the MLE and is not a minimum.
In the case where y = Z β̂ exactly the log likelihood is degenerate. The MLE
formulas for β̂ and σ̂ are still interpretable. The first is the value that makes
error zero and the second reduces to 0.
The second justification for least squares is via the Gauss-Markov theorem.
We say that λ0 Y is a linear unbiased estimator for cβ if E(λ0 Y ) = cβ holds for
all β. Let’s agree to define the best linear unbiased estimator of cβ as the one
that attains the minimum variance. It need not be unique. The Gauss-Markov
theorem states that least squares provides best linear unbiased estimators.
Theorem 2.10 (Gauss-Markov). Let Y = Zβ + ε where Z is a nonrandom
n × p matrix, β is an unknown point in Rp and ε is a random vector with mean
0 and variance matrix σ 2 In . Let cβ be estimable and let β̂ be a least squares
estimate. Then cβ̂ is a best linear unbiased estimate of cβ.
Proof. [Full rank case] We prove the theorem for the full rank case with Z 0 Z
invertible. We already know that cβ̂ is unbiased and linear. Suppose that
E(λ0 Y ) = cβ, and so λ0 Z = c. Then
var(λ0 Y ) = var(cβ̂ + (λ0 Y − cβ̂))
= var(cβ̂) + var(λ0 Y − cβ̂) + 2cov(cβ̂, λ0 Y − cβ̂).
Now
cov(cβ̂, λ0 Y − cβ̂) = cov(c(Z 0 Z)−1 Z 0 Y, (λ0 − c(Z 0 Z)−1 Z 0 )Y )
= c(Z 0 Z)−1 Z 0 (σ 2 I)(λ − Z(Z 0 Z)−1 c0 )
= σ 2 c(Z 0 Z)−1 Z 0 λ − c(Z 0 Z)−1 c0 )
=0
because Z 0 λ = c0 . Therefore var(λ0 Y ) = var(cβ̂) + var(λ0 Y − cβ̂) ≥ var(cβ̂).
2.9. LEAST SQUARES COMPUTATION 17
For rank deficient Z, a similar proof can be made but takes a bit more care.
See for example Christiansen (xxxx). Intuitively we might expect the Gauss-
Markov theorem to hold up for estimable functions in the rank deficient case.
We could drop columns of Z until Z 0 Z has full rank, get a best linear unbiased
estimator in the resulting full rank problem, and then because we only dropped
redundant columns, reinstate the ones we dropped.
QR decomposition
A better approach is to employ a QR decomposition of Z. In such a decompo-
sition we write Z = QR where Q is an n by n orthogonal matrix whose and R
is an n by p upper triangular matrix. When as usual n > p then the last n − p
rows of R are zero. Then we don’t really need the last n − p columns of Q. Thus
there are two QR decompositions
e
R
Z = QR = Q e Q∗ =Q e R,
e
0
We have managed to put zeros below the diagonal in the first column on the
right side. In practice, the choice between kZ1 k and −kZ1 k is made based on
which will be lead to a more numerically stable answer. See Golub and van
Loan (xxxx) for details.
We will use matrices Qj taking the form
I 0
Qj = j−1
0 Hj
This latter is a triangular system, so it is easy to solve for β̂p then β̂p−1 and so
on, ending with β̂1 .
We are sure to run into trouble if R e is singular. At least that trouble will
be detected because R e is singular if and only if Rii = 0 for some 1 ≤ i ≤ p. At
that point we’ll notice that we’re trying to reflect a vector of length zero. Also
if some Rii is near zero then numerical near singularity of Z is detected.
If we solve the normal equations directly, then we are solving R e0 R
eβ̂ = Z 0 Y .
0
The matrix R eR e is less favorable to solve with than is R.e A precise definition
requires the notion of matrix condition number. But roughly speaking, suppose
that in solving the system Ra e = b for a roundoff errors in b get multiplied by a
factor f . Solving the normal equations directly is like applying two such solu-
tions one for Re and one for R e0 and might then multiply error by f 2 . Employing
the QR decomposition instead of solving the normal equations directly roughly
equivalent to doubling one’s numerical precision (from single to double or from
double to quadruple).
Z = U ΣV 0 (2.26)
where U e U ∗ ) and Σ
e has the first p columns of U = (U e has the first p rows of Σ.
We begin by noting that the Euclidean length of the vector y − Zβ is pre-
served when it is multiplied by an orthogonal matrix such as U . Then
ky − Zβk2 = ky − U ΣV βk2
= kU 0 y − ΣV 0 βk2
e 0 y − ΣV
= kU e 0 βk2 + kU ∗ yk2 ,
Orthogonal polynomials
B splines
Fourier series
Haar Wavelets
A−1 uv 0 A−1
(A + uv 0 )−1 = A−1 − . (2.28)
1 + v 0 A−1 u
Equation (2.28) can be proved by multiplying the right hand side by A + uv 0 .
See Lemma ??. The condition 1 + v 0 A−1 u 6= 0 is needed to avoid turning an
invertible A into a singular A + uv 0 .
PnSuppose that we delete observation i from the regression. Then Z 0 Z =
0 0 0 0
`=1 z` z` is replaced by (Z Z)(−i) = Z Z − zi zi , using a subscript of (−i) to
denote removal of observation i. We can fit this into (2.28) by taking u = zi
and v = −zi . Then
after recognizing the hat matrix diagonal from equation (2.8). We also find that
(Z 0 y)(−i) = Z 0 y − zi yi . Now
(Z 0 Z)−1 zi zi0 (Z 0 Z)−1 0
β̂(−i) = (Z 0 Z)−1 + Z Y − zi yi
1 − Hii
(Z 0 Z)−1 zi zi β̂ (Z 0 Z)−1 zi Hii yi
= β̂ − (Z 0 Z)−1 zi yi + −
1 − Hii 1 − Hii
(Z 0 Z)−1 zi zi β̂ (Z 0 Z)−1 zi yi
= β̂ + −
1 − Hii 1 − Hii
0 −1
(Z Z) zi (yi − ŷi )
= β̂ − .
1 − Hii
Now the prediction for yi when (zi , yi ) is removed from the least squares fit is
Equation (2.29) has an important interpretation. The least squares fit ŷi is
a weighted combination of yi itself and the least squares prediction we would
have made for it, had it been left out of the fitting. The larger Hii is, the more
that ŷi depends on yi . It also means that if we want to compute a “leave one
out” residual yi − ŷi,(−i) we don’t have to actually take (zi , yi ) out of the data
and rerun the regression. We can instead use
yi − ŷi
yi − ŷi,(−i) = . (2.30)
1 − Hii