Lecture Projection Theorem Oct2022
Lecture Projection Theorem Oct2022
Vector space
Let H denote a vector space over the field of real numbers. An element of this space is
called a vector. Two examples of vector spaces that will feature prominently in what follows
are (i) the RN (Euclidean) and (ii) the L2 spaces. An element of the Euclidean space is
simply an N × 1 vector (or list) of real numbers. An element of a L2 space is a (function
of a) random variable with finite variance. Vector spaces need to include a null vector. In
Euclidean spaces the null vector is simply a list of zeros; in L2 spaces the null vector is a
degenerate random variable identically equal to zero. We will denote the null element of a
vector space by a 0. We can add vectors in these spaces in the normal way (element-wise)
and also multiply (i.e., rescale) them by scalars. Chapter 2 of Luenberger (1969) presents a
formal development.
If we pair a vector space with an inner product defined on H × H we get what is called a
(pre-) Hilbert space. A valid inner product h·, ·i : H × H → R satisfies the conditions:
1. Bi-linearity: hah1 + bh2 , ch3 + dh4 i = ac hh1 , h3 i+ad hh1 , h4 i+bc hh2 , h3 i+bd hh2 , h4 i
for a, b, c and d real scalars and h1 , h2 , h3 and h4 elements of H;
1
University of California - Berkeley Bryan S. Graham
3. Positivity hh1 , h1 i ≥ 0 with equality if, and only if, h1 is a null vector.
N
X0 Y 1 X
hX, Yi = = Xi Yi . (1)
N N i=1
It is an easy, but a useful exercise, to verify that (1) satisfies our three conditions for a valid
inner product. Note that (1) is the familiar dot product (albeit divided by N ).
Let X and Y denote years of completed schooling and earnings for a generic random draw
from the population of adult male workers. Here X and Y may be regarded as elements of
an L2 space, where we will work with the inner product
where E [X] denotes the expected value, or population mean, of the random variable X.
Again, it is a useful exercise to verify that (2) satisfies our three conditions for a valid inner
product. I will sometimes refer to (2) as the covariance inner product.
Associated with an inner product is a norm khk = hh, hi1/2 which satisfies:
The first two properties of the norm are easy to verify. It is instructive to verify the third.
To do this we will first prove.
with equality if, and only if, h1 = αh2 for some real scalar α or h2 = 0.
hh1 ,h2 i
Next set α = kh2 k2
; substituting into (3) yields
hh1 , h2 i2
0 ≤ kh1 k2 − ,
kh2 k2
which after re-arranging and taking square roots yields the result.
Proof. Applying the definition of the norm and using the bi-linearity property of the inner
product yields
kh1 + h2 k2 = hh1 + h2 , h1 + h2 i
= kh1 k2 + 2 hh1 , h2 i + kh2 k2
≤ kh1 k2 + 2 |hh1 , h2 i| + kh2 k2
≤ kh1 k2 + 2 kh1 k kh2 k + kh2 k2
= (kh1 k + kh2 k)2 ,
where the fourth line follows from the Cauchy-Schwarz inequality. Taking the square root of
both sides gives the result.
The Cauchy-Schwarz (CS) and Triangle (TI) inequalities are widely-used in probability,
statistics and econometrics.
We say that the vectors X and Y are orthogonal if their inner product is zero. We denote
this by X ⊥ Y . When two vectors are orthogonal the Triangle Inequality is tight, yielding
Pythagoras’ Theorem; the proof of which is left as an exercise.
The value of Hilbert spaces is that they provide a mechanism for generalizing geometric intu-
itions familiar from Euclidean spaces to more complicated situations. For example, kh1 − h2 k
measures the distance between h1 and h2 . Another is that the notion of orthogonality cor-
responds to perpendicularity. A good way to develop intuition about some of the results in
this note is to reflect on their geometric interpretations in 2 and 3 dimensions.
Projection Theorem
Let L be some linear subspace of H (i.e., if h1 and h2 are both in L, then so is ah1 + bh2 ).
In what follows, we will restrict ourselves to closed linear subspaces. This implies that if
hN is an element of L for all N and hN → h, then h is also in L; Luenberger (1969)
provides additional details. In what follows “subspace” refers to a closed linear subspace
unless explicitly stated otherwise.
Let X and Y be two elements of H. Then L might consist of all linear functions of X, or
(almost) any function of X. It is of considerable interest to consider the projection of Y ∈ H
onto the subspace L. Specifically we define the projection operator Π (·| L) : H → L by:
Π (Y | L) is the element Ŷ ∈ L that achieves
min Y − Ŷ . (4)
Ŷ ∈L
In words we look for the element of L, a subspace of H, which is closest to Y (or “best
approximates” Y ). The notion of “best” is embodied in the chosen inner product and induced
norm.
It is instructive to consider two examples. Let Y and X be the vectors consisting of earnings-
schooling pairs for a sample of adult male workers. Let L be the linear span of 1, X (i.e.,
vectors of the form α1 + βX). In that case finding Π (Y| L) corresponds to computing α̂
and β̂, the solutions to
N
1 X
2
min 2 kY − α1 − βXk = min 2 (Yi − α − βXi )2 . (5)
(α,β)∈R (α,β)∈R N
i=1
This corresponds to finding the ordinary least squares (OLS) fit of Y = (Y1 , Y2 , . . . , YN )0
onto a constant and X = (X1 , X2 , . . . , XN )0 .
Alternatively let (X, Y ) denote a schooling-earnings pair for a generic random draw from
the population of adult male workers. Let L consist of all linear functions of X; using the
norm associated with our L2 Hilbert space we have that Π (Y | L) corresponds to computing
This corresponds to finding the best (i.e., mean squared error minimizing) linear predictor
(LP) of Y given X.
Both (5) and (6) correspond to prediction problems. It turns out that, using the elementary
Hilbert space theory outlined above, we can provide a generic solution to both of them (and
indeed many other problems). The solution is a generalization of the idea familiar from
elementary school geometry that one can finding the shortest distance between a point and
a line by “dropping the perpendicular”.
Theorem 2. ( Projection Theorem) Let H be a vector space with an inner product and
associated norm and L a subspace of H, then for Y an arbitrary element of H, if there exists
a vector Ŷ ∈ L such that
Y − Ŷ ≤ Y − Ỹ (7)
1. Ŷ = Π ( Y | L) is unique
Proof. See Luenberger (1969, Theorem 1). We begin by verifying that orthogonality is a
necessary condition for Ŷ to be norm minimizing. Suppose there Dexists a vector
E Ỹ which is
not orthogonal to the prediction error Y − Ŷ . This implies that Y − Ŷ , Ỹ = α 6= 0. We
2 D E
Y − Ŷ − αỸ = Y − Ŷ − αỸ , Y − Ŷ − αỸ
2 D E D E 2
= Y − Ŷ − Y − Ŷ , αỸ − αỸ , Y − Ŷ + α2 Ỹ
2
= Y − Ŷ − α2 ,
where the last equality follows from the fact that Ŷ − Ỹ ∈ L and Y − Ŷ is orthogonal to
any element of L. Next, by the properties of the norm, Ŷ − Ỹ ≥ 0, with equality if, and
only if, Ŷ = Ỹ . This implies (7) for all Ỹ ∈ L with the equality only holding if Ŷ = Ỹ . This
gives sufficiency and uniqueness.
Observe that we have not shown existence of a solution to (4). We have shown that condi-
tional on the existence of a solution, that the solution is unique and that the prediction error
Y − Ŷ is orthogonal to the subspace L. Proving existence is a more technical argument. For
a general result, which applies to closed linear subspaces (and all the cases considered here),
see Luenberger (1969, p. 51 - 52).
Three additional properties of projections will prove useful to us. First, they are linear
operators. To see this note that we can write, using the Projection Theorem,
X = Π ( X| L) + UX , UX ⊥ L
Y = Π ( Y | L) + UY , UY ⊥ L.
1
To see this note we could always work with the normalized vector Ỹ ∗ = Ỹ / Ỹ and constant α∗ =
α/ Ỹ in what follows.
Now observe that, using bi-linearity of the inner product, for all W ∈ L
and hence
Π (aX + bY | L) = aΠ (X| L) + bΠ (Y | L) . (8)
Linearity of the projection operator will be useful for establishing several properties of linear
regression.
A second property of the projection operator is idempotency. Idempotency of an operator
means that it can be applied multiple times without changing the result beyond the one
found after the initial application. In the context of projections this property implies that
Π (Π ( Y | L)| L) = Π ( Y | L) . (9)
The projection of a projection is itself (assuming the same subspaces are projected onto in
both cases). To see this observe that
D E
0 = Π (Π ( Y | L)| L) − Π (Y | L) , Ỹ
D E D E
= Y − Π (Y | L) , Ỹ − Y − Π (Π ( Y | L)| L) , Ỹ
D E
= 0 − Y − Π (Π ( Y | L)| L) , Ỹ ,
but the last line is the necessary and sufficient condition for Π (Π ( Y | L)| L) to be the unique
projection of Y onto L. This gives (9) above.
A third property of projections is that they are norm reducing. Let 1 denote a constant
vector. In a Euclidean space the constant vector equals an N × 1 vector of ones, denoted by
1. In the L2 space it is a degenerate random variable always equal to 1. In the Euclidean case
we have that Π (Y| 1) = N1 N i=1 Yi = Ȳ , the sample mean. In the L case Π (Y | 1) = E [Y ].
2
P
It is a useful exercise to verify these claims. Observe further that, for the dot product and
covariance inner products respectively,
N
1 X
2 2
kY − Π (Y| 1)k = Yi − Ȳ , kY − Π (Y | 1)k2 = V (Y ) ,
N i=1
Lemma 3. ( Analysis of Variance) If the linear subspace L contains the constant vector,
then
kY − Π (Y | 1)k2 = kY − Π (Y | L)k2 + kΠ (Y | L) − Π (Y | 1)k2 .
hY − Π (Y | 1) , Π (Y | L) − Π (Y | 1)i = hY − Π (Y | L) + Π ( Y | L) − Π (Y | 1) , Π (Y | L) − Π (Y | 1)i
= hY − Π (Y | L) , Π (Y | L) − Π (Y | 1)i
+ kΠ (Y | L) − Π (Y | 1)k2
= kΠ (Y | L) − Π (Y | 1)k2
kY − Π (Y | L)k ≤ kY − Π (Y | 1)k
dummies, parents’ school etc.). Assume that the columns of X are linearly independent such
that the rank of X is K.
The column space of X is the span – or set of all possible linear combinations – of its
column vectors. The set of all N × 1 vectors expressable as linear combinations of the K
columns of X is
def
L = C (X) = {Xβ : β is a K × 1 vector of real numbers} . (10)
The projection of Y onto the column space of X satisfies, by the Projection Theorem intro-
duced earlier, the following necessary and sufficient condition
D E
hY − Π (Y| L) , Xβi = Y − Xβ̂, Xβ = 0. (11)
Note that both Y and (any element of) C (X) are Euclidean vectors; hence we work with
inner product (1) such that (11) coincides with
N N K
1 X 1 XX
Yi − Xi0 β̂ Xi0 β = βk Xik Yi − Xi0 β̂ = 0.
N i=1 N i=1 k=1
N
1 X
Xik Yi − Xi0 β̂ = 0
N i=1
N
1 X
Xi Yi − Xi0 β̂ = 0. (12)
N i=1
Hence (11) implies (12). The converse, that (12) implies (11) for any β ∈ RK , follows
directly. This implies equivalence of the two conditions. We can therefore use (12) to find
the projection Π (Y| L) = Xβ̂.
Condition (12) is a system of K linear equations
" N
# " N
#
1 X 1 X
Xi Yi − Xi Xi0 β̂ = 0.
N i=1 N i=1 |{z}
| {z } | {z } K×1
K×1 K×K
Solving this system for β̂ yields the familiar ordinary least squares (OLS) estimator:
" N
#−1 " N
#
1 X 1 X
β̂ = Xi Xi0 × Xi Yi (13)
N i=1 N i=1
−1
= (X0 X) X0 Y.
−1
Π (Y| L) = X (X0 X) X0 Y = Xβ̂.
This is also called the least squares regression fit or simply the least squares fit.
Note that, for K = 1, (14) coincides with our inner product definition for Euclidean vector
spaces, equation (1) above. Sometimes (14) is called the Frobenius Inner Product and
denoted by hX, YiF . The division by N in (14) is non-standard.
= kXkF .
Equation (15) is the Frobenius norm (the division by N inside the [·] is typically omitted).
Linear regression/prediction
Let Y equal log earnings for a generic random draw from the population of adult males.
Let X be a corresponding K × 1 vector of respondent attributes (the first element of X is a
constant). The linear subspace spanned by X equals
Let Z = (X 0 , Y )0 denote a generic random draw from the population of interest (with dis-
hP i1/2
tribution function F0 on support Z ∈ Z ⊆ RK+1 ). Let kXk2 = K
k=1 X 2
k denote the
Euclidean Norm of X and impose the following regularity condition on the population dis-
tribution function, F0 .
Assumption 1. (i) E [Y 2 ] < ∞, (ii) E kXk22 < ∞, and (iii) E (α0 X)2 > 0 for any
non-zero α ∈ RK .
Next consider the space of all 1-dimensional random functions of Z with finite variance (L2
space). Condition (i) of Assumption 1 implies that Y is an element of this space. By the
Hölder Inequality (HI) inequality we have
1/2
E [kX 0 βk2 ] ≤ E kXk22
kβk2 < ∞
hY − X 0 β0 , X 0 βi = E [(Y − X 0 β0 ) X 0 β] = 0
for all β ∈ RK . Using an argument similar to one used in the analysis of least squares earlier,
we work with the equivalent K × 1 vector of conditions:
E [X (Y − X 0 β0 )] = 0. (17)
E [XX 0 ] β0 − E [XY ] = 0.
Part (iii) of Assumption 1 requires that no single predictor corresponds to a linear combi-
nation of the others (i.e., that the elements of X be linearly independent). This condition
ensures invertibility of E [XX 0 ] since
h i
2
E (α0 X) = α0 E [XX 0 ] α,
condition (iii) implies positive-definiteness of E [XX 0 ]. This, in turn, implies that the deter-
minant of E [XX 0 ] is non-zero (non-singularity) and hence that E [XX 0 ]−1 is well-defined.
Since E [XX 0 ] is invertible we can directly solve for β0 :
−1
β0 = E [XX 0 ] × E [XY ] . (18)
Π (Y | L) = E∗ [Y | X] = X 0 β0 . (19)
Here E∗ [Y | X] is special notation for the best linear predictor of Y given X. Unless stated
otherwise, I assume that a constant is a component of X.
Define U = Y − X 0 β0 to be the prediction error associated with (19). From the first order
conditions to (17) we get
E [XU ] = 0. (20)
Equation (20) indicates that β0 is chosen to ensure that the covariance between X and U is
zero. Recall that the first element of X is a constant so that (20) implies
E[U ] = 0
Multivariate regression
Let H be a Hilbert space consisting of J-dimensional random functions with finite variance.
That is,
h : Z → RJ
The null vector in this space is the degenerate random variable identically equal to a J × 1
vector of zeros. In what follows I use h to denote h (Z), with the dependence on the random
variable Z left implicit. Call this space the space of K-dimensional random functions with
finite variance.
We extend the covariance inner product, equation (2) above, to accommodate multidimen-
sional dimensional h:
hh1 , h2 i = E h1 (Z)0 h2 (Z) .
Next let g (Z) be a K × 1 vector of random functions with E [g 0 g] < ∞. The linear subspace
spanned by g (Z) equals
Assume that the random functions g1 (Z) , g2 (Z) , . . . , gK (Z) are linearly independent such
−1
that the inverse E g (Z) g (Z)0 exists.
To find the projection of h ∈ H onto L we use the necessary and sufficient condition from
the Projection Theorem introduced earlier:
E (h − Π0 g)0 Πg = 0. (22)
To show the equivalence of (22) and (23) note that, using linearity of the expectation and
trace operators,
E (h − Π0 g)0 Πg = E Tr (h − Π0 g)0 Πg
= E Tr (Πv (h − Π0 g))0
XX
= πjk E [gj (h − Π0 g)k ] ,
j k
where πjk is the jk th element of the arbitrary real J × K matrix Π. For any pair j, k we can
set πjk = 1 and the rest of the elements of Π to zero. This implies
E [gj (h − Π0 g)k ] = 0
for all j, k or (23) . The converse, that (23) implies (22) for any Π ∈ RJ×K , follows directly.
Solving (23) for Π0 indicates that the unique projection of h onto L is:
−1
Π (h| L) = Π0 g, Π0 = E [hg 0 ] × E [gg 0 ] . (24)
This is the multivariate linear predictor (or multivariate linear regression) of h (Z) onto
g (Z). Letting Z = (Y1 , . . . , YJ , X1 , . . . , XK )0 , h (Z) = Y = (Y1 , . . . YJ )0 and g (Z) = X =
(X1 , . . . , XK )0 yields
−1
Π (Y | L) = Π0 X, Π0 = E [Y X 0 ] × E [XX 0 ] . (25)
Bibliographic notes
The standard reference on vector space optimization is Luenberger (1969). These notes draw
extensively from this reference (glossing over details!). Appendix B.10 of Bickel & Doksum
(2015) and van der Vaart (1998, Chapter 10) provide compact introductions targeted toward
statistical applications. I have also found the presentation of Hilbert Space theory in Tsiatis
(2006) very helpful.
References
Bickel, P. J. & Doksum, K. A. (2015). Mathematical Statistics, volume 1. Boca Raton:
Chapman & Hall, 2nd edition.
Luenberger, D. G. (1969). Optimization by Vector Space Methods. New York: John Wiley
& Sons, Inc.
Tsiatis, A. A. (2006). Semiparametric Theory and Missing Data. New York: Springer.