0% found this document useful (0 votes)
18 views15 pages

Lecture Projection Theorem Oct2022

The document discusses the Projection Theorem, highlighting its significance in solving optimization and approximation problems in econometrics using vector space methods. It defines key concepts such as vector spaces, inner products, and the projection operator, and illustrates their applications through examples related to linear regression and prediction problems. The theorem establishes conditions for the uniqueness of projections and their properties, including linearity and idempotency.

Uploaded by

sjmin711
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views15 pages

Lecture Projection Theorem Oct2022

The document discusses the Projection Theorem, highlighting its significance in solving optimization and approximation problems in econometrics using vector space methods. It defines key concepts such as vector spaces, inner products, and the projection operator, and illustrates their applications through examples related to linear regression and prediction problems. The theorem establishes conditions for the uniqueness of projections and their properties, including linearity and idempotency.

Uploaded by

sjmin711
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Projection Theorem

Bryan S. Graham, UC - Berkeley & NBER

November 27, 2022

Optimization and approximation problems arise frequently in econometrics. Many of these


problems are solvable using vector space methods. In particular by orthogonal projection in
a vector space endowed with an inner product. While there is a fixed cost associated with
developing a vector space approach, the long term pay off is considerable. Many properties
of, for example, linear regression are easily derived using projection arguments. Projection
arguments also play important roles in deriving the distributions of (complicated) sequences
of statistics (cf., van der Vaart, 1998, Chapters 12 - 13) and in semiparametric efficiency
bound analysis (cf., Newey, 1990). Our development of these tools will be informal, but of
sufficient depth so as to allow for application to interesting problems.

Vector space
Let H denote a vector space over the field of real numbers. An element of this space is
called a vector. Two examples of vector spaces that will feature prominently in what follows
are (i) the RN (Euclidean) and (ii) the L2 spaces. An element of the Euclidean space is
simply an N × 1 vector (or list) of real numbers. An element of a L2 space is a (function
of a) random variable with finite variance. Vector spaces need to include a null vector. In
Euclidean spaces the null vector is simply a list of zeros; in L2 spaces the null vector is a
degenerate random variable identically equal to zero. We will denote the null element of a
vector space by a 0. We can add vectors in these spaces in the normal way (element-wise)
and also multiply (i.e., rescale) them by scalars. Chapter 2 of Luenberger (1969) presents a
formal development.
If we pair a vector space with an inner product defined on H × H we get what is called a
(pre-) Hilbert space. A valid inner product h·, ·i : H × H → R satisfies the conditions:

1. Bi-linearity: hah1 + bh2 , ch3 + dh4 i = ac hh1 , h3 i+ad hh1 , h4 i+bc hh2 , h3 i+bd hh2 , h4 i
for a, b, c and d real scalars and h1 , h2 , h3 and h4 elements of H;

1
University of California - Berkeley Bryan S. Graham

2. Symmetry: hh1 , h2 i = hh2 , h1 i;

3. Positivity hh1 , h1 i ≥ 0 with equality if, and only if, h1 is a null vector.

Let X = (X1 , X2 , . . . , XN )0 and Y = (Y1 , Y2 , . . . , YN )0 be N × 1 vectors of real numbers.


For example (X1 , Y1 ) , . . . , (XN , YN ) may consist of pairs of years of completed schooling and
adult earnings measures for a random sample of N adult male workers. Here X and Y are
elements of the Euclidean space RN and we will work with the inner product

N
X0 Y 1 X
hX, Yi = = Xi Yi . (1)
N N i=1

It is an easy, but a useful exercise, to verify that (1) satisfies our three conditions for a valid
inner product. Note that (1) is the familiar dot product (albeit divided by N ).
Let X and Y denote years of completed schooling and earnings for a generic random draw
from the population of adult male workers. Here X and Y may be regarded as elements of
an L2 space, where we will work with the inner product

hX, Y i = E [XY ] , (2)

where E [X] denotes the expected value, or population mean, of the random variable X.
Again, it is a useful exercise to verify that (2) satisfies our three conditions for a valid inner
product. I will sometimes refer to (2) as the covariance inner product.
Associated with an inner product is a norm khk = hh, hi1/2 which satisfies:

1. khk = 0 if, and only if, h = 0;

2. kahk = |a| khk for any scalar a;

3. Triangle Inequality: kh1 + h2 k ≤ kh1 k + kh2 k .

The first two properties of the norm are easy to verify. It is instructive to verify the third.
To do this we will first prove.

Lemma 1. ( Cauchy-Schwarz Inequality) For all (h1 , h2 ) ∈ H × H,

|hh1 , h2 i| ≤ kh1 k kh2 k

with equality if, and only if, h1 = αh2 for some real scalar α or h2 = 0.

2 © Bryan S. Graham, 2018, 2022


University of California - Berkeley Bryan S. Graham

Proof. Begin by observing that for all scalars α

0 ≤ hh1 − αh2 , h1 − αh2 i


= hh1 , h1 i − α hh1 , h2 i − α hh2 , h1 i + α2 hh2 , h2 i
= kh1 k2 − 2α hh1 , h2 i + α2 kh2 k2 . (3)

hh1 ,h2 i
Next set α = kh2 k2
; substituting into (3) yields

hh1 , h2 i2
0 ≤ kh1 k2 − ,
kh2 k2

which after re-arranging and taking square roots yields the result.

With Lemma 1 in hand it is straightforward to prove the Triangle Inequality.

Lemma 2. ( Triangle Inequality) For all (h1 , h2 ) ∈ H × H,

kh1 + h2 k ≤ kh1 k + kh2 k .

Proof. Applying the definition of the norm and using the bi-linearity property of the inner
product yields

kh1 + h2 k2 = hh1 + h2 , h1 + h2 i
= kh1 k2 + 2 hh1 , h2 i + kh2 k2
≤ kh1 k2 + 2 |hh1 , h2 i| + kh2 k2
≤ kh1 k2 + 2 kh1 k kh2 k + kh2 k2
= (kh1 k + kh2 k)2 ,

where the fourth line follows from the Cauchy-Schwarz inequality. Taking the square root of
both sides gives the result.

The Cauchy-Schwarz (CS) and Triangle (TI) inequalities are widely-used in probability,
statistics and econometrics.
We say that the vectors X and Y are orthogonal if their inner product is zero. We denote
this by X ⊥ Y . When two vectors are orthogonal the Triangle Inequality is tight, yielding
Pythagoras’ Theorem; the proof of which is left as an exercise.

Theorem 1. ( Pythagorean Theorem) If h1 ⊥ h2 , then kh1 + h2 k2 = kh1 k2 + kh2 k2 .

3 © Bryan S. Graham, 2018, 2022


University of California - Berkeley Bryan S. Graham

The value of Hilbert spaces is that they provide a mechanism for generalizing geometric intu-
itions familiar from Euclidean spaces to more complicated situations. For example, kh1 − h2 k
measures the distance between h1 and h2 . Another is that the notion of orthogonality cor-
responds to perpendicularity. A good way to develop intuition about some of the results in
this note is to reflect on their geometric interpretations in 2 and 3 dimensions.

Projection Theorem
Let L be some linear subspace of H (i.e., if h1 and h2 are both in L, then so is ah1 + bh2 ).
In what follows, we will restrict ourselves to closed linear subspaces. This implies that if
hN is an element of L for all N and hN → h, then h is also in L; Luenberger (1969)
provides additional details. In what follows “subspace” refers to a closed linear subspace
unless explicitly stated otherwise.
Let X and Y be two elements of H. Then L might consist of all linear functions of X, or
(almost) any function of X. It is of considerable interest to consider the projection of Y ∈ H
onto the subspace L. Specifically we define the projection operator Π (·| L) : H → L by:
Π (Y | L) is the element Ŷ ∈ L that achieves

min Y − Ŷ . (4)
Ŷ ∈L

In words we look for the element of L, a subspace of H, which is closest to Y (or “best
approximates” Y ). The notion of “best” is embodied in the chosen inner product and induced
norm.
It is instructive to consider two examples. Let Y and X be the vectors consisting of earnings-
schooling pairs for a sample of adult male workers. Let L be the linear span of 1, X (i.e.,
vectors of the form α1 + βX). In that case finding Π (Y| L) corresponds to computing α̂
and β̂, the solutions to

N
1 X
2
min 2 kY − α1 − βXk = min 2 (Yi − α − βXi )2 . (5)
(α,β)∈R (α,β)∈R N
i=1

This corresponds to finding the ordinary least squares (OLS) fit of Y = (Y1 , Y2 , . . . , YN )0
onto a constant and X = (X1 , X2 , . . . , XN )0 .
Alternatively let (X, Y ) denote a schooling-earnings pair for a generic random draw from
the population of adult male workers. Let L consist of all linear functions of X; using the
norm associated with our L2 Hilbert space we have that Π (Y | L) corresponds to computing

4 © Bryan S. Graham, 2018, 2022


University of California - Berkeley Bryan S. Graham

α0 and β0 , the solutions to

min 2 kY − α − βXk2 = min 2 E (Y − α − βX)2 . (6)


 
(α,β)∈R (α,β)∈R

This corresponds to finding the best (i.e., mean squared error minimizing) linear predictor
(LP) of Y given X.
Both (5) and (6) correspond to prediction problems. It turns out that, using the elementary
Hilbert space theory outlined above, we can provide a generic solution to both of them (and
indeed many other problems). The solution is a generalization of the idea familiar from
elementary school geometry that one can finding the shortest distance between a point and
a line by “dropping the perpendicular”.

Theorem 2. ( Projection Theorem) Let H be a vector space with an inner product and
associated norm and L a subspace of H, then for Y an arbitrary element of H, if there exists
a vector Ŷ ∈ L such that
Y − Ŷ ≤ Y − Ỹ (7)

for all Ỹ ∈ L, then

1. Ŷ = Π ( Y | L) is unique

2. A necessary and sufficient condition for Ŷ to be the uniquely minimizing vector in L


is the orthogonality condition
D E
Y − Ŷ , Ỹ = 0 for all Ỹ ∈ L

(or Y − Π (Y | L) ⊥ Ỹ for all Ỹ ∈ L).

Proof. See Luenberger (1969, Theorem 1). We begin by verifying that orthogonality is a
necessary condition for Ŷ to be norm minimizing. Suppose there Dexists a vector
E Ỹ which is
not orthogonal to the prediction error Y − Ŷ . This implies that Y − Ŷ , Ỹ = α 6= 0. We

5 © Bryan S. Graham, 2018, 2022


University of California - Berkeley Bryan S. Graham

can, without loss of generality assume that Ỹ = 1,1 and evaluate

2 D E
Y − Ŷ − αỸ = Y − Ŷ − αỸ , Y − Ŷ − αỸ
2 D E D E 2
= Y − Ŷ − Y − Ŷ , αỸ − αỸ , Y − Ŷ + α2 Ỹ
2
= Y − Ŷ − α2 ,

which implies the contradiction Y − Ŷ − αỸ ≤ Y − Ŷ . Next we show that if Y − Ŷ ⊥


L, then Ŷ is the unique minimizing vector. Let Ỹ be some arbitrary element of L; we have
that
2 2
Y − Ỹ = Y − Ŷ + Ŷ − Ỹ
2 D E 2
= Y − Ŷ + 2 Y − Ŷ , Ŷ − Ỹ + Ŷ − Ỹ
2 2
= Y − Ŷ + Ŷ − Ỹ ,

where the last equality follows from the fact that Ŷ − Ỹ ∈ L and Y − Ŷ is orthogonal to
any element of L. Next, by the properties of the norm, Ŷ − Ỹ ≥ 0, with equality if, and
only if, Ŷ = Ỹ . This implies (7) for all Ỹ ∈ L with the equality only holding if Ŷ = Ỹ . This
gives sufficiency and uniqueness.

Observe that we have not shown existence of a solution to (4). We have shown that condi-
tional on the existence of a solution, that the solution is unique and that the prediction error
Y − Ŷ is orthogonal to the subspace L. Proving existence is a more technical argument. For
a general result, which applies to closed linear subspaces (and all the cases considered here),
see Luenberger (1969, p. 51 - 52).
Three additional properties of projections will prove useful to us. First, they are linear
operators. To see this note that we can write, using the Projection Theorem,

X = Π ( X| L) + UX , UX ⊥ L
Y = Π ( Y | L) + UY , UY ⊥ L.

1
To see this note we could always work with the normalized vector Ỹ ∗ = Ỹ / Ỹ and constant α∗ =
α/ Ỹ in what follows.

6 © Bryan S. Graham, 2018, 2022


University of California - Berkeley Bryan S. Graham

Rescaling and adding yields

aX + bY = aΠ (X| L) + bΠ (Y | L) + aUX + bUY .

Now observe that, using bi-linearity of the inner product, for all W ∈ L

haUX + bUY , W i = a hUX , W i + b hUY , W i = 0

and hence
Π (aX + bY | L) = aΠ (X| L) + bΠ (Y | L) . (8)

Linearity of the projection operator will be useful for establishing several properties of linear
regression.
A second property of the projection operator is idempotency. Idempotency of an operator
means that it can be applied multiple times without changing the result beyond the one
found after the initial application. In the context of projections this property implies that

Π (Π ( Y | L)| L) = Π ( Y | L) . (9)

The projection of a projection is itself (assuming the same subspaces are projected onto in
both cases). To see this observe that
D E
0 = Π (Π ( Y | L)| L) − Π (Y | L) , Ỹ
D E D E
= Y − Π (Y | L) , Ỹ − Y − Π (Π ( Y | L)| L) , Ỹ
D E
= 0 − Y − Π (Π ( Y | L)| L) , Ỹ ,

but the last line is the necessary and sufficient condition for Π (Π ( Y | L)| L) to be the unique
projection of Y onto L. This gives (9) above.
A third property of projections is that they are norm reducing. Let 1 denote a constant
vector. In a Euclidean space the constant vector equals an N × 1 vector of ones, denoted by
1. In the L2 space it is a degenerate random variable always equal to 1. In the Euclidean case
we have that Π (Y| 1) = N1 N i=1 Yi = Ȳ , the sample mean. In the L case Π (Y | 1) = E [Y ].
2
P

It is a useful exercise to verify these claims. Observe further that, for the dot product and
covariance inner products respectively,

N
1 X
2 2
kY − Π (Y| 1)k = Yi − Ȳ , kY − Π (Y | 1)k2 = V (Y ) ,
N i=1

7 © Bryan S. Graham, 2018, 2022


University of California - Berkeley Bryan S. Graham

which correspond to the sample and population variance respectively.


We can use the results above, in conjunction with the Pythagorean Theorem, to develop
a more general version of the familiar analysis of variance decomposition (e.g., Goldberger,
1991, p. 48).

Lemma 3. ( Analysis of Variance) If the linear subspace L contains the constant vector,
then
kY − Π (Y | 1)k2 = kY − Π (Y | L)k2 + kΠ (Y | L) − Π (Y | 1)k2 .

Proof. By the bi-linearity property of the inner product:

kY − Π (Y | L)k2 = kY − Π (Y | 1)k2 − 2 hY − Π (Y | 1) , Π (Y | L) − Π (Y | 1)i


+ kΠ (Y | L) − Π (Y | 1)k2 .

The middle term in the expression above can be further manipulated:

hY − Π (Y | 1) , Π (Y | L) − Π (Y | 1)i = hY − Π (Y | L) + Π ( Y | L) − Π (Y | 1) , Π (Y | L) − Π (Y | 1)i
= hY − Π (Y | L) , Π (Y | L) − Π (Y | 1)i
+ kΠ (Y | L) − Π (Y | 1)k2
= kΠ (Y | L) − Π (Y | 1)k2

where hY − Π (Y | L) , Π (Y | L) − Π (Y | 1)i = 0 by the Projection Theorem since Π (Y | L) −


Π (Y | 1) ∈ L whenever L contains the constant vector. Re-arranging gives the result.

An immediate implication of Lemma 3 is that

kY − Π (Y | L)k ≤ kY − Π (Y | 1)k

(i.e., Π (Y | L) is norm reducing).

The least squares fit as a projection


You are likely already familiar with one important projection: the least squares fit. Let Y
be an N ×1 vector of log earnings measures for a simple random sample of N adult males. Let
X be a corresponding N × K matrix of covariates. The first column of this matrix consists of
a vector of ones. The remaining columns includes measures of various respondent attributes
(years of completed schooling, Armed Forces Qualification Test (AFQT) score, ethnicity

8 © Bryan S. Graham, 2018, 2022


University of California - Berkeley Bryan S. Graham

dummies, parents’ school etc.). Assume that the columns of X are linearly independent such
that the rank of X is K.
The column space of X is the span – or set of all possible linear combinations – of its
column vectors. The set of all N × 1 vectors expressable as linear combinations of the K
columns of X is

def
L = C (X) = {Xβ : β is a K × 1 vector of real numbers} . (10)

The projection of Y onto the column space of X satisfies, by the Projection Theorem intro-
duced earlier, the following necessary and sufficient condition
D E
hY − Π (Y| L) , Xβi = Y − Xβ̂, Xβ = 0. (11)

Note that both Y and (any element of) C (X) are Euclidean vectors; hence we work with
inner product (1) such that (11) coincides with

N N K
1 X  1 XX  
Yi − Xi0 β̂ Xi0 β = βk Xik Yi − Xi0 β̂ = 0.
N i=1 N i=1 k=1

By setting βk = 1 and βj = 0 for j 6= k we get the implication

N
1 X  
Xik Yi − Xi0 β̂ = 0
N i=1

for k = 1, . . . , K or, stacking conditions into a vector,

N
1 X  
Xi Yi − Xi0 β̂ = 0. (12)
N i=1

Hence (11) implies (12). The converse, that (12) implies (11) for any β ∈ RK , follows
directly. This implies equivalence of the two conditions. We can therefore use (12) to find
the projection Π (Y| L) = Xβ̂.
Condition (12) is a system of K linear equations
" N
# " N
#
1 X 1 X
Xi Yi − Xi Xi0 β̂ = 0.
N i=1 N i=1 |{z}
| {z } | {z } K×1
K×1 K×K

9 © Bryan S. Graham, 2018, 2022


University of California - Berkeley Bryan S. Graham

Solving this system for β̂ yields the familiar ordinary least squares (OLS) estimator:
" N
#−1 " N
#
1 X 1 X
β̂ = Xi Xi0 × Xi Yi (13)
N i=1 N i=1
−1
= (X0 X) X0 Y.

Hence the projection of Y onto the column space of X equals

−1
Π (Y| L) = X (X0 X) X0 Y = Xβ̂.

This is also called the least squares regression fit or simply the least squares fit.

Space of real matrices


The discussion above provides a projection interpretation of least squares. It is useful to
extend our Euclidean vector space to accommodate matrix prediction. This extension, as will
we see later, has numerous applications in the area of panel data (where multiple outcomes
for each sampling unit are observed). It also provides an interesting perspective on some
basic results in matrix analysis.
Let Xi and Yi be K × 1 vectors of real numbers. For example, Yi = (Yi1 , . . . , YiT )0 might
be realized log earnings for individual i in years t = 1, . . . , T (here T = K). Define the real
N × K matrices X = (X1 , . . . , XN )0 and Y = (Y1 , . . . , YN )0 . Here X and Y are elements of
the space of real N × K matrices, say MN ×K (R). For this space we will work with the inner
product
N
1 1 X 0
hX, Yi = Tr (XY ) = 0
X Yi . (14)
N N i=1 i

Note that, for K = 1, (14) coincides with our inner product definition for Euclidean vector
spaces, equation (1) above. Sometimes (14) is called the Frobenius Inner Product and
denoted by hX, YiF . The division by N in (14) is non-standard.

10 © Bryan S. Graham, 2018, 2022


University of California - Berkeley Bryan S. Graham

The associated norm is


1/2
T r (XX0 )

kXk = (15)
N
" N
#1/2
1 X 0
= X Xi
N i=1 i
" N N
#1/2
1 XX 2
= |Xij |
N i=1 j=1

= kXkF .

Equation (15) is the Frobenius norm (the division by N inside the [·] is typically omitted).

Linear regression/prediction
Let Y equal log earnings for a generic random draw from the population of adult males.
Let X be a corresponding K × 1 vector of respondent attributes (the first element of X is a
constant). The linear subspace spanned by X equals

L = {X 0 β : β is a K × 1 vector of real numbers} . (16)

Let Z = (X 0 , Y )0 denote a generic random draw from the population of interest (with dis-
hP i1/2
tribution function F0 on support Z ∈ Z ⊆ RK+1 ). Let kXk2 = K
k=1 X 2
k denote the
Euclidean Norm of X and impose the following regularity condition on the population dis-
tribution function, F0 .
Assumption 1. (i) E [Y 2 ] < ∞, (ii) E kXk22 < ∞, and (iii) E (α0 X)2 > 0 for any
   

non-zero α ∈ RK .

We wish to compute the projection of Y onto L:


h i
2 2
min kY − X 0 βk = min E (Y − X 0 β) .
β∈RK β∈RK

Next consider the space of all 1-dimensional random functions of Z with finite variance (L2
space). Condition (i) of Assumption 1 implies that Y is an element of this space. By the
Hölder Inequality (HI) inequality we have
1/2
E [kX 0 βk2 ] ≤ E kXk22

kβk2 < ∞

11 © Bryan S. Graham, 2018, 2022


University of California - Berkeley Bryan S. Graham

by part (ii) of Assumption 1. Hence any element of L is also in L2 .


From the Projection Theorem we get the necessary and sufficient condition

hY − X 0 β0 , X 0 βi = E [(Y − X 0 β0 ) X 0 β] = 0

for all β ∈ RK . Using an argument similar to one used in the analysis of least squares earlier,
we work with the equivalent K × 1 vector of conditions:

E [X (Y − X 0 β0 )] = 0. (17)

Re-arranging (17) yields the system of K linear equations:

E [XX 0 ] β0 − E [XY ] = 0.

Part (iii) of Assumption 1 requires that no single predictor corresponds to a linear combi-
nation of the others (i.e., that the elements of X be linearly independent). This condition
ensures invertibility of E [XX 0 ] since
h i
2
E (α0 X) = α0 E [XX 0 ] α,

condition (iii) implies positive-definiteness of E [XX 0 ]. This, in turn, implies that the deter-
minant of E [XX 0 ] is non-zero (non-singularity) and hence that E [XX 0 ]−1 is well-defined.
Since E [XX 0 ] is invertible we can directly solve for β0 :

−1
β0 = E [XX 0 ] × E [XY ] . (18)

The corresponding best (i.e., MSE-minimizing) linear predictor (LP) of Y given X = x is

Π (Y | L) = E∗ [Y | X] = X 0 β0 . (19)

Here E∗ [Y | X] is special notation for the best linear predictor of Y given X. Unless stated
otherwise, I assume that a constant is a component of X.
Define U = Y − X 0 β0 to be the prediction error associated with (19). From the first order
conditions to (17) we get
E [XU ] = 0. (20)

Equation (20) indicates that β0 is chosen to ensure that the covariance between X and U is

12 © Bryan S. Graham, 2018, 2022


University of California - Berkeley Bryan S. Graham

zero. Recall that the first element of X is a constant so that (20) implies

E[U ] = 0

or zero average prediction error.

Multivariate regression
Let H be a Hilbert space consisting of J-dimensional random functions with finite variance.
That is,
h : Z → RJ

where h (Z) = (h1 (Z) , . . . , hJ (Z))0 is such that

E h (Z)0 h (Z) < ∞.


 

The null vector in this space is the degenerate random variable identically equal to a J × 1
vector of zeros. In what follows I use h to denote h (Z), with the dependence on the random
variable Z left implicit. Call this space the space of K-dimensional random functions with
finite variance.
We extend the covariance inner product, equation (2) above, to accommodate multidimen-
sional dimensional h:
hh1 , h2 i = E h1 (Z)0 h2 (Z) .
 

Next let g (Z) be a K × 1 vector of random functions with E [g 0 g] < ∞. The linear subspace
spanned by g (Z) equals

L = {Πg : Π is a J × K matrix of real numbers} . (21)

Assume that the random functions g1 (Z) , g2 (Z) , . . . , gK (Z) are linearly independent such
−1
that the inverse E g (Z) g (Z)0 exists.


To find the projection of h ∈ H onto L we use the necessary and sufficient condition from
the Projection Theorem introduced earlier:

hh − Π0 g, Πgi = 0, for all Π ∈ RJ×K .

13 © Bryan S. Graham, 2018, 2022


University of California - Berkeley Bryan S. Graham

Under the covariance inner product this condition implies that

E (h − Π0 g)0 Πg = 0. (22)
 

To find the form of Π0 observe that condition (22) is equivalent to

E g (h − Π0 g)0 = 0K×J . (23)


 

To show the equivalence of (22) and (23) note that, using linearity of the expectation and
trace operators,

E (h − Π0 g)0 Πg = E Tr (h − Π0 g)0 Πg
   

= E Tr (Πv (h − Π0 g))0
 
XX
= πjk E [gj (h − Π0 g)k ] ,
j k

where πjk is the jk th element of the arbitrary real J × K matrix Π. For any pair j, k we can
set πjk = 1 and the rest of the elements of Π to zero. This implies

E [gj (h − Π0 g)k ] = 0

for all j, k or (23) . The converse, that (23) implies (22) for any Π ∈ RJ×K , follows directly.
Solving (23) for Π0 indicates that the unique projection of h onto L is:

−1
Π (h| L) = Π0 g, Π0 = E [hg 0 ] × E [gg 0 ] . (24)

This is the multivariate linear predictor (or multivariate linear regression) of h (Z) onto
g (Z). Letting Z = (Y1 , . . . , YJ , X1 , . . . , XK )0 , h (Z) = Y = (Y1 , . . . YJ )0 and g (Z) = X =
(X1 , . . . , XK )0 yields

−1
Π (Y | L) = Π0 X, Π0 = E [Y X 0 ] × E [XX 0 ] . (25)

Bibliographic notes
The standard reference on vector space optimization is Luenberger (1969). These notes draw
extensively from this reference (glossing over details!). Appendix B.10 of Bickel & Doksum
(2015) and van der Vaart (1998, Chapter 10) provide compact introductions targeted toward

14 © Bryan S. Graham, 2018, 2022


University of California - Berkeley Bryan S. Graham

statistical applications. I have also found the presentation of Hilbert Space theory in Tsiatis
(2006) very helpful.

References
Bickel, P. J. & Doksum, K. A. (2015). Mathematical Statistics, volume 1. Boca Raton:
Chapman & Hall, 2nd edition.

Goldberger, A. S. (1991). A Course in Econometrics. Cambridge, MA: Harvard University


Press.

Luenberger, D. G. (1969). Optimization by Vector Space Methods. New York: John Wiley
& Sons, Inc.

Newey, W. K. (1990). Semiparametric efficiency bounds. Journal of Applied Econometrics,


5(2), 99 – 135.

Tsiatis, A. A. (2006). Semiparametric Theory and Missing Data. New York: Springer.

van der Vaart, A. W. (1998). Asymptotic Statistics. Cambridge: Cambridge University


Press.

15 © Bryan S. Graham, 2018, 2022

You might also like