100% found this document useful (1 vote)
152 views162 pages

ECON2125/8013 Maths Notes: John Stachurski March 4, 2015

This document contains lecture notes on linear algebra and probability. In the linear algebra section, it introduces vectors and vector spaces, spans, linear independence, bases and dimension, linear maps, matrices and linear equations. It then covers other matrix operations such as transposes, eigenvalues, and quadratic forms. The probability section covers probabilistic models, random variables, distributions, joint distributions, asymptotics, and random vectors/matrices. It also includes chapters on orthogonality/projections and analysis in the appendices.

Uploaded by

mike
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
152 views162 pages

ECON2125/8013 Maths Notes: John Stachurski March 4, 2015

This document contains lecture notes on linear algebra and probability. In the linear algebra section, it introduces vectors and vector spaces, spans, linear independence, bases and dimension, linear maps, matrices and linear equations. It then covers other matrix operations such as transposes, eigenvalues, and quadratic forms. The probability section covers probabilistic models, random variables, distributions, joint distributions, asymptotics, and random vectors/matrices. It also includes chapters on orthogonality/projections and analysis in the appendices.

Uploaded by

mike
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 162

ECON2125/8013 Maths Notes

John Stachurski

March 4, 2015
Contents

1 Linear Algebra 1
1.1 Vectors Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.2 Spans and Linear Subspaces . . . . . . . . . . . . . . . . . . . . 6
1.1.3 Linear Independence . . . . . . . . . . . . . . . . . . . . . . . . 9
1.1.4 Bases and Dimension . . . . . . . . . . . . . . . . . . . . . . . . 12
1.2 Linear Maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.2.1 Linearity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.2.2 Linear Maps from R N to R N . . . . . . . . . . . . . . . . . . . 16
1.2.3 Linear Maps Across Dimensions . . . . . . . . . . . . . . . . . 17
1.3 Matrices and Linear Equations . . . . . . . . . . . . . . . . . . . . . . 17
1.3.1 Basic Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.3.2 Matrices as Maps . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.3.3 Square Matrices and Invertibility . . . . . . . . . . . . . . . . . 22
1.3.4 Determinants . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
1.3.5 Solving Equations with Tall Matrices . . . . . . . . . . . . . . . 25
1.3.6 Solving Equations with Wide Matrices . . . . . . . . . . . . . . 26
1.4 Other Matrix Operations . . . . . . . . . . . . . . . . . . . . . . . . . . 27
1.4.1 Types of Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . 27

i
CONTENTS ii

1.4.2 Transpose and Trace . . . . . . . . . . . . . . . . . . . . . . . . 28


1.4.3 Eigenvalues and Eigenvectors . . . . . . . . . . . . . . . . . . 30
1.4.4 Similar Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
1.4.5 Matrix Norm and Neumann Series . . . . . . . . . . . . . . . . 33
1.4.6 Quadratic Forms . . . . . . . . . . . . . . . . . . . . . . . . . . 35
1.5 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
1.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
1.6.1 Solutions to Selected Exercises . . . . . . . . . . . . . . . . . . 41

2 Probability 46
2.1 Probabilistic Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
2.1.1 Sample Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
2.1.2 Probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
2.1.3 Dependence and Independence . . . . . . . . . . . . . . . . . . 50
2.1.4 Technical Details . . . . . . . . . . . . . . . . . . . . . . . . . . 52
2.2 Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
2.2.1 Definition and Notation . . . . . . . . . . . . . . . . . . . . . . 54
2.2.2 Finite Random Variables . . . . . . . . . . . . . . . . . . . . . . 56
2.2.3 Expectations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
2.3 Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
2.3.1 Distribution Functions . . . . . . . . . . . . . . . . . . . . . . . 61
2.3.2 Densities and PMFs . . . . . . . . . . . . . . . . . . . . . . . . 63
2.3.3 The Quantile Function . . . . . . . . . . . . . . . . . . . . . . . 65
2.3.4 Expectations from Distributions . . . . . . . . . . . . . . . . . 66
2.3.5 Common Distributions . . . . . . . . . . . . . . . . . . . . . . . 68
2.4 Joint Distributions and Independence . . . . . . . . . . . . . . . . . . 69
CONTENTS iii

2.4.1 Distributions Across Random Variables . . . . . . . . . . . . . 70


2.4.2 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
2.4.3 Covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
2.4.4 Best Linear Predictors . . . . . . . . . . . . . . . . . . . . . . . 73
2.5 Asymptotics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
2.5.1 Modes of Convergence . . . . . . . . . . . . . . . . . . . . . . . 75
2.5.2 The Law of Large Numbers . . . . . . . . . . . . . . . . . . . . 78
2.5.3 The Central Limit Theorem . . . . . . . . . . . . . . . . . . . . 80
2.6 Random Vectors and Matrices . . . . . . . . . . . . . . . . . . . . . . . 83
2.6.1 Expectations for Vectors and Matrices . . . . . . . . . . . . . . 84
2.6.2 Multivariate Gaussians . . . . . . . . . . . . . . . . . . . . . . 85
2.6.3 Convergence of Random Matrices . . . . . . . . . . . . . . . . 86
2.6.4 Vector LLN and CLT . . . . . . . . . . . . . . . . . . . . . . . . 89
2.7 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
2.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
2.8.1 Solutions to Selected Exercises . . . . . . . . . . . . . . . . . . 96

3 Orthogonality and Projections 107


3.1 Orthogonality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
3.1.1 Definition and Basic Properties . . . . . . . . . . . . . . . . . . 107
3.1.2 Orthogonal Decompositions . . . . . . . . . . . . . . . . . . . 110
3.1.3 Orthogonal Complements . . . . . . . . . . . . . . . . . . . . . 110
3.2 Orthogonal Projections . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
3.2.1 The Orthogonal Projection Theorem . . . . . . . . . . . . . . . 111
3.2.2 Orthogonal Projection as a Mapping . . . . . . . . . . . . . . . 113
3.2.3 Projection as Decomposition . . . . . . . . . . . . . . . . . . . 115
CONTENTS iv

3.3 Applications of Projection . . . . . . . . . . . . . . . . . . . . . . . . . 116


3.3.1 Projection Matrices . . . . . . . . . . . . . . . . . . . . . . . . . 116
3.3.2 Overdetermined Systems of Equations . . . . . . . . . . . . . 117
3.3.3 Gram Schmidt Orthogonalization . . . . . . . . . . . . . . . . 119
3.3.4 Solving Systems via QR Decomposition . . . . . . . . . . . . . 119
3.4 Projections in L2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
3.4.1 The Space L2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
3.4.2 Application: Best Linear Prediction . . . . . . . . . . . . . . . 125
3.4.3 Measurability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
3.4.4 Conditional Expectation . . . . . . . . . . . . . . . . . . . . . . 128
3.4.5 The Vector/Matrix Case . . . . . . . . . . . . . . . . . . . . . . 131
3.4.6 An Exercise in Conditional Expectations . . . . . . . . . . . . 132
3.5 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
3.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
3.6.1 Solutions to Selected Exercises . . . . . . . . . . . . . . . . . . 136

I Appendices 142

4 Appendix A: Analysis 143


4.1 Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
4.2 Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
4.2.1 Convergence and Continuity . . . . . . . . . . . . . . . . . . . 150
4.3 Real-Valued Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
Chapter 1

Linear Algebra

1.1 Vectors Space

[roadmap]

1.1.1 Vectors

Let’s quickly review the basics. An important set for us will be, for arbitrary N ∈ N,
the set of all N-vectors, or vectors of length N. This set is denoted by R N , and a
typical element is of the form
 
x1
 x2 
x= .. where xn ∈ R for each n
 
.

 
xN

(As usual, R = R1 represents the set of all real numbers, which is, in essence, the
union of the rational and irrational numbers.) Here x has been written vertically, as a
column of numbers, but we could also write it horizontally, like so: x = ( x1 , . . . , x N ).
At this stage, we are viewing vectors just as sequences of numbers, so it makes no
difference whether they are written vertically or horizontally. It’s only when we
get to matrix multiplication that we have to concern ourselves with distinguishing
between column (vertical) and row (horizontal) vectors.

1
CHAPTER 1. LINEAR ALGEBRA 2

The vector of ones will be denoted 1, while the vector of zeros will be denoted 0:
1 0
   
 ..   .. 
1 :=  .  0 :=  . 
1 0
For elements of R N there are two fundamental algebraic operations: addition and
scalar multiplication. If x ∈ R N and y ∈ R N , then the vector sum is defined by
     
x1 y1 x1 + y1
 x2   y2   x2 + y2 
x + y :=:  .  +  .  :=  .
     
.
 .   .  .  .
.


xN yN xN + yN
If α ∈ R, then the scalar product of α and x is defined to be
 
αx1
 αx2 
αx :=  . 
 
 .. 
αx N
Thus, addition and scalar multiplication are defined in terms of ordinary addition
and multiplication in R, and computed element-by-element, by adding and multi-
plying respectively. Figures 1.1 and 1.1 show examples of vector addition and scalar
multiplication in the case N = 2. In the figure, vectors are represented as arrows,
starting at the origin and ending at the location in R2 defined by the vector.
Remark: In some instances, the notion of scalar multiplication includes multiplica-
tion of vectors by complex numbers. In what follows we will work almost entirely
with real scalars, and hence scalar multiplication means real scalar multiplication
unless otherwise stated.
We have defined addition and scalar multiplication of vectors, but not subtraction.
Subtraction is performed element by element, analogous to addition. The definition
can be given in terms of addition and scalar multiplication. x − y := x + (−1)y. An
illustration of this operation is given in figure 1.3. One way to remember this is to
draw a line from y to x, and then shift it to the origin.
The inner product of two vectors x and y in R N is denoted by x0 y, and defined as
the sum of the products of their elements:
N
x0 y := ∑ xn yn
n =1
CHAPTER 1. LINEAR ALGEBRA 3

x+y
y

Figure 1.1: Vector addition

−2x

Figure 1.2: Scalar multiplication


CHAPTER 1. LINEAR ALGEBRA 4

x−y

Figure 1.3: Difference between vectors

Fact 1.1.1. For any α, β ∈ R and any x, y, z ∈ R N , the following statements are true:

1. x0 y = y0 x

2. (αx)0 ( βy) = αβ(x0 y)

3. x0 (αy + βz) = α(x0 y) + β(x0 z)

These properties are easy to check from the definition. However, since this is our
first formal claim let’s step through one of the arguments carefully, just as an ex-
ercise: Consider the second equality, which is (αx)0 ( βy) = αβ(x0 y). The claim is
that the stated equality holds for any α, β ∈ R and any x, y ∈ R N . To verify a “for
any” claim we have to show that the stated property holds for arbitrary choices of
these objects. Thus we start by saying pick any α, β ∈ R and any x, y ∈ R N . By the
definitions of scalar multiplication and inner product respectively, we have
N
(αx)0 ( βy) = ∑ αxn βyn
n =1

Passing the scalars out of the sum, we see that this equals αβ ∑nN=1 xn yn , which,
using the definition of inner product again, is αβ(x0 y). This verifies the claim.
Perhaps this style of argument seemed like overkill for such a simple statement. But
getting into the habit of constructing careful arguments will help you a great deal
CHAPTER 1. LINEAR ALGEBRA 5

as we work through more complex material. One thing you’ll find in particular if
you’re still new to formal reasoning is that getting the first sentence of a proof right
(establishing exactly what it is you’re going to show) is critical. If you get this right
the rest should flow more easily.
The (Euclidean) norm of a vector x ∈ R N is defined as
!1/2
√ N
kxk := x0 x := ∑ xn2 (1.1)
n =1

and represents the length of the vector x. (In the arrow representation of vectors in
figures 1.1–1.3, the norm of the vector is equal to the length of the arrow.)

Fact 1.1.2. For any α ∈ R and any x, y ∈ R N , the following statements are true:

1. kxk ≥ 0 and kxk = 0 if and only if x = 0

2. kαxk = |α|kxk

3. kx + yk ≤ kxk + kyk

4. |x0 y| ≤ kxkkyk

The first two properties you can verify yourself without difficulty. Proofs for the
second two are a bit harder. The third property is called the triangle inequality,
while the fourth is called the Cauchy-Schwarz inequality. The proof of the Cauchy-
Schwarz inequality is given as a solved exercise after we’ve built up some more
tools (see exercise 3.6.13 below). If you’re prepared to accept the Cauchy-Schwarz
inequality for now, then the triangle inequality follows, because, by the properties
of the inner product given in fact 1.1.1,

kx + yk2 = (x + y)0 (x + y) = x0 x + 2x0 y + y0 y ≤ x0 x + 2|x0 y| + y0 y

Applying the Cauchy-Schwarz inequality leads to kx + yk2 ≤ (kxk + kyk)2 . Taking


the square root gives the triangle inequality.
Given two vectors x and y, the value kx − yk has the interpretation of being the
“distance” between these points. To see why, consult figure 1.3 again.
CHAPTER 1. LINEAR ALGEBRA 6

1.1.2 Spans and Linear Subspaces

One of the most elementary ways to work with vectors is to combine them using
linear operations. Given K vectors x1 , . . . , xK in R N , a linear combination of these
vectors is a new vector of the form
K
y= ∑ α k x k = α1 x1 + · · · + α K x K
k =1

for some collection of scalars α1 , . . . , αK (i.e., with αk ∈ R for all k).

Fact 1.1.3. Inner products of linear combinations satisfy the following rule:
!0 !
K J K J
∑ αk xk ∑ β j yj = ∑ ∑ αk β j x0k y j
k =1 j =1 k =1 j =1

Given any nonempty X ⊂ R N , the set of all vectors that can be made by (finite) lin-
ear combinations of elements of X is called the span of X, and denoted by span( X ).
For example, the set of all linear combinations of X := {x1 , . . . , xK } is
( )
K
span( X ) := all vectors ∑ αk xk such that α := (α1, . . . , αK ) ∈ RK
k =1

Example 1.1.1. Let X = {1} = {(1, 1)} ⊂ R2 . The span of X is all vectors of the
form α1 = (α, α) with α ∈ R. This constitutes a line in the plane. Since we can take
α = 0, it follows that the origin 0 is in span( X ). In fact span( X ) is the unique line in
the plane that passes through both 0 and the vector 1 = (1, 1).

Example 1.1.2. Let x1 = (3, 4, 2) and let x2 = (3, −4, 0.4). This pair of vectors is
shown on the left-hand side of figure 1.4. The span of {x1 , x2 } is a plane in R3 that
passes through both of these vectors and the origin. The plane is shown on the
right-hand side of figure 1.4.

Let Y be any subset of R N , and let X := {x1 , . . . , xK }. If Y ⊂ span( X ), we say that the
vectors in X span the set Y, or that X is a spanning set for Y. This is a particularly
nice situation when Y is large but X is small, because it means that all the vectors in
the large set Y are “described” by the small number of vectors in X. We’ll have a lot
more to say about this idea.
CHAPTER 1. LINEAR ALGEBRA 7

x1 x1
0 0

x2 x2

0 0

0 0

Figure 1.4: Span of x1 , x2

Example 1.1.3. Consider the vectors {e1 , . . . , e N } ⊂ R N , where en has all zeros ex-
cept for a 1 as the n-th element:

1 0 0
     
 0   1   0 
e1 : =  .  , e2 : =  .  , · · · , e N : =  . 
     
.
 .  .
 .   .. 
0 0 1

The case of R2 is illustrated in figure 1.5. The vectors e1 , . . . , e N are called the canon-
ical basis vectors of R N —we’ll see why later on. One reason is that {e1 , . . . , e N }
spans all of R N . To see this in the case of N = 2 (check general N yourself), observe
that for any y ∈ R2 , we have

0 1 0
         
y1 y1
y := = + = y1 + y2 = y1 e1 + y2 e2
y2 0 y1 0 1

Thus, y ∈ span{e1 , e2 } as claimed. Since y is just an arbitrary vector in R2 , we have


shown that {e1 , e2 } spans R2 .

Example 1.1.4. Consider the set P := {( x1 , x2 , 0) ∈ R3 : x1 , x2 ∈ R}. Graphically,


P corresponds to the flat plane in R3 , where the height coordinate is always zero.
If we take e1 = (1, 0, 0) and e2 = (0, 1, 0), then given y = (y1 , y2 , 0) ∈ P we have
y = y1 e1 + y2 e2 . In other words, any y ∈ P can be expressed as a linear combination
of e1 and e2 , and {e1 , e2 } is a spanning set for P.
CHAPTER 1. LINEAR ALGEBRA 8

e2 = (0, 1)

e1 = (1, 0)

Figure 1.5: Canonical basis vectors in R2

Fact 1.1.4. Let X and Y be any two finite subsets of R N . If X ⊂ Y, then we have
span( X ) ⊂ span(Y ).

One of the key features of the span of a set X is that it is “closed” under the linear
operations of vector addition and scalar multiplication, in the sense that if we take
elements of the span and combine them using these operations, the resulting vectors
are still in the span.
To check that the span( X ) is closed under vector addition for arbitrary nonempty
X ⊂ R N , just observe that if y, z ∈ span( X ), then we can express them as finite
linear combinations of elements of X, like so:
K K0
y= ∑ αk xk and z= ∑ α0k x0k
k =1 k =1

Here each xk and x0k is an element of X, and each αk and α0k is a scalar. Adding y and
z gives
K K0
y+z = ∑ αk xk + ∑ α0k x0k
k =1 k =1
which is yet another finite linear combination of elements of X. Hence span( X )
is closed under vector addition as claimed. Another easy argument shows that
span( X ) is closed under scalar multiplication.
The notion of a set being closed under scalar multiplication and vector addition is
important enough to have its own name: A nonempty subset S of R N is called a
linear subspace (or just subspace) of R N if, for any x and y in S, and any α and β in
R, the linear combination αx + βy is also in S.
CHAPTER 1. LINEAR ALGEBRA 9

Example 1.1.5. It follows immediately from the proceeding discussion that if X is


any nonempty subset of R N , then span( X ) is a linear subspace of R N . For this
reason, span( X ) is often called the linear subspace spanned by X.

In R3 , lines and planes that pass through the origin are linear subspaces. You will
get more of a feel for this as we go along. Other linear subspaces of R3 are the
singleton set containing the zero element 0, and the set R3 itself.

Fact 1.1.5. Let S be a linear subspace of R N . The following statements are true:

1. The origin 0 is an element of S.

2. If X ⊂ S, then span( X ) ⊂ S.

3. span(S) = S.

1.1.3 Linear Independence

In some sense, the span of X := {x1 , . . . , xK } is a measure of the “diversity” of the


vectors in X—the more diverse are the elements of X, the greater is the set of vectors
that can be represented as linear combinations of its elements. In particular, if X is
not very “diverse,” then some “similar” elements may be redundant, in the sense
that one can remove an element xi from the collection X without reducing its span.
Let’s consider two extremes. First consider the vectors e1 := (1, 0) and e2 := (0, 1)
in R2 (figure 1.5). As we saw in example 1.1.3, the span {e1 , e2 } is all of R2 . With just
these two vectors, we can span the whole plane. In algebraic terms, these vectors
are relatively diverse. We can also see their diversity in the fact that if we remove
one of the vectors from {e1 , e2 }, the span is no longer all of R2 . In fact it is just a line
in R2 . Hence both vectors have their own role to play in forming the span.
Now consider the pair e1 and x := −2e1 = (−2, 0), as shown in figure 1.6. This pair
is not very diverse. In fact, if y ∈ span{e1 , x}, then, for some α1 and α2 ,

y = α1 e1 + α2 x = α1 e1 + α2 (−2)e1 = (α1 − 2α2 )e1 ∈ span{e1 }

In other words, any element of span{e1 , x} is also an element of span{e1 }. We can


kick x out of the set {e1 , x} without reducing the span.
Let’s translate these ideas into formal definitions. In general, the set of vectors X :=
{x1 , . . . , xK } in R N is called linearly dependent if one (or more) vector(s) can be
CHAPTER 1. LINEAR ALGEBRA 10

e1 = (1, 0)
x = (−2, 0)

Figure 1.6: The vectors e1 and x

removed without changing span( X ). We call X linearly independent if it is not


linearly dependent.
To see the definition of independence in a slightly different light, suppose that X :=
{x1 , . . . , xK } is linearly dependent, with
span{x1 , . . . , xK } = span{x2 , . . . , xK }

Since x1 ∈ span{x1 , . . . , xK } certainly holds, this equality implies that

x1 ∈ span{x2 , . . . , xK }

Hence, there exist constants α2 , . . . , αK with

x1 = α2 x2 + · · · α K x K

In other words, x1 can be expressed as a linear combination of the other elements in


X. This is a general rule: Linear dependence means that at least one vector in the
set can be written as a linear combination of the others. Linear independence means
the opposite is true.
There is yet a third way to express linear independence, which is very succinct and
often the most useful condition to aim for when checking linear independence. It is
given as the first part of the next fact, which clarifies the various relationships.
Fact 1.1.6 (Definitions of linear independence). The following statements are all
equivalent definitions of linear independence of X := {x1 , . . . , xK } ⊂ R N :

1. If α1 x1 + · · · + αK xK = 0 then α1 = · · · = αK = 0.

2. If X0 is a proper subset of X, then span( X0 ) is a proper subset of span( X ).1


1A is a proper subset of B if A ⊂ B and A 6= B.
CHAPTER 1. LINEAR ALGEBRA 11

3. No vector in X can be written as a linear combination of the others.

Exercise 1.6.11 asks you to check all of these equivalences.


Perhaps the most important example of linearly independent vectors in R N is the
canonical basis vectors in example 1.1.3. Indeed, if α j 6= 0 for some j, then ∑kK=1 αk ek =
(α1 , . . . , αK ) 6= 0. More examples are given in the exercises.

Fact 1.1.7. If X := {x1 , . . . , xK } is linearly independent, then

1. Every subset of X is linearly independent.

2. X does not contain 0.

3. X ∪ {x} is linearly independent for all x ∈ R N such that x ∈


/ span( X ).

The proof is a solved exercise (exercise 1.6.12 on page 39).


One reason for our interest in the concept of linear independence lies in the follow-
ing problem: We know when a point in R N can be expressed as a linear combination
of some fixed set of vectors X. This is true precisely when that point is in the span
of X. What we do not know is when that representation is unique. It turns out that
the relevant condition is independence:

Theorem 1.1.1. Let X := {x1 , . . . , xK } be any collection of vectors in R N . The following


statements are equivalent:

1. X is linearly independent.

2. For each y ∈ span( X ) there exists exactly one set of scalars α1 , . . . , αK such that
K
y= ∑ αk xk (1.2)
k =1

Proof. Let X be linearly independent and pick any y ∈ span( X ). Since y is in the
span of X, we know that there exists at least one set of scalars such that (1.2) holds.
Suppose now that there are two. In particular, suppose that y = ∑kK=1 αk xk =
∑kK=1 β k xk . It follows from the second equality that ∑kK=1 (αk − β k )xk = 0. Using
fact 1.1.6, we conclude that αk = β k for all k. In other words, the representation is
unique.
CHAPTER 1. LINEAR ALGEBRA 12

On the other hand, suppose that for each y ∈ span( X ) there exists exactly one set of
scalars α1 , . . . , αK such that y = ∑kK=1 αk xk . Since 0 ∈ span( X ) must be true (why?),
this implies that there exists only one set of scalars such that 0 = ∑kK=1 αk xk . Since
α1 = · · · = αk = 0 has this property, we conclude that no nonzero scalars yield
0 = ∑kK=1 αk xk . In other words, X is linearly independent.

1.1.4 Bases and Dimension

In essence, the dimension of a linear subspace is the minimum number of vectors


needed to span it. For example, consider the plane

P := {( x1 , x2 , 0) ∈ R3 : x1 , x2 ∈ R} (1.3)

from example 1.1.4. Intuitively, this plane is a “two-dimensional” subset of R3 . This


intuition agrees with the definition above. Indeed, P cannot be spanned by one
vector, for if we take a single vector in R3 , then the span created by that singleton is
only a line in R3 , not a plane. On the other hand, P can be spanned by two vectors,
as we saw in example 1.1.4. Finally, while P can also be spanned by three or more
vectors, it turns out that one of the vectors will always be redundant—it does not
change the span. In other words, any collection of 3 or more vectors from P will be
linearly dependent.
Let’s discuss this more formally. We begin with the following deep result:

Theorem 1.1.2. If S is a linear subspace of R N and S is spanned by K vectors, then every


linearly independent subset of S has at most K vectors.

Put differently, if S is spanned by K vectors, then any subset of S with more than K
vectors will be linearly dependent. Many practical results lean on this theorem. The
proof is not particularly hard, but it is a little long. You can find it in most texts on
linear algebra (see, e.g., theorem 1.1 of [21]). Here’s an example of what we can do
with theorem 1.1.2.

Theorem 1.1.3. Let X := {x1 , . . . , x N } be any N vectors in R N . The following statements


are equivalent:

1. span( X ) = R N .

2. X is linearly independent.
CHAPTER 1. LINEAR ALGEBRA 13

Proof. Suppose first that X is linearly independent. Seeking a contradiction, suppose


in addition that there exists an x ∈ R N with x ∈ / span( X ). By fact 1.1.7 it then
follows that the N + 1 element set X ∪ {x} ⊂ R is linearly independent. Since R N
N

is spanned by the N canonical basis vectors, this statement stands in contradiction


to theorem 1.1.2.
Conversely, suppose that span( X ) = R N but X is not linearly independent. Then,
by fact 1.1.6, there exists a proper subset X0 of X with span( X0 ) = span( X ). Since X0
is a proper subset of X it contains K < N elements. We now have K vectors spanning
R N . In particular, the span of K vectors contains the N linearly independent vectors
e1 , . . . , e N . Once again, this statement stands in contradiction to theorem 1.1.2.

We now come to a key definition. If S is a linear subspace of R N and B is some finite


subset of R N , then B is called a basis of S if B spans S and is linearly independent.

Example 1.1.6. The pair {e1 , e2 } is a basis for the set P defined in (1.3).

Example 1.1.7. The set of canonical basis vectors {e1 , . . . , e N } ⊂ R N described in


example 1.1.3 is linearly independent and its span is equal to all of R N (see page 7).
As a result, {e1 , . . . , e N } is a basis for R N —as anticipated by the name.

As stated above, the dimension of a linear subspace is the minimum number of vec-
tors needed to span it. To formalize this idea, we will use the following fundamental
result about bases:

Theorem 1.1.4. If S is a linear subspace of R N distinct from {0}, then

1. S has at least one basis, and

2. every basis of S has the same number of elements.

The proof of part 1 can be found in any good text on linear algebra. See, for example,
theorem 1.1 of [21]. Part 2 follows from theorem 1.1.2 and is left as an exercise
(exercise 1.6.13).
If S is a linear subspace of R N , then common number identified in theorem 1.1.4 is
called the dimension of S, and written as dim(S). For example,

1. dim( P) = 2 for the plane in (1.3), because {e1 , e2 } ⊂ R3 is a basis.

2. dim(R N ) = N, because {e1 , . . . , e N } ⊂ R N is a basis.


CHAPTER 1. LINEAR ALGEBRA 14

In R N the singleton subspace {0} is said to have zero dimension. A line {αx ∈ R N :
α ∈ R} through the origin is obviously one dimensional.
If we take a set of K vectors, then how large will its span be in terms of dimension?
The next lemma answers this question.

Lemma 1.1.1. Let X := {x1 , . . . , xK } ⊂ R N . Then

1. dim(span( X )) ≤ K.

2. dim(span( X )) = K if and only if X is linearly independent.

Proof. Regarding part 1, let B be a basis of span( X ). By definition, B is a linearly in-


dependent subset of span( X ). Since span( X ) is spanned by K vectors, theorem 1.1.2
implies that B has no more than K elements. Hence, dim(span( X )) ≤ K.
Regarding part 2, suppose first that X is linearly independent. Then X is a basis for
span( X ). Since X has K elements, we conclude that dim(span( X )) = K. Conversely,
if dim(span( X )) = K then X must be linearly independent. For if X is not linearly
independent, then exists a proper subset X0 of X such that span( X0 ) = span( X ).
By part 1 of this theorem, we then have dim(span( X0 )) ≤ #X0 ≤ K − 1. Therefore,
dim(span( X )) ≤ K − 1. Contradiction.

Part 2 of lemma 1.1.1 is important in what follows, and also rather intuitive. It says
that the span of a set will be large when it’s elements are algebraically diverse.
Let’s finish this section with facts that can be deduced from the preceding results.

Fact 1.1.8. The following statements are true:

1. Let S and T be K-dimensional linear subspaces of R N . If S ⊂ T, then S = T.

2. If S is an M-dimensional linear subspace of R N where M < N, then S 6= R N .

The first part of fact 1.1.8 has many useful implications, one of which is that the only
N-dimensional linear subspace of R N is R N .

1.2 Linear Maps

[Roadmap]
CHAPTER 1. LINEAR ALGEBRA 15

1.2.1 Linearity

There are many different classes of functions and perhaps the most important of
these is the class of linear functions. In high school one learns about linear func-
tions as those whose graph is a straight line. We need a more formal (and accurate)
definition: A function T : RK → R N is called linear if

T (αx + βy) = αTx + βTy for any x, y ∈ RK and α, β ∈ R

Remark: Mathematicians usually write linear functions with upper case letters and
omit the parenthesis around the argument where no confusion arises. It is a custom
of their tribe.
Example 1.2.1. The function T : R → R defined by Tx = 2x is linear because for
any α, β, x, y in R, we have

T (αx + βy) = 2(αx + βy) = α2x + β2y = αTx + βTy

Example 1.2.2. Given constants c1 , . . . , cK , the function T : RK → R defined by


K
Tx = T ( x1 , . . . , xK ) = ∑ ck xk
k =1

is linear, because if we take any α, β in R and x, y in RK , then


K K K
T (αx + βy) = ∑ ck [αxk + βyk ] = α ∑ ck xk + β ∑ ck yk = αTx + βTy
k =1 k =1 k =1

Example 1.2.3. The function f : R → R defined by f ( x ) = x2 is nonlinear, because


if we take α = β = x = y = 1, then

f (αx + βy) = f (2) = 4 while α f ( x ) + β f (y) = 1 + 1 = 2

Thinking of linear functions as those whose graph is a straight line is incorrect. For
example, the function f : R → R defined by f ( x ) = 1 + 2x is nonlinear, because if
we take α = β = x = y = 1, then f (αx + βy) = f (2) = 5, while α f ( x ) + β f (y) =
3 + 3 = 6. This kind of function is actually called an affine function.
The range of a linear map is the span of the image of the canonical basis functions:
Lemma 1.2.1. If T : RK → R N is a linear map, then

rng( T ) = span(V ) where V := { Te1 , . . . , TeK }


CHAPTER 1. LINEAR ALGEBRA 16

Proof. Any x ∈ RK can be expressed in terms of the basis vectors as ∑kK=1 αk ek , for
some suitable choice of scalars. Hence rng( T ) is the set of all points of the form
" #
K K
Tx = T ∑ αk ek = ∑ αk Tek
k =1 k =1

as we vary α1 , . . . , αK over all scalar combinations. This coincides with the definition
of span(V ).

For linear map T : RK → R N , the null space or kernel of T is defined as

ker( T ) := {x ∈ RK : Tx = 0}

Fact 1.2.1. If T : RK → R N is linear, then

1. ker( T ) is a linear subspace of RK .

2. ker( T ) = {0} if and only if T is one-to-one.

The proofs are straightforward. For example, suppose that and Tx = Ty for arbi-
trary x, y ∈ RK . Then x − y ∈ ker( T ). If ker( T ) = {0} then it must be that x = y.
Since x and y were arbitrary we conclude that T is one-to-one.

1.2.2 Linear Maps from RN to RN


Linear functions from R N to itself (linear self-mappings) have an extremely neat and
useful property: They are onto if and only if they are one-to-one. In other words,
for linear self-mappings (maps from a space back into itself), the properties of being
one-to-one, onto and a bijection are all entirely equivalent. The next theorem gives
details.

Theorem 1.2.1. If T is a linear function from R N to R N then all of the following are
equivalent:

1. T is a bijection.

2. T is onto.

3. T is one-to-one.
CHAPTER 1. LINEAR ALGEBRA 17

4. ker( T ) = {0}.

5. The set of vectors V := { Te1 , . . . , Te N } is linearly independent.

If any one of these equivalent conditions is true, then T is called nonsingular. (In
other words, a function from R N to itself is called a nonsingular function if it is a
linear bijection.) Otherwise T is called singular. The proof of theorem 1.2.1 is not
overly difficult given all the work we’ve already done. It is left as a (solved) exercise
(exercise 1.6.15).
If T is nonsingular, then, being a bijection, it must have an inverse function T −1
that is also a bijection (fact 4.2.1 on page 149). It turns out that this inverse function
inherits the linearity of T. In summary,
Fact 1.2.2. If T : R N → R N is nonsingular then so is T −1 .

1.2.3 Linear Maps Across Dimensions

Theorem 1.2.1 only applies to linear maps between spaces of the same dimension.
Here are fundamental results for the other two cases (smaller to bigger and bigger
to smaller).
Theorem 1.2.2. For a linear map T from RK → R N , the following statements are true:

1. If K < N then T is not onto.

2. If K > N then T is not one-to-one.

The most important implication: If N 6= K, then we can forget about bijections.

Proof. Regarding part 1, let K < N and let T : RK → R N be linear. T cannot be onto,
for if T were onto then we would have rng( T ) = R N , in which case the vectors in
V = { Te1 , . . . , TeK } in lemma 1.2.1 would span R N , despite having only K < N
elements. This is impossible. (Why?)
The proof of part 2 is left as an exercise (exercise 1.6.16).

1.3 Matrices and Linear Equations

[Roadmap]
CHAPTER 1. LINEAR ALGEBRA 18

1.3.1 Basic Definitions

A N × K matrix is a rectangular array A of real numbers with N rows and K columns,


written in the following way:
 
a11 a12 · · · a1K
 21 a22 · · · a2K 
 a 
A= . .. .. 
 .. . . 
a N1 a N2 · · · a NK
Often, the values ank in the matrix represent coefficients in a system of linear equa-
tions, such as
a11 x1 + a12 x2 + · · · + a1K xK = b1
a21 x1 + a22 x2 + · · · + a2K xK = b2
.. (1.4)
.
a N1 x1 + a N2 x2 + · · · + a NK xK = b N

We’ll explore this relationship further after some more definitions.


In matrix A, the symbol ank stands for the element in the n-th row of the k-th column.
For obvious reasons, the matrix A is also called a vector if either N = 1 or K = 1. In
the former case, A is called a row vector, while in the latter case it is called a column
vector. If A is N × K and N = K, then A is called square. When convenient, we
will use the notation rown (A) to refer to the n-th row of A, and colk (A) to refer to
it’s k-th column.
The square matrix I where n, k-th element is 1 if n = k and zero otherwise is called
the identity matrix, and denoted by I:
1 0 ··· 0
 
 0 1 ··· 0 
I :=  . . ..
 
 .. .. .


0 0 ··· 1
Just as was the case for vectors, a number of algebraic operations are defined for
matrices. The first two, scalar multiplication and addition, are immediate general-
izations of the vector case: For γ ∈ R, we let
   
a11 a12 · · · a1K γa11 γa12 · · · γa1K
 21 a22 · · · a2K  γa22 · · · γa2K 
 a
21
  γa
γ . .. ..  : =  .. .. .. 
 
 .. . .   . . . 
a N1 a N2 · · · a NK γa N1 γa N2 · · · γa NK
CHAPTER 1. LINEAR ALGEBRA 19

while
     
a11 ··· a1K b11 ··· b1K a11 + b11 ··· a1K + b1K
 a
 21 ··· a2K   b21 ··· b2K   a21 + b21 ··· a2K + b2K 
 .. .. .. + .. .. ..  := .. .. ..
    
 . . . . . . . . .
 
    
a N1 · · · a NK b N1 · · · b NK a N1 + b N1 · · · a NK + b NK

In the latter case, the matrices have to have the same number of rows and columns
in order for the definition to make sense.
Now let’s look at multiplication of matrices. If A and B are two matrices, then their
product C := AB is formed by taking as it’s i, j-th element the inner product of the
i-th row of A and the j-th column of B. That is,
K
cij = rowi (A)0 col j (B) = ∑ aik bkj
k =1

Here’s the picture for i = j = 1:


    
a11 ··· a1K b11 · · · b1J c11 ··· c1J
 a21 ··· a2K  b21 · · · b2J   c21 ··· c2J 
.. .. .. .. .. .. = .. .. ..
    
. . . . . . . . .
  
    
a N1 · · · a NK bK1 · · · bK J c N1 · · · cN J

Since inner products are only defined for vectors of equal length, this requires that
the length of the rows of A is equal to the length of the columns of B. In other words,
if A is N × K and B is J × M, then we require K = J. The resulting matrix AB is
N × M. Here’s the rule to remember:

product of N × K and K × M is N × M

From the definition, it is clear that multiplication is not commutative, in that AB


and BA are not generally the same thing. Indeed BA is not well-defined unless
N = M also holds. Even in this case, the two are not generally equal. Other than
that, multiplication behaves pretty much as we’d expect:

Fact 1.3.1. For conformable matrices A, B, C and scalar α we have

1. A(BC) = (AB)C

2. A(B + C) = AB + AC
CHAPTER 1. LINEAR ALGEBRA 20

3. (A + B)C = AC + BC

4. AαB = αAB

Here, we are using the word “conformable” to indicate dimensions are such that
the operation in question makes sense. For example, we’ll say “for two conformable
matrices A and B, the product AB satisfies xyz” if the dimensions of A and B are
such that the product is well defined; and similarly for addition, etc.
You can verify that, assuming conformability, we always have

AI = IA = A

For this reason the identity matrix is sometimes called the “multiplicative unit”.

1.3.2 Matrices as Maps

The single most useful way to think of a matrix is as a mapping over vector space.
In particular, an N × K matrix A can be thought of as a map sending a vector x ∈ RK
into a new vector y = Ax in R N . Among the collection of all functions from RK to
R N , these functions defined by matrices have a special property: they are all linear.
To see that this is so, take a fixed N × K matrix A and consider the function T : RK →
R N defined by Tx = Ax. To see that T is a linear function, pick any x, y in RK , and
any scalars α and β. The rules of matrix arithmetic (see fact 1.3.1) tell us that

T (αx + βy) := A(αx + βy) = Aαx + Aβy = αAx + βAy =: αTx + βTy

In other words, T is linear.


So matrices make linear functions. How about some examples of a linear functions
that don’t involve matrices? Actually we can’t give any such examples because there
are none:

Theorem 1.3.1. Let T be a function from RK to R N . The following are equivalent:

1. T is linear.

2. There exists an N × K matrix A such that Tx = Ax for all x ∈ RK .

In other words, the set of linear functions from RK to R N and the set of N × K
matrices are in one-to-one correspondence.
CHAPTER 1. LINEAR ALGEBRA 21

Proof. We’ve already proved one implication. Regarding the second, let T : RK →
R N be linear. We aim to construct a matrix A such that Tx = Ax for all x ∈ RK . As
usual, let ek be the k-th canonical basis vector in RK . Define an N × K matrix A by
colk (A) = Tek . Pick any x = ( x1 , . . . , xK ) ∈ RK . We can also write x = ∑kK=1 xk ek .
By linearity we have
K K
Tx = ∑ xk Tek = ∑ xk colk (A)
k =1 k =1
This is just Ax, and we are done.

Let A be an N × K matrix and consider the corresponding linear map T defined by


Tx = Ax. The range of this map is typically write span(A) instead of rng( T ). That
is,
span(A) := {Ax : x ∈ RK }
The reason for this notation is that the range of T, or the set of all Ax where x ∈ RK ,
is precisely the span of A’s columns:

span(A) = span{col1 (A), . . . , colK (A)} (1.5)

(This is an important point. Please work through it using the definition of matrix
multiplication if you are not fully convinced that it’s true.) For obvious reasons, this
set is also called the column space of A. It is a linear subspace of R N . (Why?)
How large is the column space of a given matrix? To answer that question we have
to say what “large” means. In the context of linear algebra, size of subspaces is
usually measured by dimension. The dimension of span(A) is known as the rank
of A. That is,
rank(A) := dim(span(A))
A is said to have full column rank if rank(A) is equal to K, the number of its
columns. The reason we say “full” rank here is that, by definition, span(A) is
the span of K vectors. Hence, by part 1 of lemma 1.1.1 on page 14, we must have
dim(span(A)) ≤ K. In other words, the rank of A is less than or equal to K. A is
said to have full column rank when this maximum is achieved.
When is this maximum achieved? By part 2 of lemma 1.1.1, this while be the case
precisely when the columns of A are linearly independent. Let’s state this as a fact:

Fact 1.3.2. Let A be an N × K matrix. The following statements are equivalent:

1. A is of full column rank.


CHAPTER 1. LINEAR ALGEBRA 22

2. The columns of A are linearly independent.

3. If Ax = 0, then x = 0.

The last equivalence follows from fact 1.1.6 on page 10.

1.3.3 Square Matrices and Invertibility

The underlying problem driving much of this discussion is the need to solve systems
of equations of the form (1.4), which can be written more conveniently in matrix
notation as Ax = b. Ideally we want general conditions on A such that a solution
to this equation will always exist, plus a way to compute the solution.
There are a variety of scenarios depending on the properties of A. To narrow things
down, let’s concentrate for now on perhaps the most important case, where A is a
square N × N matrix. Let’s also aim for the best case: A set of conditions on A such
that, for every single b ∈ R N , there exists exactly one x ∈ R N such that Ax = b.
The best way to understand this problem is to recall the discussion in §1.2.1, where
we started to think about matrices as maps. Letting T be the linear map Tx = Ax,
the question we are asking here is when does each point in R N have one and only
one preimage under T. In other words, when is T a bijection?
Recall from §1.2.2 that linear bijections have a special name—they are called non-
singular functions. Moreover, theorem 1.2.1 on page 1.2.1 gives us all sorts of nice
equivalences for this property. For example, it’s enough to know that Tx = Ax is
one-to-one, or onto, or that the image of the set of canonical basis vectors is linearly
independent. The next theorem simply replicates these equivalences in the language
of matrices.

Theorem 1.3.2. For N × N matrix A, the following are equivalent:

1. The columns of A are linearly independent.

2. rank(A) = N.

3. span(A) = R N .

4. If Ax = Ay, then x = y.

5. If Ax = 0, then x = 0.
CHAPTER 1. LINEAR ALGEBRA 23

6. For each b ∈ R N , the equation Ax = b has a solution.

7. For each b ∈ R N , the equation Ax = b has a unique solution.

Obtaining a proof of theorem 1.3.2 is just a matter of going back to theorem 1.2.1
on page 16 and checking the implications for Tx = Ax. For example, with a bit
of algebra you should be able to convince yourself that linear independence of
{ Te1 , . . . , Te N } is equivalent to linear independence of the columns of A.
Following common usage, if any of the equivalent conditions in theorem 1.3.2 are
true we will call not just the map T but also the underlying matrix A nonsingu-
lar. If any one and hence all of these conditions fail, then A is called singular. A
nonsingular matrix sometimes also referred to as invertible.

Theorem 1.3.3. For nonsingular A the following statements are true:

1. There exists a square matrix B such that AB = BA = I, where where I is the identity
matrix. The matrix B is called the inverse of A, and written as A−1 .

2. For each b ∈ R N , the unique solution to Ax = b is given by

x = A −1 b (1.6)

Proof. Let T be the linear map associated with A via Tx = Ax. Since A is nonsin-
gular, T is also, by definition, nonsingular, and hence, by fact 1.2.2 on page 17, has
a nonsingular inverse T −1 . Being nonsingular, T −1 is necessarily linear, and hence,
by theorem 1.3.1 on page 20, there exists a matrix B such that T −1 x = Bx for all x.
By the definition of the inverse we have

ABx = T ( T −1 (x)) = x = Ix

Since this holds for any x we have AB = I (see exercise 1.6.27). A similar argument
shows that BA = I.
Regarding the second claim, A−1 b is a solution to Ax = b, since AA−1 b = Ib = b.
Uniqueness follows from theorem 1.3.2 (and in particular the fact that nonsingular-
ity of A includes the implication that the map x 7→ Ax is one-to-one).

The next handy fact tells us that to to prove that B is an inverse of A, it suffices to
show either that B is a “left inverse” or that B is a “right inverse”. A short proof is
given in §1.3.4.
CHAPTER 1. LINEAR ALGEBRA 24

Fact 1.3.3. Let A be an N × N matrix. If there exists a N × N matrix B such that


either AB = I or BA = I, then A is nonsingular and A−1 = B.

The next fact collects more useful results about inverse matrices.

Fact 1.3.4. If A and B are nonsingular and α 6= 0, then

1. (A−1 )−1 = A,

2. (αA)−1 = α−1 A−1 , and

3. (AB)−1 = B−1 A−1 .

The relation (AB)−1 = B−1 A−1 is just a special case of the analogous rule for inver-
sion of bijections—see fact 4.2.1 on page 149.

1.3.4 Determinants

To each square matrix A, we can associate a unique number det(A) called the de-
terminant of A. The determinant is a bit fiddly to describe, but it turns out to give a
neat one-number summary of certain useful properties.
To begin, let S( N ) be the set of all bijections from {1, . . . , N } to itself, also called the
set of permutations on {1, . . . , N }. For π ∈ S( N ) we define the signature of π as

π (m) − π (n)
sgn(π ) := ∏ m−n
m<n

The determinant of N × N matrix A is then given as


N
det(A) := ∑ sgn(π ) ∏ aπ (n)n
π ∈S( N ) n =1

In the 2 × 2 case it can be show that this definition reduces to


 
a b
det = ad − bc (1.7)
c d

Solving for determinants of most larger matrices is a fiddly task, best left for com-
puters. However, the determinant has many neat properties that can be used for
proving various results.
CHAPTER 1. LINEAR ALGEBRA 25

Fact 1.3.5. If I is the N × N identity, A and B are N × N matrices and α ∈ R, then

1. det(I) = 1

2. A is nonsingular if and only if det(A) 6= 0

3. det(AB) = det(A) det(B)

4. det(αA) = α N det(A)

5. det(A−1 ) = (det(A))−1 ,

As an example of how these facts can be useful, let’s establish fact 1.3.3. Fix square
matrix A and suppose that a right inverse B exists, in the sense that AB = I. This
equality immediately implies that both A and B are nonsingular. Indeed, if we apply
det to both sides of AB = I and use the rules in fact 1.3.5 we get det(A) det(B) = 1.
It follows that both det(A) and det(B) are nonzero, and hence both matrices are
nonsingular.
The rest is just a mopping up operation. To show that B is the inverse of A, we just
need to check that, in addition to AB = I we also have BA = I. To obtain the latter
equality from the former, premultiply the former by B to get BAB = B, and then
postmultiply by B−1 to get BA = I. The proof for the left inverse case is similar.

1.3.5 Solving Equations with Tall Matrices

In §1.3.3 we talked about solving equations of the form Ax = b when A is square.


Now let’s turn to the case where A is N × K and K < N. Such a system of equations
is said to be overdetermined. This corresponds to the situation where the number
of equations (equal to N) is larger than the number of unknowns (the K elements of
x). Intuitively, in such a situation, we may not be able find a x that satisfies all N
equations.
To repeat, we seek a solution x ∈ RK to Ax = b with K < N and b ∈ R N given.
As discussed in §1.3.2, A can be viewed as a mapping from RK to R N . A solution
to Ax = b exists precisely when b lies in the range of this map. As discussed on
page 21, its range is the linear subspace of R N spanned by the columns of A:

span(A) := {all vectors Ax with x ∈ RK } =: column space of A


CHAPTER 1. LINEAR ALGEBRA 26

Given our assumption that K < N, the scenario b ∈ span(A) is very rare. The
reason is that the dimension of span(A), which is precisely the rank of A, is less than
or equal to K (see §1.3.2) and hence strictly less than N. We known from fact 1.1.8 on
page 14 that such a space is not equal to R N , where b lives. In fact we can say more:
All K-dimensional subspaces of R N are “negligible,” and hence the “chance” of b
happening to lie in this subspace is likewise small. For example, consider the case
where N = 3 and K = 2. Then the column space of A forms at most a 2 dimensional
plane in R3 . Intuitively, this set has no volume because planes have no “thickness,”
and hence the chance of a randomly chosen b lying in this plane is essentially zero.2
As a result, the standard approach is to admit that an exact solution may not exist,
and instead focus on finding a x ∈ RK such that Ax is as close to b as possible. This
problem is taken up in §3.3.2, after we have developed tools sufficient to tackle it.

1.3.6 Solving Equations with Wide Matrices

Now let’s turn to the case where A is N × K and K > N. In this setting the system of
equations Ax = b is said to be underdetermined, meaning that the number of equa-
tions (equal to N) is strictly smaller than the number of unknowns (the K elements
of x). Intuitively, in such a situation, we may not have enough information to pin
down a unique solution x. Indeed, by theorem 1.2.2 on page 17, the map x 7→ Ax
cannot be one-to-one in this setting. In fact the following is true.

Fact 1.3.6. If A is N × K, the equation Ax = b has a solution and K > N, then the
same equation has an infinity of solutions.

Unlike overdetermined systems, working with underdetermined systems is rela-


tively rare in econometrics and economics more generally. It usually means that
you have insufficient theory to pin down the endogenous variables in your model—
you need to find another equation (an arbitrage condition, a feasibility constraint,
etc.) before the model can be solved. In other words, the problem is not how to
find a clever method to solve underdetermined systems. Rather, it is to better un-
derstand the underlying economic problem, and write down a system that is fully
determined.
2 More formally, if K < N then any K dimensional subspace of RN has Lebesgue measure zero in
RN . This result is available in most texts on measure theory.
CHAPTER 1. LINEAR ALGEBRA 27

1.4 Other Matrix Operations

[Roadmap]

1.4.1 Types of Matrices

For a square N × N matrix A, the N elements of the form ann for n = 1, . . . , N are
called the principal diagonal:
 
a11 a12 · · · a1N
 21 a22 · · · a2N 
 a 
 .. .. .. 
 . . . 
a N1 a N2 · · · a NN

A square matrix A is called diagonal if all entries off the principal diagonal are zero.
For example, the identity matrix is diagonal. The following notation is often used to
define diagonal matrices:

d1 0 · · · 0
 
 0 d2 · · · 0 
diag(d1 , . . . , d N ) :=  . .. .. 
 
 .. . . 
0 0 · · · dN

The k-th power of a square matrix A is written Ak and indicates A · · · A with k terms.
If it exists, the square root of A is written A1/2 and defined as the matrix such that
A1/2 A1/2 is A.
With diagonal matrices we can compute powers, roots, inverses and products very
easily:

Fact 1.4.1. Let C = diag(c1 , . . . , c N ) and D = diag(d1 , . . . , d N ). The following state-


ments are true:

1. CD = diag(c1 d1 , . . . , c N d N ).

2. Dk = diag(d1k , . . . , dkN ) for any k ∈ N.


√ √
3. If dn ≥ 0 for all n, then D1/2 exists and equals diag( d1 , . . . , d N ).

4. If dn 6= 0 for all n, then D is nonsingular and D−1 = diag(d1−1 , . . . , d− 1


N ).
CHAPTER 1. LINEAR ALGEBRA 28

You can check part 1 from the definition of matrix multiplication. The other parts
follow directly.
A square matrix is called lower triangular if every element strictly above the princi-
ple diagonal is zero. It is called upper triangular if every element strictly below the
principle diagonal is zero. For example, in

1 0 0 1 2 3
   

L :=  2 5 0  and U :=  0 5 6 
3 6 1 0 0 1

the matrices L and U are lower and upper triangular respectively. The great advan-
tage of triangular matrices is that the associated linear equations are trivial to solve
using either forward or backward substitution. For example, with the system

1 0 0
      
x1 x1 b1
 2 5 0   x2  =  2x1 + 5x2  =  b2 
3 6 1 x3 3x1 + 6x2 + x3 b3

the top equation involves only x1 , so we can solve for its value directly. Plugging
that value into the second equation, we can solve out for x2 and so on.

Fact 1.4.2. If A = ( amn ) is triangular, then det(A) = ∏nN=1 ann .

1.4.2 Transpose and Trace

The transpose of N × K matrix A is a K × N matrix A0 such that coln (A0 ) = rown (A).
For example, given

10 40
 
1 3 5
 
A :=  20 50  B := (1.8)
2 4 6
30 60

the transposes are

1 2
 
10 20 30
 
A0 = B0 :=  3 4 
40 50 60
5 6

Fact 1.4.3. For conformable matrices A and B, transposition satisfies


CHAPTER 1. LINEAR ALGEBRA 29

1. (A0 )0 = A.

2. (AB)0 = B0 A0 .

3. (A + B)0 = A0 + B0 .

4. (cA)0 = cA0 for any constant c.

Fact 1.4.4. For each square matrix A, we have

1. det(A0 ) = det(A), and

2. If A is nonsingular then so is A0 , and (A0 )−1 = (A−1 )0 .

A square matrix A is called symmetric if A0 = A. This is equivalent to the statement


that ank = akn for every k and n. Note that A0 A and AA0 are always well-defined
and symmetric.
The trace of a square matrix is the sum of the elements on its principal diagonal:
 
a11 a12 ··· a1N
 a21 a22 ··· a2N  N
trace  .. .. ..  = ∑ ann
 
 . . .  n =1
a N1 a N2 · · · a NN

Fact 1.4.5. Transposition does not alter trace: trace(A) = trace(A0 ).

Fact 1.4.6. If A and B are N × N matrices and α and β are two scalars, then

trace(αA + βB) = α trace(A) + β trace(B)

Moreover, if A is N × M and B is M × N, then trace(AB) = trace(BA).

The rank of a matrix can be difficult to determine. One case where it is easy is where
the matrix is idempotent. A square matrix A is called idempotent if AA = A.

Fact 1.4.7. If A is idempotent, then rank(A) = trace(A).


CHAPTER 1. LINEAR ALGEBRA 30

1.4.3 Eigenvalues and Eigenvectors

Let A be N × N. As in § 1.3.2, think of A as a linear map, so that Ax is the image of


x under A. In general A will map x to some arbitrary new location but sometimes x
will only be scaled. That is,
Ax = λx (1.9)
for some scalar λ. If x and λ satisfy (1.9) and x is nonzero, then x is called an eigen-
vector of A, λ is called an eigenvalue and (x, λ) is called an eigenpair. For example,
if I is the N × N identity then (1, x) is an eigenpair of I for every nonzero x ∈ R N .
Evidently any scalar multiple of an eigenvector of A is also an eigenvector of A.
We can given many other examples of matrices with eigenpairs. But what about the
matrix
0 −1
 
R=
1 0
This matrix induces counter-clockwise rotation on any point by 90◦ . For any real
number λ and x ∈ R2 the scaling in (1.9) clearly fails. However, if we admit the
possibility that λ and the elements of x can be complex we find that (1.9) can hold.
Thus, for this rotation matrix, you will be able to confirm that λ = i and x = (1, −i )0
is an eigenpair for R. This example shows that contemplation of complex eigenpairs
is useful. Hence an eigenpair is always taken to be complex valued unless explicitly
stated to be real.

Fact 1.4.8. If A is N × N and I is the N × N identity, then λ is an eigenvalue of A if


and only if
det(A − λI) = 0

For the 2 × 2 matrix in (1.7), one can use the rule for the 2 × 2 determinant in (1.7),
fact 1.4.8 and a little bit of algebra to show that its eigenvalues are given by the two
roots of the polynomial expression

λ2 − ( a + d)λ + ( ad − bc) = 0

More generally, given any N × N matrix A, it can be shown via the fundamental
theorem of algebra that there exist complex numbers λ1 , . . . , λ N , not necessarily dis-
tinct, such that
N
det(A − λI) = ∏ (λn − λ) (1.10)
n =1
CHAPTER 1. LINEAR ALGEBRA 31

It is clear that each λn satisfies det(A − λn I) = 0 and hence is an eigenvalue of A.


In particular, λ1 , . . . , λ N is the set of eigenvalues of A, although we stress again that
not all are necessarily distinct.

Fact 1.4.9. Let A be N × N and let λ1 , . . . , λ N be the eigenvalues defined in (1.10).


The following statements are true:

1. det(A) = ∏nN=1 λn .

2. trace(A) = ∑nN=1 λn .

3. If A is symmetric, then λn ∈ R for all n.

4. If A is nonsingular, then the eigenvalues of A−1 are 1/λ1 , . . . , 1/λ N .

5. If A = diag(d1 , . . . , d N ), then λn = dn for all n.

It follows immediately from item 1 of fact 1.4.9 that A is nonsingular if and only if
all its eigenvalues are nonzero.

1.4.4 Similar Matrices

An important concept in the field of dynamic systems is conjugacy. For example, let
f : A → A where A is any set. We are often interested in the evolution of sequences
defined recursively by such maps:

a t +1 = f ( a t ), a0 = a given point in A

Evidently this sequence also satisfies at = f t ( a0 ), where f t is the t-th composition of


f with itself.
Even when f is known, the properties of f t are not always easy to discern. In this
situation it can help to have another map g : B → B that is conjugate to f . That is,
there exists some bijection τ : B → A such that

f = τ ◦ g ◦ τ −1

This means that f ( a) = τ ( g(τ −1 ( a))) for all a ∈ A. Observe that, under this conju-
gacy,
f 2 = f ◦ f = τ ◦ g ◦ τ −1 ◦ τ ◦ g ◦ τ −1 = τ ◦ g 2 ◦ τ −1
CHAPTER 1. LINEAR ALGEBRA 32

P −1 P

x A Ax

Figure 1.7: A is similar to B

In other words, f 2 is likewise conjugate to g2 . An inductive argument shows that


the same is true for all t, so f t = τ ◦ gt ◦ τ −1 . This is handy if computing gt is in
some sense easier than computing f t . We’ll see an example of this in just a moment.
In the case of linear maps—that is, matrices—it is natural to study conjugacy in a
setting where the bijection is also required to be linear. In most texts this is called
similarity. In particular, matrix A is said to be similar to a matrix B if there exists an
invertible matrix P such that A = PBP−1 . Figure 1.7 shows the conjugate relation-
ship of the two matrices when thought of as maps.
As a special case of the reasoning for f and g given above, we get
Fact 1.4.10. If A is similar to B, then At is similar to Bt for all t ∈ N.

As discussed above, similarity of A to a given matrix B is most useful when B is


somehow simpler than A, or more amenable to a given operation. About the sim-
plest kind of matrices we work with are diagonal matrices. Hence similarity to a
diagonal matrix is particularly desirable. If A is similar to a diagonal matrix, then A
is called diagonalizable.
One instance where this scenario is useful is in studying At for some given t ∈ N. If
A is diagonalizable with A = PDP−1 for some D = diag(λ1 , . . . , λ N ), then, in view
of fact 1.4.10 and fact 1.4.1 on page 27, we have
At = P diag(λ1t , . . . , λtN )P−1

In this expression we’ve used the symbol λn for the scalars, which is reminiscent of
our notation for eigenvalues. There is a reason for this:
CHAPTER 1. LINEAR ALGEBRA 33

Theorem 1.4.1. Let A be an N × N matrix. The following statements are true:

1. If A = PDP−1 for some D = diag(λ1 , . . . , λ N ), then (coln (P), λn ) is an eigenpair


of A for each n.

2. Conversely, if A had N linearly independent eigenvectors, then A can be diagonalized


via A = PDP−1 , where (coln (P), λn ) is an eigenpair of A for each n.

Proof. If A = PDP−1 then AP = PD. Equating the n-th column on each side gives
Apn = λn pn , where pn := coln (P). Thus, to show that (pn , λn ) is an eigenpair, we
need only check that pn is not the zero vector. In fact this is immediate, because if it
was the zero vector then P would not be invertible. (Why?)
Conversely, if A has N linearly independent eigenvectors p1 , . . . , p N , then by form-
ing P via coln (P) = pn and taking D = diag(λ1 , . . . , λ N ) where λn is the eigenvalue
associated with pn , we can stack the individual vector equations Apn = λn pn into
the matrix form AP = PD. Using the invertibility implied by linear independence,
we then have A = PDP−1 . In particular, A is diagonalizable.

1.4.5 Matrix Norm and Neumann Series

Consider the vector difference equation xt+1 = Axt + b, where xt ∈ R N represents


the values of some variables of interest (e.g., consumption, investment, etc.) at time
t, and A and b form the parameters in the law of motion for xt . One question of
interest for such systems is whether or not there is a vector x ∈ R N such that xt = x
implies xt+1 = x. In other words, we seek a x ∈ R N that solves the system of
equations
x = Ax + b, where A is N × N and b is N × 1
In considering this problem we can get some insight from the scalar case x = ax + b.
Here we know that if | a| < 1, then this equation has the solution

b
x̄ = = b ∑ ak
1−a k =0

The second equality follows from elementary results on geometric series.


It turns out a very similar result is true in R N if we replace the condition | a| < 1
with kAk < 1 where kAk is the matrix norm or operator norm of A. This is defined
CHAPTER 1. LINEAR ALGEBRA 34

for any N × K matrix A as

kAxk
 
kAk := max : x ∈ RK , x 6 = 0 (1.11)
kxk
Note that in this (standard) notation there are two different norms in play. The
left hand side is a matrix norm. The norm expressions on the right hand side are
ordinary Euclidean vector norms.
The matrix norm behaves very much like the Euclidean norm. For example,

Fact 1.4.11. For any N × K matrices A and B, the matrix norm satisfies

1. kAk = 0 if and only if all entries of A are zero.

2. kαAk = |α|kAk for any scalar α.

3. kA + Bk ≤ kAk + kBk.

4. If A is a square matrix, then

kAk = max{λ1/2 : λ is an eigenvalue of A0 A} (1.12)

Returning to the problem at hand, it’s clear from (1.11) that if kAk < 1, then any
nonzero point is “contracted” by A, in the sense of being pulled closer to the origin
(i.e., kAxk < kxk). In this sense its implications are similar to | a| < 1 in the scalar
case. In particular, we have the following parallel result:

Theorem 1.4.2. Let b be any vector in R N and let I be the N × N identity. If A is an


N × N matrix with kA j k < 1 for some j ∈ N, then I − A is invertible, and the system of
equations x = Ax + b has the unique solution

x̄ = (I − A)−1 b = b ∑ Ai
i =0

The second equality means that b ∑ik=0 Ai converges to (I − A)−1 b as k → ∞. The


infinite sum is called the Neumann series associated with A.
One way to test the conditions of theorem 1.4.2 is to use (1.12). Another is to use the
following fact, which restricts the eigenvalues of A directly, rather than A0 A. It is
based on the spectral radius, which is defined for square A as

$(A) := max{|λ| : λ is an eigenvalue of A} (1.13)


CHAPTER 1. LINEAR ALGEBRA 35

Here |λ| is the modulus of the possibly complex number λ. In particular, if λ =


a + ib, then |λ| = ( a2 + b2 )1/2 . If λ ∈ R then this reduces to the usual notion of
absolute value.

Fact 1.4.12. If $(A) < 1, then kA j k < 1 for some j ∈ N.

1.4.6 Quadratic Forms

We’ve spent a lot of time discussing linear maps, one class of which is the linear
real-valued maps. By theorem 1.3.1 we know that any linear map from R N to R
takes the form x 7→ x0 a for some vector a ∈ R N . The next level of complexity is
quadratic real-valued maps. To describe them, let A be N × N and symmetric, and
let x be N × 1. The quadratic function or quadratic form on R N associated with A
is the map Q defined by
N N
Q(x) := x0 Ax = ∑ ∑ aij xi x j
j =1 i =1

To give a simple illustration, let N = 2 and let A be the identity matrix I. In this
case,
Q(x) = kxk2 = x12 + x22
A 3D graph of this function is shown in figure 1.8.
One thing you’ll notice about this function is that its graph lies everywhere above
zero, or Q(x) ≥ 0. In fact we know that kxk2 is nonnegative and will be zero only
when x = 0. Hence the graph touches zero only at the point x = 0. Many other
choices of A yield a quadratic form with this property. Such A are said to be positive
definite. More generally, an N × N symmetric matrix A is called

• nonnegative definite if x0 Ax ≥ 0 for all x ∈ R N ,

• positive definite if x0 Ax > 0 for all x ∈ R N with x 6= 0,

• nonpositive definite if x0 Ax ≤ 0 for all x ∈ R N , and

• negative definite if x0 Ax < 0 for all x ∈ R N with x 6= 0.


CHAPTER 1. LINEAR ALGEBRA 36

12
10
8
6
4
2
0
3
2
1
3 0
2 1 1
0 1 2
2 3 3

Figure 1.8: The quadratic function Q(x) = x12 + x22

0
2
4
6
8
10
12
3
2
1
3 0
2 1 1
0 1 2
2 3 3

Figure 1.9: The quadratic function Q(x) = − x12 − x22


CHAPTER 1. LINEAR ALGEBRA 37

40
20
0
20
3
40 2
1
0
3 2 1
1 0 1 2
2 3 3

Figure 1.10: The quadratic function Q(x) = x12 /2 + 8x1 x2 + x22 /2

If A fits none of these categories then A is called indefinite. Figure 1.9 shows the
graph of a negative definite quadratic function. Now the function is hill-shaped,
and 0 is the unique global maximum. Figure 1.10 shows an indefinite form.
The easiest case for detecting definiteness is when the matrix A is diagonal, since

A = diag(d1 , . . . , d N ) implies Q(x) = d1 x12 + · · · + d N x2N

From the right hand expression we see that a diagonal matrix is positive definite
if and only if all diagonal elements are positive. Analogous statements are true for
nonnegative, nonpositive and negative definite matrices. The next fact generalizes
this idea and is proved in §3.1.2.

Fact 1.4.13. Let A be any symmetric matrix. A is

1. positive definite if and only if its eigenvalues are all positive,

2. negative definite if and only if its eigenvalues are all negative,

and similarly for nonpositive and nonnegative definite.


CHAPTER 1. LINEAR ALGEBRA 38

It immediately follows from fact 1.4.13 (why?) that

Fact 1.4.14. If A is positive definite, then A is nonsingular, with det(A) > 0.

Finally, here’s a necessary (but not sufficient) condition for each kind of definiteness.

Fact 1.4.15. If A is positive definite, then each element ann on the principal diagonal
is positive, and the same for nonnegative, nonpositive and negative.

1.5 Further Reading

To be written.

1.6 Exercises

Ex. 1.6.1. Given two vectors x and y, show that |kxk − kyk| ≤ kx − yk.3

Ex. 1.6.2. Use fact 1.1.2 on page 5 to show that if y ∈ R N is such that y0 x = 0 for
every x ∈ R N , then y = 0.

Ex. 1.6.3. Fix nonzero x ∈ R N . Consider the optimization problem

max x0 y subject to y ∈ R N and kyk = 1


y

Show that the maximizer is x̂ := x/kxk.4

Ex. 1.6.4. Show that if S and S0 are two linear subspaces of R N , then S ∩ S0 is also a
linear subspace of R N .

Ex. 1.6.5. Show that every linear subspace of R N contains the origin 0.

Ex. 1.6.6. Show that the vectors (1, 1) and (−1, 2) are linearly independent.5
3 Hint: Use the triangle inequality.
4 Hint: There’s no need to go taking derivatives and setting them equal to zero. An easier proof
exists. If you’re stuck, consider the Cauchy-Schwarz inequality.
5 Hint: Look at the different definitions of linear independence. Choose the one that’s easiest to

work with in terms of algebra.


CHAPTER 1. LINEAR ALGEBRA 39

Ex. 1.6.7. Let a ∈ R N and let A := {x ∈ R N : a0 x = 0}. Show that A is a linear


subspace of R N .

Ex. 1.6.8. Let Q be the subset of R3 defined by

Q := {( x1 , x2 , x3 ) ∈ R3 : x2 = x1 + x3 }

Is Q a linear subspace of R3 ? Why or why not?

Ex. 1.6.9. Let Q be the subset of R3 defined by

Q := {( x1 , x2 , x3 ) ∈ R3 : x2 = 1}

Is Q a linear subspace of R3 ? Why or why not?

Ex. 1.6.10. Show that if T : R N → R N is a linear function and λ is any scalar, then
E := {x ∈ R N : Tx = λx} is a linear subspace of R N .

Ex. 1.6.11. Prove the equivalences in fact 1.1.6 on page 10.

Ex. 1.6.12. Prove fact 1.1.7 on page 11.

Ex. 1.6.13. Show that if S is a linear subspace of R N then every basis of S has the
same number of elements.

Ex. 1.6.14. Prove fact 1.1.8 on page 14.

Ex. 1.6.15. Prove theorem 1.2.1 on page 16.

Ex. 1.6.16. Show that if T : RK → R N is linear and K > N, then T is not one-to-one.

Ex. 1.6.17. Prove the claim in fact 1.4.14 on page 38 that if A is positive definite, then
A is nonsingular. If you can, prove it without invoking positivity of its eigenvalues.

Ex. 1.6.18. Prove fact 1.4.15 on page 38.

Ex. 1.6.19. Show that for any two conformable matrices A and B, we have (AB)−1 =
B − 1 A − 1 .6

Ex. 1.6.20. Let A be a constant N × N matrix. Assuming existence of the inverse


A−1 , show that (A0 )−1 = (A−1 )0 .
6 Hint: Look at the definition of the inverse! Always look at the definition, and then show that the
object in question has the stated property.
CHAPTER 1. LINEAR ALGEBRA 40

Ex. 1.6.21. Show that if ei and e j are the i-th and j-th canonical basis vectors of R N
respectively, and A is an N × N matrix, then ei0 Ae j = aij , the i, j-th element of A.
Ex. 1.6.22. Prove fact 1.3.6 on page 26.
Ex. 1.6.23. Let
1 −1 1 2
   
A := , B :=
−1 1 2 1
Show that

1. A is nonnegative definite.

2. B is not positive definite.


Ex. 1.6.24. Let A1 , . . . , A J be invertible matrices. Use proof by induction and fact 1.3.4
on page 24 to show that the product of these matrices is invertible, and, in particular,
that
( A 1 A 2 · · · A J ) −1 = A − 1 −1 −1
J · · · A2 A1

Ex. 1.6.25. Show that for any matrix A, the matrix A0 A is well-defined (i.e., multi-
plication is possible), square, and nonnegative definite.
Ex. 1.6.26. Show that if A and B are positive definite and A + B is well defined, then
it is also positive definite.
Ex. 1.6.27. Let A be N × K. Show that if Ax = 0 for all K × 1 vectors x, then A = 0
(i.e., every element of A is zero). Show as a corollary that if A and B are N × K and
Ax = Bx for all K × 1 vectors x, then A = B.
Ex. 1.6.28. Let I N be the N × N identity matrix.

1. Explain briefly why I N is full column rank.

2. Show that I N is the inverse of itself.

3. Let A := αI N . Give a condition on α such that A is positive definite.


Ex. 1.6.29. Let X := I N − 2uu0 , where u is an N × 1 vector with kuk = 1. Show that
X is symmetric and XX = I N .
Ex. 1.6.30. Recall the definition of similarity of matrices, as given in §1.4.4. Let’s
write A ∼ B if A is similar to B. Show that ∼ is an equivalence relation on the set
of N × N matrices. In particular, show that, for any N × N matrices A, B and C, we
have (i) A ∼ A, (ii) A ∼ B implies B ∼ A and (iii) A ∼ B and B ∼ C implies A ∼ C.
CHAPTER 1. LINEAR ALGEBRA 41

Ex. 1.6.31. Confirm the claim in fact 1.4.8 on page 30.

Ex. 1.6.32. Show that A is nonsingular if and only if 0 is not an eigenvalue for A.

Ex. 1.6.33. Show that the only nonsingular idempotent matrix is the identity matrix.

Ex. 1.6.34. Let 1 be an N × 1 vector of ones. Consider the matrix


1 0
Z := 11
N

1. Show that if x is any N × 1 vector, then Zx is a vector with all elements equal
to the mean of the elements of x.

2. Show that Z is idempotent.

1.6.1 Solutions to Selected Exercises

Solution to Exercise 1.6.3. Fix nonzero x ∈ R N . Let x̂ := x/kxk. Comparing this


point with any other y ∈ R N satisfying kyk = 1, the Cauchy-Schwarz inequality
yields
x0 x
y0 x ≤ |y0 x| ≤ kykkxk = kxk = = x̂0 x
kxk
Hence x̂ is the maximizer, as claimed.

Solution to Exercise 1.6.7. Let x, y ∈ A and let α, β ∈ R. We must show that z :=


αx + βy ∈ A, or, equivalently, a0 z = a0 (αx + βy) = 0. This is immediate, because
a0 (αx + βy) = αa0 x + βa0 y = 0 + 0 = 0.

Solution to Exercise 1.6.8. If a := (1, −1, 1), then Q is all x with a0 x = 0. This set is
a linear subspace of R3 , as shown in exercise 1.6.7.

Solution to Exercise 1.6.11. We are asked to verify the equivalences in fact 1.1.6 on
page 10 for the set X := {x1 , . . . , xK }. The right way to do this is to establish a cycle,
such as part 1 implies part 2 implies part 3 implies part 1. It is then clear that part i
implies part j for any i and j.
First let’s show that part 1 implies part 2, which is that if X0 is a proper subset of X,
then span( X0 ) is a proper subset of span( X ). To save fiddly notation let’s take X0 :=
{x2 , . . . , xK }. Suppose to the contrary that span( X0 ) = span( X ). Since x1 ∈ span( X )
CHAPTER 1. LINEAR ALGEBRA 42

we must then have x1 ∈ span( X0 ), from which we deduce the existence of scalars
α2 , . . . , αK such that 0 = −x1 + α2 x2 + · · · + αK xK . Since −1 6= 0, this contradicts
part 1.
The next claim is that part 2 implies part 3; that is, that so vector in X can be
written as a linear combination of the others. Suppose to the contrary that x1 =
α2 x2 + · · · + αK xK , say. Let y ∈ span( X ), so that y = β 1 x1 + · · · + β K xK . If we
use the preceding equality to substitute out x1 , we get y as a linear combination of
{x2 , . . . , xK } alone. In other words, any element of span( X ) is in the span of the
proper subset {x2 , . . . , xK }. Contradiction.
The final claim is that part 3 implies part 1; that is, that α1 = · · · = αK = 0 whenever
α1 x1 + · · · + αK xK = 0. Suppose to the contrary that there exist scalars with α1 x1 +
· · · + αK xK = 0 and yet αk 6= 0 for at least one k. It follows immediately that xk =
(1/αk ) ∑ j6=k α j x j . Contradiction.

Solution to Exercise 1.6.12. The aim is to prove fact 1.1.7 on page 11. Regarding
the first part, let’s take X as linearly independent and show that the subset X0 :=
{x1 , . . . , xK−1 } is linearly independent. (The argument for more general subsets is
similar.) Suppose to the contrary that X0 is linearly dependent. Then by the defini-
tion we can take α1 , . . . , αK −1 not all zero with ∑kK=−11 αk xk = 0. Letting αK = 0 we can
write this as ∑kK=1 αk xk = 0. Since not all coefficients are zero we have contradicted
linear independence of X.
Regarding the second claim, let X := {x1 , . . . , xK } be linearly independent and sup-
pose that x j = 0. Then by setting αk = 0 for k 6= j and α j = 1 we can form scalars
not all equal to zero with ∑kK=1 αk xk = 0.
Regarding the third claim, let X := {x1 , . . . , xK } ⊂ R N be linearly independent and
let xK +1 be any point in R N such that xK +1 ∈ / span( X ). The claim is that X ∪ {xK +1 }
is linearly independent. Suppose to the contrary that there exist α1 , . . . , αK , αK +1 not
all zero such that ∑kK=+11 αk xk = 0. There are two possibilities for αK +1 , both of which
lead to a contradiction: First, if αK +1 = 0, then, since α1 , . . . , αK , αK +1 are not all
zero, at least one of α1 , . . . , αK are nonzero, and, moreover, ∑kK=1 αk xk = ∑kK=+11 αk xk =
0. This contradicts our assumption of independence on X. On the other hand, if
αK +1 6= 0, then from ∑kK=+11 αk xk = 0 we can express xK +1 as a linear combination of
elements of X. This contradicts the hypothesis that xK +1 ∈ / span( X ).

Solution to Exercise 1.6.13. Let B1 and B2 be two bases of S, with K1 and K2 ele-
ments respectively. By definition, B2 is a linearly independent subset of S. More-
over, S is spanned by the set B1 , which has K1 elements. Applying theorem 1.1.2, we
CHAPTER 1. LINEAR ALGEBRA 43

see that B2 has at most K1 elements. That is, K2 ≤ K1 . Reversing the roles of B1 and
B2 gives K1 ≤ K2 .

Solution to Exercise 1.6.14. The aim is to prove fact 1.1.8 on page 14. Suppose that
S and T are K-dimensional linear subspaces of R N with S ⊂ T. We claim that
S = T. To see this, observe that by the definition of dimension, S is equal to span( B)
where B is a set of K linearly independent basis vectors {b1 , . . . , bK }. If S 6= T,
then there exists a vector x ∈ T such that x ∈ / span( B). In view of fact 1.1.6 on
page 10, the set {x, b1 , . . . , bK } is linearly independent. Moreover, since x ∈ T and
since B ⊂ S ⊂ T, we now have K + 1 linearly independent vectors inside T. At
the same time, being K-dimensional, we know that T is spanned by K vectors. This
contradicts theorem 1.1.2 on page 12.
Regarding part 2, suppose that S is an M-dimensional linear subspace of R N where
M < N and yet S = R N . Then we have a space S spanned by M < N vectors that at
the same time contains the N linearly independent canonical basis vectors. We are
led to another contradiction of theorem 1.1.2. Hence S = R N cannot hold.

Solution to Exercise 1.6.15. A collection of equivalent statements are usually proved


via a cycle of implications, such as 1 =⇒ 2 =⇒ · · · =⇒ 5 =⇒ 1. However in
this case the logic is clearer if we directly show that all statements are equivalent to
linear independence of V.
First observe equivalence of the onto property and linear independence of V via

T onto ⇐⇒ rng( T ) = R N ⇐⇒ span(V ) = R N

by lemma 1.2.1, and the last statement is equivalent to linear independence of V by


theorem 1.1.3 on page 12.
Next let’s show that ker( T ) = {0} implies linear independence of V. To this end,
suppose that ker( T ) = {0} and let α1 , . . . , α N be such that ∑nN=1 αn Ten = 0. By
linearity of T we then have T (∑nN=1 αn en ) = 0. Since ker( T ) = {0} this means that
∑nN=1 αn en = 0, which in view of independence of {e1 , . . . , e N }, implies α1 = · · · =
α N = 0. This establishes that V is linearly independent.
Now let’s check that linear independence of V implies ker( T ) = {0}. To this end, let
x be a vector in R N such that Tx = 0. We can represent x in the form ∑nN=1 αn en for
suitable scalars {αn }. From linearity and Tx = 0 we get ∑nN=1 αn Ten = 0. By linear
independence of V this implies that each αn = 0, whence x = 0. Thus ker( T ) = {0}
as claimed.
CHAPTER 1. LINEAR ALGEBRA 44

From fact 1.2.1 we have ker( T ) = {0} iff T is one-to-one, so we can now state the
following equivalences

T onto ⇐⇒ V linearly independent ⇐⇒ T one-to-one (1.14)

Finally, if T is a bijection then T is onto and hence V is linearly independent by


(1.14). Conversely, if V is linearly independent then T is both onto and one-to-one
by (1.14). Hence T is a bijection.

Solution to Exercise 1.6.16. Let T be as described in the exercise and let K > N.
Seeking a contradiction, suppose in addition that T is one-to-one. Let {αk }kK=1 be
such that ∑kK=1 αk Tek = 0. By linearity, T (∑kK=1 αk ek ) = 0, and since T is one-to-one
and T0 = 0, this in turn implies that ∑kK=1 αk ek = 0. Since the canonical basis vectors
are linearly independent, it must be that α1 = · · · = αK = 0. From this we conclude
that { Te1 , . . . , TeK } is linearly independent. Thus R N contains K linearly indepen-
dent vectors, despite the fact that N < K. This is impossible by theorem 1.1.2 on
page 12.

Solution to Exercise 1.6.17. Let A be positive definite and consider the following: If
A is singular, then there exists nonzero x with Ax = 0 (see theorem 1.3.2 on page 22).
But then x0 Ax = 0 for nonzero x. Contradiction.

Solution to Exercise 1.6.18. If x = en then x0 Ax = ann . The claim follows.

Solution to Exercise 1.6.22. Since the columns of A consist of K vectors in R N , the


fact that K > N implies that not all of the columns of A are linearly independent.
(Recall theorem 1.1.2 on page 12.) It follows that Az = 0 for some nonzero z in
RK , and hence that Aλz = 0 for any scalar λ. Now suppose that some x solves
Ax = b. Then, for any λ ∈ R, we have Ax + Aλz = A(x + λz) = b. This proves
the claim.

Solution to Exercise 1.6.28. The solutions are as follows: (1) I N is full column rank
because its columns are the canonical basis vectors, which are independent. (2) By
definition, B is the inverse of A if BA = AB = I N . It follows immediately that I N
is the inverse of itself. (3) A sufficient condition is α > 0. If this holds, then given
x 6= 0, we have x0 αI N x = αkxk2 > 0.
CHAPTER 1. LINEAR ALGEBRA 45

Solution to Exercise 1.6.29. First, X is symmetric because

X0 = (I N − 2uu0 )0 = I0N − 2(uu0 )0 = I N − 2(u0 )0 u0 = I N − 2uu0 = X

Second, XX = I N , because

XX = (I N − 2uu0 )(I0N − 2uu0 ) = I N I N − 2I N 2uu0 + (2uu0 )(2uu0 )


= I N − 4uu0 + 4uu0 uu0 = I N − 4uu0 + 4uu0 = I N

The second last equality is due to the assumption that u0 u = kuk2 = 1.

Solution to Exercise 1.6.33. Suppose that A is both idempotent and nonsingular.


From idempotence we have AA = A. Premultiplying by A−1 gives A = I.
Chapter 2

Probability

[Roadmap]

2.1 Probabilistic Models

We begin with some foundations of probability theory. These involve a few set
theoretic operations—see Appendix 4.1 for a review.

2.1.1 Sample Spaces

In setting up a probabilistic model, we always start with the notion of a sample


space, which we think of as being the enumeration or “list” of all possible outcomes
in a given random experiment. In general, the sample space can be any nonempty
set, and is usually denoted Ω. A typical element of Ω is denoted ω. The general idea
is that a realization of uncertainty will lead to the selection of a particular ω ∈ Ω.

Example 2.1.1. Let Ω := {1, . . . , 6} represent the six different faces of a dice. A
realization of uncertainty corresponds to a roll of the dice, with the outcome being
an integer ω in the set {1, . . . , 6}.

The specification of all possible outcomes Ω is one part of our model. The other
thing we need to do is to assign probabilities to outcomes. One idea is to start by
assigning appropriate probabilities to every ω in Ω. It turns out that this is not the

46
CHAPTER 2. PROBABILITY 47

right way forward. Instead, the standard approach is to directly assign probabilities
to subsets of Ω instead. In the language of probability theory, subsets of Ω are called
events. The set of events is usually denoted by F , and we follow this convention.1
Below we attach probabilities to events using notation P( B), where B is an event
(i.e., B ∈ F ). The symbol P( B) should be interpreted as representing “the probabil-
ity that event B occurs.” The way you should think about it is this:

P( B) represents the probability that when uncertainty is resolved and


some ω ∈ Ω is selected by “nature,” the statement ω ∈ B is true.

To illustrate, consider again example 2.1.1. Let B be the event {1, 2}. The number
P( B) represents the probability that the face ω selected by the roll is either 1 or 2.
Let Ω be any sample space. Two events we always find in F are Ω itself and the
empty set ∅, the latter because the empty set is regarded as being a subset of every
set.2 In this context, Ω is called the certain event because it always occurs (regard-
less of which outcome ω is selected, ω ∈ Ω is true by definition). The empty set ∅
is called the impossible event.

Remark 2.1.1. Why not start the other way, assigning probability to individual points
ω ∈ Ω, and then working out the probability of events by looking at the probability
of the points contained in those events? In fact this approach fails in many cases.
For example, suppose we take Ω to be the interval (0, 1) and set up a model where
all numbers in (0, 1) are “equally likely.” In this scenario, it can be shown that the
probability of an event of the form ( a, b) is just the length of the interval, or b − a.
Moreover, the probability of an individual point x ∈ (0, 1) occuring should be less
than the probability of some point in the interval ( x − e, x + e) occuring. Hence the
probability of hitting x is less than 2e for any e we choose. No positive number is
less than 2e for any e. Hence the probability of x must be zero. As a result, we
can’t build probabilities of events from probabilities of individual elements, since
the probability of hitting any given point is zero. There is no way to build up a
proper probabilistic model with this information alone.
1 I’m skirting some technical details here. In many situations, we exclude some of the more “com-
plex” subsets of Ω from F because they are so messy that assigning probabilities to these sets causes
problems for the theory. See §2.1.4 for more discussion.
2 Why? Because in mathematics every sensible mathematical statement must be either true or

false, so ∅ ⊂ Ω must be true or false. To show the statement true, we need to show that every element
of ∅ is also in Ω. Since ∅ contains no elements we cannot test this statement directly. However it is
certain not false, since no counterexample can be given. As a result we regard it to be true.
CHAPTER 2. PROBABILITY 48

2.1.2 Probabilities

The second stage of our model construction is to assign probabilities to elements of


F . This is done with a function P that maps F to [0, 1]. In order to make sure our
model of probability is well behaved, we need to put certain restrictions on P. For
example, we wouldn’t want to have a B with P( B) = −93, as negative probabilities
don’t make much sense. These restrictions are imposed in the next definition:
A probability P on (Ω, F ) is a function from F to [0, 1] that satisfies

1. P(Ω) = 1, and

2. Additivity: P( A ∪ B) = P( A) + P( B) whenever A, B ∈ F with A ∩ B = ∅.

Together, the triple (Ω, F , P) is called a probability space. It describes a set of


events and their probabilities for a given experiment. Confession: I’ve simplified the
standard definition of a probability space slightly to avoid technical discussions that we don’t
need to go into right now. An outline of the technical issues can be found in §2.1.4.
The axioms in the definition of a probability are sensible. For example, it’s clear why
we require P(Ω) = 1, since the realization ω will always be chosen from Ω by its
definition. Also, the additivity property is natural: To find the probability of a given
event, we can determine all the different (i.e., disjoint) ways that the event could
occur, and then sum their probabilities. The examples below reinforce this idea.

Example 2.1.2. Let Ω := {1, . . . , 6} represent the six different faces of a dice, as in
example 2.1.1. We define a function P over all A ∈ F by

#A
P( A ) : = where #A := number of elements in A (2.1)
6
For example, P{2, 4, 6} = 3/6 = 1/2. Let’s check the axioms that define a proba-
bility. It’s easy to see that 0 ≤ P( A) ≤ 1 for any A ∈ F , and that P(Ω) = 1. Re-
garding additivity, suppose that A and B are two disjoint subsets of {1, . . . , 6}. Then
#( A ∪ B) = #A + #B, since, by disjointness, the number of elements in the union is
just the number contributed by A plus the number contributed by B. Hence

#( A ∪ B ) #A + #B #A #B
P( A ∪ B ) = = = + = P( A ) + P( B )
6 6 6 6
CHAPTER 2. PROBABILITY 49

In the definition of P, additivity was defined for pairs of sets, but this is enough
to imply additivity over any disjoint finite union. In particular, if A1 , . . . , A J are
disjoint in the sense that Ai ∩ A j = ∅ whenever i 6= j, then

  J
P ∪ jJ=1 A j = ∑ P( A j )
j =1

also holds. See exercise 2.8.1.

Example 2.1.3. Continuing example 2.1.2, if we roll the dice, the probability of get-
ting an even number is the probability of getting a 2 plus that of getting a 4 plus that
of getting a 6. Formally,

P{2, 4, 6} = P[{2} ∪ {4} ∪ {6}]


= P{2} + P{4} + P{6} = 1/6 + 1/6 + 1/6 = 1/2

Example 2.1.4. Consider a memory chip in a computer, made up of billions of tiny


switches. Imagine that a random number generator accesses a subset of N switches,
setting each one to “on” or “off” at random. One sample space for this experiment
is
Ω0 := {(b1 , . . . , b N ) : bn ∈ {on, off} for each n}
Letting zero represent off and one represent on, we can also use the more practical
space
Ω := {(b1 , . . . , b N ) : bn ∈ {0, 1} for each n}
Thus, Ω is the set of all binary sequences of length N. As our probability, we define

P( A) := 2− N (#A)
To see that this is indeed a probability on (Ω, F ) we need to check that 0 ≤ P( A) ≤ 1
for all A ⊂ Ω, that P(Ω) = 1, and that P is additive. Exercise 2.8.7 asks you to
confirm that P is additive. That P(Ω) = 1 follows from the fact that the number of
distinct binary sequences of length N is 2 N .

Now let’s go back to the general case, where (Ω, F , P) is an arbitrary probability
space. From the axioms above, we can derive a suprising number of properties.
Let’s list the key ones, starting with the next fact.

Fact 2.1.1. Let (Ω, F , P) be a probability space, and let A, B ∈ F . If A ⊂ B, then


CHAPTER 2. PROBABILITY 50

1. P( B \ A ) = P( B ) − P( A );
2. P( A ) ≤ P( B );
3. P( Ac ) := P(Ω \ A) = 1 − P( A); and
4. P(∅) = 0.

These claims are not hard to prove. Regarding the part 1, if A ⊂ B, then we have B =
( B \ A) ∪ A. (Sketch the Venn diagram.) Since B \ A and A are disjoint, additivity
of P now gives

P( B ) = P( B \ A ) + P( A ) (whenever A ⊂ B)
This equality implies parts 1–4 of fact 2.1.1. Rearranging gives part 1, while nonneg-
ativity of P gives part 2. Specializing to B = Ω gives part 3, and setting B = A gives
part 4.
The property that A ⊂ B implies P( A) ≤ P( B) is called monotonicity, and is fun-
damental. If A ⊂ B, then we know that B occurs whenever A occurs (because if
ω lands in A, then it also lands in B). Hence, the probability of B should be larger.
Many crucial ideas in probability boil down to this one point.
Fact 2.1.2. If A and B are any (not necessarily disjoint) events, then

P( A ∪ B ) = P( A ) + P( B ) − P( A ∩ B )

It follows that for any A, B ∈ F , we have P( A ∪ B) ≤ P( A) + P( B). This is called


subadditivity. Thus, probabilities are subadditive over arbitrary pairs of events and
additive over disjoint pairs.

2.1.3 Dependence and Independence

If A and B are events, then the conditional probability of A given B is


P( A ∩ B )
P( A | B ) : = (2.2)
P( B )
It represents the probability that A will occur, given the information that B has oc-
curred. For the definition to make sense, it requires that P( B) > 0. Events A and B
are called independent if
P( A ∩ B ) = P( A )P( B ) (2.3)
If A and B are independent, then the conditional probability of A given B is just the
probability of A.
CHAPTER 2. PROBABILITY 51

Example 2.1.5. Consider an experiment where we roll a dice twice. A suitable sam-
ple space is the set of pairs (i, j), where i and j are between 1 and 6. The first element
i represents the outcome of the first roll, while the second element j represents the
outcome of the second roll. Formally,
Ω := {(i, j) : i, j ∈ {1, . . . , 6}}
For our probability, let’s define P( E) := #E/36. (Here elements of E are pairs, so #E
is the number of pairs in E.) Now consider the events
A := {(i, j) ∈ Ω : i is even} and B := {(i, j) ∈ Ω : j is even}
In this case we have
A ∩ B = {(i, j) ∈ Ω : i and j are even}
We can establish (2.3), indicating independence of A and B under P. To check this
we need to be able to count the number of elements in A, B and A ∩ B. The basic
principle for counting ordered tuples is that the total number of possible tuples is the
product of the number of possibilities for each element. For example, the number of
distinct tuples
(i, j, k) where i ∈ I, j ∈ J and k ∈ K
is (#I ) × (#J ) × (#K ). Hence, the number of elements in A is 3 × 6 = 18, the number
of elements in B is 6 × 3 = 18, and the number of elements in A ∩ B is 3 × 3 = 9. As
a result,
P( A ∩ B) = 9/36 = 1/4 = (18/36) × (18/36) = P( A)P( B)
Thus, A and B are independent, as claimed.

A very useful result is the law of total probability, which says the following: Let
A ∈ F and let B1 , . . . , B M be a partition of Ω, so that Bm ∈ F for each m, Bj ∩ Bk = ∅
when j 6= k, and ∪m M B = Ω. If P( B ) > 0 for all m, then
=1 m m

M
P( A) = ∑ P( A | Bm ) · P( Bm )
m =1

The proof is straightforward, although you should check that the manipulations of
intersections and unions if you have not seen them before:
P( A) = P[ A ∩ (∪mM=1 Bm )] = P[∪mM=1 ( A ∩ Bm )]
M M
= ∑ P( A ∩ Bm ) = ∑ P( A | Bm ) · P( Bm )
m =1 m =1
CHAPTER 2. PROBABILITY 52

Example 2.1.6. Suppose we flip a coin to decide whether to take part in a poker
game. If we play the chance of losing money is 2/3. The overall chance of losing
money (LM) that evening is

P{LM } = P{LM | play}P{play} + P{LM | don’t play}P{don’t play}


which is (2/3) × (1/2) + 0 × (1/2) = 1/3.

2.1.4 Technical Details

As alluded to above, in the presentation of probability spaces I’ve swept some tech-
nical details under the carpet to make the presentation smooth. These details won’t
affect anything that follows, and this whole course can be completed successfully
without knowing anything about them. Hence you can skip this section on first
reading. Nevertheless, if you intend to keep going deeper into probability and
statistics, eventually you will have to work your way through them. So let’s note
them as points for future study.
First, it turns out that assigning probabilities to all subsets of an arbitrary set Ω
in a consistent way can be quite a difficult task. The reason is that if Ω is relatively
large—a continuum, say—then it contains an awful lot of subsets, and some of them
can be manipulated to exhibit very strange phenomena. (Look up the Banach-Tarski
paradox for a hint of what I mean.) Because of this we usually take our set of events
F to be a “well behaved” subset of the subsets of Ω, and only assign probabilities to
elements of F .
How to choose F ? As stated above, we don’t just choose freely because doing so
will make it hard to form a consistent theory. Instead, the usual method is to start
with a collection of sets that are reasonable and well behaved, and then permit ex-
tention to other sets than can be obtained from the original sets by some standard
set operations.
For example, let’s suppose that A ∈ F , so that P( A) is well defined, and represents
the “probability of event A.” Now, given that we can assign a probability to the
event A, it would be a bit unfortunate if we couldn’t assign a probability to the event
“not A”, which corresponds to Ac . So normally we require that if A ∈ F , then Ac ∈
F . When this is true, we say that F is “closed under the taking of complements”.
Also, let’s suppose that A and B are both in F , so we assign probabilities to these
events. In this case, it would be natural to think about the probability of the event
CHAPTER 2. PROBABILITY 53

“A and B”, which corresponds to A ∩ B. So we also require that if A and B are in F ,


then A ∩ B is also in F . We say that F is “closed under the taking of intersections.”
Perhaps we should also require that if A and B are in F , then A ∪ B is also in F ?
Actually, we don’t have to, because (see fact 4.1.1 on page 145),

A ∪ B = ( Ac ∩ Bc )c

Thus, if F is closed under the taking of complements and intersections, then F is


automatically closed under the taking of unions.
There is one more restriction that’s typically placed on F , which is the property
of being closed under “countable” unions. This just means that if A1 , A2 , . . . is a
sequence of sets in F , then its union is likewise in F . Since the details don’t matter
here we won’t discuss it further. When F satisfies all these properties and contains
the whole space Ω it is called a σ-algebra. Almost all of advanced probability rests
on event classes that form σ-algebras.
Also, in standard probability theory there is a restriction placed on the probability
P that has not yet been mentioned, called countable additivity. The definition of
countable additivity is that if A1 , A2 , . . . is a disjoint sequence of sets in F (disjoint
means that Ai ∩ A j = ∅ for any i 6= j), then

P(∪i Ai ) := P{ω ∈ Ω : ω ∈ Ai for some i} = ∑ P( Ai )
i =1

Why strengthen additivity to countable additivity? Countable additivity works be-


hind the scenes to make probability theory run smoothly (expectations operators are
suitably continuous, and so on). None of these details will concern us in this course.
If you wish, you can learn all about σ-algebras and countable additivity in any text
on measure theory. There are many beautiful books on this subject. One of my
favorites at the introductory level is Williams (1991).

2.2 Random Variables

[Roadmap]
CHAPTER 2. PROBABILITY 54

2.2.1 Definition and Notation

In elementary probability courses, a random variable is defined as a “value that


changes randomly,” or something to that effect. This isn’t a very satisfying defini-
tion. The proper definition of a random variable is a function from a sample space
Ω into R. Think of it this way:

1. “Nature” picks out an element ω in Ω according to some probability.

2. The random variable now sends this ω into numerical value x (ω ).

Thus random variables convert outcomes in sample space—which can be any kind
of object—into numerical outcomes. This is valuable because numerical outcomes are
easy to order, add, subtract, etc. In other words, random variables “report” the
outcome of an experiment in a format amenable to further analysis.3
Example 2.2.1. Consider the sample space
Ω := {(b1 , b2 , . . .) : bn ∈ {0, 1} for each n}
Ω is called the set of all infinite binary sequences. (This is an infinite version of
the sample space in example 2.1.4. Imagine a computer with an infinite amount of
memory.) Consider an experiment where we flip a coin until we get a “heads”. We
let 0 represent tails and 1 represent heads. The experiment of flipping until we get a
heads can be modeled with the random variable
x (ω ) = x (b1 , b2 , . . .) = min{n ∈ N : bn = 1}
The number of heads in the first 10 flips is given by the random variable
10
y(ω ) = y(b1 , b2 , . . .) = ∑ bn
n =1

As per the definition, x and y are well-defined functions from Ω into R.4
3 I’m skipping some technical details again. If Ω is a continuum, then, when identifying random
variables with the class of all functions from Ω to R, we typically exclude some particularly compli-
cated functions. The remaining “nice” functions are called the measurable functions. These are our
random variables. In this course we will never meet the nasty functions, and there’s no need to go
into further details. Those who want to know more should consult any text on measure theory (e.g.,
Williams, 1991).
4 Is this true? What if ω = ω is an infinite sequence containing only zeros? Then { n ∈ N : b =
0 n
1} = ∅. So what is x (ω0 ). The convention here is to set min ∅ = ∞. This is reasonable, but now x
is not a map into R, because it can take the value ∞. However, in most applications this event has
probability zero, and hence we can ignore it. For example, we can set x (ω0 ) = 0 without changing
anything significant. Now we’re back to a well-defined function from Ω to R.
CHAPTER 2. PROBABILITY 55

Before going further, let’s discuss a common notational convention with random
variables that we’ve adopted above and that will be used below. With a random
variable x, you will often see notation such as

{ x has some property}

This is a shorthand for the event

{ω ∈ Ω : x (ω ) has some property}

We’ll follow this convention, but you should translated it backwards and forwards
in your mind to start off with. To give an example, consider the claim that, for any
random variable x,

P{ x ≤ a} ≤ P{ x ≤ b} whenever a ≤ b (2.4)

This is intuitively obvious. The mathematical argument goes as follows: Pick any
a, b ∈ R with a ≤ b. We have

{ x ≤ a} := {ω ∈ Ω : x (ω ) ≤ a} ⊂ {ω ∈ Ω : x (ω ) ≤ b} =: { x ≤ b}

where the inclusion ⊂ holds because if ω is such that x (ω ) ≤ a, then, since a ≤ b,


we also have x (ω ) ≤ b. (Any ω in the left-hand side is also in the right-hand side.)
The result in (2.4) now follows from monotonicity of P (fact 2.1.1 on page 49).

Example 2.2.2. Recall example 2.1.4, with sample space

Ω := {(b1 , . . . , b N ) : bn ∈ {0, 1} for each n}

The set of events and probability were defined as F := all subsets of Ω and P( A) :=
2− N (#A). Consider a random variable x on Ω that returns the first element of any
given sequence. That is,
x (ω ) = x (b1 , . . . , b N ) = b1
The probability that x = 1 is 1/2. Indeed,

P{ x = 1} := P{ω ∈ Ω : x(ω ) = 1} = P{(b1 , . . . , bN ) : b1 = 1}


The number of length N binary sequences with b1 = 1 is 2 N −1 , so P{ x = 1} = 1/2.
CHAPTER 2. PROBABILITY 56

2.2.2 Finite Random Variables

Aside from trivial random variables that take one value no matter what happens
(and hence provide no information), the simplest kind of random variables are bi-
nary or Bernoulli random variables. A binary random variable is a random variable
that takes values in {0, 1}. While binary random variables are by themselves rather
limited, in fact they play a role analogous to basis vectors in R N . In particular, a
great variety of random variables can be constructed as linear combinations of bi-
nary random variables. Things like expectation of a random variable can then be
defined in terms of these primitive components. This section explores these ideas.
There is a generic way to create binary random variables, using indicator functions.
If Q is a statement, such as “on the planet Uranus, there exists a tribe of three-headed
monkeys,” then 1{ Q} is considered as equal to 1 when the statement Q is true, and
zero when the statement Q is false. Hence, 1{ Q} is a binary indicator of the truth of
the statement Q. Another common variation on the notation is, for arbitrary C ⊂ Ω,
(
1 if ω ∈ C
1C ( ω ) : = : 1 { ω ∈ C } : =
0 otherwise

Note that 1C is a binary random variable. In fact, when you think about it, any
binary random variable has the form

x ( ω ) = 1C ( ω ) : = : 1 { ω ∈ C } (2.5)

where C is some subset of Ω. Indeed, if x is any binary random variable, we can


write x in the form of (2.5) by setting C := {ω ∈ Ω : x (ω ) = 1}.
From binary random variables we can create finite random variables. A finite ran-
dom variable is a random variable with finite range (see §4.2 for the definition of
range). We can create finite random variables by taking linear combinations of bi-
nary random variables. For example, let A and B be disjoint subsets of Ω. The
random variable x defined by

x ( ω ) = s1 A ( ω ) + t1 B ( ω ) (2.6)

is a finite random variable taking the value s when ω falls in A, t when ω falls in B,
and zero otherwise. Figure 2.1 shows a graph of x when Ω = R.
In fact any finite random variable can be expressed as the linear combinations of
binary random variables. Thus we take as our generic expression for a finite random
CHAPTER 2. PROBABILITY 57

A B Ω

Figure 2.1: Finite random variable x (ω ) = s1 A (ω ) + t1B (ω )

variable the expression


J
x (ω ) = ∑ s j 1 A j (ω ) (2.7)
j =1

Whenever we work with this expression we will always assume that

• the scalars s1 , . . . , s J are distinct, and


• the sets A1 , . . . , A J form a partition of Ω.

By a partition we mean that Ai ∩ A j = ∅ when i 6= j, and ∪ j A j = Ω.

Fact 2.2.1. If x is the random variable in (2.7), then, for all j,

1. x (ω ) = s j if and only if ω ∈ A j .

2. { x = s j } = A j .

3. P{ x = s j } = P( A j ).

Convince yourself of these results before continuing. (The second two statements
follow from the first.)
CHAPTER 2. PROBABILITY 58

2.2.3 Expectations

Our next task is to define expectations of random variables and state their basic
properties. The idea behind expectations is that they give a kind of weighted aver-
age value, defined as the sum of all possible values of of the variable, weighted by
their probabilities. Let’s start with the finite case, where things are clearest. Letting
x be as defined in (2.7), the expectation of x is
J
E [ x ] : = ∑ s j P( A j ) (2.8)
j =1

To understand this expression, recall from fact 2.2.1 that P( A j ) is also P{ x = s j }, so


we can also write
J
E [ x ] = ∑ s j P{ x = s j }
j =1

As required, the expectation is the sum of the different values that x may take,
weighted by their probabilities.
We can already make some important observations.
Fact 2.2.2. E [1 A ] = P( A) for any A ∈ F .

Fact 2.2.2 seems trivial but in fact it represents the fundamental link between prob-
abilities and expectations. To see why it holds, observe that 1 A (ω ) = 1 × 1 A (ω ) +
0 × 1 Ac (ω ). Applying (2.8), we get E [1 A ] = 1 × P( A) + 0 × P( Ac ) = P( A).
Fact 2.2.3. If α ∈ R, then E [α] = α.

The right way to understand this is to take α to be the constant random variable
α1Ω . From (2.8), the expectation of this constant is E [α] := E [α1Ω ] = αP(Ω) = α.
How about expectation for arbitrary random variables, such as those with infinite
range? The same mode of definition doesn’t work but this presents no major prob-
lem because any random variable can be well approximated by finite random vari-
ables. A finite approximation xn to an arbitrary random variable x is shown in
figure 2.2. This approximation can be improved without limit if we allow the finite
approximation to take a larger and larger number of distinct values.
Hence we can take a sequence { xn } of finite random variables converging to any
selected arbitrary random variable x. The expectation of x is then defined as

E [ x] := lim E [ xn ] (2.9)
n→∞
CHAPTER 2. PROBABILITY 59

x
xn

Figure 2.2: Finite approximation to a general random variable

To complete the definition we need to make sure that this limit makes sense. There
are a couple of issues that arise here. Suppose first that x is nonnegative (i.e., x (ω ) ≥
0 for all ω ∈ Ω). It turns out that the sequence { xn } approximating x can then
be chosen so that E [ xn ] is monotone increasing, and therefore must converge to
something. While that something can be +∞, this causes no real problem as we just
say that E [ x ] = ∞.
If x is not nonnegative then some thought will convince you that we can write it as
the sum x = x + − x − , where x + (ω ) := max{ x (ω ), 0} and x − (ω ) := − min{ x (ω ), 0}.
Both x + and x − are nonnegative random variables. The expectation can then be de-
fined as E [ x ] = E [ x + ] − E [ x − ]. The only issue is that this expression might have
the form ∞ − ∞, which is not allowed. Hence for random variables that are not
necessarily nonnegative, we usually restrict attention to integrable random vari-
ables. An integrable random variable is a random variable x such that E [| x |] < ∞.
Since x + ≤ | x | and x − ≤ | x | this is enough to ensure that both E [ x + ] < ∞ and
E [ x− ] < ∞. Hence E [ x] = E [ x+ ] − E [ x− ] is well defined.
To be careful we should also check that the value in (2.9) does not depend on the
particular approximating sequence { xn } that we choose. This and other related
technical details can be found in a text such as Williams (1991). In that reference the
following facts are also established:

Fact 2.2.4. If x and y are integrable random variables and α and β are any constants,
CHAPTER 2. PROBABILITY 60

then
E [αx + βy] = αE [ x] + βE [y]

This is called linearity of expectations. Here αx + βy should be understood as the


random variable (αx + βy)(ω ) := αx (ω ) + βy(ω ). Exercises 2.8.10 and 2.8.11 ask
you to check fact 2.2.4 in some special cases.

Fact 2.2.5. If x, y are integrable random variables and x ≤ y, then E [ x ] ≤ E [y].

This is called monotonicity of expectations. The statement x ≤ y means that x is


less than y for any realization of uncertainty. Formally, x (ω ) ≤ y(ω ) for all ω ∈ Ω.
Exercise 2.8.20 asks you to prove fact 2.2.5 in one special case.
Let x be a random variable and let k ∈ N. If E [| x |k ] < ∞ then the k-th moment of x
is said to exist, and is defined as the value E [ x k ]. For some random variables even
the first moment does not exist. For others every moment exists.
If x is a random variable with finite second moment, then the variance of x is

var[ x ] := E [( x − E [ x ])2 ]

This gives a measure of the dispersion of x. The standard deviation of x is often


written as σx and defined as q
σx := var[ x ]

Fact 2.2.6. If the k-th moment of x exists, then so does the j-th moment for all j ≤ k.

Fact 2.2.7. If x and y are random variables with finite second moment, then
q
| E [ xy] | ≤ E [ x2 ]E [y2 ] (2.10)

Fact 2.2.8. For any nonnegative random variable x any δ > 0, we have

E [x]
P{ x ≥ δ } ≤ (2.11)
δ

Fact 2.2.7 is called the Cauchy-Schwarz inequality for random variables, while
(2.11) is called Chebyshev’s inequality. A common variation of Chebyshev’s in-
equality is the bound
E [ x2 ]
P{| x| ≥ δ} ≤ 2 (2.12)
δ
Exercise 2.8.36 asks you to check (2.11) and (2.12).
CHAPTER 2. PROBABILITY 61

In concluding this section, let us agree that to avoid repetition, we will assume that
every random variable introduced below is integrable unless stated otherwise. Also,
only random variables with finite second moment have a well defined variance, but
in what follows we will often talk about the variance of a given random variable
without adding the caveat “assuming it exists.”

2.3 Distributions

Distributions summarize the probabilities of different outcomes for random vari-


ables and help us compute expectations. In this section, we describe the link be-
tween random variables and their distributions.

2.3.1 Distribution Functions

Let x be a random variable on some probability space (Ω, F , P), and consider the
function F defined by

F ( s ) : = P{ x ≤ s } : = P{ ω ∈ Ω : x ( ω ) ≤ s } ( s ∈ R) (2.13)

It can be shown that, for any choice of random variable x, this function always has
the following properties:

1. right-continuity: F (sn ) ↓ F (s) whenever sn ↓ s

2. monotonicity: s ≤ s0 implies F (s) ≤ F (s0 )

3. lims→−∞ F (s) = 0 and lims→∞ F (s) = 1

For example, monotonicity is immediate from (2.4) on page 55. (The other properties
are a bit trickier to prove. See, e.g, Williams, 1991.)
Any function F : R → [0, 1] satisfying conditions 1–3 is called a cumulative dis-
tribution function or cdf on R. Thus, to each random variable, we can associate
a unique cdf on R. We say that F is the cdf of x, or, alternatively, that F is the
distribution of x, and write x ∼ F.

Example 2.3.1. The function F (s) = arctan(s)/π + 1/2 is a cdf—one variant of the
Cauchy distribution. A plot is given in figure 2.3.
CHAPTER 2. PROBABILITY 62

1.0

0.8

0.6

0.4

0.2

0.0
10 5 0 5 10

Figure 2.3: Cauchy cdf

It’s worth nothing that the following is also true: For every cdf F, there exists a prob-
ability space (Ω, F , P) and a random variable x : Ω → R such that the distribution
of x is F. Exercise 2.8.16 gives some hints on how one construction works.
Fact 2.3.1. If x ∼ F, then P{ a < x ≤ b} = F (b) − F ( a) for any a ≤ b.

Indeed, if a ≤ b, then { a < x ≤ b} = { x ≤ b} \ { x ≤ a} and { x ≤ a} ⊂ { x ≤ b}.


Applying fact 2.1.1 on page 49 gives fact 2.3.1.
A cdf F is called symmetric if F (−s) = 1 − F (s) for all s ∈ R.5 The proof of the next
fact is an exercise (exercise 2.8.14).
Fact 2.3.2. Let F be a cdf and let x ∼ F. If F is symmetric and P{ x = s} = 0 for all
s ∈ R, then the distribution F| x| of the absolute value | x | is given by

F| x| (s) := P{| x | ≤ s} = 2F (s) − 1 ( s ≥ 0)

One often needs to obtain the cdf of the transform of a random variable. This is easy
in the monotone case. For example, if x ∼ F and y = exp( x ), then the cdf of y is
G (s) := F (ln(s)), because

P{y ≤ s} = P{exp( x) ≤ s} = P{ x ≤ ln(s)} = F(ln(s)) =: G(s)


Note how monotonicity is used in the second equality.
5 Thus,the probability that x ≤ −s is equal to the probability that x > s. Centered normal distri-
butions and t-distributions have this property.
CHAPTER 2. PROBABILITY 63

2.3.2 Densities and PMFs

Every random variable has a well-defined cdf via (2.13). However, as representa-
tions of distributions, cdfs have some disadvantages. For example, and plotting cdfs
is a poor way to convey information about probabilities. The amount of probability
mass in different regions is determined by the slope of the cdf. Research shows that
humans are poor at extracting quantitative information from slopes. They do much
better with heights, which leads us into our discussion of densities and probability
mass functions (pmfs). Densities and pmfs correspond to two different, mutually
exclusive cases. The density case arises when the increase of the cdf in question is
smooth, and contains no jumps. The pmf case arises when the increase consists of
jumps alone.
Let’s have a look at these two situations, starting with the second. The pure jump
case occurs when the cdf represents a discrete random variable. To understand this,
suppose that x takes values s1 , . . . , s J . Let p j := P{ x = s j }. We then have 0 ≤ p j ≤ 1
for each j, and ∑ jJ=1 p j = 1 (exercise 2.8.15). A finite collection of numbers p1 , . . . , p J
such that 0 ≤ p j ≤ 1 and p1 + · · · + p J = 1 is called a probability mass function
(pmf). The cdf corresponding to this random variable is
J
F (s) := ∑ 1{ s j ≤ s } p j (2.14)
j =1

How do we arrive at this expression? Because, for this random variable,


 
 [  J
P{ x ≤ s } = P { x = s j } = ∑ P{ x = s j } = ∑ 1{ s j ≤ s } p j
 j s.t. s ≤s j =1
j s.t. s j ≤s

j

Visually, F is a step function, with a jump up of size p j at point s j . Figure 2.4 gives
an example with J = 2. The cdf is right continuous but not continuous.
The other case of interest is the density case. A density is a nonnegative function
p that integrates to 1. For example, suppose that F is a smooth cdf, so that the
derivative F 0 exists. Let p := F 0 . By the fundamental theorem of calculus, we then
have Z s Z s
p(t)dt = F 0 (t)dt = F (s) − F (r )
r r
From this equality and the properties of cdfs, we can see that p is nonnegative and
R +∞
−∞ p ( s ) ds = 1. In other words, p is a density. Also, taking the limit as r → − ∞ we
CHAPTER 2. PROBABILITY 64

p2

p1

s1 s2

Figure 2.4: Discrete cdf

obtain Z s
F (s) = p(t)dt
−∞
which tells us that F can be recovered from p.
Not every random variable has a density. The exact necessary and sufficient condi-
tion for a density to exist is that F is “absolutely continuous.” This is a smoothness
condition, an important special case of which is differentiability.6 If F is absolutely
continuous and p : R → [0, ∞) satisfies
Z s
F (s) = p(t)dt for all s ∈ R
−∞

then p is called the density corresponding to F.

Fact 2.3.3. If x has a density, then P{ x = s} = 0 for all s ∈ R.

As discussed above, cdfs are useful because every random variable has one, but
pmfs and densities are nicer to work with, and visually more informative. For ex-
ample, consider figure 2.5, which shows the density corresponding to the Cauchy
cdf in figure 2.3. Information about probability mass is now conveyed by height
rather than slope, which is easier for us humans to digest.
6 Inelementary texts, random variables with densities are often called “continuous random vari-
ables.” This terminology is confusing because “continuous” here has nothing to do with the usual
definition of continuity of functions.
CHAPTER 2. PROBABILITY 65

0.40
0.35
0.30
0.25
0.20
0.15
0.10
0.05
0.00
10 5 0 5 10

Figure 2.5: Cauchy density

2.3.3 The Quantile Function

Let F be any cdf on R. Suppose that F is strictly increasing, so that the inverse
function F −1 exists:

F −1 (q) := the unique s such that F (s) = q (0 < q < 1) (2.15)

The inverse of the cdf is called the quantile function, and has many applications
in probability and statistics. For example, the quantile function associated with the
Cauchy cdf in example 2.3.1 is F −1 (q) = tan[π (q − 1/2)]. See figure 2.6.
Things are a bit more complicated when F is not strictly increasing, as the inverse
F −1 is not well defined. (If F is not strictly increasing, then there exists at least two
distinct points s and s0 such that F (s) = F (s0 ).) This problem is negotiated by setting

F −1 (q) := inf{s ∈ R : F (s) ≥ q} (0 < q < 1)

This expression is a bit more complicated, but in the case where F is strictly increas-
ing, it reduces to (2.15).
The value F −1 (1/2) is called the median of F. It gives an alternative measure of
central tendency (alternative to the mean).
The quantile function features in hypothesis testing, where it can be used to define
critical values. An abstract version of the problem is as follows: Let x ∼ F, where F is
strictly increasing, differentiable (so that a density exists and x puts no probability
mass on any one point) and symmetric. Given α ∈ (0, 1), we want to find the c
CHAPTER 2. PROBABILITY 66

15

10

10

15
0.0 0.2 0.4 0.6 0.8 1.0

Figure 2.6: Cauchy quantile function

such that P{−c ≤ x ≤ c} = 1 − α (see figure 2.7). The solution is given by c :=


F −1 (1 − α/2). That is,

c = F −1 (1 − α/2) =⇒ P{| x | ≤ c} = 1 − α (2.16)

To see this, fix α ∈ (0, 1). From fact 2.3.2, we have

P{| x| ≤ c} = 2F(c) − 1 = 2F[ F−1 (1 − α/2)] − 1 = 1 − α


In the case where F is the standard normal cdf Φ, this value c is usually denoted by
zα/2 . We will adopt the same notation:

zα/2 := Φ−1 (1 − α/2) (2.17)

2.3.4 Expectations from Distributions

Until now, we’ve been calculating expectations using the expectation operator E ,
which was defined from a given probability P in §2.2.3. One of the most useful
facts about distributions is that they encode all the information necessary to calcu-
late E [ x ]. For a full treatment of this topic you can consult a text such as Williams
(1991). Here we’ll stick to noting down the main facts. In all of what follows, h is an
arbitrary function from R to R.
CHAPTER 2. PROBABILITY 67

0.30
0.25
0.20
0.15
0.10
0.05
0.00

−c 0 c

Figure 2.7: Finding critical values

Fact 2.3.4. If x is a discrete random variable taking values s1 , . . . , s J with probabilities


p1 , . . . , p J , then
J
E [h( x)] = ∑ h(s j ) p j (2.18)
j =1

On the other hand, if x has density p, then


Z ∞
E [h( x)] = h(s) p(s)ds (2.19)
−∞

It’s convenient to have a piece of notation that captures both of these cases. As a
result, if x ∼ F, then we will write
Z
E [h( x)] = h(s) F (ds)

The way you should understand this expression is that when F is differentiable with
R∞
derivative p = F 0 , then h(s) F (ds) is defined as −∞ h(s) p(s)ds. If, on the other
R

hand, F is the step function F (s) = ∑ jJ=1 1{s j ≤ s} p j corresponding to the discrete
random variable in fact 2.3.4, then h(s) F (ds) is defined as ∑ jJ=1 h(s j ) p j .
R
CHAPTER 2. PROBABILITY 68

Just for the record, for a given cdf F, the expression h(s) F (ds) has its own precise
R

definition, as the “Lebesgue-Stieltjes” integral of h with respect to F. In the special


cases where F is discrete or differentiable, one can prove that h(s) F (ds) reduces to
R

(2.18) or (2.19) respectively. For details see, e.g., Williams (1991).


As an exercise, let’s prove the first part of fact 2.3.4. This is the discrete case, where
x = ∑ jJ=1 s j 1 A j and p j := P{ x = s j } = P( A j ). As usual, the values {s j } are distinct
and the sets { A j } are a partition of Ω. Let h : R → R. You should be able to convince
yourself that
J
h( x (ω )) = ∑ h ( s j )1 A j ( ω )
j =1

(Pick an arbitrary A j and check that the left- and right-hand sides are equal when
ω ∈ A j .) This is a discrete random variable, which we can take the expectation of
using (2.8) on page 58. We get
J J
E [h( x)] = ∑ h(s j )P( A j ) = ∑ h(s j ) p j
j =1 j =1

Equation (2.18) is confirmed.

2.3.5 Common Distributions

Let’s list a few well-known distributions. First, given a < b, the uniform distribu-
tion on interval [ a, b] is the distribution associated with the density
1
p(s; a, b) := ( a ≤ s ≤ b)
b−a
(If s < a or s > b, then p(s; a, b) := 0.) The mean is
Z b
a+b
s p(s; a, b)ds =
a 2

The univariate normal density or Gaussian density is a function p of the form

1
 
2 −1/2 2 −2
p(s) := p(s; µ, σ) := (2πσ ) exp − (s − µ) σ
2

for some µ ∈ R and σ > 0. We represent this distribution symbolically by N (µ, σ2 ).


The distribution N (0, 1) is called the standard normal distribution
CHAPTER 2. PROBABILITY 69

It is well-known that if x ∼ N (µ, σ2 ), then E [ x ] = µ, and var[ x ] = σ2 . Hence the


two parameters separately define the mean and the variance (or standard deviation),
and this is one of many attractive features of the distribution. Here’s another:

Fact 2.3.5. If x1 , . . . , x N are normally distributed and α0 , . . . , α N are any constants,


then α0 + ∑nN=1 αn xn is also normally distributed.

In other words, linear combinations of normals are normal. This is a fact, telling us
that linear models and normal distributions play very well together.7
The chi-squared distribution with k degrees of freedom is the distribution with
density
1
p(s; k ) := k/2 sk/2−1 e−s/2 ( s ≥ 0)
2 Γ(k/2)
where Γ is the Gamma function (details omitted). If x has a distribution described
by this density, then we write x ∼ χ2 (k ).
Student’s t-distribution with k degrees of freedom, or, more simply, the t-distribution
with k degrees of freedom, is the distribution on R with density
−(k+1)/2
Γ( k+2 1 ) s2

p(s; k ) := 1+
(kπ )1/2 Γ( 2k ) k

The F-distribution with parameters k1 , k2 is the distribution with the unlikely look-
ing density
q
(k1 s)k1 kk22 /[k1 s + kk21 +k2 ]
p(s; k1 , k2 ) := ( s ≥ 0)
sB(k1 /2, k2 /2)

where B is the Beta function (details omitted). The F-distribution arises in certain
hypothesis tests, some of which we will examine later.

2.4 Joint Distributions and Independence

[roadmap]
7 Weshould be a bit careful here—what if αn = 0 for all n? To save ourselves from embarrassment
we can declare a random variable concentrated on a point to be a normal random variable with “zero
variance”.
CHAPTER 2. PROBABILITY 70

2.4.1 Distributions Across Random Variables

Consider a collection of N random variables x1 , . . . , x N . For each individual random


variable xn : Ω → R, the distribution Fn of xn is

Fn (s) := P{ xn ≤ s} (−∞ < s < ∞) (2.20)

This distribution tells us about the random properties of xn viewed as a single entity.
But we often want to know about the relationships between the variables x1 , . . . , x N ,
and outcomes for the group of variables as a whole. To quantify these things, we
define the joint distribution of x1 , . . . , x N to be

F ( s1 , . . . , s N ) : = P{ x1 ≤ s1 , . . . , x N ≤ s N } (−∞ < sn < ∞ ; n = 1, . . . , N )

In this setting, the distribution Fn of xn is sometimes called the marginal distribu-


tion, in order to distinguish it from the joint distribution.
The joint density of x1 , . . . , x N , if it exists, is a function p : R N → [0, ∞) satisfying
Z tN Z t1
··· p(s1 , . . . , s N )ds1 · · · ds N = F (t1 , . . . , t N ) (2.21)
−∞ −∞

for all tn ∈ R, n = 1, . . . , N.
Typically, the joint distribution cannot be determined from the N marginal distri-
butions alone, since the marginals do not tell us about the interactions between the
different variables. Once special case where we can tell the joint from the marginals
is when there is no interaction. This is called independence, and we treat it in the
next section.
From joint densities we can construct conditional densities. The conditional density
of xk+1 , . . . , x N given x1 = s1 , . . . , xk = sk is defined by

p ( s1 , . . . , s N )
p ( s k +1 , . . . , s N | s 1 , . . . , s k ) : = (2.22)
p ( s1 , . . . , s k )

Rearranging this expression we obtain a decomposition of the joint density:

p ( s 1 , . . . , s N ) = p ( s k +1 , . . . , s N | s 1 , . . . , s k ) p ( s 1 , . . . , s k ) (2.23)

This decomposition is useful in many situations.


CHAPTER 2. PROBABILITY 71

2.4.2 Independence

Let x1 , . . . , x N be a collection of random variables with xn ∼ Fn , where Fn is a cdf.


The variables x1 , . . . , x N are called identically distributed Fn = Fm for all n, m. They
are called independent if, given any s1 , . . . , s N , we have

P{ x1 ≤ s1 , . . . , x N ≤ s N } = P{ x1 ≤ s1 } × · · · × P{ x N ≤ s N } (2.24)

Equivalently, if F is the joint distribution of x1 , . . . , x N and Fn is the marginal distri-


bution of xn , then independence states that
N
F (s1 , . . . , s N ) = F1 (s1 ) × · · · × FN (s N ) = ∏ Fn (sn )
n =1

We use the abbreviation IID for collections of random variables that are both inde-
pendent and identically distributed.

Example 2.4.1. Consider a monkey throwing darts at a dartboard. Let x denote the
horizontal location of the dart relative to the center of the board, and let y denote
the vertical location. (For example, if x = −1 and y = 3, then the dart is 1cm to the
left of the center, and 3cm above.) At first pass, we might suppose that x and y are
independent and identically distributed.

Fact 2.4.1. If x1 , . . . , x M are independent and E | xm | is finite for each m, then


" #
M M
E ∏ xm = ∏ E [ xm ]
m =1 m =1

We won’t prove the last fact in the general case, as this involves measure theory.
However, we can illustrate the idea by showing that E [ xy] = E [ x ]E [y] when x
and y are independent and defined by (2.40). In this case, it can be shown (details
omitted) that the random variables x and y are independent precisely when the
events A and B are independent. Now observe that

( xy)(ω ) := x (ω )y(ω ) = s1{ω ∈ A}t1{ω ∈ B} = st1{ω ∈ A ∩ B}

Hence, by the definition of expectations, we have

E [ xy] = stP( A ∩ B) = stP( A)P( B) = sP( A)tP( B) = E [ x]E [y]


Fact 2.4.2. If x and y are independent and g and f are any functions, then f ( x ) and
g(y) are independent.
CHAPTER 2. PROBABILITY 72

An important special case of the “independence means multiply” rule is as follows.

Fact 2.4.3. If random variables x1 , . . . , x N are independent, and each has density pn ,
then the joint density p exists, and is the product of the marginal densities:
N
p ( s1 , . . . , s N ) = ∏ pn (sn )
n =1

Here are some useful facts relating independence and certain common distributions.
IID
Fact 2.4.4. If x1 , . . . , xk ∼ N (0, 1), then
k
Q := ∑ xi2 ∼ χ2 (k)
i =1

Fact 2.4.5. If Q1 , . . . , Q J are independent with Q j ∼ χ2 (k j ), then ∑ jJ=1 Q j ∼ χ2 (∑ j k j ).

Fact 2.4.6. If Z and Q are two random variables such that

1. Z ∼ N (0, 1),

2. Q ∼ χ2 (k), and

3. Z and Q are independent,

then Z (k/Q)1/2 has the t-distribution with k degrees of freedom.

Fact 2.4.7. If Q1 ∼ χ2 (k1 ) and Q2 ∼ χ2 (k2 ) are independent, then

Q1 /k1
Q2 /k2

is distributed as F (k1 , k2 ).

2.4.3 Covariance

The covariance of random variables x and y is defined as

cov[ x, y] := E [( x − E [ x ])(y − E [y])]


CHAPTER 2. PROBABILITY 73

Fact 2.4.8. If x1 , . . . , x N are random variables and α1 , . . . , α N are constant scalars,


then " #
N N
var ∑ αn xn = ∑ α2n var[xn ] + 2 ∑ αn αm cov[ xn , xm ]
n =1 n =1 n<m

In particular, if α and β are real numbers and x and y are random variables, then
var[α] = 0,8 var[α + βx ] = β2 var[ x ], and

var[αx + βy] = α2 var[ x ] + β2 var[y] + 2αβ cov[ x, y]

Given two random variables x and y with finite variances σx2 and σy2 respectively, the
correlation of x and y is defined as

cov[ x, y]
corr[ x, y] :=
σx σy

If corr[ x, y] = 0, we say that x and y are uncorrelated. For this to occur, it is nec-
essary and sufficient that cov[ x, y] = 0. Positive correlation means that corr[ x, y] is
positive, while negative correlation means that corr[ x, y] is negative. The first part
of the next fact follows immediately from fact 2.2.7, while the second is just algebra.

Fact 2.4.9. Given any two random variables x, y and positive constants α, β, we have

−1 ≤ corr[ x, y] ≤ 1 and corr[αx, βy] = corr[ x, y]

Fact 2.4.10. If x and y are independent, then cov[ x, y] = corr[ x, y] = 0.

Note that the converse is not true: One can construct examples of dependent ran-
dom variables with zero covariance.

2.4.4 Best Linear Predictors

Let’s consider the problem of predicting the value of a random variable y given
knowledge of the value of a second random variable x (and also knowledge of
the underlying probability distributions, which makes this a problem in probability
8 Here var[α] should be understood as var[α1{ω ∈ Ω}], as was the case when we discussed
fact 2.2.3 on page 58.
CHAPTER 2. PROBABILITY 74

rather than statistics). Thus, we seek a function f such that f ( x ) is close to y on av-
erage. To measure the “average distance” between f ( x ) and y, we will use the mean
squared deviation between f ( x ) and y, which is

E [(y − f ( x))2 ]
As we will learn in chapter 3, the minimizer of the mean squared deviation over all
functions of x is obtained by choosing f ( x ) = E [y | x ], where the right-hand size is
the conditional expectation of y given x. However, the conditional expectation may
be nonlinear and complicated, so let’s now consider the simpler problem of finding
a good predictor of y within a small and well-behaved class of functions. The class
of functions we will consider is the set of “linear” functions

L := { all functions of the form `( x ) = α + βx }

(While elementary courses refer to these functions as linear, in fact they are not linear
unless α = 0 (see §1.2.1). The class of functions L is more correctly known as the set
of affine functions.) Thus, we consider the problem

min E [(y − `( x ))2 ] = min E [(y − α − βx )2 ] (2.25)


`∈ L α,β∈ R
Expanding the square on the right-hand side and using linearity of E , the objective
function becomes

ψ(α, β) := E [y2 ] − 2αE [y] − 2βE [ xy] + 2αβE [ x ] + α2 + β2 E [ x2 ]

Computing the derivatives and solving the equations

∂ψ(α, β) ∂ψ(α, β)
=0 and =0
∂α ∂β
We obtain (exercise 2.8.29) the minimizers
cov[ x, y]
β∗ := and α∗ := E [y] − β∗ E [ x ] (2.26)
var[ x ]
The best linear predictor is therefore

`∗ ( x ) := α∗ + β∗ x

If you’ve studied elementary linear least squares regression before, you will realize
that α∗ and β∗ are the “population” counterparts for the coefficient estimates in the
regression setting. We’ll talk more about the connections below.
CHAPTER 2. PROBABILITY 75

2.5 Asymptotics

[Roadmap]

2.5.1 Modes of Convergence

Let { xn }∞ ∞
n=1 be a sequence of random variables. We say that { xn }n=1 converges to
random variable x in probability if

for any δ > 0, P{| xn − x| > δ} → 0 as n → ∞


p
In symbols, this convergence is represented by xn → x. In almost all the applications
we consider, the limit x will be a constant. The next example illustrates the definition
for this case.
p
Example 2.5.1. If xn ∼ N (α, 1/n), then xn → α. That is, for any δ > 0, we have
P{| xn − α| > δ} → 0. Fixing δ > 0, the probability P{| xn − α| > δ} is shown
in figure 2.8 for two different values of n, where it corresponds to the size of the
shaded areas. This probability collapses to zero as n → ∞, decreasing the variance
and causing the density to become more peaked.

A full proof of the convergence result in example 2.5.1 can be found by looking at the
normal density and bounding tail probabilities. However, a much simpler proof can
also be obtained by exploiting the connection between convergence in probability
and convergence in mean squared error. The details are below.
Fact 2.5.1. Regarding convergence in probability, the following statements are true:

p p
1. If g : R → R is continuous at x and xn → x, then g( xn ) → g( x ).
p p p p
2. If xn → x and yn → y, then xn + yn → x + y and xn yn → xy.
p p p
3. If xn → x and αn → α, then xn + αn → x + α and xn αn → xα.9

We say that { xn } converges to x in mean square if

E [( xn − x)2 ] → 0 as n → ∞ (2.27)
ms
In symbols, this convergence is represented by xn → x.
9 Here {αn } is a nonrandom scalar sequence.
CHAPTER 2. PROBABILITY 76
1.0

1.0
0.8

0.8
0.6

0.6
0.4

0.4
0.2

0.2
0.0

0.0
α−δ α α+δ α−δ α α+δ

(a) n = 10 (b) n = 20

Figure 2.8: P{| xn − α| > δ} for xn ∼ N (α, 1/n)

Fact 2.5.2. Regarding convergence in mean square, the following statements are
true:

ms p
1. If xn → x, then xn → x.
ms
2. If α is constant, then xn → α if and only if E [ xn ] → α and var[ xn ] → 0.

Part 1 of fact 2.5.2 follows from Chebyshev’s inequality (page 60). Using monotonic-
ity of P and then applying (2.12) to xn − x, we obtain

E [( xn − x)2 ]
P{| xn − x| > δ} ≤ P{| xn − x| ≥ δ} ≤
δ2
Part 1 of fact 2.5.2 follows. Part 2 is implied by the equality

E [( xn − α)2 ] = var[ xn ] + (E [ xn ] − α)2


Verification of this equality is an exercise.
p
Example 2.5.2. In example 2.5.1, we stated that if xn ∼ N (α, 1/n), then xn → α. This
follows from parts 1 and 2 of fact 2.5.2, since E [ xn ] = α and var[ xn ] = 1/n → 0.
CHAPTER 2. PROBABILITY 77

1.0 t with 1 df
t with 2 dfs
0.8 t with 10 dfs
t with 100 dfs
0.6 standard normal

0.4

0.2

0.0
10 5 0 5 10

Figure 2.9: t-distribution with k df converges to N (0, 1) as k → ∞

Let { Fn }∞
n=1 be a sequence of cdfs, and let F be a cdf. We say that Fn converges
weakly to F if, for any s such that F is continuous at s, we have

Fn (s) → F (s) as n→∞

Example 2.5.3. It is well-known that the cdf of the t-distribution with k degrees
of freedom converges to the standard normal cdf as k → ∞. This convergence is
illustrated in figure 2.9.

Sometimes densities are easier to work with than cdfs. In this connection, note that
pointwise convergence of densities implies weak convergence of the corresponding
distribution functions:

Fact 2.5.3. Let { Fn }∞


n=1 be a sequence of cdfs, and let F be a cdf. Suppose that all these
cdfs are differentiable, and let pn and p be the densities of Fn and F respectively. If
pn (s) → p(s) for all s ∈ R, then Fn converges weakly to F.

Let { xn }∞
n=1 and x be random variables, where xn ∼ Fn and x ∼ F. We say that
xn converges in distribution to x if Fn converges weakly to F. In symbols, this
d
convergence is represented by xn → x.

Fact 2.5.4. Regarding convergence in distribution, the following statements are true:

d d
1. If g : R → R is continuous at x and xn → x, then g( xn ) → g( x ).
CHAPTER 2. PROBABILITY 78

p d
2. If xn → x, then xn → x.
d p
3. If α is constant and xn → α, then xn → α.

The next result is sometimes known as Slutsky’s theorem.


p d d d
Fact 2.5.5. If α is constant, xn → α and yn → y, then xn + yn → α + y and xn yn → αy.

An immediate but useful conseqence is that


p d p
Fact 2.5.6. If xn → 0 and yn → y, then xn yn → 0.

d
Indeed, by Slutsky’s theorem (fact 2.5.5) we have xn yn → 0. Since the limit is con-
stant, 2.5.4 then tells us that convergence is in probability as well.

2.5.2 The Law of Large Numbers

Two of the most important theorems in both probability and statistics are the law
of large numbers and the central limit theorem. In their simplest forms, these theo-
rems deal with averages of independent and identically distributed (IID) sequences.
The law of large numbers tells us that these averages converge in probability to the
mean of the distribution in question. The central limit theorem tells us that a simple
transform of the average converges to a normal distribution.
Let’s start with the law of large numbers, which relates to the sample mean
N
1
x̄ N :=
N ∑ xn
n =1

of a given sample x1 , . . . , x N

Theorem 2.5.1. Let { xn } be an IID sequence of random variables with common distribution
R
F. If the first moment |s| F (ds) is finite, then
Z
p
x̄ N → E [ xn ] = sF (ds) as N→∞ (2.28)
CHAPTER 2. PROBABILITY 79

To prove theorem 2.5.1, we can use fact 2.5.2 on page 76. In view of this fact, it
suffices to show that E [ x̄ N ] → sF (ds) and var[ x̄ N ] → 0 as N → ∞. These steps
R

are left as an exercise (exercise 2.8.39). When you do the exercise, note to yourself
exactly where independence bites.10

Example 2.5.4. To illustrate the law of large numbers, consider flipping a coin until
10 heads have occurred. The coin is not fair: The probability of heads is 0.4. Let x
be the number of tails observed in the process. It is known that such an x has the
negative binomial distribution, and, with a little bit of googling, we find that the
mean E [ x ] is 15. This means that if we simulate many observations of x and take
the sample mean, we should get a value close to 15. Code to do this is provided in
the next listing. Can you see how this program works?11

Listing 1 Illustrates the LLN

import numpy as np
from random import uniform
num_repetitions = 10000
outcomes = np.empty(num_repetitions)

for i in range(num_repetitions):
num_tails = 0
num_heads = 0
while num_heads < 10:
b = uniform(0, 1)
num_heads = num_heads + (b < 0.4)
num_tails = num_tails + (b >= 0.4)
outcomes[i] = num_tails

print(outcomes.mean())

At first glance, the law of large numbers (2.28) appears to only be a statement about
the sample mean, but actually it can be applied to functions of the random variable
10 The proof involves a bit of cheating, because it assumes that the variance of each xn is finite. This
second moment assumption is not necessary for the result, but it helps to simplify the proof.
11 Hint: If u is uniform on [0, 1] and q ∈ [0, 1], then P{ u ≤ q } = q. This fact is used to simulate the

coin flips. Also recall that the logical values TRUE and FALSE are treated as 1 and 0 respectively in
algebraic expressions.
CHAPTER 2. PROBABILITY 80

as well. For example, if h : R → R is such that |h(s)| F (ds) is finite, then


R

N
1
Z
p
N ∑ h(xn ) → E [h(xn )] = h(s) F (ds) (2.29)
n =1

This can be confirmed by letting yn := h( xn ) and then applying theorem 2.5.1.


Also, the law of large numbers applies to probabilities as well as expectations. To
see this, let x ∼ F, fix B ⊂ R, and consider the probability P{ x ∈ B}. Let h be
the function defined by h(s) = 1{s ∈ B} for all s ∈ R. Using the principle that
expectations of indicator functions equal probabilities of events (page 58), we have

E [h( x)] = E [1{ x ∈ B}] = P{ x ∈ B}


It now follows from (2.29) that if { xn } is an IID sample from F, then
N
1 p
N ∑ 1{ x n ∈ B } → P{ x n ∈ B } (2.30)
n =1

The left hand side is the fraction of the sample that falls in the set B, and (2.30) tells
us that this fraction converges to the probability that xn ∈ B.

2.5.3 The Central Limit Theorem

The central limit theorem is another classical result from probability theory. It is
arguably one of the most beautiful and important results in all of mathematics. Rel-
ative to the LLN, it requires an additional second moment condition.

Theorem 2.5.2. Assume the conditions of theorem 2.5.1. If, in addition, the second moment
R 2
s F (ds) is finite, then
√ d
N ( x̄ N − µ) → y ∼ N (0, σ2 ) as N→∞ (2.31)

where µ := sF (ds) = E [ xn ] and σ2 := (s − µ)2 F (ds) = var[ xn ].


R R

Another common statement of the central limit theorem is as follows: If all the con-
ditions of theorem 2.5.2 are satisfied, then
√  x̄ N − µ  d
z N := N → z ∼ N (0, 1) as N → ∞ (2.32)
σ
CHAPTER 2. PROBABILITY 81

0.40
0.35
0.30
0.25
0.20
0.15
0.10
0.05
0.00
4 3 2 1 0 1 2 3 4 5

Figure 2.10: Illustration of the CLT

Exercise 2.8.40 asks you to confirm this via theorem 2.5.2 and fact 2.5.4.
The central limit theorem tells us about the distribution of the sample mean when
N is large. Arguing informally, for N large we have

N ( x̄ N − µ) ≈ y ∼ N (0, σ2 )

σ2
 
y
∴ x̄ N ≈ √ + µ ∼ N µ,
N N
Here ≈ means that the distributions are approximately equal. We see that x̄ N is
approximately normal, with mean equal to µ := E [ x1 ] and variance converging to
zero at a rate proportional to 1/N.
The convergence in (2.32) is illustrated by listing 2, the output of which is given
in figure 2.10. The listing generates 5,000 observations of the random variable z N
defined in (2.32), where each xn is χ2 (5). (The mean of this distribution is 5, and the
variance is 2 × 5 = 10.) The observations of z N are stored in the vector outcomes,
and then histogrammed. At the end of the listing we superimpose the density of the
standard normal distribution over the histogram. As predicted, the fit is relatively
good.
Before finishing this section, we briefly note the following asymptotic result, which
is frequently used in conjunction with the central limit theorem:
CHAPTER 2. PROBABILITY 82

Listing 2 Illustrates the CLT

import numpy as np
import scipy.stats as st
import matplotlib.pyplot as plt

num_replications = 5000
outcomes = np.empty(num_replications)
N = 1000
k = 5 # Degrees of freedom
chi = st.chi2(k)

for i in range(num_replications):
xvec = chi.rvs(N)
outcomes[i] = np.sqrt(N / (2 * k)) * (xvec.mean() - k)

xmin, xmax = -4, 4


grid = np.linspace(xmin, xmax, 200)
fig, ax, = plt.subplots()
ax.hist(outcomes, bins=50, normed=True, alpha=0.4)
ax.plot(grid, st.norm.pdf(grid), ’k-’, lw=2, alpha=0.7)
plt.show()

Theorem 2.5.3. Let {tn } be a sequence of random numbers and let θ be a constant. Sup-
√ d
pose that n(tn − θ ) → N (0, σ2 ) for some σ > 0. Suppose further that g : R → R is
differentiable at θ and g0 (θ ) 6= 0. Under these conditions we have
√ d
n{ g(tn ) − g(θ )} → N (0, g0 (θ )2 σ2 ) as n→∞ (2.33)

The technique illustrated in theorem 2.5.3 is referred to as the delta method. The
delta method is extremely useful, particularly when one seeks the asymptotic distri-
bution of certain kinds estimators. We will see its importance in some applications
later on. The proof of theorem 2.5.3 is based on a Taylor expansion of g around
the point θ, and can be found in almost any text on mathematical statistics. Exer-
cise 2.8.43 walks you throught the most important ideas.
Instead of giving a full proof here, we will cover some parts of the proof of the
following corollary: If the conditions of theorem 2.5.2 are satisfied and g : R → R is
CHAPTER 2. PROBABILITY 83

differentiable at µ with g0 (µ) 6= 0, then


√ d
N { g( x̄ N ) − g(µ)} → N (0, g0 (µ)2 σ2 ) as N→∞ (2.34)

2.6 Random Vectors and Matrices

A random vector x is just a sequence of K random variables ( x1 , . . . , xK ). Each real-


ization of x is an element of RK . The distribution (or cdf) of x is the joint distribution
F of ( x1 , . . . , xK ). That is,

F ( s ) : = : F ( s1 , . . . , s K ) : = P{ x1 ≤ s1 , . . . , x K ≤ s K } : = : P{ x ≤ s } (2.35)

for each s in RK . (Here and in what follows, the statement x ≤ s means that xn ≤ sn
for n = 1, . . . , K.)
Just as some but not all distributions on R have a density representation (see §2.3.2),
some but not all distributions on RK can be represented by a density. We say that
f : RK → R is the density of random vector x := ( x1 , . . . , xK ) if
Z
f (s) ds = P{x ∈ B} (2.36)
B

for every subset B of RK .12 Most of the distributions we work with in this course
have density representations.
For random vectors, the definition of independence mirrors the scalar case. In par-
ticular, a collection of random vectors x1 , . . . , x N is called independent if, given any
s1 , . . . , s N , we have

P{ x1 ≤ s1 , . . . , x N ≤ s N } = P{ x1 ≤ s1 } × · · · × P{ x N ≤ s N }
We note the following multivariate version of fact 2.4.2:

Fact 2.6.1. If x and y are independent and g and f are any functions, then f (x) and
g(y) are also independent.

A random N × K matrix X is a rectangular N × K array of random variables. In this


section, we briefly review some properties of random vectors and matrices.
12 Actually,some subsets of RK are so messy that it’s not possible to integrate over them, so we
only require (2.36) to hold for a large but suitably well-behaved class of sets called the Borel sets. See
any text on measure theory for details.
CHAPTER 2. PROBABILITY 84

2.6.1 Expectations for Vectors and Matrices

Let x = ( x1 , . . . , xK ) be a random vector taking values in RK with µk := E [ xk ]


for all k = 1, . . . , K. The expectation E [x] of vector x is defined as the vector of
expectations:
E [ x1 ]
   
µ1
 E [ x2 ]   µ2 
E [x] :=  ..  =  ..  =: µ
  
. . 

  
E [ xK ] µK
More generally, if X is a random N × K matrix, then its expectation E [X] is the
matrix of the expectations:
E [ x11 ] E [ x12 ] · · · E [ x1K ]
 
 E [ x ] E [ x22 ] · · · E [ x2K ] 
21
E [X] :=  .
. .
. .
.

. . .
 
 
E [ x N1 ] E [ x N2 ] · · · E [ x NK ]
Expectation of vectors and matrices maintains the linearity of scalar expectations:
Fact 2.6.2. If X and Y are random and A, B and C are conformable constant matrices,
then
E [A + BX + CY] = A + BE [X] + CE [Y]
The covariance between random N × 1 vectors x and y is
cov[x, y] := E [(x − E [x])(y − E [y])0 ]
The variance-covariance matrix of random vector x with µ := E [x] is defined as
var[x] := cov(x, x) = E [(x − E [x])(x − E [x])0 ] = E [(x − µ)(x − µ)0 ]
Expanding this out, we get
E [( x1 − µ1 )( x1 − µ1 )] E [( x1 − µ1 )( x N − µ N )]
 
···
 E [( x2 − µ2 )( x − µ )] ··· E [( x2 − µ2 )( x N − µ N )] 
1 1
var[x] =  . .. ..
 
 .. . .


E [( x N − µ N )( x1 − µ1 )] · · · E [( x N − µ N )( x N − µ N )]
The j, k-th term is the scalar covariance between x j and xk . As a result, the principle
diagonal contains the variance of each xn .
Some simple algebra yields the alternative expressions
cov[x, y] = E [xy0 ] − E [x]E [y]0 and var[x] = E [xx0 ] − E [x]E [x]0
CHAPTER 2. PROBABILITY 85

Fact 2.6.3. For any random vector x, the variance-covariance matrix var[x] is square,
symmetric and nonnegative definite.
Fact 2.6.4. For any random vector x, any constant conformable matrix A and any
constant conformable vector a, we have

var[a + Ax] = A var[x]A0

2.6.2 Multivariate Gaussians

The multivariate normal density or Gaussian density in R N is a function p of the


form
1
 
− N/2 −1/2 0 −1
p(s) = (2π ) det(Σ) exp − (s − µ) Σ (s − µ)
2
where µ is any N × 1 vector and Σ is a symmetric, positive definite N × N matrix. In
symbols, we represent this distribution by N (µ, Σ). Although we omit the deriva-
tions, it can be shown that if x ∼ N (µ, Σ), then

E [x] = µ and var[x] = Σ


We say that x is normally distributed if x ∼ N (µ, Σ) for some N × 1 vector µ and
symmetric, positive definite N × N matrix Σ. We say that x is standard normal if
µ = 0 and Σ = I.
Fact 2.6.5. N × 1 random vector x is normally distributed if and only if a0 x is nor-
mally distributed in R for every constant N × 1 vector a.13
Fact 2.6.6. If x ∼ N (µ, Σ), then a + Ax ∼ N (a + Aµ, AΣA0 ).

Here, the fact that a + Ax has mean a + Aµ and variance-covariance matrix AΣA0 is
not surprising. What is important is that normality is preserved .
Fact 2.6.7. Normally distributed random variables are independent if and only if
they are uncorrelated. In particular, if both x and y are normally distributed and
cov[ x, y] = 0, then x and y are independent.
Fact 2.6.8. If x ∼ N (µ, Σ), then (x − µ)0 Σ−1 (x − µ) ∼ χ2 (k ), where k := length of x.
Fact 2.6.9. If x ∼ N (0, I) and A is a conformable idempotent and symmetric matrix
with rank(A) = j, then x0 Ax ∼ χ2 ( j). (In view of fact 1.4.7, when using this result it
is sufficient to show that trace(A) = j.)
13 If a = 0 then we can interpret a0 x as a “normal” random variable with zero variance.
CHAPTER 2. PROBABILITY 86

2.6.3 Convergence of Random Matrices

Next we extend the probabilistic notions of convergence discussed in §2.5.1 to ran-


dom vectors and matrices. Beginning with the notion of convergence in probability
(see §2.5.1 for the scalar case), let {Xn }∞
n=1 be a sequence of random N × K matrices.
We say that Xn converges to a random N × K matrix X in probability and write
p
Xn → X if every element of Xn converges to the corresponding element of X in
probability in the scalar sense. That is,
p p
Xn → X ⇐⇒ xijn → xij for all i and j

Similarly, a sequence of random vectors is said to converge in probability when the


individual components converge in probability. That is,
 n   
x1 x1
 ..  p  ..  p
 . → .  ⇐⇒ xkn → xk for all k
xKn xK

Fact 2.6.10. Let {Xn } be a sequence of random N × K matrices, let {xn } is a sequence
of random vectors in RK and let X and x be, respectively, a random matrix and vector
of the same dimensions. Then,

p p
1. xn → x if and only if kxn − xk → 0.
p p
2. Xn → X if and only if kXn − Xk → 0.

Here the first norm is the ordinary Euclidean norm and the second is the matrix
norm of §1.4.5.

Now let’s extend the notion of convergence in distribution to random vectors. The
definition is almost identical to the scalar case, with only the obvious modifications.
n=1 be a sequence of cdfs on R , and let F be a cdf on R . We say that Fn
Let { Fn }∞ K K

converges to F weakly if, for any s such that F is continuous at s, we have

Fn (s) → F (s) as n → ∞

n=1 and x be random vectors in R , where xn ∼ Fn and x ∼ F. We say


Let {xn }∞ K

that xn converges in distribution to x if Fn converges weakly to F. In symbols, this


d
convergence is represented by xn → x.
CHAPTER 2. PROBABILITY 87

As discussed above, convergence of xn to x in probability simply requires that the


elements of xn converge in probability (in the scalar sense) to the corresponding
elements of x. For convergence in distribution this is not generally true:
 n   
x1 x1
xkn → xk for all k does not imply xn :=  ...  →  ...  =: x
d   d  

xKn xK

Put differently, convergence of the marginals does not necessarily imply conver-
gence of the joint distribution. (As you might have guessed, one setting where con-
vergence of the marginals implies convergence of the joint is when the elements of
the vectors are independent, and the joint is just the product of the marginals.)
The fact that elementwise convergence in distribution does not necessarily imply
convergence of the vectors is problematic, because vector convergence is harder
to work with than scalar convergence. Fortunately, we have the following results,
which provide a link from scalar to vector convergence:

Fact 2.6.11. Let xn and x be random vectors in RK .

d d
1. If a0 xn → a0 x for any a ∈ RK , then xn → x.
p p
2. If a0 xn → a0 x for any a ∈ RK , then xn → x.

The second of these results is quite straightforward to prove (exercise 2.8.45). The
first is more difficult (the standard argument uses characteristic functions). It is often
referred as the Cramer-Wold device.
We noted a variety of useful results pertaining to convergence in probability and
distribution for sums and products in the scalar case. Most of these carry over to the
vector case essentially unchanged. For example,

Fact 2.6.12. Assuming conformability, the following statements are true:

p 1 −1 p
1. If Xn → X and Xn and X are nonsingular, then X−
n → X .
p p
2. If Xn → X and Yn → Y, then
p p p
Xn + Yn → X + Y, Xn Yn → XY and Yn Xn → YX
CHAPTER 2. PROBABILITY 88

p
3. If Xn → X and An → A, then
p p p
Xn + An → X + A, Xn An → XA and An Xn → AX

In part 3 of fact 2.6.12, the matrices An and A are nonrandom. The convergence
An → A means that each element of An converges in the usual scalar sense to the
corresponding element of A:

An → A means aijn → aij for all i and j

Alternatively, we can stack the matrices into vectors and take the norms, as dis-
cussed above. Then we say that An → A if kAn − Ak → 0. The two definitions can
be shown to be equivalent.

Example 2.6.1. To see how fact 2.6.12 can be used, let’s establish convergence of the
quadratic form
p p
a0 Xn a → a0 Xa whenever a is a conformable constant vector and Xn → X (2.37)

This follows from two applications of fact 2.6.12. Applying fact 2.6.12 once we get
p
a0 Xn → a0 X. Applying it a second time yields the convergence in (2.37).

As in the scalar case, convergence in probability and convergence in distribution are


both preserved under continuous transformations:

Fact 2.6.13 (Continuous mapping theorem, vector case). Let xn and x be random
vectors in RK , and let g : RK → R J be continuous at x. In this setting,

d d
1. If xn → x, then g(xn ) → g(x).
p p
2. If xn → x, then g(xn ) → g(x).

Another result used routinely in econometric theory is the vector version of Slut-
sky’s theorem:

Fact 2.6.14 (Slutsky’s theorem, vector case). Let xn and x be random vectors in RK ,
p d
let Yn be random matrices, and let C be a constant matrix. If Yn → C and xn → x,
then
d d
Yn xn → Cx and Yn + xn → C + x
whenever the matrices are conformable.
CHAPTER 2. PROBABILITY 89

The delta method from theorem 2.5.3 on page 81 extends to random vectors. For
example, let g : RK → R be differentiable at a vector θ ∈ RK , in the sense that the
gradiant vector
∂g(θ)
 
∂θ1
 . 
∇ g(θ) :=  . 
 . 
∂g(θ)
∂θK
is well defined (i.e., the limit defining each of the partial derivatives exists). In this
context,
√ d
Fact 2.6.15. If {tn } is a sequence of random vectors in RK with n(tn − θ) →
N (0, Σ) for some θ ∈ RK and positive definite K × K matrix Σ, then
√ d
n{ g(tn ) − g(θ)} → N (0, ∇ g(θ)0 Σ∇ g(θ)) as n→∞ (2.38)

whenever ∇ g(θ)0 Σ∇ g(θ) is positive. This last assumption will be satisfied if, for
example, at least one of the partial derivatives in ∇ g(θ) is nonzero (why?).

2.6.4 Vector LLN and CLT

With the above definitions of convergence in hand, we can move on to the next topic:
Vector LLN and CLT. The scalar LLN and CLT that we discussed in §2.5 extend to
the vector case in a natural way. For example, we have the following result:
Theorem 2.6.1. Let {xn } be an IID sequence of random vectors in RK with E [kxn k2 ] < ∞.
Let µ := E [xn ] and let Σ := var[xn ]. For this sequence we have

1 N
p √
∑ xn → µ
d
x̄ N := and N (x̄ N − µ) → N (0, Σ) (2.39)
N n =1

Figure 2.11 illustrates the LLN in two dimensions. The green dot is the point 0 =
(0, 0) in R2 . The black dots are IID observations x1 , . . . , x N of a random vector with
mean µ = 0. The red dot is the sample mean N1 ∑nN=1 xn . (Remember that we are
working with vectors here, so the summation and scalar multiplication in the sam-
ple mean x̄ N is done elementwise—in this case for two elements. In particular, the
sample mean is a linear combination of the observations x1 , . . . , x N .) By the vector
LLN, the red dot converges to the green dot.
The vector LLN in theorem 2.6.1 follows from the scalar LLN. To see this, let xn be as
in theorem 2.6.1, let a be any constant vector in RK and consider the scalar sequence
CHAPTER 2. PROBABILITY 90

Figure 2.11: LLN, vector case

{yn } defined by yn = a0 xn . The sequence {yn } inherets the IID property from {xn }.14
By the scalar LLN (theorem 2.5.1) we have
N
1 p
N ∑ yn → E [yn ] = E [a0 xn ] = a0 E [xn ] = a0 µ
n =1

But " #
N N N
1 1 1
N ∑ yn = N ∑ a0 xn = a0 N ∑ xn = a0 x̄ N
n =1 n =1 n =1

Since a was arbitrary, we have shown that


p
a0 x̄ N → a0 µ for any a ∈ RK
p
The claim x̄ N → µ now follows from fact 2.6.11.
The vector CLT in theorem 2.6.1 also follows from the scalar case. The proof is rather
similar to the vector LLN proof we have just completed. See exercise 2.8.48.

2.7 Further Reading

To be written
14 Functions of independent random variables are themselves independent (fact 2.4.2, page 71).
CHAPTER 2. PROBABILITY 91

2.8 Exercises

Ex. 2.8.1. Suppose that P is a probability on (Ω, F ), so that P( A ∪ B) = P( A) +


P( B) whenever A and B are disjoint. Show that if A, B and C are disjoint, then
P( A ∪ B ∪ C ) = P( A ) + P( B ) + P( C ).
Ex. 2.8.2. Prove fact 2.1.2: P( A ∪ B) = P( A) + P( B) − P( A ∩ B) for any A, B.15

Ex. 2.8.3. Given sample space Ω := {1, 2, 3}, let A := {1}, B := {2} and C := {3}.
Let P( A) = P( B) = 1/3. Compute P(C ), P( A ∪ B), P( A ∩ B), P( Ac ), P( Ac ∪ Bc )
and P( A | B). Are A and C independent?

Ex. 2.8.4. A dice is designed so that the probability of getting face m is qm, where
m ∈ {1, . . . , 6} and q is a constant. Compute q.

Ex. 2.8.5. Let Ω be a nonempty finite set, and let ω0 be a fixed element of Ω. For
each A ⊂ Ω, define P( A) := 1{ω0 ∈ A}. Is P a probability on Ω? Why or why not?

Ex. 2.8.6. Let Ω be any sample space, and let P be a probability on the subsets F . Let
A ∈ F . Show that if P( A) = 0 or P( A) = 1, then A is independent of every other
event in F . Show that if A is independent of itself, then either P( A) = 0 or P( A) =
1. Show that if A and B are independent, then Ac and Bc are also independent.

Ex. 2.8.7. Let P and Ω be defined as in example 2.1.4. Show that P is additive, in
the sense that if A and B are disjoint events, then P( A ∪ B) = P( A) + P( B).

Ex. 2.8.8. Let P and Ω be defined as in example 2.1.4. Let A be the event that the
first switch is on, and let B be the event that the second switch is on. Show that A
and B are independent under P.

Ex. 2.8.9. Show that when Ω is finite, a random variable x on Ω can only take on a
finite set of values (i.e., has finite range).16

Ex. 2.8.10. Fact 2.2.4 on page 59 states that, among other things, if x is a random
variable, then E [αx ] = αE [ x ]. Show this for the case where x is a finite random
variable.
15 Hint: Sketching the Venn diagram, convince yourself that A = [( A ∪ B) \ B] ∪ ( A ∩ B). Finish
the proof using the definition of a probability and fact 2.1.1 (page 49).
16 Hint: Have a look at the definition of a function in §4.2.
CHAPTER 2. PROBABILITY 92

Ex. 2.8.11. Fact 2.2.4 on page 59 states that, among other things, if x and y are two
random variables, then E [ x + y] = E [ x ] + E [y]. Instead of giving the full proof, try
to show this for the binary random variables

x (ω ) = 1 A (ω ) and y(ω ) = 1B (ω ) (2.40)

Note that even this is a little fiddly—see the solution if you get lost.

Ex. 2.8.12. Recall F defined in (2.13). We claimed that lims→∞ F (s) = 1. Verify this
when x is the finite-valued random variable in (2.7).

Ex. 2.8.13. Recall F defined in (2.13). Suppose that x is the finite-valued random
variable in (2.7). Show that lims→−∞ Fx (s) = 0. If you can, show that F is right-
continuous.

Ex. 2.8.14. Prove the claim in fact 2.3.2 on page 62.

Ex. 2.8.15. Let x be a discrete random variable taking values s1 , . . . , s J , and let p j :=
P{ x = s j }. Show that 0 ≤ p j ≤ 1 for each j, and ∑ jJ=1 p j = 1.
Ex. 2.8.16. This exercise describes the inverse transform method for generating
random variables with arbitrary distribution from uniform random variables. The
uniform cdf on [0, 1] is given by F (s) = 0 if s < 0, F (s) = s if 0 ≤ s ≤ 1, and F (s) = 1
if s > 1. Let G be another cdf on R. Suppose that G is strictly increasing, and let
G −1 be the inverse (quantile). Show that if u ∼ F, then G −1 (u) ∼ G.

Ex. 2.8.17. Let x ∼ F where F is the uniform cdf on [0, 1]. Give an expression for the
cdf G of the random variable y = x2 .

Ex. 2.8.18. Let F be the cdf on R defined by F (s) = es /(1 + es ) for s ∈ R.

1. Obtain the quantile function corresponding to F.

2. Let x ∼ F and compute P{0 ≤ x ≤ ln 2}.

Ex. 2.8.19. Let y ∼ F, where F is a cdf. Show that F (s) = E [1{y ≤ s}] for any s.

Ex. 2.8.20. Confirm monotonicity of expectations (fact 2.2.5 on page 60) for the spe-
cial case where x and y are the random variables in (2.40).

Ex. 2.8.21. Prove fact 2.2.6 (existence of k-th moment implies existence of j-th mo-
ment for all j ≤ k).
CHAPTER 2. PROBABILITY 93

Ex. 2.8.22. Confirm the expression for variance of linear combinations in fact 2.4.8.

Ex. 2.8.23. Let x and y be scalar random variables. With reference to fact 2.4.9 on
page 73, is it true that corr[αx, βy] = corr[ x, y] for any constant scalars α and β? Why
or why not?

Ex. 2.8.24. Confirm the claim in fact 2.4.10: If x and y are independent, then cov[ x, y] =
corr[ x, y] = 0.

Ex. 2.8.25. Let x1 and x2 be random variables with densities p1 and p2 . Let q be their
joint density. Show that x1 and x2 are independent whenever q(s, s0 ) = p1 (s) p2 (s0 )
for every s, s0 ∈ R.

Ex. 2.8.26. Fact 2.4.2 tells us that if x and y are independent random variables and g
and f are any two functions, then f ( x ) and g(y) are independent. Prove this for the
case where f ( x ) = 2x and g(y) = 3y − 1.

Ex. 2.8.27. Let x and y be independent uniform random variables on [0, 1]. Let
z := max{ x, y}. Compute the cdf, density and mean of z.17 In addition, compute
the cdf of w := min{ x, y}.

Ex. 2.8.28. A discrete random variable x taking values in the set N0 := {0, 1, 2, . . .}
has pmf { p j } = { p0 , p1 , . . .}, if P{ x = j} = p j for all j ∈ N0 . (Here { p j } is a
sequence in [0, 1] with ∑∞ j=0 p j = 1.) Let u and v be indepedent random variables
taking values in N0 with pmfs { p j } and {q j } respectively, and let z := u + v. Obtain
an expression for the pmf of z in terms of { p j } and {q j }.

Ex. 2.8.29. Confirm the solutions in (2.26).

Ex. 2.8.30. Consider the setting of §2.4.4. Let α∗ , β∗ and `∗ be as defined there. Let
the prediction error u be defined as u := y − `∗ ( x ). Show that

1. E [`∗ ( x )] = E [y]

2. var[`∗ ( x )] = corr[ x, y]2 var[y]

3. var[u] = (1 − corr[ x, y]2 ) var[y]

Ex. 2.8.31. Continuing on from exercise 2.8.30, show that cov[`∗ ( x ), u] = 0.


Fix s ∈ R and compare the sets {z ≤ s} and { x ≤ s} ∩ {y ≤ s}. What is the relationship
17 Hint:

between these two sets?


CHAPTER 2. PROBABILITY 94

Ex. 2.8.32. Show that if x is a random vector with E [xx0 ] = I and A is a conformable
constant matrix, then E [x0 Ax] = trace(A).

Ex. 2.8.33. Let x be random and let a be constant. Show that if E [x] = µ and
var[x] = Σ, then E [a0 x] = a0 µ and var[a0 x] = a0 Σa.

Ex. 2.8.34. Let x be a random K × 1 vector. Show that E [xx0 ] is nonnegative definite.

Ex. 2.8.35. Let x = ( x1 , . . . , x N ) ∼ N (0, I N ).

1. Are x1 and x2 independent? Why or why not?

2. What is the distribution of x12 ? Why?

3. What is the distribution of x12 /x22 ? Why?

4. What is the distribution of x1 [2/( x22 + x32 )]1/2 ? Why?

5. What is the distribution of kxk2 ? Why?

6. If a is an N × 1 constant vector, what is the distribution of a0 x?

Ex. 2.8.36. Prove the Chebyshev inequalities (2.11) and (2.12).

Ex. 2.8.37. Let { xn } be a sequence of random variables satisfying xn = y for all n,


where y is a single random variable. Show that if P{y = −1} = P{y = 1} = 0.5,
p p
then xn → 0 fails. Show that if P{y = 0} = 1, then xn → 0 holds.
p d
Ex. 2.8.38. We saw in fact 2.5.4 that if xn → x, then xn → x. Show that the converse
is not generally true. In other words, give an example of a sequence of random
variables { xn } and random variable x such that xn converges to x in distribution,
but not in probability.

Ex. 2.8.39. In this exercise, we complete the proof of the LLN on page 78. Let { xn }
be an IID sequence of random variables with common distribution F. Show that
E [ x̄ N ] → sF(ds) and var[ x̄ N ] → 0 as N → ∞.
R

Ex. 2.8.40. Confirm (2.32) via theorem 2.5.2 and fact 2.5.4.

Ex. 2.8.41. Let u ∼ U [0, 1] and xn := n1{0 ≤ u ≤ 1/n} for n = 1, 2, . . .

1. Calculate the expectation of xn .


CHAPTER 2. PROBABILITY 95

p
2. Show that xn → 0 as n → ∞.

Ex. 2.8.42. Let x be any random variable with E [ x ] = µ and var[ x ] = σ2 < ∞. Show
that xn := x/n converges to zero in probability as n → ∞.

Ex. 2.8.43. This exercise covers some of the proof behind theorem 2.5.3 on page 81.
Suppose that {tn } is a sequence of random variables, θ is a constant, and
√ d
n(tn − θ ) → N (0, σ2 ) as n→∞

Let g : R → R be differentiable at θ with g0 (θ ) 6= 0. Taking a first order Taylor


expansion of g around θ, we can write g(tn ) = g(θ ) + g0 (θ )(tn − θ ) + R(tn − θ ),
where R(tn − θ ) is a remainder term. It turns out that under these conditions we
√ p
have nR(tn − θ ) → 0. The details are omitted. Using this fact, prove carefully that
√ d
n{ g(tn ) − g(θ )} → N (0, g0 (θ )2 σ2 ).

Ex. 2.8.44. Using fact 2.5.1 (page 75) as appropriate, prove the following part of
p p p
fact 2.6.12: If Xn → X and Yn → Y, then Xn Yn → XY whenever the matrices are
conformable.
p
Ex. 2.8.45. Confirm the following claim in fact 2.6.11: If a0 xn → a0 x for every a ∈ RK ,
p
then xn → x.

Ex. 2.8.46. Let {xn } be a sequence of vectors in R2 , where xn := ( xn , yn ) for each n.


p p p p
Suppose that xn → 0 (i.e., xn → 0 and yn → 0). Show that kxn k → 0.

Ex. 2.8.47. Verify the first part of fact 2.6.10 on page 86. (Note that exercise 2.8.46 is
a warm up to this exercise.)
√ d
Ex. 2.8.48. Confirm the claim N (x̄ N − µ) → N (0, Σ) in theorem 2.6.1.

Ex. 2.8.49. Let {xn } be an IID sequence of random vectors in RK with E [xn ] = 0 and
var[xn ] = IK . Let
1 N
N n∑
x̄ N := xn and y N := N · kx̄ N k2
=1
What is the asymptotic distribution of {y N }?
CHAPTER 2. PROBABILITY 96

2.8.1 Solutions to Selected Exercises

Solution to Exercise 2.8.1. If A, B and C are disjoint, then A ∪ B and C are also
disjoint, and A ∪ B ∪ C = ( A ∪ B) ∪ C. As a result, using additivity over pairs,

P( A ∪ B ∪ C) = P(( A ∪ B) ∪ C) = P( A ∪ B) + P(C) = P( A) + P( B) + P(C)


This result can be extended to an arbitrary number of sets by using induction.

Solution to Exercise 2.8.2. Pick any sets A, B ∈ F . To show that

P( A ∪ B ) = P( A ) + P( B ) − P( A ∩ B )
we start by decomposing A into the union of two disjoint sets: A = [( A ∪ B) \ B] ∪
( A ∩ B). Using additivity of P, we then have

P( A) = P[( A ∪ B) \ B] + P( A ∩ B)
Since B ⊂ ( A ∪ B), we can apply part 1 of fact 2.1.1 (page 49) to obtain

P( A ) = P( A ∪ B ) − P( B ) + P( A ∩ B )
Rearranging this expression gives the result that we are seeking.

Solution to Exercise 2.8.3. First, P(C ) = 1/3 as 1 = P(Ω) = P( A ∪ B ∪ C ) =


P( A) + P( B) + P(C) = 1/3 + 1/3 + P(C), and hence P(C) = 1/3. In addition,
P( A ∪ B) = 2/3, P( A ∩ B) = 0, P( Ac ) = 2/3, P( Ac ∪ Bc ) = P(( A ∩ B)c ) =
P(Ω) = 1, and P( A ∩ C) = 0 6= 1/9 = P( A)P(C). Therefore A is not independent
of C.

Solution to Exercise 2.8.4. When the dice is rolled one face must come up, so the
sum of the probabilities is one. More formally, letting Ω = {1, . . . , 6} be the sample
space, we have
6 6
P{1, . . . , 6} = P ∪6m=1 {m} = ∑ P{m} = ∑ qm = 1
m =1 m =1

Solving the last equality for q, we get q = 1/21.

Solution to Exercise 2.8.5. To show that P is a probability on Ω we need to check


that
CHAPTER 2. PROBABILITY 97

1. 1{ω0 ∈ A} ∈ [0, 1] for every A ⊂ Ω.

2. 1{ω0 ∈ Ω} = 1

3. If A ∩ B = ∅, then 1{ω0 ∈ A ∪ B} = 1{ω0 ∈ A} + 1{ω0 ∈ B}

1 is immediate from the definition of an indicator function. 2 holds because ω0 ∈ Ω.


Regarding 3, pick any disjoint A and B. If ω0 ∈ A, then ω0 ∈ / B, and we have
1 { ω0 ∈ A ∪ B } = 1 = 1 { ω0 ∈ A } + 1 { ω0 ∈ B }
If ω0 ∈ B, then ω0 ∈
/ A, and once again we have
1 { ω0 ∈ A ∪ B } = 1 = 1 { ω0 ∈ A } + 1 { ω0 ∈ B }
Finally, if ω0 is in neither A nor B, then
1 { ω0 ∈ A ∪ B } = 0 = 1 { ω0 ∈ A } + 1 { ω0 ∈ B }
We have shown that 1–3 hold, and hence P is a probability on Ω.

Solution to Exercise 2.8.6. Suppose that P( A) = 0 and that B ∈ F . We claim that


P( A ∩ B) = P( A)P( B), or, in this case, P( A ∩ B) = 0. Using nonnegativity and
monotonicity of P (fact 2.1.1), we obtain
0 ≤ P( A ∩ B ) ≤ P ( A ) = 0
Therefore P( A ∩ B) = 0 as claimed.
Now suppose that P( A) = 1. We claim that P( A ∩ B) = P( A)P( B), or, in this case,
P( A ∩ B) = P( B). In view of fact 2.1.2 on page 50, we have
P( A ∩ B ) = P( A ) + P( B ) − P( A ∪ B )
Since P( A) = 1, it suffices to show that P( A ∪ B) = 1. This last equality is implied
by monotonicity of P, because 1 = P( A) ≤ P( A ∪ B) ≤ 1.
Next, suppose that A is independent of itself. Then P( A) = P( A ∩ A) = P( A)P( A) =
P( A)2 . If a = a2 , then either a = 0 or a = 1.
Finally, let A and B be independent. We have
P( Ac ∩ Bc ) = P(( A ∪ B)c ) = 1 − P( A ∪ B)
Applying fact 2.1.2 and independence, we can transform the right-hand side to ob-
tain
P( Ac ∩ Bc ) = (1 − P( A))(1 − P( B)) = P( Ac )P( Bc )
In other words, Ac and Bc are independent.
CHAPTER 2. PROBABILITY 98

Solution to Exercise 2.8.7. The proof is almost identical to the proof of additivity in
example 2.1.3 (page 49).

Solution to Exercise 2.8.8. The proof of independence is essentially the same as the
proof of independence of A and B in example 2.1.5 (page 51).

Solution to Exercise 2.8.10. We want to show that E [αx ] = αE [ x ] is valid. To see


this, observe that
" #
J J
αx (ω ) = α ∑ s j 1 A j (ω ) = ∑ αs j 1 Aj (ω )
j =1 j =1

Hence, applying (2.8),


" #
J J
E [αx] = ∑ αs j P( A j ) = α ∑ s j P( A j ) = αE [ x]
j =1 j =1

Solution to Exercise 2.8.11. Consider the sum x + y. By this, we mean the random
variable ( x + y)(ω ) := x (ω ) + y(ω ). We claim that E [ x + y] = E [ x ] + E [y]. To see
that this is the case, note first that

( x + y)(ω ) = 1 A\ B (ω ) + 1B\ A (ω ) + 21 A∩ B (ω )

(To check this, just go through the different cases for ω, and verify that the right
hand side of this expression agrees with x (ω ) + y(ω ). Sketching a Venn diagram
will help.) Therefore, by the definition of expectation,

E [ x + y ] = P ( A \ B ) + P ( B \ A ) + 2P ( A ∩ B ) (2.41)

Now observe that A = ( A \ B) ∪ ( A ∩ B) and hence, by disjointness,

E [ x ] : = P( A ) = P( A \ B ) + P( A ∩ B )
Performing a similar calculation with y produces

E [ y ] : = P( B ) = P( B \ A ) + P( A ∩ B )
Adding these two produces the value on the right-hand side of (2.41), and we have
confirmed that E [ x + y] = E [ x ] + E [y].
CHAPTER 2. PROBABILITY 99

Solution to Exercise 2.8.12. We are assuming that x has finite range, and hence takes
only finitely many different values. Let m be the largest such value. For this m, we
have
lim Fx (s) ≥ Fx (m) = P{ω ∈ Ω : x (ω ) ≤ m} = P(Ω) = 1
s→∞
(The inequality is due to the fact that Fx is increasing.) On the other hand,

lim Fx (s) = lim P{ x ≤ s} ≤ lim P(Ω) = 1


s→∞ s→∞ s→∞

From these two inequalities we get 1 ≤ lims→∞ Fx (s) ≤ 1, which is equivalent to


lims→∞ Fx (s) = 1.

Solution to Exercise 2.8.14. Fix s ≥ 0. Using additivity over disjoint sets, we have

F| x| (s) := P{| x | ≤ s} = P{−s ≤ x ≤ s} = P{ x = −s} + P{−s < x ≤ s}

By assumption, P{ x = −s} = 0. Applying fact 2.3.1 on page 62 then yields

F| x| (s) = P{−s < x ≤ s} = F (s) − F (−s)

The claim F| x| (s) = 2F (s) − 1 now follows from the definition of symmetry.

Solution to Exercise 2.8.15. That 0 ≤ p j ≤ 1 for each j follows immediately from


the definition of P. In addition, using additivity of P, we have
J J
∑ ∑ P { x = s j } = P ∪ j =1 { x = s j } = P ( Ω ) = 1
J
pj = (2.42)
j =1 j =1

(We are using the fact that the sets { x = s j } disjoint. Why is this always true? Look
carefully at the definition of a function given in §4.2.)

Solution to Exercise 2.8.16. Let z := G −1 (u). We want to show that z ∼ G. Since G


is monotone increasing we have G ( a) ≤ G (b) whenever a ≤ b. As a result, for any
s ∈ R,

P{z ≤ s} = P{G−1 (u) ≤ s} = P{G(G−1 (u)) ≤ G (s)} = P{u ≤ G(s)} = G(s)


We have shown that z ∼ G as claimed.

Solution to Exercise 2.8.17. Evidently G (s) = 0 when s < 0. For s ≥ 0 we have


√ √ √
P{ x2 ≤ s} = P{| x| ≤ s} = P{ x ≤ s} = F( s)

Thus, G (s) = F ( s)1{s ≥ 0}.
CHAPTER 2. PROBABILITY 100

Solution to Exercise 2.8.20. If x (ω ) := 1{ω ∈ A} ≤ 1{ω ∈ B} =: y(ω ) for any


ω ∈ Ω, then A ⊂ B. (If ω ∈ A, then x (ω ) = 1. Since x (ω ) ≤ y(ω ) ≤ 1, we then
have y(ω ) = 1, and hence ω ∈ B.) Using fact 2.2.2 and monotonicity of P, we then
have
E [ x] = E [1{ω ∈ A}] = P( A) ≤ P( B) = E [1{ω ∈ B}] = E [y]
as was to be shown.

Solution to Exercise 2.8.21. Let a be any nonnegative number, and let j ≤ k. If a ≥


1, then a j ≤ ak . If a < 1, then a j ≤ 1. Thus, for any a ≥ 0, we have a j ≤ ak + 1, and for
any random variable x we have | x | j ≤ | x |k + 1. Using monotonicity of expectations
(fact 2.2.5 on page 60) and E [1] = 1, we then have E [| x | j ] ≤ E [| x k |] + 1. Hence the
j-th moment exists whenever the k-th moment exists.

Solution to Exercise 2.8.22. We have


" #  " #!2 
N N N
var ∑ αn xn = E  ∑ αn xn − E ∑ αn xn 
n =1 n =1 n =1
 !2 
N
=E ∑ αn ( xn − E[ xn ]) 
n =1
" #
N
=E ∑ α2n (xn − E[xn ])2 + 2 ∑ αn αm ( xn − E[ xn ])( xm − E[ xm ])
n =1 n<m
N
= ∑ α2n var[xn ] + ∑ αn αm cov[ xn , xm ]
n =1 n<m

as required.

Solution to Exercise 2.8.26. Let u := f ( x ) = 2x and v := g(y) = 3y − 1, where


x and y are independent. Independence of u and v can be confirmed via (2.24) on
page 71. Fixing s1 and s2 in R, we have

P{u ≤ s1 , v ≤ s2 } = P{ x ≤ s1 /2, y ≤ (s2 + 1)/3}


= P{ x ≤ s1 /2}P{y ≤ (s2 + 1)/3} = P{u ≤ s1 }P{v ≤ s2 }

Thus u and v are independent as claimed.


CHAPTER 2. PROBABILITY 101

Solution to Exercise 2.8.27. As in the statement of the exercise, x and y are indepen-
dent uniform random variables on [0, 1], z := max{ x, y} and w := min{ x, y}. As
a first step to the proofs, you should convince yourself that if a, b and c are three
numbers, then

• max{ a, b} ≤ c if and only if a ≤ c and b ≤ c

• min{ a, b} ≤ c if and only if a ≤ c or b ≤ c

Using these facts, next convince yourself that, for any s ∈ R,

• {z ≤ s} = { x ≤ s} ∩ {y ≤ s}

• {w ≤ s} = { x ≤ s} ∪ {y ≤ s}

(For each, equality, show that if ω is in the right-hand side, then ω is in the left-hand
side, and vice versa.) Now, for s ∈ [0, 1], we have

P{z ≤ s} = P[{ x ≤ s} ∩ {y ≤ s}] = P{ x ≤ s}P{y ≤ s} = s2


R1
By differentiating we get the density p(s) = 2s, and by integrating 0 sp(s)ds we get
E [z] = 2/3. Finally, regarding the cdf of w, for s ∈ [0, 1] we have
P{w ≤ s} = P[{ x ≤ s} ∪ {y ≤ s}]
= P{ x ≤ s} + P{y ≤ s} − P[{ x ≤ s} ∩ {y ≤ s}]

Hence P{w ≤ s} = 2s − s2 .

Solution to Exercise 2.8.31. Using y = `∗ ( x ) + u and the results form exercise 2.8.30,
we have

var[`∗ ( x ) + u] = var[y]
= corr[ x, y]2 var[y] + (1 − corr[ x, y]2 ) var[y]
= var[`∗ ( x )] + var[u]

It follows (why?) that cov[`∗ ( x ), u] = 0 as claimed.


CHAPTER 2. PROBABILITY 102

Solution to Exercise 2.8.35. First note that since x ∼ N (0, I N ) we have cov[ xi , x j ] =
0 for all i 6= j. Since uncorrelated normal random variables are independent, we
IID
then have x1 , . . . , x N ∼ N (0, 1). Since sums of squares of independent standard
normals are chi-squared, we have in particular that
k
∑ xn2 ∼ χ2 (k) (2.43)
n =1

for any k ≤ N. The solutions to the exercise can now be given:

1. Yes, for the reason just described.

2. x12 ∼ χ2 (1) by (2.43)

3. x12 /x22 ∼ F (1, 1), because if Q1 ∼ χ2 (k1 ) and Q2 ∼ χ2 (k2 ) and Q1 and Q2 and
independent, then ( Q1 /k1 )/( Q2 /k2 ) ∼ F (k1 , k2 ).

4. x1 [2/( x22 + x32 )]1/2 ∼ t(2), because if Z ∼ N (0, 1) and Q ∼ χ2 (k ) and Z and Q
are independent, then Z (k/Q)1/2 ∼ t(k ).

5. kxk2 = ∑nN=1 x2N ∼ χ2 ( N ) by (2.43).

6. Linear combinations of normals are normal, so y := a0 x is normal. Evidently


E [y] = E [a0 x] = a0 E [x] = 0. Using independence, we obtain
N N
var[y] = ∑ a2n var[ xn ] = ∑ a2n
n =1 n =1

Hence y ∼ N (0, ∑nN=1 a2n ).

Solution to Exercise 2.8.36. Pick any nonnegative random variable x and δ > 0. By
considering what happens at an arbitrary ω ∈ Ω, you should be able to convince
yourself that
x = 1{ x ≥ δ } x + 1{ x < δ } x ≥ 1{ x ≥ δ } δ
Using fact 2.2.5 (page 60), fact 2.2.2 (page 58) and rearranging gives (2.11). Regard-
ing (2.12), observe that

x2 = 1{| x | ≥ δ} x2 + 1{| x | < δ} x2 ≥ 1{| x | ≥ δ}δ2

Proceding as before leads to (2.12).


CHAPTER 2. PROBABILITY 103

Solution to Exercise 2.8.37. From the definition of convergence in probability (see


p
§2.5.1), the statement xn → 0 means that, given any δ > 0, we have P{| xn | > δ} →
0. Consider first the case where P{y = −1} = P{y = 1} = 0.5. Take δ = 0.5. Then,
since xn = y for all n,

P{| xn | > δ} = P{|y| > 0.5} = 1


p
Thus, the sequence does not converge to zero. Hence xn → 0 fails. On the other
hand, if P{y = 0} = 1, then for any δ > 0 we have

P{| xn | > δ} = P{|y| > δ} = 0


p
This sequence does converge to zero (in fact it’s constant at zero), and xn → 0 holds.

Solution to Exercise 2.8.38. We want to give an example of a sequence of random


variables { xn } and random variable x such that xn converges to x in distribution,
but not in probability. Many examples can be found by using IID sequences. For
example, if { xn }∞
n=1 and x are IID standard normal random variables, then xn and
x have the same distribution for all n, and hence xn converges in distribution to x.
However, zn := xn − x has distribution N (0, 2) for all n. Letting z be any random
variable with distribution N (0, 2) and δ be any strictly positive constant, we have
P{| xn − x| ≥ δ} = P{|z| ≥ δ} > 0. Thus, P{| xn − x| ≥ δ} does not converge to
zero.

Solution to Exercise 2.8.39. By linearity of expectations,


N
1 N
Z Z
E [ x̄ N ] =
N ∑ E [ xn ] = N sF (ds) = sF (ds)
n =1

This confirms that E [ x̄ N ] → sF (ds) as claimed. To see that var[ x̄ N ] → 0 as N → ∞,


R

let σ2 be the common variance of each xn . Using fact 2.4.8, we obtain


" #
1 N 1 N 2 2
var ∑
N n =1
xn = 2 ∑ σ + 2 ∑ cov[ xn , xm ]
N n =1 N n<m

By independence, this reduces to var[ x̄ N ] = σ2 /N, which converges to zero.


CHAPTER 2. PROBABILITY 104

p p p
Solution to Exercise 2.8.44. Let Xn → X and Yn → Y. To prove that Xn Yn → XY,
we need to show that the i, j-th element of Xn Yn converges in probability to the i, j-th
element of XY. By hypothesis, we have

n p p
xik → xik and ynkj → ykj for all k

Applying fact 2.5.1 on page 75 twice, we obtain

n n p
xik ykj → xik ykj for all k

and then
p
∑ xikn ynkj → ∑ xik ykj
k k
In other words, the i, j-th element of Xn Yn converges in probability to the i, j-th ele-
ment of XY.

p
Solution to Exercise 2.8.45. If a0 xn → a0 x for every a ∈ RK , then we know in par-
ticular that this convergence holds for the canonical basis vectors. Hence
p
e0k xn → e0k x for every k
p
∴ xnk → x k for every k (elementwise convergence)
p
∴ xn → x (vector convergence, by definition)

Solution to Exercise 2.8.46. From fact 2.5.1 on page 75, we know that if g : R → R
p
is continuous and {un } is a scalar sequence of random variables with un → u, then
p p p p
g(un ) → g(u). We also know that if un → u and vn → v, then un + vn → u + v. By
assumption, we have
p p
xn → 0 and yn → 0
p p
∴ xn2 → 02 = 0 and y2n → 02 = 0
p
∴ kxn k2 = xn2 + y2n → 0 + 0 = 0
p √
q
∴ k x n k = k x n k2 → 0 = 0
CHAPTER 2. PROBABILITY 105

Solution to Exercise 2.8.47. Let {xn } be a sequence of random vectors in RK and x


be a random vector in RK . We need to show that
p p
xkn → xk for all k ⇐⇒ kxn − xk → 0

A special case of this argument can be found in the solution to exercise 2.8.46. The
p
general case is similar: Suppose first that xkn → xk for all k. Combining the various
results about scalar convergence in probability in fact 2.5.1 (page 75), one can then
verify (details left to you) that
v
u K
p
kxn − xk := t ∑ ( xkn − xk )2 → 0
u
(n → ∞)
k =1

p
Regarding the converse, suppose now that kxn − xk → 0. Fix e > 0 and arbitrary k.
From the definition of the norm we see that | xkn − xk | ≤ kxn − xk is always true, and
hence
| xkn − xk | > e =⇒ kxn − xk > e
∴ {| xkn − xk | > e} ⊂ {kxn − xk > e}
∴ 0 ≤ P{| xkn − xk | > e} ≤ P{kxn − xk > e} → 0
The proof is done.

Solution to Exercise 2.8.48. Define



zn := N (x̄ N − µ) and z ∼ N (0, Σ)
d
We need to show that zn → z. To do this, we apply the Cramer-Wold device
(fact 2.6.11, page 87) and the scalar CLT (theorem 2.5.2, page 80). To begin, fix
a ∈ RK . Observe that √
a0 zn := N (ȳn − E [yn ])
where yn := a0 xn . Since yn is IID (in particular, functions of independent random
variables are independent) and

var[yn ] = var[a0 xn ] = a0 var[xn ]a = a0 Σa

the scalar CLT yields


d
a0 zn → N (0, a0 Σa)
d
Since a0 z ∼ N (0, a0 Σa), we have shown that a0 zn → a0 z. Since a was arbitrary, the
Cramer-Wold device tells us that zn converges in distribution to z.
CHAPTER 2. PROBABILITY 106

Solution to Exercise 2.8.49. By assumption, {xn } is an IID sequence in RK with E [xn ] =


0 and var[xn ] = IK . It follows from the vector central limit theorem that
√ d
N x̄ N → z ∼ N (0, IK )

Letting g(s) := ksk2 and applying the continuous mapping theorem (fact 2.6.13 on
page 88), we obtain
√ K
y N = k N x̄ N k2 → kzk2 = ∑ z2k
d

k =1
d
From fact 2.4.4 on page 72 we conclude that y N → χ2 (K ).
Chapter 3

Orthogonality and Projections

[roadmap]

3.1 Orthogonality

[roadmap]

3.1.1 Definition and Basic Properties

Let x and z be vectors in R N . If x0 z = 0, then we write x ⊥ z and call x and


z orthogonal. In R2 , x and z are orthogonal when they are perpendicular to one
another, as in figure 3.1. For x ∈ R N and S ⊂ R N , we say that x is orthogonal to S
if x ⊥ z for all z ∈ S (figure 3.2). In this case we write x ⊥ S.

Fact 3.1.1. If B ⊂ S with span( B) = S, then x ⊥ S if and only if x ⊥ b for all b ∈ B.

A set of vectors {x1 , . . . , xK } ⊂ R N is called an orthogonal set if its elements are


mutually orthogonal; that is, if x j ⊥ xk whenever j and k are not equal.

Fact 3.1.2 (Pythagorian law). If {x1 , . . . , xK } is an orthogonal set, then

k x1 + · · · + x K k2 = k x1 k2 + · · · + k x K k2

Orthogonal sets and linear independence are closely related. In particular,

107
CHAPTER 3. ORTHOGONALITY AND PROJECTIONS 108

x z

Figure 3.1: x ⊥ z

S
x

Figure 3.2: x ⊥ S
CHAPTER 3. ORTHOGONALITY AND PROJECTIONS 109

Fact 3.1.3. Let O = {x1 , . . . , xK } ⊂ R N be an orthogonal set. If 0 ∈


/ O, then O is
linearly independent.

While not every linearly independent set is orthogonal, a partial converse to fact 3.1.3
is given in §3.3.3.
A set of vectors O ⊂ R N is called an orthonormal set if it is an orthogonal set and
kuk = 1 for all u ∈ O. An orthonormal set that spans a linear subspace S of R N is
called an orthonormal basis for S. The standard example of an orthonormal basis
for all of R N is the canonical basis vectors e1 , . . . , e N .
One neat thing about orthonormal bases is the following: If O = {u1 , . . . , uK } is any
basis of S ⊂ R N , then we can write x in terms of the basis vectors as in x = ∑kK=1 αk uk
for suitable scalars α1 , . . . , αK . The value of these scalars is not always transparent,
but for an orthonormal basis we have αk = x0 uk for each k. That is,

Fact 3.1.4. If {u1 , . . . , uK } is an orthonormal set and x ∈ span{u1 , . . . , uK }, then


K
x= ∑ (x0 uk )uk (3.1)
k =1

The proof is an exercise.


An orthogonal matrix is an N × N matrix Q such that the columns of Q form an
orthonormal set. If Q is an orthonormal matrix then the columns of Q must form an
orthonormal basis for R N . (Why?) Since the columns of an orthogonal matrix Q are
linearly independent, we know that Q is invertible. Indeed, the inverse is just the
transpose. This and other useful facts are collected below:

Fact 3.1.5. Let Q and P be orthogonal N × N matrices. The following statements are
true:

1. Q−1 = Q0 , and Q0 is also orthogonal.

2. QP is an orthogonal matrix.

3. det(Q) ∈ {−1, 1}.


CHAPTER 3. ORTHOGONALITY AND PROJECTIONS 110

3.1.2 Orthogonal Decompositions

Recall from §1.4.4 that an N × N matrix A is diagonalizable if it is similar to a di-


agonal matrix, and this occurs precisely when the eigenvectors of A span R N . In
particular, by theorem 1.4.1 on page 33, we have A = PDP−1 , where (coln (P), λn )
is an eigenpair of A for each n. There is one important case where we can say more:

Theorem 3.1.1. If A is a symmetric N × N matrix, then an orthonormal basis of R N can


be formed from the eigenvectors of A. In particular, A can be diagonalized as A = QDQ0 ,
where Q is the orthogonal matrix constructed from this orthonormal basis, and D is the
diagonal matrix formed from the N eigenvalues of A.

This theorem is sometimes called the spectral decomposition theorem. One nice
application is a proof of fact 1.4.13 on page 37, which states among other things that
symmetric matrix A is positive definite if and only if its eigenvalues are all positive.
Exercise 3.6.5 and its solution step you through the arguments.
There is another kind of matrix decomposition using orthogonality that has many
important applications: the QR decomposition.

Theorem 3.1.2. If A is an N × K matrix with linearly independent columns, then there


exists an invertible upper triangular K × K matrix R and an N × K matrix Q with or-
thonormal columns such that A = QR.

Remark: Q isn’t referred to as an orthogonal matrix here because it isn’t in general


square. We’ll give a constructive proof of theorem 3.1.2 when we get to §3.3.3.

3.1.3 Orthogonal Complements

Given S ⊂ R N , the orthogonal complement of S is defined as

S⊥ := {x ∈ R N : x ⊥ S}

In other words, S⊥ is the set of all vectors that are orthogonal to S. Figure 3.3 gives
an example in R2 .

Fact 3.1.6. For any nonempty S ⊂ R N , the set S⊥ is a linear subspace of R N .


CHAPTER 3. ORTHOGONALITY AND PROJECTIONS 111

S⊥
S

Figure 3.3: Orthogonal complement of S

This is easy enough to confirm: Looking back at the definition of linear subspaces,
we see that the following statement must be verified: Given x, y ∈ S⊥ and α, β ∈ R,
the vector that αx + βy is also in S⊥ . Clearly this is the case, because if z ∈ S, then

(αx + βy)0 z = αx0 z + βy0 z (∵ linearity of inner products)


= α×0+β×0 = 0 (∵ x, y ∈ S⊥ and z ∈ S)

We have shown that αx + βy ⊥ z for any z ∈ S, thus confirming that αx + βy ∈ S⊥ .

Fact 3.1.7. For S ⊂ R N , we have S ∩ S⊥ = {0}.

3.2 Orthogonal Projections

[Roadmap]

3.2.1 The Orthogonal Projection Theorem

One problem that comes up in many different contexts is approximation of an el-


ement y of R N by an element of a given subspace S of R N . Stated more precisely,
the problem is, given y and S, to find the closest element ŷ of S to y. Closeness is in
terms of Euclidean norm, so ŷ is the minimizer of ky − zk over all z ∈ S:

ŷ = argmin ky − zk (3.2)
z∈S
CHAPTER 3. ORTHOGONALITY AND PROJECTIONS 112

y − ŷ S

Figure 3.4: Orthogonal projection

Existence of a minimizer is not immediately obvious, suggesting that ŷ may not be


well-defined. However, it turns out that we need not be concerned, as ŷ always
exists (given any S and y). The next theorem states this fact, as well as providing a
way to identify ŷ.

Theorem 3.2.1 (Orthogonal Projection Theorem I). Let y ∈ R N and let S be any
nonempty linear subspace of R N . The following statements are true:

1. The optimization problem (3.2) has exactly one solution.

2. Vector ŷ ∈ R N is the unique solution to (3.2) if and only if ŷ ∈ S and y − ŷ ⊥ S.

The vector ŷ in theorem 3.2.1 is called the orthogonal projection of y onto S. Al-
though we do not prove the theorem here, the intuition is easy to grasp from a
graphical presentation. Figure 3.4 illustrates. Looking at the figure, we can see that
the closest point ŷ to y within S is indeed the one and only point in S such that y − ŷ
is orthogonal to S.
[Add a proof?]

Example 3.2.1. Let y ∈ R N and let 1 ∈ R N be the vector of ones. Let S be the set of
constant vectors in R N , meaning that all elements are equal. Evidently S is the span
of {1}. The orthogonal projection of y onto S is ŷ := ȳ1, where ȳ is the scalar
N
1
ȳ :=
N ∑ yn
n =1
CHAPTER 3. ORTHOGONALITY AND PROJECTIONS 113

S
Py
y0

Py0

Figure 3.5: Orthogonal projection under P

formed by averaging the elements of y. Since ŷ ∈ S clearly holds, to verify claim


this we only need to check that y − ŷ is orthogonal to S, for which it suffices to show
that (y − ŷ)0 1 = 0. This is true because
N N N
1
(y − ŷ) 1 = y 1 − ŷ 1 = ∑ yn − ȳ1 1 = ∑ yn −
0 0 0 0
∑ yn N = 0
n =1 n =1
N n =1

3.2.2 Orthogonal Projection as a Mapping

Holding S fixed, we can think of the operation

y 7→ the orthogonal projection of y onto S

as a function from R N to R N . The function is typically denoted by P, so that, for


each y ∈ R N , the symbol Py represents image of y under P, which is the orthogonal
projection ŷ. (We should perhaps write PS but to simplify notation we’ll stick with
P. Hopefully the subspace that determines P will be clear from context.) In general,
P is called the orthogonal projection onto S. Figure 3.5 illustrates the action of P on
two different vectors.
Using this notation, we can restate the orthogonal projection theorem, as well as
adding some properties of P:

Theorem 3.2.2 (Orthogonal Projection Theorem II). Let S be any linear subspace, and
let P : R N → R N be the orthogonal projection onto S. The following statements are true:
CHAPTER 3. ORTHOGONALITY AND PROJECTIONS 114

1. The function P is linear.

Moreover, for any y ∈ R N , we have

2. Py ∈ S,

3. y − Py ⊥ S,

4. kyk2 = kPyk2 + ky − Pyk2 ,

5. kPyk ≤ kyk, and

6. Py = y if and only if y ∈ S.

These results are not difficult to prove, given theorem 3.2.1. Linearity of P is left as
an exercise (exercise 3.6.11). Parts 2 and 3 follow directly from theorem 3.2.1. To see
part 4, observe that y can be decomposed as y = Py + y − Py. Part 4 now follows
from parts 2–3 and the Pythagorean law. (Why?) Part 5 follows from part 4. (Why?)
Part 6 follows from the definition of Py as the closest point to y in S.
Fact 3.2.1. If {u1 , . . . , uK } is an orthonormal basis for S, then, for each y ∈ R N ,
K
Py = ∑ (y0 uk )uk (3.3)
k =1

Fact 3.2.1 is a fundamental result. We can see it’s true immediately because the right
hand side of (3.3) clearly lies in S (being a linear combination of basis functions) and,
for any u j in the basis set
K
(y − Py)0 u j = y0 u j − ∑ (y0 uk )(u0k u j ) = y0 u j − y0 u j = 0
k =1

This confirms y − Py ⊥ S by fact 3.1.1 on page 107.


There’s one more very important property of P that we need to make note of: Sup-
pose we have two linear subspaces S1 and S2 of R N , where S1 ⊂ S2 . What then is
the difference between (a) first projecting a point onto the bigger subspace S2 , and
then projecting the result onto the smaller subspace S1 , and (b) projecting directly to
the smaller subspace S1 ? The answer is none—we get the same result.
Fact 3.2.2. Let S1 and S2 be two subspaces of R N , and let y ∈ R N . Let P1 and P2 be
the projections onto S1 and S2 respectively. If S1 ⊂ S2 , then
P1 P2 y = P2 P1 y = P1 y
CHAPTER 3. ORTHOGONALITY AND PROJECTIONS 115

y
S⊥

S
û = My
ŷ = Py

Figure 3.6: Orthogonal projection

3.2.3 Projection as Decomposition

There’s yet another way of stating the orthogonal projection theorem, which is also
informative. Recall the definition of orthogonal complements from §3.1.3. In the
orthogonal projection theorem, our interest was in projecting y onto S, but we have
just learned that S⊥ is itself a linear subspace, so we can also project y onto S⊥ . Just
as we used P to denote the function sending y into its projection onto S, so we’ll
use M to denote the function sending y into its projection onto S⊥ . The result we’ll
denote by û, so that û := My. Figure 3.6 illustrates. The figure suggests that we will
have y = ŷ + û, and indeed that is the case. The next theorem states this somewhat
more mathematically.
Theorem 3.2.3 (Orthogonal Projection Theorem III). Let S be a linear subspace of R N .
If P is the orthogonal projection onto S and M is the orthogonal projection onto S⊥ , then Py
and My are orthogonal, and
y = Py + My

If S1 and S2 are two subspaces of R N with S1 ⊂ S2 , then S2⊥ ⊂ S1⊥ . This means that
the result in fact 3.2.2 is reversed for M.
Fact 3.2.3. Let S1 and S2 be two subspaces of R N and let y ∈ R N . Let M1 and M2 be
the projections onto S1⊥ and S2⊥ respectively. If S1 ⊂ S2 , then,
M1 M2 y = M2 M1 y = M2 y
Fact 3.2.4. Py = 0 if and only if y ∈ S⊥ , and My = 0 if and only if y ∈ S.1
1 For example, if Py = 0, then y = Py + My = My. Hence M does not shift y. If an orthogonal
CHAPTER 3. ORTHOGONALITY AND PROJECTIONS 116

3.3 Applications of Projection

[Roadmap]

3.3.1 Projection Matrices

As stated in theorem 3.2.2, given any subspace S, the corresponding orthogonal


projection P is a linear map from R N to R N . In view of theorem 1.3.1 on page 20, it
then follows that there exists an N × N matrix P̂ such that Px = P̂x for all x ∈ R N . In
fact we’ve anticipated this in the notation P, and from now on P will also represent
the corresponding matrix. But what does this matrix look like?

Theorem 3.3.1. Let S be a subspace of R N . If P is the orthogonal projection onto S and


{b1 , . . . , bK } is any basis for S, then

P = B ( B 0 B ) −1 B 0 (3.4)

where B is the matrix formed by taking this basis as its columns.

This result is actually a generalization of fact 3.2.1 on page 114. While the expression
isn’t quite as neat, the advantage is that it applies to any basis, not just orthonormal
ones. We further explore the connection between the two results in §3.3.3.

Proof of theorem 3.3.1. Fix y ∈ R N . The claim is that the vector ŷ := B(B0 B)−1 B0 y is
the orthogonal projection of y onto S, where B is the matrix defined by colk (B) = bk .
To verify this, we need to show that

1. ŷ ∈ S, and

2. y − ŷ ⊥ S.

Part 1 is true because ŷ can be written as ŷ = Bx where x := (B0 B)−1 B0 y. The vector
Bx is a linear combination of the columns of B. Since these columns form a basis of
S they must lie in S. Hence ŷ ∈ S as claimed.
projection onto a subspace doesn’t shift a point, that’s because the point is already in that subspace
(see, e.g., theorem 3.2.2). In this case the subspace is S⊥ , and we conclude that y ∈ S⊥ .
CHAPTER 3. ORTHOGONALITY AND PROJECTIONS 117

Regarding part 2, from the assumption that B gives a basis for S, all points in S have
the form Bx for some x ∈ RK . Thus part 2 translates to the claim that

y − B(B0 B)−1 B0 y ⊥ Bx for all x ∈ RK

This is true, because if x ∈ RK , then

(Bx)0 [y − B(B0 B)−1 B0 y] = x0 [B0 y − B0 B(B0 B)−1 B0 y] = x0 [B0 y − B0 y] = 0

The proof of theorem 3.3.1 is done.

Theorem 3.3.1 and its proof implicitly assume that B0 B is nonsingular, but this is
justified because B is assumed to be full column rank (exercise 3.6.17).
The matrix P := B(B0 B)−1 B0 defined in (3.4) is called the projection matrix associ-
ated with the basis B (or the subspace S). It is common to also define the annihilator
associated with B as
M := I − P (3.5)
Here I is, as usual, the identity matrix (in this case N × N). In view of theorem 3.2.3,
the annihilator matrix projects any vector onto the orthogonal complement of S.

Fact 3.3.1. Both P and M are symmetric and idempotent.

The proof is an exercise (exercise 3.6.19). Idempotence is actually immediate be-


cause both P and M represent orthogonal projections onto their respective linear
subspaces. Applying the mapping a second time has no effect, because the vector is
already in the subspace. However you can obtain a more concrete proof by direct
computation.

3.3.2 Overdetermined Systems of Equations

An initial discussion of overdetermined systems was given in §1.3.5. Let’s recall


the main idea, this time using notation oriented towards linear regression (which
is probably the most important application of the theory). We consider a system of
equations Xβ = y where X is N × K, β is K × 1, and y is N × 1. We regard X and y
as given, and seek a β ∈ RK that solves this equation.
If K = N and X has full column rank, then theorem 1.3.2 implies that this system has
precisely one solution, which is X−1 y. When N > K the system is overdetermined
CHAPTER 3. ORTHOGONALITY AND PROJECTIONS 118

and we cannot in general find a β that satisfies all N equations. We aim instead
for the “best available” vector, which is the β ∈ RK making Xβ is as close to y
as possible (in the Euclidean sense). Using the orthogonal projection theorem, the
minimizer is easy to identify:

Theorem 3.3.2. Let X be an N × K matrix with full column rank and let y ∈ R N . The
minimization problem minβ∈RK ky − Xβk has a unique solution. The solution is

β̂ := (X0 X)−1 X0 y (3.6)

Proof. Let X and y be as in the statement of the theorem. Let β̂ be as in (3.6) and let
S := span(X). By the full column rank assumption, X forms a basis for S. In view of
theorem 3.3.1, the orthogonal projection of y onto S is

ŷ := X(X0 X)−1 X0 y = X β̂

Pick any β ∈ RK such that β 6= β̂. By the definition of S we have ỹ := Xβ ∈ S.


Moreover, since β 6= β̂ we must have ỹ 6= ŷ. (Why?!) Hence, by the orthogonal
projection theorem (page 112), we have ky − ŷk < ky − ỹk. In other words,

ky − X β̂k < ky − Xβk

Since β was arbitrary the proof is now done.

The solution β̂ in theorem 3.3.2 is sometimes called the least squares solution to the
minimization problem described in that theorem. The intuition behind this notation
will be given later on.

Remark 3.3.1. What happens if we drop the assumption that the columns of X are
linearly independent? The set span(X) is still a linear subspace, and the orthogonal
projection theorem still gives us a closest point ŷ to y in span(X). Since ŷ ∈ span(X),
there still exists a vector β̂ such that ŷ = X β̂. The problems is that now there exists
an infinity of such vectors. Exercise 3.6.18 asks you to prove this.

The projection matrix and the annihilator in this case are

P : = X ( X 0 X ) −1 X 0 and M := I − P (3.7)

Fact 3.3.2. The annihilator M associated with X satisfies MX = 0 (exercise 3.6.21).


CHAPTER 3. ORTHOGONALITY AND PROJECTIONS 119

3.3.3 Gram Schmidt Orthogonalization

The expression for projection onto a basis given in theorem 3.3.1 is extremely useful.
At the same time, it isn’t quite as neat as the expression for projection onto an or-
thonormal basis given in fact 3.2.1 (page 114). For this and other reasons, it’s nice to
know that any basis of a linear subspace can be converted to an orthonormal basis
in a straightforward way. The rest of this section gives details.

Theorem 3.3.3. Given any linearly independent subset {b1 , . . . , bK } of R N , there exists
an orthonormal set {u1 , . . . , uK } such that

span{b1 , . . . , bk } = span{u1 , . . . , uk } for k = 1, . . . , K

The proof of theorem 3.3.3 provides a concrete algorithm for generating the or-
thonormal set {u1 , . . . , uK }. The first step is to construct orthogonal sets {v1 , . . . , vk }
with span identical to {b1 , . . . , bk } for each k. At the end one can just normalize each
vector to produce orthonormal sets with the same span.
The construction of {v1 , . . . , vK } uses the so called Gram Schmidt orthogonaliza-
tion procedure, which runs as follows: First, for each k = 1, . . . , K, let Bk be the
N × k matrix formed by columns b1 , . . . , bk . Let Pk := Bk (B0k Bk )−1 B0k be the associ-
ated orthogonal projection matrix and let P0 := 0. Define v1 , . . . , vK by

v k : = b k − P k −1 b k (3.8)

Looking at the definition of the annihilator in (3.5), the idea behind the algorithm is
clear—we are using the annihilator to project each successive bk into the orthogonal
complement of the span of the previous vectors b1 , . . . , bk−1 .
Exercises 3.6.22–3.6.24 step you through the process of verifying that {v1 , . . . , vk }
is orthogonal with span equal to that of {b1 , . . . , bk } for k = 1, . . . , K. As part of
the proof it is shown that no vk equals 0, and hence we can define uk := vk /kvk k.
The set {u1 , . . . , uk } is clearly orthonormal for each k, and its span is the same as
{v1 , . . . , vk }. (Why?) In other words, it has the properties stated in theorem 3.3.3.

3.3.4 Solving Systems via QR Decomposition

Let X be an N × K matrix with linearly independent columns x1 , . . . , xK . The QR


decomposition theorem on page 110 tells us that there exists an invertible upper
CHAPTER 3. ORTHOGONALITY AND PROJECTIONS 120

triangular K × K matrix R and an N × K matrix Q with orthonormal columns such


that X = QR. Now we have theorem 3.3.3 in hand, we can give a construction for
these matrices.
To begin, theorem 3.3.3 gives us existence of an orthonormal set {u1 , . . . , uK } such
that the span of {u1 , . . . , uk } equals that of {x1 , . . . , xk } for k = 1, . . . , K. In particular,
xk is in the span of {u1 , . . . , uk }. Appealing to fact 3.1.4 on page 109, we can write

x1 = (x10 u1 )u1
x2 = (x20 u1 )u1 + (x20 u2 )u2
x3 = (x30 u1 )u1 + (x30 u2 )u2 + (x30 u3 )u3

and so on. Sticking to the 3 × 3 case to simplify expressions, we can stack these
equations horizontally to get
 0
| | | | | | (x1 u1 ) (x20 u1 ) (x30 u1 )
   
 x1 x2 x3  =  u1 u2 u3   0 (x20 u2 ) (x30 u2 ) 
| | | | | | 0 0 (x30 u3 )

or X = QR. This is our QR decomposition.


Given this decomposition X = QR, the least squares solution β̂ defined in (3.6) can
also be written as β̂ = R−1 Q0 y. Exercise 3.6.25 asks you to confirm this.

3.4 Projections in L2

The main purpose of this section is to introduce conditional expectations and study
their properties. The definition of conditional expectations given in elementary
probability texts is often cumbersome to work with, and fails to provide the big
picture. In advanced texts, there are several different approaches to presenting con-
ditional expectations. The one I present here is less common than the plain vanila
treatment, but it is, to my mind, by far the most intuitive. As you might expect given
the location of this discussion, the presentation involves orthogonal projection.

3.4.1 The Space L2

Sometimes we want to predict the value of a random variable y using another vari-
able x. In this case we usually choose x such that y and x are expected to be close
CHAPTER 3. ORTHOGONALITY AND PROJECTIONS 121

under each realization of uncertainty. A natural measure of closeness for random


variables is mean squared error (MSE), which is defined in this case as E [( x − y)2 ].
This same concept of distance between random variables came up previously in
our linear prediction problem of §2.4.4, as well as in measuring convergence of se-
quences of random variables (see, e.g., page 75). Mean square distance and its vari-
ations constitute the most commonly used measures of deviation between random
variables in probability theory.
One such variation is when we replace MSE with the root mean squared error
(RMSE), which is, as the name suggests, the square root of the MSE. Since we’ll
be using it a lot, let’s give the RMSE its own notation:
q
k x − yk := E [( x − y)2 ]

This notion of distance is quite reminiscent of Euclidean distance between vectors.


For example, it’s constructed from a norm, in the sense that if we define the L2 norm
as q
k z k : = E [ z2 ] (3.9)
then the RMSE between x and y is the norm of the random variable x − y. This
is similar to the idea of Euclidean distance, where the distance between x and y is
the norm of x − y. Moreover, the norm is obtained by taking the square root after
“summing” over squared values, just like Euclidean norm.
Of course the value (3.9) is only finite if z has finite second moment. So for the
remainder of this section we’ll restrict attention to such random variables. The stan-
dard name of this set of random variables is L2 . That is,

L2 := { all random variables x with E [ x2 ] < ∞}

The space L2 with its norm (3.9) is important partly because it shares so many prop-
erties with Euclidean space, and hence geometric intuition carries over from the
latter to L2 . We’ll see many examples of this below. Here’s a list of some of the
many ways that L2 is similar to Euclidean space

Fact 3.4.1. Let x and y be random variables in L2 and let α and β be any scalars. The
following statements are true:

1. kαx k = |α|k x k
CHAPTER 3. ORTHOGONALITY AND PROJECTIONS 122

2. k x k ≥ 0 and k x k = 0 if and only if x = 0.2

3. k x + yk ≤ k x k + kyk.

One important consequence of these results is that if α, β ∈ R and x, y ∈ L2 , then


αx + βy also has finite norm, and is therefore in L2 . More generally, any linear
combination
α1 x1 + · · · α K x K , αk ∈ R, xk ∈ L2
of L2 random variables is again L2 . The following definitions mimic the Euclidean
case: If X := { x1 , . . . , xK } is a subset of L2 , then the set of finite linear combinations
of elements of X is called the span of X, and denoted by span( X ):
( )
K
span( X ) := all random variables ∑ αk xk such that α := (α1, . . . , αK ) ∈ RK
k =1

A subset S of L2 is called a linear subspace of L2 if it is closed under addition and


scalar multiplication. That is, for each x, y ∈ S and α, β ∈ R, we have αx + βy ∈ S.
Example 3.4.1. Let Z be the set of all zero mean random variables in L2 . The set
Z is a linear subspace of L2 because if x, y ∈ Z and α, β ∈ R, then E [αx + βy] =
αE [ x ] + βE [y] = 0.
Example 3.4.2. Fix p ∈ L2 and let S be all zero mean random variables uncorrelated
with p. That is, S := { x ∈ Z : cov[ x, p] = 0}. The set S is a linear subspace of L2
because if x, y ∈ S and α, β ∈ R, then

cov[αx + βy, p] = E [(αx + βy) p] = αE [ xp] + βE [yp] = α cov[ x, p] + β cov[y, p] = 0

Following on from fact 3.4.1, we can draw another parallel between L2 norm and
the Euclidean norm. As we saw in §1.1.1, the Euclidean norm is defined in terms of
the inner product on R N . If x and y are √
two vectors in R N , then the inner product is
x0 y, and the norm of vector x is kxk = x0 x. Similarly, for random variables x and
y, we define
h x, yi := inner product of x and y := E [ xy]
2 Actually, while kxk = 0 certainly implies that x = 0 in Euclidean space, it isn’t quite true that
k x k = 0 implies that x is the zero random variable (i.e., x (ω ) = 0 for all ω ∈ Ω). However, we can say
that if k x k = 0, then x = 0 with probability one. More formally, the set E := {ω ∈ Ω : | x (ω )| > 0}
satisfies P( E) = 0. In this sense, x differs from the zero random variable only in a trivial way. In
the applications we consider this caveat never causes problems. If you do continuous time stochastic
dynamics, however, you’ll find that there’s a lot of mathematical machinery built up to keep control
of this issues.
CHAPTER 3. ORTHOGONALITY AND PROJECTIONS 123

As for the Euclidean case, we then have


q q
k x k = h x, x i := E [ x2 ]
As in the Euclidean case, if the inner product of x and y is zero, then we say that
x and y are orthogonal, and write x ⊥ y. This terminology is used frequently in
econometrics (sometimes by people who aren’t actually sure why the term “orthog-
onal” is used—which puts you one step ahead of them). This is because if either x or
y is zero mean, then orthogonality of x and y is equivalent to cov[ x, y] = 0. Hence,
for centered random variables, orthogonality is equivalent to lack of correlation.
The next fact shows how the inner product on L2 has all the same essential proper-
ties as the inner product on Euclidean space (cf. fact 1.1.1 on page 4).
Fact 3.4.2. For any α, β ∈ R and any x, y, z ∈ L2 , the following statements are true:

1. h x, yi = hy, x i

2. hαx, βyi = αβh x, yi

3. h x, (αy + βz)i = αh x, yi + βh x, zi

This properties follow immediately from linearity of E . The inequality


|h x, yi| ≤ k x kkyk (all x, y ∈ L2 )
also carries over from the Euclidean case (see fact 1.1.2 on page 5). In fact we’ve al-
ready seen this inequality—it’s precisely the Cauchy-Schwarz inequality for random
variables shown in fact 2.2.7 on page 60. Now you know why these two inequalities
share the same name.
Given that inner products and norms in L2 share so many properties with their
Euclidean cousins, you might not be surprized to learn that a great variety results
about Euclidean space carry over to L2 . Often we can just lift the proofs across with
barely a change to notation. Since we know a lot about vectors and Euclidean space,
this process yields many insights about random variables and their properties.
One result that carries across essentially unchanged is the orthogonal projection the-
orem. Let’s now state the L2 version. Given y ∈ L2 and linear subspace S ⊂ L2 , we
seek the closest element ŷ of S to y. Closeness is in terms of L2 norm, so ŷ is the
minimizer of ky − zk over all z ∈ S. That is,
q
ŷ = argmin ky − zk = argmin E [(y − z)2 ] (3.10)
z∈S z∈S
The next theorem mimics theorem 3.2.1 on page 112.
CHAPTER 3. ORTHOGONALITY AND PROJECTIONS 124

Theorem 3.4.1 (Orthogonal Projection Theorem IV). Let y ∈ L2 and let S be any
nonempty linear subspace of L2 . The following statements are true:3

1. The optimization problem (3.10) has exactly one solution.

2. Vector ŷ ∈ L2 is the unique solution to (3.10) if and only if ŷ ∈ S and y − ŷ ⊥ S.

The vector ŷ in theorem 3.2.1 is called the orthogonal projection of y onto S. Hold-
ing S fixed, we can think of the operation

y 7→ the orthogonal projection of y onto S

as a function from L2 to L2 . The function will be denoted by P, so that, for each y ∈


L2 , the symbol Py represents image of y under P, which is the orthogonal projection
ŷ. Just as in the Euclidean case this function P is linear, in the sense that

P(αx + βy) = αPx + βPy (3.11)

for any x, y ∈ L2 and scalars α and β. In addition,

Fact 3.4.3. Let S be any linear subspace of L2 , and let P : L2 → L2 be the orthogonal
projection onto S. For any y ∈ L2 , we have

2. Py ∈ S

3. y − Py ⊥ S

4. kyk2 = k Pyk2 + ky − Pyk2

5. k Pyk ≤ kyk

6. Py = y if and only if y ∈ S

7. If {u1 , . . . , uK } is an orthonormal basis of S, then


K
Py = ∑ hy, uk iuk (3.12)
k =1
3 There are two small caveats I should mention. First, we actually require that S is a “closed”
linear subspace of L2 , which means that if { xn } ⊂ S, x ∈ L2 and k xn − x k → 0, then x ∈ S. For the
subspaces we consider here, this condition is always true. Second, when we talk about uniqueness in
L2 , we do not distinguish between elements x and x 0 of L2 such that P{ x = x 0 } = 1. A nice treatment
of orthogonal projection in Hilbert spaces (of which L2 is one example) is provided in Cheney (2001,
chapter 2). Most other books covering Hilbert space will provide some discussion.
CHAPTER 3. ORTHOGONALITY AND PROJECTIONS 125

In the final item, the meaning of orthonormal basis is that hu j , uk i = 1 if j = k and


zero otherwise, and that S is equal to the span of {u1 , . . . , uK }. See (3.3) on page 114
for comparison. More generally, these results parallel those given for projection on
Euclidean space, and the proofs are essentially the same.

3.4.2 Application: Best Linear Prediction

There are many many important applications of orthogonal projection in L2 . We’ll


treat one of the most important ones in the next section—conditional expectations.
For now let’s stick to a simpler “warm up” applications. Our main objective is to
revisit the problem of best linear predictors from §2.4.4, establishing the same set of
results in a different way.
To begin, let

• 1 := 1Ω = the constant random variable that is always equal to 1

• S1 := span(1) = the linear subspace of all constant random variables

• P1 := be the orthogonal projection onto S1

• M1 := be the annihilator projection (cf. page 117) defined by M1 x = x − P1 x

Now let x ∈ L2 with E [ x ] = µ and var[ x ] = σx2 . Observe first that P1 x = µ1.
(We’ve written “µ1” instead of just “µ” to remind ourselves that we’re thinking of
a constant random variable, rather than just a constant.) To see this we need only
confirm that µ1 ∈ S1 , which is obvious, and that E [1( x − µ1)] = 0. The second
claim is also obvious, because E [1( x − µ1)] = E [ x − µ] = E [ x ] − µ.
Since P1 x = µ1, we have M1 x = x − µ1, or, written more conventionally, M1 x =
x − µ. In other words, M1 x is the “de-meaned” or “centered” version of x. The norm
of M1 x is q
k M1 x k = E [( x − µ)2 ] = σx
In other words, the norm of M1 x is the standard devation of x.
Next fix x, y ∈ L2 and consider projecting y onto S2 := span(1, x ). That is, S2 is
equal to the set of random variables

α + βx := α1 + βx for scalars α, β
CHAPTER 3. ORTHOGONALITY AND PROJECTIONS 126

Let P2 and M2 be the orthogonal projection and annihilator associated with S2 .


Perhaps the easiest way to project y onto S2 is to find an orthonormal basis {u1 , u2 }
for S2 and then apply (3.12). Getting our inspiration from the Gram-Schmidt or-
thogonalization procedure (page 119), we take
x−µ M1 x
u1 := 1 and u2 : = =
σx k M1 x k
It’s easy to check that hu1 , u2 i = E [u1 u2 ] = 0 and ku1 k = ku2 k = 1, so this pair
is indeed orthonormal. It’s also straightforward to show that span(u1 , u2 ) = S2 , so
{u1 , u2 } is an orthonormal basis for S2 . Hence, by (3.12),
cov[ x, y]
P2 y = hy, u1 iu1 + hy, u2 iu2 = E [y] + ( x − E [ x ])
var[ x ]
The missing algebra between these two equations is not difficult to replicate. Alter-
natively, we can write
cov[ x, y]
P2 y = α∗ + β∗ x where β∗ := and α∗ := E [y] − β∗ E [ x ]
var[ x ]
These are the same solutions we obtained using calculus in §2.4.4.

3.4.3 Measurability

Now let’s turn towards a really fundamental application of orthogonal projection


in L2 . We’ll set up these results in several steps. To begin, recall that, in the case of
R N , orthogonal projection starts with a linear subspace S of R N . Once we have this
subspace, we think about how to project onto it. The space S is obviously crucial
because once we select S, we implicitly define the orthogonal projection mapping.
So when I tell you that conditional expectation is characterized by orthogonal pro-
jection, you will understand that the first thing we need to do is identify the linear
subspaces that we want to project onto.
The first step in this process is a definition at the very heart of probability theory:
measurability. Let x1 , . . . , x p be some collection of random variables, and let G :=
{ x1 , . . . , x p }. Thus, G is a set of random variables, often referred to in what follows
as the information set. We will say that another random variable z is G -measurable
if there exists a (nonrandom) function g : R p → R such that

z = g ( x1 , . . . , x p )
CHAPTER 3. ORTHOGONALITY AND PROJECTIONS 127

Informally, what this means is that once the values of the random variables x1 , . . . , x p
have been realized, the variable z is completely determined (i.e., no longer random)
and it’s realized value can be calculated (assuming that we can calculate the func-
tional form g). You might like to imagine it like this: Uncertainty is realized, in the
sense that some ω is selected from the sample space Ω. Suppose that we don’t get
to view ω itself, but we do get to view certain random outcomes. For example, we
might get to observe the realized values x1 (ω ), . . . , x p (ω ). If z is G -measurable, we
can now calculate the realized value z(ω ) of z, even without knowning ω, because
we can compute z(ω ) = g( x1 (ω ), . . . , x p (ω )).4
As a matter of notation, if G = { x } and y is G -measurable, then we will also say that
y is x-measurable.
Example 3.4.3. Let x and z be two random variables. If z = 2x + 3, then z is x-
measurable. To see this formally, we can write z = g( x ) when g( x ) = 2x + 3. Less
formally, when x is realized, the value of z can be calculated.
Example 3.4.4. Let x1 , . . . , x N be random variables and let x̄ N be their sample mean.
If G = { x1 , . . . , x N }, then x̄ N := N −1 ∑nN=1 xn is clearly G -measurable.
Example 3.4.5. If x and y are independent and neither random variable is constant,
then y is not x-measurable. Indeed, if y was x-measurable, then we would have
y = g( x ) for some function g. This contradicts independence of x and y.
Example 3.4.6. Let x, y and z be three random variables with z = x + y. Suppose that
x and y are independent. Then z is not x-measurable. Intuitively, even if we know
the realized value of x, the realization of z cannot be computed until we know the
realized value of y. Formally, if z is x-measurable then z = g( x ) for some function
g. But then y = g( x ) − x, so y is x-measurable. This contradicts independence of x
and y.
Example 3.4.7. Let y = α, where α is a constant. This degenerate random variable
is G -measurable for any information set G , because y is already deterministic. For
p
example, if G = { x1 , . . . , x p }, then we can take y = g( x1 , . . . , x p ) = α + ∑i=1 0xi .

If x and y are known given the information in G , then a third random variable that
depends on only on x and y is likewise known given G . Hence G -measurability is
preserved under the taking of sums, products, etc. In particular,
4A technical note: In the definition of measurability above, where we speak of existence of the
function g, it is additional required that the function g is “Borel measurable.” For the purposes of
this course, we can regard non-Borel measurable functions as a mere theoretical curiosity. As such,
the distinction will be ignored. See Williams (1991) or any similar text for further details.
CHAPTER 3. ORTHOGONALITY AND PROJECTIONS 128

Fact 3.4.4. Let α, β be any scalars, and let x and y be random variables. If x and y are
both G -measurable, then u := xy and v := αx + βy are also G -measurable.

Let G and H be two information sets with G ⊂ H. In this case, if random variable z is
G measurable, then it is also H-measurable. This follows from our intuitive definition
of measurability: If the value z is known once the variables in G are known, then it
is certainly known when the extra information provided by H is available. The next
example helps to clarify.

Example 3.4.8. Let x, y and z be three random variables, let G = { x }, and let
H = { x, y}. Suppose that z = 2x + 3, so that z is G -measurable. Then z is also
H-measurable. Informally, we can see that z is deterministic once the variables in H
are realized. Formally, we can write z = g( x, y), where g( x, y) = 2x + 3 + 0y. Hence
z is also H-measurable as claimed.

Let’s note this idea as a fact:

Fact 3.4.5. If G ⊂ H and z is G -measurable, then z is H-measurable.

We started off this section by talking about how conditional expectations were going
to be projections onto linear subspaces, and we wanted to identify the subspaces.
Let’s clarify this now. Given G ⊂ L2 , define

L2 (G) := {the set of all G -measurable random variables in L2 }

In view of fact 3.4.4, we have the following result:

Fact 3.4.6. For any G ⊂ L2 , the set L2 (G) is a linear subspace of L2 .

From fact 3.4.5 we see that, in the sense of set inclusion, the linear subspace is in-
creasing with respect to the information set.

Fact 3.4.7. If G ⊂ H, then L2 (G) ⊂ L2 (H).

3.4.4 Conditional Expectation

Now it’s time to define conditional expectations. Let G ⊂ L2 and y be some ran-
dom variable in L2 . The conditional expectation of y given G is written as E [y | G]
CHAPTER 3. ORTHOGONALITY AND PROJECTIONS 129

or E G [y], and defined as the closest G -measurable random variable to y.5 More
formally,
E [y | G] := argmin ky − zk (3.13)
z∈ L2 (G)

This definition makes a lot of sense. Our intuitive understanding of the conditional
expectation E [y | G] is that it is the best predictor of y given the information con-
tained in G . The definition in (3.13) says the same thing. It simultaneously restricts
E [y | G] to be G -measurable, so we can actually compute it once the variables in G
are realized, and selects E [y | G] as the closest such variable to y in terms of RMSE.
While the definition makes sense, it still leaves many open questions. For example,
there are many situations where minimizers don’t exist, or, if the do exist, there are
lots of them. So is our definition really a definition? Moreover, even assuming we
do have a proper definition, how do we actually go about computing conditional ex-
pectations in practical situations? And what properties do conditional expectations
have?
These look like tricky questions, but fortunately the orthogonal projection theorem
comes to the rescue.
Comparing (3.13) and (3.10), we see that y 7→ E [y | G] is exactly the orthogonal
projection function P in the special case where the subspace S is the G -measurable
functions L2 (G).
Okay, so E [y | G] is the orthogonal projection of y onto L2 (G). That’s kind of neat,
but what does it actually tell us? Well, it tells us quite a lot. For starters, theo-
rem 3.4.1 implies that E [y | G] is always well defined and unique. Second, it gives
us a useful characterization of E [y | G], because we now know that E [y | G] is the
unique point in L2 such that E [y | G] ∈ L2 (G) and y − E [y | G] ⊥ z for all z ∈ L2 (G).
Rewriting these conditions in a slightly different way, we can give an alternative
(and equivalent) definition of conditional expectation: E [y | G] ∈ L2 is the condi-
tional expectation of y given G if

1. E [y | G] is G -measurable, and

2. E [E [y | G] z] = E [yz] for all G -measurable z ∈ L2 .


5I prefer the notation E G [y] to E [y | G] because, as we will see, E G is a function (an orthogonal
projection) from L2 to L2 , and the former notation complements this view. However, the notation
E [y | G] is a bit more standard, so that’s the one we’ll use.
CHAPTER 3. ORTHOGONALITY AND PROJECTIONS 130

This definition seems a bit formidable, but it’s not too hard to use. Before giving an
application, let’s bow to common notation and define

E [y | x1 , . . . , x p ] := E [y | G]
Also, let’s note the following “obvious” fact:

Fact 3.4.8. Given { x1 , . . . , x p } and y in L2 , there exists a function g : R p → R such


that E [y | x1 , . . . , x p ] = g( x1 , . . . , x p ).

This is obvious because, by definition, E [y | G] is G -measurable. At the same time,


it’s worth keeping in mind: A conditional expectation with respect to a collection of
random variables is some function of those random variables.

Example 3.4.9. If x and w are independent and y = x + w, then E [y | x ] = x + E [w].

Let’s check this using the second definition of conditional expectations given above.
To check that x + E [w] is indeed the conditional expectation of y given G = { x }, we
need to show that x + E [w] is x-measurable and that E [( x + E [w]) z] = E [yz] for
all x-measurable z. The first claim is clearly true, because x + E [w] is a deterministic
function of x. The second claim translates to the claim that

E [( x + E [w]) g( x)] = E [( x + w) g( x)] (3.14)

for any function g. Verifying this equality is left as an exercise (exercise 3.6.27)
The next example shows that when x and y are linked by a conditional density (re-
member: densities don’t always exist), then our definition of conditional expectation
reduces to the one seen in elementary probability texts. The proof of the claim in the
example is the topic of exercise 3.6.32.

Example 3.4.10. If x and y are random variables and p(y | x ) is the conditional den-
sity of y given x, then Z
E [y | x ] = tp(t | x )dt

There are some additional goodies we can harvest using the fact that conditional
expectation is an orthogonal projection.

Fact 3.4.9. Let x and y be random variables in L2 , let α and β be scalars, and let G
and H be subsets of L2 . The following properties hold.
CHAPTER 3. ORTHOGONALITY AND PROJECTIONS 131

1. Linearity: E [αx + βy | G] = αE [ x | G] + βE [y | G].

2. If G ⊂ H, then E [E [y | H] | G] = E [y | G] and E [E [y | G]] = E [y].

3. If y is independent of the variables in G , then E [y | G] = E [y].

4. If y is G -measurable, then E [y | G] = y.

5. If x is G -measurable, then E [ xy | G] = xE [y | G].

Checking of these facts is mainly left to the exercises. Most are fairly straightfor-
ward. For example, consider the claim that if y is G -measurable, then E [y | G] = y.
In other words, we are saying that if y ∈ L2 (G), then y is projected into itself. This
is immediate from the last statement in theorem 3.4.1.
The fact that if G ⊂ H, then E [E [y | H] | G] = E [y | G] is called the “tower” property
of conditional expectations (by mathematicians), or the law of iterated expectations
(by econometricians). The law follows from the property of orthogonal projections
given in fact 3.2.2 on page 114: Projecting onto the bigger subspace L2 (H) and from
there onto L2 (G) is the same as projecting directly onto the smaller subspace L2 (G).

3.4.5 The Vector/Matrix Case

Conditional expectations of random matrices are defined using the notion of condi-
tional expectations for scalar random variables. For example, given random matri-
ces X and Y, we set

E [y11 | X] E [y12 | X] · · · E [y1K | X]


 
 E [y | X] E [y22 | X] · · · E [y2K | X] 
21
E [Y | X] :=  .
. .
. .
.

. . .
 
 
E [y N1 | X] E [y N2 | X] · · · E [y NK | X]
where
E [ynk | X] := E [ynk | x11 , . . . , x`m , . . . , x LM ]
We also define
cov[x, y | Z] := E [xy0 | Z] − E [x | Z]E [y | Z]0
and
var[x | Z] := E [xx0 | Z] − E [x | Z]E [x | Z]0
CHAPTER 3. ORTHOGONALITY AND PROJECTIONS 132

Using the definitions, one can show that all of the results on conditional expectations
in fact 3.4.9 continue to hold in the current setting, replacing scalars with vectors and
matrices. We state necessary results for convenience:
Fact 3.4.10. Let X, Y and Z be random matrices, and let A and B be constant matrices.
Assuming conformability of matrix operations, the following results hold:

1. E [Y | Z]0 = E [Y0 | Z].

2. E [AX + BY | Z] = AE [X | Z] + BE [Y | Z].

3. E [E [Y | X]] = E [Y] and E [E [Y | X, Z] | X] = E [Y | X].

4. If X and Y are independent, then E [Y | X] = E [Y].

5. If g is a (nonrandom) function, so that g(X) is a matrix depending only on X,


then

• E [ g(X) | X] = g(X)
• E [ g ( X ) Y | X ] = g ( X )E [ Y | X ]
• E [Y g(X) | X] = E [Y | X] g(X)

3.4.6 An Exercise in Conditional Expectations

Let x and y be two random variables. We saw that E [y | x ] is a function f of x


such that f ( x ) is the best predictor of y in terms of root mean squared error. Since
monotone increasing transformations do not affect minimizers, f also minimizes the
mean squared error. In other words, f solves

min E [(y − g( x ))2 ] (3.15)


g∈ G

where G is the set of functions from R to R. From this definition of conditional


expectations, we employed the orthogonal projection theorem to deduce various
properties of conditional expectations. We can also reverse this process, showing di-
rectly that f ( x ) := E [y | x ] solves (3.15), given the various properties of conditional
expectations listed in fact 3.4.9. To begin, suppose that the properties in fact 3.4.9
hold, and fix an arbitrary g ∈ G. We have

(y − g( x ))2 = (y − f ( x ) + f ( x ) − g( x ))2
= (y − f ( x ))2 + 2(y − f ( x ))( f ( x ) − g( x )) + ( f ( x ) − g( x ))2
CHAPTER 3. ORTHOGONALITY AND PROJECTIONS 133

Let’s consider the expectation of the cross-product term. From the law of iterated
expectations (fact 3.4.9), we obtain

E {(y − f ( x))( f ( x) − g( x))} = E { E [(y − f ( x))( f ( x) − g( x)) | x] } (3.16)

We can re-write the term inside the curly brackets on the right-hand side of (3.16) as

( f ( x ) − g( x ))E [(y − f ( x )) | x ]

(Which part of fact 3.4.9 are we using here?) Regarding the second term in this
product, we have (by which facts?) the result

E [y − f ( x ) | x ] = E [y | x ] − E [ f ( x ) | x ] = E [y | x ] − f ( x ) = E [y | x ] − E [y | x ] = 0
We conclude that the expectation in (3.16) is E [0] = 0. It then follows that

E [(y − g( x))2 ] = E [(y − f ( x))2 + 2(y − f ( x))( f ( x) − g( x)) + ( f ( x) − g( x))2 ]


= E [(y − f ( x ))2 ] + E [( f ( x ) − g( x ))2 ]

Since ( f ( x ) − g( x ))2 ≥ 0 we have E [( f ( x ) − g( x ))2 ] ≥ 0, and we conclude that

E [(y − g( x))2 ] ≥ E [(y − f ( x))2 ] :=: E [(y − E [y | x])2 ]


Since g was an arbitrary element of G, we conclude that

f = argmin E [(y − g( x ))2 ]


g∈ G

3.5 Further Reading

To be written.

3.6 Exercises

Ex. 3.6.1. Find two unit vectors (i.e., vectors with norm equal to one) that are or-
thogonal to (1, −2).

Ex. 3.6.2. Prove the Pythagorean law (fact 3.1.2 on page 107). See fact 1.1.3 if you
need a hint.
CHAPTER 3. ORTHOGONALITY AND PROJECTIONS 134

Ex. 3.6.3. Prove fact 3.1.3 on page 109.

Ex. 3.6.4. Let Q be an orthogonal matrix. Show that Q−1 = Q0 and det(Q) ∈
{−1, 1} both hold.
Ex. 3.6.5. Use theorem 3.1.1 (page 110) to prove the following part of fact 1.4.13
(page 37): A symmetric matrix A is positive definite if and only if its eigenvalues
are all positive.

Ex. 3.6.6. Prove theorem 3.2.3 using theorems 3.2.1–3.2.2.

Ex. 3.6.7. Prove fact 3.1.7: If S ⊂ R N , then S ∩ S⊥ = {0}.

Ex. 3.6.8. Prove fact 3.2.2.

Ex. 3.6.9. Let x and y be any two N × 1 vectors.

1. Show that kx + yk2 = kxk2 + kyk2 + 2x0 y

2. Explain the connection between this equality and the Pythagorean Law.

Ex. 3.6.10. Show that if Q is an N × N orthogonal matrix, then Q is an isometry on


R N . That is, for any x, y ∈ R N , we have kQx − Qyk = kx − yk.
Ex. 3.6.11. Let P be the orthogonal projection described in theorem 3.2.2 (page 113).
Confirm that P is a linear function from R N to R N , as defined in §1.2.1.

Ex. 3.6.12. Let P be an N × N matrix that is both symmetric and idempotent. Let
S := rng(P). Show that P is precisely the orthogonal projection mapping onto the
linear space S. (In other words, for any given y ∈ R N , the vector Py is the closest
point in S to y.)

Ex. 3.6.13. In this exercise you are asked to prove the Cauchy-Schwarz inequality
|x0 y| ≤ kxkkyk from fact 1.1.2 on page page 5 via the orthogonal projection theorem.
Let y and x be nonzero vectors in R N (since if either equals zero then the inequality
is trivial), and let span(x) be all vectors of the form αx for α ∈ R.

1. Letting P be the orthgonal projection onto span(x), show that

x0 y
Py = x
x0 x

2. Using this expression and any relevant properties of orthgonal projections (see
theorem 3.2.2 on page 113), confirm the Cauchy-Schwarz inequality.
CHAPTER 3. ORTHOGONALITY AND PROJECTIONS 135

Ex. 3.6.14. Consider theorem 3.3.2 on page 118. If N = K, what does β̂ reduce to?
Interpret.

Ex. 3.6.15. Let S := {( x1 , x2 , x3 ) ∈ R3 : x3 = 0}, and let y := 1 := (1, 1, 1). Using the
orthogonal projection theorem, find the closest point in S to y.

Ex. 3.6.16. Let P be the orthogonal projection described in theorem 3.2.2 (page 113).
Is it true that Px 6= Py whenever x 6= y? Why or why not?6

Ex. 3.6.17. Show that when N × K matrix B is full column rank, the matrix B0 B is
nonsingular.7

Ex. 3.6.18. Prove the claim at the end of remark 3.3.1 on page 118. In particular, let
X be N × K with linearly dependent columns, and let ŷ be the closest point to y in
span(X), existence of which follows from the orthogonal projection theorem. Prove
that there are infinitely many b such that ŷ = Xb.

Ex. 3.6.19. Show by direct computation that the projection matrix P and annihilator
M in fact 3.3.1 are both symmetric and idempotent.

Ex. 3.6.20. Let A be an N × N matrix.

1. Show that if I N − A is idempotent, then A is idempotent.

2. Show that if A is both symmetric and idempotent, then the matrix I N − 2A is


orthogonal.

Ex. 3.6.21. Verify fact 3.3.2 (i.e., MX = 0) directly using matrix algebra.

Ex. 3.6.22. Taking all notation as in §3.3.3 and adopting the assumptions of theo-
rem 3.3.3 on page 119, let Vk be the N × k matrix formed by columns {v1 , . . . , vk }
for each k = 1, . . . , K. Show that span(Vk ) ⊂ span(Bk ) for each k.

Ex. 3.6.23. Continuing on from exercise 3.6.22, show that {v1 , . . . , vK } is an orthog-
onal set.

Ex. 3.6.24. Following on from exercises 3.6.22 and 3.6.23, show that span(Vk ) =
span(Bk ) for each k. 8
6 Hint: Sketch the graph and think about it visually.
7 Hint: In view of fact 1.4.14, it suffices to show that B0 B is positive definite.
8 Hint: Use the results of both of these exercises and fact 1.1.8 on page 14.
CHAPTER 3. ORTHOGONALITY AND PROJECTIONS 136

Ex. 3.6.25. Let X be an N × K matrix with linearly independent columns and QR


factorization X = QR. (See §3.3.4.) Fix y ∈ R N . Show that β̂ defined in (3.6) can
also be written as β̂ = R−1 Q0 y.

Ex. 3.6.26. Prove the Cauchy-Schwarz inequality for random variables, which was
first statedpas fact 2.2.7 on page 60. That is, shown that for x, y ∈ L2 we have
|E [ xy]| ≤ E [ x2 ]E [y2 ]. Use the results on orthogonal projections in L2 as found in
§3.4.1. If you get stuck, follow the solution to exercise 3.6.13, adjusting from vectors
to L2 as required.

Ex. 3.6.27. Show that the equality in (3.14) holds when x and w are independent.

Ex. 3.6.28. In fact 3.4.9, it is stated that if y is independent of the variables in G ,


then E [y | G] = E [y]. Prove this using the (second) definition of the conditional
expectation E [y | G]. To make the proof a bit simpler, you can take G = { x }.

Ex. 3.6.29. Confirm the claim in fact 3.4.9 that if x is G -measurable, then E [ xy | G] =
xE [y | G].

Ex. 3.6.30. Let var[y | x ] := E [y2 | x ] − (E [y | x ])2 . Show that

var[y] = E [var[y | x ]] + var[E [y | x ]]

Ex. 3.6.31. Show that the conditional expectation of a constant α is α. In particular,


using the results in fact 3.4.9 (page 130) as appropriate, show that if α is a constant
and G is any information set, then E [α | G] = α.

Ex. 3.6.32. Prove the claim in example 3.4.10. (Warning: The proof is a little ad-
vanced and you should be comfortable manipulating double integrals.)

3.6.1 Solutions to Selected Exercises

Solution to Exercise 3.6.3. Let O = {x1 , . . . , xK } ⊂ R N be an orthogonal set that


does not contain 0. Let α1 , . . . , αK be such that ∑kK=1 αk xk = 0. We claim that α j = 0
for any j. To see that this is so, fix j and take the inner product of both sides of
∑kK=1 αk xk = 0 with respect to x j to obtain α j kx j k2 = 0. Since x j 6= 0, we conclude
that α j = 0. The proof is done.

Solution to Exercise 3.6.4. Let Q be an orthogonal matrix with columns u1 , . . . , u N .


By the definition of matrix multiplication, the m, n-th element of Q0 Q is u0m un , which
CHAPTER 3. ORTHOGONALITY AND PROJECTIONS 137

is 1 if m = n and zero otherwise. Hence Q0 Q = I. It follows from fact 1.3.3 on


page 24 that Q0 is the inverse of Q.
To see that det(Q) ∈ {−1, 1}, apply the results of fact 1.3.5 (page 25) and fact 1.4.4
(page 29) to the equality Q0 Q = I to obtain det(Q)2 = 1. The claim follows.

Solution to Exercise 3.6.5. Suppose that A is symmetric with eigenvalues λ1 , . . . , λ N .


By theorem 3.1.1 we can decompose it as A = QDQ0 where D is the diagonal matrix
formed from eigenvalues and Q is an orthogonal matrix. Fixing x ∈ R N and letting
y := Q0 x, we have

x0 Ax = (Q0 x)0 D(Q0 x) = y0 Dy = λ1 y21 + · · · + λ N y2N (3.17)

Suppose that all eigenvalues are positive. Take x to be nonzero. The vector y must
be nonzero (why?), and it follows from (3.17) that x0 Ax > 0. Hence A is positive
definite as claimed.
Conversely, suppose that A is positive definite. Fix n ≤ N and set x = Qen . Evi-
dently x is nonzero (why?). Hence x0 Ax > 0. Since Q0 is the inverse of Q, it follows
that
λn = e0n Den = (Q0 x)0 DQ0 x = x0 QDQ0 x = x0 Ax > 0
Since n was arbitrary, all eigenvalues are positive.

Solution to Exercise 3.6.7. Let S ⊂ R N . We aim to show that S ∩ S⊥ = {0}. Fix


a ∈ S ∩ S⊥ . Since a ∈ S⊥ , we know that a0 s = 0 for any s ∈ S. Since a ∈ S, we have
in particular, a0 a = kak2 = 0. As we saw in fact 1.1.2, the only such vector is 0.

Solution to Exercise 3.6.10. Fixing x, y ∈ R N and letting z := x − y we have

kQx − Qyk2 = kQzk2 = (Qz)0 Qz = z0 Q0 Qz = z0 z = kzk2 = kx − yk2

Solution to Exercise 3.6.11. Fix α, β ∈ R and x, y ∈ R N . The claim is that

P(αx + βy) = αPx + βPy

To verify this equality, we need to show that the right-hand side is the orthogonal
projection of αx + βy onto S. Going back to theorem 3.2.1, we need to show that (i)
αPx + βPy ∈ S and (ii) for any z ∈ S, we have

(αx + βy − (αPx + βPy))0 z = 0


CHAPTER 3. ORTHOGONALITY AND PROJECTIONS 138

Here (i) is immediate, because Px and Py are in S by definition; and, moreover S is


a linear subspace. To see that (ii) holds, just note that

(αx + βy − (αPx + βPy))0 z = α(x − Px)0 z + β(y − Py)0 z

By definition, the projections of x and y are orthogonal to S, so we have (x − Px)0 z =


(y − Py)0 z = 0. We are done.

Solution to Exercise 3.6.13. Regarding part 1, the expression for Py given in exer-
cise 3.6.13 can also be written as x(x0 x)−1 x0 y. Since x is a basis for span(x), the
validity of this expression as the projection onto span(x) follows immediately from
theorem 3.3.1. Regarding part 2, recall that orthogonal projections contract norms,
so that, in particular, kPyk ≤ kyk must hold. Using our expression for Py from part
1 and rearranging gives the desired bound |x0 y| ≤ kxkkyk.

Solution to Exercise 3.6.14. If N = K, then, in view of the full column rank as-
sumption and theorem 1.3.3 on page 23, the matrix X is nonsingular. By fact 1.4.4 on
page 29, X0 is likewise nonsingular. Applying the usual rule for inverse of products
(fact 1.3.4 on page 24), we have

β̂ = (X0 X)−1 X0 y = X−1 (X0 )−1 X0 y = X−1 y

This is of course the standard solution to the system Xβ = y.

Solution to Exercise 3.6.15. Let x = ( x1 , x2 , x3 ) be the closest point in S to y. Note


that e1 ∈ S and e2 ∈ S. By the orthogonal projection theorem we have (i) x ∈ S, and
(ii) y − x ⊥ S. From (i) we have x3 = 0. From (ii) we have

hy − x, e1 i = 0 and hy − x, e2 i = 0

These equations can be expressed more simply as 1 − x1 = 0 and 1 − x2 = 0. We


conclude that x = (1, 1, 0).

Solution to Exercise 3.6.16. It is false to say that Px 6= Py whenever x 6= y: We can


find examples of vectors x and y such that x 6= y but Px = Py. Indeed, if we fix any
y and then set x = Py + αMy for some constant α, you should be able to confirm
that Px = Py, and also that x 6= y when α 6= 1.
CHAPTER 3. ORTHOGONALITY AND PROJECTIONS 139

Solution to Exercise 3.6.17. Let A = B0 B. It suffices to show that A is positive def-


inite, since this implies that its determinant is strictly positive, and any matrix with
nonzero determinant is nonsingular. To see that A is positive definite, pick any
b 6= 0. We must show that b0 Ab > 0. To see this, observe that

b0 Ab = b0 B0 Bb = (Bb)0 Bb = kBbk2

By the properties of norms, this last term is zero only when Bb = 0. But this is not
true, because b 6= 0 and B is full column rank (see fact 1.1.6).

Solution to Exercise 3.6.18. By the definition of orthogonal projection we have ŷ ∈


span(X), and hence there exists a vector b such that ŷ = Xb. Since X has linearly
dependent columns, there exists a nonzero vector a such that Xa = 0. Hence Xλa =
0 for all λ ∈ R. For each such λ we have ŷ = Xb = Xb + Xλa = X(b + λa).

Solution to Exercise 3.6.21. We have MX = X − X(X0 X)−1 X0 X = 0.

Solution to Exercise 3.6.22. To see that span(Vk ) ⊂ span(Bk ) for each k, observe
first that (3.8) can be rewritten as vk = bk − Bk−1 x for suitable choice of x. Hence
vk ∈ span(Bk ). Since spans increase as we add more elements, it follows that v j ∈
span(Bk ) for j ≤ k. Therefore span(Vk ) ⊂ span(Bk ).

Solution to Exercise 3.6.23. To show that {v1 , . . . , vK } is an orthogonal set, we must


show that arbitrary distinct elements are orthogonal, for which it suffices to check
that vk ⊥ v j whenever j < k. To see this, fix any such j and k. By construction, vk
lies in the orthogonal complement of span(Bk−1 ). On the other hand, as shown in
the solution to exercise 3.6.22, we have v j ∈ span(Bk−1 ). Hence vk ⊥ v j .

Solution to Exercise 3.6.24. We wish to confirm that span(Vk ) = span(Bk ) holds.


In exercise 3.6.22 we showed that span(Vk ) ⊂ span(Bk ). As a result, it is enough
to check that the columns of Vk are linearly independent (see, e.g., fact 1.1.8 on
page 14). As we’ve just shown the columns of Vk are mutually orthogonal (see
exercise 3.6.23), it suffices to show that none of them are zero (fact 3.1.3 on page 109).
This is easy enough, for if vk = 0 then, by (3.8), we have bk ∈ span(Bk−1 ), which
contradicts linear independence.

Solution to Exercise 3.6.25. Let X, Q, R and y ∈ R N be as in the statement of the


exercise. The claim is that β̂ defined in (3.6) is equal to β̃ := R−1 Q0 y. To show this, in
CHAPTER 3. ORTHOGONALITY AND PROJECTIONS 140

view of linear independence of the columns of X, it suffices to show that X β̃ = X β̂,


or
X(X0 X)−1 X0 y = QRR−1 Q0 y
After simplifying, we see it suffices to show that X(X0 X)−1 X0 = QQ0 . Since X and
Q have the same column space, this follows from theorem 3.3.1 on page 116.

Solution to Exercise 3.6.27. Let g be any function from R → R. Given indepen-


dence of x and w (and applying fact 2.4.2 on page 71), we have

E [( x + E [w]) g( x)] = E [ xg( x)] + E [w]E [ g( x)]


= E [ xg( x )] + E [wg( x )]
= E [( x + w) g( x )]
This confirms (3.14).

Solution to Exercise 3.6.28. Let y be independent of x. From the (second) definition


of conditional expectation, to show that E [y | x ] = E [y] we need to show that

1. E [y] is G -measurable, and

2. E [E [y] g( x )] = E [yg( x )] for any function g : R → R.

Part 1 is immediate, because E [y] is constant (see example 3.4.7 on page 127). Re-
garding part 2, if g is any function, then by facts 2.4.1 and 2.4.2 (see page 71) we have
E [yg( x)] = E [y]E [ g( x)]. By linearity of expectations, E [y]E [ g( x)] = E [E [y] g( x)].

Solution to Exercise 3.6.29. We need to show that if x is G -measurable, then E [ xy | G] =


xE [y | G]. To confirm this, we must show that

1. xE [y | G] is G -measurable, and

2. E [ xE [y | G]z] = E [ xyz] for any z ∈ L2 (G).

Regarding part 1, E [y | G] is G -measurable by definition, and x is G -measurable by


assumption, so xE [y | G] is G -measurable by fact 3.4.4 on page 128. Regarding part
2, fix z ∈ L2 (G), and let u := xz. Since x ∈ L2 (G), we have u ∈ L2 (G). We need to
show that
E [E [y | G]u] = E [yu]
Since u ∈ L2 (G), this is immediate from the definition of E [y | G].
CHAPTER 3. ORTHOGONALITY AND PROJECTIONS 141

Solution to Exercise 3.6.31. By fact 3.4.9 (page 130), we know that if α is G -measurable,
then E [α | G] = α. Example 3.4.7 on page 127 tells us that α is indeed G -measurable.

Solution to Exercise 3.6.32. As in example 3.4.10, let x and y be random variables


where p(y | x ) is the conditional density of y given x. Let g( x ) := tp(t | x )dt.
R

The claim is that E [y | x ] = g( x ). To prove this, we need to show that g( x ) is x-


measurable, and that

E [ g( x)h( x)] = E [yh( x)] for any function h : R → R (3.18)

The first claim is obvious. Regarding (3.18), let h be any such function. Using the
notation in (2.22) on page 70, we can write
Z 
E [ g( x)h( x)] = E tp(t | x )dt h( x )
Z Z
= tp(t | s)dt h(s) p(s)ds
p(s, t)
Z Z
= t dt h(s) p(s)ds
p(s)
Z Z
= t h(s) p(s, t)dtds

This is equal to the right-hand side of (3.18), and the proof is done.
Part I

Appendices

142
Chapter 4

Appendix A: Analysis

4.1 Sets

In the course we often refer to the real numbers. This set is denoted by R, and we
understand it to contain “all of the numbers.” R can be visualized as the “continu-
ous” real line:

It contains both the rational and the irrational numbers.


What’s “real” about the real numbers? Well, “real” is in contrast to “imaginary,”
where the latter refers to the set of imaginary numbers. Actually, the imaginary
numbers are no more imaginary (or less real) than any other kind of numbers, but
we don’t need to talk any more about this.
R is an example of a set. A set is a collection of objects viewed as a whole. (In this
case the objects are numbers.) Other examples of sets are the set of all rectangles in
the plane, or the set of all monkeys in Japan.
If A is a set, then the statement x ∈ A means that x is contained in (alternatively, is
an element of) A. If B is another set, then A ⊂ B means that any element of A is also
an element of B, and we say that A is a subset of B. The statement A = B means
that A and B contain the same elements (each element of A is an element of B and

143
CHAPTER 4. APPENDIX A: ANALYSIS 144

A B

Figure 4.1: Sets A and B in S

vice versa). For example, if I is the irrational numbers,1 then I ⊂ R. Also, 0 ∈ R,


π ∈ R, −3 ∈ R, e ∈ R, and so on.
Commonly used subsets of R include the intervals. For arbitrary a and b in R, the
open inverval ( a, b) is defined as

( a, b) := { x ∈ R : a < x < b}

while the closed inverval [ a, b] is defined as

[ a, b] := { x ∈ R : a ≤ x ≤ b}

We also use half open intervals such as [ a, b) := { x ∈ R : a ≤ x < b}, half lines such
as (−∞, b) = { x ∈ R : x < b}, and so on.
Let S be a set and let A and B be two subsets of S, as illustrated in figure 4.1. The
union of A and B is the set of elements of S that are in A or B or both:

A ∪ B := { x ∈ S : x ∈ A or x ∈ B}

Here and below, “or” is used in the mathematical sense. It means “and/or”. The
intersection of A and B is the set of all elements of S that are in both A and B:

A ∩ B := { x ∈ S : x ∈ A and x ∈ B}

The set A \ B is all points in A that are not points in B:

A \ B := { x ∈ S : x ∈ A and x ∈
/ B}

The complement of A is the set of elements of S that are not contained in A:

Ac := S \ A :=: { x ∈ S : x ∈
/ A}
CHAPTER 4. APPENDIX A: ANALYSIS 145

Figure 4.2: Unions, intersection and complements

Here x ∈
/ A means that x is not an element of A. Figure 4.2 illustrate these defini-
tions.
For example, since R consists of the irrationals I and the rationals Q, we have

Q ⊂ R, I ⊂ R, Q ∪ I = R, Qc = I, etc.
Also,
N := {1, 2, 3, . . .} ⊂ Q ⊂ R
The empty set is, unsurprisingly, the set containing no elements. It is denoted by ∅.
If the intersection of A and B equals ∅, then A and B are said to be disjoint.
The next fact lists some well known rules for set theoretic operations.

Fact 4.1.1. Let A and B be subsets of S. The following statements are true:

1. A ∪ B = B ∪ A and A ∩ B = B ∩ A.

2. ( A ∪ B)c = Bc ∩ Ac and ( A ∩ B)c = Bc ∪ Ac .


1 The

irrationals are those numbers such as π and 2 that cannot be expressed as fractions of
whole numbers.
CHAPTER 4. APPENDIX A: ANALYSIS 146

3. A \ B = A ∩ Bc .

4. ( Ac )c = A.

4.2 Functions

There are two fundamental primitives in mathematics: sets and functions.2 A brief
discussion of sets is given in §4.1. Here we recall some basic definitions concerning
functions.
Given arbitrary sets A and B, a function or map f from A to B is a rule that associates
to each element a of A one and only one element of B. This element of B is usually
called the image of a under f , and written f ( a). If we write f : A → B, this means
that f is a function from A to B.
Example 4.2.1. Think of the hands on an old school clock. If we know it’s morning,
then each position of the two hands is associated with one and only one time. If we
don’t know it’s morning, one position of the hands is associated with two possible
times, in am and pm. The relationship is no longer functional.

Consider figure 4.3. Bottom left is not a function because the middle point on the
left-hand side is associated with two different points (images). Bottom right is not
a function because the top point on the left-hand side is not associated with any
image. From the definition, this is not allowed. Top right and top left are both
functions.
If f : A → B and g : B → C then the function h : A → C defined by h( a) = g( f ( a))
is called the composition of g and f , and written as g ◦ f . For example, if f : R → R
is defined by f ( x ) = e x :=: exp( x ) and g : R → R is defined by g( x ) = x2 , then

( g ◦ f )( x ) = g( f ( x )) = exp(2x )

If a and b are points such that f ( a) = b, then a is called a preimage of b. As shown


in figure 4.3, the set of preimages of a can be empty, a singleton or contain multiple
values. If f : A → B then the set A is called the domain of f . The set of points

{b ∈ B : f ( a) = b for some a ∈ A}
2 Actually functions can be represented as a special type of set containing ordered pairs, and hence

in pure mathematics cannot claim to be as foundational as sets. However, for our purposes, it will
be fine to think of a function as a primitive in its own right, defined as a certain kind of “rule.”
CHAPTER 4. APPENDIX A: ANALYSIS 147

A B A B

function function

A B A B

not a function not a function

Figure 4.3: Functions and non-functions

is called the range of f , and written as rng( f ). Thus b is in the range of f if it has at
least one preimage.
Figure 4.4 graphs a one-dimensional function f : [0, 1] → R. The red interval repre-
sents the range of f . Also shown is the preimage x of a point b ∈ rng( f ).
A vast multitude of mathematical problems come down to finding the x such that
f ( x ) = b for given function f and constant (or vector, or any other object) b. There
are two things that can go wrong here. One is that such an x might not be uniquely
defined; in other words, there are multiple preimages of b under f . The other po-
tential problem is lack of existence. This happens if b lies outside the range of f .
Figure 4.5 gives an example of failure of uniqueness. Both x1 and x2 solve f ( x ) = b.
We use some additional language to keep track of when these problems will occur.
A function f : A → B is called one-to-one if f ( a) = f ( a0 ) implies that a = a0 . For
example, top right in figure 4.3 is one-to-one while top left is not (because there exist
distinct points a and a0 with f ( a) = f ( a0 )). If f : A → B and f is one-to-one, then the
equation f ( x ) = b has at most one solution. (Why?)
A function f : A → B is called onto if rng( f ) = B; this is if every b ∈ B has at
lease one preimage. For example, top left in figure 4.3 is onto but top left is not. If
f : A → B and f is onto, then the equation f ( x ) = b has at least one solution. (Why?)
CHAPTER 4. APPENDIX A: ANALYSIS 148

0 x 1

Figure 4.4: Preimage of b under f

0 x1 x2 1

Figure 4.5: Multiple solutions


CHAPTER 4. APPENDIX A: ANALYSIS 149

A B
f

f −1
f

f −1

Figure 4.6: f is a bijection

A B C
f g

f −1 g −1

Figure 4.7: The inverse of g ◦ f is f −1 ◦ g−1

Finally, f : A → B is called a bijection if it is both one-to-one and onto. Bijections


are sometimes called one-to-one correspondences. The nice thing about bijections,
of course, is that the equation f ( x ) = b always has exactly one solution. We can
therefore define the inverse function to f , denoted by f −1 . In particular, f −1 is the
map from B → A such that f −1 (b) is the unique a with f ( a) = b.

Fact 4.2.1. Let f : A → B and g : B → C be bijections.

1. f −1 is a bijection and its inverse is f

2. f −1 ( f ( a)) = a for all a ∈ A

3. f ( f −1 (b)) = b for all b ∈ B

4. g ◦ f is a bijection from A to C and ( g ◦ f )−1 = f −1 ◦ g−1 .


CHAPTER 4. APPENDIX A: ANALYSIS 150

Figure 4.8: Continuity

4.2.1 Convergence and Continuity

Let { xn }∞
n=1 be a sequence of real numbers. (For each n = 1, 2, . . . we have a corre-
sponding xn ∈ R.) We say that xn converges to 0 if, given any neighborhood of 0,
the sequence points are eventually in that neighborhood. More formally (we won’t
use the formal definition, so feel free to skip this), given any e > 0, there exists an
N ∈ N such that | xn | < e whenever n ≥ N. Symbolically, xn → 0.
Now let {xn }∞ n=1 be a sequence of vectors in R . We say that xn converges to x ∈ R
N

if kxn − xk → 0. Symbolically, xn → x. This is the fundamental notion of conver-


gence in R N . Whole branches of mathematics are built on this idea.
Let A ⊂ R N and B ⊂ R M . A function f : A → B is called continuous at x if
f (xn ) → f (x) whenever xn → x, and continuous if it is continuous at x for all
x ∈ A. Figure 4.8 illustrates.

4.3 Real-Valued Functions

For any set A, if f : A → R then f is called a real-valued function. If f and g are


real-valued functions, then f + g is defined by ( f + g)( x ) = f ( x ) + g( x ), while α f
is defined by (α f )( x ) = α f ( x ). A maximizer of f on A is a point a∗ ∈ A such that

f ( a∗ ) ≥ f ( a) for all a ∈ A
CHAPTER 4. APPENDIX A: ANALYSIS 151

Figure 4.9: Monotone transforms preserve maximizers

The value f ( a∗ ) is called the maximum of f on A. A minimizer of f on A is a point


b∗ ∈ A such that
f (b∗ ) ≥ f ( a) for all a ∈ A
The value f (b∗ ) is called the minimum of f on A.
Monotone increasing transformations of functions do not affect maximizers. To see
this, let f : R → R and let m be a monotone increasing function, in the sense that
if x ≤ x 0 , then m( x ) ≤ m( x 0 ), and let g be the function defined by g( a) = m( f ( a)).
Our claim is this:

Any maximizer of f on A is also a maximizer of g on A.

It’s easy to see why this is the case. Let a ∈ A. Since a∗ is a maximizer of f , it must
be the case that f ( a) ≤ f ( a∗ ). Since m is monotone increasing, this implies that
m( f ( a)) ≤ m( f ( a∗ )). Given that a was chosen arbitrarily, we have now shown that

g( a∗ ) ≥ g( a) for all a ∈ A

In other words, a∗ is a maximizer of g on A.


Before finishing this topic, let’s recall the notions of supremum and infimum. To
illustrate, consider the function f : (0, 1) → (0, 1) defined by f ( x ) = x. It should be
clear that f has no maximiser on (0, 1): given any a∗ ∈ (0, 1), we can always choose
another point a∗∗ ∈ (0, 1) such that a∗∗ = f ( a∗∗ ) > f ( a∗ ) = a∗ . No maximizer exists
and the optimization problem maxx∈(0,1) f ( x ) has no solution.
To get around this kind of problem, we often use the notion of supremum instead.
If A is a set, then the supremum s := sup A is the unique number s such that a ≤ s
CHAPTER 4. APPENDIX A: ANALYSIS 152

for every a ∈ A, and, moreover, there exists a sequence { xn } ⊂ A such that xn → s.


For example, 1 is the supremum of both (0, 1) and [0, 1]. The infimum i := inf A is
the unique number i such that a ≥ i for every a ∈ A, and, moreover, there exists a
sequence { xn } ⊂ A such that xn → i. For example, 0 is the infimum of both (0, 1)
and [0, 1].
One can show that the supremum and infimum of any bounded set A exist, and any
set A when the values −∞ and ∞ are admitted as a possible infima and supremum.
Returning to our original example with f ( x ) = x, while maxx∈(0,1) f ( x ) is not well
defined, supx∈(0,1) f ( x ) := sup{ f ( x ) : x ∈ (0, 1)} = sup(0, 1) = 1.
Bibliography

[1] Amemiya, T. (1994): Introduction to Statistics and Econometrics, Harvard UP.

[2] Bishop, C.M. (2006): Pattern Recognition and Machine Learning, Springer.

[3] Casella, G. and R. L. Berger (1990): Statistical Inference, Duxbury Press, CA.

[4] Cheney, W. (2001): Analysis for Applied Mathematics, Springer.

[5] Dasgupta, A. (2008): Asymptotic Theory of Statistics and Probability, Springer.

[6] Davidson, R. (2000): Econometric Theory, Blackwell Publishing.

[7] Davidson, R., and MacKinnon, J. G. (1993): Estimation and Inference in Economet-
rics, OUP.

[8] Davidson, R., and MacKinnon, J. G. (2009): Econometric Theory and Methods,
OUP.

[9] Devroye, Luc and Gabor Lugosi (2001): “Combinatorial Methods in Density
Estimation” Springer-Verlag, New York.

[10] Doeblin, W. (1938): “Exposé de la theorie des chaı̂ns simples constantes de


Markov à un nombre fini d’états,” Rev. Math. Union Interbalkanique, 2, 77–105.

[11] Durrett, R. (1996): Probability: Theory and Examples, Duxbury Press.

[12] Evans, G. and S. Honkapohja (2005): “An Interview with Thomas Sargent,”
Macroeconomic Dynamics, 9, 561–583.

[13] Freedman, D. A. (2009): Statistical Models: Theory and Practice, Cambridge UP.

[14] Greene, W. H. (1993): Econometric Analysis, Prentice-Hall, New Jersey.

153
BIBLIOGRAPHY 154

[15] Hall, R. E. (1978): “Stochastic implications of the life cycle-permanent income


hypothesis,” Journal of Political Economy 86 (6), 971–87.

[16] Hamilton, J. D. (1994). Time Series Analysis, Princeton: Princeton university


press.

[17] Hayashi, F. (2000): Econometrics, Princeton UP.

[18] Hoerl, A. E. and R. W. Kennard (1970): “Ridge Regression: Biased Estimation


for Nonorthogonal Problems,” Technometrics, 12 (1), 55–67.

[19] Jones, G. L. (2004): ”On the Markov chain central limit theorem,” Probability
Surveys, 1, 5-1, 299–320.

[20] Kennedy, P. (2003): A Guide to Econometrics. MIT press.

[21] Marcus, M. and H. Minc (1965): Introduction to Linear Algebra. Dover Publica-
tions, N.Y.

[22] Olley, S. and A. Pakes (1996): “The Dynamics of Productivity in the Telecom-
munications Equipment Industry,” Econometrica, 64 (6), 1263-97.

[23] Negri, I and Y. Nishiyama (2010): “Review on Goodness of Fit Tests for Ergodic
Diffusion processes by Different Sampling Schemes,” Economic Notes, 39, 91–
106.

[24] Stachurski, J. and V. Martin (2008): “Computing the Distributions of Economic


Models via Simulation,” Econometrica, 76 (2), 443–450.

[25] Stachurski, J. (2009): Economic Dynamics: Theory and Computation, MIT Press.

[26] Van der Vaart, A. W. (2000): Asymptotic statistics, Cambridge university press.

[27] Vapnik, V. N. (2000): The Nature of Statistical Learning Theory, Springer-Verlag,


New York.

[28] Wasserman, L. (2004): All of Statistics, Springer, New York.

[29] Williams, D. (1991): Probabiliity with Martingales, Cambridge Mathematical


Textbooks.

[30] Wooldridge, J. M. (2010): Econometric analysis of cross section and panel data, MIT
press.
Index

Annihilator, 117 Distribution function, 61

Basis, 13 Eigenpair, 30
Bernoulli random variable, 56 Eigenvalue, 30
Bijection, 149 Eigenvector, 30
Binary random variable, 56 Empty set, 145
Event, 47
Cauchy-Schwarz inequality, 5, 60 Expectation, 58
cdf, 61 Expectation, vector, 84
Central limit theorem, 80
Chebyshev’s inequality, 60 F-distribution, 69
Chi-squared distribution, 69 Full column rank, 21
Column space, 21
Gaussian distribution, 68
Column vector, 18
Gradiant vector, 89
Complement, 144
Gram Schmidt orthogonalization, 119
Composition of functions, 146
Conditional density, 70 Idempotent, 29
Conditional expectation, 128 Identity matrix, 18
Conditional probability, 50 Image, 147
Convergence in distribution, 77, 86 Independence, of events, 50
Convergence in mean square, 75 Independence, of r.v.s, 71
Convergence in probability, 75, 86 Indicator function, 56
Covariance, 72 Infimum, 151
Cumulative distribution function, 61 Information set, 126
Inner product, 2
Delta method, 82
Inner product in L2 , 123
Density, 63
Integrable random variable, 59
Determinant, 24
Intersection, 144
Diagonal matrix, 27
Inverse matrix, 23
Diagonalizable matrix, 32
Inverse transform method, 92
Dimension, 13
Invertible matrix, 23
Disjoint sets, 145
Irrational numbers, 143

155
INDEX 156

Joint density, 70 Orthogonal projection theorem, 112, 123


Joint distribution, 70 Orthogonal set, 107
Orthogonal vectors, 107
Kernel, 16
Orthonormal basis, 109
Law of large numbers, 78 Orthonormal set, 109
Law of Total Probability, 51 Overdetermined system, 25
Linear combinations, 6
Partition, 57
Linear function, 15
Positive definite, 37
Linear independence, 9
Preimage, 147
Linear subspace, 8, 128
Probability, 48
Lower triangular, 28
Probability space, 48
Marginal distribution, 70 Projection matrix, 117
Matrix, 18 Pythagorean law, 107
Matrix norm, 33
QR decomposition, 110
Maximizer, 150
Maximum, 150 Random variable, 54
Mean squared error, 121 Range, 147
Measurability, 126 Rank, 21
Minimizer, 150 Rational numbers, 143
Minimum, 150 Real numbers, 143
Modulus, 35 Real-valued function, 150
Moment, 60 Row vector, 18
Monotone increasing function, 151
Sample space, 46
Negative definite, 37 Scalar product, 2
Neumann series, 34 Set, 143
Nonnegative definite, 37 Singular matrix, 23
Nonpositive definite, 37 Slutsky’s theorem, 78
Nonsingular, 17 Span, 6
Nonsingular matrix, 23 Spectral radius, 35
Norm, 2 Square matrix, 18
Normal distribution, 68 Square root of a matrix, 27
Standard deviation, 60
One-to-one function, 147
Student’s t-distribution, 69
Onto function, 147
Subset, 143
Orthogonal complement, 110
Sum, vectors, 2
Orthogonal matrix, 109
Supremum, 151
Orthogonal projection, 112, 124
INDEX 157

Symmetric, 29
Symmetric cdf, 62
Symmetric matrix, 18

Trace, 29
Transpose, 28
Triangle inequality, 5

Underdetermined system, 26
Uniform distribution, 68
Union, 144
Upper triangular, 28

Variance, real r.v., 60


Variance-covariance matrix, 84

You might also like