0% found this document useful (0 votes)
20 views251 pages

Stats 310 Notes

This document outlines lecture notes on measures and integration theory. It begins with an introduction to measurable spaces and σ-algebras. It then covers outer measures, Carathéodory's extension theorem for constructing measures, and examples of non-measurable sets. Subsequent chapters discuss measurable functions, Lebesgue integration, product spaces, norms and inequalities, random variables, expectation, variance, independence, convergence of random variables, conditional expectation, martingales, ergodic theory, Markov chains, weak convergence, Brownian motion, and stochastic calculus. The document provides a comprehensive overview of foundational probability and measure theory concepts.

Uploaded by

huguuowei
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views251 pages

Stats 310 Notes

This document outlines lecture notes on measures and integration theory. It begins with an introduction to measurable spaces and σ-algebras. It then covers outer measures, Carathéodory's extension theorem for constructing measures, and examples of non-measurable sets. Subsequent chapters discuss measurable functions, Lebesgue integration, product spaces, norms and inequalities, random variables, expectation, variance, independence, convergence of random variables, conditional expectation, martingales, ergodic theory, Markov chains, weak convergence, Brownian motion, and stochastic calculus. The document provides a comprehensive overview of foundational probability and measure theory concepts.

Uploaded by

huguuowei
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 251

STATS 310A,B,C (MATH 230A,B,C) Lecture

notes (ongoing, to be updated)

Sourav Chatterjee
Contents

Chapter 1. Measures 1
1.1. Measurable spaces 1
1.2. Measure spaces 2
1.3. Dynkin’s π-λ theorem 4
1.4. Outer measures 6
1.5. Carathéodory’s extension theorem 8
1.6. Construction of Lebesgue measure 10
1.7. Example of a non-measurable set 12
1.8. Completion of a measure space 12

Chapter 2. Measurable functions and integration 15


2.1. Measurable functions 15
2.2. Lebesgue integration 17
2.3. The monotone convergence theorem 19
2.4. Linearity of the Lebesgue integral 21
2.5. Fatou’s lemma and dominated convergence 23
2.6. The concept of almost everywhere 25

Chapter 3. Product spaces 27


3.1. Finite dimensional product spaces 27
3.2. Fubini’s theorem 29
3.3. Infinite dimensional product spaces 32
3.4. Kolmogorov’s extension theorem 34

Chapter 4. Norms and inequalities 37


4.1. Markov’s inequality 37
4.2. Jensen’s inequality 37
4.3. The first Borel–Cantelli lemma 39
4.4. Lp spaces and inequalities 40

Chapter 5. Random variables 45


5.1. Definition 45
5.2. Cumulative distribution function 46
5.3. The law of a random variable 47
5.4. Probability density function 47
5.5. Some standard densities 48
5.6. Standard discrete distributions 49
iii
iv CONTENTS

Chapter 6. Expectation, variance, and other functionals 51


6.1. Expected value 51
6.2. Variance and covariance 53
6.3. Moments and moment generating function 55
6.4. Characteristic function 56
6.5. Characteristic function of the normal distribution 58

Chapter 7. Independence 59
7.1. Definition 59
7.2. Expectation of a product under independence 60
7.3. The second Borel–Cantelli lemma 62
7.4. The Kolmogorov zero-one law 63
7.5. Zero-one laws for i.i.d. random variables 64
7.6. Random vectors 66
7.7. Convolutions 67

Chapter 8. Convergence of random variables 71


8.1. Four notions of convergence 71
8.2. Interrelations between the four notions 72
8.3. Uniform integrability 74
8.4. The weak law of large numbers 76
8.5. The strong law of large numbers 77
8.6. Tightness and Helly’s selection theorem 81
8.7. An alternative characterization of weak convergence 82
8.8. Inversion formulas 84
8.9. Lévy’s continuity theorem 86
8.10. The central limit theorem for i.i.d. sums 87
8.11. The Lindeberg–Feller central limit theorem 91
8.12. Stable laws 94

Chapter 9. Conditional expectation and martingales 101


9.1. Conditional expectation 101
9.2. Basic properties of conditional expectation 104
9.3. Jensen’s inequality for conditional expectation 108
9.4. Martingales 109
9.5. Stopping times 110
9.6. Optional stopping theorem 111
9.7. Submartingales and supermartingales 114
9.8. Optimal stopping and Snell envelopes 117
9.9. Almost sure convergence of martingales 121
9.10. Lévy’s downwards convergence theorem 124
9.11. De Finetti’s theorem 125
9.12. Lévy’s upwards convergence theorem 128
9.13. Lp convergence of martingales 129
9.14. Almost supermartingales 133
CONTENTS v

Chapter 10. Ergodic theory 137


10.1. Measure preserving transforms 137
10.2. Ergodic transforms 139
10.3. Birkhoff’s ergodic theorem 141
10.4. Stationary sequences 144

Chapter 11. Markov chains 149


11.1. The Ionescu-Tulcea existence theorem 149
11.2. Markov chains on countable state spaces 152
11.3. Pólya’s recurrence theorem 155
11.4. Markov chains on finite state spaces 159

Chapter 12. Weak convergence on Polish spaces 165


12.1. Definition 165
12.2. The portmanteau lemma 166
12.3. Independence on Polish spaces 169
12.4. Tightness and Prokhorov’s theorem 170
12.5. Skorokhod’s representation theorem 175
12.6. Convergence in probability on Polish spaces 177
12.7. Multivariate inversion formula 178
12.8. Multivariate Lévy continuity theorem 180
12.9. The Cramér–Wold device 180
12.10. The multivariate CLT for i.i.d. sums 181

Chapter 13. Brownian motion 183


13.1. The spaces C[0, 1] and C[0, ∞) 183
13.2. Tightness on C[0, 1] 184
13.3. Donsker’s theorem 186
13.4. Construction of Brownian motion 190
13.5. An application of Donsker’s theorem 192
13.6. Law of large numbers for Brownian motion 193
13.7. Nowhere differentiability of Brownian motion 194
13.8. The Brownian filtration 196
13.9. Markov property of Brownian motion 197
13.10. Law of the iterated logarithm for Brownian motion 199
13.11. Stopping times for Brownian motion 202
13.12. The strong Markov property 205
13.13. Multidimensional Brownian motion 207

Chapter 14. Martingales in continuous time 209


14.1. Optional stopping theorem in continuous time 209
14.2. Doob’s Lp inequality in continuous time 210
14.3. Martingales related to Brownian motion 211
14.4. Skorokhod’s embedding 214
14.5. Strassen’s coupling 216
14.6. The Hartman–Wintner LIL 218
vi CONTENTS

Chapter 15. Introduction to stochastic calculus 219


15.1. Continuous stochastic processes 219
15.2. The Itô integral 219
15.3. The Itô integral as a continuous martingale 223
15.4. The quadratic variation process 224
15.5. Itô integrals for multidimensional Brownian motion 226
15.6. Itô’s formula 227
15.7. Stochastic differential equations 231
15.8. Chain rule for stochastic calculus 237
15.9. The Ornstein–Uhlenbeck process 241
15.10. Lévy’s characterization of Brownian motion 242
CHAPTER 1

Measures

The mathematical foundation of probability theory if measure theoretic. In this


chapter we will build some of the basic measure theoretic tools that are needed for
probability theory.

1.1. Measurable spaces


D EFINITION 1.1.1. Let Ω be a set. A σ-algebra F on Ω is a collection of
subsets such that
(1) ∅ ∈ F (where ∅ denotes the empty set),
(2) if A ∈ F, then Ac ∈ F (where Ac denotes the complement of A in Ω),
and
(3) if A1 , A2 , . . . is a countable collection of sets in F, then

[
An ∈ F.
n=1

If the countable union condition is replaced by a finite union condition, then


F is called an algebra instead of a σ-algebra. Note that σ-algebras are also closed
under finite unions, since we can append empty sets to convert a finite collection
into a countable collection.
D EFINITION 1.1.2. A pair (Ω, F), where Ω is a set and F is a σ-algebra on Ω,
is called a measurable space.
E XERCISE 1.1.3. Prove that a σ-algebra is closed under countable intersec-
tions.
E XERCISE 1.1.4. For any set Ω, show that the power set of Ω is a σ-algebra
on Ω.
E XERCISE 1.1.5. Prove that the intersection of any arbitrary collection of σ-
algebras on a set is a σ-algebra.
E XERCISE 1.1.6. If Ω is a set and A is any collection of subsets of Ω, show
that there is a ‘smallest’ σ-algebra F containing A, in the sense that any σ-algebra
G containing A must also contain F. (Hint: Use the previous two exercises.)
The above exercise motivates the following definition.
D EFINITION 1.1.7. Let Ω be a set and A be a collection of subsets of Ω. The
smallest σ-algebra containing A is called the σ-algebra generated by A, and is
denoted by σ(A).
1
2 1. MEASURES

The above definition makes it possible to define the following important class
of σ-algebras.
D EFINITION 1.1.8. Let Ω be a set endowed with a topology. The Borel σ-
algebra on Ω is the σ-algebra generated by the collection of open subsets of Ω. It
is sometimes denoted by B(Ω).
In particular, the Borel σ-algebra on the real line R is the σ-algebra generated
by all open subsets of R.
E XERCISE 1.1.9. Prove that the Borel σ-algebra on R is also generated by
the set of all open intervals (or half-open intervals, or closed intervals). (Hint:
Show that any open set is a countable union of open — or half-open, or closed —
intervals.)
E XERCISE 1.1.10. Show that intervals of the form (x, ∞) also generated B(R),
as do intervals of the form (−∞, x).
E XERCISE 1.1.11. Let (Ω, F) be a measurable space and take any Ω0 ∈ F.
Consider the set F 0 := {A∩Ω0 : A ∈ F}. Show that this is a σ-algebra. Moreover,
show that if F is generated by a collection of sets A, then F 0 is generated by the
collection A0 := {A ∩ Ω0 : A ∈ A}. (The σ-algebra F 0 is called the restriction of
F to Ω0 .)
E XERCISE 1.1.12. As a corollary of the above exercise, show that if Ω is a
topological space and Ω0 is a Borel subset of Ω endowed with the topology in-
herited from Ω, then the restriction of B(Ω) to Ω0 equals the Borel σ-algebra of
Ω0 .

1.2. Measure spaces


A measurable space endowed with a measure is called a measure space. The
definition of measure is as follows.
D EFINITION 1.2.1. Let (Ω, F) be a measurable space. A measure µ on this
space is a function from F into [0, ∞] such that
(1) µ(∅) = 0, and
(2) if A1 , A2 , . . . is a countable sequence of disjoint sets in F, then
[∞  X ∞
µ An = µ(An ).
n=1 n=1

The triple (Ω, F, µ) is called a measure space.


The second condition is known as the countable additivity condition. Note that
finite additivity is also valid, since we can append empty sets to a finite collection
to make it countable.
E XERCISE 1.2.2. Let (Ω, F, µ) be a measure space. For any A, B ∈ F such
that A ⊆ B, show that µ(A) ≤ µ(B). For any A1 , A2 , . . . ∈ F, show that
1.2. MEASURE SPACES 3
P
µ(∪Ai ) ≤ µ(Ai ). (These are known as the monotonicity and countable sub-
additivity properties of measures. Hint: Rewrite the union as a disjoint union of
B1 , B2 , . . ., where Bi = Ai \ (A1 ∪ · · · ∪ Ai−1 ). Then use countable additivity.)

D EFINITION 1.2.3. If a measure µ on a measurable space (Ω, F) satisfies


µ(Ω) = 1, then it is called a probability measure, and the triple (Ω, F, µ) is called
a probability space. In this case, elements of F are often called ‘events’.

E XERCISE 1.2.4. If (Ω, F, µ) is a probability space and A ∈ F, show that


µ(Ac ) = 1 − µ(A).

E XERCISE 1.2.5. If (Ω, F, µ) is a measure space and A1 , A2 , . . . ∈ F is


an increasing sequence of events (meaning that A1 ⊆ A2 ⊆ · · · ), prove that
µ(∪An ) = lim µ(An ). Moreover, if µ is a probability measure, and if A1 , A2 , . . .
is a decreasing sequence of events (meaning that A1 ⊇ A2 ⊇ · · · ), prove that
µ(∩An ) = lim µ(An ). Lastly, show that the second assertion need not be true if
µ is not a probability measure. (Hint: For the first, rewrite the union as a disjoint
union and apply countable additivity. For the second, write the intersection as the
complement of a union and apply the first part.)

Let (Ω, F, µ) be a measure space. In measure theory, µ(A) is thought of as


a measure of the ‘size’ of A. When µ(A) is small, we think of A as a ‘small’
set. Following this line of thought, we think of µ(A∆B) as a kind of distance
between two sets A and B (where A∆B is the symmetric difference of A and B).
If this is small, then A and B are ‘almost the same set’. The following useful result
shows that if µ is a probability measure, then any element of F can be arbitrarily
well-approximated by elements of any generating algebra.

T HEOREM 1.2.6. Let (Ω, F, µ) be a probability space, and let A be an algebra


of sets generating F. Then for any A ∈ F and any  > 0, there is some B ∈ A
such that µ(A∆B) < , where A∆B is the symmetric difference of A and B.

P ROOF. Let G be the collection of all A ∈ F for which the stated property
holds. Clearly, A ⊆ G. In particular, Ω ∈ G. Take A ∈ G and any  > 0. Find
B ∈ A such that µ(A∆B) < . Since Ac ∆B c = A∆B, we get µ(Ac ∆B c ) < .
Thus, Ac ∈ G. Finally, take any A1 , A2 , . . . ∈ G. Let A := ∪Ai and take any
 > 0. Since µ is a probability measure, there is some n large enough such that
 n 
[ 
µ A\ Ai < .
2
i=1

For each i, find Bi ∈ A such that µ(Ai ∆Bi ) < 2−i−1 . Let A0 := ∪ni=1 Ai and
B 0 := ∪ni=1 Bi . It is not hard to see that
n
[
A0 ∆B 0 ⊆ (Ai ∆Bi ).
i=1
4 1. MEASURES

Thus,
n
0 0
X 
µ(A ∆B ) ≤ µ(Ai ∆Bi ) ≤ .
2
i=1

Again, it is not hard to check that A∆B 0 ⊆ (A \ A0 ) ∪ (A0 ∆B 0 ). Therefore


µ(A∆B 0 ) ≤ µ(A \ A0 ) + µ(A0 ∆B 0 ) < .
Thus, A ∈ G. This proves that G is a σ-algebra, and hence contains σ(A) = F. 

1.3. Dynkin’s π-λ theorem


D EFINITION 1.3.1. Let Ω be a set. A collection P of subsets of Ω is called a
π-system if it is closed under finite intersections.
D EFINITION 1.3.2. Let Ω be a set. A collection L of subsets of Ω is called a
λ-system (or Dynkin system) if Ω ∈ L and L is closed under taking complements
and countable disjoint unions.
E XERCISE 1.3.3. Show that L is a λ-system if and only if
(1) Ω ∈ L,
(2) if A, B ∈ L and A ⊆ B, then B \ A ∈ L, and
(3) if A1 , A2 , . . . ∈ L and Ai ⊆ Ai+1 for each i, then

[
Ai ∈ L.
i=1

L EMMA 1.3.4. If a λ-system is also a π-system, then it is a σ-algebra.


P ROOF. Let L be a λ-system which is also a π-system. Then L is closed under
complements by definition, and clearly, ∅ ∈ L. Suppose that A1 , A2 , . . . ∈ L. Let
B1 = A1 , and for each i ≥ 2, let
Bi = Ai ∩ Ac1 ∩ Ac2 ∩ · · · ∩ Aci−1 .
Since L is a λ-system, each Aci is in L. Therefore, since L is also a π-system, each
Bi ∈ L. By construction, B1 , B2 , . . . are disjoint sets and

[ ∞
[
Ai = Bi .
i=1 i=1
Thus ∪Ai ∈ L. This shows that L is a σ-algebra. 
T HEOREM 1.3.5 (Dynkin’s π-λ theorem). Let Ω be a set. Let P be a π-system
of subsets of Ω, and let L ⊇ P be a λ-system of subsets of Ω. Then L ⊇ σ(P).
P ROOF. Since the intersection of all λ-systems containing P is again a λ-
system, we may assume that L is the smallest λ-system containing P.
Take any A ∈ P. Let
GA := {B ∈ L : A ∩ B ∈ L}. (1.3.1)
1.3. DYNKIN’S π-λ THEOREM 5

Then GA ⊇ P since P is a π-system and P ⊆ L. Clearly, Ω ∈ GA . If B ∈ GA ,


then B c ∈ L and
A ∩ B c = (Ac ∪ (A ∩ B))c ∈ L,
since Ac and A∩B are disjoint elements of L and L is a λ-system. Thus, B c ∈ GA .
If B1 , B2 , . . . are disjoint sets in GA , then
A ∩ (B1 ∪ B2 ∪ · · · ) = (A ∩ B1 ) ∪ (A ∩ B2 ) ∪ · · · ∈ L,
again since L is a λ-system. Thus, GA is a λ-system containing P. By the min-
imality of L, this shows that GA = L. In particular, if A ∈ P and B ∈ L, then
A ∩ B ∈ L.
Next, for A ∈ L, let GA be defined as in (1.3.1). By the deduction in the
previous paragraph, GA ⊇ P. As before, GA is a λ-system. Thus, GA = L. In
particular, L is a π-system. By Lemma 1.3.4, this completes the proof. 

An important corollary of the π-λ theorem is the following result about unique-
ness of measures.
T HEOREM 1.3.6. Let P be a π-system. If µ1 and µ2 are measures on σ(P)
that agree on P, and there is a sequence A1 , A2 , . . . ∈ P such that An increases
to Ω and µ1 (An ) and µ2 (An ) are both finite for every n, then µ1 = µ2 on σ(P).
P ROOF. Take any A ∈ P such that µ1 (A) = µ2 (A) < ∞. Let
L := {B ∈ σ(P) : µ1 (A ∩ B) = µ2 (A ∩ B)}.
Clearly, Ω ∈ L. If B ∈ L, then
µ1 (A ∩ B c ) = µ1 (A) − µ1 (A ∩ B)
= µ2 (A) − µ2 (A ∩ B) = µ2 (A ∩ B c ),
and hence B c ∈ L. If B1 , B2 , . . . ∈ L are disjoint and B is their union, then

X
µ1 (A ∩ B) = µ1 (A ∩ Bi )
i=1
X∞
= µ2 (A ∩ Bi ) = µ2 (A ∩ B),
i=1

and therefore B ∈ L. This shows that L is a λ-system. Therefore by the π-λ


theorem, L = σ(P). In other words, for every B ∈ σ(P) and A ∈ P such that
µ1 (A) < ∞, µ1 (A ∩ B) = µ2 (A ∩ B). By the given condition, there is a sequence
A1 , A2 , . . . ∈ P such that µ1 (An ) < ∞ for every n and An ↑ Ω. Thus, for any
B ∈ σ(P),
µ1 (B) = lim µ1 (An ∩ B) = lim µ2 (An ∩ B) = µ2 (B).
n→∞ n→∞

This completes the proof of the theorem. 


6 1. MEASURES

1.4. Outer measures


D EFINITION 1.4.1. Let Ω be any set and let 2Ω denote its power set. A function
φ : 2Ω → [0, ∞] is called an outer measure if it satisfies the following conditions:
(1) φ(∅) = 0.
(2) φ(A) ≤ φ(B) whenever A ⊆ B.
(3) For any A1 , A2 , . . . ⊆ Ω,
[∞  X∞
φ Ai ≤ φ(Ai ).
i=1 i=1

Note that there is no σ-algebra in the definition of an outer measure. In fact,


we will show below that an outer measure generates its own σ-algebra.
D EFINITION 1.4.2. If φ is an outer measure on a set Ω, a subset A ⊆ Ω is
called φ-measurable if for all B ⊆ Ω,
φ(B) = φ(B ∩ A) + φ(B ∩ Ac ).

Note that A is φ-measurable if and only if


φ(B) ≥ φ(B ∩ A) + φ(B ∩ Ac )
for every B, since the opposite inequality follows by subadditivity. The following
result is the most important fact about outer measures.
T HEOREM 1.4.3. Let Ω be a set and φ be an outer measure on Ω. Let F be
the collection of all φ-measurable subsets of Ω. Then F is a σ-algebra and φ is a
measure on F.
The proof of Theorem 1.4.3 is divided into a sequence of lemmas.
L EMMA 1.4.4. The collection F is an algebra.
P ROOF. Clearly, ∅ ∈ F, since φ(∅) = 0. If A ∈ F, then Ac ∈ F by the
definition of F. If A, B ∈ F, let D := A ∪ B and note that by subadditivity of φ,
for any E we have
φ(E ∩ D) + φ(E ∩ Dc )
= φ((E ∩ A) ∪ (E ∩ B ∩ Ac )) + φ(E ∩ Ac ∩ B c )
≤ φ(E ∩ A) + φ(E ∩ B ∩ Ac ) + φ(E ∩ Ac ∩ B c ).
But since B ∈ F,
φ(E ∩ B ∩ Ac ) + φ(E ∩ Ac ∩ B c ) = φ(E ∩ Ac ).
Thus,
φ(E ∩ D) + φ(E ∩ Dc ) ≤ φ(E ∩ A) + φ(E ∩ Ac ).
But since A ∈ F, the right side equals φ(E). This completes the proof. 
1.4. OUTER MEASURES 7

L EMMA 1.4.5. If A1 , . . . , An ∈ F are disjoint and E ⊆ Ω, then


n
X
φ(E ∩ (A1 ∪ · · · ∪ An )) = φ(E ∩ Ai ).
i=1

P ROOF. For each j, let Bj := A1 ∪ · · · ∪ Aj . Since An ∈ F,


φ(E ∩ Bn ) = φ(E ∩ Bn ∩ An ) + φ(E ∩ Bn ∩ Acn ).
But since A1 , . . . , An are disjoint, Bn ∩ Acn = Bn−1 and Bn ∩ An = An . Thus,
φ(E ∩ Bn ) = φ(E ∩ An ) + φ(E ∩ Bn−1 ).
The proof is now completed by induction. 
A consequence of the last two lemmas is the following.
L EMMA 1.4.6. If A1 , A2 , . . . is a sequence of sets in F increasing to a set
A ⊆ Ω, then for any E ⊆ Ω,
φ(E ∩ A) = lim φ(E ∩ An ).
n→∞

P ROOF. By monotonicity of φ, the left side dominates the right. For the op-
posite inequality, let Bn := An ∩ (A1 ∪ · · · ∪ An−1 )c for each n, so that the sets
B1 , B2 , . . . are disjoint and An = B1 ∪ · · · ∪ Bn for each n. By Lemma 1.4.4,
Bn ∈ F for each n. Thus, by Lemma 1.4.5,
n
X
φ(E ∩ An ) = φ(E ∩ Bi ).
i=1
Consequently,

X
lim φ(E ∩ An ) = φ(E ∩ Bi ).
n→∞
i=1
Since φ is countably subadditive, this completes the proof of the lemma. 
We are now ready to prove Theorem 1.4.3.
P ROOF OF T HEOREM 1.4.3. Let A1 , A2 , . . . ∈ F and let A := ∪Ai . For each
n, let
[n
Bn := Ai .
i=1
Take any E ⊆ Ω and any n. By Lemma 1.4.4, Bn ∈ F and hence
φ(E) = φ(E ∩ Bn ) + φ(E ∩ Bnc ).
By monotonicity of φ, φ(E ∩ Bnc ) ≥ φ(E ∩ Ac ). Thus,
φ(E) ≥ φ(E ∩ Bn ) + φ(E ∩ Ac ).
By Lemma 1.4.6,
lim φ(E ∩ Bn ) = φ(E ∩ A).
n→∞
8 1. MEASURES

Thus, A ∈ F. Together with Lemma 1.4.4, this shows that F is a σ-algebra. To


show that φ is a measure on F, take any disjoint collection of sets A1 , A2 , . . . ∈ F.
Let B be the union of these sets, and let Bn = A1 ∪ · · · ∪ An . By the monotonicity
of φ and Lemma 1.4.5,
n
X
φ(B) ≥ φ(Bn ) = φ(Ai ).
i=1
P
Letting n → ∞, we get φ(B) ≥ φ(Ai ). On the other hand, subadditivity of φ
gives the opposite inequality. This shows that φ is a measure on F and completes
the proof. 

1.5. Carathéodory’s extension theorem


Let Ω be a set and A be an algebra of subsets of Ω. A function µ : A → [0, ∞]
is called a measure on A if
(1) µ(∅) = 0, and
(2) if A1 , A2 , . . . is a countable collection of disjoint elements of A such that
their union is also in A, then

[  X ∞
µ Ai = µ(Ai ).
i=1 i=1

A measure µ on an algebra A is called σ-finite if there is a countable family of sets


A1 , A2 , . . . ∈ A such that µ(Ai ) < ∞ for each i, and Ω = ∪Ai .
T HEOREM 1.5.1 (Carathéodory’s extension theorem). If A is an algebra of
subsets of a set Ω and µ is a measure on A, then µ has an extension to σ(A).
Moreover, if µ is σ-finite on A, then the extension is unique.
The plan of the proof is to construct an outer measure on Ω that agrees with µ
on A. Then Theorem 1.4.3 will give the required extension. The outer measure is
defined as follows. For each A ⊆ Ω, let
X ∞ ∞
[ 

µ (A) := inf µ(Ai ) : A1 , A2 , . . . ∈ A, A ⊆ Ai . (1.5.1)
i=1 i=1

The next lemma establishes that µ∗ is an outer measure on Ω.


L EMMA 1.5.2. The functional µ∗ is an outer measure on Ω.

P ROOF. It is obvious from the definition of µ∗ that µ∗ (∅) = 0 and µ∗ is


monotone. For proving subadditivity, take any sequence of sets {Ai }i≥1 and let
A := ∪Ai . Fix some  > 0, and for each i, let {Aij }∞
j=1 be a collection of ele-
ments of A such that Ai ⊆ ∪j Aij and

X
µ(Aij ) ≤ µ∗ (Ai ) + 2−i .
j=1
1.5. CARATHÉODORY’S EXTENSION THEOREM 9

Then {Aij }∞
i,j=1 is a countable cover for A, and so


X

µ (A) ≤ µ(Aij )
i,j=1
X∞ ∞
X
≤ (µ∗ (Ai ) + 2−i ) =  + µ∗ (Ai ).
i=1 i=1

Since  is arbitrary, this completes the proof of the lemma. 

The next lemma shows that µ∗ is a viable candidate for an extension.

L EMMA 1.5.3. For A ∈ A, µ∗ (A) = µ(A).

P ROOF. Take any A ∈ A. By definition, µ∗ (A) ≤ µ(A). Conversely, take


any A1 , A2 , . . . ∈ A such that A ⊆ ∪Ai . Then A = ∪(A ∩ Ai ). Note that
each A ∩ Ai ∈ A, and their union is A, which is also in A. It is easy to check
that countable subadditivity (Exercise 1.2.2) continues to be valid for measures on
algebras, provided that the union belongs to the algebra. Thus, we get

X ∞
X
µ(A) ≤ µ(A ∩ Ai ) ≤ µ(Ai ).
i=1 i=1

This shows that µ(A) ≤ µ∗ (A). 

We are now ready to prove Carathéodory’s extension theorem.

P ROOF OF T HEOREM 1.5.1. Let A∗ be the set of all µ∗ -measurable sets. By


Theorem 1.4.3, we know that A∗ is a σ-algebra and that µ∗ is a measure on A∗ .
We now claim that A ⊆ A∗ . To prove this, take any A ∈ A and E ⊆ Ω. Let
A1 , A2 , . . . be any sequence of elements of A that cover E. Then {A ∩ Ai }∞ i=1 is
a cover for E ∩ A and {Ac ∩ Ai }∞ i=1 is a cover for E ∩ A c . Consequently,


X
µ∗ (E ∩ A) + µ∗ (E ∩ Ac ) ≤ (µ(A ∩ Ai ) + µ(Ac ∩ Ai ))
i=1
X∞
= µ(Ai ).
i=1

Taking infimum over all choices of {Ai }∞ ∗ ∗


i=1 , this shows that µ (E ∩ A) + µ (E ∩
c ∗ ∗ ∗
A ) ≤ µ (E), which means that A ∈ A . Thus, A ⊆ A . This proves the
existence part of the theorem. If µ is σ-finite on A, then the uniqueness of the
extension follows from Theorem 1.3.6. 
10 1. MEASURES

1.6. Construction of Lebesgue measure


Let A be the set of all subsets of R that are finite disjoint unions of half-open
intervals of the form (a, b] ∩ R, where −∞ ≤ a ≤ b ≤ ∞. Here we write (a, b] ∩ R
to ensure that the interval is (a, ∞) if b = ∞. If a = b, the interval is empty.
E XERCISE 1.6.1. Show that A is an algebra of subsets of R.
E XERCISE 1.6.2. Show that the algebra A generates the Borel σ-algebra of R.
Define a functional λ : A → R as:
[ n  n
X
λ (ai , bi ] ∩ R := (bi − ai ),
i=1 i=1
where remember that (a1 , b1 ], . . . , (an , bn ] are disjoint. In other words, λ measures
the length of an element of A, as understood in the traditional sense. It is obvious
that λ is finitely additive on A (that is, λ(A1 ∪ · · · ∪ An ) = λ(A1 ) + · · · + λ(An )
when A1 , . . . , An are disjoint elements of A). It is also obvious that λ is monotone,
that is, λ(A) ≤ λ(B) when A ⊆ B (just observe that B is the disjoint union of A
and B \ A, and apply finite additivity).

P L EMMA 1.6.3. For any A1 , . . . , An ∈ A and any A ⊆ A1 ∪ · · · ∪ An , λ(A) ≤


λ(Ai ).

P ROOF. Let B1 = A1 and Bi = Ai \ (A1 ∪ · · · ∪ Ai−1 ) for 2 ≤ i ≤ n. Then


B1 , . . . , Bn are disjoint and their union is the same as the union of A1 , . . . , An .
Therefore by the finite additivity and monotonicity of λ,
n
X n
X n
X
λ(A) = λ(A ∩ Bi ) ≤ λ(Bi ) ≤ λ(Ai ),
i=1 i=1 i=1
where the last inequality holds because Bi ⊆ Ai for each i. 
P ROPOSITION 1.6.4. The functional λ defined above is a σ-finite measure on
A.

P ROOF. Suppose that an element A ∈ A is a countable disjoint union of ele-


ments A1 , A2 , . . . ∈ A. We have to show that

X
λ(A) = λ(Ai ). (1.6.1)
i=1
It suffices to show this when A = (a, b] ∩ R and Ai = (ai , bi ] ∩ R for each i, since
each element of A is a finite disjoint union of such intervals. There is nothing to
prove if a = b, so assume that a < b.
First suppose that −∞ < a < b < ∞. Take any δ > 0 such that a + δ < b,
and take any  > 0. Then the closed interval [a + δ, b] is contained in the union of
(ai , bi + 2−i ), i ≥ 1. To see this, take any x ∈ [a + δ, b]. Then x ∈ (a, b], and
hence x ∈ (ai , bi ] for some i. Thus, x ∈ (ai , bi + 2−i ).
1.6. CONSTRUCTION OF LEBESGUE MEASURE 11

Since [a + δ, b] is compact, it is therefore contained in the union of finitely


many (ai , bi + 2−i ). Consequently, there exists k such that
k
[
(a + δ, b] ⊆ (ai , bi + 2−i ].
i=1
Thus, by Lemma 1.6.3,
k
X ∞
X
b−a−δ ≤ (bi + 2−i  − ai ) ≤  + (bi − ai ).
i=1 i=1
Since this holds for any  and δ, we get

X
b−a≤ (bi − ai ). (1.6.2)
i=1
On other hand, for any k, finite additivity and monotonicity of λ implies that
k
X k
X
b − a = λ(A) ≥ λ(Ai ) = (bi − ai ).
i=1 i=1
Thus,

X
b−a≥ (bi − ai ), (1.6.3)
i=1
which proves (1.6.1) when a and b are finite. If either a or b is infinite, we find
finite a0 , b0 such that (a0 , b0 ] ⊆ (a, b] ∩ R. Repeating the above steps, we arrive at
the inequality
X∞
0 0
b −a ≤ (bi − ai ).
i=1
Since this hold for any finite a0
> a and b0 < b, we recover (1.6.2), and (1.6.3)
continues to hold as before. This completes the proof of countable additivity of λ.
The σ-finiteness is trivial. 
C OROLLARY 1.6.5. The functional λ has a unique extension to a measure on
B(R).

P ROOF. By Exercise 1.6.2, the algebra A generates the Borel σ-algebra. The
existence and uniqueness of the extension now follows by Proposition 1.6.4 and
Carathéodory’s extension theorem. 
D EFINITION 1.6.6. The unique extension of λ given by Corollary 1.6.5 is
called the Lebesgue measure on the real line. The outer measure defined by the
formula (1.5.1) in this case is called Lebesgue outer measure, and the σ-algebra
induced by this outer measure is called the Lebesgue σ-algebra.
E XERCISE 1.6.7. Prove that for any −∞ < a ≤ b < ∞, λ([a, b]) = b − a.
12 1. MEASURES

E XERCISE 1.6.8. Define Lebesgue measure on Rn for general n by consider-


ing disjoint unions of products of half-open intervals, and carrying out a similar
procedure as above.
E XERCISE 1.6.9. Let A be a Lebesgue measurable subset of R and let x ∈ R.
Let A + x denote the set {a + x : a ∈ A}. Prove that A + x is also Lebesgue
measurable, and has the same Lebesgue measure as A.

1.7. Example of a non-measurable set


In this section we will describe a standard construction of a subset of R that is
not Lebesgue measurable (and hence not Borel measurable). The critical ingredient
is the axiom of choice. Indeed, it can be shown that such a set cannot be produced
without invoking the axiom of choice.
Define an equivalence relation ∼ on [0, 1] as: x ∼ y if x−y ∈ Q. By the axiom
of choice, there is a subset A of [0, 1] consisting of exactly one element from each
equivalence class. We claim that A is not Lebesgue measurable. To prove this, let
q1 , q2 , . . . be an enumeration of the rationals in [−1, 1], and define

[
B := (A + qi ).
i=1
First, note that if i 6= j, then (A + qi ) ∩ (A + qj ) = ∅. This is a simple consequence
of the definition of A. Next, take any x ∈ [0, 1]. Then there is some a ∈ A such
that x − a ∈ Q. But x − a ∈ [−1, 1]. Thus, x ∈ A + qi for some i. This implies that
B ⊇ [0, 1]. On the other hand, it is clear that B ⊆ [−1, 2]. Now, if A is Lebesgue
measurable, then by Exercise 1.6.9, so is B, and thus
∞ ∞
(
X X ∞ if λ(A) > 0,
λ(B) = λ(A + qi ) = λ(A) =
i=1 i=1
0 if λ(A) = 0.
But since [0, 1] ⊆ B ⊆ [−1, 2], we have 1 ≤ λ(B) ≤ 3, which gives us a contra-
diction.
It can be shown that the Lebesgue σ-algebra is strictly bigger than the Borel
σ-algebra. That is, there is a Lebesgue measurable set that is not Borel measurable.
But this is a much more difficult result than the above example.

1.8. Completion of a measure space


Let (Ω, F, µ) be a measure space. Sometimes, it is convenient if for any set
A ∈ F with µ(A) = 0, any subset B of A is automatically also in F. This is
because in probability theory, we sometimes encounter events of probability zero
which are not obviously measurable. If F has this property, then it is called a
complete σ-algebra for the measure µ. If a σ-algebra is not complete, we may
want to find a bigger σ-algebra on which µ has an extension which is complete. The
smallest such extension is called the completion of the measure space (Ω, F, µ).
E XERCISE 1.8.1. Prove that the completion (Ω, F0 , µ0 ) of an arbitrary mea-
sure space (Ω, F, µ) can be obtained as follows.
1.8. COMPLETION OF A MEASURE SPACE 13

(1) Let Z be the set of all subsets of the zero µ-measure sets in F.
(2) Let F0 := {A ∪ B : A ∈ F, B ∈ Z}.
(3) If C ∈ F0 equals A∪B for some A ∈ F and B ∈ Z, let µ0 (C) := µ(A).
(First, show that this well-defined.)
E XERCISE 1.8.2. Prove that the completion of the Borel σ-algebra of R (or
Rn ) obtained via the above prescription is the Lebesgue σ-algebra.
Recall that Lebesgue measure is defined on the Lebesgue σ-algebra. We will,
however, work with the Borel σ-algebra most of the time. When we say that a func-
tion defined on R is ‘measurable’, we will mean Borel measurable unless otherwise
mentioned. On the other hand, abstract probability spaces on which we will define
our random variables (measurable maps), will usually be taken to be complete.
CHAPTER 2

Measurable functions and integration

In this chapter, we will define measurable functions, Lebesgue integration, and


the basic properties of integrals.

2.1. Measurable functions


D EFINITION 2.1.1. Let (Ω, F) and (Ω0 , F 0 ) be two measurable spaces. A
function f : Ω → Ω0 is called measurable if f −1 (A) ∈ F for every A ∈ F 0 . Here,
as usual, f −1 (A) denotes the set of all x ∈ Ω such that f (x) ∈ A.
E XERCISE 2.1.2. If (Ωi , Fi ) are measurable spaces for i = 1, 2, 3, f : Ω1 →
Ω2 is a measurable function, and g : Ω2 → Ω3 is a measurable function, show that
f ◦ g : Ω1 → Ω3 is a measurable function.
The main way to check that a function is measurable is the following lemma.
L EMMA 2.1.3. Let (Ω, F) and (Ω0 , F 0 ) be two measurable spaces and f :
Ω → Ω0 be a function. Suppose that there is a set A ⊆ F 0 that generates F 0 , and
suppose that f −1 (A) ∈ F for all A ∈ A. Then f is measurable.

P ROOF. It is easy to verify that the set of all B ⊆ Ω0 such that f −1 (B) ∈ F is
a σ-algebra. Since this set contains A, it must also contains the σ-algebra generated
by A, which is F 0 . Thus, f is measurable. 
Essentially all functions that arise in practice are measurable. Let us now see
why that is the case by identifying some large classes of measurable functions.
P ROPOSITION 2.1.4. Suppose that Ω and Ω0 are topological spaces, and F
and F 0 are their Borel σ-algebras. Then any continuous function from Ω into Ω0 is
measurable.

P ROOF. Use Lemma 2.1.3 with A = the set of all open subsets of Ω0 . 
A combination of Exercise 2.1.2 and Proposition 2.1.4 shows, for example,
that sums and products of real-valued measurable functions are measurable, since
addition and multiplication are continuous maps from R2 to R, and if f, g : Ω → R
are measurable, then (f, g) : Ω → R2 is measurable with respect to B(R2 ) (easy
to show).
E XERCISE 2.1.5. Following the above sketch, show that sums and products of
measurable functions are measurable.
15
16 2. MEASURABLE FUNCTIONS AND INTEGRATION

E XERCISE 2.1.6. Show that any right-continuous of left-continuous function


f : R → R is measurable.
E XERCISE 2.1.7. Show that any monotone function f : R → R is measurable.
E XERCISE 2.1.8. If Ω is a topological space endowed with its Borel σ-algebra,
show that any lower- or upper-semicontinuous f : Ω → R is measurable.
Often, we will have occasion to consider measurable functions that take value
in the set R∗ = R ∪ {−∞, ∞}, equipped with the σ-algebra generated by all
intervals of the form [a, b] where −∞ ≤ a ≤ b ≤ ∞.
P ROPOSITION 2.1.9. Let (Ω, F) be a measurable space and let {fn }n≥1 be a
sequence of measurable functions from Ω into R∗ . Let g(ω) := inf n≥1 fn (ω) and
h(ω) := supn≥1 fn (ω) for ω ∈ Ω. Then g and h are also measurable functions.

P ROOF. For any t ∈ R∗ , g(ω) ≥ t if and only if fn (ω) ≥ t for all n. Thus,
\∞
g −1 ([t, ∞]) = fn−1 ([t, ∞]),
n=1
which shows that g −1 ([t, ∞])
∈ F. It is straightforward to verify that sets of the
form [t, ∞] generate the σ-algebra of R∗ . Thus, g is measurable. The proof for h
is similar. 
The following exercises are useful consequences of Proposition 2.1.9.
E XERCISE 2.1.10. If {fn }n≥1 is a sequence of R∗ -valued measurable func-
tions defined on the same measure space, show that the functions lim inf n→∞ fn
and lim supn→∞ fn are also measurable. (Hint: Write the lim sup as an infimum
of suprema and the lim inf as a supremum of infima.)
E XERCISE 2.1.11. If {fn }n≥1 is a sequence of R∗ -valued measurable func-
tions defined on the same measure space, and fn → f pointwise, show that f is
measurable. (Hint: Under the given conditions, f = lim supn→∞ fn .)
E XERCISE 2.1.12. If {fn }n≥1 is a sequence of [0,
P∞]-valued measurable func-
tions defined on the same measure space, show that fn is measurable.
E XERCISE 2.1.13. If {fn }n≥1 is a sequence of measurable R∗ -valued func-
tions defined on the same measurable space, show that the set of all ω where
lim fn (ω) exists is a measurable set.
It is sometimes useful to know that Exercise 2.1.11 has a generalization to
functions taking value in arbitrary separable metric spaces. In that setting, however,
the proof using lim sup does not work, and a different argument is needed. This is
given in the proof of the following result.
P ROPOSITION 2.1.14. Let (Ω, F) be a measurable space and let S be a sepa-
rable metric space endowed with its Borel σ-algebra. If {fn }n≥1 is a sequence of
measurable functions from Ω into S that converge pointwise to a limit function f ,
then f is also measurable.
2.2. LEBESGUE INTEGRATION 17

P ROOF. Let B(x, r) denote the open ball with center x and radius r in S.
Since S is separable, any open set is a countable union of open balls. Therefore
by Lemma 2.1.3 it suffices to check that f −1 (B) ∈ F for any open ball B. Take
such a ball B(x, r). For any ω ∈ Ω, if f (ω) ∈ B(x, r), then there is some large
enough integer k such that f (ω) ∈ B(x, r − k −1 ). Since fn (ω) → f (ω), this
implies that fn (ω) ∈ B(x, r − k −1 ) for all large enough n. On the other hand, if
there exists k ≥ 1 such that fn (ω) ∈ B(x, r − k −1 ) for all large enough n, then
f (ω) ∈ B(x, r). Thus, we have shown that f (ω) ∈ B(x, r) if and only if there is
some integer k ≥ 1 such that fn (ω) ∈ B(x, r − k −1 ) for all large enough n. In set
theoretic notation, this statement can be written as
∞ [
[ ∞ \

f −1 (B(x, r)) = fn−1 (B(x, r − k −1 )).
k=1 N =1 n=N

Since each fn is measurable, and σ-algebras are closed under countable unions and
intersections, this shows that f −1 (B(x, r)) ∈ F, completing the proof. 

Measurable maps generate σ-algebras of their own, which are important for
various purposes.
D EFINITION 2.1.15. Let (Ω, F) and (Ω0 , F 0 ) be two measurable spaces and
let f : Ω → Ω0 be a measurable function. The σ-algebra generated by f is defined
as
σ(f ) := {f −1 (A) : A ∈ F 0 }.

E XERCISE 2.1.16. Verify that σ(f ) in the above definition is indeed a σ-


algebra.

2.2. Lebesgue integration


Let (Ω, F, µ) be a measure space. This space will be fixed throughout this
section. Let f : Ω → R∗ be a measurable function, where R∗ = R ∪ {−∞, ∞}, as
defined in the previous section. Our goal in this section is to define the Lebesgue
integral
Z
f (ω)dµ(ω).

The definition comes in several steps. For A ∈ F, define the indicator function 1A
as (
1 if ω ∈ A,
1A (ω) =
0 if ω 6∈ A.
Suppose that
n
X
f= ai 1Ai (2.2.1)
i=1
18 2. MEASURABLE FUNCTIONS AND INTEGRATION

for some a1 , . . . , an ∈ [0, ∞) and disjoint sets A1 , . . . , An ∈ F. Such functions


are called ‘nonnegative simple functions’. For a nonnegative simple function f as
above, define
Z Xn
f (ω)dµ(ω) := ai µ(Ai ). (2.2.2)
Ω i=1
A subtle point here is that the same simple function can have many different rep-
resentations like (2.2.1). It is not difficult to show that all of them yield the same
answer in (2.2.2).
Next, take any measurable f : Ω → [0, ∞]. Let SF+ (f ) be the set of all
nonnegative simple functions g such that g(ω) ≤ f (ω) for all ω. Define
Z Z
f (ω)dµ(ω) := sup g(ω)dµ(ω).
Ω g∈SF+ (f ) Ω

It is not difficult to prove that if f is a nonnegative simple function, then this defi-
nition gives the same answer as (2.2.2), so there is no inconsistency.
Finally, take any measurable f : Ω → R∗ . Define f + (ω) = max{f (ω), 0} and
f (ω) = − min{f (ω), 0}. Then f + and f − are nonnegative measurable functions

(easy to show), and f = f + − f − . We say that the integral of f is defined when


the integrals of at least one of f + and f − is finite. In this situation, we define
Z Z Z
f (ω)dµ(ω) := +
f (ω)dµ(ω) − f − (ω)dµ(ω).
Ω Ω Ω
R
Often, we will simply write f dµ for the integral of f .
D EFINITION 2.2.1. A measurable function f : Ω → R∗ is called integrable if
f + dµ and f − dµ are both finite.
R R
R
Note that for the integral f dµ to be defined, the function f needR not be
integrable. As defined above, it suffices to have at least one of f + dµ and f − dµ
R
finite.
E XERCISE 2.2.2. Show that if f is an integrable function, then {ω : f (ω) =
∞ or − ∞} is a set of measure zero.
Sometimes, we will need to integrate a function f over a measurable subset
S ⊆ Ω rather than the whole of Ω. This is defined simply by making f zero
outside S and integrating the resulting function. That is, we define
Z Z
f dµ := f 1S dµ,
S Ω
provided that the right side is defined. Here and everywhere else, we use the con-
vention ∞ · 0 = 0, which is a standard convention in measure theory and proba-
bility. The integral over S can also be defined in a different way, by considering S
itself as a set endowed with a σ-algebra and a measure. To be precise, let FS be
the restriction of F to S, that is, the set of all subsets of S that belong to F. Let
µS be the restriction of µ to FS . Let fS be the restriction of f to S. It is easy to
see that (S, FS , µS ) is a measure space and fS is a measurable function from this
2.3. THE MONOTONE CONVERGENCE THEOREM 19

space into R∗ . The integral of f over the set S can then be defined as the integral
of fS on the measure space (S, FS , µS ), provided that the integral exists. When
S = ∅, the integral can be defined to be zero.
R
E XERCISE 2.2.3. Show that the two definitions of S f dµ discussed above are
the same, in the sense that one is defined if and only if the other is defined, and in
that case they are equal.

2.3. The monotone convergence theorem


The monotone convergence theorem is a fundamental result of measure theory.
We will prove this result in this section. First, we need two lemmas. Throughout,
(Ω, F, µ) denotes a measure space.
L EMMA 2.3.1. If f,Rg : Ω →R[0, ∞] are two measurable functions such that
f ≤ g everywhere, then f dµ ≤ gdµ.

P ROOF. If s ∈ SF+ (f ), then s is also +


R
R in SF (g). Thus, the definition of gdµ
takes supremum over a larger set than f dµ. This proves the inequality. 

R s : Ω → [0, ∞) be a measurable simple function. For each


L EMMA 2.3.2. Let
S ∈ F, let ν(S) := S sdµ. Then ν is a measure on (Ω, F).

P ROOF. Suppose that


n
X
s= ai 1Ai .
i=1
Since ν(∅) = 0 by definition, it suffices to show countable additivity of ν. Let
S1 , S2 , . . . be a sequence of disjoint sets in F, and let S be their union. Then
n
X
ν(S) = ai µ(Ai ∩ S)
i=1
Xn ∞
X 
= ai µ(Ai ∩ Sj ) .
i=1 j=1

Since an infinite series with nonnegative summands may be rearranged any way
we like without altering the result, this gives
∞ X
X n ∞
X
ν(S) = ai µ(Ai ∩ Sj ) = ν(Sj ).
j=1 i=1 j=1

This proves the countable additivity of ν. 


T HEOREM 2.3.3 (Monotone convergence theorem). Suppose that {fn }n≥1 is
a sequence of measurable functions from Ω into [0, ∞],R which are increasing
R point-
wise to a limit function f . Then f is measurable, and f dµ = lim fn dµ.
20 2. MEASURABLE FUNCTIONS AND INTEGRATION

P ROOF. The measurability of f has already been Restablished earlier


R (Exercise
2.1.11). Since f ≥ fn for every n, Lemma 2.3.1 gives f dµ ≥ lim fn dµ, where
the limit on the right exists because the integrals on the right are increasing with n.
For the opposite inequality, take any s ∈ SF+ (f ). Define
Z
ν(S) := sdµ,
S
so that by Lemma 2.3.2, ν is a measure on Ω. Take any α ∈ (0, 1). Let
Sn := {ω : αs(ω) ≤ fn (ω)}.
It is easy to see that these sets are measurable and increasing with n. Moreover,
note that any ω ∈ Ω belongs to Sn for all sufficiently large n, because fn (ω)
increases to f (ω) and αs(ω) < f (ω) (unless f (ω) = 0, in which case αs(ω) =
fn (ω) = 0 for all n). Thus, Ω is the union of the increasing sequence S1 , S2 , . . ..
Since ν is a measure, this shows that
Z Z
sdµ = ν(Ω) = lim ν(Sn ) = lim sdµ.
n→∞ n→∞ S
n

But αs ≤ fn on Sn . Moreover, it is easy to see that


Z Z
αsdµ = α sdµ
Sn Sn
since s is a simple function. Thus, by Lemma 2.3.1,
Z Z Z Z
α sdµ = αsdµ ≤ fn dµ ≤ fn dµ.
Sn Sn Sn Ω
R R
Combining the last two steps,
R we get αR sdµ ≤ lim fn dµ. Since α ∈ (0, 1) is
arbitrary, this shows that sdµ ≤ lim fn dµ. Taking supremum over s, we get
the desired result. 
E XERCISE 2.3.4. Produce a counterexample to show that the nonnegativity
condition in the monotone convergence theorem cannot be dropped.
The next exercise generalizes Lemma 2.3.2.

R Let f : Ω → [0, ∞] be a measurable function. Show that the


E XERCISE 2.3.5.
functional ν(S) := S f dµ defined on F is a measure.
The following result is often useful in applications of the monotone conver-
gence theorem.
P ROPOSITION 2.3.6. Given any measurable f : Ω → [0, ∞], there is a se-
quence of nonnegative simple functions increasing pointwise to f .

P ROOF. Take any n. If k2−n ≤ f (ω) < (k + 1)2−n for some integer 0 ≤
k < n2n , let fn (ω) = k2−n . If f (ω) ≥ n, let fn (ω) = n. It is easy to check that
the sequence {fn }n≥1 is a sequence of nonnegative simple functions that increase
pointwise to f . 
2.4. LINEARITY OF THE LEBESGUE INTEGRAL 21

E XERCISE 2.3.7 (Change of variable formula). Let (Ω1 , F1 , µ1 ) be a measure


space and (Ω2 , F2 ) be a measurable space. Let g : Ω1 → Ω2 and f : Ω2 → R
be measurable functions such that f ◦ g is integrable. Let ν be the measure on Ω2
induced by g, that is, ν(A) := µ1 (g −1 (A)) for each A ∈ F2 . Then show that f is
integrable with respect to ν, and
Z Z
f ◦ gdµ1 = f dν.
Ω1 Ω2

(Hint: Start with f = 1A and use Proposition 2.3.6 and the monotone convergence
theorem to generalize.)

2.4. Linearity of the Lebesgue integral


A basic result about Lebesgue
R integration, which is not quite obvious from the
definition, is that the map f 7→ f dµ is linear. This is a little surprising, because
the Lebesgue integral is defined as a supremum, and functionals that are defined
as suprema or infima are rarely linear. In the following, when defining the sum
of two functions, we adopt the convention that ∞ − ∞ = 0. Typically such a
convention can lead to inconsistencies, but if the functions are integrable then such
occurrences will only take place on sets of measure zero and will not cause any
problems.
P ROPOSITION 2.4.1. If f and g are two integrable functions ∗
R from Ω into R ,
R for any α,
then R β ∈ R, the function αf + βg is integrable and (αf + βg)dµ =
α f dµ + β gdµ. Moreover, if f and g are R measurable functions
R R Ω into
from
[0, ∞] (but not Rnecessarily integrable),
R then (f + g)dµ = f dµ + gdµ, and
for any α ∈ R, αf dµ = α f dµ.

P ROOF. It is easy to check that additivity of the integral holds for nonnega-
tive simple functions. Take any measurable f, g : Ω → [0, ∞]. Then by Propo-
sition 2.3.6, there exist sequences of nonnegative simple functions {un }n≥1 and
{vn }n≥1 increasing to f and g pointwise. Then un + vn increases to f + g point-
wise. Thus, by the linearity of integral for nonnegative simple functions and the
monotone convergence theorem,
Z Z
(f + g)dµ = lim (un + vn )dµ
n→∞
Z Z  Z Z
= lim un dµ + vn dµ = f dµ + gdµ.
n→∞

Next, take any α ≥ 0 and any measurable f : Ω → [0, ∞]. Any element of
SF+ (αf ) must be of the form αg for some g ∈ SF+ (f ). Thus
Z Z Z Z
αf dµ = sup αgdµ = α sup gdµ = α f dµ.
g∈SF+ (f ) g∈SF+ (f )
22 2. MEASURABLE FUNCTIONS AND INTEGRATION

If α ≤ 0, then (αf )+ ≡ 0 and (αf )− = −αf . Thus,


Z Z Z
αf dµ = − (−αf )dµ = α f dµ.

This completes the proofs of the assertions about nonnegative functions. Next, take
any integrable f and g. It is easy to see that a consequence of integrability is that
set where f or g take infinite values is a set of measure zero. The only problem in
defining the function f + g is that at some ω, it may happen that one of f (ω) and
g(ω) is ∞ and the other is −∞. At any such point, define (f + g)(ω) = 0. Then
f + g is defined everywhere, is measurable, and (f + g)+ ≤ f + + g + . Therefore,
by the additivity R functions and the integrability of f
of integration for nonnegative
and g, (f + g)+ dµ is finite. Similarly, (f + g)− dµ is finite. Thus, f + g is
R
integrable. Now,
(f + g)+ − (f + g)− = f + g = f + − f − + g + − g − ,
which can be written as
(f + g)+ + f − + g − = (f + g)− + f + + g + .
Note that the above identity holds even if one or more of the summands on either
side are infinity. Thus, by additivity of integration for nonnegative functions,
Z Z Z Z Z Z
− − −
(f + g) dµ + f dµ + g dµ = (f + g) dµ + f dµ + g + dµ,
+ +

R R R
which rearranges to give (f + g)dµ = f dµ + gdµ. Finally, if f is integrable
and α ≥ 0, then + + − −
R (αf ) = Rαf and (αf ) = αf , which shows that αf is
integrable and αf dµ = α f dµ. Similarly, α ≤ 0, then (αf ) = −αf − and
+

(αf )− = −αf + , and the proof can be completed as before. 


An immediate consequence of Proposition 2.4.1 is that ifR f and g Rare measur-
able real-valued functions such that f ≥ g everywhere, then f dµ ≥ gdµ. This
can be seen easily by writing f = (f − g) + g, and observing that f − g is nonneg-
ative. Another Rimmediate consequence is that integrability of f is equivalent to the
condition that |f |dµ is finite, since |f | = f + + f − . Finally, another inequality
that we will often use, which is an easy consequence of the triangle inequality, the
definition of the Lebesgue integral, and the fact that R|f | = f + + f − , is that for any
f : Ω → R∗ such that f dµ is defined, | f dµ| ≤ |f |dµ.
R R
The reader may be wondering about the connection between the Lebesgue in-
tegral defined above and the Riemann integral taught in undergraduate analysis
classes. The following exercises clarify the relationship between the two.
E XERCISE 2.4.2. Let [a, b] be a finite interval, and let f : [a, b] → R be
a bounded measurable function. If f is Riemann integrable, show that it is also
Lebesgue integrable and that the two integrals are equal.
E XERCISE 2.4.3. Give an example of a Lebesgue integrable function on a
finite interval that is not Riemann integrable.
E XERCISE 2.4.4. Generalize Exercise 2.4.2 to higher dimensions.
2.5. FATOU’S LEMMA AND DOMINATED CONVERGENCE 23

E XERCISE 2.4.5. Consider the integral


Z ∞
sin x
dx.
0 x
Show that this makes sense as a limit of integrals (both Lebesgue and Riemann)
from 0 to a as a → ∞. However, show that the above integral is not defined as an
integral in the Lebesgue sense.
Lebesgue integration with respect to the Lebesgue measure on R (or Rn ) shares
many properties of Riemann integrals. The following is one example.
E XERCISE 2.4.6. Let f : R → R be a Lebesgue integrable
R Rfunction. Take any
a ∈ R and let g(x) := f (x + a). Then show that gdλ = f dλ, where λ is
Lebesgue measure. (Hint: First prove this for simple functions.)
The following exercise has many applications in probability theory.

R P 2.4.7. IfPfR1 , f2 , . . . are measurable functions from Ω into [0, ∞],


E XERCISE
show that fi dµ = fi dµ.

2.5. Fatou’s lemma and dominated convergence


In this section we will establish two results about sequences of integrals that
are widely used in probability theory. Throughout, (Ω, F, µ) is a measure space.

R Let {fn }n≥1 be a sequence


T HEOREM 2.5.1 (Fatou’s lemma). R of measurable
functions from Ω into [0, ∞]. Then (lim inf fn )dµ ≤ lim inf fn dµ.

P ROOF. For each n, let gn := inf m≥n fm . Then, as n → ∞, gn increases


pointwise to g := lim inf n→∞ fn . Moreover, gn ≤ fn everywhere and gn and
g are measurable by Exercise 2.1.10. Therefore by the monotone convergence
theorem, Z Z Z
gdµ = lim gn dµ ≤ lim inf fn dµ,
n→∞ n→∞
which completes the proof. 

T HEOREM 2.5.2 (Dominated convergence theorem). Let {fn }n≥1 be a se-


quence of measurable functions from Ω into R∗ that converge pointwise to a limit
function f . Suppose that there exists
R an integrable
R function h such that for each
n, |fn | ≤ h everywhere.
R Then lim fn dµ = f dµ. Moreover, we also have the
stronger result lim |fn − f |dµ = 0.

P ROOF. We know that f is measurable since it is the limit of a sequence of


measurable functions (Exercise 2.1.11). Moreover, |f | ≤ h everywhere. Thus,
fn + h and f + h are nonnegative integrable R functions and fn +R h → f + h
pointwise. Therefore by Fatou’s lemma, (f + h)dµ ≤ lim inf (fn + h)dµ.
R of h and the fact that |fn | and
Since all integrals are finite (due Rto the integrability
|f | are bounded by h), this gives f dµ ≤ lim inf fn dµ.
24 2. MEASURABLE FUNCTIONS AND INTEGRATION

Now replace fn by −fn and f byR −f . All the conditionsR still hold, and there-
fore the Rsame deduction shows
R that (−f )dµ ≤ R lim inf (−fRn )dµ, which is the
same as f dµ ≥ lim sup fRn dµ. Thus, we get f dµ = lim fn dµ.
Finally, to show that lim |fn −f |dµ = 0, observe that |fn −f | → 0 pointwise,
and |fn − f | is bounded by the integrable function 2h everywhere. Then apply the
first part. 
E XERCISE 2.5.3. Produce a counterexample to show that the nonnegativity
condition in Fatou’s lemma cannot be dropped.
E XERCISE 2.5.4. Produce a counterexample to show that the domination con-
dition of the dominated convergence theorem cannot be dropped.
A very important application of the dominated convergence theorem is in giv-
ing conditions for differentiating under the integral sign.
P ROPOSITION 2.5.5. Let I be an open subset of R, and let (Ω, F, µ) be a
measure space. Suppose that f : I × Ω → R satisfies the following conditions:
(i) f (x, ω) is an integrable function of ω for each x ∈ I,
(ii) for all ω ∈ Ω, the derivative fx of f with respect to x exists for all x ∈ I,
and
(iii) there is an integrable function h : Ω → R such that |fx (x, ω)| ≤ h(ω)
for all x ∈ I and ω ∈ Ω.
Then for all x ∈ I,
Z Z
d
f (x, ω)dµ(ω) = fx (x, ω)dµ(ω).
dx Ω Ω

P ROOF. Take any x ∈ I and a sequence xn → x. Without loss of generality,


assume that xn 6= x and xn ∈ I for each n. Define
f (x, ω) − f (xn , ω)
gn (ω) := .
x − xn
Since the derivative of f with respect to x is uniformly bounded by the function h,
it follows that |gn (ω)| ≤ h(ω). Since h is integrable and gn (ω) → fx (x, ω) for
each ω, the dominated convergence theorem gives us
Z Z
lim gn (ω)dµ(ω) = fx (x, ω)dµ(ω).
n→∞ Ω Ω
But R R
f (x, ω)dµ(ω) − Ω f (xn , ω)dµ(ω)
Z
gn (ω)dµ(ω) = Ω ,
Ω x − xn
which proves the claim. 
A slight improvement of Proposition 2.5.5 is given in Exercise 2.6.8 later. Sim-
ilar slight improvements of the monotone convergence theorem and the dominated
convergence theorem are given in Exercise 2.6.5.
2.6. THE CONCEPT OF ALMOST EVERYWHERE 25

2.6. The concept of almost everywhere


Let (Ω, F, µ) be a measure space. An event A ∈ F is said to ‘happen al-
most everywhere’ if µ(Ac ) = 0. Almost everywhere is usually abbreviated as
a.e. Sometimes, in probability theory, we say ‘almost surely (a.s.)’ instead of a.e.
When the measure µ may not be specified from the context, we say µ-a.e. If f
and g are two functions such that the {ω : f (ω) = g(ω)} is an a.e. event, we
say that f = g a.e. Similarly, if {fn }n≥1 is a sequence of functions such that
{ω : limn→∞ fn (ω) = f (ω)} is an a.e. event, we say fn → f a.e. We have al-
ready seen an example of a class of a.e. events in Exercise 2.2.2. The following is
another result of the same type.

R P ROPOSITION 2.6.1. Let f : Ω → [0, ∞] be a measurable function. Then


f dµ = 0 if and only if f = 0 a.e.

P ROOF. First suppose that µ({ω : f (ω) > 0}) = 0. Take any nonnegative
measurable simple function g such that g ≤ f everywhere. Then g = 0 whenever
f = R0. Thus, µ({ω : g(ω) > 0}) = 0. Since g is a simple
R function, this implies
that gdµ = 0. Taking supremum over all such g gives f dµ = 0. Conversely,
suppose that µ({ω : f (ω) > 0}) > 0. Then
[ ∞ 
−1
µ({ω : f (ω) > 0}) = µ {ω : f (ω) > n }
n=1
= lim µ({ω : f (ω) > n−1 }) > 0,
n→∞
where the second equality follows from the observation that the sets in the union
form an increasing sequence. Therefore for some n, µ(An ) > 0, where An :=
{ω : f (ω) > n−1 }. But then
Z Z Z
f dµ ≥ f 1An dµ ≥ n−1 1An dµ = n−1 µ(An ) > 0.

This completes the proof of the proposition. 


The following exercises are easy consequences of the above proposition.
E XERCISE R2.6.2. If f,R g are integrable
R functions on Ω such that f = g a.e.,
then
R show that f dµ = gdµ and |f − g|dµ = 0. Conversely, show that if
|f − g|dµ = 0, then f = g a.e.

R f and gR are two integrable functions on Ω such


E XERCISE 2.6.3. Suppose that
that f ≥ g a.e. Then show that f dµ ≥ gdµ, and equality holds if and only if
f = g a.e.
Broadly speaking, the idea is that ‘almost everywhere’ and ‘everywhere’ can
be treated as basically the same thing. There are some exceptions to this rule of
thumb, but in most situations it is valid. Often, when a function or a limit is defined
almost everywhere, we will treat it as being defined everywhere by defining it arbi-
trarily (for example, equal to zero) on the set where it is undefined. The following
26 2. MEASURABLE FUNCTIONS AND INTEGRATION

exercises show how to use this rule of thumb to get slightly better versions of the
convergence theorems of this chapter.
E XERCISE 2.6.4. If {fn }n≥1 is a sequence of measurable functions and f is a
function such that fn → f a.e., show that there is a measurable function g such that
g = f a.e. Moreover, if the σ-algebra is complete, show that f itself is measurable.
E XERCISE 2.6.5. Show that in the monotone convergence theorem and the
dominated convergence theorem, it suffices to have fn → f a.e. and f measurable.
E XERCISE 2.6.6. Let f : Ω → I be a measurable function, where I is an
interval in R∗ . The interval I is allowed
R to be finite or infinite, open, closed or
half-open. In all cases, show that f dµ ∈ I if µ is a probability measure. (Hint:
Use Exercise 2.6.3.)
E 2.6.7. If f1 , f2 , . . . arePmeasurable functions from Ω into ∗
PXERCISE
R RP PRR such
that |fi |dµ < ∞, then show that fi exists a.e., and fi dµ = fi dµ.
E XERCISE 2.6.8. Show that in Proposition 2.5.5, every occurrence of ‘for all
ω ∈ Ω’ can be replaced by ‘for almost all ω ∈ Ω’.
CHAPTER 3

Product spaces

This chapter is about the construction of product measure spaces. Product


measures are necessary for defining sequences of independent random variables,
that form the backbone of many probabilistic models.

3.1. Finite dimensional product spaces


Let (Ω1 , F1 ), . . . , (Ωn , Fn ) be measurable spaces. Let Ω = Ω1 × · · · × Ωn
be the Cartesian product of Ω1 , . . . , Ωn . The product σ-algebra F on Ω is defined
as the σ-algebra generated by sets of the form A1 × · · · × An , where Ai ∈ Fi for
i = 1, . . . , n. It is usually denoted by F1 × · · · × Fn .
P ROPOSITION 3.1.1. Let Ω and F be as above. If each Ωi is endowed with a
σ-finite measure µi , then there is a unique measure µ on Ω which satisfies
n
Y
µ(A1 × · · · × An ) = µi (Ai ) (3.1.1)
i=1

for each A1 ∈ F1 , . . . , An ∈ Fn .

P ROOF. It is easy to check that the collection of finite disjoint unions of sets of
the form A1 × · · · × An (sometimes called ‘rectangles’) form an algebra. Therefore
by Carathéodory’s theorem, it suffices to show that µ defined through (3.1.1) is a
countably additive measure on this algebra.
We will prove this by induction on n. It is true for n = 1 by the definition of a
measure. Suppose that it holds for n − 1. Take any rectangular set A1 × · · · × An .
Suppose that this set is a disjoint union of Ai,1 × · · · × Ai,n for i = 1, 2, . . ., where
Ai,j ∈ Fj for each i and j. It suffices to show that

X
µ(A1 × · · · × An ) = µ(Ai,1 × · · · × Ai,n ).
i=1

Take any x ∈ A1 × · · · × An−1 . Let I be the collection of indices i such that


x ∈ Ai,1 × · · · × Ai,n−1 . Take any y ∈ An . Then (x, y) ∈ A1 × · · · × An , and
hence (x, y) ∈ Ai,1 · · · × · · · Ai,n for some i. In particular, x ∈ Ai,1 × · · · × Ai,n−1
and hence i ∈ I. Also, y ∈ Ai,n . Thus,
[
An ⊆ Ai,n .
i∈I
27
28 3. PRODUCT SPACES

On the other hand, if y ∈ Ai,n for some i ∈ I, then (x, y) ∈ Ai,1 × · · · × Ai,n .
Thus, (x, y) ∈ A1 × · · · × An , and therefore y ∈ An . This shows that An is the
union of Ai,n over i ∈ I. Now, if y ∈ Ai,n ∩ Aj,n for some distinct i, j ∈ I, then
since x ∈ (Ai,1 × · · · × Ai,n−1 ) ∩ (Aj,1 × · · · × Aj,n−1 ), we get that (x, y) ∈
(Ai,1 × · · · × Ai,n ) ∩ (Aj,1 × · · · × Aj,n ), which is impossible. Thus, the sets Ai,n ,
as i ranges over I, are disjoint. This gives
X
µn (An ) = µn (Ai,n ).
i∈I

Finally, if x 6∈ A1 × · · · × An−1 and x ∈ Ai,1 × · · · × Ai,n−1 for some i, then Ai,n


must be empty, because otherwise (x, y) ∈ Ai,1 × · · · × Ai,n for any y ∈ Ai,n and
so (x, y) ∈ A1 × · · · × An , which implies that x ∈ A1 × · · · × An−1 . Therefore
we have proved that

X
1A1 ×···×An−1 (x)µn (An ) = 1Ai,1 ×···×Ai,n−1 (x)µn (Ai,n ).
i=1

Let µ0 := µ1 × · · · × µn−1 , which exists by the induction hypothesis. Integrat-


ing both sides with respect to the measure µ0 on Ω1 × · · · × Ωn−1 , and applying
Exercise 2.4.7 to the right, we get

X
µ0 (A1 × · · · × An−1 )µn (An ) = µ0 (Ai,1 × · · · × Ai,n−1 )µn (Ai,n ).
i=1

The desired result now follows by applying the induction hypothesis to µ0 on both
sides. 
The measure µ of Proposition 3.1 is called the product of µ1 , . . . , µn , and is
usually denoted by µ1 × · · · × µn .
E XERCISE 3.1.2. Let (Ωi , Fi , µi ) be σ-finite measure spaces for i = 1, 2, 3.
Show that (µ1 × µ2 ) × µ3 = µ1 × µ2 × µ3 = µ1 × (µ2 × µ3 ).
E XERCISE 3.1.3. Let S be a separable metric space, endowed with its Borel
σ-algebra. Then S × S comes with its product topology, which defines its own
Borel σ-algebra. Show that this is the same as the product σ-algebra on S × S.
E XERCISE 3.1.4. Let S be as above, and let (Ω, F) be a measurable space. If
f and g are measurable functions from Ω into S, show that (f, g) : Ω → S × S is
a measurable function.
E XERCISE 3.1.5. Let S and Ω be as above. If ρ is the metric on S, show that
ρ : S × S → R is a measurable map.
E XERCISE 3.1.6. Let (Ω, F) be a measurable space and let S be a complete
separable metric space endowed with its Borel σ-algebra. Let {fn }n≥1 be a se-
quence of measurable functions from Ω into S. Show that
{ω ∈ S : lim fn (ω) exists}
n→∞
3.2. FUBINI’S THEOREM 29

is a measurable subset of Ω. (Hint: Try to write this set as a combination of count-


able unions and intersections of measurable sets, using the Cauchy criterion for
convergence on complete metric spaces.)

3.2. Fubini’s theorem


We will now learn how to integrate with respect to product measures. The
natural idea is to compute an integral with respect to a product measure as an
iterated integral, integrating with respect to one coordinate at a time. This works
under certain conditions. The conditions are given by Fubini’s theorem, stated
below. We need a preparatory lemma.
L EMMA 3.2.1. Let (Ωi , Fi ) be measurable spaces for i = 1, 2, 3. Let f : Ω1 ×
Ω2 → Ω3 be a measurable function. Then for each x ∈ Ω1 , the map y 7→ f (x, y)
is measurable on Ω2 .

P ROOF. Take any A ∈ F3 and x ∈ Ω1 . Let B := f −1 (A) and


Bx := {y ∈ Ω2 : (x, y) ∈ B} = {y ∈ Ω2 : f (x, y) ∈ A}.
We want to show that Bx ∈ F2 . For the given x, let G be the set of all E ∈ F1 ×F2
such that Ex ∈ F2 , where Ex := {y ∈ Ω2 : (x, y) ∈ E}. An easy verification
shows that G is a σ-algebra. Moreover, it contains every rectangular set. Thus,
G ⊇ F1 × F2 , which shows in particular that Bx ∈ F2 for every x ∈ Ω1 , since
B ∈ F1 × F2 due to the measurability of f . 
T HEOREM 3.2.2 (Fubini’s theorem). Let (Ω1 , F1 , µ1 ) and (Ω2 , F2 , µ2 ) be two
σ-finite measure spaces. Let µ = µ1 × µ2 , and let f : Ω1 × Ω2 → R∗ be
a measurable
R function. If f is either nonnegativeRor integrable, then the map
x 7→ Ω2 f (x, y)dµ2 (y) on Ω1 and the map y 7→ Ω1 f (x, y)dµ1 (x) on Ω2 are
well-defined and measurable (when set equal to zero if the integral is undefined).
Moreover, we have
Z Z Z
f (x, y)dµ(x, y) = f (x, y)dµ2 (y)dµ1 (x)
Ω1 ×Ω2 Ω1 Ω2
Z Z
= f (x, y)dµ1 (x)dµ2 (y),
Ω2 Ω1
Finally, if either of
Z Z Z Z
|f (x, y)|dµ2 (y)dµ1 (x) or |f (x, y)|dµ1 (x)dµ2 (y)
Ω1 Ω2 Ω2 Ω1
is finite, then f is integrable.

P ROOF. First, suppose that f = 1A for some A ∈ F1 × F2 . Then notice that


for any x ∈ Ω1 ,
Z
f (x, y)dµ2 (y) = µ2 (Ax ),
Ω2
30 3. PRODUCT SPACES

where Ax := {y ∈ Ω2 : (x, y) ∈ A}. The integral is well-defined by Lemma 3.2.1.


We will first prove that x 7→ µ2 (Ax ) is a measurable map, and its integral equals
µ(A). This will prove Fubini’s theorem for this f .
Let L be the set of all E ∈ F1 × F2 such that the map x 7→ µ2 (Ex ) is
measurable and integrates to µ(E). We will now show that L is a λ-system. We
will first prove this under the assumption that µ1 and µ2 are finite measures. Clearly
Ω1 × Ω2 ∈ L. If E1 , E2 , . . . ∈ L are disjoint, and E is their union, then for any x,
Ex is the disjoint union of (E1 )x , (E2 )x , . . ., and hence

X
µ2 (Ex ) = µ2 ((Ei )x ).
i=1
By Exercise 2.1.12, this shows that x 7→ µ2 (Ex ) is measurable. The monotone
convergence theorem and the fact that Ei ∈ L for each i show that
Z X∞ Z ∞
X
µ2 (Ex )dµ1 (x) = µ2 ((Ei )x )dµ1 (x) = µ(Ei ) = µ(E).
Ω1 i=1 Ω1 i=1
Thus, E ∈ L, and therefore L is closed under countable disjoint unions.
Finally, take any E ∈ L. Since µ1 and µ2 are finite measures, we have
µ2 ((E c )x ) = µ2 ((Ex )c ) = µ2 (Ω2 ) − µ2 (Ex ),
which proves that x 7→ µ2 ((E c )x ) is measurable. It also proves that
Z Z
c
µ2 ((E )x )dµ1 (x) = µ1 (Ω1 )µ2 (Ω2 ) − µ2 (Ex )dµ1 (x)
Ω1 Ω1
= µ(Ω) − µ(E) = µ(E c ).
Thus, L is a λ-system. Since it contains the π-system of all rectangles, which
generates F1 × F2 , we have now established the claim that for any E ∈ F1 × F2 ,
the map x 7→ µ2 (Ex ) is measurable and integrates to µ(E), provided that µ1 and
µ2 are finite measures.
Now let µ1 and µ2 be σ-finite measures. Let {En,1 }n≥1 and {En,2 }n≥2 be
sequences of measurable sets of finite measure increasing to Ω1 and Ω2 , respec-
tively. For each n, let En := En,1 × En,2 , and define the functionals µn,1 (A) :=
µ1 (A ∩ En,1 ), µn,2 (B) := µ2 (B ∩ En,2 ) and µn (E) := µ(E ∩ En ) for A ∈ F1 ,
B ∈ F2 and E ∈ F1 ×F2 . It is easy to see that these are finite measures, increasing
to µ1 , µ2 and µ.
If f : Ω1 → [0, ∞] is a measurable function, it is easy see that
Z Z
f (x)dµn,1 (x) = f (x)1En,1 (x)dµ1 (x), (3.2.1)
Ω1 Ω1
where we use the convention ∞ · 0 = 0 on the right. To see this, first note that it
holds for indicator functions by the definition of µn,1 . From this, pass to simple
functions by linearity and then to nonnegative measurable functions by the mono-
tone convergence theorem and Proposition 2.3.6.
Next, note that for any E ∈ F1 × F2 and any x ∈ Ω1 , µ2 (Ex ) is the increasing
limit of µn,2 (Ex )1En,1 (x) as n → ∞. Firstly, this shows that x 7→ µ2 (Ex ) is
3.2. FUBINI’S THEOREM 31

a measurable map. Moreover, for each n, µn = µn,1 × µn,2 , as can be seen by


verifying on rectangles and using the uniqueness of product measures on σ-finite
spaces. Therefore by the monotone convergence theorem and equation (3.2.1),
Z Z
µ2 (Ex )dµ1 (x) = lim µn,2 (Ex )1En,1 (x)dµ1 (x)
Ω1 n→∞ Ω
1
Z
= lim µn,2 (Ex )dµn,1 (x)
n→∞ Ω
1

= lim µn (E) = µ(E).


n→∞

This completes the proof of Fubini’s theorem for all indicator functions. By linear-
ity, this extends to all simple functions. Using the monotone convergence theorem
and Proposition 2.3.6, it is straightforward the extend the result to all nonnegative
measurable functions.
Now take any integrable f : Ω1 × Ω2 → R∗ . Applying Fubini’s theorem to f +
and f − , we get
Z Z Z Z Z
f dµ = f + (x, y)dµ2 (y)dµ1 (x) − f − (x, y)dµ2 (y)dµ1 (x)
ZΩ1 Ω2 Z Ω1 Ω2

= g(x)dµ1 (x) − h(x)dµ1 (x),


Ω1 Ω1

where
Z Z
g(x) := f + (x, y)dµ2 (y), h(x) := f − (x, y)dµ2 (y).
Ω2 Ω2

By Fubini’s theorem for nonnegative functions and the integrability of f , it follows


that g and h are integrable functions. Thus,
Z Z
f dµ = (g(x) − h(x))dµ1 (x),
Ω1

where we have adopted the convention that ∞ − ∞ = 0, as in Proposition 2.4.1.


Now, at any x where at least one of g(x) and h(x) is finite,
Z Z
+ −
g(x) − h(x) = (f (x, y) − f (x, y))dµ2 (y) = f (x, y)dµ2 (y). (3.2.2)
Ω2 Ω2

On the other hand, the set where g(x) and h(x) are both infinite is a set of measure
zero, and therefore we may define
Z
f (x, y)dµ2 (y) = 0
Ω2

on this set, so that again (3.2.2) is valid under the convention


R ∞ − ∞ = 0. (The
construction ensures, among other things, that x 7→ Ω2 f (x, y)dµ2 (y) is a measur-
able function.) This concludes the proof of Fubini’s theorem for integrable func-
tions. The final assertion of the theorem follows easily from Fubini’s theorem for
nonnegative functions. 
32 3. PRODUCT SPACES

E XERCISE 3.2.3. Produce a counterexample to show that the integrability con-


dition in Fubini’s theorem is necessary, in that otherwise the order of integration
cannot be interchanged even if the two iterated integrals make sense.
E XERCISE 3.2.4. Compute the Lebesgue measure of the unit disk in R2 by
integrating the indicator function of the disk using Fubini’s theorem.

3.3. Infinite dimensional product spaces


Suppose now that {(Ωi , Fi , µi )}∞
i=1 is a countable sequence of σ-finite mea-
sure spaces. Let Ω = Ω1 × Ω2 × · · · . The product σ-algebra F = F1 × F2 × · · · is
defined to be the σ-algebra generated by all sets of the form A1 × A2 × · · · , where
Ai = Ωi for all but finitely many i. Such sets generalize the notion of rectangles to
infinite product spaces.
Although the definition of the product σ-algebra is just as before, the existence
of a product measure is a slightly trickier question. In particular, we need that
each µi is a probability measure to even define the infinite product measure on
rectangles in a meaningful way. To see why this condition is necessary, suppose
that the measure spaces are all the same. Then if µi (Ωi ) > 1, each rectangle must
have measure ∞, and if µi (Ωi ) < 1, each rectangle must have measure zero. To
avoid these trivialities, we need to have µi (Ωi ) = 1.
T HEOREM 3.3.1. Let all notation be as above. Suppose that {µi }∞
i=1 are prob-
ability measures. Then there exists a unique probability measure µ on (Ω, F) that
satisfies
Y∞
µ(A1 × A2 × · · · ) = µi (Ai )
i=1
for every A1 × A2 × · · · such that Ai = Ωi for all but finitely many i.

P ROOF. First note that by the existence of finite dimensional product mea-
sures, the measure νn := µ1 × · · · × µn is defined for each n.
Next, define Ω(n) := Ωn+1 × Ωn+2 × · · · . Let A ∈ F be called a cylinder set
if it is of the form B × Ω(n) for some n and some B ∈ F1 × · · · × Fn . Define
µ(A) := νn (B). It is easy to see that this is a well-defined functional on A, and
that it satisfies the product property for rectangles.
Let A be the collection of all cylinder sets. It is not difficult to check that
A is an algebra, and that µ is finitely additive and σ-finite on A. Therefore by
Carathéodory’s theorem, we only need to check that µ is countably additive on A.
Suppose that A ∈ A is the union of a sequence of disjoint sets A1 , A2 , . . . ∈ A.
For each n, let Bn := A \ (A1 ∪ · · · ∪ An ). Since A is an algebra, each Bn ∈ A.
Since µ is finitely additive on A, µ(A) = µ(Bn ) + µ(A1 ) + · · · + µ(An ) for each
n. Therefore we have to show that lim µ(Bn ) = 0. Suppose that this is not true.
Since {Bn }n≥1 is a decreasing sequence of sets, this implies that there is some
 > 0 such that µ(Bn ) ≥  for all n. Using this, we will now get a contradiction to
the fact that ∩Bn = ∅.
3.3. INFINITE DIMENSIONAL PRODUCT SPACES 33

For each n, let A(n) be the algebra of all cylinder sets in Ω(n) , and let µ(n) be
defined on A(n) using µn+1 , µn+2 , . . . , the same way we defined µ on A. For any
n, m, and (x1 , . . . , xm ) ∈ Ω1 × · · · × Ωm , define
Bn (x1 , . . . , xm ) := {(xm+1 , xm+2 , . . .) ∈ Ω(m) : (x1 , x2 , . . .) ∈ Bn }.
Since Bn is a cylinder set, it is of the form Cn × Ω(m) for some m and some
Cn ∈ F1 × · · · × Fm . Therefore by Lemma 3.2.1, Bn (x1 ) ∈ A(1) for any x1 ∈ Ω1 ,
and by Fubini’s theorem, the map x1 7→ µ(1) (Bn (x1 )) is measurable. (Although
we have not yet shown that µ(1) is a measure on the product σ-algebra of Ω(1) , it is
evidently a measure on the σ-algebra G of all sets of the form D × Ω(m) ⊆ Ω(1) ,
where D ∈ F2 ×· · ·×Fm . Moreover, Bn ∈ F1 ×G. This allows us to use Fubini’s
theorem to reach the above conclusion.) Thus, the set
Fn := {x1 ∈ Ω1 : µ(1) (Bn (x1 )) ≥ /2}
is an element of F1 . But again by Fubini’s theorem,
Z
µ(Bn ) = µ(1) (Bn (x1 ))dµ1 (x1 )
Z Z
(1)
= µ (Bn (x1 ))dµ1 (x1 ) + µ(1) (Bn (x1 ))dµ1 (x1 )
Fn Fnc

≤ µ1 (Fn ) + .
2
Since µ(Bn ) ≥ , this shows that µ1 (Fn ) ≥ /2. Since {Fn }n≥1 is a decreasing
sequence of sets, this shows that ∩Fn 6= ∅. Choose a point x∗1 ∈ ∩Fn .
Now note that {Bn (x∗1 )}n≥1 is a decreasing sequence of sets in F (1) , such that
µ (Bn (x∗1 )) ≥ /2 for each n. Repeating the above argument for the product
(1)

space Ω(1) and the sequence {Bn (x∗1 )}n≥1 , we see that there exists x∗2 ∈ Ω2 such
that µ(2) (Bn (x∗1 , x∗2 )) ≥ /4 for every n.
Proceeding like this, we obtain a point x = (x∗1 , x∗2 , . . .) ∈ Ω such that for any
m and n,
µ(m) (Bn (x∗1 , . . . , x∗m )) ≥ 2−m .
Take any n. Since Bn is a cylinder set, it is of the form Cn × Ω(m) for some m
and some Cn ∈ F1 × · · · × Fm . Since µ(m) (Bn (x∗1 , . . . , x∗m )) > 0, there is some
(xm+1 , xm+2 , . . .) ∈ Ω(m) such that (x∗1 , . . . , x∗m , xm+1 , xm+2 , . . .) ∈ Bn . But by
the form of Bn , this implies that x ∈ Bn . Thus, x ∈ ∩Bn . This completes the
proof. 
Having constructed products of countably many probability spaces, one may
now wonder about uncountable products. Surprisingly, this is quite simple, given
that we know how to handle the countable case. Suppose that {(Ωi , Fi , µi )}i∈I
is an arbitrary collection of probability spaces. Let Ω := ×i∈I Ωi , and let F :=
×i∈I Fi be defined using rectangles as in the countable case. Now take any count-
able set J ⊆ I, and let FJ := ×j∈J Fj . Consider a set A ∈ F of the form
{(ωi )i∈I : (ωj )j∈J ∈ B}, where B is some element of FJ . Let G be the collection
of all such A, as B varies in FJ and J varies over all countable subsets of I. It is
34 3. PRODUCT SPACES

not hard to check that G is in fact a σ-algebra. Moreover, it is contained in F, and


it contains all rectangular sets. Therefore G = F. Thus we can define µ on this
σ-algebra simply using the definition of the product measure on countable product
spaces.
Sometimes showing that some function is measurable with respect to an infinite
product σ-algebra can be somewhat tricky. The following exercise gives such an
example, which arises in percolation theory.
E XERCISE 3.3.2. Take any d ≥ 1 and consider the integer lattice Zd with
the nearest-neighbor graph structure. Let E denote the set of edges. Take the
two-point set {0, 1} with its power set σ-algebra, and consider the product space
{0, 1}E . Given ω = (ωe )e∈E ∈ {0, 1}E , define a subgraph of Zd as follows: Keep
an edge e if ωe = 1, and delete it otherwise. Let N (ω) be the number of connected
components of this subgraph. Prove that N : {0, 1}E → {0, 1, 2, . . .} ∪ {∞} is a
measurable function.

3.4. Kolmogorov’s extension theorem


While infinite dimensional product measures are sufficient for many purposes,
sometimes we need to construct non-product measures on infinite dimensional
spaces. The fundamental tool for such constructions, known as Kolmogorov’s
existence theorem (also as Kolmogorov’s consistency theorem, or the Daniell–
Kolmogorov theorem), is stated below.. The version given here works only for
Euclidean spaces. There is a more general version that works for any complete
separable metric space — see Exercise 12.4.4 in Chapter 12.
T HEOREM 3.4.1 (Kolmogorov’s extension theorem). Take any d ≥ 1. Suppose
that for each n, we have a probability measure µn on ((Rd )n , B((Rd )n )) such that
the collection {µn }n≥1 satisfies the following consistency property: For each n, if
(X1 , . . . , Xn+1 ) is a random array with law µn+1 , then (X1 , . . . , Xn ) has law µn .
Then there is a unique probability measure µ on the product σ-algebra of (Rd )N
such that if (X1 , X2 , . . .) ∼ µ, then for all n, (X1 , . . . , Xn ) ∼ µn .
To prove this result, we first need to establish that every probability measure
on Euclidean space has a property known as ‘regularity’.
L EMMA 3.4.2 (Regularity of probability measures). Let µ be a probability
measure on Rn . Then for any B ∈ B(Rn ),
µ(B) = inf{µ(U ) : U ⊇ B, U open}
= sup{µ(C) : C ⊆ B, C closed}
= sup{µ(K) : K ⊆ B, K compact}.

P ROOF. Let C be the collection of Borel sets B for which the first and second
identities displayed above hold true. It is easy to see that C contains all closed sets,
since for closed set C can be written as the decreasing intersection of the open sets
Uk := {x ∈ Rn : |x − y| < 1/k for some y ∈ C},
3.4. KOLMOGOROV’S EXTENSION THEOREM 35

where |x − y| denotes the Euclidean norm of x − y. Thus, the proof of the first two
identities will be complete if we can show that C is a σ-algebra.
It is easy to see that ∅ ∈ C. Next, if B ∈ C, then it is not hard to see that
c
B ∈ C using the observation that if C ⊆ B is a closed set and U ⊇ B is an
open set, then C c ⊇ B c is an open set and U c ⊆ B c is a closed set. Next, take
any B1 , B2 , . . . ∈ C and let B be their union. Take any  > 0. Find closed sets
C1 ⊆ B1 , C2 ⊆ B2 , . . . such that µ(Ci ) ≥ µ(Bi ) − 2−i  for each i, and open
sets U1 ⊇ B1 , U2 ⊇ B2 , . . . such that µ(Ui ) ≤ µ(Bi ) + 2−i  for each i. Let
U := ∪i≥1 Ui , so that U is an open set, and

X
µ(U ) − µ(B) ≤ (µ(Ui ) − µ(Bi )) ≤ .
i=1

Similarly, letting C := ∪i≥1 Ci , we have



X
µ(B) − µ(C) ≤ (µ(Bi ) − µ(Ci )) ≤ .
i=1

The problem is, C may not be closed. But this is easily resolved by taking k
large enough so that µ(C) − µ(∪ki=1 Ci ) ≤ . Then C 0 := ∪ki=1 Ci is a closed set
contained in B, and µ(B) − µ(C 0 ) ≤ 2. This proves that C is a σ-algebra.
Finally, to prove the third identity, take any B ∈ B(Rn ), any  > 0, and a
closed set C ⊆ B such that µ(B) − µ(C) ≤ . Let Br denote the closed ball of
radius r centered at the origin. Then µ(C ∩ Br ) → µ(C) as r → ∞, and C ∩ Br
is a compact subset of C. Thus, it is possible to find a compact set K ⊆ C such
that µ(B) − µ(K) ≤ 2. 

P ROOF OF T HEOREM 3.4.1. We will use Carathéodory’s extension theorem


(Theorem 1.5.1) to prove both the existence and the uniqueness of µ. Let A be the
algebra of cylinder sets of (Rd )N , that is, sets of the form
{ω ∈ (Rd )N : (ω0 , . . . , ωk−1 ) ∈ B}
for some k and some B ∈ B((Rd )k ). Note that the consistency of the family
{µn }n≥1 automatically implies the existence of a finitely additive probability mea-
sure µ on A defined in the the natural way. To apply Carathédory’s theorem, we
only have to show that µ is countably additive on A.
P∞ of A such that their union,
Let A1 , A2 , . . . be a disjoint sequence of elements
A, is also in A. We need to show that µ(A) = i=1 µ(Ai ). Note that for any n,
the finite additivity of µ implies that
n
X
µ(A) = µ(Bn ) + µ(Ai ),
i=1

where Bn := A\∪ni=1 Ai . Thus, we only need to show that µ(Bn ) → 0 as n → ∞.


We will prove this by contradiction. Suppose that µ(Bn ) 6→ 0. Then, since this
is a decreasing sequence, it must converge to a positive real number . Under
36 3. PRODUCT SPACES

this assumption, we will show that ∩n≥1 Bn 6= ∅, which will give a contradiction,
because we know that the intersection is empty.
Since the Bn ’s are cylinder sets, there exist integers m1 ≤ m2 ≤ · · · , and
Borel sets D1 ⊆ (Rd )m1 , D2 ⊆ (Rd )m2 , . . . such that for each n,
Bn = Dn × Rd × Rd × · · · .
Moreover, since Bn ⊇ Bn+1 for each n, we must have that
Dn+1 ⊆ Dn × (Rd )mn+1 −mn (3.4.1)
for each n.
Now recall the uniform positive lower bound  on µ(Bn ). By Lemma 3.4.2,
we can find a compact set Kn ⊆ Dn for each n, such that
µmn (Dn \ Kn ) ≤ 2−n−1 .
Define
Kn0 := (K1 × (Rd )mn −m1 ) ∩ (K2 × (Rd )mn −m2 ) ∩ · · ·
· · · ∩ (Kn−1 × (Rd )mn −mn−1 ) ∩ Kn .
Then Kn0 , being a closed subset of Kn , is compact. Moreover,
Xn
µmn (Dn \ Kn0 ) ≤ µmn (Dn \ (Ki × (Rd )mn −mi ))
i=1
Xn
≤ µmn ((Di × (Rd )mn −mi ) \ (Ki × (Rd )mn −mi ))
i=1
n n
X X 
= µmi (Di \ Ki ) ≤ 2−i−1  ≤ .
2
i=1 i=1
Since µmn (Dn ) = µ(Bn ) ≥ , this shows that µmn (Kn0 ) ≥ /2 for each n. In
particular, each Kn0 is nonempty. Now, from the definition of Kn0 , it is easy to see
that Kn+10 ⊆ Kn0 × (Rd )mn+1 −mn for each n. Thus, if we choose (xn1 , . . . , xnmn ) ∈
Kn , then for each i < n, (xn1 , . . . , xnmi ) ∈ Ki0 . In particular, (xn1 , . . . , xnm1 ) ∈
0

K10 for each n, and so, by the compactness, we can choose a subsequence of
{(xn1 , . . . , xnm1 )}n≥1 converging to some (x1 , . . . , xm1 ) ∈ K10 . Passing to a further
subsequence, we can have (xn1 , . . . , xnm2 ) converging to some (x1 , . . . , xm2 ) ∈ K20 ,
and so on. Then, by the standard diagonal argument, we can find a subsequence
along which (xn1 , . . . , xnmi ) converges to some (x1 , . . . , xmi ) ∈ Ki0 for each i. Con-
sider the vector x = (x1 , x2 , . . .) ∈ (Rd )N . We claim that x ∈ ∩n≥1 Bn . Indeed,
this is true, because for each n,
x = (x1 , . . . , xmn , xmn +1 , . . .) ∈ Kn0 × Rd × Rd × · · ·
⊆ Dn × Rd × Rd × · · · = Bn .
This completes the proof of the theorem. 
CHAPTER 4

Norms and inequalities

The main goal of this chapter is to introduce Lp spaces and discuss some of
their properties. In particular, we will discuss some inequalities for Lp spaces that
are useful in probability theory.

4.1. Markov’s inequality


The following simple but important inequality is called Markov’s inequality in
the probability literature.
T HEOREM 4.1.1 (Markov’s inequality). Let (Ω, F, µ) be a measure space and
f : Ω → [0, ∞] be a measurable function. Then for any t > 0,
Z
1
µ({ω : f (ω) ≥ t}) ≤ f dµ.
t

P ROOF. Take any t > 0. Define a function g as


(
1 if f (ω) ≥ t,
g(ω) :=
0 if f (ω) < t.
Then g ≤ t−1 f everywhere. Thus,
Z Z
1
gdµ ≤ f dµ.
t
R
The proof is completed by observing that gdµ = µ({ω : f (ω) ≥ t}) by the
definition of Lebesgue integral for simple functions. 

4.2. Jensen’s inequality


Jensen’s inequality is another basic tool in probability theory. Unlike Markov’s
inequality, this inequality holds for probability measures only.
First, let us recall the definition of a convex function. Let I be an interval in
R∗ . The interval may be finite or infinite, open, closed or half-open. A function
φ : I → R∗ is called convex if for all x, y ∈ I and t ∈ [0, 1],
φ(tx + (1 − t)y) ≤ tφ(x) + (1 − t)φ(y).

E XERCISE 4.2.1. If φ is differentiable, show that φ is convex if and only if φ0


is an increasing function.
37
38 4. NORMS AND INEQUALITIES

E XERCISE 4.2.2. If φ is twice differentiable, show that φ is convex if and only


if φ00 is nonnegative everywhere.
E XERCISE 4.2.3. Show that the functions φ(x) = |x|α for α ≥ 1 and φ(x) =
eθx for θ ∈ R, are all convex.
An important property of convex functions is that they are continuous in the
interior of their domain.
E XERCISE 4.2.4. Let I be an interval and φ : I → R be a convex function.
Prove that φ is continuous at every interior point of I, and hence that φ is measur-
able.
Another important property of convex functions is that have at least one tangent
at every interior point of their domain.
E XERCISE 4.2.5. Let I be an interval and φ : I → R be a convex function.
Then for any x in the interior of I, show that there exist a, b ∈ R such that φ(x) =
ax + b and φ(y) ≥ ay + b for all y ∈ I. Moreover, if φ is nonlinear in every
neighborhood of x, show that a and b can be chosen such that φ(y) > ay + b for
all y 6= x.

T HEOREM 4.2.6 (Jensen’s inequality). Let (Ω, F, P) be a probability space


and f : Ω → R∗ be an integrable function. Let I be an interval containing
R
the range of f ,R and let φ : I → R be a convex function. Let m := f dP.
Then φ(m) ≤ φ ◦ f dP, provided that φ ◦ f is also integrable. Moreover, if φ
is nonlinear in every open neighborhood of m, then equality holds in the above
inequality if and only if f = m a.e.

P ROOF. Integrability implies that f is finite a.e. Therefore we can replace f


by a function that is finite everywhere, without altering either side of the claimed
inequality. By Exercise 2.6.6, m ∈ I. If m equals either endpoint of I, then it is
easy to show that f = m a.e., and there is nothing more to prove. So assume that
m is in the interior of I. Then by Exercise 4.2.5, there exist a, b ∈ R such that
ax + b ≤ φ(x) for all x ∈ I and am + b = φ(m). Thus,
Z Z
φ(f (x)) dP(x) ≥ (af (x) + b) dP(x) = am + b = φ(m),
Ω Ω
which is the desired inequality. If φ is nonlinear in every open neighborhood of m,
then by the second part of Exercise 4.2.5 we can guarantee that φ(x) > ax + b for
all x 6= m. Thus, by Exercise 2.6.3, equality can hold in the above display if and
only if f = m a.e. 
Jensen’s inequality is often used to derive inequalities for functions of real
numbers. An example is the following.
XERCISE 4.2.7. If x1 , . . . , xn ∈ R and p1 , . . . , pn ∈ [0, 1] are numbers such
EP
that pi = 1, and φ is a convex functionP on anPinterval containing the xi ’s,
use Jensen’s inequality to show that φ( xi pi ) ≤ φ(xi )pi . (Strictly speaking,
4.3. THE FIRST BOREL–CANTELLI LEMMA 39

Jensen’s inequality is not really needed for this proof; it can done by simply using
the definition of convexity and induction on n.)
A function φ is called concave if −φ is convex. Clearly, the opposite of
Jensen’s inequality holds for concave functions. An important concave function
is the logarithm, whose concavity can be verified by simply noting that its second
derivative is negative everywhere on the positive real line. A consequence of the
concavity of the logarithm is Young’s inequality, which we will use later in this
chapter to prove Hölder’s inequality.
E XERCISE 4.2.8 (Young’s inequality). If x and y are positive real numbers,
and p, q ∈ (1, ∞) are number such that 1/p + 1/q = 1, show that
xp y q
xy ≤ + .
p q
(Hint: Take the logarithm of the right side and apply the definition of concavity.)

4.3. The first Borel–Cantelli lemma


The first Borel–Cantelli lemma is an important tool for proving limit theorems
in probability theory. In particular, we will use it in the next section to prove the
completeness of Lp spaces.
T HEOREM 4.3.1 (The first Borel–Cantelli lemma). P Let (Ω, F, µ) be a measure
space, and let A1 , A2 , . . . ∈ F be events such that µ(An ) < ∞. Then µ({ω :
ω ∈ infinitely many An ’s}) = 0.

P ROOF. It is not difficult to see that in set theoretic notation,


∞ [
\ ∞
{ω : ω ∈ infinitely many An ’s} = Ak .
n=1 k=n

Since the inner union on the right is decreasing in n,


[ ∞ 
µ({ω : ω ∈ infinitely many An ’s}) ≤ inf µ Ak
n≥1
k=n

X ∞
X
≤ inf µ(Ak ) = lim µ(Ak ).
n≥1 n→∞
k=n k=n
P
The finiteness of µ(An ) shows that the limit on the right equals zero, completing
the proof. 
E XERCISE 4.3.2. Produce a counterexample to show that the converse of the
first Borel–Cantelli lemma is not true, even for probability measures.
40 4. NORMS AND INEQUALITIES

4.4. Lp spaces and inequalities


Let (Ω, F, µ) be a measure space, to be fixed throughout this section. Let
f : Ω → R∗ be a measurable function. Given any p ∈ [1, ∞), the Lp norm of f is
defined as Z 1/p
p
kf kL :=
p |f | dµ .

The space Lp (Ω, F, µ) (or simply Lp (Ω) or Lp (µ)) is the set of all measurable
f : Ω → R∗ such that kf kLp is finite. In addition to the above, there is also the
L∞ norm, defined as
kf kL∞ := inf{K ∈ [0, ∞] : |f | ≤ K a.e.}.
The right side is called the ‘essential supremum’ of the function |f |. It is not hard
to see that |f | ≤ kf kL∞ a.e. As before, L∞ (Ω, F, µ) is the set of all f with finite
L∞ norm.
It turns out that for any 1 ≤ p ≤ ∞, Lp (Ω, F, µ) is actually a vector space
and the Lp is actually a norm on this space, provided that we first quotient out
the space by the equivalence relation of being equal almost everywhere. Moreover,
these norms are complete (that is, Cauchy sequences converge). The only thing that
is obvious is that kαf kLp = |α|kf kLp for any α ∈ R and any p and f . Proving the
other claims, however, requires some work, which we do below.
T HEOREM 4.4.1 (Hölder’s inequality). Take any measurable f, g : Ω → R∗ ,
and any p ∈ [1, ∞]. Let q be the solution of 1/p + 1/q = 1. Then kf gkL1 ≤
kf kLp kgkLq .

P ROOF. If p = 1, then q = ∞. The claimed inequality then follows simply as


Z Z
|f g|dµ ≤ kgkL∞ |f |dµ = kf kL1 kgkL∞ .

If p = ∞, then q = 1 and the proof is exactly the same. So assume that p ∈ (1, ∞),
which implies that q ∈ (1, ∞). Let α := 1/kf kLp and β := 1/kgkLq , and let
u := αf and v := βg. Then kukLp = αkf kLp = 1 and kvkLq = βkgkLq = 1.
Now, by Young’s inequality,
|u|p |v|q
|uv| ≤ +
p q
everywhere. Integrating both sides, we get
kukpLp kvkqLq 1 1
kuvkL1 ≤ + = + = 1.
p q p q
It is now easy to see that this is precisely the inequality that we wanted to prove.

E XERCISE 4.4.2. If p, q ∈ (1, ∞) and f ∈ Lp (µ) and g ∈ Lq (µ), then show
that Hölder’s inequality becomes an equality if and only if there exist α, β ≥ 0, not
both of them zero, such that α|f |p = β|g|q a.e.
4.4. Lp SPACES AND INEQUALITIES 41

Using Hölder’s inequality, it is now easy to prove that the Lp norms satisfy the
triangle inequality.
T HEOREM 4.4.3 (Minkowski’s inequality). For any 1 ≤ p ≤ ∞ and any
measurable f, g : Ω → R∗ , kf + gkLp ≤ kf kLp + kgkLp .

P ROOF. If the right side is infinite, there is nothing to prove. So assume that
the right side is finite. First, consider the case p = ∞. Since |f | ≤ kf kL∞ a.e. and
|g| ≤ kgkL∞ a.e., it follows that |f + g| ≤ kf kL∞ + kgkL∞ a.e., which completes
the proof by the definition of essential supremum.
On the other hand, if p = 1, then the claimed inequality follows trivially from
the triangle inequality and the additivity of integration.
So let us assume that p ∈ (1, ∞). First, observe that by Exercise 4.2.7,
p
f +g |f |p + |g|p
≤ ,
2 2
which shows that kf + gkLp is finite. Next note that by Hölder’s inequality,
Z Z
|f + g| dµ = |f + g||f + g|p−1 dµ
p

Z Z
≤ |f ||f + g| dµ + |g||f + g|p−1 dµ
p−1

≤ kf kLp k|f + g|p−1 kLq + kgkLp k|f + g|p−1 kLq ,


where q solves 1/p + 1/q = 1. But
Z 1/q Z 1/q
p−1 q (p−1)q p
k|f + g| kL = |f + g| dµ = |f + g| dµ .

Combining, and using the finiteness of kf + gkLp , we get


Z 1−1/q
p
kf + gkLp = |f + g| dµ ≤ kf kLp + kgkLp ,

which is what we wanted to prove. 


E XERCISE 4.4.4. If p ∈ (1, ∞), show that equality holds in Minkowski’s
inequality if and only if f = λg for some λ ≥ 0 or g = 0.
Minkowski’s inequality shows, in particular, that Lp (Ω, F, µ) is a vector space,
and the Lp norm satisfies two of the three required properties of a norm on this
space. The only property that it does not satisfy is that f may be nonzero even
if kf kLp = 0. But this is not a serious problem, since by Proposition 2.6.1, the
vanishing of the Lp norm implies that f = 0 a.e. More generally, kf − gkLp = 0
if and only if f = g a.e. This shows that if we quotient out Lp (Ω, F, µ) by the
equivalence relation of being equal a.e., the resulting quotient space is a vector
space where the Lp norm is indeed a norm. Since we already think of two functions
which are equal a.e. as effectively the same function, we will not worry about this
technicality too much and continue to treat our definition of Lp space as a vector
space with Lp norm.
42 4. NORMS AND INEQUALITIES

The fact that is somewhat nontrivial, however, is that the Lp norm is complete.
That is, any sequence of functions that is Cauchy in the Lp norm converges to a
limit in Lp space. A first step towards the proof is the following lemma, which is
important in its own right.
L EMMA 4.4.5. If {fn }n≥1 is a Cauchy sequence in the Lp norm for some
1 ≤ p ≤ ∞, then there is function f ∈ Lp (Ω, F, µ), and a subsequence {fnk }k≥1 ,
such that fnk → f a.e. as k → ∞.

P ROOF. First suppose that p ∈ [1, ∞). It is not difficult to see that using
the Cauchy criterion, we can extract a subsequence {fnk }k≥1 such that kfnk −
fnk+1 kLp ≤ 2−k for every k. Define the event
Ak := {ω : |fnk (ω) − fnk+1 (ω)| ≥ 2−k/2 }.
Then by Markov’s inequality,
µ(Ak ) = µ({ω : |fnk (ω) − fnk+1 (ω)|p ≥ 2−kp/2 })
Z
kp/2
≤2 |fnk − fnk+1 |p dµ

= 2kp/2 kfnk − fnk+1 kpLp ≤ 2−kp/2 .


P
Thus, µ(Ak ) < ∞. Therefore by the first Borel–Cantelli lemma, µ(B) = 0,
where B := {ω : ω ∈ infinitely many Ak ’s}. If ω ∈ B c , then ω belongs to only
finitely many of the Ak ’s. This means that |fnk (ω) − fnk+1 (ω)| ≤ 2−k/2 for all
sufficiently large k. From this, it follows that {fnk (ω)}k≥1 is a Cauchy sequence
of real numbers. Define f (ω) to be the limit of this sequence. For ω ∈ B, define
f (ω) = 0. Then f is measurable and fnk → f a.e. Moreover, by the a.e. version
of Fatou’s lemma,
Z Z
|f | dµ ≤ lim inf |fnk |p dµ = lim inf kfnk kpLp < ∞,
p
k→∞ k→∞
since a Cauchy sequence of real numbers must be bounded. Thus, f has finite Lp
norm.
Next, suppose that p = ∞. Extract a subsequence {fnk }k≥1 as before. Then
for each k, |fnk − fnk+1 | ≤ 2−k a.e. Therefore, if we define
E := {ω : |fnk (ω) − fnk+1 (ω)| ≤ 2−k for all k},
then µ(E c ) = 0. For any ω ∈ E, {fnk (ω)}k≥1 is a Cauchy sequence of real
numbers. Define f (ω) to be the limit of this sequence. For ω ∈ E c , define f (ω) =
0. Then f is measurable and fnk → f a.e. Moreover, on E,

X ∞
X
|f | ≤ |fn1 | + |fnk+1 − fnk | ≤ |fn1 | + 2−k ,
k=1 k=1
which shows that f has finite L∞ norm. 
T HEOREM 4.4.6 (Riesz–Fischer theorem). For any 1 ≤ p ≤ ∞, the Lp norm
on Lp (Ω, F, µ) is complete.
4.4. Lp SPACES AND INEQUALITIES 43

P ROOF. Take any sequence {fn }n≥1 that is Cauchy in Lp . Then note that by
Lemma 4.4.5, there is a subsequence {fnk }k≥1 that converges a.e. to a function f
which is also in Lp . First, suppose that p ∈ [1, ∞). Take any  > 0, and find N
such that for all m, n ≥ N , kfn − fm kLp < . Take any n ≥ N . Then by the
a.e. version of Fatou’s lemma,
Z Z
p
|fn − f | dµ = lim |fn − fnk |p dµ
k→∞
Z
≤ lim inf |fn − fnk |p dµ ≤ p .
k→∞

This shows that fn → f in the Lp norm. Next, suppose that p = ∞. Take any
 > 0 and find N as before. Let
E := {ω : |fn (ω) − fm (ω)| ≤  for all m, n ≥ N },
so that µ(E c ) = 0. Take any n ≥ N . Then for any ω such that ω ∈ E and
fnk (ω) → f (ω),
|fn (ω) − f (ω)| = lim |fn (ω) − fnk (ω)| ≤ .
k→∞

This shows that kfn − f kL∞ ≤ , completing the proof that fn → f in the L∞
norm. 
The following fact about Lp spaces of probability measures is very important
in probability theory. It does not hold for general measures.
T HEOREM 4.4.7 (Monotonicity of Lp norms for probability measures). Sup-
pose that µ is a probability measure. Then for any measurable f : Ω → R∗ and
any 1 ≤ p ≤ q ≤ ∞, kf kLp ≤ kf kLq .

P ROOF. If p = q = ∞, there is nothing to prove. So let p be finite. If q = ∞,


then Z
|f |p dµ ≤ kf kpL∞ µ(Ω) = kf kpL∞ ,

which proves the claim. So let q be also finite. First assume that kf kLp and kf kLq
are both finite. Applying Jensen’s inequality with the convex function φ(x) =
|x|q/p , we get the desired inequality
Z q/p Z
|f |p dµ ≤ |f |q dµ.

Finally, let us drop the assumption of finiteness of kf kLp and kf kLq . Take a
sequence of nonnegative simple functions {gn }n≥1 increasing pointwise to |f |
(which exists by Proposition 2.3.6). Since µ is a probability measure, kgn kLp and
kgn kLq are both finite. Therefore kgn kLp ≤ kgn kLq for each n. We can now com-
plete the proof by applying the monotone convergence theorem to both sides. 
E XERCISE 4.4.8. When Ω = R and µ is the Lebesgue measure, produce a
function f whose L2 norm is finite but L1 norm is infinite.
44 4. NORMS AND INEQUALITIES

The space L2 holds a special status among all the Lp spaces, because it has a
natural rendition as a Hilbert space with respect to the inner product
Z
(f, g) := f gdµ. (4.4.1)

It is easy to verify that this is indeed an inner product on L2 (Ω, F, µ), and the
norm generated by this inner product is the L2 norm (after quotienting out by the
equivalence relation of a.e. equality, as usual). The completeness of the L2 norm
guarantees that this is indeed a Hilbert space. The Cauchy–Schwarz inequality on
this Hilbert space is a special case of Hölder’s inequality with p = q = 2.
CHAPTER 5

Random variables

In this chapter we will study the basic properties of random variables and re-
lated functionals.

5.1. Definition
If (Ω, F) is a measurable space, a measurable function from Ω into R or R∗ is
called a random variable. More generally, if (Ω0 , F 0 ) is another measurable space,
then a measurable function from Ω into Ω0 is called a Ω0 -valued random variable
defined on Ω. Unless otherwise mentioned, we will assume that any random vari-
able that we are talking about is real-valued.
Let (Ω, F, P) be a probability space and let X be a random variable defined on
Ω. It is the common convention in probability theory to write P(X ∈ A) instead
of P({ω ∈ Ω : X(ω) ∈ A}). Similarly, we write {X ∈ A} to denote the set
{ω ∈ Ω : X(ω) ∈ A}. Similarly, if X and Y are two random variables, the event
{X ∈ A, Y ∈ B} is the set {ω : X(ω) ∈ A, Y (ω) ∈ B}.
Another commonly used convention is that if f : R → R is a measurable
function and X is a random variable, f (X) denotes the random variable f ◦X. The
σ-algebra generated by a random variable X is denoted by σ(X), and if {Xi }i∈I
is a collection of random variables defined on the same probability space, then
the σ-algebra σ({Xi }i∈I ) generated by the collection {Xi }i∈I is defined to be the
σ-algebra generated by the union of the sets σ(Xi ), i ∈ I.
E XERCISE 5.1.1. If X1 , . . . , Xn are random variables defined on the same
probability space, and f : Rn → R is a measurable function, show that the random
variable f (X1 , . . . , Xn ) is measurable with respect to the σ-algebra generated by
the random variables X1 , . . . , Xn .
E XERCISE 5.1.2. Let {Xn }∞ n=1 be a sequence of random variables. Show that
the random variables sup Xn , inf Xn , lim sup Xn and lim inf Xn are all measur-
able with respect to σ(X1 , X2 , . . .).
E XERCISE 5.1.3. Let {Xn }∞n=1 be a sequence of random variables defined on
a probability space (Ω, F, P). For any A ∈ σ(X1 , X2 , . . .) and any  > 0, show
that there is some n ≥ 1 and some B ∈ σ(X1 , . . . , Xn ) such that P(A∆B) < .
(Hint: Use Theorem 1.2.6.)
E XERCISE 5.1.4. Let {Xi }i∈I be a finite or countable collection of real-valued
random variables defined on a probability space (Ω, F, P). Show that any A ∈
45
46 5. RANDOM VARIABLES

σ({Xi }i∈I ) can be expressed as X −1 (B) for some measurable set B ⊆ RI , where
X : Ω → RI is the map X(ω) = (Xi (ω))i∈I .
E XERCISE 5.1.5. If X1 , . . . , Xn are random variables defined on the same
probability space, and X is a random variable that is measurable with respect to
σ(X1 , . . . , Xn ), then there is a measurable function f : Rn → R such that X =
f (X1 , . . . , Xn ). Hint: Start with indicator random variables.
E XERCISE 5.1.6. Extend the previous exercise to σ-algebras generated by
countably many random variables.

5.2. Cumulative distribution function


D EFINITION 5.2.1. Let (Ω, F, P) be a probability space and let X be a random
variable defined on Ω. The cumulative distribution function of X is a function
FX : R → [0, 1] defined as
FX (t) := P(X ≤ t).
Often, the cumulative distribution function is simply called the distribution func-
tion of X, or abbreviated as the c.d.f. of X.
P ROPOSITION 5.2.2. Let F : R → [0, 1] be a function that is non-decreasing,
right-continuous, and satisfies
lim F (t) = 0 and lim F (t) = 1. (5.2.1)
t→−∞ t→∞
Then there is a probability space (Ω, F, P) and a random variable X defined on
this space such that F is the cumulative distribution function of X. Conversely, the
cumulative distribution function of any random variable X has the above proper-
ties.

P ROOF. Take Ω = (0, 1), F = the restriction of B(R) to (0, 1), and P =
the restriction of Lebesgue measure to (0, 1), which is a probability measure. For
ω ∈ Ω, define X(ω) := inf{t ∈ R : F (t) ≥ ω}. Since F (t) → 1 as t → ∞
and F (t) → 0 as t → −∞, X is well-defined on Ω. Now, if ω ≤ F (t) for some
ω and t, then by the definition of X, X(ω) ≤ t. Conversely, if X(ω) ≤ t, then
there is a sequence tn ↓ t such that F (tn ) ≥ ω for each n. By the right-continuity
of F , this implies that F (t) ≥ ω. Thus, we have shown that X(ω) ≤ t if and
only if F (t) ≥ ω. By either Exercise 2.1.6 or Exercise 2.1.7, F is measurable.
Hence, the previous sentence shows that X is also measurable, and moreover that
P(X ≤ t) = P((0, F (t)]) = F (t). Thus, F is the c.d.f. of the random variable X.
Conversely, if F is the c.d.f. of a random variable, it is easy to show that F
is right-continuous and satisfies (5.2.1) by the continuity of probability measures
under increasing unions and decreasing intersections. Monotonicity of F follows
by the monotonicity of probability measures. 
Because of Proposition 5.2.2, any function satisfying the three conditions stated
in the statement of the proposition is called a cumulative distribution function (or
just distribution function or c.d.f.).
5.4. PROBABILITY DENSITY FUNCTION 47

The following exercise is often used in proofs involving cumulative distribution


functions.
E XERCISE 5.2.3. Show that any cumulative distribution function can have
only countably many points of discontinuity. As a consequence, show that the
set of continuity points is a dense subset of R.

5.3. The law of a random variable


Any random variable X induces a probability measure µX on the real line,
defined as
µX (A) := P(X ∈ A).
This generalizes easily to random variables taking value in other spaces. The prob-
ability measure µX is called the law of X.
P ROPOSITION 5.3.1. Two random variables have the same cumulative distri-
bution function if and only if they have the same law.

P ROOF. If X and Y have the same law, it is clear that they have the same c.d.f.
Conversely, let X and Y be two random variables that have the same distribution
function. Let A be the set of all Borel sets A ⊆ R such that µX (A) = µY (A).
It is easy to see that A is a λ-system. Moreover, A contains all intervals of the
form (a, b], a, b ∈ R, which is a π-system that generates the Borel σ-algebra on R.
Therefore, by Dynkin’s π-λ theorem, A ⊇ B(R), and hence µX = µY . 
The above proposition shows that there is a one-to-one correspondence be-
tween cumulative distribution functions and probability measures on R. Moreover,
the following is true.
E XERCISE 5.3.2. If X and Y have the same law, and g : R → R is a measur-
able function, show that g(X) and g(Y ) also have the same law.

5.4. Probability density function


Suppose that f is a nonnegative integrable function on the real line, such that
Z ∞
f (x)dx = 1,
−∞
where dx = dλ(x) denotes integration with respect to the Lebesgue measure λ
defined in Section 1.6, and the range of the integral denotes integration over the
whole real line. Although this is Lebesgue integration, we retain the notation of
Riemann integration for the sake of familiarity. By Exercise 2.3.5, f defines a
probability measure ν on R as
Z
ν(A) := f (x)dx (5.4.1)
A
for each A ∈ B(R). The function f is called the probability density function
(p.d.f.) of the probability measure ν.
48 5. RANDOM VARIABLES

To verify that a given function f is the p.d.f. of a random variable, it is not


necessary to check (5.4.1) for every Borel set A. It suffices that it holds for a much
smaller class, as given by the following proposition.
P ROPOSITION 5.4.1. A function f is the p.d.f. of a random variable X if and
only if (5.4.1) is satisfied for every set A of the form [a, b] where a and b are
continuity points of the c.d.f. of X.

P ROOF. One implication is trivial. For the other, let F be the c.d.f. of X and
suppose that (5.4.1) is satisfied for every set A of the form [a, b] where a and b are
continuity points of F . We then claim that (5.4.1) holds for every [a, b], even if a
and b are not continuity points of F . This is easily established using Exercise 5.2.3
and the dominated convergence theorem. Once we know this, the result can be
completed by the π-λ theorem, observing that the set of closed intervals is a π-
system, and the set of all Borel sets A for which the identity (5.4.1) holds is a
λ-system. 
The following exercise relates the p.d.f. and the c.d.f. It is simple consequence
of the above proposition.
E XERCISE 5.4.2. If f is a p.d.f. on R, show that the function
Z x
F (x) := f (y)dy (5.4.2)
−∞
is the c.d.f. of the probability measure ν defined by f as in (5.4.1). Conversely,
if F is a c.d.f. on R for which there exists a nonnegative measurable function f
satisfying (5.4.2) for all x, show that f is a p.d.f. which generates the probability
measure corresponding to F .
The next exercise establishes that the p.d.f. is essentially unique when it exists.
E XERCISE 5.4.3. If f and g are two probability density functions for the same
probability measure on R, show that f = g a.e. with respect to Lebesgue measure.
Because of the above exercise, we will generally treat the probability density
function of a random variable X as a unique function (although it is only unique
up to almost everywhere equality), and refer to it as ‘the p.d.f.’ of X.

5.5. Some standard densities


An important example of a p.d.f. is the density function for the normal (or
Gaussian) distribution with mean parameter µ ∈ R and standard deviation param-
eter σ > 0, given by
(x − µ)2
 
1
f (x) = √ exp − .
2πσ 2σ 2
If a random variable X has this distribution, we write X ∼ N (µ, σ 2 ). The special
case of µ = 0 and σ = 1 is known as the standard normal distribution.
5.6. STANDARD DISCRETE DISTRIBUTIONS 49

E XERCISE 5.5.1. Verify that the p.d.f. of the standard normal distribution is
indeed a p.d.f., that is, its integral over the real line equals 1. Then use a change
of variable to prove that the normal p.d.f. is indeed a p.d.f. for any µ and σ. (Hint:
Square the integral and pass to polar coordinates.)
E XERCISE 5.5.2. If X ∼ N (µ, σ 2 ), show that for any a, b ∈ R, aX + b ∼
N (aµ + b, a2 σ 2 ).
Another class of densities that occur quite frequently are the exponential den-
sities. The exponential distribution with rate parameter λ has p.d.f.
f (x) = λe−λx 1{x≥0} ,
where 1{x≥0} denotes the function that is 1 when x ≥ 0 and 0 otherwise. It is easy
to see that this is indeed a p.d.f., and its c.d.f. is given by
F (x) = (1 − e−λx )1{x≥0} .
If X is a random variable with this c.d.f., we write X ∼ Exp(λ).
E XERCISE 5.5.3. If X ∼ Exp(λ), show that for any a > 0, aX ∼ Exp(λ/a).
The Gamma distribution with rate parameter λ > 0 and shape parameter r > 0
has probability density function
λr xr−1 −λx
f (x) = e 1{x≥0} ,
Γ(r)
where Γ denotes the standard Gamma function:
Z ∞
Γ(r) = xr−1 e−x dx.
0
(Recall that Γ(r) = (r − 1)! if r is a positive integer.) If a random variable X has
this distribution, we write X ∼ Gamma(r, λ).
Yet another important class of distributions that have densities are uniform dis-
tributions. The uniform distribution on an interval [a, b] has the probability density
that equals 1/(b − a) in this interval and 0 outside. If a random variable X has this
distribution, we write X ∼ U nif [a, b].

5.6. Standard discrete distributions


Random variables that have continuous c.d.f. are known as continuous random
variables. A discrete random variable is a random variable that can only take values
in a finite or countable set. Note that it is possible that a random variable is neither
continuous nor discrete. The law of a discrete random variable is characterized by
its probability mass function (p.m.f.), which gives the probabilities of attaining the
various values in its range. It is not hard to see that the p.m.f. uniquely determines
the c.d.f. and hence the law. Moreover, any nonnegative function on a finite or
countable subset of R that adds up to 1 is a p.m.f. for a probability measure. The
simplest example of a discrete random variable is a Bernoulli random variable with
probability parameter p, which can take values 0 or 1, with p.m.f.
f (x) = (1 − p)1{x=0} + p1{x=1} .
50 5. RANDOM VARIABLES

If X has this p.m.f., we write X ∼ Ber(p). A generalization of the Bernoulli


distribution is the binomial distribution with parameters n and p, whose p.m.f. is
n  
X n k
f (x) = p (1 − p)n−k 1{x=k} .
k
k=0
If X has this p.m.f., we write X ∼ Bin(n, p). Note that when n = 1, this is
simply the Bernoulli distribution with parameter p.
The binomial distributions are, in some sense, discrete analogues of normal
distributions. The discrete analogues of exponential distributions are geometric
distributions. The geometric distribution with parameter p has p.m.f.
X∞
f (x) = (1 − p)k−1 p1{x=k} .
k=1
If a random variable X has this p.m.f., we write X ∼ Geo(p). Again, the reader
probably knows already that this distribution models the waiting time for the first
head in a sequence of coin tosses where the chance of heads is p.
Finally, a very important class of probability distributions are the Poisson dis-
tributions. The Poisson distribution with parameter λ > 0 is has p.m.f.

X λk
f (x) = e−λ 1{x=k} .
k!
k=1
If X has this distribution, we write X ∼ P oi(λ).
CHAPTER 6

Expectation, variance, and other functionals

This chapter introduces a number of functionals associated with random vari-


ables — or, more accurately, with laws of random variables. The list includes ex-
pectation, variance, covariance, moments, moment generating function, and char-
acteristic function.

6.1. Expected value


Let (Ω, F, P) be a probability space and let X be a random variable defined
on Ω. The expected R value (or expectation, or mean) of X, denoted by E(X), is
simply the integral XdP, provided that the integral exists. A random variable X
is called integrable if it is integrable as a measurable function, that is, if E|X| < ∞.
A notation that is often used is that if X is a random variable and A is an event
(defined on the same space), then
E(X; A) := E(X1A ).
The following exercises show how to compute expected values in practice.
E XERCISE 6.1.1. If a random variable X has law µX , show that for any mea-
surable g : R → R, Z
E(g(X)) = g(x)dµX (x),
R
in the sense that one side exists if and only if the other does, in then the two are
equal. (Hint: Start with simple functions.)
E XERCISE 6.1.2. Let S be a measurable space and let X be an S-valued ran-
dom variable (meaning that X is a measurable map from some probability space
into S). Then µX can be defined as usual, that is, µX (B) := P(X ∈ B). Let
g : S → R be a measurable map. Generalize the previous exercise to this setting.
E XERCISE 6.1.3. If a random variable X takes values in a countable or finite
set A, prove that for any g : A → R, g(X) is also a random variable, and
X
E(g(X)) = g(a)P(X = a),
a∈A
+ g(a)− P(X = a) is finite.
P P
provided that at least one of g(a) P(X = a) and
E XERCISE 6.1.4. If X has p.d.f. f and g : R → R is a measurable function,
shows that E(g(X)) exists if and only if the integral
Z ∞
g(x)f (x)dx
−∞
51
52 6. EXPECTATION, VARIANCE, AND OTHER FUNCTIONALS

exists in the Lebesgue sense, and in that case the two quantities are equal.

E XERCISE 6.1.5. Compute the means of normal, exponential, Gamma, uni-


form, Bernoulli, binomial, geometric and Poisson random variables.

We will sometimes make use of the following application of Fubini’s theorem


to compute expected values of integrals of functions of random variables.

E XERCISE 6.1.6. Let X be a random variable and let f : R2 → R be a


measurable function such that
Z ∞
E|f (t, X)|dt < ∞.
−∞

Then show that


Z ∞ Z ∞ 
E(f (t, X))dt = E f (t, X)dt ,
−∞ −∞

in the sense that both sides exist and are equal. (Of course, here f (t, X) denotes
the random variable f (t, X(ω)).)

A very important representation of the expected value of a nonnegative random


variable is given by the following exercise, which is a simple consequence of the
previous one. It gives a way of calculating the expected value of a nonnegative
random variable if we know its law.

E XERCISE 6.1.7. If X is a nonnegative random variable, prove that


Z ∞
E(X) = P(X ≥ t)dt.
0

A corollary of the above exercise is the following fact, which shows that the
expected value is a property of the law of a random variable rather than the random
variable itself.

E XERCISE 6.1.8. Prove that if two random variables X and Y have the same
law, then E(X) exists if and only if E(Y ) exists, and in this case they are equal.

An inequality related to Exercise 6.1.7 is the following. It is proved easily by


the monotone convergence theorem.

E XERCISE 6.1.9. If X is a nonnegative random variable, prove that



X ∞
X
P(X ≥ n) ≤ E(X) ≤ P(X ≥ n),
n=1 n=0

with equality on the left if X is integer-valued.


6.2. VARIANCE AND COVARIANCE 53

6.2. Variance and covariance


The variance of X is defined as
Var(X) := E(X 2 ) − (E(X))2 ,
provided that E(X 2 ) is finite. Note that by the monotonicity of Hölder norms for
probability measures, the finiteness of E(X 2 ) automatically implies the finiteness
of E|X| and in particular the existence of E(X).
E XERCISE 6.2.1. Compute the variances of the normal, exponential, Gamma,
uniform, Bernoulli, binomial, geometric and Poisson distributions.
It is not difficult to verify that
Var(X) = E(X − E(X))2 ,
where X − E(X) denotes the random variable obtained by subtracting off the
constant E(X) from X at each ω. Note that for any a, b ∈ R, Var(aX + b) =
a2 Var(X). When the variance exists, an important property of the expected value
is that it is the constant a that minimizes E(X − a)2 . This follows from the easy-
to-prove identity
E(X − a)2 = Var(X) + (E(X) − a)2 .
The square-root of Var(X) is known as the standard deviation of X. A simple
consequence of Minkowski’s inequality is that the standard deviation of X + Y ,
where X and Y are two random variables defined on the same probability space, is
bounded by the sum of the standard deviations of X and Y . A simple consequence
of Markov’s inequality is the following result, which shows that X is likely to be
within a few standard deviations from the mean.
T HEOREM 6.2.2 (Chebychev’s inequality). Let X be any random variable
with E(X 2 ) < ∞. Then for any t > 0,
Var(X)
P(|X − E(X)| ≥ t) ≤ .
t2

P ROOF. By Markov’s inequality,


E(X − E(X))2
P(|X − E(X)| ≥ t) = P((X − E(X))2 ≥ t2 ) ≤ ,
t2
and recall that Var(X) = E(X − E(X))2 . 
Chebychev’s inequality is an example of what is known as a ‘tail bound’ for a
random variable. Tail bounds are indispensable tools in modern probability theory.
Another very useful inequality involving the E(X) and E(X 2 ) is the following.
It gives a kind of inverse for Chebychev’s inequality.
T HEOREM 6.2.3 (Paley–Zygmund inequality). Let X be a nonnegative ran-
dom variable with E(X 2 ) < ∞. Then for any t ∈ [0, E(X)),
(E(X) − t)2
P(X > t) ≥ .
E(X 2 )
54 6. EXPECTATION, VARIANCE, AND OTHER FUNCTIONALS

P ROOF. Take any t ∈ [0, E(X)). Let Y := (X − t)+ . Then


0 ≤ E(X − t) ≤ E(X − t)+ = E(Y ) = E(Y ; Y > 0).
By the Cauchy–Schwarz inequality for L2 space,
(E(Y ; Y > 0))2 ≤ E(Y 2 )P(Y > 0) = E(Y 2 )P(X > t).
The proof is completed by observing that Y 2 ≤ X 2 . 
The covariance of two random variables X and Y defined on the same proba-
bility space is defined as
Cov(X, Y ) := E(XY ) − E(X)E(Y ),
provided that both X, Y and XY are integrable. Notice that
Var(X) = Cov(X, X).
It is straightforward to show that
Cov(X, Y ) = E((X − E(X))(Y − E(Y ))).
From this, it follows by the Cauchy–Schwarz inequality for L2 space that if X, Y ∈
L2 , then p
|Cov(X, Y )| ≤ Var(X)Var(Y ).
In fact, the covariance itself is an inner product, on the Hilbert space obtained by
quotienting L2 by the subspace consisting of all a.e. constant random variables. In
particular, the covariance is a bilinear functional on L2 , and Cov(X, a) = 0 for
any random variable X and constant a (viewed as a random variable).
The correlation between X and Y is defined as
Cov(X, Y )
Corr(X, Y ) := p ,
Var(X)Var(Y )
provided that the variances are nonzero. By the Cauchy–Schwarz inequality, the
correlation always lies between −1 and 1. If the correlation is zero, we say that the
random variables are uncorrelated.
E XERCISE 6.2.4. Show that Corr(X, Y ) = 1 if and only if Y = aX + b for
some a > 0 and b ∈ R, and Corr(X, Y ) = −1 if and only if Y = aX + b for some
a < 0 and b ∈ R.
An important formula involving the covariance is the following.
P ROPOSITION 6.2.5. For any X1 , . . . , Xn defined on the same space,
X n  Xn
Var Xi = Cov(Xi , Xj )
i=1 i,j=1
n
X X
= Var(Xi ) + 2 Cov(Xi , Xj ).
i=1 1≤i<j≤n
6.3. MOMENTS AND MOMENT GENERATING FUNCTION 55

P ROOF. This is a direct consequence of the bilinearity of covariance. The


second identity follows from the observations that Cov(X, Y ) = Cov(Y, X) and
Cov(X, X) = Var(X). 

6.3. Moments and moment generating function


For any positive integer k, the kth moment of a random variable X is defined
as E(X k ), provided that this expectation exists.
E XERCISE 6.3.1. If X ∼ N (0, 1), show that
(
0 if k is odd,
E(X k ) =
(k − 1)!! if k is even,
where (k − 1)!! := (k − 1)(k − 3) · · · 5 · 3 · 1.
The moment generating function of X is defined as
mX (t) := E(etX )
for t ∈ R. Note that the moment generating function is allowed to take infinite
values. The moment generating function derives its name from the fact that
∞ k
X t
mX (t) = E(X k )
k!
k=0
whenever

X |t|k
E|X|k < ∞, (6.3.1)
k!
k=0
as can be easily verified using the monotone convergence theorem and the domi-
nated convergence theorem.
E XERCISE 6.3.2. Carry out the above verification.
2 σ 2 /2
E XERCISE 6.3.3. If X ∼ N (µ, σ 2 ), show that mX (t) = etµ+t .
Moments and moment generating functions provide tail bounds for random
variables that are more powerful than Chebychev’s inequality.
P ROPOSITION 6.3.4. For any random variable X, any t > 0, and any p > 0,
E|X|p
P(|X| ≥ t) ≤ .
tp
Moreover, for any t ∈ R and θ ≥ 0,
P(X ≥ t) ≤ e−θt mX (θ), P(X ≤ t) ≤ eθt mX (−θ).

P ROOF. All of these inequalities are simple consequences of Markov’s in-


equality. For the first inequality, observe that |X| ≥ t if and only if |X|p ≥ tp
and apply Markov’s inequality. For the second and third, observe that X ≥ t if and
only if eθX ≥ eθt , and X ≤ t if and only if e−θX ≥ e−θt . 
56 6. EXPECTATION, VARIANCE, AND OTHER FUNCTIONALS

Often, it is possible to get impressive tail bounds in a given problem by opti-


mizing over θ or p in the above result.
E XERCISE 6.3.5. If X ∼ N (0, σ 2 ), use the above procedure to prove that for
2 2
any t ≥ 0, P(X ≥ t) and P(X ≤ −t) are bounded by e−t /2σ .
The above bound is actually not optimal. The following exercise gives the
correct asymptotic behavior for the normal tail.
E XERCISE 6.3.6. If X ∼ N (0, 1), show that for any t > 0,
 2 2
1 e−t /2 e−t /2

1
− 3 √ ≤ P(X ≥ t) ≤ √ .
t t 2π t 2π
R ∞ −x2 /2 R∞ 2
(Hint: For the upper bound, use the inequality t e dx ≤ t (x/t)e−x /2 dx.
R ∞ −x2 /2 R ∞ −1 −x2 /2
For the lower bound, use t e dx = t x (xe )dx, use integration by
parts, and finally use a similar trick as above to bound the second term coming out
of the integration by parts.)
E XERCISE 6.3.7. If X is a nonnegative random variable, prove that for any
p > 0, Z ∞
p
E(X ) = ptp−1 P(X ≥ t)dt.
0

6.4. Characteristic function


Another important function associated with a random variable X is its charac-
teristic function φX , defined as
φX (t) := E(eitX ),

where i = −1. Until now we have only dealt with expectations of random
variables, but this is not much different. The right side of the above expression
is defined simply as
E(eitX ) := E(cos tX) + iE(sin tX),
and the two expectations on the right always exist because cos and sin are bounded
functions. In fact, the expected value of any complex random variable can be de-
fined in the same manner, provided that the expected values of the real and imagi-
nary parts exist and are finite.
P ROPOSITION 6.4.1. If X and Y are integrable random variables and Z =
X + iY , and E(Z) is defined as E(X) + iE(Y ), then |E(Z)| ≤ E|Z|.

P ROOF. Let α := E(Z) and ᾱ denote the complex conjugate of α. It is easy


to check the linearity of expectation holds for complex random variables. Thus,
|E(Z)|2 = ᾱE(Z) = E(ᾱZ)
= E(<(ᾱZ)) + iE(=(ᾱZ)),
6.4. CHARACTERISTIC FUNCTION 57

where <(z) and =(z) denote the real and imaginary parts of a complex number
z. Since |E(Z)|2 is real, the above identity show that |E(Z)|2 = E(<(ᾱZ)).
But <(ᾱZ) ≤ |ᾱZ| = |α||Z|. Thus, |E(Z)|2 ≤ |α|E|Z|, which completes the
proof. 
The above proposition shows in particular that |φX (t)| ≤ 1 for any random
variable X and any t.
E XERCISE 6.4.2. Show that the characteristic function can be a written as a
power series in t if (6.3.1) holds.
E XERCISE 6.4.3. Show that the characteristic function of any random vari-
able is a uniformly continuous function. (Hint: Use the dominated convergence
theorem.)
Perhaps somewhat surprisingly, the characteristic function also gives a tail
bound. This bound is not very useful, but we will see at least one fundamental
application in a later chapter.
P ROPOSITION 6.4.4. Let X be a random variable with characteristic function
φX . Then for any t > 0,
t 2/t
Z
P(|X| ≥ t) ≤ (1 − φX (s))ds.
2 −2/t

P ROOF. Note that for any a > 0,


Z a Z a
(1 − φX (s))ds = (2 − φX (s) − φX (−s))ds
−a 0
Z a
= E(2 − eisX − e−isX )ds
0
Z a
=2 E(1 − cos sX)ds.
0
By Fubini’s theorem (specifically, Exercise 6.1.6), expectation and integral can be
interchanged above, giving
Z a  
sin aX
(1 − φX (s))ds = 2aE 1 − ,
−a aX
interpreting (sin x)/x = 1 when x = 0. Now notice that
(
sin aX 1/2 when |X| ≥ 2/a,
1− ≥
aX 0 always.
Thus,  
sin aX 1
E 1− ≥ P(|X| ≥ 2/a).
aX 2
Taking a = 2/t, this proves the claim. 
58 6. EXPECTATION, VARIANCE, AND OTHER FUNCTIONALS

E XERCISE 6.4.5. Compute the characteristic functions of the Bernoulli, bino-


mial, geometric, Poisson, uniform, exponential and Gamma distributions.

6.5. Characteristic function of the normal distribution


The characteristic function of a standard normal random variable will be useful
for us in the proof of the central limit theorem later. The calculation of this char-
acteristic function is not entirely trivial; the standard derivation involves contour
integration. The complete details are given below.
2 /2
P ROPOSITION 6.5.1. If X ∼ N (0, 1), then φX (t) = e−t .

P ROOF. Note that


Z ∞
1 2 /2
φX (t) = √ eitx e−x dx
2π −∞
−t2 /2 Z ∞
e 2 /2
= √ dx. e−(x−it)
2π −∞
Take any R > 0. Let C be the contour in the complex plane that forms the bound-
2
ary of the box with vertices −R, R, R − it, −R − it. Since the map z 7→ e−z /2 is
entire, I
2
e−z /2 dz = 0.
C
Let C1 be the part of C that lies on the real line, going from left to right. Let C2 be
the part that is parallel to C1 going from left to right. Let C3 and C4 be the vertical
parts, going from top to bottom. It is easy to see that as R → ∞,
I I
−z 2 /2 2
e dz → 0 and e−z /2 dz → 0.
C3 C4
Thus, as R → ∞,
I I
−z 2 /2 2 /2
e dz − e−z dz → 0.
C1 C2
Also, as R → ∞,
I
−z 2 /2
Z ∞
2 /2 √
e dz → e−x dx = 2π,
C1 −∞
and
I Z ∞
2 /2 2 /2
e−z dz → e−(x−it) dx.
C2 −∞
This completes the proof. 
As a final remark, we note that by Exercise 5.3.2, the expectation, variance,
moments, moment generating function, and characteristic function of a random
variable are all determined by its law. That is, if two random variables have the
same law, then the above functionals are also the same for the two.
CHAPTER 7

Independence

A central idea of probability theory, which distinguishes it from measure the-


ory, is the notion of independence. The reader may be already familiar with inde-
pendent random variables from undergraduate probability classes. In this chapter
we will bring the concept of independence into the measure-theoretic framework
and derive some important consequences.

7.1. Definition
D EFINITION 7.1.1. Let (Ω, F, P) be a probability space, and suppose that
G1 , . . . , Gn are sub-σ-algebras of F. We say that G1 , . . . , Gn are independent if
for any A1 ∈ G1 , A2 ∈ G2 , . . . , An ∈ Gn ,
n
Y
P(A1 ∩ · · · ∩ An ) = P(Ai ).
i=1

More generally, an arbitrary collection {Gi }i∈I of sub-σ-algebras are called inde-
pendent if any finitely many of them are independent.
The independence of σ-algebras is used to define the independence of random
variables and events.
D EFINITION 7.1.2. A collection of events {Ai }i∈I in F are said to be indepen-
dent if the σ-algebras Gi := {∅, Ai , Aci , Ω} generated by the Ai ’s are independent.
Moreover, an event A is said to be independent of a σ-algebra G is the σ-algebras
{∅, A, Ac , Ω} and G are independent.
E XERCISE 7.1.3. Show that a collection of events {Ai }i∈I are independent if
and only if for any finite J ⊆ I,
\  Y
P Aj = P(Aj ).
j∈J j∈J

D EFINITION 7.1.4. A collection of random variables {Xi }i∈I defined on Ω


are said to be independent if {σ(Xi )}i∈I are independent sub-σ-algebras of F.
Moreover, a random variable X is said to be independent of a σ-algebra G if the
σ-algebras σ(X) and G are independent.
A particularly important definition in probability theory is the notion of an
independent and identically distributed (i.i.d.) sequence of random variables. A
59
60 7. INDEPENDENCE

sequence {Xi }∞i=1 is said to be i.i.d. if the Xi ’s are independent and all have the
same distribution.
We end this section with a sequence of important exercises about independent
random variables and events.
E XERCISE 7.1.5. Let {Xn }∞ n=1 be a sequence of random variables defined on
the same probability space. If Xn+1 is independent of σ(X1 , . . . , Xn ) for each n,
prove that the whole collection is independent.
E XERCISE 7.1.6. If X1 , . . . , Xn are independent random variables and f :
Rn → R is a measurable function, show that the law of f (X1 , . . . , Xn ) is deter-
mined by the laws of X1 , . . . , Xn .
E XERCISE 7.1.7. If {Xi }i∈I is a collection of independent random variables
and {Ai }i∈I is a collection of measurable subsets of R, show that the events {Xi ∈
Ai }, i ∈ I are independent.
E XERCISE 7.1.8. If {Fi }i∈I is a family of cumulative distribution functions,
show that there is a probability space (Ω, F, P) and independent random variables
{Xi }i∈I defined on Ω such that for each i, Fi is the c.d.f. of Xi . (Hint: Use product
spaces.)
(The above exercise allows us to define arbitrary families of independent ran-
dom variables on the same probability space. The usual convention in probability
theory is to always have a single probability space (Ω, F, P) in the background, on
which all random variables are defined. For convenience, this probability space is
usually assumed to be complete.)
E XERCISE 7.1.9. If A is a π-system of sets and B is an event such that B and
A are independent for every A ∈ A, show that B and σ(A) are independent.
E XERCISE 7.1.10. If {Xi }i∈I is a collection of independent random variables,
then show that for any disjoint subsets J, K ⊆ I, the σ-algebras generated by
{Xi }i∈J and {Xi }i∈K are independent. (Hint: Use Exercise 7.1.9.)
E XERCISE 7.1.11. Let {Xi }i∈I and {Yj }j∈J be two collections of random
variables defined on the same probability space. Suppose that for any finite F ⊆ I
and G ⊆ J, the collections {Xi }i∈F and {Yj }j∈G are independent. Then prove
that {Xi }i∈I and {Yj }j∈J are independent. (Hint: Use Exercise 7.1.9.)

E XERCISE 7.1.12. If X1 , X2 , . . . is a sequence of independent Ber(p) random


variables where p ∈ (0, 1), and T := min{k : Xk = 1}, give a complete measure
theoretic proof of the fact that T ∼ Geo(p).

7.2. Expectation of a product under independence


A very important property of independent random variables is the following.
Together with the above exercise, it gives a powerful computational tool for prob-
abilistic models.
7.2. EXPECTATION OF A PRODUCT UNDER INDEPENDENCE 61

P ROPOSITION 7.2.1. If X and Y are independent random variables such that


X and Y are integrable, then the product XY is also integrable and E(XY ) =
E(X)E(Y ). The identity also holds if X and Y are nonnegative but not necessarily
integrable.

P ROOF. First, suppose that X = ki=1 ai 1Ai and Y = m


P P
j=1 bj 1Bj for some
nonnegative ai ’s and bj ’s, and measurable Ai ’s and Bj ’s. Without loss of gen-
erality, assume that all the ai ’s are distinct and all the bj ’s are distinct. Then each
Ai ∈ σ(X) and each Bj ∈ σ(Y ), because Ai = X −1 ({ai }) and Bj = Y −1 ({bj }).
Thus, for each i and j, Ai and Bj are independent. This gives
k X
X m k X
X m
E(XY ) = ai bj E(1Ai 1Bj ) = ai bj P(Ai ∩ Bj )
i=1 j=1 i=1 j=1
k X
X m
= ai bj P(Ai )P(Bj ) = E(X)E(Y ).
i=1 j=1

Next take any nonnegative X and Y that are independent. Construct nonnegative
simple random variables Xn and Yn increasing to X and Y , using the method from
the proof of Proposition 2.3.6. From the construction, it is easy to see that each Xn
is σ(X)-measurable and each Yn is σ(Y )-measurable. Therefore Xn and Yn are
independent, and hence E(Xn Yn ) = E(Xn )E(Yn ), since we have already proved
this identity for nonnegative simple random variables. Now note that since Xn ↑ X
and Yn ↑ Y , we have Xn Yn ↑ XY . Therefore by the monotone convergence
theorem,
E(XY ) = lim E(Xn Yn ) = lim E(Xn )E(Yn ) = E(X)E(Y )
n→∞ n→∞

Finally, take any independent X and Y . It is easy to see that X + and X − are
σ(X)-measurable, and Y + and Y − are σ(Y )-measurable. Therefore
E|XY | = E((X + + X − )(Y + + Y − ))
= E(X + Y + ) + E(X + Y − ) + E(X − Y + ) + E(X − Y − )
= E(X + )E(Y + ) + E(X + )E(Y − ) + E(X − )E(Y + ) + E(X − )E(Y − )
= E|X|E|Y |.
Since E|X| and E|Y | are both finite, this shows that XY is integrable. Repeating
the steps in the above display starting with E(XY ) instead of E|XY |, we get
E(XY ) = E(X)E(Y ). 
The following exercise generalizes the above proposition. It is provable easily
by induction, starting with the case n = 2.
E XERCISE 7.2.2. If X1 , X2 , . . . , Xn are independent integrable random vari-
ables, show that the product X1 X2 · · · Xn is also integrable and
E(X1 X2 · · · Xn ) = E(X1 )E(X2 ) · · · E(Xn ).
62 7. INDEPENDENCE

Moreover, show that the identity also holds if the Xi ’s are nonnegative but not
necessarily integrable.
The following are some important consequences of the above exercise.
E XERCISE 7.2.3. If X and Y are independent integrable random variables,
show that Cov(X, Y ) = 0. In other words, independent random variables are
uncorrelated.
E XERCISE 7.2.4. Give an example of a pair of uncorrelated random variables
that are not independent.
A collection of random variables {Xi }i∈I is called pairwise independent if for
any two distinct i, j ∈ I, Xi and Xj are independent.
E XERCISE 7.2.5. Give an example of three random variables X1 , X2 , X3 that
are pairwise independent but not independent.

7.3. The second Borel–Cantelli lemma


Let {An }n≥1 be a sequence of events in a probability space (Ω, F, P). The
event {An i.o.} denotes the set of all ω that belong to infinitely many of the An ’s.
Here ‘i.o.’
P means ‘infinitely often’. In this language, the first Borel–Cantelli says
that if P(An ) < ∞, then P(An i.o.) = 0. By Exercise 4.3.2, we know that the
converse of the lemma is not true. The second Borel–Cantelli lemma, stated below,
says that the converse is true if we additionally impose the condition that the events
are independent. Although not as useful as the first lemma, it has some uses.
T HEOREM 7.3.1 (The second Borel–Cantelli
P lemma). If {An }n≥1 is a se-
quence of independent events such that P(An ) = ∞, then P(An i.o.) = 1.

P ROOF. Let B denote the event {A i.o.}. Then B c is the set of all ω that
belong to only finitely many of the An ’s. In set theoretic notation,
∞ \
[ ∞
c
B = Ack .
n=1 k=n
Therefore

\ 
c
P(B ) = lim P Ack .
n→∞
k=n
Take any 1 ≤ n ≤ m. Then by independence,
\∞  \ m 
c c
P Ak ≤ P Ak
k=n k=n
m
Y
= (1 − P(Ak )).
k=n
7.4. THE KOLMOGOROV ZERO-ONE LAW 63

By the inequality 1 − x ≤ e−x that holds for all x ≥ 0, this gives


\∞   X m 
c
P Ak ≤ exp − P(Ak ) .
k=n k=n

Since this holds for any m ≥ n, we get


\∞  ∞
 X 
c
P Ak ≤ exp − P(Ak ) = 0,
k=n k=n
P∞
where the last equality holds because k=1 P(Ak ) = ∞. This shows that P(B c ) =
0 and completes the proof of the theorem. 
E XERCISE 7.3.2. Let X1 , X2 , . . . be a sequence of i.i.d. random variables such
that E|X1 | = ∞. Prove that P(|Xn | ≥ n i.o.) = 1.
E XERCISE 7.3.3. Let X1 , X2 , . . . be an i.i.d. sequence of integer-valued ran-
dom variables. Take any m and any sequence of integers k1 , k2 , . . . , km such that
P(X1 = ki ) > 0 for each i. Prove that with probability 1, there are infinitely many
occurrences of the sequence k1 , . . . , km in a realization of X1 , X2 , . . ..

7.4. The Kolmogorov zero-one law


Let X1 , X2 , . . . be a sequence of random variables defined on the same proba-
bility space. The tail σ-algebra generated by this family is defined as

\
T (X1 , X2 , . . .) := σ(Xn , Xn+1 , . . .).
n=1

The following result is often useful for proving that some random variable is actu-
ally a constant.
T HEOREM 7.4.1 (Kolmogorov’s zero-one law). If {Xn }∞ n=1 is a sequence of
independent random variables defined on a probability space (Ω, F, P), and T is
the tail σ-algebra of this sequence, then for any A ∈ T , P(A) is either 0 or 1.

P ROOF. Take any n. Since A ∈ σ(Xn+1 , Xn+2 , . . .) and the Xi ’s are indepen-
dent, it follows by Exercise 7.1.10 that the event A is independent of the σ-algebra
σ(X1 , . . . , Xn ). Let

[
A := σ(X1 , . . . , Xn ).
n=1
It is easy to see that A is an algebra, and σ(A) = σ(X1 , X2 , . . .). From the first
paragraph we know that A is independent of B for every B ∈ A. Therefore by
Exercise 7.1.9, A is independent of σ(X1 , X2 , . . .). In particular, A is independent
of itself, which implies that P(A) = P(A ∩ A) = P(A)2 . This proves that P(A) is
either 0 or 1. 
64 7. INDEPENDENCE

Another way to state Kolmogorov’s zero-one law is to say that the tail σ-
algebra of a sequence of independent random variables is ‘trivial’, in the sense
that any event in it has probability 0 or 1. Trivial σ-algebras have the following
useful property.
E XERCISE 7.4.2. If G is a trivial σ-algebra for a probability measure P, show
that any random variable X that is measurable with respect to G must be equal to
a constant almost surely.
In particular, if {Xn }∞
n=1 is a sequence of independent random variables and
X is a random variable that is measurable with respect to the tail σ-algebra of this
sequence, then show that there is some constant c such that X = c a.e. Some
important consequences are the following.
E XERCISE 7.4.3. If {Xn }∞n=1 is a sequence of independent random variables,
show that the random variables lim sup Xn and lim inf Xn are equal to constants
almost surely.

E XERCISE 7.4.4. Let {Xn }∞ n=1 be a sequence of independent random vari-


ables. Let Sn := X1 + · · · + Xn , and let {an }∞ n=1 be a sequence of constants
increasing to infinity. Then show that lim sup Sn /an and lim inf Sn /an are con-
stants almost surely.

7.5. Zero-one laws for i.i.d. random variables


For i.i.d. random variables, one can prove a zero-one law that is quite a bit
stronger than Kolmogorov’s zero-one law.
T HEOREM 7.5.1. Let X = {Xi }i∈I be a countable collection of i.i.d. random
variables. Suppose that f : RI → R is a measurable function with the following
property. There is a collection Γ of one-to-one maps from I into I, such that
(i) f (ω γ ) = f (ω) for each γ ∈ Γ and ω ∈ RI , where ωiγ := ωγ(i) , and
(ii) for any finite J ⊆ I, there is some γ ∈ Γ such that γ(J) ∩ J = ∅.
Then the random variable f (X) equals a constant almost surely.

P ROOF. First, assume that f is the indicator function of a measurable set E ⊆


RI . Let A be the event that X ∈ E. Take any  > 0. Let {i1 , i2 , . . .} be an enu-
meration of I. By Exercise 5.1.3, there is some n and some B ∈ σ(Xi1 , . . . , Xin )
such that P(A∆B) < , where P is the probability measure on the space where
X is defined. By Exercise 5.1.5, there is some measurable function g : Rn → R
such that 1B = g(Xi1 , . . . , Xin ). Then P(A∆B) <  is equivalent to saying that
E|1A − 1B | < . Using condition (ii), find γ ∈ Γ such that {γ(i1 ), . . . , γ(in )} does
not intersect {i1 , . . . , in }. Let Y := X γ , Z := f (Y ) and W := g(Yi1 , . . . , Yin ).
Since the Xi ’s are i.i.d. and γ is an injection, Y has the same law as X. Therefore
E|Z − W | = E|f (X) − g(Xi1 , . . . , Xin )| = E|1A − 1B | < .
7.5. ZERO-ONE LAWS FOR I.I.D. RANDOM VARIABLES 65

But by condition (i), f (X) = f (Y ). Therefore


E|1B − W | ≤ E|1B − 1A | + E|Z − W | < 2.
On the other hand, notice that 1B and W are independent and identically distributed
random variables. Moreover, both are {0, 1}-valued. Therefore
2 > E|1B − W | ≥ E(1B − W )2 = E(1B + W − 21B W )
= 2P(B) − 2P(B)E(W ) = 2P(B)(1 − P(B)).
On the other hand,
|P(A)(1 − P(A)) − P(B)(1 − P(B))| ≤ |P(A) − P(B)|
≤ E|1A − 1B | < ,
since the derivative of the map x 7→ x(1 − x) is bounded by 1 in [0, 1]. Combining,
we get P(A)(1 − P(A)) < 2. Since this is true for any  > 0, we must have that
P(A) = 0 or 1. In particular, f (X) is a constant.
Now take an arbitrary measurable f with the given properties. Take any t ∈ R
and let E be the set of all ω ∈ RI such that f (ω) ≤ t. Then the function 1E
also satisfies the hypotheses of the theorem, and so we can apply the first part to
conclude that 1E (X) is a constant. Since this holds for any t, we may conclude
that f (X) is a constant. 
The following result is a useful consequence of Theorem 7.5.1.
C OROLLARY 7.5.2 (Zero-one law for translation-invariant events). Let Zd be
the d-dimensional integer lattice, for some d ≥ 1. Let E be the set of nearest-
neighbor edges of this lattice, and let {Xe }e∈E be a collection of i.i.d. random
variables. Let Γ be the set of all translations of E. Then any measurable function
f of {Xe }e∈E which is translation-invariant, in the sense that f ({Xe }e∈E ) =
f ({Xγ(e) }e∈E ) for any γ ∈ Γ, must be equal to a constant almost surely. The
same result holds if edges are replaced by vertices.

P ROOF. Clearly Γ satisfies property (ii) in Theorem 7.5.1 for any chosen enu-
meration of the edges. Property (i) is satisfied by the hypothesis of the corol-
lary. 
The following exercise demonstrates an important application of Corollary
7.5.2.
E XERCISE 7.5.3. Let {Xe }e∈E be a collection of i.i.d. Ber(p) random vari-
ables, for some p ∈ [0, 1]. Define a random subgraph of Zd by keeping an edge e
if Xe = 1 and deleting it if Xe = 0. Let N be the number of infinite connected
components of this random graph. (These are known as infinite percolation clus-
ters.) Exercise 3.3.2 shows that N is a random variable. Using Theorem 7.5.1 and
the above discussion, prove that N equals a constant in {0, 1, 2, . . .} ∪ {∞} almost
surely.
Another consequence of Theorem 7.5.1 is the following result.
66 7. INDEPENDENCE

C OROLLARY 7.5.4 (Hewitt–Savage zero-one law). Let X1 , X2 , . . . be a se-


quence of i.i.d. random variables. Let f be a measurable function of this sequence
that has the property that f (X1 , X2 , . . .) = f (Xσ(1) , Xσ(2) , . . .) for any σ that
permutes finitely many of the indices. Then f (X1 , X2 , . . .) must be equal to a
constant almost surely.

P ROOF. Here Γ is the set of all permutations of the positive integers that fix
all but finitely many indices. Then clearly Γ satisfies the hypotheses of Theorem
7.5.1. 
A typical application of the Hewitt–Savage zero-one law is the following.
E XERCISE 7.5.5. Suppose that we have m boxes, and an infinite sequence of
balls are dropped into the boxes independently and uniformly at random. Set this
up as a problem in measure-theoretic probability, and prove that with probability
one, each box has the maximum number of balls among all boxes infinitely often.

7.6. Random vectors


A random vector is simply an Rn -valued random variable for some positibe
integer n. In this way, it generalizes the notion of a random variable to multi-
ple dimensions. The cumulative distribution function of a random vector X =
(X1 , . . . , Xn ) is defined as
FX (t1 , . . . , tn ) = P(X1 ≤ t1 , X2 ≤ t2 , . . . , Xn ≤ tn ),
where P denotes the probability measure on the probability space on which X is
defined. The probability density function of X, if it exists, is a measurable function
f : Rn → [0, ∞) such that for any A ∈ B(Rn ),
Z
P(X ∈ A) = f (x1 , . . . , xn )dx1 · · · dxn ,
A
where dx1 · · · dxn denotes integration with respect to Lebesgue measure on Rn .
The characteristic function is defined as
φX (t1 , . . . , tn ) := E(ei(t1 X1 +···+tn Xn ) ).
The law µX of X the probability measure on Rn induced by X, that is, µX (A) :=
P(X ∈ A). The following exercises are basic.
E XERCISE 7.6.1. Prove that the c.d.f. of a random vector uniquely character-
izes its law (that is, a multivariate analogue of Proposition 5.3.1).
E XERCISE 7.6.2. If X1 , . . . , Xn are independent random variables with laws
µ1 , . . . , µn , then show that the law of the random vector (X1 , . . . , Xn ) is the prod-
uct measure µ1 × · · · × µn .
E XERCISE 7.6.3. If X1 , . . . , Xn are independent random variables, show that
the cumulative distribution function and the characteristic function of the random
vector (X1 , . . . , Xn ) can be written as products of one-dimensional distribution
functions and characteristic functions.
7.7. CONVOLUTIONS 67

E XERCISE 7.6.4. If X1 , . . . , Xn are independent random variables, and each


has a probability density function, show that (X1 , . . . , Xn ) also has a p.d.f. and it
is given by a product formula.
E XERCISE 7.6.5. Let X be an Rn -valued random vector and U ⊆ Rn be an
open set such that P(X ∈ U ) = 1. Suppose that the c.d.f. F of X is n times
differentiable in U . Let
∂nF
f (x1 , . . . , xn ) :=
∂x1 · · · ∂xn
for (x1 , . . . , xn ) ∈ U , and let f ≡ 0 outside U . Prove that f is the p.d.f. of X.
(Hint: First show that for any rectangle R ⊆ U , P(X ∈ R) equals the integral of f
over R. Then extend this to any Borel subset of U using the π-λ theorem.)
The mean vector of a random vector X = (X1 , . . . , Xn ) is the vector µ =
(µ1 , . . . , µn ), where µi = E(Xi ), assuming that the means exist. The covariance
matrix of a random vector X = (X1 , . . . , Xn ) is the n × n matrix Σ = (σij )ni,j=1 ,
where σij = Cov(Xi , Xj ), provided that these covariances exist.
E XERCISE 7.6.6. Prove that the covariance matrix of any random vector is a
positive semi-definite matrix.
An important kind of random vector is the multivariate normal (or Gaussian)
random vector. Given µ ∈ Rn and a strictly positive definite matrix Σ of order n,
the multivariate normal distribution with mean µ and covariance matrix Σ is the
probability measure on Rn with probability density function
 
1 1 T −1
exp − (x − µ) Σ (x − µ) .
(2π)n/2 (det Σ)1/2 2
If X has this distribution, we write X ∼ Nn (µ, Σ).
E XERCISE 7.6.7. Show that the above formula indeed describes a p.d.f. of a
probability law on Rn , and that this law has mean vector µ and covariance ma-
trix Σ.
E XERCISE 7.6.8. Let X ∼ Nn (µ, Σ), and let m be a positive integer ≤ n.
Show that for any a ∈ Rm and any m × n matrix A of full rank, AX + a ∼
Nm (a + Aµ, AΣAT ).

7.7. Convolutions
Given two probability measures µ1 and µ2 on R, their convolution µ1 ∗ µ2 is
defined to be the push-forward of µ1 × µ2 under the addition map from R2 to R.
That is, if φ(x, y) := x + y, then for any A ∈ B(R),
µ1 ∗ µ2 (A) := µ1 × µ2 (φ−1 (A)).
Exercise 7.6.2 shows that in the language of random variables, the convolution of
two probability measures has the following description: If X and Y are indepen-
dent random variables with laws µ1 and µ2 , then µ1 ∗ µ2 is the law of X + Y .
68 7. INDEPENDENCE

P ROPOSITION 7.7.1. Let X and Y be independent random variables. Suppose


that Y has probability density function g. Then the sum Z := X+Y has probability
density function h(z) = Eg(z − X).

P ROOF. Let µ1 be the law of X and µ2 be the law of Y . Let µ := µ1 × µ2 .


Then the discussion preceding the statement of the proposition shows that for any
A ∈ B(R),
P(Z ∈ A) = µ1 ∗ µ2 (A) = µ(φ−1 (A))
Z
= 1φ−1 (A) (x, y)dµ(x, y).
R2
By Fubini’s theorem, this integral equals
Z Z
1φ−1 (A) (x, y)dµ2 (y)dµ1 (x).
R R
But for any x,
Z Z
1φ−1 (A) (x, y)dµ2 (y) = 1A (x + y)dµ2 (y)
R R
Z
= 1A (x + y)g(y)dy
ZR Z
= 1A (z)g(z − x)dz = g(z − x)dz,
R A
where the last step follows by the translation invariance of Lebesgue measure (Ex-
ercise 2.4.6). Thus again by Fubini’s theorem,
Z Z
P(Z ∈ A) = g(z − x)dzdµ1 (x)
R A
Z Z
= g(z − x)dµ1 (x)dz
ZA R
= Eg(z − X)dz.
A
This proves the claim. 
E XERCISE 7.7.2. If X1 ∼ N (µ1 , σ12 ) and X2 ∼ N (µ2 , σ22 ) are independent,
prove that X1 + X2 ∼ N (µ1 + µ2 , σ12 + σ22 ).
E XERCISE 7.7.3. As a consequence of the above exercise, prove that any linear
combination of independent normal random variables is normal with the appropri-
ate mean and variance.
Often, we need to deal with n-fold convolutions rather than the convolution of
just two probability measures. The following exercises are two useful results about
n-fold convolutions.
If X1 , X2 , . . . is a sequence of independent Ber(p) random
E XERCISE 7.7.4.P
variables, and Sn := ni=1 Xi , give a complete measure theoretic proof of the fact
that Sn ∼ Bin(n, p).
7.7. CONVOLUTIONS 69

E XERCISE 7.7.5. Use induction on n and the above convolution formula to


prove that if X1 , . . . , Xn are i.i.d. Exp(λ) random variables, then X1 +· · ·+Xn ∼
Gamma(n, λ).
2
E XERCISE PIf X1 , . . . , Xn are independent random variables in L , show
P 7.7.6.
that Var( Xi ) = Var(Xi ).
EPXERCISE 7.7.7. If X1 , X2 , . . . , Xn are independent random variables and
S = Xi , show that the moment generating function mQ S and the characteristic
function
Q φ S are given by the product formulas m S (t) = mXi (t) and φS (t) =
φXi (t).
CHAPTER 8

Convergence of random variables

This chapter discusses various notions of convergence of random variables,


laws of large numbers, and central limit theorems.

8.1. Four notions of convergence


Random variables can converge to limits in various ways. Four prominent
definitions are the following.
D EFINITION 8.1.1. A sequence of random variables {Xn }n≥1 is said to con-
verge to a random variable X almost surely if all of these random variables are
defined on the same probability space, and lim Xn = X a.e. This is often written
as Xn → X a.s.
D EFINITION 8.1.2. A sequence of random variables {Xn }n≥1 is said to con-
verge in probability to a random variable X if all of these random variables are
defined on the same probability space, and for each  > 0,
lim P(|Xn − X| ≥ ) = 0.
n→∞
p
This is usually written as Xn → X or Xn → X in probability. If X is a constant,
then {Xn }n≥1 need not be all defined on the same probability space.
D EFINITION 8.1.3. For p ∈ [1, ∞], sequence of random variables {Xn }n≥1 is
said to converge in Lp to a random variable X if all of these random variables are
defined on the same probability space, and
kXn − XkLp = 0.
Lp
This is usually written as Xn → X or Xn → X in Lp . If X is a constant, then
{Xn }n≥1 need not be all defined on the same probability space.
D EFINITION 8.1.4. For each n, let Xn be a random variable with cumulative
distribution function FXn . Let X be a random variable with c.d.f. F . Then Xn is
said to converge in distribution to X if for any t where FX is continuous,
lim FXn (t) = FX (t).
n→∞
d D
This is usually written as Xn → X, or Xn → X, or Xn ⇒ X, or Xn * X, or
Xn → X in distribution. Convergence in distribution is sometimes called conver-
gence in law or weak convergence.

71
72 8. CONVERGENCE OF RANDOM VARIABLES

8.2. Interrelations between the four notions


The four notions of convergence defined above are inter-related in interesting
ways.
P ROPOSITION 8.2.1. Almost sure convergence implies convergence in proba-
bility.

P ROOF. Let Xn be a sequence converging a.s. to X. Take any  > 0. Since


Xn → X a.s.,
[∞ \ ∞ 
1=P {|Xk − X| ≤ }
n=1 k=n
\ ∞ 
= lim P {|Xk − X| ≤ }
n→∞
k=n
≤ lim P(|Xn − X| ≤ ),
n→∞
which proves the claim. 
E XERCISE 8.2.2. Give a counterexample to show that convergence in proba-
bility does not imply almost sure convergence.
Although convergence in probability does not imply almost sure convergence,
it does imply that there is a subsequence that converges almost surely.
P ROPOSITION 8.2.3. If {Xn }n≥1 is a sequence of random variables converg-
ing in probability to a limit X, then there is a subsequence {Xnk }k≥1 converging
almost surely to X.

P ROOF. Since Xn → X in probability, it is not hard to see that there is a


subsequence {Xnk }k≥1 such that for each k, P(|Xnk − Xnk+1 | > 2−k ) ≤ 2−k .
Therefore by the Borel–Cantelli lemma, P(|Xnk − Xnk+1 | > 2−k i.o.) = 0. How-
ever, if |Xnk (ω) − Xnk+1 (ω)| > 2−k happens only finitely many times for some
ω, then {Xnk (ω)}k≥1 is a Cauchy sequence. Let Y (ω) denote the limit. On the
set where this does not happen, define Y = 0. Then Y is a random variable, and
Xnk → Y a.s. Then by Proposition 8.2.1, Xnk → Y in probability. But, for any
 > 0 and any k,
P(|X − Y | ≥ ) ≤ P(|X − Xnk | ≥ /2) + P(|Y − Xnk | ≥ /2).
Sending k → ∞, this shows that P(|X − Y | ≥ ) = 0. Since this holds for every
 > 0, we get X = Y a.s. This completes the proof. 
Next, let us connect convergence in probability and convergence in distribu-
tion.
P ROPOSITION 8.2.4. Convergence in probability implies convergence in dis-
tribution.
8.2. INTERRELATIONS BETWEEN THE FOUR NOTIONS 73

P ROOF. Let Xn be a sequence of random variables converging in probability


to a random variable X. Let t be a continuity point of FX . Take any  > 0. Then
FXn (t) = P(Xn ≤ t)
≤ P(X ≤ t + ) + P(|Xn − X| > ),
which proves that lim sup FXn (t) ≤ FX (t + ). Since this is true for any  > 0 and
FX is right continuous, this gives lim sup FXn (t) ≤ FX (t). A similar argument
gives lim inf FXn (t) ≥ FX (t). 
E XERCISE 8.2.5. Show that the above proposition is not valid if we demanded
that FXn (t) → FX (t) for all t, instead of just the continuity points of FX .
E XERCISE 8.2.6. If Xn → c in distribution, where c is a constant, show that
Xn → c in probability.
The following result combines weak convergence and convergence in proba-
bility is a way that is useful for many purposes.
d p
P ROPOSITION 8.2.7 (Slutsky’s theorem). If Xn → X and Yn → c, where c is
d d
a constant, then Xn + Yn → X + c and Xn Yn → cX.

P ROOF. Let F be the c.d.f. of X + c. Let t be a continuity point of F . For any


 > 0,
P(Xn + Yn ≤ t) ≤ P(Xn + c ≤ t + ) + P(Yn − c < −).
If t +  is also a continuity point of F , this shows that
lim sup P(Xn + Yn ≤ t) ≤ F (t + ).
n→∞
By Exercise 5.2.3 and the right-continuity of F , this allows us to conclude that
lim sup P(Xn + Yn ≤ t) ≤ F (t).
n→∞
Next, take any  > 0 such that t −  is a continuity point of F . Since
P(Xn + c ≤ t − ) ≤ P(Xn + Yn ≤ t) + P(Yn − c > ),
we get
lim inf P(Xn + Yn ≤ t) ≥ F (t − ).
n→∞
By Exercise 5.2.3 and the continuity of F at t, this gives
lim inf P(Xn + Yn ≤ t) ≥ F (t).
n→∞
Thus,
lim P(Xn + Yn ≤ t) = F (t)
n→∞
d
for every continuity point t of F , and hence Xn + Yn → X + c. The proof of
d
Xn Yn → cX is similar, with a slight difference in the case c = 0. 
Finally, let us look at the relation between Lp convergence and convergence in
probability.
74 8. CONVERGENCE OF RANDOM VARIABLES

P ROPOSITION 8.2.8. For any p > 0, convergence in Lp implies convergence


in probability.

P ROOF. Suppose that Xn → X in Lp . Take any  > 0. By Markov’s inequal-


ity,
P(|Xn − X| > ) = P(|Xn − X|p > p )
E|Xn − X|p
≤ ,
p
which proves the claim. 
The converse of the above proposition holds under an additional assumption.
P ROPOSITION 8.2.9. If Xn → X in probability and there is some constant c
such that |Xn | ≤ c a.s. for each n, then Xn → X in Lp for any p ∈ [1, ∞).

P ROOF. It is easy to show from the given condition that |X| ≤ c a.s. Take any
 > 0. Then
E|Xn − X|p ≤ E(|Xn − X|p ; |Xn − X| > ) + p
≤ (2c)p P(|Xn − X| > ) + p .
Sending n → ∞, we get lim sup E|Xn − X|p ≤ p . Since  is arbitrary, this
completes the proof. 
Interestingly, there is no direct connection between convergence in Lp and
almost sure convergence.
E XERCISE 8.2.10. Take any p > 0. Give counterexamples to show that almost
sure convergence does not imply Lp convergence, and Lp convergence does not
imply almost sure convergence.

8.3. Uniform integrability


Under a certain condition known as uniform integrability, almost sure con-
vergence implies L1 convergence. We say that a sequence of random variables
{Xn }n≥1 is uniformly integrable if for any  > 0, there is some K > 0 such that
for all n,
E(|Xn |; |Xn | > K) ≤ .

P ROPOSITION 8.3.1. If a uniformly integrable sequence of random variables


{Xn }n≥1 converges almost surely to a limit random variable X as n → ∞, then
X is integrable and Xn → X in L1 .

P ROOF. Take any , and find K such that E(|Xn |; |Xn | > K) ≤  for all n.
This implies, in particular, that
E|Xn | ≤ E(|Xn |; |Xn | > K) + E(|Xn |; |Xn | ≤ K) ≤  + K.
8.3. UNIFORM INTEGRABILITY 75

Therefore by Fatou’s lemma, E|X| ≤  + K < ∞. So, by the dominated conver-


gence theorem, limL→∞ E(|X|; |X| > L) = 0. This implies that by increasing K
if necessary, we can ensure that E(|X|; |X| > K) is also bounded by . Define a
function 
x
 if − K ≤ x ≤ K,
φ(x) := K if x > K,

−K if x < −K.

Note that φ is bounded and continuous. Therefore by the dominated convergence
theorem, E|φ(Xn ) − φ(X)| → 0. But
E|Xn − X| ≤ E|Xn − φ(Xn )| + E|φ(Xn ) − φ(X)| + E|φ(X) − X|
≤ E(|Xn |; |Xn | > K) + E|φ(Xn ) − φ(X)| + E(|X|; |X| > K)
≤ 2 + E|φ(Xn ) − φ(X)|.
Thus, lim supn→∞ E|Xn − X| ≤ 2. Since this holds for any , we conclude that
Xn → X in L1 . 
The following proposition gives a useful criterion for checking uniform inte-
grability.
P ROPOSITION 8.3.2. If supn≥1 E|Xn |p is finite for some p > 1, then the se-
quence {Xn }n≥1 is uniformly integrable.

P ROOF. Note that for any K,


E(|Xn |; |Xn | > K) ≤ K −(p−1) E|Xn |p .
Since E|Xn |p is uniformly bounded, the above bound can be made uniformly small
by choosing K large enough. 
Uniform integrability has the following equivalent formulation, which is some-
times useful.
P ROPOSITION 8.3.3. A sequence {Xn }n≥1 is uniformly integrable if and only
if supn≥1 E|Xn | < ∞ and for for all  > 0, there is some δ > 0 such that for any
event A with P(A) < δ, we have E(|Xn |; A) <  for all n.

P ROOF. Suppose that {Xn }n≥1 is uniformly integrable. First, choose a > 0
such that
sup E(|Xn |; |Xn | > a) ≤ 1.
n≥1
Then for any n,
E|Xn | = E(|Xn |; |Xn | ≤ a) + E(|Xn |; |Xn | > a) ≤ a + 1,
which shows that supn≥1 E|Xn | < ∞. Next, for any a > 0, any event A, and
any n,
E(|Xn |; A) = E(|Xn |; A ∩ {|Xn | ≤ a}) + E(|Xn |; A ∩ {|Xn | > a})
≤ aP(A) + E(|Xn |; |Xn | > a).
76 8. CONVERGENCE OF RANDOM VARIABLES

By uniform integrability, the right side can be made uniformly small by choosing
a large enough and P(A) small enough.
Conversely, suppose that the alternative criterion holds. Then for any a > 0,
aP(|Xn | > a) ≤ E(|Xn |; |Xn | > a) ≤ sup E|Xn | < ∞.
n≥1

Thus, P(|Xn | > a) can be made uniformly small by choosing a large enough.
Thus, E(|Xn |; |Xn | > a) can be made uniformly small by choosing a large enough,
which proves uniform integrability. 
An interesting corollary of the above result is the following. We will have
occasion to use it later.
C OROLLARY 8.3.4. For any integrable random variable X, and for any  > 0,
there is some δ > 0 such that E(|X|; A) <  whenever P(A) < .

P ROOF. By the dominated convergence theorem, E(|X|; |X| > K) → 0 as


K → ∞. Thus, the sequence X, X, X, . . . is uniformly integrable. Now apply
Proposition 8.3.3. 
The following proposition is also sometimes useful.
P ROPOSITION 8.3.5. If a sequence of random variables converges in L1 , then
it is uniformly integrable.

P ROOF. Suppose that Xn → X in L1 . Take any  > 0 and M > 0. By


Corollary 8.3.4, there is some δ > 0 such that if P(A) < δ, then E(|X|; A) ≤ /4.
Now, note that
E(|Xn |; |Xn | > M ) − E(|X|; |X| > M/2)
≤ E(|Xn − X|; |Xn | > M ) + E(|X|(1{|Xn |>M } − 1{|X|>M/2} ))
≤ E|Xn − X| + E(|X|; |Xn − X| > M/2).
Since Xn → X in L1 , we can find n0 large enough so that if n ≥ n0 , then
E|Xn − X| ≤ /4 and P(|Xn − X| > M/2) < δ (the second assertion follows by
Proposition 8.2.8). Thus, for n ≥ n0 ,
E(|Xn |; |Xn | > M ) ≤ E(|X|; |X| > M/2) + /2.
Now choose M so large that E(|X|; |X| > M/2) < /2, and also E(|Xn |; |Xn | >
M ) <  for all n < n0 . Then for all n ≥ 1, E(|Xn |; |Xn | > M ) ≤ . 

8.4. The weak law of large numbers


The weak law of large numbers is a fundamental result of probability theory.
Perhaps the best way to state the result is to state a quantitative version. It says
that the average of a finite collection of random variables is close to the average of
the expected values with high probability if the average of the covariances is small.
This allows wide applicability in a variety of problems.
8.5. THE STRONG LAW OF LARGE NUMBERS 77

T HEOREM 8.4.1 (Weak law of large numbers). Let X1 , . . . , Xn be L2 random


variables defined on the same probability space. Let µi := E(Xi ) and σij :=
Cov(Xi , Xj ). Then for any  > 0,
 X n n  n
1 1X 1 X
P Xi − µi ≥  ≤ 2 2 σij .
n n  n
i=1 i=1 i,j=1

P ROOF. Apply Chebychev’s inequality, together with the formula given by


Proposition 6.2.5 for the variance of a sum of random variables. 

An immediate corollary is the following theorem, which is traditionally known


as the L2 weak law of large numbers.
C OROLLARY 8.4.2. If {Xn }∞
n=1 is a sequence of uncorrelated random vari-
ablesPwith common mean µ and uniformly bounded finite second moment, then
n−1 ni=1 Xi converges in probability to µ as n → ∞.
Actually, the above theorem holds true even if the second moment is not finite,
provided that the sequence is i.i.d. Since this is a simple consequence of the strong
law of large numbers that we will prove later, we will not worry about it here.
E XERCISE 8.4.3 (An occupancy problem). Let n balls be dropped uniformly
and independently at random into n boxes. Let Nn be the number of empty boxes.
Prove that Nn /n → e−1 in probability as n → ∞. (Hint: Write Nn as a sum of
indicator variables.)
E XERCISE 8.4.4 (Coupon collector’s problem). Suppose that there are n types
of coupons, and a collector wants to obtain at least one of each type. Each time
a coupon is bought, it is one of the n types with equal probability. Let Tn be the
number of trials needed to acquire all n types. Prove that Tn /(n log n) → 1 in
probability as n → ∞. (Hint: Let τk be the number of trials needed to acquire k
distinct types of coupons. Prove that τk − τk−1 are independent geometric random
variables with different means, and Tn is the sum of these variables.)
E XERCISE 8.4.5 (Erdős–Rényi random graphs). Define an undirected random
graph on n vertices by putting an edge between any two vertices with probability
p and excluding the edge with probability 1 − p, all edges independent. This is
known as the Erdős–Rényi G(n, p) random graph. First, formulate the model in the
measure theoretic framework using independent Bernoulli random variables. Next,
show that if Tn,p is the number of triangles in this random graph, then Tn,p /n3 →
p3 /6 in probability as n → ∞, if p remains fixed.

8.5. The strong law of large numbers


The strong law of large numbers is the almost sure version of the weak law.
The best version of the strong law was proved by Etemadi.
78 8. CONVERGENCE OF RANDOM VARIABLES

T HEOREM 8.5.1 (Etemadi’s strong law of large numbers). Let {Xn }n≥1 be
a sequence of pairwise independent and identically distributed random variables,
with E|X1 | < ∞. Then n−1 ni=1 Xi converges almost surely to E(X1 ) as n tends
P
to ∞.

P ROOF. Splitting each Xi into its positive and negative parts, we see that it
suffices to prove the theorem for nonnegative random variables. So assume that
the Xi ’s are nonnegative random variables.
The next step is to truncate the Xi ’s to produce random variables that are more
well-behaved with respect to variance computations. Define Yi := Xi 1{Xi <i} . We
claim that it suffices to show that n−1 ni=1 Yi → µ a.s., where µ := E(X1 ). To
P
see why this suffices, notice that by Exercise 6.1.9 and the fact that the Xi ’s are
identically distributed,

X ∞
X ∞
X
P(Xi 6= Yi ) ≤ P(Xi ≥ i) = P(X1 ≥ i) ≤ E(X1 ) < ∞.
i=1 i=1 i=1

P Borel–Cantelli lemma, P(Xi 6= Yi i.o.) = 0. Thus, it suffices


Therefore by the first
to prove that n−1 ni=1 Yi → µ a.s.
Next, note that E(Yi ) − µ = E(Xi ;P
Xi ≥ i) → 0 as i → ∞ by the dominated
−1 n
convergence theorem. Therefore n i=1 E(Yi ) → µ as n → ∞. Thus, it
suffices to prove that Zn → 0 a.s., where
n
1X
Zn := (Yi − E(Yi )).
n
i=1

Take any α > 1. Let kn := [αn ], where [x] denotes the integer part of a real
number x. The penultimate step in Etemadi’s proof is to show that for any choice
of α > 1, Zkn → 0 a.s.
To show this, take any  > 0. Recall that the Xi ’s are pairwise independent.
Therefore so are the Yi ’s and hence Cov(Yi , Yj ) = 0 for any i 6= j. Thus for any
n, by Theorem 8.4.1,
kn
1 X
P(|Zkn | > ) ≤ Var(Yi ).
2 kn2
i=1

Therefore
∞ ∞ kn
X X 1 X
P(|Zkn | > ) ≤ Var(Yi )
2 kn2
n=1 n=1 i=1

1 X X 1
= 2 Var(Yi )
 kn2
i=1 n : kn ≥i
8.5. THE STRONG LAW OF LARGE NUMBERS 79

It is easy to see that there is some β > 1, depending only on α, such that kn+1 /kn ≥
β for all n large enough. Therefore for large enough n,

X 1 1 X −n C
≤ β ≤ 2,
kn2 i2 i
n : kn ≥i n=0

where C depends only on α. Increasing C if necessary, the inequality can be made


to hold for all n. Therefore by the monotone convergence theorem,
∞ ∞ ∞
X C X Var(Yi ) C X E(Yi2 )
P(|Zkn | > ) ≤ ≤
2 i2 2 i2
n=1 i=1 i=1
∞ ∞
C X E(Xi2 ; Xi < i) C X E(X12 ; X1 < i)
= 2 =
 i2 2 i2
i=1 i=1
C0
 
C X 1
≤ 2 E X12 ≤ E(X1 ),
 i2 2
i>X1

where C 0 is some other constant depending only on α. By the first Borel–Cantelli


lemma, this shows that P(|Zkn | >  i.o.) = 0. Since this holds for any  > 0, it
follows that Zkn → 0 a.s. as n → ∞.
ThePfinal step of the proof is to deduce that Zn → 0 a.s. For each n, let
Tn := ni=1 Yi . Take any m. If kn < m ≤ kn+1 , then
kn Tkn Tkn Tm Tk Tk kn+1
= ≤ ≤ n+1 = n+1 .
kn+1 kn kn+1 m kn kn+1 kn
Let m → ∞, so that kn also tends to infinity. Since kn+1 /kn → α and Tkn /kn →
µ a.s., the above inequalities imply that
µ Tm Tm
≤ lim inf ≤ lim sup ≤ αµ a.s.
α m→∞ m m→∞ m
Since α > 1 is arbitrary, this completes the proof. 
E XERCISE 8.5.2. Using Exercise 7.3.2, show that if X1 , X2 , . . . is a sequence
of i.i.d. random variables such that E|X1 | = ∞, then
 X n 
1
P Xi has a finite limit as n → ∞ = 0.
n
i=1

E XERCISEP8.5.3. If X1 , X2 , . . . are i.i.d. random variables with E(X1 ) = ∞,


show that n−1 ni=1 Xi → ∞ a.s.
Although the strong law of large numbers is formulated for i.i.d. random vari-
ables, it can sometimes be applied even when the random variables are not inde-
pendent. An useful case is the case of stationary m-dependent sequences.
D EFINITION 8.5.4. A sequence of random variables {Xn }∞ n=1 is called sta-
tionary if for any n and m, the random vector (X1 , . . . , Xn ) has the same law as
(Xm+1 , . . . , Xm+n ).
80 8. CONVERGENCE OF RANDOM VARIABLES

D EFINITION 8.5.5. A sequence of random variables {Xn }∞ n=1 is called m-


dependent if for any n, the collections {Xi }ni=1 and {Xi }∞
i=n+m+1 are independent.

Let {Yi }∞ i=1 be an i.i.d. sequence. An example of a sequence which is 1-


dependent and stationary is {Yi Yi+1 }∞ i=1 . More generally, an example of an m-
dependent stationary sequence is a sequence like {Xi }∞ i=1 where each Xi is of the
form f (Yi , . . . , Yi+m ) for some measurable function f : Rm+1 → R.
T HEOREM 8.5.6. If {Xn }∞ a stationary m-dependent sequence for some
n=1 isP
finite m, and E|X1 | < ∞, then n−1 ni=1 Xi converges a.s. to E(X1 ) as n → ∞.

P ROOF. As in Etemadi’s proof of the strong law, we may break up each Xi


into its positive and negative parts and prove the result separately for the two, since
the positive and negative parts also give stationary m-dependent sequences. Let us
therefore assume without loss of generality that the Xi ’s are nonnegative random
variables. For each k ≥ 1, let
mk
1 X
Yk := Xi .
m
i=m(k−1)+1

By stationarity and m-dependence, it is easy to see (using Exercise 7.1.5) that


Y1 , Y3 , Y5 , . . . is a sequence of i.i.d. random variables, and Y2 , Y4 , Y6 , . . . is another
sequence of i.i.d. random variables. Moreover E(Y1 ) = E(X1 ). Therefore by the
strong law of large numbers, the averages
n n
1X 1X
An := Y2i−1 , Bn := Y2i
n n
i=1 i=1
both converge a.s. to E(X1 ). Therefore the average
2n
An + Bn 1 X
Cn := = Yi
2 2n
i=1
also converges a.s. to E(X1 ). But note that
2nm
1 X
Cn = Xi .
2nm
i=1
Now take any n ≥ 2m and let k be an integer such that 2mk ≤ n < 2m(k + 1).
Since the Xi ’s are nonnegative,
n
1 X 2m(k + 1)
Ck ≤ Xi ≤ Ck+1 .
2mk 2mk
i=1
Since Ck and Ck+1 both converge a.s. to E(X1 ) as k → ∞, and 2mk/n → 1 as
n → ∞, this completes the proof. 
Sometimes strong laws of large numbers can be proved using only moment
bounds and the first Borel–Cantelli lemma. The following exercises give such
examples.
8.6. TIGHTNESS AND HELLY’S SELECTION THEOREM 81

E XERCISE 8.5.7 (SLLN under bounded fourth moment). Let {Xn }∞ n=1 be a
sequence of independent randomP variables with mean zero and uniformly bounded
fourth moment. Prove that n−1 ni=1 Xi → 0 a.s. (Hint: Use a fourth moment
version of Chebychev’s inequality.)
E XERCISE 8.5.8 (Random matrices). Let {Xij }1≤i≤j<∞ be a collection of
i.i.d. random variables with mean zero and all moments finite. Let Xji := Xij
if j > i. Let Wn be the n × n symmetric random matrix whose (i, j)th entry is
n−1/2 Xij . A matrix like Wn is called a Wigner matrix. Let λn,1 ≥ · · · ≥ λn,n be
the eigenvalues of Wn , repeated by multiplicities. For any integer k ≥ 1, show that
n  X n 
1X k 1 k
λn,i − E λn,i → 0 a.s. as n → ∞.
n n
i=1 i=1
(Hint: Express the sum as the trace of a power of Wn , and use Theorem 8.4.1. The
tail bound is strong enough to prove almost sure convergence.)
E XERCISE 8.5.9. If all the random graphs in Exercise 8.4.5 are defined on the
same probability space, show that the convergence is almost sure.

8.6. Tightness and Helly’s selection theorem


Starting in this section, we will gradually move towards the proof of the central
limit theorem, which is one of the most important basic results in probability the-
ory. For this, we will first have to develop our understanding of weak convergence
to a more sophisticated level. A concept that is closely related to convergence in
distribution is the notion of tightness, which we study in this section.
D EFINITION 8.6.1. A sequence of random variables {Xn }n≥1 is called a tight
family if for any , there exists some K such that supn P(|Xn | ≥ K) ≤ .
E XERCISE 8.6.2. If Xn → X in distribution, show that {Xn }n≥1 is a tight
family.
E XERCISE 8.6.3. If {Xn }n≥1 is a tight family and {cn }n≥1 is a sequence of
constants tending to 0, show that cn Xn → 0 in probability.
A partial converse of Exercise 8.6.2 is the following theorem.
T HEOREM 8.6.4 (Helly’s selection theorem). If {Xn }n≥1 is a tight family, then
there is a subsequence {Xnk }k≥1 that converges in distribution.

P ROOF. Let Fn be the c.d.f. of Xn . By the standard diagonal argument, there


is subsequence {nk }k≥1 of positive integers such that
F∗ (q) := lim Fnk (q)
k→∞
exists for every rational number q. For each x ∈ R, define
F (x) := inf F∗ (q).
q∈Q, q>x
82 8. CONVERGENCE OF RANDOM VARIABLES

Then F∗ and F are non-decreasing functions. From tightness, it is easy to argue


that F (x) → 0 as x → −∞ and F (x) → 1 as x → ∞. Now, for any x, there is
a sequence of rationals q1 > q2 > · · · decreasing to x, such that F∗ (qn ) → F (x).
Then F (qn+1 ) ≤ F∗ (qn ) for each n, and hence
F (x) = lim F∗ (qn ) ≥ lim F (qn+1 ),
n→∞ n→∞
which proves that F is right-continuous. Thus, by Proposition 5.2.2, F is a cumu-
lative distribution function.
We claim that Fnk converges weakly to F . To show this, let x be a continuity
point of F . Take any rational number q > x. Then Fnk (x) ≤ Fnk (q) for all k.
Thus,
lim sup Fnk (x) ≤ lim Fnk (q) = F∗ (q).
k→∞ k→∞

Since this holds for all rational q > x,


lim sup Fnk (x) ≤ F (x).
k→∞
Next, take any y < x, and take any rational number q ∈ (y, x). Then
lim inf Fnk (x) ≥ lim Fnk (q) = F∗ (q).
k→∞ k→∞
Since this holds for all rational q ∈ (y, x),
lim inf Fnk (x) ≥ F (y).
k→∞
Since this holds for all y < x and x is a continuity point of F , this completes the
proof. 

8.7. An alternative characterization of weak convergence


The following result gives an important equivalent criterion for convergence in
distribution.
P ROPOSITION 8.7.1. A sequence of random variables {Xn }n≥1 converges to
a random variable X in distribution if and only if
lim Ef (Xn ) = Ef (X)
n→∞
for every bounded continuous function f : R → R. In particular, two random
variables X and Y have the same law if and only if Ef (X) = Ef (Y ) for all
bounded continuous f .

P ROOF. First, suppose that Ef (Xn ) → Ef (X) for every bounded continuous
function f . Take any continuity point t of FX . Take any  > 0. Let f be the
function that is 1 below t, 0 above t + , and goes down linearly from 1 to 0 in the
interval [t, t + ]. Then f is a bounded continuous function, and so
lim sup FXn (t) ≤ lim sup Ef (Xn )
n→∞ n→∞
= Ef (X) ≤ FX (t + ).
8.7. AN ALTERNATIVE CHARACTERIZATION OF WEAK CONVERGENCE 83

Since this is true for all  > 0 and FX is right-continuous, this gives
lim sup FXn (t) ≤ FX (t).
n→∞
A similar argument shows that for any  > 0,
lim inf FXn (t) ≥ FX (t − ).
n→∞
Since t is a continuity point of FX , this proves that lim inf FXn (t) ≥ FX (t). Thus,
Xn → X in distribution.
Conversely, suppose that Xn → X in distribution. Let f be a bounded con-
tinuous function. Take any  > 0. By Exercise 8.6.2 there exists K such that
P(|Xn | ≥ K) ≤  for all n. Choose K so large that we also have P(|X| ≥ K) ≤ .
Let M be a number such that |f (x)| ≤ M for all x.
Since f is uniformly continuous in [−K, K], there is some δ > 0 such that
|f (x) − f (y)| ≤  whenever |x − y| ≤ δ and x, y ∈ [−K, K]. By Exercise 5.2.3,
we may assume that −K and K are continuity points of FX , and we can pick out
a set of points x1 ≤ x2 ≤ · · · ≤ xm ∈ [−K, K] such that each xi is a continuity
point of FX , x1 = −K, xm = K, and xi+1 − xi ≤ δ for each i. Now note that
Ef (Xn ) = E(f (Xn ); Xn > K) + E(f (Xn ); Xn ≤ −K)
m−1
X
+ E(f (Xn ); xi < Xn ≤ xi+1 ),
i=1
which implies that
m−1
X
Ef (Xn ) − f (xi )P(xi < Xn ≤ xi+1 )
i=1
m−1
X
≤ M + E(|f (Xn ) − f (xi )|; xi < Xn ≤ xi+1 )
i=1
m−1
X
≤ M +  P(xi < Xn ≤ xi+1 ) ≤ (M + 1).
i=1
A similar argument gives
m−1
X
Ef (X) − f (xi )P(xi < X ≤ xi+1 ) ≤ (M + 1).
i=1
Since x1 , . . . , xm are continuity points of FX and Xn → X in distribution,
lim P(xi < Xn ≤ xi+1 ) = lim (FXn (xi+1 ) − FXn (xi ))
n→∞ n→∞
= FX (xi+1 ) − FX (xi ) = P(xi < X ≤ xi+1 )
for each i. Combining the above observations, we get
lim sup |Ef (Xn ) − Ef (X)| ≤ 2(M + 1).
n→∞
Since  was arbitrary, this completes the proof. 
84 8. CONVERGENCE OF RANDOM VARIABLES

An immediate consequence of Proposition 8.7.1 is the following result.


P ROPOSITION 8.7.2. If {Xn }∞ n=1 is a sequence of random variables converg-
ing in distribution to a random variable X, then for any continuous f : R → R,
d
f (Xn ) → f (X).

P ROOF. Take any bounded continuous g : R → R. Then g ◦ f is also a


bounded continuous function. Therefore E(g ◦ f (Xn )) → E(g ◦ f (X)), which
shows that f (Xn ) → f (X) in distribution. 

8.8. Inversion formulas


We know how to calculate the characteristic function of a probability law on
the real line. The following inversion formula allows to go backward, and calculate
expectations of bounded continuous functions using the characteristic function.
T HEOREM 8.8.1. Let X be a random variable with characteristic function φ.
For each θ > 0, define
Z ∞
1 2
fθ (x) := e−itx−θt φ(t)dt.
2π −∞
Then for any bounded continuous g : R → R,
Z ∞
E(g(X)) = lim g(x)fθ (x)dx.
θ→0 −∞

P ROOF. Let µ be the law of X, so that


Z ∞
φ(t) = eity dµ(y).
−∞
Since |φ(t)| ≤ 1 for all t, we may apply Fubini’s theorem to conclude that
Z ∞Z ∞
1 2
fθ (x) = ei(y−x)t−θt dtdµ(y).
2π −∞ −∞
But by Proposition 6.5.1,
Z ∞ −s2 /2
r Z ∞
i(y−x)t−θt2 π −1/2 (y−x)s e
e dt = ei(2θ) √ ds
−∞ θ −∞ 2π
r
π −(y−x)2 /4θ
= e .
θ
Therefore Z ∞ −(y−x)2 /4θ
e
fθ (x) = √ dµ(y).
−∞ 4πθ
But by Proposition 7.7.1, the above formula shows that fθ (x) is the p.d.f. of X+Zθ ,
where Zθ ∼ N (0, 2θ). Thus, we get
Z ∞
g(x)fθ (x)dx = E(g(X + Zθ )).
−∞
8.8. INVERSION FORMULAS 85

Since Var(Zθ ) = 2θ (Exercise 6.2.1), it follows by Chebychev’s inequality that


Zθ → 0 in probability as θ → 0. Since g is a bounded continuous function, the
proof can now be completed using Slutsky’s theorem and Proposition 8.7.1. 
An immediate corollary of the above theorem is the following important fact.
C OROLLARY 8.8.2. Two random variables have the same law if and only if
they have the same characteristic function.

P ROOF. If two random variables have the same law, then they obviously have
the same characteristic function. Conversely, suppose that X and Y have the same
characteristic function. Then by Theorem 8.8.1, E(g(X)) = E(g(Y )) for every
bounded continuous g. Therefore by Proposition 8.7.1, X and Y have the same
law. 
E XERCISE 8.8.3. Let φ be the characteristic function of a random variable X.
If |φ(t)| = 1 for all t, show that X is a degenerate random variable — that is,
there is a constant c such that P(X = c) = 1. (Hint: Start by showing that |φ(t)|2
is the characteristic function of X − X 0 , where X 0 has the same law as X and is
independent of X. Then apply Corollary 8.8.2.)
E XERCISE 8.8.4. Let X be a non-degenerate random variable. If aX and bX
have the same distribution for some positive a and b, show that a = b. (Hint: Use
the previous exercise.)
E XERCISE 8.8.5. Let X be any random variable. If X + a and X + b have the
same distribution for some a, b ∈ R, show that a = b.
E XERCISE 8.8.6. Suppose that a random variable X has the same distribution
as the sum of n i.i.d. copies of X, for some n ≥ 2. Prove that X = 0 with
probability one. (Hint: Prove that the characteristic function takes value in a finite
set and must therefore be equal to 1 everywhere.)
Another important corollary of Theorem 8.8.1 is the following simplified in-
version formula.
C OROLLARY 8.8.7. Let X be a random variable with characteristic function
φ. Suppose that Z ∞
|φ(t)|dt < ∞.
−∞
Then X has a probability density function f , given by
Z ∞
1
f (x) = e−itx φ(t)dt.
2π −∞

P ROOF. Recall from the proof of Theorem 8.8.1 that fθ is the p.d.f. of X + Zθ ,
where Zθ ∼ N (0, 2θ). If φ is integrable, then it is easy to see by the dominated
convergence theorem that for every x,
f (x) = lim fθ (x).
θ→0
86 8. CONVERGENCE OF RANDOM VARIABLES

Moreover, the integrability of φ also shows that for any θ and x,


Z ∞
1
|fθ (x)| ≤ |φ(t)|dt < ∞.
2π −∞
Therefore by the dominated convergence theorem, for any −∞ < a ≤ b < ∞,
Z b Z b
f (x)dx = lim fθ (x)dx.
a θ→0 a

Therefore if a and b are continuity points of the c.d.f. of X, then Slutsky’s theorem
implies that
Z b
P(a ≤ X ≤ b) = f (x)dx.
a
By Proposition 5.4.1, this completes the proof. 
For integer-valued random variables, a different inversion formula is often use-
ful.
T HEOREM 8.8.8. Let X be an integer-valued random variable with character-
istic function φ. Then for any x ∈ Z,
Z π
1
P(X = x) = e−itx φ(t)dt.
2π −π

P ROOF. Let µ be the law of X. Then note that by Fubini’s theorem,


Z π Z πZ ∞
1 −itx 1
e φ(t)dt = e−itx eity dµ(y)dt
2π −π 2π −π −∞
Z ∞Z π
1
= eit(y−x) dtdµ(y)
2π −∞ −π
 Z π 
X 1 it(y−x)
= P(X = y) e dt
2π −π
y∈Z
= P(X = x),
where the last identity holds because x, y ∈ Z in the previous step. 

8.9. Lévy’s continuity theorem


In this section we will prove Lévy’s continuity theorem, which asserts that
convergence in distribution is equivalent to pointwise convergence of characteristic
functions.
T HEOREM 8.9.1 (Lévy’s continuity theorem). A sequence of random variables
{Xn }n≥1 converges in distribution to a random variable X if and only if the se-
quence of characteristic functions {φXn }n≥1 converges to the characteristic func-
tion φX pointwise.
8.10. THE CENTRAL LIMIT THEOREM FOR I.I.D. SUMS 87

P ROOF. If Xn → X in distribution, then φXn (t) → φX (t) for every t by


Proposition 8.7.1. Conversely, suppose that φXn (t) → φX (t) for every t. Take any
 > 0. Recall that φX is a continuous function (Exercise 6.4.3), and φX (0) = 1.
Therefore we can choose a number a so small that |φX (s) − 1| ≤ /2 whenever
|s| ≤ a. Consequently,
1 a
Z
(1 − φX (s))ds ≤ .
a −a
Therefore, since φXn → φX pointwise, the dominated convergence theorem shows
that
1 a 1 a
Z Z
lim (1 − φXn (s))ds = (1 − φX (s))ds ≤ .
n→∞ a −a a −a
Let t := 2/a. Then by Proposition 6.4.4 and the above inequality,
lim sup P(|Xn | ≥ t) ≤ .
n→∞

Thus, P(|Xn | ≥ t) ≤ 2 for all large enough n. This allows us to choose K large
enough such that P(|Xn | ≥ K) ≤ 2 for all n. Since  is arbitrary, this proves that
{Xn }n≥1 is a tight family.
Suppose that Xn does not converge in distribution to X. Then there is a
bounded continuous function f such that Ef (Xn ) 6→ Ef (X). Passing to a subse-
quence if necessary, we may assume that there is some  > 0 such that |Ef (Xn ) −
Ef (X)| ≥  for all n. By tightness, there exists some subsequence {Xnk }k≥1
that converges in distribution to a limit Y . Then Ef (Xnk ) → Ef (Y ) and hence
|Ef (Y ) − Ef (X)| ≥ . But by the first part of this theorem and the hypothesis that
φXn → φX pointwise, we have that φY = φX everywhere. Therefore Y and X
must have the same law by Lemma 8.8.2, and we get a contradiction by the second
assertion of Proposition 8.7.1 applied to this X and Y . 
E XERCISE 8.9.2. If a sequence of characteristic functions {φn }n≥1 converges
pointwise to a characteristic function φ, prove that the convergence is uniform on
any bounded interval. (Hint: It suffices to show that if tn → t, then φn (tn ) → φ(t).
Deduce this from Lévy’s continuity theorem.)
E XERCISE 8.9.3. If a sequence of characteristic functions {φn }n≥1 converges
pointwise to some function φ, and φ is continuous at zero, prove that φ is also a
characteristic function. (Hint: Prove tightness and proceed from there.)

8.10. The central limit theorem for i.i.d. sums


Broadly speaking, a central limit theorem is any theorem that states that a
sequence of random variables converges weakly to a limit random variable that has
a continuous distribution (usually Gaussian). The following central limit theorem
suffices for many problems.
88 8. CONVERGENCE OF RANDOM VARIABLES

T HEOREM 8.10.1 (CLT for i.i.d. sums). Let X1 , X2 , . . . be i.i.d. random vari-
ables with mean µ and variance σ 2 . Then the random variable
Pn
i=1√Xi − nµ

converges weakly to the standard Gaussian distribution as n → ∞.

We need a few simple lemmas to prepare for the proof. The main argument is
based on Lévy’s continuity theorem.

L EMMA 8.10.2. For any x ∈ R and k ≥ 0,


k
X (ix)j |x|k+1
eix − ≤
j! (k + 1)!
j=0

P ROOF. This follows easily from Taylor expansion, noting that all derivatives
of the map x 7→ eix are uniformly bounded by 1 in magnitude. 

C OROLLARY 8.10.3. For any x ∈ R,


x2 3
 
ix 2 |x|
e − 1 − ix + ≤ min x , .
2 6

P ROOF. By Lemma 8.10.2

x2 |x|3
eix − 1 − ix + ≤ .
2 6
On the other hand, by Lemma 8.10.2 we also have

x2 x2
eix − 1 − ix + ≤ |eix − 1 − ix| +
2 2
x 2 x 2
≤ + = x2 .
2 2
The proof is completed by combining the two bounds. 

L EMMA 8.10.4. Let a1 , . . . , an and b1 , . . . , bn be complex numbers such that


|ai | ≤ 1 and |bi | ≤ 1 for each i. Then
n
Y n
Y n
X
ai − bi ≤ |ai − bi |.
i=1 i=1 i=1
8.10. THE CENTRAL LIMIT THEOREM FOR I.I.D. SUMS 89

P ROOF. Writing the difference of the products as a telescoping sum and ap-
plying the triangle inequality, we get
Y n Yn Xn
ai − bi ≤ a1 · · · ai−1 bi · · · bn − a1 · · · ai bi+1 · · · bn
i=1 i=1 i=1
n
X
= |a1 · · · ai−1 (bi − ai )bi+1 · · · bn |
i=1
Xn
≤ |ai − bi |,
i=1
where the last inequality holds because |ai | and |bi | are ≤ 1 for each i. 
We are now ready to prove Theorem 8.10.1.
P ROOF OF T HEOREM 8.10.1. Replacing Xi by (Xi − µ)/σ, let us assume
without loss of generality that µ = 0 and σ = 1. Let Sn := n−1/2 ni=1 Xi .
P
Take any t ∈ R. By Lévy’s continuity theorem and the formula for the character-
istic function of the standard normal distribution (Proposition 6.5.1), it is sufficient
2
to show that φSn (t) → e−t /2 as n → ∞, where φSn is the characteristic function
of Sn . By the i.i.d. nature of the summands,
n
Y √ √
φSn (t) = φXi (t/ n) = (φX1 (t/ n))n .
i=1
Therefore by Lemma 8.10.4, when n is so large that t2 ≤ 2n,
t2 n √ t2
   
φSn (t) − 1 − ≤ n φX1 (t/ n) − 1 − .
2n 2n
Thus, we only need to show that the right side tends to zero as n → ∞. To prove
this, note that by Corollary 8.10.3,
√ t2 √ itX1 t2 X12
   
itX1 / n
n φX1 (t/ n) − 1 − =nE e −1− √ +
2n n 2n
3 3
 
|t| |X1 |
≤ E min t2 X12 , √ .
6 n
By the finiteness of E(X12 ) and the dominated convergence theorem, the above
expectation tends to zero as n → ∞. 

E XERCISE 8.10.5. Give a counterexample to show that the i.i.d. assumption


in Theorem 8.10.1 cannot be replaced by the assumption of identically distributed
and pairwise independent.
In the following exercises, ‘prove a central limit theorem for Xn ’ means ‘prove
that
Xn − an d
→ N (0, 1)
bn
as n → ∞, for some appropriate sequences of constants an and bn ’.
90 8. CONVERGENCE OF RANDOM VARIABLES

E XERCISE 8.10.6. Let Xn ∼ Bin(n, p). Prove a central limit theorem for
Xn . (Hint: Use Exercise 7.7.4 and the CLT for i.i.d. random variables.)
E XERCISE 8.10.7. Let Xn ∼ Gamma(n, λ). Prove a central limit theorem
for Xn . (Hint: Use Exercise 7.7.5 and the CLT for i.i.d. random variables.)
E XERCISE 8.10.8. Suppose that Xn ∼ Gamma(n, λn ), where {λn }∞
n=1 is
any sequence of positive constants. Prove a CLT for Xn .
Just like the strong law of large numbers, one can prove a central limit theorem
for a stationary m-dependent sequence of random variables.
T HEOREM 8.10.9 (CLT for stationary m-dependent sequences). Suppose that
X1 , X2 , . . . is a stationary m-dependent sequence of random variables with mean
µ and finite second moment. Let
m+1
X
σ 2 := Var(X1 ) + 2 Cov(X1 , Xi ).
i=2
Then the random variable Pn
− nµ
i=1√Xi

converges weakly to the standard Gaussian distribution as n → ∞.

P ROOF. The proof of this result uses the technique of ‘big blocks and little
blocks’, which is also useful for other things. Without loss of generality, assume
that µ = 0. Take any r ≥ m. Divide up the set of positive integers into ‘big blocks’
size r, with intermediate ‘little blocks’ of size m. For example, the first big block
is {1, . . . , r}, which is followed by the little block {r + 1, . . . , r + m}. Enumerate
the big blocks as B1 , B2 , . . . and the little blocks as L1 , L2 , . . .. Let
X X
Yj := Xi , Zj := Xi .
i∈Bj i∈Lj

Note that by stationarity and m-dependence, Y1 , Y2 , . . . is a sequence of i.i.d. ran-


dom variables and Z1 , Z2 , . . . is also a sequence of i.i.d. random variables.
Let kn be the largest integer such that Bkn ⊆ {1, . . . , n}. Let Sn := ni=1 Xi ,
P
Pkn
Tn := j=1 Yj , and Rn := Sn − Tn . Then by the central limit theorem for
i.i.d. sums and the fact that kn ∼ n/(r + m) as n → ∞, we have
T d
√n → N (0, σr2 ),
n
where
Var(Y1 )
σr2 = .
r+m
Now, it is not hard to see that if we let
kn
X
Rn0 := Zj ,
j=1
8.11. THE LINDEBERG–FELLER CENTRAL LIMIT THEOREM 91

then {Rn − Rn0 }n≥1√ is a tight family of random variables. Therefore by Exercise
8.6.3, (Rn − Rn0 )/ n → 0 in probability. But again by the CLT for i.i.d. variables,
R0 d
√n → N (0, τr2 ),
n
where
Var(Z1 )
τr2 = .
r+m

Therefore by Slutsky’s theorem, Rn / n also has the√same weak limit.
Now let φn be the√characteristic function of Sn / n and ψn,r be the character-
istic function of Tn / n. Then for any t,
√ √
|φn (t) − ψn,r (t)| = |E(eitSn / n
− eitTn / n
)|

itRn / n
≤ E|e − 1|
Letting n → ∞ on both sides and using the observations made above, we get
2 σ 2 /2
lim sup |φn (t) − e−t r | ≤ E|eitξr − 1|,
n→∞

where ξr ∼ N (0, τr2 ).


Now send r → ∞. It is easy to check that σr → σ and
τr → 0, and complete the proof from there. 
E XERCISE 8.10.10. Let X1 , X2 , . . . be a sequence of i.i.d. random variables.
For each i ≥ 2, let
(
1 if Xi ≥ max{Xi−1 , Xi+1 },
Yi :=
0 if not.
In other words, Yi is 1 if and only
Pif the original sequence has a local maximum at
i. Prove a central limit theorem ni=2 Yi .

8.11. The Lindeberg–Feller central limit theorem


In some applications, the CLT for i.i.d. sums does not suffice. For sums of
independent random variables, the most powerful result available in the literature
is the following theorem.
T HEOREM 8.11.1 (Lindeberg–Feller CLT). Let {kn }n≥1 be a sequence of pos-
itive integers increasing to infinity. For each n, let {Xn,i }1≤i≤kn is a collection of
independent random variables. Let µn,i := E(Xn,i ), σn,i 2 := Var(X ), and
n,i

kn
X
s2n := 2
σn,i .
i=1
Suppose that for any  > 0,
kn
1 X
lim E((Xn,i − µn,i )2 ; |Xn,i − µn,i | ≥ sn ) = 0. (8.11.1)
n→∞ s2
n i=1
92 8. CONVERGENCE OF RANDOM VARIABLES

Then the random variable Pkn


i=1 (Xn,i − µn,i )
sn
converges in distribution to the standard Gaussian law as n → ∞.
The condition (8.11.1) is commonly known as Lindeberg’s condition. The
proof of the Lindeberg–Feller CLT similar to the proof of the CLT for i.i.d. sums,
but with a few minor additional technical subtleties.
L EMMA 8.11.2. For any x1 , . . . , xn ∈ [0, 1],
 X n  Y n n
1X 2
exp − xi − (1 − xi ) ≤ xi .
2
i=1 i=1 i=1

P ROOF. By Taylor expansion, we have |e−x − (1 − x)| ≤ x2 /2 for any x ≥ 0.


The proof is now easily completed by Lemma 8.10.4. 

P ROOF OF T HEOREM 8.11.1. Replacing Xn,i by (Xn,i − µn,i )/sn , let us as-
sume without loss of generality that µn,i = 0 and sn = 1 for each n and i. Then
the condition (8.11.1) becomes
kn
X
2
lim E(Xn,i ; |Xn,i | ≥ ) = 0. (8.11.2)
n→∞
i=1
Note that for any  > 0 and any n,
2 2 2
max σn,i = max (E(Xn,i ; |Xn,i | < ) + E(Xn,i ; |Xn,i | ≥ ))
1≤i≤kn 1≤i≤kn

≤ 2 + max E(Xn,i
2
; |Xn,i | ≥ ).
1≤i≤kn

Therefore by (8.11.2),
2
lim sup max σn,i ≤ 2 .
n→∞ 1≤i≤kn
Since  is arbitrary, this shows that
2
lim max σn,i = 0. (8.11.3)
n→∞ 1≤i≤kn
Pn
Let Sn := i=1 Xi . Then by Exercise 7.7.7,
kn
Y
φSn (t) = φXn,i (t).
i=1
By Corollary 8.10.3,
t2 σn,i
2 t2 Xn,i
2 

itXn,i
φXn,i (t) − 1 + = E e − 1 − itXn,i +
2 2
 3 3 
2 |t| |Xn,i |
≤ E min t2 Xn,i , .
6
8.11. THE LINDEBERG–FELLER CENTRAL LIMIT THEOREM 93

Now fix t. Equation (8.11.3) tells us that t2 σn,i


2 ≤ 2 for each i when n is sufficiently

large. Thus by Lemma 8.10.4,


kn  kn
Y t2 σn,i
2  X t2 σn,i
2
φSn (t) − 1− ≤ φXi (t) − 1 +
2 2
i=1 i=1
kn  3 3
2 2 |t| |Xn,i |
X
≤ E min t Xn,i , .
6
i=1
Take any  > 0. Then
 3 3
2 2 |t| |Xn,i |
E min t Xn,i ,
6
|t|3
≤ E(t2 Xn,i
2
; |Xn,i | ≥ ) + E(|Xn,i |3 ; |Xn,i | < )
6
|t|3 
≤ t2 E(Xn,i
2
; |Xn,i | ≥ ) + 2
E(Xn,i ).
6
Therefore for any fixed t and sufficiently large n,
kn  kn kn
Y t2 σn,i
2  X |t|3  X
φSn (t) − 1− ≤ t2 E(Xn,i 2
; |Xn,i | ≥ ) + 2
σn,i
2 6
i=1 i=1 i=1
kn
X |t|3 
= t2 2
E(Xn,i ; |Xn,i | ≥ ) + .
6
i=1
Therefore by (8.11.2),
kn 
t2 σn,i
2
|t|3 
Y 
lim sup φSn (t) − 1− ≤ .
n→∞ 2 6
i=1
Since this holds for any  > 0, the lim sup on the left must be equal to zero. But
by Corollary 8.11.2 and equation (8.11.3),
kn  kn
−t2 /2
Y t2 σn,i
2 
1X
lim sup e − 1− ≤ lim sup t4 σn,i
4
n→∞ 8 n→∞ 2
i=1 i=1
kn
t4 max 1≤i≤kn
2 X
σn,i 2
≤ lim sup σn,i
n→∞ 8
i=1
t4 max1≤i≤kn σn,i
2
= lim sup = 0.
n→∞ 8
2
Thus, φSn (t) → e−t /2 as n → ∞. By Lévy’s continuity theorem and Proposi-
tion 6.5.1, this proves that Sn converges weakly to the standard Gaussian distribu-
tion. 
A corollary of the Lindeberg–Feller CLT that is useful for sums of independent
but not identically distributed random variables is the Lyapunov CLT.
94 8. CONVERGENCE OF RANDOM VARIABLES

T HEOREM 8.11.3 (Lyapunov CLT). Let {Xn }∞ n=1 be a sequence P


of indepen-
dent random variables. Let µi := E(Xi ), σi2 := Var(Xi ), and s2n = ni=1 σi2 . If
for some δ > 0,
n
1 X
lim E|Xi − µi |2+δ = 0, (8.11.4)
n→∞ s2+δ
n i=1
then the random variable Pn
i=1 (Xi − µi )
sn
converges weakly to the standard Gaussian distribution as n → ∞.

P ROOF. To put this in the framework of the Lindeberg–Feller CLT, let kn = n


and Xn,i = Xi . Then for any  > 0 and any n,
n n
|Xi − µi |2+δ
 
1 X 2 1 X
E((Xi − µi ) ; |Xi − µi | ≥ sn ) ≤ 2 E
s2n sn (sn )δ
i=1 i=1
n
1 X
= 2+δ
E|Xi − µi |2+δ .
δ
 sn i=1
The Lyapunov condition (8.11.4) implies that this upper bound tends to zero as
n → ∞, which completes the proof. 
E XERCISE 8.11.4. Suppose that Xn ∼ Bin(n, pn ), where {pn }∞ n=1 is a se-
quence of constants such that npn (1 − pn ) → ∞. Prove a CLT for Xn .
E XERCISE 8.11.5. Let X1 , X2 , . . . be
Pa sequence of uniformly bounded inde-
pendent random variables, and let Sn = ni=1 Xi . If Var(Sn ) → ∞, show that Sn
satisfies a central limit theorem.

8.12. Stable laws


The central limit theorem gives the limiting distribution for the normalized
sums of i.i.d. random variables with finite second moment. What if the second
moment is not finite? ‘Stable laws’ are a class of distributions that characterize all
possible limits of this sort.
D EFINITION 8.12.1. A real-valued random variable Y is said to have a ‘stable
law’ if there are i.i.d. random variables {Xn }n≥1 and sequences of real constants
{an }n≥1 and {bn }n≥1 , with an > 0 for each n, such that
X1 + · · · + Xn d
− bn → Y.
an

By the central limit theorem, the standard normal distribution is a stable law.
The goal of this section is to characterize the set of all stable laws. We need the
following lemma.
8.12. STABLE LAWS 95

d
L EMMA 8.12.2. Suppose that Xn → X, where X is non-degenerate. Let
d
αn > 0 and βn ∈ R be such that αn Xn + βn → Y for some non-degenerate Y .
d
Then there exist α > 0 and β ∈ R such that αn → α, βn → β, and Y = αX + β.

P ROOF. Let φn be the characteristic function of Xn and ψn be the character-


istic function of αn Xn + βn , so that
ψn (t) = eitβn φ(αn t) (8.12.1)
for all t. Let φ be the characteristic function of X and ψ be the characteristic
function of Y , so that φn converges pointwise to φ and ψn converges pointwise
to ψ. If αn → 0 through a subsequence, (8.12.1) shows that |φ(t)| = 1 for all
t. Similarly, if αn → ∞ through a subsequence, (8.12.1) shows that |ψ(t)| = 1
for all t. By Exercise 8.8.3, these are contradictions to the assumption that X and
Y are non-degenerate random variables. Thus, the sequence {αn }n≥1 is bounded
away from 0 and ∞.
Now let α and α0 be two subsequential limits of of {αn }n≥1 . Note that these
are positive and finite, by the conclusion from the previous paragraph. By (8.12.1),
it follows that for all t,
|φ(αt)| = |ψ(t)| = |φ(α0 t)|.
Suppose α0 > α. The above equation shows that for all t, |φ(t)| = |φ(at)|, where
a = α/α0 . Iterating this, we get |φ(t)| = limn→∞ |φ(an t)| = |φ(0)| = 1. Again
by Exercise 8.8.3, this implies that X is a degenerate random variable, which con-
tradicts our assumption. Thus, α := limn→∞ αn , exists and is in (0, ∞).
The last sentence of the previous paragraph, and the uniform convergence of
φn to φ in bounded neighborhoods, allow us to conclude that there is some t0 > 0
small enough such that for t ∈ (−t0 , t0 ), |φn (αn t)| > 1/2 for all n. Thus, by
equation (8.12.1), we get that eitβn converges uniformly to ψ(t)/φ(αt) in (−t0 , t0 ).
This shows that {βn }n≥1 cannot have a subsequence {βnk }k≥1 such that |βnk | →
∞, because otherwise we can take tk = π/βnk and have eiβnk tk = −1 for all k,
whereas tk → 0 and ψ(0)/φ(0) = 1.
Thus, {βn }n≥1 must be a bounded sequence. Let β and β 0 be limit points of
0
this sequence. Then eitβ = ψ(t)/φ(αt) = eitβ for all t ∈ (−t0 , t0 ), which implies
that β = β 0 . Thus, β := limn→∞ βn exists and is finite.
This completes the proofs of the first two claims of the lemma. The last claim
is now obvious. 
The next result gives an intrinsic characterization of stable laws.
T HEOREM 8.12.3. A non-degenerate random variable Y has a stable law if
and only if for each n, there exist An > 0 and Bn ∈ R such that
d Y1 + · · · + Yn
Y = − Bn ,
An
where Y1 , . . . , Yn are i.i.d. random variables with the same law as Y .
96 8. CONVERGENCE OF RANDOM VARIABLES

P ROOF. If Y has the stated property, then it is obvious that Y is stable. Con-
versely, suppose that Y is stable. Let Xn , an and bn be as in Definition 8.12.1.
Let
X1 + · · · + Xn
Zn := − bn .
an
Take any k ≥ 1. For 1 ≤ j ≤ k, let
Xn(j−1)+1 + · · · + Xnj
Zn(j) := − bn ,
an
so that
k
an X (j) kan
Znk = Zn + bn − bnk .
ank ank
j=1

Now, as k remains fixed and n → ∞,


k
d d
X
Znk → Y, Zn(j) → Y1 + · · · + Yn .
j=1

By Lemma 8.12.2, this completes the proof. 

The next result shows that the constant An in Theorem 8.12.3 must necessarily
have a specific form.

T HEOREM 8.12.4. In the setting of Theorem 8.12.3, there exists a number


α ∈ (0, 2] such that An = n1/α for all n.

We need the following lemma for the proof of Theorem 8.12.4. A random
d
variable X is called ‘symmetric’ if X = −X.

L EMMA 8.12.5. Let X1 , . . . , Xn be i.i.d. symmetric random variables with


c.d.f. F . Then for any t > 0,
1
P(|X1 + · · · + Xn | > t) ≥ (1 − e−2n(1−F (t)) ).
2

P ROOF. Let Yi := |Xi | and si := sign(Xi ), where the sign is taken to be


1 or −1 with equal probability if Xi = 0. It is a simple consequence of the
symmetry of Xi that si and Yi are independent. Let I be the minimum i such
that |Xi | = max1≤j≤n |Xj |. Let s := (s1 , . . . , sn ) and Y := (Y1 , . . . , Yn ). Let
s0 = (s01 , . . . , s0n ) be the vector defined as s0i := −si if i 6= I and s0i = si if i = I.
d
We claim that (s, Y ) = (s0 , Y ). To see this, note that by the independence of s and
8.12. STABLE LAWS 97

X := (X1 , . . . , Xn ), we have that for any a ∈ {−1, 1}n and A ∈ B(Rn ),


n
X
P(s0 = a, Y ∈ A) = P(s0 = a, Y ∈ A, I = i)
i=1
n
X
= P(sj = −aj for j 6= i, si = ai , Y ∈ A, I = i)
i=1
Xn
= P(sj = −aj for j 6= i, si = ai )P(Y ∈ A, I = i)
i=1
n
X
−n
=2 P(Y ∈ A, I = i) = 2−n P(Y ∈ A),
i=1

and this is equal to P(s = a, Y ∈ A) Pby a similar argument.


Now let M := XI and T := j6=I Xj . Then note that M = sI YI and
P
T = j6=I sj Yj . This implies that (M, T ) has the same law as (M, −T ), since
the function that maps (s, Y ) to (M, T ), maps (s0 , Y ) to (M, −T ). Thus, for any
t > 0,

P(M > t) ≤ P(M > t, T ≥ 0) + P(M > t, T ≤ 0)


= P(M > t, T ≥ 0) + P(M > t, −T ≥ 0)
= 2P(M > t, T ≥ 0).

Thus, by symmetry,

P(|X1 + · · · + Xn | > t) = 2P(X1 + · · · + Xn > t)


= 2P(M + T > t)
≥ 2P(M > t, T ≥ 0) ≥ P(M > t).

To complete the proof, note that again by symmetry,

1
P(M > t) = P(|M | > t) = P( max |Xi | > t).
2 1≤i≤n

If t is a point of continuity of F , then

P( max |Xi | > t) = 1 − P( max |Xi | ≤ t)


1≤i≤n 1≤i≤n
= 1 − (P(|X1 | ≤ t))n = 1 − (1 − 2(1 − F (t)))n .

By the inequality 1−x ≤ e−x , this proves the lemma when t is a point of continuity
of F . If t is not a point of continuity, then the result can be obtained by taking a
sequence of continuity points decreasing to t. 
98 8. CONVERGENCE OF RANDOM VARIABLES

P ROOF OF T HEOREM 8.12.4. Let Z := Y1 −Y2 , and let Z1 , Z2 , . . . be i.i.d. ran-


dom variables with the same law as Z. Then note that for any m and n,
m+n m n
d
X X X
Am+n Z = Zi = Zi + Zi
i=1 i=1 i=m+1
d
= Am Z 1 + An Z 2 . (8.12.2)
Thus, for any t > 0,
P(Z > t) = P(Am+n Z > Am+n t)
1
≥ P(Am Z1 > Am+n t and Z2 ≥ 0) ≥ P(Am Z1 > Am+n t),
2
where the last inequality holds because Z2 is symmetrically distributed around
d
zero. Since Z1 = Z and Z is nondegenerate and symmetric, this proves that
An
B := sup < ∞. (8.12.3)
1≤n≤k Ak

Now, by (8.12.2) and induction, we have that for any r and k,


d d
Ark Z = Ar Z1 + Ar Z2 + · · · + Ar Zk = Ar Ak Z.
By Exercise 8.8.4, this proves that
Ark = Ar Ak . (8.12.4)
In particular, Ark = Akr . This, together with (8.12.3), implies that Ar ≥ 1 for all
r. By Exercise 8.8.6, Ar 6= 1 for all r. Thus, Ar > 1 for all r.
Now take any r, s ≥ 2. Since Ar , As > 1, there are positive real numbers α
and β such that Ar = r1/α and As = s1/β . We claim that α = β. To prove this
claim, take any k and let m = sk . For k large enough, there is some j such that
n = rj satisfies n ≤ m ≤ rn. Then by (8.12.4),
Am = Aks = Ak/β = m1/β ≤ (rn)1/β = r1/β Aα/β
n .

But by (8.12.3) and the fact that m ≥ n, we have Am ≥ B −1 An . Thus, B −1 An ≤


α/β
r1/β An . Since Ar > 1, it is impossible that this holds as k → ∞ unless α ≥ β.
Similarly, β ≥ α. This proves the claim. Thus, there is some α > 0 such that
An = n1/α for all n.
Lastly, we prove that α ≤ 2. Suppose not. Let Sn := Z1 + · · · + Zn , so that
d
Sn = An Z. Thus, there exists t > 0 such that for all n,
1
P(|Sn | > tAn ) = P(|Z| > t) < .
4
By Lemma 8.12.5, this shows that B := supn≥1 n(1 − F (tAn )) < ∞, where F
is the c.d.f. of Z. Take any x ≥ t. Then there is some n such that n1/α ≤ x/t ≤
(n + 1)1/α . For any such n,
B 2B 2Btα
1 − F (x) ≤ 1 − F (tn1/α ) = 1 − F (tAn ) ≤ ≤ ≤ .
n n+1 xα
8.12. STABLE LAWS 99

Thus,
Z ∞ Z ∞
2 2
E(Z ) = 2xP(|Z| ≥ x)dx ≤ t + 4Btα x1−α dx,
0 t
which is finite since α > 2. But then, by the central limit theorem, (Z1 + · · · +
Zn )/n1/α cannot converge in distribution as n → ∞. This gives a contradiction
which proves that α ≤ 2. 
E XERCISE 8.12.6. Show that if X is a symmetric stable random variable, it
α
must have characteristic function φ(t) = e−θ|t| for some θ > 0 and α ∈ (0, 2].
CHAPTER 9

Conditional expectation and martingales

In this chapter we will learn about the measure theoretic definition of condi-
tional expectation, about martingales and their properties, and some applications.

9.1. Conditional expectation


Conditional probability and conditional expectations have straightforward def-
initions when we are conditioning on events of nonzero probability, but problems
arise when we try to condition on events of probability zero — for example, when
trying to compute the distribution of Y given X = x, where X is a continuous
random variable. The following exercise gives an example.
E XERCISE 9.1.1. Let (X, Y ) be a point distributed uniformly in the unit disk
in R2 . What is the distribution of Y given X = 0? Show that the answer depends
on how we calculate P(Y ∈ A|X = 0). One way is to take the limit, as  → 0, of
P(Y ∈ A|X ∈ (−, )). Another way is to interpret the event {X = 0} as {Θ ∈
{π/2, 3π/2}}, where (R, Θ) is the representation of (X, Y ) in polar coordinates,
with Θ ranging in [0, 2π). In this second approach, P(Y ∈ A|X = 0) is the limit,
as  → 0, of P(Y ∈ A|Θ ∈ (π/2 − , π/2 + ) ∪ (3π/2 − , 3π/2 + )). Show
that the two answers are different. (This is an instance of the ‘Borel–Kolmogorov
paradox’, which uses a different example.)
The measure-theoretic approach to probability gives us a way of avoiding such
problematic issues. The key idea is to define the conditional expectation of a ran-
dom variable, not by conditioning on an event, but by conditioning on a σ-algebra.
Let (Ω, F, P) be a probability space and let X be a real-valued random variable
defined on this space. Suppose that X is integrable (that is, E|X| is finite). Let G
be a sub-σ-algebra of F. The conditional expectation of X given G, denoted by
E(X|G), is a G-measurable integrable random variable Y such that for any B ∈ G,
E(X; B) = E(Y ; B). It is not clear from the definition whether such a random
variable exists, but if it does, then almost sure uniqueness is not difficult to prove.
L EMMA 9.1.2. If Y and Z are two random variables that qualify as E(X|G),
then Y = Z a.s.

P ROOF. Let B := {ω : Y (ω) > Z(ω)}. Note that B ∈ G. Since E(X; B) =


E(Y ; B) = E(Z; B), we get E((Y − Z); B) = 0. But (Y − Z)1B is a nonnegative
random variable. Thus, (Y − Z)1B = 0 a.s., which implies that Y ≤ Z a.s.
Similarly, Y ≥ Z a.s. 
101
102 9. CONDITIONAL EXPECTATION AND MARTINGALES

The definition of conditional expectation implies that we cannot have unique-


ness holding everywhere; almost everywhere is the best that we can hope for. How-
ever, we will generally treat conditional expectation as if it is a uniquely defined
random variable, because differences on null sets usually do not matter.
We will now show that conditional expectation always exists. To do this, we
will follow the usual route — that is, first show it for simple random variables, then
use monotonicity to show it for all nonnegative random variables, and finally use
positive and negative parts to cover all integrable random variables.
L EMMA 9.1.3. If X is a simple random variable, then E(X|G) exists.

P ROOF. Since X is simple, it is square-integrable. Let H denote the Hilbert


space L2 (Ω, G, P). Let {Yn }n≥1 be a sequence of elements of H such that
lim E(X − Yn )2 = s := inf E(X − Y )2 . (9.1.1)
n→∞ Y ∈H
Take any m and n. Let Z := (Ym + Yn )/2. Recall the parallelogram identity
(a − b)2 + (a + b)2 = 2a2 + 2b2 . Taking a = (X − Ym )/2 and b = (X − Yn )/2,
we get
1 1 1
E(X − Z)2 + E(Ym − Yn )2 = E(X − Ym )2 + E(X − Yn )2 .
4 2 2
2
Since Z ∈ H, E(X − Z) ≥ s. Thus,
1 1 1
E(Ym − Yn )2 ≤ E(X − Ym )2 + E(X − Yn )2 − s.
4 2 2
By (9.1.1), this proves that {Yn }n≥1 is a Cauchy sequence in H. Since H is a
Hilbert space, this sequence has a G-measurable L2 limit Y . Clearly,
E(X − Y )2 = lim E(X − Yn )2 = s.
n→∞
Now take any Z ∈ H. Then for any λ ∈ R, Y + λZ ∈ H, and so by the above
property of Y , E(X − Y − λZ)2 is minimized when λ = 0. But this is just a
quadratic polynomial in λ, and so the condition that it is minimized at 0 implies
that the coefficient of λ, namely E((X − Y )Z), is zero. Taking Z = 1B for any
B ∈ G proves that Y = E(X|G). 
L EMMA 9.1.4. If X1 and X2 are simple random variables and a, b ∈ R, then
E(aX1 + bX2 |G) = aE(X1 |G) + bE(X2 |G) a.s.

P ROOF. It is easy to see from the definition that aE(X1 |G) + bE(X2 |G) is
a valid candidate for E(aX1 + bX2 |G), and therefore the result follows by the
uniqueness of conditional expectation. 
L EMMA 9.1.5. If X is a nonnegative simple random variable, then E(X|G) is
also a nonnegative random variable.

P ROOF. Let Y = E(X|G). Let B := {ω : Y (ω) < 0}. Then B ∈ G,


and hence E(Y ; B) = E(X; B). Since X is nonnegative, E(X; B) ≥ 0. Thus,
E(Y ; B) ≥ 0. But this implies that Y ≥ 0 a.s. 
9.1. CONDITIONAL EXPECTATION 103

L EMMA 9.1.6. If X is [0, ∞]-valued, E(X|G) exists and is [0, ∞]-valued. If


X is integrable, then so is E(X|G), and these random variables have the same
expected value. Lastly, if X1 and X2 are [0, ∞]-valued random variables and
a, b ≥ 0, then E(aX1 + bX2 |G) = aE(X1 |G) + bE(X2 |G) a.s.

P ROOF. By Proposition 2.3.6, we can find a sequence of nonnegative simple


functions {Xn }n≥1 increasing to X. By Lemma 9.1.3, Yn := E(Xn |G) exists for
every n. By Lemma 9.1.4, E(Xn − Xn−1 |G) = Yn − Yn−1 for every n. Therefore
by Lemma 9.1.5, {Yn }n≥1 is also an increasing sequence of nonnegative random
variables. Let Y be the pointwise limit of this sequence. Then for any B ∈ G, the
monotone convergence theorem gives us
E(X; B) = lim E(Xn ; B)
n→∞
= lim E(Yn ; B) = E(Y ; B).
n→∞
Thus, Y = E(X|G). Clearly, Y is [0, ∞]-valued. If E(X) < ∞, then E(Y ) =
E(Y ; Ω) = E(X; Ω) = E(X) is also finite. Linearity is proved exactly as in
Lemma 9.1.4. 
Finally, we arrive at the main result of this section.
T HEOREM 9.1.7. For any integrable random variable X, E(X|G) exists and
is unique almost everywhere. Moreover, |E(X|G)| ≤ E(|X||G) a.s.

P ROOF. We have already established uniqueness in Lemma 9.1.2. To prove


existence, let X + and X − denote the positive and negative parts of X. Then by
Lemma 9.1.6, Y1 := E(X + |G) and Y2 := E(X − |G) exist. Let Y := Y1 −Y2 . Since
X is integrable, so are X + and X − . So by the last assertion of Lemma 9.1.6, Y1
and Y2 are integrable, and thus, so is Y . Therefore for any B ∈ G,
E(X; B) = E(X + ; B) − E(X − ; B)
= E(Y1 ; B) − E(Y2 ; B) = E((Y1 − Y2 ); B) = E(Y ; B).
Thus, Y qualifies as E(X|G). Finally, by the linearity of conditional expectation
for nonnegative random variables, |Y | ≤ Y1 + Y2 = E(X + |G) + E(X − |G) =
E(|X||G) a.s. 
Thus, we have established the existence and uniqueness of the conditional ex-
pectation of an integrable random variable given a σ-algebra. One can similarly
define the conditional probability of an event, as P(A|G) := E(1A |G). Another
common notation is that if X and Y are two random variables, then E(Y |σ(X)) is
written as E(Y |X).
E XERCISE 9.1.8. Show that if A and B are two events with P(B) > 0, and G
is the σ-algebra generated by B (that is, G = {∅, B, B c , Ω}), then
P(A ∩ B)
P(A|G) = .
P(B)
104 9. CONDITIONAL EXPECTATION AND MARTINGALES

E XERCISE 9.1.9. If X is random variable that takes value in some finite or


countable set X , then for any other random variable Y , E(Y |X) = g(X) a.s.,
where g is defined as
E(Y ; X = x)
g(x) = E(Y |X = x) := ,
P(X = x)
when P(X = x) > 0. When P(X = x) = 0, g(x) may be defined arbitrarily.
E XERCISE 9.1.10. If (X, Y ) is a pair of real-valued random variables having a
joint probability density function f , then E(Y |X) = g(X) a.s., where g is defined
as Z
g(x) = E(Y |X = x) := yf (y|x)dy,
R
where f (y|x) is the conditional probability density function of Y given X = x,
defined as
f (x, y)
f (y|x) := R
R f (x, z)dz
where the denominator is nonzero. If the denominator is zero, f (y|x) may be
defined arbitrarily.
E XERCISE 9.1.11. Let Ω = [0, 1), equipped with its Borel σ-algebra, so that a
real-valued random variable X defined on Ω is just a measurable map from [0, 1)
into R. Take any n ≥ 1. Let Fn be the σ-algebra generated by the intervals
[i/n, (i + 1)/n), 0 ≤ i ≤ n − 1. What is E(X|Fn )?
E XERCISE 9.1.12. If (X, Y ) is distributed uniformly on the unit disk in R2
and A is a Borel subset of R, compute P(Y ∈ A|X).
E XERCISE 9.1.13. In the above exercise, let (R, Θ) be the representation of
(X, Y ) in polar coordinates, with Θ ranging in [0, 2π). Compute P(Y ∈ A|Θ).
(Observe that in the measure-theoretic formulation, unlike our attempt in Exer-
cise 9.1.1, the ‘conditional probability of Y ∈ A given the event X = 0’ has
no meaning, and so there is no contradiction. We can use the random variable
P(Y ∈ A|X) to compute P(Y ∈ A, X ∈ B) for any B, by integrating P(Y ∈
A|X) over the set {X ∈ B}. Similarly, P(Y ∈ A|Θ) can be used to compute
P(Y ∈ A, Θ ∈ B) for any B, by integrating P(Y ∈ A|Θ) over the set {Θ ∈ B}.)

9.2. Basic properties of conditional expectation


It is clear from the definition that if conditional expectation exists, it must be
linear. That is,
E(aX1 + bX2 |G) = aE(X1 |G) + bE(X2 |G) a.s.
This holds because the right side is easily seen to be a valid candidate for the condi-
tional expectation of aX1 +bX2 given G, and we know that conditional expectation
is unique.
9.2. BASIC PROPERTIES OF CONDITIONAL EXPECTATION 105

Conditional expectation behaves nicely under independence. If X is indepen-


dent of a σ-algebra G, then E(X|G) = E(X) a.s. To see this, simply note that for
any A ∈ G, E(X; A) = E(X)P(A) = E(E(X)1A ).
On the other hand, if X is G-measurable, it is clear that E(X|G) = X a.s.,
since X satisfies the defining property of E(X|G).
By Lemma 9.1.6, the conditional expectation of a nonnegative random variable
is a nonnegative random variable. Together with linearity, this implies that condi-
tional expectation is monotone. That is, if X ≤ Y a.s., then E(X|G) ≤ E(Y |G).
Moreover, by the monotone convergence theorem, this implies that if {Xn }n≥1 is
a sequence of nonnegative random variables increasing to a limit X, then
E(X|G) = lim E(Xn |G) a.s.
n→∞
One may say that this is the monotone convergence theorem for conditional expec-
tation, or simply the conditional monotone convergence theorem. An immediate
consequence of the conditional monotone convergence theorem is conditional Fa-
tou’s lemma, which says that if {Xn }n≥1 is a sequence of nonnegative random
variables, then
E(lim inf Xn |G) ≤ lim inf E(Xn |G) a.s.
n→∞ n→∞
To prove this, let Yn := inf k≥n Xk . Then Yn increases to Y := lim inf n→∞ Xn ,
and Yn ≥ Xn for each n. Therefore by the conditional monotone convergence
theorem and the monotonicity of conditional expectation,
E(Y |G) = lim E(Yn |G) ≤ lim inf E(Xn |G) a.s.
n→∞ n→∞
Similarly, we have the conditional dominated convergence theorem: Suppose that
Xn → X a.s. and |Xn | is uniformly bounded by an integrable random variable
Z. Then E(Xn |G) → E(X|G) a.s. as n → ∞. To prove this, first note that
{Xn +Z}n≥1 is a sequence of nonnegative random variables converging to X +Z,
and so by the conditional Fatou’s lemma,
E(X + Z|G) ≤ lim inf E(Xn + Z|G) a.s.,
n→∞
which gives
E(X|G) ≤ lim inf E(Xn |G) a.s.
n→∞
Similarly, Z − Xn is a sequence of nonnegative random variables converging to
Z − X, so again by the conditional Fatou’s lemma,
E(Z − X|G) ≤ lim inf E(Z − Xn |G) a.s.,
n→∞
which gives
E(X|G) ≥ lim sup E(Xn |G) a.s.
n→∞
Combining the two inequalities, we get the desired result. Under the hypothe-
ses of the conditional dominated convergence theorem, we can also prove that
E(Xn |G) → E(X|G) in L1 , as follows. First, note that for each n, |E(Xn |G)| ≤
E(|Xn ||G) ≤ E(Z|G). By the definition of conditional expectation, E(E(Z|G)) =
E(Z) < ∞. Lastly, we know that E(Xn |G) → E(X|G) a.s. So, applying the L1
106 9. CONDITIONAL EXPECTATION AND MARTINGALES

assertion of the dominated convergence theorem, we get that E(Xn |G) → E(X|G)
in L1 .
The above properties of conditional expectation are all analogues of properties
of unconditional expectation. Conditional expectation has a few properties that
have no unconditional analogue. One example is the tower property, which says
that if we have two σ-algebras G 0 ⊂ G, then
E(X|G 0 ) = E(E(X|G)|G 0 ) a.s.
To prove this, take any B ∈ G 0 . Then B ∈ G, and hence
E(E(X|G); B) = E(X; B) = E(E(X|G 0 ); B).
Thus, E(X|G 0 ) satisfies the defining property of the conditional expectation of
E(X|G) given G 0 , which proves the claim.
A second example is the property that if Y is G-measurable and XY is inte-
grable (in addition to X being integrable), then
E(XY |G) = Y E(X|G) a.s. (9.2.1)
To prove this, let us first prove the weaker statement
E(XY ) = E(Y E(X|G)). (9.2.2)
Note that this follows from the definition of conditional expectation if Y = 1B
for some B ∈ G. Therefore it holds for any simple G-measurable Y . Next, if
X and Y are both nonnegative, we can take a sequence of nonnegative, simple, G-
measurable random variables increasing to Y and apply the monotone convergence
theorem on both sides to get (9.2.2). In particular, this shows that if X and Y are
nonnegative and XY is integrable, then so is Y E(X|G).
Finally, in the general case, note that the integrability of XY implies that
X + Y + , X + Y − , X − Y + and X − Y − are all integrable. By linearity, this gives
us (9.2.2) for any X and Y such that X and XY are integrable, and Y is G-
measurable.
Next, to get (9.2.1), replace Y by Y 1B in (9.2.2), where B ∈ G. Since the
integrability of XY implies that of XY 1B , we get
E(XY ; B) = E(Y E(X|G); B).
Since this holds for any B ∈ G, we get (9.2.1).
The following result is another basic property of conditional expectation that
is often useful.
P ROPOSITION 9.2.1. Let S and T be two measurable spaces. Let X be an
S-valued random variable and Y be a T -valued random variable, defined on the
same probability space, such that X and Y are independent. Let φ : S ×T → R be
a measurable function (with respect to the product σ-algebra) such that φ(X, Y )
is an integrable random variable. For each x ∈ S, define
ψ(x) := E(φ(x, Y )).
Then ψ is a measurable function, ψ(X) is an integrable random variable and
ψ(X) = E(φ(X, Y )|X) a.s.
9.2. BASIC PROPERTIES OF CONDITIONAL EXPECTATION 107

P ROOF. Take any A ∈ σ(X). Then A = X −1 (B) for some B in the σ-algebra
of S. Let µ be the law of X and ν be the law of Y . Then by Exercise 6.1.2, the
integrability of φ(X, Y ), and Fubini’s theorem,
Z Z
E(φ(X, Y ); A) = φ(x, y)1B (x)dν(y)dµ(x)
S T
Z Z Z
= φ(x, y)dν(y)dµ(x) = ψ(x)dµ(x)
B T B
Z
= ψ(x)1B (x)dµ(x) = E(ψ(X); A).
S
This shows that ψ(X) is indeed a version of E(φ(X, Y )|X). The measurability and
ψ and the integrability of ψ(X) are also consequences of Fubini’s theorem. 
In the following exercises, (Ω, F, P) is a probability space on which all our
random variables are defined, and G is an arbitrary sub-σ-algebra of F.
E XERCISE 9.2.2. If Xn → X in L1 , prove that E(Xn |G) → E(X|G).

E XERCISE 9.2.3. On the unit interval with Lebesgue measure, let X(ω) = ω
and for an arbitrary positive integer n let Y (ω) = nω − [nω], where [x] denotes
the largest integer ≤ x. What is E(X|Y )? Explain your answer.

E XERCISE 9.2.4. Suppose E(Xi ) exists for i = 1, 2 and E(X1 ; A) ≤ E(X2 ; A)


for all A ∈ F. Show that P(X1 ≤ X2 ) = 1.

E XERCISE 9.2.5. Suppose X and Y are square integrable. Show that for every
sub-σ-algebra G ⊂ F, E[XE(Y |G)] = E[E(X|G)Y ].

E XERCISE 9.2.6. Let Var(X|F) = E(X 2 |F) − E(X|F)2 . Prove that


Var(X) = E(Var(X|F)) + Var(E(X|F)).

E XERCISE 9.2.7. Suppose that X, Y ∈ L1 (Ω, F, P) and that E(X|Y ) = Y


a.s. and E(Y |X) = X a.s. Prove that X = Y a.s.

E XERCISE 9.2.8. Let S be a random variable with P(S > t) = exp(−t) for
t > 0. Find E(S| max{S, t}) and E(S| min{S, t}) for each t > 0.

E XERCISE 9.2.9. Suppose θ is 1 or 0 with probabilities p and 1 − p, respec-


tively, and independently of θ, X and Y are independent and identically distributed.
Let Z = (Z1 , Z2 ), where Z1 = θX + (1 − θ)Y and Z2 = (1 − θ)X + θY . Show
that Z and θ are independent. Then, find an explicit expression for E(g(X, Y )|Z),
for an arbitrary bounded Borel-measurable function g.

E XERCISE 9.2.10. Suppose that E(Xn ) exists for all n, E(X) exists and X1 ≤
X2 ≤ · · · → X with probability one. Show that E(Xn |G) → E(X|G) a.e. on
{E(X1 |G) > −∞} as n → ∞. Hint: Consider B = {E(X1 |G) ≥ −b} and the
random variables Xn 1B .
108 9. CONDITIONAL EXPECTATION AND MARTINGALES

E XERCISE 9.2.11. Let F1 ⊂ F2 ⊂ · · · ⊂ F = σ(∪∞ 1


i=1 Fi ). Let X ∈ L (F).
Let A be an algebra that generates the σ-algebra F. First, show that for any  > 0
there exists a random variable Y that is a finite linear combination of indicator
variables of sets in A such that E|X − Y | < . (Hint: See Theorem 1.2.6.) Us-
ing only this result and the basic properties of conditional expectations show that
E(X|Fn ) − X → 0 in L1 as n → ∞.

E XERCISE 9.2.12. Suppose that 0 ≤ Xn → 0 in probability, Xn ≤ Y with


probability one for all n, and E(Y ) < ∞. Prove that E(Xn |G) → 0 in probability
as n → ∞. Hint: Break up E(Xn |G) into three parts according as Xn < ,  ≤
Xn < a, or Xn > a.

E XERCISE 9.2.13. Let (Ω, F, P) be a probability space, and let G1 and G2 be


two independent sub-σ-algebras of F. Let G := σ(G1 ∪ G2 ). If an integrable
random variable X has the property that σ(σ(X) ∪ G1 ) is independent of G2 , show
that E(X|G) = E(X|G1 ) a.s.

9.3. Jensen’s inequality for conditional expectation


The following result is known as Jensen’s inequality for conditional expecta-
tion. Its proof is a bit more involved than the proof of ordinary Jensen’s inequality.
T HEOREM 9.3.1 (Jensen’s inequality for conditional expectation). Let X be
an integrable random variable defined on some probability space (Ω, F, P). Let I
be an interval containing the range of X, and let φ : I → R be a convex function
such that φ(X) is integrable. Then for any σ-algebra G ⊆ F, E(φ(X)|G) ≥
φ(E(X|G)) a.s.
For the proof, we need a couple of lemmas about convex functions.
L EMMA 9.3.2. The supremum of any collection of convex functions defined on
an interval is convex.

P ROOF. Let A be a collection of convex functions defined on an interval I.


Let g(x) := supf ∈A f (x). Take any x, y ∈ I, with x ≤ y, and any t ∈ [0, 1]. Then
for any f ∈ A,
f (tx + (1 − t)y) ≤ tf (x) + (1 − t)f (y) ≤ tg(x) + (1 − t)g(y).
Taking supremum over f ∈ A on the left, we get
g(tx + (1 − t)y) ≤ tg(x) + (1 − t)g(y).
Thus, g is convex. 
L EMMA 9.3.3. Let I be an interval and φ : I → R be a convex function. Then
there are sequences {an }n≥1 and {bn }n≥1 such that φ(x) = supn≥1 (an x + bn )
for all x ∈ I.
9.4. MARTINGALES 109

P ROOF. Take any x ∈ Q ∩ I. By Exercise 4.2.5, there are numbers ax and bx


such that φ(y) ≥ ax y + bx for all y ∈ I, and φ(x) = ax x + bx . This shows that if
we define
g(x) := sup (ay x + by )
y∈Q∩I
for each x ∈ I, then φ(x) = g(x) for each x ∈ Q ∩ I. By Lemma 9.3.2, g is
convex, and hence continuous. Thus, φ and g are two continuous functions that
agree on a dense subset of I. Therefore φ and g must agree everywhere on I. 

P ROOF OF T HEOREM 9.3.1. Let an and bn be as in Lemma 9.3.3. Then for


any n,
E(φ(X)|G) ≥ E(an X + bn |G) = an E(X|G) + bn a.s.
Implicit in the above display is that we have chosen and fixed some versions of
the conditional expectations displayed on the two sides. Since the above inequal-
ity holds almost surely for any given n, they hold simultaneously for all n with
probability one. By the monotonicity of conditional expectation, it follows that
E(X|G) ∈ I a.s. Thus, with probability one,
E(φ(X)|G) ≥ sup(an E(X|G) + bn ) = φ(E(X|G))a.s.,
n≥1

completing the proof of the theorem. 

9.4. Martingales
Let (Ω, F, P) be a probability space. An increasing sequence of σ-algebras
F0 ⊆ F1 ⊆ F2 ⊆ · · · contained in F is called a filtration. A sequence of ran-
dom variables {Xn }n≥0 is said to be adapted to this filtration if for each n, Xn is
Fn -measurable. An adapted sequence is called a martingale if for each n, Xn is
integrable, and
E(Xn+1 |Fn ) = Xn a.s.
Note that by the tower property, E(Xn |Fm ) = Xm a.s. whenever n ≥ m. Also
note that E(Xn ) = E(X0 ) for all n.
E XAMPLE 9.4.1 (Simple random walk). Perhaps the simplest example of a
nontrivial martingale is a random walk with mean zero increments. Let X1 , X2 , . . .
be independent, integrable random variables with E(Xi ) = 0 for all i. Let S0 := 0
and Sn := X1 + · · · + Xn . Let Fn be the σ-algebra generated by X1 , . . . , Xn
for n ≥ 1, and let F0 be the trivial σ-algebra. Then it is easy to see that {Sn }n≥0
is a martingale adapted to the filtration {Fn }n≥0 . To see this, first note that that
E|Xi | < ∞ for each i, the triangle inequality shows that Sn is integrable for
each n. It is obvious that Sn is Fn -measurable. Next, by linearity of conditional
expectation, and the fact that Sn−1 is Fn−1 measurable, we have
E(Sn |Fn−1 ) = E(Sn−1 + Xn |Fn−1 )
= Sn−1 + E(Xn |Fn−1 ).
110 9. CONDITIONAL EXPECTATION AND MARTINGALES

But, since Xn is independent of Fn−1 , E(Xn |Fn−1 ) = E(Xn ) = 0. This proves


the martingale property.

E XERCISE 9.4.2. Let X1 , X2 , . . . be square-integrable independent random


variables, and let Fn := σ(X1 , . . . , Xn ). Let µi := E(Xi ) and σi2 := Var(Xi ).
Show that X n 2 X n
Zn := (Xi − µi ) − σi2
i=1 i=1
is a martingale adapted to Fn .

E XERCISE 9.4.3. Let X1 , X2 , . . . be i.i.d. random variables and Sn := X1 +


· · · + Xn . Suppose that for some θ ∈ R \ {0}, m(θ) := E(eθX1 ) is finite. Then
show that
eθSn
Mn :=
m(θ)n
is a martingale adapted to Fn := σ(X1 , . . . , Xn ).

E XERCISE 9.4.4. Let (Ω, F, P) be a probability space, and let {Fn }n≥0 be
a filtration of sub-σ-algebras of F. Let X be an integrable random variable de-
fined on Ω. Let Yn := E(X|Fn ). Show that {Yn }n≥0 is a martingale adapted to
{Fn }n≥0 .

9.5. Stopping times


A random variable T taking value in {0, 1, 2, . . .} ∪ {∞} is called a stopping
time for a filtration {Fn }n≥0 if for each finite n, then the event {T = n} is Fn -
measurable.
A large class of stopping times arise in the following way. Let {Fn }n≥0 be a
filtration of σ-algebras. Let {Xn }n≥0 be a sequence of real-valued random vari-
ables adapted to this filtration — that is, for each n, Xn is Fn -measurable. Let A
be a Borel subset of R. Let T := inf{n : Xn ∈ A}, where the infimum is inter-
preted as ∞ if Xn ∈ / A for all n. This is a stopping time for the filtration {Fn }n≥0 ,
because for any finite n,
{T = n} = {Xn ∈ A} ∩ {Xj ∈
/ A for all j < n},
which is an element of Fn by the given conditions.
E XERCISE 9.5.1. Let {Sn }n≥0 be a simple symmetric random walk on Z start-
ing at the origin. That is, S0 = 0, and for each n, Sn+1 − Sn is 1 or −1 with
equal probability, irrespective of past events. Let Fn be the σ-algebra generated by
S0 , . . . , Sn . Take any integers a < 0 < b, and let Ta,b := min{n : Sn = a or Sn =
b}. Show that Ta,b is a stopping time with respect to the filtration {Fn }n≥0 .
E XERCISE 9.5.2. In the above exercise, show that E(Ta,b ) < ∞, and hence
that Ta,b is finite a.s. Hint: Consider the first occurrence of a continuous sequence
of a + b positive jumps of the walk, and use it to obtain a random upper bound on
Ta,b .
9.6. OPTIONAL STOPPING THEOREM 111

E XERCISE 9.5.3. Let Sn be as above. Let T := max{n : Sn ≥ n}. Show


that T < ∞ a.s., and that T is not a stopping time. Hint: For the first part, use the
strong law of large numbers. For the second, produce an argument by contradiction
to show that {T = 0} ∈ / F0 .
E XERCISE 9.5.4. Let {Fn }n≥0 be a filtration. Show that a random variable T
taking value in {0, 1, 2, . . .} ∪ {∞} is a stopping time with respect to this filtration
if and only if {T > n} ∈ Fn for all n ≥ 0, and also if and only if {T ≤ n} ∈ Fn
for all n ≥ 0.
Given any stopping time T , there is an associated σ-algebra called the stopped
σ-algebra, denoted by FT . It is defined as
FT := {A ∈ F : A ∩ {T = n} ∈ Fn for all n}.
Intuitively, FT encodes ‘all the information up to time T ’. Alternatively, an event is
in FT if and only if we can determine whether it is true by observing what happens
up to time T .
E XERCISE 9.5.5. Show that a stopping time T is always measurable with re-
spect to the stopped σ-algebra FT .
E XERCISE 9.5.6. Give an example to show that the σ-algebra generated by a
stopping time T may be a proper subset of FT .
E XERCISE 9.5.7. Let Ta,b be as in Exercise 9.5.1. Take any k ≥ 1, and let A
be the event that by time Ta,b , the random walk visits 0 exactly k times. Show that
A ∈ FTa,b .
E XERCISE 9.5.8. Let {Xn }n≥0 be a sequence of random variables adapted to
a filtration {Fn }n≥0 . Let T be a stopping time with respect to this filtration. The
stopped random variable XT is defined as XT (ω) := XT (ω) (ω). Show that XT is
FT -measurable.
An important observation is that if S and T are stopping times such that S ≤ T
always, then FS ⊆ FT . To see this, take any A ∈ FS . Then for any n,
A ∩ {T = n} = A ∩ {S ≤ n} ∩ {T = n},
since S ≤ T . But A ∩ {S ≤ n} ∈ Fn since A ∈ FS , and {T = n} ∈ Fn since T
is a stopping time. Therefore A ∩ {T = n} ∈ Fn .
Note that a random variable T that is identically equal to a positive integer n
is trivially a stopping time. For this T , it is easy to see that FT = Fn .
A stopping time is called bounded if there is a constant c such that T ≤ c
always.

9.6. Optional stopping theorem


One of the most important results relating martingales and stopping times is the
optional stopping theorem. The simplest version of this theorem goes as follows.
112 9. CONDITIONAL EXPECTATION AND MARTINGALES

T HEOREM 9.6.1 (Optional stopping theorem). Let {Xn }n≥0 be a martingale


adapted to a filtration {Fn }n≥0 . Let S and T be bounded stopping times for
this filtration, such that S ≤ T always. Then XS and XT are integrable, and
E(XT |FS ) = XS a.s. In particular, E(XT ) = E(X0 ).

P ROOF. Since T is a bounded stopping time, there is an integer n such that


S ≤ T ≤ n always. To prove integrability of XS and XT , simply note that |XT |
and |XS | are both bounded by |X0 | + · · · + |Xn |. Take any A ∈ FS . Then note
that by the given conditions,
n
X
E(Xn ; A) = E(Xn ; {T = i} ∩ A).
i=0

But {T = i} ∩ A ∈ Fi since A ∈ FS ⊆ FT , and E(Xn |Fi ) = Xi since {Xj }j≥0


is a martingale. So,
E(Xn ; {T = i} ∩ A) = E(E(Xn |Fi ); {T = i} ∩ A)
= E(Xi ; {T = i} ∩ A).
Summing over i, we get
n
X
E(Xn ; A) = E(Xi ; {T = i} ∩ A) = E(XT ; A).
i=0

But by the same argument, E(Xn ; A) = E(XS ; A). Therefore E(XT ; A) =


E(XS ; A) for all A ∈ FS . By Exercise 9.5.8, XS is FS -measurable. Thus,
E(XT |FS ) = XS a.s. Considering the special case S ≡ 0, we get E(XT |F0 ) =
X0 , which gives E(XT ) = E(X0 ). 

Although most stopping times that occur in practice are unbounded, there is a
simple trick to apply the optional stopping theorem is a great variety of situations.
Let T be a stopping time with respect to a filtration {Fn }n≥0 . Take any n ≥ 0,
and let T ∧ n denote the minimum of T and n. That is, T ∧ n is a random variable
defined as T ∧ n(ω) := min{T (ω), n}. Then T ∧ n is a stopping time. To see this,
note that for any k ≥ 0,
(
{T > k} if n > k,
{T ∧ n > k} =
∅ if n ≤ k,
and apply Exercise 9.5.4. But note that T ∧ n is also bounded. Thus, we can apply
the optional stopping time with T ∧n, and later, take n → ∞ to recover T , because
limn→∞ T ∧ n = T on the set {T < ∞}.
To see how this works, consider the following. Let {Sn }n≥0 be a simple sym-
metric random walk on Z, starting at the origin. Take any a < 0 < b, and let
T := min{n : Sn = a or Sn = b}. Let Fn be the σ-algebra generated by
S0 , . . . , Sn . As noted in Section 9.4, {Sn }n≥0 is a martingale adapted to the filtra-
tion {Fn }n≥0 . Take any n. Then T ∧ n is a bounded stopping time, and hence, by
9.6. OPTIONAL STOPPING THEOREM 113

the optional stopping theorem,


E(ST ∧n ) = E(S0 ) = 0. (9.6.1)
By Exercise 9.5.2, T < ∞ a.s. Thus, for almost every ω, T (ω) ∧ n = T (ω) for all
sufficiently large n, and hence
lim ST ∧n (ω) = lim ST (ω)∧n (ω) = ST (ω) (ω) = ST (ω).
n→∞ n→∞
In other words, ST ∧n → ST a.s. as n → ∞. Note also that Sn remains in the inter-
val [a, b] up to time T , and therefore, |ST ∧n | ≤ max{|a|, b} for all n. Combining
this with (9.6.1) and the dominated convergence theorem, we get that E(ST ) = 0.
But ST can take only two values, a and b. Thus,
0 = E(ST ) = aP(ST = a) + b(1 − P(ST = a)),
which gives
b
P(ST = a) = .
b−a
Thus, the chance that the walk exits the interval [a, b] through a is b/(b−a), and the
chance that it exits through b is −a/(b − a). (This is known as the gambler’s ruin
problem: If a gambler starts with x dollars, and wins or loses 1 dollar with equal
probability at each turn, then what is the chance that he will reach a threshold of y
dollars before hitting 0 (that is, getting ruined)? It is easy to use the above formula
to show that in this case, the chance is x/y.)
E XERCISE 9.6.2. In the above setting, compute E(T ) using the martingale
Sn2 − n.
E XERCISE 9.6.3. If the random walk is biased — that is, the chance of a
positive step is p 6= 1/2, show that (q/p)Sn is a martingale, where q = 1 − p.
Using this compute P(ST = a).
E XERCISE 9.6.4. Let X1 , X2 , . . . be a sequence of i.i.d. random variables with
negative mean and a nonzero probability of being positive, and finite moment gen-
erating function m. Show that there exists θ∗ > 0 such that m(θ∗ ) = 1, and that

{eθ Sn }n≥0 is a martingale, where Sn = X1 + · · · + Xn and S0 = 0. Check that
the martingale in the previous exercise is a special case of this one when p < 1/2.
Hint: For the first part, first show that m is differentiable everywhere, then show
that m0 (0) < 0, and finally show that m(θ) → ∞ as θ → ∞.
E XERCISE 9.6.5. Let all notation be as in the previous exercise. Let M :=
maxn≥0 Sn . Show that M is finite a.s. Next, using the martingale from the previ-

ous exercise, prove that for any x ∈ (0, ∞), P(M ≥ x) ≤ e−θ x . Hint: Use the
stopping time T := inf{n ≥ 0 : Sn ≥ x} and find an upper bound on P(T < ∞).
E XERCISE 9.6.6. Let Sn be the total assets of an insurance company at the
end of year n. In year n, premiums totaling c > 0 are received and claims ζn
are paid where ζn ∼ N (µ, σ 2 ) and µ < c. To be precise, if ξn = c − ζn , then
Sn = Sn−1 + ξn . The company is ruined if its assets drop to 0 or less. Show that
if S0 > 0 is nonrandom, then P(ruin) ≤ exp(−2(c − µ)S0 /σ 2 ).
114 9. CONDITIONAL EXPECTATION AND MARTINGALES

E XERCISE 9.6.7 (Wald’s equation). Let X1 , X2 , . . . be i.i.d. integrable random


variables with mean µ and let Sn := X1 + · · · + Xn . Let Fn = σ(X1 , . . . , Xn ) for
n ≥ 1, and let F0 be the trivial σ-algebra. Let T be a stopping time with respect to
the filtration {Fn }n≥0 . If E(T ) <
P∞, show that E(ST ) = µE(T ). Hint: Show that
for all n, |ST ∧n | is bounded by ∞ i=1 |Xi |1{T ≥i} , and that this random variable is
integrable.
E XERCISE 9.6.8. Let X1 , X2 , . . . be i.i.d. integrable random variables with
mean 0 and let Sn := X1 + · · · + Xn . Take any x > 0, and let T := inf{n ≥ 0 :
Sn ≥ x}. Prove that E(T ) = ∞. Hint: Use the previous exercise.

9.7. Submartingales and supermartingales


Let {Fn }n≥0 be a filtration of σ-algebras. A sequence of integrable random
variables {Xn }n≥0 adapted to this filtration is called a submartingale if for each n,
we have
Xn ≤ E(Xn+1 |Fn ) a.s.,
and a supermartingale if for each n, we have
Xn ≥ E(Xn+1 |Fn ) a.s.
A simple way to produce a submartingale is to apply a convex function to a mar-
tingale. Indeed, if {Yn }n≥0 is a martingale and φ is a convex function defined on
an interval containing the ranges of the Yn ’s, such that φ(Yn ) is integrable for all
n, then by the conditional Jensen inequality, we have
E(φ(Yn+1 )|Fn ) ≥ φ(E(Yn+1 |Fn )) = φ(Yn ).
One can also produce submartingales from other submartingales by applying func-
tions that are both convex and non-decreasing. Indeed, if {Xn }n≥0 is a submartin-
gale and φ is such a function, and φ(Xn ) is integrable for all n, then
E(φ(Xn+1 )|Fn ) ≥ φ(E(Xn+1 |Fn )) ≥ φ(Xn ).
Similarly, supermartingales can be produced by applying concave functions to mar-
tingales, or concave non-increasing functions to supermartingales.
Many results for martingales have analogous versions for submartingales and
supermartingales. The following exercises contain some of these.
E XERCISE 9.7.1 (Optional stopping theorem for submartingales and super-
martingales). Let {Xn }n≥0 be a submartingale adapted to a filtration {Fn }n≥0 .
Let S and T be bounded stopping times for this filtration, such that S ≤ T al-
ways. Then XS and XT are integrable, and E(XT |FS ) ≥ XS a.s. In particular,
E(XT ) ≥ E(X0 ). Prove also the analogous result for supermartingales.
There is also a way to extract a martingale out of a submartingale or super-
martingale — or, in fact, any adapted sequence. This procedure, known as the
‘Doob decomposition’, goes as follows. Let {Xn }n≥0 be a sequence of integrable
9.7. SUBMARTINGALES AND SUPERMARTINGALES 115

random variables adapted to a filtration {Fn }n≥0 . Define M0 := 0 and for each
n ≥ 1,
n−1
X
Mn := (Xk+1 − E(Xk+1 |Fk )),
k=0
and
n−1
X
An := (E(Xk+1 |Fk ) − Xk ).
k=0
Then note that by definition,
Xn = X0 + Mn + An . (9.7.1)
It is easy to see that Mn and An are both integrable for each n. Next, note that Mn
is Fn -measurable for each n, and
n−2
X
E(Mn |Fn−1 ) = (Xk+1 − E(Xk+1 |Fk )) = Mn−1 .
k=0

Thus, {Mn }n≥0 is a martingale adapted to {Fn }n≥0 . Finally, note that {An }n≥0
is not only an adapted process, but actually, An is Fn−1 -measurable for each n.
Such processes are called ‘predictable’. Thus, the Doob decomposition (9.7.1) ex-
presses any adapted process as the sum of the initial value, plus a martingale, plus
a predictable process. Additionally, if {Xn }n≥0 is a submartingale, we have that
E(Xn+1 |Fn ) ≥ Xn for each n, and hence {An }n≥0 is a nonnegative and increas-
ing process. Similarly, if {Xn }n≥0 is a supermartingale, {An }n≥0 is a nonpositive
and decreasing process.
The following construction is another method of obtaining submartingales, su-
permartingales and martingales from arbitrary adapted processes, which is often
useful in applications.
P ROPOSITION 9.7.2. Let {Xn }n≥0 be a sequence of integrable random vari-
ables adapted to a filtration {Fn }n≥0 . Let T be a stopping time for this filtration.
Suppose that for each n, E(Xn+1 |Fn ) ≥ Xn a.s. on the set {T > n}, that is,
P({E(Xn+1 |Fn ) < Xn } ∩ {T > n}) = 0.
Then {XT ∧n }n≥0 is a submartingale adapted to {Fn }n≥0 . Similarly, if we have
that E(Xn+1 |Fn ) ≤ Xn a.s. on {T > n}, then {XT ∧n }n≥0 is a supermartingale,
and if E(Xn+1 |Fn ) = Xn a.s. on {T > n}, then {XT ∧n }n≥0 is a martingale.
(The process {XT ∧n }n≥0 is often called the ‘stopped’ version of the process
{Xn }n≥0 , since it progresses as Xn up to time T , and then gets frozen at XT .)

P ROOF. Take any n ≥ 0. By Exercise 9.5.8, XT ∧n is FT ∧n -measurable. Since


T ∧ n ≤ n, and T ∧ n and n are both stopping times, we have that FT ∧n ⊆ Fn .
Thus, XT ∧n is Fn -measurable. Also, clearly, |XT ∧n | ≤ |X0 | + · · · + |Xn |, which
116 9. CONDITIONAL EXPECTATION AND MARTINGALES

shows that XT ∧n is integrable. Now note that


Xn
E(XT ∧(n+1) |Fn ) = E(XT ∧(n+1) 1{T =i} |Fn ) + E(XT ∧(n+1) 1{T >n} |Fn )
i=0
n
X
= E(Xi 1{T =i} |Fn ) + E(Xn+1 1{T >n} |Fn )
i=0
Xn
= Xi 1{T =i} + 1{T >n} E(Xn+1 |Fn ),
i=0
where the last identity holds because Xi and 1{T =i} are Fi -measurable for each i.
But
Xn
Xi 1{T =i} = XT 1{T ≤n} ,
i=0
and by the assumed property, 1{T >n} E(Xn+1 |Fn ) ≥ 1{T >n} Xn a.s. Thus,
E(XT ∧(n+1) |Fn ) ≥ XT 1{T ≤n} + Xn 1{T >n} = XT ∧n .
This proves the claim for submartingales. The proofs for supermartingales and
martingales are exactly the same. 
The above result is sometimes used in the following way. Let {Xn }n≥0 be a
sequence of integrable random variables adapted to a filtration {Fn }n≥0 , starting
at X0 ≡ x for some nonrandom x > 0. Suppose that T is a stopping time with
respect to this filtration, with the property that there is some a ≤ x such that
XT ∧n ≥ a a.s., and there is some b > 0 such that E(Xn+1 |Fn ) ≤ Xn − b a.s. on
the set {T > n}. Then we have the bound
x−a
E(T ) ≤ , (9.7.2)
b
by the following argument. Define
Yn := Xn + bn.
Then by the stated property, a.s. on the set {T > n},
E(Yn+1 |Fn ) = E(Xn+1 |Fn ) + b(n + 1)
≤ Xn − b + b(n + 1) = Xn + bn = Yn .
Therefore by Proposition 9.7.2, {YT ∧n }n≥0 is a supermartingale. In particular,
E(YT ∧n ) ≤ E(YT ∧0 ) = E(Y0 ) = x.
But
E(YT ∧n ) = E(XT ∧n ) + bE(T ∧ n)
Since XT ∧n ≥ a a.s., combining the last two displays shows that
x−a
E(T ∧ n) ≤ .
b
Taking n → ∞ and applying the monotone convergence theorem, we get (9.7.2).
The following exercises demonstrate how this technique can be applied.
9.8. OPTIMAL STOPPING AND SNELL ENVELOPES 117

E XERCISE 9.7.3. A gambler wins or loses 1 dollar with equal probability at


each turn of a game. However, if his total assets at any point of time are bigger
than some amount a, he has to pay a tax of b dollars for playing that turn. Suppose
that the gambler starts playing with assets totaling x dollars. Let T be the first time
at which his assets go below a dollars. Prove that E(T ) ≤ (x − a)/b. In particular,
deduce that T is finite a.s.
E XERCISE 9.7.4. Consider the following random walk on the integers. The
walk starts at some odd integer x > 3. At each step, if the walk is at some y, then
it goes to either p − 2 or p + 2 with equal probability, where p is the largest prime
divisor of y. Let T be the first time the walk hits a prime number or 1. Prove that
E(T ) ≤ C log x, where C is a constant that does not depend on x. In particular,
deduce that T < ∞ a.s. (Note that if XT is a prime, then either XT + 2 or XT − 2
is also a prime, which means that XT is a twin prime.)

9.8. Optimal stopping and Snell envelopes


Let (Ω, F, P) be a probability space. Let F1 ⊆ F2 ⊆ · · · ⊆ FN be a filtration
of sub-σ-algebras of F. Let X1 , . . . , XN be integrable random variables defined
on Ω, not necessarily adapted to this filtration. The ‘finite horizon optimal stopping
problem’ in this setting is the problem of finding the stopping time T (with respect
to this filtration) that maximizes E(XT ). The idea is that Fn encodes ‘information
up to time n’, and Xn is the reward obtained if we execute some action at time n,
where our decision to execute the action is based solely on information available
up to time n. We want to devise a strategy for executing the action at an opportune
time, so that the expected reward is maximized. Surprisingly, this problem has a
solution at this complete level of generality, without any further assumptions.
The first step is to simplify the problem by defining Yn := E(Xn |Fn ) and
proving the following lemma.
L EMMA 9.8.1. For any stopping time T for the filtration {Fn }1≤n≤N (taking
value in {1, . . . , N }), E(XT ) = E(YT ).

P ROOF. Note that


N
X N
X
E(YT ) = E(Yn ; T = n) = E(Xn ; T = n),
n=1 n=1
because {T = n} ∈ Fn and Yn = E(Xn |Fn ). But the last expression is just
E(XT ). 
Next, we define a sequence of random variables known as the ‘Snell envelope’
of Y1 , . . . , YN . Let VN := YN , and define, by backward induction,
Vn := max{Yn , E(Vn+1 |Fn )} (9.8.1)
for n = N − 1, N − 2, . . . , 1. Note that this well-defined, because each Vn is inte-
grable (easy to show by backward induction). Also, clearly, Vn is Fn -measurable.
Observe that {Vn }1≤n≤N is a supermartingale adapted to {Fn }1≤n≤N , because it
118 9. CONDITIONAL EXPECTATION AND MARTINGALES

is a trivial consequence of the above definition that Vn ≥ E(Vn+1 |Fn ) a.s. Note
also that Vn ≥ Yn a.s. for each n. In fact, the following holds (which is why it’s
called an ‘envelope’):
E XERCISE 9.8.2. Prove that {Vn }1≤n≤N is the ‘smallest’ supermartingale
dominating {Yn }1≤n≤N , in the sense that if {Un }1≤n≤N is any other supermartin-
gale adapted to {Fn }1≤n≤N such that Un ≥ Yn a.s. for all n, then Un ≥ Vn a.s. for
all n.
Finally, define
τ := min{n : Vn = Yn }.
Note that τ ∈ {1, . . . , N }, since there is always at least one n (namely, N ) such
that Vn = Yn . Note also that τ is a stopping time with respect to {Fn }1≤n≤N ,
since
{τ = n} = {Vn = Yn } ∩ {Vk 6= Yk for all k < n}.
We now arrive at the main result of this section, which is that τ is the solution of
the optimal stopping problem mentioned above.
T HEOREM 9.8.3. The stopping time τ maximizes E(Xτ ) among all stopping
times adapted to the filtration {Fn }1≤n≤N . Moreover, this maximum value is equal
to E(V1 ).

P ROOF. Let T be any other stopping time. By Lemma 9.8.1, it suffices to


show that E(Yτ ) ≥ E(YT ). By the definitions of Vn and τ , we have that on the set
{τ > n}, Vn = E(Vn+1 |Fn ) a.s. Therefore, by Proposition 9.7.2, {Vτ ∧n }1≤n≤N
is a martingale adapted to {Fn }1≤n≤N . Moreover, Yτ = Vτ by the definition of τ .
Consequently,
E(Yτ ) = E(Vτ ) = E(Vτ ∧N ) = E(Vτ ∧1 ) = E(V1 ).
But, since {Vn }1≤n≤N is a supermartingale, T is a bounded stopping time, and
Vn ≥ Yn for all n, the optional stopping theorem implies that
E(V1 ) ≥ E(VT ) ≥ E(YT ).
This completes the proof. 
Let us now work out an application of the above method. Suppose that there
are N ≥ 2 candidates appearing for an interview in a certain order. Let ri be the
rank of candidate i among all candidates (with rank 1 being the best). Ideally, the
employer would want to hire the candidate with rank 1. The problem is that at
any time n, the employer only knows r1n , . . . , rnn , where rin is the rank of candidate
i among the first n candidates. That is, the employer has no knowledge about
the candidates who are going to appear for the interview at later times. Also, the
employer is constrained by the rule that he has to either immediately offer the job
to a candidate after the interview, or let them go without the possibility of making
an offer later. There are various problems of this type, where you have to make a
decision at some point of time based in prior information, but are not allowed to
change your decision later.
9.8. OPTIMAL STOPPING AND SNELL ENVELOPES 119

To inject randomness, let us make the quite reasonable assumption that the
(r1 , . . . , rN ) is uniformly distributed over the set of all permutations of {1, . . . , n}.
Since the employer knows only r1n , . . . , rnn at time n, let Fn be the σ-algebra gen-
erated by this vector. Clearly, F1 ⊆ F2 ⊆ · · · ⊆ FN . We want to find a stopping
time τ with respect to this filtration that maximizes P(rτ = 1). To put this into the
framework of Theorem 9.8.3, let Xn := 1{rn =1} , so that P(rT = 1) = E(XT ) for
any stopping time T .
First, let us compute Yn = E(Xn |Fn ). If rnn 6= 1, then clearly rn 6= 1. On the
other hand, given rn = 1, the remaining ri ’s form a uniform random permutation
of 2, . . . , n. In particular, given rn = 1, all possible rankings of the first n − 1
candidates are equally likely. Consequently, for any permutation σ1 , . . . , σn−1 of
2, . . . , n,
1
P(r1n = σ1 , . . . , rn−1
n
= σn−1 , rnn = 1|rn = 1) = .
(n − 1)!
By Bayes’ rule, this gives
P(rn = 1|r1n = σ1 , . . . , rn−1
n
= σn−1 , rnn = 1)
n
P(r1n = σ1 , . . . , rn−1 = σn−1 , rnn = 1|rn = 1)P(rn = 1)
= n n
P(r1 = σ1 , . . . , rn−1 = σn−1 , rnn = 1)
(1/(n − 1)!)(1/N ) n
= = .
(1/n!) N
The above argument shows that
n
Yn = E(Xn |Fn ) = P(rn = 1|Fn ) = 1 n .
N {rn =1}
In other words, Yn = n/N if candidate n is the best among the first n candidates,
and 0 otherwise. The fact that makes it easy to apply Theorem 9.8.3 in this problem
is the following.
L EMMA 9.8.4. For each 2 ≤ n ≤ N , Yn is independent of Fn−1 . Moreover,
P(Yn = n/N ) = 1/n.

P ROOF. Since Yn can take only two values, 0 and n/N , it suffices to show that
the random variable P(Yn = n/N |Fn−1 ), which is the same as P(rnn = 1|Fn ), is
actually nonrandom. Take any permutation σ1 , . . . , σn−1 of 1, . . . , n − 1. Then
P(rnn = 1|r1n−1 = σ1 , . . . , rn−1
n−1
= σn−1 )
P(rnn = 1, r1n−1 = σ1 , . . . , rn−1
n−1
= σn−1 )
=
P(r1n−1 = σ1 , . . . , rn−1
n−1
= σn−1 )
P(rnn = 1, r1n−1 = σ1 + 1, . . . , rn−1
n−1
= σn−1 + 1) 1/n! 1
= = = .
P(r1n−1 = n−1
σ1 , . . . , rn−1 = σn−1 ) 1/(n − 1)! n
Thus, P(Yn = n/N |Fn−1 ) ≡ 1/n. This completes the proof of the lemma. 
Now let V1 , . . . , VN be defined by backward induction as in (9.8.1), starting
with VN = YN . Lemma 9.8.4 yields the following corollary.
120 9. CONDITIONAL EXPECTATION AND MARTINGALES

C OROLLARY 9.8.5. For 1 ≤ n ≤ N −1, E(Vn+1 |Fn ) is equal to a nonrandom


quantity vnN .

P ROOF. We will prove the claim by backward induction on n. By Lemma


9.8.4,
1
E(VN |FN −1 ) = E(YN |FN −1 ) = E(YN ) = .
N
So, the claim holds for n = N − 1. Suppose that it holds for some n. Then
E(Vn |Fn−1 ) = E(max{Yn , E(Vn+1 |Fn )}|Fn−1 ).
But, by the induction hypothesis, E(Vn+1 |Fn ) is equal to some nonrandom quan-
tity vnN . Thus,
E(Vn |Fn−1 ) = E(max{Yn , vnN }|Fn−1 ).
But by Lemma 9.8.4, Yn is independent of Fn−1 , and therefore, so is max{Yn , vnN }.
N
Thus, we get that vn−1 := E(Vn |Fn−1 ) is nonrandom. 
N . Already in the proof of
Let us now evaluate the quantities v1N , . . . , vN −1
N
Corollary 9.8.5, we have seen that vN −1 = 1/N , and also that for each n ≤ N − 1,
N
vn−1 = E(max{Yn , vnN }). (9.8.2)
This shows that v1N ≥ v2N ≥ · · · ≥ vN
N
−1 = 1/N . Let

tN := min{n ≤ N − 1 : n/N ≥ vnN },


N
which is well defined since vN −1 = 1/N ≤ (N − 1)/N (because N ≥ 2). By the
decreasing nature of vn , it follows that n/N < vnN for all n < tN and n/N ≥ vnN
N

for all n ≥ tN . In particular, for any n < tN , Yn ≤ n/N < vnN , and therefore,
N
(9.8.2) shows that vn−1 = vnN . On the other hand, for tN ≤ n ≤ N − 1, (9.8.2)
and Lemma 9.8.4 imply that
N n
vn−1 = P(Yn = n/N ) + vnN P(Yn = 0)
N  
1 1 N
= + 1− v .
N n n
N
Using this, and the fact that vN −1 = 1/N , it is now easy to prove by backward
induction that for tN − 1 ≤ n ≤ N − 1,
N −1
1 n X 1
vnN = + . (9.8.3)
N N k
k=n+1

Thus, tN is the unique integer n ∈ {1, . . . , N − 1} such that


N −1 N −1
1 n−1 X 1 n 1 n X 1
+ > ≥ + .
N N k N N N k
k=n k=n+1
9.9. ALMOST SURE CONVERGENCE OF MARTINGALES 121

It is easy to see from these inequalities that asymptotically as N → ∞, tN ∼ N/e.


By Theorem 9.8.3, we now deduce that the optimal stopping rule τ in this problem
is given by

τ = min{n ≥ tN : Yn = n/N } = min{n ≥ tN : rnn = 1}.

That is, we should choose the first candidate since time tN who is better than
everyone who came before him. Moreover, with this optimal rule, the probability of
choosing the best candidate is v1N , which is equal to vtNN −1 . By the characterization
of tN given above, and the formula (9.8.3), this value is asymptotically 1/e.

E XERCISE 9.8.6. Let X1 , X2 , . . . , XN be i.i.d. integrable random variables.


Let K > 0 be a given constant, and let Sn := X1 + · · · + Xn and Fn :=
σ(X1 , . . . , Xn ). Consider all stopping times T for the filtration {Fn }1≤n≤N ,
which are allowed to take values in {1, . . . , N } ∪ {∞}. Stopping at time T gets
us a reward ST − K if T is finite, and 0 if T = ∞. We want to maximize the
expected reward. (This comes from ‘American options’, where an asset has price
Sn dollars on day n, and the holder of the asset has to pay a fee of K dollars to sell
the asset at any time before time N , which is the ‘expiration date’ for the option.
The holder has no obligation to sell, and if the option is not exercised by time N ,
then the reward is zero — this is the case T = ∞. The ‘sum of i.i.d.’ is a toy
model. In reality, more complex models are used.) Prove the following:
(1) Show that the problem can be reformulated as the problem of finding a
stopping time T taking values in {1, . . . , N } that maximizes E((ST −
K)+ ). Having found such a T , what would be an optimal strategy for the
original problem?
(2) Define a sequence of functions w0 , w1 , . . . from R into R as follows: Let
w0 (x) := (x − K)+ , and for each n ≥ 1, let

wn (x) := max{(x − K)+ , E(wn−1 (x + X1 ))}.

Finally, let T := min{1 ≤ n ≤ N : (Sn − K)+ = wN −n (Sn )}. Show


that T is an optimal stopping time for the revised problem described in
part (1).
(3) Show that each wn is a convex nondecreasing function.

9.9. Almost sure convergence of martingales


Martingales, submartingales and supermartingales often have nice convergence
properties. In this section, we will derive a criterion for almost sure convergence
due to Doob. A key ingredient in the proof is Doob’s upcrossing lemma. If
{Xn }n≥0 is any sequence of random variables, and [a, b] is a bounded interval,
the upcrossings of [a, b] by {Xn }n≥0 are defined as follows. Let S1 be the first n
such that Xn ≤ a. Let T1 be the first n after S1 such that Xn ≥ b. Let S2 be the
first n after T1 such that Xn ≤ a, and T2 be the first n after S2 such that Xn ≥ b.
122 9. CONDITIONAL EXPECTATION AND MARTINGALES

And we keep going like this. That is, taking T0 = −1, we have that for all k ≥ 1,
Sk = inf{n > Tk−1 : Xn ≤ a},
Tk = inf{n > Sk : Xn ≥ b}.
In the above, we follow the convention that the infimum of an empty set is infinity.
The intervals {Sk , Sk + 1, . . . , Tk }, for all k such that Sk is finite, are called the
upcrossings of [a, b] by {Xn }n≥0 .
L EMMA 9.9.1 (Upcrossing lemma). Let {Xn }n≥0 be a submartingale sequence
adapted to some filtration {Fn }n≥0 . Take any interval [a, b] with a < b. Let Um
be the number of upcrossings of this interval by the sequence {Xn }n≥0 that are
completed by time m. Then
E(Xm − a)+ − E(X0 − a)+
E(Um ) ≤ .
b−a

P ROOF. For each n, let Yn := a + (Xn − a)+ . That is, Yn = a if Xn ≤


a and Yn = Xn if Xn > a. Note that x 7→ a + (x − a)+ is a convex non-
decreasing function. Thus, {Yn }n≥0 is also a submartingale. Moreover, it has the
same upcrossings of [a, b] as the sequence {Xn }n≥0 . Note that that if k, . . . , k + l
is an upcrossing, then Yk+l − Yk ≥ b − a, and moreover, for any k ≤ j ≤ k + l,
Yj − Yk ≥ 0 (the second property is not valid for the process {Xn }n≥0 ). Thus, if
we add up Yk+l − Yk for all upcrossings k, . . . , k + l that are completed by time
m, and also Ym − Yk for an incomplete upcrossing k, . . . , m at the end, the sum
will be at least (b − a)Um .
Let us now represent the above sum in a different way. For each n ≥ 1, let
Zn be the indicator of the event that the indices n and n − 1 are both in the same
upcrossing. The crucial observation is that Zn is Fn−1 -measurable. To see this,
note that there are four possibilities: (a) Yn−1 ≤ a. In this case, n − 1 is part of
an upcrossing (since the process is always above a between upcrossings), and n
is also a part of the same upcrossing. (b) a < Yn−1 < b, and n − 1 is part of
an upcrossing. In this case, n is part of the same upcrossing. (c) a < Yn−1 < b,
but n − 1 is not in an upcrossing. In this case, obviously, n and n − 1 are not
both in the same upcrossing. (d) Yn−1 ≥ b. In this case, n and n − 1 cannot be
in the same upcrossing. Thus, whether Zn = 1 or 0 is completely determined by
Y0 , . . . , Yn−1 , and hence, Zn is Fn−1 -measurable. Now, the sum defined in the
previous paragraph can be written as
Xm
Zn (Yn − Yn−1 ).
n=1
Since Zn is Fn−1 -measurable and {0, 1}-valued, and {Yn }n≥0 is a submartingale
(which implies that E(Yn |Fn−1 ) − Yn−1 ≥ 0), we have
E(Zn (Yn − Yn−1 )) = E[E(Zn (Yn − Yn−1 )|Fn−1 )]
= E[Zn (E(Yn |Fn−1 ) − Yn−1 )]
≤ E[E(Yn |Fn−1 ) − Yn−1 ] = E(Yn − Yn−1 ).
9.9. ALMOST SURE CONVERGENCE OF MARTINGALES 123

Thus,
m
X
(b − a)E(Um ) ≤ E(Yn − Yn−1 ) = E(Ym − Y0 ).
n=1

Since Yn = a + (Xn − a)+ , this completes the proof. 


Using the upcrossing lemma, we will now prove the following important result.
T HEOREM 9.9.2 (Submartingale convergence theorem). Let {Xn }n≥0 be a
submartingale adapted to a filtration {Fn }n≥0 . Suppose that supn≥0 E(Xn+ ) is
finite. Then there is a random variable X such that Xn → X a.s. and E|X| < ∞.

P ROOF. Take any interval [a, b] with a < b. Let Um be the number of up-
crossings of this interval by the sequence {Xn }n≥0 that are completed by time m.
Then {Um }m≥0 is an increasing sequence of random variables, whose limit U is
the total number of upcrossings of this interval. But by the upcrossing lemma and
the condition that supn≥0 E(Xn+ ) < ∞, we get that E(Um ) is uniformly bounded.
Thus by the monotone convergence theorem, E(U ) < ∞ and hence U < ∞
a.s. But this is true for any [a, b], and so it holds almost surely for all intervals
with rational endpoints. This implies that lim inf n→∞ Xn cannot be strictly less
than lim supn→∞ Xn , because otherwise there would exist an interval with ratio-
nal endpoints which is upcrossed infinitely many times. This proves the existence
of the limit X.
By Fatou’s lemma, E(X + ) ≤ lim inf n→∞ E(Xn+ ) < ∞. On the other hand,
since {Xn }n≥0 is a submartingale sequence,
E(Xn− ) = E(Xn+ ) − E(Xn ) ≤ E(Xn+ ) − E(X0 ),
which proves the uniform boundedness of E(Xn− ). Thus again by Fatou’s lemma,
E(X − ) < ∞ and hence E|X| < ∞. 
E XERCISE 9.9.3. Prove that a nonnegative supermartingale converges almost
surely to an integrable random variable.
E XERCISE 9.9.4 (Pólya’s urn). Consider an urn which initially has one black
ball and one white ball. At each turn, a ball is picked uniformly at random from
the urn, and replaced back into the urn with an additional ball of the same color.
Let Wn be the fraction of white balls at time n (so that W0 = 1/2). Prove that
{Wn }n≥0 is a martingale with respect to a suitable filtration, and show that it con-
verges almost surely to a limit. Show that the limit is uniformly distributed in [0, 1].
Hint: For the last part, prove by induction that each Wn is uniformly distributed on
a certain set depending on n.
E XERCISE 9.9.5. Show by a counterexample that the convergence in the sub-
martingale convergence theorem need not hold in L1 . Hint: Consider the martin-
gale {ST ∧n }n≥0 , where {Sn }n≥0 is a simple symmetric random walk starting at 0,
and T is the first time to hit 1.
124 9. CONDITIONAL EXPECTATION AND MARTINGALES

9.10. Lévy’s downwards convergence theorem


A second consequence of the upcrossing lemma is the backwards martingale
convergence theorem, stated below. This is also known as Lévy’s downwards con-
vergence theorem.
T HEOREM 9.10.1 (Backwards martingale convergence theorem, or, Lévy’s
downwards convergence theorem). Let X be an integrable random variable de-
fined on some probability space (Ω, F, P). Let F0 ⊇ F1 ⊇ · · · be a decreasing
sequence of sub-σ-algebras of F. Let F ∗ = ∩n≥0 Fn . Then E(X|Fn ) converges
to E(X|F ∗ ) a.s. and in L1 as n → ∞.

P ROOF. Let Xn := E(X|Fn ). Take any interval [a, b] with a < b. For each n,
let Un be the number of complete upcrossings of this interval by the finite sequence
Xn , Xn−1 , . . . , X0 . Note that this sequence is a martingale with respect to the
filtration Fn , Fn−1 , . . . , F0 . Therefore by the upcrossing lemma,
E(X0 − a)+
E(Un ) ≤ .
b−a
Since X0 = E(X|F0 ) is integrable, the quantity on the right is finite. Moreover,
it does not depend on n. Thus, supn≥0 E(Un ) < ∞. But Un increases to a limit
as n → ∞, and if this limit is finite, then the sequence {Xn }n≥0 cannot attain
infinitely many values in both the intervals (−∞, a] and [b, ∞). Therefore, with
probability one, this cannot happen for any interval with rational endpoints. This
implies that Xn converges almost surely to a limit X ∗ .
The next step is to show that the sequence {Xn }n≥0 is uniformly integrable.
Since
|Xn | = |E(X|Fn )| ≤ E(|X||Fn ),
we have
E(|Xn |; |Xn | > K) ≤ E(E(|X||Fn ); E(|X||Fn ) > K)
= E(|X|; E(|X||Fn ) > K). (9.10.1)
Take any  > 0. By Corollary 8.3.4, there is some δ > 0 such that for any
event A with P(A) < δ, we have E(|X|; A) < . Fixing K, let An be the event
{E(|X||Fn ) > K}. By Markov’s inequality,
E(E(|X||Fn )) E|X|
P(An ) ≤ = ,
K K
which can be made less than δ by choosing K large enough. Therefore, by (9.10.1),
E(|Xn |; |Xn | > K) ≤ E(|X|; An ) ≤ .
This proves the uniform integrability of {Xn }n≥0 . By Proposition 8.3.1, we con-
clude that Xn → X ∗ in L1 . This implies that for any B ∈ F ∗ ,
E(X ∗ ; B) = lim E(Xn ; B) = E(X; B),
n→∞
where the last identity holds because Xn = E(X|Fn ) and B ∈ Fn for every n.
So, to complete the proof we only have to show that X ∗ is F ∗ -measurable. To
9.11. DE FINETTI’S THEOREM 125

show this, take any n. Since Fm ⊆ Fn for any m ≥ n, we have that Xm is Fn -


measurable for any m ≥ n. Therefore X ∗ is Fn -measurable. But since this is true
for all n, X ∗ must be F ∗ -measurable. 

E XERCISE 9.10.2. Let X1 , X2 , . . . be a sequence of i.i.d. integrable random


variables. Let Sn := X1 + · · · + Xn , and Gn := σ(Sn , Sn+1 , Sn+2 , . . .). Using
Exercise 9.2.13, show that E(X1 |Gn ) = E(X1 |Sn ) = E(Xi |Sn ) for any i ≤ n.
From this, deduce that E(X1 |Gn ) = Sn /n. Using this fact, the downwards conver-
gence theorem, and Kolmogorov’s zero-one law, show that Sn /n converges a.s. and
in L1 to E(X1 ) as n → ∞.

9.11. De Finetti’s theorem


Let {Xn }n≥1 be an infinite sequence of real-valued random variables defined
on a probability space (Ω, F, P). Let G := σ(X1 , X2 , . . .). By Exercise 5.1.6,
any A ∈ G can be expressed as X −1 (B) for some measurable set B ⊆ RN ,
where X : Ω → RN is the map X(ω) = (Xn (ω))n≥1 . We will say that A
is invariant under permutations of the first n coordinates if B is invariant under
permutations of the first n coordinates (for some choice of B such that A =
X −1 (B)) — that is, for any (x1 , x2 , . . .) ∈ B, and any permutation π of 1, . . . , n,
(xπ(1) , . . . , xπ(n) , xn+1 , xn+2 , . . .) is also in B. Let En be the σ-algebra consisting
of all A ∈ G that are invariant under permutations of the first n coordinates (it is
easy to see that this is a σ-algebra). Let

\
E := En
n=1
be the σ-algebra generated by all X that are invariant under permutations of any
finitely many coordinates. This is known as the ‘exchangeable σ-algebra’ of the
sequence {Xn }n≥1 .
E XERCISE 9.11.1 (Hewitt–Savage zero-one law, alternative version). Recall
the Hewitt–Savage zero-one law proved in Chapter 7 (Corollary 7.5.4). Prove the
following equivalent version. If {Xn }n≥0 is a sequence of i.i.d. random variables,
show that the exchangeable σ-algebra E is trivial (meaning that for any A ∈ E,
P(A) is either 0 or 1).
An infinite sequence of random variables {Xn }n≥1 is called ‘exchangeable’ if
its joint distribution remains invariant under permutations of any finitely many co-
ordinates. To be more precise, the condition means that for any bijection π : N →
N that fixes all but finitely many coordinates, the sequences (Xπ(1) , Xπ(2) , . . .) and
(X1 , X2 , . . .) induce the same probability measure on RN .
The following fundamental result for infinite exchangeable sequences is called
De Finetti’s theorem.
T HEOREM 9.11.2 (De Finetti’s theorem). Let {Xn }n≥1 be an infinite exchange-
able sequence of real-valued random variables. Then they are independent and
126 9. CONDITIONAL EXPECTATION AND MARTINGALES

identically distributed conditional on the exchangeable σ-algebra E, in the sense


that for any k and any Borel sets A1 , . . . , Ak ⊆ R,
k
Y k
Y
P(X1 ∈ A1 , . . . , Xk ∈ Ak |E) = P(Xi ∈ Ai |E) = P(X1 ∈ Ai |E) a.s.
i=1 i=1

We need several lemmas to prove this theorem.


L EMMA 9.11.3. If a measurable function f : RN → R is invariant under
permutations of the first n coordinates, then f (X1 , X2 , . . .) is an En -measurable
random variable.

P ROOF. For any Borel set A ⊆ R, it is easy to check that the set
{ω : f (X1 (ω), X2 (ω), . . .) ∈ A}
is in En , because it is equal to X −1 (B), where B = f −1 (A), and B is invariant
under permutations of the first n coordinates. 
L EMMA 9.11.4. For any n, any bounded measurable function ϕ : Rn → R,
and any permutation π of 1, . . . , n,
E(ϕ(Xπ(1) , . . . , Xπ(n) )|En ) = E(ϕ(X1 , . . . , Xn )|En ) a.s.

P ROOF. Let Y and Z denote the two conditional expectations displayed above,
that have to be shown to be equal. Take any A ∈ En , so that A = X −1 (B) for
some measurable set B ⊆ RN that is invariant under permutations of the first n
coordinates. Let f := 1B . Then
E(Y ; A) = E(ϕ(Xπ(1) , . . . , Xπ(n) )f (X1 , X2 , . . .))
= E(ϕ(Xπ(1) , . . . , Xπ(n) )f (Xπ(1) , Xπ(2) , . . . , Xπ(n) , Xn+1 , Xn+2 , . . .))
= E(ϕ(X1 , . . . , Xn )f (X1 , X2 , . . . , Xn , Xn+1 , Xn+2 , . . .))
= E(Z; A),
where the second identity holds because f is invariant under permutations of the
first n coordinates, and the third identity holds because X1 , X2 , . . . are exchange-
able. 
L EMMA 9.11.5. For any bounded measurable function ϕ : R → R,
n
1X
ϕ(Xi ) → E(ϕ(X1 )|E)
n
i=1
a.s. as n → ∞.

P ROOF. By the downwards convergence theorem,


E(ϕ(X1 )|En ) → E(ϕ(X1 )|E)
9.11. DE FINETTI’S THEOREM 127

a.s. as n → ∞. By Lemma 9.11.4, E(ϕ(X1 )|En ) = E(ϕ(Xi )|En ) a.s. for any
1 ≤ i ≤ n. Thus,
n
1X
E(ϕ(X1 )|En ) = E(ϕ(Xi )|En ) a.s.
n
i=1
 X n 
1
=E ϕ(Xi ) En a.s.
n
i=1
−1
Pn
But by Lemma 9.11.3, n i=1 ϕ(Xi ) is En -measurable. Thus,
 X n  n
1 1X
E ϕ(Xi ) En = ϕ(Xi ) a.s.
n n
i=1 i=1

This completes the proof of the lemma. 

L EMMA 9.11.6. For any k, and any bounded measurable functions ϕ1 , . . . , ϕk :


Rk → R,
k  Xn 
Y 1
E(ϕ1 (X1 ) · · · ϕk (Xk )|En ) − ϕi (Xj ) → 0
n
i=1 j=1
a.s. as n → ∞.

P ROOF. Take any n ≥ k. By Lemma 9.11.4,


E(ϕ1 (X1 ) · · · ϕk (Xk )|En ) = E(ϕ1 (Xi1 ) · · · ϕk (Xik )|En ) a.s. (9.11.1)
for any distinct i1 , . . . , ik ∈ {1, . . . , n}. Let (n)k := n(n − 1) · · · (n − k + 1), and
define
1 X
Yn := ϕ1 (Xi1 ) · · · ϕk (Xik )
(n)k
1≤i1 ,...,ik ≤n
distinct
and
k  Xn 
1 X Y 1
Zn := k ϕ1 (Xi1 ) · · · ϕk (Xik ) = ϕi (Xj ) .
n n
1≤i1 ,...,ik ≤n i=1 j=1

Since the ϕi ’s are bounded functions, a simple calculation shows that Yn −Zn → 0
as n → ∞. By equation (9.11.1), E(ϕ1 (X1 ) · · · ϕk (Xk )|En ) = E(Yn |En ), and by
Lemma 9.11.3, Yn is En -measurable. Combining these observations completes the
proof. 
Finally, we are ready to prove De Finetti’s theorem.
P ROOF OF T HEOREM 9.11.2. Take any k, and any bounded measurable func-
tions ϕ1 , . . . , ϕk : Rk → R. By the downwards convergence theorem,
E(ϕ1 (X1 ) · · · ϕk (Xk )|En ) → E(ϕ1 (X1 ) · · · ϕk (Xk )|E)
128 9. CONDITIONAL EXPECTATION AND MARTINGALES

a.s. as n → ∞. By Lemma 9.11.5,


k  X n  k
Y 1 Y
ϕi (Xj ) → E(ϕi (X1 )|E)
n
i=1 j=1 i=1
a.s. as n → ∞. The above two displays, together with Lemma 9.11.6, give
k
Y
E(ϕ1 (X1 ) · · · ϕk (Xk )|E) = E(ϕi (X1 )|E) a.s.
i=1
As a special case of the above identity, we get E(ϕi (Xi )|E) = E(ϕi (X1 )) a.s. for
each i. This completes the proof. 
E XERCISE 9.11.7. Let X1 , X2 , . . . be an exchangeable infinite sequence of
{0, 1}-valued random variables. Prove that:
(1) n−1 ni=1 Xi converges a.s. to a limit Θ as n → ∞.
P
(2) Let µ be the law of Θ. Then for any n and any x1 , . . . , xn ∈ {0, 1}, show
that
Z
P(X1 = x1 , . . . , Xn = xn ) = θk (1 − θ)n−k dµ(θ),
[0,1]
Pn
where k = i=1 xi . In other words, the law of the infinite sequence
(X1 , X2 , . . .) is a mixture of i.i.d. distributions. (This kind of result is
often cited as justification for Bayesian modeling.)

9.12. Lévy’s upwards convergence theorem


The following, result, known as Lévy’s upwards convergence theorem, com-
plements the downwards convergence theorem from Section 9.10.
T HEOREM 9.12.1 (Lévy upwards convergence theorem). Let {Fn }n≥0 be a
filtration of σ-algebras and let F ∗ := σ(∪∞
n=0 Fn ). Then for any integrable random
variable X, E(X|Fn ) → E(X|F ∗ ) a.s. and in L1 .

P ROOF. Let Xn := E(X|Fn ). Note that the sequence {Xn }n≥0 is a martin-
gale with respect to the filtration {Fn }n≥0 . Moreover, by Jensen’s inequality for
conditional expectation, E|Xn | ≤ E|X| for all n. Therefore by Theorem 9.9.2,
there is an integrable random variable Y such that Xn → Y a.s. By a similar argu-
ment as in the proof of the backward martingale convergence theorem, we see that
{Xn }n≥0 is uniformly integrable. Thus, Xn → Y in L1 . Finally, we need to show
that Y = E(X|F ∗ ) a.s. To prove this, take any A ∈ ∪n≥1 Fn . Then A ∈ Fn for
some n, and so
E(X; A) = E(Xn ; A).
But Xn = E(Xm |Fn ) for any m ≥ n. Thus, E(X; A) = E(Xm ; A) for any
m ≥ n. Since Xm → Y in L1 as m → ∞, this shows that E(X; A) = E(Y ; A). It
is now a simple exercise using the π-λ theorem to show that E(X; A) = E(Y ; A)
for all A ∈ F ∗ , which proves that Y = E(X|F ∗ ). 
9.13. Lp CONVERGENCE OF MARTINGALES 129

E XERCISE 9.12.2. Let X be a Borel measurable, integrable map from [0, 1)


into R. Let Xn : [0, 1) → R be defined as follows. For each n and 0 ≤ i < 2n , and
for each x in the interval [2−n i, 2−n (i + 1)), let Xn (x) be the average value of X
in that interval. Show that Xn → X pointwise almost everywhere on [0, 1). Hint:
Use Exercise 9.1.11 and Exercise 9.4.4, along with the martingale convergence
theorem.

9.13. Lp convergence of martingales


In this section we discuss the necessary and sufficient conditions for Lp con-
vergence of martingales, for all p ≥ 1. The cases p > 1 and p = 1 are somewhat
different. Let us first discuss p > 1. An important ingredient in the Lp conver-
gence theorem is the following inequality, known as Doob’s maximal inequality
for submartingales.
T HEOREM 9.13.1 (Doob’s maximal inequality). Let {Xi }0≤i≤n be a sub-
martingale sequence adapted to a filtration {Fi }0≤i≤n . Let Mn := max0≤i≤n Xi .
Then for any t > 0,
E(Xn ; Mn ≥ t)
P(Mn ≥ t) ≤ .
t

P ROOF. For each 0 ≤ i ≤ n, let Ai be the event that Xj < t for j < i and
Xi ≥ t. Then the event {Mn ≥ t} is the union of the disjoint events A0 , . . . , An .
Thus,
Xn
E(Xn ; Mn ≥ t) = E(Xn ; Ai ).
i=0
But Ai is Fi -measurable and E(Xn |Fi ) = Xi . Thus,
Xn
E(Xn ; Mn ≥ t) = E(Xi ; Ai ).
i=0
But if Ai happens, then Xi ≥ t. Thus,
n
X
E(Xn ; Mn ≥ t) ≥ t P(Ai ) = tP(Mn ≥ t),
i=0
which is what we wanted. 
Doob’s maximal inequality implies the following inequality, known as Doob’s
Lp inequality.
T HEOREM 9.13.2 (Doob’s Lp inequality). Let {Xi }0≤i≤n be a martingale
or nonnegative submartingale adapted to a filtration {Fi }0≤i≤n . Let Mn :=
max0≤i≤n |Xi |. Then for any p > 1,
 p
p p
E(Mn ) ≤ E|Xn |p .
p−1
130 9. CONDITIONAL EXPECTATION AND MARTINGALES

P ROOF. Under the given assumptions, {|Xi |}0≤i≤n is a submartingale se-


quence. Therefore by Doob’s maximal inequality,
E(|Xn |; Mn ≥ t)
P(Mn ≥ t) ≤
t
for any t > 0. Therefore, by Exercise 6.3.7 and Fubini’s theorem,
Z ∞
p
E(Mn ) = ptp−1 P(Mn ≥ t)dt
0
Z ∞
≤ ptp−2 E(|Xn |; Mn ≥ t)dt
0
 Z ∞ 
p−2
= E |Xn | pt 1{Mn ≥t} dt
0
p
= E(|Xn |Mnp−1 ).
p−1
By Hölder’s inequality,
E(|Xn |Mnp−1 ) ≤ (E|Xn |p )1/p (E(Mnp ))(p−1)/p .
Putting together the above two displays and rearranging, we get the desired in-
equality. 
Doob’s Lp inequality implies the following necessary and sufficient condition
for Lp convergence of martingales.
T HEOREM 9.13.3. Let {Xn }n≥0 be a martingale or nonnegative submartin-
gale adapted to a filtration {Fn }n≥0 . Suppose that supn≥0 E|Xn |p < ∞ for some
p > 1. Then there is a random variable X defined on the same probability space
such that Xn → X in Lp and also almost surely. Conversely, if Xn converges the
some X in Lp , then supn≥0 E|Xn |p < ∞.

P ROOF. The converse implication is trivial, so let us prove the forward direc-
tion only. Since the Lp norm of Xn is uniformly bounded, so is the L1 norm.
Therefore, by the submartingale convergence theorem, there exists a random vari-
able X such that Xn → X a.s. By Fatou’s lemma,
E|X|p = E(lim inf |Xn |p ) ≤ lim inf E|Xn |p < ∞.
n→∞ n→∞
Thus, X ∈ Lp . For each n, let
Zn := sup |Xm − X|p ,
0≤m≤n

and
Z := lim Zn = sup |Xn − X|p .
n→∞ n≥0
By Jensen’s inequality,
p
Xn − X |Xn |p + |X|p
|Xn − X|p = 2p ≤ 2p .
2 2
9.13. Lp CONVERGENCE OF MARTINGALES 131

Thus, by Doob’s Lp inequality (Theorem 9.13.2),


 p
p
E(Zn ) ≤ 2p−1 (E|Xn |p + E|X|p ).
p−1
But, by the monotone convergence theorem, E(Z) = limn→∞ E(Zn ), and so,
E(Z) < ∞. But Z dominates |Xn − X|p for each n, and |Xn − X|p → 0 a.s.
Thus, by the dominated convergence theorem, we get the desired result. 
E XERCISE 9.13.4. Let {Xn }n≥0 be a martingale sequence that is uniformly
bounded in L2 . Show that for any n ≤ m,
m−1
X
E(Xn − Xm )2 = E(Xi − Xi+1 )2 ,
i=n

and using this, give a direct proof that Xn converges in L2 to some limit random
variable.
For the necessary and sufficient condition for L1 convergence of martingales,
recall the definition of uniform integrability from Section 8.3 of Chapter 8.
T HEOREM 9.13.5. A submartingale converges in L1 if and only if it is uni-
formly integrable, and in this situation, it also converges a.s. to the same limit.

P ROOF. Suppose that {Xn }n≥0 is a uniformly integrable submartingale. By


the definition of uniform integrability, it is obvious that supn≥1 E|Xn | < ∞.
Therefore, by the submartingale convergence theorem, there exists X such that
Xn → X a.s. But then by Proposition 8.3.1, it follows that Xn → X in L1 . Con-
versely, suppose that {Xn }n≥0 is a submartingale converging in L1 to some limit
X. Then by Proposition 8.3.5, {Xn }n≥0 is uniformly integrable. 

E XERCISE 9.13.6. Let {Xn }n≥0 be a martingale or a nonnegative submartin-


gale. If supn≥0 E|Xn |p < ∞ for some p > 1, show that the sequence {|Xn |p }n≥0
is uniformly integrable.
E XERCISE 9.13.7. Let {Xn }n≥0 be a square-integrable martingale. Define
n
X
hXin := E((Xi − Xi−1 )2 |Fi−1 ).
i=1
Show that the sequence {hXin }n≥0 is a nonnegative, increasing, and predictable.
(It is called the ‘predictable quadratic variation’ of Xn .) Show that Xn2 − hXin is
a martingale. Finally, let hXi∞ := limn→∞ hXin , and show that Xn converges
a.s. on the set {hXi∞ < ∞}.
E XERCISE 9.13.8 (Galton–Watson branching process). The Galton–Watson
branching process is defined as follows. At generation 0, there is a single organism,
which gives birth to a random number of offspring, which all belong to generation
1. Then, each organism in generation 1 independently gives birth to a random
number of offspring, which are the members of generation 2, and so on. The
132 9. CONDITIONAL EXPECTATION AND MARTINGALES

assumptions are that the individuals give birth independently, and the distribution
of the number of offspring per individual is the same throughout. Mathematically,
if Zn denotes the number of individuals in generation n, then
Zn
X
Zn+1 = Xn,i ,
i=1

where Xn,i are i.i.d. random variables from the offspring distribution, which are
independent of everything up to generation n. Suppose that the mean number of
offspring per individual, µ, is finite. Then:
(1) Show that Mn := µ−n Zn is a martingale adapted to a suitable filtration.
(2) Verifying the required conditions, conclude that limn→∞ Mn exists a.s. and
is integrable. From this, show that if µ < 1, then Zn → 0 a.s.
(3) If µ = 1, use the above fact and the fact that Zn is integer-valued to show
that Zn → 0 a.s. when the number of offspring is not always exactly
equal to 1.
(4) Suppose that the variance of the offspring distribution, σ 2 , is finite. Then
show that E(Zn+12 |F ) = µ2 Z 2 + σ 2 Z .
n n n
(5) Use the above result to show that if µ 6= 1,

σ 2 µn − 1
Nn := Mn2 − Mn
µn+1 µ − 1
is a martingale.
(6) Using the above, show that supn≥1 E(Mn2 ) < ∞ if µ > 1 and σ 2 <
∞. From this, conclude that P(limn→∞ Mn > 0) > 0, that is, there
is positive probability that Zn behaves like a positive multiple of µn as
n → ∞.

E XERCISE 9.13.9. In Exercise 9.12.2, if X ∈ Lp for some p ≥ 1, show that


Xn → X in Lp .

E XERCISE 9.13.10. Using the above exercise, show that C[0, 1] is a dense
subset of L2 [0, 1]. (Hint: First show that step functions are dense, and then ap-
proximate continuous functions by dense functions.)

E XERCISE 9.13.11. A trigonometric polynomial on [0, 1] is a function of the


form
N
X N
X
an cos 2πnx + bn sin 2πnx
n=0 n=1

for some N and some coefficients a0 , . . . , an and b1 , . . . , bN .


(1) Show that the trigonometric polynomials form an algebra over the real
numbers — that is, they are closed under addition, multiplication, and
scalar multiplication.
9.14. ALMOST SUPERMARTINGALES 133

(2) Take any 0 < a < b < 1, and any N . Define a function fN : [0, 1] →
[0, ∞) as

1 + cos 2π(u − x) N
Z b√  
fN (x) := N du.
a 2
Show that fN is uniformly bounded above by a constant that does not de-
pend on N , and that as N → ∞, fN (x) → 0 if x ∈ / [a, b], and fN (x) → c
if x ∈ (a, b), where c is a positive real number.
(3) Using the above and Exercise 9.12.2, show that if g ∈ L1 [0, 1] is orthogo-
nal to sin 2πnx and cos 2πnx for all n (under the L2 inner product), then
g = 0 a.e. on [0, 1].

9.14. Almost supermartingales


A sequence of random variables {Xn }n≥0 adapted to a filtration {Fn }n≥0 is
called a nonnegative ‘almost supermartingale’ if each Xn is a nonnegative inte-
grable random variable, and there are two sequences of adapted nonnegative inte-
grable random variables {ξn }n≥0 and {ζn }n≥0 such that for each n,
E(Xn+1 |Fn ) ≤ Xn + ξn − ζn a.s.
The following theorem gives a criterion under which the above almost supermartin-
gale converges a.s. It is sometimes called the ‘Robbins–Siegmund almost super-
martingale theorem’.
T HEOREMP 9.14.1. Let Xn , ξn and ζn be as above. Then withPprobability one
on the set { ∞ ξ
n=0 n < ∞}, lim n→∞ nX exists and is finite, and ∞
ζ
n=0 n < ∞.

P ROOF. Let
n−1
X
Yn := Xn − (ξk − ζk ).
k=0
Then note that
n
X
E(Yn+1 |Fn ) = E(Xn+1 |Fn ) − (ξk − ζk )
k=0
Xn
≤ Xn + ξn − ζn − (ξk − ζk )
k=0
n−1
X
= Xn − (ξk − ζk ) = Yn .
k=0

Thus, {Yn }n≥0 is a supermartingale. Take any a > 0. Let τ := inf{n : nk=0 ξk ≥
P
a}. Note that τ is allowed to be infinity. By the optional stopping theorem, {Yτ ∧n }
134 9. CONDITIONAL EXPECTATION AND MARTINGALES

is a supermartingale adapted to {Fn }n≥0 . Moreover, since the Xn ’s and ζn ’s are


nonnegative random variables,
τ ∧n−1
X
Yτ ∧n = Xτ ∧n − (ξk − ζk ) > −a.
k=0

By the submartingale convergence theorem, and supermartingale that is uniformly


bounded below must converge a.s. to an integrable limit. Therefore, we get that
limn→∞ Yτ ∧n exists P and is finite a.s. This shows that limn→∞ Yn exists and is
finite a.s. on the set { ∞k=0 ξk < a}. Taking union P over a = 1, 2, . . ., we conclude

that limn→∞ Yn exists and is finite
P∞ a.s. on the set { k=0 ξk < ∞}.
Now, if Yn converges and k=0 ξk is finite, then the definition of Yn shows
that
 n−1
X   n−1
X 
sup Xn + ζk = sup Yn + ξk < ∞.
n≥0 n≥0
k=0 k=0
P∞
This implies that k=0 ζk < ∞, and hence,

X
lim Xn = lim Yn − (ξk − ζk )
n→∞ n→∞
k=0

exists. This completes the proof of the theorem. 

Striking applications of almost supermartingales can be found in the literature


on stochastic approximation. The following is a special case of the celebrated
Robbins–Monro algorithm from this literature.
Suppose that there is an unknown measurable function f : R → R with a single
zero at some x∗ ∈ R, and our goal is to find x∗ . This kind of problem often comes
up in optimization, because maxima and minima of functions are characterized by
the zeros of their derivatives. The problem is that whenever we query the value of
f at some point x, we can only recover f (x) up to some random error with mean
zero but possibly a substantial variance. The following amazing algorithm allows
us to converge to x∗ even under such unfavorable conditions.
Start with X0 = x0 for some arbitrary x0 ∈ R. At step n of the algorithm
(starting with n = 0), we query the value of f at Xn , and obtain a value Yn , where
Yn is a function of Xn and some extra randomness that is independent of past
events, and E(Yn |Xn ) = f (Xn ). Define
Xn+1 := Xn − an Yn ,
where {an }n≥0 is a sequence of constants satisfying

X ∞
X
a2n < ∞, an = ∞. (9.14.1)
n=0 n=0

For example, we can take an = n−3/4 . The following theorem shows that under
mild conditions, Xn converges almost surely to x∗ as n → ∞.
9.14. ALMOST SUPERMARTINGALES 135

T HEOREM 9.14.2 (Convergence of Robbins–Monro algorithm). In the above


setting, suppose that |f | is uniformly bounded, and that for any  > 0,
inf f (x∗ + x) > 0, sup f (x∗ + x) < 0. (9.14.2)
<x<1/ −1/<x<−

Further, suppose that Var(Yn |Xn ) is uniformly bounded above by a constant. Then
Xn → x∗ almost surely.

P ROOF. Define Zn := (Xn − x∗ )2 . Let Fn := σ(X0 , . . . , Xn ). Since Yn


is a function of Xn and some extra randomness that is independent of the past,
E(Yn |Fn ) = E(Yn |Xn ) and Var(Yn |Fn ) = Var(Yn |Xn ). In particular,
E(Yn − f (Xn )|Fn ) = 0.
Thus,
E(Zn+1 |Fn ) = E((Xn+1 − x∗ )2 |Fn )
= E((Xn − x∗ − an (Yn − f (Xn )) − an f (Xn ))2 |Fn )
= (Xn − x∗ )2 + a2n E((Yn − f (Xn ))2 |Fn ) + a2n f (Xn )2
− 2an (Xn − x∗ )E(Yn − f (Xn )|Fn ) + 2a2n f (Xn )E(Yn − f (Xn )|Fn )
− 2an (Xn − x∗ )f (Xn )
= Zn + a2n Var(Yn |Xn ) + a2n f (Xn )2 − 2an |Xn − x∗ ||f (Xn )|,
where in the last step we used the fact (implied by (9.14.2)) that Xn − x∗ and
f (Xn ) have the same sign. Let σ 2 be a uniform upper bound on Var(Yn |Xn ) and
B be a uniform upper bound on |f |. Then the above expression shows that
E(Zn+1 |Fn ) ≤ Zn + ξn − ζn ,
where
ξn = a2n (σ 2 + B 2 ), ζn = 2an |Xn − x∗ ||f (Xn )|.
Note that ξn and ζn are nonnegative and Fn -measurable. Thus, this expresses Zn
P∞supermartingale. But here the ξn ’s are constants, and by assumption
as an almost
(9.14.1), n=0 ξn < ∞. Therefore, by the Robbins–Siegmund theorem (Theorem
9.14.1), limn→∞ Zn exists and is finite a.s., and ∞
P
n=0 ζ n < ∞ a.s. Let

D := lim Zn = lim |Xn − x∗ |.


p
n→∞ n→∞

Suppose that D 6= 0 at some sample point. Then from assumption (9.14.2) we


P∞bounded away from 0 as n → ∞, and also that
deduce that f (Xn ) must stay

|Xn − x | → D > P0. Since n=0 an = ∞, this shows that at such a sample point,∗
we cannot have ∞ n=0 ζn < ∞. This proves that D = 0 a.s., that is, Xn → x
a.s. 

The following exercise is a simple fact from real analysis that will be needed
in the subsequent problems.
136 9. CONDITIONAL EXPECTATION AND MARTINGALES

E XERCISE 9.14.3. Let {an }n≥0 be an increasing sequence of positive real


numbers. Then show that
∞ ∞
X an+1 − an X an+1 − an
= ∞, < ∞.
n=0
an+1
n=0
a2n+1
(Hint: For the second claim, simply compare with the integral of 1/x2 . For the
first, separately consider the cases lim inf an /an+1 = 0 and > 0, and compare
with the integral of 1/x for the latter.)

E XERCISE 9.14.4 (Strong law of large numbers for martingales). Let {Sn }n≥0
be a mean zero, square-integrable martingale adapted to a filtration {Fn }n≥0 . Let
{cn }n≥1 be a predictable and increasing sequence of random variables (that is, cn
is Fn−1 -measurable and cn ≤ cn+1 for each n). Let σn2 := Var(Sn |Fn−1 ). Then,
show that Sn /cn → 0 a.s. on the set

σn2
X 
< ∞ and lim cn = ∞ .
c2n n→∞
n=0
(Hint: Define Zn := Sn /c2n and show
2 that it is an almost supermartingale, and then
apply the Robbins–Siegmund theorem.)

E XERCISE 9.14.5. Let X0 , X1 , . . . be a P


sequence of i.i.d. square-integrable
random variables with mean zero. Let Sn := ni=0 Xi Xi+1 . Using the SLLN for
martingales, prove that Sn /n → 0 a.s.

E XERCISE 9.14.6. Let {Xn }n≥0 be a sequence of {0, 1}-valued random vari-
ables adapted to a filtrationP{Fn }n≥0 . Let pn := P(Xn = 1|Fn−1 ) (with p0 :=
P(X0 = 1)) and let Sn := nk=0 Xk . Then show that the limit
Sn
lim Pn
k=0 pk
n→∞

P a.s. Moreover, show that the limit is equal to 1 almost surely on the set
exists
{ ∞ n=0 pn = ∞}. (Hint: Define a suitable almost supermartingale.)
CHAPTER 10

Ergodic theory

Ergodic theory is the study of measure preserving transforms and their proper-
ties. It is useful in probability theory for the study of random processes whose laws
are invariant under various groups of transformations. This chapter gives an intro-
duction to the basic notions and results of ergodic theory and some applications in
probability.

10.1. Measure preserving transforms


Let (Ω, F, P) be a probability space. A measurable map ϕ : Ω → Ω is called
‘measure preserving’ if P(ϕ−1 A) = P(A) for all A ∈ F. (In ergodic theory, the
standard practice is to write ϕ−1 A instead of ϕ−1 (A), ϕ2 (A) instead of ϕ(ϕ(A)),
as so on.) The following lemma is often useful for proving that a given map is
measure preserving.
L EMMA 10.1.1. Let (Ω, F, P) be a probability space and ϕ : Ω → Ω be a
measurable map. Suppose that P(ϕ−1 A) = P(A) for all A in a π-system P that
generates F. Then ϕ is measure preserving.

P ROOF. By Dynkin’s π-λ theorem, it suffices to check that the set L of all
A ∈ F such that P(ϕ−1 A) = P(A) is a λ-system. First, note that ϕ−1 Ω = Ω.
Thus, Ω ∈ L. Next, if A ∈ L, then
P(ϕ−1 Ac ) = P((ϕ−1 A)c ) = 1 − P(ϕ−1 A) = 1 − P(A) = P(Ac ),
which shows that Ac ∈ L. Finally, if A1 , A2 , . . . are disjoint elements of L, and
A = ∪i=1 6∞Ai , then
[∞ 
−1 −1
P(ϕ A) = P ϕ Ai
i=1

X ∞
X
= P(ϕ−1 Ai ) = P(Ai ) = P(A).
i=1 i=1
Thus, L is a λ-system. 
E XAMPLE 10.1.2 (Bernoulli shift). Let Ω = {0, 1}N be the set of all infinite
sequences of 0’s and 1’s. Let F be the product σ-algebra on this space, and P be
the product of Ber(p) measures. The ‘shift operator’ on this space is defined as
ϕ(ω0 , ω1 , . . .) = (ω1 , ω2 , . . .).
137
138 10. ERGODIC THEORY

We now show that this map is measure preserving. Let A be the collection of all
sets of the form
{ω : ω0 = x0 , . . . , ωn = xn }
for some n ≥ 0 and some x0 , . . . , xn ∈ {0, 1}. Clearly, this is a π-system that
generates the product σ-algebra. Let A denote the set displayed above. Then ϕ−1 A
is the set of all sequences (ω0 , ω1 , . . .) such that (ω1 , ω2 , . . .) ∈ A, that is, the set
of all ω such that ω1 = x0 , ω2 = x1 , . . . ωn+1 = xn . Under the product measure,
both A and ϕ−1 A have the same measure. By Lemma 10.1.1, this shows that ϕ is
measure preserving.
E XAMPLE 10.1.3 (Rotations of the circle). Let Ω = [0, 1) with the Borel σ-
algebra and Lebesgue measure. Take any θ ∈ [0, 1), and let
(
x+θ if x + θ < 1,
ϕ(x) := x + θ (mod 1) =
x + θ − 1 if x + θ ≥ 1.
One way to view ϕ is to consider [0, 1) as the unit circle in the complex plane via
the map x 7→ e2πix , which makes ϕ a ‘rotation by angle θ’. We claim that ϕ is
measure preserving. To see this, simply note that the map ϕ is a bijection with
(
−1 x−θ if x − θ ≥ 0,
ϕ (x) = x − θ (mod 1) =
x − θ + 1 if x − θ < 0.
and therefore, for any interval [a, b] ⊆ [0, 1),

[a − θ, b − θ]
 if a ≥ θ,
−1
ϕ ([a, b]) = [0, b − θ] ∪ [1 + a − θ, 1] if a < θ ≤ b,

[1 + a − θ, 1 + b − θ] if b < θ.

In all three cases, the Lebesgue measure of ϕ−1 ([a, b]) equals b − a. Since the
set of closed intervals is a π-system that generates the Borel σ-algebra on [0, 1),
Lemma 10.1.1 shows that ϕ is measure preserving.
E XERCISE 10.1.4. Show that the map ϕ(x) := 2x (mod 1) preserves Lebesgue
measure on [0, 1).
Measure preserving transforms possess the following important property.
P ROPOSITION 10.1.5. If ϕ is a measure preserving map on a probability space
(Ω, F, P), then for any integrable random variable X defined on this space, X ◦ ϕ
has the same distribution as X, and hence, has the same expected value as X.

P ROOF. Simply note that for any Borel set A ⊆ R,


P(X ◦ ϕ ∈ A) = P({ω : φ(ω) ∈ X −1 (A))
= P(ϕ−1 X −1 (A)) = P(X −1 (A)) = P(X ∈ A),
where the second-to-last inequality holds by the measure preserving property. 
10.2. ERGODIC TRANSFORMS 139

10.2. Ergodic transforms


Let (Ω, F, P) be a probability space and ϕ : Ω → Ω be a measure preserving
transform. The ‘invariant σ-algebra’ of ϕ is the set of all A ∈ F such that ϕ−1 A =
A. It is easy to see that this is indeed a σ-algebra. The transform ϕ is called
‘ergodic’ for the probability measure P if its invariant σ-algebra I is trivial, that is,
P(A) ∈ {0, 1} for all A ∈ I.
E XAMPLE 10.2.1 (Ergodicity of Bernoulli shift). Recall the Bernoulli shift op-
erator ϕ defined in Example 10.1.2, and the associated probability space (Ω, F, P).
We claim that ϕ is ergodic. To see this, take any A ∈ I, where I is the invariant
σ-algebra of ϕ. Define a sequence of random variables (Xn )n≥0 as Xn (ω) := ωn .
Then, by the definition of P, these random variables are i.i.d. Ber(p). The invari-
ance of A means that for any k,
A = ϕ−k A
= {ω : (ωk , ωk+1 , . . .) ∈ A}
= {ω : (Xk (ω), Xk+1 (ω), . . .) ∈ A}.
Thus, A ∈ σ(Xk , Xk+1 , . . .). Since this holds for any k, we conclude that A is
an element of the tail σ-algebra of the sequence (Xn )n≥0 . Thus, by Kolmogorov’s
zero-one law, P(A) ∈ {0, 1}, and hence, ϕ is ergodic.
E XAMPLE 10.2.2 (Ergodicity of rotations). Recall the rotation map ϕ from
Example 10.1.3. We will now show that ϕ is ergodic if θ is irrational, but not
ergodic if θ is rational.
First, suppose that θ is rational. Then θ = m/n for some integers 0 ≤ m < n.
Let
n−1
[ i i 
1
A := , + .
n n 2n
i=0
Now note that for any i ≥ m,
   
−1 i i 1 i−m i−m 1
ϕ , + = , + ,
n n 2n n n 2n
and for any i < m,
   
−1 i i 1 i−m+n i−m+n 1
ϕ , + = , + ,
n n 2n n n 2n
From this, it is easy to see that ϕ−1 A = A. But the Lebesgue measure of A is 1/2.
Thus, ϕ is not ergodic if θ is rational.
Next, suppose that θ is irrational. Take any A such that A = ϕ−1 A. Then note
that for any integer k, Proposition 10.1.5 gives
Z 1 Z 1
2πikx
ck := 1A (x)e dx = 1A (ϕ(x))e2πikϕ(x) dx.
0 0
But the invariance of A implies that
1A (ϕ(x)) = 1ϕ−1 A (x) = 1A (x),
140 10. ERGODIC THEORY

and the fact that k is an integer implies that


e2πikϕ(x) = e2πik(x+θ) ,
because ϕ(x) is either x + θ or x + θ − 1, and e−2πik = 1. Thus,
ck = e2πikθ ck .
Since θ is irrational, kθ is not an integer for any nonzero integer k, and so, the
above identity implies that ck = 0 for all k 6= 0. Consequently, the inner product
of 1A − c0 with cos 2πnx or sin 2πnx, for any n, is zero. By Exercise 9.13.11, this
implies that 1A = c0 a.e., which shows that the Lebesgue measure of A must be 0
or 1.
E XERCISE 10.2.3. Prove that the map ϕ from Exercise 10.1.4 is ergodic.
E XERCISE 10.2.4. If I is the invariant σ-algebra of a measure preserving
transform ϕ, show that an R∗ -valued random variable Y is I-measurable if and
only if Y ◦ ϕ = Y .
E XERCISE 10.2.5. Give an example of an ergodic measure preserving trans-
formation ϕ such that ϕ2 is not ergodic.
E XERCISE 10.2.6. Let d ≥ 1 and Ω = {0, 1}E , where E is the set of edges of
Zd . Let Ω be endowed with the product σ-algebra. Let p ∈ (0, 1) and let µ be the
infinite product of Bernoulli(p) measures on Ω. That is, µ is the law of a collection
of i.i.d. Bernoulli(p) random variables, one attached to each edge of Zd . For each
x ∈ Zd , let Tx denote the map on Ω that translates by x; that is, if ω 0 = Tx (ω),
then ωe0 = ωe+x for each edge e, where e + x is the translation of e by x. Prove
that for any x, Tx is measure preserving, and if x 6= 0, then Tx is ergodic.
E XERCISE 10.2.7. Consider a collection of i.i.d. Bernoulli(p) random vari-
ables attached to edges of Zd , as in the preceding problem. Let Xe denote the
variable attached to edge e. Let us say that an edge is ‘open’ if Xe = 1 and
‘closed’ otherwise. This is the classical edge percolation model. Say that two ver-
tices are connected if there is a sequence of open edges connecting one to the other.
This is an equivalence relation that divides up Zd into a set of disjoint connected
‘clusters’. Using the previous problem, show that the number of infinite clusters is
almost surely a constant N , belonging to {0, 1, 2, . . .} ∪ {∞}.
E XERCISE 10.2.8. In the previous problem, suppose that N is finite but greater
than one. Then show that there is a set B ⊆ Zd such that B intersects more than
one infinite cluster with positive probability. Using this, produce an argument that
leads to a contradiction, and thus establish that N ∈ {0, 1} ∪ {∞}. (Actually, the
value ∞ is also impossible, but that is much harder to prove. This is a famous
result in percolation theory.)
E XERCISE 10.2.9. Let ϕ(x) = 4x(1−x) for x p ∈ [0, 1]. Prove that ϕ preserves
the probability measure on [0, 1] with density 1/(π x(1 − x)).
10.3. BIRKHOFF’S ERGODIC THEOREM 141

10.3. Birkhoff’s ergodic theorem


The following is one of the fundamental results of ergodic theory, known as
Birkhoff’s ergodic theorem, and also known as the pointwise ergodic theorem.
T HEOREM 10.3.1 (Birkhoff’s ergodic theorem). Let (Ω, F, P) be a probability
space and ϕ be a measure preserving transform on Ω. Let X be an integrable real-
valued random variable defined on Ω. Let I be the invariant σ-algebra of ϕ. Then
n−1
1 X
X ◦ ϕm → E(X|I)
n
m=0

almost surely and in L1 as n → ∞. In particular, if ϕ is ergodic, then the averages


on the left converge to E(X) a.s. and in L1 .
The first step in the proof of Birkhoff’s ergodic theorem is the following result,
known as the maximal ergodic theorem.
T HEOREM 10.3.2 (Maximal ergodic theorem). Let X and ϕ be as in Birkhoff’s
ergodic theorem. For each k ≥ 0, let Xk := X ◦ ϕk , Sk := X0 + · · · + Xk , and
Mk := max{0, S0 , S1 , . . . , Sk }.
Then for any k, E(X; Mk > 0) ≥ 0.

P ROOF. Take any ω ∈ Ω. If j ≤ k, then Sj (ϕω) ≤ Mk (ϕω). Thus,


X(ω) + Mk (ϕω) ≥ X(ω) + Sj (ϕω) = Sj+1 (ω).
Thus,
X(ω) ≥ Sj+1 (ω) − Mk (ϕω)
for j = 0, . . . , k. But also,
X(ω) ≥ S0 (ω) − Mk (ϕω)
since S0 = X and Mk ≥ 0 everywhere. Thus,
Z
E(X; Mk > 0) ≥ (max{S0 (ω), . . . , Sk (ω)} − Mk (ϕω))dP(ω)
{ω:Mk (ω)>0}
Z
= (Mk (ω) − Mk (ϕω))dP(ω).
{ω:Mk (ω)>0}
Now, by Proposition 10.1.5,
Z
(Mk (ω) − Mk (ϕω))dP(ω) = 0.

On the other hand, the nonnegativity of Mk implies that
Z
(Mk (ω) − Mk (ϕω))dP(ω) ≤ 0.
{ω:Mk (ω)=0}

By the last three displays, we see that E(X; Mk > 0) must be nonnegative. 
We are now ready to prove Birkhoff’s ergodic theorem.
142 10. ERGODIC THEORY

P ROOF OF T HEOREM 10.3.1. First, we claim that it suffices to prove the the-
orem under the assumption that E(X|I) = 0 a.s. To see this, suppose that the the-
orem holds under this assumption. Let Y := E(X|I). Since Y is I-measurable,
it satisfies Y ◦ ϕm = Y for any m, by Exercise 10.2.4. Let Z := X − Y , so that
E(Z|I) = 0 a.s. Thus, by the assumed version of the theorem,
n−1
1 X
Z ◦ ϕm → 0
n
m=0

a.s. and in L1 . But Z ◦ ϕm = X ◦ ϕm − Y ◦ ϕm = X ◦ ϕm − Y , and thus,


n−1 n−1
1 X 1 X
Z ◦ ϕm = X ◦ ϕm − E(X|I).
n n
m=0 m=0
This gives us the full version of the theorem. Thus, let us henceforth work under the
assumption that E(X|I) = 0 a.s. Our goal, then, is to show that Sn /(n + 1) → 0
a.s. and in L1 , where Sn is as in the maximal ergodic theorem. Define
Sn
X := lim sup .
n→∞ n + 1

From this definition, it is clear that X ◦ ϕ = X. By Exercise 10.2.4, this shows


that X is I-measurable.
Next, fix some  > 0, and define
D := {ω : X(ω) > }.
Since X is I-measurable, we conclude that D ∈ I. Define
X ∗ := (X − )1D ,
and define, in analogy with Sk and Mk , the random variables
Sk∗ (ω) := X ∗ (ω) + X ∗ (ϕω) + · · · + X ∗ (ϕk ω),
Mk∗ (ω) := max{0, S0∗ (ω), . . . , Sk∗ (ω)}.
For each k, define the set
Fk := {ω : Mk∗ (ω) > 0},
and let F := ∪k≥0 Fk . We claim that F = D. To see this, first take any ω ∈ F .
Then Mk∗ (ω) > 0 for some k, and hence Sj∗ (ω) > 0 for some j. Thus, X ∗ (ϕi ω) >
0 for some i. In particular, this shows that ϕi ω ∈ D for some i. Since D ∈ I, this
shows that ω ∈ D. Thus, F ⊆ D. Conversely, take any ω ∈ D. Then ϕj ω ∈ D
for all j, and so,
X ∗ (ϕj ω) = (X(ϕj ω) − )1D (ϕj ω) = X(ϕj ω) − .
Thus,
Sk∗ (ω) = Sk (ω) − (k + 1)
for all k. Since ω ∈ D, we have
Sk (ω)
lim sup > .
k→∞ k + 1
10.3. BIRKHOFF’S ERGODIC THEOREM 143

Thus, the previous display shows that Sk∗ (ω) > 0 for some k, and hence, Mk∗ (ω) >
0 for some k. In other words, ω ∈ F . Thus, D ⊆ F , completing the proof of our
claim that F = D.
Since Mk∗ is an increasing sequence, so is Fk . Thus, X ∗ 1Fk → X ∗ 1F point-
wise as k → ∞. Moreover, |X ∗ 1Fk | ≤ |X ∗ | ≤ |X| +  for all k, and X is
integrable. So, by the dominated convergence theorem, we get that
lim E(X ∗ ; Fk ) = E(X ∗ ; F ).
k→∞
By the maximal ergodic theorem, E(X ∗ ; Fk ) ≥ 0 for any k. Since F = D, this
shows that
E(X ∗ ; F ) = E(X ∗ ; D) ≥ 0.
But
E(X ∗ ; D) = E((X − )1D ) = E(E(X|I); D) − P(D).
Since E(X|I) = 0 a.s., this shows that −P(D) ≥ 0. Thus, P(D) = 0. Looking
back at the definition of D, we conclude that lim sup Sn /(n + 1) ≤ 0. Replacing
X by −X throughout the proof, we conclude that lim inf Sn /(n + 1) ≥ 0. This
completes the proof of the a.s. convergence of Sn /n to 0.
It remains to prove convergence in L1 . Fix any M > 0. Define XM 0 :=
00 0
X1{|X|≤M } and XM := X − XM . Then by the almost sure part of the theorem,
n−1
1 X 0 0
XM ◦ ϕm → E(XM |I)
n
m=0
a.s. as n → ∞. Since the random variable on the right is uniformly bounded by M
in absolute value, the dominated convergence theorem implies that the convergence
also holds in L1 . Now, since E(X|I) = 0 a.s., we have
n−1
1 X
E X ◦ ϕm
n
m=0
n−1 n−1
1
X
0 0 1 X 00 00
≤E XM ◦ ϕm − E(XM |I) + E XM ◦ ϕm − E(XM |I) .
n n
m=0 m=0
We have already shown that the first term on the right tends to zero as n → ∞. On
the other hand,
n−1 n−1
1 X 00 m 00 1 X 00 00
E XM ◦ ϕ − E(XM |I) ≤ E|XM ◦ ϕm | + E|E(XM |I)|.
n n
m=0 m=0
By Proposition 10.1.5, 00
E|XM ◦ ϕm |
= 00 | for
E|XMany m, and by Jensen’s inequal-
ity,
00 00 00
E|E(XM |I)| ≤ E(E(|XM ||I)) = E|XM |.
Combining everything, we see that
n−1
1 X 00
lim sup E X ◦ ϕm ≤ 2E|XM |.
n→∞ n
m=0
144 10. ERGODIC THEORY

00 | → 0 as
But M is arbitrary, and by the dominated convergence theorem, E|XM
1
M → ∞. This proves the L part of the theorem. 

E XERCISE 10.3.3. In Birkhoff’s ergodic theorem, show that if X ∈ Lp for


some p > 1, then the convergence happens in Lp . (The special case of p = 2 is
known as ‘Von Neumann’s ergodic theorem’ or the ‘mean ergodic theorem’. It has
a simpler direct proof using Hilbert spaces.)
E XERCISE 10.3.4. Let ϕ be the rotation map from Example 10.1.3. If θ ∈
[0, 1) is irrational, show that for any Borel set A ⊆ [0, 1), for a.e. x ∈ [0, 1) we
have
1
lim |{0 ≤ m ≤ n − 1 : ϕm x ∈ A}| = λ(A),
n→∞ n
where λ(A) is the Lebesgue measure of A.
E XERCISE 10.3.5. In the previous exercise, suppose that A = [a, b) is a half-
open interval contained in [0, 1). Then show that the claim actually for all x ∈
[0, 1) (instead of almost all), by the following steps:
(1) Take any k so large that b − a > 2/k. Let Ak := [a + 1/k, b − 1/k).
Applying the previous exercise to the set Ak , find a set Ωk ⊆ [0, 1) of
Lebesgue measure 1 such that the conclusion of the previous exercise
holds for all x ∈ Ωk .
(2) Deduce that Ωk is dense in [0, 1).
(3) Given any x ∈ [0, 1), find yk ∈ Ωk such that |x − yk | < 1/k. Show that
if ϕm yk ∈ Ak , then ϕm x ∈ A.
(4) Using the above, conclude that for all large enough k,
1 2
lim inf |{0 ≤ m ≤ n − 1 : ϕm x ∈ [a, b)}| ≥ b − a − .
n→∞ n k
(5) Working with the complement of [a, b), deduce the opposite inequality
with lim sup. Combining the two, deduce that the conclusion of the pre-
vious exercise holds for all x ∈ [0, 1) if A = [a, b).

10.4. Stationary sequences


Recall that a sequence of real-valued random variables (Xn )n≥0 is called a
‘stationary sequence’ if for any n and k, the law of (X0 , . . . , Xn ) is the same as
that of (Xk , Xk+1 , . . . , Xk+n ). (See Definition 8.5.4 and the discussion following
it.) The following lemma connects stationary sequences of random variables with
measure preserving maps.
L EMMA 10.4.1. Let (Xn )n≥0 be a stationary sequence of real-valued random
variables. Let µ be the law of (Xn )n≥0 on RN . Let ϕ be the shift operator on RN ,
that is, ϕ(ω0 , ω1 , . . .) = (ω1 , ω2 , . . .). Then ϕ preserves µ.
10.4. STATIONARY SEQUENCES 145

P ROOF. Take any rectangular set


A = {ω ∈ RN : ωi1 ∈ B1 , . . . , ωik ∈ Bk },
where i1 , . . . , ik are distinct indices and B1 , . . . , Bk are Borel subsets of R. Then
µ(ϕ−1 A) = P((X0 , X1 , . . .) ∈ ϕ−1 A)
= P(ϕ(X0 , X1 , . . .) ∈ A)
= P((X1 , X2 , . . .) ∈ A)
= P(Xi1 +1 ∈ B1 , Xi2 +1 ∈ B2 , . . . , Xik +1 ∈ Bk ).
But by stationarity, the last expression equals
P(Xi1 ∈ B1 , Xi2 ∈ B2 , . . . , Xik ∈ Bk ),
which is equal to µ(A). This proves that ϕ preserves µ. 

Now let X denote the sequence (Xn )n≥0 . Let I be the invariant σ-algebra
of the shift operator on RN . Then J := X −1 (I) is a sub-σ-algebra of F, where
(Ω, F, P) is the probability space on which X is defined. Let us call this the ‘in-
variant σ-algebra of X’. A stationary sequence is called ‘ergodic’ if its invariant
σ-algebra is trivial.
As a corollary of Birkhoff’s ergodic theorem, we get the following theorem for
infinite stationary sequences of real-valued random variables.
T HEOREM 10.4.2. Let X = (Xn )n≥0 be a stationary sequence of integrable
real-valued random variables, with invariant σ-algebra J . Then
n−1
1 X
Xm → E(X0 |J )
n
m=0

almost surely and in L1 as n → ∞. In particular, if X is ergodic, then the averages


on the left converge to E(X0 ) a.s. and in L1 .

P ROOF. Let µ be the law of X on RN , and let ϕ be the shift operator on RN .


Define a random variable Y : RN → R as Y (ω0 , ω1 , . . .) := ω0 . Under µ, the law
of Y is the same as that of X0 . In particular, Y is integrable. Therefore, by Lemma
10.4.1 and Birkhoff’s ergodic theorem,
n−1
1 X
Y ◦ ϕm → E(Y |I)
n
m=0

almost surely and in L1


as n → ∞, where I is the invariant σ-algebra of ϕ. Let
A be a measurable subset of RN of µ-measure 1 on which the above convergence
takes place. Then 1 = µ(A) = P(X ∈ A), which implies that with P-probability 1,
n−1
1 X
Y ◦ ϕm ◦ X → Z ◦ X,
n
m=0
146 10. ERGODIC THEORY

where Z := E(Y |I). Now note that Y ◦ ϕm ◦ X = Xm . Thus, to prove the


‘almost sure’ part of the theorem, it suffices to show that Z ◦ X = E(X|J ). Take
any E ∈ J . Then E = X −1 (F ) for some F ∈ I. Thus,
Z
E(Z ◦ X; E) = Z ◦ XdP
E
By the general change of variable formula for Lebesgue integrals (Exercise 2.3.7),
Z Z
Z ◦ XdP = Zdµ = E(Z; F ) = E(Y ; F ),
E F
where the last identity holds because Z = E(Y |I) and F ∈ I. But again, by
change of variable,
Z Z
E(Y ; F ) = Y dµ = Y ◦ XdP = E(X0 ; E),
F E
where the last identity holds because Y ◦ X = X0 . This shows that Z ◦ X =
E(X0 |J ), and completes the proof of the almost sure part of the theorem. For the
L1 part, simply note that
n−1 Z n−1
1 X 1 X
E Xm − E(X0 |J ) = Y ◦ ϕm ◦ X − Z ◦ X dP
n n
m=0 m=0
Z n−1
1 X
= Y ◦ ϕm − Z dµ,
n
m=0
which tends to zero as n → ∞. This completes the proof of the theorem. 

E XERCISE 10.4.3. Use Theorem 10.4.2 to give an alternative proof of the


SLLN for m-dependent stationary sequences (Theorem 8.5.6).
E XERCISE 10.4.4. A number x ∈ [0, 1) is said to be normal in base b if
the digits in its base b expansion have the property that for any finite sequence
d1 , . . . , dk ∈ {0, 1, . . . , b − 1}, the fraction of times the pattern d1 d2 · · · dk occurs
in the first n digits of x tends to b−k as n → ∞. Show that if x is chosen uniformly
at random from [0, 1), then its base b digits are i.i.d. random variables uniformly
distributed in {0, 1, . . . , b − 1}. Using this, and the previous exercise, prove that
almost every number x ∈ [0, 1) is normal in base b. Then use this to show that al-
most all x ∈ [0, 1) is a normal number, meaning that it’s normal in every basis. (It’s
a famous open √ problem to show that some “naturally occurring” element of [0, 1),
such as 1/ 2 or 1/e, is normal. As of now, we can construct only very artificial
examples of normal numbers, even though almost all numbers are normal.)
E XERCISE 10.4.5. Let X0 , X1 , . . . be a sequence of jointly normal random
variables (meaning that any finitely many of them have a joint normal distribu-
tion), such that for each i, E(Xi ) = 0 and Var(Xi ) = 1, and for each i 6= j,
Cov(Xi , Xj ) = ρ, where ρ ∈ [−1, 1] is some fixed constant. Show that this se-
quence is stationary, and that ρ ≥ 0. Then, let J denote the invariant σ-algebra of
10.4. STATIONARY SEQUENCES 147

this sequence, and evaluate the distribution of E(X0 |J ). (This gives an example
of a stationary sequence where the limit in Theorem 10.4.2 is not a constant.)
CHAPTER 11

Markov chains

In this chapter, we define and prove the existence of Markov chains on Eu-
clidean spaces and discuss some of their basic properties.

11.1. The Ionescu-Tulcea existence theorem


A Markov chain is defined by its transition kernel. A transition kernel, or
transition probability, is defined as follows.
D EFINITION 11.1.1. Let (S, S) be a measurable space. A function K : S ×
S → R is called a ‘Markov transition probability’ or ‘Markov transition kernel’ if
(1) for each x ∈ S, A 7→ K(x, A) is probability measure on (S, S), and
(2) for each A ∈ S, x 7→ K(x, A) is a measurable function.

D EFINITION 11.1.2. Let (S, S) be a measurable space. Let {Xn }n≥0 be


a sequence of S-valued random variables adapted to a filtration of σ-algebras
{Fn }n≥0 . Let K be a Markov transition kernel on S and σ be a probability
measure on σ. We say that {Xn }n≥0 is a time-homogeneous Markov chain with
kernel K and initial distribution σ if X0 ∼ σ, and for all n ≥ 0 and B ∈ S,
P(Xn+1 ∈ B|Fn ) = K(Xn , B) a.s.
Often, the initial distribution σ is taken to be the point mass at some point
x0 ∈ S. In that case, we say that x0 is the initial state of the chain. The follow-
ing theorem shows that given any kernel and any initial distribution on any state
space, it is possible to construct a Markov chain with the given kernel and initial
distribution.
T HEOREM 11.1.3 (Ionescu-Tulcea theorem). Given any Markov transition ker-
nel K on any state space (S, S) as above, and any probability measure σ on (S, S),
it is possible to construct a Markov chain on S with kernel K and initial distribu-
tion σ.
We need the following lemma.
L EMMA 11.1.4. Let (S, S) be a measurable space and K be a Markov transi-
tion kernel on this space. For any n ≥ 2 and any bounded measurable f : S n → R,
the map Z
g(x1 , . . . , xn−1 ) := f (x1 , . . . , xn )K(xn−1 , dxn )
S
is well-defined, bounded and measurable.

149
150 11. MARKOV CHAINS

P ROOF. First of all note that by Lemma 3.2.1, for any given x1 , . . . , xn−1 ,
the map xn 7→ f (x1 , . . . , xn ) is bounded and measurable. This proves that g is
well-defined. Next, note that if we can prove the claim when f = 1B for any mea-
surable set B ⊆ S n , the general claim will follow by the usual route of going from
indicator functions to simple functions and then to general measurable functions.
Thus, let us prove it when f = 1B . Suppose first that B = B1 × · · · × Bn for some
B1 , . . . , Bn ∈ S. Then

g(x1 , . . . , xn−1 ) = 1B1 (x1 ) · · · 1Bn−1 (xn−1 )K(xn−1 , Bn ),

which is measurable by the properties of K. Now consider the set of all measurable
B for which the claim is true. Using the properties of K, it is not hard to see that
this is a λ-system, and by the above deduction, it contains the π-system of all
rectangular sets. The proof is now completed by invoking the π-λ theorem. 

P ROOF OF T HEOREM 11.1.3. By Lemma 11.1.4, the following iterated inte-


gral is well-defined as a functional on S n :
ZZ Z
µn (B) := · · · 1B (x1 , . . . , xn )K(xn−1 , dxn )K(xn−2 , dxn−1 ) · · ·
· · · K(x1 , dx2 )dσ(x1 ).

Using the properties of K and the monotone convergence theorem, it is not hard to
show that µn is a probability measure on S n and moreover, that the family {µn }n≥1
is consistent, in the sense that for any n and any A ∈ S n , µn+1 (A × S) = µn (A).
We will now show that there is a probability measure µ on the infinite product
space such that µn is the marginal of µ on S n for each n.
Note that we cannot apply Kolmogorov’s extension theorem, since S is not a
Euclidean space or a Polish space. Instead, consider the algebra A of all cylinder
sets, that is, sets of the form B = A × S × S × · · · for some A ∈ S n for some n.
Define µ(B) = µn (A)¡ which is well-defined due to the consistency of the family
{µn }n≥1 . Then µ is finitely additive on A. By Carathéordory’s extension theorem,
we only need to show that µ is countably additive on A.
As in the proof of Theorem 3.3.1, this can be reduced to showing that for any
sequences of set {An }n≥1 in A decreasing to ∅, µ(An ) → 0. So, let us take any
such sequence, and suppose that µ(An ) ≥ ε > 0 for all n.
Take any B ∈ A. Take any n such that B = A×S ×S ×· · · for some A ∈ S n .
For 1 ≤ k ≤ n − 1, define the “conditional probability of the chain being in B
given that the first k coordinates are x1 , . . . , xk ” to be
Z Z
µ(B|x1 , . . . , xk ) := · · · 1A (x1 , . . . , xn )K(xn−1 , dxn ) · · · K(xk , dxk+1 ).

Additionally, define µ(B|x1 , . . . , xk ) = 1A (x1 , . . . , xn ) for k ≥ n. It is easy to


see that this definition does not depend on our choice of n, is a measurable function
11.1. THE IONESCU-TULCEA EXISTENCE THEOREM 151

on S k (by Lemma 11.1.4), and that for all 1 ≤ j < k < ∞ and x1 , . . . , xk ∈ S,
µ(B|x1 , . . . , xj )
Z Z
= · · · µ(B|x1 , . . . , xk )K(xk−1 , dxk ) · · · K(xj , dxj+1 ), (11.1.1)

and moreover,
Z
µ(B) = µ(B|x)dσ(x). (11.1.2)

Now, given our sequence {An }n≥1 , define


mk (x1 , . . . , xk ) := lim µ(An |x1 , . . . , xk ).
n→∞

The limit exists because {An }n≥1 is a decreasing sequence, and is measurable
because it is the limit of a sequence of measurable functions. Analogously, define
m0 := limn→∞ µ(An ). By (11.1.1), (11.1.2), and the dominated convergence
theorem, we get that for any 1 ≤ j < k,
Z Z
mj (x1 , . . . , xj ) = · · · mk (x1 , . . . , xk )K(xk−1 , dxk ) · · · K(xj , dxj+1 ,

and Z
m0 = m1 (x)dσ(x).

By assumption, m0 > 0. Thus, there exists x1 such that m1 (x1 ) > 0. Therefore,
there exists x2 such that m2 (x1 , x2 ) > 0, and so on. Let x := (x1 , x2 , . . .). We
claim that x ∈ ∩∞ n=1 An . To see this, take any n. Then An = B × S × S × · · · for
N
some B ∈ S , for some N (depending on An ). Since mN (x1 , . . . , xN ) > 0, we
have µ(Ak |x1 , . . . , xN ) > 0 for all k. In particular, µ(An |x1 , . . . , xN ) > 0. But
µ(An |x1 , . . . , xN ) = 1B (x1 , . . . , xN ).
Thus, x ∈ B × S × S × · · · = An . This proves that x ∈ ∩∞ n=1 An , giving us
the desired contradiction. Thus, µ extends to a probability measure on the infinite
product σ-algebra.
Finally, to produce an infinite Markov chain with transition kernel K and ini-
tial distribution σ, take the space S N with its product σ-algebra, and let X1 , X2 , . . .
be the coordinate maps. Let Fn := σ(X1 , . . . , Xn ). Then note that for any mea-
surable D ⊆ S n and B ⊆ S,
Z
1D (x1 , . . . , xn )K(xn , B)dµn (x1 , . . . , xn , xn+1 )
S n+1
Z
= 1D (x1 , . . . , xn )K(xn , B)dµn (x1 , . . . , xn )
Sn
Z
= 1D×B (x1 , . . . , xn , xn+1 )K(xn , dxn+1 )dµn (x1 , . . . , xn )
S n+1
Z
= µn+1 (D × B) = 1D (x1 , . . . , xn )1B (xn+1 )dµn+1 (x1 , . . . , xn , xn+1 ).
S n+1
152 11. MARKOV CHAINS

This shows that P(Xn+1 ∈ B|Fn ) = K(Xn , B). It is clear that X1 ∼ σ. This
completes the proof. 

11.2. Markov chains on countable state spaces


In this section, we will study some basic properties of Markov chains on count-
able state spaces. Let S be a finite or countable set and let {Xn }n≥0 be a Markov
chain on S. The matrix P = (pij )i,j∈S defined as
pij = P(X1 = j|X0 = i)
is called ‘transition matrix’ of the chain. It is easy to see that the entries of P are
nonnegative and each row sums to 1. Let P (n) be the transition matrix from time 0
to time n. That is, the (i, j)th entry of P (n) is
(n)
pij = P(Xn = j|X0 = i).
This is sometimes called the ‘n-step transition matrix’.
P ROPOSITION 11.2.1 (Chapman–Kolmogorov formula). For any n,
P (n) = P n ,
where the right side denotes the nth power of P under matrix multiplication.

P ROOF. First note that by applying the definition of Markov chain in the third
step below,
P(X0 = i0 , . . . , Xn = in )
n
Y
= P(X0 = i0 ) P(Xm = im |X0 = i0 , . . . , Xm−1 = im−1 )
m=1
Yn
= P(X0 = i0 ) P(Xm = im |Xm−1 = im−1 )
m=1
= P(X0 = i0 )pi0 i1 pi1 i2 · · · pim−1 im .
This shows that
P(Xn = j|X0 = i)
X
= P(X1 = i1 , . . . , Xn−1 = in−1 , Xn = j|X0 = i)
i1 ,...,in−1 ∈S
X
= pii1 pi1 i2 · · · pin−1 j .
i1 ,...,in−1 ∈S

But the last expression is just the (i, j)th element of P n . Thus, P (n) = P n , as we
wanted to show. 
11.2. MARKOV CHAINS ON COUNTABLE STATE SPACES 153

A state i ∈ S is called ‘recurrent’ if


P(Xn = i for some n ≥ 1|X0 = i) = 1,
and ‘transient’ if the above probability is strictly less than 1. The following theorem
gives an easily checkable condition for recurrence and transience.
T HEOREM 11.2.2. Let {Xn }n≥0 be a time-homogeneous Markov chain on a
finite or countably infinite state space S. Then a state i ∈ S is recurrent if

(n)
X
pii = ∞,
n=1
and transient if this sum is finite. Moreover, if i recurrent, the chain started at i
returns to i infinitely many times with probability 1, and if i is transient, then the
chain started at i returns to i only finitely many times with probability 1.

P ROOF. Fix some state i. For proving the above theorem, it is convenient to
define a random variable N which is the number of n ≥ 1 such that Xn = i. (Note
that we are ignoring n = 0. So if the chain starts from i but never comes back to
i, then N = 0. Also note that N is allowed to be ∞, in case the Markov chain
visits i infinitely many times.) The monotone convergence theorem for conditional
expectation implies that
X∞ 
E(N |X0 = i) = E 1{Xn =i} X0 = i
n=1
∞ ∞
(n)
X X
= P(Xn = i|X0 = i) = pii . (11.2.1)
n=1 n=1
Define
p = P(Xn = i for some n ≥ 1|X0 = i).
The identity (11.2.1) implies that to prove the theorem, we have to show that p < 1
if E(N |X0 = i) < ∞ and p = 1 if E(N |X0 = i) = ∞. To prove this, it suffices
to show that

X
E(N |X0 = i) = pk . (11.2.2)
k=1
Note that by the monotone convergence theorem for conditional expectation,
X ∞ 
E(N |X0 = i) = E 1{N ≥k} X0 = i
k=1

X
= P(N ≥ k|X0 = i).
k=1
Thus, to prove (11.2.2), it suffices to show that
P(N ≥ k|X0 = i) = pk . (11.2.3)
154 11. MARKOV CHAINS

By the definition of p, this holds for k = 1. Let us assume that this holds for k − 1.
For any n ≥ 1 and any j1 , . . . , jn ∈ S such that jn = i and exactly k − 2 of the
other jm ’s are equal to i, let Aj1 ,...,jn be the event {X1 = j1 , . . . , Xn = jn }. It is
not hard to see that the event {N ≥ k − 1} is the union of these events. Moreover,
these events are disjoint. Since {N ≥ k} ⊆ {N ≥ k − 1}, this gives
P(N ≥ k|X0 = i) = P({N ≥ k} ∩ {N ≥ k − 1}|X0 = i)
X
= P({N ≥ k} ∩ Aj1 ,...,jn |X0 = i)
X
= P(N ≥ k|Aj1 ,...,jn ∩ {X0 = i})P(Aj1 ,...,jn |X0 = i)

where the sum is over all n and j1 , . . . , jn as above. Now, for any m and i1 , . . . , im ,
let Bi1 ,...,im be the event {Xn+1 = i1 , . . . , Xn+m = im }. Using the definition of
Markov chain and the fact that jn = i, it is easy to see that
P(Bi1 ,...,im |Aj1 ,...,jn ∩ {X0 = i})
= P(Bi1 ,...,im |Xn = i) = pii1 pi1 i2 · · · pim−1 im .
Now, P(N ≥ k|Aj1 ,...,jn ∩ {X0 = i}) is just the sum of the above quantity over
all choices of m and i1 , . . . , im such that im = i and il 6= i for l < m. But on the
other hand, the sum of pii1 pi1 i2 · · · pim−1 im over the same set of m and i1 , . . . , im
equals p. Thus, for any such j1 , . . . , jn ,
P(N ≥ k|Aj1 ,...,jn ∩ {X0 = i}) = p.
Therefore,
X
P(N ≥ k|X0 = i) = P(N ≥ k|Aj1 ,...,jn ∩ {X0 = i})P(Aj1 ,...,jn |X0 = i)
X
=p P(Aj1 ,...,jn |X0 = i)
= pP(N ≥ k − 1|X0 = i).

Since P(N ≥ k − 1|X0 = i) = pk−1 , this completes the induction step.


To complete the proof of the final claim, simply observe that by (11.2.2), it
follows that if i is recurrent, then P(N ≥ k|X0 = i) = 1 for each k, which implies
that P(N = ∞|X0 = i) = 1, and conversely, if i is transient, then E(N |X0 =
i) < ∞, which implies that P(N < ∞|X0 = i) = 1. 

Theorem 11.2.2 gives necessary and sufficient conditions for recurrence and
transience for a given state. What if we want such information for all states? For-
tunately, we do not usually have to check separately for each state, as we will see
below.
(n)
A state j of a Markov chain is said to be accessible from a state i if pij > 0
for some n ≥ 1. (Note that we are leaving out n = 0.) A Markov chain on a finite
state space is called irreducible if every state j is accessible from every state i.
(n)
Let us say that a state j is accessible from a state i in n steps if pij > 0. The
following lemma is often useful for checking irreducibility.
11.3. PÓLYA’S RECURRENCE THEOREM 155

L EMMA 11.2.3. Take any k ≥ 1, i0 , i1 , . . . , ik ∈ S and n1 , . . . , nk ≥ 1.


Suppose that it is accessible from it−1 in nt steps, for t = 1, . . . , k. Let n =
n1 + · · · + nk . Then ik is accessible from i0 in n steps.

P ROOF. By the Chapman–Kolmogorov formula,


P (n) = P n = P n1 P n2 · · · P nk
= P (n1 ) P (n2 ) · · · P (nk ) .
Therefore
(n) (n ) (n ) (n )
X
pi0 ik = pi0 j11 pj1 j22 · · · pjk−1
k
ik
j1 ,...,jk−1
(n ) (n ) (n )
≥ pi0 i11 pi1 i22 · · · pik−1
k
ik ,

which proves the claim. 


The following proposition shows that any two states that are accessible from
each other have to be either both recurrent or both transient.
P ROPOSITION 11.2.4. If states i and j are accessible from each other, then
either both i and j are recurrent, or both i and j are transient. Consequently, if a
time-homogeneous irreducible Markov chain on a finite or countably infinite state
space has one recurrent state, then all states are recurrent. Similarly, if one state
is transient, then all states are transient.

P ROOF. Take any two states i and j such that j is accessible from i and i is ac-
(l) (m)
cessible from j. Thus, pij > 0 and pji > 0 for some l and m. By the Chapman–
Kolmogorov formula (which continues to hold on countable state spaces),
(n) (m) (n−m−l) (l)
pjj ≥ pji pii pij
for any n > m + l. Summing both sides over n, we see that the recurrence of
i implies the recurrence of j (by Theorem 11.2.2). Similarly, the recurrence of j
implies the recurrence of i. For an irreducible chain, every state is accessible from
every other state. Thus, either every state is recurrent or every state is transient. 
Because of Proposition 11.2.4, an irreducible chain as a whole may be called
recurrent or transient. In the next section, we will see important examples of recur-
rent and transient chains.

11.3. Pólya’s recurrence theorem


Simple symmetric random walk on Zd is a Markov chain with state space Zd ,
and kernel K(x, ·) = uniform distribution on the 2d nearest neighbors of x. That
is, if the walk is at x ∈ Zd at time n, it jumps to a uniformly chosen neighbor at
time n + 1. It is easy to see that this chain is irreducible, and so we can ask whether
the walk is recurrent or transient without mentioning the starting state. We will
now prove the following famous result.
156 11. MARKOV CHAINS

T HEOREM 11.3.1 (Pólya’s recurrence theorem). Simple symmetric random


walk on Zd is recurrent if d = 1 or d = 2, and transient if d ≥ 3.

P ROOF. Let rn denote the probability that the chain returns to the origin at
time n, given that it started from the origin at time 0. The number of ways this can
happen is counted as follows. First, suppose that the walk takes ni steps in the ith
coordinate direction (either forward or backward). Here n1 + · · · + nd = n. The
number of ways of allocating these steps is
n!
.
n1 !n2 ! · · · nd !
Having associated each step of the walk to a coordinate direction, we now have to
decide whether the move is forward or backward. If the walk has to return to the
origin at time n, the number of forward steps in any coordinate direction must be
equal tothe number of backward steps. For direction i, this allocation can be done
in nni /2
i
ways, provided that ni is even. Otherwise, this is zero. Combining, we
see that the number of paths that return to the origin at time n is
d  
X n! Y ni
. (11.3.1)
n1 ,...,nd even, n1 !n2 ! · · · nd !
i=1
ni /2
n1 +···+nd =n

Therefore, rn equals the above quantity times (2d)−n . When d = 1, this gives
 
2n −2n
r2n = 2 ,
n
for any positive integer n. By Stirling’s formula, this is asymptotically
√ 1
2π(2n)2n+ 2 e−2n 2−2n 1
√ =√ .
n+ 12 −n 2 πn
( 2πn e )
In other words,
r2n
lim = 1.
n→∞ (πn)−1/2
By the ratio test for summability of a series, this proves that
X∞
r2n = ∞.
n=1
Therefore by Theorem 11.2.2, simple symmetric random walk on Z is recurrent.
Next, let us consider d = 2. By the formula (11.3.1), we get that if n is even,
then
X n!4−n
rn = .
((k/2)!(l/2)!)2
k,l even,
k+l=n
This can be rewritten as
n n  
(2n)!4−2n 2n −2n X n 2
X  
r2n = = 4 .
(k!(n − k)!)2 n k
k=0 k=0
11.3. PÓLYA’S RECURRENCE THEOREM 157

Now, since
   
n n
= ,
k n−k
it follows that
n  2
X n
k
k=0

is the coefficient of xnin the expansion of (1 + x)n (1 + x)n . But the coefficient of
xn in (1 + x)2n is 2n n . Thus,
n  2  
X n 2n
= .
k n
k=0

This shows that


 2
2n
r2n = 4−2n .
n
But this is just the square of r2n in the one-dimensional case. Thus, this P
is asymp-
totic to (πn)−1 . Again by the ratio test, this proves the divergence of ∞ n=1 r2n
and hence the recurrence of the simple symmetric random walk on Z2 .
Finally, let us turn our attention to the case d ≥ 3. First, note that any path
of length 2n that starts and ends at the origin can be extended to a path of length
2n + 2 by adding two steps at the end, the first one moving one step away from the
origin in a fixed direction and the next one moving back to the origin. Since this
gives a one-to-one map from the set of path of length 2n that start and end at the
origin, into the set of paths of length 2n + 2 that start and end at the origin, this
shows that if N2n is the number of paths of length 2n that start and end the origin,
then N2n ≤ N2n+2 . Consequently, N2n is a non-decreasing function of n. But
r2n = N2n (2d)−2n . Thus, for any k ≥ 1,

r2n ≤ N2n+2k (2d)−2n = r2n+2k (2d)2k .

Therefore,

X ∞
X
r2n = (r2dm−2d+2 + r2dm−2d+4 + · · · + r2dm )
n=1 m=1
X∞
≤ r2dm ((2d)2d−2 + (2d)2d−4 + · · · + 1)
m=1
P∞
Thus, to prove n=1 r2n < ∞, it suffices to show that

X
r2dn < ∞.
n=1
158 11. MARKOV CHAINS

Now from (11.3.1), we get


X (2dn)!(2d)−2dn
r2dn =
(k1 !k2 ! · · · kd !)2
k1 ,...,kd
k1 +···+kd =dn
2
(dn)!d−dn
  
2dn −2dn X
= 2 . (11.3.2)
dn k1 !k2 ! · · · kd !
k1 ,...,kd
k1 +···+kd =dn

Take any nonnegative integers k1 , . . . , kd that sum to dn. Writing ki ! as the product
of 1, 2, . . . , ki , we can write k1 !k2 ! · · · kd ! as a product of integers between 1 and
dn, possibly with many repeats. Using the condition k1 + · · · + kd = dn, little bit
of thought shows that in this product we can replace the integers bigger then n by
integers less than n in such a way that the end result is exactly equal to (n!)d . (If
some ki is less than n, it ‘falls short’ of contributing n − ki terms to n!. On the
other hand, if some ki is bigger than n, it contributes ki − n extra terms that are
all bigger than n. Since k1 + · · · + kd = dn, the number of missing contributions
exactly equals the number excess terms.) This procedure shows that
k1 !k2 ! · · · kd ! ≥ (n!)d .
Thus,
2
(dn)!d−dn (dn)!d−dn (dn)!d−dn
X  X

k1 !k2 ! · · · kd ! (n!)d k1 !k2 ! · · · kd !
k1 ,...,kd k1 ,...,kd
k1 +···+kd =dn k1 +···+kd =dn

(dn)!d−dn
, =
(n!)d
where the last identity was obtained by observing that the sum in the previous line
is the sum of a multinomial probability mass function. By Stirling’s formula, the
above quantity is asymptotic to
√ 1 √
2π(dn)dn+ 2 e−dn d−dn d
√ 1 = (d−1)/2
.
( 2πnn+ 2 e−n )d (2π) n(d−1)/2
On the other hand, by our calculations in the one-dimensional case, the term
 
2dn −2dn
2
dn
in (11.3.2) is asymptotic to (πdn)−1/2 . Thus, r2dn is bounded above by a quantity
that is asymptotic to C(d)n−d/2 , where C(d) is a constant that depends only on d.
Since

X
n−d/2 < ∞
n=1
when d ≥ 3, this proves the summability of r2dn , and so by our previous discus-
sion, the summability of r2n . Since rn = 0 when n is odd, the completes the proof
of transience of simple symmetric random walk in d ≥ 3. 
11.4. MARKOV CHAINS ON FINITE STATE SPACES 159

11.4. Markov chains on finite state spaces


Let {Xn }n≥0 be a time-homogeneous Markov chain on a finite state space S
with transition matrix P . A probability measurePon S is simply a set of values
µ = (µi )i∈S such that µi ≥ 0 for each i, and i∈S µi = 1. We will think of
µ as a row vector in RN , where N is the size of S. This deviates from the usual
convention of thinking of any vector as a column vector, but it is the usual practice
in the Markov chain world, and is convenient for various purposes.
A probability distribution µ on S is called an invariant distribution (also called
a stationary distribution or an equilibrium distribution) for the above Markov chain
if µP = µ. In other words, for each j ∈ S,
X
µi pij = µj .
i∈S

The significance of an invariant distribution is that if the distribution of X0 is an


invariant distribution, then the distribution of each Xn is the same as that of X0 .
To see this, suppose that X0 ∼ µ, where µ is an invariant distribution. Let µ(n) be
the distribution of Xn . Then note that
(n)
X
µj = P (Xn = j) = P (Xn = j|X0 = i)P (X0 = i)
i∈S
X
= P (Xn = j|X0 = i)µi .
i∈S

By the Chapman–Kolmogorov formula, this gives

µ(n) = µP n .

But µP = P . Thus, µ(n) = µP n = µ for any n. The following result shows that
a time-homogeneous Markov chain on a finite state space always has at least one
invariant distribution.

T HEOREM 11.4.1. Let {Xn } be a time-homogeneous Markov chain on a fi-


nite state space S, with transition matrix P . Then there is at least one invariant
distribution µ for this chain.

P ROOF. Take any probability distribution µ0 on S. For each n, define

µ0 + µ0 P + µ0 P 2 + · · · + µ 0 P n
µn = .
n+1
It is an easy consequence of the Chapman–Kolmogorov formula (as observed
above) that µ0 P k is the distribution of Xk if X0 ∼ µ0 . Since the average of a
set of probability distributions is again a probability distribution, µn must be a
160 11. MARKOV CHAINS

probability measure on S. Now note that


µn − µ(n) P
µ 0 + µ 0 P + µ0 P 2 + · · · + µ0 P n
=
n+1
µ0 P + µ0 P 2 + µ0 P 2 + · · · + µ0 P n+1

n+1
µ0 − µ 0 P n+1
= .
n+1
Since µ0 and µ0 P n+1 are both probability measures, the last expression tends to
the zero vector as n → ∞. Thus,
lim (µn − µn P ) = 0
n→∞
where the 0 on the right is the zero vector. Finally, note that since {µn }n≥0 is a
bounded sequence (because each vector is a probability measure), it has a conver-
gent subsequence. Let µ be a limit of such a subsequence. The above equation
proves that µP = µ. The set of probability measures on S is a closed set, which
implies that µ must be a probability measure. Thus, µ is an invariant distribu-
tion. 
Let {Xn }n≥1 be a time-homogeneous Markov chain on a finite state space S,
with transition matrix P . Let µ be an invariant distribution for P . The Doeblin
condition is a condition under which we can identify the limit of P n as n → ∞,
together with a rate of convergence.
Let M be the square matrix whose rows are all equal to µ. We say that P
satisfies the Doeblin condition if for some k ≥ 1 and some  ∈ (0, 1),
P k ≥ M, (11.4.1)
where the inequality means that each entry of the matrix on the left is greater
than or equal to the corresponding entry of the matrix on the right. Under this
condition, we can prove that P n converges to M as n → ∞. In other words,
P (Xn = j|X0 = i) converges to µj as n → ∞, irrespective of the value of i.
To state Doeblin’s theorem with a rate of convergence, we need to define a
matrix norm. There are many matrix norms. The one that will be useful for us is
the following. For a matrix A = (aij )i,j∈S , define
X
kAk = max |aij |.
i∈S
j∈S
It is not hard to check that for any two matrices A and B,
kA + Bk ≤ kAk + kBk, kABk ≤ kAkkBk.
We are now ready to state Doeblin’s theorem.
T HEOREM 11.4.2 (Doeblin’s theorem). Suppose that P satisfies the Doeblin
condition (11.4.1) for some k ≥ 1 and some  ∈ (0, 1). Then for any n,
kP n − M k ≤ 2(1 − )[n/k] ,
11.4. MARKOV CHAINS ON FINITE STATE SPACES 161

where [n/k] denotes the integer part of n/k.

P ROOF. Define
P k − M
Q= .
1−
Since P k and M are both stochastic matrices, it follows that the each row of Q
sums to 1. Moreover, by the Doeblin condition, the entries of Q are nonnegative.
Thus, Q is a stochastic matrix. Now, note that
P k = (1 − )Q + M.
Also note that since µ is a probability distribution, µP = P , and P is a stochastic
matrix, we have the identities
M 2 = M, M P = M, P M = M.
This implies that QM = M Q = M . Consequently, any product of a sequence
of Q’s and M ’s that contains at least one M must be equal to M . Thus, for any
m ≥ 1, the distributive law for matrix products gives
P km = ((1 − )Q + M )m
m  
m
X
m m
= (1 − ) Q + (1 − )m−k k M
k
k=1
= (1 − ) Q + (1 − (1 − )m )M.
m m

Now take any n ≥ 1. Let m = [n/k], so that n = km + r, where 0 ≤ r ≤ k − 1.


Then
P n − M = P r (P km − M )
= (1 − )m P r (Qm − M ).
Since P r , Q and M are stochastic matrices, this gives
kP n − M k ≤ (1 − )m kP r k(kQm k + kM k) ≤ 2(1 − )m ,
which completes the proof. 
The period of a state i ∈ S is defined to be greatest common divisor of the set
(n)
of all n ≥ 1 such that pii > 0. If there is no such n, the period is ∞. A Markov
chain is called aperiodic if all of its states have period 1. The following theorem is
the main result of this section.
T HEOREM 11.4.3. Let {Xn }n≥0 be an irreducible aperiodic Markov chain on
a finite state space S, with transition matrix P . Then P has a unique invariant
distribution µ. Moreover, if M is the square matrix whose rows are all equal to µ,
then P n → M as n → ∞.
The key idea is to show that any irreducible aperiodic chain on a finite state
space satisfies the Doeblin condition. The first step is to prove the following
lemma.
(n)
L EMMA 11.4.4. For any state i, pii > 0 for all sufficiently large n.
162 11. MARKOV CHAINS

P ROOF. By irreducibility, we know that P (Xn = i|X0 = i) > 0 for some


n ≥ 1. Let A be the set of all such n. Since the g.c.d. of A is 1 (because the chain
is aperiodic), there must exist a1 , . . . , ak ∈ A and u1 , . . . , uk ∈ Z for some k ≥ 1
such that
u1 a1 + · · · + uk ak = 1.
Define a positive integer
m = a1 + · · · + ak .
Take any n ≥ 1. Let q and r be the quotient and remainder when n is divided by
m. Then
n = qm + r = qm + r(u1 a1 + · · · + uk ak )
= (q + ru1 )a1 + · · · (q + ruk )ak .
Since r is always bounded above by m − 1, and q → ∞ as n → ∞, the above
identity shows that any sufficiently large n is expressible as a linear combina-
tion of a1 , . . . , ak with positive integer coefficients. The claim now follows from
Lemma 11.2.3. 
By irreducibility, the above lemma generalizes to the following.
L EMMA 11.4.5. There is some n0 such that for any n ≥ n0 , the entries of P n
are all strictly positive.

(n)
P ROOF. Take any state i. By Lemma 11.4.4, there is some m such that pii >
0 for all n ≥ m. By irreducibility and the finiteness of the state space, there is some
k ≥ 1 such that any j is accessible from i in less than k steps. Let m0 = m + k.
Take any n ≥ m0 and any state j. Then there is a number r < k such that j
is accessible from i in r steps. Also, since n − r > n − k ≥ m, the state i is
accessible from itself in n − r steps. Therefore by Lemma 11.2.3, j is accessible
from i in n steps. Thus, we have shown that any state is accessible from state i
in n steps whenever n is sufficiently large. From this it follows that when n is
(n)
sufficiently large, pij > 0 for all i and j, which is what we wanted to prove. 
By the finiteness of the state space, Lemma 11.4.5 immediately yields the fol-
lowing corollary, because if the entries of P n are all strictly positive, then for any
given matrix M , there is some  > 0 such that P n ≥ M .
C OROLLARY 11.4.6. Any irreducible aperiodic Markov chain on a finite state
space satisfies the Doeblin condition.
It is now easy to finish the proof of Theorem 11.4.3. By Theorem 11.4.1, there
is at least one invariant distribution µ. Let M be the matrix whose rows are all
equal to µ. By the above corollary, P n → M as n → ∞. To prove the uniqueness
of the invariant distribution, let ν be another invariant distribution. Then on the one
hand, νP n = ν for every n. On the other hand,
lim νP n = νM = µ,
n→∞
11.4. MARKOV CHAINS ON FINITE STATE SPACES 163

since ν is a probability distribution and each column of M is a scalar multiple of


the vector of all 1’s. Thus, µ = ν.
E XAMPLE 11.4.7 (Shuffling cards). Consider a deck of n cards, shuffled ac-
cording to the following rule. At each step, the card on the top is moved to a
uniformly chosen location in the deck (which may be the top position too). This is
known as the ‘top-to-random’ shuffle. Here the state space is the space Sn of all
permutations of 1, . . . , n.
First, note that any state can be accessed from itself in 1 step, which implies
that the chain is aperiodic. Next, note that any state can be accessed from any other
state by a sequence of ‘allowed’ moves. Therefore by Lemma 11.2.3, the chain is
irreducible. Thus, by Theorem 11.4.3, this Markov chain has a unique invariant
distribution µ, and irrespective of the starting state, the distribution of Xn tends to
µ as n → ∞.
To complete the discussion, let us identify the invariant distribution µ. We
claim that µ is the uniform distribution on Sn . To see this, note that any state
σ ∈ Sn can be accessed from exactly n other states π1 , . . . , πn in 1 step, and each
of these n states has probability 1/n! under the uniform distribution. Moreover, the
transition from πi to σ in one step has probability 1/n. Thus, if X0 is uniformly
distributed on Sn , then
n
X 1
P (X1 = σ) = P (X1 = σ|X0 = πi )P (X0 = πi ) = .
n!
i=1

Since this holds for any σ, X1 is again uniformly distributed on Sn . Thus, the uni-
form distribution is an invariant distribution for this chain. The claim now follows
by the uniqueness of the invariant distribution.

E XAMPLE 11.4.8 (Random walk on a graph). Let G = (V, E) be a finite,


simple, undirected graph with vertex set V and edge E. A simple random walk
{Xn }n≥0 on G is a Markov chain on V that evolves according to the following
rule. If the chain is at a vertex v, it chooses one of its neighbors uniformly at
random and moves there. In general, simple random walks may not be aperiodic
(for example, if the graph is bipartite). There is an easy fix for this problem. We
just make the walk stay where it is with some fixed probability p > 0, and jump to
a uniformly chosen neighbor with probability 1 − p. This is known as a lazy simple
random walk on G with holding probability p. Lazy walks are alway aperiodic.
Recall that a graph is called connected if any vertex can be reached from any
other by moving along a sequence of edges. By Lemma 11.2.3, a simple random
walk (or a lazy simple random walk) on a connected graph is irreducible.
Combining the above observations, we get that a lazy simple random walk on a
finite connected graph is irreducible and aperiodic. Therefore by Theorem 11.4.3,
there is a unique invariant distribution and the distribution of Xn converges to this
invariant distribution from any starting state.
It remains to identify the invariant distribution. Recall that the degree of a
vertex v, denoted by dv , is the number of neighbors of v. We claim the invariant
164 11. MARKOV CHAINS

distribution is given by
dv
µv = P .
u∈Vdu
Let us verify this by direct calculation. First,
P note that this is indeed a probability
distribution, since µv ≥ 0 for all v and v∈V µv = 1. Next, let puv be the
probability of transition from u to v. That is,

p
 if u = v,
puv = (1 − p)/du if v is a neighbor of u,

0 otherwise.

Then note that for any v,


X X µu
µu puv = pµv + (1 − p) ,
du
u∈V u∈Nv
where Nv is the set of neighbors of v. Plugging in the formula for µu , we see that
the right side equals
X 1 d
pµv + (1 − p) P u
du w∈V dw
u∈Nv
X 1
= pµv + (1 − p) P
u∈Nv w∈V dw

dv
= pµv + (1 − p) P = pµv + (1 − p)µv = µv .
w∈V dw
Thus, µ is indeed the unique invariant distribution for this Markov chain.
CHAPTER 12

Weak convergence on Polish spaces

In this chapter we will develop the framework of weak convergence on com-


plete separable metric spaces, also called Polish spaces. The most important exam-
ples are finite dimensional Euclidean spaces and spaces of continuous functions.

12.1. Definition
Let S be a Polish space with metric ρ. The notions of almost sure convergence,
convergence in probability and Lp convergence remain the same, with |Xn − X|
replaced by ρ(Xn , X). Convergence in distribution is a bit more complicated,
since cumulative distribution functions do not make sense in a Polish space. It
turns out that the right way to define convergence in distribution on Polish spaces
is to generalize the equivalent criterion given in Proposition 8.7.1.
D EFINITION 12.1.1. Let (S, ρ) be a Polish space. A sequence Xn of S-valued
random variables is said to converge weakly to an S-valued random variable X if
for any bounded continuous function f : S → R,
lim Ef (Xn ) → Ef (X).
n→∞

Alternatively, a sequence of probability measure µn on S is said to converge weakly


to a probability measure µ on S if for every bounded continuous function f : S →
R, Z Z
lim f dµn = f dµ.
n→∞ S

Just as on the real line, the law of an S-valued random variable X is the prob-
ability measure µX on S defined as
µX (A) := P(X ∈ A).
Of course, here the σ-algebra on S is the Borel σ-algebra generated by its topology.
By the following exercise, it follows that a sequence of random variables on S
converge weakly if and only if their laws converge weakly.
E XERCISE 12.1.2. Prove that the assertion of Exercise 6.1.1 holds on Polish
spaces.
For any n, the Euclidean space Rn with the usual Euclidean metric is an exam-
ple of a Polish space. The following exercises give some other examples of Polish
spaces.
165
166 12. WEAK CONVERGENCE ON POLISH SPACES

E XERCISE 12.1.3. Let C[0, 1] be the space of all continuous functions from
[0, 1] into R, with the metric
ρ(f, g) := sup |f (x) − g(x)|.
x∈[0,1]

Prove that this is a Polish space. (Often the distance ρ(f, g) is denoted by kf −gk∞
or kf − gk[0,1] or kf − gksup , and is called the sup-metric.)
E XERCISE 12.1.4. Let C[0, ∞) be the space of all continuous functions from
[0, ∞) into R, with the metric

X kf − gk[j,j+1]
ρ(f, g) := 2−j ,
1 + kf − gk[j,j+1]
j=0
where
kf − gk[j,j+1] := sup |f (x) − g(x)|.
x∈[j,j+1]
Prove that this is a Polish space. Moreover, show that fn → f on this space if and
only if fn → f uniformly on compact sets.
E XERCISE 12.1.5. Generalize the above exercise to Rn -valued continuous
functions on [0, ∞).
E XERCISE 12.1.6. If X is a C[0, 1]-valued random variable, show that for any
t ∈ [0, 1], X(t) is a real-valued random variable (that is, it’s a measurable map
from the sample space into the real line).
E XERCISE 12.1.7. If X is a C[0, 1]-valued random variable defined on a prob-
ability space (Ω, F, P), show that the map (ω, t) 7→ X(ω)(t) from Ω × [0, 1] into
R is measurable with respect to the product σ-algebra. (Hint: Approximate X by
a sequence of piecewise linear random functions, and use the previous exercise.)
E XERCISE 12.1.8. If X is a C[0, 1]-valued random variable, show that for any
R1
bounded measurable f : R → R, 0 f (X(t))dt is a real-valued random variable.
Also, show that the map t 7→ Ef (X(t)) is measurable. (Hint: Use Exercise 12.1.7
and the measurability assertion from Fubini’s theorem.)
E XERCISE 12.1.9. Let X be a C[0, 1]-valued random variable. Prove that
max0≤t≤1 X(t) is a random variable.

12.2. The portmanteau lemma


The following result gives an important set of equivalent criteria for weak
convergence on Polish spaces. Recall that if (S, ρ) is a metric space, a function
f : S → R is called Lipschitz continuous if there is some L < ∞ such that
|f (x) − f (y)| ≤ Lρ(x, y) for all x, y ∈ S.
P ROPOSITION 12.2.1 (Portmanteau lemma). Let (S, ρ) be a Polish space and
let {µn }∞
n=1 be a sequence of probability measures on S. The following are equiv-
alent:
12.2. THE PORTMANTEAU LEMMA 167

(a) R n → µ weakly.
µ R
(b) R f dµn → R f dµ for every bounded and uniformly continuous f .
(c) f dµn → f dµ for every bounded and Lipschitz continuous f .
(d) For every closed set F ⊆ S,
lim sup µn (F ) ≤ µ(F ).
n→∞

(e) For every open set V ⊆ S,


lim inf µn (V ) ≥ µ(V ).
n→∞

(f) For every Borel set A ⊆ S such that µ(∂A) = 0 (where ∂A denotes the
topological boundary of A),
lim µn (A) = µ(A).
n→∞
R R
(g) f dµn → f dµ for every bounded measurable f : S → R that is
continuous a.e. with respect to the measure µ.

P ROOF. It is clear that (a) =⇒ (b) =⇒ (c). Suppose that (c) holds. Take any
closed set F ⊆ S. Define the function
f (x) = ρ(x, F ) := inf ρ(x, y). (12.2.1)
y∈F

Note that for any x, x0 ∈ S and y ∈ F , the triangle inequality gives ρ(x, y) ≤
ρ(x, x0 )+ρ(x0 , y). Taking infimum over y gives f (x) ≤ ρ(x, x0 )+f (x0 ). Similarly,
f (x0 ) ≤ ρ(x, x0 ) + f (x). Thus,
|f (x) − f (x0 )| ≤ ρ(x, x0 ). (12.2.2)
In particular, f is a Lipschitz continuous function. Since F is closed, it is easy to
see that f (x) = 0 if and only if x ∈ F . Thus, if we define gk (x) := (1 − kf (x))+ ,
then gk is Lipschitz continuous, takes values in [0, 1], 1F ≤ gk everywhere, and
gk → 1F pointwise as k → ∞. Therefore by (c),
Z Z
lim sup µn (F ) ≤ lim sup gk dµn = gk dµ
n→∞ n→∞

for any k, and hence by the dominated convergence theorem,


Z
lim sup µn (F ) ≤ lim gk dµ = µ(F ),
n→∞ k→∞

which proves (d). The implication (d) =⇒ (e) follows simply by recalling that
open sets are complements of closed sets. Suppose that (d) and (e) both hold. Take
any A such that µ(∂A) = 0. Let F be the closure of A and V be the interior of A.
Then F is closed, V is open, V ⊆ A ⊆ F , and
µ(F ) − µ(V ) = µ(F \ V ) = µ(∂A) = 0.
168 12. WEAK CONVERGENCE ON POLISH SPACES

Consequently, µ(A) = µ(V ) = µ(F ). Therefore by (d) and (e),


µ(A) = µ(V ) ≤ lim inf µn (V ) ≤ lim inf µn (A)
n→∞ n→∞
≤ lim sup µn (A) ≤ lim sup µn (F ) ≤ µ(F ) = µ(A),
n→∞ n→∞

which proves (f). Next, suppose that (f) holds. Take any f as in (g), and let Df
be the set of discontinuity points of f . Since f is bounded, we may apply a linear
transformation and assume without loss of generality that f takes values in [0, 1].
For t ∈ [0, 1], let At := {x : f (x) ≥ t}. Then by Fubini’s theorem, for any
probability measure ν on S,
Z Z Z 1
f (x)dν(x) = 1{f (x)≥t} dtdν(x)
S S 0
Z 1Z Z 1
= 1{f (x)≥t} dν(x)dt = ν(At )dt.
0 S 0
Now take any t and x ∈ ∂At . Then there is a sequence yn → x such that yn 6∈ At
for each n, and there is a sequence zn → x such that zn ∈ At for each n. Thus
if x 6∈ Df , then f (x) = t. Consequently, ∂At ⊆ Df ∪ {x : f (x) = t}. Since
µ(Df ) = 0, this shows that µ(At ) > 0 if and only if µ({x : f (x) = t}) > 0.
But µ({x : f (x) = t}) can be strictly positive for only countably many t. Thus,
µ(∂At ) = 0 for all but countably many t. Therefore by (f) and the a.e. version of
the dominated convergence theorem (Exercise 2.6.5),
Z Z 1 Z 1 Z
lim f dµn = lim µn (At )dt = µ(At )dt = f dµ,
n→∞ n→∞ 0 0
proving (g). Finally, if (g) holds then (a) is obvious. 
An important corollary of the portmanteau lemma is the following.
C OROLLARY 12.2.2.R Let S beR a Polish space. If µ and ν are two probability
measures on S such that f dµ = f dν for every bounded continuous f : S → R,
then µ = ν.

P ROOF. By the given condition, µ converges weakly to ν and ν converges


weakly to µ. Therefore by the portmanteau lemma, µ(F ) ≤ ν(F ) and ν(F ) ≤
µ(F ) for every closed set F . Thus, by Theorem 1.3.6, µ = ν. 
E XERCISE 12.2.3. Let (S, ρ) be a Polish space, and let {Xn }∞
n=1 be a sequence
of S-valued random variables converging in law to a random variable X. Show that
d
for any continuous f : S → R, f (Xn ) → f (X).
E XERCISE 12.2.4. Let {Xn }∞
n=1 be a sequence of C[0, 1]-valued random vari-
ables converging weakly to some X. Then prove that
d
max Xn (t) → max X(t).
0≤t≤1 0≤t≤1
12.3. INDEPENDENCE ON POLISH SPACES 169

Exercise 12.2.6 given below requires an application of criterion (g) from the
portmanteau lemma. Although criterion (g) implicitly assumes that the set of dis-
continuity points is a measurable set of measure zero, it is often not trivial to prove
measurability of the set of discontinuity points. The following exercise is helpful.
E XERCISE 12.2.5. Let (S, ρ) be a metric space, and let f : S → R be any
function. Prove that the set of continuity points of f is measurable. (Hint: For any
open set U , define d(U ) := supx∈U f (x) − inf x∈U f (x). Let V be the union of
all open U with d(U ) < . Show that the set of continuity points of f is exactly
∩n≥1 V1/n .)
E XERCISE 12.2.6. In the setting of the previous exercise, suppose further that
P(X(t) = 0) = 0 for a.e. t ∈ [0, 1]. Then show that
Z 1 Z 1
d
1{Xn (t)≥0} dt → 1{X(t)≥0} dt.
0 0
(Hint: Use criterion (g) from the portmanteau lemma.)
E XERCISE 12.2.7. In the previous exercise, give a counterexample to show
that the conclusion may not be valid without the condition that P(X(t) = 0) = 0
for a.e. t ∈ [0, 1].
E XERCISE 12.2.8. Suppose that S and S 0 are Polish spaces and h : S → S 0
is a measurable map. Let X1 , X2 , . . . be a sequence of S-valued random variables
converging weakly to an S-valued random variable X. Let D be the set of dis-
continuity points of h. If P(X ∈ D) = 0, show that h(X1 ), h(X2 ), . . . converge
weakly to h(X).

12.3. Independence on Polish spaces


Polish space valued random variables are said to be independent if the σ-
algebras generated by the random variables are independent. All of the measure-
theoretic results and exercises about independent real-valued random variables
from Chapter 7 remain valid for independent Polish space valued random vari-
ables.
The following result is sometimes useful for proving independence of Polish
space valued random variables. Although the result seems almost obvious, the
proof needs some work. We will use it later to prove results about Brownian mo-
tion.
P ROPOSITION 12.3.1. Let (S, ρ) be a Polish space. Let {Xn }n≥1 be a se-
quence of S-valued random variables converging almost surely to a random vari-
able X. Suppose that each Xn is independent of a σ-algebra G. Then so is X.

P ROOF. Take any bounded continuous f : S → R, and any A ∈ G. Then


E(f (Xn )1A ) = E(f (Xn ))P(A)
170 12. WEAK CONVERGENCE ON POLISH SPACES

for each n. By the almost sure version of the dominated convergence theorem
(Exercise 2.6.5), E(f (Xn )1A ) → E(f (X)1A ) and E(f (Xn )) → E(f (X)) as
n → ∞. Thus, E(f (X)1A ) = E(f (X))P(A).
Now, for any bounded continuous g, h : R → R, the composition g ◦ f is a
bounded continuous function from S into R, and the random variable h(1A ) is just
a linear transformation of 1A . Thus, by the above deduction, E(g(f (X))h(1A )) =
E(g(f (X)))E(h(1A )). So by Corollary 12.7.4, the random variables f (X) and 1A
are independent.
Now take any closed set F ⊆ S, and let f (x) := max{ρ(x, F ), 1}, where
ρ(x, F ) is the distance from x to F as defined in (12.2.1). We have already seen
in the proof of the Portmanteau lemma that x 7→ ρ(x, F ) is continuous. Thus, f
is a continuous map. Therefore f (X) and 1A are independent. In particular, the
events {f (X) = 0} and A are independent. But f (X) = 0 if and only if X ∈ F .
Thus, the events {X ∈ F } and A are independent. Since this holds for every
closed F , the random variable X is independent of the event A (for example, by
Exercise 7.1.9). And since this holds for all A ∈ G, X and G are independent. 

12.4. Tightness and Prokhorov’s theorem


The appropriate generalization of the notion of tightness to probability mea-
sures on Polish spaces is the following.
D EFINITION 12.4.1. Let (S, ρ) be a Polish space and let {µn }n≥1 be a se-
quence of probability measures on S. The sequence is called tight if for any ,
there is a compact set K ⊆ S such that µn (K) ≥ 1 −  for all n.
The following result was almost trivial on the real line, but requires effort to
prove on Polish spaces.
T HEOREM 12.4.2. If a sequence of probability measures on a Polish space is
weakly convergent, then it is tight.

P ROOF. Let (S, ρ) be a Polish space and let {µn }∞ n=1 be a sequence of prob-
ability measures on S that converge weakly to some µ. First, we claim that if
{Vi }∞
i=1 is any increasing sequence of open sets whose union is S, then for each
 > 0 there is some i such that µn (Vi ) > 1 −  for all n. If not, then for each i
there exists ni such that µni (Vi ) ≤ 1 − . There cannot exist k such that ni = k for
infinitely many i, because then µk (Vi ) → 1 as i → ∞. Thus, ni → ∞ as i → ∞.
Therefore by the portmanteau lemma, for each i,
µ(Vi ) ≤ lim inf µnj (Vi ) ≤ lim inf µnj (Vj ) ≤ 1 − .
j→∞ j→∞

But this is impossible since Vi ↑ S. This proves the claim.


Now take any  > 0 and k ≥ 1. Recall that separable metric spaces have the
Lindëlof property, namely, that any open cover has a countable subcover. Thus,
12.4. TIGHTNESS AND PROKHOROV’S THEOREM 171

there is a sequence of open balls {Bk,i }∞


i=1 of radius 1/k that cover S. By the
above claim, we can choose nk such that for any n,
[nk 
µn Bk,i ≥ 1 − 2−k .
i=1
Define
∞ [
\ nk
L := Bk,i .
k=1 i=1
Then for any n,

X nk
\  ∞
X
µn (L) ≥ 1 − µn c
Bk,i ≥1− 2−k  = 1 − .
k=1 i=1 k=1
Now recall that any totally bounded subset of a complete metric space is precom-
pact. By construction, L is totally bounded. Therefore L is precompact, and hence
the closure L of L is a compact set which satisfies µn (L) ≥ 1 −  for all n. 

E XERCISE 12.4.3. Using Theorem 12.4.2, prove that any probability measure
on a Polish space is regular, in the sense of Lemma 3.4.2.
E XERCISE 12.4.4. Using the previous exercise, prove the analogue of Kol-
mogorov’s extension theorem (Theorem 3.4.1) for Polish spaces.
E XERCISE 12.4.5. Using the previous exercise, prove the existence of Markov
chains with any given kernel on a Polish space (i.e., the analogue of Theorem
11.1.3).
Helly’s selection theorem generalizes to Polish spaces. The generalization is
known as Prokhorov’s theorem.
T HEOREM 12.4.6 (Prokhorov’s theorem). Let (S, ρ) be a Polish space. Sup-
pose that {µn }∞
n=1 is a tight family of probability measures on S. Then there is a
subsequence {µnk }∞k=1 converging weakly to a probability measure µ as k → ∞.
There are various proofs of Prokhorov’s theorem. The proof given below is a
purely measure-theoretic argument. There are other proofs that are more functional
analytic in nature. In the proof below, the measure µ is constructed using the
technique of outer measures, as follows.
Let K1 ⊆ K2 ⊆ · · · be a sequence of compact sets such that µn (Ki ) ≥ 1−1/i
for all n. Such a sequence exists because the family {µn }∞ n=1 is tight. Let D be a
countable dense subset of S, which exists because S is separable. Let B be the set
of all closed balls with centers at elements of D and nonzero rational radii. Then
B is a countable collection. Let C be the collection of all finite unions of sets that
are of the form B ∩ Ki for some B ∈ B and some i. (In particular, C contains the
empty union, which equals ∅.) Then C is also countable. By the standard diagonal
argument, there is a subsequence {nk }∞ k=1 such that
α(C) := lim µnk (C)
k→∞
172 12. WEAK CONVERGENCE ON POLISH SPACES

exists for every C ∈ C. For every open set V , define


β(V ) := sup{α(C) : C ⊆ V, C ∈ C}.
Finally, for every A ⊆ S, let
µ(A) := inf{β(V ) : V open, A ⊆ V }.
In particular, µ(V ) = β(V ) if V is open. We will eventually show that µ is an
outer measure. The proof requires several steps.
L EMMA 12.4.7. The functional α is monotone and finitely subadditive on the
class C. Moreover, if C1 , C2 ∈ C are disjoint, then α(C1 ∪ C2 ) = α(C1 ) + α(C2 ).

P ROOF. This is obvious from the definition of α and the properties of mea-
sures. 
L EMMA 12.4.8. If F is a closed set, V is an open set containing F , and some
C ∈ C also contains F , then there exists D ∈ C such that F ⊆ D ⊆ V .

P ROOF. Since F ⊆ C for some C ∈ C, F is contained in some Kj . Since F


is closed, this implies that F is compact. For each x ∈ F , choose Bx ∈ B such
that Bx ⊆ V and x belongs to the interior Bx◦ of Bx . This can be done because
V is open and V ⊇ F . Then by the compactness of F , there exist finitely many
x1 , . . . , xn such that
[n [n

F ⊆ B xi ⊆ Bxi ⊆ V.
i=1 i=1
To complete the proof, take D = (Bx1 ∩ Kj ) ∪ · · · ∪ (Bxn ∩ Kj ). 
L EMMA 12.4.9. The functional β is finitely subadditive on open sets.

P ROOF. Take any two open sets V1 and V2 , and any C ∈ C such that C ⊆
V1 ∪ V2 . Define
F1 := {x ∈ C : ρ(x, V1c ) ≥ ρ(x, V2c )},
F2 := {x ∈ C : ρ(x, V2c ) ≥ ρ(x, V1c )},
where ρ(x, A) is defined as in (12.2.1) for any A. It is not hard to see (using
(12.2.2), for instance), that x 7→ ρ(x, A) is a continuous map for any A. Therefore
the sets F1 and F2 are closed. Moreover, if x 6∈ V1 and x ∈ F1 , then the defi-
nition of F1 implies that x 6∈ V2 , which is impossible since F1 ⊆ C ⊆ V1 ∪ V2 .
Thus, F1 ⊆ V1 . Similarly, F2 ⊆ V2 . Moreover F1 and F2 are both subsets of C.
Therefore by Lemma 12.4.8, there exist C1 , C2 ∈ C such that F1 ⊆ C1 ⊆ V1 and
F2 ⊆ C2 ⊆ V2 . But then C = F1 ∪F2 ⊆ C1 ∪C2 , and therefore by Lemma 12.4.7,
α(C) ≤ α(C1 ) + α(C2 ) ≤ β(V1 ) + β(V2 ).
Taking supremum over C completes the proof. 
L EMMA 12.4.10. The functional β is countably subadditive on open sets.
12.4. TIGHTNESS AND PROKHOROV’S THEOREM 173

P ROOF. Let V1 , V2 , . . . be a sequence of open sets and let C be an element of C


that is contained in the union of these sets. Since C is compact, there is some finite
n such that C ⊆ V1 ∪ · · · ∪ Vn . Then by the definition of β and Lemma 12.4.9,
n
X ∞
X
α(C) ≤ β(V1 ∪ · · · ∪ Vn ) ≤ β(Vi ) ≤ β(Vi ).
i=1 i=1
Taking supremum over C completes the proof. 
L EMMA 12.4.11. The functional µ is an outer measure.

P ROOF. It is clear from the definition that µ is monotone and satisfies µ(∅) =
0. We only need to show that µ is subadditive. Take any sequence of set A1 , A2 , . . .
contained in S and let A be their union. Take any  > 0. For each i, let Vi be an
open set containing Ai such that β(Vi ) ≤ µ(Ai ) + 2−i . Then by Lemma 12.4.10,
[ ∞  X ∞
µ(A) ≤ β Vi ≤ β(Vi )
i=1 i=1

X ∞
X
≤ (µ(Ai ) + 2−i ) =  + µ(Ai ).
i=1 i=1
Since  is arbitrary, this completes the proof. 
L EMMA 12.4.12. For any open V and closed F ,
β(V ) ≥ µ(V ∩ F ) + µ(V ∩ F c ).

P ROOF. Choose any C1 ∈ C, C1 ⊆ V ∩ F c . Having chosen C1 , choose


C2 ∈ C, C2 ⊆ V ∩C1c . Then C1 and C2 are disjoint, C1 ∪C2 ∈ C, and C1 ∪C2 ⊆ V .
Therefore by Lemma 12.4.7,
β(V ) ≥ α(C1 ∪ C2 ) = α(C1 ) + α(C2 ).
Taking supremum over all choices of C2 , we get
β(V ) ≥ α(C1 ) + β(V ∩ C1c ) ≥ α(C1 ) + µ(V ∩ F ),
where the second inequality holds because V ∩ C1c is an open set that contains
V ∩ F . Now taking supremum over C1 completes the proof. 
We are finally ready to prove Prokhorov’s theorem.
P ROOF OF T HEOREM 12.4.6. Let F be the set of all µ-measurable sets, as
defined in Section 1.4. Then recall that by Theorem 1.4.3, F is a σ-algebra and µ
is a measure on F. We need to show that (a) F contains the Borel σ-algebra of S,
(b) µ is a probability measure, and (c) µnk converges weakly to µ.
To prove (a), take any closed set F and any set D ⊆ S. Then for any open
V ⊇ D, Lemma 12.4.12 gives
β(V ) ≥ µ(V ∩ F ) + µ(V ∩ F c ) ≥ µ(D ∩ F ) + µ(D ∩ F c ),
174 12. WEAK CONVERGENCE ON POLISH SPACES

where the second inequality holds because µ is monotone. Now taking infimum
over V shows that F ∈ F. Therefore F contains the Borel σ-algebra. To prove
(b), take any i and observe that by the compactness of Ki , Ki can be covered by
finitely many elements of B. Consequently, Ki itself is an element of C. Therefore
1
µ(S) = β(S) ≥ α(Ki ) ≥ 1 − .
i
Since this holds for all i, we get µ(S) ≥ 1. On the other hand, it is clear from the
definition of µ that µ(S) ≤ 1. Thus, µ is a probability measure. Finally, to prove
(c), notice that for any open set V and any C ∈ C, C ⊆ V ,

α(C) = lim µnk (C) ≤ lim inf µnk (V ),


k→∞ k→∞

and take supremum over C. 

E XERCISE 12.4.13. Let {Xn }∞ d


n=1 be a sequence of R -valued random vectors.
Show that the sequence is tight if and only if for every  > 0, there is some R such
that P(|Xn | > R) ≤  for all n, where |Xn | is the Euclidean norm of Xn .

E XERCISE 12.4.14. Let S be the set of all functions from Z into R. Equip S
with the topology of pointwise convergence. Prove that this topology is metrizable,
and under a compatible metric, it is a Polish space. If (Xi )i∈Z is a sequence of real-
valued random variables defined on some probability space, show that the whole
sequence can be viewed as an S-valued random variable. If (Xn )n≥1 is a sequence
of S-valued random variables, where Xn = (Xn,i )i∈Z , prove that it is tight if and
only if for each i ∈ Z, the sequence of real-valued random variables (Xn,i )n≥1 is
tight.

E XERCISE 12.4.15. Let {µn }n≥1 be a tight family of probability measures


on a Polish space S. Suppose that any convergent subsequence converges to the
same limit µ. Using Prokhorov’s theorem, prove that µn → µ weakly. (Hint: Use
contradiction.)

E XERCISE 12.4.16 (The Lévy–Prokhorov metric). Let (S, ρ) be a Polish space.


For any A ⊆ S and  > 0, let

A := {x ∈ S : ρ(x, y) <  for some y ∈ A}.

For any two probability measures µ and ν on S, define

d(µ, ν) := inf{ > 0 : µ(A) ≤ ν(A ) +  and


ν(A) ≤ µ(A ) +  for all A ∈ B(S)},

where B(S) is the Borel σ-algebra of S. Prove that d is a metric on the set of
probability measures on S, and µn → µ weakly if and only if d(µn , µ) → 0.
(Thus, d metrizes weak convergence of probability measures.)
12.5. SKOROKHOD’S REPRESENTATION THEOREM 175

12.5. Skorokhod’s representation theorem


We know that convergence in law does not imply almost sure convergence. In
fact, weak convergence does not even assume that the random variables are defined
on the same probability space. It turns out, however, that a certain kind of converse
can be proved. It is sometimes useful for proving theorems and constructing ran-
dom variables.
T HEOREM 12.5.1 (Skorokhod’s representation theorem). Let S be a Polish
space and let {µn }∞n=1 be a sequence of probability measures on S that converge
weakly to a limit µ. Then it is possible to construct a probability space (Ω, F, P),
a sequence of S-valued random variable {Xn }∞ n=1 on Ω, and another S-valued
random variable X on Ω, such that Xn ∼ µn for each n, X ∼ µ and Xn → X
almost surely.
We need a topological lemma about Polish spaces. Recall that the diameter of
a set A in a metric space (S, ρ) is defined as
diam(A) = sup ρ(x, y).
x,y∈A

L EMMA 12.5.2. Let S be a Polish space and µ be a probability measure on S.


Then for any  > 0, there is a partition A0 , A1 , . . . , An of S into measurable sets
such that µ(A0 ) < , A0 is open, µ(∂Ai ) = 0 and µ(Ai ) > 0 for 1 ≤ i ≤ n, and
diam(Ai ) ≤  for 1 ≤ i ≤ n.

P ROOF. Let B(x, r) denote the closed ball of radius r centered at a point x
in S. Since ∂B(x, r) and ∂B(x, s) are disjoint for any distinct r and s, it follows
that for any x, there can be only countably many r such that µ(∂B(x, r)) > 0. In
particular, for each x we can find rx ∈ (0, /2) such that µ(∂B(x, rx )) = 0. Then
the interiors of the balls B(x, rx ) form a countable open cover of S. Since S is
a separable metric space, it has the property that any open cover has a countable
subcover (the Lindelöf property). Thus, there exist x1 , x2 , . . . such that

[
S= B(xi , rxi ).
i=1
Now choose n so large that
n
[ 
µ(S) − µ B(xi , rxi ) < .
i=1
Let Bi := B(xi , rxi ) for i = 1, . . . , n. Define A1 = B1 and
Ai = Bi \ (B1 ∪ · · · ∪ Bi−1 )
for 2 ≤ i ≤ n. Finally, let A0 := S \ (B1 ∪ · · · ∪ Bn ). Then by our choice of
n, µ(A0 ) < . By construction, A0 is open and diam(Ai ) ≤ diam(Bi ) ≤  for
1 ≤ i ≤ n. Finally, note that ∂Ai ⊆ ∂B1 ∪ · · · ∪ ∂Bi for 1 ≤ i ≤ n, which shows
that µ(∂Ai ) = 0 because µ(∂Bj ) = 0 for every j. Finally, we merge those Ai with
A0 for which µ(Ai ) = 0, so that µ(Ai ) > 0 for all i that remain. 
176 12. WEAK CONVERGENCE ON POLISH SPACES

P ROOF OF T HEOREM 12.5.1. For each j ≥ 1, choose sets Aj0 , . . . , Ajkj satis-
fying the conditions of Lemma 12.5.2 with  = 2−j and µ as in the statement of
Theorem 12.5.1.
Next, for each j, find nj such that if n ≥ nj , then

µn (Aji ) ≥ (1 − 2−j )µ(Aji ) (12.5.1)

for all 0 ≤ i ≤ kj . We can find such nj by the portmanteau lemma, because


µn → µ weakly, Aj0 is open, and µ(∂Aji ) = 0. Without loss of generality, we can
choose {nj }∞j=1 to be a strictly increasing sequence.
Take any n ≥ n1 , and find j such that nj ≤ n < nj+1 . For 0 ≤ i ≤ kj , define
a probability measure µn,i on S as

µn (A ∩ Aji )
µn,i (A) :=
µn (Aji )

if µn (Aji ) > 0. If µn (Aji ) = 0, define µn,i to be some arbitrary probability measure


on Aji if 1 ≤ i ≤ kj and some arbitrary probability measure on S of i = 0. This
can be done because Aji is nonempty for i ≥ 1. It is easy to see that µn,i is indeed
a probability measure by the above construction, and moreover if i ≥ 1, then
µn,i (Aji ) = 1. Next, for each 0 ≤ i ≤ kj , let

pn,i := 2j (µn (Aji ) − (1 − 2−j )µ(Aji )).

By (12.5.1), pn,i ≥ 0. Moreover,


kj kj
(µn (Aji ) − (1 − 2−j )µ(Aji ))
X X
j
pn,i = 2
i=0 i=0
= 2 (µn (S) − (1 − 2−j )µ(S)) = 1.
j

Therefore the convex combination


kj
X
νn (A) := µn,i (A)pn,i
i=0

also defines a probability measure on S.


Next, let X ∼ µ, Yn ∼ µn , Yn,i ∼ µn,i , Zn ∼ νn , and U ∼ U nif [0, 1] be
independent random variables defined on a probability space (Ω, F, P), where we
take all n ≥ n1 and for each n, all 0 ≤ i ≤ kj where j is the number such that
nj ≤ n < nj+1 . Take any such n and j, and define
kj
X
Xn := 1{U >2−j } 1{X∈Aj } Yn,i + 1{U ≤2−j } Zn .
i
i=0
12.6. CONVERGENCE IN PROBABILITY ON POLISH SPACES 177

Then for any A ∈ B(S),


kj
P(X ∈ Aji )P(Yn,i ∈ A)
X
−j
P(Xn ∈ A) = P(U > 2 )
i=0
−j
+ P(U ≤ 2 )P(Zn ∈ A)
kj kj
µ(Aji )µn,i (A) + 2−j
X X
= (1 − 2−j ) µn,i (A)pn,i
i=0 i=0
kj kj
µn (Aji )µn,i (A) = µn (A ∩ Aji ) = µn (A).
X X
=
i=0 i=0
Thus, Xn ∼ µn . To complete the proof, we need to show that Xn → X a.s. To
show this, first note that by the first Borel–Cantelli lemma,
P(X 6∈ Aj0 and U > 2−j for all sufficiently large j) = 1.
If the above event happens, then for all sufficiently large n, Xn = Yn,i for some
i such that X ∈ Aji . In particular, ρ(Xn , X) ≤ , because diam(Aji ) ≤ . This
proves that Xn → X a.s. 

12.6. Convergence in probability on Polish spaces


Let (S, ρ) be a Polish space. A sequence {Xn }∞ n=1 of S-valued random vari-
ables is said to converge in probability to a random variable X if for every  > 0,
lim P(ρ(Xn , X) ≥ ) = 0.
n→∞
Here we implicitly assumed that all the random variables are defined on a common
probability space (Ω, F, P). If X is a constant, this assumption can be dropped.
P ROPOSITION 12.6.1. Convergence in probability implies convergence in dis-
tribution on Polish spaces.

P ROOF. Let (S, ρ) be a Polish space, and suppose that {Xn }∞n=1 is a sequence
of S-valued random variables converging to a random variable X in probability.
Take any bounded uniformly continuous function f : S → R. Take any  > 0.
Then there is some δ > 0 such that |f (x) − f (y)| ≤  whenever ρ(x, y) ≤ δ. Thus,
P(|f (Xn ) − f (X)| > ) ≤ P(ρ(Xn , X) > δ),
which shows that f (Xn ) → f (X) in probability. Since f is bounded, Proposition
8.2.9 shows that f (Xn ) → f (X) in L1 . In particular, Ef (Xn ) → Ef (X). Thus,
Xn → X in distribution. 
P ROPOSITION 12.6.2. Let (S, ρ) be a Polish space. A sequence of S-valued
random variables {Xn }∞ n=1 converges to a constant c ∈ S in probability if and
only if Xn → c in distribution.
178 12. WEAK CONVERGENCE ON POLISH SPACES

P ROOF. If Xn → c in probability, then it follows from Proposition 12.6.1 that


Xn → c in distribution. Conversely, suppose that Xn → c in distribution. Take
any  and let F := {x ∈ S : ρ(x, c) ≥ }. Then F is a closed set, and so by
assertion (d) in the portmanteau lemma,
lim sup P(ρ(Xn , c) ≥ ) = lim sup P(Xn ∈ F ) ≤ 0,
n→∞ n→∞
since c 6∈ F . This shows that Xn → c in probability. 
There is also a version of Slutsky’s theorem for Polish spaces.
P ROPOSITION 12.6.3 (Slutsky’s theorem for Polish spaces). Let (S, ρ) be a
Polish space. Let {Xn }∞ ∞
n=1 and {Yn }n=1 be two sequences of S-valued random
variables, defined on the same probability space, such that Xn → X in distribution
and Yn → c in probability, where X is an S-valued random variable and c ∈ S is a
constant. Then, as random variables on S × S, (Xn , Yn ) → (X, c) in distribution.

P ROOF. There are many metrics that metrize the product topology on S × S.
For example, we can use the metric
d((x, y), (w, z)) := ρ(x, w) + ρ(y, z).
Suppose that f : S ×S → R is a bounded and uniformly continuous function. Take
any  > 0. Then there is some δ > 0 such that |f (x, y) − f (w, z)| ≤  whenever
d((x, y), (w, z)) ≤ δ. Then
P(|f (Xn , Yn ) − f (Xn , c)| > ) ≤ P(ρ(Yn , c) > δ),
which implies that f (Xn , Yn )−f (Xn , c) → 0 in probability. By Proposition 8.2.9,
this shows that f (Xn , Yn ) − f (Xn , c) → 0 in L1 . In particular, Ef (Xn , Yn ) −
Ef (Xn , c) → 0. On the other hand, x 7→ f (x, c) is a bounded continuous function
on S, and so Ef (Xn , c) → Ef (X, c). Thus, Ef (Xn , Yn ) → Ef (X, c). By the
portmanteau lemma, this completes the proof. 

12.7. Multivariate inversion formula


The inversion formula for characteristic functions of random vectors is anal-
ogous to the univariate formula that was presented in Theorem 8.8.1. In the fol-
lowing, a · b denotes the scalar product of two vectors a and b, and |a| denotes the
Euclidean norm of a.
T HEOREM 12.7.1. Let X be an n-dimensional random vector with character-
istic function φ. For each θ > 0, define a function fθ : Rn → C as
Z
1 2
fθ (x) := n
e−it·x−θ|t| φ(t)dt.
(2π) Rn
Then for any bounded continuous g : Rn → R,
Z
E(g(X)) = lim g(x)fθ (x)dx.
θ→0 Rn
12.7. MULTIVARIATE INVERSION FORMULA 179

P ROOF. Proceeding exactly as in the proof of Theorem 8.8.1, we can deduce


that fθ is the p.d.f. of a multivariate normal random vector Zθ with mean zero and
i.i.d. components with variance 2θ. It is easy to show that Zθ → 0 in probability
as θ → 0, and therefore by Slutsky’s theorem for Polish spaces, X + Zθ → X in
distribution as θ → 0. The proof is now completed as before. 

Just like Corollary 8.8.2, the above theorem has the following corollary about
random vectors.
C OROLLARY 12.7.2. Two random vectors have the same law if and only if
they have the same characteristic function.

P ROOF. Same as the proof of Corollary 8.8.2, using Theorem 12.7.1 and Corol-
lary 12.2.2 instead of Theorem 8.8.1 and Proposition 8.7.1. 

The above corollary has the following important consequences.


C OROLLARY 12.7.3. Let (X1 , . . . , Xn ) is a random vector with characteristic
function φ. Let φi be the characteristic function of Xi . Then φ(t1 , . . . , tn ) =
φ1 (t1 ) · · · φn (tn ) for all t1 , . . . , tn if and only if X1 , . . . , Xn are independent.

P ROOF. Let Y1 , . . . , Yn be independent random variables, with Yi having the


same distribution as Xi for each i. Let ψ be the characteristic function of the
random vector (Y1 , . . . , Yn ). By Exercise 7.6.3, ψ(t1 , . . . , tn ) = φ1 (t1 ) · · · φn (tn ).
The claim now follows by Corollary 12.7.2. 
C OROLLARY 12.7.4. Let X and Y be real-valued random variables defined
on the same probability space. Then X and Y are independent if and only if
E(f (X)g(Y )) = E(f (X))E(g(Y )) for any bounded continuous f, g : R → R.

P ROOF. If X and Y are independent, we know that by the product rule for ex-
pectation, E(f (X)g(Y )) = E(f (X))E(g(Y )) for any bounded continuous f, g :
R → R. Conversely, suppose that this criterion holds. Considering real and imag-
inary parts, we can assume that the criterion holds even if f and g are complex
valued. Then taking f (x) = eisx and g(y) = eity for arbitrary s, t ∈ R, we get that
the characteristic function of (X, Y ) is the product of the characteristic functions
of X and Y . Therefore by Corollary 12.7.3, X and Y are independent. 
E XERCISE 12.7.5. Prove a multivariate version of Corollary 8.8.7.
E XERCISE 12.7.6. Let X be a real-valued random variable defined on a prob-
ability space (Ω, F, P) and let G be a sub-σ-algebra of F. Let φ be a characteristic
function. Prove that X is independent of G and has characteristic function φ if and
only if E(eitX ; A) = φ(t)P(A) for all t ∈ R and all A ∈ G. (Hint: Show that X
and 1A are independent random variables for all A ∈ G.)
180 12. WEAK CONVERGENCE ON POLISH SPACES

12.8. Multivariate Lévy continuity theorem


The multivariate version of Lévy’s continuity theorem is a straightforward gen-
eralization of the univariate theorem. The only slightly tricky part of the proof is
the proof of tightness, since we did not prove a tail bound for random vectors in
terms characteristic functions.
T HEOREM 12.8.1 (Multivariate Lévy continuity theorem). A sequence of ran-
dom vectors {Xn }n≥1 converges in distribution to a random vector X if and only if
the sequence of characteristic functions {φXn }n≥1 converges to the characteristic
function φX pointwise.

d
P ROOF. If Xn → X, then it follows from the definition of weak convergence
that the characteristic functions converge. Conversely, suppose that φXn (t) →
φX (t) for all t ∈ Rm , where m is the dimension of the random vectors. Let
Xn1 , . . . , Xnm denote the coordinates of Xn . Similarly, let X 1 , . . . , X m be the co-
ordinates of X. Then note that the pointwise convergence of φXn to φX automat-
ically implies the pointwise convergence of φXni to φX i for each i. Consequently
by Lévy’s continuity theorem, Xni → X i in distribution for each i. In particular,
for each i, {Xni }∞ n=1 is a tight family of random variables. Thus, given any  > 0,
there is some K > 0 such that P(Xni ∈ [−K i , K i ]) ≥ 1 − /m for all n. Let
i

K = max{K 1 , . . . , K m }, and let R be the cube [−K, K]m . Then for any n,
m
X
P(Xn 6∈ R) ≤ P(Xni 6∈ [−K, K]) ≤ .
i=1

Thus, we have established that {Xn }∞ m


n=1 is a tight family of R -valued random
vectors. We can complete the proof of the theorem as we did for the original Lévy
continuity theorem, using Corollary 12.7.2. 
E XERCISE 12.8.2. Let (S, ρ) be a Polish space. Suppose that for each 1 ≤ j ≤
m, {Xn,j }∞ n=1 is a sequence of S-valued random variables converging weakly to a
random variable Xj . Suppose that for each n, the random variables Xn,1 , . . . , Xn,m
are defined on the same probability space and are independent, and the same
holds for (X1 , . . . , Xm ). Then show that (Xn,1 , . . . , Xn,m ) converges weakly to
(X1 , . . . , Xm ) as random vectors on S n .

12.9. The Cramér–Wold device


The Cramér–Wold device is a simple idea about proving weak convergence of
random vectors using weak convergence of random variables. We will use it to
prove the multivariate central limit theorem.
P ROPOSITION 12.9.1 (Cramér–Wold theorem). Let {Xn }∞ n=1 be a sequence of
m-dimensional random vectors and X be another m-dimensional random vector.
d d
Then Xn → X if and only if t · Xn → t · X for every t ∈ Rm .
12.10. THE MULTIVARIATE CLT FOR I.I.D. SUMS 181

d
P ROOF. If Xn → X, then Ef (t · Xn ) → Ef (t · X) for every bounded contin-
d
uous function f : R → R and every t ∈ Rm . This shows that t · Xn → t · X for
d
every t. Conversely, suppose that t · Xn → t · X for every t. Then
φXn (t) = E(eit·Xn ) → E(eit·X ) = φX (t).
d
Therefore, Xn → X by the multivariate Lévy continuity theorem. 

12.10. The multivariate CLT for i.i.d. sums


In this section we will prove a multivariate version of the central limit theorem
for sums of i.i.d. random variables. The proof is a simple consequence of the
univariate CLT and the Cramér–Wold device. Recall that Nm (µ, Σ) denotes the
m-dimensional normal distribution with mean vector µ and covariance matrix Σ.
T HEOREM 12.10.1 (Multivariate CLT for i.i.d. sums). Let X1 , X2 , . . . be i.i.d. m-
dimensional random vectors with mean vector µ and covariance matrix Σ. Then,
as n → ∞,
n
1 X d
√ (Xi − µ) → Nm (0, Σ).
n
i=1

P ROOF. Let Z ∼ Nm (0, Σ). Take any nonzero t ∈ Rm . Let Y := t · Z. Then


by Exercise 7.6.8, Y ∼ N (0, tT Σt). Next, let Yi := t · (Xi − µ). Then Y1 , Y2 , . . .
are i.i.d. random variables, with E(Yi ) = 0 and Var(Yi ) = tT Σt. Therefore by the
univariate CLT for i.i.d. sums,
n
1 X d
√ Yi → N (0, tT Σt).
n
i=1
Combining the two, we get
 n 
1 X d
t· √ (Xi − µ) → t · Z.
n
i=1
Since this happens for every t ∈ Rm , the result now follows by the Cramér–Wold
device. 
CHAPTER 13

Brownian motion

This chapter introduces an important probabilistic object known as Brownian


motion. We will construct Brownian motion, see how it arises as a scaling limit of
random walks (Donsker’s theorem), and prove a number of results about it.

13.1. The spaces C[0, 1] and C[0, ∞)


Recall the spaces C[0, 1] and C[0, ∞) defined in Exercises 12.1.3 and 12.1.4.
In this section we will study some probabilistic aspects of these Polish spaces.
D EFINITION 13.1.1. For each n and each t1 , . . . , tn ∈ [0, 1], the projection
map πt1 ,...,tn : C[0, 1] → Rn is defined as
πt1 ,...,tn (f ) := (f (t1 ), . . . , f (tn )).
The projection maps are defined similarly on C[0, ∞).
It is easy to see that the projection maps are continuous and hence measurable.
However, more is true:
P ROPOSITION 13.1.2. The finite-dimensional projection maps generate the
Borel σ-algebras of C[0, 1] and C[0, ∞).

P ROOF. Let us first consider the case of C[0, 1]. Let B be the Borel σ-algebra
of C[0, 1] and F be the σ-algebra generated by the finite dimensional projection
maps. Note that by the definition of F, each projection map πt1 ,...,tn is measurable
from (C[0, 1], F) into Rn . Given f ∈ C[0, 1] and some n and some t1 , . . . , tn , we
can reconstruct an ‘approximation’ of f from πt1 ,...,tn (f ) as the function g which
satisfies g(ti ) = f (ti ) for each i, and linearly interpolate when t is between some
ti and tj . Define g to be constant to the left of the smallest ti and to the right of the
largest ti , such that continuity is preserved. The map ρt1 ,...,tn that constructs g from
πt1 ,...,tn (f ) is a continuous map from Rn into C[0, 1], and therefore measurable
from Rn into (C[0, 1], B). Thus, if we let
ξt1 ,...,tn := ρt1 ,...,tn ◦ πt1 ,...,tn ,
then ξt1 ,...,tn is measurable from (C[0, 1], F) into (C[0, 1], B). Now let {t1 , t2 , . . .}
be a dense subset of [0, 1] chosen in such a way that
lim max min |t − ti | = 0. (13.1.1)
n→∞ 0≤t≤1 1≤i≤n

(It is easy to construct such a sequence.) Let fn := ξt1 ,...,tn (f ). By the uniform
continuity of f , the construction of fn , and the property (13.1.1) of the sequence
183
184 13. BROWNIAN MOTION

{tn }∞
n=1 , it is not hard to show that fn → f in the topology of C[0, 1]. Thus, the
map ξt1 ,...,tn converges pointwise to the identity map as n → ∞. So by Proposi-
tion 2.1.14, it follows that the identity map from (C[0, 1], F) into (C[0, 1], B) is
measurable. This proves that F ⊇ B. We have already observed the opposite in-
clusion in the sentence preceding the statement of the proposition. This completes
the argument for C[0, 1].
For C[0, ∞), the argument is exactly the same, except that we have to replace
the maximum over t ∈ [0, 1] in (13.1.1) with maximum over t ∈ [0, K], and then
impose that the condition should hold for every K. The rest of the argument is left
as an exercise for the reader. 
Given any probability measure µ on C[0, 1] or C[0, ∞), the push-forwards of
µ under the projection maps are known as the finite-dimensional distributions of
µ. In the language of random variables, if X is a random variable with law µ,
then the finite-dimensional distributions of µ are the laws of random vectors like
(X(t1 ), . . . , X(tn )), where n and t1 , . . . , tn are arbitrary.
P ROPOSITION 13.1.3. On C[0, 1] and C[0, ∞), a probability measure is de-
termined by its finite-dimensional distributions.

P ROOF. Given a probability measure, the finite-dimensional distributions de-


termine the probabilities of all sets of the form πt−1
1 ,...,tn
(A), where n and t1 , . . . , tn
are arbitrary, and A ∈ B(Rn ). Let A denote the collection of all such sets. It is not
difficult to see that A is an algebra. Moreover, by Proposition 13.1.2, A generates
the Borel σ-algebra of C[0, 1] (or C[0, ∞)). By Theorem 1.3.6, this shows that the
finite-dimensional distributions determine the probability measure. 
C OROLLARY 13.1.4. If {µn }∞ n=1 is a tight family of probability measures on
C[0, 1] or C[0, ∞), whose finite-dimensional distributions converge to limiting dis-
tributions, then the sequence itself converges weakly to a limit. Moreover, the lim-
iting probability measure is uniquely determined by the limiting finite-dimensional
distributions.

P ROOF. By Prokhorov’s theorem, any subsequence has a further subsequence


that converges weakly. By Proposition 13.1.3, there can be only one such limit
point. The result now follows by Exercise 12.4.15. 

13.2. Tightness on C[0, 1]


In this section we investigate criteria for tightness of sequences of probability
measures on C[0, 1]. Recall that the modulus of continuity of a function f ∈
C[0, 1] is defined as
ωf (δ) := sup{|f (s) − f (t)| : 0 ≤ s, t ≤ 1, |s − t| ≤ δ}.
Recall that a family of functions F ⊆ C[0, 1] is called equicontinuous if for any
 > 0, there is some δ > 0 such that for and f ∈ F , |f (s) − f (t)| ≤  whenever
|s − t| ≤ δ. The family f is called uniformly bounded if there is some finite M
13.2. TIGHTNESS ON C[0, 1] 185

such that |f (t)| ≤ M for all f ∈ F and all t ∈ [0, 1]. Finally, recall the Arzelà–
Ascoli theorem, which says that a closed set F ⊆ C[0, 1] is compact if and only if
it is uniformly bounded and equicontinuous.
P ROPOSITION 13.2.1. Let {Xn }∞ n=1 be a sequence of C[0, 1]-valued random
variables. The sequence is tight if and only the following two conditions hold:
(i) For any  > 0 there is some a > 0 such that for all n,
P(|Xn (0)| > a) ≤ .
(ii) For any  > 0 and η > 0, there is some δ > 0 such that for all large
enough n (depending on  and η),
P(ωXn (δ) > η) ≤ .

P ROOF. First, suppose that the sequence {Xn }∞


n=1 is tight. Take any  > 0.
Then there is some compact K ⊆ C[0, 1] such that P(Xn 6∈ K) ≤  for all n. By
the Arzelà–Ascoli theorem, there is some finite M such that |f (t)| ≤ M for all
f ∈ K and all t ∈ [0, 1]. Thus, for any n,
P(|Xn (0)| > M ) ≤ P(Xn 6∈ K) ≤ .
This proves condition (i). Next, take any positive η. Again by the Arzelà–Ascoli
theorem, the family K is equicontinuous. Thus, there exists δ such that ωf (δ) ≤ η
for all f ∈ K. Therefore, for any n,
P(ωXn (δ) > η) ≤ P(Xn 6∈ K) ≤ .
This proves condition (ii).
Conversely, suppose that conditions (i) and (ii) hold. We will first assume that
(ii) holds for all n. Take any  > 0. Choose a so large that P(|Xn (0)| > a) ≤ /2
for all n. Next, for each k, choose δk so small that
P(ωXn (δk ) > k −1 ) ≤ 2−k−1 .
Finally, let
K := {f ∈ C[0, 1] : |f (0)| ≤ a, ωf (δk ) ≤ k −1 for all k}.
Then by construction, P(Xn 6∈ K) ≤  for all n. Moreover, it is easy to see that K
is closed, uniformly bounded, and equicontinuous. Therefore by the Arzelà–Ascoli
theorem, K is compact. This proves tightness.
Note that we have proved tightness under the assumption that (ii) holds for all
n. Now suppose that (ii) holds only for n ≥ n0 , where n0 depend on  and η. By
Theorem 12.4.2, any single random variable is tight. This, and the fact that (i) and
(ii) hold for any tight family (which we have shown above), allows us to decrease
δ sufficiently so that (ii) holds for n < n0 too. 
186 13. BROWNIAN MOTION

13.3. Donsker’s theorem


In this section, we will prove a ‘functional version’ of the central limit theorem,
that is known as Donsker’s theorme or Donsker’s invariance principle.
Let us start with a sequence of i.i.d. random variables {Xn }∞ n=1 , with mean
zero and variance one. For each n, let us use them to construct a C[0, 1]-valued
random variable Bn as follows. Let Bn (0) = 0. When t = i/n for some 1 ≤ i ≤
n, let
i
1 X
Bn (t) = √ Xj .
n
j=1

Finally, define Bn between (i − 1)/n and i/n by linear interpolation.


Donsker’s theorem identifies the limiting distribution of Bn as n → ∞. Just
like in the central limit theorem, it turns out that the limiting distribution does not
depend on the law of the Xi ’s.
T HEOREM 13.3.1 (Donsker’s invariance principle). As n → ∞, the sequence
{Bn }∞
n=1 converges weakly to a C[0, 1]-valued random variable B. The finite-
dimensional distributions of B are as follows: B(0) = 0, and for any m and
any 0 < t1 < · · · < tm ≤ 1, the random vector (B(t1 ), . . . , B(tm )) has a
multivariate normal distribution with mean vector zero and covariance structure
given by Cov(B(ti ), B(tj )) = min{ti , tj }.
The limit random variable B is called Brownian motion, and its law is called
the Wiener measure on C[0, 1]. The proof of Donsker’s theorem comes in a number
of steps. First, we identify the limits of the finite-dimensional distributions.
L EMMA 13.3.2. As n → ∞, the finite-dimensional distributions of Bn con-
verge weakly to the limits described in Theorem 13.3.1.

P ROOF. Since Bn (0) = 0 for all n, Bn (0) → 0 in distribution. Take any m


and any 0 < t1 < t2 < · · · < tm ≤ 1. Let ki := bnti c, and define
i k
1 X
Wn,i := √ Xj .
n
j=1

Also let Wn,0 = 0 and t0 = 0. It is a simple exercise to show by the Lindeberg–


Feller CLT that for any 0 ≤ i ≤ m − 1,
d
Wn,i+1 − Wn,i → N (0, ti+1 − ti ).
Moreover, for any n, Wn,1 − Wn,0 , Wn,2 − Wn,1 , . . . , Wn,m − Wn,m−1 are inde-
pendent random variables. Therefore by Exercise 12.8.2, the random vector
Wn := (Wn,1 − Wn,0 , Wn,2 − Wn,1 , . . . , Wn,m − Wn,m−1 )
converges in distribution to the random vector Z = (Z1 , . . . , Zm ), where the ran-
dom variables Z1 , . . . , Zm are independent and Zi ∼ N (0, ti − ti−1 ) for each i.
13.3. DONSKER’S THEOREM 187

Now notice that for each n and i,


ki  ki 
1 X 1 X
|Wn,i − Bn (ti )| = √ Xj − √ Xj + (nti − ki )Xki +1
n n
j=1 j=1
|Xki +1 |
≤ √ ,
n
where the right side is interpreted as 0 if ki = n. Thus, for any  > 0,

P(|Wn,i − Bn (ti )| > ) ≤ P(|Xki +1 | >  n)

= P(|X1 | >  n),
which tends to zero as n → ∞. This shows that as n → ∞, Wn,i − Bn (ti ) → 0 in
probability. Therefore, if we let
Un := (Bn (t1 ), Bn (t2 ) − Bn (t1 ), . . . , Bn (tm ) − Bn (tm−1 )),
then Wn − Un → 0 in probability. Thus, Un → Z in probability. The claimed
result is easy to deduce from this. 

Next, we have to prove tightness. For that, the following maximal inequality is
useful.
L EMMA 13.3.3. Let X1 , X2 , . . . be independent random variables with mean
zero and variance one. For each n, let Sn := ni=1 Xi . Then for any n ≥ 1 and
P
t ≥ 0,
√ √ √
P( max |Si | ≥ t n) ≤ 2P(|Sn | ≥ (t − 2) n).
1≤i≤n

P ROOF. Define

Ai := { max |Sj | < t n ≤ |Si |},
1≤j<i

where the maximum of the left is interpreted as zero if i = 1. Also let


√ √
B := {|Sn | ≥ (t − 2) n}.
Then
n

[ 
P( max |Sn | ≥ t n) = P Ai
1≤i≤n
i=1
 n
[   n
[ 
c
=P B∩ Ai + P B ∩ Ai
i=1 i=1
n−1
X
≤ P(B) + P(Ai ∩ B c ).
i=1
188 13. BROWNIAN MOTION

Now, Ai ∩ B c implies |Sn − Si | ≥ 2n. But this event is independent of Ai . Thus,
we have

P(Ai ∩ B c ) ≤ P(Ai ∩ {|Sn − Si | ≥ 2n})

= P(Ai )P(|Sn − Si | ≥ 2n)
n−i 1
≤ P(Ai ) ≤ P(Ai ),
2n 2
where the second-to-last inequality follows by Chebychev’s inequality. Combin-
ing, we get
n−1
√ 1X
P( max |Sn | ≥ t n) ≤ P(B) + P(Ai ).
1≤i≤n 2
i=1
But the Ai ’s are disjoint events. Therefore,
n−1 n−1
X [  √
P(Ai ) = P Ai ≤ P( max |Sn | ≥ t n).
1≤i≤n
i=1 i=1

Plugging this upper bound into the right side of the previous display, we get the
desired result. 
C OROLLARY 13.3.4. Let Sn be as in Lemma 13.3.3. Then for any t > 3, there
is some n0 such that for all n ≥ n0 ,
√ 2
P( max |Si | ≥ t n) ≤ 6e−t /8 .
1≤i≤n


P ROOF. Take any t > 3. Then t − 2 ≥ t/2. Let Z ∼ N (0, 1). Then by the
central limit theorem,
√ √ √
lim P(|Sn | ≥ (t − 2) n) = P(|Z| ≥ t − 2) ≤ P(|Z| ≥ t/2).
n→∞

To complete the proof, recall that by Exercise 6.3.5, we have the tail bound P(|Z| ≥
2
t/2) ≤ 2e−t /8 . 

We are now ready to prove tightness.


L EMMA 13.3.5. The sequence {Bn }∞
n=1 is tight.

P ROOF. Choose any η, , δ > 0. Note that for any 0 ≤ s ≤ t ≤ 1 such that
t − s ≤ δ, we have dnte − bnsc ≤ dnδe + 2. From this it is not hard to see that
 
|Sl − Sk |
ωBn (δ) ≤ max √ : 0 ≤ k ≤ l ≤ n, l − k ≤ dnδe + 2 .
n
Let E be the event that the quantity on the right is greater than η. Let 0 = k0 ≤
k1 ≤ · · · ≤ kr = n satisfy ki+1 − ki = dnδe + 2 for each 0 ≤ i ≤ r − 2, and
dnδe + 2 ≤ kr − kr−1 ≤ 2(dnδe + 2). Let n be so large that dnδe + 2 ≤ 2nδ.
13.3. DONSKER’S THEOREM 189

Suppose that the event E happens, and let k ≤ l be a pair that witnesses that.
Then either ki ≤ k ≤ l ≤ ki+1 for some i, or ki ≤ k ≤ ki+1 ≤ l ≤ ki+2 for some
i. In either case, the following event must happen:
 √ 
0 η n
E := max |Sm − Ski | > for some i ,
ki ≤m≤ki+1 3

because if it does not happen, then |Sl − Sk | cannot be greater than η n. But by
Corollary 13.3.4, we have for any 0 ≤ i ≤ r − 1,
 √ 
η n
P max |Sm − Ski | >
ki ≤m≤ki+1 3
 √ 
η n p
=P max |Sm − Ski | > p ki+1 − ki
ki ≤m≤ki+1 3 ki+1 − ki
 
η p 2
≤P max |Sm − Ski | > √ ki+1 − ki ≤ C1 e−C2 η /δ ,
ki ≤m≤ki+1 3 2δ
where C1 and C2 do not depend on n, η or δ, and n is large enough, depending on
η and δ. Thus, for all large enough n,
P(ωBn (δ) > η) ≤ P(E) ≤ P(E 0 )
2
C1 e−C2 η /δ
−C2 η 2 /δ
≤ C1 re ≤ ,
δ
where the last inequality holds since r is bounded above by a constant multiple
of 1/δ. Thus, given any η,  > 0 we can choose δ such that condition (ii) of
Proposition 13.2.1 holds for all large enough n. Condition (i) is automatic. This
completes the proof. 
We now have all the ingredients to prove Donsker’s theorem.
P ROOF OF T HEOREM 13.3.1. Note that by Lemma 13.3.5 and Prokhorov’s
theorem, any subsequence of {Bn }∞ n=1 has a weakly convergent subsequence. By
Proposition 13.1.3 and Lemma 13.3.2, any two such weak limits must be the same.
This suffices to prove that the whole sequences converges. 
E XERCISE 13.3.6. Let X1 , X2 , . . . be a sequence ofP
i.i.d. random variables
with mean zero and variance one. For each n, let Sn := ni=1 Xi , with S0 = 0.
Prove that the random variables
1
√ max Si
n 0≤i≤n
and
n
1X
1{Si ≥0}
n
i=1
converge in law as n → ∞, and the limiting distributions do not depend on the dis-
tribution of the Xi ’s. In fact, the first sequence converges in law to max0≤t≤1 B(t),
R1
and the second sequence converges in law to 0 1{B(t)≥0} dt. (Hint: Use Donsker’s
190 13. BROWNIAN MOTION

theorem and Exercises 12.2.4 and 12.2.6. The second one is technically more chal-
lenging.)
E XERCISE 13.3.7. Prove a version of Donsker’s theorem for sums of station-
ary m-dependent sequences. (Hint: The key challenge is to generalize Lemma
13.3.3. The rest goes through as in the i.i.d. case, applying the CLT for stationary
m-dependent sequences that we derived earlier.)

P 13.3.8. Let X1 , X2 , . . . be i.i.d. standard Cauchy random variables.


E XERCISE
Let Sn := ni=1 Xi , and let Zn (t) := Sj /n when t = j/n. Linearly interpolate
to define Zn (t) for all t. Show that the finite dimensional distributions of Zn con-
verge, but the random functions Zn do not converge in law on C[0, ∞). (That is,
you need to show that {Zn }n≥1 is not a tight family in C[0, ∞).)

13.4. Construction of Brownian motion


The probability measure on C[0, 1] obtained by taking the limit n → ∞ in
Donsker’s theorem is known as the Wiener measure on C[0, 1]. The statement of
Donsker’s theorem gives the finite-dimensional distributions of this measure. A
random function B with this law is called Brownian motion on the time interval
[0, 1]. Brownian motion on the time interval [0, ∞) is a similar C[0, ∞)-valued
random variable. One can define it in the spirit of Donsker’s theorem (see Exer-
cise 13.4.2 below), but one can also define it more directly using a sequence of
independent Brownian motions on [0, 1], as follows.
Let B 1 , B 2 , . . . be a sequence of i.i.d. Brownian motions on the time interval
[0, 1]. Define a C[0, ∞)-valued random variable B as follows. If k ≤ t < k + 1
for some integer k ≥ 0, define
k−1
X
B(t) := B j (1) + B k (t − k).
j=1

The random function B is called standard Brownian motion. We often write Bt


instead of B(t).
E XERCISE 13.4.1. Check that B defined above is indeed a C[0, ∞)-valued
random variable, and that for any n and any 0 ≤ t1 ≤ · · · ≤ tn < ∞, the vector
(B(t1 ), . . . , B(tn )) is a multivariate Gaussian random vector with mean vector
zero and covariance structure given by Cov(B(ti ), B(tj )) = min{ti , tj }.
E XERCISE 13.4.2. Let X1 , X2 , . . . be a sequence of i.i.d. random variables
with mean zero and variance one. Define Bn (t) as in Donsker’s theorem, but for
all t ∈ [0, ∞), so that Bn is now a C[0, ∞)-valued random variable. Prove that Bn
converges weakly to the random function B defined above.
E XERCISE 13.4.3. Let B be standard Brownian motion. Prove that
(1) B(t) ∼ N (0, t) for any t.
(2) B(t) − B(s) ∼ N (0, t − s) for any 0 ≤ s ≤ t.
13.4. CONSTRUCTION OF BROWNIAN MOTION 191

(3) For any n and 0 ≤ t1 ≤ · · · ≤ tn , the random variables B(t1 ), B(t2 ) −


B(t1 ), . . . , B(tn ) − B(tn−1 ) are independent. (This is known as the ‘in-
dependent increments’ property.)
Conversely, if B is a random continuous function with the above properties, prove
that its law is the Wiener measure.

E XERCISE 13.4.4 (Invariance under sign change). If B is standard Brownian


motion, prove that −B is again standard Brownian motion.

E XERCISE 13.4.5 (Invariance under time translation). Let B be standard Brow-


nian motion. Take any s ≥ 0. Define W (t) := B(s + t) − B(s) for t ≥ 0. Prove
that W is standard Brownian motion.

E XERCISE 13.4.6 (Invariance under time scaling). Let B be standard Brow-


nian motion. Take any c > 0. Define W (t) := c−1/2 B(ct). Prove that W is
standard Brownian motion.

E XERCISE 13.4.7 (Lévy’s construction of Brownian motion). Construct a se-


quence of random elements X1 , X2 , . . . of C[0, 1] as follows. Let X1 (0) = 0,
X1 (1) ∼ N (0, 1), and define X1 (t) for all other t by linear interpolation. Then,
define X2 (1/2) = X1 (1/2) + Z, where Z ∼ N (0, 1/4). Also let X2 (1) = X1 (1)
and X2 (0) = 0. Having defined X2 (t) at t = 0, 1/2, 1, define it everywhere else
by linear interpolation. In general, having defined Xk , define Xk+1 as follows. If
t = j2−k for some even j, let Xk+1 (t) = Xk (t). If t = j2−k for some odd j, let

Xk+1 (t) = Xk (t) + independent N (0, 2−k−1 ).

For all other t, define Xk+1 (t) by linear interpolation. Then:


(1) Prove that the finite dimensional distributions of the sequence {Xn }n≥1
converge to those of Brownian motion.
(2) Prove that
X∞
EkXn+1 − Xn k[0,1] < ∞.
n=1

(3) Using the above, prove that {Xn }n≥1 is almost surely a Cauchy sequence
in C[0, 1], and therefore has a limit. Prove that the limit is Brownian
motion.

E XERCISE 13.4.8. Let (B(t))0≤t≤1 be standard Brownian motion on the time


interval [0, 1]. Let W (t) = B(t) − tB(1). This process is known as the ‘Brown-
ian bridge’. What are the finite dimensional distributions of the Brownian bridge?
Prove that if W is a Brownian bridge, and Z is an independent N (0, 1) random
variable, and U (t) = W (t) + tZ, then U is standard Brownian motion. Conse-
quently, deduce that if B and W are as above, then the random function W and the
random variable B(1) are independent.
192 13. BROWNIAN MOTION

13.5. An application of Donsker’s theorem



Pn Let {Xn }n=1 be i.i.d. random variables with mean 0 and variance 1. Let √ Sn =
i=1 i X , with S 0 = 0. What is the limiting distribution of max S
0≤i≤n i / n? By
Exercise 13.3.6, we know that there is a limiting distribution and it does not depend
on the law of the Xi ’s. It is therefore enough to figure out the limiting distribution
in one case.
For convenience, we choose the Xi ’s to be 1 or −1 with equal probability, so
that Sn is a simple symmetric random walk on Z. Let Mn be the maximum of
S0 , . . . , Sn . Since S0 = 0, we have Mn ≥ 0. Take any integer a > 0. Let P be the
set of all possible values of the vector (S0 , . . . , Sn ). Let A be the subset consisting
of all (s0 , . . . , sn ) ∈ P such that max0≤i≤n si ≥ a and sn < a. Let B be the
subset consisting of all (s0 , . . . , sn ) ∈ P such that max0≤i≤n si ≥ a and sn > a.
The first key observation is that A and B have the same size. To prove this,
we exhibit a bijection φ from A onto B. Take any (s0 , . . . , sn ) ∈ A. Define
(s00 , . . . , s0n ) = φ(s0 , . . . , sn ) as follows. Let m ≤ n be the smallest index such
that sm = a. This exists by the definition of A. Define (s00 , . . . , s0n ) by ‘reflecting
(s0 , . . . , sn ) across level a beyond the time point m’, as
(
0 si if i ≤ m,
si :=
a − (si − a) if i > m.
Since s0n = 2a − sn and sn < a, we have s0n > a. Thus, φ is indeed a map from A
into B. Now, it is clear from the definition that m is the smallest number such that
s0m = a. Moreover, si = s0i for i ≤ m, and si = a − (s0i − a) for i > m. Thus,
if we define φ as exactly as above on B, then φ(s00 , . . . , s0n ) = (s0 , . . . , sn ). This
proves that φ is a bijection between A and B, and hence |A| = |B|.
Now let C be the set of all (s0 , . . . , sn ) ∈ P such that max0≤i≤n si ≥ a and
sn = a. Then
|A ∪ B ∪ C|
P(Mn ≥ a) = .
|P|
But the sets A, B and C are disjoint, and |A| = |B|. Therefore
2|B| + |C|
P(Mn ≥ a) = .
|P|
Now we make the second crucial observation in the proof, which is that B is just
the set of all (s0 , . . . , sn ) such that sn > a, and C is just the set of all (s0 , . . . , sn )
such that sn = a. Thus,
P(Mn ≥ a) = 2P(Sn > a) + P(Sn = a).
Using this identity and the central limit theorem (or just Stirling’s approximation)
it is easy to show that for any x ≥ 0,

lim P(Mn ≥ x n) = 2P(Z ≥ x),
n→∞

where Z ∼ N (0, 1). Since 2P(Z ≥ x) = P(|Z| ≥ x), we arrive at the following
result.
13.6. LAW OF LARGE NUMBERS FOR BROWNIAN MOTION 193

P ROPOSITION 13.5.1. Let X1 , X2 , . . . be i.i.d. random variables with mean 0


and variance 1. Then as n → ∞,
n
1 X d
√ max Xi → |Z|,
n 0≤i≤n
i=1
where Z ∼ N (0, 1). In particular, max0≤t≤1 B(t) has the same law as |Z|.
Incidentally, the argument used above is known as the ‘reflection principle’.
It has important extensions to other settings, especially in higher dimensions. An
important corollary of the above proposition is the following.
C OROLLARY 13.5.2. Let B be standard Brownian motion. Then for all x > 0,
2 /2
P( max |B(t)| ≥ x) ≤ 4e−x .
0≤t≤1

P ROOF. Since −B is again a standard Brownian motion (by Exercise 13.4.4),


Proposition 13.5.1 gives
P( max |B(t)| ≥ x) ≤ P( max B(t) ≥ x) + P( max (−B(t)) ≥ x)
0≤t≤1 0≤t≤1 0≤t≤1
= 2P( max B(t) ≥ x) = 4P(Z ≥ x).
0≤t≤1

The proof is now completed by the normal tail bound from Exercise 6.3.5. 
E XERCISE 13.5.3. Let B be standard Brownian motion. Given t > 0, figure
out the law of max0≤s≤t B(s).
E XERCISE 13.5.4. Let B be standard Brownian motion. Given a > 0, let
T := inf{t : B(t) ≥ a}. Prove that T is a random variable and that it is finite
almost surely. Then compute the probability density function of T . (Hint: Use the
previous problem.)

13.6. Law of large numbers for Brownian motion


The following result is known as the law of large numbers for Brownian mo-
tion.
P ROPOSITION 13.6.1. Let B be standard Brownian motion. As t → ∞,
B(t)/t → 0 almost surely.

P ROOF. Since B(1), B(2) − B(1), B(3) − B(2), . . . are i.i.d. standard normal
random variables, B(n)/n → 0 almost surely as n → ∞ by the strong law of large
numbers. Now if n ≤ t ≤ n + 1, then
|B(t)| |B(t)| |B(n)| 1
≤ ≤ + max |B(s) − B(n)|.
t n n n n≤s≤n+1
Therefore it suffices to show that Mn /n → 0 almost surely as n → ∞, where
Mn := maxn≤s≤n+1 |B(s) − B(n)|. Take any n. Let W (t) := B(n + t) − B(n)
194 13. BROWNIAN MOTION

for t ≥ 0. By Exercise 13.4.5, W is again standard Brownian motion. Therefore


by Corollary 13.5.2,
2 /2
P(Mn ≥ x) = P( max |W (t)| ≥ x) ≤ 4e−x .
0≤t≤1

Thus,
∞ ∞
X √ X
P(Mn ≥ n) ≤ 4e−n/2 < ∞.
n=1 n=1

By the first Borel–Cantelli lemma, this proves that Mn /n → 0 almost surely. 


E XERCISE 13.6.2 (Time inversion). Let B be standard Brownian motion. De-
fine W (t) := tB(1/t) for t > 0, and W (0) = 0. Prove that W is again standard
Brownian motion. (Hint: You will have to use the law of large numbers for conti-
nuity at zero.)
E XERCISE 13.6.3. Let B be standard Brownian motion. Given a > 0, let
T := sup{t : B(t) ≥ at}. Prove that T is a random variable and that it is finite
almost surely. Then compute the probability density function of T . (Hint: Use
Exercise 13.5.4 and time inversion.)
E XERCISE 13.6.4. Suppose that X1 , X2 , . . . are i.i.d. random variables with
mean zero and variance one. Let Sn = X1 + · · · + Xn . For each  > 0, let N () be
the smallest number such that Sn /n <  for all n > N (). Compute the limiting
distribution of 2 N () as  → 0. (Hint: Use the previous exercise.)
E XERCISE 13.6.5. Prove that with probability one,
B(t) B(t)
lim sup √ = ∞, lim inf √ = −∞.
t→∞ t t→∞ t

13.7. Nowhere differentiability of Brownian motion


We know that Brownian motion is continuous, but is it differentiable? The
following result of Paley, Wiener and Zygmund shows that with probability one,
the Brownian path is nowhere differentiable. Note that this is much stronger than
proving that it is not differentiable at a given point with probability one, since there
are uncountably many points. A minor technical point is that it is not clear that the
set of nowhere differentiable continuous functions is a Borel subset of C[0, ∞).
However, since we only wish to show that it has probability zero, it suffices to
work with the completion of the Borel σ-algebra (see Proposition ??) and prove
that the set of nowhere differentiable continuous functions is a subset of a Borel set
of probability zero.
T HEOREM 13.7.1. With probability one, standard Brownian motion is nowhere
differentiable.
13.7. NOWHERE DIFFERENTIABILITY OF BROWNIAN MOTION 195

P ROOF. If we prove that B is nowhere differentiable in [0, 1] with probability


one (where differentiability at the endpoints is one-sided), then it follows that with
probability one, B is not differentiable in [k, k + 1] for any positive integer k,
and therefore nowhere differentiable on [0, ∞). So let us prove that B is nowhere
differentiable on [0, 1].
Suppose that in a particular realization, B is differentiable at a point t ∈ [0, 1].
Then
|B(t + h) − B(t)|
lim sup < ∞.
h→0 h
By the boundedness of B in [0, 1], this implies that
|B(s) − B(t)|
sup < ∞.
s∈[0,1]\{t} |s − t|
Thus, there is some positive integer M (depending on the realization of B) such
that for all s ∈ [0, 1],
|B(s) − B(t)| ≤ M |s − t|.
Fix M and t as above. Take any n ≥ 4. Then for any 0 ≤ k ≤ n − 1,
|B(k/n) − B((k + 1)/n)| ≤ |B(k/n) − B(t)| + |B(t) − B((k + 1)/n)|
≤ M |(k/n) − t| + M |((k + 1)/n) − t|.
Choose 0 ≤ j ≤ n − 3 such that |j/n − t| ≤ 3/n. (This can be done, for example,
by choosing j such that j/n ≤ t ≤ (j + 1)/n if t ≤ 1 − 3/n, and j = n − 3
otherwise.) Then by the above inequality, the quantities |B(j/n) − B((j + 1)/n)|,
|B((j +1)/n)−B((j +2)/n)| and |B((j +2)/n)−B((j +3)/n)| are all bounded
above by 11M/n.
For any positive integers M ≥ 1, n ≥ 4, and 0 ≤ j ≤ n − 3, let EM,n,j be
the event described in the previous sentence. Thus, we have shown that if B has a
point of differentiability in [0, 1], then there is some positive integer M such that
for all n ≥ 4, there is some 0 ≤ j ≤ n − 3 for which the event EM,n,j happens. In
other words,
∞ \
[ ∞ n−3
[
{B has a point of differentiability in [0, 1]} ⊆ EM,n,j . (13.7.1)
M =1 n=4 j=0
The set on the right is clearly measurable. So it suffices to prove that it has proba-
bility zero. To see this, note that
[ ∞ \ ∞ n−3
[  X∞ \ ∞ n−3
[ 
P EM,n,j ≤ P EM,n,j
M =1 n=4 j=0 M =1 n=4 j=0

X n−3
[ 
≤ inf P EM,n,j
n≥4
M =1 j=0

X n−3
X
≤ inf P(EM,n,j ).
n≥4
M =1 j=0
196 13. BROWNIAN MOTION

By independence of increments and the fact that B((i + 1)/n) − B(i/n) is a


N (0, 1/n) random variable for any i, it is easy to see that for any given M , n, and
j as above,
P(EM,n,j ) ≤ Cn−3/2 ,
where C depends only on M (and not on n or j). Thus, for any given M ,
n−3
X
inf P(EM,n,j ) = 0.
n≥4
j=0
This proves that the event on the right in (13.7.1) indeed has probability zero. 
E XERCISE 13.7.2. For any function f : [0, 1] → R, the total variation of f is
defined as
Xn
V (f ) := sup sup |f (xi ) − f (xi−1 )|.
n≥1 0=x0 ≤x1 ≤···≤xn =1 i=1

Prove that Brownian motion on the time interval [0, 1] almost surely has infinite
total variation.

13.8. The Brownian filtration


Let (Ω, F, P) be a probability space. A continuous-time filtration {Ft }t≥0 is a
collection of sub-σ-algebras of F such that Fs ⊆ Ft whenever s ≤ t. For example,
if B is standard Brownian motion and we let Ft be the σ-algebra generated by
the collection of random variables {B(s)}0≤s≤t , then {Ft }t≥0 is a filtration. The
following exercise points out a subtle but important fact.
E XERCISE 13.8.1. Viewing the restriction of B to [0, t] as a C[0, t]-valued
random variable, prove that the σ-algebra Ft is the same as the σ-algebra generated
by this random variable. Prove that this holds also for t = ∞, that is, the σ-
algebras generated by B and the collection {B(t)}t≥0 are the same. (Hint: Note
that B|[0,t] is the limit of a sequence of polygonal approximations, each of which
is Ft -measurable. Then use Proposition 2.1.14. The converse is easy.)
It turns out that it is much more useful to consider a slight enlargement of Ft ,
defined as \
Ft+ := Fs .
s>t
Clearly, {Ft+ }t≥0 is also a filtration. This is called the right-continuous filtration
of Brownian motion, or simply the Brownian filtration. The important property of
this filtration is that it is right-continuous, meaning that
\
Ft+ = Fs+ .
s>t
To see this, just note that
\ \\ \
Ft+ = Fs = Fu = Fu+ .
s>t s>t u>s s>t
13.9. MARKOV PROPERTY OF BROWNIAN MOTION 197

E XAMPLE 13.8.2. The following is an example of an event in Ft+ . Take any t


and let
τ := inf{s > t : B(s) = B(t)}.
Then the event {τ = t} means that there is some sequence tn strictly decreasing to
t such that B(tn ) = B(t) for each n. So for any  > 0,
\
{τ = t} = En ,
n≥1/

where En is the event that B(t + u) = B(t) for some u ∈ (0, 1/n]. Note that by
the continuity of B,

[
En = {B(t + u) = B(t) for some u ∈ [1/k, 1/n]}
k=n
[∞ \∞
= {|B(t + u) − B(t)| < 1/j for some u ∈ [1/k, 1/n] ∩ Q}.
k=n j=1

This shows that En ∈ Ft+1/n . Thus, for n ≥ 1/, En ∈ Ft+1/n ⊆ Ft+ . Since
this is true for every  > 0, we see that {τ = t} ∈ Ft+ .
To see that Ft+ may not be equal to Ft , consider the special case t = 0. Sup-
pose that the probability space on which B is defined is simply C[0, 1] endowed
with its Borel σ-algebra and the Wiener measure, and B is the identity map on this
space. Since B(0) = 0, F0 is the trivial σ-algebra consisting of only Ω and ∅. But
clearly, the event {τ = 0}, as a subset of this space, is neither Ω nor ∅.

13.9. Markov property of Brownian motion


The following result is known as the Markov property of Brownian motion. It
is the precursor of a more powerful property, known as the strong Markov property,
that we will discuss in the next section.
T HEOREM 13.9.1. Let B be standard Brownian motion. Take any s ≥ 0.
Define W (t) := B(s + t) − B(s) for t ≥ 0. Then W is a standard Brownian
motion and is independent of the σ-algebra Fs+ .

P ROOF. We have already seen that W is a standard Brownian motion in Exer-


cise 13.4.5. So it only remains to prove the independence. We will first prove
that W and Fs are independent. Take any m, n ≥ 1, any 0 ≤ t1 < · · · <
tn < ∞, and any 0 ≤ s1 < · · · < sm ≤ s. By the independence of incre-
ments of Brownian motion, it follows that the random vectors (B(s1 ), . . . , B(sm ))
and (W (t1 ), . . . , W (tn )) are independent. Therefore by Exercise 7.1.11, the σ-
algebras generated by {B(t)}0≤t≤s and {W (t)}t≥0 are independent. But by Exer-
cise 13.8.1, these are the σ-algebras Fs and σ(W ). This proves the independence
of W and Fs .
Next, let us prove the independence of W and Fs+ . Take a sequence {sn }n≥1
strictly decreasing to s. For each n, define Wn (t) = B(sn + t) − B(sn ) for t ≥ 0.
198 13. BROWNIAN MOTION

Then by the previous paragraph, Wn and Fsn are independent. Since Fs+ ⊆ Fsn ,
we get that Wn and Fs+ are independent for each n. But Wn → W in C[0, ∞).
Therefore by Proposition 12.3.1, W is independent of Fs+ . 
An interesting consequence of the Markov property is Blumenthal’s zero-one
law, which says that any event in F0+ must have probability 0 or 1 with respect to
Wiener measure. Indeed, take any A ∈ F0+ . Then due to the Markov property of
B, the event A is independent of σ(B). On the other hand, since F0+ ⊆ σ(B), A
is an element of σ(B). Therefore A is independent of itself, and hence P(A) must
be 0 or 1.
Blumenthal’s zero-one law is a basic fact about Brownian motion, with a num-
ber of important consequences. Let us work out a few of these. First, let
τ + := inf{t > 0 : B(t) > 0},
τ − := inf{t > 0 : B(t) < 0},
τ := inf{t > 0 : B(t) = 0}.
It is easy to see (just as we did in the previous section) that the event {τ + = 0} is
in F0+ . Now for any t > 0,
1
P(τ + ≤ t) ≥ P(B(t) > 0) = .
2
+ +
Since {τ = 0} is the decreasing limit of {τ ≤ 1/n} as n → ∞, this shows
that P(τ + = 0) ≥ 1/2. Thus, by Blumenthal’s zero-one law, P(τ + = 0) = 1. By
the same argument, P(τ − = 0) = 1. But if τ + = 0 and τ − = 0, then B attains
both positive and negative values in arbitrarily small neighborhoods of 0. By the
continuity of B, this implies that τ = 0. Thus, P(τ = 0) = 1.
Combined with the Markov property, the above result implies the more general
statement that for any given t, inf{s > t : B(s) = B(t)} = t almost surely. Note
that this does not imply that there will not exist any exceptional t for which this
fails in a given realization of B. To see this, note that B(1) 6= 0 with probability 1.
Thus, if we let T := sup{t ∈ [0, 1] : B(t) = 0}, then T < 1 almost surely. Now
note that inf{t > T : B(t) = B(T )} > 1 > T .
Similarly, inf{s > t : B(s) > B(t)} = t and inf{s > t : B(s) < B(t)} = t
almost surely. As a consequence, we get the following result.
P ROPOSITION 13.9.2. With probability one, Brownian motion has no interval
of increase or decrease (that is, no nontrivial interval where Brownian motion is
increasing or decreasing monotonically).

P ROOF. For a given interval [a, b], where a < b, we know from the discussion
preceding the proposition that with probability one, B cannot be monotonically in-
creasing or decreasing in [a, b]. Therefore with probability one, this cannot happen
in any interval with rational endpoints. But if there is a nontrivial interval where B
is monotonically increasing or decreasing, then there is a subinterval with rational
endpoints where this happens. 
13.10. LAW OF THE ITERATED LOGARITHM FOR BROWNIAN MOTION 199

E XERCISE 13.9.3. Prove that with probability one, every local maximum of
Brownian motion is a strict local maximum. (Hint: First prove that the values of
the maxima in two given disjoint intervals are different almost surely. Then take
intersection over intervals with rational endpoints.)
E XERCISE 13.9.4. Prove that with probability one, the set of local maxima
of Brownian motion is countable and dense. (Hint: Use Proposition 13.9.2 and
Exercise 13.9.3.)
E XERCISE 13.9.5. For any fixed s > 0, show that the random function V (t) :=
B(s − t) − B(s), defined for 0 ≤ t ≤ s, is a standard Brownian motion on the time
interval [0, s]. (This is called ‘time reversal’.)
E XERCISE 13.9.6. Prove that the for any s ∈ [0, 1],

P max B(t) = max B(t) = 0.
0≤t≤s s≤t≤1

Using this, prove that with probability one, Brownian motion has a unique point of
global maximum in the time interval [0, 1].
E XERCISE 13.9.7. Let T be the unique point at which B(t) attains its global
maximum in the time interval [0, 1]. Compute the probability distribution function
and the probability density function of T . (This is known as the ‘arcsine distri-
bution’. Hint: Try to write P(T ≤ s) as a probability involving the maxima of
two independent Brownian motions. You will have to apply time reversal and the
Markov property.)

13.10. Law of the iterated logarithm for Brownian motion


The following theorem is known as the law of the iterated logarithm (LIL) for
Brownian motion. It gives a precise measure of the maximum fluctuations of B(t)
as t → ∞ or t → 0. The proof makes several crucial uses of the Markov property.
T HEOREM 13.10.1. Let B be standard Brownian motion. Then with probabil-
ity one,
B(t) B(t)
lim sup √ = 1, lim inf √ = −1.
t→∞ 2t log log t t→∞ 2t log log t
The same statements hold if we replace t → ∞ by t → 0, and log log t by
log log(1/t).

P ROOF. Using time inversion, it is not difficult to check that it suffices to prove
the result only for t → ∞. Also, the lim inf result follows
√ by changing B to −B,
so it suffices to prove only for lim sup. Let h(t) := 2t log log t. Take any δ > 0.
Let tn := (1 + δ)n . Let An be the event
B(s)
sup > 1 + δ.
s∈[tn ,tn+1 ] h(s)
200 13. BROWNIAN MOTION

Let M (t) := max0≤s≤t B(s). Note that h(t) is an increasing function of t if t is


sufficiently large. So if n is sufficiently large and An happens, then there is some
s ∈ [tn , tn+1 ] such that
B(s) B(s)
1+δ < ≤ ,
h(s) h(tn )
which implies that M (tn+1 ) ≥ (1 + δ)h(tn ). But we know the exact distribution
of M (tn+1 ). Therefore, by Exercise 6.3.5,
P(An ) ≤ P(M (tn+1 ) ≥ (1 + δ)h(tn ))
2 h(t )2 /2t
≤ 2e−(1+δ) n n+1

= e−(1+δ) log log tn = (n log(1 + δ))−(1+δ) .


Since this upper bound is summable in n, the first Borel–Cantelli lemma tells us
that with probability one, Acn happens for all sufficiently large n. But that implies
B(t)
lim sup ≤ 1 + δ a.s.
t→∞ h(t)
Since this holds for all δ > 0, we can take a countable sequence of δ’s tending to
zero and conclude that
B(t)
lim sup ≤ 1 a.s. (13.10.1)
t→∞ h(t)

Next, take any δ ∈ (0, 1) and let sn := δ −n . Let Xn := B(sn ) − B(sn−1 ) and let
En be the event Xn ≥ (1 − δ)h(sn ). Let
(1 − δ)h(sn ) p
an := √ = 2(1 − δ) log log(δ −n ). (13.10.2)
sn − sn−1
Then by Exercise 6.3.6,
2
e−an /2
 
1 1
P(En ) ≥ − 3 √ .
an an 2π
From the expression (13.10.2) for an , it follows that ∞
P
n=1 P(Bn ) = ∞. Since
the random variables X1 , X2 , . . . are independent, so are the events E1 , E2 , . . ..
Therefore by the second Borel–Cantelli lemma, En occurs infinitely often with
probability one. In other words, with probability one,
B(sn ) − B(sn−1 )
≥ 1 − δ i.o. (13.10.3)
h(sn )
But by (13.10.1) and the fact that −B is also a standard Brownian motion, we have
that
−B(sn−1 ) −B(sn−1 ) h(sn−1 )
lim sup = lim sup
n→∞ h(sn ) n→∞ h(sn−1 ) h(sn )
√ −B(sn−1 ) √
= δ lim sup ≤ δ.
n→∞ h(sn−1 )
13.10. LAW OF THE ITERATED LOGARITHM FOR BROWNIAN MOTION 201

Thus, with probability one, −B(sn−1 )/h(sn ) ≤ 2 δ for all sufficiently large n.
Plugging this into (13.10.3), we get that with probability one,
B(sn ) √
≥ 1 − δ − 2 δ i.o. (13.10.4)
h(sn )
Since δ > 0 is arbitrary, this shows that lim supt→∞ B(t)/h(t) ≥ 1 a.s. 

E XERCISE 13.10.2. Prove that as t → ∞, B(t)/ 2t log log t → 0 in proba-
bility but does not tend to zero almost surely.
By a refinement of the above proof, one can prove the following stronger ver-
sion of the law of the iterated logarithm for Brownian motion. We will use it later
to prove the law of the iterated logarithm for sums of i.i.d. random variables.
T HEOREM 13.10.3. Let B be standard Brownian motion. Then with probabil-
ity one, for all sequences {tn }n≥1 such that tn → ∞ and tn+1 /tn → 1,
B(tn ) B(tn )
lim sup √ = 1, lim inf √ = −1.
n→∞ 2tn log log tn t→∞ 2tn log log tn
The same statements hold if we replace tn → ∞ by tn → 0, and log log tn by
log log(1/tn ).
(Note that the crucial point in this theorem is that tn is allowed to be random.)
P ROOF. As before, it suffices to prove only for lim sup and only for tn → ∞.
Take a realization of B where the conclusion of the LIL is satisfied, and let {tn }n≥1
be any sequence such that tn → ∞ and tn+1 /tn → 1. Then immediately from
Theorem 13.10.1 we have
B(tn ) B(t)
lim sup √ ≤ lim sup √ = 1.
n→∞ 2tn log log tn t→∞ 2t log log t
The opposite inequality requires a bit more work. First, let δ, sn and En be as in
the second half of the proof of Theorem 13.10.1. Let Dn be the event

min (B(t) − B(sn )) ≥ − sn .
sn ≤t≤sn+1

By the Markov property of Brownian motion, it is easy to see that the events En and
Dn are independent. Also, by Proposition 13.5.1, it is easy to check that P(Dn ) ≥
c(δ) > 0 for some constant c(δ) that depends on δ but not on n. Therefore from
our previously obtained lower bound on P(En ), we get

X
P(E2n ∩ D2n ) = ∞.
n=1
But again, note that En and Dn are determined by the process
Wn (t) := B(sn−1 + t) − B(sn−1 ),
while Ek and Dk are determined by {B(t)}t≤sn−1 for k ≤ n − 2. So by successive
applications of the Markov property, we see that the events {E2n ∩ D2n }n≥1 are
independent. Therefore with probability one, these events occur infinitely often.
202 13. BROWNIAN MOTION

Now if En ∩ Dn occurs for some n, then we have



min B(t) ≥ B(sn ) − sn .
sn ≤t≤sn+1

Combining with (13.10.4), we get that with probability one,


√ √
min B(t) ≥ (1 − δ − 2 δ)h(sn ) − sn i.o.
sn ≤t≤sn+1

Take a realization of B that satisfies the above property for all δ in a sequence
tending to zero. Take any sequence {tn }n≥1 such that tn → ∞ and tn+1 /tn → 1.
For each k, let n = n(k) be the smallest number such that sk ≤ tn < sk+1 . Then
by the above property of the particular realization of B, there exist infinitely many
n such that
√ √
B(tn ) ≥ (1 − δ − 2 δ)h(sk ) − sk .
As k → ∞, we have n = n(k) → ∞, and so tn /tn−1 → 1. This means, in
particular, that tn /sk → 1 as k → ∞, because otherwise tn−1 would be between
sk and tn infinitely often. Thus,
B(tn ) B(tn(k) )
lim sup ≥ lim sup
n→∞ h(tn ) k→∞ h(tn(k) )

√ h(sk ) √ 
sk
≥ lim (1 − δ − 2 δ) −
k→∞ h(tn(k) ) h(tn(k) )

= 1 − δ − 2 δ.

Since this holds for all δ in a sequence tending to zero, this completes the proof of
the theorem. 

13.11. Stopping times for Brownian motion


Let B be standard Brownian motion. A [0, ∞]-valued random variable T ,
defined on the same probability space as B, is said to be a stopping time for B if
for every t ∈ [0, ∞), the event {T ≤ t} is in Ft+ . Interestingly, the right-continuity
of the filtration implies that it does not matter whether we put T ≤ t or T < t in
the definition.

E XERCISE 13.11.1. Show that a random variable T is a stopping time for


Brownian motion if and only if for every t ≥ 0, the event {T < t} is in Ft+ .

Any constant is trivially a stopping time. A slightly less trivial example is the
following.

E XAMPLE 13.11.2. Let

T = inf{t ≥ 0 : B(t) = a},


13.11. STOPPING TIMES FOR BROWNIAN MOTION 203

where a ∈ R is some given number. Then T is a stopping time, because


[
{T ≤ t} = {B(s) = a}
s∈[0,t]
\ [
= {|B(s) − a| ≤ 1/n},
n≥1 s∈[0,t]∩Q

where both identities hold because B is continuous. The last event is clearly in
Ft , and therefore in Ft+ . By the continuity of B, it also follows that T can be
alternately defined as inf{t ≥ 0 : B(t) ≥ a} if a ≥ 0 and as inf{t ≥ 0 : B(t) ≤ a}
if a ≤ 0.
E XERCISE 13.11.3. If T and S are stopping times for Brownian motion, prove
that min{S, T }, max{S, T } and S + T are also stopping times.
Associated with any stopping time T is a σ-algebra known as the stopped σ-
algebra of T . It is defined as
FT+ := {A ∈ σ(B) : A ∩ {T ≤ t} ∈ Ft+ for all t ≥ 0}.

E XERCISE 13.11.4. Prove that if we defined FT+ using {T < t} instead of


{T ≤ t}, the definition would remain the same.
Intuitively, FT+ contains the record of all information up to the random time T
and infinitesimally beyond. For example, T itself is FT+ -measurable. To see this,
note that for any s, t ≥ 0,
+
{T ≤ s} ∩ {T ≤ t} = {T ≤ min{s, t}} ∈ Fmin{s,t} ⊆ Ft+ .

Thus, {T ≤ s} ∈ FT+ for any s ≥ 0. This proves that T is FT+ -measurable.


An important and often useful fact is the following.
P ROPOSITION 13.11.5. If S and T are stopping times for Brownian motion,
such that S ≤ T always, then FS+ ⊆ FT+ .

P ROOF. Take any t ≥ 0 and any A ∈ FS+ . Since S ≤ T , we have


A ∩ {T ≤ t} = A ∩ {S ≤ t} ∩ {T ≤ t}.
Since A ∈ FS+ , we have A ∩ {S ≤ t} ∈ Ft+ . Since T is a stopping time,
{T ≤ t} ∈ Ft+ . Thus, A ∩ {T ≤ t} ∈ Ft+ . Since this holds for every t, we
conclude that A ∈ FT+ . 

The next result is also often useful for technical purposes.


P ROPOSITION 13.11.6. Let {Tn }n≥1 be a sequence of stopping times (for the
Brownian filtration) decreasing to a stopping time T . Then FT+ = ∩n≥1 FT+n .
204 13. BROWNIAN MOTION

P ROOF. By Proposition 13.11.5, FT+ ⊆ ∩n≥1 FT+n . To prove the opposite in-
clusion, take any A ∈ ∩n≥1 FT+n . Then for any t ≥ 0,
[ ∞ \ ∞ 
A ∩ {T < t} = A ∩ {Tm < t}
n=1 m=n
∞ \
[ ∞
= (A ∩ {Tm < t}).
n=1 m=n

But A ∩ {Tm < t} ∈ Ft+ for each m. Thus, A ∩ {T < t} ∈ Ft+ , and so by
Exercise 13.11.4, A ∈ FT+ . 
Using the above results, we now prove a very important fact.
P ROPOSITION 13.11.7. Let T be a stopping time for Brownian motion. Define
the ‘stopped process’ (
B(t) if t ≤ T,
U (t) :=
B(T ) if t > t.
Then U is FT+ -measurable.

P ROOF. For each n ≥ 1, define


(
(m + 1)2−n if m2−n ≤ T < (m + 1)2−n ,
Tn :=
∞ if T = ∞.
We claim that each Tn is a stopping time. To see this, take any t ≥ 0. Let m be
such that m2−n ≤ t < (m + 1)2−n . Then
{Tn ≤ t} = {Tn ≤ m2−n } = {T < m2−n } ∈ Fm2
+ +
−n ⊆ Ft .

Moreover, Tn decreases to T as n → ∞. Therefore by Propositions 13.11.5


and 13.11.6, FT+n decreases to FT+ .
Now let Un be the stopped process for Tn . Note that Tn can take only countably
many values. Let t be one of these values. Let U (t) be Brownian motion stopped
at time t. It is easy to see that U (t) is Ft+ -measurable. Also, {Tn = t} is Ft+ -
measurable. Therefore, for any Borel subset A of C[0, ∞), the set
{Un ∈ A} ∩ {Tn = t} = {U (t) ∈ A} ∩ {Tn = t}
is Ft+ measurable. From this, it follows that Un is FT+n -measurable. Thus, for any
n, Um is FT+n -measurable for all m ≥ n. Now, Um → U in C[0, ∞) as m → ∞.
Therefore by Proposition 2.1.14, U is FT+n -measurable. Since this holds for every
n, Proposition 13.11.6 implies that U is FT+ -measurable. 
E XERCISE 13.11.8. If T is a stopping time for Brownian motion, prove that
the ‘stopped’ random variable B(T ) is FT+ -measurable.
E XERCISE 13.11.9. Generalize Proposition 13.11.7 and the above exercise to
any continuous process adapted to the Brownian filtration.
13.12. THE STRONG MARKOV PROPERTY 205

13.12. The strong Markov property


The following result is known as the strong Markov property of Brownian
motion.
T HEOREM 13.12.1. Let B be standard Brownian motion and let T be a stop-
ping time for B. Define W (t) := B(T + t) − B(T ) for all t ≥ 0 if T < ∞.
If T = ∞, let W be an independent Brownian motion. Then W is a standard
Brownian motion and is independent of the σ-algebra FT+ .

P ROOF. First, let us prove the claim under the assumption that T takes values
in a countable set {t1 , t2 , . . .}, which may include ∞. Take any A ∈ FT+ and any
E in the Borel σ-algebra of C[0, ∞). For each n such that tn is finite, define a
process Wn (t) := B(tn + t) − B(tn ). If tn = ∞, let Wn be an independent
Brownian motion. If T = tn , then W = Wn . Thus,
{W ∈ E} ∩ A ∩ {T = tn } = {Wn ∈ E} ∩ A ∩ {T = tn }.
Now note that if tn is finite, then A∩{T = tn } = A∩{T ≤ tn }∩{T = tn }. Since
A ∈ FT+ , A ∩ {T ≤ tn } ∈ Ft+n . Also, {T = tn } = {T ≤ tn } \ {T < tn } ∈ Ft+n
(by Exercise 13.11.1). Thus, A ∩ {T = tn } ∈ Ft+n . On the other hand, by the
Markov property, Wn is a standard Brownian motion and is independent of Ft+n .
Thus,
P({Wn ∈ E} ∩ A ∩ {T = tn }) = P(Wn ∈ E)P(A ∩ {T = tn })
= P(B ∈ E)P(A ∩ {T = tn }).
If tn = ∞, then also the above identity holds. Thus, for any n,
P({W ∈ E} ∩ A ∩ {T = tn }) = P(B ∈ E)P(A ∩ {T = tn }).
Summing over n gives
P({W ∈ E} ∩ A) = P(B ∈ E)P(A).
Taking A = Ω gives P(W ∈ E) = P(B ∈ E). Thus, W is a Brownian motion,
and is independent of FT+ .
Now take a general stopping time T . Let Tn be defined as in the proof of
Proposition 13.11.7. Define Wn (t) := B(Tn + t) − B(Tn ) if Tn < ∞, and let Wn
be an independent Brownian motion if Tn = ∞. Since Tn = ∞ implies T = ∞,
it also implies that Tm = ∞ for all m. Thus, we can take Wn to be the same
independent Brownian motion for every n if T = ∞.
Now, since Tn can take only countably many values, we know by the previ-
ous deduction that Wn is a standard Brownian motion and that Wn and FT+n are
independent. Since Tn ≥ T , Proposition 13.11.5 implies that FT+n ⊇ FT+ . Thus,
Wn is independent of FT+ for each n. But note that Tn → T as n → ∞, and
hence Wn → W in C[0, ∞). Thus W is also a standard Brownian motion, and by
Proposition 12.3.1, W is independent of FT+ . 
The strong Markov property can sometimes be used to show that certain ran-
dom variables are not stopping times. The following is an example.
206 13. BROWNIAN MOTION

E XAMPLE 13.12.2. Let


T := sup{t ∈ [0, 1] : B(t) = 0}.
Although it is intuitively clear that T is not a stopping time, it is not clear how
to prove it directly from the definition. The strong Markov property of Brownian
motion allows us to give an indirect proof. Since P(B(1) = 0) = 0, the continuity
of B implies that P(T < 1) = 1. Let W (t) := B(T + t) − B(T ) for t ≥ 0. Note
that W does not change sign in the nonempty interval (0, 1 − T ). Thus, W cannot
be Brownian motion. So we conclude that T is not a stopping time.
E XERCISE 13.12.3. Let T := sup{t : B(t) = t}. Prove that T is not a
stopping time.
Another interesting application of the strong Markov property is the reflection
principle for Brownian motion.
P ROPOSITION 13.12.4. Let B be standard Brownian motion. Take any a > 0.
Let T := inf{t : B(t) = a}. Define a new process B e as
(
e := B(t) if t ≤ T,
B(t)
2a − B(t) if t > T.
Then B
e is also a standard Brownian motion.

P ROOF. Define a map φ : C[0, ∞) × C[0, ∞) × [0, ∞) → C[0, ∞) as


(
f (s) if s ≤ t,
φ(f, g, t)(s) :=
f (s) + g(s − t) − g(0) if s > t.
It is easy to see that φ is continuous and hence measurable.
Now let W (t) := B(T + t) − B(T ) for t ≥ 0. Let U be the Brownian motion
B stopped at time T , as defined in Proposition 13.11.7. By Proposition 13.11.7,
U is FT+ -measurable. We have also seen earlier that T is FT+ -measurable. By the
strong Markov property, W is a standard Brownian motion and is independent of
FT+ . Finally, recall that −W is also a standard Brownian motion. Thus, the joint
law of (U, W, T ) is the same as that of (U, −W, T ). But B = φ(U, W, T ) and
Be = φ(U, −W, T ), where φ is the map defined above. Thus, B e and B must have
the same law. 
As an application of the reflection principle, let us now calculate the joint law
of (M (t), B(t)), where M (t) = max0≤s≤t B(s). Take any t ≥ 0 and a > 0. Let
T := inf{s : B(s) = a}. Then for any b < a, the event {M (t) ≥ a, B(t) ≤ b}
e ≥ 2a − b, where B
happens if and only if B(t) e is the reflected process. Thus, for
b < a,

P(M (t) ≥ a, B(t) ≤ b) = P(B(t) e ≥ 2a − b) = P( tZ ≥ 2a − b),
where Z ∼ N (0, 1). Now note that M (t) > 0 almost surely, and also B(t) <
M (t) almost surely by time reversal and Blumenthal’s zero-one law. Therefore
P((M (t), B(t)) ∈ U ) = 1, where U := {(a, b) : a > 0, −∞ < b < a}. So by
13.13. MULTIDIMENSIONAL BROWNIAN MOTION 207

Exercise 7.6.5, (M (t), B(t)) has a probability density function, whose value at a
point (a, b) ∈ U is given by
√ Z ∞
∂2 ∂2 1 2
− P( tZ ≥ 2a − b) = − √ e−x /2 dx
∂a∂b ∂a∂b (2a−b)/√t 2π
∂ 1 −(2a−b)2 /2t
=− √ e
∂a 2πt
r
2 (2a − b) −(2a−b)2 /2t
= e .
π t3/2
E XERCISE 13.12.5. Let B be standard Brownian motion and let M (t) be the
maximum of B up to time t. Prove that for any t, M (t) − B(t) has the same law
as |B(t)|.

13.13. Multidimensional Brownian motion


Let d be a positive integer. A d-dimensional standard Brownian motion B is
simply a d-tuple (B1 , . . . , Bd ) of i.i.d. standard Brownian motions. We think of
B(t) as the trajectory of a particle moving in Rd , where the coordinates perform
independent Brownian motions. A multidimensional Brownian motion B gener-
ates its own right continuous filtration {Ft+ }t≥0 ; the definition is just the same as
in the unidimensional case. It is not hard to see that Ft+ is the σ-algebra gener-
+ +
ated by the union of the σ-algebras Fi,t , i = 1, . . . , d, where {Fi,t }t≥0 is the right
continuous filtration of Bi .
A stopping time for multidimensional Brownian motion is a nonnegative ran-
dom variable T such that for each t ≥ 0, {T ≤ t} ∈ Ft+ . The results of Sec-
tion 13.11 remain valid for such stopping times, with identical proofs.
Note that d-dimensional Brownian motion is a random variable taking value
in C[0, ∞)d , which is a Polish space under the product topology. Since Proposi-
tion 12.3.1 works for any Polish space, the proof of the Markov property easily
carries over to the multidimensional setting. For the same reason, the proof of the
strong Markov property also carries over.
In spite of being just a d-tuple of i.i.d. Brownian motions, d-dimensional Brow-
nian motion has many interesting properties that are not shared by one-dimensional
Brownian motion. For one thing, the filtration generated by multidimensional
Brownian motion is much more complex than the filtration in one dimension. So
are the stopping times. The path properties are also richer and more complex. We
will get a glimpse of some of this in later chapters.
CHAPTER 14

Martingales in continuous time

In continuous time, a collection of σ-algebras {Ft }t≥0 is called a filtration if


Fs ⊆ Ft whenever s ≤ t. The filtration is called right-continuous if for any t,
\
Ft = Fs .
s>t

A stopping time T for a filtration {Ft }t≥0 is a [0, ∞]-valued random variable such
that for any t ∈ [0, ∞), the event {T ≤ t} is in Ft . A stopping time T defines a
stopped σ-algebra FT as
FT := {A ∈ F : A ∩ {T ≤ t} ∈ Ft for all t ≥ 0}.
If S and T are stopping times such that S ≤ T always, then FS ⊆ FT . The proof
is exactly the same as the proof of Proposition 13.11.5. Similarly, if the filtration
is right-continuous, and {Tn }n≥1 is a sequence of stopping times decreasing to a
stopping time T , then

\
FT = FTn .
n=1
Again, the proof is exactly identical to the proof of Proposition 13.11.6.
A stopping times is called bounded if it is always bounded above by some
deterministic constant.
A collection of random variables {Xt }t≥0 , called a stochastic process, is said
to be adapted to this filtration for Xt is Ft -measurable for each t. The stochastic
process is called right-continuous if for any sample point ω, the map t 7→ Xt (ω) is
right-continuous.
A stochastic process {Xt }t≥0 adapted to a filtration {Ft }t≥0 is called a mar-
tingale if E|Xt | < ∞ for each t, and for any s ≤ t, E(Xt |Fs ) = Xs a.s.

14.1. Optional stopping theorem in continuous time


The following result extends the optional stopping theorem to continuous time.
T HEOREM 14.1.1 (Optional stopping theorem in continuous time). Suppose
that {Xt }t≥0 is a right-continuous martingale adapted to a right-continuous fil-
tration {Ft }t≥0 . Let S and T be bounded stopping times for this filtration. Then
XS and XT are integrable and E(XT |FS ) = XS a.s.

209
210 14. MARTINGALES IN CONTINUOUS TIME

P ROOF. Let k be an integer such that T ≤ k − 1 always. For each n ≥ 1,


define (
(m + 1)2−n if m2−n ≤ T < (m + 1)2−n ,
Tn :=
∞ if T = ∞.
Repeating the argument given in the proof of Proposition 13.11.7, we see that each
Tn is a stopping time, Tn ≤ k for any n, and Tn decreases to T as n → ∞.
Now fix n and consider the sequence {Xj2−n }j≥0 . It is easy to verify that
this sequence is a discrete time martingale adapted to the filtration {Fj2−n }j≥0 .
Moreover, we claim that Tn is a stopping time for this filtration, and the stopped
σ-algebra of Tn with respect to this filtration is the same as the stopped σ-algebra
with respect to the original filtration. To prove the first claim, note that for any j,
{Tn = j2−n } = {(j − 1)2−n ≤ T < j2−n } ∈ Fj2−n .
For the second claim, first take any A ∈ F such that A ∩ {Tn = j2−n } ∈ Fj2−n
for all j. Take any t ≥ 0 and find m such that m2−n ≤ t < (m + 1)2−n . Then
A ∩ {Tn ≤ t} = A ∩ {Tn ≤ m2−n } ∈ Fm2−n ⊆ Ft .
Conversely, suppose that A ∩ {Tn ≤ t} ∈ Ft for all t. Then clearly for any j,
A ∩ {Tn = j2−n } = (A ∩ {Tn ≤ j2−n }) \ (A ∩ {Tn ≤ (j − 1)2−n })
∈ Fj2−n .
This proves the second claim. Thus, we can apply the optional stopping theorem
for discrete time martingales and deduce that XTn is integrable and E(Xk |FTn ) =
XTn a.s. Since this holds for every n, and ∩n≥1 FTn = FT , the backwards mar-
tingale convergence theorem implies that XTn → E(Xk |FT ) a.s. and in L1 as
n → ∞. But by the right continuity of {Xt }t≥0 , we know that XTn → XT
everywhere. Thus, XT is integrable and E(Xk |FT ) = XT a.s.
But the same logic implies that XS is integrable and E(Xk |FS ) = XS a.s.
Therefore,
E(XT |FS ) = E(E(Xk |FT )|FS ) = E(Xk |FS ) = XS a.s.
This completes the proof of the theorem. 

14.2. Doob’s Lp inequality in continuous time


We also have a useful extension of Doob’s Lp inequality to right-continuous
martingales.
T HEOREM 14.2.1. Let {Xt }t≥0 be a right-continuous martingale adapted to
a filtration {Ft }t≥0 . Let Mt := max0≤t≤s |Xs |. Then for any t ≥ 0 and p > 1,
 p
p p
E(Mt ) ≤ E|Xt |p .
p−1
14.3. MARTINGALES RELATED TO BROWNIAN MOTION 211

P ROOF. Take any t ≥ 0 and some integer n ≥ 1. Then {Xkt2−n }0≤k≤2n is


a martingale adapted to {Fkt2−n }0≤k≤2n . Therefore by Doob’s Lp inequality for
discrete time martingales,
 p
p p
E( max n |Xkt2−n | ) ≤ E|Xt |p .
0≤k≤2 p−1
As n → ∞, the right side stays fixed, whereas the right-continuity of X ensures
that the random variable inside the expectation on the left side increases to Mt .
Therefore the claim follows by the monotone convergence theorem. 

14.3. Martingales related to Brownian motion


Let B be standard Brownian motion. The Markov property of B gives rise to a
class of continuous martingales adapted to the right continuous Brownian filtration.
E XAMPLE 14.3.1. The simplest example of a martingale related to B is B
itself. That is, E(B(t)|Fs+ ) = B(s) a.s. whenever s ≤ t. To see this, define
W (u) := B(s + u) − B(s) for u ≥ 0. Then W is independent of Fs+ , and
B(t) − B(s) = W (t − s). Thus,
E(B(t)|Fs+ ) = E(B(t) − B(s) + B(s)|Fs+ )
= E(W (t − s)|Fs+ ) + B(s)
= E(W (t − s)) + B(s) a.s.
= B(s).

E XAMPLE 14.3.2. The process B(t)2 − t is another martingale related to B.


To see this, note that
E(B(t)2 − t|Fs+ ) = E[(B(t) − B(s) + B(s))2 |Fs+ ) − t
= E[(B(t) − B(s))2 + 2B(s)(B(t) − B(s)) + B(s)2 |Fs+ ] − t
= E[W (t − s)2 + 2B(s)W (t − s)|Fs+ ] + B(s)2 − t.
By the Markov property, E(W (t−s)2 |Fs+ ) = E(W (t−s)2 ) = t−s a.s. Similarly,
since B(s) is Fs+ -measurable and B(s)W (t − s) ∈ L1 ,
E(B(s)W (t − s)|Fs+ ) = B(s)E(W (t − s)|Fs+ ) = 0 a.s.
Thus, for any 0 ≤ s ≤ t,
E(B(t)2 − t|Fs+ ) = B(s)2 − s a.s.
By a similar procedure, we can find polynomials in B(t) and t of any degree that
are martingales with respect to the Brownian filtration.
E XAMPLE 14.3.3. Another class of examples is given by the exponential mar-
1 2
tingales of Brownian motion. For any λ ∈ R, eλB(t)− 2 λ t is a martingale adapted
212 14. MARTINGALES IN CONTINUOUS TIME

to the Brownian filtration. Again, this is a simple application of the Markov prop-
erty. Take any s ≤ t and define W (u) := B(s + u) − B(s), as before. Then
1 2 1 2
E(eλB(t)− 2 λ t |Fs+ ) = eλB(s)− 2 λ t E(eλ(B(t)−B(s)) |Fs+ )
1 2
= eλB(s)− 2 λ t E(eλW (t−s) |Fs+ )
1 2
= eλB(s)− 2 λ t E(eλW (t−s) )
1 2 1 2 (t−s) 1 2
= eλB(s)− 2 λ t e 2 λ = eλB(s)− 2 λ s .

Let us now work out some applications of the martingales obtained above.
E XAMPLE 14.3.4. Take any a < 0 < b. Let T := inf{t : B(t) ∈ / (a, b)}. Then
B(T ) can take only two possible values, namely, a and b. The optional stopping
theorem helps us calculate the respective probabilities, as follows.
Take any t ≥ 0 and let T ∧ t denote the minimum of T and t. This is the
standard notation in this area. By Exercise 13.11.3, T ∧ t is also a stopping time.
Moreover, it is a bounded stopping time. Therefore by the optional stopping theo-
rem for right continuous martingales, E(B(T ∧ t)) = E(B(0)) = 0 for any t.
Now notice that since T ∧ t ≤ T , B(T ∧ t) ∈ [a, b]. Also, letting M (t) :=
max0≤s≤t B(s), we have
P(T = ∞) = lim P(T ≥ t)
t→∞
= lim P(M (t) ≤ b) = 0,
t→∞

where the last identity follows from the fact that M (t) has the same law as the
absolute value of a N (0, t) random variable. Thus, P(T < ∞) = 1, and so
B(T ∧ t) → B(T ) a.s. as t → ∞. Combining all of these observations, we see
that by the dominated convergence theorem,
E(B(T )) = lim E(B(T ∧ t)) = 0.
t→∞

On the other hand,


E(B(T )) = aP(B(T ) = a) + bP(B(T ) = b)
= aP(B(T ) = a) + b(1 − P(B(T ) = a)).
Since the above expression must be equal to zero, we get
b −a
P(B(T ) = a) = , P(B(T ) = b) = .
b−a b−a

E XAMPLE 14.3.5. Let us now use the martingale B(t)2 − t to compute E(T ).
As before, the optional stopping theorem gives us
E(B(T ∧ t)2 ) = E(T ∧ t)
for every t. We have already observed that B(T ∧t) ∈ [a, b] for all t, and converges
to B(T ) as t → ∞. Thus, by the dominated convergence theorem, the right side
14.3. MARTINGALES RELATED TO BROWNIAN MOTION 213

of the above identity approaches E(B(T )2 ) as t → ∞. On the other hand, the


monotone convergence theorem implies that the left side approaches E(T ). Thus,
E(T ) = E(B(T )2 ).
But we already know the distribution of B(T ). This gives us
E(T ) = a2 P(B(T ) = a) + b2 P(B(T ) = b)
b −a
= a2 + b2 = −ab.
b−a b−a

E XAMPLE 14.3.6. The exponential martingales help us calculate the distribu-


tion of the maximum excess of Brownian motion above a linear boundary. Take
any θ > 0. Let
M := sup(B(t) − θt).
t≥0
By the law of large numbers for Brownian motion, it is easy to see that M < ∞
a.s., because otherwise B(t) will have to exceed θt along a sequence of times
tending to infinity. Also, M ≥ 0 since B(t) − θt = 0 at t = 0. We will now use
the exponential martingales to derive the distribution of M .
1 2
Let λ := 2θ and let Xt := eλB(t)− 2 λ t , so that Xt is a martingale. Fix any
x > 0. Let T := inf{t ≥ 0 : B(t) ≥ θt + x}. Here we allow T to be infinity,
and that is actually an important matter. Clearly, T is a stopping time, and by the
optional stopping theorem, E(XT ∧t ) = 1 for any t ≥ 0. Now note that for any t,
T ∧ t ≤ T and hence B(T ∧ t) ≤ θ(T ∧ t) + x. Therefore,
1 2 (T ∧t)
0 ≤ XT ∧t ≤ eλ(θ(T ∧t)+x)− 2 λ
1 2 )(T ∧t)
= eλx+(λθ− 2 λ = e2θx .
Now, as t → ∞, XT ∧t tends to
1 2T 1 2T
XT = eλBT − 2 λ = eλ(θT +x)− 2 λ = e2θx
if T < ∞. On the other hand, if T = ∞, then
XT ∧t = Xt = e2θ(B(t)−θt)
for all t, and so by the law of large numbers for Brownian motion, XT ∧t → 0 as
t → ∞. Therefore by the dominated convergence theorem,
1 = E( lim XT ∧t ) = e2θx P(T < ∞).
n→∞

Thus, P(T < ∞) = e−2θx . Consequently, P(M ≥ x) = e−2θx . That is, M


follows an exponential distribution with rate 2θ.
E XERCISE 14.3.7. Let (Bt )t≥0 be standard Brownian motion starting at zero.
2
For any b > 0, show that the process Vt = cosh(b|Bt |)e−b t/2 is a martingale
adapted to the right continuous Brownian filtration. As an application, take any
a > 0 and compute the moment generating function of the stopping time τa :=
inf{t : |Bt | = a}.
214 14. MARTINGALES IN CONTINUOUS TIME

√ E XERCISE 14.3.8. Take any a > 0 and let T := inf{t ≥ 0 : |B(t)| =


t + a}. Prove that E(T ) = ∞.
E XERCISE 14.3.9. Let B1 and B2 be independent standard Brownian motions
started at 0. Let {Ft+ }t≥0 denote the right continuous filtration generated by the
pair (B1 , B2 ). Prove that for any λ ∈ R, the process
X(t) := eλB1 (t) cos(λB2 (t))
is a continuous martingale adapted to this filtration. Let
T := inf{t ≥ 0 : B1 (t) = 1}.
Show that T is a stopping time for the above filtration, and then using the above
martingale, compute the characteristic function of B2 (T ). (This idenitifies the law
of the y coordinate when the x coordinate of a 2D Brownian motion hits 1.)

14.4. Skorokhod’s embedding


Skorokhod’s embedding is a technique for constructing a randomized stopping
time T for Brownian motion such that the stopped random variable B(T ) has the
same law as any given random variable X with mean zero and finite variance.
Take any a < 0 < b. Let Ta,b := inf{t ≥ 0 : B(t) ∈ / (a, b)}. We derived the
law of B(T ) in the previous section. Using that, it follows that for any function
φ : R → R,
E[φ(B(Ta,b ))] = φ(a)P(B(T ) = a) + φ(b)P(B(T ) = b)
φ(a)b − φ(b)a
= . (14.4.1)
b−a
Using the above formula, and choosing a and b randomly, we will now construct
Skorokhod’s embedding.
T HEOREM 14.4.1. Let X be a random variable with mean zero and finite
variance. Let µ be the law of X. Define a probability measure ν on the set
(−∞, 0) × (0, ∞) ∪ {(0, 0)} using the formulas ν({(0, 0)}) = µ({0}), and for
any Borel set A ⊆ (−∞, 0) × (0, ∞),
ZZ
1
ν(A) := (v − u)dµ(u)dµ(v),
c A
R
where c := (0,∞) vdµ(v). Let (U, V ) be a pair of random variables with joint
law ν, and let B be an independent standard Brownian motion. Let T := TU,V ,
where Ta,b is defined as above. Then B(T ) has the same law as X. Moreover,
E(T ) = E(X 2 ).

P ROOF. It is not obvious that ν is a probability measure, so let us first prove


that. Since E(X) = 0, we have
Z Z Z
0= xdµ(x) = vdµ(v) + udµ(u).
R (0,∞) (−∞,0)
14.4. SKOROKHOD’S EMBEDDING 215
R
Thus, the number c can be alternately expressed as (−∞,0) (−u)dµ(u). By Exer-
cise 2.3.5, ν is a measure. To prove that it is a probability measure, we only have
to show that ν((−∞, 0) × (0, ∞) ∪ {(0, 0)}) = 1. By construction, ν({(0, 0)}) =
µ({0}). On the other hand, by Fubini’s theorem,
Z Z
1
ν((−∞, 0) × (0, ∞)) = (v − u)dµ(v)dµ(u)
c (−∞,0) (0,∞)
Z Z Z Z
1 1
= vdµ(v)dµ(u) − udµ(v)dµ(u)
c (−∞,0) (0,∞) c (−∞,0) (0,∞)
Z Z
= dµ(u) + dµ(v) = 1 − µ({0}).
(−∞,0) (0,∞)

Thus, ν is a probability measure. Now, we can consider B(T ) = inf{t ≥ 0 :


B(t) ∈/ (U, V )} as a function of B, U and V . It is not hard to show that it’s a
measurable function. Therefore by Proposition 9.2.1 and equation (14.4.1), for any
bounded continuous function φ : R → R,
 
φ(U )V − φ(V )U
E[φ(B(T ))|U, V ] = 1{(U,V )6=(0,0)} + φ(0)1{(U,V )=(0,0)} .
V −U
So, to show that B(T ) has the same law as X, it suffices to prove that the expected
value of the right side equals E(φ(X)). To show this, note that
φ(u)v − φ(v)u
Z
1
(v − u)dµ(v)dµ(u)
c (−∞,0)×(0,∞) v−u
Z Z
1 1
= φ(u)vdµ(v)dµ(u) − φ(v)udµ(v)dµ(u)
c (−∞,0)×(0,∞) c (−∞,0)×(0,∞)
Z Z
= φ(u)dµ(u) + φ(v)dµ(v).
(−∞,0) (0,∞)
(A slight modification of the above argument also shows integrability, which is
left to the reader.) Also, by construction, P((U, V ) = (0, 0)) = P(X = 0).
Thus, B(T ) has the same law as X. To show that E(T ) = E(X 2 ), we again use
Proposition 9.2.1 and the fact that E(Ta,b ) = −ab (derived in the previous section)
to get
E(T |U, V ) = −U V
and hence
Z
1
E(T ) = −E(U V ) = − uv(v − u)dµ(v)dµ(u)
c (−∞,0)×(0,∞)
Z Z
1 1
=− uv 2 dµ(v)dµ(u) + u2 vdµ(v)dµ(u)
c (−∞,0)×(0,∞) c (−∞,0)×(0,∞)
Z Z
= v 2 dµ(v) + u2 dµ(u) = E(X 2 ).
(0,∞) (−∞,0)
This completes the proof of the theorem. 
216 14. MARTINGALES IN CONTINUOUS TIME

14.5. Strassen’s coupling


Let X1 , X2 , . . . be a sequence of P
i.i.d. random variables with mean zero and
variance one. Let S0 := 0 and Sn := ni=1 Xi for n ≥ 1. Strassen’s coupling is a
way of constructing the process {Sn }n≥0 and a standard Brownian motion B on√the
same probability space such that with probability one, maxk≤n |Sk −Bk | = o( n)
as n → ∞. The construction proceeds as follows. Let X be a random variable
with the law of the Xi ’s, and let (U, V ) be as in Skorokhod’s embedding from
the previous section. Let (U1 , V1 ), (U2 , V2 ), . . . be a sequence of i.i.d. random
pairs with the same law as (U, V ). Let B be a standard Brownian motion that is
independent of this sequence. Define a sequence of random variables T1 , T2 , . . . as
follows. Let T0 := 0, T1 := inf{t ≥ 0 : B(t) ∈ / (U1 , V1 )}, and for each n ≥ 1, let
Tn := inf{t ≥ Tn−1 : B(t) − B(Tn−1 ) ∈
/ (Un , Vn )}.
The following theorem establishes the coupling property.
T HEOREM 14.5.1. The sequence {B(Tn )}n≥0 has the same law as the se-
quence {Sn }n≥0 . Moreover, the sequence {Tn − Tn−1 }n≥1 is i.i.d. with mean 1.
We need a small lemma about measurability.
L EMMA 14.5.2. Let Ta,b be as in the previous section. Then for any t ∈ R, the
maps (a, b) 7→ P(B(Ta,b ) ≤ t) and (a, b) 7→ P(Ta,b ≤ t) are measurable.

P ROOF. The value of P(B(Ta,b ) ≤ t) is explicitly given by equation (14.4.1)


with φ(x) = 1{x≤t} , which shows its measurability as a function of (a, b). The
map (a, b) 7→ P(Ta,b ≤ t) is increasing in a and decreasing in b. It is not difficult
to show that any such map is measurable. 

P ROOF OF T HEOREM 14.5.1. For each n, let Yn := B(Tn ) − B(Tn−1 ). We


will show that Y1 , Y2 , . . . are i.i.d. with the same law as the Xi ’s. Since the law
of a random vector is characterized by its cumulative distribution function (Ex-
ercise 7.6.1) and the law of an infinite sequence of random variables is uniquely
characterized by its finite dimensional distributions (Theorem 3.3.1), it suffices to
show that for any n and any bounded continuous functions t1 , . . . , tn ∈ R,
P(X1 ≤ t1 , . . . , Xn ≤ tn ) = P(Y1 ≤ t1 , . . . , Yn ≤ tn ). (14.5.1)

Fix t1 , . . . , tn ∈ R and let φi (x) := 1{x≤ti } for each i. Suppose that instead of
random (Ui , Vi ), we had deterministic (ui , vi ) and defined Tn and Yn as above
with these ui ’s and vi ’s. Then each Tn would be a stopping time for the Brownian
filtration, and 0 = T0 ≤ T1 ≤ T2 ≤ · · · . Let Wn (t) := B(Tn + t) − B(Tn ).
Then by the strong Markov property, each Wn is a standard Brownian motion, and
is independent of FT+n . Also, note that for each n,

Tn − Tn−1 = inf{t ≥ 0 : Wn−1 (t) ∈


/ (un , vn )},
14.5. STRASSEN’S COUPLING 217

and Y1 , . . . , Yn−1 are FT+n−1 -measurable (because B(Ti ) is FT+i -measurable for all
i and FT+i ⊆ FT+j when i ≤ j). Lastly, note that Yn = Wn−1 (Tn − Tn−1 ). Thus,
E(φ1 (Y1 ) · · · φn (Yn )|FT+n−1 )
= φ1 (Y1 ) · · · φn−1 (Yn−1 )E[φn (Wn (Tn − Tn−1 ))|FT+n−1 ]
= φ1 (Y1 ) · · · φn−1 (Yn−1 )E[φn (Wn (Tn − Tn−1 ))].
By Lemma 14.5.2 and the observations made above, E[φn (Wn (Tn − Tn−1 ))] is a
measurable function of (un , vn ). Let’s call it ψn (un , vn ). Proceeding inductively,
we get
E(φ1 (Y1 ) · · · φn (Yn )) = ψ1 (u1 , v1 ) · · · ψn (un , vn ).
Now consider the case of random (Ui , Vi ), generated as before. Considering the
random variables Y1 , . . . , Yn as functions of B and (U1 , V1 ), . . . , (Un , Vn ), and
applying Proposition 9.2.1 and the above identity, we get
Yn
E(φ1 (Y1 ) · · · φn (Yn )) = E(ψi (Ui , Vi )).
i=1
But by Skorokhod’s embedding theorem and our definition of the Ti ’s,
E(ψi (Ui , Vi )) = E(φi (Xi )).
This proves (14.5.1). So we have shown that (14.5.1) holds for every n, which
completes the proof of the first claim of the theorem. The proof of the second
claim is exactly similar, using Zi := Ti − Ti−1 instead of Yi . 
Strassen’s coupling can also be used to give an alternative proof of Donsker’s
theorem, assuming that we construct Brownian motion by some other means (for
example, by Exercise 13.4.7). This is outlined in the following exercises.
E XERCISE 14.5.3. Let X1P , X2 , . . . be i.i.d. random variables with mean zero
and variance one. Let Sn := ni=1 Xi . Using Strassen’s coupling, prove that it
is possible to construct {Sn }n≥1 and a standard Brownian motion B on the same
probability space such that n−1/2 max0≤k≤n |Sk − B(k)| → 0 in probability as
n → ∞. (Hint: First, show that n−1 max0≤k≤n |Tk − k| → 0 a.s. This is
a simple consequence of the fact that Tn /n → 1 a.s. Then use this to control
max0≤k≤n |B(Tk ) − B(k)|.)
E XERCISE 14.5.4. Let Bn be the piecewise linear path from Donsker’s theo-
rem. Using the previous exercise, prove that Bn and a standard Brownian motion
B can be constructed on the same probability space such that for any  > 0,
lim P(kBn − Bk[0,1] > ) = 0.
n→∞
(In other words, Bn → B in probability on C[0, 1].) As a consequence, prove
Donsker’s theorem.
Incidentally, if we have more information about the Xi ’s, such as finiteness
of higher moments or the moment generating function, it is possible to construct
better couplings. For example, under the assumption of finite moment generating
218 14. MARTINGALES IN CONTINUOUS TIME

function, the famous Komlós–Major–Tusnády embedding constructs {Sn }n≥0 and


B on the same space such that {(log n)−1 max0≤k≤n |Sk − B(k)|}n≥0 is a tight
family. Proving this is beyond the scope of this discussion.

14.6. The Hartman–Wintner LIL


Combined with the better version of the law of iterated logarithm for Brownian
motion (Theorem 13.10.3), Strassen’s coupling immediately implies the law of the
iterated logarithm for sums of i.i.d. random variables with finite variance. This is
known as the Hartman–Wintner LIL.
C OROLLARY 14.6.1. Let X1 , X2 , . . . be a sequence
Pn of i.i.d. random variables
with mean zero and variance one, and let Sn := i=1 Xi . Then with probability
one,
Sn Sn
lim sup √ = 1, lim inf √ = −1.
n→∞ 2n log log n n→∞ 2n log log n

P ROOF. Let Tn be as in Theorem 14.5.1, so that the sequence {B(Tn )}n≥0


has the same law as the sequence {Sn }n≥0 . From Theorem 14.5.1 and the strong
law of large numbers, we get Tn /n → 1 a.s. In particular, Tn+1 /Tn → 1 a.s.
Therefore by Theorem 13.10.3,
B(Tn ) B(Tn )
lim sup √ = lim sup √ = 1 a.s.
n→∞ 2n log log n n→∞ 2Tn log log Tn
Replacing Sn by −Sn , we get the result for lim inf. 
CHAPTER 15

Introduction to stochastic calculus

This chapter introduces the basics of stochastic calculus and some applications
of these methods. For technical simplicity, we will not attempt to prove the most
general versions of the results, but only as much as needed.

15.1. Continuous stochastic processes


Let (Ω, F, P) be a probability space. A collection of real-valued random vari-
ables {X(t)}t≥0 is called a stochastic process in continuous time. Such a process
is called continuous if for each ω ∈ Ω, the map t 7→ X(t)(ω) is continuous.
We will encounter such processes often in this chapter. A continuous stochastic
process X can also be viewed as a C[0, ∞)-valued random variable by defining
X(ω)(t) := X(t)(ω). To see that X is measurable, simply observe that it is the
pointwise limit (in C[0, ∞)) of a sequence of piecewise linear C[0, ∞)-valued
random variables, and apply Proposition 2.1.14.
A continuous stochastic process X is said to be adapted to a filtration {Ft }t≥0
if for each t, X(t) is Ft -measurable. As above, it is easy to see that the restriction
of X to [0, t] is a C[0, t]-valued random variable.

15.2. The Itô integral


Let (Ω, F, P) be a probability space on which a standard Brownian motion B
is defined. Let {Ft+ }t≥0 be the right continuous filtration of B. Let {X(t)}t≥0 be
a stochastic process adapted to this filtration. Given some t ≥ 0, Itô integration is
a way of giving meaning to the integral
Z t
X(s)dB(s)
0
as a limit of sums like
n−1
X
X(si )(B(si+1 ) − B(si ))
i=0
where 0 = s0 < s1 < · · · < sn = t, as the mesh size max0≤i≤n−1 |si+1 −
si | tends to zero. Although this is in the spirit of the Riemann–Stieltjes integral
Rt
0 f (s)dg(s) of one function f with respect to another function g, the situation here
is more complicated. For instance, the Riemann–Stieltjes integral is defined only
when g has finite total variation, and by Exercise 13.7.2, we know that Brownian
motion does not have finite total variation. Indeed, the partial sum displayed above
usually does not converge almost surely as the mesh size tends to zero. Itô’s insight,
219
220 15. INTRODUCTION TO STOCHASTIC CALCULUS

however, was to realize that convergence takes place in L2 under mild conditions.
We will now carry out this construction.
Fix some t ≥ 0. The first step is to define the Itô integral for simple pro-
cesses. A stochastic process X = {X(s)}s≤t is called simple if there exist some
deterministic M , n and 0 = s0 < s1 < · · · < sn = t such that for each i < n,
X(s) = X(si ) for all s ∈ [si , si+1 ), and |X(si )| ≤ M always. Suppose that
X is adapted to the Brownian filtration, meaning that for each s, X(s) is Fs+ -
measurable. Note that this is equivalent to saying that X(si ) is Fs+i -measurable for
each i. We define
Z t n−1
X
X(s)dB(s) := X(si )(B(si+1 ) − B(si )).
0 i=0
For simplicity, let us denote the integral by I(X). It is not hard to see that this is
well-defined, in the sense that if we take two different representations of X as a
simple process, the value of I(X) would be the same for both.
The Itô integral on simple processes enjoys two important properties. First, it
is linear, meaning that if X and Y are two simple processes, and a and b are real
numbers, then aX + bY is also a simple process, and I(aX + bY ) = aI(X) +
bI(Y ). This is easy to verify from the definition. The second important property is
the Itô isometry property, which says that
Z t
2
E(I(X) ) = E(X(s)2 )ds.
0
To see this, note that for any 0 ≤ j < i < n, the facts that X is adapted to the
Brownian filtration, X is bounded, and B is a martingale together imply that
E[X(si )(B(si+1 ) − B(si ))X(sj )(B(sj+1 ) − B(sj ))|Fs+i ]
= X(si )(B(si+1 ) − B(si ))X(sj )E[(B(sj+1 ) − B(sj ))|Fs+i ] = 0 a.s.,
and therefore
E[X(si )(B(si+1 ) − B(si ))X(sj )(B(sj+1 ) − B(sj ))] = 0.
Similarly, by the Markov property of Brownian motion,
E[X(si )2 (B(si+1 ) − B(si ))2 |Fs+i ] = X(si )2 E[(B(si+1 ) − B(si ))2 |Fs+i ]
= X(si )2 (si+1 − si ).
As a consequence of these two observations, we get
n−1
X
2
E(I(X) ) = E(X(si )2 )(si+1 − si )
i=0
Z t
= E(X(s)2 )ds,
0
which is the Itô isometry. This is called an isometry because of the following
reason. Consider the product space [0, t] × Ω endowed with the product σ-algebra
and the product measure. We may view X as a measurable map (s, ω) 7→ X(s)(ω)
15.2. THE ITÔ INTEGRAL 221

from this product space into the real line. Then the right side of the Itô isometry is
the squared L2 norm of X, and the left side is the L2 norm of I(X). So the map I
is an isometry from a subspace of L2 ([0, t] × Ω) into L2 (Ω).
We will now extend the definition of I to continuous adapted processes. It is
possible to further extend it to square-integrable processes, but we will refrain from
doing that to avoid technical complications.
Let {X(s)}s≤t be a continuous stochastic process adapted to the Brownian
filtration, meaning that X(s) is Fs+ -measurable for each s, and for each ω ∈ Ω,
the map s 7→ X(s)(ω) is continuous. Suppose that
Z t
E(X(s)2 )ds < ∞. (15.2.1)
0
(We will see below that the integral makes sense because the continuity of X im-
plies that the map s 7→ E(X(s)2 ) is measurable.) Under this condition, we want
to define I(X). The key observation is the following.
L EMMA 15.2.1. Let X be as above. Define X(s, ω) := X(s)(ω). Then X ∈
L2 ([0, t] × Ω), and there is a sequence of simple adapted processes {Xn }n≥1 such
that Xn → X in L2 ([0, t] × Ω).

P ROOF. For each n, define


kt (k + 1)t
Zn (s, ω) := X(kt/n, ω) if ≤s< . (15.2.2)
n n
It is easy to show that each Zn is measurable. Since X is the pointwise limit of Zn
as n → ∞ (due to the continuity of s 7→ X(s, ω) for each ω), Proposition 2.1.14
tells us that X is also measurable. Furthermore, by the assumption (15.2.1) and
Fubini’s theorem, X ∈ L2 ([0, t] × Ω).
Suppose that there is some constant M such that |X(s, ω)| ≤ M for all s and
ω. Let Zn be as above. Then |Zn (s, ω)| ≤ M for all n, s and ω and Zn → X
pointwise. So by the dominated convergence theorem,
lim kZn − XkL2 ([0,t]×Ω) = 0. (15.2.3)
n→∞
Let us now drop the assumption that X is bounded. For each positive integer M ,
define

M
 if X(s, ω) > M,
YM (s, ω) := X(s, ω) if − M ≤ X(s, ω) ≤ M, (15.2.4)

−M if X(s, ω) < −M.

By the condition (15.2.1) and the dominated convergence theorem, we get that
lim kYM − XkL2 ([0,T ]×Ω) = 0.
M →∞
But YM is a continuous adapted process that is bounded by a constant. Therefore
by our previous deduction, we can find a simple adapted process XM such that
1
kYM − XM kL2 ([0,T ]×Ω) ≤ .
M
222 15. INTRODUCTION TO STOCHASTIC CALCULUS

The sequence {XM }M ≥1 satisfies the requirement of the lemma. 

We are now ready to define the Itô integral I(X) for any continuous adapted
process X. Take a sequence of simple processes {Xn }n≥1 that converge to X
in L2 ([0, t] × Ω). Then by the Itô isometry and linearity of I, {I(Xn )}n≥1 is a
Cauchy sequence in L2 (Ω). We define I(X) to be its L2 limit. The Itô isom-
etry ensures that this limit does not depend on the choice of the approximating
sequence. Furthermore, since I(X) is constructed as an L2 limit, it is clear that
X 7→ I(X) is linear and the Itô isometry continues to hold. Lastly, note that
I(X) is Ft+ -measurable, since by Lemma 4.4.5, I(Xn ) converges almost surely
to I(X) along a subsequence, and each I(Xn ) is Ft+ -measurable (and then apply
Proposition 2.1.14).
The above procedure defines the integral from 0 to t, for any t. What about
R t from s to t for some 0 ≤ s ≤ t? We can retrace the exact same steps to
integration
define s X(u)dB(u) for any continuous adapted process X satisfying
Z t
E(X(u)2 )du < ∞.
s

This integral satisfies the Itô isometry


Z t 2  Z t
E X(u)dB(u) = E(X(u)2 )du.
s s

Moreover, when X satisfies (15.2.1), the above integral also satisfies the natural
identity
Z t Z s Z t
X(u)dB(u) = X(u)dB(u) + X(u)dB(u) a.s.
0 0 s

All of the above can be proved by first considering simple processes and then pass-
ing to the L2 limit. The details are left to the reader.
E XERCISE 15.2.2. Let X and Y be twoRcontinuous adapted R t processes satis-
t
fying (15.2.1) for some t ≥ 0. Suppose that 0 X(s)dB(s) = 0 Y (s)dB(s) a.s.
Then prove that with probability one, X(s) = Y (s) for all s ∈ [0, t].
E XERCISE 15.2.3. Let X and Y be two continuous adapted processes satisfy-
ing (15.2.1). Prove that for any t ≥ 0, the event
Z t Z t 
X(s)dB(s) 6= Y (s)dB(s) ∩ {X(s) = Y (s) for all s ∈ [0, t]}
0 0

has probability zero. (Hint: Approximate by simple processes.)


E XERCISE 15.2.4. Let f : [0, t] → R be a (deterministic) continuous function.
Rt
Prove that 0 f (s)dB(s) is a normal random variable with mean zero and variance
Rt 2
0 f (s) ds.
15.3. THE ITÔ INTEGRAL AS A CONTINUOUS MARTINGALE 223

15.3. The Itô integral as a continuous martingale


Let all notation be as in the previous section. Let X = {X(t)}t≥0 be a con-
tinuous stochastic process adapted to the Brownian filtration, and satisfying the
square-integrability condition (15.2.1) for all t. We now know how to define
Z t
Z(t) := X(s)dB(s)
0
for any t ≥ 0. However, this definition is only in an almost sure sense, since
the Itô integral is defined as an L2 limit. Can we produce versions of Z(t) in a
way such that with probability one, t 7→ Z(t) is continuous? It turns out that this
is possible. Moreover, this process is a continuous martingale. We will need to
assume that Ft+ is a complete σ-algebra for every t, which can be easily arranged
by replacing it with its completion (see Section 1.8). Henceforth, we will work
under this assumption.
T HEOREM 15.3.1. Let X be as above. Then there exists a continuous martin-
gale {Z(t)}t≥0 adapted to the Brownian filtration such that for each t ≥ 0,
Z t
Z(t) = X(s)dB(s) a.s.
0
Moreover, if Z0 is another such martingale, then P(Z(t) = Z 0 (t) for all t) = 1.

P ROOF. Fix some t ≥ 0. Temporarily, denote the restriction of X to the


interval [0, t] also by X. We will firstRconstruct a continuous martingale {Z(s)}s≤t
s
such that for each s ∈ [0, t], Z(s) = 0 X(u)dB(u) a.s. Later, we will patch these
martingales together to create a single continuous martingale.
By Lemma 15.2.1, find a sequence of simple adapted processes {Xn }n≥1 that
converge to X in L2 ([0, t] × Ω). For each n and each s ≤ t, define
Z s
Zn (s) := Xn (u)dB(u)
0
using the definition of the Itô integral for simple processes. Then Zn = {Zn (s)}s≤t
is a continuous adapted process. Moreover, using the Markov property of Brownian
motion, it is not difficult to see that Zn is martingale with respect to the Brownian
filtration. Therefore by Doob’s L2 maximal inequality for continuous martingales,
 
E sup (Zn (s) − Zm (s)) ≤ 4E[(Zn (t) − Zm (t))2 ].
2
0≤s≤t

But by the Itô isometry,


E[(Zn (t) − Zm (t))2 ] = kXn − Xm k2L2 ([0,t]×Ω) .
Since {Xn }n≥1 is a Cauchy sequence in L2 ([0, t] × Ω), this allows us to find a
subsequence {Znk }k≥1 such that for each k,
 
E sup (Znk (s) − Znk+1 (s)) ≤ 2−k .
2
0≤s≤t
224 15. INTRODUCTION TO STOCHASTIC CALCULUS

A simple application of the first Borel–Cantelli lemma now shows that with prob-
ability one, the sequence {Znk }k≥1 , viewed as a sequence of random continuous
functions on [0, t], is Cauchy with respect to the supremum norm. Let Z denote
its limit. Being the uniform limit of a sequence of continuous functions, Z is
continuous. Since for each s, the restriction of Xnk to [0, s] converges to the re-
2
R s of X to [0, s] in L ([0, s] × Ω), we conclude that Z(s) must be2 a version
striction
of 0 X(u)dB(u). Moreover, this also shows that Znk (s) → Z(s) in L .
Since each Znk (s) is Fs+ -measurable and Znk (s) → Z(s) a.s., the complete-
ness of Fs+ shows that Z(s) is Fs+ -measurable (see Exercise 2.6.4). Finally, to
prove that Z is a martingale, recall that Zn is a martingale for each n. So for any
0 ≤ u ≤ s ≤ t and any k,
E(Znk (s)|Fu+ ) = Znk (u) a.s.
We know that the right side converges to Z(s) a.s. as k → ∞. For the left side,
recall that Znk (s) → Z(s) in L2 and hence also in L1 . Therefore by Exercise 9.2.2,
E(Znk (s)|Fu+ ) → E(Z(s)|Fu+ ) a.s. as k → ∞. Thus, Z is a martingale.
Using the above procedure we can construct, for each n ≥ 1,R a continuous
t
martingale {Z n (t)}0≤t≤n such that for each t ∈ [0, n], Z n (t) = 0 X(s)dB(s)
a.s. Take any 1 ≤ m ≤ n. Then with probability one, for each rational t ∈ [0, m],
Z m (t) = Z n (t). But Z m and Z n are continuous processes. So, with probability
one, Z m (t) = Z n (t) for all t ∈ [0, m]. Thus, if we define Z(t) := Z n (t) for n =
Rt
dte, then {Z(t)}t≥0 is a continuous martingale satisfying Z(t) = 0 X(s)dB(s)
a.s. for each given t. Uniqueness follows by continuity. 
E XERCISE 15.3.2. Let f : R → R be a continuous function. Let X(t) be
Rt
the continuous martingale representing 0 f (s)dB(s). Prove that X is a centered
Gaussian process, meaning that the finite dimensionalRdistributions are jointly nor-
s∧t
mal with mean zero. Show that Cov(X(s), X(t)) = 0 f (u)2 du.

15.4. The quadratic variation process


Let B be a standard Brownian motion and X be a continuous stochastic pro-
cess adapted to the filtration of B, and satisfying the square-integrability condi-
tion (15.2.1) for every t. Let
Z t
Z(t) := X(s)dB(s)
0
be the continuous martingale obtained in the previous section. The quadratic vari-
ation process of Z is defined as
Z t
[Z](t) := X(s)2 ds.
0
By Exercise 15.2.2, [Z] is indeed determined by Z, in the sense that if Z is alter-
natively expressed as the stochastic integral of some other process Y , we will end
up with the same [Z]. Note that [Z] is an adapted, continuous, and non-decreasing
stochastic process. The following is one of the main properties of this process.
15.4. THE QUADRATIC VARIATION PROCESS 225

T HEOREM 15.4.1. Let Z and [Z] be as above. Then the process Z(t)2 −[Z](t)
is a continuous martingale adapted to the Brownian filtration.
We need the following simple lemma.
L EMMA 15.4.2. If {Un }n≥1 is a sequence of real-valued random variables
converging in L2 to U , the Un2 → U 2 in L1 .

P ROOF. By the Cauch–Schwarz inequality and the triangle inequality,


E|Un2 − U 2 | = E|(Un − U )(Un + U )|
≤ kUn − U kL2 kUn + U kL2
≤ kUn − U k2L2 + 2kUn − U kL2 kU kL2 .

Since Un → U in L2 , the right side tends to 0 as n → ∞. 

P ROOF OF T HEOREM 15.4.1. Fix t ≥ 0. Let {Xn }n≥1 be a sequence of sim-


2
ple adapted processes converging to R sX in L ([0, t] × Ω). Such a sequence exists
by Lemma 15.2.1. Let Zn (s) := 0 Xn (u)dB(u) for each u and let [Zn ] be the
quadratic variation process of Zn . It is easy to verify by direct computation that for
any s ≤ t and any n,
Z t 
2 + 2 +
E((Zn (t) − Zn (s)) |Fs ) = E Xn (u) du Fs .
s

Since Zn is a martingale adapted to the Brownian filtration, this shows that


E(Zn (t)2 |Fs+ )
= E((Zn (t) − Zn (s))2 + 2Zn (s)(Zn (t) − Zn (s)) + Zn (s)2 |Fs+ )
Z t 
=E 2
Xn (u) du Fs + Zn (s)2
+
s
= E([Zn ](t) − [Zn ](s)|Fs+ ) + Zn (s)2 .
Since [Zn ](s) is Fs+ -measurable, we get
E(Zn (t)2 − [Zn ](t)|Fs+ ) = Zn (s)2 − [Zn ](s).
Now, by Lemma 15.4.2, it follows that Zn (s)2 → Z(s)2 in L1 and [Zn ](s) →
[Z](s) in L1 . Combining this information with the above identity, we get
E(Z(t)2 − [Z](t)|Fs+ ) = Z(s)2 − [Z](s) a.s.,
which completes the proof. 

An important corollary of Theorem 15.4.1 is the following result, which allows


us to apply the Itô isometry up to a bounded stopping time. Depending on the
situation, one can then hope to extend it to unbounded stopping times.
226 15. INTRODUCTION TO STOCHASTIC CALCULUS

C OROLLARY 15.4.3. Let X be a continuous adapted process satisfying the


square-integrability condition (15.2.1). Let S and T be bounded stopping times
for the Brownian filtration, such that S ≤ T always. Then
Z T 2 Z T 
2 +
E X(s)dB(s) − X(s) ds FS = 0 a.s.
S S

Rt
P ROOF. Let Z(t) be the continuous martingale version of 0 X(s)dB(s). By
Theorem 15.4.1, Z(t)2 − [Z](t) is also a continuous martingale. Therefore by
the optional stopping theorem for continuous martingales, we get E(Z(T )|FS+ ) =
Z(S) and E(Z(T )2 − [Z](T )|FS+ ) = Z(S)2 − [Z](S). The second identity can be
rewritten as
E(Z(T )2 − Z(S)2 |FS+ ) = E([Z](T ) − [Z](S)|FS+ )
Z T 
2 +
=E X(s) ds FS .
S

Thus,
Z T 2 
E X(s)dB(s) FS = E((Z(T ) − Z(S))2 |FS+ )
+
S
= E(Z(T )2 − 2Z(T )Z(S) + Z(S)2 |FS+ )
= E(Z(T )2 − Z(S)2 |FS+ ).
Combined with the preceding display, this proves the corollary. 

15.5. Itô integrals for multidimensional Brownian motion


Let B be d-dimensional standard Brownian motion and let {X(t)}t≥0 be a
real-valued continuous process adapted to the right continuous filtration of B, and
satisfying the condition
Z t
E(X(s)2 )ds < ∞
0
for each t ≥ 0. Then the stochastic integral
Z t
X(s)dBi (s)
0

is defined for each i and t, and can be represented as a continuous martingale


adapted to the filtration of B. Note that the only difference here is that X not
adapted to the filtration of Bi , but the larger filtration generated by B. The def-
initions of the integral and the continuous martingale are just as before; all steps
go through without any problem because the Markov property is valid for multidi-
mensional Brownian motion, as we observed in Section 13.13.
15.6. ITÔ’S FORMULA 227

15.6. Itô’s formula


Itô’s formula is the stochastic analogue of the fundamental theorem of calculus.
In its simplest form, it says the following.
T HEOREM 15.6.1. Let B be standard Brownian motion. Take any twice con-
tinuously differentiable function f : R → R such that for each t ≥ 0,
Z t
E(f 0 (B(s))2 )ds < ∞.
0
Then with probability one, we have that for all t ≥ 0,
Z t
1 t 00
Z
0
f (B(t)) = f (B(0)) + f (B(s))dB(s) + f (B(s))ds,
0 2 0
where first integral should be interpreted as the continuous martingale given by
Theorem 15.3.1.
Note that if B was a function with finite total variation, we would only have
the first integral on the right side. The appearance of the second integral is the main
distinctive feature of Itô’s formula. The reason for the appearance of this term is
the following lemma.
L EMMA 15.6.2. Let B be standard Brownian motion and g : R → R be a
continuous function. Take any t ≥ 0 and a partition 0 = s0 < s1 < · · · < sn = t.
Then the sum
n−1
X
g(B(si ))(B(si+1 ) − B(si ))2
i=0
Rt
converges in probability to 0 g(B(s))ds as the mesh size max0≤i<n |si+1 − si |
approaches zero.

P ROOF. Let us first prove the lemma under the assumption that g is bounded,
in addition to being continuous. Since s 7→ g(B(s)) is a continuous map, the sum
n
X
g(B(si ))(si+1 − si )
i=0
Rt
converges to 0 g(B(s))ds as the mesh size approaches zero. So it suffices to prove
that
X n 2 
2
E g(B(si ))((B(si+1 ) − B(si )) − (si+1 − si )) (15.6.1)
i=0

tends to zero as the mesh size goes to zero. When we expand the square and take the
expectation inside, the cross terms vanish due to the Markov property of Brownian
motion, which implies that for any j < i,
E((B(si+1 ) − B(si ))2 |Fs+j ) = si+1 − si a.s.
228 15. INTRODUCTION TO STOCHASTIC CALCULUS

Thus, the expectation displayed in (15.6.1) equals


n−1
X
E[g(B(si ))2 ((B(si+1 ) − B(si ))2 − (si+1 − si ))2 ].
i=0
There is a constant C such that |g(x)| ≤ C for all x, and B(si+1 ) − B(si ) ∼
N (0, si+1 − si ), which gives
E[((B(si+1 ) − B(si ))2 − (si+1 − si ))2 ] = (si+1 − si )2 E((Z 2 − 1)2 ),
where Z ∼ N (0, 1). Since
n−1
X n−1
X
(si+1 − si )2 ≤ max |si+1 − si | (si+1 − si )
0≤i<n
i=0 i=0
= t max |si+1 − si |,
0≤i≤n−1

this shows that the quantity displayed in (15.6.1) indeed converges to zero as the
mesh size goes to zero.
Next, let us drop the assumption that g is bounded. Take any , δ > 0. Find M
so large that P(max0≤s≤t |B(s)| > M ) < δ/2. Define

g(x)
 if |x| ≤ M,
h(x) := g(M ) if x > M,

g(−M ) if x < −M.

Then h is a bounded continuous function, and so


n Z t
P
X
h(B(si ))(si+1 − si ) → h(B(s))ds
i=0 0

as the mesh size tends to zero. On the other hand, note that
n−1
X n−1
X 
P h(B(si ))(si+1 − si ) 6= g(B(si ))(si+1 − si )
i=0 i=0
δ
≤ P( max |B(s)| > M ) < .
0≤s≤t 2
Similarly,
Z t Z t 
δ
P h(B(s))ds 6= g(B(s))ds ≤ P( max |B(s)| > M ) < .
0 0 0≤s≤t 2
Combining the above observations, it follows that
 n−1
X Z t 
lim sup P g(B(si ))(si+1 − si ) − g(B(s))ds >  ≤ δ,
i=0 0

where the lim sup is taken over a sequence of partitions with mesh size tending to
zero. Since  and δ are arbitrary, this completes the proof of the lemma. 
We also the analogous lemma for first-order differences.
15.6. ITÔ’S FORMULA 229

L EMMA 15.6.3. Let B be standard Brownian motion and g : R → R be a


continuous function such that
Z t
E(g(B(s))2 )ds < ∞.
0
Take any t ≥ 0 and a partition 0 = s0 < s1 < · · · < sn = t. Then the sum
n−1
X
g(B(si ))(B(si+1 ) − B(si ))
i=0
Rt
converges in probability to 0 g(B(s))dB(s) as the mesh size max0≤i<n |si+1 −si |
approaches zero.

P ROOF. The proof is immediate from the definition of the Itô integral when g
is bounded and continuous. When g is not bounded, we apply the same truncation
trick as in the proof of Lemma
R t 15.6.2, and thenRapply Exercise 15.2.3 to get the
t
required upper bound on P( 0 h(B(s))dB(s) 6= 0 g(B(s))dB(s)). 

With the aid of the above lemmas, we are now ready to prove Itô’s formula.

P ROOF OF T HEOREM 15.6.1. First, fix some t > 0. Take a partition 0 = s0 <
s1 < · · · < sn = t. For each δ, M > 0, define
ω(δ, M ) := sup |f 00 (x) − f 00 (y)|.
x,y∈[−M,M ],
|x−y|≤δ

Then by Taylor approximation,

f (B(si+1 )) − f (B(si )) − f 0 (B(si ))(B(si+1 ) − B(si ))

1 1
− f 00 (B(si ))(B(si+1 ) − B(si ))2 ≤ ω(δB , MB )(B(si+1 ) − B(si ))2 ,
2 2
where MB := max0≤s≤t |B(s)| and δB := max0≤i<n |B(si+1 ) − B(si )|. Sum-
ming over i and applying the triangle inequality, we get
n−1
X
f (B(t)) − f (B(0)) − f 0 (B(si ))(B(si+1 ) − B(si ))
i=0
n−1
1 X
− f 00 (B(si ))(B(si+1 ) − B(si ))2
2
i=0
n−1
1 X
≤ ω(δB , MB ) (B(si+1 ) − B(si ))2 .
2
i=0

Now suppose that we take a sequence of partitions such that the mesh size tend to
zero. Then MB remains fixed, but by the continuity of Brownian motion, δB → 0
230 15. INTRODUCTION TO STOCHASTIC CALCULUS

a.s. Therefore ω(δB , MB ) → 0 a.s. By Lemma 15.6.2,


n−1 Z t
2 P
X
00
f (B(si ))(B(si+1 ) − B(si )) → f 00 (B(s))ds
i=0 0

and
n−1
P
X
(B(si+1 ) − B(si ))2 → t.
i=0
Finally, by Lemma 15.6.3,
n−1 Z t
L2
X
f 0 (B(si ))(B(si+1 ) − B(si )) → f 0 (B(s))dB(s).
i=0 0

Combining these observations, we have


Z t
1 t 00
Z
f (B(t)) = f (B(0)) + f 0 (B(s))dB(s) + f (B(s))ds a.s.
0 2 0

R t equation holds for all rational t. But if we


Therefore, with probability one, this
use the continuous martingale for 0 f 0 (B(s))dB(s), then all terms in the above
display are continuous functions of t. Thus, with probability one, the equation
holds for all t. 

Itô’s formula
R t is commonly used to evaluate stochastic integrals. For example,
let us evaluate 0 B(s)dB(s).
E XAMPLE 15.6.4. To put the problem in the framework of Itô’s formula, we
choose a function f such that f 0 (x) = x. Let’s take f (x) = x2 /2. Then Itô’s
formula gives
Z t
B(t)2 B(0)2 1 t
Z
− = B(s)dB(s) + ds,
2 2 0 2 0
which gives
Z t
1
B(s)dB(s) = (B(t)2 − t).
0 2

Theorem 15.6.1 gives the simplest version of Itô’s formula. There are many
generalizations. One version that suffices for most purposes is the following.
T HEOREM 15.6.5. Let d be a positive integer and let B = (B1 , . . . , Bd ) be
d-dimensional standard Brownian motions. Let f (t, x1 , . . . , xd ) be a function from
[0, ∞) × Rd into R which is continuously differentiable in t (where the derivative
at 0 is the right derivative) and twice continuously differentiable in (x1 , . . . , xd ).
Suppose that for each t ≥ 0 and each i,
Z t  2 
∂f
E (s, B(s)) ds < ∞.
0 ∂xi
15.7. STOCHASTIC DIFFERENTIAL EQUATIONS 231

Then with probability one, we have that for all t ≥ 0,


d Z t
X ∂f
f (t, B(t)) = f (0, B(0)) + (s, B(s))dBi (s)
∂xi
i=1 0
Z t d
1 X ∂2f

∂f
+ (s, B(s)) + 2 (s, B(s)) ds,
0 ∂t 2 ∂x i
i=1

where the stochastic integrals are interpreted as continuous martingales.


The proof of this theorem is exactly similar to that of Theorem 15.6.1, except
that we need the following enhancement of Lemma 15.6.2.
L EMMA 15.6.6. Let B be d-dimensional standard Brownian motions, and g :
[0, ∞) × Rd → R be a continuous function. Take any t ≥ 0 and a partition
0 = s0 < s1 < · · · < sn = t. Then for any 1 ≤ j 6= k ≤ d, as the mesh size
max0≤i<n |si+1 − si | approaches zero,
n−1 Z t
2 P
X
g(si , B(si ))(Bj (si+1 ) − Bj (si )) → g(s, B(s))ds
i=0 0

and
n−1
P
X
g(si , B(si ))(Bj (si+1 ) − Bj (si ))(Bk (si+1 ) − Bk (si )) → 0.
i=0

P ROOF. The proof goes just as for Lemma 15.6.2, by first assuming that g
is bounded and showing convergence in L2 , and then dropping the boundedness
assumption using the same trick as in the proof of Lemma 15.6.2. 

P ROOF OF T HEOREM 15.6.5. The proof is an easy generalization of the proof


of Theorem 15.6.1. As in that proof, we use second order Taylor expansion of f
in the x coordinates and first order expansion in t, and then use Lemma 15.6.6 to
calculate the limits of the second order terms. The mixed second order derivatives
do not appear in the limit because of the second limit in Lemma 15.6.6. 

15.7. Stochastic differential equations


Let B be standard (one-dimensional) Brownian motion. Let Y (t) := eB(t) .
Then by Itô’s formula, we have that for any t,
Z t
1 t B(s)
Z
B(s)
Y (t) − Y (0) = e dB(s) + e ds
0 2 0
Z t
1 t
Z
= Y (s)dB(s) + Y (s)ds.
0 2 0
232 15. INTRODUCTION TO STOCHASTIC CALCULUS

Therefore the process Y satisfies a ‘stochastic integral equation’. The related dif-
ferential equation would be
dY (t) 1 dB(t)
= Y (t) + Y (t) .
dt 2 dt
However, Brownian motion is not differentiable. So instead, we write
1
dY (t) = Y (t)dt + Y (t)dB(t). (15.7.1)
2
This is known as a ‘stochastic differential equation’ (s.d.e.). The equation is just
shorthand for writing the integral equation displayed above. It should not be inter-
preted as a differential equation. (Note that we wrote dt before dB(t). Although
dt comes after dB(t) in Itô’s formula, it is customary to place the dt term before
the dB(t) term when writing s.d.e.’s.)
Now consider the following generalization of the stochastic differential equa-
tion (15.7.1):
dY (t) = µY (t)dt + σY (t)dB(t), (15.7.2)
with Y (0) = c, where c, µ ∈ R and σ > 0 are some given constants. To solve the
equation, let us try to guess a solution, and then verify that it works. Suppose that
there is some solution of the form X(t) = f (t, B(t)) for some smooth function
f (t, x). Then by Itô’s formula,
1 ∂2f
 
∂f ∂f
dX(t) = (t, B(t)) + 2
(t, B(t)) dt + (t, B(t))dB(t).
∂t 2 ∂x ∂x
Thus, we need f to satisfy
∂f ∂f 1 ∂2f
= σf, + = µf.
∂x ∂t 2 ∂x2
From the first equation, we see that f must be of the form A(t)eσx for some func-
tion A. Plugging this into the second equation, we get
1
A0 (t) + σ 2 A(t) = µA(t).
2
1 2
Solving this, we get A(t) = A(0)eµt− 2 σ t . Thus,
1 2 )t+σx
f (t, x) = A(0)e(µ− 2 σ .
The condition X(0) = c implies that A(0) = c. Therefore, our guess for the
solution to (15.7.2) is
1 2 )t+σB(t)
X(t) = ce(µ− 2 σ . (15.7.3)
It is easy to check that this is a continuous adapted process that satisfies (15.2.1).
Therefore, we can apply Itô’s formula and check that it is indeed a solution of the
s.d.e. (15.7.2) with initial condition X(0) = c.
15.7. STOCHASTIC DIFFERENTIAL EQUATIONS 233

E XERCISE 15.7.1. Find a solution for the stochastic differential equation


1 p
dX(t) = − X(t)dt + 1 − X(t)2 dB(t)
2
with initial condition X(0) = 0.
So we now know how to use Itô’s formula to produce solutions of stochastic
differential equations. But are these solutions unique? The following theorem gives
a sufficient condition for uniqueness within the class of adapted continuous solu-
tions. It shows, for example, that the solution (15.7.3) of the equation (15.7.2) is
unique. But it has its limitations; for example, it does not apply to Exercise 15.7.1.
T HEOREM 15.7.2. Let B be standard Brownian motion. Let a(t, x) and b(t, x)
be continuous functions from [0, ∞) × R into R satisfying the local Lipschitz con-
dition that for any n, there is a number Ln such that for all t and all x, y ∈ [−n, n],
|a(t, x) − a(t, y)| ≤ Ln |x − y|, |b(t, x) − b(t, y)| ≤ Ln |x − y|.
Let c be a real number. Consider the s.d.e.
dX(t) = a(t, X(t))dt + b(t, X(t))dB(t)
with initial condition X(0) = c. Let X and Y be continuous adapted processes
solving the above equation with the given initial condition. Then with probability
one, X(t) = Y (t) for all t.

P ROOF. Fix some n ≥ 1. Let


Tn = inf{t ≥ 0 : |X(t)| > n or |Y (t)| > n}.
Since X and Y are continuous adapted process, it is not difficult to see that Tn is
a stopping time for the Brownian filtration. Let X(t)
e := X(t ∧ Tn ) and Ye (t) :=
Y (t ∧ Tn ). Since X and Y are solutions of the given s.d.e., we have
X(t)
e = X(t ∧ Tn )
Z t∧Tn Z t∧Tn
=c+ a(s, X(s))ds + b(s, X(s))dB(s),
0 0

and a similar identity holds for Ye (t). Therefore by the Cauchy–Schwarz inequality
and Corollary 15.4.3,
Z t∧Tn 2 
e − Ye (t))2 ] ≤ 2E
E[(X(t) (a(s, X(s)) − a(s, Y (s)))ds
0
Z t∧Tn 2 
+ 2E (b(s, X(s)) − b(s, Y (s)))dB(s)
0
Z t∧Tn 
≤ 2tE (a(s, X(s)) − a(s, Y (s)))2 ds
0
Z t∧Tn 
+ 2E (b(s, X(s)) − b(s, Y (s)))2 ds .
0
234 15. INTRODUCTION TO STOCHASTIC CALCULUS

By the local Lipschitz condition on a and b, note that |a(s, X(s))−a(s, Y (s))| and
|b(s, X(s)) − b(s, Y (s))| are both bounded by Ln |X(s) − Y (s)| when s ≤ t ∧ Tn .
But if s ≤ t ∧ Tn , then s = s ∧ Tn . Thus, putting K := 2L2n t + 2L2n , we have
Z t∧Tn 
2 2
E[(X(t) − Y (t)) ] ≤ KE
e e (X(s) − Y (s)) ds
0
Z t∧Tn 
2
= KE (X(s ∧ Tn ) − Y (s ∧ Tn )) ds
0
Z t 
2
≤ KE (X(s ∧ Tn ) − Y (s ∧ Tn )) ds .
0

Thus, if we define f (t) := E[(X(t)


e − Ye (t))2 ], then f satisfies the recursive in-
equality
Z t
f (t) ≤ K f (s)ds.
0
Moreover, since K is an increasing function of t, this shows that for all s ≤ t,
Z t
f (s) ≤ K f (u)du.
0

Also, notice that f (s) ≤4n2 for all s since |X(t)


e − Ye (t)| ≤ 2n. Therefore we can
iterate the recursive inequality and get that for any j ≥ 1,
Z t Z s1 Z s2 Z sj−1
f (t) ≤ K j ··· f (sj )dsj dsj−1 · · · ds1
0 0 0 0
Z t Z s1 Z s2 Z sj−1
j
≤K ··· 4n2 dsj dsj−1 · · · ds1
0 0 0 0
4n2 K j
≤ .
j!
Letting j → ∞, we get f (t) = 0. Thus, X(t ∧ Tn ) = Y (t ∧ Tn ) a.s. Since X and
Y are continuous processes, it follows that Tn → ∞ a.s. as n → ∞. This shows
that X(t) = Y (t) a.s. Since this holds for any t, the continuities of X and Y imply
that with probability one, X(t) = Y (t) for all t. 
E XERCISE 15.7.3. Solve the stochastic differential equation
1 1
dX(t) = − dt + dB(t)
1+t 1+t
with initial condition X(0) = 0, and show that the solution is unique in the class
of adapted continuous solutions.
Theorem 15.7.2 gives a sufficient condition for the uniqueness of a solution.
The condition is mild enough to be widely applicable. But what about existence?
In the special case of equation (15.7.2), we could produce an explicit solution, but
that requires good luck. The following theorem gives a sufficient condition for the
existence of a solution adapted to the Brownian filtration (such solutions are known
as ‘strong solutions’). The condition is similar to that of Theorem 15.7.2, but with
15.7. STOCHASTIC DIFFERENTIAL EQUATIONS 235

the important difference that now we require a and b to be globally Lipschitz.


This is a strong and quite restrictive assumption, but not so different than similar
conditions for the existence of solutions of ordinary differential equations.
T HEOREM 15.7.4. Let B be standard Brownian motion. Let a(t, x) and b(t, x)
be continuous functions from [0, ∞) × R into R satisfying the global Lipschitz
conditions
|a(t, x) − a(t, y)| ≤ L|x − y|, |b(t, x) − b(t, y)| ≤ L|x − y|
for all x, y and t, where L is a constant. Let c be a real number. Then there is a
continuous adapted process {X(t)}t≥0 that solves the s.d.e.
dX(t) = a(t, X(t))dt + b(t, X(t))dB(t),
with initial condition X(0) = c.

P ROOF. We will construct a solution by Picard iteration. First, let X0 (t) := c


for all t. Iteratively, define
Z t Z t
Xn+1 (t) := c + a(s, Xn (s))ds + b(s, Xn (s))dB(s),
0 0
where the stochastic integral on the right should be interpreted as a continuous
martingale. It is easy to show by induction, using the conditions on a and b, that
each Xn is a continuous adapted process satisfying (15.2.1). This shows that the
stochastic integral in the above display is well-defined.
For any t ≥ 0 and n ≥ 1, define
∆n (t) := EkXn+1 − Xn k2[0,t] ,
where the term inside the expectation on the right is the supremum norm of the
process Xn+1 − Xn on the interval [0, t]. Also define
Z t Z t
Un+1 (t) := a(s, Xn (s))ds, Vn+1 (t) := b(s, Xn (s))dB(s).
0 0
Now fix some n ≥ 1 and 0 ≤ s ≤ t. The Lipschitz condition on a gives us, for
any s0 ≤ s,
 Z s0 2
0 0 2
(Un+1 (s ) − Un (s )) ≤ (a(u, Xn (u)) − a(u, Xn−1 (u))du
0
Z s0
0
≤s (a(u, Xn (u)) − a(u, Xn−1 (u)))2 du
0
Z s0
2 0
≤L s (Xn (u) − Xn−1 (u))2 du
0
Z s
≤ L2 t (Xn (u) − Xn−1 (u))2 du.
0
Thus,
Z s
kUn+1 − Un k2[0,s] ≤ L2 t (Xn (u) − Xn−1 (u))2 du. (15.7.4)
0
236 15. INTRODUCTION TO STOCHASTIC CALCULUS

Next, note that by the Itô isometry and the Lipschitz condition on b,
Z s
2
E[(Vn+1 (s) − Vn (s)) ] = E[(b(u, Xn (u)) − b(u, Xn−1 (u)))2 ]du
0
Z s
2
≤L E[(Xn (u) − Xn−1 (u))2 ]du.
0
But Vn+1 − Vn is a continuous martingale. So by Doob’s L2 inequality for contin-
uous martingales (Theorem 14.2.1),
EkVn+1 − Vn k2[0,s] ≤ 4E[(Vn+1 (s) − Vn (s))2 ]
Z s
2
≤ 4L E[(Xn (u) − Xn−1 (u))2 ]du.
0
Combining this with (15.7.4) and using the inequality (x + y)2 ≤ 2x2 + 2y 2 , we
get
∆n (s) = EkXn+1 − Xn k2[0,s]
≤ 2EkUn+1 − Un k2[0,s] + 2EkVn+1 − Vn k2[0,s]
Z s
≤K E[(Xn (u) − Xn−1 (u))2 ]du
0
Z s
≤K ∆n−1 (u)du,
0
where K := 2L2 t + 8L2 . Let C := ∆0 (t). We claim that
C(Ks)n
∆n (s) ≤ (15.7.5)
n!
for each n ≥ 1 and 0 ≤ s ≤ t. This is easily shown by induction. Suppose that
this is true for n − 1. Then the recursive inequality shows that for any s ≤ t,
Z s
∆n (s) ≤ K ∆n−1 (u)du
0
Z s
C(Ku)n−1
≤K du
0 (n − 1)!
C(Ks)n
= .
n!
Thus, we get

X X∞ p
EkXn+1 − Xn k[0,t] ≤ ∆n (t)
n=1 n=1

r
X C(Kt)n
≤ < ∞.
n!
n=1
By the monotone convergence theorem, this shows that
X∞
kXn+1 − Xn k[0,t] < ∞ a.s.
n=1
15.8. CHAIN RULE FOR STOCHASTIC CALCULUS 237

This, in turn, shows that with probability one, the sequence {Xn }n≥1 is Cauchy
in C[0, t] under the supremum norm. Therefore with probability one, {Xn }n≥1
is Cauchy in C[0, M ] for every integer M . Consequently, {Xn }n≥1 is Cauchy in
C[0, ∞) with probability one. Let X denote the limit. Clearly, X is a continuous
adapted process.
Now, for any given t and n,
Z t Z t
2
E[(Xn+1 (s) − Xn (s)) ]ds ≤ ∆n (s)ds.
0 0

Using this, and the bound (15.7.5) on ∆n (s), it is easy to see that {Xn }n≥1 is a
Cauchy sequence in L2 ([0, t]×Ω). By Fatou’s lemma, the limit must necessarily be
equal to X. This shows that X satisfies the square-integrability condition (15.2.1)
for every t. Thus, by the Lipschitz properties of a and b, the following processes
are well-defined, continuous and adapted:
Z t Z t
U (t) := a(s, X(s))ds, V (t) := b(s, X(s))dB(s).
0 0

To show that X satisfies the required s.d.e., we need to prove that with probability
one, X(t) = c+U (t)+V (t) for all t. Since X, U and V are continuous processes,
it suffices to show that this holds almost surely for any given t. So let us fix some t.
Since Xn (t) = c + Un (t) + Vn (t) for each n, and Xn (t) → X(t) a.s. as n → ∞,
it suffices to show that Un (t) → U (t) and Vn (t) → V (t) in L2 (because then there
will exist subsequences converging a.s.). To prove this, first observe that
Z t
2
E[(U (t) − Un (t)) ] ≤ t E[(a(s, X(s)) − a(s, Xn−1 (s)))2 ]ds
0
≤ L tkX − Xn−1 k2L2 ([0,t]×Ω) .
2

But we have already seen above that Xn → X in L2 ([0, t] × Ω). Thus, Un (t) →
U (t) in L2 . Next, by the Itô isometry,
Z t
2
E[(V (t) − Vn (t)) ] = E[(b(s, X(s)) − b(s, Xn−1 (s)))2 ]ds
0
≤ L kX − Xn−1 k2L2 ([0,t]×Ω) .
2

Again, this shows that Vn (t) → V (t) in L2 . Thus, X is indeed a solution to the
required s.d.e. with initial condition X(0) = c. 

15.8. Chain rule for stochastic calculus


The Itô formula has a generalization that is often useful for solving stochas-
tic differential equations. Suppose that a continuous adapted process {X(t)}t≥0
satisfies the s.d.e.
dX(t) = a(t)dt + b(t)dB(t),
238 15. INTRODUCTION TO STOCHASTIC CALCULUS

where {a(t)}t≥0 and {b(t)}t≥0 are continuous adapted processes, with b satisfy-
ing (15.2.1) for each t. Let Y (t) := f (t, X(t)), where f (t, x) is a twice continu-
ously differentiable function. The following result identifies the s.d.e. satisfied by
the process {Y (t)}t≥0 .

T HEOREM 15.8.1. Let X and Y be as above. Suppose that for each t ≥ 0,


Z t  
∂f 2 2
E (s, X(s)) b(s) ds < ∞.
0 ∂x

Then
∂f ∂f 1 ∂2f
dY (t) = (t, X(t))dt + (t, X(t))dX(t) + (t, X(t))(dX(t))2 ,
∂t ∂x 2 ∂x2
where dX(t) = a(t)dt + b(t)dB(t) and (dX(t))2 stands for b(t)2 dt.

P ROOF. The proof is almost exactly the same as for Theorem 15.6.1, except
that we have to substitute Brownian motion with the process X. A sketch of the
proof goes as follows. The details are left to reader. To deal with potential un-
boundedness of X, we start by ‘localizing’ X as follows. Choose some large
number M , and define

T := inf{t : |X(t)| > M or |a(t)| > M or |b(t)| > M }.

Since X, a and b are adapted continuous processes, T is a stopping time. Define


the localized processes X(t)
e := X(t∧T ) and Ye (t) := Y (t∧T ) = f (t∧T, X(t)).
e
We will show that with probability one, for any t ≥ 0,
Z t∧T
∂f
Ye (t) − Ye (0) = (s, X(s))b(s)dB(s)
0 ∂x
Z t∧T 
∂f ∂f
+ (s, X(s)) + (s, X(s))a(s)
0 ∂t ∂x
1 ∂2f

2
+ (s, X(s))b(s) ds. (15.8.1)
2 ∂x2

Suppose that we have shown this. Then let M → ∞. By the continuity of X, a,


and b, the stopping time T must tend to infinity. Therefore the left side approaches
Y (t) − Y (0) and the right side approaches the sum of the same two integrals, but
integrated from 0 to t. This proves the theorem, provided that we can establish
equation (15.8.1) for the localized process.
Taking any t ≥ 0 and 0 = s0 < s1 < · · · < sn = t, observe that
Z si+1 ∧T Z si+1 ∧T
e i+1 ) − X(s
X(s e i) = a(u)du + b(u)dB(u). (15.8.2)
si ∧T si ∧T
15.8. CHAIN RULE FOR STOCHASTIC CALCULUS 239

So, for any function g(t, x),


n−1
X
g(si ∧ T, X(s e i+1 ) − X(s
e i ))(X(s e i ))
i=0
n−1
XZ si+1 ∧T
= g(si ∧ T, X(s
e i ))a(u)du
i=0 si ∧T
Z si+1 ∧T 
+ g(si ∧ T, X(s
e i ))b(u)dB(u) .
si ∧T

If g is continuous, then using the dominated convergence theorem and Corol-


lary 15.4.3, it is not hard to show that the above random variable converges in
probability to
Z t∧T Z t∧T
g(s, X(s))a(s)ds + g(s, X(s))b(s)dB(s)
0 0

as the mesh size max0≤i<n (si+1 − si ) tends to zero. Similarly,


n−1 Z t∧T
P
X
g(si ∧ T, X(s
e i ))(si+1 ∧ T − si ∧ T ) → g(s, X(s))ds.
i=0 0

This takes care of the first-order terms in (15.8.1). For the second-order term, we
proceed as follows. First, note that
e i ))2
e i+1 ) − X(s
(X(s
Z si+1 ∧T 2 Z si+1 ∧T 2
= a(u)du + b(u)dB(u)
si ∧T si ∧T
Z si+1 ∧T Z si+1 ∧T 
+2 a(u)du b(u)dB(u) . (15.8.3)
si ∧T si ∧T

By the Cauchy–Schwarz inequality,


Z si+1 ∧T 2 Z si+1 ∧T
a(u)du ≤ (si+1 − si ) a(u)2 du.
si ∧T si ∧T

Thus,
n−1
XZ si+1 ∧T 2 Z t∧T
a(u)du ≤ max (si+1 − si ) a(u)2 du
si ∧T 0≤i≤n−1 0
i=0
≤ M 2 t max (si+1 − si ),
0≤i≤n−1

which tends to zero as the mesh size goes to zero. Similarly, using the Cauchy–
Schwarz inequality and Corollary 15.4.3, we get an upper bound for L1 norm of
the third term on the right side of (15.8.3), which also tends to zero as the mesh
240 15. INTRODUCTION TO STOCHASTIC CALCULUS

size goes to zero. Combining, we get


n−1
X 
g(si ∧ T, X(s e i ))2
e i+1 ) − X(s
e i )) (X(s
i=0
Z si+1 ∧T 2 
P
− b(u)dB(u) →0 (15.8.4)
si ∧T
as the mesh size goes to zero. Now, by Corollary 15.4.3,
Z si+1 ∧T 2 Z si+1 ∧T 
2 +
E b(u)dB(u) − b(u) du Fsi ∧T = 0.
si ∧T si ∧T
Note that
n−1
XZ si+1 ∧T 4
b(u)dB(u)
i=0 si ∧T
Z si+1 ∧T 2 n−1
X Z si+1 ∧T 2
≤ max b(u)dB(u) b(u)dB(u) ,
0≤i<n si ∧T si ∧T
i=0
which tends to zero in probability as the mesh size goes to zero, because the ex-
pected value of the sum remains bounded (which implies that it forms a tight fam-
ily) and the maximum goes to zero almost surely by the continuity of the stochastic
integral. Similarly,
n−1
XZ si+1 ∧T 2
2 P
b(u) du → 0.
i=0 si ∧T

With all this information at our disposal, we can now proceed as in the proof of
Lemma 15.6.2 to show that
n−1 Z si+1 ∧T 2 Z si+1 ∧T 
P
X
2
g(si ∧ T, X(si ))
e b(u)dB(u) − b(u) du → 0
i=0 si ∧T si ∧T

as the mesh size goes to zero. Combined with (15.8.4), this gives
n−1  Z si+1 ∧T 
P
X
2 2
g(si ∧ T, X(si )) (X(si+1 ) − X(si )) −
e e e b(u) du → 0.
i=0 si ∧T

It is now easy to complete the argument. 


E XERCISE 15.8.2. Write down the details for all the steps in the proof of The-
orem 15.8.1.
A simple way to remember the identity (dX(t))2 = b(t)2 dt is to memorize the
following rules of thumb:
(dt)2 = 0, dtdB(t) = 0, (dB(t))2 = dt.
Using these rules and the identity dX(t) = a(t)dt+b(t)dB(t), it is easy to ‘derive’
(dX(t))2 = b(t)2 dt. Such rules of thumb are particularly useful if we have a
15.9. THE ORNSTEIN–UHLENBECK PROCESS 241

process adapted to multidimensional Brownian motion. In this case we have an


s.d.e. like
d
X
dX(t) = a(t)dt + bi (t)dBi (t).
i=1

As above, we have the for each i,

(dt)2 = 0, dtdBi (t) = 0, (dBi (t))2 = dt.

Additionally, we also have dBi (t)dBj (t) = 0 for i 6= j. These rules give
d
X
(dX(t))2 = bi (t)2 dt.
i=1

E XERCISE 15.8.3. Formulate and prove a version of Theorem 15.8.1 for pro-
cesses adapted to multidimensional Brownian motion.

15.9. The Ornstein–Uhlenbeck process


Consider the stochastic differential equation

dX(t) = −µX(t)dt + σdB(t) (15.9.1)

with initial condition X(0) = c ∈ R, where µ and σ are strictly positive con-
stants. This looks almost the same as the s.d.e. (15.7.2), except that we do not
have an X(t) in front of dB(t), and constant in front of X(t)dt is strictly negative.
A simple verification shows that in this problem we cannot find a solution of the
form f (t, B(t)), so we have to implement a different plan. Note that by Theo-
rems 15.7.4 and 15.7.2, an adapted continuous solution exists and is unique. So we
only have to guess the solution and verify that the guess is valid. Taking cue from
the deterministic case σ = 0, we define Y (t) := eµt X(t). By Theorem 15.8.1,

dY (t) = µeµt X(t)dt + eµt dX(t)


= σeµt dB(t).

The initial condition for Y is Y (0) = c. Thus,


Z t
Y (t) = c + σ eµs dB(s),
0

and therefore,
Z t
X(t) = ce−µt + σ eµ(s−t) dB(s).
0
242 15. INTRODUCTION TO STOCHASTIC CALCULUS

Since Y is a continuous adapted process, so is X. It is simple to verify that


this is indeed a solution of (15.9.1). The process X(t) is known as an Ornstein–
Uhlenbeck process. Clearly, E(X(t)) = ce−µt . By Exercise 15.3.2,
Z s∧t
2 −µ(s+t)
Cov(X(s), X(t)) = σ e e2µu du
0
σ 2 −µ(s+t) 2µ(s∧t)
= e (e − 1)

σ 2 −µ|s−t|
= (e − e−µ(s+t) ).

The formulas for the mean and the covariance completely identify the distribu-
tion of the Ornstein–Uhlenbeck process. Note that as t → ∞,√X(t) converges
in distribution to N (0, σ 2 /2µ). The special case µ = 1, σ = 2 is sometimes
called the standard Ornstein–Uhlenbeck process. In this case, X(t) → N (0, 1) in
distribution as t → ∞.
E XERCISE 15.9.1. Let B be standard Brownian motion. Take three numbers
c ∈ R, µ > 0 and σ > 0, and define X(t) := ce−µt + σe−µt B(e2µt − 1). Show
that X has the same distribution as the Ornstein–Uhlenbeck process defined above.
E XERCISE 15.9.2. Let X be an Ornstein–Uhlenbeck process. Find a function
f (t) such that lim supt→∞ X(t)/f (t) and lim inf t→∞ X(t)/f (t) are finite con-
stants almost surely.

15.10. Lévy’s characterization of Brownian motion


Let B be standard Brownian motion. We have seen that B(t) and B(t)2 − t are
continuous martingales with respect to the right-continuous filtration generated by
B. It turns out that Brownian motion is the only continuous process having these
two properties. This is known as Lévy’s characterization of Brownian motion.
T HEOREM 15.10.1. Let {X(t)}t≥0 be a continuous stochastic process adapted
to a right-continuous filtration {Ft }t≥0 , with X(0) = 0. Suppose that X(t) and
X(t)2 − t are martingales adapted to this filtration. Then X is standard Brownian
motion.

P ROOF. It is not hard to see that it’s sufficient to prove that for any 0 ≤ s ≤ t,
X(t) − X(s) ∼ N (0, t − s) and X(t) − X(s) is independent of Fs . By Exer-
cise 12.7.6, we have to show that for any A ∈ Fs and any θ ∈ R,
2 (t−s)/2
E(eiθ(X(t)−X(s)) ; A) = P(A)e−θ . (15.10.1)
Fix some s ≥ 0 and a positive integer M , and let
T := inf{t ≥ s : |X(t)| > M }.
By the continuity of X, T is a stopping time for the filtration {Ft }t≥0 , and T is
always ≥ s. For each t ≥ 0, let
Y (t) := X(t ∧ T ), Z(t) := X(t ∧ T )2 − (t ∧ T ).
15.10. LÉVY’S CHARACTERIZATION OF BROWNIAN MOTION 243

Let Gt := Ft∧T . Then by the results of Section 13.11 (which hold for any right-
continuous filtration), {Gt }t≥0 is also a right-continuous filtration. Moreover, by
the optional stopping theorem for continuous martingales, Y (t) and Z(t) are mar-
tingales adapted to Gt . A consequence of the martingale properties of Y (t) and
Z(t) is that for any 0 ≤ u ≤ t,
E((Y (t) − Y (u))2 |Gu ) = E(Y (t)2 − 2Y (t)Y (u) + Y (u)2 |Gu )
= E(Y (t)2 |Gu ) − Y (u)2
= E(Z(t) + t ∧ T |Gu ) − Y (u)2
= Z(u) + E(t ∧ T |Gu ) − Y (u)2
= E(t ∧ T − u ∧ T |Gu ). (15.10.2)
Now fix some t ≥ s, A ∈ Fs , and θ ∈ R. Let f (x) := eiθx . Take a partition
s = s0 < s1 < · · · < sn = t of the interval [s, t]. By the martingale property of Y
and the fact that Fs ⊆ Gu for any u ≥ s,
n−1
X
E[(Y (si+1 ) − Y (si ))f 0 (Y (si ) − Y (s)); A] = 0. (15.10.3)
i=0
By (15.10.2),
n−1
X
E[(Y (si+1 ) − Y (si ))2 f 00 (Y (si ) − Y (s)); A]
i=0
n−1
X
= E[(si+1 ∧ T − si ∧ T )f 00 (Y (si ) − Y (s)); A]
i=0
n−1
X 
00
=E (si+1 ∧ T − si ∧ T )f (Y (si ) − Y (s)); A .
i=0
As the mesh size goes to zero, the sum the expectation approaches
Z t∧T
f 00 (Y (u) − Y (s))du.
s
Also, by the boundedness of f 00 , the sums are uniformly bounded by a constant.
Therefore by the dominated convergence theorem,
n−1
X
E[(Y (si+1 ) − Y (si ))2 f 00 (Y (si ) − Y (s)); A]
i=0
Z t∧T 
00
→E f (Y (u) − Y (s))du; A (15.10.4)
s
as the mesh size goes to zero.
Next, let δ := max0≤i≤n−1 |Y (si+1 ) − Y (si )|, and let
ω(δ) := max |f 00 (x) − f 00 (y)|.
x,y∈R,|x−y|≤δ
244 15. INTRODUCTION TO STOCHASTIC CALCULUS

Since f 000 is uniformly bounded and Y is a continuous process, ω(δ) → 0 as the


mesh size goes to zero. Since f 00 is uniformly bounded, ω(δ) is bounded by a
constant. Therefore by the dominated convergence theorem, as the mesh size goes
to zero,
E(ω(δ)2 ) → 0. (15.10.5)
Note that by (15.10.2),
E[(Y (si+1 ) − Y (si ))2 (Y (sj+1 ) − Y (sj ))2 ] ≤ (si+1 − si )(sj+1 − sj )
when i 6= j. Now, if T = s, then Y (t) is unchanging beyond time s, and if
T > s, then |Y (t)| ≤ M beyond time s. So in either case, we have that for any i,
|Y (si+1 ) − Y (si )| ≤ 2M . This gives
E[(Y (si+1 ) − Y (si ))4 ] ≤ 4M 2 E[(Y (si+1 ) − Y (si ))2 ]
≤ 4M 2 (si+1 − si ).
Using the last two displays, we get
n−1
X 2 
2
E (Y (si+1 ) − Y (si )) ≤ 4M 2 (t − s) + (t − s)2 .
i=0
In particular, the above expectation remains bounded as the mesh size tends to zero.
Combining this information with (15.10.5) and applying the Cauchy–Schwarz in-
equality, we get that
 n−1
X 
2
E ω(δ) (Y (si+1 ) − Y (si )) → 0 (15.10.6)
i=0
as the mesh size tends to zero. Now, by Taylor expansion,
n−1
X
f (Y (t) − Y (s)) − f (0) − f 0 (Y (si ) − Y (s))(Y (si+1 ) − Y (si ))
i=0
n−1
1 X
− f 00 (Y (si ) − Y (s))(Y (si+1 ) − Y (si ))2
2
i=0
n−1
1 X
≤ ω(δ) (Y (si+1 ) − Y (si ))2 .
2
i=0
Using (15.10.3), (15.10.4), (15.10.6), and the above bound, we get
Z t∧T 
1
E(f (Y (t) − Y (s)); A) = P(A) + E f 00 (Y (u) − Y (s))du; A
2 s
Z t∧T 
1 00
= P(A) + E f (X(u) − X(s))du; A ,
2 s
where the last step holds because Y (u) = X(u) when u ≤ T . Now let M → ∞,
so that T → ∞ (because X is a continuous process). By the boundedness of f
15.10. LÉVY’S CHARACTERIZATION OF BROWNIAN MOTION 245

and f 00 and the dominated convergence theorem, we can take the limits inside the
expectations in the above identity and get
Z t 
1
E(f (X(t) − X(s)); A) = P(A) + E f 00 (X(u) − X(s))du; A .
2 s
Let φ(t) := E(f (X(t) − X(s)); A) for t ≥ s. Since f 00 (x) = −θ2 f (x), the above
identity yields the following integral equation for φ:
θ2 t
Z
φ(t) = P(A) − φ(u)du.
2 s
2
The unique solution of this equation is φ(t) = P(A)e−θ (t−s)/2 . (Uniqueness is
easily established, for example, by Picard iteration.) This proves (15.10.1), and
hence completes the proof of the theorem. 

You might also like