STAT733 Notes
STAT733 Notes
Lecture Notes
Hanbaek Lyu
www.hanbaeklyu.com
Contents
Chapter 2. Independence 48
2.1. Definition of Independence 48
2.2. Sufficient condition for independence 51
2.3. Independence, distribution, and expectation 52
2.4. Sums of independent RVs – Convolution 54
2
CONTENTS 3
Bibliography 216
CHAPTER 1
Many things in life are uncertain. Can we ‘measure’ and compare such uncertainty so that it helps
us to make more informed decision? Probability theory provides a systematic way of doing so.
Example 1.1.1 (coin flip). Consider a random experiment of flipping a coin, which comes up heads ‘H ’ or
tails ‘T ’. Hence our sample space of all possible outcomes is Ω = {H , T }. Next, let 2Ω = {;, {H }, {T }, {H , T }},
which consists of all subsets of Ω. For instance, the subset {H } corresponds to the event that a random
coin flip comes up heads; the subset {H , T } corresponds to the event that a random coin flip comes up
heads or tails; and the subset ; corresponds to the event that a random coin flip comes up with noting.
We would like to be able to measure the probability of all such events. Hence we set the σ-algebra (or the
set of events) to be 2Ω . Lastly, for the probability measure, fix a parameter p ∈ [0, 1], and define a function
Pp : 2Ω → [0, 1] by Pp (;) = 0, Pp ({H }) = p, Pp ({T }) = 1 − p, Pp ({H , T }) = 1. The resulting probability space
(Ω = {H , T }, F = 2Ω , P = Pp ) is a probability model for the random experiment of flipping a ‘probability-p
coin’. ▲
Recall that the probability measure P is a certain function that assigns a numerical value between 0
and 1 to every event, which we regard as some sort of ‘size’ of that event. Hence it is natural to require that
the empty set is assigned with zero probability, and that the probability of the union of disjoint events is
the sum of the probabilities of the individual events. This leads us to the following formal definition of
‘measures’.
5
1.1. PROBABILITY MEASURE AND PROBABILITY SPACE 6
Note that in Definition 1.1.2, we did not require any particular condition for the collection A of
subsets of Ω. If Ω is countable (i.e., finite or countably infinite), then we usually take A = 2Ω . In general,
if A is too large, then it could be impossible to define a measure on A . The right choice of A on which
a measure can be defined is what is called the ‘σ-algebra’, which we will discuss later.
Exercise 1.1.3. Let (Ω, F , P) be a probability space and let A ⊆ Ω be an event. Show that P(A c ) = 1− P(A).
Example 1.1.4 (Counting measure). Suppose Ω is countable and F = 2Ω . Define a set function γ : 2Ω →
[−∞, ∞] by
(
|A| = number of elements in A if A is finite
γ(A) =
∞ if A is infinite.
𝑥+𝑦=6 𝑥 + 𝑦 = 12
𝑥+𝑦=5 𝑥 + 𝑦 = 11
𝑥+𝑦=4 𝑥 + 𝑦 = 10
𝑥+𝑦=3 𝑥+𝑦=9
𝑥+𝑦=2 𝑥+𝑦=8
1 2 3 4 5 6
1 1 1 1 1 1
F IGURE 1.1.1. Sample space representation for roll of two independent fair dice and and events
of fixed sum of two dice.
Show that the coefficient of x k in the right hand side equals the size of the intersection between
Ω and the plane x + y + z = k. Conclude that
27 1
P(X + Y + Z = 10) = P(X + Y + Z = 11) = = .
63 8
(This way of calculating probabilities is called the generating function method.)
1.1. PROBABILITY MEASURE AND PROBABILITY SPACE 8
Then one can easily check that P is a probability measure on 2Ω and f is called probability mass function
(PMF) of P. For instance, the PMF of the probability measure Pp in the coin flip example in Exercise 1.1.1
is given by f (H ) = p and f (T ) = 1 − p. ▲
Exercise 1.1.9. Show that the function P : 2Ω → [0, 1] defined in (1) is a probability measure on Ω. Con-
versely, show that every probability measure in a discrete probability space can be defined in this way (in
this case, identify the PMF).
Next, we discuss the requirement for a collection of subsets F of the sample space Ω to be considered
as the set of events. In all examples we have seen above, our sample space Ω was countable and we took
the power set 2Ω to be the set of events. However, when Ω is uncountable (e.g., Ω = [0, 1] or Ω = R),
then the power set 2Ω is ‘too large’ to be considered as the set of events. It turns out that F needs to be
a system of subsets of Ω that satisfy certain properties. Namely, it is natural to expect that the union,
intersection, and complements of events are also events. In other words, the set of events F , which
is a system of subsets of Ω, should be ‘closed under’ elementary set operations such as taking union,
intersection, and complements. Furthermore, we would like to make F rich enough so that we can
perform such operations countably many times. This motivates the following definition:
Definition 1.1.10 (σ-algebra). Let Ω be a set and let F be a set of subsets of Ω. We call F a σ-algebra on
Ω if it satisfies the following properties:
(i) (contains the whole set) Ω ∈ F ;
(ii) (closed under complements) A ∈ F =⇒ A c = Ω \ A ∈ F ;
S∞
(iii) (closed under countable union) A i ∈ F for i ∈ N =⇒ A ∈ F.
i =1 i
Note that a σ-algebra contains the empty set ;, being the complement of the whole set Ω. While the
condition (iii) only states that the union of countably infinite collection of events is again an event, since
A i ’s can be the empty set, it also entails closedness under finite union.
Exercise 1.1.11. Let F be a σ-algebra on a set Ω. Show that F is closed under countable intersection.
T
That is, show that if A i ∈ F for i ∈ N, then ∞ A ∈ F.
i =1 i
Exercise 1.1.12 (Power set is a σ-algebra). Let Ω be a set and let 2Ω denote the set of all subsets of Ω,
which is called the power set of Ω (note that ; ∈ 2Ω ). Show that 2Ω is a σ-algebra on Ω.
S
Exercise 1.1.13 (σ-algebra generated by a partition). Let Ω be a sample space and let Ω = k≥1 A k for
some disjoint subsets A 1 , A 2 , . . . of Ω. Then show that for each B ∈ σ({A 1 , A 2 , . . . }), there exists an index
S
set I ⊆ N such that B = k∈I B k .
Exercise 1.1.14 (σ-algebra on a countable set). Let Ω be a countable set. Let F be an arbitrary σ-algebra
on Ω containing all singleton sets: {x} ∈ F for all x ∈ Ω. Show that F = 2Ω .
Now we can give a formal definition of a probability space.
Definition 1.1.15 (Probability spaces). Let Ω be a set, not necessarily countable. Let F be any σ-algebra
on Ω and let P be any probability measure on F . Then the triple (Ω, F , P) defines a probability space. If
Ω is countable, then (Ω, F , P) is called a discrete probability space.
The following are important properties of a measure.
Theorem 1.1.16. Let (Ω, F , µ) be a measure space. These followings hold:
1.1. PROBABILITY MEASURE AND PROBABILITY SPACE 9
P ROOF. (i) Since A ⊆ B , write B = A ∪(B \ A). Note that A and B \ A are disjoint. Hence by the second
axiom of probability measure, we get
B1 = A1 ⊆ A1
B 2 = (A 1 ∪ A 2 ) \ A 1 ⊆ A 2
B 3 = (A 1 ∪ A 2 ∪ A 3 ) \ (A 1 ∪ A 2 ) ⊆ A 3 ,
and so on. Then clearly B i ’s are disjoint and their union is the same as the union of A i ’s. Now
by part (i) and countable additivity of probability measure, we get
à ! à !
[∞ [
∞ X∞ X
∞
µ(A) ≤ µ Ai = µ Bi = µ(B i ) ≤ µ(A i ).
i =1 i =1 i =1 i =1
(iii) Define a collection of disjoint subsets B i ’s similarly as in (ii). In this case, they will be
B1 = A1 ⊆ A1
B2 = A2 \ A1 ⊆ A2
B3 = A3 \ A2 ⊆ A3,
S∞ S∞ S S
and so on. Then i =1 A i = i =1 B i and A n = ni=1 A i = ni=1 B i . Hence we get
à ! à !
[
∞ [
∞ X
∞
µ(A) = µ Ai = µ Bi = µ(B i )
i =1 i =1 i =1
Xn
= lim µ(B i ) = lim µ(B 1 ∪ · · · ∪ B n ) = lim µ(A n ).
k→∞ i =1 n→∞ n→∞
(iv) Since A 1 \ A n % A 1 \ A, by (iii) we get µ(A 1 \ A n ) % µ(A 1 \ A). Also, since A ⊆ A 1 , we have µ(A 1 \ A) =
µ(A 1 ) − µ(A). Hence
Some immediate consequences of Theorem 1.1.16 (ii) are given in the following exercise.
P(A) ≤ P(B ).
Exercise 1.1.18 (Inclusion-exclusion). Let (Ω, P) be a probability space. Let A 1 , A 2 , · · · , A k ⊆ Ω. Show the
following.
(i) For any A, B ⊆ Ω,
Remark 1.1.19. Later, we will show the general inclusion-exclusion in a much easier way using random
variables and expectation.
1.1.2. Constructing σ-algebras bottom-up. Note that Definition 1.1.10 gives an ‘axiomatic definition’
of a σ-algebra, but it does not tell us how to construct such an object. For instance, suppose that F is a
σ-algebra on Ω = R. Suppose that F contains C , which is the collection of all open intervals of the form
(a, b) for a, b ∈ R, a ≤ b. Note that the set A = (1, 2)∪(3, 4) is itself not an open interval so it is not contained
in C . But A ∈ F since it is made of taking the union of two sets (1, 2) and (3, 4) that are members of the
σ-algebra F . In this way, one can build new sets from C by taking the complement, countable union,
and countable intersection, until the collection of subsets becomes a σ-algebra. By only including the
sets that can be made of this way, one obtains a σ-algebra ‘generated by C ’, which is denoted as σ(C ).
This is an important concept of constructing a σ-algebra from a small collection of subsets.
Definition 1.1.20 (σ-algebra generated by a collection of subsets). Let Ω be a set and let C be a set of
subsets of Ω (i.e., C ⊆ 2Ω ). The σ-algebra generated by C is the smallest σ-algebra on Ω that contains C .
Exercise 1.1.21. Let Ω be a set and let F denote an arbitrary non-empty collection of σ-algebras on Ω.
Let A denote the intersection of all σ-algebras in F:
\
A := F = {B ⊆ Ω | B ∈ F for all F ∈ F} .
F ∈F
Proposition 1.1.22. Let Ω be a set and let C be a set of subsets of Ω. Then σ(C ) exists and is unique. More
precisely,
\© ª
σ(C ) = F | F is a σ-algebra on Ω containing C .
1.1. PROBABILITY MEASURE AND PROBABILITY SPACE 11
P ROOF. (Uniqueness): Let A and A 0 be two smallest σ-algebra on Ω that contains C . Then by defi-
nition, A ⊆ A 0 and A 0 ⊆ A . Hence A = A 0 .
(Existence): Let F denote the collection of all σ-algebras on Ω that contains C . Recall that by Exercise
1.1.12, the power set 2Ω is a σ-algebra on Ω and it contains C , so 2Ω ∈ F. Since F is non-empty, if we let A
be the intersection of all σ-algebras in F , then it is a σ-algebra by Exercise 1.1.21 and it contains C . By
definition, A is a smalleset σ-algebra containing C . Indeed, if A 0 is an arbitrary σ-algebra that contains
C , then by definition A 0 ∈ F, so by definition A ⊆ A 0 . Hence A is a smallest σ-algebra that contains C .
By the uniqueness, we must then have A = σ(C ). exists. □
Exercise 1.1.23. Let F be a σ-algebra on a sample space Ω. Let C be a collection of subsets of Ω such
that C ⊆ F . Then show that σ(C ) ⊆ F .
Definition 1.1.24 (Borel σ-algebra on Rd ). Let Ω = Rd be the d -dimensional Euclidean space. An open
ball in Rd is a subset of Rd given by B (x, r ) := {y ∈ Rd | kx − yk < d }, where k·k denote the Euclidean norm
on Rd . Let C denote the set of all open balls in Rd . Then the σ-algebra generated by all open balls in
Rd , σ(C ), is called the Borel σ-algebra on Rd , which is typically denoted as B n . When n = 1, we denote
B 1 = B.
Exercise 1.1.25. Let B be the Borel σ-algebra on R. Show that all closed intervals and half-open intervals
are contained in B.
Example 1.1.26 (The Cantor set). Starting from the unit interval [0, 1], take away the middle third interval
iteratively. The limit of this process defines a subset C of [0, 1], which is known as the Cantor set (see
Figure 1.1.2). A more precise way to construct the Cantor set is as follows. Define a sequence of subsets
C0 , C1 , . . . of [0, 1] recursively as C0 := [0, 1] and
µ ¶
Cn 2 Cn
Cn+1 = ∪ + n = 1, 2, . . . .
3 3 3
Now we define the Cantor set C as
\
∞
C := Cn .
n=1
F IGURE 1.1.2. Construction of Cantor set. Starting from the unit interval [0, 1], take away the
middle third interval iteratively. The limit of this process defines a subset C of [0, 1], which is
known as the Cantor set. (Figure credit: https://en.wikipedia.org/wiki/Cantor_set)
Power set 2!
outer measure
𝛼) ! 0
extension of 𝛼
ℱ = { 𝛼)-measurable sets } (𝜎-algebra)
𝛼)|ℱ
Generator subsets
[0, ∞]
(e.g., intervals) 𝛼
F IGURE 1.1.3. Bottom-up construction of a measure by outer measure extension and restriction
onto measurable sets.
𝐴
Definition 1.1.27 (Outer measures). Let Ω be a set. A set function ρ : 2Ω → [0, ∞] is an outer measure on
Ω if the following hold:
(i) (null empty set) ρ(;) = 0;
(ii) (countable subadditivity) For subsets A, B 1 , B 2 · · · ⊆ Ω,
[
∞ X
∞
A⊆ Bi =⇒ ρ (A) ≤ ρ(B i ).
i =1 i =1
Exercise 1.1.28. Show that outer measures have monotonicity. That is if ρ is an outer measure on Ω and
if A, B ⊆ Ω such that A ⊆ B , then ρ(A) ≤ ρ(B ).
Definition 1.1.29 (Outer measure extension). Let Ω be a set and C be a collection of subsets of Ω (i.e.,
C ⊆ 2Ω ) containing the empty set. Let α : C → [0, ∞] be a set function such that α(;) = 0. Define a set
𝐸 𝐸
function α∗ ": 2Ω → [0, ∞] by#
( )
X
∞ ¯ [
∞
¯
α∗ (E ) := inf α(C i ) ¯ C 1 ,C 2 , · · · ∈ C and E ⊆ Ci . (2)
i =1 i =1
∗
We say α is an outer measure extension of α. (Take inf(;) = ∞.)
S
Exercise 1.1.30. Let C ⊆ 2Ω contains ; and covers Ω (i.e., Ω = C ). Fix an arbitrary set function α : C →
[0, ∞] such that α(;) = 0. Show that the outer measure extension α (see (2)) is indeed an outer measure
on Ω.
1.1. PROBABILITY MEASURE AND PROBABILITY SPACE 13
Example 1.1.31. Let Ω = R and C be the collection of all half-open intervals of the form (a, b], a, b ∈ R.
Let α : C → [0, ∞], α((a, b]) = b −a. Note that α gives the usual ‘length’ b −a of the interval (a, b]. For now
α is only defined on half-open intervals. Hence one cannot evaluate α((1, 2] ∪ (3, 4]), for example. Let α
be the outer measure constructed from α as in (2). Then note that
α ((1, 2] ∪ (3, 4]) = α((1, 2]) + α((3, 4]) = (2 − 1) + (4 − 3) = 2.
For another example, consider the open interval (0, 1). We first write (0, 1) as the following disjoint union
of half-open intervals:
(0, 1) = (0, 1 − 2−1 ] ∪ (1 − 2−1 , 1 − 3−1 ] ∪ (1 − 3−1 , 1 − 4−1 ] ∪ . . . .
Then invoking the definition of the outer measure extension α of α in (2),
α((0, 1)) = α((0, 1 − 2−1 ]) + α((1 − 2−1 , 1 − 3−1 ]) + α((1 − 3−1 , 1 − 4−1 ]) + . . .
¡ ¢ ¡ ¢ ¡ ¢
= 1 − 2−1 + 2−1 − 3−1 + 3−1 − 4−1 + . . .
= 1.
In general, one can show that α is countably additive on C : If I 1 , I 2 , . . . are disjoint elements of C , then
à !
[
∞ X
∞
α Ii = α(I i ).
i =1 i =1
▲
Let A, E ⊆ Ω be arbitrary. Then by using subadditivity of ρ and since A ⊆ (A ∩ E ) ∪ (A \ E ), we have
ρ(A) ≤ ρ(A ∩ E ) + ρ(A \ E ).
If the above inequality becomes an equality for all A ⊆ Ω, then we say E is ρ-measurable. See below.
Definition 1.1.32 (Measurability w.r.t. an outer measure). Let ρ be an outer measure on a set Ω. A subset
E ⊆ Ω is said to be ρ-measurable (or Carathéory-measruable relative to ρ) if
ρ(A) = ρ(A ∩ E ) + ρ(A \ E ) for all A ⊆ Ω.
Lemma 1.1.33 (Carathéory’s lemma). Let ρ be an outer measure on a set Ω. Let F be the collection of all
ρ-measurable subsets of Ω.
(i) F is a σ-algebra.
(ii) ρ restricted to F is a measure on F .
P ROOF. We first show (i). We will take some steps.
(F is closed under complementation) Let E ∈ F . This means ρ(A) = ρ(A ∩ E ) + ρ(A \ E ) for all A ⊆ Ω.
Note that since A ∩ E c = A \ E and A \ E = A ∩ E c , this implies ρ(A) = ρ(A ∩ E c ) + ρ(A \ E c ) for
all A ⊆ Ω. This shows E c ∈ F . Since E ∈ F was arbitrary, this shows that F is closed under
complementation.
(F is closed under intersection and union) Let E 1 , E 2 ∈ F and let A ⊆ Ω be arbitrary. Then by ρ-
measurability of E 1 and E 2 , we have
ρ(A) = ρ(A ∩ E 1 ) + ρ(A \ E 1 )
ρ(A ∩ E )
1 = ρ(A ∩ (E ∩ E )) + ρ((A ∩ E ) \ E )
1 2 1 2
ρ(A \ (E 1 ∩ E 2 )) = ρ([A \ (E 1 ∩ E 2 )] ∩ E 1 ) + ρ([A \ (E 1 ∩ E 2 )] \ E 1 )
= ρ((A ∩ E 1 ) \ E 2 ) + ρ(A \ E 1 ).
By combining the above identities, we have
¡ ¢
ρ(A) = ρ(A ∩ (E 1 ∩ E 2 )) + ρ((A ∩ E 1 ) \ E 2 ) + ρ(A \ E 1 )
= ρ(A ∩ (E 1 ∩ E 2 )) + ρ(A \ (E 1 ∩ E 2 )).
1.1. PROBABILITY MEASURE AND PROBABILITY SPACE 14
where for the inequality above, we used the subadditivity of the outer measure ϕ and that
S
A \ C n ⊇ A \C ∞ . Taking n → ∞ and using subadditivity of ρ with ∞ i =1
(A ∩ B i ) ⊇ A ∩C ∞ ,
à !
X
∞
ρ(A) ≥ ρ(A ∩ B i ) + ρ(A \ C ∞ ) ≥ ρ(A ∩C ∞ ) + ρ(A \C ∞ ) ≥ ρ(A).
i =1
(ii) A, B ∈ R =⇒ A ∩ B ∈ R.
S
(ii) A, B ∈ R =⇒ A \B = ∞ C for disjoint C i ∈ R, i ≥ 1.
i =1 i
A prime example of semi-ring (and the only example you would need to know) is the set of half-open
intervals. For instance, letting A = (0, 3] and B = (1, 2], we have
A \ B = (0, 1] ∪ (2, 3],
which is not a half-open interval of the form (a, b], but is indeed the disjoint union of two such intervals.
Exercise 1.1.35 (Set of half-open intervals is a semi-ring). Let Ω = R and let R be the set of all half-open
intervals of the form (a, b] for a ≤ b, a, b ∈ R. Show that R is a semi-ring.
Theorem 1.1.36 (Carathéory extension theorem). Let Ω be a set and let R ⊆ 2Ω . Fix a set function α : R →
[0, ∞]. Suppose the following holds:
(A1) R is a semi-ring; ¡S ¢ P
S
(A2) For each A ∈ R such that A = ∞ B for some disjoint B 1 , B 2 , · · · ∈ R, α ∞
i =1 i
B = ∞
i =1 i i =1 α(B i ).
Then the followings hold:
(i) There exists a measure µ on σ(R) that extends α on R (i.e., µ|R = α).
S
(ii) If α is σ-finite, that is, Ω = ∞ A for some A i ∈ R for i ≥ 1 such that α(A i ) < ∞, then the measure µ
i =1 i
in (i) is unique.
S KETCH OF PROOF. Let α : 2Ω → [0, ∞] be the outer measure extension of α : R → [0, ∞]. Let F de-
note the set of all α-measurable subsets of Ω. By Lemma 1.1.33, we know that F is a σ-algebra and α
restricted on F is a measure on F .
Now with the additional hypothesis that R is a semi-ring and α is countably additive on R, one can
show that R ⊆ F . Recall that F is itself a σ-algebra. Since σ(R) is the smallest σ-algebra that contains
R, it follows that σ(R) ⊆ F . Now restricting α on σ(R) shows the existence of a measure µ on σ(R) that
extends α on R.
For the uniqueness, suppose µ0 is another measure on σ(R) that extends α. Then by Lemma 1.1.38,
one has to have µ = µ0 over σ(R). □
Theorem 1.1.37 (Dynkin’s π − λ theorem). Let Ω be a set. Let P ⊆ 2Ω be a π-system (i.e., closed under
intersection) and let Q ⊆ 2Ω be a λ-system (i.e., Ω ∈ Q; A, B ∈ Q with A ⊆ B ⇒ B \ A ∈ Q; Q is closed under
ascending union) Then σ(P ) ⊆ Q.
P ROOF. The proof of the above theorem is short but non-trivial (see [Dur19, Appendix A]). □
A typical application of Dynkin’s π − λ theorem is the following uniqueness of measure agreeing on a
π-system.
Lemma 1.1.38 (Uniqueness of measure agreeing on a π-system). Let Ω be a sample space and let R ⊆ 2Ω
be a π-system. Let µ and µ0 be two measures on σ(R) such that µ(A) = µ0 (A) for all A ∈ R. Then µ(A) =
µ0 (A) for all A ∈ σ(R).
P ROOF. Define
Q := {B ∈ σ(R) | µ(B ) = µ0 (B )}.
Then one can yerify that Q is a λ-system containing R. Since R is semi-ring, it is a π-system. Hence by
the π − λ theorem, we have σ(R) ⊆ Q. This implies that µ = µ0 over σ(R), as desired. □
1.1.4. Stieltjes and Lebesgue measure on Rd . Now we are ready to define Stieltjes measures on the Borel
σ-algebra on R. A special case of such measures is the Lebesgue measure.
Definition 1.1.39 (Stieltjes measure on (R, B)). Let Ω = R and let B denote the Borel σ-algebra on R (see
Definition 1.1.24). Recall that B = σ(R), where R is the collection of all half-open intervals (a, b] in R. A
function F : R → R is called a Stieltjes measure function if
1.1. PROBABILITY MEASURE AND PROBABILITY SPACE 16
We can think of the above quantity as the ‘volumn’ of the rectangle (a, b] induced by the ‘potential func-
tion’ F .
Exercise 1.1.41 (Borel σ-algebra is generated by rectangles). Recall that the Borel σ-algebra B on Rd is
generated by the set of all open balls in Rd . Let C denote the collection of ‘half-open rectangles’ of the
form (a, b] in (3). Show that B = σ(C ). (Hint: Show that any open ball is the countable union of some
open balls with rational centers and rational radii; Then show every open ball with rational center and
rational radii is the countable union of some half-open rectangles in (3) with rational endpoints.)
Exercise 1.1.42. This is a higher-dimensional analog of Exercise 1.1.35, which shows that the collection
of half-open intervals form a semi-ring (see Definition 1.1.34). Let R be the collection of all half-open
rectangles (a, b] ⊆ Rd for all a, b ∈ Rd with a ≤ b. Show that R is a semi-ring on Rd .
Definition 1.1.43 (Stieltjes measures on (Rd , B)). Let Ω = Rd and let B denote the Borel σ-algebra on R
(see Definition 1.1.24). Recall that B = σ(R), where R is the collection of all half-open rectanbles (a, b]
in Rd (see Exercise 1.1.41). A function F : Rd → R is called a Stieltjes measure function if
(i) F is non-degreasing; (i.e., F (a) ≤ F (b) if a ≤ b)
(ii) F is right-continuous, that is, limy&x F (y) = F (x);
(iii) If xn → [−∞, . . . , −∞] as n → ∞, then F (xn ) → 0; If xn → [∞, . . . , ∞] as n → ∞, then F (xn ) → 1;
(iv) For all a, b ∈ Rd with a ≤ b, we have ∆(a,b] F ≥ 0 (see (4)).
Fix a Stieltjes measure function F : Rd → R. Define a set function α : R → [0, ∞] as
α ((a, b]) := ∆(a,b] F for a, b ∈ Rd with a ≤ b.
1.2. RANDOM VARIABLES AND DISTRIBUTIONS 17
According to Exercise 1.1.42, we know that R is a semi-ring. Then by Theorem 1.1.36, there exists a
unique measure µ on B that extends α. We call µ the Stieltjes measure associate with F . In particular, if
Q
F (x) = di=1 F i (x i ) for x = (x 1 , . . . , x d ), where F i : R → R for i = 1, . . . , d , then we can write
Y
d
α ((a, b]) = (F i (b i ) − F i (a i )) .
i =1
In particular, if F i (x) = x, then
Y
d
α ((a, b]) = (b i − a i ),
i =1
where the right-hand-side is the usual volumn of the rectangle (a, b]. In this case, the corresponding
Stieltjes measure µ is called the Lebesgue measure on Rd . ▲
Then these functions are random variables. In order to see this for X , let A be any Borel subset of R. Then
X −1 (A) = {(ω1 , ω2 ) ∈ Ω | ω1 ∈ A} is a subset of Ω, and hence it is an element of the power set 2Ω . In other
words, X −1 (A) is (2Ω − B)-measurable, where B denotes the Borel σ-algebra on R. Likewise, Y and Z
are also random variables.
Now since Z pulls back Borel subsets of R to a measurable subset of Ω, we can compute the proba-
bility of such inverse image. For instance, recall that {7} ∈ B. Consider the following event
{Z = 7} = Z −1 ({7}) = {ω ∈ Ω | Z (ω) = 7}
= {(1, 6), (2, 5), (3, 4), (4, 3), (5, 2), (6, 1)}.
Hence
|Z −1 ({7})| 6 1
P(Z = 7) = P(Z −1 ({7})) = = = .
|Ω| 36 6
▲
Example 1.2.6 (Uniform RV on [0, 1]). Consider the probability space ([0, 1], B, µ), where B is the Borel
σ-algebra on [0, 1] (i.e., smallest σ-algebra generated by interavls (a, b] for 0 ≤ a ≤ b ≤ 1) and µ is the
Lebesgue measure. Let U : [0, 1] → R be the identity function, U (x) = x. Then U is a random variable on
([0, 1], B, µ).
In order to show U : ([0, 1], B) → (R, BR ) is measurable, we need to show that for any Borel subset
B ∈ BR , U −1 (B ) is Borel in [0, 1]. But because the Borel σ-algebra on R is generated by the intervals (a, b]
for a, b ∈ R, we only need to check this for B = (a, b] for all a, b ∈ R. Now in general, it holds that
U −1 (B ) = B ∩ [0, 1].
Hence in all cases, B ∩ [0, 1] is Borel in [0, 1]. This shows U : ([0, 1], R) → (R, BR ) is measurable. Hence U
is a RV.
Also, the CDF of U is the identity on [0, 1], as
µ(U ≤ x) = µ({y ∈ [0, 1] |U (y) ≤ x}) = µ({y ∈ [0, 1] | y ≤ x}) = µ([0, x]) = x.
Remark 1.2.7 (σ-algebra and topology). Fix a set Ω. A σ-algebra F on Ω is a system of subsets of Ω that
gives rise to the notion of ‘measurability’. Another well-known system of subsets in mathematics is called
the ‘topology’, which gives rise to the notion of ‘open subsets’. It is helpful to compare the definition of σ
-algebra and topology on Ω. Recall that a collection T of subsets of Ω is a topology on Ω if it satisfies the
following conditions:
(i) ;, Ω ∈ T ;
(ii) T is closed under arbitrary union (not necessarily only the countable ones);
(iii) T is closed under finite intersection.
Note that the above properties are abstracted from the collection of open balls in the Euclidean space Rn .
A pair (Ω, T ) of sample space Ω and a topology is called a topological space. If we have two topological
spaces (Ω, T ) and (Ω0 , T 0 ), then a function f : Ω → Ω0 is said to be continuous if f −1 (A 0 ) ∈ T for all
A 0 ∈ T (i.e., the inverse image of open subsets is open).
Proposition 1.2.8 (Check measurability on basis). Let (Ω, F , P) be a probability space and let (S, R) be a
measurable space. A function X : Ω → S is (F − R)-measurable if there exists a collection A of subsets of S
such that R = σ(A ) and X −1 (A) ∈ F for all A ∈ A .
P ROOF. Let Q := {B ∈ R | X −1 (B ) ∈ F }, which is the collection of R-measurable subsets of S that is
pulled back by X to a F -measurable set. It is easy to see that Q is a σ-algebra. Indeed, if B ∈ Q, then
X −1 (B c ) = Ω \ X −1 (B ) ∈ F , so Q is closed under complementation. Also, if B 1 , B 2 , · · · ∈ R, then
à !
−1
[
∞ [∞
X Bi = X −1 (B i ) ∈ F , (why?)
i =1 i =1
S∞
so it verifies i =1 B i ∈ Q. Now by the hypothesis, A ⊆ Q. It follows that R = σ(A ) ⊆ Q ⊆ R. Hence
Q = R, as desired. □
Example 1.2.9 (Monotone functions are Borel measurable). Let f : R → R be a monotone function. For
instance, we assume f is non-decreasing, that is, f (x) ≤ f (y) if x ≤ y. Then f : (R, B) → (R, B), that is, f
is Borel measurable. Indeed, note that
f −1 ((−∞, a]) = {x : f (x) ≤ a} = (−∞, b],
where b := sup{y : f (y) ≤ a}. Now b ∈ [−∞, ∞). Hence the inverse image of intervals under f −1 is an
interval, which is a Borel set. Hence f is Borel measurable by Proposition 1.2.8. ▲
Proposition 1.2.10 (Building RVs from other RVs). These followings hold:
(i) (Composition I) If X : (Ω, F ) → (Ω0 , F 0 ) and f : (Ω0 , F 0 ) → (Ω00 , F 00 ) are measurable maps, then f (X ) =
f ◦ X is a measruable map (Ω, F ) → (Ω00 , F 00 ).
(ii) (Composition II) If X 1 , . . . , X n : (Ω, F ) → (R, B) are RVs and if f : (Rn , B n ) → (R, B) is measurable, then
f (X 1 , . . . , X n ) is a RV.
P Q
(iii) (Sum and product) If X 1 , . . . , X n : (Ω, F ) → (R, B) are RVs, then ni=1 X i and ni=1 X i are also RVs.
(iv) (One-sided imits) If X 1 , X 2 · · · : (Ω, F ) → (R, B) are RVs, then the following are also RVs:
inf X n , sup X n , lim inf X n , lim sup X n ,
n≥1 n≥1 n→∞ n→∞
(iii) According to (ii), it suffices to verify that the functions f : (x 1 , . . . , x n ) 7→ x 1 +· · ·+x n and g : (x 1 , . . . , x n ) 7→
x 1 · · · x n are measurable. By Proposition 1.2.8, one only needs to check if the inverse image of
open intervals under these maps are Borel measurable. Indeed,
Hence infn≥1 X n and supn≥1 X n are RVs. Next, recall that lim inf X n and lim sup X n are the small-
n→∞ n→∞
est and the largest subsequential limits of X n , respectively. Hence we can write
³ ´ µ ¶
lim inf X n = sup inf X m , lim sup X n = inf sup X m .
n→∞ n≥1 m≥n n→∞ n≥1 m≥n
Since we have verified that infn≥1 X n and supn≥1 X n are RVs, their supremum and infimum are
also RVs. Hene the above are also RVs.
□
Suppose we have sequence of RVs X 1 , X 2 , . . . on a probability space (Ω, R, P). From Proposition
1.2.10, it follows that
n ¯ o ½ ¯ ¾
¯ ¯
Ω0 := ω ∈ Ω ¯ lim X n (ω) exists = ω ∈ Ω ¯ lim inf X n (ω) = lim sup X n (ω) ∈ F .
n→∞ n→∞ n→∞
If P(Ω) = 1, then we say X n converges almost surely as n → ∞. Note that the limiting RV X ∞ := limn→∞ X n
has well-defined values only on Ω0 . That is, X ∞ (ω) ∈ [−∞, ∞] if ω ∈ Ω0 and X ∞ (ω) does not exist.
1.2.2. Distributions. Let (Ω, F , P) be a probability space and let X : Ω → R be a random variable. Note
that if A ⊆ R is a Borel set, then since X is a random variable, we have X −1 (A) ∈ F . Hence we can measure
the ‘size’ of X −1 (A) using the probability measure P on F .
Definition 1.2.11 (Distribution). Let (Ω, F , P) be a probability space and let X : Ω → R be a random
variable. The distribution of X is the probability measure µ on (R, B), where B = Borel σ-algebra on R,
defined as
The (cumulative) distribution function (CDF) of X is the function F : R → [0, 1] defined by F (x) := P(X ≤
x) for x ∈ R. If the CDF of X has the integral representation5
Zx
P(X ≤ x) = f X (x) d x ∀ x ∈ R
−∞
for some function f X : R → R, then we call the function f X the probability density function of X and say
X is a continuous RV. If there exists a countable subset S ⊂ R such that P(X ∈ S) = 1, then X is a discrete
P
RV. In this case, if we write S = {x 1 , x 2 , . . . , }, then we can write X = ∞ 6
i =1 x i 1(X = x i ) .
5We have not actually defined Lebesgue integral yet. For now regard integrals in the sense of Riemann integral.
6Equivalently, X is a disrete RV iff the CDF of X is a stepfunction with countably many jumps.
1.2. RANDOM VARIABLES AND DISTRIBUTIONS 21
The set function µ : B → [0, ∞] by µ(A) = P(X −1 (A)) = P(X ∈ A) defined in is indeed a probability mea-
sure on B. In order to see this, first note that µ(R) = P(X ∈ R) = 1, so µ(;) = 1 − µ(R) = 0; second, if
A 1 , A 2 , · · · ∈ B are disjoint, then
à ! à ! à !
[
∞ [
∞ [
∞
−1
X
∞ ¡ ¢ X∞
µ Ai = P X ∈ Ai = P X (A i ) = P X −1 (A i ) = µ(A i ).
i =1 i =1 i =1 i =1 i =1
Proposition 1.2.12. Let F : R → R be the CDF of a random variable X . It satisfies the following properties.
(i) F is non-decreasing;
(ii) limx→−∞ F (x) = 0 and limx→∞ F (x) = 1;
(iii) F is right-continuous, that is, lim y&x F (y) = F (x);
(iv) Let F (x−) := lim y%x F (y). Then F (x−) = P(X < x);
(v) P(X = x) = F (x) − F (x−).
P ROOF. We assume the RV X is from some probability space (Ω, F , P).
(i) Follows from the monotonicity of probability measures and {X ≤ x} ⊆ {X ≤ y} if x ≥ y.
(ii) By continuity of measures, limx→∞ P(X ≤ x) = P(X −1 ((−∞, ∞))) = P(Ω) = 1. Then limx→−∞ P(X ≤
x) = P(X −1 (;)) = P(;) = 0.
T
(iii) Note that if a sequence a n & 0 as n → ∞, then ∞ n=1 {X ≤ x + a n } = {X ≤ x}. So by continuity of
measure, we get limn→∞ P(X ≤ x + a n ) = P(X ≤ x). Since this holds for any sequence a n & 0, it
holds that lim y&x F (y) = F (x).
S
(iv) Fix a sequence a n & 0 as n → ∞. Then ∞ n=1 {X ≤ x − a n } = {X < x}. So by continuity of measure,
we get limn→∞ P(X ≤ x − a n ) = P(X < x). Since this holds for any sequence a n & 0, it holds that
lim y%x F (y) = F (x−).
(v) By (iv), F (x) − F (x−) = P(X ≤ x) − P(X < x) = P(X = x).
□
Proposition 1.2.13 (Constructing RV from its CDF). If a function F : R → R satisfies Proposition 1.2.12
(i)-(iii), then F is the CDF of some random variable X .
P ROOF. Consider the probability space (Ω, B, µ), where Ω = [0, 1], B is the Borel σ-algebra on [0, 1],
and µ is the Lebesgue measure. Now define a function X : [0, 1] → R by
X (ω) := sup{y : F (y) < ω} = “F −1 (ω)”.
(Note that If F is 1-1, then X (ω) = F −1 (ω).) We need to verify two points: (1) X is a random variable; (2)
The CDF of X is F . To this end, we claim that
{ω : X (ω) ≤ x} = {ω : ω ≤ F (x)}. (5)
Note that this yields X −1 ((−∞, x]) = [0, F (x)]. That is, the inverse image of the interval [0, x] under X is
the interval [0, F (x)], which is a Borel set. This also yields that X −1 ([a, b]) = (F (a), F (b)], which is also a
Borel set. Hence by Proposition 1.2.8, X is indeed a random variable. Furthermore, assuming (5), we can
easily verify that the CDF of X is indeed F :
µ(X ≤ x) = µ(ω ≤ F (x)) = µ([0, F (x)]) = F (x).
It remains to verify (5). Indeed, on the one hand, suppose X (ω) ≤ x. Suppose for contradiction that
F (x) < ω. Then by using the right-continuity of F , there exists ε > 0 such that F (x + ε) < ω. It follows
that x + ε ≤ X (ω) ≤ x, which is a contradiction. On the other hand, suppose ω ≤ F (x). Suppose for
contradiction that X (ω) > x. This implies that there exists x < y ≤ X (ω) such that F (y) < ω. But since F is
non-decreasing and x < y, F (y) ≥ F (x) ≥ ω, which is a contradiction. Thus X (ω) ≤ x. □
Remark 1.2.14. If the function F in Proposition 1.2.13 is strictly increassing so that it has an inverse
function, then the RV X we constructed in the proof of Proposition 1.2.13 is simply
X = F −1 (U ),
1.2. RANDOM VARIABLES AND DISTRIBUTIONS 22
where U is the uniform RV on [0, 1] in Example 1.2.6. Note that X is a RV since F −1 is monotone and
hence measurable (see Example 1.2.9). Also, indeed its CDF is F :
P(X ≤ x) = P(F −1 (U ) ≤ x) = P(U ≤ F (x)) = F (x).
This is in fact how random variables are generated numerically in computer softwares. Namely, it first
genreates a uniform RV U and then outputs F −1 (U ).
Exercise 1.2.15. Using your favorate programing languate (e.g., R, python, C++), sample i.i.d. uniform
RVs U1 , . . . ,Un ∼ Uniform([0, 1]). Compute the values of X 1 , . . . , X n , where X i := g (Ui ), where g (x) =
−λ−1 log(1 − x) for some fixed λ > 0. Numerically compute and plot the empirical CDF of the samples
X 1 , . . . , X n . What is the distribution of X := g (U ), U ∼ Uniform([0, 1])?
Rx
Example 1.2.16 (Uniform distribution on [0, 1]). Let f (x) := 1[0,1] (x) and let F (x) := −∞ f (x) d x. Then
0 if x < 0
F (x) = x if 0 ≤ x ≤ 1
1 if x > 1
is called the uniform distrubution on [0, 1]. ▲
−λx
Example 1.2.17R(Exponential distribution with rate λ). Fix a constant λ > 0 and let f (x) := λe 1[0,∞) (x)
x
and let F (x) := −∞ f (x) d x. Then
³ ´
F (x) = 1 − e −λx 1[0,∞) (x)
is called the exponential distrubution with rate λ. ▲
Rx
Example 1.2.18 (Standard normal distribution on [0, 1]). Let f (x) := p1 e −x /2 . Then F (x) := −∞ f (x) d x
2
2π
is called the standard normal distrubution. There is no closed-form expression for F in this case. ▲
Rx
Exercise 1.2.19. Let F (x) = −∞ p1 e −x /2 d x be the standard normal distribution. Show that for all x > 0,
2
2π
−1
(x − x −3 ) exp(−x 2 /2) ≤ 1 − F (x) ≤ x −1 exp(−x 2 /2).
Definition 1.2.20 (Equivalence in distribution). Let X , Y be RVs, possibly defined on different probability
d
spaces. We say X and Y are equal in distribution and write X = Y if they have the same CDF (equivalently,
distribution):7
d def
X =Y ⇐⇒ CDF of X = CDF of Y ⇐⇒ P(X ≤ x) = P(Y ≤ x) ∀x ∈ R. (6)
Exercise 1.2.21. Construct an example of two random variables X , Y on the same probability space
d
(Ω, F , P) where X = Y but P(X = −Y ) = 1.
Definition 1.2.22 (Absolute continuity). Let (Ω, F ) be a measurable space and let µ, ν be two measures
on it. We say ν is absolutely continuous with respect to µ and write ν ¿ µ if
µ(A) = 0 =⇒ ν(A) ∀A ∈ F .
A measure ρ on (R, B) is absolutely continuous if it is absolutely continuous w.r.t. the Lebesgue measure;
otherwise ρ is said to be singular. A measure ξ on (Ω, F ) is discrete if there exists a countable set S ⊆ Ω
Example 1.2.23 (Point mass). Let (Ω, F , µ) be a probability space and suppose that µ({y}) = 1 for some
y ∈ Ω. Then µ is called the point mass at y and denoted as δ y . If Ω = R, then the point mass at y cor-
responds to the distribution function F (x) = 1(y ≥ x). Clearly point masses are discete probability mea-
sures. ▲
7If X : (Ω, F , P ) → (R, B) and Y : (Ω, F , P ) → (R, B), then the last expression in (6) precisely means P (X ≤ x) = P (Y ≤
X Y X Y
x) ∀x ∈ R. However, we usually use generic notation P for probability measures and interpret it as the probability measure on
the appropriate probability space.
1.3. INTEGRATION 23
Example 1.2.24 (Uniform distribution on the Cantor set). Recall the Cantor set C defined in Example
1.1.26, which is Borel-measurable with Lebesgue measure zero. We are going to consider the ‘uniform
probability distribution’ on C. Namely, let Cn be the set obtained from [0, 1] by repeating the procedure of
omitting the middle third of every remaining interval n times. Since Cn is a disjoint union of 2n intervals,
one can consider the uniform distribution on Cn with distribution denoted as F n . Namely,
Zx Zx
1 n
F n (x) := 1(x ∈ Cn ) d x = (3/2) 1(x ∈ Cn ) d x.
µ(Cn ) 0 0
F IGURE 1.2.1. CDF of the Cantor distribution, which is also known as the “devil’s staircase”. Fig-
ure excerpted from [DMRV06].
Note that F n is a piecewise linear function with slopes either 0 or 3/2. It can be shown that F n con-
verges to some limiting function F ∞ pointwise. Since F n is monotone, the convergence F n → F is uni-
form, so the limiting function F is also continuous. Note that F ∞ cannot have a density function, since
F ∞ = 0 on Cc and µ(C) = 1. This means any density function (if exists) of F ∞ has to be zero on the set
Cc of measure 1, so it integrates over [0, 1] to zero, not one. A distribution function not having a density
function is in fact equivalent to the associated Stieltjes measure being singular. This uses the famous
Radon-Nikodym theorem (pointer will be added). ▲
1.3. Integration
1.3.1. Definition of Lebesgue integral and basic properties. In this section, we develop the theory of
Lebesgue integration. One should be familar with the notion of Riemann integral, where one partitions
the domain of the function into intervals or rectangles and approximates the area or volume under the
graph of function by the sum of rectangles. On the contrary, the key point in Lebesgue integral is to
partition the range of the function, rather than its domain (see Figure 1.3.1).
An important building block of Lebesgue integral is simple function, which will play the role of rect-
angles in Riemann integral.
Definition 1.3.1 (Simple functions). Let (Ω, F , µ) is a measure space. A function φ : Ω → R is simple if it
takes only finitely many values on sets of finite measures. That is, there exists disjoint measurable sets
A 1 , . . . , A n ∈ F with µ(A i ) < ∞ for i = 1, . . . , n and a 1 , . . . , a n ∈ R such that
X
m
φ= ai 1 Ai .
i =1
1.3. INTEGRATION 24
F IGURE 1.3.1. Riemann integral (left) partitions the domain of the function, whereas the
Lebesgue integral partitions the range of the function. Figure by Xichu Zhang.
where the supermum runs over all bounded functions 0 ≤ g ≤ f such that µ({ω | g (ω) > 0}) < ∞.
(IV) (General measurable functions) Let a∨b := max(a, b) and a∧b := min(a, b) for a, b ∈ R. For a general
measurable function f : Ω → R, define two nonnegative functions f + , f − : Ω → R by8
¡ ¢
f + (ω) := f (ω) ∨ 0, f −1 (ω) := − f (ω) ∧ 0 .
Then we have
f = f +− f − | f | = f + + f −.
and
R R
f is (Lebesgue) integrable9 if f is measurable and | f | d µ < ∞, or equivalently, f + d µ <
We say R
∞ and f − d µ < ∞. In this case, we define
Z Z Z
f d µ := f + d µ − f − d µ.
R R
Otherwise, f d µ is undefined and we say f d µ does not exist.
8The ReLU function x 7→ x ∨ 0 is monotone and hence measurable. Hence f + is measurable being the composition of two
measurable functions. f − is also measurable for a similar reason.
9R f d µ is defined iff R f d µ ∈ [−∞, ∞] iff [either R f + d µ < ∞ or R f − d µ < ∞].
1.3. INTEGRATION 25
Remark 1.3.4 (Integral notations). Here we introduce some notations for Lebesgue integrals for measure
space (Ω, F , µ) and a measurable function f : Ω → R.
(i) For an event A ⊆ Ω, we write
Z Z
f d µ := f 1 A d µ,
A
where we call the RHS above s the ‘integral of f over A’. R R
(ii) When (Ω, F , µ) = (Rd , B d , λ), where λ = Lebesgue measure, we write f d µ = f (x) d x.
R Rb
(iii) When (Ω, F , µ) = (R, B, λ), where λ = Lebesgue measure, we write [a,b] f d µ = a f (x) d x.
R R
(iv) When (Ω, F , µ) = (R, B, µ) and µ((a, b]) = G(b)−G(a) for a < b 10 , then we write
R f d µ = f (x) dG(x).
P
(v) If Ω is countable, F = 2Ω , and µ is the counting measure11, then we write f d µ = x∈Ω f (x).
Exercise
R 1.3.5. Let f : (Ω, F , µ) → (R, B) be a measurable function such that f = 0 µ-a.e..RShow that
f d µ = 0. Deduce
R that, if A ∈ F with µ(A) = 0 and f is arbitrary measurable function, then A f d µ = 0.
Conversely, if | f | d µ = 0, then show that f = 0 µ-a.e..
One should get familar with this order of development. This is how we usually prove certain identities
(e.g., linearity of integral, change of variables formula, Fubini’s theorem) for Lebesgue integral.
Proposition 1.3.6 (Upper and lower Lebesgue sum). Let (Ω, F , µ) be a measure space. Let f : Ω → R be a
bounded measurable function such that µ({ f 6= 0}) < ∞. Then
Z Z
sup φdµ = inf φ d µ.
φ simple ≤ f ϕ simple ≥ f
P ROOF. First note that a simple function φ ≤ f has to vanish outside E since f ≡ 0 on E . Also, without
loss of generality, we may assume that the simple functions ϕ ≥ f in the statement vanishes outside E .
We first show
Z Z
sup φdµ ≤ inf φ d µ. (7)
φ simple ≤ f ϕ simple ≥ f
P P
Fix simple functions φ ≤ f and ϕ ≥ f . Write φ = a i 1 A i and ϕ = b j 1B j . The support of φ + ϕ is
S S S S
contained in i A i ∪ j B j . We can rewrite this union as the disjoint union of the sets B j \ i A i , A i \ j B j ,
and A i ∪ B j , for i , j ≥ 1. Enumerate all these sets as C k for k = 1, . . . , n for some n ≥ 1. Then we can write
P Pn
φ = nk=1 x k 1C k and ϕ = k=1 y k 1C k . Since φ ≤ f ≤ ϕ, we have x k ≤ y k for k = 1, . . . , n. Then
Z Xn X
n Z
φdµ = x k 1C k ≤ y k 1C k = ϕ d µ.
k=1 k=1
This holds for all simple functions φ ≤ f and ϕ ≥ f . Hence taking supremum on the left hand side and
then taking infimum on the right hand side, we obtain (7).
To prove the other inequality, we suppose | f | < M for some M > 0. For each m ≥ 1, define
½ ¯ ℓM ¾
¯ (ℓ + 1)M ¡ ¢
E ℓ,m := ω ∈ E ¯ ≤ f (ω) < = f −1 [ℓM /m, (ℓ + 1)M /m) for −m ≤ ℓ ≤ m,
m m
X
m ℓM Xm (ℓ + 1)M
φm := 1E k , ϕm := 1E k .
ℓ=−m m ℓ=−m m
By definition ϕm − φm = (M /m)1E , so we have
Z Z Z
sup φ d µ ≥ φm d µ = ϕm d µ − (M /m)µ(E )
φ simple ≤ f
µ Z ¶
≥ inf φ d µ − (M /m)µ(E ).
ϕ simple ≥ f
Now since µ(E ) < ∞, we can take m → ∞ and obtain the desired inequality. □
The main goal of this section is to establish the following basic properties of Lebesgue integral, which
should be familar from Riemann integrals.
Theorem 1.3.7 (Basic properties of integral). Let (Ω, F , µ) be a measure space and let f , g : Ω → R be
integrable functions such that their supports { f 6= 0}) and ({g 6= 0}) are σ-finite12. Then the following holds:
R
(i) If f ≥ 0 µ-a.e., then f d µ ≥ 0;
R R
(ii) For all a ∈ R, a f d µ = a f d µ;
R R R
(iii) f + g d µ = f d µ + g d µ;
R R
(iv) If g ≤ f µ-a.e., then g d µ ≤ f d µ;
¯R ¯ R
(v) If ¯ f d µ¯ ≤ | f | d µ.
P ROOF OF T HEOREM 1.3.7 FOR BOUNDED OR NONNEGATIVE FUNCTIONS . We will first verify each state-
ments for simple functions and then move on to bounded functions and then to nonnegative functions.
(i) It holds clearly if f is a nonnegative simple function. If f is a bounded function that vanishes outside
some set E with µ(E ) < ∞, then it also holds by definition since we can take the zero simple
function φ ≡ 0. Lastly, if f ≥ 0 µ-a.e., then for
R any bounded measurable function g such that
0 ≤ g ≤ f , we have µ(g > 0) R
≤ µ( f > 0) < ∞, so g d µ ≥ 0 by the previous case. Taking supremum
over all such g , this shows f d µ ≥ 0.
(ii) It holds clearly if f is a simple function. Also, one can easily verify it when a = 0 case by case, so
we may assume a 6= 0. Next, suppose f is a bounded function that vanishes outside some set
E with µ(E ) < ∞. Suppose a > 0. Then φ is a simple function with φ ≤ f if and only if aφ is a
simple function with aφ ≤ a f . Hence
Z Z Z Z Z Z
a f d µ = a sup φ d µ = sup a φ d µ = sup aφ d µ = sup aφ d µ = a f d µ.
φ≤ f φ≤ f φ≤ f aφ≤a f
If a < 0, then noting that φ is a simple function with φ ≤ f if and only if aφ is a simple function
with aφ ≥ a f , using Proposition 1.3.6, we get
Z Z Z Z Z Z
a f d µ = a sup φ d µ = inf a φ d µ = inf aφ d µ = inf aφ d µ = a f d µ.
φ≤ f φ≤ f φ≤ f aφ≥a f
Next, suppose f ≥ 0 µ-a.e.. In this case, we will verify (ii) for only the case that a > 0 (the case
a < 0 will be handle later). Note that h is a bounded measurable function with µ({h 6= 0}) < ∞
with 0 ≤ h ≤ f if and only if 0 ≤ ah ≤ a f and µ({ah 6= 0}) < ∞. Hence
Z Z Z Z Z Z
a f d µ = a sup φ d µ = sup a h d µ = sup ah d µ = sup ah d µ = a f d µ.
0≤h≤ f 0≤h≤ f 0≤h≤ f ah≤a f
P P
(iii) For simple functions, write φ = a i 1 A i and ϕ = b j 1B j . The support of φ + ϕ is contained in
S S S S
i A i ∪ j B j . We can rewrite this union as the disjoint union of the sets B j \ i A i , A i \ j B j ,
and A i ∪ B j , for i , j ≥ 1. Enumerate all these sets as C k for k = 1, . . . , n for some n ≥ 1. Then we
P P P
can write φ = nk=1 x k 1C k and ϕ = nk=1 y k 1C k . Then φ + ϕ = ni=1 (x k + y k )1E k . Hence
Z Xn X
n X
n Z Z
φ + ϕdµ = (x k + y k )µ(E k ) = x k µ(E k ) + y k µ(E k ) = φ d µ + ϕ d µ.
k=1 k=1 k=1
12Given a measure space (Ω, F , µ), a set A ⊆ Ω is σ-finite if there exists countable events E ∈ F , n ≥ 1 with µ(E ) < ∞ and
n n
S
A ⊆ n≥1 E n . If µ is a finite measure (i.e., µ(Ω) < ∞) or Ω itself is σ-finite, then all subsets of Ω is σ-finite.
1.3. INTEGRATION 27
Next, assume f , g are bounded measurable functions with finite measure support. Then if
φ ≤ f and ϕ ≤ g are simple functions, then φ + ϕ is a simple function ≤ f + g , so
Z Z Z Z
f + g d µ ≥ φ + ϕ d µ = φ d µ + ϕ d µ.
Then taking supφ≤ f and supϕ≤g on the right hand side, we get
Z Z Z
f + g d µ ≥ f d µ + g d µ.
Then taking infφ≥ f and inf ϕ ≥ g on the right hand side and again using Proposition 1.3.6, we
get
Z Z Z
f + g d µ ≤ f d µ + g d µ.
Lastly, suppose f , g ≥ 0. Fix bounded functions 0 ≤ h ≤ f and 0R≤ r ≤ gRwith µ(hR6= 0) < ∞
and µ(r 6= 0) < ∞. Then using (iii) for bounded functions, we have h d µ + r d µ = h + r d µ.
Taking supremum and noting that h + r is a bounded function such that 0 ≤ h + r ≤ f + h and
µ(h + r 6= 0) = µ(h 6= 0 or r 6= 0) ≤ µ(h 6= 0) + µ(r 6= 0) < ∞, we get
Z Z Z Z Z Z
f d µ + g d µ = sup h d µ + sup h dµ = sup h + r d µ ≤ f + g d µ.
0≤h≤ f 0≤r ≤g 0≤h≤ f , 0≤r ≤g
To show the other direction, first note that since f , g ≥ 0 and the supports of f and g are σ-
finite, the support of f + g is also σ-finite. Hence there exists events E n ∈ F , n ≥ 1 such that
E n % { f + g 6= 0} and µ(E n ) < ∞. Then for each n ≥ 1, we have
Then
R by using RLemma R1.3.8, as n → ∞, the first and the last expression above converges to
f + g d µ and f d µ + g d µ, respectively. This shows the desired inequality.
(iv) It follows immediately from applying (i) to h := f − g ≥ 0 and using (iii).
R R
(v) Note that f ≤ | f | so f d µ ≤ | f | d µ by (iv). Taking absolute value then gives the assertion.
□
The following lemma shows that the integral of a possibly unbounded function f can be achievedR by
the integrals of its truncations f ∧ n for n ≥ 1. In other words, the supremum in the definition of f d µ
in Definition 1.3.3) (III) could be taken only over the bounded functions g := ( f ∧ n)1E n .
Lemma 1.3.8 (Trunctation of integrals). Let (Ω, F , µ) be a measure space and fix a nonnegative measur-
able function f : Ω → R. Suppose there exists events E n ∈ F , n ≥ 1 such that µ(E n ) < ∞ and E n % { f > 0}
S
(i.e., E 1 ⊆ E 2 ⊆ . . . and n≥1 E n = { f > 0}). Then
Z Z
lim f ∧ n d µ = f d µ. (8)
n→∞ E
n
1.3. INTEGRATION 28
P ROOF. Note that f ∧n is the function that maps ω to f (ω)∧n. Hence ( f ∧n)1E n is bounded above by
n, non-incerasing in n, and has support contained in E n , which has finite µ-measure by the hypothesis.
Hence by Theorem 1.3.7 (iv) for the case of bounded functions, we see that the integrals in the LHS of (8)
is non-decreasing in n. Moreover, since g := ( f ∧n)1RE n is itself a bounded measurable function such that
0 ≤ g ≤ f and µ(g > 0) ≤ µ(E n ) < ∞, by definition of f d µ, we have
Z Z
lim f ∧ n d µ ≤ f d µ.
n→∞ E
n
To show that the limit equals the integral in the RHS, we use the definition of the integral in the
RHS. Namely, fix a bounded measurable function h such that 0 ≤ h ≤ f and µ(h > 0) < ∞. Suppose
supω h(ω) < M for some constnat M > 0. Then for n > 0, h ≤ f ∧ n ≤ f Hence again using Theorem 1.3.7
(iv) for the case of bounded functions,
Z Z Z Z
f ∧ n dµ ≥ h dµ = h dµ − h dµ
En En E nc
Z Z
= h dµ − h dµ
{h>0}\E n
Z
≥ h d µ − M µ({h > 0} \ E n ).
Since E n % { f > 0}, {h > 0} ⊆ { f > 0}, and µ(h > 0) < ∞, we have µ({h > 0} \ E n ) → 0 as n → ∞. So
Z Z
h d µ ≤ lim f ∧ n d µ.
n→∞ E
n
Taking supremum overRall bounded measurable functions h such that 0 ≤ h ≤ f and µ(h > 0) < ∞, by
using the definition of f d µ, we obtain
Z Z
lim f ∧ n d µ ≥ f d µ.
n→∞ E
n
P ROOF. Since f = f + − f − , we have f + + f 2 = f − + f 1 . Then since all four involved functions are
nonnegative with σ-finite support, by Theorem 1.3.7 for such functions, we have
Z Z Z Z
f + d µ + f 2 d µ = f − d µ + f 1 d µ. (9)
R R + R
Since f d µ exists, one of f + and f − has finite integral.
R First suppose
R f d µ = ∞. Then f − d µ < ∞,
and the above identity
R and the hypothesis
R implies f 1 d µ = ∞ and f 2 d µ < ∞. Hence we can subtract
the finite numbers f − d µ < ∞ and f 2 d µ < ∞ from both sides to get
Z Z Z Z Z
+ −
f d µ = f d µ − f d µ = f 1 d µ − f 2 d µ = ∞.
R
A similar argument works by symmetry if f − d µ = ∞. Lastly, if both f ± have finiteRintegral, then (9)
yields that f 1 and f 2 both have finite integrals. Again subtract the finite numbers f − d µ < ∞ and
R
f 2 d µ < ∞ from both sides of (9) and using Definition 1.3.3 (IV), the assertion follows. □
Now we finish the proof of Theorem 1.3.7 for the general case.
P ROOF OF T HEOREM 1.3.7 FOR INTEGRABLE FUNCTIONS . All functions here are assumed to have σ-
finite support.
(i) We have already shown this for nonnegative measurable functions.
1.3. INTEGRATION 29
(ii) Suppose a > 0. Note that a f + = (a f )+ and a f − = (a f )− . Then by using Definition 1.3.3 (IV) and
Theorem 1.3.7 (ii) for nonnegative functions and nonnegative scalar,
Z Z Z Z Z
a f d µ = (a f ) d µ − (a f ) d µ = a f d µ − a f − d µ
+ − +
Z Z µZ Z ¶ Z
+ − + −
= a f dµ − a f dµ = a f d µ − f d µ = a f d µ.
xp yq
xy ≤ + ∀x, y ∈ R .
p q
Using this inequality, replacing x = f and y = g and integrating,
Z
1 1
| f g | d µ = + = 1 = k f kp kg kq .
p q
This shows the assertion. □
Next, we will study when we can interchange limit and integral. Namely, if we have a sequence of
functions f n such that f n → f in some appropriate sense, we would like to know sufficient conditions
under which
Z Z Z
lim f n d µ = lim f n d µ = f d µ.
n→ n→
Definition 1.3.14 (Convergence in measure and almost everywhere convergence). Let (Ω, F , µ) be a
measure space and let f , f n : Ω → R for n ≥ 1 be measurable functions. We say f n → f as n → ∞ in
measure if for each ε > 0,
¡© ª¢
lim µ ω : | f n (ω) − f (ω)| > ε = 0.
n→∞
Exercise 1.3.15 (Convergence µ-a.e. implies convergence in measure). Suppose µ is a finite measure
and assume f n → f µ-a.e.. Show that f n → f in measure.
Theorem 1.3.16 (Bounded Convergence Theorem). Let (Ω, F , µ) be a measure space and let f , f n : Ω → R
for n ≥ 1 be measurable functions such that f n → f in measure. Assume that there exists an event E ∈ F
such that µ(E ) < ∞ and f n ≡ 0 on E c . Further assume | f n | < M for all n ≥ 1 for some constant M > 0. Then
Z Z
lim f n d µ = f d µ.
n→∞
P ROOF. The idea is to decompose integrals on sets G n where | f n − f | is small and on E n where | f n − f |
is large. The difference of the integrals is small either because f n ≈ f or the underlying set has small
measure. (The proof is much simpler if we also assume f ≡ 0 on E c and | f | < M , but we argue for the
general case below.)
We first claim that
¡ ¢
µ { f 6= 0} ∩ E c = 0 and µ({| f | > M }) = 0. (11)
Indeed, since f n ≡ 0 on E c and f n → f in measure, for each fixed ε > 0,
¡ ¢ ¡ ¢
µ {| f | > ε} ∩ E c = µ {| f n − f | > ε} ∩ E c → 0 as n → ∞.
1.3. INTEGRATION 31
¡ ¢
Since the left hand side does not depend on n, we have µ {| f | > ε} ∩ E c = 0. Since this holds for arbitrary
S
ε > 0, using that { f 6= 0} = ∞ −1
n=1 {| f | > n }, we get (by using union bound, or subadditivity of measure,
see Thm 1.1.16)
µ∞ ¶ ∞
¡ ¢ [ X ¡ ¢
µ { f 6= 0} ∩ E c = µ {| f | > n −1 } ∩ E c ≤ µ {| f | > n −1 } ∩ E c = 0,
n=1 n=1
as desired. Similarly, for each ε > 0,
¡ ¢ ¡ ¢
µ | f | ≥ M + ε ≤ µ | f − f n | ≥ ε → 0 as n → ∞.
Hence the left hand side equals 0. This holds for all ε > 0, so we deduce µ(| f | > M ) = 0.
Now fix ε > 0. Let A n := {| f n − f | > ε}, B n := {| f n − f | ≤ ε}, and C := {| f | ≤ M }. Then we have
¯Z Z ¯ ¯Z ¯
¯ ¯ ¯ ¯
¯ f n d µ − f d µ¯ = ¯ f n − f d µ¯ ( ∵ Thm 1.3.7 (iii))
¯ ¯ ¯ ¯
Z
¯ ¯
≤ ¯ f n − f ¯ d µ ( ∵ Thm 1.3.7 (v))
Z Z
¯ ¯ ¯ ¯
= ¯ ¯
fn − f d µ + ¯ f n − f ¯ d µ ( ∵ Thm 1.3.7 (v)).
An Bn
P ROOF. (The proof is simpler if there exists E n % Ω with µ(E n ) < ∞.) Denote g n := infm≥n f m . Then
f n ≥ g n % g := lim inf f n as n → ∞. Hence lim inf f n ≥ g , so it suffices to show that
n→∞
Z Z
lim inf gn d µ ≥ g d µ.
n→∞
To this end, we claim that there exists finite µ-measure events C m % {g > 0}. For this, we can use a
‘diagonal argument’. For each n ≥ 1, let E n,m % { f n > 0} as m → ∞ and µ(E n,m ) < ∞ for m ≥ 1. Arrange
these sets as below:
f 1 : E 1,1 E 1,2 E 1,3 ...
f 2 : E 2,1 E 2,2 E 2,3 ...
f 3 : E 3,1 E 3,2 E 3,3 ...
..
.
We can the enumerate these sets in diagonal fashion, namely, E 1,1 , E 1,2 , E 2,1 , E 1,3 , E 2,2 , E 3,1 , . . . . Let B m
S
denote the union of the first m sets in this enumeration. Then µ(B m ) < ∞ for m ≥ 1 and n≥1 { f n > 0} ⊆
S S
m≥1 B m . In particular, this implies {g > 0} ⊆ m≥1 B m . Then setting C m := B m ∩ {g > 0} proves the claim.
Now, by using the truncation argument (Lemma 1.3.8), we have
Z Z
g d µ = lim g ∧ m d µ.
m→∞ C
m
On the other hand, fix m ≥ 1 and observe that h n := (g n ∧ m)1C m for n ≥ 1 is a sequence of bounded
measurable functions with supports all contained in C m that has finite µ-measure. Moreover, h n → h :=
(g ∧ m)1C m µ-a.e. as n → ∞. Hence by BCT (Theorem 1.3.16),
Z Z
g ∧ m d µ = lim g n ∧ m d µ.
Cm n→∞ C
m
Then we have
Z Z
lim inf g n d µ ≥ lim inf gn ∧ m d µ
n→∞ n→∞
Z
≥ lim inf g n ∧ m d µ (∵ g n ≥ 0)
n→∞ C
Z m
Z
m→∞
= g ∧ m d µ −→ g d µ.
Cm
Namely, we need to define a natural σ-algbegra on X × Y and a measure on that σ-algebra. A simple
choice for the σ-algebra would be the Cartesian product of σ-algebras A × B. The sets in A × B takes
the form of A × B for A ∈ A and B ∈ B, which is called a (measurable) rectangle. However, A × B is not
necessarily a σ-algebra. Hence, we consider the smallest σ-algebra generated by all rectangles, namely,
σ(A × B). This will be our cannonical choice of σ-algebra on the product space Ω = X × Y . It then
remains to define a measure on this σ-algebra. This is provided by the following result.
Theorem 1.4.1 (Construction of product measure). Let (Ω1 , A , µ1 ) and (Ω2 , B, µ2 ) be two σ-finite mea-
sure spaces. Consider the product measurable space (Ω1 × Ω2 , σ(A × B)). Then there exists a unique mea-
sure µ on σ(A × B) such that
µ(A × B ) = µ1 (A) µ2 (B ) ∀ A ∈ A , B ∈ B.
P ROOF. Denote S := A ×B, the set of all measurable rectangles. Define a premeasure α : S → [0, ∞]
by α(A × B ) := µ1 (A) µ2 (B ). Notice that S form a semi-ring: For A,C ∈ A and B, D ∈ B,
(A × B ) ∩ (C × D) = (A ∩C ) × (B ∩ D)
(A × B ) \ (C × D) = (A × B ) ∩ (C × D)c
¡ ¢
= (A × B ) ∩ (C c × D) t (C × D c ) t (C c × D c )
= ((A ∩C c ) × (B ∩ D)) t ((A ∩C ) × (B ∩ D c )) t ((A ∩C c ) × (B ∩ D c )).
It remains to verify σ-additivity of α on S , since then Carathéory’s extension theorem (Thm. 1.1.36) will
imply the desired result.
F
To this end, fix a rectangle A × B and suppose A × B = ∞ A × B i for disjoint rectangles (A i × B i ) for
i =1 i
i ≥ 1. We wish to show that
X
∞
µ1 (A) µ2 (B ) = µ1 (A i ) µ2 (B i ). (12)
i =1
1.4. PRODUCT MEASURES AND FUBINI’S THEOREM 34
F P
To show the above, fix (x, y) ∈ A × B . Since A × B = ∞ A × B i , we have 1(x ∈ A)1(y ∈ B ) = ∞
i =1 i i =1 1(x ∈
A i )1(y ∈ B i ). Integrating with respect to y and using MCT, we get
ZX
∞
1(x ∈ A)µ2 (B ) = 1(x ∈ A i ) 1(y ∈ B i ) d µ2 (y)
i =1
∞ Z
X
= 1(x ∈ A i ) 1(y ∈ B i ) d µ2 (y)
i =1
X∞ Z
= 1(x ∈ A i ) 1(y ∈ B i ) d µ2 (y)
i =1
X∞
= 1(x ∈ A i )µ2 (B ).
i =1
Then integrating with respect to x and using MCT again shows (12). □
Remark 1.4.2 (Lebesgue measure on Rd ). By using Thoerem 1.4.1 and an induction on the number of
products, we can construct product measures on arbitrary product space with finitely many products.
Namely, if (Ωi , F , µi ), i ≥ 1 are σ-finite measure spaces, then for any n ≥ 1, we have the product measure
space
à à ! !
Y
n Y
n
Ωi , σ Fi , µ1 ⊗ · · · ⊗ µn ,
i =1 i =1
¡Qn ¢
where µ1 ⊗ · · · ⊗ µn is the unique measure on σ i =1 Fi such that
Y
n
(µ1 ⊗ · · · ⊗ µn )(A 1 × · · · × A n ) = µi (A i ).
i =1
In particular, if (Ωi , Fi , µ) = (R, B, λ), where B is the Borel σ-algebra and λ is the Lebesgue measure on
N
R, then λ ⊗ · · · ⊗ λ = ni=1 λ is the Lebesgue measure on the Borel σ-algebra on Rn .
Theorem 1.4.3 (Fubini’s theorem). Let (Ω1 , F1 , µ1 ) and (Ω2 , B, µ2 ) be two σ-finite measure spaces. Let
(Ω1 ×Ω2 , σ(A ×F R 2 ), µ1 ⊗µ2 ) be the product measure space. If a Borel measurable function f : Ω1 ×Ω2 → R
satisfies f ≥ 0 or | f | d µ < ∞, then
Z Z Z
f d µ1 ⊗ µ2 = f (x, y) d µ1 (x) d µ2 (y) (13)
Ω1 ×Ω2 Ω Ω
Z2Z1
= f (x, y) d µ2 (y) d µ1 (x).
Ω 1 Ω2
There are two technical points we need to address before we can prove Fubini’s theorem. In order to
make sense of the second expression in (13), we need to make sure that
(1) For each fixed x ∈ Ω1 , the function f x : Ω2 → R, y 7→ f (x, y) is measurable;
R
(2) The function g : Ω1 → R, x 7→ Ω2 f (x, y) d µ2 (y) is measurable.
We will assume the above two points for the following proof.
P ROOF. We will only prove the first equality in (13). The second equality follows from symmetry.
(Indicator functions) Let f = 1E for some E ∈ σ(F1 × F2 ). Further assume E is a rectangle E 1 × E 2 . Then
1((x, y) ∈ E ) = 1(x ∈ E 1 )1(y ∈ E 2 ), so we can verify the assertion as
Z
f d µ1 ⊗ µ2 = µ1 ⊗ µ2 (E ) = µ1 (E 1 ) µ2 (E 2 ),
Ω1 ×Ω2
Z Z Z
f (x, y) d µ1 (x) d µ2 (y) = 1(x ∈ E 1 )µ2 (E 2 ) d µ1 (x) = µ1 (E 1 ) µ2 (E 2 ).
Ω 2 Ω1 Ω2
1.5. EXPECTATION 35
What if E is not a rectangle? Then we proceed as follows. Let E denote the set of all measurable
sets in σ(F1 × F2 ) for which the assertion holds. Then one can show that E is a λ-system. (i.e.,
Ω ∈ E ; A, B ∈ E with A ⊆ B ⇒ B \ A ∈ E ; E is closed under ascending union). The measurable
rectangles form a π-system (i.e., closed under intersection) and we just showed that they belong
to E . Hence by Dynkin’s π − λ theorem (see Theorem 1.1.37), E should contain the σ-algebra
σ(F1 × F2 ) generated by the rectangles.
(Simple functions) Follows from linearity of integral and the assertion for characteristic functions.
(Nonnegative functions) Suppose f ≥ 0. Note that (µ1 ⊗µ2 )(Ω1 ×Ω2 ) = µ1 (Ω1 )µ2 (Ω2 ) by the construction
of the product measure. By the hypothesis on σ-finiteness, it follows that µ1 ⊗ µ2 is a σ-finite
measure. Choose measurable sets A n % Ω1 × Ω2 such that (µ1 ⊗ µ2 )(A n ) < ∞. Define f n : Ω1 ×
Ω2 → R, f n (x, y) := [2n f (x, y)1((x, y) ∈ A n )]/2n ∧ n, where [·] denotes the largest smaller integer
than the input ·. Note that f n is a simple function: It takes at most n2n distinct values on sets
of finite measure as (µ1 ⊗ µ2 )( f n 6= 0) ≤ (µ1 ⊗ µ2 )(A n ) < ∞. Also, f n % f (µ1 ⊗ µ2 )-a.e.. Then the
assertion follows from the previous case and MCT:
Z Z
f d µ1 ⊗ µ2 = lim f n d µ1 ⊗ µ2
Ω1 ×Ω2 Ω1 ×Ω2 n→∞
Z
= lim f n d µ1 ⊗ µ2
n→∞ Ω ×Ω
Z Z1 2
Applying Fubini’s theorem for the counting measure, we know that the above equality holds if f ≥ 0 or if
P P∞
f is absolutely summable: ∞ n=1 m=1 | f (n, m)| < ∞.
Exercise 1.4.5. Construct a counterexample to the Fubini’s theorem, where the infinite double sums
P∞ P∞ P∞ P∞
n=1 m=1 f (n, m) and m=1 n=1 f (n, m) does not equal to each other.
1.5. Expectation
1.5.1. Measure-theoretic definition of expectation. The following is the most general definition of the
expectation of a RV in terms of a Lebesgue integral on the probability space.
Definition 1.5.1 (Expectaion). Let X be a random variable from a probability space (Ω, F , P) into the real
line (R, B, µ), where B is the Borel σ-algebra on R and µ denotes the Lebesgue measure. The expectation
1.5. EXPECTATION 36
An advantage of defining the expectation of a RV as a Lebesgue integral is that all properties and re-
sults we have established for Lebesgue integral (e.g., linearity, various inequalities, and limit theorems)
readily apply to expectation. For instance, from Theorem 1.3.7, if we have two integrable random vari-
ables X , Y on the same probability space (Ω, F , P). then for any a, b ∈ R, aX + bY is integrable and
(Linearity of expectation) E[aX + bY ] = a E[X ] + b E[Y ]. (14)
To see the measure-theoretic definition of expectation of a RV in Definition 1.5.1 agrees with our
undergraduate-level definition of expectation using PMFs and PDFs, we need a change of variables for-
mula (‘u-substitution’). But for the case of discrete RVs, a direct computation can show a direct con-
nection between these two definitions of expectation. Consider the following computation: If we have a
P
discrete RV X = ∞i =1 x i 1 A i , then
Z X
∞
E[X ] = x i 1(ω ∈ A i ) d P(ω)
Ω i =1
? XZ
∞
= x i 1(ω ∈ A i ) d P(ω)
i =1 Ω
X∞ Z
= xi 1(ω ∈ A i ) d P(ω)
i =1 Ω
X∞
= x i P(A i )
i =1
X∞
= x i P(X = x i ).
i =1
Note that the last expression is the usual formula for the expectation of the discrete RV X . The only caveat
in this computation is the second equality, where we swapped the integral on Ω and the infinite sum (or
integral against the counting measure). This is allowed if the sum over i was finite, by using linearity of
integral in Theorem 1.3.7. In general, Fubini’s theorem gives sufficient conditions under which we can
interchange the order of two integrals, which we will study in the following section. Here, we give a direct
justifiction of the above computation using MCT.
Proposition 1.5.2 (Expectation of discrete RV). Suppose X is a discrete random variable taking constant
values x i on disjoint measurable sets A i for i ≥ 1. If X is integrable, then
X
∞
E[X ] = x i P(X = x i ).
i =1
Pn
P ROOF. We first assume X ≥ 0. Let X n := i =1 x i 1 A i . Then X n % X . So we have
E[X ] = lim E[X n ] (∵ MCT, Theorem 1.3.19)
n→∞
X
n
= lim x i P(X = x i ) (∵ Definition of E)
n→∞
i =1
X
∞
= x i P(X = x i ) (∵ Def. of infinite sum).
i =1
Hence the assertion holds for nonnegative RVs. For the general case, write X = X + − X − . Since X is
P P∞
integrable, E[X + ], E[X − ] < ∞. Note that X + = ∞ −
i =1 (x i ∨0)1 A i and X = i =1 (−x i ∨0)1 A i . By the assertion
1.5. EXPECTATION 37
where the second to the last equality uses linearity of integral against the counting measure. □
Theorem 1.5.3 (Change of variables). Let X be measurable map from a probability space (Ω, F , P) to
(Ω0 , F 0 )14. Let µ denote the measure on F 0 induced by X , that is, µ(A) := P(X ∈ A) for A ∈ F 015. If f : Ω0 → R
is Borel measurable, f ≥ 0, and E[| f (X )|] < ∞, then
Z
E[ f (X )] = f (u) d µ(u).
Ω0
P ROOF. We verify the formula following the usual pipeline by establishing the assertion for f = char-
acteristic function, simple function, nonnegative function, and then general integrable function.
(Indicator functions) Let B ∈ F 0 and f := 1B . Then
Z
E[ f (X )] = E[1B (X )] = E[1(X ∈ B )] = P(X ∈ B ) = µ(B ) = f d µ.
Ω0
P 0
(Simple functions) Let f := m i =1 a i 1 A i for disjoint events A 1 , . . . , A m ∈ F . Then by linearity of expecta-
tion (14) and Lebesgue integral (see Theorem 1.3.7) and the assertion for indicator functions,
" # Z
£ ¤ X
m X
m Xm
E f (X ) = E a i 1(X ∈ A i ) = a i E [1(X ∈ A i )] = ai 1 Ai d µ
i =1 i =1 i =1 Ω0
Z X
m Z
= ai 1 Ai d µ = f d µ.
Ω0 i =1 Ω0
(Nonnegative functions) Let f ≥ 0. Note that µ is a finite measure since µ(Ω0 ) = P(X ∈ Ω0 ) = 1. Define
f n : Ω0 → R, f n (x) := [2n f (x)]/2n ∧ n, where [·] denotes the largest smaller integer than the input
·. Note that f n is a simple function, since it takes at most n2n distinct values on sets of finite
measure(note that µ( f n 6= 0) < ∞ since µ is a finite measure). Also, f n % f µ-a.e.. Furthermore,
f n (X ) is also a simple function and f n (X ) % f (X ) P-a.e.. Hence by using the monotone con-
vergence theorem for the two sequences f n (X ) and f n , with also using the assertion for simple
functions,
Z Z Z
E[ f (X )] = lim E[ f n (X )] = lim fn d µ = lim f n d µ = f d µ.
n→∞ n→∞ Ω0 Ω0 n→∞ Ω0
(Integrable functions) Write f = f − f . The hypothesis E[| f (X )|] < ∞ ensures that E[ f + (X )], E[ f − (X )] <
+ −
∞. Then using the assertion for nonnegative functions and by linearity of expectation and
Lebesgue integral,
Z Z Z Z
+ − + − + −
E[ f (X )] = E[ f (X )] − E[ f (X )] = f dµ − f dµ = f − f dµ = f d µ.
Ω0 Ω0 Ω0 Ω0
□
14 X is called a random element in Ω0 . In the special case Ω0 = R, X is called a random variable.
15Such measure µ is called the pushforward of P under X , which is denoted as X (P). When Ω0 = R and F = B Borel, the
∗
pushforword measure is called as the distribution of X (see Def. 1.2.11.)
1.5. EXPECTATION 38
Remark 1.5.4. The above “change of variables” formula is the measure theoretic version of “u-substitution”.
It allows us to transfer integral on one domain Ω to another Ω0 . Usually we use it with the new domain
Ω0 = R.
To make such a connection, writing h := X and µ := P ◦ h −1 , Theorem 1.5.3 states that
Z Z
f (h(ω)) d P(ω) = f (u) d P ◦ h −1 (u).
Ω0
In the LHS integral, the variable ω lives in the original sample space Ω. In the RHS integral, the variable
y lives in the new space Ω0 . Here we are performing the ‘u-substitution’ where u := h(ω).
Exercise 1.5.5. Suppose X is a discrete random variable taking values x i on a disjoint measurable sets
A i for i ≥ 1. Suppose X is integrable. Then by Theroem 1.5.3 with Ω0 = R and f (x) = x,
Z
E[X ] = u d P ◦ X −1 (u).
R
P∞
From this, derive the formula E[X ] = i =1 x i P(X = x i ). (Hint: Note that P ◦ X −1 is a probability measure
on R that puts mass P(A i ) on {x i } for i ≥ 1. )
R integral and Lebesgue integral). Let (Ω, F , µ) be a measure space with µ being
Exercise 1.5.6 (Stieltjes
σ-finite. Let ν(A) = ρ1 A d µ for some Borel measurable function ρ : Ω → R (that is, ν ¿ µ with Radon-
R
Nikodym derivative dd µν = ρ) such that ρ d µ < ∞. Following the standard procesure, show that for an
integrable function f : Ω → R,
Z Z
f d ν = f ρ d µ.
Now we can connect the measure theoretic definition of expectation with the usual definition of
expectation for continuous RVs.
Proposition 1.5.7 (Expectation
R of a continuous RV). Suppose X is a contiuous random variable with PDF
f , that is, P(X ∈ B ) = B f d µ for each Borel set B . Then
Z
E[X ] = x f (x) d x.
(ii) Use (i) and Fubini’s theorem (see Theorem 1.4.3) to show
X
∞ X
∞ X
y
P(X ≥ x) = P(X = y)
x=1 y=1 x=1
Exercise 1.5.9 (Tail sum formula for expectation of continuous RVs). Let X be a continuous RV with PDF
f X and suppose f X (x) = 0 for all x < 0. Use Fubini’s theorem to show that
Z∞
E[X ] = P(X > x) d x.
0
The following statement generalizes the tail-sum formulas in Exercises 1.5.8 and 1.5.9.
Proposition 1.5.10 (Tail sum formula for expectation for nonnegative RVs). Let X ≥ 0 be a RV on a prob-
ability space (Ω, F , P). Then for each p ≥ 1,
Z∞
p
E[X ] = px p−1 P(X > x) d x.
0
P ROOF. The statement for general p ≥ 1 follows from that for p = 1 and a change of variable. Namely,
since |X |p itself is a nonnegative RV,
Z∞ Z∞
p p
E[|X | ] = P(|X | > x) d x = P(|X | ≥ t )pt p−1 d t
0 0
1/p
by the change of variable x = t . Below we will show the statement for p = 1.
Let λ denote the Lebesgue measure on R. Consider the joint measure space (Ω × R, σ(F × B), P × λ),
where B denotes the Borel σ-algebra on R. Denote A := {(ω, x) | 0 ≤ x < X (ω)}. We claim that A ∈ σ(F ×
B). Indeed, define a function G : Ω × R → R by G(ω, x) := X (ω) − x. Note that the functions (ω, x) 7→
X (ω) and (ω, x) 7→ x are measurable by considering their inverse images of the sets (−∞, a], and G is the
difference of these functions, so it is measurable. Hence
G −1 ((−∞, 0)) = {(ω, x) : G(ω, x) < 0} = {(ω, x) : x < X (ω)} ∈ σ(F × B).
It follows that
A = G −1 ((−∞, 0)) ∩ (Ω × [0, ∞)) ∈ σ(F × B).
Thus 1 A : Ω × R → R is measurable. Then 1 A is also measurable and is nonnegative, so by Fubini’s theo-
rem,
ZZ ZZ
1 A d λ(x) d P(ω) = px p−1 1 A d P(ω) d λ(x).
Ω R R Ω
Note that the LHS equals
ZZ ZZ
1 A d λ(x) d P(ω) = 1(0 ≤ x < X (ω)) d λ(x) d P(ω)
Ω R Ω R
Z Z X (ω)
= d λ(x) d P(ω)
ZΩ 0
= X (ω) d P(ω).
Ω
The RHS equals
ZZ ZZ
1 A d P(ω) d λ(x) = 1(0 ≤ x < X (ω)) d P(ω) d λ(x)
R Ω
ZR∞ΩZ
= 1(x < X (ω)) d P(ω) d λ(x)
0 Ω
Z∞
= P(X > x) d λ(x).
0
This shows the desired assertion. □
Next, we will see a nice application of linearity of expectation.
1.5. EXPECTATION 40
Example 1.5.12 (Doubling strategy). Suppose we bet $x on a game where we have to predict whether
a fair coin clip comes up heads. We win $x on heads and lose $x on tails. Suppose we are playing the
‘doubling strategy’. Namely, we start betting $1, and until the first time we win, we double our bet; upon
the first win, we quit the game. Let X be the random variable giving the net gain of the overall game. How
can we evaluate this strategy?
Let N be the random number of coin flips we have to encounter until we see the first head. For
instance,
P(N = 1) = P({H }) = 1/2,
P(N = 2) = P({(T, H )}) = 1/22
P(N = 3) = P({(T, T, H )}) = 1/23 .
In general, we have
P(N = k) = 1/2k .
Note that on the event that N = k (i.e., we bet k times), the net gain X |{N = k} is
Hence our net gain is $1 with probability 1. In particular, the expected net gain is also 1:
E[X ] = 1 · P(X = 1) = 1.
What if we use tripling strategy? ▲
Exercise 1.5.13 (Jensen’s inequality for multivaraite convex functions). Suppose a function φ : Rd → R is
convex, that is, φ(λx + (1 − λ)y) ≤ λφ(x) + (1 − λ)φ(y) for all x, y ∈ Rd and λ ∈ [0, 1]. Let X 1 , . . . , X d be RVs
such that E[|X i |] < ∞ for i = 1, . . . , d and E[|φ(X 1 , . . . , X d )|] < ∞. Then show that
£ ¤
E φ(X 1 , . . . , X d ) ≥ φ (E [X 1 ] , . . . , E [X d ]) .
1.5.2. Variance and moments. Say you play two different games where in the first game, you win or
lose $1 depending on a fair coin flip, and in the second game, you win or lose $10. In both games, your
expected winning is 0. But the two games are different in how much the outcome fluctuates around the
mean. This notion if fluctuation is captured by the following quantity called ‘variance’.
Definition 1.5.14 (Covariance and variance). Let X , Y be RVs on the same probability space (Ω, F , P). We
define the covariance of£X and Y as ¤Cov(X , Y ) := E [(X − E[X ])(Y − E[Y ])]. The variance of X is defined by
Var(X ) := Cov(X , X ) = E (X − E[X ])2 .
Exercise 1.5.15 (Covariance is symmetric and bilinear). Let X , Y be RVs on the same probability space
(Ω, F , P) and fix constants a, b ∈ R. Show the following.
(i) Cov(aX + b, Y ) = aCov(X , Y ).
(ii) Cov(X + Z , Y ) = Cov(X , Y ) + Cov(Z , Y ).
(iii) Cov(X , Y ) = Cov(Y , X ).
Definition 1.5.16 (Moments of RVs). Let X be a RV. For each k ≥ 1, the kth moment of X is defined as
E[X k ].
Exercise 1.5.17 (Finite second moment implies finite first moment). Let X be a RV with E[X 2 ] < ∞.
(i) Show that
E[|X |] = E[|X |1(|X | ≤ 1)] + E[|X |1(|X | > 1)] ≤ 1 + E[X 2 1(|X | > 1)] ≤ 1 + E[X 2 ] < ∞.
Hence X has finite first moment and is integrable.
(ii) Deduce that Var(X ) < ∞.
Example 1.5.18 (Higher moments). Let X be a continuous RV with PDF f X and suppose f X (x) = 0 for all
x < 0. We will show that for any real number α > 0,
Z∞
α
E[X ] = x α f X (x) d x.
0
First, Use Exercise 1.5.9 and to write
Z∞
α
E[X ] = P(X α ≥ x) d x
0
Z∞
= P(X ≥ x 1/α ) d x
Z0∞ Z∞
= f X (t ) d t d x.
0 x 1/α
1.5. EXPECTATION 42
We then use Fubini’s theorem to change the order of integral. This gives
Z∞ Z∞ Z∞ Zt α Z∞
E[X ] = f X (t ) d t d x = f X (t ) d x d t = t α f X (t ) d t ,
0 x 1/α 0 0 0
as desired. ▲
Exercise 1.5.19. Let X be a continuous RV with PDF f X and suppose f X (x) = 0 for all x < 0. Let g : R → R
be a strictly increasing function. Use Fubini’s theorem and tail sum formula for expectation to show
Z∞
E[g (X )] = g (x) f X (x) d x.
0
Furthermore, assume X is a general continuous RV and g is a measurable function with E[|g (X )|] < ∞.
Show that
Z
E[g (X )] = g (x) f X (x) d x
R
by using Theorem 1.5.3 and Exercise 1.5.6.
The following exercise ties the expectation and the variance of a RV into a problem of finding a point
estimator that minimizes the mean squared error.
Exercise 1.5.20 (Variance as minimum MSE). Let X be a RV. Let x̂ ∈ R be a number, which we consider
as a ‘guess’ (or ‘estimator’ in statistics) of X . Let E[(X − x̂)2 ] be the mean squared error (MSE) of this
estimation.
(i) Show that
E[(X − x̂)2 ] = (x̂ − E[X ])2 + Var(X ).
(ii) Conclude that the MSE is minimized when x̂ = E[X ] and the global minimum is Var(X ). In this sense,
E[X ] is the ‘best guess’ for X and Var(X ) is the corresponding MSE.
Example 1.5.21 (Linear transform). In this example, we argue that Cov(X , Y ) measures the ‘linear ten-
dency’ between X and Y . Let X be a RV, and define another RV Y by Y = aX + b for some constants
a, b ∈ R. Let’s compute their covariance using linearity of expectation.
Cov(X , Y ) = Cov(X , aX + b)
= E(aX 2 + bX ) − E(X )E(aX + b)
= a E(X 2 ) + b E(X ) − E(X )(a E(X ) + b)
= a[E(X 2 ) − E(X )2 ]
= a Var(X ).
Thus, Cov(X , aX + b) > 0 if a > 0 and Cov(X , aX + b) < 0 if a < 0. In other words, if Cov(X , Y ) > 0, then X
and Y tend to be large at the same time; if Cov(X , Y ) > 0, then Y tends to be small if X tends to be large.
▲
Next, let’s say four RVs X , Y , Z , and W are given. Suppose that Cov(X , Y ) > Cov(Z ,W ) > 0. Can we
say that ‘the positive linear relation’ between X and Y is stronger than that between Z and W ? Not quite.
So to compare the magnitude of covariance, we first need to properly normalize covariance so that
the effect of fluctuation (variance) of each coordinate is not counted: then only the correlation between
the two coordinates will contribute. This is captured by the following quantity.
Definition 1.5.23 (Correlation coefficient). Given two RVs X and Y , their correlation coefficient ρ(X , Y )
is defined by
Cov(X , Y )
ρ(X , Y ) = p p .
Var(X ) Var Y
Example 1.5.24. Suppose X is a RV and fix constants a, b ∈ R. Then
aCov(X , X ) a Var(X ) a
ρ(X , aX + b) = p p =p p = = sign(a).
Var(X ) Var(aX + b) Var(X ) a 2 Var(X ) |a|
▲
Exercise 1.5.25 (Cauchy-Schwarz inequality). Let X , Y are RVs. Suppose E(Y 2 ) > 0. We will show that the
‘inner product’ of X and Y is at most the product of their ‘magnitudes’
(i) For any t ∈ R, show that
µ ¶
E[X Y ] 2 E[X 2 ]E[Y 2 ] − E[X Y ]2
E[(X − t Y )2 ] = E[Y 2 ] t − + .
E(Y 2 ) E[Y 2 ]
Conclude that
·µ ¶2 ¸
E(X Y ) E[X 2 ]E[Y 2 ] − E[X Y ]2
0≤E X− Y = .
E[Y 2 ] E[Y 2 ]
(ii) Show that a RV Z satisfies E[Z 2 ] = 0 if and only if P[Z = 0] = 1.
(iii) Show that
p p
E[X Y ] ≤ E[X 2 ] E[Y 2 ],
where the equality holds if and only if
µ ¶
E[X Y ]
P X= Y = 1.
E[Y 2 ]
Exercise 1.5.26. Let X , Y are RVs such that Var(Y ) > 0. Let X̄ = X − E[X ] and Ȳ = Y − E[Y ].
(i) Use (1.5.25) to show that
·µ ¶2 ¸
Cov(X , Y ) ¡ ¢
0≤E X̄ − Ȳ = Var(X ) 1 − ρ(X , Y )2 .
Var(Y )
(ii) Show that |ρ(X , Y )| ≤ 1.
(iii) Show that |ρ(X , Y )| = 1 if and only if X̄ = a Ȳ for some constant a 6= 0.
1.6. Examples
1.6.1. Discrete RVs.
Example 1.6.1 (Bernoulli RV). A RV X is a Bernoulli variable with (success) probability p ∈ [0, 1] if it
takes value 1 with probability p and 0 with probability 1− p. In this case we write X ∼ Bernoulli(p). Then
E(X ) = p and Var(X ) = p(1 − p).
Example 1.6.2 (Indicator variables). Let (Ω, P) be a probability space and let E ⊆ Ω be an event. The
indicator variable of the event E , which is denoted by 1E , is the RV such that 1E (ω) = 1 if ω ∈ E and
1E (ω) = 0 if ω ∈ E c . Then 1E is a Bernoulli variable with success probability p = P(E ).
1.6. EXAMPLES 44
Example 1.6.3 (Binomial RV). Let X 1 , X 2 , · · · , X n be independent and identically distributed Bernoulli p
variables. Let X = X 1 + · · · + X n . One can think of flipping the same probability p coin n times. Then X is
the total number of heads. Note that X has the following PMF
à !
n k
P(X = k) = p (1 − p)n−k
k
for k nonnegative integer, and P(X = k) = 0 otherwise. We say X follows the Binomial distribution with
parameters n and p, and write X ∼ Binomial(n, p).
We can compute the mean and variance of X using the above PMF directly, but it is much easier to
break it up into Bernoulli variables and use linearity. Recall that X i ∼ Bernoulli(p) and we have E[X i ] = p
and Var(X i ) = p(1 − p) for each 1 ≤ i ≤ n (from Example 1.6.1). So by linearity of expectation,
On the other hand, since X i ’s are independent, variance of X is the sum of variance of X i ’s, so
Example 1.6.4 (Geometric RV). Suppose we flip a probability p coin until it lands heads. Let X be the
total number of trials until the first time we see heads. Then in order for X = k, the first k − 1 flips must
land on tails and the kth flip should land on heads. Since the flips are independent with each other,
This is valid for k positive integer, and P(X = k) = 0 otherwise. Such a RV is called a Geometric RV with
(success) parameter p, and we write X ∼ Geom(p).
The mean and variance of X can be easily computed using its moment generating function, which
we will learn soon in this course. For their direct computation, note that
E(X ) = 1/p.
(In fact, one can apply Exercise 1.5.8 and quickly compute the expection of a Geometric RV.)
Exercise 1.6.5. Let X ∼ Geom(p). Use a similar computation as we had in Example 1.6.4 to show E(X 2 ) =
(2 − p)/p 2 . Using the fact that E(X ) = 1/p, conclude that Var(X ) = (1 − p)/p 2 .
λk e −λ
P(X = k) = (15)
k!
for all nonnegative integers k ≥ 0. We write X ∼ Poisson(λ).
Poisson distribution is obtained as a limit of the Binomial distribution as the number n of trials tend
to infinity while the mean np is kept at constant λ. Namely, let Y ∼ Binomial(n, p) and suppose np = λ.
This means that we expect to see λ successes out of n trials. Then what is the probability that we see,
1.6. EXAMPLES 45
say, k successes out of n trials, when n is large? Since the mean is λ, this probability should be very small
when k is large compared to λ. Indeed, we can rewrite the Binomial PMF as
n(n − 1)(n − 2) · · · (n − k + 1) k
P(Y = k) = p (1 − p)n−k
k!
µ ¶µ ¶ µ ¶
n 1 2 k − 1 (np)k
= 1− 1− ··· 1− (1 − p)n−k
n n n n k!
µ ¶µ ¶ µ ¶ µ ¶
1 2 k − 1 λk λ n−k
= 1− 1− ··· 1− 1− .
n n n k! n
As n tends to infinity, the limit of the last expression is precisely the right hand side of (15). 16
If µ = 0 and σ2 = 1, then X is called a standard normal RV. Note that if X ∼ N (µ, σ2 ), then Y := X − µ has
PDF
1 −x
2
f Y (x) = p e 2σ2 .
2πσ2
Since this is an even function, it follows that E(Y ) = 0. Hence E(X ) = µ. ▲
R∞ −x 2 p
Exercise 1.6.13 (Gaussian integral). In this exercise, we will show −∞ e d x = π.
(i) Show that Z
1
xe −x d x = − e −x +C .
2 2
2
R∞ −x 2
(ii) Let I = −∞ e d x. Show that
Z∞ Z∞
e −(x +y 2 )
2
2
I = d x d y.
−∞ −∞
(iii) Use polar coordinate (r, θ) to rewrite
Z2π Z∞ Z∞
−r 2
r e −r d r.
2
2
I = e r d r d θ = 2π
0 0 0
2 p
Then use (i) to deduce I = π. Conclude I = π.
Exercise 1.6.14. Let X ∼ N (µ, σ2 ). In this exercise, we will show Var(X ) = σ2 .
(i) Show that Var(X ) = Var(X − µ).
(ii) Use integration by parts and Exercise 1.6.13 to show that
Z∞ · µ ¶¸∞ Z∞ p
1 1 −x 2 π
x 2 e −x d x = x − e −x
2 2
+ e dx = .
0 2 0 0 2 4
p
(iii) Use change of variable x = 2σt and (ii) to show
Z∞ Z
x2 −x
2
2σ2 ∞ 2 −t 2 σ2
p e 2σ2 d x = p t e dt = .
0 2πσ2 π 0 2
Use (i) to conclude Var(X ) = σ2 .
Proposition 1.6.15 (Linear transform). Let X be a RV with PDF f X . Fix constants a, b ∈ R with a > 0, and
define a new RV Y = aX + b. Then
1
f aX +b (y) = f X ((y − b)/a).
|a|
P ROOF. First suppose a > 0. Then
Z(y−b)/a
F Y (y) = P(Y ≤ y) = P(aX + b ≤ y) = P(X ≤ (y − b)/a) = f X (t ) d t .
−∞
By differentiating the last integral by y, we get
d F Y (y) 1
f Y (y) = = f X ((y − b)/a).
dy a
For a < 0, a similar calculation shows
Z∞
F Y (y) = P(Y ≤ y) = P(aX + b ≤ y) = P(X ≥ (y − b)/a) = f X (t ) d t ,
(y−b)/a
so we get
d F Y (y) 1
f Y (y) = = − f X ((y − b)/a).
dy a
This shows the assertion. □
1.6. EXAMPLES 47
Example 1.6.16 (Linear transform of normal RV is normal). Let X ∼ N (µ, σ2 ) and fix constants a, b ∈ R
with a 6= 0. Define a new RV Y = aX + b. Then since
µ ¶
1 (x − µ)2
f X (x) = p exp − ,
2πσ2 2σ2
by Proposition 1.6.15, we have
µ ¶
1 (y − b − aµ)2
f Y (y) = p exp − .
2π(aσ)2 2(aσ)2
Notice that this is the PDF of a normal RV with mean aµ + b and variance (aσ)2 . In particular, if we take
a = 1/σ and b = µ/σ, then Y = (X −µ)/σ ∼ N (0, 1), the standard normal RV. This is called standardization
of normal RV.
Proposition 1.6.17 (Sum of ind. normal RVs is normal). Let X ∼ N (µ1 , σ21 ) and Y ∼ N (µ2 , σ22 ) be indepen-
dent normal RVs. Then
X + Y ∼ N (µ1 + µ2 , σ21 + σ22 ).
P ROOF. Details omitted. One can use the convolution formula or moment generating functions. □
Exercise 1.6.18. Compute the following probabilities using the standard normal table in Table ??.
(i) P(−1 ≤ X ≤ 2) where X ∼ N (0, 32 ).
(ii) P(X 2 + X − 1 ≥ 0) where X ∼ N (1, 1).
(iii) P(exp(2X + 2Y ) − 3 exp(X + Y ) + 2 ≤ 0) where X ∼ N (0, 1) and Y ∼ N (−2, 3).
Exercise 1.6.19 (Beta distribution). A random varible X taking values from [0, 1] has Beta distribution of
parameters α and β, which we denote by Beta(α, β), if it has PDF
Γ(α + β) α−1
f X (x) = x (1 − x)β−1 ,
Γ(α)Γ(β)
where Γ(z) is the Euler Gamma function defined by
Z∞
Γ(z) = x z−1 e −x d x.
0
(i) Use integration by parts to show the following recursion
Γ(z + 1) = zΓ(z).
Deduce that Γ(n) = (n − 1)! for all integers n ≥ 1.
R1
(ii) Write A n,k = 0 y k (1 − y)n−k d y. Use integration by parts and show that
k
A n,k = A n,k−1 .
n −k +1
for all 1 ≤ k ≤ n. Conclude that for all 0 ≤ k ≤ n,
1 1
A n,k = ¡n ¢ .
k
n +1
(iii) Let X ∼ Beta(k + 1, n − k + 1). Use (i) to show that
n!(n + 1) k x k (1 − x)n−k
f X (x) = x (1 − x)n−k = ¡n ¢ .
k!(n − k)! 1/ k (n + 1)
Use (ii) to verify that the above function is indeed a PDF (i.e., it integrates to 1).
(iv) Show that if X ∼ Beta(α, β), then
α αβ
E[X ] = , Var(X ) = .
α+β (α + β)2 (α + β + 1)
CHAPTER 2
Independence
According to Durrett [Dur19], “Measure theory ends and probability theory begins with the definition
of independence.” Also, Kac [Kac87] says “Independence is the central concept of probability theory and
few would believe today that understanding what it meant was ever a problem.”
Definition 2.1.1 (Independence between two objects). Let (Ω, F , P) be a probability space and let X , Y
be random variables on it.
(i) Two events A, B ∈ F are independent if P(A ∩ B ) = P(A) P(B ).
(ii) The RVs X , Y are independent if P(X ∈ C , Y ∈ D) = P(X ∈ C ) P(Y ∈ D) for all Borel subsets C , D of R.
Equivalentely, X , Y are independent if the events X −1 (C ) and Y −1 (D) are independent for all
Borel subsets C , D of R.
(iii) Let G , H ⊆ F be two σ-algebras on Ω contained in F . Then G , H are independent if two events
A, B are independent for all A ∈ G and B ∈ H .
Definition 2.1.2 (σ-algebra generated by a function). Let Ω be a sample space and let f : Ω → R be an
arbitrary function. The σ-algebra generated by f , denoted by σ( f ), is defined as
σ( f ) := { f −1 (B ) | ∀ Borel subset B ⊆ R}.
Exercise 2.1.3 (σ-algebra generated by functions). Let Ω be a sample space and let f : Ω → R be an
arbitrary function. Show the following:
(i) Show that σ( f ) is indeed a σ-algebra.
(ii) Show that σ( f ) is the smallest σ-algebra that makes f Borel measurable. That is, show that, denoting
B = Borel σ-algebra on R,
\© ª
σ( f ) = F ⊆ 2Ω | F is a σ-algebra such that f is (F − B)-measurable .
(iii) Let X be a RV on a probability space (Ω, F , P). Show that σ(X ) ⊆ F .
The following proposition shows that independence between RVs is equivalent to independence be-
tween the σ-algebras generated by them.
Proposition 2.1.4. Let (Ω, F , P) be a probability space and let X , Y be random variables on it. The follow-
ing hold:
(i) If X , Y are independent, then σ(X ), σ(Y ) are independent.
48
2.1. DEFINITION OF INDEPENDENCE 49
(ii) If G , H ⊆ F are independent σ-algebras on Ω and if functions X , Y : Ω → R are (G −B)- and (H −B)-
measureable, respectively1, then X , Y are independent. In particular, if σ(X ), σ(Y ) are indepen-
dent, then X , Y are independent.
P ROOF. (i) First recall that by Exercise 2.1.3, σ(X ), σ(Y ) are σ-algebras on Ω contained in F . Fix
A ∈ σ(X ) and B ∈ σ(Y ). By definition of σ-algebra generated by a function, there exists Borel
sets C , D ⊆ R such that A = X −1 (C ), B = Y −1 (D). Then
P(A ∩ B ) = P(X −1 (C ) ∩ Y −1 (D)) = P(X ∈ C , Y ∈ D) = P(X ∈ C ) P(Y ∈ D)
= P(X −1 (C )) P(Y −1 (D)) = P(A) P(B ).
So A, B are independent. Since A, B were arbitrary, σ(X ), σ(Y ) are independent.
(ii) Let C , D be Borel subsets of R. We wish to show that P(X ∈ C , Y ∈ D) = P(X ∈ C ) P(Y ∈ D). But
since X −1 (C ) ∈ G , Y −1 (D) ∈ H and since G , H are independent, we have P(X −1 (C )∩Y −1 (D)) =
P(X −1 (C )) P(Y −1 (D)). The asesrtion then follows.
□
The following proposition shows that independence between events is equivalent to the indepen-
dence between the corresponding indicator RVs.
P ROOF. (i) Subtracting P(A ∩ B ) = P(A)P(B ) from P(B ) = P(B ) shows P(A c ∩ B ) = P(A c )P(B ). By
symmetry, we also get P(B c ∩ A) = P(B c )P(A). Subtracting the former from P(A c ) = P(A c ) yields
P(A c ∩ B c ) = P(A c )P(B c ).
(ii) Recall that σ(1E ) = {;, E , E c , Ω} for any event E ∈ F (see Example 1.2.4). Hence part (i) shows that
A, B are independent if and only if σ(1 A ) and σ(1B ) are independent. This is if and only if 1 A
and 1B are independent by Proposition 2.1.7.
□
Next, we introduce independence between several objects.
Definition 2.1.6 (Independence between finitely many objects). Let (Ω, F , P) be a probability space.
(i) Events A 1 , . . . , A n ∈ F are independent if for all subset I ⊆ {1, . . . , n},
à !
\ Y
P A i = P(A i ).
i ∈I i ∈I
P ROOF. (i) Suppose X 1 , . . . , X n are independent. Fix events A i ∈ σ(X i ) for i = 1, . . . , n. Then there
exists Borel subsets B i ⊆ R with A i = X −1 (B i ) for i = 1, . . . , n. Then
à !
\
n Y
n Yn
P A n = P(X 1 ∈ B 1 , . . . , X n ∈ B n ) = P(X ∈ B i ) = P(A i ).
i =1 i =1 i =1
Conversely, suppose σ(X 1 ), . . . , σ(X n ) ⊆ F are independent. Fix Borel subsets B i ⊆ R for i =
1, . . . , n. Then
à ! à !
\
n \
n
−1
Y
n ¡ ¢ Yn
P {X i ∈ B i } = P X (B i ) = P X −1 (B i ) = P(X i ∈ B i ).
i =1 i =1 i =1 i =1
where we set A 0i = A ci for i = 1 and A 0i = A i for i 6= 1. If I does not contain 1, the the equal-
ity above follows from the definition of independence of A 1 , . . . , A n with the same index set I .
Hence we may assume 1 ∈ I . By permuting the indices if necessary, we may assume that I =
{1, 2, . . . , r } for some 1 ≤ r ≤ n. Then by applying the definition of independence ¡T ¢ between events
Q
A 1 , . . . , A n with index sets J = {1, 2, . . . , r } and J 0 = {2, . . . , r }, we have P ri=1 A i = ri=1 P(A i ) and
¡T ¢ Q
P ri=2 A i = ri=2 P(A i ). Subtracting the former from the latter, we get
¡ ¢
P A c1 ∩ A 2 ∩ · · · ∩ A r = P(A c1 ) P(A 2 ) · · · P(A r ),
as desired.
(iii) Repeating the same argument in the proof of (i), we can show that if A 1 , . . . , A n are independent,
then for any subset J ∈ {1, . . . , n}, the events A 01 , . . . , A 0n are independent, where A 0i = A ci if i ∈ J
and A 0i = A i if i ∉ J . Recall that σ(1E ) = {;, E , E c , Ω} for any event E ∈ F (see Example 1.2.4).
Hence σ(1 A 1 ), . . . , σ(1 A n ) are independent if and only if for any subset J ∈ {1, . . . , n}, the events
A 01 , . . . , A 0n are independent, where A 0i = A ci if i ∈ J and A 0i = A i if i ∉ J . The latter holds if and only
if A 1 , . . . , A n are independent by definition and the observation we made at the beginning of this
paragraph. To conclude, note that 1 A 1 , . . . , 1 A n are independent if and only if σ(1 A 1 ), . . . , σ(1 A n )
are independent by part (i).
□
Example 2.1.8 (Pairwise independnece does not imply independence). Events A 1 , . . . , A n are said to by
pairwise independent if P(A i ∩ A j ) = P(A i ) P(A j ) for all i 6= j , i , j ∈ {1, . . . , n}. notice that this is weaker
than the definition of independence between events A 1 , . . . , A n . In fact, pairwise independence is a
strictly weaker notion than independence. To see this, let X 1 , X 2 , X 3 be independent RVs such that
X i ∼ Bernoulli(1/2) for i = 1, 2, 3. Define events
A 1 := {X 2 = X 2 }, A 2 := {X 3 = X 1 }, A 3 = {X 1 = X 2 }.
Intuitively speaking, knowing two of the three events A 1 , A 2 , A 3 completely determins the remaining
event. However, knowing only one of these events does not give any information on any other single
event. ▲
2.2. SUFFICIENT CONDITION FOR INDEPENDENCE 51
The following proposition shows that independence between σ-algebras generated by some collec-
tions of subsets can be checked by only checking the independence between the generating sets.
Proposition 2.2.2 (Independence of generating sets imply independence of σ-algebras). Let (Ω, F , P) be
a probability space and let A1 , . . . , An ⊆ F such that Ω ∈ Ai for all i = 1, . . . , n. Further assume that each
Ai is a π-system. If A1 , . . . , An are independent, then σ(A1 ), . . . , σ(An ) are independent.
That is, L is the collection of all events that are independent from F . Since A1 , . . . , An are independent,
it follows that A1 ⊆ L . We wish to show that σ(A1 ) ⊆ L . By using Dynkin’s π − λ theorem (see Theorem
1.1.37), we only need to verify that L is a λ-system.
(i) (Ω ∈ L ): P(Ω ∩ F ) = P(F ) = 1 · P(F ) = P(Ω) P(F ).
(ii) (A, B ∈ L with A ⊆ B ⇒ B \ A ∈ L ): Since A ⊆ B , we have (A \ B ) ∩ F = (A ∩ F ) \ (B ∩ F ). Hence
P ((A \ B ) ∩ F ) = P(A ∩ F ) − P(B ∩ F ) = P(A) P(F ) − P(B ) P(F ) = (P(A) − P(B )) P(F ) = P(A \ B ) P(F ).
This shows A \ B ∈ L .
S
(iii) (A 1 , A 2 , · · · ∈ L , A 1 ⊆ A 2 ⊆ · · · ⇒ A := i ≥1 A i ∈ L ): Note that A i ∩ F % A ∩ F . Hence by using
continuity of measure from below,
P(A ∩ F ) = lim P(A n ∩ F ) = lim P(A n ) P(F ) = P(F ) lim P(A n ) = P(F ) P(A).
n→∞ n→∞ n→∞
This shows A ∈ L .
The above three items verify that L is a λ-system. Thus we conclude σ(A1 ) ⊆ L .
The above discussion shows that, under the hypothesis, σ(A 1 ), A2 , . . . , An are independent. Repeat-
ing the same arguments, one shows that σ(A1 ), σ(A2 ), . . . , σ(An ) are independent. □
As a corollary, we deduce that RVs are independent if their joint CDF factorizes into the product of
marginal CDFs.
P ROOF. By Proposition 2.1.4, X 1 , . . . , X n are independent if and only if σ(X 1 ), . . . , σ(X n ) are indepen-
dent. Let A denote the set of all intervals of the form (−∞, a] for a ∈ R and R itself. Then σ(A ) = B =Borel
σ-algebra on R. Hence for each i = 1, . . . , n, σ(X i ) is generated by the sets X i−1 (A) for A ∈ A . The condi-
tion in the assertion exactly states that the collections Ai := {X i−1 (A) : A ∈ A } for i = 1, . . . , n are indepen-
dent. Also note that each Ai is a π-system: For X −1 ((−∞, a])∩ X −1 ((−∞, b]) = X −1 ((∞, a ∧b]). Hence the
assertion follows from Proposition 2.2.2. □
Proposition 2.2.4. Let (Ω, F , P) be ¡a probability¢ space. Let Fi , j for 1 ≤ i , j ≤ n be independent σ-algebras
S
on Ω. For each 1 ≤ i ≤ n, let Gi := σ 1≤ j ≤n Fi , j . Then G1 , . . . , Gn are independent.
2.3. INDEPENDENCE, DISTRIBUTION, AND EXPECTATION 52
T
P ROOF. Let Ai be the collections of sets of the form 1≤ j ≤n A i , j where A i , j ∈ Fi , j . Then each Ai
S
contains Ω and σ(Ai ) ⊇ 1≤ j ≤n Fi , j . Furthermore, A1 , . . . , An are independent since, by using the inde-
pendence of Fi , j ,
à ! à ! à à !!
\\ Y Y Y Y \
P A i , j = P(A i , j ) = P(A i , j ) = P Ai , j .
i j i,j i j i j
S
Hence by Proposition 2.2.2, it follows that σ(A1 ), . . . , σ(An ) are independent. Since σ(Ai ) ⊇ 1≤ j ≤n Fi , j ,
we also have σ(Ai ) ⊇ Gi . Hence G1 , . . . , Gn are independent. □
Remark 2.2.5. In the proposition above, we can take Fi , j = 2Ω for some j ’s if one wants to apply the
statement for different numbers of Fi , j ’s for each i .
An important consequence of the previous proposition is that functions of disjoint sets of indepen-
dent RVs are independent.
Proposition 2.2.6. Let X i , j for 1 ≤ i , j ≤ n be independent RVs on a probability space. (Ω, F , P). Let
f i : Rn → R be Borel measurable for each i = 1, . . . , n. Define Yi := f i (X i ,1 , . . . , X i ,n ) for i = 1, . . . , n. Then
Y1 , . . . , Yn are independent.
P ROOF. By Proposition 2.1.4, Y1 , . . . , Yn are independent if and only if σ(Y1 ), . . . , σ(Yn ) are indepen-
S
dent. Let Fi , j := σ(X i , j ) and Gi := σ( j Fi , j ). Then σ(Yi ) ⊆ Gi for i = 1, . . . , n. Since X i , j ’s are indepen-
dent, Fi , j ’s are independent. Then by Prooposition 2.2.4, Gi ’s are independent. It follows that σ(Yi )”s
are also independent. □
Lemma 2.3.1 (Independence and product measure). Let X 1 , . . . , X n be independent RVs on a probability
space (Ω, F , P). Let µi := P ◦ X i−1 denote the distribution of X i for i = 1, . . . , n. Then the random vector
(X 1 , . . . , X n ) : (Ω, F , P) → (Rn , B n ) has distribution µ1 ⊗ · · · ⊗ µn .
where the three equalities above follow from independence, definition of distribution, and definition of
the product measure. This shows that the distribution of (X 1 , . . . , X n ) and the product measure µ1 ⊗· · ·⊗µn
which are both probability measures on B n , agree on all measurable rectangles, which form a π-system.
Two measures agreeing on a π-system must be identical on the σ-algebra generated by that π-system
(see Lemma 1.1.38). Hence they agree on the Borel σ-algebra B n . □
Lemma 2.3.2 (Expectation of a function of independent RVs). Let X , Y be independent RVs on a proba-
bility space (Ω, F , P) with distributions µ := P ◦ X −1 and ν := P ◦Y −1 . If h : R2 → R is a measurable function
with either h ≥ 0 or E[|h(X , Y )|] < ∞, then
Ï
E[h(X , Y )] = h(x, y) d µ(x) d ν(y).
E[ f (X )g (Y )] = E[ f (X )] E[g (Y )].
2.3. INDEPENDENCE, DISTRIBUTION, AND EXPECTATION 53
This show the first assertion. For the second assertion, assume h(x, y) = f (x)h(y) for (x, y) ∈ R2 . From
the first assertion,
E[ f (X )g (Y )] = E[h(X , Y )]
Ï
= f (x)g (y) d µ(x) d ν(y)
Z µZ ¶
= g (y) f (x) d µ(x) d µ(y)
Z
= g (y) E[ f (X )] d µ(y) (∵ change of variables, see Proposition 1.5.3)
Z
= E[ f (X )] g (y) d µ(y)
Example 2.3.4 (Uncorrelated but dependent). Two random variables X , Y are said to be uncorrelated if
E[X Y ] = E[X ]E[Y ]. Due to Exercise 2.3.3, we know that independence implies uncorrelation. However,
the converse is not true. That is, two random variables can be uncorrelated but still be dependent.
Let (X , Y ) be a uniformly sampled point from the unit circle in the 2-dimensional plane. Param-
eterize the unit circle by S 1 = {(cos θ, sin θ) | 0 ≤ θ < 2π}. Then we can first sample a uniform angle
Θ ∼ Uniform([0, 2π)), and then define (X , Y ) = (cos Θ, sin Θ). Recall from your old memory that
sin 2t = 2 cos t sin t .
Now
E(X Y ) = E(cos Θ sin Θ)
1
= E(sin 2Θ)
2
Z
1 2π
= sin 2t d t
2 0
· ¸2π
1 1
= − cos 2t = 0.
2 2 0
On the other hand,
Z2π
E(X ) = E(cos Θ) = cos t d t = 0
0
2.4. SUMS OF INDEPENDENT RVS – CONVOLUTION 54
and likewise E(Y ) = 0. This shows Cov(X , Y ) = 0, so X and Y are uncorrelated. However, they satisfy the
following deterministic relation
X 2 + Y 2 = 1,
so clearly they cannot be independent. It is clear that why the x- and y-coordinates of a uniformly sam-
pled point from the unit circle are uncorrelated – they have no linear relation. ▲
Example 2.4.1 (Two dice). Roll two dice independently and let their outcome be recorded by RVs X and
Y . Note that both X and Y are uniformly distributed over {1, 2, 3, 4, 5, 6}. So the pair (X , Y ) is uniformly
Y
distributed over the (6 × 6) integer grid {1, 2, 3, 4, 5, 6} × {1, 2, 3, 4, 5, 6}. In other words,
4/ 1/ 1/ 1/
3 1 1 1
P((X , Y ) = (x, y)) = P(X = x)P(Y = y) = · = .
3/ 0 5/ 1/ 6 6 36
2
Now, what is the distribution of the sum Z = X +Y ? Since each point (x, y) in the grid is equality probable,
1/
1 of such points on the line x + y = z, for each value of z. In other words,
we just need to count the number 2/ 3/ 5/
1/ X
1 1/
6
2/ 2/
0P(X + Y = z) = P(X = x)P(Y = z − x).
x=1
X
This is easy to compute from the following
0 1picture:
2 For 3 example, P(X + Y = 7) = 6/36 = 1/6. ▲
1
𝑥+𝑦=7
𝑥+𝑦=6 𝑥 + 𝑦 = 12
𝑥+𝑦=5 𝑥 + 𝑦 = 11
𝑥+𝑦=4 𝑥 + 𝑦 = 10
𝑥+𝑦=3 𝑥+𝑦=9
𝑥+𝑦=2 𝑥+𝑦=8
1 2 3 4 5 6
1 1 1 1 1 1
F IGURE 2.4.1. Probability space for two dice and lines on which sum of the two are constant.
Proposition 2.4.2 (Convolution of PMFs). Let X , Y be two independent integer-valued RVs. Let Z = X +Y .
Then
X
P(Z = z) = P(X = x)P(Y = z − x).
x
P ROOF. Note that the pair (X , Y ) is distributed over Z2 according to the distribution
P(X = x, Y = y) = P(X = x)P(Y = y),
since X and Y are independent. Hence in order to get P(Z = z) = P(X + Y = z), we need to add up all
probabilities of the pairs (x, y) over the line x + y = z. If we first fix the values of x, then y should take
value z − x. Varying the range of x, we get (2.4.2). □
2.4. SUMS OF INDEPENDENT RVS – CONVOLUTION 55
Exercise 2.4.3 (Sum of ind. Poisson RVs is Poisson). Let X ∼ Poisson(λ1 ) and Y ∼ Poisson(λ2 ) be inde-
pendent Poisson RVs. Show that X + Y ∼ Poisson(λ1 + λ2 ).
For the continuous case, a similar observation should hold as well. Namely, we should be integrating
all the probabilities of the pair (X , Y ) at points (x, y) along the line x + y = z in order to get the probability
density f X +Y (z). We first derive a general convolution formula for CDFs in Proposition 2.4.4 and then
deduce convolution formula for PDFs in Proposition 2.4.5.
Proposition 2.4.4 (Convolution of CDFs). If X , Y are independent RVs on a probability space (Ω, F , P),
then letting F (x) := P(X ≤ x) and G(y) := P(Y ≤ y),
Z
P(X + Y ≤ z) = F (z − y) dG(y) =: (F ∗G)(z),
R
where · dG is the shorthand of integrating with resepct to the measure ν with distribution function G.
Here F ∗ G is called the convolution of F and G.
P ROOF. Let µ and ν denote the probability measure on R with distribution functions F and G, re-
spectively (i.e., µ((−∞, x]) = F (x) and so on). Note that
P(X + Y ≤ z) = E[1(X + Y ≤ z)]
Z
= 1(x + y ≤ z) d (P ◦ (X , Y )−1 ) (∵ change of variables, see Proposition 1.5.3)
R2
Z
= 1(x + y ≤ z) d µ ⊗ ν(x, y) (∵ Lemma 2.3.1)
R2
Ï
= 1(x ≤ z − y) d µ(x) d ν(y) (∵ Fubini’s theorem, see Theorem 1.4.3)
Z
= F (z − y) d ν(y) (∵ def of F )
Z
= F (z − y) dG(y) (∵ def of dG).
□
Proposition 2.4.5 (Convolution of PDFs). Let If X , Y are independent RVs on a probability space (Ω, F , P).
Let Z := X + Y .
(i) If X is continuous with PDF f X , then Z has PDF
Z∞
f Z (z) := f (z − y) dG(y).
−∞
(ii) If furthermore Y is continuous with PDF f Y , then Z has PDF
Z∞
f Z (z) := f X (z − y) f Y (y) d y.
−∞
P ROOF. We begin with computing the CDF of Z . By Proposition 2.4.4,
Z
P(Z ≤ z) = F (z − y) dG(y)
Z∞ Zz−y
= f X (x) d x dG(y)
Z−∞
∞ Z−∞
z
= f X (x − y) d x dG(y)
Z−∞
z Z−∞
∞
= f X (x − y) dG(y) d x.
−∞ −∞
R∞
R shows that Z has PDF −∞ f X (z − y) dG(y) as asserted in (i). Then (ii) follows from
The last expression
the definition of dG(y) and Exercise 1.5.6. □
2.4. SUMS OF INDEPENDENT RVS – CONVOLUTION 56
Example 2.4.6. Let X , Y ∼ N (0, 1) be independent standard normal RVs. Let Z = X + Y . We will show
that Z ∼ N (0, 2) using the convolution formula. Recall that X and Y have the following PDFs:
1 1
f X (x) = p e −x /2 , f Y (y) = p e −y /2
2 2
2π 2π
By taking convolution of the above PDFs, we have
Z∞ µ µ 2 ¶¶ µ µ ¶¶
1 x 1 (z − x)2
f Z (z) = p exp − p exp − dx
−∞ 2π 2 2π 2
Z µ 2 ¶
1 ∞ x (z − x)2
= exp − − dx
2π −∞ 2 2
Z µ ¶
1 ∞ z2
= exp −x 2 + xz − dx
2π −∞ 2
Z µ ³ ¶
1 ∞ z ´2 z 2
= exp − x − − dx
2π −∞ 2 4
Z∞ µ ³ ¶
1 −z 2 /4 1 z ´2 1
d x = p e −z /4 ,
2
=p e p exp − x −
4π −∞ π 2 4π
where we have recognized the integrand in the line as the PDF of N (−z/2, 1/2) so that the integral is 1.
Since the last expression is the PDF of N (0, 2), it follows that Z ∼ N (0, 2). ▲.
The following example generalizes the observation we made in the previous example.
Example 2.4.7 (Sum of ind. normal RVs is normal). Let X ∼ N (µ1 , σ21 ) and Y ∼ N (µ2 , σ22 ) be independent
normal RVs. We will see that Z = X +Y is again a normal random variable with distribution N (µ1 +µ2 , σ21 +
σ22 ). The usual convolution computation for this is pretty messy (c.f., wikipedia article). Instead let’s save
some work by using the fact that normal distributions are preserved under linear transform (Exercise
1.6.16). So instead of X and Y , we may consider X 0 := (X −µ1 )/σ1 and Y 0 := (Y −µ2 )/σ2 (It is important to
note that we must use the same linear transform here for X and Y ). Then X 0 ∼ N (0, 1), and Y 0 ∼ N (µ, σ2 )
where µ = (µ2 − µ1 )/σ1 and σ = σ2 /σ1 . Now it suffices to show that Z 0 := X 0 + Y 0 ∼ N (µ, 1 + σ2 ) (see the
following exercise for details).
To compute the convolution of the corresponding normal PDFs:
Z∞ µ µ 2 ¶¶ µ µ ¶¶
1 x 1 (z − x − µ)2
f Z (z) = p exp − p exp − dx
−∞ 2π 2 2πσ2 2σ2
Z∞ µ 2 ¶
1 x (z − x − µ)2
= exp − − d x.
2πσ −∞ 2 2σ2
At this point, we need to ‘complete the square’ for x for the bracket inside the exponential as below:
x 2 (z − x − µ)2 1 ¡ 2 2 2
¢
+ = σ x + (x + µ − z)
2 2σ2 2σ2
µ ¶
1 + σ2 2 2(µ − z)x (µ − z)2
= x + +
2σ2 1 + σ2 1 + σ2
2 ·µ ¶ 2 ¸
1+σ (µ − z) (µ − z)2 (µ − z)2
= x+ + −
2σ2 1 + σ2 1 + σ2 (1 + σ2 )2
·µ ¶ ¸
1 + σ2 (µ − z) 2 (z − µ)2 σ2
= x+ +
2σ2 1 + σ2 1 + σ2 1 + σ2
2µ ¶2
1+σ (µ − z) (z − µ)2
= x + + .
2σ2 1 + σ2 2(1 + σ2 )
2.4. SUMS OF INDEPENDENT RVS – CONVOLUTION 57
Suppose each increment X k has a finite mean µ. Then by linearity of expectation and independence
of the increments, we have
µ ¶
Sn E[S n ]
E = = µ,
n n
µ ¶
Sn Var(S n ) n Var(X 1 ) Var(X 1 )
Var = = = .
n n2 n2 n
So the sample mean S n /n has constant expectation and shrinking variance. Hence it makes sense to
guess that it should behave as the constant µ, without taking the expectation. That is,
Sn
lim = µ.
n→∞ n
58
3.2. BOUNDING TAIL PROBABILITIES 59
But this expression is shaky, since the left hand side is a limit of RVs while the right hand side is a constant.
In what sense the random sample means converge to µ? This is the content of the law of large numbers,
for which we will prove a weak and a strong versions.
The first limit theorem we will encounter is called the Weak Law of Large Numbers (WLLN), which is
stated below:
P
Theorem 3.1.1 (WLLN, preliminary ver.). Let (X k )k≥1 be i.i.d. RVs with mean µ < ∞ and let S n = nk=1 X i ,
n ≥ 1 be a random walk. Then for any positive constant ε > 0,
µ¯ ¯ ¶
¯ Sn ¯
lim P ¯¯ ¯
− µ¯ > ε = 0.
n→∞ n
In words, the probability that the sample mean S n /n is not within ε distance from its expectation µ
decays to zero as n tends to infinity. In this case, we say the sequence of RVs (S n /n)n≥1 converges to µ in
probability.
The second version of law of large numbers is call the strong law of large numbers (SLLN), which is
available if the increments have finite variance.
P
Theorem 3.1.2 (SLLN, preliminary ver.). Let (X k )k≥1 be i.i.d. RVs and let S n = nk=1 X i , n ≥ 1 be a random
walk. Suppose E[X 1 ] = µ < ∞ and E[X 12 ] < ∞. Then
µ ¶
Sn
P lim = µ = 1.
n→∞ n
To make sense out of this, notice that the limit of sample mean limn→ S n /n is itself a RV. Then SLLN says
that this RV is well defined and its value is µ with probability 1. In this case, we say the sequence of RVs
(S n /n)n≥1 converges to µ with probability 1 or almost surely.
Perhaps one of the most celebrated theorems in probability theory is the central limit theorem (CLT),
which tells about how the sample mean S n /n “fluctuates” around its mean µ. From 3.1, if we denote
σ2 = Var(X 1 ) < ∞, we know that Var(S n /n) = σ2 /n → 0 as n → ∞. So the fluctuation decays as we add
up more increments. To see the effect of fluctuation, we first center the sample mean by subtracting its
p
expectation and “zoom in” by dividing by the standard deviation σ/ n. This is where the name ‘central
limit’ comes from: it describes the limit of centered random walks.
P
Theorem 3.1.3 (CLT, preliminary ver.). Let (X k )k≥1 be i.i.d. RVs and let S n = nk=1 X i , n ≥ 1 be a random
walk. Suppose E[X 1 ] = µ < ∞ and E[X 12 ] = σ2 < ∞. Let Z ∼ N (0, 1) be a standard normal RV and define
S n − µn S n /n − µ
Zn = p = p .
σ n σ/ n
Then for all z ∈ R,
Zz
1 x2
lim P(Zn ≤ z) = P(Z ≤ z) = p e− 2 d x.
n→∞ 2π −∞
Proposition 3.2.1 (Markov’s inequality). Let X ≥ 0 be a nonnegative RV with finite expectation. Then for
any a > 0, we have
E[X ]
P(X ≥ a) ≤ .
a
P ROOF. Consider an auxiliary RV Y define as follows:
(
a if X ≥ a
Y =
0 if X < a.
Note that we always have Y ≤ X . Hence we should have E[Y ] ≤ E[X ]. But since E[Y ] = a P(X ≥ a), we
have
λP(X ≥ a) ≤ E[X ].
E[Z 2 ]
P(Z 2 ≥ a) ≤ = 0.
a
By continuity of measure, it follows that P(Z 2 = 0) = limε&0 P(Z 2 < ε) = 1, so P(Z = 0) = 1. ▲
Proposition 3.2.3 (Chebyshef’s inequality). Let X be any RV with E[X ] = µ < ∞ and Var(X ) < ∞. Then for
any a > 0, we have
Var(X )
P(|X − µ| ≥ a) ≤ .
a2
P ROOF. Applying Markov’s inequality for the nonnegative RV (X − µ)2 , we get
P(X ≥ a) = e −λa .
As we can see, both Markov’s and Chebyshef’s inequalities give loose estimates, but the latter gives a
slightly stronger bound. ▲
3.2. BOUNDING TAIL PROBABILITIES 61
Example 3.2.5 (Chebyshef’s inequality for bounded RVs). Let X be a RV taking values from the interval
[a, b]. Suppose we don’t know anything else about X . Can we say anything useful about tail probability
P(X ≥ λ)? If we were to use Markov’s inequality, then certainly a ≤ E[X ] ≤ b and in the worst case E[X ] =
b. Hence we can at least conclude
b
P(X ≥ λ) ≤ .
λ
On the other hand, let’s get a bound on Var(X ) and use Chebyshef’s inequality instead. We claim that
(b − a)4
Var(X ) ≤ ,
4
which would yield by Chebyshef’s inequality that
(b − a)2
P(|X − E[X ]| ≤ λ) ≤ .
4λ2
Intuitively speaking, Var(X ) is the largest when the value of X is as much spread out as possible at the
two extreme values, a and b. Hence the largest variance will be achieved when X takes a and b with
equal probabilities. In this case, E[X ] = (a + b)/2 so
a 2 + b 2 (a + b)2 (b − a)2
Var(X ) = E[X 2 ] − E[X ]2 = − = .
2 4 4
▲
Exercise 3.2.6. Let X be a RV taking values from the interval [a, b].
(i) Use the usual ‘completing squares’ trick for a second moment to show that
0 ≤ E[(X − t )2 ] = (t − E[X ])2 + Var(X ) ∀t ∈ R.
(ii) Conclude that E[(X − t )2 ] is minimized when t = E[X ] and the minimum is Var(X ).
(iii) By plugging in t = (a + b)/2 in (3.2.6), show that
µ ¶
(b − a)2 a +b 2
Var(X ) = E[(X − a)(X − b)] + − E[X ] − .
4 2
(iv) Show that E[(X − a)(X − b)] ≤ 0.
(v) Conclude that Var(X ) ≤ (b − a)2 /4, where the equality holds if and only if X takes the extreme values
a and b with equal probabilities.
Exercise 3.2.7 (Paley-Zigmund inequality). Let X be a nonnegative RV with E[X 2 ] < ∞. Fix a constant
θ ≥ 0. We prove the Paley-Zigmund inequality, which gives a lower bound on the tail probabilities and
also implies the so-called ‘second moment method’.
(i) Write X = X 1(X > θ E[X ]) + X 1(X ≤ θ E[X ]). Show that
E[X ] = E[X 1(X ≤ θ E[X ])] + E[X 1(X > θ E[X ])]
≤ θ E[X ] + E[X 1(X > θ E[X ])].
(ii) Use Cauchy-Schwartz inequality (Exercise 1.5.25) to show
(E[X 1(X > θ E[X ])])2 ≤ E[X 2 ]E[1(X > θ E[X ])2 ]
= E[X 2 ]E[1(X > θ E[X ])]
= E[X 2 ]P(X > θ E X ).
(iii) From (i) and (ii), derive
p
E[X ] ≤ θ E[X ] + E[X 2 ]P(X > θ E X ).
3.2. BOUNDING TAIL PROBABILITIES 62
(Note that since X ≥ 0, the above inequality is meaningful only when θ ≤ 1.) Conclude that, for
θ ∈ [0, 1],
(1 − θ)2 E[X ]2
P(X > θ E X ) ≥ .
E[X 2 ]
(iv) (Second moment method) From (iii), conclude that
E[X ]2
P(X > 0) ≥ .
E[X 2 ]
Alternatively, use the Cauchy-Schwarz inequality (Exercise 1.5.25) to the RV X 1(X > 0) to de-
duce the above result.
Proposition 3.2.8 (Chernoff bound). Let X n := ξ1 +· · ·+ξn , where ξ1 , . . . , ξn are i.i.d. RVs such that E[exp(θξ1 )] <
∞ for θ ∈ [0, c). Denote the log moment generating function of ξ1 as φ(θ) = log E[exp(θξ1 )]. Then
à !
P(X n ≥ t ) ≤ exp − sup θt − nφ(θ) . (16)
θ∈[0,c)
Furthermore, assume that E[ξ1 ] = 0. Then L := sup0≤θ≤c/2 |φ00 (θ)| < ∞ and
³ ´
exp − Lt 2 if t ≤ cL/2 (Gaussian tail)
P(X n ≥ t ) ≤ ³ 2 ´
exp − ct + c 2 if t > cL/2 (Exponential tail).
2L 8L
P ROOF. For any parameter θ ∈ [0, c), by exponentiating and taking Markov’s inequality,
where the identity above uses the independence between the increments ξi . This holds for all θ ∈ [0, c).
This shows (16).
Next, denote the moment generating function of ξ1 as ψ(θ) = E[exp(θξ1 )]. Recall that it is a power
series with radius of convergence ≥ c by the hypothesis. Hence we can differentiate it term-by-term
when θ ∈ (−c, c). In particular, this gives ψ(0) = 1, ψ0 (0) = E[ξ1 ] = 0, and ψ00 (0) = E[ξ21 ] ≥ 0. It follows that
φ(0) = 0, φ0 (0) = ψ0 (0) = 0, and φ00 (0) = ψ00 (0) ≥ 0. Then by Taylor’s theorem,
L L
φ(θ) ≤ φ(0) + φ0 (0)θ + θ 2 = θ 2 for all θ ∈ [0, c/2].
2 2
This and (16) gives
à !
L 2
P(X n ≥ t ) ≤ exp − sup θt − θ .
θ∈[0,c/2] 2
2
The quadratic function θ 7→ θt − L2 θ 2 is globally minimized at θ = t /L with minimum value Lθ2 . Hence
when t < cL/2, the supremum in the above bound is attained at θ = t /L. If t ≥ c/2L, the same quadratic
function is increasing in [0, c/2L] so the constrained maximization is solved at θ = c/2L. □
Example 3.2.9 (Chernoff bound for Gamma distribution). Let ξ1 , . . . , ξn be i.i.d. Exp(c) RVs (with mean
1/c). Let X n := ξ1 + · · · + ξn ∼ Gamma(n, c). Recall that
c
ψ(θ) = E[exp(θξ1 )] = for θ < c.
c −θ
3.3. WEAK LAW OF LARGE NUMBERS 63
Definition 3.3.4. Let (X n )n≥1 be a sequence of RVs and let µ ∈ R be a constant. We say X n converges to µ
in probability if for each ε > 0,
¡¯ ¯ ¢
lim P ¯ X n − µ¯ > ε = 0.
n→∞
Before we proceed further, let us take a moment and think about the definition of convergence in
probability. Recall that a sequence of real numbers (x n )n≥0 converges to x if for each ‘error level’ ε > 0,
there exists a large integer N (ε) > 0 such that
|x n − x| < ε ∀n ≥ N (ε).
If we would like to say that a sequence of RVs (X n )≥0 ‘converges’ to some real number x, how should we
formulate this? Since X n is an RV, {|X n − x| < ε} is an event. On the other hand, we can also view each x n
as an RV, even though it is a real number. Then we can rewrite (3.3.1) as
For general RVs, requiring P(|X n − x| < ε) = 1 for any large n might not be possible. But we can fix any
desired level of ‘confidence’, δ > 0, and require
P(|x n − x| < ε) ≥ 1 − δ
Example 3.3.6 (Empirical frequency). Let A be an event of interest. We would like to estimate the un-
known probability p = P(A) by observing a sequence of indpendent experiments. namely, let (X k )k≥0
be a sequence of i.i.d. RVs where X k = 1(A) is the indicator variable of the event A for each k ≥ 1. Let
p̂ n := (X 1 + · · · + X n )/n. Since E[X 1 ] = P(A) = p, by WLLN we conclude that, for any ε > 0,
¡ ¢
P |p̂ n − p| > ε → 0 as n → ∞.
▲
3.3. WEAK LAW OF LARGE NUMBERS 65
Example 3.3.7 (Polling). Let E A be the event that a randomly select voter supports candidate A. Using
a poll, we would like to estimate p = P(E A ), which can be understood as the proportion of supporters of
candidate A. As before, we observe a sequence of i.i.d. indicator variables X k = 1(E A ). Let p̂ n := S n /n be
the empirical proportion of supporters of A out of n samples. We know by WLLN that p̂ n converges to p
in probability. But if we want to guarantee a certain confidence level α for an error bound ε, how many
samples should be take?
By Chebyshef’s inequality, we get the following estimate:
¡ ¢ Var(p̂ n ) Var[X 1 ] 1
P |p̂ n − p| > ε ≤ = ≤ .
ε 2 nε 2 4nε2
Note that for the last inequality, we noticed that X 1 ∈ [0, 1] and used Exercise 3.2.6 (or you can use that for
Y ∼ Bernoulli(p), Var(Y ) = p(1 − p) ≤ 1/4). Hence, for instance, if ε = 0.01 and α = 0.95, then we would
need to set n large enough so that
¡ ¢ 10000
P |p̂ n − p| > 0.01 ≤ ≤ 0.05.
4n
This yields n ≥ 50, 000. In other words, if we survey at least n = 50, 000 independent voters, then the
empirical frequency p̂ n is between p −0.01 and p +0.01 with probability at least 0.95. Still in other words,
the true frequency p is between p̂ n − 0.01 and p̂ n + 0.01 with probability at least 0.95 if n ≥ 50, 000. We
don’t actually need this many samples. We will improve this result later using central limit theorem. ▲
Example 3.3.8 (Bernstein’s polynomial approximation). Let f : [0, 1] → R be a continuous function. For
each n ≥ 1, define the Bernstein polynomial f n of degree n by
à !
Xn n m
f n (x) := x (1 − x)n−m f (m/n).
m=0 m
We claim that
¯ ¯
lim sup ¯ f (x) − f n (x)¯ = 0.
n→∞ x∈[0,1]
P ROOF. While the above statement is completely analytic, we will use probability theory to prove it.
Fix p ∈ [0, 1] and let X 1 , X 2 , . . . be i.i.d. RVs with distribution Bernoulli(p). Then E[X i ] = p and Var(X i ) =
p(1 − p) for i ≥ 1. Let S n := X 1 + · · · + X n . Note that
X
n
E[ f (S n /n)] = f (m/n) P(S n = m)
m=0
à !
X
n n m
= f (m/n) p (1 − p)n−m = f n (p).
m=0 m
For each δ > 0, by using Chebyshef’s inequality and the fact that p(1 − p) ≤ 1/4 for p ∈ [0, 1],
µ¯ ¯ ¶
¯ Sn ¯ Var(S n /n) p(1 − p) 1
¯
P ¯ − p ¯¯ > δ ≤ = ≤ ∀p ∈ [0, 1]. (18)
n δ 2 nδ 2 4nδ2
Now since f is continuous on the compact interval [0, 1] it is uniformly continuous. That is, for each
ε > 0, there exists δ > 0 such that |x − y| < δ implies | f (x)− f (y)| < δ. Fix ε > 0, and let M := supx∈[0,1] f (x).
Then by Jensen’s inequality and using (18),
¯ ¯ ¯ £ ¤¯
¯ f (p) − f n (p)¯ = ¯ f (p) − E f (S n /n) ¯
£¯ ¯¤
≤ E ¯ f (S n /n) − f (p)¯
· µ¯ ¯ ¶¸ · µ¯ ¯ ¶¸
¯ ¯ ¯ Sn ¯ ¯ ¯ ¯ Sn ¯
¯ ¯
= E f (S n /n) − f n (p) 1 ¯ ¯ ¯ ¯ ¯ ¯
− p ¯ ≤ δ + E f (S n /n) − f n (p) 1 ¯ ¯
− p¯ > δ
n n
µ¯ ¯ ¶
¯ Sn ¯ M
≤ ε + 2M P ¯¯ − p ¯¯ > δ ≤ ε + .
n nδ2
3.3. WEAK LAW OF LARGE NUMBERS 66
A change of perspective might help us. Instead of waiting to collect all n coupons, let’s progressively
collect k distinct coupons for k = 1 to n. Namely, for each 1 ≤ k ≤ n, define
Exercise 3.3.14. For each n ≥ 1, let X 1,n , X 2,n , · · · , X n,n be a sequence of independent geometric RVs
where X k,n ∼ Geom((n − k)/n). Define τn = X 1,n + X 2,n + · · · + X n,n .
P
(i) Show that E[τn ] = n n−1
k=1
k −1 . Using Exercise 3.3.13 (ii), deduce that
(ii) Using Var(Geom(p)) = (1 − p)/p 2 ≤ p −2 and Exercise 3.3.13 (iii), show that
X
n−1
Var(τn ) ≤ n 2 k −2 ≤ 2n 2 .
k=1
3.3. WEAK LAW OF LARGE NUMBERS 68
Conclude that
τn − E[τn ]
→ 0 as n → ∞ in probability.
n log n
(iv) By using part (i), conclude that
τn
→ 1 as n → ∞ in probability.
n log n
Exercise 3.3.15 (Tail bound on the coupon collecting time). Let τ denote the first time to collect n dis-
tinct types of coupons, where each draw picks one of the n types independently uniformly at random.
Whow that for each c > 0,
P(τ ≥ dn log n + cne) ≤ e −c .
(Hint: Let A i denote the event that the i th coupon is not selected among the first dn log n + cne draws.
Then
à !
[ n X n X n µ ¶
1 dn log n+cne ¡ ¢
P(τ ≥ dn log n + cne) ≤ P Ai ≤ P(A i ) ≤ 1− ≤ n exp − log n − c = e −c .
i =1 i =1 i =1 n
)
3.3.2. Weak law without finite second moment. In this section, we will state and prove more general
versions of the weak law of large numbers than Theorem 3.3.3, which assumed finite second moments.
The goal of this section is to establish the following general WLLN:
Theorem 3.3.16 (Weak Law of Large Numbers). Let X 1 , X 2 , . . . be i.i.d. RVs. Suppose that
lim x P(|X 1 | > x) = 0. (20)
x→∞
Let S n := X 1 + · · · + X n and µn := E[X 1 1(|X 1 | ≤ n)]. Then
S n − nµn
→ 0 as n → ∞ in probability.
n
Note that in Theorem 3.3.16, we do not even assume finite first moment condition for X i ’s. In this
case, we cannot use Chebyshef’s inequality directly as we did in the proof of Theorem 3.3.3. An immedi-
ate consequence of the above result is the following familiar version of WLLN assuming finite mean:
Corollary 3.3.17 (Weak Law of Large Numbers with finite mean). Let X 1 , X 2 , . . . be i.i.d. RVs. Suppose
that E[|X 1 |] < ∞. Let S n := X 1 + · · · + X n and µn := E[X 1 ]. Then
Sn
→ µ as n → ∞ in probability.
n
P ROOF. Note that
x P(|X 1 | > x) ≤ E [|X 1 |1(|X 1 | > x)] = E[|X 1 |] − E [|X 1 |1(|X 1 | ≤ x)] → 0 as n → ∞,
where for the limit we have applied MCT (Theorem 1.3.19) for the increasing sequence of RVs |X 1 |1(|X 1 | ≤
S −nµ
x) % |X 1 | and the fact that E[|X 1 |] < ∞. Hence (20) holds, and by Theorem 3.3.16, we have n n n → 0
in probability, where µn = E[X 1 1(|X 1 | ≤ n)]. Also note that µn → µ as n → ∞ by DCT (Theorem 1.3.20),
noting that X 1 1(|X 1 | ≤ n) → X 1 a.s., |X 1 |1(|X 1 | ≤ n) ≤ |X 1 | and E[|X 1 |] < ∞. Then Snn → µ in probability
since for each ε > 0, by taking n large enough so that |µn − µ| < ε/2,
µ¯ ¯ ¶ µ¯ ¯ ¶ µ¯ ¯ ¶
¯ Sn ¯ ¯ Sn ¯ ¯ Sn ¯
¯
P ¯ ¯
− µ¯ > ε ≤ P ¯ ¯ ¯
− µn ¯ + |µn − µ| > ε ≤ P ¯ ¯ ¯
− µn ¯ > ε/2 → 0 as n → ∞,
n n n
S n −nµn
where the limit follows since n → 0 in probability. □
3.3. WEAK LAW OF LARGE NUMBERS 69
The key idea to extend WLLN to the “heavy-tail” case is to use truncation and then use Chebyshef.
In order to prove Theorem 3.3.16, we prove a lemma that concerns WLLN for triangular array of RVs. (A
special case is WLLN for a sequence of RVs as usual)
X 1;1 ← independent
X 1;2 , X 2;2 ← independent
X 1;3 , X 2;3 , X 3;3 ← independent
..
.
Namely, all RVs that appear in the above triangular array are independent, and for each k ≥ 1, the k th row
consists of k RVs X 1;k , . . . , X k;k .
Lemma 3.3.18 (WLLN for triangular array). For each n ≥ 1, let X n;1 , . . . , X n;n be independent. Let b n > 0
with b n → ∞ as n → ∞. Let X n;k := X n;k 1(|X n;k | ≤ b n ). Suppose that
P
(a) nk=1 P(|X n;k | > b n ) = o(1); and
P h 2 i
(b) b n−2 nk=1 E X n;k = o(1).
P h i
Let S n := X n;1 + · · · + X n;n and a n := nk=1 E X n;k . Then
S n − an
→ 0 as n → ∞ in probability.
bn
In order to bound the first term, we use union bound and (a) to get
à !
³ ´ [ X
n
P S n 6= S n ≤ P {|X n;k | > b n } ≤ P(|X n;k | > b n ) = o(1).
1≤k≤n k=1
Pn
In order to bound the second term, note that E[S n ] = a n and |S n | ≤ b , so by Chebyshef’s inequality,
k=1 k
ï ¯ !
¯S −a ¯ 1 X n 1 X n h 2 i
¯ n n¯ Var(S n )
P ¯ ¯>ε ≤ = Var(X n;k ) ≤ E X n;k = o(1),
¯ bn ¯ b n2 ε2 b n2 ε2 k=1 b n2 ε2 k=1
where the last estimate uses (b). Then the conclusion follows. □
P ROOF OF T HEOREM 3.3.16. According to Lemma 3.3.18 with b n = n, it suffices to verify conditions
(a)-(b) in Lemma 3.3.18. For (a), note that by (20),
X
n
P(|X n;k | > b n ) = n P(|X n | > n) = o(1).
k=1
To this end, let X := X 1 1(|X 1 | ≤ n). Use Exercise 1.5.9 and to write
Z∞
E[|X |2 ] = P(X 2 ≥ x) d x
Z0∞
= P(|X | ≥ x 1/2 ) d x
0
Z∞
= 2u P(|X | ≥ u) d u
Z0n
= 2u P(|X | ≥ u) d u
Z0n
≤ 2u P(|X 1 | ≥ u) d u. (∵ |X | ≤ |X 1 | )
0
Thus, by making a change of variable u/n = t and denoting g (u) = 2u P(|X 1 | ≥ u),
Z Z1
£ ¤ 1 n
n −1 E (X 12 1(|X 1 | ≤ n)) = g (u) d u = g (nt ) d u.
n 0 0
The last expression vanishes as n → ∞ since g (nt ) → 0 as n → ∞ for each t > 0. To give the details, note
that since 0 ≤ g (u) ≤ 2u and g (u) = o(1), g is bounded by some constant, say, M > 0. Fix ε > 0. Since
g (u) = o(1), there exists N > 0 such that g (y) ≤ ε for all y ≤ N . Then for all n ≥ N /ε,
Z1 Zε Z1
0≤ g (nt ) du = g (nt ) d u + g (nt ) d u ≤ εM + (1 − ε)ε.
0 0 ε
R1
Then letting ε & 0 shows that 0 g (nt ) d u = o(1), as desired. □
The following exercise demonstrates a direct route to prove Corollary 3.3.17 without using triangular
arrays.
Exercise 3.3.19 (WLLN for RVs with finite mean). In this exercise, we will prove the WLLN for RVs with
infinite variance by using a ‘truncation argument’. (Note that we cannot use Chebyshef here.)
P
Let (X k )k≥1 be i.i.d. RVs such that E[|X 1 |] < ∞ and Var(X 1 ) ∈ [0, ∞]. Let µ = E[X 1 ] and S n = nk=1 X i ,
n ≥ 1. We will show that for any positive constant ε > 0,
µ¯ ¯ ¶
¯ Sn ¯
lim P ¯¯ ¯
− µ¯ > ε = 0. (21)
n→∞ n
P
(i) Fix M ≥ 1. Let S n≤M := ni=1 X i 1(|X i | ≤ M ) and µ≤M := E[X 1 1(|X 1 | ≤ M )]. Show that n −1 S n≤M → µ≤M as
n → ∞ in probability.
(ii) Show that µ≤M = E[X 1 1(|X 1 | ≤ M )] → µ as M → ∞. (Hint: Use dominated convergence theorem.)
P
(iii) Let S n>M := ni=1 X i 1(|X i | > M ). Use Markov’s inequality and (ii) to show that, for any δ > 0,
µ¯ >M ¯ ¶
¯ Sn ¯
P ¯¯ ¯ ≥ δ ≤ δ−1 E [|X 1 |1(|X 1 | > M )] → 0 as M → ∞.
n ¯
(iv) Fix ε, δ > 0, and show the following inequality
µ¯ ¯ ¶ µ¯ ≤M ¯ ¶ µ¯ >M ¯ ¶
¯ Sn ¯ ¯ Sn ¯ ¯ Sn ¯
P ¯¯ ¯
− µ¯ ≥ ε ≤ P ¯ ¯ ≤M ¯
− µ ¯ ≥ ε/3 + P ¯ ¯ ¯ ≥ ε/3 + 1(|µ≤M − µ| ≥ ε/3).
n n n ¯
Use (ii)-(iii) to deduce that there exists a large M 0 ≥ 1 such that
µ¯ ¯ ¶ ï ¯ !
¯ Sn ¯ ¯ S ≤M 0 ¯
¯ ¯ ¯ n ≤M ¯
0
P ¯ − µ¯ ≥ ε ≤ P ¯ −µ ¯ ≥ ε/3 + δ/2.
n ¯ n ¯
Finally, use (i) to show that there exists a large n 0 ≥ 1 such that
µ¯ ¯ ¶
¯ Sn0 ¯
P ¯ 0 − µ¯¯ ≥ ε ≤ δ.
¯
n
3.4. BOREL-CANTELLI LEMMAS 71
Conclude (21).
Definition 3.4.1 (Almost sure convergence). Let X , (X n )n≥1 be a RVs on the same probability space. We
say that X n converges to a almost surely (or with probability 1) if
³ ´
P lim X n = X = 1.
n→∞
Definition 3.4.2 (Infinitely often). Let (Ω, F , P) be a probability space and let (A n )n≥1 be a sequence of
events A n ∈ F . We say A n occurs infinitely often (i.o. for short) if the following event occurs:
{A n i.o.} := {ω ∈ Ω : ω ∈ A n i.o.} := {ω ∈ Ω : ω ∈ A n for infinitely many n’s}.
Exercise 3.4.3. Let (Ω, F , P) be a probability space and let (A n )n≥1 be a sequence of events A n ∈ F .
Recall that
[
∞ \
∞
lim sup A n := lim An , lim inf A n := lim An .
n→∞ m→∞ n=m n→∞ m→∞ n=m
(i) Show that lim sup 1 A n = 1(lim sup A n ) and lim inf 1 A n = 1(lim inf A n ).
n→∞ n→∞ n→∞ n→∞
µ ¶ ³ ´
(ii) Show that P lim sup A n ≥ lim sup P (A n ) and P lim inf A n ≤ lim inf P (A n ).
n→∞ n→∞ n→∞ n→∞
(iii) Let X n , n ≥ 1 be RVs on (Ω, F , P) and let x ∈ R. Show that X n → 0 a.s. as n → ∞ if and only if for all
ε > 0, P(|X n | > ε i.o.) = 0.
Example 3.4.4 (a.s. convergence is strictly stronger than in probability convergence). In this example, we
will see that convergence in probability does not necessarily imply convergence with probability 1. De-
fine a sequence of RVs (Tn )n≥1 as follows. Let T1 = 1, and T2 ∼ Uniform({2, 3}), T3 ∼ Uniform({4, 5, 6}), and
so on. In general, X k ∼ Uniform{(k −1)k/2, · · · , k(k +1)/2} for all k ≥ 2. Let X n = 1(some Tk takes value n).
Think of
Tn = nth arrival time of customers
X n = 1(some customer arrives at time n).
Then note that
P(X 1 = 1) = 1,
P(X 2 = 1) = P(X 3 = 1) = 1/2,
P(X 4 = 1) = P(X 5 = 1) = P(X 6 = 1) = 1/3,
and so on. Hence it is clear that limn→∞ P(X n = 1) = 0. Since X n is an indicator variable, this yields that
limn→∞ P(|X n − 0| > ε) = limn→∞ P(X n = 1) = 0 for all ε > 0, that is, X n converges to 0 in probability. On
the other hand, X n → 0 a.s. means P(limn→∞ X n = 0), which implies that X n = 0 for sufficiently large
n ≥ 1. However, X n = 1 for infinitely many n’s since customer always arrive after any large time N . Hence
X n cannot converge to 0 almost surely. ▲
Exercise 3.4.5 (a.s. convergence implies in probability convergence). Let (X n )n≥1 be a sequence of RVs
and let a be a real number. Suppose X n converges to a with probability 1.
(i) Show that
³ ´
P lim |X n − a| ≤ ε = 1 ∀ε > 0.
n→∞
3.4. BOREL-CANTELLI LEMMAS 72
(ii) Fix ε > 0. Let A k be the event that |X n − a| ≤ ε for all n ≥ k. Show that A 1 ⊆ A 2 ⊆ · · · and
à !
³ ´ [
∞
P lim |X n − a| ≤ ε/2 ≤ P Ak .
n→∞
k=1
(iii) Justify the following: that for each ε > 0,
à !
[
∞ ³ ´
lim P (|X n − a| ≤ ε) ≥ lim P (A n ) = P A k ≥ P lim |X n − a| ≤ ε/2 = 1.
n→∞ n→∞ n→∞
k=1
Conclude that X n → a in probability.
A typical tool for proving convergence with probability 1 is the following.
Lemma 3.4.6 (Borel-Cantelli lemma). Let (Ω, F , P) be a probability space and let (A n )n≥1 be a sequence
of events A n ∈ F .
X∞
P(A n ) < ∞. (i.e., P(A n )’s are summable)
n=1
Then we have
¡ ¢
P (A n not i.o.) = P A n occurs only for finitely many n’s = 1.
P
P ROOF. Let N = ∞ n=1 1(A n ), which is the number of n’s such that A n occurs. By Fubini’s theorem (or
MCT, Theorem 1.3.19),
·∞ ¸ ∞
X X X
∞
E[N ] = E 1(A n ) = E[1(A n )] = P(A n ) < ∞.
n=1 n=1 n=1
Since N ≥ 0, it follows that P(N < ∞) = 1 since otherwise E[N ] = ∞. Deduce that the RV N must not take
∞ with positive probability. Thus P(A n not i.o.) = P(N < ∞) = 1. □
Exercise 3.4.7 (BC Lemma and a.s. convergence). Let (X n )n≥1 be a sequence of RVs and fix x ∈ R. We
will show that X n → x a.s. if the tail probabilities are ‘summable’. (This is the typical application of the
Borel-Cantelli lemma.)
P
(i) Fix ε > 0. Suppose ∞ n=1 P(|X n − x| > ε) < ∞. Use Borel-Cantelli lemma to deduce that |X n − x| > ε for
only finitely many n’s.
P
(ii) Conclude that, if ∞n=1 P(|X n − x| > ε) < ∞ for all ε > 0, then X n → x a.s.
Example 3.4.8. Let (X n )n≥0 be a sequence of i.i.d. Exp(λ) RVs. Define Yn = min(X 1 , X 2 , · · · , X n ). Recall
that in Exercise 3.3.10, we have shown that
P(|Yn − 0| > ε) = e −λεn .
for all ε > 0 and that Yn → 0 in probability as n → ∞. In fact, Yn → 0 with probability 1. To see this, we
note that, for all ε > 0,
X
∞ X
∞ e −λε
P(|Yn − 0| > ε) = e −λεn = < ∞.
n=1 n=1 1 − e λε
By Borel-Cantelli lemma (or Exercise 3.4.7), we conclude that Yn → 0 a.s. ▲
Example 3.4.9. Let (X n )n≥0 be a sequence of i.i.d. Uniform([0, 1]) RVs.
(i) We show that X n1/n converges to 1 almost surely, as n → ∞. Fix any ϵ > 0. Since X n ≥ 1,
¡ ¢
P(|(X n )1/n − 1| > ϵ) = P (X n )1/n > (1 + ϵ) or (X n )1/n < (1 − ϵ) = P((X n )1/n < (1 − ϵ))
¡ ¢
= P X n < (1 − ϵ)n = (1 − ϵ)n ,
which goes to 0 as n → ∞. Therefore, X n1/n converges to 1 in probability. However, since
P∞ n 1/n
n=1 (1 − ϵ) < ∞, we have that X n also converges to 1 almost surely.
3.4. BOREL-CANTELLI LEMMAS 73
which doesn’t converge to zero (but to a positive number), since 1 +1/22 +· · ·+1/n 2 is a conver-
gent series. Therefore, Vn doesn’t converge to zero in probability. Hence it also doesn’t converge
to 0 a.s..
▲
The following proposition is often used to ‘upgrade’ in-probability convergence to a.s. convergence.
Proposition 3.4.10 (Subsequential a.s. conv. = in prob. conv.). Let (X n )n≥1 be a sequence of RVs and fix
x ∈ R. Then X n → x as n → ∞ in probability if and only if for every subsequence X n(k) , k ≥ 1, there is a
further subsequence X n(k(m)) , m ≥ 1 such that X n(k(m)) → x a.s. as m → ∞.
P ROOF. Suppose X n → x in probability. Fix ε > 0. Then for each n ≥ 1, there exists n(k) ≥ 1 such that
P P
P(|X n(k) − x| > ε) ≤ 2−k for k ≥ 1. Then k≥1 P(|X n(k) − x| > ε) ≤ k≥1 2−k = 1 < ∞, so by Borel-Cantelli
Lemma 3.4.6 (or Exercise 3.4.7), it holds that |X n(k) −x| > ε for only finitely many k’s with probability one.
This means X n(k) → x a.s. as k → ∞.
Conversely, suppose that for each subsequence X n(k) , k ≥ 1 of (X n )n≥1 , there is a further subsequence
X n(k(m)) , m ≥ 1 such that X n(k(m)) → x a.s. as m → ∞. Fix ε > 0. Let y n := P(|X n − x| > ε). We wish to show
that y n → 0 as n → ∞. Since a.s. convergence implies in probability convergence, the hypothesis implies
that for each subsequence y n(k) of y n , there exists a further subsequence n(k(m)) such that y n(k(m)) → 0.
This implies that y n cannot have a limit point other than zero. If y n ↛ 0, then there is an open neighbor-
hood U of 0 and a subsequence y n(k) such that y n(k) ∉ U for all k ≥ 1. But then no subsequence y n(k(m))
of y n(k) can converge to 0 since they are all outside of U . Hence y n must converge to zero. □
Example 3.4.11 (Fatou’s lemma with in-probability convergence). Let X n ≥ 0 and X n → X in probability.
Then the conclusion of Fatou’s lemma (Thm. 1.3.18) still holds despite the in-probability convergence:
To see this, let n(k) be a subsequence such that E[X n(k) ] → lim inf E[X n ] as k → ∞. By Prop. 3.4.10, there
n→∞
exists a further subsequence n(k(m)), m ≥ 1 such that X n(k(m)) → X almost surely. Then we apply Fatou’s
lemma along this sub-subsequence to get
lim inf E[X n ] = lim inf E[X n(k) ] = lim inf E[X n(k(m)) ] ≥ E[X ].
n→∞ k→∞ m→∞
▲
Proposition 3.4.12 (In prob. conv. implies weak conv.). Let X , (X n )n≥1 be a RVs on the same probability
space such that X n → X in probability (i.e., X n − X → 0 in prob.). If f : R → R is continuous, then f (X n ) →
f (X ) in probability. Furthermore, if f is continuous and bounded, then E[ f (X n )] → E[ f (X )] as n → ∞.
3.4. BOREL-CANTELLI LEMMAS 74
P ROOF. Fix a subsequence n(k), k ≥ 1. By Proposition 3.4.10, there exists a further subsequence
n(k(m)), m ≥ 1 such that X n(k(m)) → X a.s..Then f (X n(k(m)) ) → f (X ) a.s.. 1 By Proposition 3.4.10, it
follows that f (X n ) → f (X ) in probability.
Further assume that f is bounded. Recall that Then f (X n(k(m)) ) → f (X ) a.s.. Then by BCT (Theorem
1.3.16), E[ f (X n(k(m)) )] → E[ f (X )] as m → ∞. Since every subsequential limit of E[ f (X n )] equals E[ f (X )],
it follows that E[ f (X n )] → E[ f (X )] as n → ∞. □
Now we prove a special case of the strong law of large numbers. The proof of full statement (Theorem
3.1.2) with finite second moment assumption has extra technicality, so here we prove the result under a
stronger assumption of finite fourth moment.
Theorem 3.4.13 (SLLN with fourth moment). Let (X n )n≥1 be a sequence of i.i.d. RVs such that E[X n4 ] < ∞.
Let S n = X 1 + · · · + X n for all n ≥ 1. Then S n /n converges to E[X 1 ] with probability 1.
P ROOF. Our aim is to show that
X
∞
E[(S n /n)4 ] < ∞.
n=1
4
Then by Borel-Cantelli lemma, (S n /n) converges to 0 with probability 1. Hence S n /n converges to 0
with probability 1, as desired.
For a preparation, we first verify that we have finite first and second moments for X 1 . It is easy to
verify the inequality |x| ≤ 1 + x 4 for all x ∈ R, so we have
E[|X 1 |] ≤ 1 + E[X 14 ] < ∞.
Hence E[X 1 ] exists. By shifting, we may assume that E[X 1 ] = 0. Similarly, it holds that x 2 ≤ c + x 4 for all
x ∈ R if c > 0 is large enough. Hence E[X 12 ] < ∞.
Note that
"Ã !4 # " #
£ 4¤ X
n X X £ ¤
E Sn = E Xk =E Xi X j Xk Xℓ = E Xi X j Xk Xℓ .
k=1 1≤i , j ,k,ℓ≤n 1≤i , j ,k,ℓ≤n
Note that by independence and the assumption that E[X 1 ] = 0, E[X i X j X k X ℓ ] = 0 if at least one of the four
indices does not repeat. For instance,
E[X 1 X 23 ] = E[X 1 ]E[X 23 ] = 0,
E[X 1 X 22 X 3 ] = E[X 1 ]E[X 22 ]E[X 3 ] = 0.
Hence by collecting terms based on number of overlaps, we have
à !
X £ ¤ X n
4 4 X
E Xi X j Xk Xℓ = E[X i ] + E[X i2 ]E[X j2 ]
1≤i , j ,k,ℓ≤n i =1 2 1≤i < j ≤n
Hence by Borell-Cantelli lemma, we conclude that (S n /n)4 converges to 0 with probability 1. The same
conclusion holds for S n /n. This shows the assertion. □
1To see this, assume Y → Y a.s., fix ω ∈ Ω s.t. Y (ω) → Y (ω). Then by continuity of f , f (Y (ω)) → f (Y (ω)).
n n n
3.4. BOREL-CANTELLI LEMMAS 75
The following example shows that the converse of the Borel-Cantelli lemma is false.
Example 3.4.14 (The converse of BC lemma is false). Let ([0, 1], B, µ) be the probability space on the
unit interval [0, 1] with B =Borel σ-algebra on [0, 1] and µ =Lebesgue measure on [0, 1]. Fix a sequence
(a n )n≥1 and let A n := (0, a n ). Suppose a n = o(1). Then
{A n i.o.} = lim sup A n = ;.
n→∞
However, if a n ≥ 1/n, then
X
∞ X
∞
µ(A n ) = a n = ∞.
n=1 n=1
Hence in this case, the probabilities of A n are not summable but still A n does not occur infinitely often.
▲
Lemma 3.4.15 (The second Borel-Cantelli lemma). Let (Ω, F , P) be a probability space and let (A n )n≥1
be independent events in F . Then we have the following implication:
X
∞
P(A n ) = ∞ =⇒ P(A n occurs i.o.) = 1.
n=1
P∞
P ROOF. Let N = n=1 1(A n ), which is the number of n’s such that A n occurs. Fix an integer M ≥ 1.
Then by using independence and 1 − x ≤ exp(−x) for all x ∈ R,
µ ¶ Ã ! Ã !
\
N YN Y
N X
N
N →∞
P sup 1 A n = 0 = P A cn = (1 − P(A n )) ≤ exp(−P(A n )) = exp − P(A n ) −→ 0.
M ≤n≤N n=M n=M n=M n=M
¡ ¢
It follows that P supn≥M 1 A n = 0 = 0 for all M ≥ 1.Noting that supn≥M 1 A n takes values from {0, 1}, we
have supn≥M 1 A n = 1 a.s.. Hence
· ¸ · ¸ · ¸
P (A n i.o.) = E [1(A n i.o.)] = E lim sup 1 A n = E lim sup 1 A n = E lim 1 = 1.
n→∞ M →∞ n≥M M →∞
□
The following is a typical application of the second BC Lemma.
Proposition 3.4.16 (SLLN does not hold without first moment). Let (X n )n≥1 be i.i.d. RVs on the same
probability space with E[|X n |] = ∞. Then P(|X n | ≥ n i.o.) = 1. Furthermore, for S n = X 1 + · · · + X n ,
µ ¶
S n (ω)
P ω : lim exists in (−∞, ∞) = 0.
n→∞ n
P ROOF. By the tail-sum formula (Prop. 1.5.10) and since x 7→ P(|X 1 | > x) is non-increasing,
Z∞ Z∞ X
∞
∞ = E[|X 1 |] = P(|X 1 | > x) d x = P(|X n | > x) d x ≤ 1 + P(|X n | ≥ n).
0 0 n=1
Since the events |X n | ≥ n are independent, by the second BC Lemma (see Lemma 3.4.15), P(|X n | ≥
n i.o.) = 1. To show the second part, write
½ ¾
S n (ω)
A := ω : lim exists in (−∞, ∞) , B := {ω : |X n (ω)| ≥ n i.o.}.
n→∞ n
Since we have shown that P(B ) = 1, P(A \ B ) = 0 since 0 ≤ P(A \ B ) ≤ P(B c ) = 0. Hence
P(A) = P(A ∩ B ).
We would like to show the RHS equals zero. For this, it suffices to show that A ∩ B = ;. To this end,
suppose there exists ω ∈ A ∩ B . Write
S n S n+1 Sn X n+1
− = − .
n n + 1 n(n + 1) n + 1
3.4. BOREL-CANTELLI LEMMAS 76
One the one hand, since ω ∈ A, the LHS must converge to zero. On the other hand, since ω ∈ A, we have
S n (ω) Sn X n+1
n(n+1) → 0. Since ω ∈ B at the same time, n(n+1) − n+1 < −1/2 i.o.. This is a contradiciton. Thus A ∩B = ;,
as desired. □
From Proposition 3.4.16, we now know that E[|X 1 |] < ∞ is necessary order for SLLN to hold. In fact,
this condition is also sufficient for SLLN to hold, as we will prove later.
It will be useful for later examples to strengthen the second BC Lemma as the following version:
Lemma 3.4.17 (A quantitative version of the 2nd BC Lemma). Let (Ω, F , P) be a probability space and let
P
A 1 , A 2 , . . . be pairwise independent events. Suppose that ∞ n=1 P(A n ) = ∞. Then
Pn
# of occurance of A 1 , . . . , A n 1(A n )
= Pnm=1 →1 as n → ∞.
expected # of occurance of A 1 , . . . , A n m=1 P(A n )
P ROOF. Let X m := 1(A m ) and S n = X 1 +· · ·+ X n . We wish to show that S n /E[S n ] → 1 a.s.. We first show
that the convergence holds in probability. First, since the events A 1 , A 2 , . . . are pairwise independent, so
are the indicators 1(A 1 ), 1(A 2 ), . . . , so
Var(S n ) = Var(X 1 ) + · · · + Var(X n )
≤ E[X 12 ] + · · · + E[X n2 ]
= E[X 1 ] + · · · + E[X n ],
where the last equality uses the fact that X m ’s are 0-1 RVs. Now by Chebyshef’s inequality and using the
hypothesis that E[S n ] → ∞, for each δ > 0,
Var(S n ) 1
P (|S n − E[S n ]| > δE[S n ]) ≤ ≤ 2 → 0 as n → ∞.
δ E[S n ]
2 2 δ E[S n ]
This shows that S n /E[S n ] → 1 in probability.
In order to upgrade the above convergence to almost sure convergence, we use a subsequence and
monotonicity argument. Denote Wn := S n /E[S n ]. We wish to show that Wn → 1 a.s.. Recall that we have
shown the following: For all δ > 0,
1
P(|Wn − 1| > δ) ≤ as n → ∞.
δ2 E[S n]
The key point of the proof above is the following: In order to show X n /c n → 1 a.s., where c n → ∞ and
X n ≥ 0, X n %, it suffices to show that X n(k) /c n(k) → 1 and c n(k+1) /c n(k) → 1 a.s. for some subsequence
n(k) → ∞.
Example 3.4.18 (Record values). Suppose X 1 , X 2 , . . . are i.i.d. RVs on a probability space (Ω, F , P). Let
A n := {X n > sup1≤k<n X k } denote the event that X n exceeds all preceeding RVs, i.e., a record is made at
P
the nth draw. Let R n := nk=1 1(A k ) denote the number of records up to time n. We will show that
Rn
→ 1 a.s. as n → ∞.
log n
In order to show this, we claim the following:
(∗) A 1 , A 2 , . . . are mutually independent and P(A n ) = 1/n.
P
Once we have the above claim, noting that E[R n ] = nk=1 n −1 ∼ log n, we can deduce R n / log n → 1 a.s. by
Lemma 3.4.17.
It remains to justify the claim (*) above. To do so, fix n ≥ 1 and we consider the order statistics Y1;n ≥
Y2;n ≥ · · · ≥ Yn;n of X 1 , . . . , X n . For each sample ω ∈ Ω, there exists a permutation σ(ω) on {1, . . . , n} such
that Yi ;n (ω) = X σ(ω)(i ) (ω). Omitting dependence on ω, we obtain a random permutation σ on {1, . . . , n}.
Since X 1 , . . . , X n are i.i.d., the distribution of the order statistics is invariant under permuting the order
of X 1 , . . . , X n . It follows that σ is uniformly distributed among all possible n! permuations on {1, . . . , n}.
From here, we deduce
(n − 1)! 1
P(A n ) = P(σ(n) = 1) = = .
n! n
To show mutual independence, notice that, for 1 ≤ m < n,
By the BC Lemma (see Lemma 3.4.6), it follows that there exists Nε > 0 such that ℓn ≤ (1 + ε) log2 n for all
n ≥ Nε almost surely. Then for all n ≥ Nε , since x 7→ log2 x is increasing,
For the other direction, we fix n ≥ 1 and partition {1, . . . , n} into disjoint blocks of size R := [1 + (1 −
ε) log2 n]. Namely, Consider the interval partition (0, R], (R, 2R], and so on. We consider the indicators
1(ℓkR < R), which are determined by sets of non-overlapping RVs {X (k−1)R+1 , . . . , X kR }. Hence these indi-
cator RVs are independent. Furthermore, P(ℓkR < R) = 1 − 2−R = 1 − n −1+ε . Then
µ ¶ µ ¶
1 n/R n −ε
P(L n ≤ (1 − ε) log2 n) ≤ P(ℓR , ℓ2R , . . . , ℓkR ≤ (1 − ε) log2 n) = 1 − 1−ε ≤ exp − .
n 2 log2 n
Since the last term is summable, by BC Lemma, L n > (1−ε) log2 n for all sufficiently large n almost surely.
This yields lim inf logL n n ≤ 1 − ε a.s.. Then letting ε & 0 shows the claim. ▲
n→∞ 2
(a.s. convergence along a subsequence) Fix α > 1. We use an exponential subsequence n(k) := [αk ],
k ≥ 1. We will show (23) along the subsequence n(k), k ≥ 1. We start as usual by estimating the
tail probabilities using Chebyshef: For each ε > 0,
X
∞ ³¯ h i¯ ´ X∞ Var(S X X Var(X m )
∞ n(k)
¯ ¯ n(k) )
P ¯S n(k) − E S n(k) ¯ > ε n(k) ≤ ε−2 =
n(k)2 k=1 m=1 ε n(k)
2 2
k=1 k=1
X
∞ X
= ε−2 Var(X m ) n(k)−2 ,
m=1 k:n(k)≥m
where the last equality uses Fubini’s theorem for interchanging double sum of nonnegative
terms.
Now noting that n(k) = [αk ] ≥ αk /2 for k ≥ 1,
X X 4
n(k)−2 ≤ 4 α−2k ≤ −2
m −2 ,
k:n(k)≥m k:α ≥m
k 1 − α
where we have used the fact that α−2k for k ≥ 1 s.t. αk ≥ m is a geometric sequence with ratio
α−2 with first term at most m −2 . This and Lemma 3.5.2 give
X
∞ X 4 X
∞ Var(X ) 16 E[|X 1 |]
n(k)−2 ≤
m
Var(X m ) −2
≤ < ∞.
m=1 k:n(k)≥m 1 − α m=1 m 2 1 − α−2
Combining with the first displayed inequality in this case, we have shown that
X∞ ³¯ h i¯ ´
¯ ¯
P ¯S n(k) − E S n(k) ¯ > ε n(k) < ∞.
k=1
Since this holds for all ε > 0, by the Borel-Cantelli lemma, it follows that
h i
S n(k) − E S n(k)
→ 0 a.s. as k → ∞.
n(k)
To complete the proof of the current case, it remains to show
h i
n(k)−1 E S n(k) → µ as k → ∞.
Note that by DCT, we have that E[X n ] → E[X 1 ] = µ as n → ∞. Hence the above is the mean
of quantities that each converge to µ, so it should also converge to µ. To do a more rigorous
justification, note that E [|X k |1(|X k | > k)] = E [|X k |] − E [|X k |1(|X k | ≤ k)] → 0 by MCT. Fix ε > 0.
Then there exists N = N (ε) ≥ 1 such that E [|X k |1(|X k | > k)] ≤ ε. Then by Jensen’s inequality,
¯ ¯
¯ ¯ ¯X n ¯ X n NX
(ε)
¯ ¯ ¯ ¯
¯E[S n ] − E[S n ]¯ ≤ ¯ E [X k 1(|X k | > k)]¯ ≤ E [|X k |1(|X k | > k)] ≤ nε + E [|X k |] .
¯k=1 ¯ k=1 k=1
Dividing both sides by n and letting n → ∞ shows lim sup n −1 |E[S n ] − E[S n ]| ≤ ε. Since ε > 0 was
n→∞
arbitrary, this shows lim sup n −1 |E[S n ] − E[S n ]| = 0, which is enough to conclude.
n→∞
(Interpolation) For each m ≥ 1, let k = k(m) be the unique integer such that n(k) ≤ m < n(k + 1). Since
we are assuming X n ≥ 0 for all n ≥ 1, we have
S n(k) ≤ S m ≤ S n(k+1) .
This yields
n(k) S n(k) S m S n(k+1) n(k + 1)
≤ ≤ .
n(k + 1) n(k) m n(k + 1) n(k)
3.5. STRONG LAW OF LARGE NUMBERS 80
Recall that by the choice of n(k), we have n(k + 1)/n(k) → α as k → ∞. Using the previous part,
we conclude, almost surely,
Sm Sm
α−1 µ ≤ lim inf ≤ lim sup ≤ αµ.
m→∞ m m→∞ m
Since α > 1 was arbitrary, letting α & 1 then shows the assertion.
□
The following lemma was used in the proof of SLLN above.
Lemma 3.5.2 (An estimate used in the proof of SLLN). Let (X n )n≥1 be a sequence of i.i.d. RVs such that
E[|X n |] < ∞. Let X n := X n 1(|X n | ≤ n). Then
X∞ Var(X )
n
2
≤ 4E[|X 1 |] < ∞.
n=1 n
P ROOF. We first show two claims. Note that for any RV Y with E[Y 2 ] < ∞,
Z∞ Z∞ Z∞
2 2 p
E[Y ] = P(Y > y) d y = P(|Y | > y) d y = 2t P(|Y | > t ) d t ,
0 0 0
p
where we made a change of variable y = t . Moreover, for each y ≥ 0, we will show that
X
∞
y n −2 1(y ≤ n) ≤ 2.
n=1
Indeed, note that for y > 1,
X
∞ X
∞ Z∞
y
y n −2 1(y ≤ n) = y n −2 1(dye ≤ n) = y x −2 d x = ≤ 2.
n=1 n=1 dye−1 dye − 1
For 0 ≤ y < 1, we have
X
∞ X
∞ X
∞ Z∞
−2 −2 −2
y n 1(y ≤ n) = n = 1+ n ≤ 1+ x −2 d x = 2.
n=1 n=1 n=2 1
Now suppose X 1 , X 2 , . . . are i.i.d. with distribution function F . For each fixed x ∈ R, the RVs Yk :=
1(X k ≤ x) are i.i.d., so by SLLN, as n → ∞, we must have
X
n
a.s.
F n (x) = n −1 Yk −→ E[Y1 ] = P(X 1 ≤ x) = F (x).
k=1
This shows F n → F a.s. pointwise. In fact, a stronger statement holds. That is, the worst-case error
supx |F n (x) − F (x)| also converges to zero almost surely. This is the content of the following Glivenko-
Cantelli theorem.
Theorem 3.5.4 (Glivenko-Cantelli theorem). Let X 1 , X 2 , . . . be i.i.d. RVs on the same probability space
(Ω, F , P). Then
µ ¶
P lim sup |F n (x) − F (x)| = 0 = 1.
n→∞ x
P ROOF. Fix m > 0. For each j = 1, . . . , m − 1, define x m, j := inf{y : F (y) ≥ j /m}. Since F is non-
decreasing, x m,1 ≤ x m,2 ≤ · · · ≤ x m,m−1 . Denote x 0,m = −∞ and x m,m = ∞. Then2
F (x m, j −) − F (x m, j −1 ) ≤ j /m − ( j − 1)/m = m −1 .
Fix x ∈ R. Then there exists k ∈ {1, . . . , m − 1} such that x m,k−1 ≤ x < x m,k . Then
Similarly,
Thus we have
X
m
|F n (x) − F (x)| ≤ m −1 + |F n (x m,k ) − F (x m,k )| + |F n (x m,k −) − F (x m,k −)|.
k=1
The RHS above does not depend on x so we may take the supremum over x on the LHS. Also, by SLLN
(see Theorem 3.5.1), for each fixed y ∈ R, F n (y) → F (y) almost surely as n → ∞. Furthermore, F n (y−) =
P
n −1 nk=1 P(X k < y) → P(X 1 < y) = F (y−) almost surely as n → ∞. Hence
𝑥+𝑦=6 𝑥 + 𝑦 = 12
Example 3.5.5 (Shannon’s theorem). Let X 1 , X 2 , . . . be i.i.d. RVs on a probability space (Ω, F , P) taking
𝑥+𝑦=5
values from A := {1, . . . , r } with PMF p(k) := P(X 1 = k) > 0 for all k ∈ {1, . . . , r }. Think of A as the set of
𝑥 + 𝑦 = 11
alphabet and the sequence of RVs X 1 , . . . , X n as a random string of length n with these alphabet. For each
𝑥 + 𝑦 =n4 ≥ 1, define a RV Σn as 𝑥 + 𝑦 = 10
Σn := p(X 1 ) p(X 2 ) . . . p(X n ).
𝑥+𝑦=3 𝑥+𝑦=9
Then by SLLN, almost surely as n → ∞,
𝑥+𝑦=2 𝑥+𝑦=8
1X n £ ¤ Xr
−n −1 log Σn = − log p(X i ) −→ E log p(X i ) = − p(k) log p(k) =: H.
1 2 3 4 5 6 n i =1 k=1
1 1 1 1 1 1
The quantity H is called the entropy of the PMF p, and it measures how random it is. The above result
implies that −n −1 log Σn → H in probability, so for each ε > 0,
¡ ¢ ¡¯ ¯ ¢
P exp(−n(H + ε) ≤ Σn ≤ exp(−n(H − ε) = P ¯H + n −1 log Σn ¯ ≤ ε → 1 as n → ∞.
Note that these three processes (arrival times, inter-arrival times, and counting) determine each other:
2 3 3 4 0 1 2 2
(Tk )k≥1 ⇐⇒ (τk )k≥1 ⇐⇒ (N (t ))t ≥0 .
Exercise 3.6.1. Let (Tk )k≥1 be any arrival process and let (N (t ))t ≥0 be its associated counting process.
Show that these two processes determine each other by the following relation
{Tn ≤ t } = {N (t ) ≥ n}.
In words, nth customer arrives by time t if and only if at least n customers arrive up to time t .
4 𝜏
3 𝜏
2 𝜏
𝑁(𝑠) = 3
1 𝜏
0 𝑇 𝑇 𝑇 𝑠 𝑇 𝑡
F IGURE 3.6.1. Illustration of a continuous-time arrival process (Tk )k≥1 and its associated count-
ing process (N (t ))t ≥0 . τk ’s denote inter-arrival times. N (t ) ≡ 3 for T3 < t ≤ T4 .
3.6. RENEWAL PROCESSES AND RENEWAL SLLN 83
Exercise 3.6.2 (Well-definedness of the counting process). Let (N (t ))t ≥0 denote the counting process of
a renewal process with inter-arrival times (τk )k≥1 such that there exists ε > 0 for which τk > ε infinitely
often (but not necessarily independent or identically distributed). Recall that N (t ) and τk ’s are associ-
ated as
X
∞
N (t ) = 1(τ1 + · · · + τn ≤ t ).
n=1
Pn
(i) Show that Tn ≥ k=1
ε1(τk ≥ ε) for n ≥ 1.
(ii) Fix t ≥ 0. Justify the following steps:
µ ¶ Ã ! Ã !
X
∞ X
∞
P(N (t ) = ∞) = P sup Tn < t ≤ P ε1(τk ≥ ε) < t = P 1(τk ≥ ε) < t /ε
n≥1 k=1 k=1
= P (∃N ≥ 1 s.t. τk < ε for all k ≥ N )
X
∞
≤ P (τk < ε for all k ≥ N ) = 0.
N =1
Deduce that P(N (t ) < ∞) = 1 for all t ≥ 0.
Definition 3.6.3 (Renewal process). A counting process (N (t ))t ≥0 is called a renewal process if its inter-
arrival times τ1 , τ2 , · · · are i.i.d. with E[τ1 ] < ∞.
A cornerstone in the theory of renewal processes is the following strong law of large numbers for
renewal processes.
Theorem 3.6.4 (Renewal SLLN). Let (Tk )k≥0 be a renewal process and let (τk )k≥0 and (N (t ))t ≥0 be the
associated inter-arrival times and counting process, respectively. Let E[τk ] = µ be the mean inter-arrival
time. If 0 < µ < ∞, then
µ ¶
N (t ) 1
P lim = = 1.
t →∞ t µ
P ROOF. First, write Tk = τ1 + τ2 + · · · + τk . Since the inter-arrival times are i.i.d. with mean µ < ∞, the
strong law of large numbers imply
µ ¶
Tk
P lim = µ = 1. (24)
k→∞ k
Next, fix t ≥ 0 and let N (t ) = n, so that there are total n arrivals up to time t . Then the nth arrival time Tn
must occur by time t , whereas the n + 1st arrival time Tn+1 must occur after time t . Hence Tn ≤ t < Tn+1 .
In general, we have
T N (t ) ≤ t < T N (t )+1 .
Dividing by N (t ), we get
T N (t ) t T N (t )+1 N (t ) + 1
≤ < . (25)
N (t ) N (t ) N (t ) + 1 N (t )
P
To take the limit as t → ∞, we note that P(Tk < ∞) = 1 for all k since P(Tk = ∞) ≤ ki=1 P(τi = ∞) = 0
(since P(τi = ∞) > 0 implies E[τi ] = ∞, which is a contradiction). This yields N (t ) ≥ k for some large
enough t . Since k was arbitrary, this yields N (t ) % ∞ as t → ∞ with probability 1. Therefore, according
to (24) and SLLN (see Theorem 3.5.1), we get
µ ¶ µ ¶ µ ¶
T N (t ) T N (t )+1 N (t ) + 1
P lim = µ = P lim = µ = P lim = 1 = 1.
t →∞ N (t ) t →∞ N (t ) + 1 t →∞ N (t )
𝜏 3 4
1
⋯ 3.6. RENEWAL PROCESSES AND RENEWAL SLLN 84
𝑇
𝑁
4 0 1 2 4
𝑇 ( ) , 𝑁(𝑡) (𝑡, 𝑁(𝑡))
3 𝑇 ( ) , 𝑁(𝑡)
0 1 2 2 2
1 𝑁(𝑡) = 3
0 𝑇 𝑇 𝑇 𝑡 𝑇
F IGURE 3.6.2. Illustration of the inequalities (25).
Since µ > 0, we can take the reciprocal inside the above probability. This shows the assertion. □
Example 3.6.5 (Poisson process). Suppose (N (t ))t ≥0 is a renewal process with its inter-arrival times
(τk )k≥1 as Exp(λ) distribution with some λ > 0. In this case, we call (N (t ))t ≥0 a “Poisson process with
arrival rate λ”. Note that the mean inter-arrival time is E[τ1 ] = 1/λ, so the renewal SLLN yields
𝜏
µ ¶
N (t )
𝜏 P lim = λ = 1.
t →∞ t
𝑡−𝑠 𝑍
𝜏 Namely, with probability 1, we tend to see about λt arrivals during [0, t ] as t → ∞. In other words, we
tend to see λ arrivals during an interval of unit length. Hence it makes sense to call the parameter λ as
𝑁(𝑡) = 3
the ‘arrival rate’ of the process. ▲
Next, we consider a renewal process together with rewards, which are given for each inter-arrival
𝑇 𝑇 = 𝑠 extension
times. This simple 𝑡 𝑇
of the renewal processes will greatly improve the applicability of our theory.
Let (Tk )k≥0 be a renewal process and let (τk )k≥0 and (N (t ))t ≥0 be the associated inter-arrival times
and counting process, respectively. Suppose we have a sequence of rewards (Yk )k≥1 , so we regard a
reward of Yk is given at the kth arrival time Tk . We define the reward process (R(t ))t ≥0 associated with
the sequence (τk , Yk )k≥1 as
NX
(t )
R(t ) = Yk .
k=1
Namely, upon the kth arrival at time Tk = τ1 + · · · + τk , we receive a reward of Yk . Then R(t ) is the total
reward up to time t .
As we looked at the average number of arrivals N (t )/t as t → ∞, a natural quantity to look at for the
reward process is the ‘average reward’ R(t )/t as t → ∞. Intuitively, since everything refreshes upon new
arrivals, we should expect
R(t ) expected reward during one ‘cycle’
→
t expected duration of one ‘cycle’
as t → ∞ almost surely. This is made precise by the following result.
Theorem 3.6.6 (Renewal reward SLLN). Let (R(t ))t ≥0 be the reward process associated with an i.i.d. se-
quence of inter-arrival times (τk )k≥1 and an i.i.d. sequence of rewards (Yk )k≥1 . Suppose E[Yk ] < ∞ and
E[τk ] ∈ (0, ∞). Then
µ ¶
R(t ) E[Y1 ]
P lim = = 1.
n→∞ t E[τ1 ]
3.6. RENEWAL PROCESSES AND RENEWAL SLLN 85
P ROOF. Let (τk )k≥0 denote the inter-arrival times for the renewal process (Tk )k≥0 . Note that
à !
R(t ) 1 NX (t ) N (t )
= Yk .
t N (t ) k=1 t
Hence by SLLN, the ‘average reward’ up to time t in the bracket converges to E[Y1 ] almost surely. More-
over, the average number of arrivals N (t )/t converges to 1/E[τ1 ] by Theorem 3.6.4. Hence the assertion
follows. □
Remark 3.6.7. Theorem 3.6.4 can be obtained as a special case of the above reward version of SLLN,
simply by choosing Yk = τk for k ≥ 1 so that R(t ) = N (t ).
Example 3.6.8 (Long run car costs). This example is excerpted from [DD99]. Mr. White do not drive the
same car more than t ∗ years, where t ∗ > 0 is some fixed number. He changes to a new car when the old
one breaks down or reaches t ∗ years. Let X k be the life time of the kth car that Mr. White drives, which
are i.i.d. with finite expectation. Let τk be the duration of his kth car. According to his policy, we have
τk = min(X k , t ∗ ).
Let Tk = τ1 +· · ·+τk be the time that Mr. White is done with the kth car. Then (Tk )k≥0 is a renewal process.
Note that the expected running time for the kth car is
E[τk ] = E[τk | X k < t ∗ ]P(X k < t ∗ ) + E[τk | X k ≥ t ∗ ]P(X k ≥ t ∗ )
= E[X k | X k < t ∗ ]P(X k < t ∗ ) + t ∗ P(X k ≥ t ∗ ).
Suppose that the car cost g during each cycle is given by
(
A + B if t < t ∗
g (t ) =
A if t ≥ t ∗ .
Namely, if the car breaks down by t ∗ years, then Mr. White has to pay A + B dolors; otherwise, the cost is
only A dolors. Then the expected cost for one cycle is
E[g (τk )] = A + B P(τk < t ∗ ) = A + B P(X k < t ∗ ).
Thus by Theorem 3.6.6, the long-run car cost of Mr. White is
R(t ) E[g (τk )] A + B P(X k < t ∗ ).
lim = = .
t →∞ t E[τk ] E[X k | X k < t ∗ ]P(X k < t ∗ ) + t ∗ P(X k ≥ t ∗ )
For more concrete example, let X k ∼ Uniform([0, 10]) and let A = 10 and B = 3. Then
E[g (τk )] = 10 + 3t ∗ /10.
On other other hand,
E[τk ] = E[X k | X k < t ∗ ]P(X k < t ∗ ) + t ∗ P(X k ≥ t ∗ )
t∗ t∗ 10 − t ∗
= + t∗ = t ∗ − (t ∗ )2 /20.
2 10 10
Note that for E[X k | X k < t ∗ ] = t ∗ /2, observe that a uniform RV over [0, 10] conditioned on being [0, t ∗ ] is
uniformly distributed over [0, t ∗ ]. This yields
E[g (τk )] 10 + 0.3t ∗
= ∗ .
E[τk ] t (1 − t ∗ /20)
Lastly, in order to minimize the above long-run cost, we differentiate it in t ∗ and find global minimum.
A straightforward computation shows that the long-run cost is minimized at
p
∗ −1 + 1.6
t = ≈ 8.83.
0.03
Thus the optimal strategy for Mr. White in this situation is to drive each car up to 8.83 years. ▲
3.7. CONVERGENCE OF RANDOM SERIES 86
(i) Let τ = inf{k ≥ 0 : |S k | ≥ t } denote the first time that |S k | exceeds t . Show that
X
n
E[S n2 ] ≥ E[S n2 1(τ = k)].
k=1
(ii) For each 1 ≤ k ≤ n, note that S k 1(τ = k) and S n − S k = X k+1 + · · · + X n are independent. Deduce that
E[S k 1(τ = k)(S n − S k )] = 0.
(iii) For each 1 ≤ k ≤ n, write S n2 = S k2 + 2S k (S n − S k ) + (S n − S k )2 . Using (ii), show that
E[S n2 1(τ = k)] ≥ E[(S k2 + 2S k (S n − S k ))1(τ = k)]
= E[(S k2 1(τ = k)] + 2E[S k 1(τ = k)(S n − S k )]
= E[(S k2 1(τ = k)] ≥ t 2 E[1(τ = k)].
(iv) From (i)-(iii), deduce that
" # µ ¶
X
n X
n
E[S n2 ] ≥ t 2 2
E[1(τ = k)] = t E 1(τ = k) = t 2 P(τ ≤ n) = t 2 P max |S k | ≥ t .
k=1 k=1 1≤k≤n
but this is a contradiction since the LHS above is a sequence of real numbers so it cannot converge weakly
to a normal RV.
P
Lastly, suppose (i) and (iii) hold. Then by Lemma 3.7.2, nm=1 Yn − E[Yn ] converges a.s.. We have seen
P∞ P∞ P
that k=1 X n converges a.s. and (i) imply k=1 Yn converges a.s., so in our case ∞ Y converges a.s..
Pn ¡Pn ¢ ¡Pn ¢ k=1 n
Now m=1 E[Ym ] = m=1 Ym − m=1 Ym − E[Ym ] is the difference of two a.s. converging series, so it
converges a.s.. This shows (ii) holds. □
Exercise 3.7.4. Let X 1 , X 2 , . . . be independent RVs on the same probability space such that P(X n = n −α ) =
P
P(X n = −n −α ) = 1/2 for n ≥ 1. Show that ∞ n=1 X n converges almost surely if and only if α > 1/2. (Note
P∞
that n=1 |X n | converges if and only if α > 1 as |X n | = n α a.s.. Due to the cancelation, convergence of
P∞
n=1 X n requires slower decay rate.)
P 1 P∞
Exercise 3.7.5 (Kronecker’s lemma). If a n % ∞ and ∞ xn
n=1 a n converges, then show that a n n=1 x n → 0.
Theorem 3.7.6 (SLLN with a rate of convergence). Let X 1 , X 2 , . . . be independent RVs on the same proba-
bility space. Suppose that E[X n ] = 0 and E[X n2 ] = σ2 < ∞. Let S n = X 1 + · · · + X n . Then for any ε > 0,
Sn
p → 0 a.s. as n → ∞.
n(log n)(1/2)+ε
p
P ROOF. Let a n := n(log n)(1/2)+ε for n ≥ 2 and set a 1 = 1. Then
µ ¶
X
∞ X∞ 1
Var(X n /a n ) = σ2 1 + 1+2ε
< ∞.
n=1 n=2 n(log n)
P
By Lemma 3.7.2, it follows that ∞ n=1 X n /a n converges almost surely. Then by Kroneker’s lemma (see
Exercise 3.7.5), a n−1 S n → 0 almost surely, as desired. □
Remark 3.7.7. The correct rate of convergence for SLLN is given by the law of the iterated logarithm:
Sn σ
lim sup p p →p a.s. as n → ∞.
n→∞ n log log n 2
The rate given by Theorem 3.7.6 is not too far from the above optimal rate.
In this section, we have studied necessary and sufficient conditions for the condition
µ∞ ¶
X
P X n converges = 1
n=1
to hold. But what if we only want some positive probability that the random series converges? Is there
some weaker condition than those stated in Theorem 3.7.3 that imply
µ∞ ¶
X
P X n converges > 0?
n=1
¡P ¢
The answer is no. Surprisingly, the only possible value for P ∞ n=1 X n converges is 0 or 1, so if it is posi-
tive, it has to be one. This type of result is known as ‘0-1’ law. See Exercise 3.7.8 for details.
Exercise 3.7.8 (Kolmogorov’s 0-1 law). Let X 1 , X 2 , . . . be RVs on the same probability space. For each
T
n ≥ 1, let Fn := σ(X n , X n+1 , . . . ) and let T := ∞n=1 Fn . Here T is called the tail σ-algebra for the sequence
3
of RVs (X n )n≥1 If A ∈ T , then A is called a tail event.
Let X 1 , X 2 , . . . be independent RVs on the same probability space and let T denote the corresponding
tail σ-algebra. Let A ∈ T be any tail event. In this exercise, we will show that P(A) ∈ {0, 1}. This is called
Kolmogorov’s 0-1 law.
3Note that A ∈ T if and only if the occurance of A does not depend on changing the values of any finite number of RVs
X n ’s — A depends on the ‘tail’ of the sequence of (X n )n≥1 .
3.7. CONVERGENCE OF RANDOM SERIES 89
(i) If B ∈ σ(X 1 , . . . , X m ) and C ∈ σ(X m+1 , X m+2 , . . . ), then show that B and C are independent. (Hint: if
C ∈ σ(X m+1 , . . . , X m+k ) for some k ≥ 1 use Prop. 2.2.4. For the general case, the latter σ-algebra is
S
generated by k≥1 σ(X m+1 , . . . , X m+k ), which is independent from σ(X 1 , . . . , X m ). Then use Prop.
2.2.2 to conclude.)
(ii) If B ∈ σ(X 1 , X 2 , . . . ) and C ∈ T , then B and C are independent. (Hint: if B ∈ σ(X 1 , . . . , X k ) for some
k ≥ 1 use Prop. 2.2.4. For the general case, argue similarly as before.)
(iii) From (i)-(ii), dedcue that A is independent from itself. Hence P(A) = P(A ∩ A) = P(A)2 , so P(A) ∈
{0, 1}.
Example 3.7.9. If B n is a Borel subset of R, then {X n ∈ B n i.o.} is a tail event. If X n = 1(A n ) and B n = {1},
then it follows that {A n i.o.} is also a tail event. Also, {limn→∞ X n exists} is a tail event. By Exercise 3.7.8,
these events all have probability 0 or 1.
Exercise 3.7.10 (Exchangeable sequence and Hewitt-Savage 0-1 law). Let X 1 , X x , . . . be a sequence of RVs
on a probability spcae (Ω, F , P). An event A ∈ F is called exchangeable if there exists a Borel set B ⊆ RN
such that
{(X 1 , X 2 , . . . ) ∈ B } = {(X σ(1) , X σ(2) , . . . ) ∈ B }
for all finite permutations σ : N → N (here σ only permutes the first n entries for some finite n).
Now further assume that X 1 , X 2 , . . . are i.i.d. and F = σ(X 1 , X 2 , . . . ). Let A ∈ F be an exchangeable
event. In this exercise, we will show that P(A) ∈ {0, 1}. This is called Hewitt-Savage 0-1 law.
(i) Let Fn := σ(X 1 , . . . , X n ). Show that there is a sequence of events A n ∈ Fn , n ≥ 1, such that P(A n 4A) →
0 as n → ∞ (here C 4D := (C \ D) ∪ (D \ C ) denotes the symmetric difference). We will write
A n = {(X 1 , . . . , X n ) ∈ B n } for some Borel B n ⊆ Rn .
(ii) Let à n := {X n+1 , . . . , X 2n ∈ B n }. Show that P( à n 4A) = P(A n 4A) = o(1). (Hint: Let σ be the permua-
tion on N that maps {1, . . . , n} to {n + 1, . . . , 2n} and leaves all other integers fixed. Then σ maps
A n to à n but fixes A. Since the RVs are i.i.d., applying σ to the indices does not change the
probability.)
(iii) Deduce that P(A n ∩ Ã n ) → P(A) and P(A n ∩ Ã n ) = P(A n ) P( Ã n ) = P(A n )2 → P(A)2 as n → ∞. Con-
clude that P(A) ∈ {0, 1}. (Hint: Use (ii) and that the RVs are i.i.d..)
Remark 3.7.11. Given a stochastic process with i.i.d. increments, the event that a state is visited infinitely
often is in the tail space of the values of the process, however it is not in the tail space of the increments,
so Kolmogorov’s 0-1 does not apply. It is however an exchangeable event, and so occurs with probability
0 or 1 by Hewitt-Savage.
CHAPTER 4
In this chapter, we will study the celebrated central limit theorem (CLT) that we briefly touched upon
in Theorem 3.1.3.
Theorem 4.1.1 (De Moivre-Laplace CLT). Let (X k )k≥1 be i.i.d. RVs with
The proof is based on the asymptotic expression of simple symmetric random walk PMF (Lemma
4.1.3) based on Stirling’s approximation (Exercise 4.1.4).
lim (1 + c j )a j = e λ
n→∞
where a n ∼ b n means a n /b n → 1 as n → ∞.
90
4.1. THE DE MOIVRE-LAPLACE CLT 91
P ROOF. Note that S 2n = 2k if and only if n + k X i ’s are +1 and n − k X i ’s are −1. Hence
à !
2n (2n)! 2−2n
P(S 2n = 2k) = 2−2n = .
n +k (n + k)!(n − k)!
By Stirling’s approximation (27),
p
(2n)! 2−2n −2n (2n/e)2n 4πn
∼2 p p
(n + k)!(n − k)! ((n + k)/e)n+k 2π(n + k) · ((n − k)/e)n−k 2π(n − k)
p
1 n 2n n
=p p p
π (n + k)n+k n + k · (n − k)n−k n − k
µ ¶ µ ¶ µ ¶ µ ¶
1 k −n−k k −n+k k −1/2 k −1/2
=p 1+ 1− 1+ 1−
πn n n n n
µ ¶ −n µ ¶ µ ¶ µ ¶ µ ¶
1 k2 k −k k k k −1/2 k −1/2
=p 1− 2 1+ 1− 1+ 1− .
πn n n n n n
Note that using Proposition 4.1.2, we have
µ ¶−n µ ¶
k2 k2
lim 1 − 2 = exp lim = exp(x 2 /2),
n→∞ n n→∞ n
µ ¶ µ ¶
k −k k2
lim 1 + = exp − lim = exp(−x 2 /2),
n→∞ n n→∞ n
µ ¶ µ ¶
k k k2
lim 1 − = exp − lim = exp(−x 2 /2).
n→∞ n n→∞ n
P ROOF OF T HEOREM 4.1.1. Let 2Z denote the set of all even integers. Then
p p X
P(a 2n ≤ S 2n ≤ b 2n) = P(S 2n = m)
p p
m∈[a 2n,b 2n]∩2Z
X p
= P(S 2n = x 2n),
p
x∈[a,b]∩2Z/ 2n
p p
p we used change of variable x = m/ 2n and 2Z/ 2n denotes the set of all even integers divided
where
by 2n. Then by Lemma 4.1.3,
X p X 1
p e −x /2
2
P(S 2n = x 2n) ≈
p
x∈[a,b]∩2Z/ 2n
p
x∈[a,b]∩2Z/ 2n
πn
X 1 1
p e −x /2 p
2
= .
2πp n/2
x∈[a,b]∩2Z/ 2n
Rb
We recognize the last summation as a Riemann sum for the definite integral a (2π)−1/2 e −x /2 d x, since
2
p
we are summing over all rectangles of width 1/ n/2 within the interval [a, b]. This gives
X p Zb
1
p e −x /2 d x.
2
P(S 2n = x 2n) ≈
p a 2π
x∈[a,b]∩2Z/ 2n
The integral above equals P(a ≤ Z ≤ b), where Z ∈ N (0, 1). This shows the assertion. □
4.1. THE DE MOIVRE-LAPLACE CLT 92
Exercise 4.1.5 (Convergence of product). This exericise generalizes Proposition 4.1.2 to triangular arrays.
Consider a trianglular array of sequence c 1,n , c 2,n , . . . , c n,n for n ≥ 1. Suppose
X
n X
n
max |c k,n | = o(1), lim c k,n = λ, sup |c k,n | < ∞.
1≤k≤n n→∞ n≥1 k=1
k=1
Example 4.2.7 (Birthday problem). Let X 1 , X 2 , . . . be i.i.d. RVs with uniform distribution on {1, . . . , N }. For
each N ≥ 1, define T N := max{n ≥ 1 : X 1 , . . . , X n are all distinct}. Note that
n µ
Y m −1
¶
P(T N ≥ n) = 1− . (28)
m=2 N
When N = 365, the above equals the probability of not having the same birthday in a class of n students.
This quantity is surprisingly small for modest values of n. To see this, let c m,N := m−1 N and consider the
triangular array c 1,N , . . . , c bx pN c,N for N ≥ 1. Then
à p ! p
1−1 bx N c − 1 bx N c − 1
max ,..., = = o(1),
N N N
p
bxXN c
p p
m −1 1 bx N c(bx N c − 1) x2
= → ,
k=1 N N 2 2
p
bxXN c
p p
m −1 1 bx N c(bx N c − 1)
sup = sup < ∞.
N ≥1 k=1 N N ≥1 N 2
Hence by Exercise 4.1.5, we have
p p
lim P(T N / N ≥ x) = lim P(T N ≥ bx N c) = exp(−x 2 /2) for x ≥ 0.
N →∞ N →∞
p
Take N = 365 and noting that 22/ 365 = 1.1515 and (1.1515)2 /2 = 0.6630, this yields
The above approximation is within 2% from the true value P(T365 > 22) ≈ 0.524 that can be computed
from the exact formula (28). Thus, in a class of at least 22 students, there is at least 47% chance to have
two students of the exactly same birthday. ▲
4.2.2. Theory on weak convergence. In this subsection, we give some theoretical results concerning
weak convergence.
The following result shows that one can construct a sequence of RVs converging almost surely to
realize a weak convergence of distribution functions.
Lemma 4.2.8 (Weak convergence realized as a.s. convergence). If F n ⇒ F , then there are random vari-
ables Y , Yn , n ≥ 1, with distribution function F n so that Yn → Y a.s.
P ROOF. Following Proposition 1.2.13, we define RVs on the ‘standard probability space’ (Ω, B, P),
where Ω = (0, 1), B is the Borel σ-algebra on (0, 1) and P denotes the Lebesgue measure on (0, 1). Let
Yn : (0, 1) → R defined by Yn (ω) := sup{y : F n (y) < ω} and Y (ω) := sup{y : F (y) < ω}. For convenience, we
denote Yn (ω) = F n−1 (ω) and Y (ω) = F −1 (ω). In Proposition 1.2.13, we have shown that F −1 , F n−1 , n ≥ 1 are
RVs with prescribed distribution functions. It remains to show that F n−1 → F −1 a.s.. To this end, we make
the following claims:
(a) For each ω ∈ (0, 1), define a ω := sup{y : F (y) < ω} and b ω := inf{y : F (y) > ω}. Let Ω0 = {ω : a ω = b ω }.
Then Ω \ Ω0 is countable.
(b) F n−1 (ω) → F −1 (ω) for all ω ∈ Ω0 .
Given the above claims, P(limn→∞ Yn = Y ) ≥ P(Ω0 ) = 1 − P(Ω \ Ω0 ) = 1, since P is the Lebesgue measure
on (0, 1) and P(Ω \ Ω0 ) = 0 since Ω \ Ω0 is countable.
To justify (a), first note that a ω ≤ b ω by the definition. Also from the definition, nonempty open
intervals (a ω , b ω ) for ω ∈ Ω\Ω0 are disjoint. Hence these open intervals contain distinct rational number,
so the cardinality of Ω \ Ω0 is at most the cardinality of the rational numbers. It follows that Ω \ Ω0 is
countable.
4.2. WEAK CONVERGENCE 95
Similarly, if µn ⇒ µ is given, then by Prop. 1.2.13, we can cook up sequence of RVs X n ⇒ X with µn =
P ◦ X −1 and µ = P ◦ X −1 so µn ⇒ µ iff F n ⇒ F iff X n ⇒ X iff E[g (X n )] → E[g (X )] for all test functions g iff
Z Z
g (x) d µn (x) → g (x) d µ(x)
R R
for all test functions, by change of variables.
P ROOF OF P ROPOSITION 4.2.9. According to Lemma 4.2.8, we can choose a sequence of RVs Y , Yn ,
d d
n ≥ 1 such that X n = Yn , X = Y , and that Yn → Y a.s.. Then by BCT (see Theorem 1.3.16),
lim E[g (X n )] = lim E[g (Yn )] = E[g (Y )] = E[g (X )].
n→∞ n→∞
For the converse direction, suppose that for every bounded continuous function g : R → R, we have
E[g (X n )] → E[g (X )]. We wish to show that X n ⇒ X . This would be immediate if we can take g to be
the indicator function g x (y) = 1(y ≤ x), but such indicator ‘test function’ is not allowed since we only
use bounded and continuous test functions. Instead, we approximate g x by a piecewise linear function.
That is, fix x ∈ R and ε > 0, and define a bounded and continuous function g x,ε by
1 if y < x
g x,ε (y) = 0 if y > x + ε
1
1 − ε (y − x) if x ≤ y ≤ x + ε.
Then 1(y ≤ x) ≤ g x,ε (y) ≤ 1(y ≤ x + ε), so
lim sup P(X n ≤ x) = lim sup E[1(X n ≤ x)]
n→∞ n→∞
≤ lim sup E[g x,ε (X n )]
n→∞
= E[g x,ε (X )]
≤ E[1(X ≤ x + ε)] = P(X ≤ x + ε).
Letting ε & 0 and using the right continuity of CDF (see Prop. 1.2.12 (iii)), we deduce
lim sup P(X n ≤ x) ≤ P(X ≤ x) for all x ∈ R.
n→∞
For the other direction, fix x ∈ R and ε > 0, and define a bounded and continuous function h x,ε by
1 if y < x
h x,ε (y) = 0 if y > x − ε
1
1 + ε (y − x) if x − ε ≤ y ≤ x.
Then 1(y ≤ x − ε) ≤ h x,ε (y) ≤ 1(y ≤ x), so
lim inf P(X n ≤ x) = lim sup E[1(X n ≤ x)]
n→∞ n→∞
≥ lim inf E[h x,ε (X n )]
n→∞
= E[h x,ε (X )]
≥ E[1(X ≤ x − ε)] = P(X ≤ x − ε).
Letting ε & 0 and using continuity of probability measure from below (see Prop. 1.2.12 (iv)), we deduce
lim inf P(X n ≤ x) ≥ P(X < x) for all x ∈ R.
n→∞
To finish, now let x be any continuity point of F . Then
lim inf P(X n ≤ x) ≥ P(X < x) = P(X ≤ x) ≥ lim sup P(X n ≤ x).
n→∞ n→∞
This shows limn→∞ P(X n ≤ x) = P(X ≤ x), as desired. □
4.2. WEAK CONVERGENCE 97
Remark 4.2.11 (Test functions). At a high level, the role of a ‘test function’ is testing some aspect of an
unknown object. Think of slicing a high-dimensional and complicated shape by some two-dimensional
plane. You can look at the section (now it’s 2D) and study properties of it. That’s certainly not enough
to tell you the whole object, but still gives you some information. You can also take sections of multiple
other planes and study the corresponding sections. Technically if you study enough collection of such
sections, sometimes that gives you complete understanding of the object.
In our situation, a RV X is the object we want to know, and a function f : R → R gives some "section"
of X by the expectation E[ f (X )]. If we take f to be of the form f (x) = 1(x ≤ A) for A an interval, then
knowing E[ f (X )] for all such indicator test functions is enough to determine the distribution of X . Also,
if we want to know if X n ⇒ X , then it is enough to know if E[ f (X n )] → E[ f (X )] for all f bounded and
continuous functions (Prop. 4.2.9).
As for another example in modern graph theory, there is a notion of a sequence of graphs G n con-
verging to a limiting objected U called a ‘graphon’ in a topology induced by the ‘cut metric’. A well-known
theorem by Lováz states that G n → U if and only if f (G n ) → f (U ) for all f , which takes a graph/graphon
and gives the ‘homomorphism density’ of an arbitrary finite graph.
Theorem 4.2.12 (Continuous mapping theorem). Let g be a measurable function and let D g denote the
set of discontinuity of g . If X n ⇒ X and P(X ∈ D g ) = 0 then g (X n ) ⇒ g (X ). If in addition g is bounded then
E[g (X n )] → E[g (X )].
P ROOF. First note that D g is a Borel set. By Lemma 4.2.8, we may choose RVs Yn =d X n with Yn → Y
a.s. and X =d Y . If f is continuous then D f ◦g ⊆ D g so P(Y ∈ D f ◦g ) ≤ P(Y ∈ D g ) = P(X ∈ D g ) = 0 so
P(Y ∈ D f ◦g ) = 0. It follows that
¡ ¢ ¡ ¢
P f (g (Yn )) → f (g (Y )) = P f (g (Yn )) → f (g (Y )) and Y ∉ D f ◦g
³n o ´
= P ω : lim f ◦ g (Yn (ω)) = f ◦ g (Y (ω)) ∩ {ω : Y (ω) ∉ D f ◦g }
n→∞
¡© ª ¢
≥ P ω : f ◦ g (Y (ω)) = f ◦ g (Y (ω)) ∩ {ω : Y (ω) ∉ D f ◦g }
¡ ¢
= P {ω : Y (ω) ∉ D f ◦g } = 1.
Hence f (g (Yn )) → f (g (Y )) a.s.. If, in addition, f is bounded then the bounded convergence theorem
implies E[ f (g (Yn ))] → E[ f (g (Y ))]. Since this holds for all bounded continuous functions f , it follows
from Proposition 4.2.9 that g (X n ) ⇒ g (X ). For the second part of the assertion, note that Yn → Y a.s. and
P(Y ∈ D g ). So by a similar argument,
¡ ¢ ¡© ª ¢
P g (Yn ) → g (Y ) = P ω : g (Yn (ω)) → g (Y (ω)) ∩ {ω : Y (ω) ∉ D g }
≥ P({ω : Y (ω) ∉ D g }) = 1.
Hence g (Yn ) → g (Y ) a.s., and the desired result follows from the bounded convergence theorem. □
Proposition 4.2.13 (Equivalent conditiosn for weak convergence). These following are all equivalent:
(i) X n ⇒ X ;
(ii) For all open sets G ⊆ R, lim inf P(X n ∈ G) ≥ P(X ∈ G);
n→∞
(iii) For all closed sets K ⊆ R, lim sup P(X n ∈ K ) ≤ P(X ∈ K );
n→∞
(iv) For all Borel sets B ⊆ R with P(X ∈ ∂B ) = 0, limn→∞ P(X n ∈ B ) = P(X ∈ B ).
P ROOF. (i) ⇒ (ii) : By Lemma 4.2.8, we may choose RVs Yn =d X n with Yn → Y a.s. and X =d Y ,
where Y , Yn are defined on the same probability space (Ω, F , P). Then since G is open, almost
surely,
lim inf 1(Yn ∈ G) ≥ 1(Y ∈ G).
n→∞
To see this, let A := {ω : Yn (ω) → Y (ω)}. Then by the hypothesis, P(A) = 1. Fix ω ∈ A. Choose
a subsequence n(k) so that limk→∞ 1(Yn(k) (ω) ∈ G) = 0. This means for all large enough k ≥ 1,
4.2. WEAK CONVERGENCE 98
Yn(k) (ω) ∈ G c . Since G c is closed, any limit point of the sequence Yn(k) (ω) belongs to G c . Since
ω ∈ A, Yn(k) (ω) → Y (ω) ∈ G c .
³ ´ ³n o ´
P lim inf 1(Yn ∈ G) = 0 = P lim inf 1(Yn ∈ G) = 0 ∩ A
n→∞ n→∞
¡ ¢
≥ P Y ∈ G c = P(1(Y ∈ G) = 0).
Since lim inf 1(Yn ∈ G) and 1(Y ∈ G) are RVs taking values from {0, 1}, this shows the desired
n→∞
conclusion. Now applying Fatou’s lemma (see Thm. 1.3.18) gives
lim inf P(X n ∈ G) = lim inf P(Yn ∈ G)
n→∞ n→∞
≥ P(lim inf Yn ∈ G)
n→∞
≥ P(Y ∈ G)
= P(X ∈ G).
(ii)⇒(iii) : Apply the previous implication by noting that K c is open.
(ii), (iii) ⇒ (iv) : Let B and B ◦ denote the closure and the intrerior if B , respectively. Note taht B and B ◦
are closed and open in R, respectively. The boundary ∂B := B \ B ◦ . Then
P(X ∈ B ) = P(X ∈ B ◦ t ∂B ) = P(X ∈ B ◦ ) + P(X ∈ ∂B ) = P(X ∈ B ◦ ).
Since B ⊆ B ⊆ B ◦ , we get
P(X ∈ B ) = P(X ∈ B ) = P(X ∈ B ◦ ).
Thus by using (ii) and (iii),
lim sup P(X n ∈ B ) ≤ lim sup P(X n ∈ B ) ≤ P(X ∈ B ) = P(X ∈ B ),
n→∞ n→∞
lim inf P(X n ∈ B ) ≥ lim inf P(X n ∈ B ◦ ) ≤ P(X ∈ B ◦ ) = P(X ∈ B ).
n→∞ n→∞
P ROOF. We will first construct a candidate limiting function G on the rationals. Let q 1 , q 2 , . . . be an
enumeration of the rationals. Since F n (q 1 ) is an infinite sequence in the compact set [0, 1], there exists
a subsequence n(m 1 (k)) such that F n(m1 (k)) (q 1 ) converges to some point in [0, 1], which we will denote
as G(q 1 ). Next, F n(m1 (k)) (q 2 ) is an infinite sequence in [0, 1], so we can choose a further subsequence
n(m 2 (k)) of n(m 1 (k)) such that F n(m2 (k)) (q 2 ) converges to some point in [0, 1], which we will denote as
G(q 2 ). Repeating this process, we can consturct a nested subsequences (n 1 (k))k≥1 ⊇ (n 2 (k))k≥1 ⊇ · · · and
a function G : Q → [0, 1] such that
lim F n(mi (k)) (q i ) = G(q i ) for all i ≥ 1.
k→∞
Note that G is non-decreasing over Q since if q, q 0 ∈ Q and q < q 0 , then F nk (q) ≤ F nk (q 0 ) for all k ≥ 1 so
taking k → ∞ gives G(q) ≤ G(q 0 ).
4.2. WEAK CONVERGENCE 99
Definition 4.2.16 (Tightness). A sequence of RVs X n is tight if for each ε > 0, there exists M ε > 0 such
that
lim sup P(|X n | > M ε ) ≤ ε.
n→∞
(in words, ‘no mass excapes to ±∞’). A sequence of distribution functions F n , n ≥ 1 is tight if for each
ε > 0, there exists M ε > 0 such that
lim sup F n (−M ε ) + 1 − F n (M ε ) ≤ ε.
n→∞
Exercise 4.2.17 (Tightness and weak convergence). Show the following statement: Let F n be a sequence
of distribution functions. Then every subsequential limit of F n is a distribution function if and only if
(F n )n≥1 is tight.
4.3. CHARACTERISTIC FUNCTIONS 100
Exercise 4.2.18 (A sufficient condition for tightness). Let F n be a sequence of distribution functions.
Show that (F n )n≥1 is tight if there exists a function φ ≥ 0 such that φ(x) → 0 as |x| → ∞ and
Z
sup φ(x) d F n (x) < ∞.
n≥1
Exercise 4.2.19 (The Lévy metric). Let D denote the space of all distribution functions. Define a function
ρ : D × D → R by
ρ(F,G) = inf {ε : F (x − ε) − ε ≤ G(x) ≤ F (x + ε) + ε for all x ∈ R} .
Show that ρ defines a metric on D. Furthermore, show that if F, F n ,n ≥ 1 are in D, then F n ⇒ F if and only
if ρ(F n , F ) → 0.
4.3.1. Definition and the inversion formula. One can define a complex-valued RVs in a straightforward
mannaer. That is, let (Ω, F , P) be a probability space. We can put the Borel σ-algebra B on C the usual
Borel σ-algebra we put on R2 by identifying a + i b with (a, b). Then a complex-valued RV is simply a
measurable function Z : (Ω, F ) → (C, B).
Proposition 4.3.2 (Basic properties of characteristic functions). Let X be any RV and let φ be its charac-
teristic function. Then the following hold:
(i) φ(0) = 1 ;
(ii) φ(−t ) = φ(t ) ;
(iii) |φ(t )| = |E[exp(i t X )]| ≤ E[| exp(i t X )|] = 1;
£ ¤
(iv) |φ(t + h) − φ(t )| ≤ E | exp(i t X ) − 1| , so φ(t ) is uniformly continuous on (−∞, ∞);
(v) φaX +b (t ) = exp(i t b)φ X (at ).
Proposition 4.3.3 (Factorization property of the characteristic functions). Let X , Y be independent RVs
on the same probability space. Then φ X +Y (t ) = φ X (t ) φY (t ) for all t .
P ROOF. Note that exp(i t X ) and exp(i t Y ) are independent since X and Y are independent (see Prop.
2.2.6 and Lem. 2.3.2). Hence
£ ¤ £ ¤
φ X +Y (t ) = E exp(i t (X + Y )) = E exp(i t X ) exp(i t Y ))
£ ¤ £ ¤
= E exp(i t X ) E exp(i t Y )) = φ X (t )φY (t ).
□
Example 4.3.4 (Coin flips). Let X be an RV with P(X = 1) = P(X = −1) = 1/2. Then
e i t + e −i t
φ X (t ) = = cos(t ).
2
▲
λk e −λ
Example 4.3.5 (Poisson). Let X ∼ Poisson(λ), so P(X = k) = k! for k = 0, 1, . . . . Then
X
∞ λk e −λ X∞ (λe i t )k
= e −λ = e −λ e λe = e λ(e −1) .
it it
φ X (t ) = ei tk
k=0 k! k=0 k!
▲
Example 4.3.6 (Normal). Let X ∼ N (0, 1), so X has PDF f X with f X (t ) = (2π)−1/2 e −t
2
/2
. Then
Z
1
e i t x e −x /2 d x
2
φ X (t ) = p
2π R
Z µ ¶
2 1 (x − i t )2
= exp(−t /2) p exp − d x.
R 2π 2
The integral in the last expression should be 1 since it is the integral of the PDF of the normal distribution
with ‘mean i t ’ and variance 0. This will show
φ X (t ) = exp(−t 2 /2).
(See [Dur19, Ex 3.3.5] for a more rigorous justification.) ▲
1
Example 4.3.7 (Uniform). Let X ∼ Uniform((a, b)), so X has PDF f X with f X (t ) = b−a 1(x ∈ (a, b)). Then
Zb
1 ei tb − ei t a
φ X (t ) = ei t x d x = .
b−a a i t (b − a)
e i t b −e −i t b
In particular, if a = −b, then φ X (t ) = 2i b = sin(t b)/(bt ) . ▲
4.3. CHARACTERISTIC FUNCTIONS 102
Example 4.3.8 (Exponential). Let X ∼ Exp(λ), so X has PDF f X with f X (t ) = λe −λx 1(x ≥ 0). Then
Z∞ Z∞ " #∞
i t x −λx (i t −λ)x e (i t −λx) λ
φ X (t ) = λ e e dx = λ e dx = λ = .
0 0 i t − λ i t −λ
0
e i t b −e −i t b
In particular, if a = −b, then φ X (t ) = 2i b = sin(t b)/(bt ) . ▲
In the following result is a characteristic function version of the well-known Fourier inversion for-
mula. An important corollary of it is that the characteristic function uniquely determines the distribution
(see Corollary 4.3.11).
R
Theorem 4.3.9 (Inversion formula). Let φ(t ) := e i t x µ(d x) where µ is a probability measure. If a < b,
then
Z
1 T e −i t a − e −i t b µ({a, b})
lim φ(t ) d t = µ((a, b)) + .
T →∞ 2π −T it 2
P ROOF. Denote
=:g (t ,x)
z }| {
ZT ZT Z
e −i t a − e −i t b e −i t a − e −i t b i t x
I T := φ(t ) d t = e µ(d x) d t . (34)
−T it −T it
We would like to apply Fubini’s theorem to interchange the integrals in the last expression. For this,
observe that
Zb
e −i t a − e −i t b
= e −i t x d x, (35)
it a
so by the multivariate Jensen’s inequality (see Exercise 1.5.13),
¯ ¯
¯ e −i t a − e −i t b ¯ Zb
¯ ¯
¯ ¯≤ |e −i t x | d x = b − a. (36)
¯ it ¯ a
Since |e i t x | ≤ 1, it follows that the integrand g (t , x) in the last expression in (34) is uniformly bounded by
RT R
b − a. Since µ is a probability measure, it follows that −T |g (t , x)| µ(d x)d t ≤ 2T (b − a) < ∞. Hence we
can apply Fubini’s theorem and use the fact that cos(t )/t is an odd function to get
Z ZT i t (x−a)
e − e i t (x−b)
IT = d t µ(d x)
R −T it
µ
Z ZT ZT ¶
sin(t (x − a)) sin(t (x − b))
= dt − d t µ(d x). (37)
R −T t −T t
Now we introduce the following notations
ZT ZT
sin θt sin t
R(θ, T ) := dt, S(T ) := dt, sgn(x) = 1(x > 0) − 1(x < 0).
−T t 0 t
Then we have
ZT ZθT
sin θt sin(u)
R(θ, T ) = 2 dt = 2 d u = 2S(θT ) = 2sgn(θ) S(|θ|T ).
0 t 0 u
Note that limT →∞ S(T ) = π/2 by Exercise 4.3.10, so we get limT →∞ R(θ, T ) = π sgn(θ). It follows that
0 if x < a or x > b
lim R(x − a, T ) − R(x − b, T ) = R a,b (x) := 2π if a < x < b
T →∞
π if x ∈ {a, b}.
Now (37) and BCT yield the desired result as
Z
lim I T = R a,b (x) µ(d x) = 2πµ((a, b)) + πµ({a, b}).
T →∞
4.3. CHARACTERISTIC FUNCTIONS 103
□
Exercise 4.3.10. Show that (x, y) 7→ e −x y sin x is integrable on (0, a)×(0, ∞) with respect to the Lebuesgue
measure on R2 . Use Fubini’s theorem to deduce
Za Z∞ −a y Z∞ −a y
sin x e ye
d x = arctan(a) − (cos a) d y − (sin a) dy
0 1+ y 0 1+ y
x 2 2
0
Corollary 4.3.11. Let X and Y be RVs and suppose they have the identical characteristic functions. Then
X =d Y .
P ROOF. Let µ and ν denote the distributionx of X and Y , respectively. By Theorem 4.3.9,
µ({a, b}) ν({a, b})
µ((a, b)) + = ν((a, b)) + for all a, b ∈ R with a < b. (38)
2 2
Let A X := {x ∈ R : µ({x}) > 0} and A Y := {x ∈ R : ν({x}) and B := R \(A X ∪ A Y ). Note that A X corresponds to
the set of discontinuity points of the distribution function of X , x 7→ P(X ≤ x), so it is countable. Likewise,
A Y is countable. The set B consists of the common continuity points of the distribution functions of X
and Y . Note that B is dense in R.
Fix z ∈ A x . Then we may choose y < z with y ∈ B so that (38) gives
µ({z}) ν({z})
µ((y, z)) + = ν((y, z)) + .
2 2
Letting y % z with y ∈ B (we can do this since B is dense in R), and using continuity of probability
measures, we deduce µ({z}) = ν({z}). This holds for all z ∈ A X . By symmetry, this shows
Since the set of open intervals form a π-system and generate the Borel σ-algebra on R, by Lemma 1.1.38,
we conclude µ = ν. □
Exercise 4.3.12. Let X be a RV. If its chatacteristic function φ X takes values in R, then show that X =d −X .
Exercise 4.3.13 (Sum of independent normals = Normal). Let X 1 and X 2 be independent normal RVs.
Use the inversion formula (Theorem 4.3.9) to deduce that X 1 + X 2 has the normal distribution with mean
E[X 1 ] + E[X 2 ] and variance Var(X 1 ) + Var(X 2 ).
When the characteristic function φ of a probability measure µ is integrable, then µ admits a PDF
which is given by the inverse Fourier transform of φ.
Proposition 4.3.14 (Inverse Frourier transform of integrable characteristic R ft.). Let µ be a probability
measure on (R, B) and let φ be its characteristic function. Suppose |φ(x)| d x < ∞. Then µ has a bounded
and continuous density function
Z
1
f (y) := e −i t y φ(t ) d t .
2π R
4.4. PROOF OF CENTRAL LIMIT THEOREM 104
P ROOF. We first note that, by the inversion formula (see Thm. 4.3.9), for a < b,
Z
µ({a, b}) 1 T e −i t a − e −i t b
µ((a, b)) + = lim φ(t ) d t .
2 T →∞ 2π −T it
−i t a −i t b
Let f T (x) := e i−e
t φ(t )1(−T ≤ x ≤ T ). Then | f T (x)| ≤ (b − a)|φ(x)| (see (36)), so f T is dominated by
(b − a)φ, which is integrable by the hypothesis. Hence by DCT, we have
Z −i t a
µ({a, b}) 1 e − e −i t b
µ((a, b)) + = φ(t ) d t . (40)
2 2π R it
Note that the right hand side above is
ZT Z∞
b−a b−a
≤ lim |φ(t )| d t ≤ |φ(t )| d t ,
T →∞ 2π −T 2π −∞
where the first inequality above uses (36) and the second inequality uses DCT with functions f T (x) :=
|φ(x)|1(−T ≤ x ≤ T ) and the integrability hypothesis of φ. It follows that µ is a continuous measure (i.e.,
has no point mass, µ({a}) = 0 for all a ∈ R). Hence from (40), (35), and Fubini’s theorem,
Z µZb ¶
1 −i t y
µ((a, b)) = e d y φ(t ) d t
2π R a
Zb µ Z ¶
1 −i t y
= e φ(t ) d t d y.
a 2π R
The above holds for all open interavals (a, b). Hence by Lemma 1.1.38, it follows that f in the statement
is indeed a density function for µ. That f is bounded follows easily from the hypothesis. The continuity
of f follows from DCT. Indeed, noting that
¯ Z ¯
¯ 1 ¯
¯
lim | f (x + h) − f (x)| = lim ¯ e −i t x −i t h
(e − 1)φ(t ) d t ¯¯
h&0 h&0 2π R
Z¯ ¯
1 ¯ −i t h ¯
= lim ¯(e − 1)φ(t )¯ d t
h&0 2π R
Z ¯ ¯
1 ¯ ¯
= lim ¯(e −i t h − 1)φ(t )¯ d t = 0,
2π R h&0
¯ ¯
where for the second equality, we used DCT for g h (x) := ¯(e −i t h − 1)φ(t )¯ → 0 a.s. and g h (x) ≤ 2|φ(x)|. □
P ROOF. (i) Let X n , 1 ≤ n ≤ ∞ be RVs with distribution function x 7→ µn ((−∞, x]) (see Proposition
1.2.13). Fix t ∈ R. Since x 7→ e i t x is bounded and continuous function, by Proposition 4.2.9,
n→∞
φn (t ) = E[exp(i t X n )] −→ E[exp(i t X ∞ )] = φ∞ (t ).
(ii). In order to show tightness of (µn )n≥1 , we first note that by Exercise 4.4.2, the tail of µn can be
controlled by its characterstic function φn as
Z
µn ({x : |x| > 2/u}) ≤ u −1 (1 − φn (t )) d t , for all u > 0. (41)
[−u, u]
4.4. PROOF OF CENTRAL LIMIT THEOREM 105
Fix ε > 0. First we observe φ(0) = limn→∞ φn (0) = 1. By the hypothesis φ is continuous at 0, so
we get limt →∞ φ(t ) = 1. It follows that
Z
−1
lim u (1 − φ(t )) d t = 0.
u→0 [−u, u]
Hence we may choose u(ε) > 0 small enough so that the integral in the LHS above is bounded
by ε/2 for all u < u(ε). Also, note that by BCT, we have
Z Z
lim u −1 (1 − φn (t )) d t = u −1 (1 − φ(t )) d t ≤ ε/2.
n→∞ [−u, u] [−u, u]
Thus we may choose N (ε) ≥ 1 such that for all n ≥ N (ε), the integral in the LHS is at most ε.
Henec by (41),
µn ({x : |x| > 2/u}) ≤ ε for all u < u(ε) and n ≥ N (ε).
This shows that (µn )n≥1 is tight.
Now we conclude (ii). Let F n denote the distribution function of µ for n ≥ 1. Choose an
arbitrary diverging subsequence (n(k))k≥1 . By Helly’s selection (see Thm. 4.2.14), there exists
a further subsequence n(k(m)) such that F n(k(m)) ⇒ F as m → ∞ for some right-continuous
and increasing function F . By tightness and Exercise 4.2.17, F must be a probability distribu-
tion function. Let µ denote the probability measure whose distribution function is F (such µ is
unique by Lem. 1.1.38). By (i), the characteristic function of µ is φ. Since characteristic function
uniquely determines the associated probability measure (see Cor. 4.3.11), it follows that µ does
not depend on the subsequence n(k) and the further subsequence n(k(m)).
To finish the proof, let f : RR→ R be a bounded and continuous function. Consider the
sequence of real numbers a n := f d µn . From what we just proved in the above paragraph, R we
have that every subsequence of a n has a further subsequence that converges to a := f d µ. It
follows that a n → a as n → ∞. Since this holds for all bounded continuous test functions f , by
Prop. 4.2.9, we conclude that µn ⇒ µ.
□
Exercise 4.4.2 (Tail bound and characteristic function). Let X be a RV and let φ be its characteristic
function. Show that for each u > 0,
Z
1 u
P(|X | ≥ 2/u) ≤ (1 − φ(t )) d t .
u −u
Exercise 4.4.3 (Tail bound of Taylor expansion of complex exponential).
¯ n (i x)n ¯¯
¯ ix X n+1
2|x|n
¯e − ¯ ≤ |x| ∧
¯ ¯ (n + 1)!
m=0 m! n!
Lemma 4.4.4 (Moments and characteristic function). Let X be a RV. Then the following hold:
¯ n E[(i t X )n ] ¯¯
¯ X £ ¤
¯
(i) ¯φ X (t ) − ¯ ≤ E |t X |n+1 ∧ 2|t X |n
m! ¯
m=0
2
(ii) If E[X 2 ] < ∞, then φ X (t ) = 1 + i t E[X ] − t2 E[X 2 ] + o(t 2 ).
P ROOF. (i) From Exercise 4.4.3, we have
¯ n (i t X )n ¯¯
¯ itX X n+1
2|t X |n
¯e − ¯ ≤ |t X | ∧ ≤ |t X |n+1 ∧ 2|t X |n
¯ m! ¯ (n + 1)! n!
m=0
almost surely. Taking expectation and using Jensen’s inequality then shows the assertion.
(ii) From (i), we have
¯ µ ¶¯
¯ 2 ¯ £ ¤
¯φ X (t ) − 1 + i t E[X ] − t E[X 2 ] ¯ ≤ t 2 E |t ||X |3 ∧ 2|X |2 .
¯ 2 ¯
4.4. PROOF OF CENTRAL LIMIT THEOREM 106
Let Y t denote the RV in the expectation in the RHS above. Then Y t → 0 a.s. as t % 0, and
|Y t | ≤ 2|X |2 . Since E [X 2 ] < ∞, by DCT, it follows that E[Y t ] = o(1).
□
Now we are finally ready to prove the classical central limit theorem for i.i.d. increments under sec-
ond moment assumption.
P
Theorem 4.4.5 (CLT). Let (X k )k≥1 be i.i.d. RVs and let S n = nk=1 X i , n ≥ 1. Suppose Var(X 1 ) = σ2 < ∞. Let
Z ∼ N (0, 1) be a standard normal RV and define
S n − µn S n /n − µ X 1 − µ Xn − µ
Zn = p = p = p +···+ p
σ n σ/ n σ n σ n
Then Zn ⇒ Z as n → ∞.
P ROOF. First notice that we can assume E[X 1 ] = 0 without loss of generality. Then σ2 = Var(X 1 ) =
E[X 12 ] − E[X 1 ]2
= E[X 12 ]. Since the increments X i ’s are i.i.d., we have
φE[e t S n ] = E[e t X 1 ]E[e t X 2 ] · · · E[e t X n ] = E[e t X 1 ]n .
By Lemma 4.4.4 and Prop 4.3.3, we have
µ ¶n
t 2 σ2
φS n (t ) = φ X 1 (t )n = 1 − + o(t 2 ) .
2
This and Prop. 4.3.2 (v) yield
µ ¶n
n t2 2 −1
φ Zn (t ) = φ X 1 (t ) = 1 − + o(t /n ) .
2n
It is easy to extend Prop. 4.1.2 for sequences of complex numbres. Hence the RHS above converges to
exp(−t 2 /2) for all t ∈ R as n → ∞. With Example 4.3.6, we conclude that
φ Zn (t ) → exp(−t 2 /2) = φ Z (t ) for all t ∈ R.
Then by the continuity theorem (Lemma 4.4.1), it follows that Zn ⇒ Z . □
As a typical application of CLT, we can approximate Binomial(n, p) variables by normal RVs.
Then by letting ε & 0, we see that the limsup in the LSH above is zero for each t ∈ R.
Next, recall that for each fixed t ∈ R |φ Zn (t )| ≤ 1 (see Prop 4.3.2). Also, hypothesis (i) yields that for
t 2 E[X n;m
2
]
large enough n ≥ 1, |1 − 2 Hence by Prop. 4.4.9 with θ = 1, we deduce
| ≤ 1.
¯ Ã !¯
¯ Yn t 2 E[X n;m
2
] ¯¯ X
n
¯
lim ¯φ Zn (t ) − 1− ¯ ≤ lim |α − αn;m | = 0.
n→∞ ¯
m=1 2 ¯ n→∞ m=1 n;m
Thus, in order to conclude, it is enough to show that
à !
Y
n t 2 E[X n;m
2
]
lim 1− = exp(−t 2 σ2 /2). (42)
n→∞
m=1 2
P
Let c n;m := −t 2 E[X n;m
2
]/2. By the hypothesis (i), we have limn→∞ nm=1 c n;m = σ2 . Hence by Exercise
4.1.5, we will have (42), provided that we check the following conditions:
X
n
max |c k,n | = o(1), sup |c k,n | < ∞. (43)
1≤k≤n n≥1 k=1
The second condition follows directly from the hypothesis (i). For the first condiion, fix ε > 0 and write
2 2 2
E[X n;m ] ≤ E[X n;m 1(|X n;m | ≤ ε)] + E[X n;m 1(|X n;m | > ε)]
= ε2 + E[X n;m
2
1(|X n;m | > ε)]
X
n
= ε2 + 2
E[X n;m 1(|X n;m | > ε)].
m=1
According to the hypothesis (ii), this shows
2
lim sup max E[X n;m ] ≤ ε2 .
n→∞ 1≤m≤n
By letting ε & 0, we see that the limsup in the LHS above is zero. This yields the first condition in (43), as
desired. □
Proposition 4.4.9 (Comparing product of complex numbers). Let a 1 , . . . , a n and b 1 , . . . , b n be complex
numbers of modulus at most θ. Then
¯ n ¯
¯Y Y
n ¯ X
n
¯ a − b ¯
m¯ ≤ θ
n−1
|a m − b m |.
¯ m
m=1 m=1 m=1
P ROOF.
¯ n ¯ ¯ ¯ ¯ ¯
¯Y Yn ¯ ¯ Y n Yn ¯ ¯ Y n Yn ¯
¯ a − b ¯ ≤ ¯ a a − a b ¯ + ¯ a b − b b ¯
¯ m m¯ ¯ 1 m 1 m¯ ¯ 1 m 1 m¯
m=1 m=1 m=2 m=2 m=2 m=2
¯ n ¯
¯Y Yn ¯
≤ θ ¯¯ am − b m ¯¯ + |a 1 − b 1 | θ n−1 .
m=2 m=2
Then the assertion follows by an induction. □
Exercise 4.4.10 (Lyapunov’s CLT). Suppose we have a triangular array of independent RVs, X n;1 , . . . , X n;n
for n ≥ 1. Suppose the Lyapunov’s condition holds: There exists δ > 0 s.t.
Pn £ 2+δ
¤
i =1 E |X n;i − E[X n;i ]|
qP 2+δ
→ 0 as n → ∞. (44)
n
k=1
Var(X n;i )
Then show that the hypothesis of Lindeberg-Feller CLT in Thm. 4.4.8 holds for the normalized triangular
array (S n = X n;1 + · · · + X n;1 .)
X n;1 X n;1
p ,..., p .
Var(S n ) Var(S n )
4.4. PROOF OF CENTRAL LIMIT THEOREM 109
4.4.2. Applications of CLT for confidence intervals. Suppose we have n i.i.d. RVs X 1 , . . . , X n with mean
µ and finite second moment. We are interested in the deviation probability P(| X̄ −µ| > ε), which will give
a confidence interval for the population mean µ using the sample mean X̄ := n −1 S n as its estimator. A
typical practice in statistics is to “use CLT” as
µ¯ ¯ ¶
¯ Sn ¯ ¡ p ¢ “CLT00 ¡ p ¢
P ¯¯ ¯
− µ¯ > ε = P |Zn | > ε n ≈ P |Z | > ε n ,
n
−1 Pn
where Zn = n k=1
(X k − µ) and Z ∈ N (0, 1). However, the approximation step marked as “CLT” does
not make sense, since CLT holds when the sample size n → ∞ but the last expression compares with |Z |
p
with ε n → ∞ as n → ∞. A cleaner approach is to start from the CLT statement P(|Zn | ≤ z) → P(Z ≤ z)
as n → ∞ and manipulate the events to match up with the desired event involving X̄ , as discussed in
Exercise 4.4.11.
Example 4.4.11 (Confidence interval for the mean with known variance). Let (X t )t ≥0 be a sequence of
i.i.d. RVs with unknown mean µ but known variance σ2 . Here we do not know if X i ’s are drawn from a
normal distribution. By CLT, we have
X̄ − µ
Zn = p =⇒ Z ∼ N (0, 1),
σ/ n
as the sample size n → ∞. In other words, Zn becomes more and more likely to be a standard normal RV,
so for any z ∈ R
P(Zn ≤ z) ≈ P(Z ≤ z)
when n is large enough.
From this we can construct a confidence interval for µ similarly as in Example 4.4.13. Namely,
µ ¶
X̄ − µ
P −z α/2 ≤ p ≤ z α/2 ≈ 1 − α. (45)
σ/ n
Rewriting, this gives
µ · ¸¶
σ σ
P µ ∈ X̄ − z α/2 p , X̄ + z α/2 p ≈ 1 − α.
n n
h i
That is, the probability that the random interval X̄ − z α/2 pσn , X̄ + z α/2 pσn containing the population
mean µ is approximately 1−α. In this sense, this random
h interval is called the
i 100(1−α)% confidence in-
terval for µ. For instance, noting that z 0.05/2 = 1.96, x̄ − 1.96 pσn , x̄ + 1.96 pσn is a 95% confidence interval
for µ. ▲
Exercise 4.4.12. Let X equal the length of life of a 60-watt light bulb marketed by a certain manufacturer.
Assume that the distribution of Var(X ) = 1296. If a random sample of n = 27 bulbs is tested until they
burn out, yielding a sample mean of x̄ = 1478 hours, what is the corresponding 95% confidence interval
for µ?
Example 4.4.13 (Confidence interval for the difference of two means). Let X and Y be independent RVs
such that E[X ] = µ X , Var(X ) = σ2X , E[Y ] = µY , and Var(Y ) = σ2Y . Suppose we do not know if X and Y have
normal distribution but we know σ X and σY . Then by using CLT, all our confidence interval estimates in
the previous example hold asymptotically. Namely, by CLT,
X̄ ⇒ N (µ X , σ2X /n), Ȳ ⇒ N (µ X , σ2Y /m),
as n, m → ∞. As all samples are independent, we should have, as n, m → ∞,
à !
σ2X σ2Y
X̄ − Ȳ ⇒ N µ X − µY , + .
n m
4.4. PROOF OF CENTRAL LIMIT THEOREM 110
(See Exercise 4.4.14 for a rigorous justification of the above). Thus when n and m are large, X̄ − Ȳ nearly
follows normal distribution above. This implies the following 100(1 − α)% (approximate) confidence
interval for µ X − µY :
s s
σ2X σ2Y σ2X σ2Y
P µ X − µY ∈ ( X̄ − Ȳ ) − z α/2 + , ( X̄ − Ȳ ) + z α/2 + ≈ 1 − α.
n m n m
For instance, let n = 60 ,m = 40, x̄ = 70.1, ȳ = 75.3, σ2X = 60, σ2Y = 40, and 1 − α = 0.90. Note that
1 − α/2 = 0.95 = P(Z ≤ 1.645) for Z ∼ N (0, 1). Hence z α/2 = 1.645, so
s r
σ2X σ2Y 60 40 p
z α/2 + = 1.645 + = 1.645 · 2 = 2.326.
n m 60 40
Noting that x̄− ȳ = −5.2, we conclude that [−5.2−2.326, −5.2+2.326] = [−7.526, −2.87] is a 90% confidence
interval for µ X − µY . In particular, we can claim that, with at least 90% confidence, µY is larger than µ X
by at least 2.87. ▲
Exercise 4.4.14 (Lyapunov’s CLT for the Difference of sample means). Let X , Y be two RVs with E[X ] =
µ X , E[Y ] = µY , σ2X := Var(X ) < ∞, and σ2Y := Var(Y ) < ∞. Let X 1 , . . . , X n be i.i.d. samples of X and let
Y1 , . . . , Ym be i.i.d. samples of Y . Further assume that X i and Y j are all independent. The goal is this
exercise is the prove that the difference X̄ − Ȳ of the sample means satisfy the CLT, that is,
( X̄ − Ȳ ) − (µ X − µY )
q 2 =⇒ N (0, 1) as n, m → ∞. (46)
σX σ2Y
n + m
The standard CLT for i.i.d. RVs does not apply since X i ’s and Y j ’s have possibly different distribu-
tion. We will use a more general a particular instance of Lindeberg-Feller CLT for triangular arrays using
Lyapunov’s condition (see Exercise 4.4.10).
σ2X σ2Y
(i) Show that E[ X̄ − Ȳ ] = µ X − µY and Var( X̄ − Ȳ ) = n + m .
(ii) Fix sequences (n N )N ≥1 , (m N )N ≥1 such that n N + m N = N for all N ≥ 1. In light of Exercise 4.4.10,
consider the following ‘triangular array’ of N RVs:
N th row: X /n , X /n , . . . , X n N /n N , Y1 /m N , Y2 /m N , . . . Ym N /m N . (47)
| 1{z N} | 2{z N} | {z } | {z } | {z } | {z }
W1;N W2;N Wn N ;N Wn N +1;N Wn N +2;N WN ;N
(iii) From (ii) deduce that the Lyapunov’s condition (44) for the triangular array (47) is satisfied for δ > 0,
n N , and m N whenever
(iv) From (i), (iii) and Lyapunov’s CLT (see Exercise 4.4.10), deduce that (46) holds whenever the first
two moment conditions in (48) hold.
4.5. CLT WITH RATE OF CONVERGENCE: BERRY-ESSÉEN THEOREM 111
Theorem 4.5.1 (Berry-Esséen theorem). Let X 1 , . . . , X n be i.i.d. RVs with zero mean, unit variance, and
finite third moment. Let Zn := n −1/2 (X 1 + · · · + X n ) and Z ∈ N (0, 1). Then for any a ∈ R,
P(Zn ≤ a) − P(Z ≤ z) = O(n −1/2 E[|X 1 |3 ]),
Theorem 4.5.2 (Berry-Esséen theorem, weak form). Let X 1 , . . . , X n be i.i.d. RVs with zero mean, unit
variance, and finite third moment. Let Zn := n −1/2 (X 1 + · · · + X n ) and Z ∈ N (0, 1). Then for any smooth
function f : R → R with uniformly bounded derivative up to third order,
µ ¶
E[ f (Zn )] − E[ f (Z )] = O n −1/2 E[|X 1 |3 ] sup | f 000 (x)| ,
x∈R
P ROOF. A first key step is to decompose a single standard normal RV Z into the sum of n i.i.d. normal
RVs. Let Y1 , . . . , Yn be i.i.d. standard normal RVs and let Wn := n −1 (Y1 + · · · + Yn ). Note that Wn =d Z ∼
N (0, 1) by Exercise 4.3.13. Next, we consider a ‘continuous transform’ from Zn to Wn by replacing X k with
Yk for k = 1, . . . , n: (a.k.a. Lindeberg replacement trick)
p
nZn;0 := X 1 + X 2 + · · · + X n
p
nZn;1 := Y1 + X 2 + · · · + X n
p
nZn;2 := Y1 + Y2 + · · · + X n
..
.
p
nZn;n := Y1 + Y2 + · · · + Yn .
Using the above sequence of intermediate RVs, we consider the following telescoping sum
X
n
E[ f (Zn )] − E[ f (Z )] = E[ f (Zn )] − E[ f (Wn )] = E[ f (Zn;k−1 )] − E[ f (Zn;k )], (49)
k=1
where Zn;k−1 and Zn;k only differs at the kth summand. For i = 2, . . . , n − 1, denote
S n;i := n −1/2 (X 1 + X 2 + · · · + X i −1 + Yi +1 + · · · + Yn ) .
Since
Zn;k−1 = S n;k + n −1/2 X k , Zn;k = S n;k + n −1/2 Yk ,
using the second order Taylor expansion with remainder, we can write
f (Zn;k−1 ) = f (S n;k + n −1/2 X k )
µ ¶
0 −1/2 f 00 (S n;k ) −3/2 000
= f (S n;k ) + f (S n;k )n Xk + X k2 + O n 3
|X k | sup | f (x)| ,
2n x∈R
−1/2
f (Zn;k ) = f (S n;k + n Yk )
µ ¶
f 00 (S n;k )
= f (S n;k ) + f 0 (S n;k )n −1/2 Yk + Yk2 + O n −3/2 |Yk |3 sup | f 000 (x)| .
2n x∈R
The right hand side can be upper bounded by C (ε + n −1/2 ε−3 ) for some constant C > 0. Differentiating
this by ε, we find it is minized at ε = O(n −1/8 ) with minimum value of order O(n −1/8 ). Hence we deduce
|P(Zn ≤ z) − P(Z ≤ z)| = O(n −1/8 ).
The implied constant in O(·) above does not depend on z, so we in fact get
sup |P(Zn ≤ z) − P(Z ≤ z)| = O(n −1/8 ).
z∈R
Notice that the bound of O(n −1/8 ) on the CLT rate of convergence is slower than O(n −1/2 ) in the full
Berry-Esséen theorem (Theorem 4.5.1).
Exercise 4.5.4 (Third order smooth approximation of step function). Fix z ∈ R and ε > 0. We will con-
struct a function f : R → R such that f (x) ≡ 1 for x ≤ z, f (x) ≡ 0 for x ≥ z +ε, and f 000 is uniformly bounded
of order O(ε−3 ) and the implied constant does not depend on z.
4.5. CLT WITH RATE OF CONVERGENCE: BERRY-ESSÉEN THEOREM 113
(i) Fix a < b and consider g 0 (x) = λx(x −a)(x −b). Then g (x) = 13 x 3 − a+b 2
2 x +abx +C has local maximum
at a, local minimum at b, and strictly decreasing on [a, b]. Let a = z and b = z + ε. Find λ and C
such that g (z) = 1, g (z + ε) = 0. Show that λ = O(ε−3 ).
(ii) Take f = g with a = z, b = z + ε and λ as in (i). Show that such f safistifes all required conditions.
CHAPTER 5
Maringales
114
5.1. CONDITIONAL EXPECTATION 115
Theorem 5.1.4 (Radon-Nikodym theorem). Let ν and µ be σ-finite measures on (Ω, F ). Then ν ¿ µ if
and only if there exists a measurable function f : (Ω, F ) → (R, B) such that
Z
ν(B ) = f d µ ∀B ∈ F .
B
dν
In this case, the function f is denoted as dµ and is called the Radon-Nikodym derivative of ν w.r.t. µ.
P ROOF. The “only if” part is the core, andRa proof of it can be found in [Dur19, Thm. A.4.8]. For the
“if” part, suppose B ∈ F with µ(B ) = 0. Then B f d µ = 0 by the definition of Lebesgue integral. □
Example 5.1.5. Let X : (Ω, F , P) → (R, B) be a continuous RV with probability
R density function f X . Con-
sider the induced probability measure ν on R as ν(B ) := P(X ∈ B ) = B f d µ for B ∈ B, where µ denotes
the Lebesgue measure on R. Then ν ¿ µ, In this case, dd µν = f .
Exercise 5.1.6. Let (Ω, F0 , P) be a probability space and let X , X 0 : Ω → R be (F0 − B)-measurable RVs
that are absolutely integrable. Suppose that P(X 1B = X 0 1B ) = 1 for some B ∈ F , where F ⊆ F0 is a σ-
algebra on Ω. Show that E[X | F ] = E[X 0 | F ] a.s. on B . (Hint: Repeat the uniqueness argument in the
proof of Prop. 5.1.3.)
5.1.2. Examples. In this section, we connect the measure-theoretic definition of conditional expecta-
toin in Def. 5.1.2 with more familar versions of conditional expectation in some special cases. It is helpful
to recall the following interpretation of involved objects:
F0 = Baseline information
X : (Ω, F0 ) → (R, B) = RV of interest
F (⊆ F0 ) = Additional information
E[X | F ] = Best guess of X given F .
Example 5.1.7 (Conditioning on perfect information). If X is measurable w.r.t. F , then E[X | F ] = X .
That is, the best guess of X knowing X is X . Indeed, note that E[X 1B ] = E[X 1B ] for all B ∈ F . Hence the
only thing that would prevent X for being a version of E[X | F ] is its measurability w.r.t. F . Hence if X is
F -measruable, then E[X | F ] = X . In particular, if P(X = c) = 1 for some constant c ∈ R, then X is always
measurable w.r.t. F , so E[c | F ] = c. ▲
Example 5.1.8 (Conditinoing on no information). Suppose X is independent from F . That is, σ(X ) and
F are independent. Then by defintion of independence, σ(X ) and σ(1B ) = {;, B, B c , Ω} are independent
for all B ∈ F . By Prop. 2.1.4, it follows that
X ⊥ 1B for all B ∈ F .
In this case, knowing F would not help us to improve our current guess of X , so we should have E[X | F ] =
E[X ]. In order to verify this, first note that the constant RV E[X ] is measurable w.r.t. F . Second, for the
iterated expectation, we use the independence assumption above as
E[X 1B ] = E[X ] E[1B ] = E[E[X ]1B ] for all B ∈ F .
This shows E[X | F ] = E[X ].
Conversely, assuming E[X | F ] = E[X ], we have
E[X 1B ] = E[E[X | F ]1(B )] = E[E[X ]1B ] = E[X ] E[1B ] for all B ∈ F .
Thus X ⊥ F . Therefore, we have shown that E[X | F ] if and only if X ⊥ F . ▲
Example 5.1.9 (Conditioning on discrete outcomes). Suppose B 1 , B 2 , . . . be disjoint events in F0 that
F
partitions the whole sample space, i.e., Ω = ∞ B . Suppose P(B k ) > 0 for each k ≥ 1. Let F :=
k=1 k
σ({B 1 , B 2 , . . . }). Then we claim that
X∞ E[X 1 ]
Bk
E[X | F ](ω) = 1(ω ∈ B k ). (51)
k=1 P (B k)
5.1. CONDITIONAL EXPECTATION 117
Due to the partitioning, there exists a unique k ≥ 1 such that ω ∈ B k for each ω ∈ Ω, so only one term in
the RHS is possibly nonzero. That is, for each ω ∈ Ω, we know the unique k ≥ 1 such that ω ∈ B k . In this
E[X 1B ]
case, E[X | F ](ω) = P(B k k) . In words, the information F tells us which B k in the partition each outcome
ω lies, and the best guess of X is the average value of X over such B k .
To verify (51), denote the RV in the RHS of (51) as Y . First note that Y is measurable w.r.t. F since it
is constant over each B k . Denoting its value on B k by βk , then for each Borel A ⊆ R,
[
Y −1 (A) = B k ∈ F = σ({B 1 , B 2 , . . . }).
k≥1, βk ∈A
S
Second, for the iterated expectation, fix an event B ∈ F . Note that we can write B = k∈I B k for some
I ⊆ {1, 2, . . . } (see Exc. 1.1.13). Then we need to verify
" #
X E[X 1B k ]
E[X 1B ] = E[Y 1B ] = E 1(B k ) .
k∈I P(B k )
For the general case, we write X = X + −X − and use the fact that E[X | F ] = E[X + | F ]− E[X − | F ] (see (50)).
A degenerate but notable special case is when F = {;, Ω}. In this case, (51) yields E[X | F ] = E[X ]. ▲
Example 5.1.10 (Undergraduate conditional probability). Suppose X = 1 A and consider the partition
Ω = B ∪ B c for some event B ∈ F with P(B ) ∈ (0, 1). Then (51) yields
(
P(A∩B )
P(B ) on B
P(A | 1B ) = E[1 A | σ(1B )] = P(A∩B c
)
P(B c ) on B c .
Denoting P(A|C ) := P(A ∩C )/P(C ) for all A,C ∈ F with P(C ) ∈ (0, 1), we can rewrite the above as
(
P(A|B ) on B
P(A | 1B ) = E[1 A | σ(1B )] =
P(A|B c ) on B c .
▲
Example 5.1.11 (Conditioning on discrete RV). Let T be a discrete RV on (Ω, F0 , P) taking countably
many values from {t 1 , t 2 , . . . }. We denote
· ¸
E[X 1(T = t k )] 1(T = t k )
E[X | T = t k ] := =E X . (53)
P(T = t k ) P(T = t k )
Note that if X is also a discrete RV taking values x i for i ≥ 1, then X 1(T = t k ) takes value x i with proba-
bility P(X = x i , T = t k ) for i ≥ 1 and otherwise zero, so by Prop. 1.5.2,
X P(X = x i , T = t k ) X
E[X | T = t k ] = xi = x i P(X = x i | T = t k ).
i ≥1 P(T = t k ) i ≥1
The above agrees with the undergraduate formula for conditional expectation of a discrete RV condi-
tional on another discrete RV.
Now (51) with the notation in (53) gives
X∞ E[X 1(T = t )]
k
E[X | T ](ω) = 1(T (ω) = t k )
k=1 P(T = t k )
X∞
= E[X | T = t k ]1(T (ω) = t k ).
k=1
5.1. CONDITIONAL EXPECTATION 118
which is the usual iterated expectation formula in undergraudate probability for conditioning on discrete
RVs. ▲
Example 5.1.12 (Conditional expectation from continuous joint distribution). Suppose we have two RVs
X , Y on the same probability space (Ω, F , P) with joint density f (x, y), that is,
Z
P((X , Y ) ∈ B ) = f (x, y) d x d y for B ∈ B 2 .
B
Fix a function g : R → R and suppose E[|g (X )|] < ∞. Can we guess what is E[g (X ) | Y ] = h(Y )? A naive
compuation suggests the following:
Z
E[g (X ) | Y ] = g (x) f X |Y =y (x) d x,
R
where f X |Y =y (x) is the “conditional probability density function for X given Y = y”. This must be com-
puted as
f (x, y)
f X |Y =y (x) = R .
R f (x, y) d y
It is customary to denote
E[g (X ) | Y = y] := h(y).
Now we show (54). Clearly h(Y ) ∈ σ(Y ). To check iterated expecation, let A ∈ σ(Y ). Then A = Y −1 (B )
for some B ∈ B. Then we note the following series of identities:
Z
E[h(Y )1 A ] = h(Y (ω))1(Y (ω) ∈ B ) d P
Ω
Z
(a)
= h(y)1(y ∈ B ) d P ◦ (X , Y )−1 (x, y)
R2
Z
(b)
= h(y)1(y ∈ B ) f (x, y) d x d y
R2
Z µZ ¶
(c)
= h(y)1(y ∈ B ) f (x, y) d x d y
R R
Z Z
(d )
= g (x)1(y ∈ B ) f (x, y) d x d y
ZR R
(e)
= g (x)1(y ∈ B )d P ◦ (X , Y )−1 (x, y)
R2
(f )
= E[g (X )1(Y ∈ B )]
= E[g (X )1 A ].
Here, (a) follows from change of variables (Thm. 1.5.3) for the composite RV
(b) follows from Exc. 1.5.6, (c) is Fubini’s theorem, (d) uses the definition of h, and (e) follows from Exc.
1.5.6, and (f) is another change of variables. Hence we conclude that h(Y ) is a version of E[g (X ) | Y ]. ▲
Example 5.1.13 (A discrete RV conditional on a continuous RV). Suppose we have two RVs X , Y on the
same probability space (Ω, F , P). Suppose X is a discrete RV taking values from {x 1 , x 2 , . . . } and Y has
PDF f Y . Suppose there exists a function h(x, y) such that
Z
P(X = x, Y ∈ B ) = h(x, y) f Y (y) d y for x ∈ R and B ∈ B.
B
(In this case, we denote P(X = x | Y = y) := h(x, y).) Note that by Fubini’s theorem,
Z X XZ Z X
f Y (y) d y = P(Y ∈ B ) = P(X = x k , Y ∈ B ) = h(x k , y) f Y (y) d y = f Y (y) h(x k , y) d y.
B k≥1 k≥1 B B k≥1
P
Since this holds for all Borel sets B ⊆ R, we must have k≥1 h(x k , y) = 1 for all y except for a set of
Lebesgue measure zero.
Now we claim that
X
E[X | Y ] = x k h(x k , Y ) =: W.
k≥1
Clearly W is measurable w.r.t. σ(Y ). To check iterated expecation, let A ∈ σ(Y ). Then A = Y −1 (B ) for
some B ∈ B. Then by change of variables and Fubini’s theorem,
Z ÃX !
E[W 1 A ] = x k h(x k , y) f Y (y) d y
B k≥1
X Z
= xk h(x k , y) f Y (y) d y
k≥1 B
X
= x k P(X = x k , Y ∈ B )
k≥1
= E[X 1 A ].
5.1. CONDITIONAL EXPECTATION 120
X
E[X | Y = y] = x k h(x k , y).
k≥1
ZX Z
E[X ] = E[E[X | Y ]] = x k h(x k , y) f Y (y) d y = E[X | Y = y] f Y (y) d y.
R k≥1 R
Example 5.1.14 (Conditioning on independent RV). Suppose we have two independent RVs X , Y on the
same probability space (Ω, F , P). Fix a measurable function φ : R2 → R and suppose E[|φ(X , Y )|] < ∞.
Let g (x) := EY [φ(x, Y )]. Then we claim that
To vefiry, first note that g (X ) ∈ σ(X ). Second, fix A ∈ σ(X ). Then A = X −1 (B ) for some Borel B ⊆ R. Let
µ := P ◦ X −1 and ν := P ◦ Y −1 . Since X ⊥ Y , P ◦ (X , Y )−1 = µ ⊗ ν (see Lem. 2.3.1). By change of variables,
Z
E[g (X )1 A ] = E[g (X )1(X ∈ B )] = g (x)1(x ∈ B ) µ(d x)
R
Z µZ ¶
= φ(x, y) ν(d y) 1(x ∈ B ) µ(d x)
ZR ZR
= φ(x, y)1(x ∈ B ) ν(d y) µ(d x)
ZR ZR
= φ(x, y)1(x ∈ B ) d P ◦ (X , Y )−1
R R
Z
= φ(X (ω), Y (ω))1(X (ω) ∈ B ) d P(ω)
Ω
= E[φ(X , Y )1(X ∈ B )]
= E[φ(X , Y )1 A ].
Example 5.1.15 (Reducing redundant information). Suppose X , Y , Z are independent RVs on the same
probability space (Ω, F , P). Fix a measurable function φ : R2 → R with E[|φ(X , Z )|] < ∞ and let g (x) :=
E[φ(x, Z )].
E[φ(X , Z ) | X , Y ] = g (X ) = E[φ(X , Z ) | X ].
This should be reasonable since φ(X , Z ) is independent from Y . The second equality above was shown
in Ex. 5.1.14 so we only need to justify the first.
Note that g (X ) ∈ σ(X ). Fix A ∈ σ(X , Y ). Then A = (X , Y )−1 (B ) for some Borel B ⊆ R2 . Let µ := P ◦ X −1 ,
λ := P ◦ X −1 , and ν := P ◦ Z −1 . Since X , Y , Z are independent, P ◦ (X , Y )−1 = µ ⊗ λ and P ◦ (X , Y , Z )−1 =
5.1. CONDITIONAL EXPECTATION 121
We will use the unconditional tail-sum formula (Prop. 1.5.10) and definition of conditional expectation
to show this. Fix A ∈ F and x > 0. Clearly the RHS of (55) is F -measurable. Note that
P(X 1(A) ≥ x) = E[1(X 1(A) ≥ x)]
(a)
= E[1(X ≥ x)1(A)]
(b)
= E [E[1(X ≥ x) | F ]1(A)] ,
where (a) uses X ≥ 0 and x > 0 and (b) uses the definition of E[1(X ≥ x) | F ]. Then by using the uncondi-
tional tail-sum formula (Prop. 1.5.10) and Fubini’s theorem,
Z∞
E[X 1(A)] = P(X 1(A) ≥ x) d x
Z0∞
= E [E[1(X ≥ x) | F ]1(A)] d x
0
·Z∞ ¸
=E E[1(X ≥ x) | F ]1(A) d x
0
·µZ∞ ¶ ¸
=E P(X ≥ x | F ) d x 1(A) .
0
Exercise 5.1.20. Suppose we have a stick of length L. Break it into two pieces at a uniformly chosen point
and let X 1 be the length of the longer piece. Break this longer piece into two pieces at a uniformly chosen
point and let X 2 be the length of the longer one. Define X 3 , X 4 , · · · in a similar way.
(i) Let U ∼ Uniform([0, L]). Show that X 1 takes values from [L/2, L], and that X 1 = max(U , L −U ).
(ii) From (i), deduce that for any L/2 ≤ x ≤ L, we have
2(L − x)
P(X 1 ≥ x) = P(U ≥ x or L −U ≥ x) = P(U ≥ x) + P(U ≤ L − x) = .
L
Conclude that X 1 ∼ Uniform([L/2, L]). What is E[X 1 ]?
(iii) Show that X 2 ∼ Uniform([x 1 /2, x 1 ]) conditional on X 1 = x 1 . That is,
2(X 1 − x)
P(X 2 ≥ x | X 1 ) = for X 1 /2 ≤ x ≤ X 1 .
X1
(Hint: Use the results in Ex. 5.1.12.) Using iterated expectation, show that E[X 2 ] = (3/4)2 L.
(iv) In general, show that X n+1 | X n ∼ Uniform([X n /2, X n ]). Conclude that E[X n ] = (3/4)n L.
Proposition 5.1.21 (Basic properties of conditional expectation). Suppose we have two RVs X , Y on the
same probability space (Ω, F0 , P) and let F ⊆ F0 be a sub-σ-albegra. Suppose E[|X |] < ∞ and E[|Y |] < ∞
for the first two properties below.
(i) (Linearity) E[aX + Y | F ] = a E[X | F ] + E[Y | F ].
(ii) (Monotonicity) If X ≤ Y , then E[X | F ] ≤ E[Y | F ] a.s..
(iii) (Continuity from below) If X n ≥ 0 and X n % X with E[X ] < ∞, then E[X n | F ] % E[X | F ].
P ROOF. For the proofs of (i) and (ii), denote W := E[X | F ] and Z := E[Y | F ].
5.1. CONDITIONAL EXPECTATION 123
(i) Since W and Z are F -measurable, so is their linear combination aW + Z . For iterated expectation,
fix A ∈ F . Then by linearity of expectation,
E[(aX + Y ) 1 A ] = a E[X 1 A | F ] + E[Y 1 A | F ]
= a E[W 1 A | F ] + E[Z 1 A | F ]
= E[(aW + Z ) 1 A ].
Hence aW + Z is a version of E[aX + Y | F ].
(ii) Fix ε > 0 and let A := {W − Z ≥ ε} ∈ F . Then since 0 ≥ (X − Y )1 A ,
0 ≥ E[(X − Y )1 A ] = E[X 1 A ] − E[Y 1 A ] = E[W 1 A ] − E[Z 1 A ] = E[(W − Z )1 A ] ≥ ε P(A).
It follows that P(W − Z ≥ ε) = P(A) = 0. By taking ε & 0, we deduce P(W − Z > 0) = 0. Hence
W ≤ Z a.s., as desired.
(iii) Let Yn := X −X n . Then Yn & 0 a.s.. By (i)-(ii), it suffices to show that Zn := E[Yn | F ] & 0. Since Yn &,
by (ii) we also have Zn &. Also, Yn ≥ 0 and (ii) imply Zn ≥ 0. Hence for each ω ∈ Ω, Zn (ω) is a
decreasing sequence of nonnegative reals, so it converges as n → ∞. Hence Zn & Z for some
RV Z measurable with respect to F . So it is enough to show that Z = 0 a.s..
For this, fix A ∈ F and note that by DCT,
lim E[Zn 1 A ] = E[Z 1 A ],
n→∞
as 0 ≤ Zn ≤ Z1 and E[|Z1 |] = E[Z1 ] = E[Y1 ] = E[X ] − E[X 1 ] ≤ E[X ] < ∞. But since Yn & 0, 0 ≤ Yn ≤
Y1 , and E[Y1 ] ≤ E[X ] < ∞, by DCT again, we have
lim E[Zn 1 A ] = lim E[Yn 1 A ] = E[ lim Yn 1 A ] = E[0] = 0.
n→∞ n→∞ n→∞
Proposition 5.1.23 (Conditional expectation and nested σ-algebras). Fix a probability space (Ω, F0 , P),
RVs X , Y on it, and sub-σ-algebras F ⊆ G ⊆ F0 .
(i) If E[X | G ] is F -measurable, then E[X | G ] = E[X | F ].
(ii) E[E[X | G ] | F ] = E[X | F ].
Proposition 5.1.24 (Conditional expectation with known factor). Fix a probability space (Ω, F0 , P), RVs
X , Y on it, and sub-σ-algebra F ⊆ F0 . If X ∈ F , then
E[X Y | F ] = X E[Y | F ].
P ROOF. Clearly X E[Y | F ] is F -measurable by the hypothsis. To check iterated expectation, wish to
show that for any A ∈ F ,
E[X E[Y | F ]1 A ] = E[X Y 1 A ]. (56)
We verify the above following the standard pipeline. First assume X = 1B for some B ∈ F . Then A∩B ∈ F ,
so
E[1B E[Y | F ]1 A ] = E[E[Y | F ] 1 A∩B ] = E[Y 1 A∩B ] = E[1B Y 1 A ].
By linearity of expectation, (56) also holds for X simple functions. If X , Y ≥ 0 and if X n % X , where X n is
a simple function, then by MCT (note that Y ≥ 0 implies E[Y | F ] ≥ 0 so X n E[Y | F ] % X E[Y | F ]),
E[X E[Y | F ]1 A ] = lim E[X n E[Y | F ]1 A ] = lim E[X n Y 1 A ] = E[X Y 1 A ].
n→∞ n→∞
+ −
For the general case, decompose X = X − X and Y = Y + − Y − and use linearity of expectation and
conditional expectation. □
Proposition 5.1.25 (Conditional expectation as a projection). Suppose E[X 2 ] < ∞ and fix a sub-σ-algebra
F ⊆ F0 . Then E[X | F ] is the closest F -measurable RV to X in the sense that
£ ¤
E[X | F ] = arg min E |X − Y |2
Y :RV,Y ∈F
In this case, we do have limiting expectation limn→∞ E[X n ], but not necessarily we also have conver-
gence of the RVs. A simple counterexample would be when X n = ±n with equal probabilities. Hence
monotonicity of expected values are too weak to draw meaningful conclusions.
It turns out that the sweat spot between flexibility and nice convergence properties is to require
monotonicity of conditional expectations:
Here we need to have an acompanying, growing sequence of σ-algebras (Fn )n≥1 called a ‘filtration’ cor-
responding (in analogy) to our growning knowledge on the market. In this case, we can ensure that the
limiting random variable X n exists almost surely. This is called the ‘martingale convergence theorem’,
which we will study soon in this section.
5.2.2. Definition and examples. In this section, we will define martingales and their cousins: super-
martingales and submartingales, and take the first steps in developing their theory. A stochastic process,
or a process for short, is a sequence (X n )n≥1 of RVs. A ‘martingale’ is a process that models the value of a
finantial portfolio at market equilibrium (no arbitragy exists).
Definition 5.2.1 (Martingale). Let (Fn )n≥1 be a filtration, i.e., an increasing sequence of σ-fields: F1 ⊆
F2 ⊆ · · · . A process (X n )n≥1 is said to be adapted to (Fn )n≥1 if X n ∈ Fn for all n (i.e., X n is (Fn − B)-
measurable). If X = (X n )n≥1 is a process with
(i) (finite expectation) E[|X n |] < ∞,
(ii) (adaptedness) X n is adapted to Fn ,
(iii) (conditional increments) E[X n+1 |Fn ] = X n for all n,
then X is said to be a martingale (w.r.t. (Fn )n≥1 ). X is a supermartingale or submartingale if the equality
in (iii) is replaced with ≤ or ≥, respectively.
Remark 5.2.2 (Harmonic functions and martingales). Let f : Ω ⊆ Rd → R be a twice differentiable func-
tion. It’s Laplacian is defined as
X
d ∂2 f
∆( f ) := = tr(Hessian( f )).
i =1 ∂x i2
X n = E[X n+1 | Fn ].
Example 5.2.3 (Martingale w.r.t. the filtration generated by a stochastic process). In many cases, we
work with filtrations generated by observing a single reference process. Let (Yn )n≥1 denote a discrete-
time stochastic process on Ω and for each n, let Fn = σ(Y1 , . . . , Yn ) denote the collection of all events
that can be determined by the values of Y1 , . . . , Yn . Then (Fn )n≥1 is called the filtration generated by the
process (Yn )n≥1 .
In order to get familiar with martingales, we provide three examples of martingales related to random
walks. Let (ξn )n≥1 be a sequence of i.i.d. increments with E[ξi ] = µ < ∞. Let S n = S 0 + ξ1 + · · · + ξn , where
S 0 is a constant. Denote Fn := σ(ξ1 , . . . , ξn ) and let F0 = {;, Ω}. Then (S n )n≥0 is called a random walk.
Example 5.2.4 (linear martingale from RW). Define a stochastic process (X n )n≥0 by
X n := S n − µn.
Then (X n )n≥0 is a martingale with respect to the filtration (Fn )t ≥0 . Indeed, note that X n ∈ Fn and
E[|X n |] = E[|S n − µt |] ≤ E[|S n | + |µt |] = n E[|ξ1 |] + µt < ∞,
and by linearity of conditional expectation,
E[X n+1 | Fn ] = E[S n+1 − µn | Fn ]
= E[S n+1 | Fn ] − µ(n + 1) (∵ µ(n + 1) ∈ Fn )
= E[S n | Fn ] + E[ξn+1 | Fn ] − µn
= S n + µn (∵ S n ∈ Fn and E[ξn+1 | Fn ] = E[ξn+1 ] = µ)
= Xn .
From this, we deduce
E[S n+1 | Fn ] = S n + µ.
Hence if µ = 0, then (S n )n≥0 is a martingale; If µ ≥ 0, then (S n )n≥0 is a submartingale; If µ ≤ 0, then (S n )n≥0
is a supermartingale. ▲
Example 5.2.5 (Quadratic martingale from RW). Suppose we have the same random walk (S n )n≥0 now
with increments having mean zero and finite variance σ2 > 0. Then we claim that S n2 −σ2 n is a martingale
w.r.t. the usual filtration Fn = σ(ξ1 , . . . , ξn ) with F0 = {;, Ω}.
Finite expectation and adaptedness are easy to check. For the conditional increment condition, note
that since S n ∈ Fn and ξn+1 ⊥ Fn ,
2
E[S n+1 | Fn ] = E[S n2 + 2S n ξn+1 + ξ2n+1 | Fn ]
= E[S n2 | Fn ] + 2S n E[ξn+1 | Fn ] + E[ξ2n+1 | Fn ]
= E[S n2 | Fn ] + 2S n E[ξn+1 ] + E[ξ2n+1 ]
= S n2 + σ2 .
Thus S n2 is a submartingale with the positive drift σ2 > 02. Hence subtracting this linear bias from S n2
should make it a martingale. Indeed,
2
E[S n+1 − σ2 (n + 1) | Fn ] = S n2 − σ2 n.
▲
Example 5.2.6 (Product martingale). Let (ξn )t ≥0 be a sequence of independent RVs with ξn ≥ 0 and
E[X n ] = 1 for all n ≥ 0. For each n ≥ 0, let Fn = σ(ξ1 , . . . , ξn ) and F0 = {;, Ω}. Define
X n := X 0 ξ1 ξ2 · · · ξn for n ≥ 1
where X 0 is a constant. Then (X n )n≥0 is a martingale with respect to (Fn )n≥0 .
2Informally, the variance grows at rate σ2 .
5.2. BASICS OF MARTINGALES 128
Lemma 5.2.8. Let (X n )n≥0 be a Markov chain on a discrete state space S with transition matrix P . For
each n ≥ 0, let f n be a bounded function Ω → R such that
X
f n (x) = P (x, y) f n+1 (y) ∀x ∈ Ω.
y∈Ω
Then M n := f n (X n ) defines a martingale with respect to filtration (Fn )n≥0 , where Fn = σ(X 1 , . . . , X n ).
P ROOF. Since f n is bounded, E[| f n (M n )|] < ∞. By definition, M n is given as a function f n (X n ) of the
reference process (X n )n≥0 , so M n ∈ Fn . For the conditional increment property, note that by using the
Markov property,
E[M n+1 − M n | Fn ] = E[M n+1 − M n | Fn ]
"Ã ! #
X ¯
¯
=E P (X n , y) f n+1 (y) − f n (X n ) ¯ Fn = 0.
y∈Ω
In order to see this, define a function h n (x) = ((1 − p)/p)x , which does not depend on n. According
to Lemma 5.2.8, it suffice to show that h is a ‘harmonic function’ with respect to the transition matrix of
the RW3 Namely,
X
P (x, y)h n+1 (y) = ph(x + 1) + (1 − p)h(x − 1)
y∈Z
µ ¶ µ ¶
1 − p x+1 1 − p x−1
=p + (1 − p)
p p
µ ¶x µ ¶ µ ¶
1−p 1−p x 1−p x
= (1 − p) +p = = h n (x).
p p p
Hence by Lemma 5.2.8, (M n )n≥0 is a martingale with respect to the filtration (Fn )n≥0 . ▲
Example 5.2.10 (Simple symmetric random walk). Let (ξn )n≥1 be a sequence of i.i.d. RVs with
P(ξk = 1) = P(ξk = −1) = 1/2.
Let S n = S 0 + ξ1 + · · · + ξn . Note that (S n )n≥0 is a Markov chain on Z. Define
M n = S n2 − n.
Then (M n )n≥0 is a martingale with respect to Fn = σ(ξ1 , . . . , ξn ).
We have already shown this cliam in Example 5.2.5. One can also check the martingale condition
for S n2 − n using Lemma 5.2.8. Namely, for each n ≥ 0, define a function f n : Z → R by f n (x) = x 2 − n. By
Lemma 5.2.8, it suffices to check if f n (x) is the average of f n+1 (y) with respect to the transition matrix of
S n . Namely,
X 1 1
P (x, y) f n+1 (y) = f n+1 (x + 1) + f n+1 (x − 1)
y∈Z 2 2
(x + 1)2 − (n + 1) (x − 1)2 − (n − 1)
= + = x 2 − n = f n (x).
2 2
Hence by Lemma 5.2.8, (M n )n≥0 is a martingale with respect to filtration (Fn )n≥0 . ▲
5.2.3. Basic properties of martingales. If martingale is a fair gambling strategy, then one can think of
supermartingale and submartingale as unfavorable and favorable gambling strategies, respectively. For
instance, expected winning in gambling on an unfavorable game should be non-increasing in time. This
is an immediate consequence of the definition and iterated expectation.
Proposition 5.2.11. Let (X n )n≥0 be a stochastic process adapted to a filtration (Fn )n≥0 .
(i) If (X n )n≥0 is a supermartingale w.r.t. (Fn )n≥0 , then E[X n | Fm ] ≤ X m for all n ≥ m ≥ 0.
(ii) If (X n )n≥0 is a subermartingale w.r.t. (Fn )n≥0 , then E[X n | Fm ] ≥ X m for all n ≥ m ≥ 0.
(iii) If (X n )n≥0 is a martingale w.r.t. (Fn )n≥0 , then E[X n | Fm ] = X m for all n ≥ m ≥ 0.
P ROOF. Note that −X n is a submartingale if X n is a supermartingale, so (ii) and (iii) follows directly
from (i). We will only show (i). Let (X n )n≥0 be a supermartingale w.r.t. (Fn )≥0 , so
E[X n+1 | Fn ] ≤ X n .
Then by iterated expectation (e.g., Prop. 5.1.23 with Fm ⊆ Fn ⊆ F0 ),
E[X m+2 − X m | Fm ] = E[E[X m+2 − X m | Fm+1 ] | Fm ]
| {z }
≤X m+1 −X m
≤ E[X m+1 − X m | Fn ] ≤ 0.
Then the assertion follows by an induction. □
3One can also check that M is a martingale by Example 5.2.6.
n
5.2. BASICS OF MARTINGALES 130
A fundamental but very important result is that martingales stopped at a stopping time is a martin-
gale. Recall that we denote a ∧ b = min{a, b}.
P ROOF. The key observation is the following altenative expression for X n∧N :
X
n
X n∧N = 1{N ≥m} · (X m − X m−1 ). (58)
m=1
The above identity can be checked by verifying it on each event {N = m} for m ≥ 0. Then triangle inequal-
ity and the finite expectation assumption shows that E[|X n∧N |]. For adaptedness, note that X n∧N ∈ Fn
since for each m = 1, . . . , n,
1{N ≥m} = 1{N 6=0} 1{N 6=1} · · · 1{N 6=m−1} ∈ Fm−1 ⊆ Fn−1 (59)
and X 1 , . . . , X n ∈ Fn . For the conditional increment condition, use linearity of conditional expectation
and (59) to get
µ n ¶
£ ¤ X £ ¤
E X (n+1)∧N | Fn = 1{N ≥m} · (X m − X m−1 ) + E 1{N ≥n+1} · (X n+1 − X n ) | Fn
m=1
= X n∧N + 1{N ≥n+1} E [X n+1 − X n | Fn ]
| {z }
≥0
≥ X n∧N .
The key property of stopping time we used in the proof of Theorem 5.2.17 was that 1{N ≥n+1} ∈ Fn .
That is, at time n, one already knows that if N ≥ n +1 or not. This property is generalized as the following
notion of ‘predictability’.
Definition 5.2.18 (Predictable sequence). Let (Fn )n≥0 be a filteration. A stochastic process (Hn )n≥1 is
predictible w.r.t. (Fn )n≥0 if Hn ∈ Fn−1 for n ≥ 1.
Example 5.2.19. If N is a stopping time w.r.t. Fn , then Hn = 1{N ≥n} is predictable w.r.t. Fn .
A natural interpretation of a predictable sequence Hn is the amount of shares of a stock between time
n−1 and n, determined by the information available at time n−1 by the investor. If we let X n be the value
of one share of that stock at time n, then the gain we have between time n − 1 and n is Hn (X n − X n−1 ).
Then the total gain from time 0 to time n is5
Zn X
n
H d X := Hm (X m − X m−1 ).
0 m=1
Example 5.2.20 (Doubling strategy). Let (ξn )n≥1 are i.i.d. RVs with P(ξn = 1) = p and P(ξn = −1) = 1 − p.
Let X n = X 0 + ξ1 + · · · + x n so that ξn = X n − X n−1 . A famous gampling strategy known as “martingale” is
defined by a predictable sequence (Hn )n≥1 , where
(
2Hn−1 if ξn−1 = −1
Hn =
1 if ξn−1 = 1.
In words, we double our bet Hn when we lose (ξn−1 = −1), so that if we lose k times and then win, our
net winnings will be 1 = 2k − 2k−1 − 2k−2 − · · · − 1. ▲
5This is known as the stochastic integral of H against X .
5.2. BASICS OF MARTINGALES 132
Exercise 5.2.21 (Casino always win). Let X = (X n )n≥0 be a supermartingale w.r.t. a filtration Fn and let
H = (Hn )n≥1 be any predictable
Rn sequence w.r.t. (Fn )n≥1 . Suppose that Hn is bounded and nonnega-
tive for n ≥ 1. Show that 0 H d X is a supermartingale w.r.t. Fn . (Hint: Mimic the proof of Theorem
5.2.17.) Also show the similar results for submartingales and martingales. (For the martingale case, it
holds without assuming Hn ≥ 0.)
Next, we introduce the notion of ‘upcrossings’.
Definition 5.2.22 (Upcrossing). (X n )n≥0 be a supermartingale w.r.t. a filtration (Fn )n≥0 . Fix a, b ∈ R with
a < b. Let N0 = −1 and for all k ≥ 1, define random times
N2k−1 := inf{m > N2k−2 : X m ≤ a} (60)
N2k := inf{m > N2k−1 : X m ≥ b}.
Namely, between time N2k−1 and N2k , X n crosses from below a to above b, which is called the kth up-
crossing of (X n )n≥0 between levels a and b (see Figure 5.2.1). The total number of upcrossings completed
by time n is denoted
Un := inf{k ≥ 1 : N2k ≤ n}. (61)
𝑁" 𝑁$
𝑏
𝑁! 𝑁%
𝑁#
F IGURE 5.2.1. Illustration of stopping times N2n−1 and N2n and upcrossings during [N2n−1 , N2n ]
for n ≥ 1. Solid lines depict the increments counted by Hn in (62).
Lemma 5.2.23 (Upcrossing inequality). Let X = (X n )n≥0 be a submartingale w.r.t. a filtration (Fn )n≥0 .
Fix a < b and let Un be as in (61). Then
(b − a) E[Un ] ≤ E[(X n − a)+ ] − E[(X 0 − a)+ ].
P ROOF. By using a similar argument as in Exercise 5.2.16, the random times N2k−1 and N2k in (60)
are stopping times. Define a sequence H = (Hn )n≥1 by
(
1 if N2k−1 < m ≤ N2k for some k ≥ 1
Hm = (62)
0 otherwise.
Note that
{N2k−1 < m ≤ N2k } = {N2k−1 ≤ m − 1} ∩ {M 2k ≤ m − 1}c ∈ Fm−1 .
Hence
[
{Hm = 1} = {N2k−1 < m ≤ N2k } ∈ Fm−1 ,
k≥1
Define Yn := a + (X n − a)+ for n ≥ 0. By Proposition 5.2.13, (Yn )n≥0 is a submartingale. Note that
Yn = X n if X n ≥ a and Yn = a if X n < a. Hence Yn has an upcrossing whenever X n has. From this we
deduce
Zn
(b − a)Un ≤ H dY . (63)
0
Indeed, for every upcrossing, we make a contribution to the RHS above by at least b − a. If there is an
imcomplete upcrossing after the last complete upcrossing (see Fig. 5.2.1), then this gives a nonneagtive
contribution to the RHS above7.
LetRK m := 1 − Hm for m ≥ 1. Note that K = (K n )n≥1 is predictable w.r.t. Fn (since H is). So by Exc.
n
5.2.21, 0 K d Y is a submartingale. By Prop. 5.2.11,
·Zn ¸ Z0
E K dY ≥ K d Y = 0.
0 0
Noting that
Zn Zn Zn
Yn − Y0 = 1dY = H dY + K dY
0 0 0
and using (63), we can now deduce
·Zn ¸
E[Yn − Y0 ] ≥ E H d Y ≥ (b − a)E[Un ].
0
This finishes the proof. □
We can now prove one of the main results in martingale theory.
Theorem 5.2.24 (Martingale Convergence Theorem). Let X = (X n )n≥0 be a submartingale w.r.t. a filtra-
tion (Fn )n≥0 . If supn≥0 E[X n+ ] < ∞, then as n → ∞, X n converges almost surely to a limit X with E[|X |] < ∞.
P ROOF. Fix a < b. Let Un denote the upcrossings of [a, b] up to time n. Let U := supn≥0 Un , the
number of upcrossings on the entire horizon [0, ∞). Then Un % U . Since (X n − a)+ ≤ X n+ +|a|, by Lemma
5.2.23 and the hypothesis,
(b − a) E[U ] = (b − a) sup E[Un ] ≤ sup E[X n+ ] + |a| < ∞,
n≥0 n≥0
where the first equality follows from monotone convergence theroem (Thm. 1.3.19) and the fact that Un
is increasing in n. It follows that E[U ] < ∞. Since U ≥ 0 almost surely, it implies that U < ∞ almost surely.
This yields
µ ¶
P lim inf X n < a < b < lim sup = 0, (64)
n→∞ n→∞
since the event in the probability above is the event of having infinitely many upcrossings of [a, b]. Now
since (64) holds for all reals a < b, in particular it holds for all rationals a < b. By union bound, we deduce
µ ¶ Ã ½ ¾!
[
P lim inf X n 6= lim sup X n a.s. ≤ P lim inf X n < a < b < lim sup X n
n→∞ n→∞ n→∞ n→∞
a<b∈Q
µ ¶
X
≤ P lim inf X n < a < b < lim sup = 0.
n→∞ n→∞
a<b∈Q
This shows that X n converges to some limit X almost surely. (We can write X = lim inf X n .) It also follows
n→∞
that both X n+ := X n ∨ 0 and X n− := (−X n ) ∨ 0 converge almost surely.
7Note that the last incomplete upcrossing may result in negative contribution to Rn H d X , so (63) may be false with X
0
instead of Y .
5.2. BASICS OF MARTINGALES 134
Now it remains to show that E[|X |] < ∞. On the one hand, by Fatou’s lemma (Thm. 1.3.18),
E[X + ] = E[lim inf X n+ ] ≤ lim inf E[X n+ ≤ sup E[X n+ ] < ∞.
n→∞ n→∞ n≥0
On the other hand, since X n is a submartingale,
E[X n− ] = E[X n+ ] − E[X n ] ≤ E[X n+ ] − E[X 0 ].
Hence again by Fatou’s lemma,
E[X − ] = E[lim inf X n− ] ≤ lim inf E[X n− ] ≤ sup E[X n+ ] − E[X 0 ] < ∞.
n→∞ n→∞ n≥0
Thus we conclude
E[|X |] = E[X + ] + E[X − ] < ∞.
□
An important consequence of Theorem 5.2.24 is the following.
Corollary 5.2.25. If X n ≥ 0 is a supermartingale, then as n → ∞, X n → X almost surely and E[X ] ≤ E[X 0 ].
P ROOF. Yn := −X n is a submartingale uniformly bounded above by 0. Thus by Theorem 5.2.24, Yn →
Y converges almost surely for some integrable Y . Then X n → −Y =: X almost surely and E[|X |] = E[|Y |] <
∞. Lastly, by Fatou’s lemma and since X n is a supermartingale,
E[X ] = E[lim inf X n ] ≤ lim inf E[X n ] ≤ E[X 0 ].
n→∞ n→∞
This finishes the proof. □
Recall that almost sure convergence does not necessarily imply convergence in expectation (i.e., con-
vergnece in L 1 ). The following counterexample shows that this is indeed the case for martingales.
Example 5.2.26 (Martingale convergene do not hold in L 1 ). Let (S n )n≥0 be a simple symmetric random
walk on Z with S 0 = 1. Let N = inf{m ≥ 0 | S m = 0}, the first hitting time of the origin (i.e., stop gambling
when broke). Since S n is a martingale w.r.t. the filtration Fn = σ(S 0 , . . . , S n ) and since N is a stopping
time w.r.t. the same filtration, by Theorem 5.2.17, X n := S n∧N is also a martingale. Since X n ≥ 0, it is a
nonnegative supermartingale, so by Corollary 5.2.25, X n → X a.s. as n → ∞. Note that X ≡ 0, since for
any k > 0,
P{X = k} ≤ P{X n = k for all but finitely many n} ≤ P{X n = X n+1 = k for some n} = 0.
Thus P(X = 0) = 1. But note that since X n is a martingale, E[X n ] = E[X 0 ] = E[S 0 ] = 1 6= 0 = E[X ]. ▲
Example 5.2.27. This example is brought from [Dur19, Ex. 4.2.14]. We will construct a martingale
(X n )n≥0 on a large enough common probability space (Ω, F , P) recursively as follows. Let X 0 = 0 and
for k ≥ 1, define Fk−1 := σ(X 1 , . . . , X k−1 ), and on the event that X k−1 = 0, set
1 indepenently from Fk−1 with prob. 1/2k
X k = −1 indepenently from Fk−1 with prob. 1/2k
0 indepenently from Fk−1 with prob. 1 − k −1
and on the event that X k−1 6= 0, set
(
k X k−1 indepenently from Fk−1 with prob. 1/k
Xk =
0 indepenently from Fk−1 with prob. 1 − k −1 .
Often we define a RV by only specifying its distribution as above without specifying where in the sample
space Ω it takes a specific value. We claim that X n is a martingale w.r.t. the filtration Fn . This should be
intuitive since conditional on knowing the value of X k−1 , the conditional expectation of X k is 0 if X k−1 = 0
and X k−1 if X k−1 6= 0. Below we will give a detailed verification.
5.2. BASICS OF MARTINGALES 135
8This is where we need our probability space ‘large enough’. If it weren’t, we could have expanded it by taking the product
space with another probability space so that it can accomodate events independent from Fk−1 .
5.3. APPLICATIONS OF MARTINGALE CONVERGENCE 136
So by Borel-Cantelli lemma (Lem. 5.3.3), X n 6= 0 infintely often with probability 1. But notice that X n is
an integer-valued process, so if it converges to 0 with a positive probability, then X n ≡ 0 for all sufficiently
large n with a positive probabilty. Thus P(limn→∞ X n = 0) = 0. ▲
5.3.1. Bounded increments. We first show that martingales with bounded increments either converge
or oscillate between +∞ and −∞.
Theorem 5.3.1 (Asymptotic behavior of martingales with bounded increments). Let (X n )n≥0 be a mar-
tingale w.r.t. a filtration (Fn )n≥0 . Suppose that it has bounded increments: supn≥1 |X n+1 − X n | ≤ M almost
surely for some constant M > 0. Let
Then P(C ∪ D) = 1.
P ROOF. Since X n − X 0 is a martingale, we can WLOG assume X 0 = 0. Fix K ∈ (0, ∞) and define N :=
inf{n ≥ 0 | X n ≤ −K }, the first hitting time of (−∞, K ]. This is a stopping time, so by Theorem 5.2.17, X n∧N
is a martingale. Note that X (n∧N )−1 > −K , so
¡ ¢
X n∧N = X (n∧N )−1 + X n∧N − X (n∧N )−1 ≥ −K − M .
9To see the first identity in (68), note that on the event lim inf X > −∞, we have X > −∞ for all n ≥ N for some (random)
n n 0
n→∞
N0 . Since X n ’s are integrable, X n > −∞ for all n < N0 . Hence inf X n > −∞ on this event.
5.3. APPLICATIONS OF MARTINGALE CONVERGENCE 137
P ROOF. The statement says that the submartingale X n has its ‘stationary part’ M n and the ‘increas-
ing part’ A n . The formula (69) makes sense since E[X m − X m−1 | Fm−1 ] ≥ 0 is the one-step conditional
increment of the submartingale X n , which must be subtracted to make it a martingale. Furthermore,
this conditional increment is ∈ Fm−1 ⊆ Fn−1 .
We first show that the formula (69) works. As we noted, A n defined in (69) is predictable and increas-
ing. To show that M n := X n − A n is a martingale (finite expectation and adaptededness are easy),
E[M n | Fn−1 ] = E[X n − A n | Fn−1 ]
= E[X n−1 + X n − X n−1 − A n | Fn−1 ]
= X n−1 − A n + E[X n − X n−1 | Fn−1 ]
| {z }
=A n −A n−1
= X n−1 − A n−1
= M n−1 .
Next, suppose there exists such a Doob’s decomposition X n = M n + A n . We show that it must satisfy
(69). Indeed, M n and A n should satisfy
E[M n | Fn−1 ] = M n−1 and A n ∈ Fn−1 and An % .
It follows that
E[X n | Fn−1 ] = E[M n | Fn−1 ] + E[A n | Fn−1 ]
= M n−1 + A n
= X n−1 − A n−1 + A n .
Since X n−1 ∈ Fn−1 , this shows that A n − A n−1 = E[X n − X n−1 | Fn−1 ], which then yields (69). □
An important application of Theorems 5.3.1 and 5.3.2 is the following generalization of the second
Borel-Cantelli Lemma 3.4.15.
Lemma 5.3.3 (Second Borel-Cantelli Lemma, conditional ver.). Let (Fn )n≥0 be a filtration with F0 =
{;, Ω}. Let (B n )n≥0 be a sequence of events such that B n ∈ Fn . Then possibly except a set of probability zero,
½∞ ¾ ½∞ ¾
X X
{B n i.o.} = 1(B n ) = ∞ = P(B n | Fn−1 ) = ∞ .
n=1 n=1
Pn
P ROOF. Let X 0 = 0 and X n = m=1 1(B m ). Then X n is a submartingale. Applying Doob’s decomposi-
tion (Thm. 5.3.2), we can write X n = M n + A n where
X
n X
n
An = P(B m | Fm−1 ) and M n = 1(B m ) − P(B m | Fm−1 ).
m=1 m=1
Here M n is a martingale with incremenet bounded by 1. Using the notation in Theorem 5.3.1, we have
X
∞ X
∞
on C , 1(B n ) = ∞ ⇐⇒ P(B n | Fn−1 ) = ∞,
n=1 n=1
X∞ X
∞
on D, 1(B n ) = ∞ and P(B n | Fn−1 ) = ∞.
n=1 n=1
5.3.2. Polya’s Urn. An urn contains r red and g green balls. At each time we draw a ball out, then replace
it, and add c more balls of the color drawn. Let X n be the fraction of green balls after the nth draw. Will X n
converge to some random fraction? This may not be obvious, but we can see it by using the martingale
convergence theorem.
5.3. APPLICATIONS OF MARTINGALE CONVERGENCE 138
Let Fn denote the σ-algebra generated by the first n draws. Then X n is a martingale w.r.t. Fn . To see
this, assuming that there are i red balls and j green balls at time n, then
( j +c j
with prob. i + j
X n+1 = i + jj +c i
i + j +c with prob. i + j .
So we have
j +c j j i j (i + j + c) j
E[X n+1 | Fn ] = · + · = = = Xn .
i + j + c i + j i + j + c i + j (i + j + c)(i + j ) i + j
Since X n is a nonnegative martingale, by Corollary 5.2.25, X n converges to some limiting RV X ∞ , which
is the limiting fraction of green balls. We also know that E[X ∞ ] ≤ E[X 0 ]. But the actual value of X ∞ is
random. What is the distribution of X ∞ ?
We make the following observation. Let ξk denote the indicator that we get a green ball at the kth
draw. Then
¡ ¢ g g +c g + (m − 1)c r r + (ℓ − 1)c
P (ξ1 , . . . , ξn ) = (1, . . . , 1, 0, . . . , 0) = · ··· · ··· .
| {z } | {z } r +g r +g +c r + g + (m − 1)c r + g + mc r + g + (n − 1)c
m 1’s ℓ 0’s
In fact, the above probability is unchanged for any binary sequence of length n with the same number of
1’s, since the denominators in the RHS are unchanged and the numerators are only permuted.
Now for concrete examples, suppose r = g = c = 1. By the above computation, for any m = 0, . . . , n,
à ! à !
Xn n m! ℓ! 1
P ξk = m = = .
k=1 m (n + 1)! n + 1
This shows that X n is uniform among all possible values. Since X n → X ∞ a.s. implies convergence in
probability, for each x ∈ [0, 1],
P(X ∞ ≤ x) = lim P(X n ≤ x) = x.
n→∞
This shows that X ∞ ∼ Uniform(0, 1).
Next, suppose r = c = 1 and g = 2. Then
à ! à !
Xn n (2 · 3 · · · m + 1) ℓ! 2(m + 1)
P ξk = m = = → 2x
k=1 m 3 · 4 · · · (n + 2) n +2
provided n → ∞ and m/n → x.
In general, X ∞ is a continuous RV on [0, 1] with density
Γ((g + r )/c) (g /c)−1
x (1 − x)r /c − 1,
Γ(g /c)Γ(r /c)
which is the Beta distribution with parameters g /c and r /c (see Exc. 1.6.19).
Below in Figure 5.3.1 (generated by using Python), we provide some simulation results of Polya’s urn
for various parameter choices.
Exercise 5.3.4. Use your favorate programming language (e.g., python, R, matlab, C++) and reproduce
plots similar to the ones in Figure 5.3.1.
5.3.3. Branching processes. The martingale theory makes an important application in analyzing branch-
ing processes, a stochastic model for population growth.
Imagine a specie where every individual gives birth to a random number of offsprings for the next
generation and dies at the end of the current generation. Suppose for simplicity that the number of
offsprings that the i th individual at generation n, denoted as ξni , are i.i.d. from some fixed ‘offspring
distribution’ p = (p 0 , p 1 , . . . , ), i.e., P(ξni = k) = p k for k ≥ 0. Suppose there are Zn individuals at generation
n. Then Zn+1 , the population at the next generation, should be given by the sum of all offsprings of every
individuals at generation n. This is the idea behind the branching process in the following definition.
5.3. APPLICATIONS OF MARTINGALE CONVERGENCE 139
F IGURE 5.3.1. 10 (top row) and 100 (bottom row) sample trajectories of Polya’s urn process.
Definition 5.3.5 (Branching processes). Let (ξni )i ,n≥1 be a sequence of doubly indexed i.i.d. nonnegative
integer-valued random variables with distribution p = (p 1 , p 2 , . . . ). Define a sequence Zn , n ≥ 0, by Z0 = 1
and
(
ξn+1 + · · · + ξn+1 if Zn ≥ 1
Zn+1 := 1 Zn
0 if Zn = 0.
Then (Zn )n≥0 is called the branching process (or a Galton-Watson process) with offspring distribution p.
P
We call µ := E[ξni ] = ∞
k=1
kp k the mean offspring number.
Every individual on average gives µ number of offsprings. It would be reasonable to guess that the
following behavior:
(Subcritical phase): Zn → 0 almost surely and exponentially fast if µ < 1;
(Critical phase): Zn is ‘unbiased’ but Zn → 0 almost surely if µ = 1;10
(Supercritical phase): Zn ∼ µn almost surely if µ > 1.
We will use martingale theory to justify the above speculations. The key connection between the
branching processes and martingales is the following.
P ROOF. Finite expectation and adaptedness are clear. For martingale increments, since Zn ∈ Fn , we
have
E[Zn+1 | Fn ] = E[ξn+1
1 + · · · + ξn+1
Zn | Fn ] = µZn . (70)
10Exclusing the trivial case when Z ≡ 1 a.s., which occurs when ξn ≡ 1 a.s. for all i , n.
n i
5.3. APPLICATIONS OF MARTINGALE CONVERGENCE 140
The desired result follows from multiplying both sides above with 1/µn+1 .11 □
Proposition 5.3.7. Zn /µn converges almost surely to some limiting RV.
Note that since Zn /µn is a martingale by Prop. 5.3.7, E[Zn /µn ] = E[Z0 ] = 1. Therefore
E[Zn ] = µn . (71)
This confirms the conjectured phase transition behavior of the branching process in expectation.
Next, we identify the limit in Prop. 5.3.7. We start with the subcritical phase µ < 1, in which case the
limit should be zero.
Proposition 5.3.8 (Extinction of subcritical branching process). If µ < 1, then Zn ≡ 0 for all n sufficiently
large, so Zn /µn → 0 almost surely.
Hence if µ ∈ (0, 1), then P(Zn > 0) → 0 exponentially fast. By Borel-Canteli lemma (Lem. 3.4.6), Zn > 0
only for finitely many ns almost surely. Thus Zn = 0 for all sufficiently large n almost surely. □
Next we consider the critical case µ = 1. Clearly if every individual gives birth to exactly one offspring,
then Zn ≡ 1 almost surely. If we exclude this trivial case, then any small amount of randomness is enough
to get the entrie population extinct.12
11If you are not completely satisfied with the argument for (70) since Z is random (albeit being in F ), we can proceed
n n
F∞
by showing the claimed identity with partitioning the sample space. That is, write Ω = k=0
{Zn = k}. Then for each k, {Zn =
k} ∈ Fn and Zn+1 = ξn+1
1 + · · · + ξn+1
k
on {Zn = k}. Hence by Exc. 5.1.6,
E[Zn+1 | Fn ] = E[ξn+1
1 + · · · + ξkn+1 | Fn ] = µk = µZn .
Lastly, we consider the supercritical branching process (µ > 1). In this case, we will show that the
population survives forever with a positive probability, meaning that Zn > 0 for all n with a positive prob-
ability. Furthermore, we can exactly compute this survival probability. using the generating function of
the offspring distribution.
Define the generating function of the offspring distribution p = (p 0 , p 1 , . . . ) as
X
∞
φ(s) := E[s Z1 ] = pk sk .
k=0
Observe that
X
∞ X
∞
φ0 (s) = kp k s k−1 > 0, φ00 (s) = k(k − 1)p k s k−2 > 0.
k=1 k=2
Hence φ has the following properties:
1. φ is strictly increasing on [0, 1);
2. φ00 > 0 on (0, 1), and hence φ0 is strictly increasing and φ is strictly convex on (0, 1);
3. φ(1) = 1.
4. φ(0) = p 0 and φ0 (0) = p 1 .
We make some observations on the generating functions.
Lemma 5.3.10. Define φn (t ) := E[t Zn ]. Then φn is the n-fold composition of φ by itself, that is,
φ0 (t ) = t ,
φn+1 (t ) = φ(φn (t )) = φn (φ(t )).
P ROOF. We present two argumetns depending on conditioning on F1 or Fn . First, note that Zn+1 is
the sum of Z1 independent copies of Zn , due to the recursive structure. Hence, denoting Zni for i ≥ 1 to
be i.i.d. copies of Zn ,
£ ¤
φn+1 (t ) = E t Zn+1
£ £ ¤¤
= E E t Zn+1 | F1
h h 1 Z1 ii
= E E t Zn +···+Zn | F1
£ ¤
= E φn (t ) Z1
= φ(φn (t )).
For the second argument, note that
£ ¤
φn+1 (t ) = E t Zn+1
£ £ ¤¤
= E E t Zn+1 | Fn
h h n+1 ii
= E E t ξ1 +···+ξZn | Fn
n+1
£ ¤
= E φ(t ) Zn
= φn (φ(t )).
□
A key notion for describing the behavior of a branching process is its extinction time, which is the
first time that the population reaches zero:
τ := inf{n ≥ 0 | Zn = 0}. (72)
With the convention that inf ; = ∞, we have τ = ∞ if Zn never goes extinct. What is the probability of
extinction, P(τ < ∞)? The following lemma is a key result that relates the extinction probability and the
fixed point of the generating function.
5.3. APPLICATIONS OF MARTINGALE CONVERGENCE 142
Lemma 5.3.11 (Extinction probability). The extinction probability ζ := P(τ < ∞) is the smallest nonnega-
tive root of the fixed point equation φ(t ) = t .
P ROOF. We present two approaches. First we appeal to the recursive nature of the branching process.
Consider what has to happen for the event τ < ∞. There are Z1 first-generation nodes, and descending
from them there are independent copies of the entire branching process. Therefore by partioning on the
values of Z1 ,
X∞
ζ = p0 + p k ζk = φ(ζ).
k=1
0
Second, observe that (since 0 = 1)
P(Zn = 0) = φn (0).
Note that Zn+1 = 0 if Zn = 0, so P(Zn = 0) is non-decreasing in n. It follows that, by continuity of proba-
bility measure (Thm. 1.1.16), ζ = limn→∞ P(Zn = 0) = limn→∞ φn (0). It follows that by Lem. 5.3.10,
ζ = lim φ(φn (0)) = φ( lim φn−1 (0)) = φ(ζ).
n→∞ n→∞
Lastly, we conclude that ζ is the smallest nonnegative root of φ(t ) = t . This follows from the mono-
tonicity of φ. Indeed, suppose ζ0 is another nonnegative root of the fixed point equation. Then since φ is
strictly increasing and by Lem. 5.3.10, φn = φ ◦ · · · ◦ φ is also strictly increasing. Since 0 ≤ ζ0 ,
φn (0) ≤ φn (ζ0 ).
Letting n → ∞, we get ζ ≤ ζ0 . So we must have ζ = ζ0 . □
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
F IGURE 5.3.2. Phase transition in the fixed point of the generating function φ of the offspring
distribution ξ ∼ p = Binomial(3, p), where the mean offspring number µ = 3p. When µ ≤ 1, ζ = 1
is the only fixed point of φ. When p > 1/3 so that µ > 1, there appears a new fixed point ζ < 1.
The following is a key observation for the generating function φ of the offpsring distribution p.
Theorem 5.3.12 (Phase transition in extinction probability). Suppose p 1 < 1 and let ζ = P(τ < ∞) denote
the exinction probability. Then the followings hold:
(i) ( subcritical and critical regime) If µ ≤ 1, then ζ = 1.
(ii) ( supercritical regime) If µ > 1, then ζ < 1.
P ROOF. By Lemma 5.3.11, we know that ζ is the smallest nonnegative fixed point of the generationg
function φ of the offspring distribution p.
Suppose µ ≤ 1. Recall that φ is strictly convex with strictly increasing first derivative and positive
second derivative. Note that limt %1 φ0 (t ) = µ ≤ 1, so φ0 (t ) < 1 for all t ∈ (0, 1). Then by mean value
5.3. APPLICATIONS OF MARTINGALE CONVERGENCE 143
theorem, there cannot be any fixed point of φ in [0, 1), for if ζ = φ(ζ) for some ζ ∈ [0, 1), then since φ(1) = 1,
there must be some ζ0 ∈ (ζ, 1) such that φ0 (ζ0 ) = 1, but this contradicts the fact that φ0 < 1 on (0, 1). Since
φ(1) = 1, the smallest nonnegative fixed point of φ is 1. Therefore ζ = 1.
Next, suppose µ > 1. Recall that φ(0) = p 0 . Hence if p 0 , then φ(0) = 0 and φ(1) = 1. Since φ is
strictly convex, there is no other fixed point in (0, 1). Now suppose p 0 > 0 so that φ(0) = p 0 > 0. Since
φ0 (1) = µ > 1, when t % 1 we should have φ0 (t ) < t by Taylor expansion. Hence there exists t ∗ ∈ (0, 1)
such that φ(t ∗ ) < t ∗ . Since φ(0) > 0, by the intermediate value theorem, there exists a fixed point ζ in
(0, t ∗ ). It follows that ζ < 1. □
Theorem 5.3.12 only states that ζ < 1 in the supercritical case but does not provide what that value is.
Lemma 5.3.11 tells us that we only need to solve the fixed point equation φ(t ) = t to find it, but solving
this equation exactly might be difficult. In Exercise 5.3.13 below, we provide a simple iterative algorithm
that converges to the extinction probability ζ for the supercritcal case µ > 1 at an exponential rate.
Exercise 5.3.13 (Approximating the extinction probability by fixed point iteration). Suppose µ > 1. In
order to compute the extinction probability ζ = P(τ < ∞) ∈ [0, 1), consider the following ‘fixed point
iterates’: θ0 = 0 and
θn := φ(θn−1 ),
Our last point of investigation for the branching processes is how does the tail of the extinction time
τ behaves. For instance, Theorem 5.3.12 shows that for µ ≤ 1, the extinction time τ is almost surely finite.
But how likely is it to exceed a certain value? That is, how does the probability of the branching process
surviving up to time n behave? We can answer this question again by using generating fuctions. The key
relation is
P(τ > n) = P(Zn > 0) = 1 − P(Zn = 0) = 1 − φn (0).
Thus the speed at which P(τ > n) decays is determined by the speed at which φn (0) converges to one.
Exercise 5.3.14 (Tail of the extinction time in the subcritical regime). Suppose µ < 1 and E[Z12 ] < ∞.
(i) Use a similar approach illustrated in Exc. 5.3.13 to show that 1 − φn (0) ≤ µn . Conclude that
P(τ > n) ≤ µn .
(ii)* By Taylor’s theorem, show that
¡ ¢
1 − φn+1 (0) = φ(1) − φ(φn (0)) = µ(1 − φn (0)) + O (1 − φn (0))2 .
Using convexity of φ, deduce that, for some constant C > 0,
¡ ¢
µ(1 − φn (0)) −C (1 − φn (0))2 ≤ 1 − φn+1 (0) ≤ µ(1 − φn (0)).
5.4. MARTINGALE CONCENTRATION INEQUALITIES 144
Exercise 5.3.15 (Tail of the extinction time in the critical regime). Suppose µ = 1, p 1 6= 1, and E[Z12 ] < ∞.
(i)* Show that there exists a constant C > 0 such that
P(τ > n) ∼ C /n.
(Hint: Use the Taylor expansion of φ near ζ = 1:
φ00 (1)
1 − φn+1 (0) ≈ 1 − φn (0) − (1 − φn (0))2 .
2
Letting x n := 1−φn (0), they satisfy the recursion x n+1 = x n −bx n2 . Then making a further change
of variable y n = x n−1 , deduce
y n2 b b
y n+1 = = yn + = yn + .
yn − b 1 − bx n 1 − o(1)
Hence y n is asymptotically an arithmetic sequence.)
(ii) Use (i) to deduce that the extinction time has the following scaling property: For each x > 1,
P(τ > xn | τ > n) = C /x + o(1).
Theorem 5.4.1 (Azuma-Hoeffding inequality). Let (X k )0≤k≤n be a martingale w.r.t. a filtration (Fk )0≤k≤n
with bounded increments: |X k − X k−1 | ≤ σk for some constant σk ∈ (0, ∞) for k = 1, . . . , n. Then
à !
t2
P (X n − X 0 ≥ t ) ≤ exp − Pn . (73)
2 k=1 σ2k
P
If σk = O(1), then the sum nk=1 σ2k grows linearly in n. This growing denominator in the exponential
p p
function above is killed only when t À n; if t = o( n), then the concentration bound (73) is not useful.
p
Thus (73) implies that the martingale X n cannot exceed n by much with high probability. For instance,
p
P(X n ≥ n) = exp(−O(n)) is exponentially small. If X n had independent increments, this O( n) scaling is
expected by the central limit theorem (Thm. 4.4.5). We will use a ‘variational Jensen’s inequality’ in Exc.
5.4.2 in the proof of Theorem 5.4.1.
5.4. MARTINGALE CONCENTRATION INEQUALITIES 145
P ROOF OF T HEOREM 5.4.1. The proof follows the usual recipe for proving Hoeffding’s inequality but
we use conditional expectation everywhere. Without loss of generality, assume X 0 = 0.
Write ξn = X n − X n−1 for the increments. Since |ξn | ≤ σn a.s., ξn is a bounded RV so it has finite
exponential moments everywhere: E[exp(θσn )] < ∞ for all θ ∈ R. Note that, for any parameter θ > 0, by
exponentiating and taking Markov’s inequality,
P(X n ≥ t ) ≤ P(exp(θX n ) ≥ exp(θt )) (74)
≤ exp(−θt )E[exp(θX n )].
Next we use iterated expectation to write
E[exp(θX n )] = E[E[exp(θX n ) | Fn−1 ]]
= E[E[exp(θX n−1 + θξn ) | Fn−1 ]]
= E[exp(θX n−1 ) E[exp(θξn ) | Fn−1 ]]
since exp(θX n−1 ) ∈ Fn−1 . Now, ξn conditional on Fn−1 is a bounded random variable taking values from
[−σk , σk ]. By Exercise 5.4.2 (this is where we use the martingale condition E[ξn | Fn−1 ] = 0), we have
E[exp(θξn ) | Fn−1 ]] ≤ exp(θ 2 σ2n /2).
Since exp(θX n−1 ) ≥ 0, it follows that
E[exp(θX n )] ≤ E[exp(θX n−1 )] exp(θ 2 σ2n /2).
Proceeding by an induction, we deduce
à !
θ2 X
n
E[exp(θX n )] ≤ exp σ2 .
2 k=1 k
Combining with (74), we get
à !
θ2 X
n
2
P(X n ≥ t ) ≤ exp −θt + σ .
2 k=1 k
The above bound holds for all θ > 0 so we can optimize the above bound in θ. Note that the exponent in
the RHS above is a convex quadratic function in θ, which is mimized at θ = Pn t σ2 with minimum value
P k=1 k
−t 2 /2 nk=1 σ2k . This shows the assertion. □
Exercise 5.4.2 (A variational Jensen’s inequality). Let X be a mean zero RV taking values from an interval
[−A, B ]. Fix a convex function φ : R → R. We will show that
B A
E[φ(X )] ≤ φ(−A) + φ(B ) . (75)
A +B A +B
In words, over all possible distributions of X over [−A, B ], the most extreme distribution that maximizes
E[φ(X )] is the one that puts point mass on −A and B as in the right-hand side.
(i) Let Y be a RV taking values from [0, 1] and mean p ∈ [0, 1]. Suppose that for any convex function
ψ : R → R, we have
E[ψ(Y )] ≤ (1 − p)ψ(0) + pψ(1). (76)
Then deduce (75) from this. (Hint: Rescale X and make appropriate change to φ.)
(ii) Here we will deduce (76). Let Y be as before. Let U ∼ Uniform(0, 1) independent from Y . Argue that
1(U ≤ Y ) | Y ∼ Bernoulli(Y ) and 1(U ≤ Y ) ∼ Bernoulli(p).
(You may use Ex. 5.1.14 for the first part.) Then use Jensen’s inequality to deduce
(1 − p)φ(0) + pφ(1) = E[φ(1(U ≤ Y ))] ≥ E[φ(Y )].
5.4. MARTINGALE CONCENTRATION INEQUALITIES 146
(iii) (Hoeffding’s lemma) Let φ(x) = e θx for a fixed θ > 0 and assume A = B > 0. Deduce that
E[exp(−θ A)] + E[exp(θ A)]
E[exp(θX )] ≤ ≤ exp(θ 2 A 2 /2).
2
A useful consequence of Azuma-Hoeffding’s inequality is the following McDiarmid’s inequality, which
is also known as the ‘bounded difference inequality’.
Theorem 5.4.3 (McDiarmid’s inequality). Let X 1 , . . . , X n be indepdendent RVs. Let g : Rn → R be a “Lips-
chitz” function in the following sense: If x, x0 ∈ Rn differ only in the kth coordinate, then
| f (x) − f (x0 )| ≤ σk (77)
for some constant σk ∈ (0, ∞). Then for any t > 0,
à !
¡ ¢ t2
P | f (X 1 , . . . , X n ) − E[ f (X 1 , . . . , X n )]| ≥ t ≤ 2 exp − Pn .
2 k=1 σ2k
P ROOF. Define a filtration (Fk )0≤k≤n by F0 = {;, Ω} and Fk := σ(X 1 , . . . , X k ) for k = 1, . . . , n. For k =
1, . . . , n, denote
Yk := E[ f (X 1 , . . . , X n ) | Fk ].
Pn
Then |Yk | ≤ k=1 σk by using the Lipschitz property of f , so E[|Yk |] < ∞. Clearly Yk ∈ Fk . Also by iterated
expectation, E[Yk+1 | Fk ] = Yk , so (Yk )1≤k≤n is a martingale w.r.t. (Fk )1≤k≤n 13. Furthermore, by (77),
|Yk − Yk−1 | ≤ σk . To see this, let X k0 be an independent copy of X k , so X k0 ⊥ Fk . Then (using Ex. 5.1.15)
E[ f (X 1 , . . . , X k−1 , X k0 , X k+1 , . . . , X n ) | Fk ] = E[ f (X 1 , . . . , X k−1 , X k0 , X k+1 , . . . , X n ) | Fk−1 ] = Yk−1 ,
so by (77) and a triangle inequality,
¯ ¯
|Yk − Yk−1 | = ¯E[ f (X 1 , . . . , X k−1 , X k , X k+1 , . . . , X n ) − f (X 1 , . . . , X k−1 , X 0 , X k+1 , . . . , X n ) | Fk ]¯
k
≤ σk .
Now the statement follows from Azuma-Hoeffding (Thm. 5.4.1) and noting that Y0 = E[ f (X 1 , . . . , X n )]. □
Below we will see some interesting applications of the martingale concentration inequalities in the
context of Erdős-Rényi random graphs.
Definition 5.4.4 (Erdős-Renyí random graphs). Construct a random graph with n nodes in the following
manner. For each pair of nodes (i , j ), include an edge i j independently with probability p and leave it
as a non-adjacent pair with probability 1 − p. The resulting random graph is denoted as G(n, p) and is
called a Erdős-Renyí random graph.
Exercise 5.4.5 (Number of triangles in G(n, p)). Let T = T (n, p) denote the total number of triangles in
G(n, p).
(i) For each three distinct nodes i , j , k in G, let Yi j k := 1(i j , j k, ki ∈ E ), which is the indicator variable
for the event that there is a triangle with node set {i , j , k}. Show that
Yi j k ∼ Bernoulli(p 3 ).
(ii) Show that we can write
X
T= 1(i j , j k, ki ∈ E ). (78)
1≤i < j <k≤n
(Hint: First compute E[T 2 ] and use the fact that Var(T ) = E[T 2 ] − E[T ]2 . For computing E[T 2 ],
use (78) and consider possible cases according to the number of overlapping edges.) Thus
Std(T (n, p)) = Θ(n 2 ). If CLT holds for T (n, p), then T (n, p) should fluctuate around its mean
by Θ(n 2 ). Can we conclude this by CLT?
(iv) Show that for each t ≥ 0,
ï à ! ¯ ! µ ¶
¯ n 3 ¯¯ t2
¯
P ¯T (n, p) − p ¯ ≥ t ≤ 2 exp − .
¯ 3 ¯ n(n − 1)(n − 2)2
Deduce that the above probability is o(1) if t À n 2 . Specifically, for any ε > 0,
ï à ! ¯ !
¯ n 3 ¯¯ ¡ ¢
¯ 2+ε
P ¯T (n, p) − p ¯≥n ≤ 2 exp −n 2ε .
¯ 3 ¯
Thus, McDiarmid’s inequality almost confirms the upper tail of fluctuation of T (n, p) predicted
by CLT. (Hint: Let X 1 , . . . , X (n ) denote the indicator of there being an edge for the kth pair of
2
distinct nodes. Let f (X 1 , . . . , X (n ) ) denote the number of triangles using the edges indicated by
2
X k s. Consider the “edge exposure filtration” (Fn )0≤n≤(n ) , where we reveal the connectedness
2
of every pair of distinct nodes (i , j ) sequentially. Argue that there at most n − 2 triangles that
contains a given edge. Then use Theorem 5.4.3.)
Exercise 5.4.6 (A moment bound for sub-exponential RVs). Suppose X is a sub-exponential RV, i.e., there
exists a constant c > 0 such that for all x ≥ 0,
Theorem 5.5.3
© (Doob’s inequality).
ª Let (X n )n≥0 be a submartingale w.r.t. a filtration (Fn )n≥0 . Fix λ > 0
+
and let A := max0≤m≤n X m > λ . Then
λ P(A) ≤ E[X n 1 A ] ≤ E[X n+ ].
In particular,
µ ¶
+
P max Xm > λ ≤ λ−1 E[X n+ ].
0≤m≤n
Thus we recover Kolmogorov’s maximal inequality (see Exc. 3.7.1) from Doob’s inequality. ▲
Theorem 5.5.5 (L p maximum inequality). Let (X n )n≥0 be a submartingale w.r.t. a filtration (Fn )n≥0 .
+
Denote X n := max0≤m≤n X m . Then for 1 < p < ∞,
µ ¶
p p p
E[X n ] ≤ E[(X n+ )p ].
p −1
P ROOF. Fix a constant M ∈ (0, ∞). We will consider the truncated variable X n ∧ M instead of X n to
ensure integrability. By Thm. 5.5.3, we have
P(X n > x) ≤ x −1 E[X n+ 1(X n > x)].
Note that {X n ∧ M > x} = {X n > x} if x < M and is ; if x ≥ M . Thus the above yields
P(X n ∧ M > x) ≤ x −1 E[X n+ 1(X n ∧ M > x)].
Then by Prop. 1.5.10 and Fubini’s theorem, we get
Z∞
E[(X n ∧ M )p ] = px p−1 P((X n ∧ M ) > x) d x
0
Z∞
≤ px p−2 E[X n+ 1(X n ∧ M > x)] d x
0
· Z∞ ¸
+ p−2
= E Xn px 1(X n ∧ M > x) d x
0
" Z X n ∧M #
p
= E X n+ (p − 1)x p−2 d x
p −1 0
p h i
= E X n+ (X n ∧ M )p−1 .
p −1
5.5. DOOB’S INEQUALITY AND CONVERGENCE IN L p FOR p > 1 151
Now let q := p/(p − 1) denote the conjugate of p. Applying Hölder’s inequality (Prop. 1.3.13), we get
h i
E X n+ (X n ∧ M )p−1 ≤ E[|X n+ |p ]1/p E[(X n ∧ M )p ]1/q .
p−1
Since 1 − (1/q) = 1 − p = 1/p, dividing both sides by E[(X n ∧ M )p ]1/q 14, it follows that
µ ¶
p p
E[(X n ∧ M )p ] = E[(X n ∧ M )p(1−(1/q)) ] ≤ E[|X n+ |p ].
p −1
Lastly, we let M % ∞ and use monotone convergence theorem (Thm. 1.3.19) to conclude. □
Theorem 5.5.6 (L p convergence theorem). If X n is a martingale with supn≥1 E[|X n |p ] < ∞ for some p > 1,
then X n → X a.s. and in L p as n → ∞.
P ROOF. Almost sure convergence follows directly from the martingale convergence theorem (Thm.
5.2.24) since (E[X n+ ])p ≤ E[|X n |p ] ≤ E[|X n |]p . For the L p convergence, we will use the fact that
µ ¶p
|X n − X |p ≤ 2 sup |X n | ,
n≥0
which follows from |X n | ≤ supn≥0 |X n | and |X | = limn→∞ |X n | ≤ supn≥0 |X n |. Thus if supn≥0 |X n | is in L p ,
then by dominated convergence theorem (Thm. 1.3.20), E[|X n − X |p ] → 0 as desired. Indeed, |X n | is a
submartingale (Prop. 5.2.12) so by Theorem 5.5.5,
·µ ¶p ¸ µ ¶
p
E sup |X n | ≤ E[|X n |]p < ∞.
0≤n≤m p −1
Taking m → ∞ and using MCT (Thm. 1.3.19), we deduce that supn≥0 |X n | is in L p , as desired. □
Exercise 5.5.7 (Orthogonality of martingale increments). Let X n be a martingale with E[X n2 ] < ∞ for all
n. Show that
(i) If m ≤ n and Y ∈ F m , then E[(X n − X m )Y ] = 0.
(ii) If ℓ ≤ m ≤ n, then E[(X n − X m )(X m − X ℓ )] = 0.
(iii) (Conditional variance formula) If m ≤ n, then
E[(X n − X m )2 | Fm ] = E[X n2 | Fm ] − X m
2
.
A nice application of the L p martingale convergence theorem is the exponential growth of supercrit-
ical BP conditioned to survive.
Exercise 5.5.8 (Supercritical branching process on survival). Consider supercritical branching process
(Zn )n≥0 (recall the notations in Sec. 5.3.3) with mean offspring number µ = E[ξin ] > 1 and suppose
Var(ξin ) = σ2 < ∞. By Theorem 5.3.12, we know that the extinction probability ζ = P(τ < ∞) is nonzero,
where τ denotes the extinction time (see (72)). By Prop. 5.3.7, we also know that X n := Zn /µn converges
a.s. to some limiting RV X . It is reasonable that on the survival event τ = ∞, X should be positive so the
population grows asymptotically exponentially as X µn . The goal of this exercise is to justify this:
n o
X := lim Zn /µn > 0 = {τ = ∞} . (80)
n→∞
(i) Show that
E[X n2 | Fn−1 ] = X n−1
2
+ µ−2n E[(Zn − µZn−1 )2 | Fn−1 ].
(ii) Show that
E[(Zn − µZn−1 )2 | Fn−1 ] = σ2 Zn−1 .
(Hint: Show that the identity holds on the events {Zn−1 = k} for all k ≥ 0 (partitioning the sample
space).)
14We can do so since E[(X ∧ M )p ] < ∞. Without the truncation, this is not necessarily true.
n
5.6. UNIFORM INTEGRABILITY AND CONVERGENCE IN L 1 152
(iv) Show that X n → X in L and E[X n ] → E[X ] for some RV X . (Hint: Use L p martingale convergence
2
where φ(s) = E[s Z1 ] is the generating function of the offspring distribution. Deduce that θ = ζ =
P(τ < ∞) and conclude (80). (Hint: “⊆” in (80) is trivial, so it is enough to show θ = ζ. For this,
use a first-step analysis using the recursive property of BP. Then use Lemma 5.3.11.)
We can certainly do whenever we can apply any of the convergence theorems (i.e., MCT, BCT, DCT). But
there are some counterexamples as well. One such example is given in Ex. 1.3.17. Below we give another
one that will motivate a key definition in this section.
Example 5.6.1 (Mass escaping to R infinity). Consider (R, B, µ), where µ =Lebesgue measure. Let f n :=
1[n,n+1] . Then f n ≤ 1 for n ≥ 1 and f n d µ = 1 for all n ≥ 1. Also, f n → f = 0 almost everywhere, since for
each x ∈ R, f n (x) = 1(n ≤ x ≤ n + 1) ≡ 0 for all n > x. Thus
Z Z
0 = f d µ 6= lim f n d µ = 1.
n→∞
A similar but more probabilistic example is the following. Let X 1 , X 2 , · · · be independent RVs with
(
(n + 1)2 with prob. 1/(n + 1)2
Xn =
0 with prob. 1 − 1/(n + 1)2 .
P P
Then n≥1 P(X n 6= 0) = n≥1 (n + 1)−2 < ∞, so X n is nonzero only for finitely many n’s almost surely by
Borel-Cantelli. Hence X n → X = 0 almost surely. Now
0 = E[X ] 6= lim E[X n ] = 1.
n→∞
In both examples, some amount of mass is escaping to the infinity and the pointwise limit does not hold
it in the limit. ▲
In this section, we will give necessary and sufficient conditions for a martingale to converge in L 1 .
The key to this is the following definition.
Note that uniform integrability implies a uniform bound on the expectations. Indeed, if we choose
M large enough so that the above supremum is at most 1, then writing E[|X i |] ≤ E[|X i |1(|X i | ≥ M )] + M ,
we have
sup E[|X i |] ≤ M + 1.
i ∈I
Since Y 1(Y < M ) % Y a.s. M → ∞, MCT (Thm. 1.3.19) yields E[Y 1(Y < M )] → E[Y ] < ∞. Consequently,
E[Y 1(Y ≥ M )] = E[Y ]− E[Y 1(Y < M )] → 0 as M → ∞. This shows the UI of (X i )i ∈I . As a consequence, any
uniformly bounded collection of RVs is UI. ▲
Another important collection of UI RVs is provided by the following result.
Proposition 5.6.4 (Collection of conditional expectations). Given a probability space (Ω, F0 , F ) and a RV
X ∈ L 1 , let I denote the set of all sub-σ algebras of F0 . Then we will show that (E[X | F ])F ∈I is uniformly
integrable.
But this is impossible since then |X |1(A n ) → 0 a.s. by Borel-Cantelli and |X |1(A n ) ≤ |X | with X ∈ L 1 , so
E[|X |1(A n )] → 0 as n → ∞ by DCT.
We now go back to (82). Fix ε > 0. Then by the claim above, there exists δ > 0 such that whenever
P(A) ≤ δ, E[|X |1(A)] ≤ ε. Choose M large enough so that E[|X |]/M ≤ δ. Then by Markov’s and Jensen’s
inequalities,
P(|Y | ≥ M ) ≤ M −1 E[|E[X | F ]|] ≤ M −1 E[E[|X | | F ]] = M −1 E[|X |] ≤ δ,
regardless of the choice of F ∈ I . Thus by the choice of δ, it follows that
sup E[|X |1(|Y | ≥ M )] ≤ ε.
F ∈I
Proposition 5.6.5. Let φ : R → [0, ∞) be any function such that φ(x) À x. (e.g., φ(x) = x p for p > 1 or
φ(x) = x(0 ∨ log x)). Then a collection (X i )i ∈I of RVs is UI if supi ∈I E[φ(|X i |)] ≤ C for some constant C > 0.
5.6. UNIFORM INTEGRABILITY AND CONVERGENCE IN L 1 154
P ROOF. For each M > 0, denote εM := sup{x/φ(x) : x ≥ M }. Note that for each x ≥ 0,
x
x1(x ≥ M ) = φ(x)1(x ≥ M ) ≤ εM φ(x)1(x ≥ M ).
φ(x)
It follows that
E[|X i |1(|X i | ≥ M )] ≤ εM E[φ(|X i |)1(|X i | ≥ M )] ≤ C εM .
By the hypothesis, εM → 0 as M → ∞. This verifies the UI. □
Uniform integrability is in fact necessary and sufficient for convergence in L 1 , as the following result
states.
Theorem 5.6.6 (UI ⇐⇒ L 1 convergence). Suppose that E[|X n |] < ∞ for all n ≥ 1. If X n → X in probability
then the following are equivalent:
(i) {X n : n ≥ 0} is UI;
(ii) X n → X in L 1 ;
(iii) E[|X n |] → E[|X |] < ∞.
P ROOF. “(i) ⇒(ii):” We wish to show that E [|X n − X |] = o(1). To this effect, define the ‘clipping func-
tion’
M if x > M
φM (x) := x if |x| ≤ M
−M if x < −M .
By triangle inequality, write
|X n − X | ≤ |X n − φM (X n )| + |φM (X n ) − φM (X )| + |φM (X ) − X |.
Note that for each x ∈ R, |φM (x) − x| = (|x| − M )+ ≤ |x|1(|x| ≥ M ). Hence taking expecation, we get
£ ¤
E [|X n − X |] ≤ E [|X n |1(|X n | ≥ M )] + E [|X |1(|X | ≥ M )] + E |φM (X n ) − φM (X )|
£ ¤
≤ sup E [|X n |1(|X n | ≥ M )] + E [|X |1(|X | ≥ M )] + E |φM (X n ) − φM (X )| .
n≥0 | {z } | {z }
| {z } =(b) =(c)
=(a)
Note that (a) tends to zero as M → ∞ by UI in (i). For (b), first note that UI implies supn≥1 E[|X n |] < ∞.
Then by the in-probability Fatou’s lemma (Ex. 3.4.11)
E[|X |] ≤ lim inf E[|X n |] < ∞.
n→∞
Thus (b) tends to zero as M → ∞ by MCT. Lastly, to show that (c) also tends to zero, we note that X n → X
in probability φM being continuous imply that φM (X n ) → φM (X ) in probability as n → ∞ (see Prop.
3.4.12). Since φM is also bounded, then BCT (Thm. 1.3.16) yields that (c) tends to zero as n → ∞.
To finish, we fix ε > 0 and choose M large enough so that (a) and (b) combined are ≤ ε. For this M ,
(c) tends to zero as n → ∞, so this shows
lim sup E [|X n − X |] ≤ ε.
n→∞
Now since ε > 0 was arbitrary, the limsup above must be zero.
“(ii) ⇒(iii):” By Jensen’s and triangle inequalities,
h¯ ¯i
¯ ¯
|E[|X n | − |X |]| ≤ E ¯|X n | − |X |¯ ≤ E[|X n − X |].
Since L 1 convergence means the last expression tends to zero, this yields (iii).
5.6. UNIFORM INTEGRABILITY AND CONVERGENCE IN L 1 155
“(iii) ⇒(i):” Fix ε > 0. For each M ≥ 0, let ψM denote a continuous, piece-wise linear approximation
of x1(x ≥ M ):
x if x ∈ [0, M − 1]
ψM (x) := 0 if x ≥ M
linear if |x| ≤ M .
Since X n → X in probability, ψM (|X n |) → ψM (|X |) in probability (Prop. 3.4.12) so by BCT (Thm. 1.3.16),
E[ψM (|X n |)] → E[ψM (|X |)] as n → ∞. Hence combining with (iii), we can choose N ≥ 1 large such that
|E[|X n | − |X |]| ≤ ε/3 and |E[ψM (|X n |)] → E[ψM (|X |)]| ≤ ε/3 for all n ≥ N .
Next, by MCT, we will choose M large enough so that
max E[|X n |1(|X n | > M )] ≤ ε.
1≤n<N
Since ψM (|X |) → |X | a.s. as M → ∞, |ψM (X )| ≤ |X |, and E[|X |] < ∞, DCT (Thm. 1.3.20) implies that
E[ψM (|X |)] → E[|X |] as M → ∞. Thus, we may choose larger M , if necessary, so that
E[|X |] − E[ψM (|X |)] ≤ ε/3.
It follows that for all n ≥ N ,
E[|X n |1(|X n | > M )] ≤ E[|X n |] − E[ψM (|X n |)]
¡ ¢ ¡ ¢
= (E[|X n |] − E[|X |]) + E[|X |] − E[ψM (|X |)] + E[ψM (|X n |)] − E[ψM (|X |)] ≤ ε.
Then we conclude
½ ¾
sup E[|X n |1(|X n | > M )] ≤ max max E[|X n |1(|X n | > M )], sup E[|X n |1(|X n | > M )] ≤ ε.
n≥1 1≤n<N n≥N
Since ε > 0 was arbitrary, this shows the UI in (i). □
We are now ready to state and derive the first main result in this section.
Theorem 5.6.7 (L 1 convergence of supermartingales). For a supermartingale, the following are equiva-
lent:
(i) It is uniformly integrable;
(ii) It converges a.s. and in L 1 ;
(iii) It converges in L 1 .
P ROOF. “(i) ⇒(ii):” UI implies sup E[|X n |] < ∞ so the martingale convergence theorem (Thm. 5.2.24)
implies X n → X a.s. (hence in probability), and Theorem 5.6.6 implies X n → X in L 1 .
“(ii) ⇒(iii):” Trivial.
“(iii) ⇒(i):” X n → X in L 1 implies X n → X in probability by Markov’s inequality. Hence (i) holds by
Theorem 5.6.6. □
Next, we aim to deduce an analogue of Theorem 5.6.7 for martingales. We will use the following two
simple observations in the proof.
Proposition 5.6.8. Let (X n )n≥1 be a sequence of RVs on the common probability space (Ω, F , P) that con-
verges to some RV X in L 1 . Then for any A ∈ F , E[X n 1(A)] → E[X 1(A)].
P ROOF. By linearity of expectation and Jensen’s inequality,
|E[X n 1(A)] − E[X 1(A)]| = |E[(X n − X )1(A)]| ≤ E[|X n − X |] = o(1).
□
Proposition 5.6.9. If (X n )n≥0 is a martingale w.r.t. a filtration (Fn )n≥0 converging to some RV X in L 1 ,
then X n = E[X | Fn ].
5.6. UNIFORM INTEGRABILITY AND CONVERGENCE IN L 1 156
P ROOF. Since (X n )n≥0 is a martingale, E[X m | Fn ] = X n for all m ≥ n. By definition of conditional ex-
pectation, for each A ∈ Fn and m ≥ n, E[X m 1(A)] = E[X n 1(A)]. By Prop. 5.6.8, we also have E[X m 1(A)] →
E[X 1(A)] as m → ∞. It follows that
E[X n 1(A)] = lim E[X m 1(A)] = E[X 1(A)].
m→∞
Since the above holds for all A ∈ Fn and since X n ∈ Fn , it follows that E[X | Fn ] = X n . □
Theorem 5.6.10 (L 1 convergence of martingales). For a martingale, the following are equivalent:
(i) It is uniformly integrable;
(ii) It converges a.s. and in L 1 ;
(iii) It converges in L 1 ;
(iv) There is an integrable random variable X so that X n = E[X | Fn ].
P ROOF. “(i) ⇒(ii):” Since martingales are also submartingales, this follows from Theorem 5.6.7.
“(ii) ⇒(iii):” Trivial.
“(iii) ⇒(iv):” Follows from Prop. 5.6.9.
“(iv) ⇒(i):” Follows from Prop. 5.6.4. □
Theorem 5.6.11 (Lévy’s upward convergence). Fix a probability space (Ω, F , P) and let (Fn )n≥0 be filtra-
S
tion on Ω such that Fn ⊆ F for n ≥ 0. Denote F∞ := σ( n≥0 Fn ) (i.e., Fn % F∞ ⊆ F ). Then for any
F -measurable RV X ,
E[X | Fn ] → E[X | F∞ ] as n → ∞ a.s. and in L 1 .
In particular, if X ∈ F∞ , then
E[X | Fn ] → X as n → ∞ a.s. and in L 1 .
P ROOF. Denote Yn := E[X | Fn ]. The tower property of conditional expectation shows that (Yn )n≥0 is
a martingale w.r.t. (Fn )n≥0 . By Prop. 5.6.4, (Yn )n≥0 is uniformly integrable. Then by Theorem 5.6.10, we
have that Yn → Y∞ a.s. and in L 1 for some RV Y∞ . Then by Prop. 5.6.9, for all n ≥ 0,
E[X | Fn ] = Yn = E[Y∞ | Fn ].
S
It follows that, denoting R := n≥0 Fn ,
For the same reason as before, the above result might not be very surprising. However, we can im-
mediately deduce Kolmogorov’s 0-1 law from it, so it should be far from trivial.
Example 5.6.14 (Kolmogorov’s 0-1 law). Let X 1 , X 2 , . . . be independent RVs on the same probability
space. For each n ≥ 1, let Fn := σ(X 1 , . . . , X n ) and let T denote the tail σ-algebra for (X n )n≥1 . Let A ∈ T .
We would like to show that P(A) ∈ {0, 1} (Kolmogorov’s 0-1 law, see Exc. 3.7.8).
To this effect, first observe that (Fn )n≥1 is a filtration. Also since A is a tail event and X n ’s are inde-
pendent, 1(A) ⊥ Fn for each n ≥ 115. Hence by Lévy’s 0-1 law, as almost surely as n → ∞,
Thus, the constant function ω 7→ P(A) is almost surely the same as the indicator function 1(A). It follows
that P(A) ∈ {0, 1}, as desired. ▲
Another nice consequence of Levy’s upward convergence thoerem is the following DCT for condi-
tional expectation.
Theorem 5.6.15 (Dominated convergence theorem for conditional expectations). Suppose Yn → Y al-
most surely and |Yn | ≤ Z for all n where E[Z ] < ∞. If Fn % F∞ then
The above holds for all N and WN & 0 as N → ∞, so continuity of conditional expectation from below
(Prop. 5.1.21) implies E[WN |F∞ ] & 0. Then Jensens inequality and the above give us
as n → ∞. Theorem 5.6.11 then implies E[Y |Fn ] → E[Y |F∞ ] almost surely. The desired result follows
from the last two conclusions and the triangle inequality. □
Exercise 5.6.16 (Approximation of measurable function by stepfunction in L 1 ). Let f : [0, 1) → R be a
Borel measurable function. In this exercise, we will show that for each k ≥ 1, there exists a stepfunction
g k with stepsize 2−k such that k f − g k k1 → 0 as k → ∞, a well-known fact in real analysis16. We will use a
filteration given by diadic partition and Levy’s upward converence theorem (Thm. 5.6.11).
(i) Fix an integer L ≥ 1 and denote the intervals I L;i := [ i −1 i
L , L ) for i = 1, . . . , L that partition [0, 1). Let U be
an independent Uniform([0, 1]) RV and let FL denote the σ-algebra generataed by the events
{U ∈ I L;i } for i = 1, . . . , L. Define
£ ¤
f L := E f (U ) | FL .
Show that f L is the block average of f over the interval partition [0, 1) = I L;1 t · · · t I L;L , that is,
for each ω ∈ I L;i for i = 1, . . . , L,
Z Z
1
f L (ω) = f (x) d x = L f (x) d x.
|I i | I i Ii
15If X = X a.s. for all n ≥ 1 for some RV X , clearly this is not the case.
n
16We can also use the fact that continuous functions are desne in L 1 ([0, 1]) and then use uniform continuity for a direct
construction.
5.7. OPTIONAL STOPPING THEOREMS 158
(ii) Now take L = 2k for k = 1, 2, . . . . Show that (F2k )k≥0 defines a filtration and that ( f 2k )k≥0 is a martin-
gale w.r.t. this filtration. Conclude that
k f 2k → f k1 → 0 as k → ∞.
(Hint: Use Lévy’s upward convergence theorem).
Exercise 5.6.17 (Stepfunction approximation of graphons). A symmetric integrable function W : [0, 1]2 →
[0, 1] is called a graphon, a continuum generalization of graphs which also arise as the limit object for
sequences of dense graphs. A ‘block graphon’ is a special graphon that takes constant values over rect-
angles that partition [0, 1]2 . Use the approach in Exc. 5.6.16 to show that, for each k ≥ 1, there exists a
block graphon Wk with square blocks of side lengths 2−k such that
kU −Uk k1 → 0 as k → ∞.
The next result on optional stopping does not require uniform integrability:
P ROOF. Note that E[X 0 ] ≥ E[X N ∧n ] ≥ E[X n ] by Lemma 5.5.1. Using the first inequality and Fatou’s
lemma,
h i
E[X 0 ] ≥ lim inf E[X N ∧n ] ≥ E lim inf X N ∧n = E[X N ].
n→∞ n→∞
The last optimal stopping theorem is for submartingales with increments whose conditional expec-
tations are uniformly bounded. This result is quite useful since its hypothesis is easy to check for many
applications (e.g., i.i.d. integerable increments).
Theorem 5.7.5 (Optional stopping for submartingales with increments of uniformly bounded condi-
tional expectation). Suppose (X n )n≥1 is a submartingale and suppose that there exists a constant B > 0
such that E[|X n − X n−1 | | Fn−1 ] ≤ B a.s. for all n ≥ 1. If N is a stopping time with E[N ] < ∞, then X N ∧n is
uniformly integrable and hence E[X 0 ] ≤ E[X N ].
We will show that Y is integrable. Then by the result in Example 5.6.3, we can conclude that X N ∧n is UI.
To this end, note that since {N ≥ m} = {N < m}c ∈ Fm−1 ,
E[|X m − X m−1 |1(N ≥ m)] = E [E [|X m − X m−1 | | Fm−1 ] 1(N ≥ m)]
≤ E [B 1(N ≥ m)]
= B P(N ≥ m).
Then
X
n
E[|Y |] ≤ B P(N ≥ m) ≤ B E[N ] < ∞.
m=1
5.7.1. Application of martingales to random walks. In this section, we harvest some interesting results
from the optional stopping theorems for random walks. We first recall some notations. Let (ξn )n≥1 be a
sequence of i.i.d. increments with E[ξi ] = µ < ∞. Let S n = S 0 +ξ1 +· · ·+ξn , where S 0 is a constant. Denote
Fn := σ(ξ1 , . . . , ξn ) and let F0 = {;, Ω}. Then (S n )n≥0 defines a random walk with increments ξk . Recall
the three martingales for random walks — the linear martingale (Ex. 5.2.4), the quadratic martingale (Ex.
5.2.5), and the exponential martingale (Ex. 5.2.7):
1. (Linear martingale) X n = S n − µn;
2. (Quadratic martingale) X n = S n2 − σ2 n, where E[ξk = 0] and E[ξ2k ] = σ2 < ∞;
3. (Exponential martingale) X n = exp(θS n )/φ(θ), where φ(θ) := E[exp(θξk )] < ∞.
We first deduce Wald’s equaiton from the linear martingale.
Theorem 5.7.6 (Wald’s equation). Let (ξn )n≥1 be a sequence of i.i.d. increments with E[ξi ] = µ < ∞. Let
S n = ξ1 + · · · + ξn , where S 0 is a constant. Let N be a stopping time with E[N ] < ∞. Then E[S N ] = µE[N ].
P ROOF. Let X n = S n − µn be the linear martingale. Note that E[|X n − X n−1 | | Fn−1 ] = E[|ξn |] = µ for all
n ≥ 1. Hence by Theorem 5.7.5 applied to X n and −X n , we get
0 = E[X 0 ] = E[X N ] = E[S N ] − µ E[N ],
as desired. □
Theorem 5.7.7 (Symmetric Gambler’s ruin). Let S n = S 0 + ξ1 + · · · + ξn for n ≥ 0 be a random walk with
symmetric increments ξk with P(ξk = 1) = P(ξk = −1) = 1/2. Fix integers a ≤ x ≤ b and let N := inf{n ≥
0 | S n ∈ {a, b}, the first time that S n hits either a or b. Write Px (·) = P(· | S 0 = x) and Ex (·) = E[· | S 0 = x].
b−x x −a
(i) Px (S N = a) = and Px (S N = b) = .
b−a b−a
(ii) Ex [N ] = (b − x)(x − a).
(iii) Let T y := inf{n ≥ 0 | S n = y}. Then for m ≥ 1,
1 M −1
P1 (T M < T0 ) = and P1 (T M > T0 ) = .
M M
(iv) P1 (T0 < ∞) = 1.
P ROOF. First we show (i). The key is to show optional stopping equation: E[S N ] = E[S 0 ] = x. Indeed,
noting that S N ∈ {a, b} with probability one, we would deduce
x = E[S N ] = a Px (S N = a) + b Px (S N = b)
= a Px (S N = a) + b(1 − Px (S N = a)),
and solving this equaiton for Px (S N = a) gives the desired formula for it. Subtracting this from one also
gives the desired formula for the other probability.
We will use Theorem 5.7.5 to verify optional stopping equation. For this, we only need to check
E[N ] < ∞. For this, observe that a consecutive run of +1 increments of length b − a will get the random
walk out of (a, b). This occurs with probability 2−(b−a) . Hence in order to be confined in the interval (a, b)
for m(b − a) steps, we need to keep avoiding such runs m times. This gives
³ ´m
Px (N > m(b − a)) ≤ 1 − 2−(b−a)
for all m ≥ 1. Since the right-hand side decays exponentially fast, using the tail-sum formula the fact that
Px (N ≥ k) decreases in k,
X
∞ X
∞
Ex [N ] = P(N ≥ k) ≤ (b − a) P(N ≥ m(b − a)) < ∞.
k=0 m=0
5.7. OPTIONAL STOPPING THEOREMS 161
Next, we show (ii). For this, we will leverage the quardaric martingale S n2 −n (note that σ2 = E[ξ21 ] = 1).
Applying Theorem 5.2.17, we get
x 2 = Ex [S 02 − 0] = Ex [S 2N ∧n − (N ∧ n)].
By MCT, we have E[N ∧ n] % E[N ] as n → ∞. Also, recall that S N ∧n ∈ [a, b] for all n ≥ 0, so by BCT,
b−x x −a
Ex [S 2N ∧n ] → Ex [S 2N ] = a 2 + b2
b−a b−a
1
= (−ab(b − a) + x(b − a)(b + a))
b−a
= −ab + x(b + a).
It follows that
E[N ] = −ab + x(b + a) − x 2 = (b − x)(x − a).
(iii) follows easily from the previous parts by taking a = 0, x = 1, and b = M ≥ 1.
Lastly, (iv) follows from (iii) since {T0 < T M } % {T0 < ∞} as M → ∞ (see continuity of measures from
below, Thm. 1.1.16),
M −1
P1 (T0 < ∞) = lim P1 (T0 < T M ) = lim = 1.
M →∞ M →∞ M
□
The technique we used in the proof of Theorem 5.7.7 (ii) is noteworthy. Namely, we first apply the
optional stopping theorem for bounded stopping time N ∧ n (Lem. 5.5.1) and the take n → ∞ with
appropriate convergence theorem.
Next, we ‘apply’ optional stopping to the exponential martingale and obtain the exponential moment
generating function for the first hitting time. This will in tern give us the exact distribution of hitting time.
Theorem 5.7.8. Let S n be the simple symmetric random walk on Z with S 0 = 0. Let T1 = inf{n ≥ 1 | S 1 = 1}.
Then
p
T1 1 − 1 − s2
E[s ] = .
s
Inverting the generating function, we get
1 (2n)! −2n
P(T1 = 2n − 1) = · 2 .
2n − 1 n! n!
P ROOF. Let X n = exp(θS n )/ϕ(θ)n denote the exponential martingale, where ϕ(θ) = E[exp(θξ1 )] is the
generating function of the increment. By Theorem 5.7.7 and the symmetry of SRW, we know P0 (T1 <
∞) = 1, so X T1 is well-defined. Note that
1
ϕ(θ) = (e θ + e −θ ).
2
So ϕ(0) = 1, ϕ0 (0) = E[ξ1 ] = 0, and ϕ00 = ϕ > 0 (convesity of a log MGF is a general fact, see Exc. 5.7.9). So
ϕ(θ) > 1 for θ > 0 in the domain. Choose any θ > 0. Consider the stopped martingale X T1 ∧n (Thm 5.2.17).
Noting that S T1 ∧n ≤ 1, for all n ≥ 1,
0 ≤ X T1 ∧n ≤ exp(θS T1 ∧n ) ≤ exp(θ).
Thus by BCT (Thm. 1.3.16)17,
£ ¤
1 = E[X 0 ] = E[X T1 ∧n ] → E[X T1 ] = E exp(θ)/ϕ(θ)T1 .
17Here, we in fact have proved a version of optional stopping theorem for any stopping time N that holds when the stopped
martingale X N ∧n is uniformly bounded, by using the fact that X N ∧n itself is a martingale and BCT. We can also apply Prop. 5.7.1
directly.
5.7. OPTIONAL STOPPING THEOREMS 162
This yields
£ ¤
E ϕ(θ)−T1 = exp(−θ). (83)
To convert the above as the probability generating function of T1 , choose s so that
1
ϕ(θ) = (e θ + e −θ ) = 1/s.
2
Substituting e θ = x, this becomes x + x −1 = 2/s, or sx 2 − 2x + s = 0. Solving for x = e θ ≥ 0, we get
p
1 + 1 − s2
x= .
s
Now (83) gives
p
X m T1 −1 s 1 − 1 − s2
s P(T1 = m) = E[s ] = x = p = .
m≥0 1 + 1 − s2 s
Expanding the function in the RHS as a power-series and matching the coefficients¡ ¢ will then give us the
formula for the moments of T1 . Some standard computation shows (denoting xr = x(x−1) · · · (x−r +1)/r !
for real x > 0)
p à ! à ! à !
1 − 1 − s2 1/2 1/2 3 1/2 5
= s− s + s −···
s 1 3 5
µ ¶
X 1 (2n)! −2n 2n−1
= · 2 s .
n≥1 2n − 1 n! n!
Theorem 5.7.10 (Asymmetric Gambler’s ruin). Let S n = S 0 + ξ1 + · · · + ξn for n ≥ 0 be a random walk with
symmetric increments ξk with P(ξk = 1) = p and P(ξk = −1) = 1 − p for some p 6= 1/2.
³ ´
1−p y
(i) Let T y := inf{n ≥ 0 | S n = y} and φ(y) := p . Then for a < x < b,
For (ii), choose x = 0, a < 0, and b > 0. Observe that P0 (Tb ≥ b) = 1 since the increments are bounded
by one. Hence {T a < Tb } % {T a < ∞}, so by (i) and letting b → ∞,
φ(b) − φ(0) 1
P0 (T a < ∞) = lim P0 (T a < Tb ) = lim = ,
b→∞ b→∞ φ(b) − φ(a) φ(a)
where we have also used the fact that φ(b) → 0 as b → ∞ since p > 1/2.
For (iii), proceeding as before,
φ(0) − φ(a)
P0 (Tb < ∞) = lim P0 (Tb < T a ) = lim =1
a→−∞ b→∞ φ(b) − φ(a)
where we have also used the fact that φ(a) → 0 as a → −∞ since p > 1/2.
For the last conclusion (which makes sense since the average speed is 2p − 1), we use the linear
martingale S n − (2p − 1)n. By Theorem 5.2.17,
£ ¤
0 = E S Tb ∧n − (2p − 1)E[Tb ∧ n].
By MCT, E[Tb ∧ n] % E[Tb ] as n → ∞. Also, note that
b ≥ S Tb ∧n ≥ inf S n
n≥0
and by tail-sum formula and (ii),
·¯ ¯¸ · ¸ µ ¶
¯ ¯ X
¯ ¯
E ¯ inf S n ¯ = E − inf S n = P inf S n ≤ −a < ∞.
n≥0 n≥0 a≥0 n≥0
£ ¤
Thus by DCT, E S Tb ∧n → E[S Tb ] = b as n → ∞. It follows that
b = (2p − 1) E[Tb ],
as desired. □
CHAPTER 6
Markov chains
In this chapter, we delve into the study of a pivotal class of stochastic processes known as Markov
chains. Roughly speaking, Markov chains are used to model temporally changing systems where the
future state depends only on the current state. For instance, if the price of bitcoin tomorrow depends
only on its price today, then the bitcoin price can be modeled as a Markov process. (Of course, the entire
history of prices often influences the decisions of buyers/sellers, so this assumption may not be realistic.)
The significance of Markov chains is twofold: Firstly, they find application across diverse realms
including the physical, biological, social, and economic domains. Secondly, their theory is robustly es-
tablished, enabling computation and accurate prediction through model utilization.
164
6.1. DEFINITION AND EXAMPLES 165
Example 6.1.2. Let Ω = {1, 2} and let (X t )t ≥0 be a Markov chain on Ω with the following transition matrix
· ¸
p 11 p 12
P= .
p 21 p 22
We can also represent this Markov chain pictorially as in Figure 6.3.2, which is called the ‘state space
diagram’ of the chain (X t )t ≥0 .
𝑝
𝑝 1 2 𝑝
𝑝
and similarly,
In general, the distribution of X t +1 can be computed from that of X t via a simple linear algebra. Note
that for i = 1, 2,
𝑝 𝑃
𝑝
1 0 1 2 3 4 1
1−𝑝 1−𝑝 1−𝑝
P(X t +1 = k + 1 | X t = k) = p ∀1 ≤ k < N
P(X
t +1 = k | X t = k + 1) = 1 − p ∀1 ≤ k < N
P(X t +1 = 0 | X t = 0) = 1
P(X t +1 = N | X t = N ) = 1.
For example, the transition matrix P for N = 5 is given by
1 0 0 0 0 0
1 − p 0
0 p 0 0
0 1−p 0 p 0 0
P = .
𝑝 0 0 1−p 0 p 0
𝑝
0 𝑝 0 0 1 − p 0 p
1 2
𝑝 0 0 0 0 0 1
We call the resulting Markov chain (X t )t ≥0 the gambler’s chain. ▲
Example 6.1.4 (Ehrenfest Chain). This chain is originated from the physics literature as a model for two
cubical volumes of air connected by a thin 𝑝tunnel. Suppose there𝑃 are total N indistinguishable balls split
𝑝
1 0
into two “urns” A and B . At each1step, we pick up one 2 of the N balls uniformly
3 4 move
at random, and 1 it to
1−𝑝
the other urn. Let X t denote the number1of − balls
𝑝 1 − t𝑝 steps. This is a Markov chain called the
in urn A after
Ehrenfest chain. (See the state space diagram in Figure 6.1.3.)
F IGURE 6.1.3. State space diagram of the Ehrenfest chain with 4 balls
It is easy to figure out the transition probabilities by considering different cases. If X t = k, then urn
B has N − k balls at time t . If 0 < k < N , then with probability k/N we move one ball from A to B and
with probability (N − k)/N we move one from B to A. If k = 0, then we must pick up a ball from urn
B so X t +1 = 1 with probability 1. If k = N , then we must move one from A to B and X t +1 = N − 1 with
probability 1. Hence, the transition kernel is given by
P(X t +1 = k + 1 | X t = k) = (N − k)/N ∀0 ≤ k < N
P(X = k − 1 | X = k) = k/N ∀0 < k ≤ N
t +1 t
P(X t +1 = 1 | X t = 0) = 1
P(X t +1 = N − 1 | X t = N ) = 1.
For example, the transition matrix P for N = 5 is given by
0 1 0 0 0 0
1/5 0 4/5 0 0
0
0 2/5 0 3/5 0 0
P = .
0 0 3/5 0 2/5 0
0 0 0 4/5 0 1/5
0 0 0 0 1 0
6.1. DEFINITION AND EXAMPLES 167
▲
Next, we take a look at an example of an 𝑁 important
(𝑡) class of Markov chains, which is called the ran-
dom walk on graphs. This is the basis of many algorithms involving machine learning on networks (e.g.,
Google’s 𝑝 Rate 𝜆 = 𝑝𝜆
𝑁(𝑡)PageRank).
Example 6.1.5 (Random walk on graphs). A graph G consists of a pair (V, E ) of sets of nodes V and edges
E ⊆ Rate
V 2 . 𝜆A graph G can
1 −be
𝑝 concisely represented 𝑁 (𝑡) as a |V | × |V | matrix A, which is called the adjacency
matrix of G. Namely, the (i , j ) entry of A is defined by
Rate 𝜆 = (1 − 𝑝)𝜆
A(i , j ) := 1(nodes i and j are adjacent in G) = 1(i ∼ j ).
We say G is simple if (i , j ) ∈ E implies ( j , i ) ∈ E and (i , i ) ∉ E for all i ∈ V . For a simple graph G = (V, E ), we
say a node j is adjacent to i if (i , j ) ∈ E . We denote degG (i ) the number of neighbors of i in G, which we
call the degree of i .
2 Consider we hop around the nodes of a given simple graph G = (V, E ): at each time, we jump from
one node to 𝑁one(𝑡)of the neighbors with equal probability. For instance, if we are currently at node 2 and if
1
2 is adjacent to 3, 5, and 6, then we jump to one of the three neighbors with probability 1/3. The location
Rate 𝜆 𝑁(𝑡) chain. Namely, a Markov chain (X )
of this jump process at time t can be described as a Markov t t ≥0 on
0 or
the node set V is called a random walk on G if
𝑁 (𝑡) Rate 𝜆 = 𝜆 + 𝜆
A(i , j )
P(X t +1 = j | X t = i ) = .
Rate 𝜆 deg G (i )
Note that its transition matrix P is obtained by normalizing each row of the adjacency matrix A by the
corresponding degree. That is,
P = D −1 A,
where D is the diagonal matrix of the degrees of nodes in G (i.e., the degree matrix of G).
3 4
1
⋯
𝐺
F IGURE 6.1.4. A 4-node simple graph G, its adjacency matrix A, and associated random walk
transition matrix P
2 Random walk on G can be defined (and more importantly, simulated) by the following recursive
sampling rule:
X t +1 | X t ∼ Uniform(N (X t )),
2 where N (v) denote the set of all neighbors of node v in G. So X t +1 given X t is chosen uniformly at
random among the neighbors of X t . ▲
Exercise 6.1.6 (Natural filtration for stochastic processes with countable state spaces). Let (Ω, F , P) be
a probability space and let (S , 2S ) be a countable measurable space. Let (X n )n≥0 be a sequence of S -
valued random variables. Let Fn := σ(X 0 , . . . , X n ). Show that
Fn = σ ({{X 0 = x 0 , · · · , X n = x n } : x 0 , . . . , x n ∈ S }) .
Exercise 6.1.7. Repeat rolling two four sided dices with numbers 1, 2, 3, and 4 on them. Let Yk be the
some of the two dice at the kth roll. Let S n = Y1 + Y2 + · · · + Yn be the total of the first n rolls, and define
X t = S t (mod6). Show that (X t )t ≥0 is a Markov chain on the state space Ω = {0, 1, 2, 3, 4, 5}. Furthermore,
identify its transition matrix.
𝜏
6.1. DEFINITION AND EXAMPLES 168
While right-multiplication of P advances a given row vector of distribution one step forward in time,
left-multiplication of P on a column vector computes the expectation of a given function with respect to
the one-step future distribution. This point is clarified in the following exercise.
Exercise 6.1.10 (Right-multiplication by transition matrix). Let (X t )t ≥0 be a Markov chain on a countable
state space S with transition matrix P . Let f : S → R be a function. Let f = [ f (1), f (2), · · · ]T be the column
vector representing the reward function f . Show that
X
[P f](i ) = P (i , j ) f ( j ) = E[ f (X 1 ) | X 0 = i ].
j ∈S
That is, P f is a function on the state space whose value on each state i is the one-step conditional expec-
tation of f (X 1 ) given X 0 = i .
Consider quadratic forms of the form rP v, where P is the Markov transition matrix, r is a PMF on the
state space, and v is a column vector reprenting a function f on the state space. The value of this quaratic
form is in fact the expectation E X 0 ∼r [ f (X 1 )], where the initial state X 0 is drawn from r. A generalization
of this observation is given in the following exercise.
Exercise 6.1.11 (Expected reward in the future). Let (X t )t ≥0 be a Markov chain on a countable state
space S with transition matrix P . Let f : S → R be a function. Suppose that if the chain X t has state x
at time t , then we get a ‘reward’ of f (x). Let rt = [P(X t = 1), P(X t = 2), . . . ] be the distribution of X t . Let
v = [ f (1), f (2), · · · ]T be the column vector representing the reward function f .
(i) Show that the expected reward at time t is given by
X
E[ f (X t )] = f (i )P(X t = i ) = rt v.
i ∈S
(ii) Use part (i) and Exercise 6.1.8 to show that
E[ f (X t )] = r0 P t v.
Pt
(iii) The total reward up to time t is a RV given by R t = k=0 f (X t ). Show that
E[R t ] = r0 (I + P + P 2 + · · · + P t )v.
If a stochastic process depends on the past two consecutive states, then one can simply extend the
state space into the set of paired states and define a Markov chain on that extended state space. The
following example illustrates this.
6.2. STRONG MARKOV PROPERTY 169
Exercise 6.1.12 (Multi-step dependence). Suppose that the probability it rains today is 0.4 if neither of
the last two days was rainy, but 0.5 if at least one of the last two days was rainy. Let Ω = {S, R}, where
S=sunny and R=rainy. Let Wt be the weather of day t .
(i) Show that (Wt )t ≥0 is not a Markov chain.
(ii) Expand the state space into the set of pairs Σ := Ω2 . For each t ≥ 0, define X t = (Wt −1 ,Wt ) ∈ Σ. Show
that (X t )t ≥0 is a Markov chain on Σ. Identify its transition matrix.
(iii) What is the two-step transition matrix?
(iv) What is the probability that it will rain on Wednesday if it didn’t rain on Sunday and Monday?
follows directly from the strong Markov property. For the first part, partition the sample space according
to the value of τ.)
6.2.2. Irreducibility, transience, and recurrence. Being able to restart a Markov chain at stopping times
using the strong Markov property allows us to decompose the trajectory of the chain into i.i.d. ‘excur-
sions’ from a ‘recurrent’ state. This gives a powerful means to analyze the behavior of Markov chains.
A Markov chain is ‘irreducible’ if every state is ‘accessible’ from any other state. More formal defini-
tion is given below.
Definition 6.2.3 (Irreducibility). Let P be the transition matrix of a Markov chain (X n )n≥0 on a countable
state space S . We say the chain (and P ) is irreducible if for any i , j ∈ S , there exists an integer k =
k(i , j ) ≥ 0 such that
P(X k = j | X 0 = i ) = P k (i , j ) > 0.
Exercise 6.2.4 (RW on connected graphs is irreducible). A graph G = (V, E ) is connected if for every pair
of distinct nodes x, y ∈ V , there exists a sequence of nodes z 0 , z 1 , . . . , z m for some m ≥ 1 such that z 0 = x,
z m = y, and z i ∼ z i +1 for all i = 1, . . . , m − 1. Show that the RW on a graph G is irreducible if and only if G
is connected.
Definition 6.2.5 (Hitting and return times). Let (X t )t ≥0 be a Markov chain on state space S . For x ∈ S ,
define the hitting time for x to be
τx := inf{t ≥ 0 | X t = x},
the first time at which the chain is at state x. We also define the first return time to x as
τ+
x := inf{t ≥ 1 | X t = x}.
ρ x y := Px (τ+
y < ∞),
which is the probability that the chain X t eventually visits y starting from x.
Using the hitting times, we can classify the states into a few classes.
Definition 6.2.6 (Classification of states). Let (X n )n≥0 be a Markov chain on a countable state space S
with transition matrix P . We classify the states into the following classes:
(Transient) We say a state x ∈ S is transient if ρ xx < 1.
(Recurrent) We say a state x ∈ S is recurrent if ρ xx = 1. A recurrent state is said to be
(Positive recurrent) if Ex [τ+
x ] < ∞; and
(Null recurrent) if Ex [τ+
x ] = ∞.
We say the chain and the transition matrix P is recurrent (resp., positive/null recurrent) if all states are
recurrent (resp., positive/null recurrent).
Example 6.2.7 (SRW on Z is null recurrent). Coinsier SRW on Z. We remark the following asymptotic of
‘persistence probability’ of SRW on Z:
r
2
P(X 1 ≥ 1, X 2 ≥ 1, . . . , X n ≥ 1 | X 0 = 0) ∼ .
πn
6.2. STRONG MARKOV PROPERTY 171
(For reference, see, e.g., [LS19].) It follows that, by the tail-sum formula for the expectation of nonnega-
tive RVs,
r
+
X∞
+
X∞ 2
E0 [τ0 ] = P0 (τ0 ≥ n) ∼ = ∞.
n=1 n=1 πn
Thus the origin is null recurrent. By symmetry, all other states are also null recurrent. ▲
Exercise 6.2.8. Show the following implications:
ρ xx > 0 ⇐⇒ P m (x, x) > 0 for some m ≥ 1.
Hint: Use
³ ¯ ´ µ∞ ¯ ¶ ∞ ³ ¯ ´
¯ [ ¯ X ¯
sup P X n = x ¯ X 0 = x ≤ P {X n = x} ¯ X 0 = x ≤ P Xn = x ¯ X0 = x .
n≥1 n=1 n=1
In the following lemma implies that every state for a finite-state irreudiclbe MC is positive recurrent.
Lemma 6.2.9 (Expected return time). Let (X t )t ≥0 be an irreducible MC with a transition matrix P on a
finite state space S . Then for each x, y ∈ S , we have (denoting Ex [·] = E[· | X 0 = x])
Ex [τ+
y ] < ∞.
In particular, P is positive recurrent.
P ROOF. Since P is irreducible, for each x, y ∈ S , there exists an integer k(x, y) ≥ 0 such that P k(x,y) (x, y) >
0. Let
r := max k(x, y) and ε = min P k(x,y) (x, y).
x,y x,y
Since S is finite, we have r < ∞ and ε > 0. Then note that for each x, y ∈ S , there exists j ∈ {1, . . . , r }4
such that
P j (x, y) ≥ ε.
Thus, for arbitrary y ∈ S and t ≥ 0, the chain hits y between some time t and t + r with probability at
least ε regardless of X t . It follows that, for k ≥ 1,
Px (τ+ +
y > kr ) ≤ (1 − ε)Px (τ y > (k − 1)r ).
By induction and strong Markov property, this shows
Px (τ+ k
y > kr ) ≤ (1 − ε) .
Thus, it is exponentially unlikely to not visit y during k consecutive time intervals of length r . Then by
the tail-sum formula (Prop. 1.5.10),
X
∞
Ex [τ+
y]= Px (τ+
y ≥ m)
m=0
X X −1
∞ (k+1)r
≤ Px (τ+
y ≥ m)
k=0 m=kr
X X −1
∞ (k+1)r
≤ Px (τ+
y ≥ kr )
k=0 m=kr
X∞
≤ r (1 − ε)k < ∞,
k=0
where we have used the fact that the tail probability Px (τ+
y ≥ m) is non-increasing in m.
Positive recurrence of P follows immediately by noting that Ex [τ+ x ] < ∞ for all x. □
4Take j = k(x, y).
6.2. STRONG MARKOV PROPERTY 172
P ROOF. The idea is to look at excursions from x. Namely, starting from x, the chain returns to x at
least once in some a.s. finite time τ+ x since x is assumed to be recurrent. Restarting the chain from the
stopping time τ+ x and again using the recurrence of x with the strong Markov property, one can deduce
that the chain returns to x for the second time in some finite time. Each trajectory of the chain from x
to x is called an excursion from x. By the strong Markov property, we see that (any measurable functions
of) the excursions are i.i.d. and the duration of each excursion is distributed as τ+x.
Now during each excursion, there is a positive probability (say δ > 0) to visit y. To see this, suppose
not. Then since the excursions are i.i.d., the probability of visiting y during any excursion is zero. This
means that the chain will never visit y, which violates the irreducibility. Then by strong Markov property,
Px (τ+
y = ∞) ≤ Px (the chain does not visit y during the first kth excursions)
≤ (1 − δ)k .
We can conclude by taking k → ∞. □
Exercise 6.2.11 (Probability of mth visit). Fix states x, y ∈ S and let τ(m)
y denote the time of mth visit to
y. Then for each m ≥ 1, show that
Px (τ(m) m−1
y < ∞) = ρ x y ρ y y .
(Hint: Use strong Markov property.)
Next, we will give a more quantitative characterization of recurrent and transient states by using the
expected total number of visits. That is, for each state y, define the following counting variable
X∞ X
∞
N y := 1(X n = y) = 1(τ(m)
y < ∞),
n=1 m=1
which equals the total number of visits to state y by the Markov chain.
Proposition 6.2.12. The following hold:
if ρ x y = 0
0
Ex [N y ] = ∞ if ρ x y > 0 and y is recurrent
ρx y if ρ x y > 0 and y is transient.
1−ρ y y
verifying the first assertion. But Proposition 6.2.12 implies that x is recurrent if and only if E[N x | X 0 =
x] = ∞. Hence the assertion follows. □
Definition 6.3.1. A probability distribution π (viewed as a row vector S → [0, 1]) on a countable state
space S is a stationary distribution for a Markov transition matrix P on S if
π = πP
that is,
X
π(x) = π(y)P (y, x).
y∈S
A Markov chain may have multiple stationary distributions, as the following example illustrates.
Example 6.3.3 (Stationary distribution of RW on G). What is the stationary distribution of random walk
on G? There is a typical one that always works. Define a probability distribution π on V by
degG (i ) degG (i )
π(i ) = P = .
j ∈V degG (i ) 2|E |
Namely, the probability given to node i is proportional to the degree of node i . Then observe that π is a
stationary distribution for P the transition matrix for RW on G. Indeed,
X X degG (i ) A(i , j ) 1 X degG ( j )
π( j )P ( j , i ) = = A(i , j ) = = π( j ).
i ∈V i ∈V 2|E | degG (i ) 2|E | i ∈V 2|E |
As we will see later, stationary distribution is crucially related to the long-term behavior of the Markov
chain. Roughly speaking, for irreducible Markov chains, the stationary distribution equals the limiting
frequency of the Markov chain visiting each state in the long run (see Thm. ??). See Figure 6.3.1 for an
empirical validation of this claim for RW on the Facebook network among students in Caltech in 2007. ▲
In the following exercise, we compute the stationary distribution of the so-called birth-death chain.
6.3. STATIONARY DISTRIBUTION 174
F IGURE 6.3.1. Normalized proportion of times that RW on Caltech spends at each node for
N = 100, 1000, 10000 steps (top three rows) and the normalized degrees deg(v)/2|E | (bottom).
Exercise 6.3.4 (Birth-Death chain). Let S = {0, 1, 2, · · · , N } be the state space. Let (X t )t ≥0 be a Markov
chain on S with transition probabilities
P(X t +1 = k + 1 | X t = k) = p ∀0 ≤ k < N
P(X
t +1 = k − 1 | X t = k) = 1 − p ∀1 ≤ k ≤ N
𝑝 P(X t +1 = 0 | X t = 0) = 1 − p
2
𝑝 1 P(X t +1 = 𝑝N | X t = N ) = p.
𝑝
This is called a Birth-Death chain. Its state space diagram is as below.
𝑝 𝑝 𝑃 𝑝
1−𝑝 0 1 2 3 4 𝑝
1−𝑝 1−𝑝 1−𝑝 1−𝑝
(i) Let π = [π0 , π1 , · · · , πN ] be a distribution on S . Show that π is a stationary distribution of the Birth-
Death chain if and only if it satisfy the following ‘balance equation’
(ii) Let ρ = p/(1 − p). From (ii), deduce that πk = ρ k π0 for all 0 ≤ k < N .
(iii) Using the normalization condition π0 + π1 + · · · + πN = 1, show that π0 = 1/(1 + ρ + ρ 2 + · · · + ρ N ).
Conclude that
ρk
πk = 0 ≤ k ≤ N. (85)
1 + ρ + ρ2 + · · · + ρ N
Conclude that the Birth-Death chain has a unique stationary distribution given by (85), which
becomes the uniform distribution on S when p = 1/2.
6.3. STATIONARY DISTRIBUTION 175
Exercise 6.3.5 (Success run chain). Consider a Markov chain (X n )n≥0 on the state space S = Z≥0 of
nonnegative integers with the following transition matrix P :
where p ∈ [0, 1] is a fixed parameter. In words, one flips independent coins with success probability p;
every time it comes up heads with probability p X n increases by 1, and when it comes up tails, X n resets
back to 0.
(i) Let π : Z≥0 → [0, ∞) be a function that satisfies πP = π. Show that the following recursion holds:
X
∞
π(0) = (1 − p) π(k)
k=0
π(1) = pπ(0)
π(2) = pπ(1)
..
.
Next, we introduce reversibilitiy and time-reversal of a transition matrix. While the equation π = πP
that defines a stationary distribution requires solving a linear equation, a much simpler ‘detailed balance
equation’, once its satisfied, implies stationarity.
Definition 6.3.6 (Reversibility and time-reversal). Let P be a transition matrix on a countable state space
S and let π be a probability distribution on S . A matrix P⃗ on S is a time-reversal of P w.r.t. π if it satisfies
the following ‘detailed balance equation’:
The following proposition will be used crucially in the proof of uniqueness of stationary distribution
(Thm. 6.3.12).
Proposition 6.3.7 (Properties of time-reversal matrix). Let P⃗ be a time-reversal of P w.r.t. π. Assume that
π = πP and π > 0 (entrywise). Then the following hold:
(i) πP⃗ = π.
(ii) P⃗ is itself a Markov transition matrix on S . Furthermore, π is a stationary distribution for P⃗.
(iii) Let µ be another stationary distribution for P . Define a column vector h = µT /πT 5. Then P⃗h = h, that
is, h is P⃗-harmonic (see Def. 6.3.14).
(iv) If P is irreducible and recurrent, then so is P⃗.
P ROOF. To show (i), note that for each state x, since the rows of P sum to one,
à !
X X X
⃗ ⃗
[πP ](x) = π(y)P (y, x) = π(x)P (x, y) = π(x) P (x, y) = π(x).
y y y
5Entrywise division
6.3. STATIONARY DISTRIBUTION 176
So (i) holds.6 To show (ii), it is enough to show that the rows of P⃗ sum to one. Indeed, for each state x,
using that πP = π and π > 0,
X X X
P⃗(x, y) = π(x)−1 P (y, x)π(y) = π(x)−1 P (y, x)π(y) = π(x)−1 [πP ](x) = π(x)−1 π(x) = 1.
y y y
Thus P⃗ is a Markov transition matrix on S . By (i), we also have that π is a stationary distribution for P⃗.
To show (iii), fix a state x and note that
X X
P⃗(x, y)h(y) = π(x)−1 P T (x, y)π(y)h(y)
y y
X
= π(x)−1 P T (x, y)µ(y)
y
X
= π(x)−1 P T (x, y)µ(y)
y
−1
= π(x) µ(x)
= h(x).
This shows (iii).
Lastly, suppose that P is irreducible and recurrent. For each states x, y, there is a path from x to y of
positive probability under P . Since π > 0, traversing the same path backwards gives a path from x to y of
positive probability under P⃗. Hence P⃗ is irreducible. For recurrence, first note that we can write
P⃗ = diag(π)−1 P T diag(π).
Taking the t th power, we get
t
P⃗ = diag(π)−1 (P t )T diag(π).
t
Evaluating the matrices on both sides at (x, x), we get P⃗ (x, x) = P t (x, x). This holds for all states x and
P P t
times t ≥ 0. Since P is recurrent, t ≥0 P t (x, x) = ∞ for all x (see Thm. 6.2.13). Thus t ≥0 P⃗ (x, x) = ∞ for
all x, so P⃗ is recurrent. □
Exercise 6.3.8 (Reversibility of random walk). Let P = D −1 A denote the transition matrix of a random
walk on a finite graph G = (V, E ), where D denotes the diagonal matrix of degrees and A denotes the
adjacency matrix of G. Define a probability distribution π on the node set V = {1, . . . , m} as
1 £ ¤
π := deg(1), . . . , deg(m) .
2|E |
Show that P is reversible w.r.t. π. Deduce that π is a stationary distribution for P .
6.3.2. Uniqueness and existence of stationary distribution: Finite state space. In this subsection, we
will show the fundamental result that for irreducible transition matrix P on a finite state space, there is
a unique stationary distribution. The uniqueness uses uniqueness of harmonic functions; the existence
uses a linear algebraic argument with a lazy chain.
Theorem 6.3.9 (Uniqueness and existence of stationary distribution for finite irreducible chain). Let P be
an irreducible transition matrix on a finite state space S = {1, . . . , m}. Then there exists a unique stationary
distribution π on S for P (i.e., π = πP ) and π > 0.
P ROOF. Existence. Stationary distributions are closely related with eigenvectors and eigenvalues of
the transition matrix P . Namely, suppose π = πP and by taking transpose,
πT = P T πT .
Hence, the column vector πT is an eigenvector of P T associated with eigenvalue 1. By Exercise 6.3.10,
the dimension of the eigenspace of P T with eigenvalue 1 equals to the dimension of eigenspace of P
6In fact, the argument above shows that (i) holds for general probability distribution π on S .
6.3. STATIONARY DISTRIBUTION 177
with eigenvalue 1. But note that 1 is an eigenvalue of P with an eigenvector 1 (all-ones vector) since P is
transition matrix. Thus,
dim ker(P T − I ) = dim ker(P − I ) ≥ 1. (87)
Now we can argue for the uniqueness and the existence of stationary distribution.7
From (87), it follows that there exists an eigenvector v of P T with eigenvalue 1. If the entries of v had
the same sign, then sign(v)v T /kvk1 is a stationary distribution of P . In order to check this, we will use a
‘lazy’ version of P . Namely, define the ‘lazy’ transition matrix Q = (P +I )/28. Clearly Q is irreducible since
P is so (use the probabilistic interpretation of Q). Then for each states x, y, P k(x,y) > 0 for some integer
k(x, y) ≥ 0. Since there are finitely many states, r := maxx,y k(x, y) ≥ 0 is a finite integer. By lazyness, Q r
is a positive matrix. Then W := Q r is a positive markov transition matrix.
Now let v be an eigenvector of P T with eigenvalue 1. We claim that the entries of v must have the
same sign. Suppose not. Note that v is an eigenvector of Q T with eigenvalue 1, so it is an eigenvector of
W T with eigenvalue 1. So
¯ ¯
¯X ¯ X
¯ ¯
|v x | = ¯ W (x, y)v y ¯ < W T (x, y)|v y |,
T
¯y ¯ y
where the strict inequality follows since the sum in the middle has some cancellation due to mixed signs
of v y ’s and W T > 0. Then summing over x,
µ ¶
X XX T X X X
|v x | < W (x, y)|v y | = W (y, x) |v y | = |v y |,
x x y y x y
which is a contradiction. In turn, this shows the existence of stationary distribution for P .
Positivity. Now we know that there exists a nonnegative eigenvector v for P T (hence for W T ) with
eigenvalue 1. Then since W T > 0, v = W T v gives that each entry of v is positive. This shows the existence
of positive stationary distribution for P .
Uniqueness. We have shown that every eigenvector of P T with eigenvalue 1 must have all coordinates
in the same sign. In particular, it follows that the eigenspace of P T with eigenvalue 1 must have dimen-
sion one (otherwise we can choose two orthogonal eigenvectors of the same sign, a contradiction). Since
a positive eigenvalue exists, this shows the uniqueness of stationary distribution for P . □
Exercise 6.3.10 (Transpose and eigenspaces). Let A be a real square matrix. Show that A and A T have
the same eigenvalues and the corresponding eigenspaces have the same dimension. (Hint: To show
that they have the same set of eigenvalues, note det(A − λI ) = det(A − λI )T = det(A T − λI ). To show the
corresponding eigenspaces have the same dimension, use the fact that rank(A − λI ) = rank(A T − λI ).)
Exercise 6.3.11 (Spectral radius of a transition matrix). Let P be a Markov transition matrix on a finite
state space. Show that the spectral radius of P (i.e., the maximum modulus of the eigenvalues of P ) is
one. (Hint: Gershgorin circle theorem and the fact that P is a transition matrix. Namely, the eigenvalues
P P
of P reside in the union of circles in the complex plane with center P i i and radius j 6=i |P i j | = j 6=i P i j =
1 − P i i . Then show that 1 is an eigenvalue of P .)
6.3.3. Uniqueness of stationary distribution: Countable state space. In this section, we will use har-
monicity and a marginale argument to show uniqueness of stationary distribution for an irreducible and
recurrent transition matrix P on a (possibly infinite) countable state space S . Note that irreducibil-
ity implies (positive) recurrence by Lemma 6.2.9 for finite state spaces, but it is not necessarily true for
countably infinite state spaces (e.g., Asymmetyric SRW on Z).
7Here both parts use the irreducibility of P . Later, we will see that irreducilibity is not needed for the exitence of stationary
distribution. See Lemma 6.3.20.
8For each transition with Q, flip an independent fair coin, and stay put upon heads and move according to P upon tails.
6.3. STATIONARY DISTRIBUTION 178
Our goal in this section is to show the uniqueness of stationary distribution in the general countable
state space case:
Theorem 6.3.12 (Uniqueness of positive stationary distribution). Suppose P is an irreducible and recur-
rent transition matrix on a countable state space S . If there is a stationary distribution for P , then P must
have a unique stationary distribution.
A simple but important observation is that stationary distribution for an irreducible transition matrix
must be positive everywhere.
Exercise 6.3.13 (Stationary distribution for irreducible chain is positive). Let a transition matrix P on S
has a stationary distribution π. Then π(x) > 0 for all x ∈ S . Hint: Show that 0’s in π propagate to outgoing
neighbors: If π(x) = 0 for some x, then π(y) = 0 for all y s.t. P (x, y) = 0.)
Our argument for this result is based on harmonic functions and martingales.
In Exercise 6.1.8, we have seen that left multiplication by the transition matrix advances the PMF of
the current state of the Markov chain in one time step. In Exercise 6.1.10, we also have seen that right-
multiplying a function (represented as a column vector) by the transition matrix gives the expectation
of the function w.r.t. the next-time PMF. We call distributions invariant under right multiplication by P
stationary. What about functions that are invariant under left multiplication by P ?
Definition 6.3.14 (Harmonic functions w.r.t. P ). Let P be a transition matrix on a countable state space
S . A function h : S → R is said to be P -harnomic at x if 9,
X
h(x) = P (x, y)h(y).
y∈S
Exercise 6.3.15 (P -harmonic on finite state space is constant). Let P be an irreducible transition matrix
on a finite state space S and let h be a P -harnomic function on S . Show that h is a constant function.
(Hint: Use the maximum principle. Namely, h attains a maximizer, and using harmonicity, show that all
the ‘neighboring’ states are also global maximizer. Iteratively use irreducibility to conclude that h attains
global maximum everywhere.)
Lemma 6.3.16 (Nonnegative P -harmonic functions are constant). Suppose P is an irreducible and recur-
rent transition matrix on a countable state space S . Let h be a nonnegative P -harmonic function on S .
Then h is a constant.
P ROOF. Let (X n )n≥0 be a Markov chain on S with transition matrix P . Recall that h(X n ) is a mar-
tingale (see Lem. 5.2.8). Thus if τ y is the hitting time of a state y, then by using the stopped martingale
(Thm. 5.2.17)
where we have used that h ≥ 0 for the inequality. The above inequality holds for all states x, y and time
t ≥ 0. Since P is irreducible and recurrent, by Prop. 6.2.10, we have Px (τ y < ∞) = 1. Thus by letting
n → ∞ and using continuity of measure (Thm. 1.1.16), we get Px (τ y < n) → 1 giving us h(x) ≥ h(y). Since
x, y are arbitrary states, this shows that h is a constant function. □
9Recall Lem. 5.2.8 on martingales obtained by plugging in a Markov chain into a harmonic function.
6.3. STATIONARY DISTRIBUTION 179
P ROOF OF T HEOREM 6.3.12. Suppose there is a stationary distribution π for P . Since P is irreducible,
π is positive by Exercise 6.3.13. Let µ be any other stationary distribution for P . We wish to show that
π = µ. Since π > 0, the function (column vector) h = µT /πT ≥ 0 (entrywise division) is well-defined.
Let P⃗ be the time-reversal of P w.r.t. π. by Prop. 6.3.7, P⃗ is irreducible and recurrent with P⃗h = h, that
is, h is P⃗-harmonic. Then h is constant by Lemma 6.3.16, so there exists a constant c ≥ 0 such that
h(x) = µ(x)/π(x) ≡ c. Since both µ and π are probability distributions, we must have c = 1, so π = µ as
desired. □
6.3.4. Characterization of stationary distribution. In this section, we will show that if an irreducible
Markov chain has a stationary distribution, then the stationary distribution should be given by the recip-
rocal of the expected return time and all states must be positive recurrent. This is provided the following
well-known lemma by Kac.
Lemma 6.3.17 (Kac). Let (X t )t ≥0 be an irreducible Markov chain with transition matrix P on a countable
state space S . Suppose that there is a stationary distribution π solving π = πP .
(i) For any set S ⊆ S , the expected return time to S when starting at the stationary distribution condi-
tioned π on S is π(S)−1 . That is,
X
π(x) Ex [τ+
S ] = 1. (88)
x∈S
P ROOF. Let (Y t )t ≥0 be the time-reversed chain with transition matrix P⃗ defined w.r.t. π. We will first
show that both P and P⃗ are recurrent. Fix a state x and define
α(t ) := Pπ (X t = x, X s 6= x for all s > t ).
By using stationarity and strong Markov property,
α(t ) = Pπ (X t = x) Px (τ+ +
x = ∞) = π(x) Px (τx = ∞).
Since the events {X t = x, X s 6= x for all s > t } are disjoint for distinct t , we have
X X
1≥ α(t ) = π(x) Px (τ+
x = ∞).
t ≥0 t ≥0
Since the summand in the last sum does not depend on t and since π > 0 by Exc. 6.3.13, it follows that
Px (τ+
x = ∞) = 0 for all x. This shows that P is recurrent. Since π is a stationary distribution for P (see
⃗
⃗
Prop. 6.3.7), rhe same argument shows that P is also recurrent.
By using the detailed balance equation (86) recursively, for each sequence of states (z 0 , . . . , z t ),
π(z 0 )P (z 0 , z 1 )P (z 1 , z 2 ) · · · P (z t −1 , z t ) = π(z t )P̂ (z t , z t −1 ) · · · P̂ (z 1 , z 0 ).
Summing the aboe over all sequences where z 0 = x, z 1 , . . . , z t −1 ∉ S, and z t = y, we obtain
⃗y (τS + = t , Y t = x).
π(x) Px (τS + ≥ t , X t = y) = π(y) P
(We write P ⃗ for the probability measure corresponding to the reversed chain.) Then summing over all
x ∈ S, y ∈ S , and t ≥ 0 shows that
XX
∞
π(x) Px (τS + ≥ t ) = P⃗π (τS + < ∞) = 1,
x∈S t =1
6.3. STATIONARY DISTRIBUTION 180
where the last equality follows from recurrence of (Y t ). Since τS + takes only positive integer values, by
P
the tail-sum formula, we get (88). Furthermore, deviding both sides of this by π(S) = x∈S π(x), we get
1 X π(x)
= Ex [τ+ +
S ] = Eπ [τs | X 0 ∈ S].
π(S) x∈S π(S)
The second conclusion follows immediately by taking S = {x}. This shows (i).
For (ii), recall that π > 0 due to the irreducibility and Exc. 6.3.13. Thus Ex [τ+
x ] < ∞ for all x by part (i).
This shows positive recurrence of P . □
6.3.5. Construction of stationary distribution. Does a Markov chain always have a stationary distribu-
tion? In Theorem 6.3.9, we answered this question positively for irreucible transition matrices on a finite
state space, using an indirect linear algebra argument. But such algebraic approach do not provide us
any useful intuition for the behavior of the Markov chain itself. In this section, we give a probabilistic and
constructive argument to show that every Markov chain has a stationary distribution that admit a certain
canonical form. Importantly, we will show that irreducibility of the transition matrix is not required for
there be a stationary distribution10.
For each y ∈ S , let Vn (y) denote the number of visits to y in the first n steps:
X
n
Vn (y) := 1(X k = y).
k=1
We will see that with reasonable assumptions, a Markov chain will admit a unique stationary distribution
given by
1
π(y) := lim Ex [Vn (y)], (89)
t →∞ n
which is the expected proportion of times spending at y starting at x and it turns out that it does not
depend on x. The main result in this section, Theorem 6.3.18, states that when the Markov chain satisfies
certain reasonable conditions, (89) is indeed a stationary distribution of the Markov chain. However, it
is not a priori clear that the limit in (89) exists. Instead of taking the limiting frequency of visits to y,
we will instead construct a measure by only counting visits to y during a single ‘excursion’ from x to x,
which occurs during the time interval [0, τ+ x ]. As we will see later, such a construction will give a valid
+
probability distribution if and only if Ex [τx ] is finite.
The main result we will prove in this section is the following:
Corollary 6.3.19. Assume S is finite and P is irreducible. Then P has a unique stationary distribution π
given by (90).
The gist of the to construct of a stationary distribution using the visit counts Vn (x, y) is given in the
following proposition, where we construct a stationary distribution for general Markov chains with a
positive recurrent state. Namely, we consider the expected number of visits to each state starting from
a positive recurrent state during a single excursion. These expected counts turn out to give an invariant
measure. Normalizing it to be a probability measure then gives a stationary distribution.
Lemma 6.3.20 (Excursion and stationary measure). Consider a Markov chain (X n )n≥0 on S with transi-
tion matrix P with at least one recurrent state, say x. For each y ∈ S , define
£ ¤
λx (y) := E Vτ+x (y) ,
" + #
X X τx
£ ¤ X X
λx (y) = Ex Vτ+x (y) = Ex 1(X n = y)
y∈S y∈S y∈S n=1
" + # · + ¸ " + #
τx
X X τx X
X τx
X
= Ex 1(X n = y) = Ex 1(X n = y) = E 1 = Ex [τ+
x ],
y∈S n=1 n=1 y∈S n=1
| {z }
=1
where the third equality uses Fubini’s theorem for nonnegative summands.
To show (ii), our goal is to verify for each y ∈ S ,
X
λx (z)P (z, y) = λx (y).
z∈S
where for the first equality, we have converted a random number of summation into an infinite sum with
additioanl indicator. Next, by time-homogeneity and Markov property, we observe that
¡ ¢ ¡ + ¢
P x τ+
x ≥ n, X n = z P (z, y) = Px τx ≥ n, X n = z P(X n+1 = y | X n = z)
¡ ¢
= P x τ+ +
x ≥ n, X n = z P(X n+1 = y | τx ≥ n, X n = z)
¡ + ¢
= Px τx ≥ n, X n = z, X n+1 = y .
6.4. CONVERGENCE RATE AND MARKOV CHAIN MIXING 182
P ROOF OF T HEOREM 6.3.18. If P has a stationary distribution, then all states are positive recurrent
by Lemma 6.3.17. Conversely, suppose all states are positive recurrent. Then for each x ∈ S , we have a
stationary distribution πx defined in Lemma 6.3.20 (iii). By irreducibility, πx > 0 (see Exc. 6.3.13). Hence
by Theorem 6.3.12, there is a unique stationary distribution for P , which we will denote as π. Then for
each x ∈ S , it holds that
λx (x) 1
π(x) = πx (x) = + = .
Ex [τx ] Ex [τ+
x]
Lastly, the return times to x form a renewal process (Def. 3.6.3) due to the strong Markov property. Thus
by the renewal theorem (Thm. 3.6.4), almost surely,
Vn (x) 1
lim = .
n→∞ n Ex [τ+
x]
Exercise 6.3.21 (Poisitive recurrence is a class property). Let x, y be two communicating states, meaning
that there exists a positive probability to reach y from x and vice versa. Show that if one is positive
recurrent, then so is the other.
6.4.1. Total variance distance and mixing time. We first need to introduce a metric that measures the
‘distance’ between two probability distributions on the same measurable space. The one we use a lot in
Markov chain theory is called the total variation distance, which is the worst-case difference between the
probabilities assigned to a single event by the two distributions.
Definition 6.4.1 (Total variation distance). Let µ and λ be two probability distribution on a measurable
space (S , F ). Then the total variation distance between µ and λ, kµ − λkT V , is defined by
kµ − λkT V := sup |µ(A) − λ(A)|. (91)
A∈F
The definition in (91) involves supremum over all events so it is not easy to work with. Below we
provide three usful characterizations of it.
Proposition 6.4.2 (TV distance; pointwise form). Let µ and ν be two probability distributions on a count-
able state space S . Then
X 1 X
kµ − λkT V = |µ(x) − ν(x)| = |µ(x) − ν(x)|. (92)
x:µ(x)≥ν(x) 2 x∈S
P ROOF. Let B = {x : µ(x) ≥ ν(x)} and let A ⊆ S be any event. Then
µ(A) − ν(A) ≤ µ(A ∩ B ) − ν(A ∩ B ) ≤ µ(B ) − ν(B ). (93)
The first inequality is true because any x ∈ A ∩ B c satisfies µ(x) − ν(x) < 0, so the difference in probability
cannot decrease when such elements are eliminated. For the second inequality, note that including
more elements of B cannot decrease the difference in probability. Since (93) holds for all events A, the
first equality in (92) follows.
By exactly parallel reasoning,
ν(A) − µ(A) ≤ ν(B c ) − µ(B c ). (94)
Fortunately, the upper bounds on the right-hand sides of (93) and (94) are actually the same (as can be
seen by subtracting them; see Figure 4.1). Furthermore, when we take A = B (or B c ), then |µ(A) − ν(A)| is
equal to the upper bound. Thus
1 1 X
kµ − νkTV = [µ(B ) − ν(B ) + ν(B c ) − µ(B c )] = |µ(x) − ν(x)|.
2 2 x∈S
□
F IGURE 6.4.1. Recall that B = {x : µ(x) ≥ ν(x)}. Region I has area µ(B ) − ν(B ). Region II has area
ν(B c ) − µ(B c ). Since the total area under each of µ and ν is 1, regions I and II must have the same
area and that area is kµ − νkTV . Figure exerpted from [LP17].
Remark 6.4.3. From Proposition 6.4.2 and the triangle inequality for real numbers, it is easy to see that
total variation distance satisfies the triangle inequality: for probability distributions µ, ν, and η,
kµ − νkTV ≤ kµ − ηkTV + kη − νkTV .
6.4. CONVERGENCE RATE AND MARKOV CHAIN MIXING 184
Proposition 6.4.4 (TV distance; variational form). Let µ and ν be two probability distributions on a count-
able state space S . Then
( )
1 X X
kµ − λkT V = sup f (x)µ(x) − f (x)ν(x) : k f k∞ ≤ 1 . (95)
2 x∈S x∈S
P ROOF. On the one hand, since f has supremum norm ≤ 1, using triangle inequality we get
X X X X
f (x)µ(x) − f (x)ν(x) ≤ | f (x)| |µ(x) − ν(x)| ≤ |µ(x) − ν(x)|.
x∈S x∈S x∈S x∈S
Thus “≥” in (95) follows from Prop. 6.4.2.
On the other hand, consider the function f on S defined as
f ∗ (x) = 1(µ(x) ≥ ν(x)) − 1(µ(x) < ν(x)).
Then
X X X X
f ∗ (x)µ(x) − f ∗ (x)ν(x) = µ(x) − ν(x) − µ(x) − ν(x)
x∈S x∈S x:µ(x)≥ν(x) x:µ(x)<ν(x)
X
= |µ(x) − ν(x)|
x
= 2kµ − νkT V ,
where the last equality follows from Prop. 6.4.2. This shows “≤” in (95). □
It is useful to introduce a parameter that measures the time required by a Markov chain for the dis-
tance to stationarity to be small.
Definition 6.4.5 (Mixing time). Let P be a Markov transition matrix on a countable state space S with a
stationary distribution π. The mixing time of P is defined by
t mix (ε) := min{t : d (t ) ≤ ε}, t mix := t mix (1/4),
t
where d (t ) := supx∈S kP (x, ·) − πkT V .
Exercise 6.4.6. Let P be a Markov transition matrix on a countable state space S with a stationary dis-
tribution π.
(i) (Standardizing Distance from Stationarity) Define
d (t ) := supkP t (x, ·) − πkT V and d¯(t ) := supkP t (x, ·) − P t (y, ·)kT V .
x x,y
Show that d (t ) ≤ d¯(t ) ≤ 2d (t ). (Hint: For the first inequality, use that since π = πP , we have
P P
π(A) = y π(y)P (y, A). Then |P t (x, A) − π(A)| ≤ y π(y)|P t (x, A) − P t (y, A)| ≤ d¯(t ).)
(ii) (submultiplicativity) Show that d¯(s + t ) ≤ d¯(s)d¯(t ). (Hint: Use an optimal coupling, see Prop. 6.4.23.)
(iii) (multiples of mixing time) Show that d (ℓt mi x ) ≤ 2−ℓ .
6.4.2. Examples and aperiodicity. Let (X n )n≥0 be an irreducible positive recurrent Markov chain on a
countable state space Ω. By Theorem 6.3.18, we know that the chain has unique stationary distribution
π. Denote by πt the distribution of X t . When can we expect that πt → π in TV distance?
We first look at some examples. We start with an example, where we show the convergence of a
2-state chain using a diagonalization of its transition matrix.
Example 6.4.7. Let (X n )n≥0 be a Markov chain on S = {1, 2} with the following transition matrix
· ¸
0.2 0.8
P= ,
0.6 0.4
which has a unique stationary distribution π = [3/7, 4/7]. By diagonalizing P , we can write
· ¸· ¸· ¸−1
t 1 −4/3 1 0 1 −4/3
P = .
1 1 0 (−2/5)t 1 1
6.4. CONVERGENCE RATE AND MARKOV CHAIN MIXING 185
4
3
2
0
0 1 2 3 4 5
F IGURE 6.4.2. (Left) Torus graph (Figure excerpted from [LP17]). (Right) RW on torus G = Z6 ×Z6
has period 2.
The key issue in the 2-periodicity of RW on G = Z6 × Z6 is that it takes even number of steps to return
to any given node. Generalizing this observation, we introduce the following notion of periodicity.
Definition 6.4.10. Let P be the transition matrix of a Markov chain (X t )t ≥0 on a countable state space S .
For each state x ∈ S , let T (x) := {t ≥ 1 | P t (x, x) > 0} be the set of times when it is possible for the chain
to return to starting state x. We define the period of x by the greatest common divisor of T (x). We say
the chain X t is aperiodic if all states have period 1.
Example 6.4.11. Let T (x) = {4, 6, 8, 10, · · · }. Then the period of x is 2, even through it is not possible to go
from x to x in 2 steps. If T (x) = {4, 6, 8, 10, · · · } ∪ {3, 6, 9, 12, · · · }, then the period of x is 1. This means the
return times to x is irregular. For the RW on G = Z6 × Z6 in Example 6.4.9, all nodes have period 2. ▲
Exercise 6.4.12 (Aperiodicity of RW on graphs). Let (X t )t ≥0 be a random walk on a connected graph
G = (V, E ).
(i) Show that all nodes have the same period.
(ii) If G contains an odd cycle C (e.g., triangle), show that all nodes in C have period 1.
(iii) Show that X t is aperiodic if and only if G contains an odd cycle.
6.4. CONVERGENCE RATE AND MARKOV CHAIN MIXING 187
(iv) Show that X t is periodic if and only if G is bipartite. (A graph G is bipartite if there exists a partition
V = A ∪ B of nodes such that all edges are between A and B .)
Remark 6.4.13. If (X t )t ≥0 is an irreducible Markov chain on a finite state space Ω, then all states x ∈ Ω
must have the same period. The argument is similar to that for Exercise 6.4.12 (i).
Exercise 6.4.14 (Aperiodicity implies irreducibility for the product chain). Let P be an irreducible tran-
sition matrix for a Markov chain on S . Let Q be a matrix on S × S defined by
Q((x, y), (z, w)) = P (x, z) P (y, w) for x, y, z, w ∈ S .
(i) Consider a Markov chain Z t := (X t , Y t ) on S × S that evolves by independently evolving its two co-
ordinates according to P . Show that Q above is the transition matrix for Z t .
(ii) Show that Q is irreducible if P is aperiodic.
The following exercise shows that if a RW on G is irreducible and aperiodic, then it is possible to
reach any node from any other node in a fixed number of steps.
Exercise 6.4.15. Let (X t )t ≥0 be a RW on a connected graph G = (V, E ). Let P denote its transition matrix.
Suppose G contains an odd cycle, so that the walk is irreducible and aperiodic. For each x ∈ V , let T (x)
denote the set of all possible return times to the state x.
(i) For any x ∈ V , show that a, b ∈ T (x) implies a + b ∈ T (x).
(ii) For any x ∈ V , show that T (x) contains 2 and some odd integer b.
(iii) For each x ∈ V , let b x be the smallest odd integer contained in T (x). Show that m ∈ T (x) whenever
m ≥ bx .
(iv) Let b ∗ = maxx∈V b x . Show that m ∈ T (x) for all x ∈ V whenever m ≥ b ∗ .
(v) Let r = |V | + b ∗ . Show that P r (x, y) > 0 for all x, y ∈ V .
With a little more work, one can also show a similar statement for general irreducible and aperiodic
Markov chains.
Exercise 6.4.16. Let P be the transition matrix of a Markov chain (X t )t ≥0 on a finite state space S . Show
that (i) implies (ii):
(i) P is irreducible and aperiodic.
(ii) There exists an integer r ≥ 0 such that every entry of P r is positive.
Furthermore, show that (ii) implies (i) if X t is a RW on some graph G = (V, E ). (Hint for (i)⇒(ii): You may
use the fact that if a subset T of positive integers N is closed under addition with gcd(T ) = 1, then it must
contain all but finitely many integers. Proof is similar to Exercise 6.4.15, but a bit more number-theoretic.
See [LP17, Lem. 1.27].)
Exercise 6.4.17. Let (X t ) and (Y t ) be independent random walks on a connected graph G = (V, E ). Sup-
pose that G contains an odd cycle. Let P be the transition matrix of the random walk on G.
(i) Let t ≥ 0 be arbitrary. Use Exercise 6.4.15 to deduce that for some integer r ≥ 1, we have
X
P(X t +r = Y t +r | X t = x, Y t = y) = P(X t +r = z | X t = x)P(Y t +r = z | Y t = y)
z∈V
X
= P r (x, z)P r (y, z) > 0.
z∈V
(ii) Let r ≥ 1 be as in (i) and let
X
δ = min P r (x, z)P r (y, z) > 0.
x,y∈V z∈V
Use (i) and Markov property that in every r steps, X t and Y t meet with probability ≥ δ > 0.
(iii) Let τ be the first time that X t and Y t meet. From (ii) and Markov property, deduce that
P(τ ≥ kr ) ≤ (1 − δ)k → 0 as k → ∞.
6.4. CONVERGENCE RATE AND MARKOV CHAIN MIXING 188
6.4.3. Convergence theorem: Finite state space. In this section, we prove the convergence theorem for
irreducible aperiodic chains on finite state spaces.
Theorem 6.4.18 (Exponential mixing for finite irreducible aperiodic chains). Let (X t )t ≥0 be an irreducible
aperiodic Markov chain on a finite state space S . Let π denote the unique stationary distribution of X t .
Then there exists constants C > 0 and α ∈ [0, 1) such that for all t ≥ 0,
P ROOF. Since P is irreducible and aperiodic, by Exercise 6.4.16, there exists an r ≥ 0 such that P r has
strictly positive entries. Let Π be the S ×S matrix obtained by stacking the row vector π. For sufficiently
small δ > 0, we have
Let θ := 1 − δ ∈ [0, 1). If δ ≥ 1, then P r (x, ·) = π for all x, so (96) holds with θ = 0 (i.e., the chain mixes
completely in r steps). Hence we may assume θ ∈ (0, 1).
Let Q be a matrix satisfting the following equation
Multyplying by the colum vector 1 and using P r 1 = 1 = Π1 with θ > 0, it follows that Q1 = 1. Moreover,
(97) implies that Q ≥ 0, so Q is itself a Markov transition matrix on S 11.
Next, we use induction to show that, for any k ≥ 1,
P r k = (1 − θ k )Π + θ k Q k . (99)
The above holds for k = 1 due to (98). For the induction step, note that M Π = Π for any Makov transition
matrix M on S and that ΠM = Π for any matrix M such that πM = π. In particular, ΠP n = Π and Q n Π =
Π. Hence, assuming that (98) holds for k = n, using (98),
£ ¤
P r (n+1) = P r n P r = (1 − θ n )Π + θ n Q n P r
= (1 − θ n )Π + θ n Q n ((1 − θ)Π + θQ)
= (1 − θ n+1 )Π + θ n Q n .
To finish, sum the absolute values of the elements in row x on both sides of (100) and divide by 2. By
Proposition 6.4.2, this gives
In the middle term, think of initializing a Markov chain with distribution Q r k (x, ·) and evolve according
to P j steps and then compare the resulting distribution with π. By definition, the TV distance between
any two probability distributions is at most one, which gives the last inequality. Then taking α = θ 1/r and
C = 1/θ finishes the proof. □
11(98) can be interpreted probabilistically as follows: In each step, flip an independent probability θ coin. If heads, evolve
the state according to Q; if tails, sample the next state independently from π.
6.4. CONVERGENCE RATE AND MARKOV CHAIN MIXING 189
6.4.4. Coupling and Total Variation Distance. One of the favorite arguments of probabilists involves
the use of ’coupling’. The coupling technique in probability is a powerful tool used to compare and relate
different probability distributions. It involves constructing a joint probability distribution for two ran-
dom variables on a single probability space such that each variable has the desired marginal distribution.
By aligning the marginals in this way, coupling allows for direct comparison between distributions, en-
abling to establish relationships, bounds, and convergence properties. Coupling is particularly valuable
in stochastic processes, where it facilitates the analysis of complex systems and the study of convergence
behavior.
Definition 6.4.19 (Coupling). A coupling of two probability distributions µ and ν is a pair of random
variables (X , Y ) defined on a single probability space such that the marginal distribution of X is µ and
the marginal distribution of Y is ν. That is, a coupling (X , Y ) satisfies P(X = x) = µ(x) and P(Y = y) = ν(y).
Example 6.4.20. Let µ and ν both be the "fair coin" measure giving weight 1/2 to the elements of {0, 1}.
(i) One way to couple µ and ν is to define (X , Y ) to be a pair of independent coins, so that P{X = x, Y =
y} = 1/4 for all x, y ∈ {0, 1}.
(ii) Another way to couple µ and ν is to let X be a fair coin toss and define Y = X . In this case, P{X = Y =
0} = 1/2, P{X = Y = 1} = 1/2, and P{X 6= Y } = 0.
Example 6.4.21. Let µ = Bernoulli(1/3) and ν = Bernoulli(1/4). Let U ∼ Uniform(0, 1) and define X =
1(U ≤ 1/3) and Y = 1(U ≤ 1/2). Then (X , Y ) is a coupling of µ and ν and satisfies X ≤ Y almost surely.
Exercise 6.4.22 (Coupling between ER random graphs). Let µ = G(n, p) and ν = G(n, q) for 0 ≤ p ≤ q ≤ 1
(recall Def. 5.4.4). Construct a coupling (X , Y ) between µ and ν such that X ∼ G(n, p), Y ∼ G(n, q), and
X is a subgraph of Y almost surely.
Suppose µ, ν are probability distributions on a countable state space S . Given a coupling (X , Y ) of
µ and ν, if q is the joint distribution of (X , Y ) on S × S , meaning that q(x, y) = P(X = x, Y = y), then the
row and column margins of q equals µ and ν, respectively. That is,
X X
q(x, y) = P{X = x, Y = y} = P{X = x} = µ(x) for x ∈ S ,
y∈S y∈X
X X
q(x, y) = P{X = x, Y = y} = P{Y = y} = ν(y) for y ∈ S .
x∈S x∈S
Conversely, given a probability distribution q on the product space S × S with margin (µ, ν), then there
is a pair of random variables (X , Y ) having q as their joint distribution and consequently this pair (X , Y )
is a coupling of µ and ν (why?). In summary, a coupling can be specified either by a pair of random
variables (X , Y ) defined on a common probability space or by a distribution q on S × S .
Every pair of distributions µ and ν possesses an independent coupling. Nonetheless, in cases where
µ and ν differ, it becomes unattainable for X and Y to consistently assume identical values. What degree
of closeness can a coupling achieve to ensure X and Y are nearly identical? The total variation distance
provides the answer.
Proposition 6.4.23 (Coupling and TV distance). Let µ and ν be two probability distributions on S . Then
kµ − νkTV = inf {P{X 6= Y } : (X , Y ) is a coupling of µ and ν}}
Furthermore, there exists a coupling that achieves the infemum above, which is called an ‘optimal cou-
pling’.
P ROOF. First, we note that for any coupling (X , Y ) of µ and ν and any event A ⊆ S ,
µ(A) − ν(A) = P{X ∈ A} − P{Y ∈ A} ≤ P{X ∈ A, Y ∉ A} ≤ P{X 6= Y }.
It immediately follows that
kµ − νkTV ≤ inf {P{X 6= Y } : (X , Y ) is a coupling of µ and ν}.
6.4. CONVERGENCE RATE AND MARKOV CHAIN MIXING 190
P
For the converse direction, we will construct a coupling that attains the infemum. Let p = x∈S µ(x)∧
ν(x) ∈ [0, 1]. Flip a coin with probability of heads equal to p. If it lands heads, draw Z from the distribu-
tion p −1 (µ ∧ ν) and let X = Y = Z (this case never occurs if p = 0). Otherwise, draw independently X and
Y from the distributions (1 − p)−1 (µ − ν)1(µ > ν) and (1 − p)−1 (ν − µ)1(ν > µ), respectively. It is easy to
verify that X and Y have distributions µ and ν, respectively, and that X = Y if and only if the coin lands
heads. Furthermore, note that
X X X
µ(x) ∧ ν(x) = µ(x) + ν(x).
x∈S x∈S ;µ(x)≤ν(x) x∈S ;µ(x)>ν(x)
P
Adding and subtracting x∈S ;µ(x)>ν(x) µ(x) from both sides, we deduce
X X
P(X 6= Y ) = 1 − p = 1 − µ(x) ∧ ν(x) = µ(x) − ν(x) = kµ − νkTV .
x∈S x∈S ;µ(x)>ν(x)
This argument showcases the power of coupling. We were able to couple the two chains together in
such a way that X t ≤ Y t always, and from this fact about the random variables, we could easily derive
information about the distributions. ▲
F IGURE 6.4.4. Coupling between two random walks on Z≥0 . Time goes from left to right. Once
they meet, they stay synchronized. Figure exerpted from [LP17].
Lemma 6.4.26 (Bounding TV distance by coalescence time). Let {(X t , Y t )} be a Markovian coupling for
transition matrix P with for which X 0 = x and Y0 = y. Then
kP t (x, ·) − P t (y, ·)kTV ≤ Px,y {τcouple > t }.
P ROOF. Notice that P t (x, z) = Px,y {X t = z} and P t (y, z) = Px,y {Y t = z}. Consequently, (X t , Y t ) is a
coupling of P t (x, ·) with P t (y, ·). So by Prop. 6.4.23,
kP t (x, ·) − P t (y, ·)kT V ≤ Px,y {X t 6= Y t } = Px,y {τcouple > t }.
This shows the assertion. □
We now prove the convergence theorem for irreducible and aperiodic Markov chains on countable
state spaces. The proof uses coupling.
Theorem 6.4.27 (Mixing for irreducible aperiodic chains). Let (X t )t ≥0 be an irreducible aperiodic and
positive recurrent Markov chain on a countable state space S with transition matrix P . Then P has a
unique stationary distribution π and for each state x,
kP t (x, ·) − πkT V → 0 as t → ∞.
P ROOF. The existence of the unique stationary distribution π for P follows from Theorem 6.3.18.
For the convergence, we will couple two P -chains (X t ) and (Y t ), where X 0 = x and Y0 ∼ π, moving
independently according to P in each step until the first time τcouple they meet, after which they stay
synchronized and move together according to P . Let δx denote the Dirac delta mass at x and Pδx ⊗π the
probability measure for the joint chain (X t , Y t ). By Prop. 6.4.23 and since X t ∼ P t (x, ·) and Y t ∼ π,
kP t (x, ·) − πkT V ≤ Pδx ⊗π (X t 6= Y t ) ≤ Pδx ⊗π (τcouple > t ).
Therefore, it suffices to show that τcouple < ∞ almost surely.
We will use aperiodicity to show that τcouple < ∞ almost surely. (Otherwise it is not true. Why?)
Consider first the ‘product chain’, a Markov chain on S × S with transition matrix Q given by
Q((x, y), (z, w)) = P (x, z) P (y, w) for all (x, y), (z, w) ∈ S × S .
This chain makes independent moves in the two coordinates, each according to the matrix P . Aperi-
odicity of P implies that Q is irreducible (Exc. 6.4.14). Hence, if (X t , Y t ) is a chain started with product
distribution µ ⊗ ν and run with transition matrix Q, then (X t ) is a Markov chain with transition matrix P
and initial distribution µ, and (Y t ) is a Markov chain with transition matrix P and initial distribution ν.
Note that
X X X
(π ⊗ π)Q(z, w) = π(x)π(y)P (x, z)P (y, w) = π(x)P (x, z) π(y)P (y, w).
(x,y)∈S ×S x∈S y∈S
𝑝
𝑝 1 2 𝑝
𝑝 6.5. MARKOV CHAIN MONTE CARLO 192
Since π = πP , the right-hand side equals π(z)π(w) = (π ⊗ π)(z, w). Thus π ⊗ π is a stationary distribution
for Q. By Theorem 6.3.18, the chain (X t , Y t ) is positive recurrent. In particular, for any fixed x 0 , if
τ := min{t > 0 : (X t , Y t )𝑝
= (x 0 , x 0 )}, 𝑃
𝑝
then 0
1 by Prop. 6.2.10, 1 2 3
1−𝑝 Px,y {τ < ∞} = 1 for all x, y ∈ X .
1−𝑝 1−𝑝
It follows that
X
Px,π (τ < ∞) = π(y) Px,y (τ < ∞) = 1.
y∈S
Thus if we initialize the product 1 chain at δx × π (δx =point 3/4mass at x), then it eventually
1/2 visits a fixed 1/4
‘diagonal state’ (x 0 , x 0 ). Since τcouple ≤ τ (why?), this is enough to conclude. □
0 1 2 3
Let τ denote the first time to collect n distinct types of coupons, where each draw picks one of the n
1/4 1/2 3/4 1
types independently uniformly at random. Whow that for each c > 0,
P(τ ≥ dn log n + cne) ≤ e −c .
(Hint: Let A i denote the event that the i th coupon is not selected among the first dn log n + cne draws.
Then
à ! 5
[ n X n X n µ ¶
1 dn log n+cne ¡ ¢
P(τ ≥ dn log n + cne) ≤ P Ai ≤ P(A i ) ≤ 1− ≤ n exp − log n − c = e −c .
i =1 i =1 i =1 n 4
)
3
6.5. Markov chain Monte Carlo
2
So far, we were given a Markov chain (X t )t ≥0 on a finite state space Ω and studied existence and
uniqueness of its stationary distribution and convergence to 1 it. In this section, we will consider the
reverse problem. Namely, given a distribution π on a sample space Ω, can we construct a Markov chain
(X t )t ≥0 such that π is a stationary distribution? If in addition the
0 chain is irreducible an aperiodic, then
by the convergence theorem (Theorem 6.4.18), we know that the 0 1 2πt of 3X t converges
distribution 4 5to π.
Hence if we run the chain for long enough, the state of the chain is asymptotically distributed as π.
In other words, we can sample a random element of Ω according to the prescribed distribution π by
emulating it through a suitable Markov chain. This method of sampling is called Markov chain Monte
Carlo (MCMC).
F IGURE 6.5.1. MCMC simulation of Ising model on 200 by 200 torus at temperature T = 1 (left),
2 (middle), and 5 (right).
6.5. MARKOV CHAIN MONTE CARLO 193
Example 6.5.1 (Uniform distribution on regular graphs). Let G = (V, E ) be a connected regular graph,
meaning that all nodes have the same degree. Let µ be the uniform distribution on the node set V . How
can we sample a random node according to µ? If we have a list of all nodes, then we can label them
from 1 to N = |V |, choose a random number between 1 and N , and find corresponding node. But often
times, we do not have the full list of nodes, especially when we want to sample a random node from a
social network. Given only local information (set of neighbors for each given node), can we still sample
a uniform random node from G?
One answer is given by random walk. Indeed, random walks on graphs are defined by only using
local information of the underlying graph: Choose a random neighbor and move there. Moreover, since
G is connected, there is a unique stationary distribution π for the walk, which is given by
degG (x)
π(x) = .
2|E |
Since G is regular, any two nodes have the same degree, so π(x) = π(y) for all x, y ∈ V . This means π
equals the uniform distribution µ on V . Hence the sampling algorithm we propose is as follows:
(∗) Run a random walk on G for t À 1 steps, and take the random node that the walk sits on.
However, there is a possible issue of convergence. Namely, if the graph G does not contain any odd
cycle, then random walk on G is periodic (see Exercise 6.4.12), so we are not guaranteed to have conver-
gence. We can overcome this by using a lazy random walk instead, which is introduced in Exercise 6.5.2.
We know that the lazy RW on G is irreducible, aperiodic, and has the same set of stationary distribution
as the ordinary RW on G. Hence we can apply the sampling algorithm (∗) above for lazy random walk on
G to sample a uniform random node in G. ▲
Exercise 6.5.2 (Lazy RW on graphs). Let G = (V, E ) be a graph. Let (X t )t ≥0 be a Markov chain on the node
set V with transition probabilities
1/2 if j = i
P(X t +1 = j | X t = i ) = 1/(2 degG (i )) if j is adjacent to i
0 otherwise.
This chain is called the lazy random walk on G. In words, the usual random walker on G now flips a fair
coin to decide whether it stays at the same node or make a move to one of its neighbors.
(i) Show that for any connected graph G, the lazy random walk on G is irreducible and aperiodic.
(ii) Let P be the transition matrix for the usual random walk on G. Show that the following matrix
1
Q = (P + I )
2
is the transition matrix for the lazy random walk on G.
(iii) For any distribution π on V , show that πQ = π if and only if πP = π. Conclude that the usual and
lazy random walks on G have the same set of stationary distribution.
Example 6.5.3 (Finding local minima). Let G = (V, E ) be a connected graph and let f : V → [0, ∞) be
a ‘cost’ function. The objective is to find a node x ∗ ∈ V such that f takes global minimum at x ∗ . This
problem has a lot of application in machine learning, for example. Note that if the domain V is very
large, then an exhaustive search is too expensive to use.
Here is simple form of the popular algorithm of stochastic gradient descent, which lies at the heart of
most of the important machine learning algorithms.
(i) Initialize the first guess X 0 = x 0 ∈ V .
(ii) Suppose X t = x ∈ V is chosen. Let
D t = {y a neighbor of x or x itself | f (y) ≤ f (x)}.
Define X t +1 to be a uniform random node from D t .
6.5. MARKOV CHAIN MONTE CARLO 194
There is a neat solution to finding global minimum. The idea is to allow that we go uphill with a small
probability.
Example 6.5.4 (Finding global minimum). Let G = (V, E ) be a connected regular graph and let f : V →
[0, ∞) be a cost function. Let
½ ¾
V ∗ = x ∈ V | f (x) = min f (y)
y∈V
λ f (x)
πλ (x) = ,
Zλ
P
where Zλ = x∈V λ f (x) is the normalizing constant. Since πλ (x) is decreasing in f (x), it favors nodes x
for which f (x) is small.
Let (X t )t ≥0 be Markov chain on V , whose transition is defined as follows. If X t = x, then let y be
a uniform random neighbor of x. If f (y) ≤ f (x), then move to y; If f (y) > f (x), then move to y with
probability λ f (y)− f (x) and stay at x with probability 1 − λ f (y)− f (x) . We analyze this MC below:
(i) (Irreducibility) Since G is connected and we are allowing any move (either downhill or uphill) we can
go from one node to any other in some number of steps. Hence the chain (X t )t ≥0 is irreducible.
(ii) (Aperiodicity) By (i) and Remark 6.4.13, all nodes have the same period. Moreover, let x ∈ V ∗ be
an arbitrary node where f takes global minimum. Then all return times are possible, so x has
period 1. Hence all nodes have period 1, so the chain is aperiodic.
(iii) (Stationarity) Here we show that πλ is a stationary distribution of the chain. To do this, we first need
to write down the transition matrix P . Namely, if we let AG (y, z) denote the indicator that y and
z are adjacent, then
( A (x,y)
f (y)− f (x)
(x) min(1, λ ) if x 6= y
G
P (x, y) = degGP
1 − y6=x P (x, y) if y = x.
Note that
X X
πλ (z)P (z, y) = πλ (y)P (y, y) + πλ (z)P (z, y)
z∈V z6= y
X X
= πλ (y) − πλ (y)P (y, z) + πλ (z)P (z, y).
z6= y z6= y
for any z 6= y. Indeed, considering the two cases f (z) ≤ f (y) and f (z) > f (y), we have
λ f (y) AG (y, z) 1 AG (z, y) max( f (y), f (z))
πλ (y)P (y, z) = min(1, λ f (z)− f (y) ) = λ ,
Zλ degG (y) Zλ degG (z)
λ f (z) AG (z, y) 1 AG (y, z) max( f (y), f (z))
πλ (z)P (z, y) = min(1, λ f (y)− f (z) ) = λ .
Zλ degG (z) Zλ degG (y)
Now since AG (z, y) = AG (y, z) and we are assuming G is a regular graph, this yields (101), as
desired. Hence πλ is a stationary distribution for the chain (X t )t ≥0 .
(iv) (Convergence) By (i), (iii), Theorem 6.3.18, we see that πλ is the unique stationary distribution for
the chain X t . Furthermore, by (i)-(iii) and Theorem 6.4.18, we conclude that the distribution of
X t converges to πλ .
(v) (Global minimum) Let f ∗ = minx∈V f (x) be the global minimum of f . Note that by definition of V ∗ ,
we have f (x) = f ∗ for any x ∈ V ∗ . Then observe that
λ f (x) λ f (x) /λ f ∗
lim πλ (x) = lim P = lim P
λ→0 λ→0 y∈V λ f (y) λ→0 y∈V λ f (y) /λ f ∗
λ f (x)− f ∗ 1(x ∈ V ∗ )
= lim P = .
λ→0 |V ∗ | + y∈V \V ∗ λ f (y)− f ∗ |V ∗ |
Thus for λ very small, πλ is approximately the uniform distribution on the set of all nodes V ∗
where f attains global minimum. ▲
Exercise 6.5.5 (Detailed Balance equation). Let P be a transition matrix of a Markov chain on state space
Ω. Suppose π is a probability distribution on Ω that satisfies the following detailed balance equation
π(x)P (x, y) = π(y)P (y, x) for all x, y ∈ Ω.
(i) Show that for all x ∈ Ω,
X X X
π(y)P (y, x) = π(x)P (x, y) = π(x) P (x, y) = π(x).
y∈Ω y∈Ω y∈Ω
(iii) Since we also want the Markov chain to converge fast, we want to choose the acceptance probability
A(a, b) ∈ [0, 1] as large as 1possible for each3/4
a, b ∈ Ω. Show that
1/2the following choice
1/4 (denoting
α ∧ β := min(α, β))
0 1 2 3 4
1/4 π(y)P (y, x) 1/2 3/4 1
A(x, y) = ∧1 ∀x, y ∈ Ω, x 6= y
π(x)P (x, y)
satisfies the condition in (ii) and each A(x, y) is maximized for all x 6= y.
(iv) Let (Y t )t ≥0 be a random walk on the 5-wheel graph G = (V, E ) as shown in Figure 6.5.2. Show that π =
5 1 2
4 3
Brownian Motion
Below we give an axiomatic definition of Brownian motion, listing all properties we desire to endow
on it.
In conditions (i)-(iii) in the definition of Brownian motion, we specified the joint law of B (0) and all
the increments (B (t 1 ) − B (0), B (t 2 ) − B (t 1 ), . . . , B (t n ) − B (t n−1 )), for all 0 ≤ t 1 ≤ t 2 ≤ . . . ≤ t n . This specifies
the distribution of the tuple of ‘snapshots’ (B (t 1 ), B (t 2 ), . . . , B (t n )) of the Brownian motion for any given
sequence of times 0 ≤ t 1 < t 2 < · · · < t n . In other words, this specifies the ‘marginal distributions’ of
Brownian motion.
Definition 7.1.4 (Marginal distribution). Given a Brownian motion (B (t ) : t ≥ 0), By the marginal distri-
butions of a stochastic process (B (t ) : t ≥ 0), we mean the laws of all the n-tuples of finite-dimensional
random vectors (B (t 1 ), B (t 2 ), . . . , B (t n )), for all 0 ≤ t 1 ≤ t 2 ≤ . . . ≤ t n .
Condition (iv) in Def. 7.1.3 concerns almost sure continuity of the ‘sample paths’ of the Brownian
motion. We have defined Brownian motion as a stochastic process (B (t ) : t ≥ 0), which is just a family of
(uncountably many) random variables ω 7→ B (t , ω) defined on a single probability space (Ω, F , P). At the
same time, a stochastic process can also be interpreted as a ‘random function’ with the sample functions
defined by t 7→ B (t , ω), which we refer to the ‘sample path’ with randomness specified by ω (see Fig.
7.0.1). The sample path properties of a stochastic process are the properties of these random functions,
and it is these properties we will be most interested in the study of Brownian motion.
Sample path properties (e.g., continuity and differentiability) goes beyond the marginal distributions
of the process in the sense above. For instance, the set {ω ∈ Ω : t 7→ B (t , ω) continuous} is, in general, not
in the σ-algebra generated by the random vectors (B (t 1 ), , . . . , B (t n )), for n ∈ N. Hence, if we are interested
in the sample path properties of a stochastic process, we may need to specify more than just its marginal
distributions.
Definition 7.1.5 (Sample path properties). Suppose X is a property a function might or might not have,
like continuity, differentiability, etc. We say that a process (X (t ) : t ≥ 0) has property X almost surely if
there exists A ∈ F such that P (A) = 1 and
A ⊆ {ω ∈ Ω : t 7→ X (t , ω) has property X}.
(Note that these sets on the right need not lie in F .)
7.1.2. Existence of Brownian motion: Lévy’s construction. In this section, we show the existence of
Brownian motion by constructing it. The first step is to reduce it to the construction of 1D standard
Brownian motion (SBM).
In the remainder of this section, we will prove the following celebrated result on the existence of
SBM. The argument uses Lévy’s construction of 1D SBM.
P ROOF. The following is a famous construction of SBM due to Paul Lévy. We will construct Browian
motion on the unit interval [0, 1] as a random element in the space C [0, 1] of continuous functions
[0, 1] → R. The key idea for ensuring contiuity is to construct it as the uniform limit of continuous func-
tions over discrete time points that become finer in the limit.
We will use ‘diadic partition’ of the unit interval [0, 1]. Define for n ≥ 1,
Dn := {k2−n : 1 ≤ k ≤ 2n }.
7.1. DEFINITION AND BASIC PROPERTIES OF BROWNIAN MOTION 200
Namely, if we partition [0, 1] into 2n intervals of equal lenths, then the endpoints form Dn . We will define
values of the BM on the poitns on Dn and linearly interpolate them. We will then check that the uniform
limit of these continuous functions exists as n → ∞ and is a Brownian motion.
Step 1. Constructing a Gaussian process over diadic points
S
Let D = ∞ n=0 Dn , and let (Ω, F , P) be a probability space on which a collection {Z t : t ∈ D} of inde-
pendent, standard normally distributed random variables can be defined. Inductively on n ≥ 0, we will
define B (x) for all points x ∈ D. For n = 0, define B (0) := 0 and B (1) := Z1 . Having defined B on Dn−1 ,
define B (x) for x ∈ Dn \ Dn−1 by
(
B (0) = 0 B (1) = Z1
B (x−2−n )+B (x+2−n ) Zx
B (x) = 2 + 2(n+1)/2 for all n ≥ 1 and x ∈ Dn \ Dn−1 .
This defines a process B on all points in D.
Next, we show that the process B on D is a ‘Gaussian process’ in the following sense. For each n ≥ 1,
(1) the vectors (B (x) : x ∈ Dn ) and (Z t : t ∈ D \ Dn ) are independent; and
(2) for all r < s < t in Dn , the random variable B (t ) − B (s) ∼ N (0, t − s) and is independent of B (s) −
B (r ).
From the above construction, it is clear that B (x) for x ∈ Dn is determined by the normal variables Z s for
s ∈ Dn . Since all these normal variables are independent, the first property follows.
Next, in order to show the second property above, it suffices to show that all increments B (x) − B (x −
2−n ) over an interal of length 2−n , for x ∈ D n \ {0}, are independent. To show the latter, this, it suffices to
show that they are pairwise independent, as the vector of these increments is Gaussian (why?).
First, consider two increments of the form B (x) − B (x − 2−n ), B (x + 2−n ) − B (x) for x ∈ Dn \ Dn−1 are
independent. Indeed, as 12 [B (x + 2−n ) − B (x − 2−n )] depends only on (Z t : t ∈ Dn−1 ), it is independent of
Z x /2(n+1)/2 . Both terms are normally distributed with mean zero and variance 2−(n+1) . Hence, their sum
B (x) − B (x − 2−n ) and their difference B (x + 2−n ) − B (x) are independent and normally distributed with
mean zero and variance 2−n by Exercise 7.1.8.
Second, the other possibility is that the increments are over intervals separated by some x ∈ Dn−1 .
Choose x ∈ D j with this property and minimal j , so that the two intervals are contained in [x − 2− j , x],
respectively [x, x + 2− j ]. By induction, the increments over these two intervals of length 2− j are inde-
pendent, and the increments over the intervals of length 2−n are constructed from the independent in-
crements B (x) − B (x − 2− j ), respectively B (x + 2− j ) − B (x), using a disjoint set of variables (Z t : t ∈ Dn ).
Hence, they are independent, and this implies the first property, and completes the induction step.
Step 2. Linear interpolation, partial sums representation, and pointwise limit
Next, having thus chosen the values of the process B on all dyadic points in D, we interpolate be-
tween them, thereby extending B to all points in [0, 1].
Formally, define
Z1 for t = 1
F 0 (t ) = 0 for t = 0
linearly in between
and for each n ≥ 1, define
−(n+1)/2
2 Zt for t ∈ Dn \ Dn−1
F n (t ) = 0 for t ∈ Dn−1
linear between consecutive points in Dn .
These functions are piecewise linear, and hence continuous. Define a function B : [0, 1] → R by
X
∞
B (x) = F i (x) (102)
i =0
7.1. DEFINITION AND BASIC PROPERTIES OF BROWNIAN MOTION 201
At this point, we have not shown that the infinite series in the RHS above is convergent at every point in
[0, 1]. But it is so trivially for diadic points x ∈ D, since F i (x) ≡ 0 for all i > n and x ∈ Dn by construction.
We claim that for all n ≥ 0,
X
n X
∞
B (x) = F i (x) = F i (x) = B (x) for all x ∈ Dn . (103)
i =0 i =0
We have just shown that the second equality holds. The first equality can be seen by an inudction in n.
It holds for n = 0 trivially. Suppose it holds for n − 1. Let x ∈ Dn \ Dn−1 . For 0 ≤ i ≤ n − 1 the function F i is
linear on [x − 2−n , x + 2−n ] and x ± 2−n ∈ Dn−1 . So by the induction hypothesis,
X
n−1 X F
n−1
i (x − 2
−n
) + F i (x + 2−n )
F i (x) =
i =0 i =0 2
à !
X
1 n−1 −n
X
n−1
−n
= F i (x − 2 ) + F i (x + 2 )
2 i =0 i =0
1¡ ¢
= B (x − 2−n ) + B (x + 2−n ) .
2
Since F n (x) = 2−(n+1)/2 Z x , this shows
à !
X n X
n−1 B (x − 2−n ) + B (x + 2−n ) Zx
F i (x) = F i (x) + F n (x) = + (n+1)/2 = B (x).
i =0 i =0 2 2
This completes the induction so we have shown (103) holds for all n ≥ 0.
Step 4. Obtaining a uniform limit of continuous linear interpolations
Next, we show that the infinite series function B in (102) is uniformly convergent over [0, 1]. Since
each partial sum is continuous, this will imply that the limit B is continuous on [0, 1]. For this, we will
show that the supremum norm kF n k∞ is so small that it is summable.
First note that for any s > 0,
¡ ¢
P(kF n k∞ ≥ s) ≤ P 2−(n+1)/2 |Z x | ≥ s for some x ∈ Dn .
p
Hence choosing s = c n2−n/2 for a constant c > 0, we have
X
∞ p X
∞ ¡ p ¢
P(kF n k∞ ≥ c n 2−n/2 ) ≤ P |Z x | ≥ c n for some x ∈ Dn
n=0 n=0
X∞ ¡ p ¢
≤ (2n + 1)P |Z0 | ≥ c n
n=0
X∞ ¡ ¢
≤ (2n + 1) exp −cn 2 /2 ,
n=0
where the second inequality follows from a union bound and |Dn | ≤ 2n +1 and the third inequalitypfollows
from a tail bound on Gaussian distribution (Exc. 7.1.9). The last sum converges as soon as c > 2 log 2.
Fix such a c. By the Borel-Cantelli lemma, there exists a random (but almost surely finite) N such that
p
kF n k∞ ≤ c n2−n/2 for all n > N almost surely. (104)
This implies that, almost surely, the series
X
∞
B (t ) = F n (t )
n=0
is uniformly convergent on [0, 1] almost surely, as desired. Since B agrees with B on D, we can identify
them and extend the values of B to the entire unit interval [0, 1] by B . Henceforth, we will write B for B .
Step 5. Verifying marginal distribution of SBM
7.1. DEFINITION AND BASIC PROPERTIES OF BROWNIAN MOTION 202
It remains to check that the increments of the limiting continuous process have the right marginal
distributions. This follows directly from the properties of B on the dense set D ⊂ [0, 1] and the continuity
of the paths. Indeed, suppose that t 1 < t 2 < . . . < t n are in [0, 1]. We find t 1,k ≤ t 2,k ≤ . . . ≤ t n,k in D with
limk→∞ t i ,k = t i and infer from the continuity of B that, for 1 ≤ i ≤ n − 1,
B (t i +1 ) − B (t i ) = lim B (t i +1,k ) − B (t i ,k ).
k→∞
As limk→∞ E[B (t i +1,k ) − B (t i ,k )] = 0 and
¡ ¢
lim Cov B (t i +1,k ) − B (t i ,k ), B (t j +1,k ) − B (t j ,k ) = lim 1(i = j ) (t i +1,k − t i ,k ) = 1(i = j ) (t i +1 − t i ),
k→∞ k→∞
the increments B (t i +1 ) − B (t i ) are, by Exc. 7.1.10, independent Gaussian random variables with mean 0
and variance t i +1 − t i , as required.
Step 6. Extending to the whole real line
We have thus constructed a continuous process B : [0, 1] → R with the same marginal distributions as
Brownian motion. Take a sequence B 1 , B 2 , . . . of independent C [0, 1]-valued random variables with the
distribution of this process, and define (B (t ) : t ≥ 0) by gluing together the parts. That is,
btX
c−1
B (t ) = B bt c (t − bt c) + B i (1) for t ≥ 0.
i =0
This defines a continuous random function B : [0, ∞) → R, and one can see easily from what we have
shown so far that the requirements of a standard Brownian motion are fulfilled. □
Exercise 7.1.8. Let X , Y ∼ N (0, σ2 ) be independent mean zero variance σ2 normal random variables.
Show that X + Y , X − Y ∼ N (0, 2σ2 ) and X + Y and X − Y are independent.
Exercise 7.1.9 (Tail bounds on standard normal distribution). Suppose Z ∼ N (0, 1). Then for all x > 0,
x 1 1 1
p e −x /2 ≤ P(Z > x) ≤ p e −x /2 .
2 2
x2 + 1 2π x 2π
Exercise 7.1.10 (Gaussianity preserved under limits). Suppose {X n : n ∈ N} is a sequence of Gaussian
random vectors and limn→∞ X n = X almost surely. If b := limn→∞ E[X n ] and C := limn→∞ Cov[X n ] exist,
then X is Gaussian with mean b and covariance matrix C .
Exercise 7.1.11 (BM as a Gaussian process). A stochastic process (Y (t ) : t ≥ 0) is called a Gaussian pro-
cess if for all t 1 < t 2 < . . . < t n , the vector (Y (t 1 ), . . . , Y (t n )) is a Gaussian random vector (see Exc. 7.1.2).
Show that Brownian motion starting at x ∈ R is a Gaussian process.
7.1.3. Scaling invariance of Brownian motion. The behavior is significantly influenced by a fundamen-
tal property of Brownian motion known as scaling invariance. This property describes a transformation
on the functions’ space that alters individual Brownian random functions while preserving their distri-
bution.
Lemma 7.1.12 (Scaling invariance). Suppose (B (t ) : t ≥ 0) is a standard Brownian motion and let a > 0.
Then the process (X (t ) : t ≥ 0) defined by X (t ) = a1 B (a 2 t ) is also a standard Brownian motion.
P ROOF. Continuity of the paths, independence, and stationarity¡ of the increments
¢ remain unchanged
1 2 2
under the rescaling. It remains to observe that X (t ) − X (s) = a B (a t ) − B (a s) is normally distributed
¡ ¢
with expectation 0 and variance a12 (a 2 t − a 2 s) = t − s. □
Remark 7.1.13. Scaling invariance has many useful consequences. For instance, let a < 0 < b, and look
at T (a, b) = inf{t ≥ 0 : B (t ) = a or B (t ) = b}, the first exit time of a one-dimensional standard Brownian
motion from the interval [a, b]. Then, with X (t ) = a 1 B (a 2 t ), we have
· ¸
b
E[T (a, b)] = a 2 E inf : X (t ) = 1 or X (t ) = = a 2 E[T (b/a, 1)],
t ≥0 a
7.1. DEFINITION AND BASIC PROPERTIES OF BROWNIAN MOTION 203
A further useful invariance property of Brownian motion, invariance under time inversion, can be
easily identified. Similar to scaling invariance, this transformation on the space of functions alters indi-
vidual Brownian random functions without changing their distribution.
Theorem 7.1.14 (Time inversion of BM). Suppose (B (t ) : t ≥ 0) is a standard Brownian motion. Then the
process (X (t ) : t ≥ 0) defined by
(
0 if t = 0
X (t ) =
t B (1/t ) if t > 0
is also a standard Brownian motion.
P ROOF. Recall that the finite-dimensional marginals (B (t 1 ), . . . , B (t n )) of Brownian motion are Gauss-
ian random vectors (Exc. 7.1.11) and are therefore characterized by E[B (t i )] = 0 and Cov(B (t i ), B (t j )) = t i
for 0 ≤ t i ≤ t j .
Obviously, (X (t ) : t ≥ 0) is also a Gaussian process, and the Gaussian random vectors (X (t 1 ), . . . , X (t n ))
have expectation 0. The covariances, for t > 0 and h ≥ 0, are given by
µ µ ¶ µ ¶¶
1 1
Cov(X (t + h), X (t )) = Cov t B ,tB
t +h t
µ µ ¶ µ ¶¶
1 1
= t (t + h)Cov B ,B
t +h t
= t (t + h) · 1
= t.
Hence, the law of all the finite-dimensional marginals (X (t 1 ), X (t 2 ), . . . , X (t n )), for 0 ≤ t 1 ≤ . . . ≤ t n , are the
same as for Brownian motion. The paths of t 7→ X (t ) are clearly continuous for all t > 0. At t = 0, we
use the following two facts: First, the distribution of X on the rationals Q is the same as for a Brownian
motion, hence
lim X (t ) = 0 almost surely.
t &0
Time inversion is a useful tool to relate the properties of Brownian motion in a neighborhood of time
t = 0 to properties at infinity. To illustrate the use of time inversion, we exploit Theorem 1.9 to derive an
interesting statement about the long-term behavior from a trivial statement at the origin.
Corollary 7.1.16 (Law of large numbers). Almost surely, limt →∞ B (t )/t = 0.
P ROOF. Let (X (t ) : t ≥ 0) be as defined in Theorem 7.1.14. Using this theorem, we see that
lim B (t )/t = lim X (1/t ) = X (0) = 0.
t →∞ t →∞
□
7.1.4. Modulus of continuity and nowhere differentiability of Brownian motion. The definition of
Brownian motion already requires that the sample functions are continuous almost surely. This implies
that on the interval [0, 1] (or any other compact interval), the sample functions are uniformly continu-
ous. That is, there exists some (random) function φ with limh&0 φ(h) = 0, called a modulus of continuity
of the function B : [0, 1] → R, such that
|B (t + h) − B (t )|
lim sup sup ≤ 1.
h&0 0≤t ≤1−h φ(h)
Can we achieve such a bound with a deterministic function φ, i.e., is there a nonrandom modulus of
continuity for the Brownian motion? The answer is yes, as the following theorem shows.
Theorem 7.1.17. Let B ∼ B M (0, 0, 1) be SBM. There exists a constant C > 0 such that, almost surely, for
every sufficiently small h > 0 and all 0 ≤ t ≤ 1 − h,
q
|B (t + h) − B (t )| ≤ C h log (1/h).
P ROOF. This follows quite elegantly from Lévys construction of Brownian motion. Recall the nota-
tion introduced there and that we have represented Brownian motion as a series
X
∞
B (t ) = F n (t ),
n=0
where each F n is a piecewise linear function (see (103)). Hence for each t , t + h ∈ [0, 1], using the mean-
value theorem, for any l > N , we can write
X
∞ X
l X
∞
|B (t + h) − B (t )| ≤ |F n (t + h) − F n (t )| ≤ hkF n0 k∞ + 2kF n k∞ . (105)
n=0 n=0 n=l +1
p
We will find that the RHS above is bounded above by C h log(1/h) almost surely whenever h is suffi-
ciently small.
The derivative of F n exists almost everywhere, and by definition,
F n (x + 2−n ) − F n (x) 2kF n k∞
kF n0 k∞ ≤ sup ≤ = 2n+1 kF n k∞ .
x∈[0,1]\Dn 2−n 2−n
p
By (104), for any c > 2 log 2, there exists an almost surely finite N ∈ N such that, for all n > N ,
p
kF n0 k∞ ≤ 2n+1 kF n k∞ ≤ 2c n 2n/2 .
Hence, using (104) again, from (105) we deduce
X
N X
l p X
l p
|B (t + h) − B (t )| ≤ h kF n0 k∞ + 2ch n2n/2 + 2c n 2−n/2 .
n=0 n=N n=N
We
p now suppose that h is (again random and)−lsmall enough that the first summand is smaller than
h log(1/h), and for such h, choose l so that 2 < h ≤ 2−l +1 . By choosing h sufficiently small, we can
ensure such l exceeds N . For this choice of l , the second and the third summands are also bounded by
7.1. DEFINITION AND BASIC PROPERTIES OF BROWNIAN MOTION 205
p
a constant multiple of h log(1/h) as both sums are dominated by their largest element. Hence we get
p
the desired inequality with a deterministic function φ(h) = C h log(1/h). □
This upper bound is pretty close to the optimal result. The following lower bound confirms that the
only missing bit is the precise value of the constant.
p
Theorem 7.1.18. Let B ∼ B M (0, 0, 1) be a SBM. For every constant c < 2, for each ε > 0,
³ q ´
P there exists h ∈ (0, ε) and t ∈ [0, 1 − h] s.t. |B (t + h) − B (t )| > c h log(1/h) = 1.
p
P ROOF. Let c < 2 and define, for integers k, n ≥ 0, the events
© ¡ ¢ ¡ ¢ p ª
A k,n = B (k + 1)e −n − B ke −n > c ne −n/2 .
Denoting h = e −n and x = ke n , we can rewrite the above event as
n q o
A k,n = B (x + h) − B (x) > c h log(1/h) .
Thus, for each ε > 0 and n ≥ 0,
³ q ´
P for all h ∈ (0, ε) and t ∈ [0, 1 − h], |B (t + h) − B (t )| ≤ c h log(1/h)
à n !
be\−1c ¡ ¢e n
≤P c
A k,n ≤ 1 − P(A 0,n ) ≤ exp(−e n P(A 0,n )),
k=0
where the second inequality follows since the increments defining A k,n are i.i.d. for k ≥ 0 and the last
inequality uses 1 − x ≤ e −x for all x. Thus it remains to show P(A 0,n ) → ∞ as n → ∞. This follows easily
from scaling invaraince (Lem. 7.1.12) and Gaussian tail bound (Exc. 7.1.9):
p
p p c n c2n
P(A k,n ) = P{B (e −n ) > c ne −n/2 } = P{B (1) > c n} ≥ p e− 2 .
c 2n + 1
n
By our assumption on c, we have e P(A k,n ) → ∞ as n → ∞, as desired. □
One can determine the constant c in the best possible moduluspof continuity ϕ(h) = c ph log(1/h)
precisely. Indeed, our proof of the lower bound yields a value of c = 2, which turns out to be optimal.
This striking result is due to Paul Lévy.
Theorem 7.1.19 (Lévys Modulus of Continuity (1937)). Let B ∼ B M (0, 0, 1) be a SBM. Almost surely,
|B (t + h) − B (t )|
lim sup sup p = 1.
h→0 0≤t ≤1−h 2h log(1/h)
P ROOF. Omitted. □
We now turn our attention to nowhere differentiability of BM.
Proposition 7.1.20 (Non-monotonicity of BM). Almost surely, for all 0 < a < b < ∞, Brownian motion is
not monotone on the interval [a, b].
P ROOF. First fix an interval [a, b]. If [a, b] is an interval of monotonicity, i.e., if B (s) ≤ B (t ) for all
a ≤ s ≤ t ≤ b, then we pick numbers a = a 1 ≤ . . . ≤ a n+1 = b and divide [a, b] into n sub-intervals [a i , a i +1 ].
Each increment B (a i ) − B (a i +1 ) has to have the same sign. As the increments are independent, this has
probability 2 · 2−n , and taking n → ∞ shows that the probability that [a, b] is an interval of monotonic-
ity must be 0. Taking a countable union gives that there is no interval of monotonicity with rational
endpoints, but each monotone interval would have a nontrivial monotone rational sub-interval. □
We utilize the time-inversion technique to explore the differentiability of Brownian motion, connect-
ing differentiability at t = 0 to a long-term property. This idea complements the law of large numbers:
while Corollary 7.1.16 shows that Brownian motion grows slower p than linearly, the subsequent proposi-
tion demonstrates that the maximum growth of B (t ) surpasses t .
7.1. DEFINITION AND BASIC PROPERTIES OF BROWNIAN MOTION 206
p
Proposition 7.1.21 ( n-fluctuation of BM). Almost surely,
B (n) B (n)
lim sup p = +∞, and lim inf p = −∞
n→∞ n n→∞ n
.
P ROOF. By Fatous lemma and scaling invariance,
µ ¶ µ ¶
B (n) B (n)
P lim sup p > c ≥ lim sup P p > c = P (B (1) > c) > 0.
n→∞ n n→∞ n
Let X n := B (n) − B (n − 1), and note that
½ ¾ ( )
B (n) © p ª Xn p
lim sup p > c = B (n) > c n infinitely often = X i > c n infinitely often
n→∞ n i =1
is an exchangeable1 event for the sequence of i.i.d. RVs X 1 , X 2 , . . . . Hence by Hewitt-Savage 0-1 law (Exc.
3.7.10), the above event has probability 0 or 1. Since it must be positive by the first display, it must
occur with probability one. Taking the intersection over all positive integers c gives the first part of the
statement and the second part is proved analogously. □
For a function f , we define the upper right derivatives and lower right derivatives as
f (t + h) − f (t ) f (t + h) − f (t )
D ∗ f (t ) := lim sup and D ∗ f (t ) := lim inf ,
h%0 h h%0 h
respectively.
We now show that for any fixed time t , almost surely, Brownian motion is not differentiable at t . For
this, we use Proposition 1.23 and the invariance under time inversion.
Theorem 7.1.22 (Non-differentiability of BM at a fixed point). Fix t ≥ 0. Then, almost surely, Brownian
motion is not differentiable at t . Moreover, D ∗ B (t ) = +∞ and D ∗ B (t ) = −∞.
P ROOF. By translation invariance of BM, it suffices to show that BM is not differentiable at 02. Given
a standard Brownian motion B , we construct a further Brownian motion X by time inversion as in The-
orem 7.1.14. Then by Prop. 7.1.21,
X (1/n) − X (0) p p
D ∗ X (0) = lim sup ≥ lim sup nX (1/n) = lim sup nB (n) = ∞.
n→∞ 1/n n→∞ n→∞
Similarly, D ∗ X (0) = −∞, showing that X is not differentiable at 0. Since X is an SBM, it is enough to
conclude. □
The previous proof demonstrates that for every t , it’s almost certain that it’s a point of nondifferen-
tiability for Brownian motion. However, this doesn’t necessarily imply that almost every t is a point of
nondifferentiability for Brownian motion! The arrangement of quantifiers, "for all t " and "almost surely,"
in results like Theorem 7.1.22 is crucial. In this case, the assertion holds for all Brownian paths except for
those within a set of probability zero, which might vary depending on t . The accumulation of all such
sets of probability zero may not, in itself, be a set of probability zero. To illustrate this point, consider the
following example.
Example 7.1.23. The argument in the proof of Theorem 7.1.22 also shows that the Brownian motion X
crosses 0 for arbitrarily small values s > 0. Defining the level sets Z (t ) = {s > 0 : X (s) = X (t )}, this shows
that every t is almost surely an accumulation point from the right for Z (t ). But not every point t ∈ [0, 1]
is an accumulation point from the right for Z (t ). For example, the last zero of {X (t ) : t ≥ 0} before time
1 is, by definition, never an accumulation point from the right for Z (t ) = Z (0). This example illustrates
1Does not depend on permutting the first finite number of RVs, see Exc. 3.7.10.
2Fix t ≥ 0 and let Y (s) := B (t + s)−B (t ), which defines a standard Brownian motion. Then non-differentiability of Y at zero
is equivalent to non-differentiability of B at t .
7.2. BROWNIAN MOTION AS A MARKOV PROCESS 207
that there can be random exceptional times at which Brownian motion exhibits atypical behavior. These
times are so rare that any fixed (i.e., nonrandom) time is almost surely not of this kind. ▲
Brownian motion is by definition (and construction) almost surely continuous. But the following
classical result due to Paley, Winer, and Zygmund shows that it is nowhere differentiable!
Theorem 7.1.24 (Paley, Wiener and Zygmund 1933). Almost surely, Brownian motion is nowhere differ-
entiable. Furthermore, almost surely, for all t , either D ∗ B (t ) = +∞ or D ∗ B (t ) = −∞ or both.
P ROOF. Omitted. □
Clearly, the family (F (t ) : t ≥ 0) is again a filtration and F + (s) ⊇ F0 (s), but intuitively F + (s) is a bit
+
larger than F0 (s) since it allows an additional ‘infinitesimal glance into the future’. However, it turns out
that F0 (s) still does not contain any information about the future.
Theorem 7.2.4 (Strict Markov property). For every s ≥ 0, the process {B (t + s) − B (s) : t ≥ 0} is independent
of F + (s).
7.2. BROWNIAN MOTION AS A MARKOV PROCESS 208
Theorem 7.2.5 (Blumenthals 0-1 law). Let x ∈ Rd and A ∈ F + (0). Then Px (A) ∈ {0, 1}.
P ROOF. Using Theorem 7.2.4 for s = 0, we see that any A ∈ σ(B (t ) : t ≥ 0) is independent of F + (0).
This applies in particular to A ∈ F + (0), which therefore is independent of itself. Therefore Px (A) = Px (A∩
A) = Px (A)Px (A), which yields that Px (A) is zero or one. □
As a first application, we show that a standard linear Brownian motion has positive and negative
values and zeros in every small interval to the right of 0. We have studied this remarkable property of
Brownian motion already by different means, in the discussion following Theorem 7.1.22.
Theorem 7.2.6. Suppose {B (t ) : t ≥ 0} is a linear Brownian motion. Define τ = inf{t > 0 : B (t ) > 0} and
σ = inf{t > 0 : B (t ) = 0}. Then
P0 {τ = 0} = P0 {σ = 0} = 1.
Theorem 7.2.7 (Kolmogorovs 0-1 Law). Let x ∈ Rd and A ∈ T . Then Px (A) ∈ {0, 1}.
P ROOF. It suffices to look at the case x = 0. Under the time inversion of Brownian motion, the tail
σ-algebra is mapped onto the germ σ-algebra, which is trivial by Blumenthals 0-1 law. □
As a final example of this section, we now exploit the Markov property to show that the set of local
extrema of a linear Brownian motion is a countable, dense subset of [0, ∞). We shall use the easy fact,
proved in Exercise 7.2.9, that every local maximum of Brownian motion is a strict local maximum.
Proposition 7.2.8. The set M of times where a linear Brownian motion assumes its local maxima is almost
surely countable and dense.
P ROOF. Consider the function from the set of non-degenerate closed intervals with rational end-
points to R given by
½ ¾
[a, b] 7→ inf t ≥ a : B (t ) = max B (s) .
a≤s≤b
7.2. BROWNIAN MOTION AS A MARKOV PROCESS 209
If x ∈ M is a local maximum, then by Exercise 7.2.9 it is a strict local maximum almost surely, so there
exists rationals a < b such that a < x < b and B (t ) = maxa≤s≤b B (s)3. It follows that M is contained almost
surely in the image of the map defined above. Since the domain is countable, the image is also countable,
so M is countable almost surely.
To show that M is dense, recall that Brownian motion has no interval of increase or decrease almost
surely (Prop. 7.1.20). It follows that it almost surely has a local maximum in every non-degenerate inter-
val. □
Exercise 7.2.9. Show that every local maximum of a one-dimensional Brownian motion is a strict local
maximum. (Hint: In order to have a ‘flat’ local maximum in a small interval, one needs to have a Gaussian
increment to take zero, which occurs with probability zero.)
7.2.2. The strong Markov property and the reflection principle. As we discussed for discrete-time Markov
chains, Markov property of BM can be extended to the strong Markov property so that a BM can be
restarted afresh at stopping times. Below we recall the definition of stopping time, this time for continu-
ous processes.
Definition 7.2.10 (Stopping time). A random variable T with values in [0, ∞], defined on a probability
space with filtration (F (t ) : t ≥ 0), is called a stopping time if {T < t } ∈ F (t ), for every t ≥ 0. It is called a
strict stopping time if {T ≤ t } ∈ F (t ), for every t ≥ 0.
It is easy to see that every strict stopping time is also a stopping time. This follows from
[
∞
{T < t } = {T ≤ t − n −1 } ∈ F (t ).
n=1
For certain nice filtrations, strict stopping times and stopping times agree. In order to come into this
situation, we are going to work with the ‘germ filtration’ (F + (t ) : t ≥ 0) in the case of Brownian motion
and refer to the notions of stopping time, etc., always with respect to this filtration. As this filtration is
larger than (F0 (t ) : t ≥ 0), our choice produces more stopping times.
The crucial property which distinguishes {F + (t ) : t ≥ 0} from {F 0 (t ) : t ≥ 0} is right-continuity, which
means that \ +
F (t + ε) = F + (t ).
ε>0
To see this, note that
\ \
∞ \
∞
F + (t + ε) = F + (t + n −1 + k −1 ) = F + (t ).
ε>0 n=1 k=1
which implies that any stopping time with respect to any right-continuous filtration is automatically a
strict stopping time.
Theorem 7.2.11 (Every stopping time is strict w.r.t. a right-continuous filtration). Every stopping time
T with respect to the filtration (F + (t ) : t ≥ 0), or indeed with respect to any right-continuous filtration, is
automatically a strict stopping time.
P ROOF. Suppose that T is a stopping time. Then
\
∞ \
∞ \
∞
{T ≤ t } = {T < t + k −1 } ∈ F (t + k −1 ) ⊆ F + (t + k −1 ) = F + (t ).
k=1 k=1 k=1
+
Note that we have only used right-continuity of F (t ) in the proof above (for the last identity). □
Example 7.2.12 (Discrete approximation of continuous stopping time). Let T be a stopping time. For
each n ≥ 1, define a stopping time Tn by
Tn := min{m2−n | T < m2−n }.
3The max is achieved a.s. over the compact set [a, b] since B is a.s. continuous.
7.2. BROWNIAN MOTION AS A MARKOV PROCESS 210
In other words, we stop at the first time of the form k2−n after T . It is easy to see that Tn is a stopping
time. We will use it later as a discrete approximation to T . ▲
We define, for every stopping time T , the corresponding σ-algebra4
F + (T ) = {A ∈ F : A ∩ {T < t } ∈ F + (t ) for all t ≥ 0}.
This means that the part of A that lies in {T < t } should be measurable with respect to the information
available at time t . Heuristically, this is the collection of events that happened before the stopping time
T . As in the proof of the last theorem, we can infer that for right-continuous filtrations like our (F + (t ) :
t ≥ 0), the event {T ≤ t } may replace {T < t } without changing the definition.
Theorem 7.2.13 (Strong Markov property of BM). Let B be a standard Brownian motion in R. For every
almost surely finite stopping time T , the process {B (T + t ) − B (T ) : t ≥ 0} is a standard Brownian motion
independent of F + (T ).
P ROOF. We first show our statement for the stopping times Tn = min{m2−n | T < m2−n }, which dis-
cretely approximate T from above (see Ex. 7.2.12). Write B k = {B k (t ) : t ≥ 0} for the Brownian motion
defined by B k (t ) = B (t + k2−n ) − B (k2−n ), and B ∗ = {B ∗ (t ) : t ≥ 0} for the process defined by B ∗ (t ) =
B (t + Tn ) − B (Tn ). By Theorem 7.2.2, we know that B k is an SBM, restarted at time k2−n . However, we do
not yet know if B ∗ is also an SBM, restarted at the stopping time Tn .
We wish to show that B ∗ is an SBM independent of F + (Tn ). Suppose that E ∈ F + (Tn ). We wish to
show that B ∗ is an SBM independent of E . Almost sure continuity of the sample paths of B ∗ is clear, so
we only need to verify independent and Gaussian increments. To this end, we will show that, for every
event {B ∗ ∈ A}5, we have
¡ ¢
P {B ∗ ∈ A} ∩ E = P(B ∈ A) P(E ). (106)
This would be enough to conclude. To show (106), first note that
¡ ¢ X∞ ¡ ¢
P {B ∗ ∈ A} ∩ E = P {B k ∈ A} ∩ E ∩ {Tn = k2−n }
k=0
X∞ ¡ ¢
= P (B k ∈ A) · P E ∩ {Tn = k2−n } ,
k=0
using that {B k ∈ A} is independent of E ∩ {Tn = k2−n } ∈ F + (k2−n ) by the (strict) Markov property (Thm.
7.2.4). Now, by the Makov property (Thm. 7.2.2), B k and B have the same distribution as an SBM, so
P(B k ∈ A) = P(B ∈ A). It follows that
X∞ ¡ ¢ X
∞ ¡ ¢
P (B k ∈ A) · P E ∩ {Tn = k2−n } = P (B ∈ A) P E ∩ {Tn = k2−n }
k=0 k=0
= P(B ∈ A) P(E ).
This shows (106), as desired.
It remains to generalize this to general stopping times T . As Tn & T , we have that {B (s +Tn )−B (Tn ) :
s ≥ 0} is a Brownian motion independent of F + (Tn ) ⊃ F + (T ). Hence the increments B (s + t + T ) − B (t +
T ) = limn→∞ B (s +t +Tn )−B (t +Tn ) of the process {B (r +T )−B (T ) : r ≥ 0} are independent and normally
distributed with mean zero and variance s. As the process is obviously almost surely continuous, it is a
Brownian motion. Moreover, all increments, B (s + t +T )−B (t +T ) = limn→∞ B (s + t +Tn )−B (t +Tn ), and
hence the process itself, are independent of F + (T ). □
There are numerous applications of strong Markov property of Brownian motion. The next result,
the reflection principle, is particularly interesting. The reflection principle states that Brownian motion
reflected at some stopping time T is still a Brownian motion. More formally:
4Compare with the one in Lem. 6.2.1.
5 A is a Borel subset in R[0,∞) , describing an event for the trajectory of a continous-time process
7.2. BROWNIAN MOTION AS A MARKOV PROCESS 211
□
Now we specialize to the case of linear Brownian motion. Let M (t ) = max0≤s≤t B (s). A priori it is not
at all clear what the distribution of this random variable is, but we can determine it as a consequence of
the reflection principle.
Theorem 7.2.15 (Distribution of maximum of BM). Let B be a BM in R and let M (t ) = max0≤s≤t B (s)
denote the running maximum of B . If a > 0, then P{M (t ) ≥ a} = 2P{B (t ) ≥ a} = P{|B (t )| ≥ a}. That is, M (t )
and |B (t )| have the same distribution. In particular,
p µ 2¶
2t a
P(M (t ) ≥ a) ≤ p exp − .
a π 2t
P ROOF. Let T = inf{t ≥ 0 : B (t ) = a} and let {B ∗ (t ) : t ≥ 0} be Brownian motion reflected at T . Note
that
{M (t ) ≥ a} = {T ≤ t } = {T ≤ t , B (t ) > a} t {T ≤ t , B (t ) ≤ a}
= {T ≤ t , B (t ) > a} t {T ≤ t , B ∗ (t ) ≥ a}.
7.2. BROWNIAN MOTION AS A MARKOV PROCESS 212
Note that S N % S as N → ∞. Hence letting N → ∞ and using dominated convergence for conditional
expectations (Thm. 5.6.15) gives the result. □
Theorem 7.2.19 (L p maximal inequality in continuous time). Suppose {X (t ) : t ≥ 0} is a continuous sub-
martingale and p > 1. Then, for any t ≥ 0,
·µ ¶p ¸ µ ¶
+ p p £ ¤
E sup X (s) ≤ E (X (t )+ )p .
0≤s≤t p −1
P ROOF. Again, this is proved for martingales in discrete time in Theorem 5.5.5, and can be extended
by approximation. Fix N ∈ N and define a discrete time martingale by X n = X (t n2−N ) with respect to the
filtration (G (n) : n ∈ N) given by G (n) = F (t n2−N ). By the L p maximal inequality,
·µ ¶p ¸ µ ¶ µ ¶
+ p £ + p¤ p p £ ¤
E sup X k ≤ E (X 2N ) = E (X (t )+ )p .
1≤k≤2N p −1 p −1
Note that the LHS is non-decreasing in N and X k → X t as N → ∞. Hence letting N → ∞ and using
monotone convergence gives the claim. □
We now use the martingale property and the optional stopping theorem to prove Wald’s lemmas for
Brownian motion. These results identify the first and second moments of the value of Brownian motion
at well-behaved stopping times.
Lemma 7.2.20 (Wald’s lemma for Brownian motion). Let {B (t ) : t ≥ 0} be a standard linear Brownian
motion, and T be a stopping time such that either
(i) E[T ] < ∞, or
(ii) {B (t ∧ T ) : t ≥ 0} is L 1 -bounded.
Then we have E[B (T )] = 0.
P ROOF. The assertion holds under condition (ii) by the optional stopping theorem (Prop. 7.2.18).
Hence it is enough to show that condition (i) implies condition (iii). Suppose E[T ] < ∞, and define
Xe
dT
M k := max |B (t + k) − B (k)|, and M := Mk .
0≤t ≤1
k=1
[CBL06] Nicolo Cesa-Bianchi and Gábor Lugosi, Prediction, learning, and games, Cambridge university press,
2006.
[DD99] Richard Durrett and R Durrett, Essentials of stochastic processes, vol. 1, Springer, 1999.
[DMRV06] Oleksiy Dovgoshey, Olli Martio, Vladimir Ryazanov, and Matt Vuorinen, The cantor function, Exposi-
tiones mathematicae 24 (2006), no. 1, 1–37.
[Dur19] Rick Durrett, Probability: theory and examples, vol. 49, Cambridge university press, 2019.
[Ete81] Nasrollah Etemadi, An elementary proof of the strong law of large numbers, Zeitschrift für Wahrschein-
lichkeitstheorie und verwandte Gebiete 55 (1981), no. 1, 119–122.
[Kac87] Mark Kac, Enigmas of chance: an autobiography, Univ of California Press, 1987.
[LP17] David A Levin and Yuval Peres, Markov chains and mixing times, vol. 107, American Mathematical Soc.,
2017.
[LS19] Hanbaek Lyu and David Sivakoff, Persistence of sums of correlated increments and clustering in cellular
automata, Stochastic Processes and their Applications 129 (2019), no. 4, 1132–1152.
216