0% found this document useful (0 votes)
138 views118 pages

6710 Notes

These lecture notes were written for a probability theory course and cover topics through conditional probability. They were created based on two probability textbooks and contain the schedule of topics covered each class day. There are likely errors in the notes since they were created by the author for personal use.

Uploaded by

Vishal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
138 views118 pages

6710 Notes

These lecture notes were written for a probability theory course and cover topics through conditional probability. They were created based on two probability textbooks and contain the schedule of topics covered each class day. There are likely errors in the notes since they were created by the author for personal use.

Uploaded by

Vishal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 118

PROBABILITY THEORY 1 LECTURE NOTES

JOHN PIKE

These lecture notes were written for MATH 6710 at Cornell University in the Fall semester

of 2013. They were revised in the Fall of 2015 and the schedule on the following page

reects that semester. These notes are for personal educational use only and are not to

be published or redistributed. Almost all of the material, much of the structure, and some

of the language comes directly from the course text, Probability: Theory and Examples by
Rick Durrett. Gerald Folland's Real Analysis: Modern Techniques and Their Applications

is the source of most of the auxiliary measure theory details. There are likely typos and

mistakes in these notes. All such errors are the author's fault and corrections are greatly

appreciated.

1
Day 1: Finished Section 1

Day 2: Finished Section 2

Day 3: Up to denition of semialgebra

Day 4: Finished Section 3

Day 5: Through review of integration

Day 6: Through Limit Theorems

Day 7: Through Corollary 6.2

Day 8: Through Theorem 6.6

Day 9: Finished Section 6

Day 10: Through Corollary 7.1

Day 11: Through Example 7.5

Day 12: Through Theorem 7.4

Day 13: Finished Section 7

Day 14: Through Theorem 8.3

Day 15: Finished Section 8

Day 16: Through Theorem 9.2

Day 17: Through Example 10.2

Day 18: Up to Claim in 3-series Theorem

Day 19: Finished Section 10

Day 20: Through Theorem 11.1

Day 21: Through Theorem 11.4

Day 22: Finished Section 11

Day 23: Through Theorem 12.1

Day 24: Up to Theorem 12.4

Day 25: Through Lemma 13.2

Day 26: Through Theorem 13.3

Day 27: Finished Section 13

Day 28: Through Example 14.1

Day 29: Finished Section 14

Day 30: Through beginning of proof of Lemma 15.2

Day 31: Through Theorem 15.2

Day 32: Through Theorem 15.6

Day 33: Finished Section 16

Day 34: Through Proposition 17.2

Day 35: Started proof of Theorem 18.1 (skipped Wald 2)

Day 36: Through Theorem 18.4

Day 37: Through Lemma 19.3 (skipped Chung-Fuchs)

Day 38: Finished Section 19

2
1. Introduction

Probability Spaces.
A probability space is a measure space (Ω, F, P ) with P (Ω) = 1.

The sample space Ω can be any set, and it can be thought of as the collection of all possible outcomes of

some experiment or all possible states of some system. Elements of Ω are referred to as elementary outcomes.

The σ -algebra (or σ -eld ) F ⊆ 2Ω satises

1) F is nonempty
2) E ∈ F ⇒ E C ∈ F
3) For any countable collection {Ei }i∈I ⊆ F , i∈I Ei ∈ F .
S

Elements of F are called events, and can be regarded as sets of elementary outcomes about which one can say
something meaningful. Before the experiment has occurred (or the observation has been made), a meaningful

statement about E∈F is P (E). Afterward, a meaningful statement is whether or not E occurred.

The probability measure P : F → [0, 1] satises


1) P (Ω) = 1
2) for any countable disjoint collection {Ei }i∈I , P
S  P
i∈I Ei = i∈I P (Ei ).

The interpretation is that P (A) represents the chance that event A occurs (though there is no general

consensus about what that actually means).

If p is some property and A = {ω ∈ Ω : p(ω) is true} is such that P (A) = 1, then we say that p holds

almost surely, or a.s. for short. This is equivalent to almost everywhere in measure theory. Note that it

is possible to have an event E ∈F with E 6= ∅ and P (E) = 0. Thus, for instance, there is a distinction

between impossible and with probability zero as discussed in Example 1.3 below.

Example 1.1. Rolling a fair die: Ω = {1, 2, 3, 4, 5, 6}, F = 2Ω , P (E) = |E|


6 .

Example 1.2. Flipping a (possibly biased) coin: Ω = {H, T }, F = 2Ω = {∅, {H}, {T }, {H, T }}, P satises

P ({H}) = p and P ({T }) = 1 − p for some p ∈ (0, 1).

Example 1.3. Random point in the unit interval: Ω = [0, 1], F = B[0,1] = Borel Sets, P = Lebesgue measure.

The experiment here is to pick a real number between 0 and 1 uniformly at random. Generally speaking,

uniformity corresponds to translation invariance, which is the primary dening property of Lebesgue measure.

Observe that each outcome x ∈ [0, 1] has P ({x}) = 0, so the experiment must result in the realization of an

outcome with probability zero.

´
Example 1.4. Standard normal distribution: Ω = R, F = B , P (E) = √1
2π E
e−
x2
2 dx.

Example 1.5. λ: Ω = N ∪ {0}, F = 2Ω , P (E) = e−λ λk


P
Poisson distribution with mean k∈E k! .
3
Why Measure Theory.
Historically, probability was dened in terms of a nite number of equally likely outcomes (Example 1.1) so
|E|
that |Ω| < ∞, F = 2Ω , and P (E) = |Ω| .

When the sample space is countably innite (Example 1.5), or nite but the outcomes are not necessarily

equally likely (Example 1.2), one can speak of probabilities in terms weighted outcomes by taking a function
P P
p : Ω → [0, 1] with ω∈Ω p(ω) = 1 and setting P (E) = ω∈E p(ω).

For most practical purposes, this can be generalized to the case where Ω ⊆ R by taking a weighting function
´ ´
f : Ω → [0, ∞) with

f (x)dx = 1 and setting P (E) = E
f (x)dx (Examples 1.3 and 1.4), but one must be

careful since the integral is not dened for all sets E (e.g. Vitali sets*).

Those who have taken undergraduate probability will recognize p and f as p.m.f.s and p.d.f.s, respectively. In
dP
measure theoretic terms, f= dm is the Radon-Nikodym derivative of P with respect to Lebesgue measure,
dP
m. Similarly, p= dc where c is counting measure on Ω.

Measure theory provides a unifying framework in which these ideas can be made rigorous, and it enables

further extensions to more general sample spaces and probability functions.

Also, note that in the formal axiomatic construction of probability as a measure space with total mass 1,
there is absolutely no mention of chance or randomness, so we can use probability without worrying about

any philosophical issues.

Random Variables and Expectation.


Given a measurable space (S, G), we dene an (S, G)-valued random variable to be a measurable function

X : Ω → S. In this class, the unqualied term random variable will refer to the case (S, G) = (R, B).

We typically think of X as an observable, or a measurement to be taken after the experiment has been

performed.

An extremely useful example is given by taking any A∈F and dening the indicator function,
(
1, ω∈A
1A (ω) = .
0, ω ∈ AC

Note that if (Ω,F, P ) is a probability space and X is an (S, G)-valued random variable, then X induces

the pushforward probability measure µ = P ◦ X −1 on (S, G). Frequently, we will abuse notation and write

P (X ∈ B) = P (X −1 (B)) = P ({ω ∈ Ω : X(ω) ∈ B}) for µ(B).

X also induces the sub-σ -algebra σ(X) = {X −1 (E) : E ∈ G} ⊆ F . If we think of Ω as the possible outcomes
of an experiment and X as a measurement to be performed, then σ(X) represents the information we can

learn from that measurement.

In contrast to other areas of measure theory, in probability we are often interested in various sub-σ-algebras
F0 ⊆ F , which we think of in terms of information content.

For instance, if the experiment is rolling a six-sided die (Example 1.1), then F0 = {Ø, {1, 3, 5}, {2, 4, 6}, Ω}
represents the information concerning the parity of the value rolled.
4
The expectation (or mean or expected value ) of a real-valued random variable X on (Ω, F, P ) is dened as
´
E[X] = Ω
X(ω)dP (ω) whenever the integral is well-dened.

Expectation is generally interpreted as a weighted average which gives the best guess for the value of the

random quantity X.
We will study random variables and their expectations in greater detail soon. For now, the point is that many

familiar objects from undergraduate probability can be rigorously and simply dened using the language of

measure theory.

That said, it should be emphasized that probability is not just the study of measure spaces with total mass

1. As useful and necessary as the rigorous measure theoretic foundations are, it is equally important to

cultivate a probabilistic way of thinking whereby one conceptualizes problems in terms of coin tossing, card

shuing, particle trajectories, and so forth.

* An example of a subset of [0, 1] which has no well-dened Lebesgue measure is given by the following

construction:

Dene an equivalence relation on [0, 1] by x∼y if and only if x − y ∈ Q.


Using the axiom of choice, let E ⊆ [0, 1] consist of exactly one point from each equivalence class.
T S
For q ∈ Q[0,1) , dene Eq = E + q (mod 1). By construction Eq Er = ∅ for r 6= q and q∈Q[0,1) Eq = [0, 1].
Thus, by countable additivity, we must have
 
[ X
1 = m ([0, 1)) = m  Eq  = m(Eq ).
q∈Q[0,1) q∈Q[0,1)

However, Lebesgue measure is translation invariant, so m(Eq ) = m(E) for all q.


We see that m(Eq ) is not well-dened as m(Eq ) = 0 implies 1=0 and m(Eq ) > 0 implies 1 = ∞.
The existence of non-measurable sets can be proved using slightly weaker assumptions than the axiom of

choice (such as the Boolean prime ideal theorem), but it has been shown that the existence of non-measurable

sets is not provable in Zermelo-Fraenkel alone.

In three or more dimensions, the Banach-Tarski paradox shows that in ZFC, there is no nitely additive

measure dened on all subsets of Euclidean space which is invariant under translation and rotation.

(The paradox is that one can cut a unit ball into ve pieces and reassemble them using only rigid motions

to obtain two disjoint unit balls.)

5
2. Preliminary Results

At this point, we need to establish some fundamental facts about probability measures and σ -algebras in

preparation for a discussion of probability distributions and to reacquaint ourselves with the style of measure

theoretic arguments.

Probability Measures.
The following simple facts are extremely useful and will be employed frequently throughout this course.

Theorem 2.1. Let P be a probability measure on (Ω, F).

(i) Complements For any A ∈ F , P (AC ) = 1 − P (A).

(ii) Monotonicity For any A, B ∈ F with A ⊆ B , P (A) ≤ P (B).

(iii) Subadditivity
S∞ P∞
For any countable collection {Ei } ∞
i=1 ⊆ F , P ( i=1 Ei ) ≤ i=1 P (Ei ).

(iv) Continuity from below


S∞
If Ai % A (i.e. A1 ⊆ A2 ⊆ ... and i=1 Ai = A), then

limn→∞ P (An ) = P (A).

(v) Continuity from above


T∞
If Ai & A = i=1 Ai , then limn→∞ P (An ) = P (A).

Proof.
For (i), 1 = P (Ω) = P (A t AC ) = P (A) + P (AC ) by countable additivity.

For (ii), P (B) = P (A t (B \ A)) = P (A) + P (B \ A) ≥ P (A).


S 
i−1
For (iii), we disjointify the sets by dening F1 = E1 and Fi = Ei \ j=1 Ej for i > 1, and observe that
Sn Sn
the Fi0 s are disjoint and i=1 Fi = i=1 Ei for all n ∈ N ∪ {∞}. Since Fi ⊆ Ei for all i, we have

∞ ∞ ∞ ∞
! !
[ [ X X
P Ei =P Fi = P (Fi ) ≤ P (Ei ).
i=1 i=1 i=1 i=1

Sn
For (iv), set B 1 = A1 and Bi = Ai \ Ai−1 for i > 1, and note that the Bi0 s are disjoint with i=1 Bi = An
S∞
and i=1 Bi = A. Then

∞ ∞
! n
[ X X
P (A) = P Bi = P (Bi ) = lim P (Bi )
n→∞
i=1 i=1 i=1
n
!
[
= lim P Bi = lim P (An ).
n→∞ n→∞
i=1

T∞ T∞ C S∞
For (v), if A1 ⊇ A2 ⊇ ... and A= i=1 Ai , then AC C
1 ⊆ A2 ⊆ ... and AC = ( i=1 Ai ) = i=1 AC
i , so it

follows from (i) and (iv) that

P (A) = 1 − P (AC ) = 1 − lim P (AC C



n ) = lim 1 − P (An ) = lim P (An ). 
n→∞ n→∞ n→∞

Note that (ii)-(iv) hold for any measure space (S, G, ν), (v) is true for arbitrary measure spaces under the

assumption that there is some Ai with ν(Ai ) < ∞, and (i) holds for all nite measures upon replacing 1

with ν(S).
6
Sigma Algebras.
We now review some some basic facts about σ -algebras. Our rst observation is

Proposition 2.1. For any index set I , if {Fi }i∈I is a collection of σ-algebras on Ω, then so is ∩i∈I Fi .

It follows easily from Proposition 2.1 that for any collection of sets A ⊆ 2Ω , there is a smallest σ -algebra
containing A - namely, the intersection of all σ -algebras containing A.
This is called the σ -algebra generated by A and is denoted by σ(A).
Note that if F is a σ -algebra and A ⊆ F, then σ(A) ⊆ F .

An important class of examples are the Borel σ -algebras: If (X, T ) is a topological space, then BX = σ(T )
is called the Borel σ-algebra. It is worth recalling that for the standard topology on R, the Borel sets are

generated by open intervals, closed intervals, half-open intervals, open rays, and closed rays, respectively.

Our main technical result about σ -algebras is Dynkin's π -λ Theorem, an extremely useful result which is

often omitted in courses on measure theory. To state the result, we will need the following denitions.

Denition. A collection of sets P ⊆ 2Ω is called a π -system if it is closed under intersection.

Denition. A collection of sets L ⊆ 2Ω is called a λ-system if

(1) Ω∈L
(2) If A, B ∈ L and A ⊆ B, then B\A∈L
(3) If An ∈ L with An % A, then A∈L

Theorem 2.2 (Dynkin). If P is a π-system and L is a λ-system with P ⊆ L, then σ(P) ⊆ L.

Proof. We begin by observing that the intersection of any number of λ-systems is a λ-system, so for any

collection A, there is a smallest λ-system `(A) containing A. Thus it will suce to show

a) `(P) is a σ-algebra (since then σ(P) ⊆ `(P) ⊆ L).


λ-system which is closed under intersection is a σ -algebra
In fact, as one easily checks that a
C
Sn
C C C
S∞
(A = Ω \ A, A ∪ B = (A ∩ B ) , and i=1 Ai % i=1 Ai ), we need only to demonstrate

b) `(P) is closed under intersection.


To this end, dene GA = {B : A ∩ B ∈ `(P)} for any set A. To complete the proof, we will rst show

c) GA is a λ-system for each A ∈ `(P),


and then prove that b) follows from c).

To establish c), let A be an arbitrary member of `(P). Then A = Ω ∩ A ∈ `(P), so GA 3 Ω . Also, for any

B, C ∈ GA with B ⊆ C, we have A ∩ (C \ B) = (A ∩ C) \ (A ∩ B) ∈ `(P) since A ∩ B, A ∩ C ∈ `(P) and

`(P) is a λ-system, hence GA is closed under subset dierences. Finally, for any sequence {Bn } in GA with

Bn % B , we have (A ∩ Bn ) % (A ∩ B) ∈ `(P), so GA is closed under countable increasing unions as well and

thus is a λ-system.
7
It remains only to show that c) implies b). To see that this is the case, rst note that since P is a π -system,
P ⊆ GA for every A ∈ P, so it follows from c) that `(P) ⊆ GA for every A ∈ P . In particular, for any A ∈ P,
B ∈ `(P), we have A ∩ B ∈ `(P). Interchanging A and B yields A ∈ `(P) and B∈P implies A ∩ B ∈ `(P).
But this means if A ∈ `(P), then P ⊆ GA , and thus c) implies that `(P) ⊆ GA . Therefore, it follows from

the denition of GA that for any A, B ∈ `(P), A ∩ B ∈ `(P). 

It is not especially important to commit the details of this proof to memory, but it worth seeing once and you

should denitely know the statement of the theorem. Though it seems a bit obscure upon rst encounter,

we will use this result in a variety of contexts throughout the course. In typical applications, we show that

a property holds on a π -system that we know generates the σ -algebra of interest. We then show that the

collection of all sets for which the property holds is a λ-system in order to conclude that the property holds

on the entire σ -algebra.

A related result which is probably more familiar to those who have taken measure theory is the monotone

class lemma used to prove Fubini-Tonelli.

Denition. A monotone class is a collection of subsets which is closed under countable increasing unions

and countable decreasing intersections.

Like π -systems, λ-systems, and σ -algebras, the intersection of monotone classes is a monotone class, so it

makes sense to talk about the monotone class generated by a collection of subsets.

Lemma 2.1 (Monotone Class Lemma). If A is an algebra of subsets, then the monotone class generated by
A is σ(A).

Note that σ -algebras are λ-systems and λ-systems are monotone classes, but the converses need not be true.

8
3. Distributions

Recall that a random variable on a probability space (Ω, F, P ) is a function X:Ω→R that is measurable

with respect to the Borel sets.

Every random variable induces a probability measure µ on R (called its distribution ) by


µ(A) = P X −1 (A)


for all A ∈ B.
To check that µ is a probability measure, note that since X is a function, if A1 , A2 , ... ∈ B are disjoint, then

so are {X ∈ A1 }, {X ∈ A2 }, ... ∈ F , hence


S S S X X
µ ( i Ai ) = P ({X ∈ i Ai }) = P ( i {X ∈ Ai }) = P ({X ∈ Ai }) = µ(Ai ).
i i

The distribution of a random variable X is usually described in terms of its distribution function
F (x) = P (X ≤ x) = µ ((−∞, x]) .

In cases where confusion may arise, we will emphasize dependence on the random variable using subscripts

- i.e. µX , FX .

Theorem 3.1. If F is the distribution function of a random variable X , then


(i) F is nondecreasing

(ii) F is right-continuous (i.e. limx→a+ F (x) = F (a) for all a ∈ R)


(iii) limx→−∞ F (x) = 0 and limx→∞ F (x) = 1

(iv) If F (x ) = limy→x− F (y), then F (x− ) = P (X < x)
(v) P (X = x) = F (x) − F (x− )

Proof.
For (i), note that if x ≤ y, then {X ≤ x} ⊆ {X ≤ y}, so F (x) = P (X ≤ x) ≤ P (X ≤ y) = F (y) by

monotonicity.

For (ii), observe that if x & a, then {X ≤ x} & {X ≤ a}, and apply continuity from above.

For (iii), we have {X ≤ x} & ∅ as x & −∞ and {X ≤ x} % R as x % ∞.


For (iv), {X ≤ y} % {X < x} as y % x. (Note that the limit exists since F is monotone.)

For (v), {X = x} = {X ≤ x} \ {X < x}. 

In fact, the rst three properties in Theorem 3.1 are sucient to characterize a distribution function.

Theorem 3.2. If F : R → R satises properties (i), (ii), and (iii) from Theorem 3.1, then it is the
distribution function of some random variable.

Proof. (Draw Picture)

Let Ω = (0, 1), F = B(0,1) , P = Lebesgue measure, and dene X : (0, 1) → R by

X(ω) = F −1 (ω) := inf{y ∈ R : F (y) ≥ ω}.

Note that properties (i) and (iii) ensure that X is well-dened.


9
To see that F is indeed the distribution function of X, it suces to show that

{ω : X(ω) ≤ x} = {ω : ω ≤ F (x)}

for all x ∈ R, as this implies

P (X ≤ x) = P ({ω : X(ω) ≤ x}) = P ({ω : ω ≤ F (x)}) = F (x)

where the nal equality uses the denition of Lebesgue measure and the fact that F (x) ∈ [0, 1].

Now if ω ≤ F (x), then x ∈ {y ∈ R : F (y) ≥ ω}, so X(ω) = inf{y ∈ R : F (y) ≥ ω} ≤ x.


Consequently, {ω : ω ≤ F (x)} ⊆ {ω : X(ω) ≤ x}.

To establish the reverse inclusion, note that if ω > F (x), then properties (i) and (ii) imply that there is an

ε>0 such that F (x) ≤ F (x + ε) < ω .


Since F is nondecreasing, it follows that x+ε is a lower bound for {y ∈ R : F (y) ≥ ω}, hence X(ω) ≥ x+ε > x.
Therefore, {ω : ω ≤ F (x)} ⊆ {ω : X(ω) ≤ x}C
C
and thus {ω : X(ω) ≤ x} ⊆ {ω : ω ≤ F (x)}. 

Theorem 3.2 shows that any function satisfying properties (i) - (iii) gives rise to a random variable X, and

thus to a probability measure µ, the distribution of X. The following result shows that the measure is

uniquely determined.

Theorem 3.3. If F is function satisfying (i)-(iii) in Theorem 3.1, then there is a unique probability measure
µ on (R, B) with µ ((−∞, x]) = F (x) for all x ∈ R.

Proof. Theorem 3.2 gives the existence of a random variable X with distribution function F. The measure

it induces is the desired µ.


To establish uniqueness, suppose that µ and ν both have distribution function F. Dene

P = {(−∞, a] : a ∈ R}
L = {A ∈ B : µ(A) = ν(A)}.

Observe that for any a ∈ R, µ ((−∞, a]) = F (a) = ν ((−∞, a]), so P ⊆ L.


Also, for any a, b ∈ R, (−∞, a] ∩ (−∞, b] = (−∞, a ∧ b] ∈ P , hence P is a π -system.
Finally, L is a λ-system since

(1) µ(Ω) = 1 = ν(Ω), so Ω ∈ L.


(2) For any A, B ∈ L with A ⊆ B, we have

µ(B \ A) = µ(B) − µ(A) = ν(B) − ν(A) = ν(B \ A)

(by countable additivity and the denition of L), so B \ A ∈ L.


(3) If An ∈ L with An % A, then

µ(A) = lim µ(An ) = lim ν(An ) = ν(A)


n→∞ n→∞

(by continuity from below and the denition of L), so A ∈ L.


Since the closed rays generate the Borel sets, the π -λ Theorem implies that B = σ(P) ⊆ L and thus

µ(E) = ν(E) for all E ∈ B.



10
To summarize, every random variable induces a probability measure on (R, B), every probability measure

denes a function satisfying properties (i)-(iii) in Theorem 3.1, and every such function uniquely determines

a probability measure.

Consequently, it is equivalent to give the distribution or the distribution function of a random variable.

However, one should be aware that distributions/distribution functions do not determine random variables,

even neglecting dierences on null sets.

1
For example, if X is uniform on [−1, 1] (so that µX = 2 m|[−1,1] ), then −X also has distribution µX , but

−X 6= X almost surely.

When two random variables X and Y have the same distribution function, we say that they are equal in
distribution and write X =d Y .

Note that random variables can be equal in distribution even if they are dened on dierent probability

spaces.

11
Constructing Measures on R . (Brief Review)

It is worth mentioning that we kind of cheated in Theorem 3.2 since we assumed the existence of Lebesgue

measure. In fact, the standard derivation of Lebesgue measure in terms of Stieltjes measure functions implies

the results in Theorems 3.2 and 3.3. Presumably everyone has seen this argument before and since it is fairly

long, we will content ourselves with a brief outline.

Recall that an algebra of sets on S is a non-empty collection A ⊆ 2S which is closed under complements and

nite unions.

A premeasure µ0 on A is a function µ0 : A → [0, ∞] such that

1) µ0 (∅) = 0
2) IfA1 , A2 , ... is a sequence of disjoint sets in A whose union also belongs to A, then
S∞ P∞
µ0 ( i=1 Ai ) = i=1 µ0 (Ai ).

(If µ0 (S) < ∞, then 2) implies that µ0 (∅) = µ0 (∅) + µ0 (∅), which implies 1).)

An outer measure µ∗ on S is a function µ∗ : 2S → [0, ∞] such that

i) µ∗ (∅) = 0
ii) µ∗ (A) ≤ µ∗ (B) if A ⊆ B .
S∞ P∞
iii) µ∗ ( i=1 Ai ) ≤ i=1 µ∗ (Ai ),

and a set A⊆S is said to be µ∗ -measurable if

µ∗ (E) = µ∗ (E ∩ A) + µ∗ (E ∩ AC )

for all E ⊆ S.

It can be shown that if µ0 is a premeasure on the algebra A, then the set function dened by

∞ ∞
( )
X [

µ (E) = inf µ0 (Ai ) : Ai ∈ A, E ⊆ Ai
i=1 i=1

is an outer measure satisfying

a) µ∗ |A = µ0
b) Every set in A is µ∗ -measurable.

To obtain a measure, one then appeals to the Carathéodory Extension Theorem:

Theorem 3.4 (Carathéodory). If µ∗ is an outer measure on S , then the collection M of µ∗ -measurable sets
is a σ-algebra, and the restriction of µ∗ to M is a (complete) measure.

Finally, one can show that if µ0 is σ -nite, then the measure µ = µ∗ |M is the unique extension of µ0 to M.
12
Using these ideas, one can construct Borel measures on R by taking any nondecreasing, right-continuous

function F (called a Lebesgue-Stieltjes measure function ) and dening a premeasure on the algebra
( n
)
[
A= (ai , bi ] : −∞ ≤ ai ≤ bi ≤ ∞, (ai , bi ] ∩ (aj , bj ] = ∅, n ∈ N
i=1

by
µ0 (∅) = 0,
n
! n
G X
µ0 (ai , bi ] = [F (bi ) − F (ai )] .
i=1 i=1

Lebesgue measure is the special case F (x) = x.

The above construction is typical of how one builds premeasures on algebras from more elementary objects:

A semialgebra S is a nonempty collection of sets satisfying

i) A, B ∈ S implies A∩B ∈S
Fn
ii) A∈S implies there exists a nite collection of disjoint sets A1 , ..., An ∈ S with AC = i=1 Ai .

(Some authors require that S ∈ S as well and call the above a semiring. We will not worry about this

distinction as we are ultimately only concerned with the algebra S generates.)

An important example of a semialgebra on R is the collection of h-intervals - that is, sets of the form (a, b]
or (a, ∞) or ∅ with −∞ ≤ a < b < ∞.
d
On R , the collection of products of h-intervals - e.g. (a1 , b1 ] × · · · × (ad , bd ] - is a semialgebra.

If S is a semialgebra, then one readily veries that S = {nite disjoint unions of sets in S} is an algebra

algebra generated by S ).

(called the Note that this construction ensures that σ (S) = σ S .

Given a semialgebra S and a function ν : S → [0, ∞) such that if A ∈ S is the disjoint union of A1 , ..., An ∈ S ,
Pn Fm Pm
then ν (A) = i=1 ν(Ai ), dene ν : S → [0, ∞) by ν ( i=1 Bi ) = i=1 ν(Bi ).

It is easy to check that ν is well-dened, nite, and nitely additive on S.



To verify countable additivity (so that ν is a premeasure on S ), it suces to show that if {Bn }n=1 is a

sequence of sets in S with Bn & ∅, then ν(Bn ) & 0.


S∞
Indeed if{Ai }∞
i=1 is a countable collection of disjoint sets in S such that A = i=1 Ai ∈ S , then for any
S∞ Sn−1
n ∈ N, Bn = i=n Ai = A \ i=1 Ai belongs to the algebra S , so nite additivity implies that
Pn−1
ν(A) = i=1 ν(Ai ) + ν(Bn ).

Alternatively, one can show that ν is countably additive on S if ν is countably subadditive on S - that is, for
S F  P
every countable disjoint collection {Ai }i∈I ⊆ S such that i∈I Ai ∈ S , one has ν A
i∈I i ≤ i∈I ν(Ai ).
(The implication is immediate if ν is countably additive on S .)

Thus if one takes a nitely additive [0, ∞)-valued function ν on a semialgebra S , extends it in the obvious way
to the function ν on the S , and then checks that ν is countably additive, then the Carathéodory construction

guarantees the existence of a unique measure µ on σ(S) which agrees with ν on S.


13
Classifying Distributions on R.
At this point, we recall the following denitions from measure theory:

Denition. If µ and ν are measures on (S, G), then we say that ν is absolutely continuous with respect to

µ (and write ν  µ) if ν(A) = 0 for all A∈G with µ(A) = 0.

Denition. If µ and ν are measures on (S, G), then we say that µ and ν are mutually singular (and write

µ⊥ν ) if there exist E, F ∈ G such that

i) E∩F =∅
ii) E∪F =S
iii) µ(F ) = 0 = ν(E).

A fundamental result in measure theory is the Lebesgue-Radon-Nikodym Theorem (which we state only for

positive measures).

Theorem 3.5 (Lebesgue-Radon-Nikodym). If µ and ν are σ-nite measures on (S, G), then there exist
unique σ-nite measures λ, ρ on (S, G) such that
λ⊥µ, ρ  µ, ν = λ + ρ.
´
Moreover, there is a measurable function f : S → [0, ∞) such that ρ(E) = E
f dµ for all E ∈ G .

The function f from Theorem 3.5 is called the Radon-Nikodym derivative of ρ with respect to µ, and one

writes f= dµ (or dρ = f dµ).
If ν is a nite measure, then λ and ρ are nite, so f is µ-integrable.

If a random variable X has distribution µ which is absolutely continuous with respect to Lebesgue measure,

then we say that (the distribution of ) X has density function f = dµ


dm .
´
Thus for all E ∈ B , P (X ∈ E) = µ(E) = E
f (x)dx.
In particular, the distribution function of X can be written as
ˆ x
F (x) = P (X ≤ x) = f (t)dt.
−∞

Accordingly, F is an absolutely continuous function and is m-almost everywhere dierentiable with F0 = f.


´ ´x
Conversely, if g is a nonnegative measurable function with
R
g(x)dx = 1, then G(x) = −∞
g(t)dt satises

(i)-(iii) in Theorem 3.1, so Theorem 3.2 gives a random variable with density g.

In undergraduate probability, such an X is called continuous. This is actually somewhat of a misnomer.

Rather, we have

Denition. If the distribution of X has a density, then we say that X is absolutely continuous.

14
The other class of random variables discussed in undergraduate probability are discrete random variables.

Denition. discrete

A measure µ is said to be if there is a countable set S with µ S C = 0. A random

variable is called discrete if its distribution is.

Note that if X is discrete, then µ⊥m.

An example of a discrete distribution is the point mass at a: P (X = a) = 1, F (x) = 1[a,∞) (x).

More generally, given any countable set S ⊂ R and any sequence of nonnegative numbers p1 , p2 , ... with
P∞
i=1 pi = 1, if we enumerate S by S = {s1 , s2 , ...}, then the random variable X with P (X = si ) = pi ,
P∞
F (x) = i=1 pi 1[si ,∞) (x) is discrete, and indeed all discrete random variables are of this form.
(Countable additivity implies that µ is determined by its values on singleton subsets of S .)
In the case S =Q and pi > 0 for all i, we have a discrete random variable whose distribution function is

discontinuous on a dense set.

If we think of summation as integration with respect to counting measure, then just as the absolutely
´
continuous random variables correspond to densities (f ≥ 0 with f dm = 1), we see that the discrete
´
random variables correspond to mass functions (p ≥ 0 with p dc = 1).

There is also a third fundamental class of random variables, which we almost never have to deal with, but

mention for the sake of completeness. To describe it, we need another denition.

Denition. A measure µ is called continuous if µ({x}) = 0 for all x ∈ R.

By countable additivity, a discrete probability measure is not continuous and vice versa.

Absolutely continuous distributions are continuous, but it is possible for a continuous distribution to be

singular with respect to Lebesgue measure.

Denition. A random variable X with continuous distribution µ⊥m is called singular continuous.

An example is given by the uniform distribution on the Cantor set formed by taking [0, 1] and successively

removing the open middle third of all remaining intervals. The distribution function is the Cantor function
1
given by F (x) = 2 for x ∈ [ 13 , 23 ], F (x) = 1
4 for x ∈ [ 91 , 29 ], F (x) = 3
4 for x ∈ [ 79 , 89 ], etc...

Analogous to the singular/absolutely continuous decomposition in the Theorem 3.5, we have the following

result for nite Borel measures on R.

Theorem 3.6. Any nite Borel measure can be uniquely written as


µ = µd + µc

where µd is discrete and µc is continuous.


15
Proof. Let E = {x ∈ R : µ ({x}) > 0}.
P
For any countable F ⊆ E , x∈F µ({x}) = µ(F ) < ∞ by countable additivity and niteness.

It follows that Ek = {x ∈ R : µ ({x}) > k −1 } is nite for all k ∈ N.


S∞
Consequently, E = k=1 Ek is a countable union of nite sets and thus is countable.

C
The result follows by dening µd (A) = µ(A ∩ E), µc (A) = µ(A ∩ E ). 

(The proof is easily modied to accommodate σ -nite measures.)

Thus if µ is a probability distribution, then it follows from the Radon-Nikodym Theorem that µ = µac + µs
where µac  m and µs ⊥m. By Theorem 3.6, µs = µd + µsc where µd is discrete and µsc is singular

continuous. Since µ is a probability measure, each of µac , µd , µsc is nite and thus is identically zero or a

multiple of a probability measure. Accordingly, we have

Theorem 3.7. Every distribution is a convex combination of an absolutely continuous distribution, a discrete
distribution, and an absolutely singular distribution.

Remark. Theorem 3.7 is not especially useful in practice. Rather, we mention these facts because so many

introductory texts make a big deal about distinguishing between discrete and continuous random variables.

There are certainly important practical dierences between the two, and it is worth knowing that more

pathological examples exist as well. However, one of the advantages of the measure theoretic approach is

a more unied perspective, and excessive focus on dierences in detail can sometimes obscure the bigger

picture.

16
4. Random Variables

In general, a random variable is a measurable function from (Ω, F) to some measurable space (S, G), but we

have agreed to reserve the unqualied term for the case (S, G) = (R, B).
If (S, G) = (R , B ),d d
we will say that X is a random vector.
We now collect some results that will help us establish that various quantities of interest are indeed random

variables.

Through a slight abuse of notation, we sometimes write X∈F to indicate that X is (F -B)-measurable.

Theorem 4.1. If A generates G (in the sense that G is the smallest σ-algebra containing A) and
X −1 (A) = {ω ∈ Ω : X(ω) ∈ A} ∈ F for all A ∈ A, then X is measurable.

Proof. X −1 ( i Ei ) = i X −1 (Ei ) X −1 E C = X −1 (E)C ,


S S 
Because and we have that
−1
E = {E ⊆ S : X (E) ∈ F} is a σ -algebra. Thus, since A⊆E and A generates G by assumption, G ⊆ E,
so X is measurable. 

The fact that inverses commute with set operations also shows that for any function X : Ω → S, if G is a

σ -algebra on S, then σ(X) = {X −1


(E) : E ∈ G} is a σ -algebra on Ω (called the σ -algebra generated by X ).
By construction, it is the smallest σ -algebra on Ω that makes X measurable with respect to G.

Proposition 4.1. If A generates G , then X −1 (A) = X −1 (A) : A ∈ A generates σ(X).




Proof. (Homework)

X −1 (A) ⊆ σ(X) σ X −1 (A) ⊆ σ(X).



Since A ⊆ G, the denition of σ(X) implies that and thus
−1

On the other hand, Theorem 4.1 shows that X is measurable as a map from Ω, σ X (A) to (S, G),
−1

so we must have that σ(X) ⊆ σ X (A) . 

Example 4.1. If (S, G) = (R, B), some useful generating sets are

A1 = {(−∞, x] : x ∈ R}, A2 = {(a, b) : a, b ∈ Q}.

Example 4.2. If (S, G) = (Rd , B d ), a convenient choice is

A = {(a1 , b1 ] × · · · × (ad , bd ] : −∞ < ai < bi < ∞}.

More generally, given an indexed collection of measurable spaces {(Sα , Gα )}α∈A , the product σ-algebra,
−1
N Q
α∈A Gα , on S= α∈A Sα is generated by {πα (Gα ) : Gα ∈ Gα , α ∈ A} where πα : S → Sα is projection

onto the α coordinate.

In other words, the product σ -algebra is the smallest σ -algebra for which the projections are measurable.

This is because we want a function taking values in the product space to be measurable precisely when its

components are measurable.

This is analogous to the denition of the product topology as the initial topology with respect to the

coordinate projections.

17
Proposition 4.2. If A is countable, then α∈A Gα is generated by the rectangles { α∈A Gα : Gα ∈ Gα }.
N Q

If, in addition, Gα is generated by Eα 3 Sα for every α ∈ A, then α∈A Gα is generated by


N

α∈A Eα : Eα ∈ Eα .
Q

Proof. πα−1 (Gα ) =


Q
If Gα ∈ Gα , then β∈A Gβ where Gβ = Sβ for all β 6= α, hence
!
nY o
πα−1 (Gα ) : Gα ∈ Gα , α ∈ A
 
σ ⊆σ Gα : Gα ∈ Gα .
α∈A

πα−1 (Gα ), so
Q T
On the other hand, α∈A Gα = α∈A
!
nY o
⊆ σ πα−1 (Gα ) : Gα ∈ Gα , α ∈ A .
 
σ Gα : Gα ∈ Gα
α∈A
N
The second statement will follow from the above argument once we show that α∈A Gα is generated by

F1 = {πα−1 (Eα ) : Eα ∈ Eα , α ∈ A}. To this end, observe that F1 ⊆ {πα−1 (Gα ) : Gα ∈ Gα , α ∈ A} by


N
denition, so σ(F1 ) ⊆ α∈A Gα .
On the other hand, arguing as in the proof of Theorem 4.1, we see that for each α ∈ A,
E ⊆ Sα : πα−1 (E) ∈ −1

σ(F1 ) is a σ -algebra containing Eα (and thus Gα ), so πα (E) ∈ σ(F1 ) for all E ∈ Gα ,
 −1 
hence σ πα (Gα ) : Gα ∈ Gα , α ∈ A ⊆ σ(F1 ) as well. 

Because the product and metric topologies coincide for Rn , Proposition 4.2 justies Example 4.2.
Qn
(In general, one can show that if S1 , ..., Sn are separable metric spaces and S= i=1 Si equipped with the
Nn
product metric, then i=1 BSi = BS .)

A simple but extremely useful observation is

Theorem 4.2. If X : (Ω, F) → (S, G) and f : (S, G) → (T, E) are measurable maps, then
is measurable.
f (X) : (Ω, F) → (T, E)

Proof. For any B ∈ E , f −1 (B) ∈ G since f is measurable, thus

{ω ∈ Ω : f (X(ω)) ∈ B} = ω ∈ Ω : X(ω) ∈ f −1 (B) ∈ F




since X is measurable. 

Theorem 4.2 is the familiar statement that compositions of measurable maps are measurable.

Thus if f :R→R is measurable (e.g. if f is any continuous function) and X is a random variable, then

f (X) is also a random variable.

An important application of Theorem 4.2 is given by

Theorem 4.3. If X1 , ..., Xn are random variables and f : (Rn , B n ) → (R, B) is measurable, then f (X1 , ..., Xn )
is a random variable.
Proof. In light of Theorem 4.2, it suces to show that (X1 , ..., Xn ) is a random vector. To this end, observe

that if A1 , ..., An are Borel sets, then

n
\
{(X1 , ..., Xn ) ∈ A1 × · · · × An } = {Xi ∈ Ai } ∈ F.
i=1

Since Bn is generated by {A1 × · · · × An : Ai ∈ B}, the result follows from Theorem 4.1. 

18
Corollary 4.1. If X1 , ..., Xn are random variables, then so are Sn = and Vn = Xi .
Pn Qn
i=1 Xi i=1

It is sometimes convenient to allow random variables to assume the values ±∞, and we observe that almost

all of our results generalize easily to (R, B) where R = R ∪ {±∞} and B = {E ⊆ R : E ∩ R ∈ B}, which is

generated, for example, by rays of the form [−∞, a) with a ∈ R.

Theorem 4.4. If X1 , X2 , ... are random variables, then so are


inf Xn , sup Xn , lim inf Xn , lim sup Xn .
n∈N n∈N n→∞ n→∞

Proof. For any a ∈ R, the inmum of a sequence is strictly less than a if and only if some term is strictly

less than a, hence


 [
inf Xn < a = {Xn < a} ∈ F,
n∈N
n∈N

hence inf n∈N Xn is measurable since {[−∞, a) : a ∈ R} generates B.


To see that supn∈N Xn is a random variable, note that supn∈N Xn = − inf n∈N −Xn and f : x 7→ −x is

measurable.

Arguing as in the rst case, inf m≥n Xm is measurable for all m ∈ N, so it follows from the second case that
 
lim inf Xn = sup inf Xm
n→∞ n∈N m≥n

is a random variable. The lim sup case is similar. 

It follows from Theorem 4.4 that

  
lim Xn exists = lim inf Xn = lim sup Xn = lim inf Xn − lim sup Xn = 0
n→∞ n→∞ n→∞ n→∞ n→∞

is measurable since it is the preimage of {0} ∈ B under the map (lim inf n→∞ Xn ) − (lim supn→∞ Xn ), which

is the dierence of measurable functions and thus measurable.

When P ({limn→∞ Xn exists}) = 1, we say that the sequence converges almost surely to X := lim supn→∞ Xn ,
and write Xn → X a.s.

19
5. Expectation

Integration. (Brief Review)

First recall that the indicator function of a measurable set E is dened as


(
1, x∈E
1E (x) = ,
0, x∈
/E

simple function φ =
Pn
and a i=1 ai 1Ei is a linear combination of indicator functions (where we may assume

that the coecients are distinct).

A fundamental observation is that we can approximate a measurable function with simple functions by

partitioning the codomain.

Theorem 5.1. If (S, G) is a measurable space and f : S → [0, ∞] is measurable, then there is a sequence
n=1 of simple functions with 0 ≤ φ1 ≤ φ2 ≤ ... ≤ f such that φn → f pointwise, and the convergence is
{φn }∞
uniform on any set on which f is bounded.
Proof. For n = 1, 2, ... and k = 0, 1, ..., 4n − 1, dene
 
k k+1
Enk = f −1 , and Fn = f −1 (( 2n , ∞] ) ,
2n 2n
and set
n
4X −1
k
φn = 1 k + 2n 1Fn . 
2n En
k=0

Now let (S, G, µ) be a measure space. We construct the integral as follows:

(i) For any E ∈ G, ˆ


1E dµ = µ(E).
Pn
(ii) For any simple function φ= ai 1Ei ,
i=1
ˆ n
X ˆ n
X
φ dµ = ai 1Ei dµ = ai µ(Ei )
i=1 i=1

with the convention that 0 · ∞ = 0.


(iii) For any measurable function f : S → [0, ∞],
ˆ ˆ 
f dµ = sup φ dµ : 0 ≤ φ ≤ f, φ is simple .
ˆ
(This is equal to lim φn dµ φn as in the proof of Theorem 5.1 by the MCT.)
with
n→∞
´
(iv) For any measurable f : S → R with |f | dµ < ∞ (called an integrable function ),
ˆ ˆ ˆ
f dµ = (f ∨ 0)dµ − (−f ∨ 0)dµ.

´ ´
For f integrable, A ∈ G, we dene the integral of f over A as
A
f dµ = f 1A dµ.
´ ´
When we wish to emphasize dependence on the argument, we write f dµ = f (x)dµ(x), or sometimes
´
f (x)µ(dx).
´ ´ ´
Proposition 5.1. For any a, b ∈ R and any integrable functions f, g, (af + bg)dµ = a f dµ + b gdµ.
´ ´
If f ≤ g a.e., then f dµ ≤ gdµ.
20
Denition.
´
If X is a random variable on (Ω, F, P ) with X ≥0 a.s., then we dene its expectation as E[X] = XdP ,
which always makes sense, but may be +∞.
If X is an arbitrary random variable, then we can write X = X+ − X− where X + = max{0, X} and

X − = max{0, −X} are nonnegative random variables.

If at least one of E[X + ], E[X − ] is nite, then we dene E[X] = E[X + ] − E[X − ].
Note that E[X] may be dened even if X isn't an integrable function.

A trivial but extremely useful observation is that P (A) = E[1A ] for any event A ∈ F.

Inequalities.
Recall that a function ϕ:R→R is said to be convex if for every x, y ∈ R, λ ∈ [0, 1], we have

ϕ (λx + (1 − λ)y) ≤ λϕ(x) + (1 − λ)ϕ(y).

That is, given any two points x, y ∈ R, the line from (x, ϕ(x)) to (y, ϕ(y)) lies above the graph of ϕ.

Lemma 5.1. If ϕ : R → R is convex, then


ϕ(y) − ϕ(x) ϕ(z) − ϕ(x) ϕ(z) − ϕ(y)
≤ ≤
y−x z−x z−y
for every x < y < z .

Proof. (Homework)

y−x
Writing λ= z−x ∈ (0, 1), we have y = λz + (1 − λ)x, so it follows from convexity that

ϕ(y) ≤ λϕ(z) + (1 − λ)ϕ(x), and thus

y−x
ϕ(y) − ϕ(x) ≤ λ (ϕ(z) − ϕ(x)) = (ϕ(z) − ϕ(x)) .
z−x
Dividing by y−x>0 gives the rst inequality.

z−y
Similarly, setting µ= z−x = 1 − λ ∈ (0, 1), we have y = µx + (1 − µ)z , so ϕ(y) ≤ µϕ(x) + (1 − µ)ϕ(z), and

thus
z−y
ϕ(y) − ϕ(z) ≤ µ (ϕ(x) − ϕ(z)) = (ϕ(x) − ϕ(z)) ,
z−x
hence
ϕ(z) − ϕ(y) ϕ(z) − ϕ(x)
≥ . 
z−y z−x

Lemma 5.2 (Supporting Hyperplane Theorem in R2 ). If ϕ is a convex function, then for any c ∈ R, there
is a linear function l(x) which satises l(c) = ϕ(c) and l(x) ≤ ϕ(x) for all x ∈ R.

Proof. (Homework)

For any h > 0, taking x = c − h, y = c, z = c + h in Lemma 5.1, it follows from the outer inequality that

that
ϕ(c) − ϕ(c − h) ϕ(c + h) − ϕ(c)
≤ .
h h
Also, for any 0 < h1 < h 2 , we have c − h2 < c − h1 < c, so the second inequality in Lemma 5.1 shows that
ϕ(c)−ϕ(c−h2 ) ϕ(c)−ϕ(c−h1 )
h2 ≤ h1 .
21
ϕ(c+h2 )−ϕ(c) ϕ(c+h1 )−ϕ(c)
Similarly, since c < c + h1 < c + h2 , the rst inequality in Lemma 5.1 shows that h2 ≥ h1 .

Consequently, the one-sided derivatives exist and satisfy

ϕ(c) − ϕ(c − h) ϕ(c + h) − ϕ(c)


ϕ0l (c) := lim+ ≤ lim+ := ϕ0r (c).
h→0 h h→0 h
Now let a ∈ [ϕ0l (c), ϕ0r (c)] and dene the linear function l(x) = a(x − c) + ϕ(c). Clearly, l(c) = ϕ(c).
To see that l(x) ≤ ϕ(x) for all x ∈ R, note that ifx = c − k for some k > 0, so
x < c, then
 
ϕ(c) − ϕ(c − k)
l(x) − ϕ(x) = a(x − c) + ϕ(c) − ϕ(c − k) = −k a − ≤0
k
ϕ(c)−ϕ(c−k)
since
k ≤ ϕ0l (c) ≤ a by monotonicity. The x>c case is similar. 

Theorem 5.2 (Jensen). If ϕ is a convex function and X is a random variable, then


ϕ (E[X]) ≤ E [ϕ(X)]

whenever the expectations exist.


Proof. Lemma 5.2 gives the existence of a function l(x) = ax + b which satises l (E[X]) = ϕ (E[X]) and

l(x) ≤ ϕ(x) for all x ∈ R.


By monotonicity and linearity, we have
ˆ ˆ ˆ
E[ϕ(X)] = ϕ(X)dP ≥ l(X)dP = (aX + b)dP
ˆ
= a XdP + b = aE[X] + b = l (E[X]) = ϕ (E[X]) . 

The triangle inequality E |X| ≥ |E[X]| is an important special case.


 
I remember the direction in Jensen's inequality by E X 2 − E[X]2 = Var(X) ≥ 0.
A function is called strictly convex if the dening inequality is strict. For such functions, modifying the

preceding arguments where necessary shows that Jensen's inequality is strict unless X is a.s. constant.

1
p
To state the next inequality, we dene the Lp norm of a random variable by kXkp = E [|X| ] p for p ∈ [1, ∞)
and kXk∞ = inf{M : P (|X| > M ) = 0}.
n o
p p
We dene L = L (Ω, F, P ) = X : kXkp < ∞ (where random variables X and Y dene the same element
p p
of L if they are equal almost surely), and one can prove that L is a Banach space for p ≥ 1.

Theorem 5.3 (Hölder). If p, q ∈ [1, ∞] with 1


p + 1
q =1 (where 1
∞ = 0), then
kXY k1 ≤ kXkp kY kq .

Proof.
We rst note that the result holds trivially if the right-hand side is innity, and if kXkp = 0 or kY kq = 0,
then |XY | = 0 a.s. Accordingly, we may assume that 0 < kXkp , kY kq < ∞. In fact, since constants factor

out of Lp -norms, it suces to establish the result when kXkp = kY kq = 1.


Also, the case p = ∞, q = 1 (and symmetrically) is immediate since |X| ≤ kXk∞ a.s., thus

E |XY | ≤ E [kXk∞ |Y |] = kXk∞ E |Y | = kXk∞ kY k1 .

22
Accordingly, we will assume henceforth that p, q ∈ (1, ∞).
xp yq
Now x y ≥ 0, and dene the function ϕ : [0, ∞) → R by ϕ(x) = p + q − xy .
1
Since ϕ0 (x) = xp−1 − y and ϕ00 (x) = (p − 1)xp−2 > 0 for x > 0, ϕ attains its minimum at x0 = y p−1 .
 −1
1 p
Thus, as the conjugacy of p and q implies that
p−1 + 1 = p−1 = 1 − p1 = q , we have that
p
xp0 yq yq
 
y p−1 1 1 1
ϕ(x) ≥ ϕ(x0 ) = + − xy = + − y p−1 +1 = y q + − yq = 0
p q p q p q
xp yq
for all x ≥ 0. It follows that
p + q ≥ xy for every x, y ≥ 0.
In particular, taking x = |X|, y = |Y |, and integrating, we have
ˆ ˆ ˆ
1 p 1 q
E |XY | = |X| |Y | dP ≤ |X| dP + |Y | dP
p q
p q
kXkp kY kq 1 1
= + = + = 1 = kXkp kY kq . 
p q p q

Some useful corollaries of Hölder's inequality are:

Corollary 5.1 (Cauchy-Schwarz). E [X 2 ] E [Y 2 ].


p
E |XY | ≤

Alternate Proof. For allt ∈ R,


h i
2
0 ≤ E (|X| + t |Y |) = E X 2 + 2tE |XY | + t2 E Y 2 = q(t),
   

so q(t) has at most one real root and thus a nonpositive discriminant

2
(2E |XY |) − 4E X 2 E Y 2 ≤ 0.
   


Corollary 5.2. For any random variable X and any 1 ≤ r < s ≤ ∞, kXkr ≤ kXks .
Therefore, we have the inclusion Ls ⊆ Lr .
Proof. For s = ∞, we have
r r
|X| ≤ kXk∞ a.s., hence
ˆ ˆ
r r r r
kXkr = |X| dP ≤ kXk∞ dP = kXk∞ .

For s < ∞, apply Holder's inequality to p = rs , q = s−r


Xr and 1 s
with
to get

ˆ  rs
s
r r r r r r
kXkr = E [|X| ] ≤ kX kp k1kq = |X | dP = kXks . 

Note that for Corollary 5.2, it is important that our measure was nite.

Of course, we could also prove the inclusion by breaking up the integral according to whether |X| is greater

or less than 1, though we would not obtain the inequality in that case.

The proof of our last big inequality should be familiar from measure theory (convergence in L1 implies

convergence in measure).

Theorem 5.4 (Chebychev). For any nonnegative random variable X and any a > 0,
E[X]
P (X ≥ a) ≤ .
a
23
Proof. Let A = {ω : X(ω) ≥ a}. Then
ˆ ˆ ˆ
aP (X ≥ a) = a 1A dP ≤ X1A dP ≤ XdP = E[X]. 

Corollary 5.3. For any (S, G)-valued random variable X and any measurable function ϕ : S → [0, ∞),
E[ϕ(X)]
P (ϕ(X) ≥ a) ≤ .
a
Some important cases of Corollary 5.3 for real-valued X are

• ϕ(x) = |x|: to control the probability that an integrable random variable is large.
2
• ϕ(x) = (x − E[X]) : to control the probability that a random variable with nite variance is far

from its mean.

• ϕ(x) = etx : to establish exponential decay for random variables with moment generating functions

( concentration inequalities ).

Limit Theorems.
We now briey recall the three main results for interchanging limits and integration. The proofs can be

found in any book on measure theory.

Theorem 5.5 (Monotone Convergence Theorem). If 0 ≤ Xn % X , then E[Xn ] % E[X].

Theorem 5.6 (Fatou's Lemma). If Xn ≥ 0, then lim .


h i
inf E[Xn ] ≥ E lim inf Xn
n→∞ n→∞

Note that if Xn ≥ M , then Xn − M ≥ 0, so since constants behave nicely with respect to limits and

expectation, nonnegative can be relaxed to bounded below in the statement of Theorems 5.5 and 5.6.

Also, since Xn % X if and only if (−Xn ) & (−X) and lim inf Xn = − lim sup(−Xn ),
one has immediate
n→∞n→∞
corollaries for lim sups and for monotone decreasing sequences (provided that they are bounded above).

Theorem 5.7 (Dominated Convergence Theorem). If Xn → X and there exists some Y ≥ 0 with E[Y ] < ∞
and |Xn | ≤ Y for all n, then E[Xn ] → E[X].

When Y is a constant, Theorem 5.7 is known as the bounded convergence theorem. In that case, it is

important that we're working on a nite measure space.

In each of these theorems, the assumptions need only hold almost surely since one can modify random

variables on null sets without aecting their expectations.

Change of Variables.
Though integration over arbitrary measure spaces in nice in theory, in order to actually compute expectations,

we will typically need to work in more familiar spaces like Rd .


The following change of variables theorem allows us to compute expectations by integrating functions of a

random variable against its distribution.


24
Theorem 5.8. Let X be a random variable taking values in the measurable space (S, G), and let µ = P ◦X −1
be the pushforward measure on (S, G).
If f is a measurable function from (S, G) to (R, B) such that f ≥ 0 or E |f (X)| < ∞, then
ˆ
E[f (X)] = f (s)dµ(s).
S

Proof. We will proceed by verifying the result in increasingly general cases paralleling the construction of

the integral.

To begin with, let B∈G and f = 1B . Then


ˆ ˆ
E[f (X)] = E[1B (X)] = P (X ∈ B) = µ(B) = 1B (s)dµ(s) = f (s)dµ(s).
S S
Pn
Now suppose that f= i=1 ai 1Bi is a simple function. Then by linearity and the previous case,

n
X n
X ˆ ˆ
E[f (X)] = ai E[1Bi (X)] = ai 1Bi (s)dµ(s) = f (s)dµ(s).
i=1 i=1 S S

If f ≥ 0, then Theorem 5.1 gives a sequence of simple functions φn % f , so the previous case and two

applications of the MCT give


ˆ ˆ
E[f (X)] = lim E[φn (X)] = lim φn (s)dµ(s) = f (s)dµ(s).
n→∞ n→∞ S S

Finally, suppose that E |f (X)| < ∞, and set f + (x) = max{f (x), 0}, f − (x) = max{−f (x), 0}. Then

f + , f − ≥ 0, f = f + − f − , and E[f (X)+ ], E[f (X)− ] ≤ E |f (X)| < ∞, so it follows from the previous result

and linearity that


ˆ ˆ ˆ
E[f (X)] = E[f + (X)] − E[f − (X)] = f + (s)dµ(s) − f − (s)dµ(s) = f (s)dµ(s). 
S S S

In light of Theorem 5.8, if X is an integrable random variable on (Ω, F, P ) with distribution µ, then
ˆ ˆ
E[X] = XdP = xdµ(x).
R

´
If X has density f= dm , then for any measurable g : R → R with g ≥ 0 a.s. or |g| dµ < ∞,
ˆ ∞
E[g(X)] = g(x)f (x)dx.
−∞

If X is a random variable, then for any k ∈ N, we say that E[X k ] is the kth moment of X.

The rst moment E[X] is called the mean and is usually denoted E[X] = µ.
The mean is a measure of the center of the distribution of X.

If X has nite second moment E[X 2 ] < ∞, then we dene the variance (or second central moment) of X as
2
Var(X) = E[(X − µ) ].
The variance provides a measure of the dispersion of the distribution of X and is is usually denoted

Var(X) = σ2 .

By linearity, we have the useful identity

Var(X) = E[(X − µ)2 ] = E[X 2 ] − 2µE[X] + µ2 = E[X 2 ] − E[X]2 .

25
6. Independence

Heuristically, two objects are independent if information concerning one of them does not contribute to one's

knowledge about the other. The correct way to formally codify this notion in a manner amenable to proving

theorems is in terms of a sort of multiplication rule.

• Two events A and B are independent if P (A ∩ B) = P (A)P (B).


• Two random variables X and Y are independent if P (X ∈ E, Y ∈ F ) = P (X ∈ E)P (Y ∈ F ) for all

E, F ∈ B . (That is, if the events {X ∈ E} and {Y ∈ F } are independent.)

• Two sub-σ -algebras F1 and F2 are independent if for all A ∈ F1 , B ∈ F2 , the events A and B are

independent.

Observe that if A∈F has P (A) = 0 or P (A) = 1, then A is independent of every B ∈ F.


This also implies that if X is a.s. constant, then X is independent of every Y ∈ F.

An innite collection of objects (sub-σ -algebras, random variables, events) is said to be independent if every

nite subcollection is independent, where

• Events A1 , ..., An ∈ F are independent if for any I ⊆ [n], we have


!
\ Y
P Ai = P (Ai ).
i∈I i∈I

• Random variables X1 , ..., Xn ∈ F are independent if for any choice of Ei ∈ Bi , i = 1, ..., n, we have

n
Y
P (X1 ∈ E1 , ..., Xn ∈ En ) = P (Xi ∈ Ei ).
i=1

• sub-σ -algebras F1 , ..., Fn are independent if for any choice of Ai ∈ Fi , i = 1, ..., n, we have

n
! n
\ Y
P Ai = P (Ai ).
i=1 i=1

Note that σ -algebras and random variables are implicitly subject to the same subcollection constraint as

events since special cases of the denition include taking Ai = Ω, Ei = R for i ∈ IC.

For any n ∈ N, it is possible to construct families of objects which are not independent, but every subcollection
of size m≤n satises the applicable multiplication rule. For example, just because a collection of events

A1 , ..., An satises P (Ai ∩Aj ) = P (Ai )P (Aj ) for all i 6= j (called pairwise independence ), it is not necessarily
the case that A1 , ..., An is an independent collection of events.

(Flip two fair coins and let A = {1st coin heads}, B = {2nd coin heads}, C = {both coins same}.)

One can show that independence of events is a special case of independence of random variables (via indicator

functions), which in turn is a special case of independence of σ -algebras (via the σ -algebras the random

variables generate). We will take as our running denition of independence, the further generalization:

Denition. Given a probability space (Ω, F, P ), collections of events A1 , ..., An ⊆ F are independent if for

all I ⊆ [n], !
\ Y
P Ai = P (Ai )
i∈I i∈I
whenever Ai ∈ Ai for each i ∈ I.
An innite collection of subsets of F is independent if every nite subcollection is.
26
Observe that if A1 , ..., An is independent and we set Ai = Ai ∪ {Ω}, then A1 , ..., An is independent as well.
Tn Qn
In this case, the independence criterion reduces to P ( i=1 Ai ) = i=1 P (Ai ) for any choice of Ai ∈ Ai .

Sucient Conditions for Independence.


The preceding denitions often require us to check a huge number of cases to determine whether a given

collection of objects is independent. The following results are useful for simplifying this task.

Theorem 6.1. Suppose that A1 , ..., An are independent collections of events. If each Ai is a π-system, then
the sub-σ-algebras σ(A1 ), ..., σ(An ) are independent.

Proof. Because σ(Ai ) = σ(Ai ) where Ai = Ai ∪ {Ω}, we can assume without loss of generality that Ω ∈ Ai
for all i so that we need only consider intersections/products over [n].
Tn
Let A2 , ..., An be events with Ai ∈ Ai , set F = i=2 Ai , and set L = {A ∈ F : P (A ∩ F ) = P (A)P (F )}.
Since P (Ω ∩ F ) = P (F ) = P (Ω)P (F ), we have that Ω ∈ L.
Now suppose that A, B ∈ L with A ⊆ B. Then

P ((B \ A) ∩ F ) = P ((B ∩ F ) \ (A ∩ F )) = P (B ∩ F ) − P (A ∩ F )
= P (B)P (F ) − P (A)P (F ) = (P (B) − P (A)) P (F ) = P (B \ A)P (F ),

hence (B \ A) ∈ L.
Finally, let B1 , B2 , ... ∈ L with Bn % B . Then (Bn ∩ F ) % (B ∩ F ), so

P (B ∩ F ) = lim P (Bn ∩ F ) = lim P (Bn )P (F ) = P (B)P (F ),


n→∞ n→∞

so B∈L as well.

Therefore, L is a λ-system, so, since A1 is a π -system contained in L by assumption, the π -λ Theorem shows
that σ(A1 ) ⊆ L.
Because A2 , ..., An were arbitrary members of A2 , ..., An , we conclude that σ(A1 ), A2 , ..., An are independent.
Repeating this argument for A2 , A3 , ..., An , σ(A1 ) shows that σ(A2 ), A3 , ..., An , σ(A1 ) are independent, and

n−2 more iterations completes the proof. 

A useful corollary is given by

Corollary 6.1. Random variables X1 , ..., Xn are independent if


n
for all x1 , ..., xn ∈ R.
Y
P (X1 ≤ x1 , ..., Xn ≤ xn ) = P (Xi ≤ xi )
i=1

Proof. Let Ai = {{Xi ≤ x} : x ∈ R} for i = 1, ..., n.


Since {Xi ≤ x} ∩ {Xi ≤ y} = {Xi ≤ x ∧ y}, the A0i s are π -systems, so σ(A1 ), ..., σ(An ) are independent by

Theorem 6.1.

Because {(−∞, x] : x ∈ R} generates B , σ(Ai ) = σ(Xi ), and the result follows. 

Since the converse of Corollary 6.1 is true by denition, independence of random variables X1 , ..., Xn is

equivalent to the condition that their joint cdf factors as a product of the marginals cdfs.

One can prove analogous results for density and mass functions using the same basic ideas.
27
It is clear that if X1 , ..., Xn are independent random variables and f1 , ..., fn : R → R are measurable, then

f (X1 ), ..., f (Xn ) are independent random variables since for any choice of Bi ∈ Bi ,

P (f1 (Xi ) ∈ B1 , ..., fn (Xn ) ∈ Bn ) = P X1 ∈ f1−1 (B1 ), ..., Xn ∈ fn−1 (Bn )




n
Y n
 Y
= P Xi ∈ fi−1 (Bi ) = P (fi (Xi ) ∈ Bi ) .
i=1 i=1

With the help of Theorem 6.1, we can prove the stronger result that functions of disjoint sets of independent

random variables are independent.

Lemma 6.1. Suppose Fi,j , 1 ≤ i ≤ n, 1 ≤ j ≤ m(i), are independent sub-σ-algebras and let Gi =
. Then G1 , ..., Gn are independent.
S 
σ F
j i,j

Proof.
nT o
Let Ai = A
j i,j : A i,j ∈ Fi,j .
T T  T  T  T
If j Ai,j , j Bi,j ∈ Ai , then j Ai,j ∩ j Bi,j = j (Ai,j ∩ Bi,j ) ∈ Ai , thus Ai is a π -system, so

σ(A1 ), ..., σ(An ) are independent by Theorem 6.1.


S
Because F ∈ j Fi,j implies 
F ∈ Fi,k for some k and thus F = Ω ∩ · · · ∩ Ω ∩ F ∩ Ω ∩ · · · ∩ Ω ∈ Ai , we have
S S 
that j Fi,j ⊆ Ai , so Gi = σ j Fi,j ⊆ σ(Ai ). Consequently, G1 , ..., Gn are independent. 

Corollary 6.2. If Xi,j , 1 ≤ i ≤ n, 1 ≤ j ≤ m(i), are independent random variables and the functions
fi : Rm(i) → R are measurable, then f1 (X1,1 , ..., X1,m(1) ), ..., fn (Xn,1 , ..., Xn,m(n) ) are independent.

Proof.
S 
Let Fi,j = σ(Xi,j ). Since fi (Xi,1 , ..., Xi,m(i) ) is measurable with respect to Gi = σ j Fi,j , the

result follows from Lemma 6.1. 

28
Product Measure.
We now pause to recall the construction of product measures.

Proposition 6.1. Given nite measure spaces (Ω1 , F1 , µ1 ) and (Ω2 , F2 , µ2 ), there exists a unique measure
µ1 × µ2 on (Ω1 × Ω2 , F1 ⊗ F2 ) which satises (µ1 × µ2 )(A × E) = µ1 (A)µ2 (E) for all A ∈ F1 , E ∈ F2 .

Proof.
Let S = {A × E : A ∈ F1 , E ∈ F2 }.
IfA1 × E1 , A2 × E2 ∈ S , then (A1 × E1 ) ∩ (A2 × E2 ) = (A1 ∩ A2 ) × (E1 ∩ E2 ) and
C   
(A1 × E1 ) = AC C C C
1 × E1 t A1 × E1 t A1 × E1 , hence S is a semialgebra.

Dene ν : S → [0, ∞) by ν(A × E) = µ1 (A)µ2 (E).


In light of the discussion in Section 3, the result will follow if we can show that for any countable disjoint
S P
union of sets {Ai × Ei }i∈I in S such that A×E = i∈I (Ai × Ei ) ∈ S , we have ν (A × E) = i∈I ν(Ai × Ei ).
To see that this is so, observe that for all (x, y) ∈ Ω1 × Ω2 ,
X X
1A (x)1E (y) = 1A×E (x, y) = 1Ai ×Ei (x, y) = 1Ai (x)1Ei (y).
i∈I i∈I

Consequently,
ˆ ˆ X
µ1 (A)1E (y) = 1A (x)1E (y)dµ1 (x) = 1Ai (x)1Ei (y)dµ1 (x)
Ω1 Ω1 i∈I

Xˆ X ˆ 
= 1Ai (x)1Ei (y)dµ1 (x) = 1Ai (x)dµ1 (x) 1Ei (y)
i∈I Ω1 i∈I Ω1
X
= µ1 (Ai )1Ei (y).
i∈I

(The interchange of summation and integration is justied by the monotone convergence theorem.)

Integrating against µ2 gives


ˆ ˆ X
ν(A × E) = µ1 (A)µ2 (E) = µ1 (A)1E (y)dµ2 (y) = µ1 (Ai )1Ei (y)dµ2 (y)
Ω2 Ω2 i∈I
X ˆ X X
= µ1 (Ai ) 1Ei (y)dµ2 (y) = µ1 (Ai )µ2 (Ei ) = ν(Ai × Ei ). 
i∈I Ω2 i∈I i∈I

* The above holds for σ -nite measure spaces as well by the same argument, but we mainly care about nite

measure spaces in probability.

An induction argument easily extends Proposition 6.1 to arbitrary nite products.

29
Independence, Distribution, and Expectation.
We now consider the joint distribution of independent random variables.

Theorem 6.2. If X1 , ..., Xn are independent random variables with distributions µ1 , ..., µn , respectively,
then the random vector (X1 , ..., Xn ) has distribution µ1 × · · · × µn .

Proof. Given any sets A1 , ..., An ∈ B, we have

n
Y
P ((X1 , ..., Xn ) ∈ A1 × · · · × An ) = P (X1 ∈ A1 , ..., Xn ∈ An ) = P (Xi ∈ Ai )
i=1
n
Y
= µi (Ai ) = (µ1 × · · · × µn ) (A1 × · · · × An ).
i=1

In the proof of Theorem 3.3, we showed that for any probability measures µ, ν , L = {A : µ(A) = ν(A)} is a
n
λ-system. Because the collection of rectangle sets is a π -system which generates B , the result follows from

the π -λ Theorem. 

In other words random variables are independent if their joint distribution is the product of their marginal

distributions.

At this point, it is appropriate to recall the theorems of Fubini and Tonelli, whose proofs can be found in

any book on measure theory.

Theorem 6.3. Suppose that (R, F, µ) and (S, G, ν) are σ-nite measure spaces.
I) Tonelli: If f : R × S → [0, ∞) is a measurable function, then
ˆ ˆ ˆ  ˆ ˆ 
(∗) f d(µ × ν) = f (x, y)dµ(x) dν(y) = f (x, y)dν(y) dµ(x).
R×S S R R S
´
II) Fubini: If f :R×S →R is integrable (i.e. |f | d(µ × ν) < ∞), then (∗) holds.

In the language of probability, we have

Theorem 6.4. Suppose that X and Y are independent with distributions µ and ν . If f : R2 → R is a
measurable function with f ≥ 0 or E |f (X, Y )| < ∞, then
ˆ ˆ
E[f (X, Y )] = f (x, y)dµ(x)dν(y).

In particular, if g, h : R → R are measurable functions with g, h > 0 or E |g(X)| , E |h(Y )| < ∞, then
E[g(X)h(Y )] = E[g(X)]E[h(Y )].

Proof. It follows from Theorem 6.2 and the change of variables formula (Theorem 5.8) that
ˆ
E[f (X, Y )] = f (x, y)d(µ × ν)(x, y),
R2

so the rst statement follows from Fubini-Tonelli.


30
Now suppose that g, h : R → R are nonnegative measurable functions. Then Tonelli's Theorem gives
ˆ ˆ ˆ ˆ 
E[g(X)h(Y )] = g(x)h(y)dµ(x)dν(y) = h(y) g(x)dµ(x) dν(y)
ˆ ˆ
= h(y)E[g(X)]dν(y) = E[g(X)] h(y)dν(y) = E[g(X)]E[h(Y )].

If g, h are integrable, then applying the above result to |g| , |h| gives E |g(X)h(Y )| = E |g(X)| E |h(Y )| < ∞,
and we can repeat the above argument using Fubini's Theorem. 

Note that the second part of the preceding proof is typical of multiple integral arguments: One uses Tonelli's

theorem to verify integrability by computing the integral of the absolute value as an iterated integral (or

interchanging order of integration), and then one applies Fubini's Theorem to compute the desired integral.

Theorem 6.4 can easily be extended to handle any nite number of random variables:

Theorem 6.5. If X1 , ..., Xn are independent and have Xi ≥ 0 for all i, or E |Xi | < ∞ for all i, then
" n
# n
Y Y
E Xi = E[Xi ].
i=1 i=1

Proof. Corollary 6.2 shows that X1 and X2 · · · Xn are independent, so Theorem 6.4 (with g=h the identity

function) gives
" n
# " n
#
Y Y
E Xi = E[X1 ]E Xi ,
i=1 i=2
and the result follows by induction. 

(To make Theorem 6.5 look more like Theorem 6.4, recall that if X1 , ..., Xn are independent and

f1 , ..., fn : R → R are measurable, then f1 (X1 ), ..., fn (Xn ) are independent.)

Note that it is possible that E[XY ] = E[X]E[Y ] without X and Y being independent.

For example, let X ∼ N (0, 1), Y = X 2 . Then X and Y are clearly dependent, but a little calculus shows
3
that E[X] and E[XY ] = E[X ] are both 0 and E[Y ] = E[X 2 ] = 1, so E[XY ] = 0 = E[X]E[Y ].

Denition. If X and Y are random variables with E[X 2 ], E[Y 2 } < ∞ and E[XY ] = E[X]E[Y ], then we

say that X and Y are uncorrelated.

Often, independence is invoked solely to argue that the expectation of the product is the product of the

expectations. In such cases, one can weaken the assumption from independence to uncorrelatedness.

Of course, we can obtain a partial converse to Theorem 6.4 if we require the expectation to factor over a

suciently large class of functions.

Proposition 6.2. X and Y are independent if E[f (X)g(Y )] = E[f (X)]E[g(Y )] for all bounded continuous
functions f and g.

31
Proof. Given any x, y ∈ R, dene
 


 1, t≤x 

1, t≤y
 
fn (t) = 1 − n(t − x), x < t ≤ x + n1 , gn (t) = 1 − n(t − y), y < t ≤ y + n1 .

 

t > x + n1 t > y + n1

0, 
0,

Then bounded convergence and the assumptions give


h i
P (X ≤ x, Y ≤ y) = E lim fn (X)gn (Y ) = lim E [fn (X)] E [gn (Y )]
n→∞ n→∞
h ih i
= E lim fn (X) lim gn (Y ) = P (X ≤ x)P (Y ≤ y). 
n→∞ n→∞

Before moving on, we mention that the ideas in this section can be used to analyze the sum of independent

random variables.

Theorem 6.6. Suppose that X and Y are independent with distributions µ, ν and distribution functions
F, G. Then X + Y has distribution function
ˆ
P (X + Y ≤ z) = F (z − y)dG(y).
´
If X has density f , then X + Y has density h(z) = f (z − y)dG(y).
´
If, additionally, Y has density g, then h(z) = f (z − y)g(y)dy = f ∗ g(z) - that is, the density of the sum is
the convolution of the densities.
Proof. The change of variables formula, independence, and Tonelli's theorem give
ˆ ˆ
P (X + Y ≤ z) = 1(−∞,z] (X + Y )dP = 1(−∞,z] (x + y)d(µ × ν)(x, y)
Ω R2
ˆ ˆ ˆ ˆ 
= 1(−∞,z] (x + y)dµ(x)dν(y) = 1(−∞,z−y] (x)dµ(x) dν(y)
ˆ ˆ
R R R R

= F (z − y)dν(y) = F (z − y)dG(y).
R

The nal equality is just interpreting an integral against ν as a Riemann-Stieltjes integral with respect to G.
Now if X has density f, then the previous result with u-substitution and Tonelli yield
ˆ ˆ ˆ z−y
P (X + Y ≤ z) = F (z − y)dν(y) = f (x)dxdν(y)
R R −∞
ˆ ˆ z ˆ z ˆ
= f (x − y)dxdν(y) = f (x − y)dν(y)dx
R −∞ −∞ R
ˆ z ˆ 
= f (x − y)dG(y) dx,
−∞

which means that the density of X +Y is as claimed.

The third assertion follows from the change of variables formula for absolutely continuous random variables

- which reads dG(y) = g(y)dy in the present context. 

Though one can use these convolution results to derive useful facts about distributions of sums, tools such

as characteristic and moment generating functions are generally much better suited for this task, so we will

not pursue the issue right now.


32
Constructing Independent Random Variables.
To see that we have not done all of this work for nothing, we now show that independent random variables

actually exist!

Given a nite collection of distribution functions F1 , ..., Fn , it is easy to construct independent random

variables X1 , ..., Xn with P (Xi ≤ x) = Fi (x).


Namely, let Ω = Rn , F = B n , and P = µ1 × · · · × µn where µi is the measure on (R, B) with distribution

function Fi .
The product measure P is well-dened and satises

P ((a1 , b1 ] × · · · × (an , bn ]) = (F1 (b1 ) − F1 (a1 )) · · · (Fn (bn ) − Fn (an )) .

If we dene Xi to be the projection map Xi ((ω1 , ..., ωn )) = ωi , then it is clear that the Xi0 s are independent

with the appropriate distributions.

In order to build an innite sequence of independent random variables with given distribution functions,

we need to perform the above construction on the innite product space

RN = {(ω1 , ω2 , ...) : ωi ∈ R} = {functions ω : N → R}.

The product σ -algebra B N is generated by cylinder sets of the form

{ω ∈ RN : ωi ∈ (ai , bi ] for i = 1, ..., n},

and the random variables are the projections Xi (ω) = ωi .


(In the denition of cylinders, we take −∞ ≤ ai ≤ bi ≤ ∞ with the interpretation that (ai , ∞] = (ai , ∞).
aj = bj for any j gives the empty set.)

Clearly, the desired measure should satisfy

n
 Y
P {ω ∈ RN : ωi ∈ (ai , bi ] for i = 1, ..., n} = (Fi (bi ) − Fi (ai ))
i=1

on the cylinders.

To see that we can uniquely extend this to all of BN , we appeal to

Theorem 6.7 (Kolmogorov). Suppose that we are given a sequence of probability measures µn on (Rn , Bn )
which are consistent in the sense that
µn+1 ((a1 , b1 ] × · · · × (an , bn ] × R) = µn ((a1 , b1 ] × · · · × (an , bn ]) .

Then there is a unique probability measure P on (RN , BN ) with



P {ω ∈ RN : ωi ∈ (ai , bi ], i = 1, ..., n} = µn ((a1 , b1 ] × · · · × (an , bn ]) .

In particular, given distribution functions F1 , F2 , ..., if we dene the µn 's by the condition

n
Y
µn ((a1 , b1 ] × · · · × (an , bn ]) = (Fi (bi ) − Fi (ai )) ,
i=1

then the projections Xn (ω) = ωn are independent with P (Xn ≤ x) = Fn (x).


33
Proof of Theorem 6.7. Let {µn }∞
n=1 be a consistent sequence of probability measures, let S be the collection

of cylinder sets, and dene Q : S → [0, 1] by


Q {ω ∈ RN : ωi ∈ (ai , bi ], 1 ≤ i ≤ n} = µn ((a1 , b1 ] × · · · × (an , bn ]) .

Let A = {nite disjoint unions of sets in S} be the algebra generated by S and dene P0 : A → [0, 1] by
Fn Pn
P0 ( k=1 Sk ) = k=1 Q(Sk ) for S1 , ..., Sn disjoint sets in S .
As S is a semialgebra which generates BN , the discussion in Section 3 shows that it suces to prove

Claim. If Bn ∈ A with Bn & ∅, then P0 (Bn ) & 0.

Proof. To further simplify our task, let Fn be the sub-σ -algebra of BN consisting of all sets of the form
∗ ∗ n n
E = E × R × R × ··· with E ∈ B . We use this asterisk notation throughout to denote the  B

component of sets in F n.
We begin by showing that we may assume without loss of generality that Bn ∈ Fn for all n.
To see this, note that Bn ∈ A implies that there is a j(n) ∈ N such that Bn ∈ Fk for all k ≥ j(n). Let

k(1) = j(1) and k(n) = k(n − 1) + j(n) for n ≥ 2. Then k(1) < k(2) < · · · and Bn ∈ Fk(n) for all n. Dene
Bei = RN for i < k(1) and B ei = Bn for k(n) ≤ i < k(n + 1). Then B en ∈ Fn for all n and the collections
n o
{Bn } and B en dier only in that the latter possibly includes RN and repeats sets. The assertion follows
 
en & ∅ if and only if Bn & ∅ and P0 B
since B en & 0 if and only if P0 (Bn ) & 0.

Now suppose that P0 (Bn ) ≥ δ > 0 for all n. We will derive a contradiction by approximating the Bn∗ from
T
within by compact sets and then using a diagonal argument to obtain n Bn 6= ∅.

Since Bn is nonempty and belongs to A ∩ Fn , we can write

K(n)
[
Bn = {ω : ωi ∈ (ai,k , bi,k ], i = 1, ..., n} where − ∞ ≤ ai,k < bi,k ≤ ∞.
k=1

By a continuity from below argument, we can nd a set E n ⊆ Bn of the form

K(n) n o
[
En = ω : ωi ∈ [e
ai,k , ebi,k ], i = 1, ..., n , −∞ < e
ai,k < ebi,k < ∞,
k=1

with µn (Bn \ En∗ ) ≤ δ
2n+1 .
Tn
Let Fn = m=1 Em . Since Bn ⊆ Bm for any m ≤ n, we have
n
! n n
[ [ [
C C C
 
Bn \ Fn = Bn ∩ Em = Bn ∩ Em ⊆ Bm ∩ Em ,
m=1 m=1 m=1

hence
n
X δ
µn (Bn∗ \ Fn∗ ) ≤ ∗
µm (Bm ∗
\ Em )≤ .
m=1
2
Since µn (Bn∗ ) = P0 (Bn ) ≥ δ , this means that µn (Fn∗ ) ≥ δ
2 , hence Fn∗ is nonempty.

Moreover, En∗ is a nite union of closed and bounded rectangles, so

Fn∗ = En∗ ∩ (En−1



× R) ∩ · · · ∩ (E1 × Rn−1 )

is compact.
34
For each m ∈ N, choose some ω m ∈ Fm . As Fm ⊆ F1 , ω1m (the rst coordinate of ωm ) is in F1∗ .
m(1,j)
By compactness, we can nd a subsequence m(1, j) ≥ j such that ω1 converges to a limit θ1 ∈ F1∗ .
For m ≥ 2, Fm ⊆ F2 , so (ω1m , ω2m ) ∈ F2∗ .

Because F2 is compact, we can nd a subsequence of {m(1, j)},
m(2,j) ∗
which we denote by m(2, j), such that ω2 converges to a limit θ2 with (θ1 , θ2 ) ∈ F2 .
m(n,j)
In general, we can nd a subsequence m(n, j) of m(n − 1, j) such that ωn converges to θn with

(θ1 , ..., θn ) ∈ Fn∗ .


Finally, dene the sequence ω(i) = ω m(i,i) . Then ω(i) is a subsequence of each ω m(i,j) , so limi→∞ ω(i)k = θk
for all k. Since (θ1 , ..., θn ) ∈ Fn∗ for all n, θ = (θ1 , θ2 , ...) ∈ Fn for all n, hence


\ ∞
\
θ∈ Fn ⊆ Bn ,
n=1 n=1

a contradiction!

Note that the proof of Theorem 6.7 used certain topological properties of Rn .
As one might expect, the theorem does not hold for innite products of arbitrary measurable spaces (S, G).
However, one can show that it does hold for nice spaces where (S, G) is said to be nice if there exists an
−1
injection ϕ:S→R such that ϕ and ϕ are measurable.

The collection of nice spaces is rich enough for our purposes. For example, if S is (homeomorphic to) a

complete and separable metric space and G is the collection of Borel subsets of S, then (S, G) is nice.

35
7. Weak Law of Large Numbers

We are now in a position to establish various laws of large numbers, which give conditions for the arithmetic

average of repeated observations to converge in certain senses. Among other things, these laws justify and

formalize our intuitive notions of probability as representing some kind of measure of long-term relative

frequency.

Convergence in Lp and Probability.


The weak law of large numbers is concerned with convergence in probability where

Denition. A sequence of random variables X1 , X2 , ... is said to converge to X in probability if for every

ε > 0, limn→∞ P (|Xn − X| > ε) = 0. In this case, we write Xn →p X .

In analysis we would call this convergence in measure.

Note that if Xn →p X , then limn→∞ P (|Xn − X| < ε) = 1 for all ε > 0, while Xn → X a.s. implies that

P (limn→∞ |Xn − X| < ε) = 1 for all ε > 0. The following proposition and example show the importance of

the placement of the limit in the two denitions..

Proposition 7.1. If Xn → X a.s., then Xn →p X .

Proof. Let ε>0 be given and dene


[ 
An = |Xm − X| > ε ,
m≥n

\
A= An ,
n=1

E = ω : lim Xn (ω) 6= X(ω) .
n→∞

Since A1 ⊇ A2 ⊇ ..., continuity from above implies that P (A) = limn→∞ P (An ).
Now if ω ∈ A, then for every n ∈ N, there is an m ≥ n with |Xm (ω) − X(ω)| > ε, so limn→∞ Xn (ω) 6= X(ω),
and thus A ⊆ E.
Because we also have the inclusion {|Xn − X| > ε} ⊆ An , monotonicity implies that

lim P (|Xn − X| > ε) ≤ lim P (An ) = P (A) ≤ P (E) = 0


n→∞ n→∞

where the nal equality is the denition of almost sure convergence. 

Example 7.1 (Scanning Interval). On the interval [0, 1) with Lebesgue measure, dene

X1 = 1[0,1) , X2 = 1[0, 1 ) , X3 = 1[ 1 ,1) , ..., X2n +k = 1[ kn , k+1 , ...


2 2 2 2n )

1
It is straight forward that Xn →p 0 (for any ε ∈ (0, 1), m ≥ 2n implies P (|Xm − 0| > ε) ≤ 2n ), but
limn→∞ Xn (ω) does not exist for any ω (there are innitely many values of n with Xn (ω) = 1 and innitely

many values with Xn (ω) = 0), thus Xn 9 0 a.s.

The preceding shows that convergence in probability is weaker than almost sure convergence. In fact, this

is the source of weak in the weak law of large numbers.


36
Our rst set of weak laws make use of L2 convergence where

Denition. For p ∈ (0, ∞], a sequence of random variables X1 , X2 , ... is said to converge to X in Lp if
p
lim kXn − Xkp = 0. (For p ∈ (0, ∞), this is equivalent to E [|Xn − X| ] → 0.)
n→∞

Our rst observation about Lp convergence is

Proposition 7.2. For any 1 ≤ r < s ≤ ∞, if Xn → X in Ls , then Xn → X in Lr .

Proof. If Xn → X in Ls , then Corollary 5.2 implies kXn − Xkr ≤ kXn − Xks → 0. 

To see how Lp convergence compares with our other notions of convergence, note that

Proposition 7.3. If Xn → X in Lp for p > 0, then Xn →p X .

Proof. For any ε > 0, Chebychev's inequality gives

p p
P (|Xn − X| > ε) = P (|Xn − X| > εp ) ≤ ε−p E [|Xn − X| ] → 0. 

Example 7.2. On the interval [0, 1] with Lebesgue measure, dene a sequence of random variables by
1
Xn = n 1(0,n−1 ] .
p Then Xn → 0 a.s. (and thus in probability) since for all ω ∈ (0, 1], Xn (ω) = 0 whenever
p
n > ω −1 . However, E [|Xn − 0| ] = 1 for all n, so Xn 9 0 in Lp .

Proposition 7.3 and Example 7.2 show that Lp convergence is stronger than convergence in probability.

Example 7.2 also shows that almost sure convergence need not imply convergence in Lp
(unless one makes additional assumptions such as boundedness or uniform integrability).

Conversely, Example 7.1 shows that Lp convergence does not imply almost sure convergence.

It is perhaps worth noting that a.s. convergence and convergence in probability are preserved by continuous

functions. (The latter claim can be shown directly from the ε-δ denition of continuity, but we will give an
p
easier proof in Theorem 8.2.) However, L convergence need not be. For example, on [0, 1] with Lebesgue
1
measure, Xn = n 1 2
(0,n−p ) converges to 0 in Lp , p > 0, but if f (x) = x2 , kf (Xn ) − f (0)kp = 1 for all n.

Now recall that random variables X and Y with nite second moments are said to be uncorrelated if

E[XY ] = E[X]E[Y ].
If we denote E[X] = µX , E[Y ] = µY , then the covariance of X and Y is dened as

Cov(X, Y ) : = E[(X − µX )(Y − µY )] = E[XY − XµY − µX Y + µX µY ]


= E[XY ] − 2µX µY + µX µY = E[XY ] − E[X]E[Y ],

so uncorrelated is equivalent to zero covariance and nite second moments.

We say that a family of random variables {Xi }i∈I is uncorrelated if Cov(Xi , Xj ) =0 for all i 6= j .
37
Before stating our rst weak law, we record the following simple observation about sums of uncorrelated

random variables.

Lemma 7.1. If X1 , X2 , ..., Xn are uncorrelated, then


Var(X1 + ... + Xn ) = Var(X1 ) + ... + Var(Xn ).

Proof.
Pn Pn
Let µi = E[Xi ] and Sn = i=1 Xi . Then E[Sn ] = i=1 µi by linearity, so
 !2 
h i n
X
2
Var(Sn ) = E (Sn − E[Sn ]) =E (Xi − µi ) 
i=1
 
Xn X
= E  (Xi − µi )2 + (Xi − µi )(Xj − µj )
i=1 i6=j
n
X  X
E (Xi − µi )2 +

= E [(Xi − µi )(Xj − µj )]
i=1 i6=j
n
X X n
X
= Var(Xi ) + Cov(Xi , Xj ) = Var(Xi ). 
i=1 i6=j i=1

We also observe that the for any a, b ∈ R,


h i
2
= a2 E (X − µX )2 = a2 Var(X).
 
Var(aX + b) = E ((aX + b) − (aµX + b))

With these results in hand, the L2 weak law follows easily.

Theorem 7.1. Let X1 , X2 , ... be uncorrelated random variables with common mean E[Xi ] = µ and uniformly
bounded variance Var(Xi ) ≤ C < ∞. Writing Sn = X1 + ... + Xn , we have that n1 Sn → µ in L2 and in
probability.

Proof.
1  1
Pn
Since E n Sn = n i=1 µ = µ, we see that
" 2 #   n
1 1 1 X nC
E Sn − µ = Var Sn = 2 Var(Xi ) ≤ →0
n n n i=1 n2
1 1
as n → ∞, hence
n Sn →µ in L2 . By Proposition 7.3,
n Sn →p µ as well. 

Specializing to the case where the Xi0 s are independent and identically distributed (or i.i.d.), we have the

oft-quoted weak law

Corollary 7.1. If X1 , X2 , ... are i.i.d. with mean µ and variance σ2 < ∞, then X n = converges
1
Pn
n i=1 Xi
in probability to µ.

The statistical interpretation of Corollary 7.1 is that under mild conditions, if the sample size is suciently

large, then the sample mean will be close to the population mean with high probability.
38
Examples.
Our rst applications of these ideas involve statements that appear to be unrelated to probability.

Example 7.3. Let f : [0, 1] → R be a continuous function and let

n    
X n k n−k k
fn = x (1 − x) f
k n
k=0

be the Bernstein polynomial of degree n associated with f . Then lim sup |fn (x) − f (x)| = 0.
n→∞ x∈[0,1]

Proof.
Given any p ∈ [0, 1], let X1 , X2 , ... be i.i.d. with P (X1 = 1) = p and P (X1 = 0) = 1 − p.
1
One easily calculates E[X1 ] = p and Var(X1 ) = p(1 − p) ≤ 4.
Pn
P (Sn = k) = nk pk (1 − p)n−k , 1
  
Letting Sn = i=1 Xi , we have that so E f n Sn = fn (p).
1
Also, Corollary 7.1 shows that Xn = n Sn converges to p in probability.

To establish the desired result, we have to appeal to the proof of our weak law.
   p(1−p) 1
First, for any α > 0, Chebychev's inequality and the fact that E X n = p, Var Xn = n < 4n gives

 Var(X n ) 1
P X n − p ≥ α ≤ ≤ .
α2 4nα2
Now since f is continuous on the compact set [0, 1] it is uniformly continuous and uniformly bounded. Let
M = supx∈[0,1] |f (x)|, and for a given ε > 0, let δ > 0 be such that |x − y| < δ implies |f (x) − f (y)| < ε for
all x, y ∈ [0, 1]. Since the absolute value function is convex, Jensen's inequality yields

E f X n − f (p) ≤ E f X n − f (p) ≤ εP X n − p < δ + 2M P X n − p ≥ δ ≤ ε + M .


     
2nδ 2
As this does not depend on p, the result follows upon letting n → ∞. 

Our next amusing result can be interpreted as saying that a high-dimensional cube is almost a sphere.

Example 7.4. X1 , X2 , ... be independent and uniformly distributed on [−1, 1]. Then X12 , X22 , ... are also
Let
´ 1 x2 Pn
2
independent with E[Xi ] =
−1 2
dx = 13 and Var(Xi2 ) ≤ E[Xi4 ] ≤ 1, so Corollary 7.1 shows that n1 i=1 Xi2
1
converges to
3 in probability.
1
An,ε = x ∈ Rn : (1 − ε) n3 ≤ kxk ≤ (1 + ε) n3
 p
kxk = (x21 + ... + x2n ) 2
p
Now given ε ∈ (0, 1), write where

is the usual Euclidean distance, and let m denote Lebesgue measure. We have
 v 
u n
m (An,ε ∩ [−1, 1]n )
r r
n uX n
= P ((X1 , ..., Xn ) ∈ An,ε ) = P (1 − ε) ≤t Xi2 ≤ (1 + ε) 
2n 3 i=1
3
n
!
1 2 1X 2 1 2
=P (1 − 2ε + ε ) ≤ X ≤ (1 + 2ε + ε )
3 n i=1 i 3
!
n
1 X 1 2ε − ε2
≥P Xi2 − ≤ ,

n 3 3
i=1
m(An,ε ∩[−1,1]n )
so that
2n →1 as n → ∞. In words, most of the volume of the cube [−1, 1]n comes from An,ε ,
pn
which is almost the boundary of the ball centered at the origin with radius .
3

39
The next set of examples concern the limiting behavior of row sums of triangular arrays, for which we appeal
to the following easy generalization of Theorem 7.1.

Theorem 7.2. Given a triangular array of integrable random variables, {Xn,k }n∈N,1≤k≤n ,
let Sn = nk=1 Xn,k denote the nth row sum, and write µn = E[Sn ], σn2 = Var(Sn ).
P

If the sequence {bn }∞


n=1 satises limn→∞ b = 0, then
2
σ n
2
n

Sn − µn
→p 0.
bn
 2 
Var(Sn )
Proof. By assumption, E Sn −µn
bn = b2n →0 as n → ∞, so the result follows since L2 convergence

implies convergence in probability. 

Example 7.5 (Coupon Collector's Problem) . Suppose that there are n distinct types of coupons and each

time one obtains a coupon it is, independent of prior selections, equally likely to be any one of the types.

We are interested in the number of draws needed to obtain a complete set. To this end, let Tn,k denote the

number of draws needed to collect k distinct types for k = 1, ..., n and note that Tn,1 = 1. Set Xn,1 = 1 and

Xn,k = Tn,k − Tn,k−1 for k = 2, ..., n so that Xn,k is the number of trials needed to obtain a type dierent

from the rst k − 1. The number of draws needed to obtain a complete set is given by

n
X n
X
Tn := Tn,n = 1 + (Tn,k − Tn,k−1 ) = 1 + Xn,k .
k=2 k=2

n−k+1 k−1 m−1


 
By construction, Xn,2 , ..., Xn,n are independent with P (Xn,k = m) = n n for m ∈ N.

Now a random variable X with P (X = m) = p(1 − p)m−1 is said to be geometric with success probability p.
A little calculus gives

∞ ∞
X X d
E[X] = mp(1 − p)m−1 = p − (1 − p)m
m=1 m=1
dp

d X d 1−p 1
= −p (1 − p)m = −p =
dp m=1 dp p p

and

X ∞
X
E[X 2 ] = m2 p(1 − p)m−1 = [m(m − 1) + m]p(1 − p)m−1
m=1 m=1

X ∞
X
= p(1 − p) m(m − 1)(1 − p)m−2 + mp(1 − p)m−1
m=1 m=1

d2 X
m d2 (1 − p)2 1
= p(1 − p) 2
(1 − p) + E[X] = p(1 − p) 2
+
m=2
dp dp p p
2(1 − p) 1 2−p
= 2
+ = ,
p p p2
hence
1−p 1
Var(X) = E[X 2 ] − E[X]2 = ≤ 2.
p2 p
40
It follows that
n n n−1 n
X X n X1 X 1
E[Tn ] = 1 + E[Xn,k ] = 1 + =1+n =n
n−k+1 j=1
j j=1
j
k=2 k=2
and
n n  2 n−1 ∞
X X n X 1 X 1 π 2 n2
Var(Tn ) = Var(Xn,k ) ≤ = n2 2
≤ n 2
2
= .
n−k+1 j=1
j j=1
j 6
k=2 k=2

Var(Tn ) Tn −n n −1
P
π2 k=1 k
Taking bn = n log(n) we have
b2n ≤ 6 log(n)2 → 0, so Theorem 7.2 implies
n log(n) →p 0.
Using the inequality
n
X 1
log(n) ≤ ≤ log(n) + 1
k
k=1
´n dx
Pn−1 1
Pn 1
(which can be seen by bounding log(n) = 1 x
with the upper Riemann sum k=1 k ≤ k=1 k and the
Pn 1
Pn 1 Tn
lower Riemann sum k=2 k = k=1 k − 1 ), we conclude that →p 1 .
n log(n)

rn
Example 7.6 (Occupancy Problem). Suppose that we drop rn balls at random into n bins where → c.
Pn n
Letting Xn,k = 1 {bin k is empty} , the number of empty bins is Xn = k=1 Xn,k .
It is clear that
n n  rn
X X n−1
E[Xn ] = E[Xn,k ] = P (bin k is empty) =n
n
k=1 k=1
and
 
n
X X n
X X
E[Xn2 ] = E  2
Xn,k +2 Xn,i Xn,j  = E[Xn,k ] + 2 E[Xn,i Xn,j ]
k=1 i<j k=1 i<j
n
X X
= P (bin k is empty) +2 P (bins i and j are empty)
k=1 i<j
  rn   r r r
n−2 n

1 n

n−1 n 2 n
=n +2 =n 1− + n(n − 1) 1 − ,
n 2 n n n
so
 rn  r 2rn
2 n

1 1
Var(Xn ) = E[Xn2 ] 2
− E[Xn ] = n 1 − + n(n − 1) 1 − 2
−n 1− .
n n n

log n−1

n n−2 n rn
Now L'Hospital's rule gives lim
−1
= lim · = −1, so, since → c, we have that
n→∞  n n→∞ −n−2 n − 1 n
 rn  r
rn log n−1
 
n−1 n

n−1
log = · n
→ −c and thus → e−c as n → ∞.
n n n−1 n
 r  2rn
2 n 1
Similarly, 1− , 1− → e−2c .
n n
 r
E[Xn ] n−1 n
Consequently, = → e−c and
n n
 rn r 2rn
1 − n1

2 n

Var(Xn ) n(n − 1) 1
= + 1 − − 1 − → 0 + 1 · e−2c − e−2c = 0
n2 n n n n
Xn
as n → ∞, so taking bn = n in Theorem 7.2 shows that the proportion of empty bins, , converges to e−c
n
in probability.
41
Weak Law of Large Numbers.
We begin by providing a simple analysis proof of the weak law in its classical form. The general trick is to

use truncation in order to consider cases where we have control over the size and the probability, respectively.

Theorem 7.3. Suppose that X1 , X2 , ... are i.i.d. with E |X1 | < ∞. Let Sn = and µ = E[X1 ].
Pn
i=1 Xi
Then n1 Sn → µ in probability.

Proof.
In what follows, the arithmetic average of the rst n terms of a sequence of random variables Y1 , Y2 , ... will
1
Pn
be denoted by Yn = n i=1 Yi .

We rst note that, by replacing Xi with Xi0 = Xi − µ if necessary, we may suppose without loss of generality

that E[Xi ] = 0.

Thus we need to show that for given ε, δ > 0, there is an N ∈ N such that P X n > ε < δ whenever

n ≥ N.
To this end, we pick C<∞ large enough that E [|X1 | 1 {|X1 | > C}] < η for some η to be determined.

(This is possible since |X1 | 1 {|X1 | ≤ n} ≤ |X1 | and E |X1 | < ∞, so limn→∞ E [|X1 | 1 {|X1 | ≤ n}] = E |X1 |
by the dominated convergence theorem, hence E [|X1 | 1 {|X1 | > n}] = E |X1 | − E [|X1 | 1 {|X1 | ≤ n}] → 0.)
Now dene

Wi = Xi 1 {|Xi | ≤ C} − E [Xi 1 {|Xi | ≤ C}]


Zi = Xi 1 {|Xi | > C} − E [Xi 1 {|Xi | > C}] .

By assumption, we have that

E |Zi | ≤ 2E [|X1 | 1 {|Xi | > C}] < 2η,


and thus, for every n ∈ N,
1 Xn n
1X
E Z n = E Zi ≤ E |Zi | ≤ 2η.

n n
i=1 i=1

Also, the Wi0 s |Wi | ≤ 2C by construction, so


are i.i.d. with mean zero and satisfy
 
n
h 2i 1 X X E[W12 ] 4C 2
E Wn = 2 E[Wi2 ] + E[Wi Wj ] = ≤ ,
n i=1
n n
i6=j

and thus
 2 h 2 i 4C 2
E W n ≤ E W n ≤
n
by Jensen's inequality.
l m
4C 2

Consequently, if n ≥ N := η 2 , then E W n ≤ η .
Finally, Chebychev's inequality and the fact that

X n = W n + Z n ≤ W n + Z n

imply that for n ≥ N,



  E W n + E Z n 3η
P Xn > ε ≤ P W n + Zn > ε ≤
< .
ε ε
εδ
Taking η= 3 completes the proof. 

42
We now turn to a weak law for triangular arrays which can be useful even in situations involving innite

means.

Theorem 7.4. For each n ∈ N, let Xn,1 , ..., Xn,n be independent. Let {bn }∞ n=1 be a sequence of positive
numbers with limn→∞ bn = ∞ and let Xen,k = Xn,k 1 {|Xn,k | ≤ bn }. Suppose that as n → ∞
Pn
(1) k=1
Pn h | > ibn ) → 0
P (|Xn,k
(2) b−2
n k=1 E e 2 → 0.
X n,k

Sn − an
If we let Sn = and an = en,k , then
→p 0.
Pn Pn h i
k=1 Xn,k k=1 E X
bn

Proof.
Pn e n o
S −an
Let Sen = k=1 Xn,k . By partitioning the event n
bn > ε according to whether or not Sn = S
en ,
we see that !
 
Sn − an Se − a
n n
P > ε ≤ P (Sn 6= Sen ) + P >ε .
bn bn

To estimate the rst term, we observe that

n
! n n
[  X   X
P (Sn 6= Sen ) ≤ P Xn,k 6= X
en,k ≤ P Xn,k 6= X
en,k = P (|Xn,k | > bn ) → 0
k=1 k=1 k=1

where the rst inequality is due to the fact that Sn 6= Sen implies that there is some k ∈ [n] with Xn,k 6= X
en,k ,
and the second inequality is countable subadditivity.
h i
For the second term, we use Chebychev's inequality, E Sen = an , the independence of the e 0 s,
X and our
n,k
second assumption to obtain
!  !2 
Se − a S − a
n n
en n
P > ε ≤ ε−2 E   = ε−2 b−2
n Var(Sn )
e
bn bn

n n
!
X h i X h i
= ε−2 b−2
n
e2
Var Xn,k
−2
≤ε b−2
n E e2
Xn,k → 0. 
k=1 k=1

Theorem 7.4 was so easy to prove because we assumed exactly what we needed. Essentially, these are the

correct hypotheses for the weak law, but they are a little clunky so we usually talk about special cases that

take a nicer form.

In order to prove our weak law for sequences of i.i.d. random variables, we need the following simple lemma.

Lemma 7.2 (Layer cake representation) . If Y ≥0 and p > 0, then


ˆ ∞
E [Y p ] = py p−1 P (Y > y)dy.
0

Proof. Tonelli's theorem gives


ˆ ∞ ˆ ∞ ˆ 
p−1 p−1
py P (Y > y)dy = py 1 {Y > y} dP dy
0 0 Ω
ˆ ˆ ∞ 
p−1
= py 1 {y < Y } dy dP
Ω 0
ˆ ˆ Y
! ˆ
= py p−1 dy dP = Y p dP = E [Y p ] . 
Ω 0 Ω

43
We now have all the necessary ingredients for

Theorem 7.5 (Weak Law of Large Numbers). Let X1 , X2 , ... be i.i.d. with
xP (|X1 | > x) → 0 as x → ∞.
Let Sn = X1 + ... + Xn and µn = E [X1 1 {|X1 | ≤ n}]. Then n1 Sn − µn → 0 in probability.

Proof. We will apply Theorem 7.4 with Xn,k = Xk and bn = n (hence an = nµn ).

The rst assumption is satised since

n
X
P (|Xn,k | > n) = nP (|X1 | > n) → 0.
k=1

For the second assumption, we have en,k = Xk 1 {|Xk | ≤ n},


X so we must show that

n
1 h e2 i 1 X h e2 i
E Xn,1 = 2 E Xn,k → 0.
n n
k=1

Lemma 7.2 shows that


h i ˆ ∞   ˆ n
2
E X = 2yP X > y dy ≤ 2yP (|X1 | > y) dy
e
n,1
en,1
0 0
   
since P Xn,1 > y = 0 for y>n and P Xn,1 > y = P (|X1 | > y) − P (|X1 | > n) for y ≤ n, so we will
e e

be done once we prove that


ˆ n
1
2yP (|X1 | > y) dy → 0.
n 0

To see that this is the case, note that since 2yP (|X1 | > y) → 0 as y → ∞, for any ε > 0, there is an N ∈N
such that 2yP (|X1 | > y) < ε whenever y ≥ N. Because 2yP (|X1 | > y) < 2N for y < N, we see that for all

n > N,
ˆ n ˆ N ˆ
1 1 1 n
2yP (|X1 | > y) dy = 2yP (|X1 | > y) dy + 2yP (|X1 | > y) dy
n 0 n 0 n N
ˆ ˆ
1 N 1 n 2N 2 n−N
≤ 2N dy + εdy = + ε,
n 0 n N n n
hence ˆ n
1 2N 2 n−N
lim sup 2yP (|X1 | > y) dy ≤ lim sup + ε = ε,
n→∞ n 0 n→∞ n n
and the result follows since ε was arbitrary. 

Remark. Theorem 7.5 implies Theorem 7.3 since if E |X1 | < ∞, then the dominated convergence theorem

gives

µn = E[X1 1 {|X1 | ≤ n}] → E[X1 ] = µ as n → ∞,


xP (|X1 | > x) ≤ E [|X1 | 1 {|X1 | > x}] → 0 as x → ∞.

44
On the other hand, the improvement is not vast since xP (|X1 | > x) → 0 implies that there is M ∈N so

that xP (|X1 | > x) ≤ 1 for x ≥ M , and thus for any ε > 0, Lemma 7.2 with p = 1 − ε yields
h i ˆ ∞ ˆ ∞
1−ε −ε
E |X1 | = (1 − ε)y P (|X1 | > y) dy = (1 − ε) y −(1+ε) · yP (|X1 | > y) dy
0 0
ˆ M ˆ ∞
≤ (1 − ε) y −(1+ε) M dy + (1 − ε) y −(1+ε) dy < ∞.
0 M

Example 7.7 (The St. Petersburg Paradox) . Suppose that I oered to pay you 2j dollars if it takes j ips

of a fair coin for the rst head to appear. That is, your winnings are given by the random variable X with

P (X = 2j ) = 2−j for j ∈ N. How much would you pay to play the game n times? The paradox is that
P∞
E[X] = j=1 2j · 2−j = ∞, but most sensible people would not pay anywhere near $40 a game.

Using Theorem 7.4, we will show that a fair price for playing n times is $ log2 (n) per play, so that one would

need to play about a trillion rounds to reasonably expect to break even at $40 a play.

Proof. To cast this problem in terms of Theorem 7.4, we will take X1 , X2 , ... to be independent random
Pn
variables which are equal in distribution to X and set Xn,k = Xk . Then Sn = k=1 Xk denotes your total
winnings after n games. We need to choose bn so that

n
X
nP (X > bn ) = P (Xn,k > bn ) → 0,
k=1
n
n  2 X h
2
i
E X 1 {X ≤ bn } = b−2

n E (Xn,k 1 {|Xn,k | ≤ bn }) → 0.
b2n
k=1

To this end, let m(n) = log2 (n) + K(n) where K(n) is such that m(n) ∈ N and K(n) → ∞ as n → ∞.
If we set bn = 2m(n) = n2K(n) , we have


X
nP (X > bn ) ≤ nP (X ≥ bn ) = n 2−i = n2−m(n)+1 = 2−K(n)+1 → 0
i=m(n)

and
m(n)
X
22i · 2−i = 2m(n)+1 − 2 ≤ 2bn ,
 2 
E X 1 {X ≤ bn } =
i=1
so that
n  2  2n
2
E X 1 {|X| ≤ bn } ≤ = 2−K(n)+1 → 0.
bn bn
Since
n m(n)
X X
an = E [Xn,k 1 {|Xn,k | ≤ bn }] = nE [X1 {X ≤ bn }] = n 2i · 2−i = nm(n),
k=1 i=1
Theorem 7.4 gives
Sn − n log2 (n) − nK(n)
→p 0.
n2K(n)
If we take K(n) ≤ log2 (log2 (n)), then the conclusion holds with n log2 (n) in the denominator, so we get

Sn
→p 1. 
n log2 (n)

45
8. Borel-Cantelli Lemmas

Given a sequence of events A1 , A2 , ... ∈ F , we dene

∞ [
\ ∞
lim supn An := Am = {ω : ω is in innitely many An },
n=1 m=n

which is often abbreviated as {An i.o.} where i.o. stands for innitely often.

The nomenclature derives from the straight-forward identity lim sup 1An = 1lim sup An .
n→∞ n

One can likewise dene the limit inferior by

∞ \
[ ∞
lim inf n An := Am = {ω : ω is in all but nitely many An },
n=1 m=n
C
but little is gained by doing so since lim inf n An = lim supn AC
n .

To illustrate the utility of this notion, observe that Xn → X a.s. if and only if P (|Xn − X| > ε i.o.) =0
for every ε > 0.

Lemma 8.1 . If P (An ) < ∞, then P (An i.o.) = 0.


P∞
(Borel-Cantelli I) n=1

Proof.
P∞
Let N= n=1 1An denote the number of events that occur. Tonelli's theorem (or MCT) gives


X ∞
X
E[N ] = E[1An ] = P (An ) < ∞,
n=1 n=1

so it must be the case that N <∞ a.s. 

A nice application of the rst Borel-Cantelli lemma is

Theorem 8.1. Xn →p X if and only if every subsequence {Xnm }∞


m=1 has a further subsequence {Xnm(k) }k=1

such that Xn m(k)


→ X a.s. as k → ∞.

Proof.
Suppose that Xn →p X and let {Xnm }∞ m=1 be any subsequence. Then Xnm →p X , so for every k ∈ N,
P |Xnm − X| > k → 0 as m → ∞. It follows that we can choose a further subsequence {Xnm(k) }∞
1

k=1 such
1 −k
that P Xnm(k) − X > k ≤ 2 for all k ∈ N. Since

∞  
X 1
P Xnm(k) − X >
≤ 1 < ∞,
k
k=1
1 
the rst Borel-Cantelli lemma shows that P Xnm(k) − X >
k i.o. = 0.
  1
Because Xnm(k) − X > ε i.o. ⊆ Xnm(k) − X >
k i.o. for every ε > 0, we see that Xnm(k) → X a.s.

To prove the converse, we rst observe

Lemma 8.2. Let {yn }∞ n=1 be a sequence of elements in a topological space. If every subsequence {yn m }∞
m=1
has a further subsequence {yn }∞ k=1 that converges to y , then yn → y .
m(k)
46
Proof. If yn 9 y , then there is an open set U 3 y such that for every N ∈ N, there is an n ≥ N with

yn ∈/ U , hence there is a subsequence {ynm }∞


m=1 with ynm ∈ / U for all m. By construction, no subsequence

of {ynm }m=1 can converge to y , and the result follows by contraposition. 

Now if every subsequence of {Xn }∞


n=1 has a further subsequence that converges to X almost surely, then

applying Lemma 8.2 to the sequence yn = P (|Xn − X| > ε) for an arbitrary ε > 0 shows that Xn →p X . 

Remark. Since there are sequences which converge in probability but not almost surely (e.g. Example 7.1),

it follows from Theorem 8.1 and Lemma 8.2 that a.s. convergence does not come from a topology.

(In contrast, one of the homework problems shows that convergence in probability is metrizable.)

Theorem 8.1 can sometimes be used to upgrade results depending on almost sure convergence.

For example, you are asked to show in your homework that the assumptions in Fatou's lemma and the

dominated convergence theorem can be weakened to require only convergence in probability.

To get a feel for how this works, we prove

Theorem 8.2. If f is continuous and Xn →p X , then f (Xn ) →p f (X). If, in addition, f is bounded, then
E[f (Xn )] → E[f (X)].

Proof. If {Xnm } is a subsequence, then Theorem 8.1 guarantees the existence of a further subsequence

{Xnm(k) } which converges to X a.s. Since limits commute with continuous functions, this means that

f (Xnm(k) ) → f (X) a.s. The other direction of Theorem 8.1 now implies that f (Xn ) →p f (X).
 
If f is bounded as well, then the dominated convergence theorem yields E f Xnm(k) → E[f (X)].
Applying Lemma 8.2 to the sequence yn = E[f (Xn )] establishes the second part of the theorem.

(Since f is bounded, the same argument shows that f (Xn ) → f (X) in L1 .) 

We will now use the rst Borel-Cantelli lemma to prove a weak form of the Strong Law of Large Numbers.

Theorem 8.3. Let X1 , X2 , ... be i.i.d. with and E X14 < ∞. If Sn = X1 + ... + Xn , then
 
E[X1 ] = µ
n Sn → µ almost surely.
1

Proof. By taking Xi0 = Xi − µ, we can suppose without loss of generality that µ = 0. Now
 ! n  ! n !  
X n X X n X X
 4
E Sn = E  Xi  Xj  Xk Xl  = E  Xi Xj Xk Xl  .
i=1 j=1 k=1 l=1 1≤i,j,k,l≤n
   
By independence, terms of the form E Xi3 Xj , E Xi2 Xj Xk and E [Xi Xj Xk Xl ] are all zero (since the

expectation of the product is the product of the expectations).


   
The only non-vanishing terms are thus of the form E Xi4 and E Xi2 Xj2 , of which there are n of the former
n 4
 
and 3n(n − 1) of the latter (determined by the
2 ways of picking the indices and the 2 2 ways of picking
which two of the four sums gave rise to the smaller and larger indices).
47
   2  
Because E Xi2 Xj2 = E Xi2 ≤ E Xi4 , we have
 2
E Sn4 ≤ nE X14 + 3n(n − 1)E X12 ≤ Cn2
   

 
where C = 3E X14 < ∞ by assumption.
It follows from Chebychev's inequality that
 
1 
4
 C
P |Sn | > ε = P |Sn | > (nε)4 ≤ 2 4 ,
n n ε
hence
∞   ∞
X 1 X 1
P |Sn | > ε ≤ Cε−4 < ∞.
n=1
n n=1
n2
1 1

Therefore, P n |Sn | > ε i.o. =0 by Borel-Cantelli, so, since ε>0 was arbitrary,
n Sn →0 a.s. 

The converse of the Borel-Cantelli lemma is false in general:

Example 8.1. Ω = [0, 1], F = Borel sets, P = Lebesgue


Let measure, and dene An = (0, n1 ).
P∞ P∞ 1
Then n=1 P (An ) = n=1 n = ∞ and lim supn→∞ An = ∅.

However, if the A0n s are independent, then we have

Lemma 8.3 (Borel-Cantelli . If the events are independent, then implies


P∞
II) A1 , A2 , ... n=1 P (An ) = ∞
P (An i.o.) = 1.

Proof.
Tn+k T∞
For each n ∈ N, the sequence Bn,1 , Bn,2 , ... dened by Bn,k = m=n AC
m decreases to Bn := m=n AC
m.

Also, since the Am 's (and thus their complements) are independent, we have

n+k
! n+k
\ Y
AC P AC

P (Bn,k ) = P m = m
m=n m=n
n+k
Y n+k
Y Pn+k
= (1 − P (Am )) ≤ e−P (Am ) = e− m=n P (Am )

m=n m=n

where the inequality is due to the Taylor series bound e−x ≥ 1 − x for x ∈ [0, 1].
P∞
Because m=n P (Am ) = ∞ by assumption, it follows from continuity from above that
Pn+k
P (Bn ) = lim P (Bn,k ) ≤ lim e− m=n P (Am )
= 0,
k→∞ k→∞
S∞ 
hence P ( m=n Am ) = P BnC = 1 for all n ∈ N.
S∞
Since m=n Am & lim supn→∞ An = {An i.o.}, another application of continuity from above gives


!
[
P (An i.o.) = lim P Am = 1. 
n→∞
m=n

Taken together, the Borel-Cantelli lemmas show that if A1 , A2 , ... is a sequence of independent events, then

the event {An i.o.} occurs either with probability 0 or probability 1.


Thus if A1 , A2 , ... are independent, then P (An i.o.) >0 implies P (An i.o.) = 1.
48
It follows from the second Borel-Cantelli lemma that innitely many independent trials of a random experi-

ment will almost surely result in innitely many realizations of any event having positive probability.

For example, given any nite string from a nite alphabet (e.g. the complete works of Shakespeare in

chronological order), an innite string with characters chosen independently and uniformly from the alphabet

(produced by the proverbial monkey at a typewriter, say) will almost surely contain innitely many instances

of said string.

Similarly, many leading cosmological theories imply the existence of innitely many universes which may be

regarded as being i.i.d. with the current state of our universe having positive probability. If any of these

theories is true, then Borel-Cantelli says that there are innitely many copies of us throughout the multiverse

having this discussion!

A more serious application demonstrates the necessity of the integrability assumption in the strong law.

Theorem 8.4. If X1 , X2 , ... are i.i.d. with E |X1 |= ∞, then P (|Xn | ≥ n i.o.) = 1.
n
Sn
Thus if Sn = Xi , then P exists in R = 0.
X
lim
n→∞ n
i=1

Proof. Lemma 7.2 and the fact that G(x) := P (|X1 | > x) is nonincreasing give
ˆ ∞ ∞
X ∞
X
E |X1 | = P (|X1 | > x) dx ≤ P (|X1 | > n) ≤ P (|X1 | ≥ n) .
0 n=0 n=0

Because E |X1 | = ∞ and the Xn0 s are i.i.d., it follows from the second Borel-Cantelli lemma that

P (|Xn | ≥ n i.o.) = 1.
Sn

To establish the second claim we will show that C = limn→∞ n exists in R and {|Xn | ≥ n i.o.} are

disjoint, hence P (|Xn | ≥ n i.o.) =1 implies P (C) = 0.


To this end, observe that

Sn Sn+1 (n + 1)Sn − n(Sn + Xn+1 ) Sn Xn+1


− = = − .
n n+1 n(n + 1) n(n + 1) n + 1
Sn (ω)
Now suppose that ω ∈ C. Then it must be the case that limn→∞ n(n+1) = 0, so there is an N ∈N with

Sn (ω) 1
n(n+1) < 2 whenever n ≥ N.

If ω ∈ {|Xn | ≥ n i.o.} as well, then there would be innitely many n ≥ N with |Xnn(ω)| ≥ 1.

Sn (ω) Sn+1 (ω) Sn (ω) Xn+1 (ω) 1
But this would mean that n − n+1 = n(n+1) − n+1 > 2 for innitely many n, so that the
n o∞
Sn (ω)
sequence
n is not Cauchy, contradicting ω ∈ C. 
n=1

Our next example is a typical application where the two Borel-Cantelli lemmas are used together to obtain

results on the limit superior of a (suitably scaled) sequence of i.i.d. random variables.

Example 8.2. Let X1 , X2 , ... be a sequence of i.i.d. exponential random variables with rate 1 (so that
−x
Xi ≥ 0 with P (Xi ≤ x) = 1 − e ).

We will show that


Xn
lim sup =1 a.s.
n→∞ log(n)

49
First observe that
 
Xn 1
P ≥1 = P (Xn ≥ log(n)) = P (Xn > log(n)) = e− log(n) = ,
log(n) n
so
∞   ∞
X Xn X 1
P ≥1 = = ∞,
n=1
log(n) n=1
n
 
0 Xn
thus, since the Xn s are independent, the second Borel-Cantelli lemma implies that P log(n) ≥1 i.o. = 1,
Xn
and we conclude that lim supn→∞
log(n) ≥ 1 almost surely.
On the other hand, for any ε > 0,
 
Xn 1
P ≥ 1 + ε = P (Xn > (1 + ε) log(n)) = 1+ε ,
log(n) n
 
Xn
which is summable, so it follows from the rst Borel-Cantelli lemma that P
log(n) ≥ 1 + ε i.o. = 0.
Xn
Since ε>0 was arbitrary, this means that lim supn→∞
log(n) ≤1 almost surely, and the claim is proved.

We conclude with a cute example in which an a.s. convergence result cannot be upgraded to pointwise

convergence.

Example 8.3. We will show that for any sequence of random variables {Xn }∞
n=1 , one can nd a sequence

∞ Xn
of real numbers {cn }n=1 such that →0 a.s., but that in general, no such sequence can be found such
cn
that the convergence is pointwise.

The rst statement is an easy application of the rst Borel-Cantelli lemma: Given {Xn }∞
n=1 , let {cn }∞
n=1
cn −n

be a sequence of positive numbers such that P |Xn | > n ≤ 2 . Such a sequence can be found since

P (|Xn | > x) → 0 as x → ∞. Then

∞  
X Xn 1
P
> ≤ 1 < ∞,
n=1
cn n
    Xn
ε > 0, P X
cn > ε ≤ P X
cn >
1
= 0, →0
n n
so for all i.o.
n i.o. hence
cn
a.s.

The interesting observation is that we cannot always choose {cn }∞


n=1 so that the convergence is pointwise.

To see this, let C denote the Cantor set. Since C has the cardinality of the continuum, there is a bijection

f : C → {{an }∞
n=1 : an ∈ N for all n}.
Dene the random variables {Xn }∞
n=1 on [0, 1] with Borel sets and Lebesgue measure by

f (ω) + 1,
n ω∈C
Xn (ω) = .
1, ω∈
/C

For any sequence cn }∞
{cn }n=1 , the sequence {e cn = d|cn |e is equal to f (ω 0 ) for some ω 0 ∈ C ,
n=1 dened by e
Xn (ω 0 )

hence
cn > 1 for all n, so there is no sequence of reals for which the convergence is sure.

50
9. Strong Law of Large Numbers

Our goal at this point is to strengthen the conclusion of Theorem 7.3 from convergence in probability to

almost sure convergence. The following proof is due to Nasrollah Etemadi.

Theorem 9.1 (Strong Law of Large Numbers). Suppose that X1 , X2 , ... are pairwise independent and iden-
tically distributed with E |X1 | < ∞. Let Sn = nk=1 Xk and µ = E[X1 ]. Then n1 Sn → µ almost surely as
P

n → ∞.

Proof.
We begin by noting that Xk+ = max{Xk , 0} and Xk− = max{−Xk , 0} satisfy the theorem's assumptions, so,

since Xk = Xk+ − Xk− , we may suppose without loss of generality that the Xk0 s are nonnegative.

Next, we observe that it suces to consider truncated versions of the Xk0 s:


1 1
Claim .
Pn
9.1 If Yk = Xk 1 {Xk ≤ k} and Tn = k=1 Yk , then Tn → µ a.s. implies Sn → µ a.s.
n n
Proof. Lemma 7.2 and the fact that G(t) = P (X1 > t) is nonincreasing imply


X ∞
X ∞
X ˆ ∞
P (Xk 6= Yk ) = P (Xk > k) = P (X1 > k) ≤ P (X1 > t) dt = E |X1 | < ∞,
k=1 k=1 k=1 0

so the rst Borel-Cantelli lemma gives P (Xk 6= Yk i.o.) = 0. Thus for all ω in a set of probability one,
Sn Tn
supn |Sn (ω) − Tn (ω)| < ∞, hence − → 0 a.s. and the claim follows. 
n n
The truncation step should not be too surprising as it is generally easier to work with bounded random

variables. The reason that we reduced the problem to the Xk ≥ 0 case is that this assures that the sequence

T1 , T2 , ... is nondecreasing.

Our strategy will be to prove convergence along a cleverly chosen subsequence and then exploit monotonicity

to handle intermediate values.

Specically, for α > 1, let k(n) = bαn c, the greatest integer less than or equal to αn .
Chebychev's inequality and Tonelli's theorem give

∞ ∞  ∞ k(n)
X    X Var Tk(n) −2
X
−2
X
P Tk(n) − E Tk(n) > εk(n) ≤ = ε k(n) Var (Ym )
n=1 n=1
ε2 k(n)2 n=1 m=1

X X ∞
X   X
= ε−2 Var (Ym ) k(n)−2 ≤ ε−2 E Ym2 bαn c −2 .
m=1 n:k(n)≥m m=1 n:αn ≥m

Since bαn c ≥ 21 αn for n≥1 (by casing out according to αn smaller or bigger than 2),
X X ∞
X
bαn c −2 ≤ 4 α−2n ≤ 4α−2 logα m α−2n = 4(1 − α−2 )−1 m−2 ,
n:αn ≥m n≥logα m n=0

hence


X ∞
X   X
P Tk(n) − E Tk(n) > εk(n) ≤ ε−2 E Ym2 [αn ]−2
  
n=1 m=1 n:αn ≥m
∞  
−2 −1 −2
X E Ym2
≤ 4(1 − α ) ε .
m=1
m2
51
At this point, we note that


E[Ym2 ]
Claim .
X
9.2 < ∞.
m=1
m2

Proof. By Lemma 7.2,


ˆ ∞ ˆ m ˆ m
E Ym2 =
 
2yP (Ym > y)dy = 2yP (Ym > y)dy ≤ 2yP (X1 > y)dy,
0 0 0

so Tonelli's theorem gives

∞ ∞ ˆ m ˆ ∞ !
X E[Ym2 ] X
−2
X
−2
≤ m 2yP (X1 > y)dy = 2 y m P (X1 > y)dy.
m=1
m2 m=1 0 0 m>y
ˆ ∞ X
Since P (X1 > y)dy = E[X1 ] < ∞, we will be done if we can show that y m−2 is uniformly bounded.
0 m>y

To see that this is the case, observe that


X X π2
y m−2 ≤ m−2 = <2
m>y m=1
6

for y ∈ [0, 1], and for j ≥ 2,



X ˆ ∞
−2
m ≤ x−2 dx = (j − 1)−1 ,
m=j j−1
so

X X y
y m−2 = y m−2 ≤ ≤2
m>y
byc
m=byc+1
for y > 1. 


X   
It follows that P Tk(n) − E Tk(n) > εk(n) < ∞, so, since ε>0 is arbitrary, the rst Borel-Cantelli
n=1  
Tk(n) − E Tk(n)
lemma implies that → 0 a.s.
k(n)
 
E Tk(n)
Now lim E[Yk ] = E[X1 ] by the dominated convergence theorem, so limn→∞ = E[X1 ].
k→∞ k(n)
Tk(n)
Thus we have shown that →µ almost surely.
k(n)
Finally, if k(n) ≤ m < k(n + 1), then

k(n) Tk(n) Tk(n) Tm Tk(n+1) Tk(n+1) k(n + 1)


· = ≤ ≤ = ·
k(n + 1) k(n) k(n + 1) m k(n) k(n + 1) k(n)
since Tn is nondecreasing.
 n+1 
k(n + 1) α
Because = →α as n → ∞, we see that
k(n) bαn c
µ Tm Tm
≤ lim inf ≤ lim sup ≤ αµ,
α n→∞ m n→∞ m

and we're done since α>1 is arbitrary. 

52
The next result shows that the strong law holds whenever E[X1 ] exists.

Theorem 9.2. Let X1 , X2 , ... be i.i.d. with E X1+ = ∞ and E X1− < ∞. Then n1 Sn → ∞ a.s.
   

Proof.

For any M ∈ N, let XiM = Xi ∧ M . Then the XiM 0 s are i.i.d. with E X1M < ∞, so, writing
Pn 1 M
 M
SnM = M
i=1 Xi , it follows from Theorem 9.1 that
n Sn → E X1 almost surely as n → ∞.

Sn SM
≥ lim n = E X1M .
 
Now Xi ≥ XiM for all M, so lim inf
n→∞ n n→∞ n
The monotone convergence theorem implies that
h + i h + i
X1M X1M = E X1+ = ∞,
 
lim E =E lim
M →∞ M →∞

so
h + i h − i h + i
E X1M = E X1M − E X1M = E X1M − E X1− % ∞,
   

Sn
thus lim inf n→∞ n ≥∞ a.s. and the theorem follows. 

Our rst application of the strong law of large numbers comes from renewal theory.

Example 9.1. Let X1 , X2 , ... be i.i.d. with 0 < X1 < ∞., and let Tn = X1 + ... + Xn . Here we are thinking

of the Xi0 s as times between successive occurrences of events and Tn as the time until the nth event occurs.

For example, consider a janitor who replaces a light bulb the instant it burns out. The rst bulb is put

in at time 0 and Xi is the lifetime of the ith bulb. Then Tn is the time that the nth bulb burns out and

Nt = sup{n : Tn ≤ t} is the number of light bulbs that have burned out by time t.
Nt 1
Theorem 9.3 (Elementary Renewal Theorem). If E[X1 ] = µ ≤ ∞, then → a.s. as t → ∞
t µ
(with the convention that ∞1 = 0).
Tn
Proof. Theorems 9.1 and 9.2 imply that lim =µ a.s., and it follows from the denition of Nt that
n→∞ n
TNt ≤ t < TNt +1 , hence

TNt t TNt +1 Nt + 1
≤ < · .
Nt Nt Nt + 1 Nt
SinceTn < ∞ for all n, we have that Nt % ∞ as t % ∞. Thus there is a set Ω0 with P (Ω0 ) = 1 such that
Tn (ω)
lim = µ and lim Nt (ω) = ∞, hence
n→∞ n t→∞

TNt (ω) (ω) Nt (ω) + 1


→ µ, → 1,
Nt (ω) Nt (ω)
for all ω ∈ Ω0 .
t
It follows that →µ on Ω0 , which implies the result. 
Nt

Example 9.2. A common situation in statistics is that one has a sequence of random variables which is

assumed to be i.i.d., but the underlying distribution is unknown. A popular estimate for the true distribution

function F (x) = P (X1 ≤ x) is given by the empirical distribution function

n
1X
Fn (x) = 1(−∞,x] (Xi ).
n i=1

53
That is, one approximates the true probability of being at most x with the observed frequency of values ≤x
in the sample. The strong law provides some justication for this method of inference by showing that for

every x ∈ R, Fn (x) → F (x) almost surely as n → ∞. The next result shows that the convergence is actually

uniform in x.

Theorem 9.4 (Glivenko-Cantelli). As n → ∞


sup |Fn (x) − F (x)| → 0 a.s.
x

Proof.
Fix x∈R and let Yn = 1 {Xn < x}. Then Y1 , Y2 , ... are i.i.d. with E[Y1 ] = P (X1 < x) = F (x− ), so the
− 1
Pn −
strong law implies that Fn (x ) =
n i=1 Yi → F (x ) a.s. as n → ∞. Similarly, Fn (x) → F (x) a.s.

In general, for any countable collection {xi } ⊆ R, there is a set Ω0 with P (Ω0 ) = 1 such that Fn (xi )(ω) →
F (xi ) and Fn (x− −
i )(ω) → F (xi ) for all ω ∈ Ω0 .
xj,k = inf y : F (y) ≥ kj .

For each k ∈ N, j = 1, ..., k − 1, set The pointwise convergence of Fn (x) and

Fn (x− ) implies that we can pick Nk (ω) ∈ N such that

Fn (xj,k − )(ω) − F (xj,k − ) , |Fn (xj,k )(ω) − F (xj,k )| < 1



for all j = 1, ..., k − 1
k
whenever n ≥ Nk (ω). Setting x0,k := −∞ and xk,k := +∞, we see that the above inequalities also hold for

j = 0, k .
Thus if xj−1,k < x < xj,k with 1≤j≤k and n ≥ Nk , then the inequality F (xj,k − ) − F (xj−1,k ) < 1
k and
the monotonicity of Fn and F imply

1 2 2
Fn (x) ≤ Fn (xj,k − ) ≤ F (xj,k − ) + ≤ F (xj−1,k ) + ≤ F (x) + ,
k k k
1 − 2 2
Fn (x) ≥ Fn (xj−1,k ) ≥ F (xj−1,k ) − ≥ F (xj,k ) − ≥ F (x) − .
k k k
2
Consequently, we have supx∈R |Fn (x) − F (x)| ≤ k and the theorem follows. 

54
10. Random Series

We now give an alternative proof of the SLLN which allows us to introduce some other interesting results

and to estimate the rate of convergence.


Denition. X1 , X2 , ..., we dene the tail σ -eld T =
\
Given a sequence of random variables σ(Xn , Xn+1 , ...).
n=1
Our next theorem is an example of a 0−1 law - that is, a statement that certain classes of events are trivial

in the sense that their probabilities are either 0 or 1.

Theorem 10.1 (Kolmogorov). If X1 , X2 , ... are independent and A ∈ T , then P (A) ∈ {0, 1}.

Proof. We will show that A is independent of itself so that P (A)2 = P (A)P (A) = P (A ∩ A) = P (A).
To do so, we rst note that B ∈ σ(X1 , ..., Xk ) and C ∈ σ(Xk+1 , Xk+2 , ...) are independent.
S∞
This follows from Lemma 6.1 if C ∈ σ(Xk+1 , ..., Xk+j ). Since σ(X1 , ..., Xk ) and j=1 σ(Xk+1 , ..., Xk+j ) are

π -systems, Theorem 6.1 shows this is true in general.

Next, we observe that E ∈ σ(X1 , X2 , ...) and F ∈T are independent.

IfE ∈ σ(X1 , ..., Xk ), then this follows from the previous observation since F ∈ T ⊆ σ(Xk+1 , Xk+2 , ...).
S∞
Since k=1 σ(X1 , ..., Xk ) and T are π -systems, Theorem 6.1 shows it is true in general.

Because T ⊆ σ(X1 , X2 , ...), the last observation shows that A∈T is independent of itself. 

Example 10.1. If B1 , B2 , ... ∈ B , then {Xn ∈ Bn i.o.} ∈T. Taking Xn = 1An , Bn = {1}, we have

{Xn ∈ Bn i.o.} = {An i.o.}, so Theorem 10.1 shows that if A1 , A2 , ... are independent, then

P (An i.o.) ∈ {0, 1}. Of course, this also follows from the Borel-Cantelli lemmas.

Example 10.2. Let Sn = X1 + ... + Xn . Then

• {limn→∞ Sn exists} ∈T (since convergence of series only depends on their tails).

• A = {lim supn→∞nSn > 0} ∈


/T in general (since the initial terms can eect the sign of the sum).
o
1
• If cn → ∞, then lim supn→∞ cn Sn >x ∈T for all x∈R (since the contribution from any nite

number of terms of Sn will be killed by cn ).

The rst item in the previous example shows that sums of independent random variables either converge

almost surely or diverge almost surely.

Our next result can be useful in determining when the former is the case.

Theorem 10.2 (Kolmogorov's maximal inequality). Suppose that X1 , X2 , ... are independent with E[Xk ] = 0
and Var(Xk ) < ∞, and let Sn = X1 + ... + Xn . Then
 
Var(Sn )
P max |Sk | ≥ x ≤ .
1≤k≤n x2
Var(Sn )
Remark. Note that under the same hypotheses, Chebychev only gives P (|Sn | ≥ x) ≤ x2 .

55
Proof. We will partition the event in question according to the rst time that the sum exceeds x by dening

Ak = {|Sk | ≥ x and |Sj | < x for all j < k} .


Sn 2
Since the Ak 's are disjoint with k=1 Ak ⊆ Ω and (Sn − Sk ) ≥ 0, we see that
n ˆ n ˆ
  X X
E Sn2 ≥ Sn2 dP = Sk2 + 2Sk (Sn − Sk ) + (Sn − Sk )2 dP


k=1 Ak k=1 Ak
Xn ˆ Xn ˆ
≥ Sk2 dP + 2 Sk 1Ak (Sn − Sk )dP.
k=1 Ak k=1

Our assumptions guarantee that Sk 1Ak ∈ σ (X1 , ..., Xk ) and Sn − Sk ∈ σ(Xk+1 , ..., Xn ) are independent and

E[Sn − Sk ] = 0, so
ˆ
Sk 1Ak (Sn − Sk )dP = E [Sk 1Ak (Sn − Sk )] = E [Sk 1Ak ] E [Sn − Sk ] = 0.

Accordingly, we have

n ˆ n  
  X X
E Sn2 ≥ Sk2 dP ≥ x2 P (Ak ) = x2 P max |Sk | ≥ x . 
Ak 1≤k≤n
k=1 k=1

We now have the tools needed to provide a sucient criterion for the a.s. convergence of random series.

(As usual a series is said to converge if its sequence of partial sums converges.)

Theorem 10.3 (Kolmogorov's two-series theorem). Suppose X1 , X2 , ... are independent with E[Xn ] = µn
and Var(Xn ) = σn2 . If ∞
n=1 µn converges in R and n=1 σn < ∞, then n=1 Xn converges almost surely.
P P∞ 2
P∞

Proof.
P∞ P∞
Since Var(Xn −µn ) = Var(Xn ) and convergence of n=1 µn means that n=1 (Xn (ω)−µn ) converges
P∞
if and only if n=1 Xn (ω) converges, we may assume without loss of generality that E[Xn ] = 0.
PN
Let SN = n=1 Xn . Theorem 10.2 gives
  N
X
P max |Sm − SM | > ε ≤ ε−2 Var (SN − SM ) = ε−2 Var(Xn ).
M ≤m≤N
n=M +1

Letting N →∞ gives
  ∞
X
P sup |Sm − SM | > ε ≤ ε−2 σn2 → 0 as M → ∞.
m≥M
n=M +1

Accordingly, for all ε > 0,


   
P sup |Sm − Sn | > 2ε ≤ P sup |Sm − SM | > ε → 0,
m,n≥M m≥M

so supm,n≥M |Sm − Sn | →p 0. By Theorem 8.1, we have that WM = supm,n≥M |Sm − Sn | has a subsequence
which converges to 0 a.s. Since WM is nondecreasing, this means that WM → 0 a.s.

In other words, Sn is a.s. Cauchy and thus a.s. convergent. 

Before moving on to prove the strong law, we take a slight detour to present a general theorem on the

convergence of random series.


56
Theorem 10.4 (Kolmogorov's three-series theorem) . Let X1 , X2 , ... be independent, let A > 0, and let
Yn = Xn 1 {|Xn | ≤ A}. Then n=1 Xn converges almost surely if and only if the following conditions hold:
P∞

n=1 P (|Xn | > A) < ∞,


P∞
(1)

n=1 E[Yn ] converges,


P∞
(2)

n=1 Var(Yn ) < ∞.


P∞
(3)

Proof. To see that the conditions are sucient, observe that Condition 1 and the rst Borel-Cantelli lemma
P∞
imply that P (Xn 6= Yn i.o.) = 0, so it suces to show that n=1 Yn converges a.s. This is assured by

Conditions 2 and 3 along with Theorem 10.3.


P∞
Conversely, suppose that n=1 Xn converges a.s.
P∞
It is clear that Condition 1 must hold because if n=1 P (|Xn | > A) = ∞, then the second Borel-Cantelli

lemma shows that P (|Xn | > A i.o.) = 1, which implies that the series diverges with full probability by the

basic divergence test from calculus.


P∞ P∞
Since Condition 1 holds, we know that n=1 Xn converges a.s. if and only if n=1 Yn converges a.s.
P∞
Now suppose that we have proved that Condition 3 holds. Then Theorem 10.3 shows that n=1 (Yn − E[Yn ])
P∞
converges a.s., which, together with the a.s. convergence of n=1 Yn , implies 2.
Thus it remains only to prove that if Y1 , Y2 , ... are independent and uniformly bounded, then a.s. convergence
P∞ P∞
of n=1 Yn implies n=1 Var(Yn ) < ∞.
In fact, we can further assume that E[Yn ] = 0. Indeed, letting {Yn0 } ∞
n=1 be an independent copy of {Yn } ∞
n=1 ,
the random variables Zn = Yn − Yn0 are independent and uniformly bounded with Var(Zn ) = 2Var(Yn ) and
P∞ P∞ P∞ 0
n=1 Zn = n=1 Yn − n=1 Yn a.s. convergent.

To summarize, the proof will be complete upon showing

Claim. Suppose that Z1 , Z2 , ... is a sequence of independent random variables with E[Zn ] = 0 and |Zn | ≤ C
P∞ P∞
for some C > 0. If n=1 Zn converges a.s., then n=1 Var(Zn ) < ∞.

Proof.
Pn 
Let Sn = k=1 Zk . Since Sn converges a.s., we can nd an L ∈ N such that P supn≥1 |Sn | < L > 0.

(The events Em = supn≥1 |Sn | < m form a countable increasing union which converges to

supn≥1 |Sn | < ∞ ⊇ {limn→∞ Sn exists}.)
For this L, let τL = min {k ≥ 1 : |Sk | ≥ L} and observe that the assumption |Zk | ≤ C for all k implies

|Sn∧τL | ≤ L + C for all n.


Accordingly,
 2 
n n
X X X
(L + C)2 ≥ E Sn∧τ
 2
E Zj2 1{j ≤ τL } + 2
  
L
= E  Zj 1{j ≤ τL }   = E [Zi Zj 1{j ≤ τL }] .
j=1 j=1 1≤i<j≤n

C
Now {j ≤ τL } = {τL ≤ j − 1} ∈ σ (Z1 , ..., Zj−1 ), so independence of the Zk 's and the mean zero assumption
give

n
X X
(L + C)2 ≥ E Zj2 1{j ≤ τL } + 2
 
E [Zi Zj 1{j ≤ τL }]
j=1 1≤i<j≤n
n
X X n
X
= Var(Zj )P (j ≤ τL ) + 2 E [Zi 1{j ≤ τL }] E[Zj ] ≥ P (τL = ∞) Var(Zj ).
j=1 1≤i<j≤n j=1
57

Taking n → ∞, and noting that P (τL = ∞) = P supn≥1 |Sn | < L > 0, we get


X (L + C)2
Var(Zj ) ≤ < ∞.
j=1
P (τL = ∞)

This completes the proof of the claim and the theorem. 

The connection between Kolmogorov's two-series theorem and the strong law is given by

∞ n
xn
Theorem 10.5 (Kronecker's lemma). If an % ∞ and converges, then a−1 xm = 0.
X X
n
a
n=1 n m=1

Proof.
Pm xk
Let a0 = 0, b0 = 0, and bm = k=1 ak for m ≥ 1.
Then xm = am (bm − bm−1 ), so

n n n
!
X X X
a−1
n xm = a−1
n am bm − am bm−1
m=1 m=1 m=1
n n
!
X X
= a−1
n an bn + am−1 bm−1 − am bm−1
m=2 m=2
n n
X am − am−1 X am − am−1
= bn − bm−1 = bn − bm−1 .
m=2
an m=1
an
n
X am − am−1
By assumption, bn → b∞ , so we will be done if we can show that bm−1 → b∞ as well.
m=1
an
ε
Given ε > 0, choose M ∈N such that |bm − b∞ | < 2 for m ≥ M.
aM ε
Set B = supn≥1 |bn | (which is nite since bn converges) and choose N >M such that < for n ≥ N.
an 4B
Since am − am−1 ≥ 0 for all m, we see that for all n ≥ N ,

n n
X am − am−1 X am − am−1
bm−1 − b∞ = (bm−1 − b∞ )

an an


m=2 m=2
M n
1 X X am − am−1
≤ (am − am−1 ) |bm−1 − b∞ | + |bm−1 − b∞ |
an m=2 an
m=M +1
aM an − aM ε
≤ · 2B + · < ε,
an an 2
and the result follows. 

We can now give an

Alternative proof of the SLLN.


Pn
Let X1 , X2 , ... be i.i.d. with E |X1 | < ∞ and set µ = E[X1 ], Sn = k=1 Xk .
1
We wish to show that
n Sn →µ a.s.
Pn
Setting Yk = Xk 1 (|Xk | ≤ k), Tn = k=1 Yk , and arguing as in Claim 9.1 shows that it suces to prove
1
n Tn →µ a.s.
 
Writing Zk = Yk − E[Yk ], we have Var(Zk ) = Var(Yk ) ≤ E Yk2 , so Claim 9.2 gives

n ∞ ∞  
E Yk2
 
X Zk X Var(Zk ) X
Var = ≤ < ∞.
k k2 k2
k=1 k=1 k=1
58

Z  X Zk
Since E k
k
= 0, Theorem 10.3 shows that converges a.s., hence
k
k=1
n n
Tn 1X X
− E[Yk ] = n−1 Zk → 0 a.s.
n n
k=1 k=1

by Theorem 10.5.
1
Pn
Finally, the DCT gives E[Yk ] → µ as k → ∞, thus
n k=1 E[Yk ] → µ as n → ∞, and we conclude that
Tn
n →µ a.s. 

As promised, we will conclude our discussion with an estimate on the rate of convergence in the strong law.

Theorem 10.6. Let X1 , X2 , ... be i.i.d. random variables with E[X1 ] = 0 and E[X12 ] = σ2 < ∞, and set
Sn = X1 + ... + Xn . Then for all ε > 0,
Sn
1 1 →0 a.s.
n log(n) 2 +ε
2

Proof.
1 1
Let an = n 2 log(n) 2 +ε n ≥ 2 and a1 > 0. We have
for

∞ ∞
σ2
 
X Xn X 1
Var = 2 + σ2 < ∞,
n=1
an a1 n=2
n log(n)1+2ε

X Xn
so Theorem 10.3 implies converges a.s. The claim then follows from Theorem 10.5. 
a
n=1 n

Note that there is no loss in assuming mean zero.

The law of the iterated logarithm shows that

Sn
lim sup p =1
n→∞ 2σ 2 n log (log(n))
under the same assumptions, so the above result is not far from optimal.

See Durrett for convergence rates under the assumption that Xn has nite absolute pth moment for 1 < p < 2
and for a generalization of Theorem 8.4.

59
11. Weak Convergence

Denition. A sequence of distribution functions F1 , F2 , ... converges weakly to a distribution function F∞


(written Fn ⇒ F∞ ) if limn→∞ Fn (x) = F∞ (x) for all x at which F is continuous.

Random variables X1 , X2 , ... converge weakly (or converge in distribution ) to a random variable X∞ (written
Xn ⇒ X∞ ) if Fn ⇒ F∞ where Fn (x) = P (Xn ≤ x) for 1 ≤ n ≤ ∞.

Note that since the denition of weak convergence of random variables depends only on their distribution

functions, one can speak of a sequence X1 , X2 , ... converging weakly even if the Xn0 s are not dened on the

same probability space. This is not the case with the other modes of convergence we have discussed.

Also, since distribution functions are right-continuous and have only countably many discontinuities, we see

that F∞ is uniquely determined by its values at continuity points.

Example 11.1. As a trivial example, suppose that X has distribution function F and let {an }∞
n=1 be any

sequence of real numbers which decreases to 0. Then Xn = X − an has distribution function Fn (x) =
P (X − an ≤ x) = F (x + an ), hence limn→∞ Fn (x) = F (x) for all x∈R (since F is right-continuous) and

thus Xn ⇒ X .

On the other hand, Xn = X + an has distribution function Fn (x) = F (x − an ), which converges to F only

at continuity points. Thus we still have Xn ⇒ X , but the distribution functions do not necessarily converge

pointwise.

For some more interesting examples, we rst need some elementary facts:

Fact 11.1. If cj → 0, aj → ∞, and aj cj → λ, then (1 + cj )a j


→ eλ .

log(1 + x) 1 log(1 + cj )
Proof. lim = lim = 1 by L'Hospital's rule, so aj log(1 + cj ) = (aj cj ) → λ, hence
x→0 x x→0 1 + x cj
aj
(1 + cj ) = eaj log(1+cj ) → eλ . 
n n n
Fact 11.2. If n→∞ max |cj,n | = 0, lim cj,n = λ, and supn |cj,n | < ∞, then lim (1 + cj,n ) = eλ .
X X Y
lim
1≤j≤n n→∞ n→∞
j=1 j=1 j=1

n
Proof.
X
(Homework) It suces to show that log(1 + cj,n ) → λ since then
j=1
n
Y n
Y Pn
log(1+cj,n )
(1 + cj,n ) = elog(1+cj,n ) = e j=1 → eλ .
j=1 j=1

To this end, note that the rst condition ensures that we can choose n large enough that |cj,n | < 1, hence

X cm
j,n (cj,n )2
log(1 + cj,n ) = − (−1)m and thus |log(1 + cj,n ) − cj,n | < by standard results for alternating
m=1
m 2
series. It follows that

n n n
n
X X X X


log(1 + c j,n ) − λ ≤

log(1 + cj,n ) − cj,n
+
cj,n − λ

j=1 j=1 j=1 j=1

n X n n X n
1X 2
X
≤ (cj,n ) + cj,n − λ ≤ max |cj,n |
|cj,n | +
cj,n − λ
2 j=1 j=1 1≤j≤n
j=1 j=1
60

n
n
X X

≤ max |cj,n | sup |cj,n | +
cj,n − λ .
1≤j≤n n
j=1 j=1

The rst and third assumptions ensure that the rst term goes to zero and the second assumption ensures

that the second term goes to zero. 

Example 11.2. Let Xp be the number of trials until the rst success in a sequence of independent Bernoulli

trials with success probability p ∈ (0, 1). (That is Xp is geometric with parameter p.)
Then P (Xp > n) = (1 − p)n , so Fact 11.1 shows that
 
x
= (1 − p)b p c → e−x
x
P (pXp > x) = P Xp >
p
as p & 0, hence pXp converges to the rate 1 exponential distribution.

Example 11.3. Let X1 , X2 , ... be independent and uniformly distributed over {1, 2, ..., N }, and let

TN = min{n : Xn = Xm for some m < n}.


We have
n   n−1
Y 
Y k−1 k
P (TN > n) = 1− = 1− .
N N
k=2 k=1
When N = 365, this is the probability that no two people in a group of size n have a common birthday.
j √ k
n+1 2 2
− x2 ) and the observation that

Using Fact 11.2 (with n = x N − 1, cj,n = −j/ x , and λ=
√ j √ k j √ k 
bx X
N c−1
x N −1
1 1 x N x2
lim j = lim = ,
N →∞ N N →∞ N 2 2
j=1

we see that √
  bx YN c−1  
TN k x2
P √ >x = 1− → e− 2
N N
k=1
as N → ∞.
 √  x2
The approximation P TN > x N ≈ e− 2 with N = 365 yields P (T365 > 22) ≈ e−0.663 ≈ 0.515 and
−0.725
P (T365 > 23) ≈ e ≈ 0.484.
This is the birthday paradox that in a room of 23 or more people, it is more likely than not that two share

a birthday.

Though distributional convergence is dened in terms of distribution functions, it is often convenient to be

able to work with random variables when proving theorems.

Theorem 11.1 (Skorokhod Representation). If Fn ⇒ F∞ , then there are random variables Yn , 1 ≤ n ≤ ∞,


on a common probability space (Ω, F, P ) such that Fn is the distribution of Yn and Yn → Y∞ a.s.
Proof.
Let Ω = (0, 1), F = Borel sets, P = Lebesgue measure, and set Yn (ω) = Fn−1 (ω) := inf{y : Fn (y) ≥ ω}.
We have seen that Yn has c.d.f. Fn . Also, we know that D = {y : F∞ is discontinuous at y} is countable, so
C
given ε>0 and ω ∈ (0, 1), there is some x∈D with Y∞ (ω) − ε < x < Y∞ (ω).
61
By construction, we have that F∞ (x) < ω , so, since Fn (x) → F∞ (x), there is an N ∈ N such that Fn (x) < ω
and thus Y∞ (ω) − ε < x < Yn (ω) for all n ≥ N. Accordingly, lim inf n→∞ Yn (ω) ≥ Y∞ (ω).
0
Now for any ω > ω , there is a y ∈ D C
such that Y∞ (ω ) < y < Y∞ (ω 0 ) + ε, hence ω < ω 0 ≤ F∞ (y).
0
It follows
0
that for n large enough, ω < Fn (y) and thus Yn (ω) ≤ y < Y∞ (ω ) + ε, so that lim supn→∞ Yn (ω) ≤ Y∞ (ω 0 )
for all ω0 > ω.
If Y∞ is continuous at ω, then letting ω0 & ω gives lim supn→∞ Yn (ω) ≤ Y∞ (ω), so limn→∞ Yn (ω) = Y∞ (ω).
Of course, Y∞ is nondecreasing in ω by construction, so it has only countably many discontinuities, and we

conclude that the convergence is almost sure. 

Note that if Yn → Y∞ a.s., then D = {ω : Yn (ω) 9 Y∞ (ω)} has probability zero.


(Since modifying a
Yn (ω), ω ∈
/D
random variable on a null set does not change its distribution, we can dene Zn (ω) =
0, ω∈D
for 1 ≤ n ≤ ∞. Then Zn =d Yn and Zn (ω) → Z∞ (ω) for all ω , thus almost sure convergence can be replaced
by sure convergence in the above theorem.

Our next result gives an equivalent denition of weak convergence. The basic idea is that Cb (R), the space of
bounded continuous functions from R to R equipped with the supremum norm, is a Banach space. It follows

from the Riesz representation theorem that its (continuous) dual Cb (R) , the space of continuous linear

functionals on Cb (R), may be identied with the space of nite and nitely additive signed Radon measures.

From this perspective, weak convergence of probability measures corresponds to weak-* convergence in

Cb (R)∗ :

Theorem 11.2. Xn ⇒ X∞ if and only if for every bounded continuous function g we have
E[g(Xn )] → E[g(X∞ )].

Proof. First suppose that Xn ⇒ X∞ . Theorem 11.1 shows that there exist random variables with Yn =d Xn
and Yn → Y∞ a.s. If g is bounded and continuous, then g(Yn ) → g(Y∞ ) a.s. and bounded convergence gives

E[g(Xn )] = E[g(Yn )] → E[g(Y∞ )] = E[g(X∞ )].

To prove the converse, dene for each x ∈ R, ε > 0,




 1, y≤x
gx,ε (y) = 0, y ≥x+ε .

1 − y−x
ε , x<y <x+ε

Since gx,ε is continuous with 1(−∞,x] ≤ gx,ε ≤ 1(−∞,x+ε) pointwise, we have


 
lim sup P (Xn ≤ x) = lim sup E 1(−∞,x] (Xn ) ≤ lim sup E [gx,ε (Xn )]
n→∞ n→∞ n→∞
 
= E [gx,ε (X∞ )] ≤ E 1(−∞,x+ε] (X∞ ) = P (X∞ ≤ x + ε).

Letting ε&0 gives lim supn→∞ P (Xn ≤ x) ≤ P (X∞ ≤ x).


Similarly,

lim inf P (Xn ≤ x) ≥ lim inf E [gx−ε,ε (Xn )] = E [gx−ε,ε (X∞ )] ≥ P (X∞ ≤ x − ε),
n→∞ n→∞

so lim inf n→∞ P (Xn ≤ x) ≥ P (X∞ < x).


This completes the proof since P (X∞ < x) = P (X∞ ≤ x) if x is a continuity point of F∞ . 
62
We now show that weak convergence is preserved under (almost) continuous functions.

Theorem 11.3. Let g : R → R be measurable and set Dg = {x : g is discontinuous at x}. If Xn ⇒ X∞ and


P (X∞ ∈ Dg ) = 0, then g(Xn ) ⇒ g(X∞ ). If g is bounded as well, then E[g(Xn )] → E[g(X∞ )].

Proof. Let Yn =d Xn with Yn → Y∞ a.s. If f is continuous, then Df ◦g ⊆ Dg , so P (Y∞ ∈ Df ◦g ) = 0 and

thus f (g(Yn )) → f (g(Y∞ )) a.s.

If f is bounded as well, then bounded convergence implies

E [f (g(Xn ))] = E [f (g(Yn ))] → E [f (g(Y∞ ))] = E [f (g(X∞ ))] .

As this is true for all f ∈ Cb (R), Theorem 11.2 shows that g(Xn ) ⇒ g(X∞ ).
The second assertion follows by noting that g(Yn ) → g(Y∞ ) a.s. and likewise applying bounded convergence.

At this point, we have characterized weak convergence in terms of convergence of distribution functions at

continuity points and as weak-* convergence when probability measures are viewed as living in the dual of

Cb (R). Here are some further useful denitions.

Theorem 11.4 (Portmanteau Theorem). The following statements are equivalent:


(i): Xn ⇒ X∞
(ii): For all open sets U , lim inf n→∞ P (Xn ∈ U ) ≥ P (X∞ ∈ U )
(iii): For all closed sets K , lim supn→∞ P (Xn ∈ K) ≤ P (X∞ ∈ K)
(iv): For all sets A with P (X∞ ∈ ∂A) = 0, limn→∞ P (Xn ∈ A) = P (X∞ ∈ A)
(Such an A is called a continuity set for the distribution of X∞ .)
Proof. We establish equivalence by showing (i) implies (ii); (ii) implies (iii); (ii) and (iii) imply (iv); and

(iv) implies (i).


Suppose that (i) holds. Then there exist random variables Yn → Y∞ on a common probability space (Ω, F, P )
with Yn =d Xn and Yn → Y∞ pointwise.

Now let ω∈Ω be such that Y∞ (ω) ∈ U . Since Yn (ω) → Y∞ (ω), for every open set V 3 Y∞ (ω), there is an

NV ∈ N with Yn (ω) ∈ V whenever n ≥ NV . In particular, there is an NU ∈ N such that Yn (ω) ∈ U for all

n ≥ NU .
In other words, for all ω∈Ω with 1U (Y∞ (ω)) = 1, we have 1U (Yn (ω)) = 1 for n suciently large, hence

lim inf n→∞ 1U (Yn (ω)) = 1. It follows that lim inf n→∞ 1U (Yn ) ≥ 1U (Y∞ ) pointwise and thus, by Fatou's

lemma and monotonicity, we have

lim inf P (Xn ∈ U ) = lim inf P (Yn ∈ U ) = lim inf E [1U (Yn )]
n→∞ n→∞ n→∞
h i
≥ E lim inf 1U (Yn ) ≥ E [1U (Y∞ )] = P (Y∞ ∈ U ) = P (X∞ ∈ U ),
n→∞

which is statement (ii).


To see that (ii) implies (iii), observe that if K is closed, then KC is open, so (ii) implies that
C C
lim inf n→∞ P (Xn ∈ K ) ≥ P (X∞ ∈ K ), and thus

P (X∞ ∈ K) = 1 − P (Xn ∈ K C ) ≥ 1 − lim inf P (Xn ∈ K C )


n→∞

= lim sup 1 − P (Xn ∈ K C ) = lim sup P (Xn ∈ K).


 
n→∞ n→∞
63
Now assume both (ii) and (iii) are true and let A be such that P (X∞ ∈ ∂A) = 0. Since ∂A = A \ A◦ , this
◦ ◦
means that P (X∞ ∈ A) = P (X∞ ∈ A ). Because A ⊆ A ⊆ A, monotonicity implies that the common

value is equal to P (X∞ ∈ A). Applying (ii) to A◦ ⊆ A and (iii) to A⊇A gives

lim inf P (Xn ∈ A) ≥ lim inf P (Xn ∈ A◦ ) ≥ P (X∞ ∈ A◦ ) = P (X∞ ∈ A),


n→∞ n→∞
 
lim sup P (Xn ∈ A) ≤ lim sup P Xn ∈ A ≤ P X∞ ∈ A = P (X∞ ∈ A),
n→∞ n→∞

and (iv) follows.

Finally, suppose that (iv) holds and let Fn denote the distribution function of Xn . Let x be any continuity

point of F∞ . Then P (X∞ ∈ {x}) = 0, so, since {x} = ∂(−∞, x], we have

Fn (x) = P (Xn ∈ (−∞, x]) → P (X∞ ∈ (−∞, x]) = F∞ (x),

hence Xn ⇒ X∞ . 

Our next set of theorems form a sort of compactness result for certain families of probability measures. We

begin with

Theorem 11.5 (Helly's Selection Theorem). If {Fn }∞ n=1 is any sequence of distribution functions, then there
is a subsequence {Fn(m) }∞ m=1 and a nondecreasing, right-continuous function F with limm→∞ Fn(m) (x) =
F (x) at all continuity points x of F .

Proof.
We begin with a diagonalization argument: Let q1 , q2 , ... be an enumeration of Q. Since the sequence

{Fn (q1 )}∞


n=1 is contained in the compact set [0, 1], it has a convergence subsequence by the Bolzano-
 ∞
Weierstrass theorem. That is, there exist n1 (1) < n1 (2) < · · · such that Fn1 (m) (q1 ) m=1 converges to
 ∞  ∞
some value G(q1 ) ∈ [0, 1]. Similarly, the sequence Fn1 (m) (q2 ) m=1 has a subsequence Fn2 (m) (q2 ) m=1
which converges to G(q2 ).

In general, we can nd a subsequence {nk+1 (m)}∞
m=1 of {nk (m)}m=1 such that

limm→∞ Fnk+1 (m) (qk+1 ) = G(qk+1 ) for each k ≥ 1.



{Fn(m) }∞

Dene the subsequence m=1 by Fn(m) = Fnm (m) (so that Fn(k+j) (qk ) j=1 is a subsequence of
 ∞
Fnk (m) (qk ) m=1
for all k ≥ 1).

By construction, limm→∞ Fn(m) (q) = G(q) for all q ∈ Q.


Also, if r, s ∈ Q with r < s, then Fn(m) (r) ≤ Fn(m) (s) for all m, hence G(r) ≤ G(s).

Now dene the function F : R → [0, 1] by

F (x) = inf{G(q) : q ∈ Q, q > x}.

To see that F is nondecreasing, note that for any x < y, there is some r∈Q with x < r < y.
Since G(r) ≤ G(s) for all rational r < s, we have

F (x) = inf{G(q) : q ∈ Q, q > x} ≤ G(r) ≤ inf{G(s) : s ∈ Q, s > r} ≤ inf{G(s) : s ∈ Q, s > y} = F (y).

Now for each x ∈ R, ε > 0, there is some rational q>x such that G(q) ≤ F (x) + ε. Thus if x ≤ y < q, then

F (y) ≤ G(q) ≤ F (x) + ε. Since ε>0 was arbitrary, we see that F is right-continuous as well.
64
Finally, suppose that F is continuous at x. Then there exist r1 , r2 , s ∈ Q with r1 < r2 < x < s such that

F (x) − ε < F (r1 ) ≤ F (r2 ) ≤ F (x) ≤ F (s) < F (x) + ε.

Since Fn(m) (r2 ) → G(r2 ) ≥ F (r1 ) and Fn(m) (s) → G(s) ≤ F (s) as m → ∞, we see that for m suciently

large,

F (x) − ε < Fn(m) (r2 ) ≤ Fn(m) (x) ≤ Fn(m) (s) < F (x) + ε,
hence Fn(m) (x) → F (x). 

It should be noted that the subsequential limit F from Theorem 11.5 is not necessarily a distribution

function since the boundary conditions may not hold. When a sequence of distribution functions converges

to a nondecreasing right-continuous function at its continuity points, the sequence is said to exhibit vague
convergence, which will be denoted by ⇒v .
In these terms, Helly's selection theorem says that every sequence of distribution functions has a vaguely

convergent subsequence.

Because all of the distribution functions take values in [0, 1], their limit must as well. The limit is thus a

Stieltjes measure function for some subprobability measure on R - a positive measure ν with ν(R) ≤ 1.
Thus vague convergence means that the distribution functions converge to a distribution function of a

subprobability measure, whereas weak convergence means that they converge to the distribution function of

a probability measure.

More generally, just as weak convergence is weak-* convergence with respect to Cb (R), vague convergence is

weak-* convergence with respect to the subspaces CK (R) or C0 (R), the spaces of continuous functions with

compact support or which vanish at innity.

This distinction between these notions is illustrated in the following example.

Example 11.4. Choose any a, b, c > 0 with a+b+c=1 and any distribution function G(x), and dene

Fn (x) = a1(x ≥ n) + b1(x ≥ −n) + cG(x).

One easily checks that the Fn0 s are distribution functions and Fn (x) → F (x) := b + cG(x). However,

lim F (x) = b and lim F (x) = 1 − a,


x→−∞ x→∞

so F is not a distribution function.

In words, an amount of mass a escapes to +∞ and mass b escapes to −∞.

Intuitively, the test functions in CK or C0 that dene vague convergence can't detect mass lost to innity,

whereas Cb test functions can.

An immediate question is Under which conditions do the two denitions coincide? or Is there a property

of a vaguely convergent sequence of distribution functions which prevents mass from being lost in the limit?

Denition. A sequence of distribution functions is tight if for every ε > 0, there is an Mε > 0 such that

lim sup 1 − Fn (Mε ) + Fn (−Mε ) ≤ ε.


n→∞

65
Theorem 11.6. A sequence of distribution functions is tight if and only if every subsequential limit is a
distribution function.
Proof. Suppose the sequence is tight and Fn(m) ⇒v F . Given ε > 0, let r < −Mε and s > Mε be continuity

points of F.
Since Fn(m) (r) → F (r) and Fn(m) (s) → F (s), we have


1 − F (s) + F (r) = lim 1 − Fn(m) (s) + Fn(m) (r) ≤ ε.
m→∞

As ε was arbitrary, we see that F is indeed a distribution function.

On the other hand, suppose that {Fn }∞


n=1 is not tight. Then there is an ε > 0 and a subsequence {Fn(m) }∞
m=1
with

1 − Fn(m) (m) + Fn(m) (−m) ≥ ε


for all m.
Helly's theorem says that there is a further subsequence {Fn(mk ) }∞
k=1 which converges vaguely to F.
Let r<0<s be continuity points of F. Then


1 − F (s) + F (r) = lim 1 − Fn(mk ) (s) + Fn(mk ) (r)
k→∞

≥ lim inf 1 − Fn(mk ) (mk ) + Fn(mk ) (−mk ) ≥ ε,
k→∞

so letting r → −∞ and s→∞ along continuity points of F shows that F is not a distribution function. 

Roughly, we have that weak convergence equals vague convergence plus tightness.

We conclude with a sucient condition for tightness.

Theorem 11.7. If there is a nonnegative function φ such that φ(x) → ∞ as x → ±∞ and


ˆ
C = supn φ(x)dFn (x) < ∞,

then {Fn }∞
n=1 is tight.

Proof. If the assumptions hold, then for every n


ˆ
C
1 − Fn (M ) + Fn (−M ) = dFn (x) ≤ ,
|x|≥M inf φ(x)
|x|≥M

which goes to 0 as M →∞ by assumption. 

66
12. Characteristic Functions

An extremely useful construct in probability (and the primary ingredient in the classical proofs of many

central limit theorems) is the characteristic function of a random variable, which is essentially the (inverse)

Fourier transform of its distribution.

Denition. characteristic function


 
The of a random variable X is dened as ϕ(t) = E eitX .

When confusion may arise, we will indicate the dependence on the random variable with a subscript.

Though we have restricted our attention to real-valued random variables thus far, no new theory is required

since if Z is complex valued, E[Z] = E [Re(Z)] + iE[Im(Z)] provided that the expectations of the real and

imaginary parts are well dened.

In the case of characteristic functions, Euler's formula gives eitX = cos(tX) + i sin(tX), and the sine and

cosine functions are bounded and thus integrable against µX .


(Note that we are still assuming that the underlying random variables are real-valued.)

Several properties of characteristic functions are immediate from the denition.

(1) ϕ(0) = E[1] = 1

(2) ϕ(−t) = E[cos(−tX)] + iE[sin(−tX)] = E[cos(tX)] − iE[sin(tX)] = ϕ(t)

 
(3) |ϕ(t)| = E eitX ≤ E eitX = 1

   
(4) |ϕ(t + h) − ϕ(t)| ≤ E ei(t+h)X − eitX = E eitX eihX − 1 = E eihX − 1 .
Since the last term goes to zero as h→0 (by the bounded convergence theorem),

ϕ is uniformly continuous.

   
(5) ϕaX+b (t) = E eit(aX+b) = E ei(at)X eitb = eitb ϕX (at)

(6) ϕ−X (t) = ϕX (−t) = ϕX (t) by 2 and 5

(7) If X1 and X2 are independent, then


h i
ϕX1 +X2 (t) = E eit(X1 +X2 ) = E eitX1 eitX2 = E eitX1 E eitX2 = ϕX1 (t)ϕX2 (t).
     

We now turn to some examples.

Example 12.1 (Rademacher). If P (X = 1) = P (X = −1) = 1


2 , then its ch.f. is given by

1 it 1 −it
ϕ(t) = e + e = cos(t).
2 2
.

67
Example 12.2 (Poisson).
k
If P (X = k) = e−λ λk!
k = 0, 1, 2, ..., then its ch.f. is given
for by

∞ ∞ k
X λk X λeit it it
ϕ(t) = eitk e−λ = e−λ = e−λ eλe = eλ(e −1) .
k! k!
k=0 k=0

Example 12.3 (Normal). x 2 t2


If X has density fX (x) = √1 e− 2 , then its ch.f. is given by ϕ(t) = e− 2 .

Naive derivation:
ˆ ˆ ˆ
1 itx − x2
2 1 − 21 [(x−it)2 +t2 ]
2
− t2 1 (x−it)2 t2
ϕ(t) = √ e e dx = √ e dx = e √ e− 2 dx = e− 2
2π R 2π R 2π R
(x−it)2
1 −
since √

e 2 is the the density of a normal random variable with mean it and variance 1.

Formal proof:
x2
Since sin(tx)e− 2 is odd and integrable,
ˆ ˆ ˆ
1 x2 1 x2 i x2
ϕ(t) = √ eitx e− 2 dx = √ cos(tx)e− 2 dx + √ sin(tx)e− 2 dx
2π R 2π R 2π R
ˆ
1 x 2
=√ cos(tx)e− 2 dx.
2π R
Dierentiating with respect to t (which can be justied using a DCT argument) gives
ˆ ˆ
1 d x2 1 x2
ϕ0 (t) = √ cos(tx)e− 2 dx = √ −x sin(tx)e− 2 dx
2π R dt 2π R
ˆ   ˆ
1 d −x 2 1 x2
=√ sin(tx) e 2 dx = − √ t cos(tx)e− 2 dx = −tϕ(t).
2π R dx 2π R
 t2  t2 02
d
It follows from the method of integrating factors that
dt e ϕ(t) = 0, hence e ϕ(t) = e ϕ(0) = 1
2 2 2

for all t.

Example 12.4 (Exponential) . If X is absolutely continuous with density fX (x) = e−x 1[0,∞) (x), then its

ch.f. is given by
ˆ ∞ ˆ ∞ b
itx −x (it−1)x 1 (it−1)x
1
ϕ(t) = e e dx = e dx = lim e = 1 − it .
0 0 b→∞ it − 1 0

Our next task is to show that the characteristic function uniquely determines the distribution.

We rst observe that

ˆ
T sin(t) π T + 1
Proposition 12.1. For all T > 0,

dt − ≤ .
0 t 2 T2

Proof. (Homework)

For all T > 0 the function e−uv sin(u) is bounded in absolute value by e−uv , which is integrable over

RT = {(u, v) : 0 < u < T, v > 0}, so it follows from Fubini's theorem that
68
ˆ T ˆ T ˆ ∞  ˆ ∞ ˆ T
!
sin(u) −uv −uv
du = e sin(u)dv du = e sin(u)du dv
0 u 0 0 0 0
ˆ ∞
" u=T #
1 −uv

= − e (cos(u) + v sin(u)) dv
0 1 + v2
u=0
ˆ ∞ ˆ ∞ ˆ ∞
dv 1 −T v v
= − cos(T ) e dv − sin(T ) e−T v dv.
0 1 + v2 0 1 + v 2
0 1 + v 2
ˆ ∞
dv π
Since
2
= , we have
0 1+v 2
ˆ ˆ ∞ ˆ ∞
T sin(t) π 1 v

−T v −T v

dt − = cos(T ) e dv + sin(T ) e dv

t 2 1 + v2 1 + v2

0 0 0
ˆ ∞ ˆ ∞
1 −T v v −T v
≤ |cos(T )| e
1 + v2 dv + |sin(T )| 1 + v2 e
dv
0 0
ˆ ∞ ˆ ∞
−T v T +1
≤ e dv + ve−T v dv = . 
0 0 T2

A little more calculus gives


ˆ T  −1,
 θ<0
sin(θt)
Lemma 12.1. For every θ ∈ R, lim dt = π sgn(θ) where sgn(θ) = 0, θ=0 .
T →∞ −T t 
1, θ>0

Proof. (Homework)
ˆ T
sin(θt) θ cos(θt) sin(θt)
Since lim = lim = θ, it is easy to see that the integral dt exists for all T > 0.
t→0 t t→0 1 −T t
sin(θt)
Because is even, it follows by u-substitution that
t
ˆ T ˆ T ˆ θT ˆ |θ|T
sin(θt) sin(θt) sin(u) sin(u)
dt = 2 dt = 2 du = 2sgn(θ) du.
−T t 0 t 0 u 0 u
ˆ |θ|T ˆ |θ|T
sin(u) π sin(u)
Proposition 12.1 shows that lim du = for all θ 6= 0, so, since θ = 0 implies that du = 0
T →∞ 0 u 2 0 u
for all T , we have
ˆ T ˆ |θ|T
sin(θt) sin(u)
lim dt = 2 sgn(θ) lim du → π sgn(θ)
T →∞ −T t T →∞ 0 u
for all θ ∈ R. 

With the previous result at our disposal, we are in a position to prove

´
Theorem 12.1 (Inversion Formula) . Let ϕ(t) = eitx dµ(x) where µ is a probability measure on (R, B). If
a < b, then
ˆ T
1 e−ita − e−itb 1
lim ϕ(t)dt = µ ((a, b)) + µ ({a, b}) .
T →∞ 2π −T it 2

69
Proof. We begin by noting that
−ita ˆ
−itb b
ˆ
b
e − e
e−ity dy ≤ e−ity dy = b − a,

(∗) =

it
a a

so, since [−T, T ] is nite and µ is a probability measure, Fubini's theorem gives
ˆ T −ita ˆ T
− e−itb e−ita − e−itb
ˆ 
e
IT = ϕ(t)dt = eitx dµ(x) dt
−T it −T it R
ˆ ˆ T it(x−a) it(x−b)
!
e −e
= dt dµ(x).
R −T it

Now
eit(x−a) − eit(x−b) sin (t(x − a)) − sin (t(x − b)) cos (t(x − b)) − cos (t(x − a))
= +i ,
it t t
´ T cos(t(x−b))−cos(t(x−a))
and it follows from (∗) and the inequality |Im(z)| ≤ |z| that dt exists.
−T t

cos (t(x − b)) − cos (t(x − a))


Thus, since is an odd function, we must have
t
ˆ ˆ T it(x−a) !
e − eit(x−b)
IT = dt dµ(x)
R −T it
ˆ ˆ T ˆ T !
sin (t(x − a)) − sin (t(x − b)) cos (t(x − b)) − cos (t(x − a))
= dt + i dt dµ(x)
R −T t −T t
ˆ ˆ T !
sin (t(x − a)) − sin (t(x − b))
= dt dµ(x)
R −T t
ˆ ˆ T ˆ T !
sin (t(x − a)) sin (t(x − b))
= dt − dt dµ(x).
R −T t −T t
´
T sin(θt))
Lemma 12.1 shows that
−T t dt converges to the nite constant π as T → ∞, so it follows from the

bounded convergence theorem and Lemma 12.1 that

ˆ ˆ T ˆ T
!
sin (t(x − a)) sin (t(x − b))
lim IT = lim dt − dt dµ(x)
T →∞ T →∞ R −T t −T t
ˆ ˆ T ˆ T
!
sin (t(x − a)) sin (t(x − b))
= lim dt − dt dµ(x)
R T →∞ −T t −T t
ˆ
=π [sgn(x − a) − sgn(x − b)] dµ(x).
R

 0, x < a or x > b

Since a < b by assumption, we have that sgn(x − a) − sgn(x − b) = 1, x = a or x = b, thus

2, a<x<b

ˆ T −ita ˆ
1 e − e−itb 1 sgn(x − a) − sgn(x − b)
lim ϕ(t)dt = lim IT = dµ(x)
T →∞ 2π −T it 2π T →∞ 2
ˆ ˆ ˆ
R
1
= 0 dµ(x) + dµ(x) + 1 dµ(x)
(−∞,a)∪(b,∞) {a,b} 2 (a,b)
1
= µ ({a, b}) + µ ((a, b)) . 
2
70
ˆ T
Remark. Note that the Cauchy principal value lim f (x)dx is not necessarily the same as
T →∞ −T
ˆ ∞ ˆ b
f (x)dx := lim f (x)dx.
−∞ b→∞ a
a→−∞
ˆ a
2x
For example, lim dx = 0 since the integrand is odd, but
a→∞ −a 1 + x2
ˆ a
1 + a2
 
2x 2 a

lim dx = lim log 1 + x −2a = lim log = − log(4),
a→∞ −2a 1 + x2 a→∞ a→∞ 1 + 4a2
ˆ ∞
2x
so the improper Riemann integral dx is not dened.
−∞ 1 + x2
ˆ ∞
2x 2 2x
Of course, since
1 + x2 ≈ |x| for large |x|, dx is not dened as a Lebesgue integral either.

−∞ 1 + x2

It is left as a homework exercise to imitate the proof of Theorem 12.1 to obtain

Theorem 12.2. Under the assumptions of Theorem 12.1,


ˆ T
1
µ({a}) = lim e−ita ϕ(t)dt.
T →∞ 2T −T

Combining Theorems 12.1 and 12.2, and noting that a Borel measure is specied by its values on open

intervals (by a π−λ argument), shows that probability distributions are uniquely determined by their

characteristic functions.

To prove our next big result, we need the following bound on the tail probabilities of a distribution in terms

of its characteristic function.

Lemma 12.2. If ϕ is the characteristic function corresponding to the distribution µ, then for all u > 0,
  ˆ u
2
µ x : |x| > ≤ u−1 (1 − ϕ(t)) dt.
u −u

Proof. It follows from the parity of the sine and cosine functions that
ˆ u ˆ u  
sin(ux)
1 − eitx dt = 2u −

(cos(tx) + i sin(tx)) dt = 2 u − .
−u −u x
Dividing by u, integrating against dµ(x), and appealing to Fubini gives
ˆ u ˆ ˆ u  ˆ  
−1 −1 itx sin(ux)
u (1 − ϕ(t)) dt = u 1 − e dt dµ(x) = 2 1− dµ(x).
−u −u ux
´ y
Since |sin(y)| = 0 cos(x)dx ≤ |y| for all y , we see that the integrand on the right is nonnegative, so we

have
ˆ u ˆ   ˆ  
sin(ux) sin(ux)
u−1 (1 − ϕ(t)) dt = 2 1− dµ(x) ≥ 2 1− dµ(x)
−u ux 2
|x|> u ux
ˆ    
1 2
≥2 1− dµ(x) ≥ µ x : |x| > . 
2
|x|> u |ux| u

71
We are now able to take our next major step toward proving the central limit theorem by relating weak

convergence to the convergence of the corresponding characteristic functions.

Theorem 12.3 (Continuity Theorem). Let µn , 1 ≤ n ≤ ∞, be probability distributions with characteristic


´
functions ϕn (t) = R eitx dµn (x).
(i) If µn ⇒ µ∞ , then ϕn (t) → ϕ∞ (t) for all t ∈ R.
(ii) If ϕn (t) converges pointwise to a limit ϕ(t) that is continuous at t = 0, then the sequence {µn }
is tight and converges weakly to the distribution µ with characteristic function ϕ(t).

Proof.
For (i), note that since eitx is bounded and continuous, if µn ⇒ µ∞ , then it follows from Theorem 11.2 that

ϕn (t) → ϕ∞ (t).

For (ii), we observe that


ˆ u
u−1 (1 − ϕ(t)) dt ≤ 2 sup{1 − ϕ(t) : |t| ≤ u} → 0 as u→0
−u

since ϕ is continuous at 0 and thus limt→0 ϕ(t) = ϕ(0) = 1.


It follows that for any ε > 0, there is a v > 0 such that
ˆ v
−1 ε
v (1 − ϕ(t)) dt < .
−v 2
Because |1 − ϕ(t)| ≤ 2 and ϕn (t) → ϕ(t) for all t, the bounded convergence theorem shows that there is an

N ∈N such that ˆ v ˆ v
−1 −1
ε
v (1 − ϕn (t)) dt − v (1 − ϕ(t)) dt <

−v −v 2
whenever n ≥ N.
The last two observations and Lemma 12.2 show that
  ˆ v
2
µn x : |x| > ≤ v −1 (1 − ϕn (t)) dt < ε
v −v

for all n ≥ N, so, since ε was arbitrary, it follows that {µn }∞


n=1 is tight.

Now let {µnm }∞


m=1 be any subsequence. Tightness and Theorems 11.5 and 11.6 imply that there is a further

subsequence which converges weakly to some probability measure µ∞ .


It then follows from part (i) that the corresponding characteristic functions converge pointwise to the char-

acteristic function of µ∞ .
Because ϕn (t) → ϕ(t) for all t and characteristic functions uniquely characterize distributions, it must be

the case that µ∞ = µ.


Therefore, every subsequence of {µn }∞
n=1 has a further subsequence which converges weakly - that is, in the

weak-∗ topology - to µ, so Lemma 8.2 shows that µn ⇒ µ. 

The crux of the proof of the nontrivial part of Theorem 12.3 was establishing tightness of the sequence {µn },
and this is where we used the assumption that the limiting characteristic function is continuous at 0.

72
As an illustration of how weak convergence may fail without the continuity assumption, consider the case

µn = N (0, n). Then µn has ch.f.



− nt2
2
0, t 6= 0
ϕn (t) = e → ,
1, t=0
which is discontinuous at 0. To see that µn has no weak limit, observe that for any x ∈ R,
ˆ x ˆ √x ˆ 0
1 t2 1 n s2 1 s2 1
µn ((−∞, x]) = √ e− 2n dt = √ e− 2 ds → √ e− 2 ds = .
2πn −∞ 2π −∞ 2π −∞ 2

We turn next to the problem of representing characteristic functions as power series with explicit remainders.

To start, we have

Theorem 12.4. If E [|X| ] < ∞,


n
then the characteristic function ϕ of X has a continuous derivative of
order n given by ˆ
ϕ(n) (t) = (ix)n eitx dµ(x).

Proof. (Homework)

Argue by induction and justify dierentiating under the integral with the DCT or some applicable corollary

thereof. 

n ´
It follows from Theorem 12.4 that if E [|X| ] < ∞, then ϕ(n) (0) = (ix)n dµ(x) = in E [X n ].

The above observation combined with Taylor's theorem shows that if X has nite absolute nth moment,

then
n n
X ϕ(k) (0) X (it)k
tk + rn (t)tn = E X k + rn (t)tn .
 
ϕ(t) =
k! k!
k=0 k=0

where the Peano remainder rn (t) → 0 as t → 0.

In particular, we have

Corollary 12.1. If X has mean 0 and nite variance σ2 , then


t2 1
ϕ(t) = 1 + itE[X] − E[X 2 ] + r2 (t)t2 = 1 − σ 2 t2 + o(t2 )
2 2
where o(t2 ) denotes a quantity which, when divided by t2 , tends to 0 as t → 0.

Remark. To verify the statement of Taylor's theorem given above, set

n
(
ϕ(t)−Pn (t)
X ϕ(k) (0) ϕ(n) (0) n , t 6= 0
Pn (t) = tk = ϕ(0) + ϕ0 (0)t + ... + t , rn (t) = tn
.
k! n! 0, t=0
k=0

Then one needs only to prove that lim rn (t) = 0.


t→0
73
Applying L'Hospital's theorem n−1 times gives

d
ϕ(t) − Pn (t) dt [ϕ(t) − Pn (t)]
lim rn (t) = lim = lim d k
t→0 t→0 tk t→0
dt t
dn−1
dtn−1 ϕ(n−1) (t) − ϕ(n−1) (0) − tϕ(n) (0)
[ϕ(t) − Pn (t)]
= . . . = lim dn−1
= lim
t→0 tn t→0
dtn−1
n!t
ϕ(n−1) (t) − ϕ(n−1) (0)
 
1 1 h (n) i
= lim − ϕ(n) (0) = ϕ (0) − ϕ(n) (0) = 0.
n! t→0 t n!

Corollary 12.1 is enough to get the classical central limit theorem for i.i.d. sequences, but when we consider

the Lindeberg-Feller CLT for triangular arrays, we will need a little better control on the error term. With

this end in mind, we prove

n
( )
n+1 n
ix X (ix)k |x| 2 |x|
Lemma 12.3.

e − ≤ min ,
k! (n + 1)! n!
k=0

Proof. Integrating by parts gives


ˆ x ˆ x ˆ x
(x − s)n+1 xn+1
  
d i
(x − s)n eis ds = − eis ds = + (x − s)n+1 eis ds.
0 0 ds n+1 n+1 n+1 0

Taking n=0 shows that


ˆ ˆ
x x
eix − 1 is
= e ds = x + i (x − s)eis ds
i 0 0
and thus
ˆ x
eix = 1 + ix + i2 (x − s)eis ds.
0

If we assume that
n ˆ x
ix
X (ix)k in+1
e = + (x − s)n eis ,
k! n! 0
k=0
then we get

n ˆ x
X (ix)k in+1
eix = + (x − s)n eis
k! n! 0
k=0
n ˆ x
(ix)k in+1 xn+1
 
X i n+1 is
= + + (x − s) e ds
k! n! n+1 n+1 0
k=0
n+1 ˆ x
X (ix)k i(n+1)+1
= + (x − s)n+1 eis ds,
k! (n + 1)! 0
k=0

so it follows from the principle of induction that

n ˆ x
ix
X (ix)k in+1
e − = (x − s)n eis ds
k! n! 0
k=0

for all n = 0, 1, 2, ...


We will be done if we can show that the modulus of the right hand side is bounded above by both
n+1 n
|x| 2 |x|
and .
(n + 1)! n!
74
In the rst case we have
n+1 ˆ x ˆ ˆ n+1
1 |x| 1 |x| n

i n is
n is
|x|

n! (x − s) e ds ≤
n! (x − s) e ds ≤ s ds = .
0 0 n! 0 (n + 1)!
For the second case, note that
ˆ x ˆ x
in+1 in (x − s)n d is
 
n is
(x − s) e ds = e ds
n! 0 (n − 1)! 0 n ds
 n ˆ x
in

x
= − + (x − s)n−1 eis ds
(n − 1)! n 0
n
 ˆ x ˆ x 
i
= − (x − s)n−1 ds + (x − s)n−1 eis ds
(n − 1)! 0 0
ˆ x
in n−1 is

= (x − s) e − 1 ds,
(n − 1)! 0
hence
n+1 ˆ x ˆ |x|
i n is
1 n−1 is 

n! (x − s) e ds ≤
(n − 1)! |x − s| e + 1 ds
0 0
ˆ |x| n
2 2 |x|
= sn−1 ds = . 
(n − 1)! 0 n!

n+1 n
|x| 2 |x|
Observe that the upper bound is better for small values of |x|, while the bound is better for
(n + 1)! n!
|x| > 2(n + 1).

Applying Lemma 12.3 to x = tX and taking expected values gives

Corollary 12.2.
n
" n
#
X (it)k
k itX
X (itX)k
ϕ(t) − E[X ] = E e −

k! k!


k=0 k=0
n
" ( )#
n+1 n
itX X (itX)k |tX| 2 |tX|

≤ E e − ≤ E min , .
k! (n + 1)! n!
k=0

75
13. Central Limit Theorems

We are almost ready to prove the central limit theorem for i.i.d. sequences, but rst we need a few more

elementary facts.

Lemma 13.1. Let z1 , ..., zn and w1 , ..., wn be complex numbers, each having modulus at most θ. Then
n n
n
Y Y X
zk − wk ≤ θn−1 |zk − wk | .



k=1 k=1 k=1

Proof. The inequality holds trivially for n = 1.


Now assume that it is true for 1 ≤ m < n. Then

Y n n
Y Y
n Yn Y n Yn
zk − wk ≤ z1 zk − z1 wk + z1 wk − w1 wk



k=1 k=1 k=2 k=2 k=2 k=2

Yn Yn
≤ θ zk − wk + |z1 − w1 | θn−1


k=2 k=2
n
X n
X
n−2 n−1 n−1
≤θ·θ |zk − wk | + θ |z1 − w1 | = θ |zk − wk |
k=2 k=1

and the result follows by the principle of induction. 

Lemma 13.2. If z ∈ C has |z| ≤ 1, then |ez − (1 + z)| ≤ |z|2 .

Proof. Expanding the analytic function ez in a power series about 0 gives


∞ ∞
X z k X k
|z|
|ez − (1 + z)| = ≤

k! k!
k=2 k=2
∞ k−2 ∞
2
X |z| 2
X 1 2
= |z| ≤ |z| = |z| . 
k! 2k
k=2 k=1

Theorem 13.1. If {cn }∞


n=1 is a sequence of complex numbers which converges to c, then
 cn  n
lim 1+ = ec .
n→∞ n

Proof. |cn | |cn | |cn | 2|c|


cn

Choose n large enough that |cn | < 2 |c| and
n ≤ 1. Then 1 +
n ≤1+ n ≤e n ≤e n ,
cn 2|c|
cn
so taking zm = 1 + n , wm = e n , θ=e n in the statement of Lemma 13.1 and then appealing to

Lemma 13.2 gives


 cn  n  2|c| n−1 c
n
 cn  n−1
 c 2
n
1+ − ec n ≤ e n n e n − 1 + ≤ e2|c| n n

n n n
2 2
4 |c| 4 |c| e2|c|
≤ e2|c| n 2 = → 0,
n n
hence
 cn n  cn n
1+ − ec ≤ 1 + − ecn + |ecn − ec | → 0. 

n n

76
After all of the work of the past two sections, we are nally able to prove the classical central limit theorem!

Theorem 13.2 (Central Limit Theorem). If X1 , X2 , ... are i.i.d. with E[X1 ] = µ and
Var(X1 ) = σ ∈ (0, ∞), then
2
Sn − nµ
√ ⇒ Z ∼ N (0, 1).
σ n

Proof. By considering Xk0 = Xk − µ if necessary, it suces to prove the result for µ = 0.


2
− t2
We have seen that the standard normal has characteristic function ϕZ (t) = e , which is continuous at
S√n
t = 0, so Theorem 12.3 shows that we only need to demonstrate that the characteristic functions of
σ n
converge pointwise to ϕZ (t).
Since X1 has ch.f. ϕ(t) = 1 − 21 σ 2 t2 + o(t2 ) by Corollary 12.1, it follows from Theorem 13.1 and the basic
S
properties of characteristic functions that √n has ch.f.
σ n
" n
!# " n  # Y n   
t X Y t t
ϕn (t) = E exp i √ Xk =E exp i √ Xk = E exp i √ Xk
σ n σ n σ n
k=1 k=1 k=1
  n  n  2  2 !!n
t t 1 2 t t
= E exp i √ X1 =ϕ √ = 1− σ √ +o √
σ n σ n 2 σ n σ n
 n
t2

1 t2
= 1− +o → e− 2 . 
2n n

( n1 ) Xn − µ
Multiplying the standardized sum by ⇒ Z where Xn = n1 Sn is the sample mean and √σn
gives
√σ
( n1 )
n
is the standard error. Thus the CLT can be interpreted as a statement about how sample averages uctuate

about the population mean.

The following poetic description of the CLT is due to Francis Galton:

I know of scarcely anything so apt to impress the imagination as the wonderful form of

cosmic order expressed by the Law of Frequency of Error. The Law would have been

personied by the Greeks and deied, if they had known of it. It reigns with serenity and

complete self-eacement amidst the wildest confusion. The huger the mob and the greater

the apparent anarchy, the more perfect is its sway. It is the supreme law of unreason.

The most common application of the central limit theorem is to provide justication for approximating a

sum of i.i.d. random variables (possibly with unknown distributions) with a normal random variable (for

which probabilities can be read o from a table).

However, one should keep in mind that, a priori, the identication is only valid in the n→∞ limit. It is

truly amazing that it holds at all, and often it is remarkably accurate even for small values of n, but one

needs empirical evidence or more advanced theory to justify the approximation for nite sample averages.

The issue of convergence rates is addressed in a subsequent section.

Our next order of business is to adapt the argument from Theorem 13.2 to prove one of the most well-

known generalizations of the central limit theorem, which applies to triangular arrays of independent (but

not necessarily identically distributed) random variables.


77
We will use the notation E [Y ; A] := E[Y 1A (Y )] for the expectation of the random variable Y restricted to

the event A.

Theorem 13.3 (Lindeberg-Feller). For each n ∈ N, let Xn,1 , ..., Xn,n be independent random variables with
E[Xn,m ] = 0. If
n
= σ 2 ∈ (0, ∞),
X  2 
(1) lim E Xn,m
n→∞
m=1
n
for all ε > 0,
X  2 
(2) lim E Xn,m ; |Xn,m | > ε = 0
n→∞
m=1
n
then Sn := where Z has the standard normal distribution.
X
Xn,m ⇒ σZ
m=1

Proof.
  2  2 
Let ϕn,m (t) = E eitXn,m , σn,m = E Xn,m . By Theorem 12.3, it suces to show that

n
t2 σ 2
Y
ϕn,m (t) → e− 2 .
m=1

From Corollary 12.2, we have


! " !#
2 3 2
t2 σn,m
2 X (it)k k
|tXn,m | 2 |tXn,m |
ϕn,m (t) − 1− = ϕn,m (t) − E[Xn,m ] ≤ E min ,

2 k! 3! 2!
k=0
h i
3 3
≤ |t| E |Xn,m | ; |Xn,m | ≤ ε + t2 E Xn,m 2
 
; |Xn,m | > ε
h i
3 2
≤ ε |t| E |Xn,m | + t2 E Xn,m
 2 
; |Xn,m | > ε .

Summing over m ∈ [n], taking limits, and appealing to assumptions 1 and 2 gives
n
! n
X t2 σn,m
2
3
X h
2
i
lim sup ϕn,m (t) − 1 − ≤ ε |t| lim sup E |Xn,m |

n→∞
m=1
2 n→∞
m=1
Xn
3
+ t2 lim sup
 2
; |Xn,m | > ε = ε |t| σ 2 ,

E Xn,m
n→∞
m=1
Pn  2 2 
t σn,m
hence lim m=1 ϕn,m (t) − 1 − =0
as ε can be taken arbitrarily small.

n→∞ 2
h i
2 2
We now observe that σn,m ≤ ε2 + E |Xn,m | ; |Xn,m | > ε for all ε > 0 and the latter term goes to 0 as

n→∞ by the second assumption.


t2 σn,m
2
Accordingly, for any xed t, we can nd n large enough that 1 ≥ 1− 2 ≥ −1. Since |ϕn,m (t)| ≤ 1 as
t2 σn,m
2
well, zm = ϕn,m (t) and wm = 1 − 2 satisfy the assumptions of Lemma 13.1 with θ=1 for large n and

thus ! !
n n n
Y Y t2 σn,m
2 X t2 σn,m
2
lim sup ϕn,m (t) − 1− ≤ lim sup ϕn,m (t) − 1− = 0.

n→∞
m=1 m=1
2 n→∞
m=1
2
n
X t2 σn,j
2 2 2
t σ Qn 
t2 σn,m
2  t2 σ 2
Finally, since lim =− it follows from Fact 11.2 that m=1 1− 2 → e− 2 as n→∞
n→∞
j=1
2 2
and the proof is complete. 

78
Roughly, Theorem 13.3 says that a sum of a large number of small independent eects has an approximately

normal distribution.

Example 13.1. Let πn be a permutation chosen from the uniform distribution on Sn and let Kn = K(πn )
be the number of cycles in πn . For example, if π is the permutation of {1, 2, . . . , 6} written in one-line

notation as 532146, then π can be expressed in cycle notation as (154)(23)(6), so K(π) = 3.

n!
* Observe that there are Qn λk
ways to write a permutation having λk k -cycles, k = 1, ..., n.
k=1 k λk !

Indeed, once we have xed the placement of parentheses dictated by the cycle type - say beginning with

λ1 pairs of parentheses having room for 1 symbol, followed by λ2 pairs of parentheses having room for 2
symbols, and so forth - there are n! ways to distribute the n symbols amongst the parentheses.

But this overcounts since we can permute each of the λk k -cycles amongst themselves and we can write each
k -cycle in k dierent ways.

For this reason, it is sometimes helpful to use the canonical cycle notation wherein the largest element

appears rst within a cycle and cycles are sorted in increasing order of their rst element.

For example, we would write π = (32)(541)(6).

** Note that the map which drops the parentheses in the canonical cycle notation of σ to obtain σ0 in
0
one-line notation (so that π = 325416, for example) gives a bijection between permutations with k cycles

and permutations with k record values.

(A record value of σ ∈ Sn is a number j ∈ [n] such that σ(j) > σ(i) for all i < j. Here we are thinking of

σ(j) as the ultimate ranking of the jth competitor.)

Now the number of permutations of [n] having k cycles is the unsigned Stirling number of the rst kind,

denoted c(n, k).


These numbers can be computed using the recurrence c(n + 1, k) = nc(n, k) + c(n, k − 1).
This is because every permutation of [n + 1] having k cycles either has n+1 as a xed point (that is, in a

cycle of size 1) or not. The number of the former is just c(n, k − 1) and the number of the latter is nc(n, k)
as n+1 can follow any of the rst n symbols divided into k cyclically ordered groups.
c(n, k)
Thus, in principle, one can explicitly compute P (Kn = k) = , but this is computationally prohibitive
n!
for large n.

We will show that when suitably standardized, Kn is asymptotically normal. To do so, we will construct

random permutations using the Chinese Restaurant Process:

In a restaurant with many large circular tables, Person 1 enters and sits at a table. Then Person 2 enters

and either sits to the right of Person 1 or at a new table with equal probability. In general, when person k
enters, they are equally likely to sit to the right of any of the k−1 seated customers or to sit at an empty

table. We associate the seating arrangement after n people have entered with the permutation whose cycles

are the tables with occupants read o clockwise.

That this generates a permutation from the uniform distribution follows by induction: It is certainly true

when n = 1, and if we have a seating arrangement corresponding to a uniform permutation of [n − 1] before

person n sits down, then the rules of the process ensure that we have a uniform permutation of n afterward

by the same line of reasoning used to establish the recursion for c(n, k).
79
Pn
If we let Xn,k be the indicator that Person k sits at an unoccupied table, then Kn = k=1 Xn,k .
Since the Xn,k 's are clearly independent, we have

n n n
X X X 1
E[Kn ] = E[Xn,k ] = P (k sits at a new table) = ≈ log(n)
k
k=1 k=1 k=1

and
n n  
X X 1 1
Var(Kn ) = Var(Xn,k ) = − ≈ log(n).
k k2
k=1 k=1
1
X −
More precisely, if we set Yn,k = √n,k k , then E [Yn,k ] = 0 and
log(n)

n n  
X  2  1 1 X 1 1
E Yn,k = Var (Kn ) = − 2 → 1.
log(n) log(n) k k
k=1 k=1

Also,
n
X  2 
E Yn,k ; |Yn,k | > ε → 0
k=1
1
since the sum is 0 once log(n)− 2 < ε.
Pn
Therefore, Theorem 13.3 implies that k=1 Yn,k ⇒ Z ∼ N (0, 1).
n ˆ n n−1
X 1 dx X 1 Kn − log(n)
Because ≤ ≤ , the conclusion can be written as p ⇒ Z.
k 1 x k log(n)
k=2 k=1

In terms of sequences of independent random variables, Theorem 13.3 specializes to

Corollary 13.1. Suppose that X1 , X2 , ... are independent, random variables with E[Xk ] = 0 and Var(Xk ) =
σk2 ∈ (0, ∞) for all k . Let Sn = k=1 Xk and s2n = k=1 σk2 . If the sequence satises Lindeberg's condition:
Pn Pn

n
1 X
lim E[Xk2 ; |Xk | > εsn ] = 0
n→∞ s2
n k=1

for every ε > 0, then


Sn
⇒ Z ∼ N (0, 1).
sn

Proof. Take Xn,m = Xm


sn in Theorem 13.3. 

Of course the mean zero condition is just a matter of convenience since nite variance implies nite mean

and we can always consider Xk0 = Xk − E[Xk ].

Note that the classical central limit theorem is an immediate consequence of Corollary 13.1 since X1 , X2 , ...
2

i.i.d. with mean zero and nite variance σ gives sn = σ n and

n n
1 X 1X √
lim E[Xk2 ; |Xk | > εsn ] = σ 2 lim E[Xk2 ; |Xk | > εσ n]
n→∞ s2
n n→∞ n
k=1 k=1

= σ lim E[X12 ; |X1 | > εσ n] = 0
2
n→∞

by the DCT and nite variance.


80
We conclude with an example showing that one can have normal convergence even when the Lindeberg

condition is not satised.

Example 13.2. Let X1 , X2 , ... be independent with X1 ∼ N (0, 1) and Xk ∼ N (0, 2k−2 ) for k ≥ 2.
Pn
Setting Sn = k=1 Xk , we have

n n n−2
X X X 2n−1 − 1
s2n = Var(Xk ) =1+ 2k−2 = 1 + 2k = 1 + = 2n−1 .
2−1
k=1 k=2 k=0
 
For any ε ∈ 0, √12 n ≥ 2, ,

E Xn2 ; |Xn | > εsn ≥ E[Xn2 ] − E Xn2 ; |Xn | ≤ εsn ≥ E[Xn2 ] − ε2 s2n P (|Xn | > εsn ) ≥ 2n−2 − ε2 2n−1 ,
   

thus
n
E Xn2 ; |Xn | > εsn
 
1 X 2 2n−2 − ε2 2n−1 1
2
E[Xk ; |Xk | > εsn ] ≥ 2
≥ = − ε2 > 0
sn sn 2n−1 2
k=1
for all n ≥ 2.
However, we observe that if W1 and W2 are independent with Wk ∼ N (µk , σk2 ), then Wk has ch.f.
σ 2 t2 2 +σ 2 )t2
(σ1
iµk t− k2 i(µ1 +µ2 )t− 2
ϕk (t) = e , hence W1 + W2 has ch.f. ϕW1 +W2 (t) = ϕ1 (t)ϕ2 (t) = e 2 , so

W1 + W2 ∼ N (µ1 + µ2 , σ12 + σ22 ).


By induction, we have Sums of independent normals are normal and the means and variances add.
Pn Sn
Applied to the case at hand, we see that Sn = k=1 Xk ∼ N (0, s2n ), hence
sn ∼ N (0, 1) for all n and thus

in the n→∞ limit.

81
14. Poisson Convergence

We now turn our attention to one of the more ubiquitous discrete limiting distributions, the Poisson, begin-

ning with the law of rare events (or weak law of small numbers). It is instructive to compare the following

result and its corresponding proof with that of the Lindeberg-Feller theorem.

Theorem 14.1. For each n ∈ N, let Xn,1 , ..., Xn,n be independent with P (Xn,m = 1) = pn,m and
P (Xn,m = 0) = 1 − pn,m .
Suppose that as n → ∞,
n
pn,m → λ ∈ (0, ∞),
X
(1)
m=1
(2) max pn,m → 0.
1≤m≤n

If Sn = Xn,1 + ... + Xn,m , then Sn ⇒ W where W ∼ Poisson(λ).

Proof. The characteristic function of Xn,m is

ϕn,m (t) = E eitXn,m = 1 − pn,m + pn,m eit ,


 

so it follows from the independence assumption that Sn has ch.f.

n
Y
ϕSn (t) = E eitSn = 1 + pn,m eit − 1 .
   
m=1

Now, for p ∈ [0, 1],


exp p(eit − 1) = exp Re p(eit − 1) = exp [p (cos(t) − 1)] ≤ 1
  


and 1 + p(eit − 1) = p · eit + (1 − p) · 1 ≤ 1 since it is on the line segment connecting 1 to eit , which is a

chord of the unit circle in C.


 
zm = 1 + pn,m eit − 1 , wm = exp pn,m (eit − 1) in Lemma 13.1, we have
Thus, taking
n n
! n m

Y X Y  Y 
it it it it

1 + pn,m e − 1 − exp pn,m (e − 1) = 1 + pn,m e − 1 − exp pn,m (e − 1)



m=1 m=1 m=1 m=1
n
X
exp pn,m (eit − 1) − 1 + pn,m eit − 1 .
  

m=1

1
max pn,m eit − 1 ≤ 1,

By assumption 2, we have max pn,m ≤ , and thus for n suciently large.
1≤m≤n 2 1≤m≤n

Using Lemma 13.2, we conclude that


!
Yn n
X X n
1 + pn,m eit − 1 − exp it exp pn,m (eit − 1) − 1 + pn,m eit − 1
   
pn,m (e − 1) ≤



m=1 m=1 m=1
n n
X  2 X
p2n,m eit − 1 ≤ 4 p2n,m


m=1 m=1
n
X
≤ 4 max pn,m pn,m → 0
1≤m≤n
m=1

by assumptions 1 and 2.
82
Pn it
pn,m (eit − 1) → eλ(e −1) ,

Therefore, since assumption 1 implies exp m=1
n
Y it
1 + pn,m eit − 1 → eλ(e −1) = ϕW (t),
 
ϕSn (t) =
m=1

and the result follows from the continuity theorem. 

An easy consequence of Theorem 14.1 is

Theorem 14.2. Let Xn,m , 1 ≤ m ≤ n be independent N0 -valued random variables with


P (Xn,m = 1) = pn,m and P (Xn,m ≥ 2) = εn,m where
n
pn,m → λ,
X
(1)
m=1
(2) max pn,m → 0,
1≤m≤n
Xn
(3) εn,m → 0
m=1

as n → ∞. Then Sn = Xn,1 + ... + Xn,n converges weakly to the Poisson(λ) distribution.

Proof.
Pn
Let Yn,m = 1 {Xn,m = 1} and Tn = m=1 Yn,m . Theorem 14.1 implies Tn ⇒ W ∼ Poisson(λ),
Pn
so, since P (Sn 6= Tn ) ≤ m=1 εn,m → 0 (thus Sn − Tn →p 0), Sn = Tn + (Sn − Tn ) ⇒ W by Slutsky's
theorem. 

It is worth mentioning that, just as in the normal case, independence is not a strictly necessary condition

for Poisson convergence. To relax the assumption in general one needs to use a dierent proof strategy than

convergence of characteristic functions. However, it is sometimes possible to give direct proofs by simple

calculations.

Example 14.1 (Hat check, Lazy Secretary, etc...).


Dene Xn,m = Xn,m (π) = 1 {π(m) = m} where π is chosen from the uniform measure on Sn , the symmetric

group on {1, ., , , n}.


Pn
Then Tn = m=1 Xn,m is the number of xed points in a random permutation of length n.
Inclusion-exclusion gives the probability of at least one xed point as

n
! n
[ X X
P (Tn > 0) = P {Xn,m = 1} = P (Xn,m = 1) − P (Xn,l = Xn,m = 1)
m=1 m=1 l<m
X
+ P (Xn,k = Xn,l = Xn,m = 1) − ... + (−1)n+1 P (Xn,1 = ... = Xn,n = 1)
k<l<m
n n
n (n − k)! X (−1)k+1
X  
= (−1)k+1 =
k n! k!
k=1 k=1

since the number of permutations with k specied xed points is (n − k)!.


It follows that the probability of a derangement is given by

n n
X (−1)k+1 X (−1)k
P (Tn = 0) = 1 − P (Tn > 0) = 1 − = → e−1 .
k! k!
k=1 k=0
83
To compute other values of the mass function for Tn , note that
 
n (n − m)! 1 1 −1
P (Tn = m) = P (Tn−m = 0) = P (Tn−m = 0) → e .
m n! m! m!
Therefore, for every x ∈ R, we have
X X 1 −1
P (Tn ≤ x) = P (Tn = m) → e
m!
m∈N0 :m≤x m∈N0 :m≤x

(as the above sums contain nitely many terms), so the number of xed points in a permutation of length

n converges weakly to W ∼ Poisson(1) as n → ∞.

For most common discrete random variables, rather than memorize the p.m.f.s, one needs only to understand

the stories that they tell and the probabilities follow easily from combinatorial considerations.

For example, if X ∼ Binomial(n, p), then the story is that X gives the number of heads in n independent

ips of a coin with heads probability p. To compute the probability that X =k for k = 0, 1, ..., n we note
k n−k
that any sequence of k heads and n−k tails has probability p (1 − p) by independence. Since the number
n

of such sequences is determined by specifying where the heads occur, of which there are possibilities, the
k
n
 k n−k
binomial p.m.f. is given by P (X = k) = k p (1 − p) , k = 0, 1, ...., n.

Similarly, if Y ∼ Hypergeometric(N, M, n), then the story is that we sample n items without replacement

from a set ofN items of which M are distinguished, and Y counts the number of distinguished items in our
M N −M
 
sample. For k ≤ min(n, M ), there are
k ways to choose k distinguished items, n−k ways to choose the
−M
N
 (Mk )(Nn−k )
remaining n − k items, and possible samples, so P (Y = k) = N .
n (n)

Because the Poisson distribution assigns positive mass to innitely many outcomes, determining the p.m.f.

is not such a simple matter of counting. Nonetheless, the preceding results do supply us with an appropriate

story:

For 0 ≤ s < t, let N (s, t) denote the number of occurrences of a given type of event in the time interval (s, t]
- say, the number of arrivals at a restaurant between s and t minutes after it opens. Suppose that

(1) The number of occurrences in disjoint time intervals are independent.

(2) The distribution of N (s, t) depends only on t−s ( stationary increments ).


(3) P (N (0, h) = 1) = λh + o(h).
(4) P (N (0, h) ≥ 2) = o(h).

Theorem 14.3. If properties 1 − 4 hold, then N (0, t) ∼ Poisson(λt).

Proof.
 
(m−1)t mt
Dene Xn,m = N n , n . Property 1 shows that Xn,1 , ..., Xn,n are independent; properties
λt
+ o nt ; and properties 2 and 4

2 and 3 pn,m = P (Xn,m = 1) = P (Xn,1 = 1) =
show that
n show that

εn,m = P (Xn,m ≥ 2) = P (Xn,1 ≥ 2) = o nt .



Pn λt 1
 Pn 1

Since m=1 pn,m = n n + o n → λt and m=1 εn,m = no n →0 as n → ∞, Theorem 14.2 implies

that Xn,1 + ... + Xn,n ⇒ W ∼ Poisson(λt). The result follows by observing that Xn,1 + ... + Xn,n = N (0, t)
for all n. 

84
The random variables N (0, t) as t ranges over [0, ∞) are an example of a continuous time stochastic process:

Denition. A family of random variables {N (t)}t≥0 is called a Poisson process with rate λ if it satises:
(1) For any 0 = t0 < t1 < ... < tn , the random variables N (tk ) − N (tk−1 ), k = 1, ..., n are independent;

(2) N (t) − N (s) ∼ Poisson (λ(t − s)).

To better understand the process {N (t)}t≥0 , it is useful to consider the following construction which explains
our arrivals story and provides a bridge between the Poisson and exponential distributions:

Let ξ1 , ξ2 , ... be i.i.d. exponentials with mean λ−1 - that is P (ξi > t) = e−λt for t ≥ 0.
Pn
Dene Tn = i=1 ξi and N (t) = sup{n : Tn ≤ t}.
0
If we think of the ξi s as interarrival times, then Tn gives the time of the nth arrival and N (t) is the number

of arrivals by time t.
Since a sum of n i.i.d. Exponential(λ) R.V.s has a Γ(n, λ−1 ) distribution*, we see that Tn has density

λn sn−1 −λs
fTn (s) = e for s ≥ 0.
(n − 1)!
Accordingly,
(λt)0
P (N (t) = 0) = P (T1 > t) = e−λt = e−λt
0!
and
ˆ t
P (N (t) = n) = P (Tn ≤ t < Tn+1 ) = fTn (s)P (ξn+1 > t − s)ds
0
ˆ t ˆ t n
λn sn−1 −λs λ(s−t) λn (λt)
= e e ds = e−λt nsn−1 ds = e−λt
0 (n − 1)! n! 0 n!
for n ≥ 1, so N (t) ∼ Poisson(λt).
To check that the number of arrivals in disjoint intervals is independent, we note that for all n∈N and all

u > t > 0,
ˆ t
P (Tn+1 ≥ u, N (t) = n) = P (Tn+1 ≥ u, Tn ≤ t) = fTn (s)P (ξn+1 ≥ u − s)ds
0
ˆ t n n
λn sn−1 −λs λ(s−u) (λt) (λt)
= e e ds = e−λu = e−λ(u−t) e−λt
0 (n − 1)! n! n!
= e−λ(u−t) P (N (t) = n),

and thus
P (Tn+1 ≥ u, N (t) = n)
P (Tn+1 ≥ u|N (t) = n) = = e−λ(u−t) .
P (N (t) = n)
Writing s = u − t, the above is equivalent to

P (Tn+1 − t ≥ s|N (t) = n) = e−λs

for all n ∈ N, s, t > 0.


It follows that T10 = TN (t)+1 − t is independent of N (t) and


X
P (T10 ≥ s) = P (Tn+1 − t ≥ s|N (t) = n)P (N (t) = n) = e−λs ,
n=0

hence T10 ∼ Exponential(λ).


85
Setting Tk0 = TN (t)+k − TN (t)+k−1 = ξN (t) for k≥2 and observing that

P (N (t) = n, T10 ≥ u − t, Tk0 ≥ vk for k = 2, ..., K)


= P (Tn ≤ t, Tn+1 ≥ u, Tn+k − Tn+k−1 ≥ vk for k = 2, ..., K)
K
Y
= P (Tn ≤ t, Tn+1 ≥ u) P (ξn+k ≥ vk ),
k=2

we see that T10 , T20 , ... are i.i.d. Exponential(λ) and independent of N (t).
In other words, the arrivals after time t are independent of N (t) and have the same distribution as the

original arrival sequence.

(Essentially, this is due to the memorylessness property of the exponential.)

It follows that for any 0 = t0 < t1 < ... < tn , N (t1 ) − N (t0 ), ..., N (tn ) − N (tn−1 ) are independent Poissons.

This is because the vector (N (t2 ) − N (t1 ), ..., N (tn ) − N (tn−1 )) is measurable with respect to σ (T10 , T20 , ...)
(where the Ti0 s are constructed as above with t = t1 ) and so is independent of N (t1 ).
Then an induction argument gives

n k
Y (λ(ti − ti−1 )) i
P (N (t1 ) − N (t0 ) = k1 , ..., N (tn ) − N (tn−1 ) = kn ) = e−λ(ti −ti−1 ) .
i=1
ki !

* To keep the discussion self-contained, we show that sums of independent exponentials are gammas.

This is a situation where convolution is more convenient than characteristic functions.


x
1 α−1 − β
First, recall that for α, β > 0, X ∼ Γ(α, β) has density fX (x) = Γ(α)β α x e , x > 0, where the gamma

function Γ satises Γ(n) = (n − 1)! for n ∈ N.


fW (w) = λe−λx , x > 0. W ∼ Γ 1, λ−1

Also, for λ > 0, W ∼ Exp(λ) has density Thus .

−1

Suppose that Y ∼ Γ n, λ and W ∼ Exp(λ) are independent and set Z =W +Y.
Then Z has positive support and Theorem 6.6 shows that for z > 0,
ˆ ∞
fZ (z) = fW ∗ fY (z) = fW (z − y)fY (y)dy
−∞
ˆ z n
λ
= λe−λ(z−y) y n−1 e−λy dy
0 (n − 1)!
ˆ
λn+1 −λz z n−1 λn+1 (n+1)−1 −λz
= e y dy = z e .
(n − 1)! 0 Γ(n + 1)
X1 + . . . + Xn ∼ Γ n, λ−1

It follows by induction that if X1 , . . . , Xn are i.i.d. Exp(λ), then .

86
15. Stein's Method

We have mentioned previously that a shortcoming of the limit theorems presented thus far is that they do

not come with rates of convergence.

A proof of the Berry-Esseen theorem for normal convergence rates in the Kolmogorov metric is given in

Durrett, and a proof of Poisson convergence with rates in the total variation metric is given there as well.

Rather than reproduce these classical results, we will obtain similar bounds using Stein's method in order to

give a glimpse of this relatively modern technique which can be applied to all sorts of dierent distributions

and often allows one to weaken assumptions such as independence as well.

As our purpose is expository, we will present some of the more straightforward approaches rather than seek

out the best possible constants and conditions.

Stein's method refers to a framework based on solutions of certain dierential or dierence equations for

bounding the distance between the distribution of a random variable X and that of a random variable Z
having some specied target distribution.

The metrics for which this approach is applicable are of the form

dH (L (X), L (Z)) = sup |E[h(X)] − E[h(Z)]|


h∈H

for some suitable class of functions H, and include the Kolmogorov, Wasserstein, and total variation distances
as special cases. These cases arise by taking H to be the set of indicators of the form 1(−∞,a] , 1-Lipschitz
functions, and indicators of Borel sets, respectively. Convergence in each of these three metrics is strictly

stronger than weak convergence (which can be metrized by taking H as the set of 1-Lipschitz functions with

sup norm at most 1).

The basic idea is to nd an operator A such that E[(Af )(X)] = 0 for all f belonging to some suciently

large class of functions F if and only if L (X) = L (Z).


For example, we will see that Z ∼ N (0, 1) if and only if E [f 0 (Z) − Zf (Z)] = 0 for all Lipschitz functions f.
If one can then show that for any h ∈ H, the equation

(Af )(x) = h(x) − E[h(Z)]

has solution fh ∈ F , then upon taking expectations, absolute values, and suprema, one nds that

dH (L (X), L (Z)) = sup |E[h(X)] − E[h(Z)]| = sup |E[(Afh )(X)]| .


h∈H h∈H

Remarkably, it is often easier to work with the right-hand side of this equation and the techniques for

analyzing distances between probability distributions in this manner are collectively known as Stein's method.

Stein's method is a vast eld with over a thousand existing articles and books and new ones written all the

time, so we will only be able to scratch the surface here. In particular, we will not prove any results for

dependent random variables. (Other than supplying convergence rates, the principal advantage of Stein's

method is that it often enables one to prove limit theorems when there is some weak or local dependence,

whereas characteristic function approaches typically fall apart when there is dependence of any sort.)

An excellent place to learn more about Stein's method (and the primary reference for this exposition) is the

survey Fundamentals of Stein's method by Nathan Ross.


87
Normal Distribution.
We begin by establishing a characterizing operator for the standard normal.

Lemma 15.1. Dene the operator A by


(Af ) (x) = f 0 (x) − xf (x).

If Z ∼ N (0, 1), then E [(Af ) (Z)] = 0 for all absolutely continuous f with E |f 0 (Z)| < ∞.
Proof. Let f be as in the statement of the lemma. Then Fubini's theorem gives

ˆ ˆ 0 ˆ ∞
0 1 0
2
− x2 1 0 − x2
2 1 x2
E[f (Z)] = √ f (x)e dx = √ f (x)e dx + √ f 0 (x)e− 2 dx
2π R 2π −∞ 2π 0
ˆ 0  ˆ x  ˆ ∞ ˆ ∞ 
1 y 2
1 y2
=√ f 0 (x) − ye− 2 dy dx + √ f 0 (x) ye− 2 dy dx
2π −∞ −∞ 2π 0 x
ˆ 0  ˆ 0  ˆ ∞ ˆ y 
1 y2
− 2 0 1 − 2y2
0
=√ ye − f (x)dx dy + √ ye f (x)dx dy
2π −∞ y 2π 0 0
ˆ 0 ˆ ∞
1 2
− y2 1 y2
=√ ye (f (y) − f (0)) dy + √ ye− 2 (f (y) − f (0)) dy
2π −∞ 2π 0
ˆ ∞ ˆ ∞
1 y2 1 y2
=√ yf (y)e− 2 dy − f (0) √ ye− 2 dy
2π −∞ 2π −∞
= E[Zf (Z)] − f (0)E[Z] = E[Zf (Z)]. 

Of course, if kf 0 k∞ < ∞, then E |f 0 (Z)| < ∞. It turns out that the condition E [(Af ) (W )] = 0 for all
0
absolutely continuous f with kf k∞ < ∞ is also sucient for W ∼ N (0, 1).

To see that this is the case, we prove

Lemma 15.2. If Φ is the distribution function for the standard normal, then the unique bounded solution
to the dierential equation
f 0 (w) − wf (w) = 1(−∞,x] (w) − Φ(x)
is given by ( √ w2
2πe 2 (1 − Φ(x)) Φ(w), w ≤ x
fx (w) = √ w2
.
2πe 2 Φ(x)(1 − Φ(w)), w > x
r
π
Moreover, fx is absolutely continuous with kfx k∞ ≤ and kfx0 k∞ ≤ 2.
2
t2
Proof. Multiplying both sides of the equation f 0 (t) − tf (t) = 1(−∞,x] (t) − Φ(x) by the integrating factor e− 2
shows that a bounded solution fx must satisfy
d  t2
 t2 t2 
e− 2 fx (t) = e− 2 [fx0 (t) − tfx (t)] = e− 2 1(−∞,x] (t) − Φ(x) ,

dt
and integration gives
ˆ w
w2 t2
e− 2 1(−∞,x] (t) − Φ(x) dt

fx (w) = e 2

−∞
ˆ ∞
w2 t2
e− 2 1(−∞,x] (t) − Φ(x) dt.

= −e 2

88
When w ≤ x, we have
ˆ w ˆ w
w2 2 w2 t2
− t2
e− 2 (1 − Φ(x)) dt

fx (w) = e 2 e 1(−∞,x] (t) − Φ(x) dt = e 2

−∞ −∞
√ ˆ w √
w2 1 2
− t2 w2
= 2πe 2 (1 − Φ(x)) √ e dt = 2πe 2 (1 − Φ(x)) Φ(w),
2π −∞

and when w > x, we have


ˆ ∞ ˆ ∞
w2 t2 w2 t2
e− 2 1(−∞,x] (t) − Φ(x) dt = −e 2 e− 2 (0 − Φ(x)) dt

fx (w) = −e 2

w w
√ ˆ ∞ √
w2 1 t2 w2
= 2πe 2 Φ(x) √ e− 2 dt = 2πe 2 Φ(x)(1 − Φ(w)).
2π w
To check boundedness, we rst observe that for any z ≥ 0,
ˆ ∞ ˆ ∞
1 2
− t2 1 (s+z)2
1 − Φ(z) = √ e dt ≤ √ e− 2 ds
2π z 2π 0
ˆ ˆ ∞
1 − z2 ∞ − s2 −sz − z2
2 1 s2 1 z2
=√ e 2 e 2 e ds ≤ e √ e− 2 ds = e− 2 ,
2π 0 2π 0 2
and, by symmetry, for any z ≤ 0,
1 − z2
Φ(z) = 1 − Φ (|z|) ≤ e 2.
2
Since fx is nonnegative and fx (w) = f−x (−w), it suces to show that fx is bounded above for x ≥ 0.

If w > x ≥ 0, then

√ √
r
w2 w2 1 w2 π
fx (w) = 2πe 2 Φ(x)(1 − Φ(w)) ≤ 2πe 2 · 1 · e− 2 = ;
2 2
If 0 < w ≤ x, then
√ w2
fx (w) = 2πe 2 (1 − Φ(x)) Φ(w)
√ √
r
w2 1 x2 w2 1 w2 π
≤ 2πe 2 · e− 2 · 1 ≤ 2πe 2 · e− 2 = ;
2 2 2
and if w ≤ 0 ≤ x, then

√ √
r
w2 w2 1 w2 π
fx (w) = 2πe 2 (1 − Φ(x)) Φ(w) ≤ 2πe 2 · 1 · e− 2 = .
2 2

The claim that fx is the only bounded solution follows by observing that the homogeneous equation
w2
0
f (w) − wf (w) = 0 has solution fh (w) = Ce 2 for C ∈ R, so the general solution is given by fx (w) + Cfh (w),
which is bounded if and only if C = 0.

Finally, we observe that, by construction, fx is dierentiable at all points w 6= x with

fx0 (w) = wfx (w) + 1(−∞,x] (w) − Φ(x), so that

|fx0 (w)| ≤ |wfx (w)| + 1(−∞,x] (w) − Φ(x) ≤ |wfx (w)| + 1.


For w > 0,
ˆ ∞ ˆ ∞
w2 t2 w2 t2
e− 2 1(−∞,x] (t) − Φ(x) dt ≤ we 2 e− 2 1(−∞,x] (t) − Φ(x) dt

|wfx (w)| = −we 2
w w
ˆ ∞ ˆ ∞ ˆ ∞
w 2 t 2 w 2 t −t 2 w 2 t2 w2 w2
≤ we 2 e− 2 dt ≤ we 2 e 2 dt = e 2 te− 2 dt = e 2 e− 2 = 1,
w w w w
89
and for w < 0,

|wfx (w)| = |−wf−x (−w)| ≤ 1,

hence |fx0 (w)| ≤ |wfx (w)| + 1 ≤ 2.


Since fx is continuous and dierentiable at all points w 6= x with uniformly bounded derivative, it is Lipschitz
and thus absolutely continuous. 

An immediate consequence of the preceding lemma is

Theorem 15.1. A random variable W has the standard normal distribution if and only if
E[f 0 (W ) − W f (W )] = 0

for all Lipschitz f .

Proof. Lemma 15.1 establishes necessity.

For suciency, observe that for any x ∈ R, taking fx as in Lemma 15.2 implies

|P (W ≤ x) − Φ(x)| = E 1(−∞,x] (W ) − Φ(x) = |E [fx0 (W ) − W fx (W )]| = 0.


 


The methodology of Lemma 15.2 can be extended to cover more general test functions than indicators of

half-lines.

Indeed, the argument given there shows that for any function h:R→R such that
ˆ ∞
1 z2
N h := E[h(Z)] = √ h(z)e− 2 dz
2π −∞

exists in R, the dierential equation

f 0 (w) − wf (w) = h(w) − N h

has solution
ˆ w
w2 t2
(∗) fh (w) = e 2 (h(t) − N h) e− 2 dt.
−∞

Some fairly tedious computations which we will not undertake here show that

Lemma 15.3. For any h : R → R such that N h exists, let fh be given by (∗).
If h is bounded, then r
π
kfh k∞ ≤ kh − N hk∞ , kfh0 k∞ ≤ 2 kh − N hk∞ .
2
If h is absolutely continuous, then
r
2 0
kfh k∞ ≤ 2 kh0 k∞ , kfh0 k∞ ≤ kh k∞ , kfh00 k∞ ≤ 2 kh0 k∞ .
π

(That the relevant derivatives are dened almost everywhere is part of the statement of Lemma 15.3.)
90
We can now give bounds on the error in normal approximation for sums of i.i.d. random variables.

We will work in the Wasserstein metric

dW (L (W ), L (Z)) = sup |E[h(W )] − E[h(Z)]|


h∈HW

where

HW = {h : R → R such that |f (x) − f (y)| ≤ |x − y| for all x, y ∈ R}.


If Z ∼ N (0, 1), then the preceding analysis shows that

dW (L (W ), L (Z)) = sup |E[fh0 (W ) − W fh (W )]|


h∈HW

where fh is given by (∗).

Since Lipschitz functions are absolutely continuous, the second part of Lemma 15.3 applies with kh0 k∞ = 1.

From these observations and some elementary manipulations we have

Theorem 15.2. Suppose that X1 , X2 , ..., Xn are independent random variables with E[Xi ] = 0 and
E[Xi2 ] = 1 for all i = 1, ..., n. If W = √1n i=1 Xi and Z ∼ N (0, 1), then
Pn

n
3 X h
3
i
dW (L (W ), L (Z)) ≤ 3 E |Xi | .
n 2
i=1

Proof. Let f be any dierentiable function with f0 absolutely continuous, kf k∞ , kf 0 k∞ , kf 00 k∞ < ∞.

For each i = 1, ..., n, set


1 X 1
Wi = √ Xj = W − √ Xi .
n n
j6=i

Then Xi and Wi are independent, so E[Xi f (Wi )] = E[Xi ]E[f (Wi )] = 0.

It follows that
" n
# " n
#
1 X 1 X
E[W f (W )] = E √ Xi f (W ) = E √ Xi (f (W ) − f (Wi )) .
n i=1 n i=1
h Pn i
1 0
Adding and subtracting E √
n i=1 X i (W − W i )f (W i ) yields

" n
#
1 X
E[W f (W )] = E √ Xi (f (W ) − f (Wi ) − (W − Wi )f 0 (Wi ))
n i=1
" n
#
1 X 0
+E √ Xi (W − Wi )f (Wi ) .
n i=1

The independence and unit variance assumptions show that


 
1 1 1
E[Xi (W − Wi )f 0 (Wi )] = E √ Xi2 f 0 (Wi ) = √ E[Xi2 ]E[f 0 (Wi )] = √ E[f 0 (Wi )],
n n n
so
n
" # " n #
1 X 1 X
E[W f (W )] = E √ Xi (f (W ) − f (Wi ) − (W − Wi )f 0 (Wi )) + E f 0 (Wi ) ,
n i=1 n i=1

91
and thus

|E[f 0 (W ) − W f (W )]|
" # " n #
n
1 X 0 1X 0 0

= E √ Xi (f (W ) − f (Wi ) − (W − Wi )f (Wi )) + E f (Wi ) − E[f (W )]

n i=1 n i=1
" n
# " n
#
1 X 0 1X 0 0

= E √ Xi (f (W ) − f (Wi ) − (W − Wi )f (Wi )) + E (f (Wi ) − f (W ))

n i=1 n i=1
" n # " n #
1 X 1 X
≤√ E |Xi (f (W ) − f (Wi ) − (W − Wi )f 0 (Wi ))| + E |f 0 (Wi ) − f 0 (W )|
n i=1
n i=1

The Taylor expansion (with Lagrange remainder)

f 00 (ζ)
f (w) = f (z) + f 0 (z)(w − z) + (w − z)2
2
for some ζ between w and z gives the bound

kf 00 k∞
|f (w) − f (z) − (w − z)f 0 (z)| ≤ (w − z)2 ,
2
so
" n # " n #
1 X 1 X kf 00 k
0 ∞ 2

√ E |Xi (f (W ) − f (Wi ) − (W − Wi )f (Wi ))| ≤ √ E Xi (W − Wi )
n i=1
n i=1
2
 2
n n
kf 00 k∞ X Xi kf 00 k∞ X h 3
i
= √ E Xi √ = 3 E |Xi | .
2 n n 2n 2
i=1 i=1

Also, the mean value theorem shows that


" n # " n # n
1 X
0 0 1 X
00 kf 00 k∞ X
E |f (Wi ) − f (W )| ≤ E (kf k∞ |Wi − W |) = 3 E |Xi | .
n i=1
n i=1
n 2 i=1

 2  h i2 h i
3 3 3 3 3
Since 1 = E[Xi2 ] = E |Xi | ≤ E |Xi | , we have E |Xi | ≥ 1, hence
h i1 h i h i
3 3 3 3
E |Xi | ≤ E |Xi | ≤ E |Xi | . (The conclusion is trivial if E |Xi | = ∞.)

Putting all of this together gives


" n # " n #
0 1 X
0 1 X
0 0
|E[f (W ) − W f (W )]| ≤ √ E |Xi (f (W ) − f (Wi ) − (W − Wi )f (Wi ))| + E |f (Wi ) − f (W )|
n i=1
n i=1
n n n
kf 00 k∞ X h
3
i kf 00 k X
∞ 3 kf 00 k∞ X h 3
i
≤ 3 E |Xi | + 3 E |Xi | ≤ 3 E |Xi | ,
2n 2
i=1
n 2 i=1 2n 2 i=1
and the result follows since

dW (L (W ), L (Z)) = sup |E[fh0 (W ) − W fh (W )]|


h∈HW

and kfh00 k∞ ≤ 2 kh0 k∞ = 2 for all h ∈ HW . 

92
Of course the mean zero variance one condition is just the usual normalization in the CLT and so imposes

no real loss of generality. If the random variables have uniformly bounded third moments, then Theorem
1
15.2 gives a rate of order n− 2 which is the best possible.

We conclude with an example of a CLT with local dependence which can be proved using very similar (albeit

more computationally intensive) methods.

Denition. A collection of random variables {X1 , ..., Xn } is said to have dependency neighborhoods
Ni ⊆ {1, 2, ..., n}, i = 1, ..., n, if i ∈ Ni and Xi is independent of {Xj }j ∈N
/ i.

Theorem 15.3. Let X1 , ..., Xn be mean zero random variables with nite fourth moments.
Set σ2 = Var ( ni=1 Xi ) and dene W = σ−1 ni=1 Xi . Let N1 , ..., Nn denote the dependency neighborhoods
P P

of {X1 , ..., Xn } and let D = maxi∈[n] |Ni |. Then for Z ∼ N (0, 1),
v
n 23 u n
D X h
3
i D u 28 X
2
dW (L (W ), L (Z)) ≤ 3 E |Xi | + 2 t E [Xi4 ].
σ i=1 σ π i=1

Poisson Distribution.
To illustrate some of the diversity in Stein's method techniques, we now look at size-biased couplings in

Poisson approximation.

Denition. For a random variable X ≥ 0 with µ = E[X] ∈ (0, ∞), we say that X s has the size-biased
distribution with respect to X if E [Xf (X)] = µE [f (X s )] for all f such that E |Xf (X)| < ∞.
1
To see that Xs exists, note that our assumptions imply that Qf := µE [Xf (X)] is a well-dened linear

functional on the space of continuous functions with compact support. Since X is nonnegative, we have

that Qf ≥ 0 for f ≥ 0. Therefore, the Riesz representation theorem implies that there is a unique positive
´
measure ν with Qf = f dν . Since Q1 = µ1 E[X] = 1, ν is a probability measure. Thus X s ∼ ν satises
ˆ
1
E [Xf (X)] = Qf = f dν = E [f (X s )] .
µ
Alternatively, one can adapt the argument from the following lemma to construct the distribution function

of Xs in terms of that of X.

Lemma 15.4. Let X be a nondegenerate N0 -valued random variable with nite mean µ. Then X s has mass
function
kP (X = k)
P (X s = k) = .
µ

Proof.

X ∞
X
µE [f (X s )] = µf (k)P (X s = k) = kf (k)P (X = k) = E [Xf (X)] . 
k=0 k=0

93
Size-biasing is an important consideration in statistical sampling.

For example, suppose that a school has N (k) classes with k students.
P∞ P∞
Then the total number of classes is n= k=1 N (k) and the total number of students is N= k=1 kN (k).
If an outside observer were interested in estimating class-size statistics, they might ask a random teacher

how large their class is.

N (k)
Letting X denote the teacher's response, we have P (X = k) = since N (k) of the n classes have k
n
students.

On the other hand, they might ask a random student how large their class is.
kN (k)
The student's response, Y, would have P (Y = k) = because kN (k) of the N students are in a class
N
of k students.

Noting that the expected number of students in a random class is

∞ ∞
X N (k) 1X N
E [X] = k = kN (k) = ,
n n n
k=1 k=1

we see that
kN (k) k N (k) kP (X = k)
P (Y = k) = = Nn = ,
N n
E [X]
so Y = X s.
Observe that the average number of classmates of a random student (their self included) is

∞ ∞  
X kN (k) n X 2 N (k) E X2
E[Y ] = k = k = ≥ E[X].
N N n E [X]
k=1 k=1

The inequality is strict unless all classes have the same number of students.

Lemma 15.5. Let X1 , ..., Xn ≥ 0 be independent random variables with E [Xi ] = µi , and let Xis have the
size-bias distribution w.r.t. Xi . Let I be a random variable, independent of all else, with P (I = i) = µµ , i

i = 1, ..., n, µ = i=1 µi . If W = i=1 Xi and Wi = W − Xi , then W s = WI + XIs has the W size-bias


Pn Pn

distribution.
Proof.
n
X µi
µE [g (W s )] = µ E [g (Wi + Xis )]
i=1
µ
n
X
= E [Xi g (Wi + Xi )]
i=1
" n
#
X
=E Xi g(W ) = E [W g(W )] . 
i=1

Lemma 15.6. If P (X = 1) = 1 − P (X = 0) = p, then Y ≡1 has the X size-bias distribution.


Proof.
1 · P (X = 1) p
P (X s = 1) = = = 1. 
E[X] p
94
To connect size-biasing with Poisson approximation, we need the following facts, which are proved in much

the same fashion as the analogous results for the normal distribution.

Theorem 15.4. Let Pλ denote the Poisson(λ) distribution. An N0 -valued random variable X has law Pλ if
and only if
E [λf (X + 1) − Xf (X)] = 0
for all bounded f .
Also, for each A ⊆ N0 , the unique solution of the dierence equation
λf (k + 1) − kf (k) = 1A (k) − Pλ (A), fA (0) = 0

is given by
fA (k) = λ−k eλ (k − 1)! [Pλ (A ∩ Uk ) − Pλ (A) Pλ (Uk )] where Uk = {0, 1, ..., k − 1} .
Finally, writing the forward dierence as ∆g(k) := g(k + 1) − g(k), we have
1 − e−λ
and k∆fA k∞ ≤
n 1
o
kfA k∞ ≤ min 1, λ− 2 .
λ

We can now prove

Theorem 15.5. Let X be an N0 -valued random variable with E [X] = λ, and let Z ∼ Poisson(λ). Then
dT V (X, Z) ≤ 1 − e−λ E |X + 1 − X s | .


Proof. Letting fA be as in Theorem 15.4, the denitions of total variation and size-biasing imply

dT V (X, Z) = sup |P (X ∈ A) − P (Z ∈ A)|


A

= sup |λE [fA (X + 1)] − E [XfA (X)]|


A

= sup |λE [fA (X + 1)] − λE [fA (X s )]|


A

≤ λ sup E |fA (X + 1) − fA (X s )|
A

≤ λ sup k∆fA k∞ E |X + 1 − X s |
A

≤ 1 − e−λ E |X + 1 − X s | .


The penultimate inequality follows by writing fA (X + 1) − fA (X s ) as a telescoping sum of |X + 1 − X s | rst


order dierences. 

We conclude with a simple proof of Theorem 14.1 complete with a total variation bound.

Theorem 15.6. Let X1 , ..., Xn be independent random variables with P (Xi = 1) = 1 − P (Xi = 0) = pi ,
and set W = ni=1 Xi , λ = E [W ] = ni=1 pi . Let Z ∼ Poisson(λ). Then
P P

n
1 − e−λ X 2
dT V (W, Z) ≤ pi .
λ i=1

Proof. Lemmas 15.5 and 15.6 show that W s = WI + XIs = W − XI + 1 where I is a random variable,
pi
independent of the Xi 's, with P (I = i) = λ.
95
Thus, by Theorem 15.5,

dT V (W, Z) ≤ 1 − e−λ E |W + 1 − W s | = 1 − e−λ E |XI |


 

n
X
= 1 − e−λ E |Xi | P (I = i)
i=1
n n
X pi 1 − e−λ X 2
= 1 − e−λ

pi = pi . 
i=1
λ λ i=1

One can also prove Theorem 15.6 without taking a detour through size-biasing.

Indeed, suppose that f satises kf k∞ , k∆f k∞ < ∞. Then


" n #
X
E [W f (W )] = E Xi f (W )
i=1
n
X n
X
= pi E [f (W ) |Xi = 1 ] = pi E [f (Wi + 1)] .
i=1 i=1
Pn
Since λ= i=1 pi , we have

Xn n
X
|λE [f (W + 1)] − E[W f (W )]| = pi E [f (W + 1)] − pi f (Wi + 1)


i=1 i=1
n
X
≤ pi E |f (W + 1) − f (Wi + 1)|
i=1
Xn
≤ pi k∆f k∞ E |(W + 1) − (Wi + 1)|
i=1
n
X n
X
= k∆f k∞ pi E |Xi | = k∆f k∞ p2i .
i=1 i=1

Therefore,

dT V (W, Z) = sup |P (W ∈ A) − P (Z ∈ A)|


A

= sup |λE [fA (W + 1)] − E[W fA (W )]|


A
n
1 − e−λ X 2
≤ pi .
λ i=1

96
16. Random Walk Preliminaries

Pn
Thus far, we have been primarily interested in the large n behavior of Sn = i=1 Xi where X1 , X2 , ... are

independent and identically distributed. We now turn our attention to the sequence S1 , S2 , . . ., which we

think of as successive states of a random walk.

Recall that the existence of an innite sequence of random variables with specied nite dimensional distri-

butions is ensured by Kolmogorov's extension theorem.

Here the sample space is Ω = RN = {(ω1 , ω2 , ...) : ωi ∈ R}, the σ -algebra is B N (which is generated by cylinder
sets), and a consistent sequence of distributions gives rise to a unique probability measure with appropriate

marginals via the extension theorem. The random variables are the coordinate maps Xi ((ω1 , ω2 , ...)) = ωi .

If S is a Polish space (i.e. a separable and completely metrizable topological space) and S is the Borel

σ -algebra for S, then this Kolmogorov construction can be carried out with Ω=S N
and F =S N
. When the

Xi0 s are independent (S, S)-valued random variables with Xi ∼ µi , the measure P arises from the sequence

of product measures Pn = µ1 × · · · × µn .

We assume in what follows that we are working in (S N , S N , P ) and X1 , X2 , ... are given by

Xi ((ω1 , ω2 , ...)) = ωi .

Recall that Kolmogorov's 0−1 law showed that if X1 , X2 , ... are independent, then the tail eld
T∞
T = n=1 σ(Xn , Xn+1 , ...) is trivial in the sense that every A∈T has P (A) ∈ {0, 1}.

Our rst main result is another 0−1 law. We begin with some terminology.

Denition. A nite permutation is a bijection π:N→N such that |{m : π(m) 6= m}| < ∞.

If π is a nite permutation and ω ∈ SN, then we dene πω by (πω)i = ωπ(i) .

Denition. A ∈ SN is permutable if π −1 A = {ω : πω ∈ A} is equal to A for any nite permutation π.


In other words, for every n∈N and every π ∈ Sn , we have

(X1 , ..., Xn , Xn+1 , ...) ∈ A ⇐⇒ (Xπ(1) , ..., Xπ(n) , Xn+1 , ...) ∈ A.

Proposition 16.1. The collection of permutable events is a σ-algebra. It is called the exchangeable σ-algebra
and is denoted E .

Pn
Taking S = R, Sn = i=1 Xi , E = {Sn ∈ Bn i.o.} and
some examples of permutable events are

F = {lim supn→∞ Scnn ≥ 1} ∞
for any sequence of Borel sets {Bn }n=1 and real numbers {cn }n=1 .

Also, every event in the tail σ -algebra is also in the exchangeable σ -algebra.
We observe that, in general, E, F ∈
/ T, though F is in the tail eld if we assume that cn → ∞.
Similarly, {limn→∞ Sn exists}, {lim supn→∞ Sn = ∞} ∈ T while, in general, {lim supn→∞ Sn > 0} ∈ E \ T .
97
The proof of the Hewitt-Savage 0−1 law will make use of the following result.

Lemma 16.1. For any I ∈ S N , there is a sequence of events I1 , I2 , ... such that In ∈ σ(X1 , ..., Xn ) and
P (In 4I) → 0 where A4B = (A \ B) ∪ (B \ A).

Proof. σ(X1 , ..., Xn ) is precisely the sub-σ-algebra of S N consisting of the cylinders {ω : (ω1 , ..., ωn ) ∈ B} as
S∞
B ranges over S n. Accordingly, P= n=1 σ(X1 , ..., Xn ) is a π -system which generates S N . The claim follows

from Theorem 2.2 upon noting that L = {J ∈ S : N


there exist In ∈ σ(X1 , ..., Xn ) with P (In 4J) → 0} is a

λ-system containing P. 

Theorem 16.1 (Hewitt-Savage). If X1 , X2 , ... are i.i.d. and A ∈ E , then P (A) ∈ {0, 1}.

Proof. As with Kolmogorov's 0−1 law, we will show that A is independent of itself.

We begin by taking a sequence of events An ∈ σ(X1 , ..., Xn ) such that P (An 4A) → 0, which is justied by

Lemma 16.1.

 j + n,
 j≤n
Now let π be the nite permutation π(j) = j − n, n < j ≤ 2n .

j, j > 2n

In words, π transposes j and n+j for j = 1, ..., n.


P π −1 (An 4A) = P (An 4A), A0n = π −1 An

Because the coordinates are i.i.d., so, setting and noting that
−1
A∈E implies π A = A, we see that

P (An 4A) = P π −1 (An 4A) = P π −1 An 4 π −1 A = P (A0n 4A).


  

Thus, since

A4(An ∩ A0n ) = (A \ An ) ∪ (A \ A0n ) ∪ [(An ∩ A0n ) \ A] ⊆ (An 4A) ∪ (A0n 4A),

we have

P (A4(An ∩ A0n )) ≤ P (An 4A) + P (A0n 4A) = 2P (An 4A) → 0,


hence P (An ∩ A0n ) → P (A).
As An and A0n are independent by construction and P (An ), P (A0n ) → P (A), we conclude that

P (A)2 = lim P (An )P (A0n ) = lim P (An ∩ A0n ) = P (A). 


n→∞ n→∞

Because T ⊂ E, Hewitt-Savage supersedes Kolmogorov in the case of i.i.d. random variables. However, the

latter only requires independence, so it can be used in situations where the former cannot. Also, note that
n o
Sn
in the examples preceding Theorem 16.1, the sequences En = {Sn ∈ Bn } and Fn = cn ≥1 are each

dependent, so the Borel-Cantelli lemmas do not imply that E or F is trivial.

98
A nice application of Theorem 16.1 is

Theorem 16.2. For a random walk on R, there are only four possibilities, one of which has probability one:
(1) Sn = 0 for all n
(2) Sn → ∞
(3) Sn → −∞
(4) −∞ = lim inf n→∞ Sn < lim supn→∞ Sn = ∞

Proof.
Theorem 16.1 implies that lim supn→∞ Sn is a constant c ∈ [−∞, ∞]. Let Sn0 = Sn+1 − X1 .
Since Sn0 =d Sn , we must have that c = c − X1 . If c ∈ (−∞, ∞), then it must be the case that X1 ≡ 0, so

the rst case occurs. Conversely, if X1 is not identically zero, then c = ±∞.
Of course, the exact same argument applies to the lim inf, so either 1 holds or

lim inf n→∞ Sn , lim supn→∞ Sn ∈ {±∞}.


As lim supn→∞ Sn ≥ lim inf n→∞ Sn , this implies that we are in one of cases 2, 3, or 4. 

99
17. Stopping Times

Given a sequence of random variables X1 , X2 , ... on a probability space (Ω, F, P ), consider the sub-σ -algebras
Fn = σ(X1 , ..., Xn ), n ≥ 1. If we think of X1 , X2 , . . . as observations taken at times 1, 2, . . ., then Fn can be

interpreted as the information available at time n.

Note that F1 ⊆ F 2 ⊆ · · · ⊆ F . Such an increasing sequence of sub-σ -algebras is known as a ltration, and
the space (Ω, F, {Fn }, P ) is called a ltered probability space.
(For the time being, we will assume that the ltration is indexed by N, but one can consider more general

index sets such as [0, ∞) as well.)

If X1 , X2 , ... satises Xn ∈ Fn for all n, we say that the sequence {Xn } is adapted to the ltration {Fn }.
{σ(X1 , ..., Xn )} is the smallest ltration with respect to which {Xn } is adapted.

Denition. Given a ltered probability space (Ω, F, {Fn }n∈N , P ), a random variable N : Ω → N ∪ {∞} is

said to be a stopping time if {N = n} ∈ Fn for all n ∈ N.

The following proposition gives an equivalent denition of stopping times. When working with more general

index sets than N, this is the appropriate denition.

Proposition 17.1. A random variable N : Ω → N ∪ {∞} on a ltered probability space (Ω, F, {Fn }, P ) is
a stopping time if and only if {N ≤ n} ∈ Fn for all n ∈ N.

Proof. n ∈ N be given. If N
Let is a stopping time, then for each m ≤ n, {N = m} ∈ Fm ⊆ Fn , so
Sn
{N ≤ n} = m=1 {N = m} ∈ Fn .

Conversely, if {N ≤ m} ∈ Fm for all m, then {N ≤ n − 1} ∈ Fn−1 ⊆ Fn , so

{N = n} = {N ≤ n} \ {N ≤ n − 1} ∈ Fn . 

To motivate the denition, suppose that X1 , X2 , ... represent one's winnings in successive games of roulette.

Let N be a rule for when to stop gambling (in the sense that one quits playing after N games).

The requirement that {N = n} ∈ Fn means that the decision to stop playing after n games can only depend
on the outcomes of the rst n games.

For example, the random variable N ≡ 6 corresponds to the rule that one will play exactly six games,

regardless of the outcomes.

Pn
The random variable N = inf{n : i=1 Xi < −10} corresponds to quitting once one's losses exceed $10.

Pn Pm
The random variable N = inf{n : i=1 Xi ≥ i=1 Xi for all m ∈ N}, corresponding to quitting once one

has attained the maximum amount they ever will, is not a stopping time since it depends on the future as

well as the past and present.


100
The second rule is a canonical example of stopping times. It is the hitting time of (−∞, −10).
In general, the random variable N = inf{n : Sn ∈ A} is a stopping time known as the hitting time of A.
To verify that N is indeed a stopping time, observe that {N = n} = {S1 ∈ AC , ..., Sn−1 ∈ AC , Sn ∈ A} ∈ Fn .

Associated with each stopping time N is the stopped σ-algebra FN , which we think of as the information

known at time N.
Formally, FN = {A ∈ F : A ∩ {N = n} ∈ Fn for all n ∈ N}. That is, on {N = n}, A must be measurable

with respect to the information known at time n. It is worth noting that the denition implies {N ≤ n} ∈ FN
for all n ∈ N, hence N is FN -measurable.
* Clearly FN contains the empty set and is closed under countable unions. To see that it's closed under
 C
complements, write AC ∩ {N = n} = AC ∪ {N 6= n} ∩ {N = n} = (A ∩ {N = n}) ∩ {N = n}.

Theorem 17.1. Suppose that X1 , X2 , ... are i.i.d., Fn = σ(X1 , ..., Xn ), and N is a stopping time for Fn .
Conditional on {N < ∞}, {XN +n }n≥1 is independent of FN and has the same distribution as the original
sequence.

Proof. Let A ∈ FN , k ∈ N, B1 , ..., Bk ∈ S be given. Let µ denote the common distribution of the Xi0 s.
For any n ∈ N, we have

P (A, N = n, XN +1 ∈ B1 , ..., XN +k ∈ Bk ) = P (A, N = n, Xn+1 ∈ B1 , ..., Xn+k ∈ Bk )


k
Y k
Y
= P (A ∩ {N = n}) P (Xn+j ∈ Bj ) = P (A ∩ {N = n}) µ(Bj )
j=1 j=1

since A ∩ {N = n} ∈ Fn and Xn+1 , ..., Xn+k is independent of Fn .


Summing over n gives

X
P (A, N < ∞, XN +1 ∈ B1 , · · · , XN +k ∈ Bk ) = P (A, N = n, XN +1 ∈ B1 , ..., XN +k ∈ Bk )
n=1

X k
Y k
Y
= P (A ∩ {N = n}) µ(Bj ) = P (A ∩ {N < ∞}) µ(Bj ),
n=1 j=1 j=1

proving independence, and taking A=Ω shows that

k
P (N < ∞, XN +1 ∈ B1 , ..., XN +k ∈ Bk ) Y
= µ(Bj ). 
P (N < ∞) j=1

We now introduce the shift function the θ:Ω→Ω which is dened coordinatewise by (θω)n = ωn+1 .
That is, applying θ results in dropping the rst term and shifting all others one place to the left. Higher

order shifts are dened by iterating θ:


1 n n−1
θ =θ and θ =θ◦θ for n > 1.
Thus θn acts on ω by dropping the rst n terms and shifting the remaining terms n places to the left, so
n
that (θ ω)i = ωn+i .
101
We extend the shift function to stopping times by setting
(
θn ω on {N = n}
θN ω =
4 on {N = ∞}
where 4 is an extra point we add to Ω to make various natural constructions work out nicely.

Example 17.1 (Returns to zero).


Suppose that S = Rd and let τ (ω) = inf {n : ω1 + . . . + ωn = 0} where inf ∅ = ∞ and τ (4) := ∞.
Thus τ gives the rst time the random walk visits 0.
Setting τ2 (ω) = τ (ω) + τ (θτ ω), we see that on {τ < ∞},

τ (θτ ω) = inf {n : (θτ ω)1 + . . . + (θτ ω)n = 0} = inf {n : ωτ +1 + . . . + ωτ +n = 0} ,

hence

τ2 (ω) = τ (ω) + τ (θτ ω) = inf {m > τ : ω1 + ... + ωm = 0} .


Because of the convention that τ (4) = ∞, this is well dened for all ω and gives the time of the second visit

to zero.

The same reasoning shows that if we set τ0 = 0, then

τn (ω) = τn−1 (ω) + τ (θτn−1 ω)

is well-dened for all n∈N and gives the time of the nth visit to zero.

Of course, this idea is applicable to stopping times in general:

If T is any stopping time, then setting T0 = 0, we can dene the iterates of T by

Tn (ω) = Tn−1 (ω) + T θTn−1 ω



for n ≥ 1.

Proposition 17.2. In the above setting, if we assume that P = µ×µ×· · · , then P (Tn < ∞) = P (T < ∞)n .

Proof. We argue by induction. The base case is trivial, so let us assume that the statement holds for n − 1.
 
Applying Theorem 17.1 to Tn−1 , we see that XTn−1 +k k≥1 = θTn−1 Xk k≥1 is independent of FTn−1 on

{Tn−1 < ∞} and has the same distribution as {Xk } k≥1 .



Consequently, T ◦ θTn−1 < ∞ is independent of {Tn−1 < ∞} and has the same probability as {T < ∞},
hence

P (Tn < ∞) = P (Tn−1 < ∞, T ◦ θTn−1 < ∞) = P (Tn−1 < ∞)P (T ◦ θTn−1 < ∞)
= P (T < ∞)n−1 P (T < ∞) = P (T < ∞)n

and the result follows. 

Our next result about stopping times is the famous

Theorem 17.2 (Wald's equation). Let X1 , X2 , ... be i.i.d. with E |X1 | < ∞. If N is a stopping time with
E[N ] < ∞, then E[SN ] = E[X1 ]E[N ].

102
Proof. First suppose that the Xi0 s are nonnegative. Then
ˆ ∞ ˆ
X n ˆ
∞ X
X
E[SN ] = SN dP = 1 {N = n} Sn dP = 1 {N = n} Xm dP
n=1 n=1 m=1
∞ ˆ
∞ X
X ∞ ˆ
X
= 1 {N = n} Xm dP = 1 {N ≥ m} Xm dP
m=1 n=m m=1

where interchanging the order of summation is justied by the nonnegativity assumption.

Since {N ≥ m} = {N ≤ m − 1}C ∈ Fm−1 and Xm is independent of Fm−1 , we have

∞ ˆ
X X∞
E [SN ] = 1 {N ≥ m} Xm dP = E [1 {N ≥ m} Xm ]
m=1 m=1
X∞ ∞
X
= P (N ≥ m) E [Xm ] = E [X1 ] P (N ≥ m) = E [X1 ] E [N ] .
m=1 m=1

To prove the result for general Xi , we run the last argument in reverse to conclude that


X ∞ ˆ
X
∞ > E |X1 | E [N ] = P (N ≥ m)E |Xm | = 1 {N ≥ m} |Xm | dP
m=1 m=1
∞ ˆ
∞ X
X ∞ ˆ
X
= 1 {N = n} |Xm | dP ≥ 1 {N = n} |Sn | dP.
m=1 n=m n=1

Since the double integrals converge absolutely, we can invoke Fubini to conclude that


X ∞ ˆ
X ∞ ˆ
∞ X
X
E [X1 ] E [N ] = P (N ≥ m)E [Xm ] = 1 {N ≥ m} Xm dP = 1 {N = n} Xm dP
m=1 m=1 m=1 n=m
X∞ Xn ˆ ∞ ˆ
X
= 1 {N = n} Xm dP = 1 {N = n} Sn dP = E [SN ] . 
n=1 m=1 n=1

One consequence of Wald's equation is that one can gain no advantage in a fair or unfavorable game by

employing a length-of-play strategy which does not depend on the possibility of innitely many games,

innite payo, or the ability to see the future.

One hears people advocate the policy of playing until they are ahead. Denoting the outcomes of successive

games by X1 , X2 , ..., this stopping rule is given by α = inf{n : X1 + ... + Xn > 0}. If E |X1 | < ∞ and

E[X1 ] ≤ 0, then E[α] < ∞ would imply 0 < E[Sα ] = E[α]E[X1 ] ≤ 0, a contradiction. Thus the expected

waiting time until one shows a prot on a sequence of independent and identical fair or unfavorable bets is

innite.

Some other amusing consequences involve variations on the following game:

Suppose that you are to roll a die repeatedly until a number of your choice appears. You are then awarded

an amount of money equal to the sum of your rolls. Wald's equation shows that any number you choose will

result in the same expected winnings.

Indeed the outcomes of each roll, X1 , X2 , ..., are i.i.d. uniform over {1, 2, ..., 6}.
The waiting time, Ni , until
1
any number i ∈ {1, ..., 6} appears is geometric with success probability
6 . Thus your expected winnings are
1+2+...+6
E[SNi ] = E[Ni ]E[X1 ] = 6 · 6 = 21 regardless of the number you choose. There is no advantage in

choosing six over one, say, in terms of expected winnings. (Of course there is an advantage in terms of things

like maximizing your minimum potential winnings.)


103
We now consider an application of Wald's equation to the analysis of simple random walk.

Example 17.2 (Simple Random Walk).


1
Let X1 , X2 , ... be i.i.d. with P (X1 = 1) = P (X1 = −1) = 2 . Let a<0<b be integers and set

N = inf{n : Sn ∈
/ (a, b)}, the rst time the walk exits (a, b).

We rst note that for any / (a, b)) ≥ 2−(b−a)


x ∈ (a, b), P (x + Sb−a ∈ since b−a steps in the positive

direction, say, will take us out of (a, b).


n
Iterating this inequality shows that P (N > n(b − a)) ≤ 1 − 2−(b−a) , so E[N ] < ∞.
(In fact, the exponential tail decay implies that N has moments of all orders.)

Applying Wald's equation shows that bP (SN = b) + aP (SN = a) = E[SN ] = E[N ]E[X1 ] = 0.
Since P (SN = a) = 1 − P (SN = b), we have (b − a)P (SN = b) = −a, hence

−a b
P (SN = b) = , P (SN = a) = .
b−a b−a
b
Writing Ta = inf{n : Sn = a}, Tb = inf{n : Sn = b} shows that P (Ta < Tb ) = P (SN = a) = b−a for all
integers a < 0 < b. Because P (Ta < ∞) ≥ P (Ta < Tb ) for all b > 0, sending b→∞ shows that

P (Ta < ∞) = 1 for all integral a < 0.

Symmetry implies that P (Tx < ∞) = 1 for all integral b > 0, and since the walk must pass through 0 to get

from 1 to −1 or vice versa, we see that P (T0 < ∞) = 1 as well.

However, Wald's equation shows that the expected time to visit any nonzero integer is innite:

If E[Tx ] < ∞ for x ∈ Z \ {0}, we would have x = E[STx ] = E[Tx ]E[X1 ] = 0.

By conditioning on the rst step, we see that the expected time for the walk to return to 0 is
1 1

E [T0 ] = E 2 (1 + T−1 ) + 2 (1 + T1 ) = ∞.

To recap, simple random walk will visit any given integer in nite time with full probability, but the

expected time to do so is innite!

We can compute the variance of random sums whose index of summation is a stopping time using

Theorem 17.3 (Wald's second equation). Let X1 , X2 , ... be i.i.d. with E[X1 ] = 0 and E[X12 ] = σ2 < ∞.
If T is a stopping time with E[T ] < ∞, then E[ST2 ] = σ2 E[T ].

Proof. We rst note that for all n ∈ N,


!2  2
T ∧n T ∧(n−1)
X X
ST2 ∧n = Xi = Xi + Xn 1 {T ≥ n}
i=1 i=1

= ST2 ∧(n−1) + 2Xn Sn−1


2
+ Xn2 1 {T ≥ n} .


Since {T ≥ n} = {T ≤ n − 1}C ∈ Fn−1 and Xn is independent of Fn−1 , taking expectations yields


h i
E ST2 ∧n = E ST2 ∧(n−1) + 2E [Xn ] E Sn−1
 2
1 {T ≥ n} + E Xn2 E [1 {T ≥ n}]
    
h i
= E ST2 ∧(n−1) + σ 2 P (T ≥ n).

104
By assumption, all expectations exist and are nite, so induction on n gives

n
X
E ST2 ∧n = σ 2
 
P (T ≥ k).
k=1


X n
X
Now E[T ] = P (T > k) = lim P (T ≥ k), so, since ST ∧n → ST pointwise, if we can show that
n→∞
k=1 k=1
E ST = lim E ST2 ∧n = σ 2 E[T ].
 2  
ST ∧n is Cauchy in L2 , then it will follow that
n→∞
To this end, observe that for any n>m

T ∧n
!2  n ∞
h i X X X
2  = σ2
E (ST ∧n − ST ∧m ) =E Xk P (T ≥ k) ≤ σ 2 P (T ≥ k)
k=m+1 k=m+1 k=m+1

where the second equality follows from the exact same argument as above.

Since this goes to 0 as m → ∞, ST ∧n is indeed Cauchy in L2 and the proof is complete. 

A consequence of Wald's second equation is


Theorem 17.4. Let X1 , X2 , ... be i.i.d. with E[X1 ] = 0, E[X12 ] = 1, and set Tc = inf {n ≥ 1 : |Sn | > c n}.
Then E[Tc ] is nite if and only if c < 1.

Proof. If E[Tc ] < ∞, then Wald's second equation implies


h √  i
E[Tc ] = E[Tc ]E[X12 ] = E[ST2c ].
2
However, E[ST2c ] > E c Tc = c2 E[Tc ] by construction, so when c ≥ 1, the assumption that E[Tc ] < ∞
leads to the contradiction E[Tc ] = E[ST2c ] 2
> c E[Tc ].
Thus it remains only to show that E[Tc ] < ∞ when c ∈ [0, 1).
To this end, we let τn = Tc ∧n and note that Sτ2n −1 ≤ c2 (τn −1) ≤ c2 τn , so Theorem 17.3 and Cauchy-Schwarz
give

E[τn ] = E[Sτ2n ] = E Sτ2n −1 + 2Sτn −1 Xτn + Xτ2n


 
q
≤ c2 E[τn ] + 2c E[τn ]E[Xτ2n ] + E[Xτ2n ].

To complete the proof, we show

Lemma 17.1. If X1 , X2 , ... are i.i.d. with E[X12 ] < ∞ and T is a stopping time with E[T ] = ∞, then
E[XT2 ∧n ]
lim = 0.
n→∞ E[T ∧ n]

It will then follow that if E[Tc ] = ∞ for c ∈ [0, 1), then for any ε ∈ (0, (1 − c)2 ), there is an N ∈N so that

E[Xτ2n ] < εE[τn ] n ≥ N , giving the contradiction


whenever
q √ √
E[τn ] ≤ c2 E[τn ] + 2c E[τn ]E[Xτ2n ] + E[Xτ2n ] < c2 + 2c ε + ε E[τn ] = (c + ε)2 E[τn ] < E[τn ].


105
To prove Lemma 17.1, we observe that

n
X
E[XT2 ∧n ] = E[XT2 ∧n ; XT2 ∧n ≤ ε (T ∧ n)] + E[XT2 ∧n ; XT2 ∧n > εj, T ∧ n = j]
j=1
n
X
≤ εE[T ∧ n] + E[Xj2 ; T ∧ n = j, Xj2 > εj].
j=1

Now, since E[Xj2 ; Xj2 > εj] → 0 as j→∞ (by the DCT), we can choose N large enough that
n
X
E[Xj2 ; Xj2 > εj] < nε whenever n > N.
j=1

Then for all n > N, we have

n
X N
X n
X
E[Xj2 ; T ∧ n = j, Xj2 > εj] = E[Xj2 ; T ∧ n = j, Xj2 > εj] + E[Xj2 ; T ∧ n = j, Xj2 > εj]
j=1 j=1 j=N +1
n
X
≤ N E[X12 ] + E[Xj2 ; T ∧ n = j, Xj2 > εj].
j=N +1

To bound the latter sum, we compute

n
X n
X
E[Xj2 ; T ∧ n = j, Xj2 > εj] ≤ E[Xj2 ; T ∧ n ≥ j, Xj2 > εj]
j=N +1 j=N +1
n
X
E Xj2 1 {T ∧ n ≥ j} 1 Xj2 > εj
  
=
j=N +1
n
X
P (T ∧ n ≥ j)E Xj2 ; Xj2 > εj
 
=
j=N +1
n
X ∞
X
P (T ∧ n = k)E Xj2 ; Xj2 > εj
 
=
j=N +1 k=j

X k
X
P (T ∧ n = k)E Xj2 ; Xj2 > εj
 

k=N +1 j=1

X
≤ P (T ∧ n = k)kε ≤ εE[T ∧ n].
k=N +1

It follows that n>N implies

n
X
E[XT2 ∧n ] ≤ εE[T ∧ n] + E[Xj2 ; T ∧ n = j, Xj2 > εj]
j=1
n
X
≤ εE[T ∧ n] + N E[X12 ] + E[Xj2 ; T ∧ n = j, Xj2 > εj]
j=N +1

≤ εE[T ∧ n] + N E[X12 ] + εE[T ∧ n] = 2εE[T ∧ n] + N E[X12 ].

E[XT2 ∧n ]
Since E[T ∧n] → E[T ] = ∞ (by the MCT), we see that lim sup ≤ 2ε and the lemma follows because
n→∞ E[T ∧ n]
ε>0 is arbitrary. 

106
18. Recurrence

In this section, we will consider some questions regarding the recurrence behavior of random walk in Rd .
Throughout, we will take (S, S) to be Rd with the Borel σ -algebra, we will denote the position of the random
walk at time n by Sn = X1 + ... + Xn with X1 , X2 , ... i.i.d., and we will work with the norm kxk = max |xi |.
1≤i≤d

Denition. x ∈ Rd is called a recurrent value for the random walk Sn if for every ε > 0,
P (kSn − xk < ε i.o.) = 1. We denote the set of recurrent values by V.

Note that the Hewitt-Savage 0−1 law implies that if P (kSn − xk < ε i.o.) is less than one, then it is zero.

Denition. x ∈ Rd is called a possible value for Sn if for every ε > 0, there is an n∈N with

P (kSn − xk < ε) > 0. The set of possible values is denoted U.

Clearly the set of recurrent values is contained in the set of possible values. In fact, we have

Theorem 18.1. The set V is either ∅ or a closed subgroup of Rd . In the latter case, V = U .
Proof. Suppose that V=
6 ∅. If z ∈ VC, then there is an ε>0 such that P (kSn − zk < ε i.o.) = 0, and thus
d C C
Bε (z) = {w ∈ R : kw − zk < ε} ⊆ V . It follows that V is open, so V is closed.

The rest of the theorem will follow upon showing

(∗) If x∈U and y ∈ V, then y − x ∈ V.

This is because V ⊆ U, so for any v, w ∈ V taking x=y=v shows that 0 ∈ V; taking x = v, y = 0 shows

that −v ∈ V ; and taking x = −w, y = v shows that v + w ∈ V. It follows that V ≤ Rd .


Also, for any u ∈ U, taking x = u, y = 0 shows that −u ∈ V . As V is a subgroup, this implies that u ∈ V,
and thus U ⊆ V. Consequently, U = V.

To prove (∗), we note that if y−x∈


/ V, then there exist ε > 0, m ∈ N such that

P (kSn − (y − x)k ≥ 2ε for all n ≥ m) > 0.

Also, since x ∈ U, there is some k∈N with

P (kSk − xk < ε) > 0.

Now for any n ≥ m + k , Sn − Sk = Xk+1 + ... + Xn has the same distribution as Sn−k and is independent

of Sk .
It follows that {kSn − Sk − (y − x)k ≥ 2ε for all n ≥ m + k} and {kSk − xk < ε} are independent and each

have positive probability.

Because kSk (ω) − xk < ε and 2ε ≤ kSn (ω) − Sk (ω) − (y − x)k = kSn (ω) − yk + kSn (ω) − xk implies

kSn (ω) − yk ≥ ε, we conclude that

P (kSn − yk ≥ ε for all n ≥ m + k)


≥ P (kSn − Sk − (x − y)k ≥ 2ε for all n ≥ m + k) P (kSk − xk < ε) > 0.

But this contradicts y ∈ V, so we must have y − x ∈ V. 


107
When V = ∅, the random walk is called transient, otherwise it is called recurrent.
It follows from Theorem 18.1 that a random walk is recurrent if and only if 0 is a recurrent value.

By denition, a sucient condition for this to be the case is P (Sn = 0 i.o.) = 1.


(That is, if 0 is point recurrent, then it is neighborhood recurrent. The distinction arises when the range of

the Xi 's is dense, but the two are equivalent for simple random walk.)

By Proposition 17.2, if we set τ0 = 0 and let τn = inf{m > τn−1 : Sm = 0} be the nth time the walk visits

0, then P (τn < ∞) = P (τ1 < ∞)n . From this observation, we arrive at

Theorem 18.2. For any random walk, the following are equivalent:
(1) P (τ1 < ∞) = 1
(2) P (Sn = 0 i.o.) = 1
P∞
(3) n=1 P (Sn = 0) = ∞

Proof.
If P (τ1 < ∞) = 1, then P (τn < ∞) = 1n = 1 for all n, hence P (Sn = 0 i.o.) = 1.
The contrapositive of the rst Borel-Cantelli lemma shows that P (Sn = 0 i.o.) =1 implies
P∞
n=1 P (Sn = 0) = ∞.
P∞ P∞
Finally, the number of visits to zero can be expressed as V = n=1 1 {Sn = 0} and as V = n=1 1 {τn < ∞}.
Thus if P (τ1 < ∞) < 1, then


X ∞
X
P (Sn = 0) = E[V ] = P (τn < ∞)
n=1 n=1

X P (τ1 < ∞)
= P (τ1 < ∞)n = < ∞.
n=1
1 − P (τ1 < ∞)
P∞
It follows that n=1 P (Sn = 0) = ∞ implies P (τ1 < ∞) = 1. 

Analogous to the one-dimensional case, we say that Sn = X1 + ... + Xn denes a simple random walk on Rd
(equivalently, Zd ) if X1 , X2 , ... are i.i.d. with

1
P (Xi = ej ) = P (Xi = −ej ) =
2d
for each of the d standard basis vectors ej .

We will show that simple random walk is recurrent in dimensions d = 1, 2 and transient otherwise.

−d
Essentially, this is because P (Sn = 0) ≈ Cd n 2 , which is summable for d≥3 but not for d = 1, 2.

The argument is combinatorial and relies on

Proposition 18.1 (Stirling's formula).


√  n n
n! ≈ 2πn
e
in the sense that their ratio approaches 1 as n → ∞.
108
Theorem 18.3. Simple random walk is recurrent in dimensions one and two.

Proof. When d = 1, P (S2n−1 = 0) = 0 and


 
1 2n 1 (2n)!
P (S2n = 0) = 2n = 2n
2 n 2 (n!)2
√ 2n
1 4πn 2n
e 1
≈ 2n √
2   = √πn
n n 2
2πn e

for all n ∈ N.
P∞
Since √1 diverges, it follows from the limit comparison test that
n=1 πn

X ∞
X
P (Sn = 0) = P (S2n = 0) = ∞,
n=1 n=1

so P (Sn = 0 i.o.) =1 and thus Sn is recurrent.

Similarly, when d = 2, P (S2n−1 = 0) = 0 and


n   n
1 X 2n 1 X (2n)!
P (S2n = 0) = 2n = 2n
4 m=0 m, m, n − m, n − m 4 m=0 m!m!(n − m)!(n − m)!
  n  2  2   Xn   
1 2n X n 1 2n n n
= 2n =
4 n m=0 m 22n n m=0 m n − m
  2
1 2n 1
= 2n ≈
2 n πn
1 2n √1 .

since we have seen that 2n
2 n ≈ πn

* The identity
n     
X n n 2n
=
m=0
m n − m n
follows by noting that the number of ways to choose a committee of size n from a population of size 2n
containing n men and n women is the sum over m = 0, 1, ..., n of the number of such committees consisting

of m men and n−m women.

Arguing as in the d=1 case shows that Sn is recurrent. 

In contrast to the d = 1, 2 cases, we have

Theorem 18.4. Simple random walk is transient in three or more dimensions.

Proof. When d = 3,
   2
−2n
X (2n)! −2n 2n X
−n n!
P (S2n = 0) = 6 2 =2 3 .
(n1 !n2 !n3 !) n n1 !n2 !n3 !
n1 ,n2 ,n3 ≥0: n1 ,n2 ,n3 ≥0:
n1 +n2 +n3 =n n1 +n2 +n3 =n

109
Now 3−n n1 !nn!2 !n3 ! ≥ 0 for each choice of n1 , n2 , n3 , n, and the multinomial theorem gives
   n1  n2  n3  n
X n! X n 1 1 1 1 1 1
3−n = = + + = 1,
n1 !n2 !n3 ! n1 , n2 , n3 3 3 3 3 3 3
n1 ,n2 ,n3 ≥0: n1 ,n2 ,n3 ≥0:
n1 +n2 +n3 =n n1 +n2 +n3 =n
so
 
 2
X n! n! X n!
3−n ≤ max 3−n  3−n
n1 !n2 !n3 ! 0≤n1 ≤n2 ≤n3 : n1 !n2 !n3 ! n1 !n2 !n3 !
n1 ,n2 ,n3 ≥0: n1 +n2 +n3 =n n1 ,n2 ,n3 ≥0:
n1 +n2 +n3 =n n1 +n2 +n3 =n
n!
= 3−n max .
0≤n1 ≤n2 ≤n3 : n1 !n2 !n3 !
n1 +n2 +n3 =n

The latter quantity is maximized when n1 !n2 !n3 ! is minimized. This happens when n1 , n2 , n3 are as close as
ni +1
possible: If ni < nj − 1 for i < j, then ni !nj ! > nj ni !nj ! = (ni + 1)!(nj − 1)!.
It follows that
√ n n
 3 n n

n! n! 2πn e 32 e 3n
max ≈   3 ≈ q = n n
≤ .
 n 3

0≤n1 ≤n2 ≤n3 : n1 !n2 !n3 ! n
! 2πn n 3 2πn n
n1 +n2 +n3 =n 3 3e
3 3e

1 2n √1

Putting all this together and recalling that
22n n ≈ πn
shows that

   2
−2n 2n X n!  3
P (S2n = 0) = 2 3−n = O n− 2 ,
n n1 !n2 !n3 !
n1 ,n2 ,n3 ≥0:
n1 +n2 +n3 =n
P∞
hence n=1 P (Sn = 0) < ∞ and we conclude that SRW is transient in 3 dimensions.

Transience in higher dimensions follows by letting Tn = (Sn1 , Sn2 , Sn3 ) be the projection onto the rst three

coordinates and letting N (n) = inf{m > N (n − 1) : Tm 6= TN (n−1) } to be the nth time that the random

walker moves in any of the rst three coordinates (with the convention that N (0) = 0). Then TN (n) is a

simple random walk in three dimensions and the probability that TN (n) = 0 innitely often is 0. Since the

rst three coordinates of Sn are constant between N (n) and N (n + 1) and N (n + 1) − N (n) is almost surely

nite, this implies that Sn is transient. 

In the case of more general random walks on Rd , the Chung-Fuchs theorem says that

Sn
• Sn is recurrent in d=1 if →p 0 .
n
S
• Sn is recurrent in d=2 if √n ⇒ N (0, Σ).
n
• Sn is transient in d≥3 if it is truly (at least) three dimensional (meaning that it does not live an

a plane through the origin).

More generally, one can show that a necessary and sucient condition for recurrence is
ˆ  
1
Re dy = ∞
(−δ,δ)d 1 − ϕ(y)
 
for δ>0 where ϕ(t) = E eit·X1 is the ch.f. of one step in the walk.
110
We will content ourselves with proofs of the d = 1, 2 results. The proof for d≥3 can be found in Durrett.

We begin with some lemmas which are valid in any dimension. The rst is analogous to Theorem 18.2.

n
Lemma 18.1. Let X1 , X2 , ... be i.i.d. and take Sn = Xi . Then Sn is recurrent if and only if
X

i=1

for every ε > 0.
X
P (kSn k < ε) = ∞
n=1

Proof.
If Sn is recurrent, then P (kSn k < ε i.o.) =1 for all ε > 0, so the contrapositive of the rst Borel-Cantelli

X
lemma implies that P (kSn k < ε) = ∞ for all ε > 0.
n=1
P∞
For the converse, x k≥1 and dene Zk = i=1 1 {kSi k < ε, kSi+j k ≥ ε for all j ≥ k}.
Then Zk ≤ k by construction, so


X
k ≥ E[Zk ] = P (Si < ε, kSi+j k ≥ ε for all j ≥ k)
i=1

X
≥ P (kSi k < ε, kSi+j − Si k ≥ 2ε for all j ≥ k)
i=1
X∞
= P (kSi k < ε) P (kSi+j − Si k ≥ 2ε for all j ≥ k)
i=1

X
= P (kSj k ≥ 2ε for all j ≥ k) P (kSi k < ε) .
i=1
P∞
Thus if i=1 P (kSi k < ε) = ∞, then it must be the case that P (kSj k ≥ 2ε for all j ≥ k) = 0.
As this is true for all k ∈ N, we see that P (kSj k < 2ε i.o.) = 1, so, since this holds for all ε > 0, Sn is

recurrent. 

Note that one could equivalently take the lower index of summation to be 0 in the preceding theorem since

adding or subtracting 1 does not change whether or not the sum diverges to innity.

Everything that we have done thus far is true for any norm. Our reason for working with the supremum

norm is the following result. As all norms on Rd are equivalent, this choice entails no loss of generality - the

denition of neighborhood recurrence is topological.

Lemma 18.2. For any m ∈ N, ε > 0,



X ∞
X
P (kSn k < mε) ≤ (2m)d P (kSn k < ε) .
n=0 n=0

Proof. The left hand side gives the expected number of visits to the open cube (−mε, mε)d . This can be
d
obtained by summing the number of visits to each of the (2m) subcubes of side length ε obtained by dividing
each side of the cube into 2m equal segments. Thus it suces to show that the expected number of visits to
P∞
any of these subcubes is at most n=0 P (kSn k < ε).
111
To this end, let C be any such side length ε cube in Rd and let T = inf{n : Sn ∈ C} be the hitting time for

C. If T =∞ then the walk never visits C, while if T = m, then on every subsequent visit to C the walk is

within ε of Sm , so


X ∞ X
X n ∞ X
X ∞
P (Sn ∈ C) = P (Sn ∈ C, T = m) = P (Sn ∈ C, T = m)
n=0 n=0 m=0 m=0 n=m
X∞ X ∞
≤ P (kSn − Sm k < ε, T = m)
m=0 n=m
X∞ X ∞
= P (kSn − Sm k < ε) P (T = m)
m=0 n=m
X∞ ∞
X
= P (T = m) P (kSk k < ε)
m=0 k=0

X
≤ P (kSk k < ε) . 
k=0

P∞
The preceding lemma shows that establishing convergence/divergence of n=1 P (kSn k < ε) for a single

value of ε>0 is sucient for determining transience/recurrence of Sn . In particular, we have


Corollary 18.1. is recurrent if and only if P (kSn k < 1) = ∞.
X
Sn
n=1

Proof.
P∞
Lemma 18.1 shows that if Sn is recurrent, then n=1 P (kSn k < 1) = ∞.
P∞
On the other hand, suppose that n=1 P (kSn k < 1) = ∞ and let ε>0 be given.
−1
Applying Lemma 18.2 with m> ε yields

X ∞
X ∞
X
P (kSn k < 1) ≤ (2m)d P kSn k < m−1 ≤ (2m)d

∞= P (kSn k < ε) ,
n=1 n=0 n=0

so, since ε was arbitrary, it follows from Lemma 18.1 that Sn is recurrent. 

With the previous results at our disposal, we are able to show

1
Theorem 18.5. If X1 , X2 , ... are i.i.d. R-valued random variables and Sn →p 0, then Sn is recurrent.
n

Proof.
P∞
By Corollary 18.1, it suces to prove that n=1 P (|Sn | < 1) = ∞.
Lemma 18.2 shows that for any m ∈ N,
∞ ∞ Km
X 1 X 1 X
P (|Sn | < 1) ≥ P (|Sn | < m) ≥ P (|Sn | < m)
n=0
2m n=0 2m n=0
Km Km
1 X  n K 1 X  n
≥ P |Sn | < = · P |Sn | <
2m n=0 K 2 Km n=0 K
n

for any K ∈ N. By hypothesis, P |Sn | < K → 1 as n → ∞, so, sending m to ∞ shows that
P∞ K
n=0 P (|Sn | < 1) ≥ 2 , and the proof is complete since K was arbitrary. 

112
It is worth remarking that if the Xi0 s have a well-dened expectation E[Xi ] = µ 6= 0, then the strong law
S
implies n
n →µ a.s. In this case, we must have |Sn | → ∞, so the walk is transient. If E[Xi ] = 0, then the
Sn
weak law implies
n →p 0. Thus if the increments have an expectation µ = E[Xi ] ∈ [−∞, ∞], then the walk

is recurrent if and only if µ = 0.

We now show that, in dimension 2, a random walk is recurrent if a mean 0 central limit theorem holds.

In the case where the limit distribution N (0, Σ) is degenerate - that is, the covariance matrix Σ has rank(Σ) <
2 - the random walk is either always at 0 or is essentially one-dimensional, in which case recurrence follows

from Theorem 18.5. Thus we will assume in what follows that N (0, Σ) is nondegenerate and thus has a
2
density with respect to Lebesgue measure on R .

Theorem 18.6. If Sn is a random walk in and n− 2 Sn converges weakly to a nondegenerate normal


1
R2
distribution, then Sn is recurrent.
Proof.
P∞
As before, we need to show that n=1 P (kSn k < 1) = ∞.
By Lemma 18.2,
∞ ∞
X 1 X
P (kSn k < 1) ≥ P (kSn k < m) ,
n=0
4m2 n=0
and we can write
∞ ˆ ∞
1 X 
2
P (kSn k < m) = P Sbm2 θc < m dθ
m n=0 0
n n+1 1
 
since m2 θ = n on the segment
m2 ≤θ< m2 of length m2 .
Also, letting n(y) denote the limiting normal density, we have

Sbm2 θc
! ˆ
 1
P Sbm2 θc < m ≈ P √ <√ → 1
n(y)dy
m θ θ kyk<θ − 2

as m → ∞, so Fatou's lemma shows that

∞ ∞
X 1 X
4 P (kSn k < 1) ≥ lim inf P (kSn k < m)
m→∞ m2
n=0 n=0
ˆ ∞

= lim inf P Sbm2 θc < m dθ
m→∞ 0
ˆ ˆ ∞
!
≥ 1
n(y)dy dθ.
0 kyk<θ − 2

Since n is positive and continuous at 0, |·| denote Lebesgue measure, we


letting have
ˆ n 1
o 4n(0)
n(y)dy ≈ kyk ≤ θ− 2 n(0) =

kyk<θ − 2
1
θ
as θ → ∞.
It follows that the integral with respect to θ (and thus the sum of interest) diverges, and we conclude that

Sn is recurrent. 

113
19. Path Properties

We conclude our investigation of random walk with a look at the trajectories of simple random walk on Z.
2
Here we think of a random walk S1 , S2 , ... as being represented by the polygonal curve in R having vertices

(1, S1 ), (2, S2 ), ... where successive vertices (n, Sn ) and (n + 1, Sn+1 ) are connected by a line segment.

We will call any polygonal curve which is a possible realization of simple random walk a path.
Now any path from (0, 0) to (n, x) consists of a steps in the positive direction and b steps in the negative

direction where a and b satisfy

a+b=n
,
a−b=x
n+x n−x
hence a= and b = .
2 2
(Note that if (n, x) is a vertex on a possible trajectory of SRW, then n ∈ N0 and x∈Z must have the same

parity and satisfy |x| ≤ n.)


As each such path is uniquely determined by the locations of the positive steps, the total number is
   
n n
Nn,x = = n+x .
a 2

Our rst result is an enumeration of the paths beginning and ending above the x-axis which hit 0 at some

point.

Lemma 19.1 (Reection principle). For any q, t, n ∈ N, the number of paths from (0, q) to (n, t) that are 0
at some point is equal to the number of paths from (0, −q) to (n, t).
Proof. (Draw picture)

Suppose that (0, r0 ), (1, r1 ), ..., (n, rn ) is a path from (0, q) to (n, t) which is 0 at some point.

Let K = min{k : rk = 0} be the rst time the path touches the x-axis. Then, setting ri0 = −ri for 0≤i<K
and ri0 = ri for K ≤ i ≤ n, we see that (0, r00 ), (1, r10 ), ..., (n, rn0 ) is a path from (0, −q) to (n, t).
Conversely, if (0, s0 ), (1, s1 ), ..., (n, sn ) is a path from (0, −q) to (n, t), then it must cross the x-axis at some

point. Let L = min{l : sl = 0} be the rst time this happens. Then (0, s00 ), (1, s01 ), ..., (n, s0n ) dened by

s0j = −sj for 0≤j<L and s0j = sj for L≤j≤n is a path from (0, q) to (n, t) which is 0 at time L.
Thus the set of paths from (0, q) to (n, t) which are 0 at some point is in 1−1 correspondence with the set

of paths from (0, −q) to (n, t), and the theorem is proved. 

A consequence of this simple observation is

Theorem 19.1 (The Ballot Theorem). Suppose that in an election candidate A gets α votes and candidate
B gets β votes where α > β . The probability that candidate A is always in the lead when the votes are
α−β
counted one by one is .
α+β
Proof. Let x = α−β and n = α + β. The number of favorable outcomes is equal to the number of paths from

(1, 1) to (n, x) which are never 0. (Think of a vote for A as a positive step and a vote for B as a negative

step, keeping in mind that the rst vote counted must be for A.)
Shifting by one time step shows that this is equal to the number of paths from (0, 1) to (n − 1, x) which are

never zero.
114
The number of paths from (0, 1) to (n − 1, x) that hit 0 is equal to the number of paths from (0, −1) to

(n − 1, x), so subtracting this from the total number of paths gives the number of favorable outcomes.

Shifting in the vertical direction to get paths starting at zero shows that the number in question is
       
n−1 n−1 n−1 n−1
Nn−1,x−1 − Nn−1,x+1 = (n−1)+(x−1) − (n−1)+(x+1) = −
2 2
α−1 α
 
(n − 1)! (n − 1)! (n − 1)! (α − (n − α)) 2α − n n
= − = = .
(α − 1)!(n − α)! α!(n − α − 1)! α!(n − α)! n α
 
0 0 n
Since the total number of sequences of α A s and β B s is , the probability that A always leads is
α
2α−n n

n α 2α − n 2α − (α + β) α−β
n = = = . 
α
n α + β α+β

This kind of reasoning involved in the preceding arguments is useful in computing the distribution of the

hitting time of {0} for simple random walk.

Lemma 19.2. For simple random walk on Z,


P (S1 6= 0, ..., S2n 6= 0) = P (S2n = 0)

Proof.
If S1 6= 0, ..., S2n 6= 0, then either S1 , ..., S2n > 0 or S1 , ..., S2n < 0 (since simple random walk cannot skip

over integers) and the two events are equally likely by symmetry.

As such, it suces to show that

1
P (S1 > 0, ..., S2n > 0) = P (S2n = 0).
2
Breaking up the event {S1 , ..., S2n > 0} according to the value of S2n (which is necessarily an even number

less than or equal to 2n) gives

n
X
P (S1 > 0, ..., S2n > 0) = P (S1 > 0, ..., S2n−1 > 0, S2n = 2r)
r=1

Now each path of length 2n has probability 2−2n of being realized, and the number of paths from (0, 0) to

(2n, 2r) which are never zero at positive times is N2n−1,2r−1 − N2n−1,2r+1 by the argument in the proof of

the Ballot theorem.

Accordingly,

n
X
P (S1 > 0, ..., S2n > 0) = P (S1 > 0, ..., S2n−1 > 0, S2n = 2r)
r=1
n
1 X
= (N2n−1,2r−1 − N2n−1,2r+1 )
22n r=1
1
= [(N2n−1,1 − N2n−1,3 ) + (N2n−1,3 − N2n−1,5 ) + ... + (N2n−1,2n−1 − N2n−1,2n+1 )]
22n
N2n−1,1 − N2n−1,2n+1 N2n−1,1
= n
= .
2 2n
where the nal equality is because you can't get to 2n + 1 in 2n − 1 steps of size 1.
115
To complete the proof, we observe that

P (S2n = 0) = P (S2n = 0, S2n−1 = 1) + P (S2n = 0, S2n−1 = −1)


= 2P (S2n = 0, S2n−1 = 1)
= 2P (S2n−1 = 1, X2n = −1)
= 2P (X2n = −1)P (S2n−1 = 1)
N2n−1,1
= P (S2n−1 = 1) = ,
2n−1
hence
N2n−1,1 1
P (S1 > 0, ..., S2n > 0) = = P (S2n = 0) . 
2n 2

Lemma 19.2 and our previous computations for simple random walk show that the distribution function of

α = inf{n ∈ N : Sn = 0} is given by
  √
1 2n πn − 1
P (α ≤ 2n) = 1 − P (S1 6= 0, ..., S2n 6= 0) = 1 − P (S2n = 0) = 1 − ≈ √ ,
22n n πn
P (α ≤ 2n + 1) = P (α ≤ 2n)

for n = 1, 2...

Our nal set of results concern the so-called arcsine laws, which show that certain suitably normalized

random walk statistics have limiting distributions that can be described using the arcsine function.

These theorems are typically stated in terms of Brownian motion, which arises as a scaling limit of random

walk.

We rst consider the arcsine law associated with

L2n = max{0 ≤ m ≤ 2n : Sm = 0},

the last visit to zero in time 2n.

We begin with the following simple lemma.

Lemma 19.3. Let u2m = P (S2m = 0). Then P (L2n = 2k) = u2k u2n−2k for k = 0, 1, ..., n.

Proof. Using Lemma 19.2, we compute

P (L2n = 2k) = P (S2k = 0, S2k+1 6= 0, ..., S2n 6= 0)


= P (S2k = 0, X2k+1 6= 0, ..., X2k+1 + ... + X2n 6= 0)
= P (S2k = 0) P (X2k+1 6= 0, ..., X2k+1 + ... + X2n 6= 0)
= P (S2k = 0)P (S1 6= 0, ..., S2n−2k 6= 0) = u2k u2n−2k . 

From here, it is a small step to deduce the limit law.

116
Theorem 19.2. For 0 < a < b < 1,
  ˆ b
L2n 1
P a≤ ≤b → p dx.
2n a π x(1 − x)

Proof. We rst note that

n 1 1 1
nP (L2n = 2k) = nu2k u2(n−k) ≈ √ p = q = q ,
πk π(n − k) π nk−k2 π k
(1 − k
)
n2 n n

k
so if
n → x, then
 
nP (L 2n = 2k) 1 → p 1
nP (L2n = 2k) =  · q .
√k1 k π x(1 − x)
π n (1− n ) π nk (1 − nk )

Now dene an and bn so that 2nan is the smallest even integer greater than or equal to 2na and 2nbn is the

largest even integer less than or equal to 2nb.


k (k+1)
Setting fn (x) = nP (L2n = 2k) for
n ≤x< n , we see that
  nbn ˆ bn + n1
L2n X 1
P a≤ ≤ b = P (2nan ≤ L2n ≤ 2nbn ) = nP (L2n = 2k) · = fn (x)dx.
2n n an k=nan

1
Moreover, our work above shows that fn (x) → f (x) = √ 1(0,1) (x) uniformly on compact sets, so
π x(1−x)

sup fn (x) → sup f (x) < ∞


1 a≤x≤b
an ≤x≤bn + n

for any 0 < a < b < 1, thus we can apply the bounded convergence theorem to conclude
  ˆ 1
bn + n ˆ b
L2n
P a≤ ≤b = fn (x)dx → f (x)dx. 
2n an a


To see the reason for the name, observe that the substitution y= x yields

ˆ ˆ √
b
1 2 b
dy 2  −1 √  √ 
p dx = √
p = sin b − sin−1 a .
a π x(1 − x) π a 1 − y2 π

 
L2n 1 2
sin−1 √1 1

Note that P 2n ≤ 2 → π 2
= 2 . (This symmetry is also apparent in the mass function derived
in Lemma 19.3.)

An amusing consequence is that if two people were to bet $1 on a coin ip every day of the year, then with
1
probability approximately
2 , one player would remain consistently ahead from July 1st onwards. In other
words, if you were to play this game against me, then the probability that you would be in the lead for the

rst two days is about the same as the probability that you would be in the lead for the entire second half

of the year!

Finally, note that the proof of Theorem 19.2 shows that any statistic Tn satisfying

P (T2n = 2k) = u2k u2n−2k k = 0, 1, ..., n obeys the asymptotic arcsine law
for
  ˆ b
T2n 1
P a≤ ≤b → p dx, 0 < a < b < 1.
2n a π x(1 − x)

117
Theorem 19.3. Let π2n be the number of line segments (k − 1, Sk−1 ) to (k, Sk ) that lie above the x-axis
and let um = P (Sm = 0). Then
P (π2n = 2k) = u2k u2n−2k .

Proof. Write β2k,2n = P (π2n = 2k). We will proceed by (strong) induction.

When n = 1, it is clear that


1
β0,2 = β2,2 = = u0 u2 .
2
(After two steps, the walk has either been always nonpositive or nonnegative, each being equally likely, and

of the four equiprobable two-step paths, half end up at 0.)


For a general n, the proof of Lemma 19.2 shows that

1 1
u2n u0 = u2n = P (S1 > 0, ..., S2n > 0)
2 2
= P (S1 = 1, S2 − S1 ≥ 0, ..., S2n − S1 ≥ 0)
1
= P (S1 ≥ 0, ..., S2n−1 ≥ 0)
2
1 1
= P (S1 ≥ 0, ..., S2n ≥ 0) = β2n,2n
2 2
where the penultimate equality is due to the fact that S2n−1 ≥ 0 implies S2n−1 ≥ 1.

This proves the result for k = n, and since β0,2n = β2n,2n (replacing Sn with −Sn shows that always

nonnegative is as likely as always nonpositive), we see that the result also holds for k = 0.
Suppose now that 1 ≤ k ≤ n − 1. Let R be the time of the rst return to 0 (so that R = 2m with 0 < m < n)
and write f2m = P (R = 2m). Breaking things up according to whether the rst excursion was on the

positive or negative side gives

k n−k
1 X 1 X
β2k,2n = f2m β2k−2m,2n−2m + f2m β2k,2n−2m
2 m=1 2 m=1
k n−k
1 X 1 X
= f2m u2k−2m u2n−2k + f2m u2k u2n−2m−2k
2 m=1 2 m=1
k n−k
1 X 1 X
= u2n−2k f2m u2k−2m + u2k f2m u2n−2m−2k
2 m=1
2 m=1

where the second equality used the inductive hypothesis.

Since
k
X n−k
X
u2k = f2m u2k−2m , u2n−2k = f2m u2n−2k−2m
m=1 m=1
(by considering the time of the rst return to 0), we have

1 1
β2k,2n = u2n−2k u2k + u2k u2n−2k = u2k u2n−2k
2 2
and the result follows from the principle of induction. 

1 1
Since f (x) = √ has a minimum at x= 2 and goes to ∞ as x → 0, 1, Theorem 19.3 shows that an
π x(1−x)
equal division of steps above and below the axis is least likely, and completely one-sided divisions have the

greatest probability.
118

You might also like