6710 Notes
6710 Notes
JOHN PIKE
These lecture notes were written for MATH 6710 at Cornell University in the Fall semester
of 2013. They were revised in the Fall of 2015 and the schedule on the following page
reects that semester. These notes are for personal educational use only and are not to
be published or redistributed. Almost all of the material, much of the structure, and some
of the language comes directly from the course text, Probability: Theory and Examples by
Rick Durrett. Gerald Folland's Real Analysis: Modern Techniques and Their Applications
is the source of most of the auxiliary measure theory details. There are likely typos and
mistakes in these notes. All such errors are the author's fault and corrections are greatly
appreciated.
1
Day 1: Finished Section 1
2
1. Introduction
Probability Spaces.
A probability space is a measure space (Ω, F, P ) with P (Ω) = 1.
The sample space Ω can be any set, and it can be thought of as the collection of all possible outcomes of
some experiment or all possible states of some system. Elements of Ω are referred to as elementary outcomes.
1) F is nonempty
2) E ∈ F ⇒ E C ∈ F
3) For any countable collection {Ei }i∈I ⊆ F , i∈I Ei ∈ F .
S
Elements of F are called events, and can be regarded as sets of elementary outcomes about which one can say
something meaningful. Before the experiment has occurred (or the observation has been made), a meaningful
statement about E∈F is P (E). Afterward, a meaningful statement is whether or not E occurred.
The interpretation is that P (A) represents the chance that event A occurs (though there is no general
If p is some property and A = {ω ∈ Ω : p(ω) is true} is such that P (A) = 1, then we say that p holds
almost surely, or a.s. for short. This is equivalent to almost everywhere in measure theory. Note that it
is possible to have an event E ∈F with E 6= ∅ and P (E) = 0. Thus, for instance, there is a distinction
between impossible and with probability zero as discussed in Example 1.3 below.
Example 1.2. Flipping a (possibly biased) coin: Ω = {H, T }, F = 2Ω = {∅, {H}, {T }, {H, T }}, P satises
Example 1.3. Random point in the unit interval: Ω = [0, 1], F = B[0,1] = Borel Sets, P = Lebesgue measure.
The experiment here is to pick a real number between 0 and 1 uniformly at random. Generally speaking,
uniformity corresponds to translation invariance, which is the primary dening property of Lebesgue measure.
Observe that each outcome x ∈ [0, 1] has P ({x}) = 0, so the experiment must result in the realization of an
´
Example 1.4. Standard normal distribution: Ω = R, F = B , P (E) = √1
2π E
e−
x2
2 dx.
When the sample space is countably innite (Example 1.5), or nite but the outcomes are not necessarily
equally likely (Example 1.2), one can speak of probabilities in terms weighted outcomes by taking a function
P P
p : Ω → [0, 1] with ω∈Ω p(ω) = 1 and setting P (E) = ω∈E p(ω).
For most practical purposes, this can be generalized to the case where Ω ⊆ R by taking a weighting function
´ ´
f : Ω → [0, ∞) with
Ω
f (x)dx = 1 and setting P (E) = E
f (x)dx (Examples 1.3 and 1.4), but one must be
careful since the integral is not dened for all sets E (e.g. Vitali sets*).
Those who have taken undergraduate probability will recognize p and f as p.m.f.s and p.d.f.s, respectively. In
dP
measure theoretic terms, f= dm is the Radon-Nikodym derivative of P with respect to Lebesgue measure,
dP
m. Similarly, p= dc where c is counting measure on Ω.
Measure theory provides a unifying framework in which these ideas can be made rigorous, and it enables
Also, note that in the formal axiomatic construction of probability as a measure space with total mass 1,
there is absolutely no mention of chance or randomness, so we can use probability without worrying about
X : Ω → S. In this class, the unqualied term random variable will refer to the case (S, G) = (R, B).
We typically think of X as an observable, or a measurement to be taken after the experiment has been
performed.
An extremely useful example is given by taking any A∈F and dening the indicator function,
(
1, ω∈A
1A (ω) = .
0, ω ∈ AC
Note that if (Ω,F, P ) is a probability space and X is an (S, G)-valued random variable, then X induces
the pushforward probability measure µ = P ◦ X −1 on (S, G). Frequently, we will abuse notation and write
X also induces the sub-σ -algebra σ(X) = {X −1 (E) : E ∈ G} ⊆ F . If we think of Ω as the possible outcomes
of an experiment and X as a measurement to be performed, then σ(X) represents the information we can
In contrast to other areas of measure theory, in probability we are often interested in various sub-σ-algebras
F0 ⊆ F , which we think of in terms of information content.
For instance, if the experiment is rolling a six-sided die (Example 1.1), then F0 = {Ø, {1, 3, 5}, {2, 4, 6}, Ω}
represents the information concerning the parity of the value rolled.
4
The expectation (or mean or expected value ) of a real-valued random variable X on (Ω, F, P ) is dened as
´
E[X] = Ω
X(ω)dP (ω) whenever the integral is well-dened.
Expectation is generally interpreted as a weighted average which gives the best guess for the value of the
random quantity X.
We will study random variables and their expectations in greater detail soon. For now, the point is that many
familiar objects from undergraduate probability can be rigorously and simply dened using the language of
measure theory.
That said, it should be emphasized that probability is not just the study of measure spaces with total mass
1. As useful and necessary as the rigorous measure theoretic foundations are, it is equally important to
cultivate a probabilistic way of thinking whereby one conceptualizes problems in terms of coin tossing, card
* An example of a subset of [0, 1] which has no well-dened Lebesgue measure is given by the following
construction:
choice (such as the Boolean prime ideal theorem), but it has been shown that the existence of non-measurable
In three or more dimensions, the Banach-Tarski paradox shows that in ZFC, there is no nitely additive
measure dened on all subsets of Euclidean space which is invariant under translation and rotation.
(The paradox is that one can cut a unit ball into ve pieces and reassemble them using only rigid motions
5
2. Preliminary Results
At this point, we need to establish some fundamental facts about probability measures and σ -algebras in
preparation for a discussion of probability distributions and to reacquaint ourselves with the style of measure
theoretic arguments.
Probability Measures.
The following simple facts are extremely useful and will be employed frequently throughout this course.
(iii) Subadditivity
S∞ P∞
For any countable collection {Ei } ∞
i=1 ⊆ F , P ( i=1 Ei ) ≤ i=1 P (Ei ).
Proof.
For (i), 1 = P (Ω) = P (A t AC ) = P (A) + P (AC ) by countable additivity.
∞ ∞ ∞ ∞
! !
[ [ X X
P Ei =P Fi = P (Fi ) ≤ P (Ei ).
i=1 i=1 i=1 i=1
Sn
For (iv), set B 1 = A1 and Bi = Ai \ Ai−1 for i > 1, and note that the Bi0 s are disjoint with i=1 Bi = An
S∞
and i=1 Bi = A. Then
∞ ∞
! n
[ X X
P (A) = P Bi = P (Bi ) = lim P (Bi )
n→∞
i=1 i=1 i=1
n
!
[
= lim P Bi = lim P (An ).
n→∞ n→∞
i=1
T∞ T∞ C S∞
For (v), if A1 ⊇ A2 ⊇ ... and A= i=1 Ai , then AC C
1 ⊆ A2 ⊆ ... and AC = ( i=1 Ai ) = i=1 AC
i , so it
Note that (ii)-(iv) hold for any measure space (S, G, ν), (v) is true for arbitrary measure spaces under the
assumption that there is some Ai with ν(Ai ) < ∞, and (i) holds for all nite measures upon replacing 1
with ν(S).
6
Sigma Algebras.
We now review some some basic facts about σ -algebras. Our rst observation is
Proposition 2.1. For any index set I , if {Fi }i∈I is a collection of σ-algebras on Ω, then so is ∩i∈I Fi .
It follows easily from Proposition 2.1 that for any collection of sets A ⊆ 2Ω , there is a smallest σ -algebra
containing A - namely, the intersection of all σ -algebras containing A.
This is called the σ -algebra generated by A and is denoted by σ(A).
Note that if F is a σ -algebra and A ⊆ F, then σ(A) ⊆ F .
An important class of examples are the Borel σ -algebras: If (X, T ) is a topological space, then BX = σ(T )
is called the Borel σ-algebra. It is worth recalling that for the standard topology on R, the Borel sets are
generated by open intervals, closed intervals, half-open intervals, open rays, and closed rays, respectively.
Our main technical result about σ -algebras is Dynkin's π -λ Theorem, an extremely useful result which is
often omitted in courses on measure theory. To state the result, we will need the following denitions.
(1) Ω∈L
(2) If A, B ∈ L and A ⊆ B, then B\A∈L
(3) If An ∈ L with An % A, then A∈L
Proof. We begin by observing that the intersection of any number of λ-systems is a λ-system, so for any
collection A, there is a smallest λ-system `(A) containing A. Thus it will suce to show
To establish c), let A be an arbitrary member of `(P). Then A = Ω ∩ A ∈ `(P), so GA 3 Ω . Also, for any
`(P) is a λ-system, hence GA is closed under subset dierences. Finally, for any sequence {Bn } in GA with
thus is a λ-system.
7
It remains only to show that c) implies b). To see that this is the case, rst note that since P is a π -system,
P ⊆ GA for every A ∈ P, so it follows from c) that `(P) ⊆ GA for every A ∈ P . In particular, for any A ∈ P,
B ∈ `(P), we have A ∩ B ∈ `(P). Interchanging A and B yields A ∈ `(P) and B∈P implies A ∩ B ∈ `(P).
But this means if A ∈ `(P), then P ⊆ GA , and thus c) implies that `(P) ⊆ GA . Therefore, it follows from
It is not especially important to commit the details of this proof to memory, but it worth seeing once and you
should denitely know the statement of the theorem. Though it seems a bit obscure upon rst encounter,
we will use this result in a variety of contexts throughout the course. In typical applications, we show that
a property holds on a π -system that we know generates the σ -algebra of interest. We then show that the
collection of all sets for which the property holds is a λ-system in order to conclude that the property holds
A related result which is probably more familiar to those who have taken measure theory is the monotone
Denition. A monotone class is a collection of subsets which is closed under countable increasing unions
Like π -systems, λ-systems, and σ -algebras, the intersection of monotone classes is a monotone class, so it
makes sense to talk about the monotone class generated by a collection of subsets.
Lemma 2.1 (Monotone Class Lemma). If A is an algebra of subsets, then the monotone class generated by
A is σ(A).
Note that σ -algebras are λ-systems and λ-systems are monotone classes, but the converses need not be true.
8
3. Distributions
Recall that a random variable on a probability space (Ω, F, P ) is a function X:Ω→R that is measurable
for all A ∈ B.
To check that µ is a probability measure, note that since X is a function, if A1 , A2 , ... ∈ B are disjoint, then
The distribution of a random variable X is usually described in terms of its distribution function
F (x) = P (X ≤ x) = µ ((−∞, x]) .
In cases where confusion may arise, we will emphasize dependence on the random variable using subscripts
- i.e. µX , FX .
Proof.
For (i), note that if x ≤ y, then {X ≤ x} ⊆ {X ≤ y}, so F (x) = P (X ≤ x) ≤ P (X ≤ y) = F (y) by
monotonicity.
For (ii), observe that if x & a, then {X ≤ x} & {X ≤ a}, and apply continuity from above.
In fact, the rst three properties in Theorem 3.1 are sucient to characterize a distribution function.
Theorem 3.2. If F : R → R satises properties (i), (ii), and (iii) from Theorem 3.1, then it is the
distribution function of some random variable.
{ω : X(ω) ≤ x} = {ω : ω ≤ F (x)}
where the nal equality uses the denition of Lebesgue measure and the fact that F (x) ∈ [0, 1].
To establish the reverse inclusion, note that if ω > F (x), then properties (i) and (ii) imply that there is an
Theorem 3.2 shows that any function satisfying properties (i) - (iii) gives rise to a random variable X, and
thus to a probability measure µ, the distribution of X. The following result shows that the measure is
uniquely determined.
Theorem 3.3. If F is function satisfying (i)-(iii) in Theorem 3.1, then there is a unique probability measure
µ on (R, B) with µ ((−∞, x]) = F (x) for all x ∈ R.
Proof. Theorem 3.2 gives the existence of a random variable X with distribution function F. The measure
P = {(−∞, a] : a ∈ R}
L = {A ∈ B : µ(A) = ν(A)}.
denes a function satisfying properties (i)-(iii) in Theorem 3.1, and every such function uniquely determines
a probability measure.
Consequently, it is equivalent to give the distribution or the distribution function of a random variable.
However, one should be aware that distributions/distribution functions do not determine random variables,
1
For example, if X is uniform on [−1, 1] (so that µX = 2 m|[−1,1] ), then −X also has distribution µX , but
−X 6= X almost surely.
When two random variables X and Y have the same distribution function, we say that they are equal in
distribution and write X =d Y .
Note that random variables can be equal in distribution even if they are dened on dierent probability
spaces.
11
Constructing Measures on R . (Brief Review)
It is worth mentioning that we kind of cheated in Theorem 3.2 since we assumed the existence of Lebesgue
measure. In fact, the standard derivation of Lebesgue measure in terms of Stieltjes measure functions implies
the results in Theorems 3.2 and 3.3. Presumably everyone has seen this argument before and since it is fairly
Recall that an algebra of sets on S is a non-empty collection A ⊆ 2S which is closed under complements and
nite unions.
1) µ0 (∅) = 0
2) IfA1 , A2 , ... is a sequence of disjoint sets in A whose union also belongs to A, then
S∞ P∞
µ0 ( i=1 Ai ) = i=1 µ0 (Ai ).
(If µ0 (S) < ∞, then 2) implies that µ0 (∅) = µ0 (∅) + µ0 (∅), which implies 1).)
i) µ∗ (∅) = 0
ii) µ∗ (A) ≤ µ∗ (B) if A ⊆ B .
S∞ P∞
iii) µ∗ ( i=1 Ai ) ≤ i=1 µ∗ (Ai ),
µ∗ (E) = µ∗ (E ∩ A) + µ∗ (E ∩ AC )
for all E ⊆ S.
It can be shown that if µ0 is a premeasure on the algebra A, then the set function dened by
∞ ∞
( )
X [
∗
µ (E) = inf µ0 (Ai ) : Ai ∈ A, E ⊆ Ai
i=1 i=1
a) µ∗ |A = µ0
b) Every set in A is µ∗ -measurable.
Theorem 3.4 (Carathéodory). If µ∗ is an outer measure on S , then the collection M of µ∗ -measurable sets
is a σ-algebra, and the restriction of µ∗ to M is a (complete) measure.
Finally, one can show that if µ0 is σ -nite, then the measure µ = µ∗ |M is the unique extension of µ0 to M.
12
Using these ideas, one can construct Borel measures on R by taking any nondecreasing, right-continuous
function F (called a Lebesgue-Stieltjes measure function ) and dening a premeasure on the algebra
( n
)
[
A= (ai , bi ] : −∞ ≤ ai ≤ bi ≤ ∞, (ai , bi ] ∩ (aj , bj ] = ∅, n ∈ N
i=1
by
µ0 (∅) = 0,
n
! n
G X
µ0 (ai , bi ] = [F (bi ) − F (ai )] .
i=1 i=1
The above construction is typical of how one builds premeasures on algebras from more elementary objects:
i) A, B ∈ S implies A∩B ∈S
Fn
ii) A∈S implies there exists a nite collection of disjoint sets A1 , ..., An ∈ S with AC = i=1 Ai .
(Some authors require that S ∈ S as well and call the above a semiring. We will not worry about this
An important example of a semialgebra on R is the collection of h-intervals - that is, sets of the form (a, b]
or (a, ∞) or ∅ with −∞ ≤ a < b < ∞.
d
On R , the collection of products of h-intervals - e.g. (a1 , b1 ] × · · · × (ad , bd ] - is a semialgebra.
If S is a semialgebra, then one readily veries that S = {nite disjoint unions of sets in S} is an algebra
algebra generated by S ).
(called the Note that this construction ensures that σ (S) = σ S .
Given a semialgebra S and a function ν : S → [0, ∞) such that if A ∈ S is the disjoint union of A1 , ..., An ∈ S ,
Pn Fm Pm
then ν (A) = i=1 ν(Ai ), dene ν : S → [0, ∞) by ν ( i=1 Bi ) = i=1 ν(Bi ).
Alternatively, one can show that ν is countably additive on S if ν is countably subadditive on S - that is, for
S F P
every countable disjoint collection {Ai }i∈I ⊆ S such that i∈I Ai ∈ S , one has ν A
i∈I i ≤ i∈I ν(Ai ).
(The implication is immediate if ν is countably additive on S .)
Thus if one takes a nitely additive [0, ∞)-valued function ν on a semialgebra S , extends it in the obvious way
to the function ν on the S , and then checks that ν is countably additive, then the Carathéodory construction
Denition. If µ and ν are measures on (S, G), then we say that ν is absolutely continuous with respect to
Denition. If µ and ν are measures on (S, G), then we say that µ and ν are mutually singular (and write
i) E∩F =∅
ii) E∪F =S
iii) µ(F ) = 0 = ν(E).
A fundamental result in measure theory is the Lebesgue-Radon-Nikodym Theorem (which we state only for
positive measures).
Theorem 3.5 (Lebesgue-Radon-Nikodym). If µ and ν are σ-nite measures on (S, G), then there exist
unique σ-nite measures λ, ρ on (S, G) such that
λ⊥µ, ρ µ, ν = λ + ρ.
´
Moreover, there is a measurable function f : S → [0, ∞) such that ρ(E) = E
f dµ for all E ∈ G .
The function f from Theorem 3.5 is called the Radon-Nikodym derivative of ρ with respect to µ, and one
dρ
writes f= dµ (or dρ = f dµ).
If ν is a nite measure, then λ and ρ are nite, so f is µ-integrable.
If a random variable X has distribution µ which is absolutely continuous with respect to Lebesgue measure,
(i)-(iii) in Theorem 3.1, so Theorem 3.2 gives a random variable with density g.
Rather, we have
Denition. If the distribution of X has a density, then we say that X is absolutely continuous.
14
The other class of random variables discussed in undergraduate probability are discrete random variables.
Denition. discrete
A measure µ is said to be if there is a countable set S with µ S C = 0. A random
More generally, given any countable set S ⊂ R and any sequence of nonnegative numbers p1 , p2 , ... with
P∞
i=1 pi = 1, if we enumerate S by S = {s1 , s2 , ...}, then the random variable X with P (X = si ) = pi ,
P∞
F (x) = i=1 pi 1[si ,∞) (x) is discrete, and indeed all discrete random variables are of this form.
(Countable additivity implies that µ is determined by its values on singleton subsets of S .)
In the case S =Q and pi > 0 for all i, we have a discrete random variable whose distribution function is
If we think of summation as integration with respect to counting measure, then just as the absolutely
´
continuous random variables correspond to densities (f ≥ 0 with f dm = 1), we see that the discrete
´
random variables correspond to mass functions (p ≥ 0 with p dc = 1).
There is also a third fundamental class of random variables, which we almost never have to deal with, but
mention for the sake of completeness. To describe it, we need another denition.
By countable additivity, a discrete probability measure is not continuous and vice versa.
Absolutely continuous distributions are continuous, but it is possible for a continuous distribution to be
Denition. A random variable X with continuous distribution µ⊥m is called singular continuous.
An example is given by the uniform distribution on the Cantor set formed by taking [0, 1] and successively
removing the open middle third of all remaining intervals. The distribution function is the Cantor function
1
given by F (x) = 2 for x ∈ [ 13 , 23 ], F (x) = 1
4 for x ∈ [ 91 , 29 ], F (x) = 3
4 for x ∈ [ 79 , 89 ], etc...
Analogous to the singular/absolutely continuous decomposition in the Theorem 3.5, we have the following
C
The result follows by dening µd (A) = µ(A ∩ E), µc (A) = µ(A ∩ E ).
Thus if µ is a probability distribution, then it follows from the Radon-Nikodym Theorem that µ = µac + µs
where µac m and µs ⊥m. By Theorem 3.6, µs = µd + µsc where µd is discrete and µsc is singular
continuous. Since µ is a probability measure, each of µac , µd , µsc is nite and thus is identically zero or a
Theorem 3.7. Every distribution is a convex combination of an absolutely continuous distribution, a discrete
distribution, and an absolutely singular distribution.
Remark. Theorem 3.7 is not especially useful in practice. Rather, we mention these facts because so many
introductory texts make a big deal about distinguishing between discrete and continuous random variables.
There are certainly important practical dierences between the two, and it is worth knowing that more
pathological examples exist as well. However, one of the advantages of the measure theoretic approach is
a more unied perspective, and excessive focus on dierences in detail can sometimes obscure the bigger
picture.
16
4. Random Variables
In general, a random variable is a measurable function from (Ω, F) to some measurable space (S, G), but we
have agreed to reserve the unqualied term for the case (S, G) = (R, B).
If (S, G) = (R , B ),d d
we will say that X is a random vector.
We now collect some results that will help us establish that various quantities of interest are indeed random
variables.
Through a slight abuse of notation, we sometimes write X∈F to indicate that X is (F -B)-measurable.
Theorem 4.1. If A generates G (in the sense that G is the smallest σ-algebra containing A) and
X −1 (A) = {ω ∈ Ω : X(ω) ∈ A} ∈ F for all A ∈ A, then X is measurable.
The fact that inverses commute with set operations also shows that for any function X : Ω → S, if G is a
Proof. (Homework)
Example 4.1. If (S, G) = (R, B), some useful generating sets are
More generally, given an indexed collection of measurable spaces {(Sα , Gα )}α∈A , the product σ-algebra,
−1
N Q
α∈A Gα , on S= α∈A Sα is generated by {πα (Gα ) : Gα ∈ Gα , α ∈ A} where πα : S → Sα is projection
In other words, the product σ -algebra is the smallest σ -algebra for which the projections are measurable.
This is because we want a function taking values in the product space to be measurable precisely when its
This is analogous to the denition of the product topology as the initial topology with respect to the
coordinate projections.
17
Proposition 4.2. If A is countable, then α∈A Gα is generated by the rectangles { α∈A Gα : Gα ∈ Gα }.
N Q
α∈A Eα : Eα ∈ Eα .
Q
πα−1 (Gα ), so
Q T
On the other hand, α∈A Gα = α∈A
!
nY o
⊆ σ πα−1 (Gα ) : Gα ∈ Gα , α ∈ A .
σ Gα : Gα ∈ Gα
α∈A
N
The second statement will follow from the above argument once we show that α∈A Gα is generated by
Because the product and metric topologies coincide for Rn , Proposition 4.2 justies Example 4.2.
Qn
(In general, one can show that if S1 , ..., Sn are separable metric spaces and S= i=1 Si equipped with the
Nn
product metric, then i=1 BSi = BS .)
Theorem 4.2. If X : (Ω, F) → (S, G) and f : (S, G) → (T, E) are measurable maps, then
is measurable.
f (X) : (Ω, F) → (T, E)
since X is measurable.
Theorem 4.2 is the familiar statement that compositions of measurable maps are measurable.
Thus if f :R→R is measurable (e.g. if f is any continuous function) and X is a random variable, then
Theorem 4.3. If X1 , ..., Xn are random variables and f : (Rn , B n ) → (R, B) is measurable, then f (X1 , ..., Xn )
is a random variable.
Proof. In light of Theorem 4.2, it suces to show that (X1 , ..., Xn ) is a random vector. To this end, observe
n
\
{(X1 , ..., Xn ) ∈ A1 × · · · × An } = {Xi ∈ Ai } ∈ F.
i=1
Since Bn is generated by {A1 × · · · × An : Ai ∈ B}, the result follows from Theorem 4.1.
18
Corollary 4.1. If X1 , ..., Xn are random variables, then so are Sn = and Vn = Xi .
Pn Qn
i=1 Xi i=1
It is sometimes convenient to allow random variables to assume the values ±∞, and we observe that almost
all of our results generalize easily to (R, B) where R = R ∪ {±∞} and B = {E ⊆ R : E ∩ R ∈ B}, which is
Proof. For any a ∈ R, the inmum of a sequence is strictly less than a if and only if some term is strictly
measurable.
Arguing as in the rst case, inf m≥n Xm is measurable for all m ∈ N, so it follows from the second case that
lim inf Xn = sup inf Xm
n→∞ n∈N m≥n
lim Xn exists = lim inf Xn = lim sup Xn = lim inf Xn − lim sup Xn = 0
n→∞ n→∞ n→∞ n→∞ n→∞
is measurable since it is the preimage of {0} ∈ B under the map (lim inf n→∞ Xn ) − (lim supn→∞ Xn ), which
When P ({limn→∞ Xn exists}) = 1, we say that the sequence converges almost surely to X := lim supn→∞ Xn ,
and write Xn → X a.s.
19
5. Expectation
simple function φ =
Pn
and a i=1 ai 1Ei is a linear combination of indicator functions (where we may assume
A fundamental observation is that we can approximate a measurable function with simple functions by
Theorem 5.1. If (S, G) is a measurable space and f : S → [0, ∞] is measurable, then there is a sequence
n=1 of simple functions with 0 ≤ φ1 ≤ φ2 ≤ ... ≤ f such that φn → f pointwise, and the convergence is
{φn }∞
uniform on any set on which f is bounded.
Proof. For n = 1, 2, ... and k = 0, 1, ..., 4n − 1, dene
k k+1
Enk = f −1 , and Fn = f −1 (( 2n , ∞] ) ,
2n 2n
and set
n
4X −1
k
φn = 1 k + 2n 1Fn .
2n En
k=0
´ ´
For f integrable, A ∈ G, we dene the integral of f over A as
A
f dµ = f 1A dµ.
´ ´
When we wish to emphasize dependence on the argument, we write f dµ = f (x)dµ(x), or sometimes
´
f (x)µ(dx).
´ ´ ´
Proposition 5.1. For any a, b ∈ R and any integrable functions f, g, (af + bg)dµ = a f dµ + b gdµ.
´ ´
If f ≤ g a.e., then f dµ ≤ gdµ.
20
Denition.
´
If X is a random variable on (Ω, F, P ) with X ≥0 a.s., then we dene its expectation as E[X] = XdP ,
which always makes sense, but may be +∞.
If X is an arbitrary random variable, then we can write X = X+ − X− where X + = max{0, X} and
If at least one of E[X + ], E[X − ] is nite, then we dene E[X] = E[X + ] − E[X − ].
Note that E[X] may be dened even if X isn't an integrable function.
A trivial but extremely useful observation is that P (A) = E[1A ] for any event A ∈ F.
Inequalities.
Recall that a function ϕ:R→R is said to be convex if for every x, y ∈ R, λ ∈ [0, 1], we have
That is, given any two points x, y ∈ R, the line from (x, ϕ(x)) to (y, ϕ(y)) lies above the graph of ϕ.
Proof. (Homework)
y−x
Writing λ= z−x ∈ (0, 1), we have y = λz + (1 − λ)x, so it follows from convexity that
y−x
ϕ(y) − ϕ(x) ≤ λ (ϕ(z) − ϕ(x)) = (ϕ(z) − ϕ(x)) .
z−x
Dividing by y−x>0 gives the rst inequality.
z−y
Similarly, setting µ= z−x = 1 − λ ∈ (0, 1), we have y = µx + (1 − µ)z , so ϕ(y) ≤ µϕ(x) + (1 − µ)ϕ(z), and
thus
z−y
ϕ(y) − ϕ(z) ≤ µ (ϕ(x) − ϕ(z)) = (ϕ(x) − ϕ(z)) ,
z−x
hence
ϕ(z) − ϕ(y) ϕ(z) − ϕ(x)
≥ .
z−y z−x
Lemma 5.2 (Supporting Hyperplane Theorem in R2 ). If ϕ is a convex function, then for any c ∈ R, there
is a linear function l(x) which satises l(c) = ϕ(c) and l(x) ≤ ϕ(x) for all x ∈ R.
Proof. (Homework)
For any h > 0, taking x = c − h, y = c, z = c + h in Lemma 5.1, it follows from the outer inequality that
that
ϕ(c) − ϕ(c − h) ϕ(c + h) − ϕ(c)
≤ .
h h
Also, for any 0 < h1 < h 2 , we have c − h2 < c − h1 < c, so the second inequality in Lemma 5.1 shows that
ϕ(c)−ϕ(c−h2 ) ϕ(c)−ϕ(c−h1 )
h2 ≤ h1 .
21
ϕ(c+h2 )−ϕ(c) ϕ(c+h1 )−ϕ(c)
Similarly, since c < c + h1 < c + h2 , the rst inequality in Lemma 5.1 shows that h2 ≥ h1 .
preceding arguments where necessary shows that Jensen's inequality is strict unless X is a.s. constant.
1
p
To state the next inequality, we dene the Lp norm of a random variable by kXkp = E [|X| ] p for p ∈ [1, ∞)
and kXk∞ = inf{M : P (|X| > M ) = 0}.
n o
p p
We dene L = L (Ω, F, P ) = X : kXkp < ∞ (where random variables X and Y dene the same element
p p
of L if they are equal almost surely), and one can prove that L is a Banach space for p ≥ 1.
Proof.
We rst note that the result holds trivially if the right-hand side is innity, and if kXkp = 0 or kY kq = 0,
then |XY | = 0 a.s. Accordingly, we may assume that 0 < kXkp , kY kq < ∞. In fact, since constants factor
22
Accordingly, we will assume henceforth that p, q ∈ (1, ∞).
xp yq
Now x y ≥ 0, and dene the function ϕ : [0, ∞) → R by ϕ(x) = p + q − xy .
1
Since ϕ0 (x) = xp−1 − y and ϕ00 (x) = (p − 1)xp−2 > 0 for x > 0, ϕ attains its minimum at x0 = y p−1 .
−1
1 p
Thus, as the conjugacy of p and q implies that
p−1 + 1 = p−1 = 1 − p1 = q , we have that
p
xp0 yq yq
y p−1 1 1 1
ϕ(x) ≥ ϕ(x0 ) = + − xy = + − y p−1 +1 = y q + − yq = 0
p q p q p q
xp yq
for all x ≥ 0. It follows that
p + q ≥ xy for every x, y ≥ 0.
In particular, taking x = |X|, y = |Y |, and integrating, we have
ˆ ˆ ˆ
1 p 1 q
E |XY | = |X| |Y | dP ≤ |X| dP + |Y | dP
p q
p q
kXkp kY kq 1 1
= + = + = 1 = kXkp kY kq .
p q p q
so q(t) has at most one real root and thus a nonpositive discriminant
2
(2E |XY |) − 4E X 2 E Y 2 ≤ 0.
Corollary 5.2. For any random variable X and any 1 ≤ r < s ≤ ∞, kXkr ≤ kXks .
Therefore, we have the inclusion Ls ⊆ Lr .
Proof. For s = ∞, we have
r r
|X| ≤ kXk∞ a.s., hence
ˆ ˆ
r r r r
kXkr = |X| dP ≤ kXk∞ dP = kXk∞ .
ˆ rs
s
r r r r r r
kXkr = E [|X| ] ≤ kX kp k1kq = |X | dP = kXks .
Note that for Corollary 5.2, it is important that our measure was nite.
Of course, we could also prove the inclusion by breaking up the integral according to whether |X| is greater
or less than 1, though we would not obtain the inequality in that case.
The proof of our last big inequality should be familiar from measure theory (convergence in L1 implies
convergence in measure).
Theorem 5.4 (Chebychev). For any nonnegative random variable X and any a > 0,
E[X]
P (X ≥ a) ≤ .
a
23
Proof. Let A = {ω : X(ω) ≥ a}. Then
ˆ ˆ ˆ
aP (X ≥ a) = a 1A dP ≤ X1A dP ≤ XdP = E[X].
Corollary 5.3. For any (S, G)-valued random variable X and any measurable function ϕ : S → [0, ∞),
E[ϕ(X)]
P (ϕ(X) ≥ a) ≤ .
a
Some important cases of Corollary 5.3 for real-valued X are
• ϕ(x) = |x|: to control the probability that an integrable random variable is large.
2
• ϕ(x) = (x − E[X]) : to control the probability that a random variable with nite variance is far
• ϕ(x) = etx : to establish exponential decay for random variables with moment generating functions
( concentration inequalities ).
Limit Theorems.
We now briey recall the three main results for interchanging limits and integration. The proofs can be
Note that if Xn ≥ M , then Xn − M ≥ 0, so since constants behave nicely with respect to limits and
expectation, nonnegative can be relaxed to bounded below in the statement of Theorems 5.5 and 5.6.
Also, since Xn % X if and only if (−Xn ) & (−X) and lim inf Xn = − lim sup(−Xn ),
one has immediate
n→∞n→∞
corollaries for lim sups and for monotone decreasing sequences (provided that they are bounded above).
Theorem 5.7 (Dominated Convergence Theorem). If Xn → X and there exists some Y ≥ 0 with E[Y ] < ∞
and |Xn | ≤ Y for all n, then E[Xn ] → E[X].
When Y is a constant, Theorem 5.7 is known as the bounded convergence theorem. In that case, it is
In each of these theorems, the assumptions need only hold almost surely since one can modify random
Change of Variables.
Though integration over arbitrary measure spaces in nice in theory, in order to actually compute expectations,
Proof. We will proceed by verifying the result in increasingly general cases paralleling the construction of
the integral.
n
X n
X ˆ ˆ
E[f (X)] = ai E[1Bi (X)] = ai 1Bi (s)dµ(s) = f (s)dµ(s).
i=1 i=1 S S
If f ≥ 0, then Theorem 5.1 gives a sequence of simple functions φn % f , so the previous case and two
Finally, suppose that E |f (X)| < ∞, and set f + (x) = max{f (x), 0}, f − (x) = max{−f (x), 0}. Then
f + , f − ≥ 0, f = f + − f − , and E[f (X)+ ], E[f (X)− ] ≤ E |f (X)| < ∞, so it follows from the previous result
In light of Theorem 5.8, if X is an integrable random variable on (Ω, F, P ) with distribution µ, then
ˆ ˆ
E[X] = XdP = xdµ(x).
R
dµ
´
If X has density f= dm , then for any measurable g : R → R with g ≥ 0 a.s. or |g| dµ < ∞,
ˆ ∞
E[g(X)] = g(x)f (x)dx.
−∞
If X is a random variable, then for any k ∈ N, we say that E[X k ] is the kth moment of X.
The rst moment E[X] is called the mean and is usually denoted E[X] = µ.
The mean is a measure of the center of the distribution of X.
If X has nite second moment E[X 2 ] < ∞, then we dene the variance (or second central moment) of X as
2
Var(X) = E[(X − µ) ].
The variance provides a measure of the dispersion of the distribution of X and is is usually denoted
Var(X) = σ2 .
25
6. Independence
Heuristically, two objects are independent if information concerning one of them does not contribute to one's
knowledge about the other. The correct way to formally codify this notion in a manner amenable to proving
• Two sub-σ -algebras F1 and F2 are independent if for all A ∈ F1 , B ∈ F2 , the events A and B are
independent.
An innite collection of objects (sub-σ -algebras, random variables, events) is said to be independent if every
• Random variables X1 , ..., Xn ∈ F are independent if for any choice of Ei ∈ Bi , i = 1, ..., n, we have
n
Y
P (X1 ∈ E1 , ..., Xn ∈ En ) = P (Xi ∈ Ei ).
i=1
• sub-σ -algebras F1 , ..., Fn are independent if for any choice of Ai ∈ Fi , i = 1, ..., n, we have
n
! n
\ Y
P Ai = P (Ai ).
i=1 i=1
Note that σ -algebras and random variables are implicitly subject to the same subcollection constraint as
events since special cases of the denition include taking Ai = Ω, Ei = R for i ∈ IC.
For any n ∈ N, it is possible to construct families of objects which are not independent, but every subcollection
of size m≤n satises the applicable multiplication rule. For example, just because a collection of events
A1 , ..., An satises P (Ai ∩Aj ) = P (Ai )P (Aj ) for all i 6= j (called pairwise independence ), it is not necessarily
the case that A1 , ..., An is an independent collection of events.
(Flip two fair coins and let A = {1st coin heads}, B = {2nd coin heads}, C = {both coins same}.)
One can show that independence of events is a special case of independence of random variables (via indicator
functions), which in turn is a special case of independence of σ -algebras (via the σ -algebras the random
variables generate). We will take as our running denition of independence, the further generalization:
Denition. Given a probability space (Ω, F, P ), collections of events A1 , ..., An ⊆ F are independent if for
all I ⊆ [n], !
\ Y
P Ai = P (Ai )
i∈I i∈I
whenever Ai ∈ Ai for each i ∈ I.
An innite collection of subsets of F is independent if every nite subcollection is.
26
Observe that if A1 , ..., An is independent and we set Ai = Ai ∪ {Ω}, then A1 , ..., An is independent as well.
Tn Qn
In this case, the independence criterion reduces to P ( i=1 Ai ) = i=1 P (Ai ) for any choice of Ai ∈ Ai .
collection of objects is independent. The following results are useful for simplifying this task.
Theorem 6.1. Suppose that A1 , ..., An are independent collections of events. If each Ai is a π-system, then
the sub-σ-algebras σ(A1 ), ..., σ(An ) are independent.
Proof. Because σ(Ai ) = σ(Ai ) where Ai = Ai ∪ {Ω}, we can assume without loss of generality that Ω ∈ Ai
for all i so that we need only consider intersections/products over [n].
Tn
Let A2 , ..., An be events with Ai ∈ Ai , set F = i=2 Ai , and set L = {A ∈ F : P (A ∩ F ) = P (A)P (F )}.
Since P (Ω ∩ F ) = P (F ) = P (Ω)P (F ), we have that Ω ∈ L.
Now suppose that A, B ∈ L with A ⊆ B. Then
P ((B \ A) ∩ F ) = P ((B ∩ F ) \ (A ∩ F )) = P (B ∩ F ) − P (A ∩ F )
= P (B)P (F ) − P (A)P (F ) = (P (B) − P (A)) P (F ) = P (B \ A)P (F ),
hence (B \ A) ∈ L.
Finally, let B1 , B2 , ... ∈ L with Bn % B . Then (Bn ∩ F ) % (B ∩ F ), so
so B∈L as well.
Therefore, L is a λ-system, so, since A1 is a π -system contained in L by assumption, the π -λ Theorem shows
that σ(A1 ) ⊆ L.
Because A2 , ..., An were arbitrary members of A2 , ..., An , we conclude that σ(A1 ), A2 , ..., An are independent.
Repeating this argument for A2 , A3 , ..., An , σ(A1 ) shows that σ(A2 ), A3 , ..., An , σ(A1 ) are independent, and
Theorem 6.1.
Since the converse of Corollary 6.1 is true by denition, independence of random variables X1 , ..., Xn is
equivalent to the condition that their joint cdf factors as a product of the marginals cdfs.
One can prove analogous results for density and mass functions using the same basic ideas.
27
It is clear that if X1 , ..., Xn are independent random variables and f1 , ..., fn : R → R are measurable, then
f (X1 ), ..., f (Xn ) are independent random variables since for any choice of Bi ∈ Bi ,
n
Y n
Y
= P Xi ∈ fi−1 (Bi ) = P (fi (Xi ) ∈ Bi ) .
i=1 i=1
With the help of Theorem 6.1, we can prove the stronger result that functions of disjoint sets of independent
Lemma 6.1. Suppose Fi,j , 1 ≤ i ≤ n, 1 ≤ j ≤ m(i), are independent sub-σ-algebras and let Gi =
. Then G1 , ..., Gn are independent.
S
σ F
j i,j
Proof.
nT o
Let Ai = A
j i,j : A i,j ∈ Fi,j .
T T T T T
If j Ai,j , j Bi,j ∈ Ai , then j Ai,j ∩ j Bi,j = j (Ai,j ∩ Bi,j ) ∈ Ai , thus Ai is a π -system, so
Corollary 6.2. If Xi,j , 1 ≤ i ≤ n, 1 ≤ j ≤ m(i), are independent random variables and the functions
fi : Rm(i) → R are measurable, then f1 (X1,1 , ..., X1,m(1) ), ..., fn (Xn,1 , ..., Xn,m(n) ) are independent.
Proof.
S
Let Fi,j = σ(Xi,j ). Since fi (Xi,1 , ..., Xi,m(i) ) is measurable with respect to Gi = σ j Fi,j , the
28
Product Measure.
We now pause to recall the construction of product measures.
Proposition 6.1. Given nite measure spaces (Ω1 , F1 , µ1 ) and (Ω2 , F2 , µ2 ), there exists a unique measure
µ1 × µ2 on (Ω1 × Ω2 , F1 ⊗ F2 ) which satises (µ1 × µ2 )(A × E) = µ1 (A)µ2 (E) for all A ∈ F1 , E ∈ F2 .
Proof.
Let S = {A × E : A ∈ F1 , E ∈ F2 }.
IfA1 × E1 , A2 × E2 ∈ S , then (A1 × E1 ) ∩ (A2 × E2 ) = (A1 ∩ A2 ) × (E1 ∩ E2 ) and
C
(A1 × E1 ) = AC C C C
1 × E1 t A1 × E1 t A1 × E1 , hence S is a semialgebra.
Consequently,
ˆ ˆ X
µ1 (A)1E (y) = 1A (x)1E (y)dµ1 (x) = 1Ai (x)1Ei (y)dµ1 (x)
Ω1 Ω1 i∈I
Xˆ X ˆ
= 1Ai (x)1Ei (y)dµ1 (x) = 1Ai (x)dµ1 (x) 1Ei (y)
i∈I Ω1 i∈I Ω1
X
= µ1 (Ai )1Ei (y).
i∈I
(The interchange of summation and integration is justied by the monotone convergence theorem.)
* The above holds for σ -nite measure spaces as well by the same argument, but we mainly care about nite
29
Independence, Distribution, and Expectation.
We now consider the joint distribution of independent random variables.
Theorem 6.2. If X1 , ..., Xn are independent random variables with distributions µ1 , ..., µn , respectively,
then the random vector (X1 , ..., Xn ) has distribution µ1 × · · · × µn .
n
Y
P ((X1 , ..., Xn ) ∈ A1 × · · · × An ) = P (X1 ∈ A1 , ..., Xn ∈ An ) = P (Xi ∈ Ai )
i=1
n
Y
= µi (Ai ) = (µ1 × · · · × µn ) (A1 × · · · × An ).
i=1
In the proof of Theorem 3.3, we showed that for any probability measures µ, ν , L = {A : µ(A) = ν(A)} is a
n
λ-system. Because the collection of rectangle sets is a π -system which generates B , the result follows from
the π -λ Theorem.
In other words random variables are independent if their joint distribution is the product of their marginal
distributions.
At this point, it is appropriate to recall the theorems of Fubini and Tonelli, whose proofs can be found in
Theorem 6.3. Suppose that (R, F, µ) and (S, G, ν) are σ-nite measure spaces.
I) Tonelli: If f : R × S → [0, ∞) is a measurable function, then
ˆ ˆ ˆ ˆ ˆ
(∗) f d(µ × ν) = f (x, y)dµ(x) dν(y) = f (x, y)dν(y) dµ(x).
R×S S R R S
´
II) Fubini: If f :R×S →R is integrable (i.e. |f | d(µ × ν) < ∞), then (∗) holds.
Theorem 6.4. Suppose that X and Y are independent with distributions µ and ν . If f : R2 → R is a
measurable function with f ≥ 0 or E |f (X, Y )| < ∞, then
ˆ ˆ
E[f (X, Y )] = f (x, y)dµ(x)dν(y).
In particular, if g, h : R → R are measurable functions with g, h > 0 or E |g(X)| , E |h(Y )| < ∞, then
E[g(X)h(Y )] = E[g(X)]E[h(Y )].
Proof. It follows from Theorem 6.2 and the change of variables formula (Theorem 5.8) that
ˆ
E[f (X, Y )] = f (x, y)d(µ × ν)(x, y),
R2
If g, h are integrable, then applying the above result to |g| , |h| gives E |g(X)h(Y )| = E |g(X)| E |h(Y )| < ∞,
and we can repeat the above argument using Fubini's Theorem.
Note that the second part of the preceding proof is typical of multiple integral arguments: One uses Tonelli's
theorem to verify integrability by computing the integral of the absolute value as an iterated integral (or
interchanging order of integration), and then one applies Fubini's Theorem to compute the desired integral.
Theorem 6.4 can easily be extended to handle any nite number of random variables:
Theorem 6.5. If X1 , ..., Xn are independent and have Xi ≥ 0 for all i, or E |Xi | < ∞ for all i, then
" n
# n
Y Y
E Xi = E[Xi ].
i=1 i=1
Proof. Corollary 6.2 shows that X1 and X2 · · · Xn are independent, so Theorem 6.4 (with g=h the identity
function) gives
" n
# " n
#
Y Y
E Xi = E[X1 ]E Xi ,
i=1 i=2
and the result follows by induction.
(To make Theorem 6.5 look more like Theorem 6.4, recall that if X1 , ..., Xn are independent and
Note that it is possible that E[XY ] = E[X]E[Y ] without X and Y being independent.
For example, let X ∼ N (0, 1), Y = X 2 . Then X and Y are clearly dependent, but a little calculus shows
3
that E[X] and E[XY ] = E[X ] are both 0 and E[Y ] = E[X 2 ] = 1, so E[XY ] = 0 = E[X]E[Y ].
Denition. If X and Y are random variables with E[X 2 ], E[Y 2 } < ∞ and E[XY ] = E[X]E[Y ], then we
Often, independence is invoked solely to argue that the expectation of the product is the product of the
expectations. In such cases, one can weaken the assumption from independence to uncorrelatedness.
Of course, we can obtain a partial converse to Theorem 6.4 if we require the expectation to factor over a
Proposition 6.2. X and Y are independent if E[f (X)g(Y )] = E[f (X)]E[g(Y )] for all bounded continuous
functions f and g.
31
Proof. Given any x, y ∈ R, dene
1, t≤x
1, t≤y
fn (t) = 1 − n(t − x), x < t ≤ x + n1 , gn (t) = 1 − n(t − y), y < t ≤ y + n1 .
t > x + n1 t > y + n1
0,
0,
Before moving on, we mention that the ideas in this section can be used to analyze the sum of independent
random variables.
Theorem 6.6. Suppose that X and Y are independent with distributions µ, ν and distribution functions
F, G. Then X + Y has distribution function
ˆ
P (X + Y ≤ z) = F (z − y)dG(y).
´
If X has density f , then X + Y has density h(z) = f (z − y)dG(y).
´
If, additionally, Y has density g, then h(z) = f (z − y)g(y)dy = f ∗ g(z) - that is, the density of the sum is
the convolution of the densities.
Proof. The change of variables formula, independence, and Tonelli's theorem give
ˆ ˆ
P (X + Y ≤ z) = 1(−∞,z] (X + Y )dP = 1(−∞,z] (x + y)d(µ × ν)(x, y)
Ω R2
ˆ ˆ ˆ ˆ
= 1(−∞,z] (x + y)dµ(x)dν(y) = 1(−∞,z−y] (x)dµ(x) dν(y)
ˆ ˆ
R R R R
= F (z − y)dν(y) = F (z − y)dG(y).
R
The nal equality is just interpreting an integral against ν as a Riemann-Stieltjes integral with respect to G.
Now if X has density f, then the previous result with u-substitution and Tonelli yield
ˆ ˆ ˆ z−y
P (X + Y ≤ z) = F (z − y)dν(y) = f (x)dxdν(y)
R R −∞
ˆ ˆ z ˆ z ˆ
= f (x − y)dxdν(y) = f (x − y)dν(y)dx
R −∞ −∞ R
ˆ z ˆ
= f (x − y)dG(y) dx,
−∞
The third assertion follows from the change of variables formula for absolutely continuous random variables
Though one can use these convolution results to derive useful facts about distributions of sums, tools such
as characteristic and moment generating functions are generally much better suited for this task, so we will
actually exist!
Given a nite collection of distribution functions F1 , ..., Fn , it is easy to construct independent random
function Fi .
The product measure P is well-dened and satises
If we dene Xi to be the projection map Xi ((ω1 , ..., ωn )) = ωi , then it is clear that the Xi0 s are independent
In order to build an innite sequence of independent random variables with given distribution functions,
n
Y
P {ω ∈ RN : ωi ∈ (ai , bi ] for i = 1, ..., n} = (Fi (bi ) − Fi (ai ))
i=1
on the cylinders.
Theorem 6.7 (Kolmogorov). Suppose that we are given a sequence of probability measures µn on (Rn , Bn )
which are consistent in the sense that
µn+1 ((a1 , b1 ] × · · · × (an , bn ] × R) = µn ((a1 , b1 ] × · · · × (an , bn ]) .
In particular, given distribution functions F1 , F2 , ..., if we dene the µn 's by the condition
n
Y
µn ((a1 , b1 ] × · · · × (an , bn ]) = (Fi (bi ) − Fi (ai )) ,
i=1
Q {ω ∈ RN : ωi ∈ (ai , bi ], 1 ≤ i ≤ n} = µn ((a1 , b1 ] × · · · × (an , bn ]) .
Let A = {nite disjoint unions of sets in S} be the algebra generated by S and dene P0 : A → [0, 1] by
Fn Pn
P0 ( k=1 Sk ) = k=1 Q(Sk ) for S1 , ..., Sn disjoint sets in S .
As S is a semialgebra which generates BN , the discussion in Section 3 shows that it suces to prove
Proof. To further simplify our task, let Fn be the sub-σ -algebra of BN consisting of all sets of the form
∗ ∗ n n
E = E × R × R × ··· with E ∈ B . We use this asterisk notation throughout to denote the B
component of sets in F n.
We begin by showing that we may assume without loss of generality that Bn ∈ Fn for all n.
To see this, note that Bn ∈ A implies that there is a j(n) ∈ N such that Bn ∈ Fk for all k ≥ j(n). Let
k(1) = j(1) and k(n) = k(n − 1) + j(n) for n ≥ 2. Then k(1) < k(2) < · · · and Bn ∈ Fk(n) for all n. Dene
Bei = RN for i < k(1) and B ei = Bn for k(n) ≤ i < k(n + 1). Then B en ∈ Fn for all n and the collections
n o
{Bn } and B en dier only in that the latter possibly includes RN and repeats sets. The assertion follows
en & ∅ if and only if Bn & ∅ and P0 B
since B en & 0 if and only if P0 (Bn ) & 0.
Now suppose that P0 (Bn ) ≥ δ > 0 for all n. We will derive a contradiction by approximating the Bn∗ from
T
within by compact sets and then using a diagonal argument to obtain n Bn 6= ∅.
K(n)
[
Bn = {ω : ωi ∈ (ai,k , bi,k ], i = 1, ..., n} where − ∞ ≤ ai,k < bi,k ≤ ∞.
k=1
K(n) n o
[
En = ω : ωi ∈ [e
ai,k , ebi,k ], i = 1, ..., n , −∞ < e
ai,k < ebi,k < ∞,
k=1
∗
with µn (Bn \ En∗ ) ≤ δ
2n+1 .
Tn
Let Fn = m=1 Em . Since Bn ⊆ Bm for any m ≤ n, we have
n
! n n
[ [ [
C C C
Bn \ Fn = Bn ∩ Em = Bn ∩ Em ⊆ Bm ∩ Em ,
m=1 m=1 m=1
hence
n
X δ
µn (Bn∗ \ Fn∗ ) ≤ ∗
µm (Bm ∗
\ Em )≤ .
m=1
2
Since µn (Bn∗ ) = P0 (Bn ) ≥ δ , this means that µn (Fn∗ ) ≥ δ
2 , hence Fn∗ is nonempty.
is compact.
34
For each m ∈ N, choose some ω m ∈ Fm . As Fm ⊆ F1 , ω1m (the rst coordinate of ωm ) is in F1∗ .
m(1,j)
By compactness, we can nd a subsequence m(1, j) ≥ j such that ω1 converges to a limit θ1 ∈ F1∗ .
For m ≥ 2, Fm ⊆ F2 , so (ω1m , ω2m ) ∈ F2∗ .
∗
Because F2 is compact, we can nd a subsequence of {m(1, j)},
m(2,j) ∗
which we denote by m(2, j), such that ω2 converges to a limit θ2 with (θ1 , θ2 ) ∈ F2 .
m(n,j)
In general, we can nd a subsequence m(n, j) of m(n − 1, j) such that ωn converges to θn with
∞
\ ∞
\
θ∈ Fn ⊆ Bn ,
n=1 n=1
a contradiction!
Note that the proof of Theorem 6.7 used certain topological properties of Rn .
As one might expect, the theorem does not hold for innite products of arbitrary measurable spaces (S, G).
However, one can show that it does hold for nice spaces where (S, G) is said to be nice if there exists an
−1
injection ϕ:S→R such that ϕ and ϕ are measurable.
The collection of nice spaces is rich enough for our purposes. For example, if S is (homeomorphic to) a
complete and separable metric space and G is the collection of Borel subsets of S, then (S, G) is nice.
35
7. Weak Law of Large Numbers
We are now in a position to establish various laws of large numbers, which give conditions for the arithmetic
average of repeated observations to converge in certain senses. Among other things, these laws justify and
formalize our intuitive notions of probability as representing some kind of measure of long-term relative
frequency.
Denition. A sequence of random variables X1 , X2 , ... is said to converge to X in probability if for every
Note that if Xn →p X , then limn→∞ P (|Xn − X| < ε) = 1 for all ε > 0, while Xn → X a.s. implies that
P (limn→∞ |Xn − X| < ε) = 1 for all ε > 0. The following proposition and example show the importance of
Since A1 ⊇ A2 ⊇ ..., continuity from above implies that P (A) = limn→∞ P (An ).
Now if ω ∈ A, then for every n ∈ N, there is an m ≥ n with |Xm (ω) − X(ω)| > ε, so limn→∞ Xn (ω) 6= X(ω),
and thus A ⊆ E.
Because we also have the inclusion {|Xn − X| > ε} ⊆ An , monotonicity implies that
Example 7.1 (Scanning Interval). On the interval [0, 1) with Lebesgue measure, dene
1
It is straight forward that Xn →p 0 (for any ε ∈ (0, 1), m ≥ 2n implies P (|Xm − 0| > ε) ≤ 2n ), but
limn→∞ Xn (ω) does not exist for any ω (there are innitely many values of n with Xn (ω) = 1 and innitely
The preceding shows that convergence in probability is weaker than almost sure convergence. In fact, this
Denition. For p ∈ (0, ∞], a sequence of random variables X1 , X2 , ... is said to converge to X in Lp if
p
lim kXn − Xkp = 0. (For p ∈ (0, ∞), this is equivalent to E [|Xn − X| ] → 0.)
n→∞
To see how Lp convergence compares with our other notions of convergence, note that
p p
P (|Xn − X| > ε) = P (|Xn − X| > εp ) ≤ ε−p E [|Xn − X| ] → 0.
Example 7.2. On the interval [0, 1] with Lebesgue measure, dene a sequence of random variables by
1
Xn = n 1(0,n−1 ] .
p Then Xn → 0 a.s. (and thus in probability) since for all ω ∈ (0, 1], Xn (ω) = 0 whenever
p
n > ω −1 . However, E [|Xn − 0| ] = 1 for all n, so Xn 9 0 in Lp .
Proposition 7.3 and Example 7.2 show that Lp convergence is stronger than convergence in probability.
Example 7.2 also shows that almost sure convergence need not imply convergence in Lp
(unless one makes additional assumptions such as boundedness or uniform integrability).
Conversely, Example 7.1 shows that Lp convergence does not imply almost sure convergence.
It is perhaps worth noting that a.s. convergence and convergence in probability are preserved by continuous
functions. (The latter claim can be shown directly from the ε-δ denition of continuity, but we will give an
p
easier proof in Theorem 8.2.) However, L convergence need not be. For example, on [0, 1] with Lebesgue
1
measure, Xn = n 1 2
(0,n−p ) converges to 0 in Lp , p > 0, but if f (x) = x2 , kf (Xn ) − f (0)kp = 1 for all n.
Now recall that random variables X and Y with nite second moments are said to be uncorrelated if
E[XY ] = E[X]E[Y ].
If we denote E[X] = µX , E[Y ] = µY , then the covariance of X and Y is dened as
We say that a family of random variables {Xi }i∈I is uncorrelated if Cov(Xi , Xj ) =0 for all i 6= j .
37
Before stating our rst weak law, we record the following simple observation about sums of uncorrelated
random variables.
Proof.
Pn Pn
Let µi = E[Xi ] and Sn = i=1 Xi . Then E[Sn ] = i=1 µi by linearity, so
!2
h i n
X
2
Var(Sn ) = E (Sn − E[Sn ]) =E (Xi − µi )
i=1
Xn X
= E (Xi − µi )2 + (Xi − µi )(Xj − µj )
i=1 i6=j
n
X X
E (Xi − µi )2 +
= E [(Xi − µi )(Xj − µj )]
i=1 i6=j
n
X X n
X
= Var(Xi ) + Cov(Xi , Xj ) = Var(Xi ).
i=1 i6=j i=1
Theorem 7.1. Let X1 , X2 , ... be uncorrelated random variables with common mean E[Xi ] = µ and uniformly
bounded variance Var(Xi ) ≤ C < ∞. Writing Sn = X1 + ... + Xn , we have that n1 Sn → µ in L2 and in
probability.
Proof.
1 1
Pn
Since E n Sn = n i=1 µ = µ, we see that
" 2 # n
1 1 1 X nC
E Sn − µ = Var Sn = 2 Var(Xi ) ≤ →0
n n n i=1 n2
1 1
as n → ∞, hence
n Sn →µ in L2 . By Proposition 7.3,
n Sn →p µ as well.
Specializing to the case where the Xi0 s are independent and identically distributed (or i.i.d.), we have the
Corollary 7.1. If X1 , X2 , ... are i.i.d. with mean µ and variance σ2 < ∞, then X n = converges
1
Pn
n i=1 Xi
in probability to µ.
The statistical interpretation of Corollary 7.1 is that under mild conditions, if the sample size is suciently
large, then the sample mean will be close to the population mean with high probability.
38
Examples.
Our rst applications of these ideas involve statements that appear to be unrelated to probability.
n
X n k n−k k
fn = x (1 − x) f
k n
k=0
be the Bernstein polynomial of degree n associated with f . Then lim sup |fn (x) − f (x)| = 0.
n→∞ x∈[0,1]
Proof.
Given any p ∈ [0, 1], let X1 , X2 , ... be i.i.d. with P (X1 = 1) = p and P (X1 = 0) = 1 − p.
1
One easily calculates E[X1 ] = p and Var(X1 ) = p(1 − p) ≤ 4.
Pn
P (Sn = k) = nk pk (1 − p)n−k , 1
Letting Sn = i=1 Xi , we have that so E f n Sn = fn (p).
1
Also, Corollary 7.1 shows that Xn = n Sn converges to p in probability.
To establish the desired result, we have to appeal to the proof of our weak law.
p(1−p) 1
First, for any α > 0, Chebychev's inequality and the fact that E X n = p, Var Xn = n < 4n gives
Var(X n ) 1
P X n − p ≥ α ≤ ≤ .
α2 4nα2
Now since f is continuous on the compact set [0, 1] it is uniformly continuous and uniformly bounded. Let
M = supx∈[0,1] |f (x)|, and for a given ε > 0, let δ > 0 be such that |x − y| < δ implies |f (x) − f (y)| < ε for
all x, y ∈ [0, 1]. Since the absolute value function is convex, Jensen's inequality yields
Our next amusing result can be interpreted as saying that a high-dimensional cube is almost a sphere.
Example 7.4. X1 , X2 , ... be independent and uniformly distributed on [−1, 1]. Then X12 , X22 , ... are also
Let
´ 1 x2 Pn
2
independent with E[Xi ] =
−1 2
dx = 13 and Var(Xi2 ) ≤ E[Xi4 ] ≤ 1, so Corollary 7.1 shows that n1 i=1 Xi2
1
converges to
3 in probability.
1
An,ε = x ∈ Rn : (1 − ε) n3 ≤ kxk ≤ (1 + ε) n3
p
kxk = (x21 + ... + x2n ) 2
p
Now given ε ∈ (0, 1), write where
is the usual Euclidean distance, and let m denote Lebesgue measure. We have
v
u n
m (An,ε ∩ [−1, 1]n )
r r
n uX n
= P ((X1 , ..., Xn ) ∈ An,ε ) = P (1 − ε) ≤t Xi2 ≤ (1 + ε)
2n 3 i=1
3
n
!
1 2 1X 2 1 2
=P (1 − 2ε + ε ) ≤ X ≤ (1 + 2ε + ε )
3 n i=1 i 3
!
n
1 X 1 2ε − ε2
≥P Xi2 − ≤ ,
n 3 3
i=1
m(An,ε ∩[−1,1]n )
so that
2n →1 as n → ∞. In words, most of the volume of the cube [−1, 1]n comes from An,ε ,
pn
which is almost the boundary of the ball centered at the origin with radius .
3
39
The next set of examples concern the limiting behavior of row sums of triangular arrays, for which we appeal
to the following easy generalization of Theorem 7.1.
Theorem 7.2. Given a triangular array of integrable random variables, {Xn,k }n∈N,1≤k≤n ,
let Sn = nk=1 Xn,k denote the nth row sum, and write µn = E[Sn ], σn2 = Var(Sn ).
P
Sn − µn
→p 0.
bn
2
Var(Sn )
Proof. By assumption, E Sn −µn
bn = b2n →0 as n → ∞, so the result follows since L2 convergence
Example 7.5 (Coupon Collector's Problem) . Suppose that there are n distinct types of coupons and each
time one obtains a coupon it is, independent of prior selections, equally likely to be any one of the types.
We are interested in the number of draws needed to obtain a complete set. To this end, let Tn,k denote the
number of draws needed to collect k distinct types for k = 1, ..., n and note that Tn,1 = 1. Set Xn,1 = 1 and
Xn,k = Tn,k − Tn,k−1 for k = 2, ..., n so that Xn,k is the number of trials needed to obtain a type dierent
from the rst k − 1. The number of draws needed to obtain a complete set is given by
n
X n
X
Tn := Tn,n = 1 + (Tn,k − Tn,k−1 ) = 1 + Xn,k .
k=2 k=2
Now a random variable X with P (X = m) = p(1 − p)m−1 is said to be geometric with success probability p.
A little calculus gives
∞ ∞
X X d
E[X] = mp(1 − p)m−1 = p − (1 − p)m
m=1 m=1
dp
∞
d X d 1−p 1
= −p (1 − p)m = −p =
dp m=1 dp p p
and
∞
X ∞
X
E[X 2 ] = m2 p(1 − p)m−1 = [m(m − 1) + m]p(1 − p)m−1
m=1 m=1
∞
X ∞
X
= p(1 − p) m(m − 1)(1 − p)m−2 + mp(1 − p)m−1
m=1 m=1
∞
d2 X
m d2 (1 − p)2 1
= p(1 − p) 2
(1 − p) + E[X] = p(1 − p) 2
+
m=2
dp dp p p
2(1 − p) 1 2−p
= 2
+ = ,
p p p2
hence
1−p 1
Var(X) = E[X 2 ] − E[X]2 = ≤ 2.
p2 p
40
It follows that
n n n−1 n
X X n X1 X 1
E[Tn ] = 1 + E[Xn,k ] = 1 + =1+n =n
n−k+1 j=1
j j=1
j
k=2 k=2
and
n n 2 n−1 ∞
X X n X 1 X 1 π 2 n2
Var(Tn ) = Var(Xn,k ) ≤ = n2 2
≤ n 2
2
= .
n−k+1 j=1
j j=1
j 6
k=2 k=2
Var(Tn ) Tn −n n −1
P
π2 k=1 k
Taking bn = n log(n) we have
b2n ≤ 6 log(n)2 → 0, so Theorem 7.2 implies
n log(n) →p 0.
Using the inequality
n
X 1
log(n) ≤ ≤ log(n) + 1
k
k=1
´n dx
Pn−1 1
Pn 1
(which can be seen by bounding log(n) = 1 x
with the upper Riemann sum k=1 k ≤ k=1 k and the
Pn 1
Pn 1 Tn
lower Riemann sum k=2 k = k=1 k − 1 ), we conclude that →p 1 .
n log(n)
rn
Example 7.6 (Occupancy Problem). Suppose that we drop rn balls at random into n bins where → c.
Pn n
Letting Xn,k = 1 {bin k is empty} , the number of empty bins is Xn = k=1 Xn,k .
It is clear that
n n rn
X X n−1
E[Xn ] = E[Xn,k ] = P (bin k is empty) =n
n
k=1 k=1
and
n
X X n
X X
E[Xn2 ] = E 2
Xn,k +2 Xn,i Xn,j = E[Xn,k ] + 2 E[Xn,i Xn,j ]
k=1 i<j k=1 i<j
n
X X
= P (bin k is empty) +2 P (bins i and j are empty)
k=1 i<j
rn r r r
n−2 n
1 n
n−1 n 2 n
=n +2 =n 1− + n(n − 1) 1 − ,
n 2 n n n
so
rn r 2rn
2 n
1 1
Var(Xn ) = E[Xn2 ] 2
− E[Xn ] = n 1 − + n(n − 1) 1 − 2
−n 1− .
n n n
log n−1
n n−2 n rn
Now L'Hospital's rule gives lim
−1
= lim · = −1, so, since → c, we have that
n→∞ n n→∞ −n−2 n − 1 n
rn r
rn log n−1
n−1 n
n−1
log = · n
→ −c and thus → e−c as n → ∞.
n n n−1 n
r 2rn
2 n 1
Similarly, 1− , 1− → e−2c .
n n
r
E[Xn ] n−1 n
Consequently, = → e−c and
n n
rn r 2rn
1 − n1
2 n
Var(Xn ) n(n − 1) 1
= + 1 − − 1 − → 0 + 1 · e−2c − e−2c = 0
n2 n n n n
Xn
as n → ∞, so taking bn = n in Theorem 7.2 shows that the proportion of empty bins, , converges to e−c
n
in probability.
41
Weak Law of Large Numbers.
We begin by providing a simple analysis proof of the weak law in its classical form. The general trick is to
use truncation in order to consider cases where we have control over the size and the probability, respectively.
Theorem 7.3. Suppose that X1 , X2 , ... are i.i.d. with E |X1 | < ∞. Let Sn = and µ = E[X1 ].
Pn
i=1 Xi
Then n1 Sn → µ in probability.
Proof.
In what follows, the arithmetic average of the rst n terms of a sequence of random variables Y1 , Y2 , ... will
1
Pn
be denoted by Yn = n i=1 Yi .
We rst note that, by replacing Xi with Xi0 = Xi − µ if necessary, we may suppose without loss of generality
that E[Xi ] = 0.
Thus we need to show that for given ε, δ > 0, there is an N ∈ N such that P X n > ε < δ whenever
n ≥ N.
To this end, we pick C<∞ large enough that E [|X1 | 1 {|X1 | > C}] < η for some η to be determined.
(This is possible since |X1 | 1 {|X1 | ≤ n} ≤ |X1 | and E |X1 | < ∞, so limn→∞ E [|X1 | 1 {|X1 | ≤ n}] = E |X1 |
by the dominated convergence theorem, hence E [|X1 | 1 {|X1 | > n}] = E |X1 | − E [|X1 | 1 {|X1 | ≤ n}] → 0.)
Now dene
and thus
2 h 2 i 4C 2
E W n ≤ E W n ≤
n
by Jensen's inequality.
l m
4C 2
Consequently, if n ≥ N := η 2 , then E W n ≤ η .
Finally, Chebychev's inequality and the fact that
X n = W n + Z n ≤ W n + Z n
42
We now turn to a weak law for triangular arrays which can be useful even in situations involving innite
means.
Theorem 7.4. For each n ∈ N, let Xn,1 , ..., Xn,n be independent. Let {bn }∞ n=1 be a sequence of positive
numbers with limn→∞ bn = ∞ and let Xen,k = Xn,k 1 {|Xn,k | ≤ bn }. Suppose that as n → ∞
Pn
(1) k=1
Pn h | > ibn ) → 0
P (|Xn,k
(2) b−2
n k=1 E e 2 → 0.
X n,k
Sn − an
If we let Sn = and an = en,k , then
→p 0.
Pn Pn h i
k=1 Xn,k k=1 E X
bn
Proof.
Pn e n o
S −an
Let Sen = k=1 Xn,k . By partitioning the event n
bn > ε according to whether or not Sn = S
en ,
we see that !
Sn − an Se − a
n n
P > ε ≤ P (Sn 6= Sen ) + P >ε .
bn bn
n
! n n
[ X X
P (Sn 6= Sen ) ≤ P Xn,k 6= X
en,k ≤ P Xn,k 6= X
en,k = P (|Xn,k | > bn ) → 0
k=1 k=1 k=1
where the rst inequality is due to the fact that Sn 6= Sen implies that there is some k ∈ [n] with Xn,k 6= X
en,k ,
and the second inequality is countable subadditivity.
h i
For the second term, we use Chebychev's inequality, E Sen = an , the independence of the e 0 s,
X and our
n,k
second assumption to obtain
! !2
Se − a S − a
n n
en n
P > ε ≤ ε−2 E = ε−2 b−2
n Var(Sn )
e
bn bn
n n
!
X h i X h i
= ε−2 b−2
n
e2
Var Xn,k
−2
≤ε b−2
n E e2
Xn,k → 0.
k=1 k=1
Theorem 7.4 was so easy to prove because we assumed exactly what we needed. Essentially, these are the
correct hypotheses for the weak law, but they are a little clunky so we usually talk about special cases that
In order to prove our weak law for sequences of i.i.d. random variables, we need the following simple lemma.
43
We now have all the necessary ingredients for
Theorem 7.5 (Weak Law of Large Numbers). Let X1 , X2 , ... be i.i.d. with
xP (|X1 | > x) → 0 as x → ∞.
Let Sn = X1 + ... + Xn and µn = E [X1 1 {|X1 | ≤ n}]. Then n1 Sn − µn → 0 in probability.
Proof. We will apply Theorem 7.4 with Xn,k = Xk and bn = n (hence an = nµn ).
n
X
P (|Xn,k | > n) = nP (|X1 | > n) → 0.
k=1
n
1 h e2 i 1 X h e2 i
E Xn,1 = 2 E Xn,k → 0.
n n
k=1
To see that this is the case, note that since 2yP (|X1 | > y) → 0 as y → ∞, for any ε > 0, there is an N ∈N
such that 2yP (|X1 | > y) < ε whenever y ≥ N. Because 2yP (|X1 | > y) < 2N for y < N, we see that for all
n > N,
ˆ n ˆ N ˆ
1 1 1 n
2yP (|X1 | > y) dy = 2yP (|X1 | > y) dy + 2yP (|X1 | > y) dy
n 0 n 0 n N
ˆ ˆ
1 N 1 n 2N 2 n−N
≤ 2N dy + εdy = + ε,
n 0 n N n n
hence ˆ n
1 2N 2 n−N
lim sup 2yP (|X1 | > y) dy ≤ lim sup + ε = ε,
n→∞ n 0 n→∞ n n
and the result follows since ε was arbitrary.
Remark. Theorem 7.5 implies Theorem 7.3 since if E |X1 | < ∞, then the dominated convergence theorem
gives
44
On the other hand, the improvement is not vast since xP (|X1 | > x) → 0 implies that there is M ∈N so
that xP (|X1 | > x) ≤ 1 for x ≥ M , and thus for any ε > 0, Lemma 7.2 with p = 1 − ε yields
h i ˆ ∞ ˆ ∞
1−ε −ε
E |X1 | = (1 − ε)y P (|X1 | > y) dy = (1 − ε) y −(1+ε) · yP (|X1 | > y) dy
0 0
ˆ M ˆ ∞
≤ (1 − ε) y −(1+ε) M dy + (1 − ε) y −(1+ε) dy < ∞.
0 M
Example 7.7 (The St. Petersburg Paradox) . Suppose that I oered to pay you 2j dollars if it takes j ips
of a fair coin for the rst head to appear. That is, your winnings are given by the random variable X with
P (X = 2j ) = 2−j for j ∈ N. How much would you pay to play the game n times? The paradox is that
P∞
E[X] = j=1 2j · 2−j = ∞, but most sensible people would not pay anywhere near $40 a game.
Using Theorem 7.4, we will show that a fair price for playing n times is $ log2 (n) per play, so that one would
need to play about a trillion rounds to reasonably expect to break even at $40 a play.
Proof. To cast this problem in terms of Theorem 7.4, we will take X1 , X2 , ... to be independent random
Pn
variables which are equal in distribution to X and set Xn,k = Xk . Then Sn = k=1 Xk denotes your total
winnings after n games. We need to choose bn so that
n
X
nP (X > bn ) = P (Xn,k > bn ) → 0,
k=1
n
n 2 X h
2
i
E X 1 {X ≤ bn } = b−2
n E (Xn,k 1 {|Xn,k | ≤ bn }) → 0.
b2n
k=1
To this end, let m(n) = log2 (n) + K(n) where K(n) is such that m(n) ∈ N and K(n) → ∞ as n → ∞.
If we set bn = 2m(n) = n2K(n) , we have
∞
X
nP (X > bn ) ≤ nP (X ≥ bn ) = n 2−i = n2−m(n)+1 = 2−K(n)+1 → 0
i=m(n)
and
m(n)
X
22i · 2−i = 2m(n)+1 − 2 ≤ 2bn ,
2
E X 1 {X ≤ bn } =
i=1
so that
n 2 2n
2
E X 1 {|X| ≤ bn } ≤ = 2−K(n)+1 → 0.
bn bn
Since
n m(n)
X X
an = E [Xn,k 1 {|Xn,k | ≤ bn }] = nE [X1 {X ≤ bn }] = n 2i · 2−i = nm(n),
k=1 i=1
Theorem 7.4 gives
Sn − n log2 (n) − nK(n)
→p 0.
n2K(n)
If we take K(n) ≤ log2 (log2 (n)), then the conclusion holds with n log2 (n) in the denominator, so we get
Sn
→p 1.
n log2 (n)
45
8. Borel-Cantelli Lemmas
∞ [
\ ∞
lim supn An := Am = {ω : ω is in innitely many An },
n=1 m=n
which is often abbreviated as {An i.o.} where i.o. stands for innitely often.
The nomenclature derives from the straight-forward identity lim sup 1An = 1lim sup An .
n→∞ n
∞ \
[ ∞
lim inf n An := Am = {ω : ω is in all but nitely many An },
n=1 m=n
C
but little is gained by doing so since lim inf n An = lim supn AC
n .
To illustrate the utility of this notion, observe that Xn → X a.s. if and only if P (|Xn − X| > ε i.o.) =0
for every ε > 0.
Proof.
P∞
Let N= n=1 1An denote the number of events that occur. Tonelli's theorem (or MCT) gives
∞
X ∞
X
E[N ] = E[1An ] = P (An ) < ∞,
n=1 n=1
Proof.
Suppose that Xn →p X and let {Xnm }∞ m=1 be any subsequence. Then Xnm →p X , so for every k ∈ N,
P |Xnm − X| > k → 0 as m → ∞. It follows that we can choose a further subsequence {Xnm(k) }∞
1
k=1 such
1 −k
that P Xnm(k) − X > k ≤ 2 for all k ∈ N. Since
∞
X 1
P Xnm(k) − X >
≤ 1 < ∞,
k
k=1
1
the rst Borel-Cantelli lemma shows that P Xnm(k) − X >
k i.o. = 0.
1
Because Xnm(k) − X > ε i.o. ⊆ Xnm(k) − X >
k i.o. for every ε > 0, we see that Xnm(k) → X a.s.
Lemma 8.2. Let {yn }∞ n=1 be a sequence of elements in a topological space. If every subsequence {yn m }∞
m=1
has a further subsequence {yn }∞ k=1 that converges to y , then yn → y .
m(k)
46
Proof. If yn 9 y , then there is an open set U 3 y such that for every N ∈ N, there is an n ≥ N with
applying Lemma 8.2 to the sequence yn = P (|Xn − X| > ε) for an arbitrary ε > 0 shows that Xn →p X .
Remark. Since there are sequences which converge in probability but not almost surely (e.g. Example 7.1),
it follows from Theorem 8.1 and Lemma 8.2 that a.s. convergence does not come from a topology.
(In contrast, one of the homework problems shows that convergence in probability is metrizable.)
Theorem 8.1 can sometimes be used to upgrade results depending on almost sure convergence.
For example, you are asked to show in your homework that the assumptions in Fatou's lemma and the
Theorem 8.2. If f is continuous and Xn →p X , then f (Xn ) →p f (X). If, in addition, f is bounded, then
E[f (Xn )] → E[f (X)].
Proof. If {Xnm } is a subsequence, then Theorem 8.1 guarantees the existence of a further subsequence
{Xnm(k) } which converges to X a.s. Since limits commute with continuous functions, this means that
f (Xnm(k) ) → f (X) a.s. The other direction of Theorem 8.1 now implies that f (Xn ) →p f (X).
If f is bounded as well, then the dominated convergence theorem yields E f Xnm(k) → E[f (X)].
Applying Lemma 8.2 to the sequence yn = E[f (Xn )] establishes the second part of the theorem.
We will now use the rst Borel-Cantelli lemma to prove a weak form of the Strong Law of Large Numbers.
Theorem 8.3. Let X1 , X2 , ... be i.i.d. with and E X14 < ∞. If Sn = X1 + ... + Xn , then
E[X1 ] = µ
n Sn → µ almost surely.
1
Proof. By taking Xi0 = Xi − µ, we can suppose without loss of generality that µ = 0. Now
! n ! n !
X n X X n X X
4
E Sn = E Xi Xj Xk Xl = E Xi Xj Xk Xl .
i=1 j=1 k=1 l=1 1≤i,j,k,l≤n
By independence, terms of the form E Xi3 Xj , E Xi2 Xj Xk and E [Xi Xj Xk Xl ] are all zero (since the
where C = 3E X14 < ∞ by assumption.
It follows from Chebychev's inequality that
1
4
C
P |Sn | > ε = P |Sn | > (nε)4 ≤ 2 4 ,
n n ε
hence
∞ ∞
X 1 X 1
P |Sn | > ε ≤ Cε−4 < ∞.
n=1
n n=1
n2
1 1
Therefore, P n |Sn | > ε i.o. =0 by Borel-Cantelli, so, since ε>0 was arbitrary,
n Sn →0 a.s.
Proof.
Tn+k T∞
For each n ∈ N, the sequence Bn,1 , Bn,2 , ... dened by Bn,k = m=n AC
m decreases to Bn := m=n AC
m.
Also, since the Am 's (and thus their complements) are independent, we have
n+k
! n+k
\ Y
AC P AC
P (Bn,k ) = P m = m
m=n m=n
n+k
Y n+k
Y Pn+k
= (1 − P (Am )) ≤ e−P (Am ) = e− m=n P (Am )
m=n m=n
where the inequality is due to the Taylor series bound e−x ≥ 1 − x for x ∈ [0, 1].
P∞
Because m=n P (Am ) = ∞ by assumption, it follows from continuity from above that
Pn+k
P (Bn ) = lim P (Bn,k ) ≤ lim e− m=n P (Am )
= 0,
k→∞ k→∞
S∞
hence P ( m=n Am ) = P BnC = 1 for all n ∈ N.
S∞
Since m=n Am & lim supn→∞ An = {An i.o.}, another application of continuity from above gives
∞
!
[
P (An i.o.) = lim P Am = 1.
n→∞
m=n
Taken together, the Borel-Cantelli lemmas show that if A1 , A2 , ... is a sequence of independent events, then
ment will almost surely result in innitely many realizations of any event having positive probability.
For example, given any nite string from a nite alphabet (e.g. the complete works of Shakespeare in
chronological order), an innite string with characters chosen independently and uniformly from the alphabet
(produced by the proverbial monkey at a typewriter, say) will almost surely contain innitely many instances
of said string.
Similarly, many leading cosmological theories imply the existence of innitely many universes which may be
regarded as being i.i.d. with the current state of our universe having positive probability. If any of these
theories is true, then Borel-Cantelli says that there are innitely many copies of us throughout the multiverse
A more serious application demonstrates the necessity of the integrability assumption in the strong law.
Theorem 8.4. If X1 , X2 , ... are i.i.d. with E |X1 |= ∞, then P (|Xn | ≥ n i.o.) = 1.
n
Sn
Thus if Sn = Xi , then P exists in R = 0.
X
lim
n→∞ n
i=1
Proof. Lemma 7.2 and the fact that G(x) := P (|X1 | > x) is nonincreasing give
ˆ ∞ ∞
X ∞
X
E |X1 | = P (|X1 | > x) dx ≤ P (|X1 | > n) ≤ P (|X1 | ≥ n) .
0 n=0 n=0
Because E |X1 | = ∞ and the Xn0 s are i.i.d., it follows from the second Borel-Cantelli lemma that
P (|Xn | ≥ n i.o.) = 1.
Sn
To establish the second claim we will show that C = limn→∞ n exists in R and {|Xn | ≥ n i.o.} are
If ω ∈ {|Xn | ≥ n i.o.} as well, then there would be innitely many n ≥ N with |Xnn(ω)| ≥ 1.
Sn (ω) Sn+1 (ω) Sn (ω) Xn+1 (ω) 1
But this would mean that n − n+1 = n(n+1) − n+1 > 2 for innitely many n, so that the
n o∞
Sn (ω)
sequence
n is not Cauchy, contradicting ω ∈ C.
n=1
Our next example is a typical application where the two Borel-Cantelli lemmas are used together to obtain
results on the limit superior of a (suitably scaled) sequence of i.i.d. random variables.
Example 8.2. Let X1 , X2 , ... be a sequence of i.i.d. exponential random variables with rate 1 (so that
−x
Xi ≥ 0 with P (Xi ≤ x) = 1 − e ).
49
First observe that
Xn 1
P ≥1 = P (Xn ≥ log(n)) = P (Xn > log(n)) = e− log(n) = ,
log(n) n
so
∞ ∞
X Xn X 1
P ≥1 = = ∞,
n=1
log(n) n=1
n
0 Xn
thus, since the Xn s are independent, the second Borel-Cantelli lemma implies that P log(n) ≥1 i.o. = 1,
Xn
and we conclude that lim supn→∞
log(n) ≥ 1 almost surely.
On the other hand, for any ε > 0,
Xn 1
P ≥ 1 + ε = P (Xn > (1 + ε) log(n)) = 1+ε ,
log(n) n
Xn
which is summable, so it follows from the rst Borel-Cantelli lemma that P
log(n) ≥ 1 + ε i.o. = 0.
Xn
Since ε>0 was arbitrary, this means that lim supn→∞
log(n) ≤1 almost surely, and the claim is proved.
We conclude with a cute example in which an a.s. convergence result cannot be upgraded to pointwise
convergence.
Example 8.3. We will show that for any sequence of random variables {Xn }∞
n=1 , one can nd a sequence
∞ Xn
of real numbers {cn }n=1 such that →0 a.s., but that in general, no such sequence can be found such
cn
that the convergence is pointwise.
The rst statement is an easy application of the rst Borel-Cantelli lemma: Given {Xn }∞
n=1 , let {cn }∞
n=1
cn −n
be a sequence of positive numbers such that P |Xn | > n ≤ 2 . Such a sequence can be found since
∞
X Xn 1
P
> ≤ 1 < ∞,
n=1
cn n
Xn
ε > 0, P X
cn > ε ≤ P X
cn >
1
= 0, →0
n n
so for all i.o.
n i.o. hence
cn
a.s.
To see this, let C denote the Cantor set. Since C has the cardinality of the continuum, there is a bijection
f : C → {{an }∞
n=1 : an ∈ N for all n}.
Dene the random variables {Xn }∞
n=1 on [0, 1] with Borel sets and Lebesgue measure by
f (ω) + 1,
n ω∈C
Xn (ω) = .
1, ω∈
/C
∞
For any sequence cn }∞
{cn }n=1 , the sequence {e cn = d|cn |e is equal to f (ω 0 ) for some ω 0 ∈ C ,
n=1 dened by e
Xn (ω 0 )
hence
cn > 1 for all n, so there is no sequence of reals for which the convergence is sure.
50
9. Strong Law of Large Numbers
Our goal at this point is to strengthen the conclusion of Theorem 7.3 from convergence in probability to
Theorem 9.1 (Strong Law of Large Numbers). Suppose that X1 , X2 , ... are pairwise independent and iden-
tically distributed with E |X1 | < ∞. Let Sn = nk=1 Xk and µ = E[X1 ]. Then n1 Sn → µ almost surely as
P
n → ∞.
Proof.
We begin by noting that Xk+ = max{Xk , 0} and Xk− = max{−Xk , 0} satisfy the theorem's assumptions, so,
since Xk = Xk+ − Xk− , we may suppose without loss of generality that the Xk0 s are nonnegative.
∞
X ∞
X ∞
X ˆ ∞
P (Xk 6= Yk ) = P (Xk > k) = P (X1 > k) ≤ P (X1 > t) dt = E |X1 | < ∞,
k=1 k=1 k=1 0
so the rst Borel-Cantelli lemma gives P (Xk 6= Yk i.o.) = 0. Thus for all ω in a set of probability one,
Sn Tn
supn |Sn (ω) − Tn (ω)| < ∞, hence − → 0 a.s. and the claim follows.
n n
The truncation step should not be too surprising as it is generally easier to work with bounded random
variables. The reason that we reduced the problem to the Xk ≥ 0 case is that this assures that the sequence
T1 , T2 , ... is nondecreasing.
Our strategy will be to prove convergence along a cleverly chosen subsequence and then exploit monotonicity
Specically, for α > 1, let k(n) = bαn c, the greatest integer less than or equal to αn .
Chebychev's inequality and Tonelli's theorem give
∞ ∞ ∞ k(n)
X X Var Tk(n) −2
X
−2
X
P Tk(n) − E Tk(n) > εk(n) ≤ = ε k(n) Var (Ym )
n=1 n=1
ε2 k(n)2 n=1 m=1
∞
X X ∞
X X
= ε−2 Var (Ym ) k(n)−2 ≤ ε−2 E Ym2 bαn c −2 .
m=1 n:k(n)≥m m=1 n:αn ≥m
Since bαn c ≥ 21 αn for n≥1 (by casing out according to αn smaller or bigger than 2),
X X ∞
X
bαn c −2 ≤ 4 α−2n ≤ 4α−2 logα m α−2n = 4(1 − α−2 )−1 m−2 ,
n:αn ≥m n≥logα m n=0
hence
∞
X ∞
X X
P Tk(n) − E Tk(n) > εk(n) ≤ ε−2 E Ym2 [αn ]−2
n=1 m=1 n:αn ≥m
∞
−2 −1 −2
X E Ym2
≤ 4(1 − α ) ε .
m=1
m2
51
At this point, we note that
∞
E[Ym2 ]
Claim .
X
9.2 < ∞.
m=1
m2
∞ ∞ ˆ m ˆ ∞ !
X E[Ym2 ] X
−2
X
−2
≤ m 2yP (X1 > y)dy = 2 y m P (X1 > y)dy.
m=1
m2 m=1 0 0 m>y
ˆ ∞ X
Since P (X1 > y)dy = E[X1 ] < ∞, we will be done if we can show that y m−2 is uniformly bounded.
0 m>y
∞
X X π2
y m−2 ≤ m−2 = <2
m>y m=1
6
∞
X
It follows that P Tk(n) − E Tk(n) > εk(n) < ∞, so, since ε>0 is arbitrary, the rst Borel-Cantelli
n=1
Tk(n) − E Tk(n)
lemma implies that → 0 a.s.
k(n)
E Tk(n)
Now lim E[Yk ] = E[X1 ] by the dominated convergence theorem, so limn→∞ = E[X1 ].
k→∞ k(n)
Tk(n)
Thus we have shown that →µ almost surely.
k(n)
Finally, if k(n) ≤ m < k(n + 1), then
52
The next result shows that the strong law holds whenever E[X1 ] exists.
Theorem 9.2. Let X1 , X2 , ... be i.i.d. with E X1+ = ∞ and E X1− < ∞. Then n1 Sn → ∞ a.s.
Proof.
For any M ∈ N, let XiM = Xi ∧ M . Then the XiM 0 s are i.i.d. with E X1M < ∞, so, writing
Pn 1 M
M
SnM = M
i=1 Xi , it follows from Theorem 9.1 that
n Sn → E X1 almost surely as n → ∞.
Sn SM
≥ lim n = E X1M .
Now Xi ≥ XiM for all M, so lim inf
n→∞ n n→∞ n
The monotone convergence theorem implies that
h + i h + i
X1M X1M = E X1+ = ∞,
lim E =E lim
M →∞ M →∞
so
h + i h − i h + i
E X1M = E X1M − E X1M = E X1M − E X1− % ∞,
Sn
thus lim inf n→∞ n ≥∞ a.s. and the theorem follows.
Our rst application of the strong law of large numbers comes from renewal theory.
Example 9.1. Let X1 , X2 , ... be i.i.d. with 0 < X1 < ∞., and let Tn = X1 + ... + Xn . Here we are thinking
of the Xi0 s as times between successive occurrences of events and Tn as the time until the nth event occurs.
For example, consider a janitor who replaces a light bulb the instant it burns out. The rst bulb is put
in at time 0 and Xi is the lifetime of the ith bulb. Then Tn is the time that the nth bulb burns out and
Nt = sup{n : Tn ≤ t} is the number of light bulbs that have burned out by time t.
Nt 1
Theorem 9.3 (Elementary Renewal Theorem). If E[X1 ] = µ ≤ ∞, then → a.s. as t → ∞
t µ
(with the convention that ∞1 = 0).
Tn
Proof. Theorems 9.1 and 9.2 imply that lim =µ a.s., and it follows from the denition of Nt that
n→∞ n
TNt ≤ t < TNt +1 , hence
TNt t TNt +1 Nt + 1
≤ < · .
Nt Nt Nt + 1 Nt
SinceTn < ∞ for all n, we have that Nt % ∞ as t % ∞. Thus there is a set Ω0 with P (Ω0 ) = 1 such that
Tn (ω)
lim = µ and lim Nt (ω) = ∞, hence
n→∞ n t→∞
Example 9.2. A common situation in statistics is that one has a sequence of random variables which is
assumed to be i.i.d., but the underlying distribution is unknown. A popular estimate for the true distribution
n
1X
Fn (x) = 1(−∞,x] (Xi ).
n i=1
53
That is, one approximates the true probability of being at most x with the observed frequency of values ≤x
in the sample. The strong law provides some justication for this method of inference by showing that for
every x ∈ R, Fn (x) → F (x) almost surely as n → ∞. The next result shows that the convergence is actually
uniform in x.
Proof.
Fix x∈R and let Yn = 1 {Xn < x}. Then Y1 , Y2 , ... are i.i.d. with E[Y1 ] = P (X1 < x) = F (x− ), so the
− 1
Pn −
strong law implies that Fn (x ) =
n i=1 Yi → F (x ) a.s. as n → ∞. Similarly, Fn (x) → F (x) a.s.
In general, for any countable collection {xi } ⊆ R, there is a set Ω0 with P (Ω0 ) = 1 such that Fn (xi )(ω) →
F (xi ) and Fn (x− −
i )(ω) → F (xi ) for all ω ∈ Ω0 .
xj,k = inf y : F (y) ≥ kj .
For each k ∈ N, j = 1, ..., k − 1, set The pointwise convergence of Fn (x) and
j = 0, k .
Thus if xj−1,k < x < xj,k with 1≤j≤k and n ≥ Nk , then the inequality F (xj,k − ) − F (xj−1,k ) < 1
k and
the monotonicity of Fn and F imply
1 2 2
Fn (x) ≤ Fn (xj,k − ) ≤ F (xj,k − ) + ≤ F (xj−1,k ) + ≤ F (x) + ,
k k k
1 − 2 2
Fn (x) ≥ Fn (xj−1,k ) ≥ F (xj−1,k ) − ≥ F (xj,k ) − ≥ F (x) − .
k k k
2
Consequently, we have supx∈R |Fn (x) − F (x)| ≤ k and the theorem follows.
54
10. Random Series
We now give an alternative proof of the SLLN which allows us to introduce some other interesting results
∞
Denition. X1 , X2 , ..., we dene the tail σ -eld T =
\
Given a sequence of random variables σ(Xn , Xn+1 , ...).
n=1
Our next theorem is an example of a 0−1 law - that is, a statement that certain classes of events are trivial
Theorem 10.1 (Kolmogorov). If X1 , X2 , ... are independent and A ∈ T , then P (A) ∈ {0, 1}.
Proof. We will show that A is independent of itself so that P (A)2 = P (A)P (A) = P (A ∩ A) = P (A).
To do so, we rst note that B ∈ σ(X1 , ..., Xk ) and C ∈ σ(Xk+1 , Xk+2 , ...) are independent.
S∞
This follows from Lemma 6.1 if C ∈ σ(Xk+1 , ..., Xk+j ). Since σ(X1 , ..., Xk ) and j=1 σ(Xk+1 , ..., Xk+j ) are
IfE ∈ σ(X1 , ..., Xk ), then this follows from the previous observation since F ∈ T ⊆ σ(Xk+1 , Xk+2 , ...).
S∞
Since k=1 σ(X1 , ..., Xk ) and T are π -systems, Theorem 6.1 shows it is true in general.
Because T ⊆ σ(X1 , X2 , ...), the last observation shows that A∈T is independent of itself.
Example 10.1. If B1 , B2 , ... ∈ B , then {Xn ∈ Bn i.o.} ∈T. Taking Xn = 1An , Bn = {1}, we have
{Xn ∈ Bn i.o.} = {An i.o.}, so Theorem 10.1 shows that if A1 , A2 , ... are independent, then
P (An i.o.) ∈ {0, 1}. Of course, this also follows from the Borel-Cantelli lemmas.
The rst item in the previous example shows that sums of independent random variables either converge
Our next result can be useful in determining when the former is the case.
Theorem 10.2 (Kolmogorov's maximal inequality). Suppose that X1 , X2 , ... are independent with E[Xk ] = 0
and Var(Xk ) < ∞, and let Sn = X1 + ... + Xn . Then
Var(Sn )
P max |Sk | ≥ x ≤ .
1≤k≤n x2
Var(Sn )
Remark. Note that under the same hypotheses, Chebychev only gives P (|Sn | ≥ x) ≤ x2 .
55
Proof. We will partition the event in question according to the rst time that the sum exceeds x by dening
k=1 Ak k=1 Ak
Xn ˆ Xn ˆ
≥ Sk2 dP + 2 Sk 1Ak (Sn − Sk )dP.
k=1 Ak k=1
Our assumptions guarantee that Sk 1Ak ∈ σ (X1 , ..., Xk ) and Sn − Sk ∈ σ(Xk+1 , ..., Xn ) are independent and
E[Sn − Sk ] = 0, so
ˆ
Sk 1Ak (Sn − Sk )dP = E [Sk 1Ak (Sn − Sk )] = E [Sk 1Ak ] E [Sn − Sk ] = 0.
Accordingly, we have
n ˆ n
X X
E Sn2 ≥ Sk2 dP ≥ x2 P (Ak ) = x2 P max |Sk | ≥ x .
Ak 1≤k≤n
k=1 k=1
We now have the tools needed to provide a sucient criterion for the a.s. convergence of random series.
(As usual a series is said to converge if its sequence of partial sums converges.)
Theorem 10.3 (Kolmogorov's two-series theorem). Suppose X1 , X2 , ... are independent with E[Xn ] = µn
and Var(Xn ) = σn2 . If ∞
n=1 µn converges in R and n=1 σn < ∞, then n=1 Xn converges almost surely.
P P∞ 2
P∞
Proof.
P∞ P∞
Since Var(Xn −µn ) = Var(Xn ) and convergence of n=1 µn means that n=1 (Xn (ω)−µn ) converges
P∞
if and only if n=1 Xn (ω) converges, we may assume without loss of generality that E[Xn ] = 0.
PN
Let SN = n=1 Xn . Theorem 10.2 gives
N
X
P max |Sm − SM | > ε ≤ ε−2 Var (SN − SM ) = ε−2 Var(Xn ).
M ≤m≤N
n=M +1
Letting N →∞ gives
∞
X
P sup |Sm − SM | > ε ≤ ε−2 σn2 → 0 as M → ∞.
m≥M
n=M +1
so supm,n≥M |Sm − Sn | →p 0. By Theorem 8.1, we have that WM = supm,n≥M |Sm − Sn | has a subsequence
which converges to 0 a.s. Since WM is nondecreasing, this means that WM → 0 a.s.
Before moving on to prove the strong law, we take a slight detour to present a general theorem on the
Proof. To see that the conditions are sucient, observe that Condition 1 and the rst Borel-Cantelli lemma
P∞
imply that P (Xn 6= Yn i.o.) = 0, so it suces to show that n=1 Yn converges a.s. This is assured by
lemma shows that P (|Xn | > A i.o.) = 1, which implies that the series diverges with full probability by the
Claim. Suppose that Z1 , Z2 , ... is a sequence of independent random variables with E[Zn ] = 0 and |Zn | ≤ C
P∞ P∞
for some C > 0. If n=1 Zn converges a.s., then n=1 Var(Zn ) < ∞.
Proof.
Pn
Let Sn = k=1 Zk . Since Sn converges a.s., we can nd an L ∈ N such that P supn≥1 |Sn | < L > 0.
(The events Em = supn≥1 |Sn | < m form a countable increasing union which converges to
supn≥1 |Sn | < ∞ ⊇ {limn→∞ Sn exists}.)
For this L, let τL = min {k ≥ 1 : |Sk | ≥ L} and observe that the assumption |Zk | ≤ C for all k implies
C
Now {j ≤ τL } = {τL ≤ j − 1} ∈ σ (Z1 , ..., Zj−1 ), so independence of the Zk 's and the mean zero assumption
give
n
X X
(L + C)2 ≥ E Zj2 1{j ≤ τL } + 2
E [Zi Zj 1{j ≤ τL }]
j=1 1≤i<j≤n
n
X X n
X
= Var(Zj )P (j ≤ τL ) + 2 E [Zi 1{j ≤ τL }] E[Zj ] ≥ P (τL = ∞) Var(Zj ).
j=1 1≤i<j≤n j=1
57
Taking n → ∞, and noting that P (τL = ∞) = P supn≥1 |Sn | < L > 0, we get
∞
X (L + C)2
Var(Zj ) ≤ < ∞.
j=1
P (τL = ∞)
The connection between Kolmogorov's two-series theorem and the strong law is given by
∞ n
xn
Theorem 10.5 (Kronecker's lemma). If an % ∞ and converges, then a−1 xm = 0.
X X
n
a
n=1 n m=1
Proof.
Pm xk
Let a0 = 0, b0 = 0, and bm = k=1 ak for m ≥ 1.
Then xm = am (bm − bm−1 ), so
n n n
!
X X X
a−1
n xm = a−1
n am bm − am bm−1
m=1 m=1 m=1
n n
!
X X
= a−1
n an bn + am−1 bm−1 − am bm−1
m=2 m=2
n n
X am − am−1 X am − am−1
= bn − bm−1 = bn − bm−1 .
m=2
an m=1
an
n
X am − am−1
By assumption, bn → b∞ , so we will be done if we can show that bm−1 → b∞ as well.
m=1
an
ε
Given ε > 0, choose M ∈N such that |bm − b∞ | < 2 for m ≥ M.
aM ε
Set B = supn≥1 |bn | (which is nite since bn converges) and choose N >M such that < for n ≥ N.
an 4B
Since am − am−1 ≥ 0 for all m, we see that for all n ≥ N ,
n n
X am − am−1 X am − am−1
bm−1 − b∞ = (bm−1 − b∞ )
an an
m=2 m=2
M n
1 X X am − am−1
≤ (am − am−1 ) |bm−1 − b∞ | + |bm−1 − b∞ |
an m=2 an
m=M +1
aM an − aM ε
≤ · 2B + · < ε,
an an 2
and the result follows.
n ∞ ∞
E Yk2
X Zk X Var(Zk ) X
Var = ≤ < ∞.
k k2 k2
k=1 k=1 k=1
58
∞
Z X Zk
Since E k
k
= 0, Theorem 10.3 shows that converges a.s., hence
k
k=1
n n
Tn 1X X
− E[Yk ] = n−1 Zk → 0 a.s.
n n
k=1 k=1
by Theorem 10.5.
1
Pn
Finally, the DCT gives E[Yk ] → µ as k → ∞, thus
n k=1 E[Yk ] → µ as n → ∞, and we conclude that
Tn
n →µ a.s.
As promised, we will conclude our discussion with an estimate on the rate of convergence in the strong law.
Theorem 10.6. Let X1 , X2 , ... be i.i.d. random variables with E[X1 ] = 0 and E[X12 ] = σ2 < ∞, and set
Sn = X1 + ... + Xn . Then for all ε > 0,
Sn
1 1 →0 a.s.
n log(n) 2 +ε
2
Proof.
1 1
Let an = n 2 log(n) 2 +ε n ≥ 2 and a1 > 0. We have
for
∞ ∞
σ2
X Xn X 1
Var = 2 + σ2 < ∞,
n=1
an a1 n=2
n log(n)1+2ε
∞
X Xn
so Theorem 10.3 implies converges a.s. The claim then follows from Theorem 10.5.
a
n=1 n
Sn
lim sup p =1
n→∞ 2σ 2 n log (log(n))
under the same assumptions, so the above result is not far from optimal.
See Durrett for convergence rates under the assumption that Xn has nite absolute pth moment for 1 < p < 2
and for a generalization of Theorem 8.4.
59
11. Weak Convergence
Random variables X1 , X2 , ... converge weakly (or converge in distribution ) to a random variable X∞ (written
Xn ⇒ X∞ ) if Fn ⇒ F∞ where Fn (x) = P (Xn ≤ x) for 1 ≤ n ≤ ∞.
Note that since the denition of weak convergence of random variables depends only on their distribution
functions, one can speak of a sequence X1 , X2 , ... converging weakly even if the Xn0 s are not dened on the
same probability space. This is not the case with the other modes of convergence we have discussed.
Also, since distribution functions are right-continuous and have only countably many discontinuities, we see
Example 11.1. As a trivial example, suppose that X has distribution function F and let {an }∞
n=1 be any
sequence of real numbers which decreases to 0. Then Xn = X − an has distribution function Fn (x) =
P (X − an ≤ x) = F (x + an ), hence limn→∞ Fn (x) = F (x) for all x∈R (since F is right-continuous) and
thus Xn ⇒ X .
On the other hand, Xn = X + an has distribution function Fn (x) = F (x − an ), which converges to F only
at continuity points. Thus we still have Xn ⇒ X , but the distribution functions do not necessarily converge
pointwise.
For some more interesting examples, we rst need some elementary facts:
log(1 + x) 1 log(1 + cj )
Proof. lim = lim = 1 by L'Hospital's rule, so aj log(1 + cj ) = (aj cj ) → λ, hence
x→0 x x→0 1 + x cj
aj
(1 + cj ) = eaj log(1+cj ) → eλ .
n n n
Fact 11.2. If n→∞ max |cj,n | = 0, lim cj,n = λ, and supn |cj,n | < ∞, then lim (1 + cj,n ) = eλ .
X X Y
lim
1≤j≤n n→∞ n→∞
j=1 j=1 j=1
n
Proof.
X
(Homework) It suces to show that log(1 + cj,n ) → λ since then
j=1
n
Y n
Y Pn
log(1+cj,n )
(1 + cj,n ) = elog(1+cj,n ) = e j=1 → eλ .
j=1 j=1
To this end, note that the rst condition ensures that we can choose n large enough that |cj,n | < 1, hence
∞
X cm
j,n (cj,n )2
log(1 + cj,n ) = − (−1)m and thus |log(1 + cj,n ) − cj,n | < by standard results for alternating
m=1
m 2
series. It follows that
n n n
n
X X X X
log(1 + c j,n ) − λ ≤
log(1 + cj,n ) − cj,n
+
cj,n − λ
j=1 j=1 j=1 j=1
n X n n X n
1X 2
X
≤ (cj,n ) + cj,n − λ ≤ max |cj,n |
|cj,n | +
cj,n − λ
2 j=1 j=1 1≤j≤n
j=1 j=1
60
n
n
X X
≤ max |cj,n | sup |cj,n | +
cj,n − λ .
1≤j≤n n
j=1 j=1
The rst and third assumptions ensure that the rst term goes to zero and the second assumption ensures
Example 11.2. Let Xp be the number of trials until the rst success in a sequence of independent Bernoulli
trials with success probability p ∈ (0, 1). (That is Xp is geometric with parameter p.)
Then P (Xp > n) = (1 − p)n , so Fact 11.1 shows that
x
= (1 − p)b p c → e−x
x
P (pXp > x) = P Xp >
p
as p & 0, hence pXp converges to the rate 1 exponential distribution.
Example 11.3. Let X1 , X2 , ... be independent and uniformly distributed over {1, 2, ..., N }, and let
we see that √
bx YN c−1
TN k x2
P √ >x = 1− → e− 2
N N
k=1
as N → ∞.
√ x2
The approximation P TN > x N ≈ e− 2 with N = 365 yields P (T365 > 22) ≈ e−0.663 ≈ 0.515 and
−0.725
P (T365 > 23) ≈ e ≈ 0.484.
This is the birthday paradox that in a room of 23 or more people, it is more likely than not that two share
a birthday.
Our next result gives an equivalent denition of weak convergence. The basic idea is that Cb (R), the space of
bounded continuous functions from R to R equipped with the supremum norm, is a Banach space. It follows
∗
from the Riesz representation theorem that its (continuous) dual Cb (R) , the space of continuous linear
functionals on Cb (R), may be identied with the space of nite and nitely additive signed Radon measures.
From this perspective, weak convergence of probability measures corresponds to weak-* convergence in
Cb (R)∗ :
Theorem 11.2. Xn ⇒ X∞ if and only if for every bounded continuous function g we have
E[g(Xn )] → E[g(X∞ )].
Proof. First suppose that Xn ⇒ X∞ . Theorem 11.1 shows that there exist random variables with Yn =d Xn
and Yn → Y∞ a.s. If g is bounded and continuous, then g(Yn ) → g(Y∞ ) a.s. and bounded convergence gives
lim inf P (Xn ≤ x) ≥ lim inf E [gx−ε,ε (Xn )] = E [gx−ε,ε (X∞ )] ≥ P (X∞ ≤ x − ε),
n→∞ n→∞
As this is true for all f ∈ Cb (R), Theorem 11.2 shows that g(Xn ) ⇒ g(X∞ ).
The second assertion follows by noting that g(Yn ) → g(Y∞ ) a.s. and likewise applying bounded convergence.
At this point, we have characterized weak convergence in terms of convergence of distribution functions at
continuity points and as weak-* convergence when probability measures are viewed as living in the dual of
Now let ω∈Ω be such that Y∞ (ω) ∈ U . Since Yn (ω) → Y∞ (ω), for every open set V 3 Y∞ (ω), there is an
NV ∈ N with Yn (ω) ∈ V whenever n ≥ NV . In particular, there is an NU ∈ N such that Yn (ω) ∈ U for all
n ≥ NU .
In other words, for all ω∈Ω with 1U (Y∞ (ω)) = 1, we have 1U (Yn (ω)) = 1 for n suciently large, hence
lim inf n→∞ 1U (Yn (ω)) = 1. It follows that lim inf n→∞ 1U (Yn ) ≥ 1U (Y∞ ) pointwise and thus, by Fatou's
lim inf P (Xn ∈ U ) = lim inf P (Yn ∈ U ) = lim inf E [1U (Yn )]
n→∞ n→∞ n→∞
h i
≥ E lim inf 1U (Yn ) ≥ E [1U (Y∞ )] = P (Y∞ ∈ U ) = P (X∞ ∈ U ),
n→∞
value is equal to P (X∞ ∈ A). Applying (ii) to A◦ ⊆ A and (iii) to A⊇A gives
Finally, suppose that (iv) holds and let Fn denote the distribution function of Xn . Let x be any continuity
point of F∞ . Then P (X∞ ∈ {x}) = 0, so, since {x} = ∂(−∞, x], we have
hence Xn ⇒ X∞ .
Our next set of theorems form a sort of compactness result for certain families of probability measures. We
begin with
Theorem 11.5 (Helly's Selection Theorem). If {Fn }∞ n=1 is any sequence of distribution functions, then there
is a subsequence {Fn(m) }∞ m=1 and a nondecreasing, right-continuous function F with limm→∞ Fn(m) (x) =
F (x) at all continuity points x of F .
Proof.
We begin with a diagonalization argument: Let q1 , q2 , ... be an enumeration of Q. Since the sequence
To see that F is nondecreasing, note that for any x < y, there is some r∈Q with x < r < y.
Since G(r) ≤ G(s) for all rational r < s, we have
Now for each x ∈ R, ε > 0, there is some rational q>x such that G(q) ≤ F (x) + ε. Thus if x ≤ y < q, then
F (y) ≤ G(q) ≤ F (x) + ε. Since ε>0 was arbitrary, we see that F is right-continuous as well.
64
Finally, suppose that F is continuous at x. Then there exist r1 , r2 , s ∈ Q with r1 < r2 < x < s such that
Since Fn(m) (r2 ) → G(r2 ) ≥ F (r1 ) and Fn(m) (s) → G(s) ≤ F (s) as m → ∞, we see that for m suciently
large,
F (x) − ε < Fn(m) (r2 ) ≤ Fn(m) (x) ≤ Fn(m) (s) < F (x) + ε,
hence Fn(m) (x) → F (x).
It should be noted that the subsequential limit F from Theorem 11.5 is not necessarily a distribution
function since the boundary conditions may not hold. When a sequence of distribution functions converges
to a nondecreasing right-continuous function at its continuity points, the sequence is said to exhibit vague
convergence, which will be denoted by ⇒v .
In these terms, Helly's selection theorem says that every sequence of distribution functions has a vaguely
convergent subsequence.
Because all of the distribution functions take values in [0, 1], their limit must as well. The limit is thus a
Stieltjes measure function for some subprobability measure on R - a positive measure ν with ν(R) ≤ 1.
Thus vague convergence means that the distribution functions converge to a distribution function of a
subprobability measure, whereas weak convergence means that they converge to the distribution function of
a probability measure.
More generally, just as weak convergence is weak-* convergence with respect to Cb (R), vague convergence is
weak-* convergence with respect to the subspaces CK (R) or C0 (R), the spaces of continuous functions with
Example 11.4. Choose any a, b, c > 0 with a+b+c=1 and any distribution function G(x), and dene
One easily checks that the Fn0 s are distribution functions and Fn (x) → F (x) := b + cG(x). However,
Intuitively, the test functions in CK or C0 that dene vague convergence can't detect mass lost to innity,
An immediate question is Under which conditions do the two denitions coincide? or Is there a property
of a vaguely convergent sequence of distribution functions which prevents mass from being lost in the limit?
Denition. A sequence of distribution functions is tight if for every ε > 0, there is an Mε > 0 such that
65
Theorem 11.6. A sequence of distribution functions is tight if and only if every subsequential limit is a
distribution function.
Proof. Suppose the sequence is tight and Fn(m) ⇒v F . Given ε > 0, let r < −Mε and s > Mε be continuity
points of F.
Since Fn(m) (r) → F (r) and Fn(m) (s) → F (s), we have
1 − F (s) + F (r) = lim 1 − Fn(m) (s) + Fn(m) (r) ≤ ε.
m→∞
1 − F (s) + F (r) = lim 1 − Fn(mk ) (s) + Fn(mk ) (r)
k→∞
≥ lim inf 1 − Fn(mk ) (mk ) + Fn(mk ) (−mk ) ≥ ε,
k→∞
so letting r → −∞ and s→∞ along continuity points of F shows that F is not a distribution function.
Roughly, we have that weak convergence equals vague convergence plus tightness.
then {Fn }∞
n=1 is tight.
66
12. Characteristic Functions
An extremely useful construct in probability (and the primary ingredient in the classical proofs of many
central limit theorems) is the characteristic function of a random variable, which is essentially the (inverse)
When confusion may arise, we will indicate the dependence on the random variable with a subscript.
Though we have restricted our attention to real-valued random variables thus far, no new theory is required
since if Z is complex valued, E[Z] = E [Re(Z)] + iE[Im(Z)] provided that the expectations of the real and
In the case of characteristic functions, Euler's formula gives eitX = cos(tX) + i sin(tX), and the sine and
(3) |ϕ(t)| = E eitX ≤ E eitX = 1
(4) |ϕ(t + h) − ϕ(t)| ≤ E ei(t+h)X − eitX = E eitX eihX − 1 = E eihX − 1 .
Since the last term goes to zero as h→0 (by the bounded convergence theorem),
ϕ is uniformly continuous.
(5) ϕaX+b (t) = E eit(aX+b) = E ei(at)X eitb = eitb ϕX (at)
1 it 1 −it
ϕ(t) = e + e = cos(t).
2 2
.
67
Example 12.2 (Poisson).
k
If P (X = k) = e−λ λk!
k = 0, 1, 2, ..., then its ch.f. is given
for by
∞ ∞ k
X λk X λeit it it
ϕ(t) = eitk e−λ = e−λ = e−λ eλe = eλ(e −1) .
k! k!
k=0 k=0
Naive derivation:
ˆ ˆ ˆ
1 itx − x2
2 1 − 21 [(x−it)2 +t2 ]
2
− t2 1 (x−it)2 t2
ϕ(t) = √ e e dx = √ e dx = e √ e− 2 dx = e− 2
2π R 2π R 2π R
(x−it)2
1 −
since √
2π
e 2 is the the density of a normal random variable with mean it and variance 1.
Formal proof:
x2
Since sin(tx)e− 2 is odd and integrable,
ˆ ˆ ˆ
1 x2 1 x2 i x2
ϕ(t) = √ eitx e− 2 dx = √ cos(tx)e− 2 dx + √ sin(tx)e− 2 dx
2π R 2π R 2π R
ˆ
1 x 2
=√ cos(tx)e− 2 dx.
2π R
Dierentiating with respect to t (which can be justied using a DCT argument) gives
ˆ ˆ
1 d x2 1 x2
ϕ0 (t) = √ cos(tx)e− 2 dx = √ −x sin(tx)e− 2 dx
2π R dt 2π R
ˆ ˆ
1 d −x 2 1 x2
=√ sin(tx) e 2 dx = − √ t cos(tx)e− 2 dx = −tϕ(t).
2π R dx 2π R
t2 t2 02
d
It follows from the method of integrating factors that
dt e ϕ(t) = 0, hence e ϕ(t) = e ϕ(0) = 1
2 2 2
for all t.
Example 12.4 (Exponential) . If X is absolutely continuous with density fX (x) = e−x 1[0,∞) (x), then its
ch.f. is given by
ˆ ∞ ˆ ∞ b
itx −x (it−1)x 1 (it−1)x
1
ϕ(t) = e e dx = e dx = lim e = 1 − it .
0 0 b→∞ it − 1 0
Our next task is to show that the characteristic function uniquely determines the distribution.
ˆ
T sin(t) π T + 1
Proposition 12.1. For all T > 0,
dt − ≤ .
0 t 2 T2
Proof. (Homework)
For all T > 0 the function e−uv sin(u) is bounded in absolute value by e−uv , which is integrable over
RT = {(u, v) : 0 < u < T, v > 0}, so it follows from Fubini's theorem that
68
ˆ T ˆ T ˆ ∞ ˆ ∞ ˆ T
!
sin(u) −uv −uv
du = e sin(u)dv du = e sin(u)du dv
0 u 0 0 0 0
ˆ ∞
" u=T #
1 −uv
= − e (cos(u) + v sin(u)) dv
0 1 + v2
u=0
ˆ ∞ ˆ ∞ ˆ ∞
dv 1 −T v v
= − cos(T ) e dv − sin(T ) e−T v dv.
0 1 + v2 0 1 + v 2
0 1 + v 2
ˆ ∞
dv π
Since
2
= , we have
0 1+v 2
ˆ ˆ ∞ ˆ ∞
T sin(t) π 1 v
−T v −T v
dt − = cos(T ) e dv + sin(T ) e dv
t 2 1 + v2 1 + v2
0 0 0
ˆ ∞ ˆ ∞
1 −T v v −T v
≤ |cos(T )| e
1 + v2 dv + |sin(T )| 1 + v2 e
dv
0 0
ˆ ∞ ˆ ∞
−T v T +1
≤ e dv + ve−T v dv = .
0 0 T2
ˆ T −1,
θ<0
sin(θt)
Lemma 12.1. For every θ ∈ R, lim dt = π sgn(θ) where sgn(θ) = 0, θ=0 .
T →∞ −T t
1, θ>0
Proof. (Homework)
ˆ T
sin(θt) θ cos(θt) sin(θt)
Since lim = lim = θ, it is easy to see that the integral dt exists for all T > 0.
t→0 t t→0 1 −T t
sin(θt)
Because is even, it follows by u-substitution that
t
ˆ T ˆ T ˆ θT ˆ |θ|T
sin(θt) sin(θt) sin(u) sin(u)
dt = 2 dt = 2 du = 2sgn(θ) du.
−T t 0 t 0 u 0 u
ˆ |θ|T ˆ |θ|T
sin(u) π sin(u)
Proposition 12.1 shows that lim du = for all θ 6= 0, so, since θ = 0 implies that du = 0
T →∞ 0 u 2 0 u
for all T , we have
ˆ T ˆ |θ|T
sin(θt) sin(u)
lim dt = 2 sgn(θ) lim du → π sgn(θ)
T →∞ −T t T →∞ 0 u
for all θ ∈ R.
´
Theorem 12.1 (Inversion Formula) . Let ϕ(t) = eitx dµ(x) where µ is a probability measure on (R, B). If
a < b, then
ˆ T
1 e−ita − e−itb 1
lim ϕ(t)dt = µ ((a, b)) + µ ({a, b}) .
T →∞ 2π −T it 2
69
Proof. We begin by noting that
−ita ˆ
−itb b
ˆ
b
e − e
e−ity dy ≤ e−ity dy = b − a,
(∗) =
it
a a
so, since [−T, T ] is nite and µ is a probability measure, Fubini's theorem gives
ˆ T −ita ˆ T
− e−itb e−ita − e−itb
ˆ
e
IT = ϕ(t)dt = eitx dµ(x) dt
−T it −T it R
ˆ ˆ T it(x−a) it(x−b)
!
e −e
= dt dµ(x).
R −T it
Now
eit(x−a) − eit(x−b) sin (t(x − a)) − sin (t(x − b)) cos (t(x − b)) − cos (t(x − a))
= +i ,
it t t
´ T cos(t(x−b))−cos(t(x−a))
and it follows from (∗) and the inequality |Im(z)| ≤ |z| that dt exists.
−T t
ˆ ˆ T ˆ T
!
sin (t(x − a)) sin (t(x − b))
lim IT = lim dt − dt dµ(x)
T →∞ T →∞ R −T t −T t
ˆ ˆ T ˆ T
!
sin (t(x − a)) sin (t(x − b))
= lim dt − dt dµ(x)
R T →∞ −T t −T t
ˆ
=π [sgn(x − a) − sgn(x − b)] dµ(x).
R
0, x < a or x > b
Since a < b by assumption, we have that sgn(x − a) − sgn(x − b) = 1, x = a or x = b, thus
2, a<x<b
ˆ T −ita ˆ
1 e − e−itb 1 sgn(x − a) − sgn(x − b)
lim ϕ(t)dt = lim IT = dµ(x)
T →∞ 2π −T it 2π T →∞ 2
ˆ ˆ ˆ
R
1
= 0 dµ(x) + dµ(x) + 1 dµ(x)
(−∞,a)∪(b,∞) {a,b} 2 (a,b)
1
= µ ({a, b}) + µ ((a, b)) .
2
70
ˆ T
Remark. Note that the Cauchy principal value lim f (x)dx is not necessarily the same as
T →∞ −T
ˆ ∞ ˆ b
f (x)dx := lim f (x)dx.
−∞ b→∞ a
a→−∞
ˆ a
2x
For example, lim dx = 0 since the integrand is odd, but
a→∞ −a 1 + x2
ˆ a
1 + a2
2x 2 a
lim dx = lim log 1 + x −2a = lim log = − log(4),
a→∞ −2a 1 + x2 a→∞ a→∞ 1 + 4a2
ˆ ∞
2x
so the improper Riemann integral dx is not dened.
−∞ 1 + x2
ˆ ∞
2x 2 2x
Of course, since
1 + x2 ≈ |x| for large |x|, dx is not dened as a Lebesgue integral either.
−∞ 1 + x2
Combining Theorems 12.1 and 12.2, and noting that a Borel measure is specied by its values on open
intervals (by a π−λ argument), shows that probability distributions are uniquely determined by their
characteristic functions.
To prove our next big result, we need the following bound on the tail probabilities of a distribution in terms
Lemma 12.2. If ϕ is the characteristic function corresponding to the distribution µ, then for all u > 0,
ˆ u
2
µ x : |x| > ≤ u−1 (1 − ϕ(t)) dt.
u −u
Proof. It follows from the parity of the sine and cosine functions that
ˆ u ˆ u
sin(ux)
1 − eitx dt = 2u −
(cos(tx) + i sin(tx)) dt = 2 u − .
−u −u x
Dividing by u, integrating against dµ(x), and appealing to Fubini gives
ˆ u ˆ ˆ u ˆ
−1 −1 itx sin(ux)
u (1 − ϕ(t)) dt = u 1 − e dt dµ(x) = 2 1− dµ(x).
−u −u ux
´ y
Since |sin(y)| = 0 cos(x)dx ≤ |y| for all y , we see that the integrand on the right is nonnegative, so we
have
ˆ u ˆ ˆ
sin(ux) sin(ux)
u−1 (1 − ϕ(t)) dt = 2 1− dµ(x) ≥ 2 1− dµ(x)
−u ux 2
|x|> u ux
ˆ
1 2
≥2 1− dµ(x) ≥ µ x : |x| > .
2
|x|> u |ux| u
71
We are now able to take our next major step toward proving the central limit theorem by relating weak
Proof.
For (i), note that since eitx is bounded and continuous, if µn ⇒ µ∞ , then it follows from Theorem 11.2 that
ϕn (t) → ϕ∞ (t).
N ∈N such that ˆ v ˆ v
−1 −1
ε
v (1 − ϕn (t)) dt − v (1 − ϕ(t)) dt <
−v −v 2
whenever n ≥ N.
The last two observations and Lemma 12.2 show that
ˆ v
2
µn x : |x| > ≤ v −1 (1 − ϕn (t)) dt < ε
v −v
acteristic function of µ∞ .
Because ϕn (t) → ϕ(t) for all t and characteristic functions uniquely characterize distributions, it must be
The crux of the proof of the nontrivial part of Theorem 12.3 was establishing tightness of the sequence {µn },
and this is where we used the assumption that the limiting characteristic function is continuous at 0.
72
As an illustration of how weak convergence may fail without the continuity assumption, consider the case
We turn next to the problem of representing characteristic functions as power series with explicit remainders.
To start, we have
Proof. (Homework)
Argue by induction and justify dierentiating under the integral with the DCT or some applicable corollary
thereof.
n ´
It follows from Theorem 12.4 that if E [|X| ] < ∞, then ϕ(n) (0) = (ix)n dµ(x) = in E [X n ].
The above observation combined with Taylor's theorem shows that if X has nite absolute nth moment,
then
n n
X ϕ(k) (0) X (it)k
tk + rn (t)tn = E X k + rn (t)tn .
ϕ(t) =
k! k!
k=0 k=0
In particular, we have
n
(
ϕ(t)−Pn (t)
X ϕ(k) (0) ϕ(n) (0) n , t 6= 0
Pn (t) = tk = ϕ(0) + ϕ0 (0)t + ... + t , rn (t) = tn
.
k! n! 0, t=0
k=0
d
ϕ(t) − Pn (t) dt [ϕ(t) − Pn (t)]
lim rn (t) = lim = lim d k
t→0 t→0 tk t→0
dt t
dn−1
dtn−1 ϕ(n−1) (t) − ϕ(n−1) (0) − tϕ(n) (0)
[ϕ(t) − Pn (t)]
= . . . = lim dn−1
= lim
t→0 tn t→0
dtn−1
n!t
ϕ(n−1) (t) − ϕ(n−1) (0)
1 1 h (n) i
= lim − ϕ(n) (0) = ϕ (0) − ϕ(n) (0) = 0.
n! t→0 t n!
Corollary 12.1 is enough to get the classical central limit theorem for i.i.d. sequences, but when we consider
the Lindeberg-Feller CLT for triangular arrays, we will need a little better control on the error term. With
n
( )
n+1 n
ix X (ix)k |x| 2 |x|
Lemma 12.3.
e − ≤ min ,
k! (n + 1)! n!
k=0
If we assume that
n ˆ x
ix
X (ix)k in+1
e = + (x − s)n eis ,
k! n! 0
k=0
then we get
n ˆ x
X (ix)k in+1
eix = + (x − s)n eis
k! n! 0
k=0
n ˆ x
(ix)k in+1 xn+1
X i n+1 is
= + + (x − s) e ds
k! n! n+1 n+1 0
k=0
n+1 ˆ x
X (ix)k i(n+1)+1
= + (x − s)n+1 eis ds,
k! (n + 1)! 0
k=0
n ˆ x
ix
X (ix)k in+1
e − = (x − s)n eis ds
k! n! 0
k=0
n+1 n
|x| 2 |x|
Observe that the upper bound is better for small values of |x|, while the bound is better for
(n + 1)! n!
|x| > 2(n + 1).
Corollary 12.2.
n
" n
#
X (it)k
k itX
X (itX)k
ϕ(t) − E[X ] = E e −
k! k!
k=0 k=0
n
" ( )#
n+1 n
itX X (itX)k |tX| 2 |tX|
≤ E e − ≤ E min , .
k! (n + 1)! n!
k=0
75
13. Central Limit Theorems
We are almost ready to prove the central limit theorem for i.i.d. sequences, but rst we need a few more
elementary facts.
Lemma 13.1. Let z1 , ..., zn and w1 , ..., wn be complex numbers, each having modulus at most θ. Then
n n
n
Y Y X
zk − wk ≤ θn−1 |zk − wk | .
k=1 k=1 k=1
76
After all of the work of the past two sections, we are nally able to prove the classical central limit theorem!
Theorem 13.2 (Central Limit Theorem). If X1 , X2 , ... are i.i.d. with E[X1 ] = µ and
Var(X1 ) = σ ∈ (0, ∞), then
2
Sn − nµ
√ ⇒ Z ∼ N (0, 1).
σ n
( n1 ) Xn − µ
Multiplying the standardized sum by ⇒ Z where Xn = n1 Sn is the sample mean and √σn
gives
√σ
( n1 )
n
is the standard error. Thus the CLT can be interpreted as a statement about how sample averages uctuate
I know of scarcely anything so apt to impress the imagination as the wonderful form of
cosmic order expressed by the Law of Frequency of Error. The Law would have been
personied by the Greeks and deied, if they had known of it. It reigns with serenity and
complete self-eacement amidst the wildest confusion. The huger the mob and the greater
the apparent anarchy, the more perfect is its sway. It is the supreme law of unreason.
The most common application of the central limit theorem is to provide justication for approximating a
sum of i.i.d. random variables (possibly with unknown distributions) with a normal random variable (for
However, one should keep in mind that, a priori, the identication is only valid in the n→∞ limit. It is
truly amazing that it holds at all, and often it is remarkably accurate even for small values of n, but one
needs empirical evidence or more advanced theory to justify the approximation for nite sample averages.
Our next order of business is to adapt the argument from Theorem 13.2 to prove one of the most well-
known generalizations of the central limit theorem, which applies to triangular arrays of independent (but
the event A.
Theorem 13.3 (Lindeberg-Feller). For each n ∈ N, let Xn,1 , ..., Xn,n be independent random variables with
E[Xn,m ] = 0. If
n
= σ 2 ∈ (0, ∞),
X 2
(1) lim E Xn,m
n→∞
m=1
n
for all ε > 0,
X 2
(2) lim E Xn,m ; |Xn,m | > ε = 0
n→∞
m=1
n
then Sn := where Z has the standard normal distribution.
X
Xn,m ⇒ σZ
m=1
Proof.
2 2
Let ϕn,m (t) = E eitXn,m , σn,m = E Xn,m . By Theorem 12.3, it suces to show that
n
t2 σ 2
Y
ϕn,m (t) → e− 2 .
m=1
Summing over m ∈ [n], taking limits, and appealing to assumptions 1 and 2 gives
n
! n
X t2 σn,m
2
3
X h
2
i
lim sup ϕn,m (t) − 1 − ≤ ε |t| lim sup E |Xn,m |
n→∞
m=1
2 n→∞
m=1
Xn
3
+ t2 lim sup
2
; |Xn,m | > ε = ε |t| σ 2 ,
E Xn,m
n→∞
m=1
Pn 2 2
t σn,m
hence lim m=1 ϕn,m (t) − 1 − =0
as ε can be taken arbitrarily small.
n→∞ 2
h i
2 2
We now observe that σn,m ≤ ε2 + E |Xn,m | ; |Xn,m | > ε for all ε > 0 and the latter term goes to 0 as
thus ! !
n n n
Y Y t2 σn,m
2 X t2 σn,m
2
lim sup ϕn,m (t) − 1− ≤ lim sup ϕn,m (t) − 1− = 0.
n→∞
m=1 m=1
2 n→∞
m=1
2
n
X t2 σn,j
2 2 2
t σ Qn
t2 σn,m
2 t2 σ 2
Finally, since lim =− it follows from Fact 11.2 that m=1 1− 2 → e− 2 as n→∞
n→∞
j=1
2 2
and the proof is complete.
78
Roughly, Theorem 13.3 says that a sum of a large number of small independent eects has an approximately
normal distribution.
Example 13.1. Let πn be a permutation chosen from the uniform distribution on Sn and let Kn = K(πn )
be the number of cycles in πn . For example, if π is the permutation of {1, 2, . . . , 6} written in one-line
n!
* Observe that there are Qn λk
ways to write a permutation having λk k -cycles, k = 1, ..., n.
k=1 k λk !
Indeed, once we have xed the placement of parentheses dictated by the cycle type - say beginning with
λ1 pairs of parentheses having room for 1 symbol, followed by λ2 pairs of parentheses having room for 2
symbols, and so forth - there are n! ways to distribute the n symbols amongst the parentheses.
But this overcounts since we can permute each of the λk k -cycles amongst themselves and we can write each
k -cycle in k dierent ways.
For this reason, it is sometimes helpful to use the canonical cycle notation wherein the largest element
appears rst within a cycle and cycles are sorted in increasing order of their rst element.
** Note that the map which drops the parentheses in the canonical cycle notation of σ to obtain σ0 in
0
one-line notation (so that π = 325416, for example) gives a bijection between permutations with k cycles
(A record value of σ ∈ Sn is a number j ∈ [n] such that σ(j) > σ(i) for all i < j. Here we are thinking of
Now the number of permutations of [n] having k cycles is the unsigned Stirling number of the rst kind,
cycle of size 1) or not. The number of the former is just c(n, k − 1) and the number of the latter is nc(n, k)
as n+1 can follow any of the rst n symbols divided into k cyclically ordered groups.
c(n, k)
Thus, in principle, one can explicitly compute P (Kn = k) = , but this is computationally prohibitive
n!
for large n.
We will show that when suitably standardized, Kn is asymptotically normal. To do so, we will construct
In a restaurant with many large circular tables, Person 1 enters and sits at a table. Then Person 2 enters
and either sits to the right of Person 1 or at a new table with equal probability. In general, when person k
enters, they are equally likely to sit to the right of any of the k−1 seated customers or to sit at an empty
table. We associate the seating arrangement after n people have entered with the permutation whose cycles
That this generates a permutation from the uniform distribution follows by induction: It is certainly true
person n sits down, then the rules of the process ensure that we have a uniform permutation of n afterward
by the same line of reasoning used to establish the recursion for c(n, k).
79
Pn
If we let Xn,k be the indicator that Person k sits at an unoccupied table, then Kn = k=1 Xn,k .
Since the Xn,k 's are clearly independent, we have
n n n
X X X 1
E[Kn ] = E[Xn,k ] = P (k sits at a new table) = ≈ log(n)
k
k=1 k=1 k=1
and
n n
X X 1 1
Var(Kn ) = Var(Xn,k ) = − ≈ log(n).
k k2
k=1 k=1
1
X −
More precisely, if we set Yn,k = √n,k k , then E [Yn,k ] = 0 and
log(n)
n n
X 2 1 1 X 1 1
E Yn,k = Var (Kn ) = − 2 → 1.
log(n) log(n) k k
k=1 k=1
Also,
n
X 2
E Yn,k ; |Yn,k | > ε → 0
k=1
1
since the sum is 0 once log(n)− 2 < ε.
Pn
Therefore, Theorem 13.3 implies that k=1 Yn,k ⇒ Z ∼ N (0, 1).
n ˆ n n−1
X 1 dx X 1 Kn − log(n)
Because ≤ ≤ , the conclusion can be written as p ⇒ Z.
k 1 x k log(n)
k=2 k=1
Corollary 13.1. Suppose that X1 , X2 , ... are independent, random variables with E[Xk ] = 0 and Var(Xk ) =
σk2 ∈ (0, ∞) for all k . Let Sn = k=1 Xk and s2n = k=1 σk2 . If the sequence satises Lindeberg's condition:
Pn Pn
n
1 X
lim E[Xk2 ; |Xk | > εsn ] = 0
n→∞ s2
n k=1
Of course the mean zero condition is just a matter of convenience since nite variance implies nite mean
Note that the classical central limit theorem is an immediate consequence of Corollary 13.1 since X1 , X2 , ...
2
√
i.i.d. with mean zero and nite variance σ gives sn = σ n and
n n
1 X 1X √
lim E[Xk2 ; |Xk | > εsn ] = σ 2 lim E[Xk2 ; |Xk | > εσ n]
n→∞ s2
n n→∞ n
k=1 k=1
√
= σ lim E[X12 ; |X1 | > εσ n] = 0
2
n→∞
Example 13.2. Let X1 , X2 , ... be independent with X1 ∼ N (0, 1) and Xk ∼ N (0, 2k−2 ) for k ≥ 2.
Pn
Setting Sn = k=1 Xk , we have
n n n−2
X X X 2n−1 − 1
s2n = Var(Xk ) =1+ 2k−2 = 1 + 2k = 1 + = 2n−1 .
2−1
k=1 k=2 k=0
For any ε ∈ 0, √12 n ≥ 2, ,
E Xn2 ; |Xn | > εsn ≥ E[Xn2 ] − E Xn2 ; |Xn | ≤ εsn ≥ E[Xn2 ] − ε2 s2n P (|Xn | > εsn ) ≥ 2n−2 − ε2 2n−1 ,
thus
n
E Xn2 ; |Xn | > εsn
1 X 2 2n−2 − ε2 2n−1 1
2
E[Xk ; |Xk | > εsn ] ≥ 2
≥ = − ε2 > 0
sn sn 2n−1 2
k=1
for all n ≥ 2.
However, we observe that if W1 and W2 are independent with Wk ∼ N (µk , σk2 ), then Wk has ch.f.
σ 2 t2 2 +σ 2 )t2
(σ1
iµk t− k2 i(µ1 +µ2 )t− 2
ϕk (t) = e , hence W1 + W2 has ch.f. ϕW1 +W2 (t) = ϕ1 (t)ϕ2 (t) = e 2 , so
81
14. Poisson Convergence
We now turn our attention to one of the more ubiquitous discrete limiting distributions, the Poisson, begin-
ning with the law of rare events (or weak law of small numbers). It is instructive to compare the following
result and its corresponding proof with that of the Lindeberg-Feller theorem.
Theorem 14.1. For each n ∈ N, let Xn,1 , ..., Xn,n be independent with P (Xn,m = 1) = pn,m and
P (Xn,m = 0) = 1 − pn,m .
Suppose that as n → ∞,
n
pn,m → λ ∈ (0, ∞),
X
(1)
m=1
(2) max pn,m → 0.
1≤m≤n
n
Y
ϕSn (t) = E eitSn = 1 + pn,m eit − 1 .
m=1
and 1 + p(eit − 1) = p · eit + (1 − p) · 1 ≤ 1 since it is on the line segment connecting 1 to eit , which is a
1
max pn,m eit − 1 ≤ 1,
By assumption 2, we have max pn,m ≤ , and thus for n suciently large.
1≤m≤n 2 1≤m≤n
by assumptions 1 and 2.
82
Pn it
pn,m (eit − 1) → eλ(e −1) ,
Therefore, since assumption 1 implies exp m=1
n
Y it
1 + pn,m eit − 1 → eλ(e −1) = ϕW (t),
ϕSn (t) =
m=1
Proof.
Pn
Let Yn,m = 1 {Xn,m = 1} and Tn = m=1 Yn,m . Theorem 14.1 implies Tn ⇒ W ∼ Poisson(λ),
Pn
so, since P (Sn 6= Tn ) ≤ m=1 εn,m → 0 (thus Sn − Tn →p 0), Sn = Tn + (Sn − Tn ) ⇒ W by Slutsky's
theorem.
It is worth mentioning that, just as in the normal case, independence is not a strictly necessary condition
for Poisson convergence. To relax the assumption in general one needs to use a dierent proof strategy than
convergence of characteristic functions. However, it is sometimes possible to give direct proofs by simple
calculations.
n
! n
[ X X
P (Tn > 0) = P {Xn,m = 1} = P (Xn,m = 1) − P (Xn,l = Xn,m = 1)
m=1 m=1 l<m
X
+ P (Xn,k = Xn,l = Xn,m = 1) − ... + (−1)n+1 P (Xn,1 = ... = Xn,n = 1)
k<l<m
n n
n (n − k)! X (−1)k+1
X
= (−1)k+1 =
k n! k!
k=1 k=1
n n
X (−1)k+1 X (−1)k
P (Tn = 0) = 1 − P (Tn > 0) = 1 − = → e−1 .
k! k!
k=1 k=0
83
To compute other values of the mass function for Tn , note that
n (n − m)! 1 1 −1
P (Tn = m) = P (Tn−m = 0) = P (Tn−m = 0) → e .
m n! m! m!
Therefore, for every x ∈ R, we have
X X 1 −1
P (Tn ≤ x) = P (Tn = m) → e
m!
m∈N0 :m≤x m∈N0 :m≤x
(as the above sums contain nitely many terms), so the number of xed points in a permutation of length
For most common discrete random variables, rather than memorize the p.m.f.s, one needs only to understand
the stories that they tell and the probabilities follow easily from combinatorial considerations.
For example, if X ∼ Binomial(n, p), then the story is that X gives the number of heads in n independent
ips of a coin with heads probability p. To compute the probability that X =k for k = 0, 1, ..., n we note
k n−k
that any sequence of k heads and n−k tails has probability p (1 − p) by independence. Since the number
n
of such sequences is determined by specifying where the heads occur, of which there are possibilities, the
k
n
k n−k
binomial p.m.f. is given by P (X = k) = k p (1 − p) , k = 0, 1, ...., n.
Similarly, if Y ∼ Hypergeometric(N, M, n), then the story is that we sample n items without replacement
from a set ofN items of which M are distinguished, and Y counts the number of distinguished items in our
M N −M
sample. For k ≤ min(n, M ), there are
k ways to choose k distinguished items, n−k ways to choose the
−M
N
(Mk )(Nn−k )
remaining n − k items, and possible samples, so P (Y = k) = N .
n (n)
Because the Poisson distribution assigns positive mass to innitely many outcomes, determining the p.m.f.
is not such a simple matter of counting. Nonetheless, the preceding results do supply us with an appropriate
story:
For 0 ≤ s < t, let N (s, t) denote the number of occurrences of a given type of event in the time interval (s, t]
- say, the number of arrivals at a restaurant between s and t minutes after it opens. Suppose that
Proof.
(m−1)t mt
Dene Xn,m = N n , n . Property 1 shows that Xn,1 , ..., Xn,n are independent; properties
λt
+ o nt ; and properties 2 and 4
2 and 3 pn,m = P (Xn,m = 1) = P (Xn,1 = 1) =
show that
n show that
that Xn,1 + ... + Xn,n ⇒ W ∼ Poisson(λt). The result follows by observing that Xn,1 + ... + Xn,n = N (0, t)
for all n.
84
The random variables N (0, t) as t ranges over [0, ∞) are an example of a continuous time stochastic process:
Denition. A family of random variables {N (t)}t≥0 is called a Poisson process with rate λ if it satises:
(1) For any 0 = t0 < t1 < ... < tn , the random variables N (tk ) − N (tk−1 ), k = 1, ..., n are independent;
To better understand the process {N (t)}t≥0 , it is useful to consider the following construction which explains
our arrivals story and provides a bridge between the Poisson and exponential distributions:
Let ξ1 , ξ2 , ... be i.i.d. exponentials with mean λ−1 - that is P (ξi > t) = e−λt for t ≥ 0.
Pn
Dene Tn = i=1 ξi and N (t) = sup{n : Tn ≤ t}.
0
If we think of the ξi s as interarrival times, then Tn gives the time of the nth arrival and N (t) is the number
of arrivals by time t.
Since a sum of n i.i.d. Exponential(λ) R.V.s has a Γ(n, λ−1 ) distribution*, we see that Tn has density
λn sn−1 −λs
fTn (s) = e for s ≥ 0.
(n − 1)!
Accordingly,
(λt)0
P (N (t) = 0) = P (T1 > t) = e−λt = e−λt
0!
and
ˆ t
P (N (t) = n) = P (Tn ≤ t < Tn+1 ) = fTn (s)P (ξn+1 > t − s)ds
0
ˆ t ˆ t n
λn sn−1 −λs λ(s−t) λn (λt)
= e e ds = e−λt nsn−1 ds = e−λt
0 (n − 1)! n! 0 n!
for n ≥ 1, so N (t) ∼ Poisson(λt).
To check that the number of arrivals in disjoint intervals is independent, we note that for all n∈N and all
u > t > 0,
ˆ t
P (Tn+1 ≥ u, N (t) = n) = P (Tn+1 ≥ u, Tn ≤ t) = fTn (s)P (ξn+1 ≥ u − s)ds
0
ˆ t n n
λn sn−1 −λs λ(s−u) (λt) (λt)
= e e ds = e−λu = e−λ(u−t) e−λt
0 (n − 1)! n! n!
= e−λ(u−t) P (N (t) = n),
and thus
P (Tn+1 ≥ u, N (t) = n)
P (Tn+1 ≥ u|N (t) = n) = = e−λ(u−t) .
P (N (t) = n)
Writing s = u − t, the above is equivalent to
∞
X
P (T10 ≥ s) = P (Tn+1 − t ≥ s|N (t) = n)P (N (t) = n) = e−λs ,
n=0
we see that T10 , T20 , ... are i.i.d. Exponential(λ) and independent of N (t).
In other words, the arrivals after time t are independent of N (t) and have the same distribution as the
It follows that for any 0 = t0 < t1 < ... < tn , N (t1 ) − N (t0 ), ..., N (tn ) − N (tn−1 ) are independent Poissons.
This is because the vector (N (t2 ) − N (t1 ), ..., N (tn ) − N (tn−1 )) is measurable with respect to σ (T10 , T20 , ...)
(where the Ti0 s are constructed as above with t = t1 ) and so is independent of N (t1 ).
Then an induction argument gives
n k
Y (λ(ti − ti−1 )) i
P (N (t1 ) − N (t0 ) = k1 , ..., N (tn ) − N (tn−1 ) = kn ) = e−λ(ti −ti−1 ) .
i=1
ki !
* To keep the discussion self-contained, we show that sums of independent exponentials are gammas.
−1
Suppose that Y ∼ Γ n, λ and W ∼ Exp(λ) are independent and set Z =W +Y.
Then Z has positive support and Theorem 6.6 shows that for z > 0,
ˆ ∞
fZ (z) = fW ∗ fY (z) = fW (z − y)fY (y)dy
−∞
ˆ z n
λ
= λe−λ(z−y) y n−1 e−λy dy
0 (n − 1)!
ˆ
λn+1 −λz z n−1 λn+1 (n+1)−1 −λz
= e y dy = z e .
(n − 1)! 0 Γ(n + 1)
X1 + . . . + Xn ∼ Γ n, λ−1
It follows by induction that if X1 , . . . , Xn are i.i.d. Exp(λ), then .
86
15. Stein's Method
We have mentioned previously that a shortcoming of the limit theorems presented thus far is that they do
A proof of the Berry-Esseen theorem for normal convergence rates in the Kolmogorov metric is given in
Durrett, and a proof of Poisson convergence with rates in the total variation metric is given there as well.
Rather than reproduce these classical results, we will obtain similar bounds using Stein's method in order to
give a glimpse of this relatively modern technique which can be applied to all sorts of dierent distributions
As our purpose is expository, we will present some of the more straightforward approaches rather than seek
Stein's method refers to a framework based on solutions of certain dierential or dierence equations for
bounding the distance between the distribution of a random variable X and that of a random variable Z
having some specied target distribution.
The metrics for which this approach is applicable are of the form
for some suitable class of functions H, and include the Kolmogorov, Wasserstein, and total variation distances
as special cases. These cases arise by taking H to be the set of indicators of the form 1(−∞,a] , 1-Lipschitz
functions, and indicators of Borel sets, respectively. Convergence in each of these three metrics is strictly
stronger than weak convergence (which can be metrized by taking H as the set of 1-Lipschitz functions with
The basic idea is to nd an operator A such that E[(Af )(X)] = 0 for all f belonging to some suciently
has solution fh ∈ F , then upon taking expectations, absolute values, and suprema, one nds that
Remarkably, it is often easier to work with the right-hand side of this equation and the techniques for
analyzing distances between probability distributions in this manner are collectively known as Stein's method.
Stein's method is a vast eld with over a thousand existing articles and books and new ones written all the
time, so we will only be able to scratch the surface here. In particular, we will not prove any results for
dependent random variables. (Other than supplying convergence rates, the principal advantage of Stein's
method is that it often enables one to prove limit theorems when there is some weak or local dependence,
whereas characteristic function approaches typically fall apart when there is dependence of any sort.)
An excellent place to learn more about Stein's method (and the primary reference for this exposition) is the
If Z ∼ N (0, 1), then E [(Af ) (Z)] = 0 for all absolutely continuous f with E |f 0 (Z)| < ∞.
Proof. Let f be as in the statement of the lemma. Then Fubini's theorem gives
ˆ ˆ 0 ˆ ∞
0 1 0
2
− x2 1 0 − x2
2 1 x2
E[f (Z)] = √ f (x)e dx = √ f (x)e dx + √ f 0 (x)e− 2 dx
2π R 2π −∞ 2π 0
ˆ 0 ˆ x ˆ ∞ ˆ ∞
1 y 2
1 y2
=√ f 0 (x) − ye− 2 dy dx + √ f 0 (x) ye− 2 dy dx
2π −∞ −∞ 2π 0 x
ˆ 0 ˆ 0 ˆ ∞ ˆ y
1 y2
− 2 0 1 − 2y2
0
=√ ye − f (x)dx dy + √ ye f (x)dx dy
2π −∞ y 2π 0 0
ˆ 0 ˆ ∞
1 2
− y2 1 y2
=√ ye (f (y) − f (0)) dy + √ ye− 2 (f (y) − f (0)) dy
2π −∞ 2π 0
ˆ ∞ ˆ ∞
1 y2 1 y2
=√ yf (y)e− 2 dy − f (0) √ ye− 2 dy
2π −∞ 2π −∞
= E[Zf (Z)] − f (0)E[Z] = E[Zf (Z)].
Of course, if kf 0 k∞ < ∞, then E |f 0 (Z)| < ∞. It turns out that the condition E [(Af ) (W )] = 0 for all
0
absolutely continuous f with kf k∞ < ∞ is also sucient for W ∼ N (0, 1).
Lemma 15.2. If Φ is the distribution function for the standard normal, then the unique bounded solution
to the dierential equation
f 0 (w) − wf (w) = 1(−∞,x] (w) − Φ(x)
is given by ( √ w2
2πe 2 (1 − Φ(x)) Φ(w), w ≤ x
fx (w) = √ w2
.
2πe 2 Φ(x)(1 − Φ(w)), w > x
r
π
Moreover, fx is absolutely continuous with kfx k∞ ≤ and kfx0 k∞ ≤ 2.
2
t2
Proof. Multiplying both sides of the equation f 0 (t) − tf (t) = 1(−∞,x] (t) − Φ(x) by the integrating factor e− 2
shows that a bounded solution fx must satisfy
d t2
t2 t2
e− 2 fx (t) = e− 2 [fx0 (t) − tfx (t)] = e− 2 1(−∞,x] (t) − Φ(x) ,
dt
and integration gives
ˆ w
w2 t2
e− 2 1(−∞,x] (t) − Φ(x) dt
fx (w) = e 2
−∞
ˆ ∞
w2 t2
e− 2 1(−∞,x] (t) − Φ(x) dt.
= −e 2
88
When w ≤ x, we have
ˆ w ˆ w
w2 2 w2 t2
− t2
e− 2 (1 − Φ(x)) dt
fx (w) = e 2 e 1(−∞,x] (t) − Φ(x) dt = e 2
−∞ −∞
√ ˆ w √
w2 1 2
− t2 w2
= 2πe 2 (1 − Φ(x)) √ e dt = 2πe 2 (1 − Φ(x)) Φ(w),
2π −∞
w w
√ ˆ ∞ √
w2 1 t2 w2
= 2πe 2 Φ(x) √ e− 2 dt = 2πe 2 Φ(x)(1 − Φ(w)).
2π w
To check boundedness, we rst observe that for any z ≥ 0,
ˆ ∞ ˆ ∞
1 2
− t2 1 (s+z)2
1 − Φ(z) = √ e dt ≤ √ e− 2 ds
2π z 2π 0
ˆ ˆ ∞
1 − z2 ∞ − s2 −sz − z2
2 1 s2 1 z2
=√ e 2 e 2 e ds ≤ e √ e− 2 ds = e− 2 ,
2π 0 2π 0 2
and, by symmetry, for any z ≤ 0,
1 − z2
Φ(z) = 1 − Φ (|z|) ≤ e 2.
2
Since fx is nonnegative and fx (w) = f−x (−w), it suces to show that fx is bounded above for x ≥ 0.
If w > x ≥ 0, then
√ √
r
w2 w2 1 w2 π
fx (w) = 2πe 2 Φ(x)(1 − Φ(w)) ≤ 2πe 2 · 1 · e− 2 = ;
2 2
If 0 < w ≤ x, then
√ w2
fx (w) = 2πe 2 (1 − Φ(x)) Φ(w)
√ √
r
w2 1 x2 w2 1 w2 π
≤ 2πe 2 · e− 2 · 1 ≤ 2πe 2 · e− 2 = ;
2 2 2
and if w ≤ 0 ≤ x, then
√ √
r
w2 w2 1 w2 π
fx (w) = 2πe 2 (1 − Φ(x)) Φ(w) ≤ 2πe 2 · 1 · e− 2 = .
2 2
The claim that fx is the only bounded solution follows by observing that the homogeneous equation
w2
0
f (w) − wf (w) = 0 has solution fh (w) = Ce 2 for C ∈ R, so the general solution is given by fx (w) + Cfh (w),
which is bounded if and only if C = 0.
For w > 0,
ˆ ∞ ˆ ∞
w2 t2 w2 t2
e− 2 1(−∞,x] (t) − Φ(x) dt ≤ we 2 e− 2 1(−∞,x] (t) − Φ(x) dt
|wfx (w)| = −we 2
w w
ˆ ∞ ˆ ∞ ˆ ∞
w 2 t 2 w 2 t −t 2 w 2 t2 w2 w2
≤ we 2 e− 2 dt ≤ we 2 e 2 dt = e 2 te− 2 dt = e 2 e− 2 = 1,
w w w w
89
and for w < 0,
Theorem 15.1. A random variable W has the standard normal distribution if and only if
E[f 0 (W ) − W f (W )] = 0
For suciency, observe that for any x ∈ R, taking fx as in Lemma 15.2 implies
The methodology of Lemma 15.2 can be extended to cover more general test functions than indicators of
half-lines.
Indeed, the argument given there shows that for any function h:R→R such that
ˆ ∞
1 z2
N h := E[h(Z)] = √ h(z)e− 2 dz
2π −∞
has solution
ˆ w
w2 t2
(∗) fh (w) = e 2 (h(t) − N h) e− 2 dt.
−∞
Some fairly tedious computations which we will not undertake here show that
Lemma 15.3. For any h : R → R such that N h exists, let fh be given by (∗).
If h is bounded, then r
π
kfh k∞ ≤ kh − N hk∞ , kfh0 k∞ ≤ 2 kh − N hk∞ .
2
If h is absolutely continuous, then
r
2 0
kfh k∞ ≤ 2 kh0 k∞ , kfh0 k∞ ≤ kh k∞ , kfh00 k∞ ≤ 2 kh0 k∞ .
π
(That the relevant derivatives are dened almost everywhere is part of the statement of Lemma 15.3.)
90
We can now give bounds on the error in normal approximation for sums of i.i.d. random variables.
where
Since Lipschitz functions are absolutely continuous, the second part of Lemma 15.3 applies with kh0 k∞ = 1.
Theorem 15.2. Suppose that X1 , X2 , ..., Xn are independent random variables with E[Xi ] = 0 and
E[Xi2 ] = 1 for all i = 1, ..., n. If W = √1n i=1 Xi and Z ∼ N (0, 1), then
Pn
n
3 X h
3
i
dW (L (W ), L (Z)) ≤ 3 E |Xi | .
n 2
i=1
It follows that
" n
# " n
#
1 X 1 X
E[W f (W )] = E √ Xi f (W ) = E √ Xi (f (W ) − f (Wi )) .
n i=1 n i=1
h Pn i
1 0
Adding and subtracting E √
n i=1 X i (W − W i )f (W i ) yields
" n
#
1 X
E[W f (W )] = E √ Xi (f (W ) − f (Wi ) − (W − Wi )f 0 (Wi ))
n i=1
" n
#
1 X 0
+E √ Xi (W − Wi )f (Wi ) .
n i=1
91
and thus
|E[f 0 (W ) − W f (W )]|
" # " n #
n
1 X 0 1X 0 0
= E √ Xi (f (W ) − f (Wi ) − (W − Wi )f (Wi )) + E f (Wi ) − E[f (W )]
n i=1 n i=1
" n
# " n
#
1 X 0 1X 0 0
= E √ Xi (f (W ) − f (Wi ) − (W − Wi )f (Wi )) + E (f (Wi ) − f (W ))
n i=1 n i=1
" n # " n #
1 X 1 X
≤√ E |Xi (f (W ) − f (Wi ) − (W − Wi )f 0 (Wi ))| + E |f 0 (Wi ) − f 0 (W )|
n i=1
n i=1
f 00 (ζ)
f (w) = f (z) + f 0 (z)(w − z) + (w − z)2
2
for some ζ between w and z gives the bound
kf 00 k∞
|f (w) − f (z) − (w − z)f 0 (z)| ≤ (w − z)2 ,
2
so
" n # " n #
1 X 1 X kf 00 k
0 ∞ 2
√ E |Xi (f (W ) − f (Wi ) − (W − Wi )f (Wi ))| ≤ √ E Xi (W − Wi )
n i=1
n i=1
2
2
n n
kf 00 k∞ X Xi kf 00 k∞ X h 3
i
= √ E Xi √ = 3 E |Xi | .
2 n n 2n 2
i=1 i=1
2 h i2 h i
3 3 3 3 3
Since 1 = E[Xi2 ] = E |Xi | ≤ E |Xi | , we have E |Xi | ≥ 1, hence
h i1 h i h i
3 3 3 3
E |Xi | ≤ E |Xi | ≤ E |Xi | . (The conclusion is trivial if E |Xi | = ∞.)
92
Of course the mean zero variance one condition is just the usual normalization in the CLT and so imposes
no real loss of generality. If the random variables have uniformly bounded third moments, then Theorem
1
15.2 gives a rate of order n− 2 which is the best possible.
We conclude with an example of a CLT with local dependence which can be proved using very similar (albeit
Denition. A collection of random variables {X1 , ..., Xn } is said to have dependency neighborhoods
Ni ⊆ {1, 2, ..., n}, i = 1, ..., n, if i ∈ Ni and Xi is independent of {Xj }j ∈N
/ i.
Theorem 15.3. Let X1 , ..., Xn be mean zero random variables with nite fourth moments.
Set σ2 = Var ( ni=1 Xi ) and dene W = σ−1 ni=1 Xi . Let N1 , ..., Nn denote the dependency neighborhoods
P P
of {X1 , ..., Xn } and let D = maxi∈[n] |Ni |. Then for Z ∼ N (0, 1),
v
n 23 u n
D X h
3
i D u 28 X
2
dW (L (W ), L (Z)) ≤ 3 E |Xi | + 2 t E [Xi4 ].
σ i=1 σ π i=1
Poisson Distribution.
To illustrate some of the diversity in Stein's method techniques, we now look at size-biased couplings in
Poisson approximation.
Denition. For a random variable X ≥ 0 with µ = E[X] ∈ (0, ∞), we say that X s has the size-biased
distribution with respect to X if E [Xf (X)] = µE [f (X s )] for all f such that E |Xf (X)| < ∞.
1
To see that Xs exists, note that our assumptions imply that Qf := µE [Xf (X)] is a well-dened linear
functional on the space of continuous functions with compact support. Since X is nonnegative, we have
that Qf ≥ 0 for f ≥ 0. Therefore, the Riesz representation theorem implies that there is a unique positive
´
measure ν with Qf = f dν . Since Q1 = µ1 E[X] = 1, ν is a probability measure. Thus X s ∼ ν satises
ˆ
1
E [Xf (X)] = Qf = f dν = E [f (X s )] .
µ
Alternatively, one can adapt the argument from the following lemma to construct the distribution function
of Xs in terms of that of X.
Lemma 15.4. Let X be a nondegenerate N0 -valued random variable with nite mean µ. Then X s has mass
function
kP (X = k)
P (X s = k) = .
µ
Proof.
∞
X ∞
X
µE [f (X s )] = µf (k)P (X s = k) = kf (k)P (X = k) = E [Xf (X)] .
k=0 k=0
93
Size-biasing is an important consideration in statistical sampling.
For example, suppose that a school has N (k) classes with k students.
P∞ P∞
Then the total number of classes is n= k=1 N (k) and the total number of students is N= k=1 kN (k).
If an outside observer were interested in estimating class-size statistics, they might ask a random teacher
N (k)
Letting X denote the teacher's response, we have P (X = k) = since N (k) of the n classes have k
n
students.
On the other hand, they might ask a random student how large their class is.
kN (k)
The student's response, Y, would have P (Y = k) = because kN (k) of the N students are in a class
N
of k students.
∞ ∞
X N (k) 1X N
E [X] = k = kN (k) = ,
n n n
k=1 k=1
we see that
kN (k) k N (k) kP (X = k)
P (Y = k) = = Nn = ,
N n
E [X]
so Y = X s.
Observe that the average number of classmates of a random student (their self included) is
∞ ∞
X kN (k) n X 2 N (k) E X2
E[Y ] = k = k = ≥ E[X].
N N n E [X]
k=1 k=1
The inequality is strict unless all classes have the same number of students.
Lemma 15.5. Let X1 , ..., Xn ≥ 0 be independent random variables with E [Xi ] = µi , and let Xis have the
size-bias distribution w.r.t. Xi . Let I be a random variable, independent of all else, with P (I = i) = µµ , i
distribution.
Proof.
n
X µi
µE [g (W s )] = µ E [g (Wi + Xis )]
i=1
µ
n
X
= E [Xi g (Wi + Xi )]
i=1
" n
#
X
=E Xi g(W ) = E [W g(W )] .
i=1
the same fashion as the analogous results for the normal distribution.
Theorem 15.4. Let Pλ denote the Poisson(λ) distribution. An N0 -valued random variable X has law Pλ if
and only if
E [λf (X + 1) − Xf (X)] = 0
for all bounded f .
Also, for each A ⊆ N0 , the unique solution of the dierence equation
λf (k + 1) − kf (k) = 1A (k) − Pλ (A), fA (0) = 0
is given by
fA (k) = λ−k eλ (k − 1)! [Pλ (A ∩ Uk ) − Pλ (A) Pλ (Uk )] where Uk = {0, 1, ..., k − 1} .
Finally, writing the forward dierence as ∆g(k) := g(k + 1) − g(k), we have
1 − e−λ
and k∆fA k∞ ≤
n 1
o
kfA k∞ ≤ min 1, λ− 2 .
λ
Theorem 15.5. Let X be an N0 -valued random variable with E [X] = λ, and let Z ∼ Poisson(λ). Then
dT V (X, Z) ≤ 1 − e−λ E |X + 1 − X s | .
Proof. Letting fA be as in Theorem 15.4, the denitions of total variation and size-biasing imply
≤ λ sup E |fA (X + 1) − fA (X s )|
A
≤ λ sup k∆fA k∞ E |X + 1 − X s |
A
≤ 1 − e−λ E |X + 1 − X s | .
We conclude with a simple proof of Theorem 14.1 complete with a total variation bound.
Theorem 15.6. Let X1 , ..., Xn be independent random variables with P (Xi = 1) = 1 − P (Xi = 0) = pi ,
and set W = ni=1 Xi , λ = E [W ] = ni=1 pi . Let Z ∼ Poisson(λ). Then
P P
n
1 − e−λ X 2
dT V (W, Z) ≤ pi .
λ i=1
Proof. Lemmas 15.5 and 15.6 show that W s = WI + XIs = W − XI + 1 where I is a random variable,
pi
independent of the Xi 's, with P (I = i) = λ.
95
Thus, by Theorem 15.5,
n
X
= 1 − e−λ E |Xi | P (I = i)
i=1
n n
X pi 1 − e−λ X 2
= 1 − e−λ
pi = pi .
i=1
λ λ i=1
One can also prove Theorem 15.6 without taking a detour through size-biasing.
Therefore,
96
16. Random Walk Preliminaries
Pn
Thus far, we have been primarily interested in the large n behavior of Sn = i=1 Xi where X1 , X2 , ... are
independent and identically distributed. We now turn our attention to the sequence S1 , S2 , . . ., which we
Recall that the existence of an innite sequence of random variables with specied nite dimensional distri-
Here the sample space is Ω = RN = {(ω1 , ω2 , ...) : ωi ∈ R}, the σ -algebra is B N (which is generated by cylinder
sets), and a consistent sequence of distributions gives rise to a unique probability measure with appropriate
marginals via the extension theorem. The random variables are the coordinate maps Xi ((ω1 , ω2 , ...)) = ωi .
If S is a Polish space (i.e. a separable and completely metrizable topological space) and S is the Borel
σ -algebra for S, then this Kolmogorov construction can be carried out with Ω=S N
and F =S N
. When the
Xi0 s are independent (S, S)-valued random variables with Xi ∼ µi , the measure P arises from the sequence
of product measures Pn = µ1 × · · · × µn .
We assume in what follows that we are working in (S N , S N , P ) and X1 , X2 , ... are given by
Xi ((ω1 , ω2 , ...)) = ωi .
Recall that Kolmogorov's 0−1 law showed that if X1 , X2 , ... are independent, then the tail eld
T∞
T = n=1 σ(Xn , Xn+1 , ...) is trivial in the sense that every A∈T has P (A) ∈ {0, 1}.
Our rst main result is another 0−1 law. We begin with some terminology.
Denition. A nite permutation is a bijection π:N→N such that |{m : π(m) 6= m}| < ∞.
Proposition 16.1. The collection of permutable events is a σ-algebra. It is called the exchangeable σ-algebra
and is denoted E .
Pn
Taking S = R, Sn = i=1 Xi , E = {Sn ∈ Bn i.o.} and
some examples of permutable events are
∞
F = {lim supn→∞ Scnn ≥ 1} ∞
for any sequence of Borel sets {Bn }n=1 and real numbers {cn }n=1 .
Also, every event in the tail σ -algebra is also in the exchangeable σ -algebra.
We observe that, in general, E, F ∈
/ T, though F is in the tail eld if we assume that cn → ∞.
Similarly, {limn→∞ Sn exists}, {lim supn→∞ Sn = ∞} ∈ T while, in general, {lim supn→∞ Sn > 0} ∈ E \ T .
97
The proof of the Hewitt-Savage 0−1 law will make use of the following result.
Lemma 16.1. For any I ∈ S N , there is a sequence of events I1 , I2 , ... such that In ∈ σ(X1 , ..., Xn ) and
P (In 4I) → 0 where A4B = (A \ B) ∪ (B \ A).
Proof. σ(X1 , ..., Xn ) is precisely the sub-σ-algebra of S N consisting of the cylinders {ω : (ω1 , ..., ωn ) ∈ B} as
S∞
B ranges over S n. Accordingly, P= n=1 σ(X1 , ..., Xn ) is a π -system which generates S N . The claim follows
λ-system containing P.
Theorem 16.1 (Hewitt-Savage). If X1 , X2 , ... are i.i.d. and A ∈ E , then P (A) ∈ {0, 1}.
Proof. As with Kolmogorov's 0−1 law, we will show that A is independent of itself.
We begin by taking a sequence of events An ∈ σ(X1 , ..., Xn ) such that P (An 4A) → 0, which is justied by
Lemma 16.1.
j + n,
j≤n
Now let π be the nite permutation π(j) = j − n, n < j ≤ 2n .
j, j > 2n
Thus, since
we have
Because T ⊂ E, Hewitt-Savage supersedes Kolmogorov in the case of i.i.d. random variables. However, the
latter only requires independence, so it can be used in situations where the former cannot. Also, note that
n o
Sn
in the examples preceding Theorem 16.1, the sequences En = {Sn ∈ Bn } and Fn = cn ≥1 are each
98
A nice application of Theorem 16.1 is
Theorem 16.2. For a random walk on R, there are only four possibilities, one of which has probability one:
(1) Sn = 0 for all n
(2) Sn → ∞
(3) Sn → −∞
(4) −∞ = lim inf n→∞ Sn < lim supn→∞ Sn = ∞
Proof.
Theorem 16.1 implies that lim supn→∞ Sn is a constant c ∈ [−∞, ∞]. Let Sn0 = Sn+1 − X1 .
Since Sn0 =d Sn , we must have that c = c − X1 . If c ∈ (−∞, ∞), then it must be the case that X1 ≡ 0, so
the rst case occurs. Conversely, if X1 is not identically zero, then c = ±∞.
Of course, the exact same argument applies to the lim inf, so either 1 holds or
99
17. Stopping Times
Given a sequence of random variables X1 , X2 , ... on a probability space (Ω, F, P ), consider the sub-σ -algebras
Fn = σ(X1 , ..., Xn ), n ≥ 1. If we think of X1 , X2 , . . . as observations taken at times 1, 2, . . ., then Fn can be
Note that F1 ⊆ F 2 ⊆ · · · ⊆ F . Such an increasing sequence of sub-σ -algebras is known as a ltration, and
the space (Ω, F, {Fn }, P ) is called a ltered probability space.
(For the time being, we will assume that the ltration is indexed by N, but one can consider more general
If X1 , X2 , ... satises Xn ∈ Fn for all n, we say that the sequence {Xn } is adapted to the ltration {Fn }.
{σ(X1 , ..., Xn )} is the smallest ltration with respect to which {Xn } is adapted.
Denition. Given a ltered probability space (Ω, F, {Fn }n∈N , P ), a random variable N : Ω → N ∪ {∞} is
The following proposition gives an equivalent denition of stopping times. When working with more general
Proposition 17.1. A random variable N : Ω → N ∪ {∞} on a ltered probability space (Ω, F, {Fn }, P ) is
a stopping time if and only if {N ≤ n} ∈ Fn for all n ∈ N.
Proof. n ∈ N be given. If N
Let is a stopping time, then for each m ≤ n, {N = m} ∈ Fm ⊆ Fn , so
Sn
{N ≤ n} = m=1 {N = m} ∈ Fn .
{N = n} = {N ≤ n} \ {N ≤ n − 1} ∈ Fn .
To motivate the denition, suppose that X1 , X2 , ... represent one's winnings in successive games of roulette.
Let N be a rule for when to stop gambling (in the sense that one quits playing after N games).
The requirement that {N = n} ∈ Fn means that the decision to stop playing after n games can only depend
on the outcomes of the rst n games.
For example, the random variable N ≡ 6 corresponds to the rule that one will play exactly six games,
Pn
The random variable N = inf{n : i=1 Xi < −10} corresponds to quitting once one's losses exceed $10.
Pn Pm
The random variable N = inf{n : i=1 Xi ≥ i=1 Xi for all m ∈ N}, corresponding to quitting once one
has attained the maximum amount they ever will, is not a stopping time since it depends on the future as
Associated with each stopping time N is the stopped σ-algebra FN , which we think of as the information
known at time N.
Formally, FN = {A ∈ F : A ∩ {N = n} ∈ Fn for all n ∈ N}. That is, on {N = n}, A must be measurable
with respect to the information known at time n. It is worth noting that the denition implies {N ≤ n} ∈ FN
for all n ∈ N, hence N is FN -measurable.
* Clearly FN contains the empty set and is closed under countable unions. To see that it's closed under
C
complements, write AC ∩ {N = n} = AC ∪ {N 6= n} ∩ {N = n} = (A ∩ {N = n}) ∩ {N = n}.
Theorem 17.1. Suppose that X1 , X2 , ... are i.i.d., Fn = σ(X1 , ..., Xn ), and N is a stopping time for Fn .
Conditional on {N < ∞}, {XN +n }n≥1 is independent of FN and has the same distribution as the original
sequence.
Proof. Let A ∈ FN , k ∈ N, B1 , ..., Bk ∈ S be given. Let µ denote the common distribution of the Xi0 s.
For any n ∈ N, we have
k
P (N < ∞, XN +1 ∈ B1 , ..., XN +k ∈ Bk ) Y
= µ(Bj ).
P (N < ∞) j=1
We now introduce the shift function the θ:Ω→Ω which is dened coordinatewise by (θω)n = ωn+1 .
That is, applying θ results in dropping the rst term and shifting all others one place to the left. Higher
hence
to zero.
is well-dened for all n∈N and gives the time of the nth visit to zero.
Proposition 17.2. In the above setting, if we assume that P = µ×µ×· · · , then P (Tn < ∞) = P (T < ∞)n .
Proof. We argue by induction. The base case is trivial, so let us assume that the statement holds for n − 1.
Applying Theorem 17.1 to Tn−1 , we see that XTn−1 +k k≥1 = θTn−1 Xk k≥1 is independent of FTn−1 on
P (Tn < ∞) = P (Tn−1 < ∞, T ◦ θTn−1 < ∞) = P (Tn−1 < ∞)P (T ◦ θTn−1 < ∞)
= P (T < ∞)n−1 P (T < ∞) = P (T < ∞)n
Theorem 17.2 (Wald's equation). Let X1 , X2 , ... be i.i.d. with E |X1 | < ∞. If N is a stopping time with
E[N ] < ∞, then E[SN ] = E[X1 ]E[N ].
102
Proof. First suppose that the Xi0 s are nonnegative. Then
ˆ ∞ ˆ
X n ˆ
∞ X
X
E[SN ] = SN dP = 1 {N = n} Sn dP = 1 {N = n} Xm dP
n=1 n=1 m=1
∞ ˆ
∞ X
X ∞ ˆ
X
= 1 {N = n} Xm dP = 1 {N ≥ m} Xm dP
m=1 n=m m=1
∞ ˆ
X X∞
E [SN ] = 1 {N ≥ m} Xm dP = E [1 {N ≥ m} Xm ]
m=1 m=1
X∞ ∞
X
= P (N ≥ m) E [Xm ] = E [X1 ] P (N ≥ m) = E [X1 ] E [N ] .
m=1 m=1
To prove the result for general Xi , we run the last argument in reverse to conclude that
∞
X ∞ ˆ
X
∞ > E |X1 | E [N ] = P (N ≥ m)E |Xm | = 1 {N ≥ m} |Xm | dP
m=1 m=1
∞ ˆ
∞ X
X ∞ ˆ
X
= 1 {N = n} |Xm | dP ≥ 1 {N = n} |Sn | dP.
m=1 n=m n=1
Since the double integrals converge absolutely, we can invoke Fubini to conclude that
∞
X ∞ ˆ
X ∞ ˆ
∞ X
X
E [X1 ] E [N ] = P (N ≥ m)E [Xm ] = 1 {N ≥ m} Xm dP = 1 {N = n} Xm dP
m=1 m=1 m=1 n=m
X∞ Xn ˆ ∞ ˆ
X
= 1 {N = n} Xm dP = 1 {N = n} Sn dP = E [SN ] .
n=1 m=1 n=1
One consequence of Wald's equation is that one can gain no advantage in a fair or unfavorable game by
employing a length-of-play strategy which does not depend on the possibility of innitely many games,
One hears people advocate the policy of playing until they are ahead. Denoting the outcomes of successive
games by X1 , X2 , ..., this stopping rule is given by α = inf{n : X1 + ... + Xn > 0}. If E |X1 | < ∞ and
E[X1 ] ≤ 0, then E[α] < ∞ would imply 0 < E[Sα ] = E[α]E[X1 ] ≤ 0, a contradiction. Thus the expected
waiting time until one shows a prot on a sequence of independent and identical fair or unfavorable bets is
innite.
Suppose that you are to roll a die repeatedly until a number of your choice appears. You are then awarded
an amount of money equal to the sum of your rolls. Wald's equation shows that any number you choose will
Indeed the outcomes of each roll, X1 , X2 , ..., are i.i.d. uniform over {1, 2, ..., 6}.
The waiting time, Ni , until
1
any number i ∈ {1, ..., 6} appears is geometric with success probability
6 . Thus your expected winnings are
1+2+...+6
E[SNi ] = E[Ni ]E[X1 ] = 6 · 6 = 21 regardless of the number you choose. There is no advantage in
choosing six over one, say, in terms of expected winnings. (Of course there is an advantage in terms of things
N = inf{n : Sn ∈
/ (a, b)}, the rst time the walk exits (a, b).
Applying Wald's equation shows that bP (SN = b) + aP (SN = a) = E[SN ] = E[N ]E[X1 ] = 0.
Since P (SN = a) = 1 − P (SN = b), we have (b − a)P (SN = b) = −a, hence
−a b
P (SN = b) = , P (SN = a) = .
b−a b−a
b
Writing Ta = inf{n : Sn = a}, Tb = inf{n : Sn = b} shows that P (Ta < Tb ) = P (SN = a) = b−a for all
integers a < 0 < b. Because P (Ta < ∞) ≥ P (Ta < Tb ) for all b > 0, sending b→∞ shows that
Symmetry implies that P (Tx < ∞) = 1 for all integral b > 0, and since the walk must pass through 0 to get
However, Wald's equation shows that the expected time to visit any nonzero integer is innite:
By conditioning on the rst step, we see that the expected time for the walk to return to 0 is
1 1
E [T0 ] = E 2 (1 + T−1 ) + 2 (1 + T1 ) = ∞.
To recap, simple random walk will visit any given integer in nite time with full probability, but the
We can compute the variance of random sums whose index of summation is a stopping time using
Theorem 17.3 (Wald's second equation). Let X1 , X2 , ... be i.i.d. with E[X1 ] = 0 and E[X12 ] = σ2 < ∞.
If T is a stopping time with E[T ] < ∞, then E[ST2 ] = σ2 E[T ].
104
By assumption, all expectations exist and are nite, so induction on n gives
n
X
E ST2 ∧n = σ 2
P (T ≥ k).
k=1
∞
X n
X
Now E[T ] = P (T > k) = lim P (T ≥ k), so, since ST ∧n → ST pointwise, if we can show that
n→∞
k=1 k=1
E ST = lim E ST2 ∧n = σ 2 E[T ].
2
ST ∧n is Cauchy in L2 , then it will follow that
n→∞
To this end, observe that for any n>m
T ∧n
!2 n ∞
h i X X X
2 = σ2
E (ST ∧n − ST ∧m ) =E Xk P (T ≥ k) ≤ σ 2 P (T ≥ k)
k=m+1 k=m+1 k=m+1
where the second equality follows from the exact same argument as above.
√
Theorem 17.4. Let X1 , X2 , ... be i.i.d. with E[X1 ] = 0, E[X12 ] = 1, and set Tc = inf {n ≥ 1 : |Sn | > c n}.
Then E[Tc ] is nite if and only if c < 1.
Lemma 17.1. If X1 , X2 , ... are i.i.d. with E[X12 ] < ∞ and T is a stopping time with E[T ] = ∞, then
E[XT2 ∧n ]
lim = 0.
n→∞ E[T ∧ n]
It will then follow that if E[Tc ] = ∞ for c ∈ [0, 1), then for any ε ∈ (0, (1 − c)2 ), there is an N ∈N so that
105
To prove Lemma 17.1, we observe that
n
X
E[XT2 ∧n ] = E[XT2 ∧n ; XT2 ∧n ≤ ε (T ∧ n)] + E[XT2 ∧n ; XT2 ∧n > εj, T ∧ n = j]
j=1
n
X
≤ εE[T ∧ n] + E[Xj2 ; T ∧ n = j, Xj2 > εj].
j=1
Now, since E[Xj2 ; Xj2 > εj] → 0 as j→∞ (by the DCT), we can choose N large enough that
n
X
E[Xj2 ; Xj2 > εj] < nε whenever n > N.
j=1
n
X N
X n
X
E[Xj2 ; T ∧ n = j, Xj2 > εj] = E[Xj2 ; T ∧ n = j, Xj2 > εj] + E[Xj2 ; T ∧ n = j, Xj2 > εj]
j=1 j=1 j=N +1
n
X
≤ N E[X12 ] + E[Xj2 ; T ∧ n = j, Xj2 > εj].
j=N +1
n
X n
X
E[Xj2 ; T ∧ n = j, Xj2 > εj] ≤ E[Xj2 ; T ∧ n ≥ j, Xj2 > εj]
j=N +1 j=N +1
n
X
E Xj2 1 {T ∧ n ≥ j} 1 Xj2 > εj
=
j=N +1
n
X
P (T ∧ n ≥ j)E Xj2 ; Xj2 > εj
=
j=N +1
n
X ∞
X
P (T ∧ n = k)E Xj2 ; Xj2 > εj
=
j=N +1 k=j
∞
X k
X
P (T ∧ n = k)E Xj2 ; Xj2 > εj
≤
k=N +1 j=1
∞
X
≤ P (T ∧ n = k)kε ≤ εE[T ∧ n].
k=N +1
n
X
E[XT2 ∧n ] ≤ εE[T ∧ n] + E[Xj2 ; T ∧ n = j, Xj2 > εj]
j=1
n
X
≤ εE[T ∧ n] + N E[X12 ] + E[Xj2 ; T ∧ n = j, Xj2 > εj]
j=N +1
E[XT2 ∧n ]
Since E[T ∧n] → E[T ] = ∞ (by the MCT), we see that lim sup ≤ 2ε and the lemma follows because
n→∞ E[T ∧ n]
ε>0 is arbitrary.
106
18. Recurrence
In this section, we will consider some questions regarding the recurrence behavior of random walk in Rd .
Throughout, we will take (S, S) to be Rd with the Borel σ -algebra, we will denote the position of the random
walk at time n by Sn = X1 + ... + Xn with X1 , X2 , ... i.i.d., and we will work with the norm kxk = max |xi |.
1≤i≤d
Denition. x ∈ Rd is called a recurrent value for the random walk Sn if for every ε > 0,
P (kSn − xk < ε i.o.) = 1. We denote the set of recurrent values by V.
Note that the Hewitt-Savage 0−1 law implies that if P (kSn − xk < ε i.o.) is less than one, then it is zero.
Denition. x ∈ Rd is called a possible value for Sn if for every ε > 0, there is an n∈N with
Clearly the set of recurrent values is contained in the set of possible values. In fact, we have
Theorem 18.1. The set V is either ∅ or a closed subgroup of Rd . In the latter case, V = U .
Proof. Suppose that V=
6 ∅. If z ∈ VC, then there is an ε>0 such that P (kSn − zk < ε i.o.) = 0, and thus
d C C
Bε (z) = {w ∈ R : kw − zk < ε} ⊆ V . It follows that V is open, so V is closed.
This is because V ⊆ U, so for any v, w ∈ V taking x=y=v shows that 0 ∈ V; taking x = v, y = 0 shows
Now for any n ≥ m + k , Sn − Sk = Xk+1 + ... + Xn has the same distribution as Sn−k and is independent
of Sk .
It follows that {kSn − Sk − (y − x)k ≥ 2ε for all n ≥ m + k} and {kSk − xk < ε} are independent and each
Because kSk (ω) − xk < ε and 2ε ≤ kSn (ω) − Sk (ω) − (y − x)k = kSn (ω) − yk + kSn (ω) − xk implies
the Xi 's is dense, but the two are equivalent for simple random walk.)
By Proposition 17.2, if we set τ0 = 0 and let τn = inf{m > τn−1 : Sm = 0} be the nth time the walk visits
0, then P (τn < ∞) = P (τ1 < ∞)n . From this observation, we arrive at
Theorem 18.2. For any random walk, the following are equivalent:
(1) P (τ1 < ∞) = 1
(2) P (Sn = 0 i.o.) = 1
P∞
(3) n=1 P (Sn = 0) = ∞
Proof.
If P (τ1 < ∞) = 1, then P (τn < ∞) = 1n = 1 for all n, hence P (Sn = 0 i.o.) = 1.
The contrapositive of the rst Borel-Cantelli lemma shows that P (Sn = 0 i.o.) =1 implies
P∞
n=1 P (Sn = 0) = ∞.
P∞ P∞
Finally, the number of visits to zero can be expressed as V = n=1 1 {Sn = 0} and as V = n=1 1 {τn < ∞}.
Thus if P (τ1 < ∞) < 1, then
∞
X ∞
X
P (Sn = 0) = E[V ] = P (τn < ∞)
n=1 n=1
∞
X P (τ1 < ∞)
= P (τ1 < ∞)n = < ∞.
n=1
1 − P (τ1 < ∞)
P∞
It follows that n=1 P (Sn = 0) = ∞ implies P (τ1 < ∞) = 1.
Analogous to the one-dimensional case, we say that Sn = X1 + ... + Xn denes a simple random walk on Rd
(equivalently, Zd ) if X1 , X2 , ... are i.i.d. with
1
P (Xi = ej ) = P (Xi = −ej ) =
2d
for each of the d standard basis vectors ej .
We will show that simple random walk is recurrent in dimensions d = 1, 2 and transient otherwise.
−d
Essentially, this is because P (Sn = 0) ≈ Cd n 2 , which is summable for d≥3 but not for d = 1, 2.
for all n ∈ N.
P∞
Since √1 diverges, it follows from the limit comparison test that
n=1 πn
∞
X ∞
X
P (Sn = 0) = P (S2n = 0) = ∞,
n=1 n=1
* The identity
n
X n n 2n
=
m=0
m n − m n
follows by noting that the number of ways to choose a committee of size n from a population of size 2n
containing n men and n women is the sum over m = 0, 1, ..., n of the number of such committees consisting
Proof. When d = 3,
2
−2n
X (2n)! −2n 2n X
−n n!
P (S2n = 0) = 6 2 =2 3 .
(n1 !n2 !n3 !) n n1 !n2 !n3 !
n1 ,n2 ,n3 ≥0: n1 ,n2 ,n3 ≥0:
n1 +n2 +n3 =n n1 +n2 +n3 =n
109
Now 3−n n1 !nn!2 !n3 ! ≥ 0 for each choice of n1 , n2 , n3 , n, and the multinomial theorem gives
n1 n2 n3 n
X n! X n 1 1 1 1 1 1
3−n = = + + = 1,
n1 !n2 !n3 ! n1 , n2 , n3 3 3 3 3 3 3
n1 ,n2 ,n3 ≥0: n1 ,n2 ,n3 ≥0:
n1 +n2 +n3 =n n1 +n2 +n3 =n
so
2
X n! n! X n!
3−n ≤ max 3−n 3−n
n1 !n2 !n3 ! 0≤n1 ≤n2 ≤n3 : n1 !n2 !n3 ! n1 !n2 !n3 !
n1 ,n2 ,n3 ≥0: n1 +n2 +n3 =n n1 ,n2 ,n3 ≥0:
n1 +n2 +n3 =n n1 +n2 +n3 =n
n!
= 3−n max .
0≤n1 ≤n2 ≤n3 : n1 !n2 !n3 !
n1 +n2 +n3 =n
The latter quantity is maximized when n1 !n2 !n3 ! is minimized. This happens when n1 , n2 , n3 are as close as
ni +1
possible: If ni < nj − 1 for i < j, then ni !nj ! > nj ni !nj ! = (ni + 1)!(nj − 1)!.
It follows that
√ n n
3 n n
n! n! 2πn e 32 e 3n
max ≈ 3 ≈ q = n n
≤ .
n 3
0≤n1 ≤n2 ≤n3 : n1 !n2 !n3 ! n
! 2πn n 3 2πn n
n1 +n2 +n3 =n 3 3e
3 3e
1 2n √1
Putting all this together and recalling that
22n n ≈ πn
shows that
2
−2n 2n X n! 3
P (S2n = 0) = 2 3−n = O n− 2 ,
n n1 !n2 !n3 !
n1 ,n2 ,n3 ≥0:
n1 +n2 +n3 =n
P∞
hence n=1 P (Sn = 0) < ∞ and we conclude that SRW is transient in 3 dimensions.
Transience in higher dimensions follows by letting Tn = (Sn1 , Sn2 , Sn3 ) be the projection onto the rst three
coordinates and letting N (n) = inf{m > N (n − 1) : Tm 6= TN (n−1) } to be the nth time that the random
walker moves in any of the rst three coordinates (with the convention that N (0) = 0). Then TN (n) is a
simple random walk in three dimensions and the probability that TN (n) = 0 innitely often is 0. Since the
rst three coordinates of Sn are constant between N (n) and N (n + 1) and N (n + 1) − N (n) is almost surely
In the case of more general random walks on Rd , the Chung-Fuchs theorem says that
Sn
• Sn is recurrent in d=1 if →p 0 .
n
S
• Sn is recurrent in d=2 if √n ⇒ N (0, Σ).
n
• Sn is transient in d≥3 if it is truly (at least) three dimensional (meaning that it does not live an
More generally, one can show that a necessary and sucient condition for recurrence is
ˆ
1
Re dy = ∞
(−δ,δ)d 1 − ϕ(y)
for δ>0 where ϕ(t) = E eit·X1 is the ch.f. of one step in the walk.
110
We will content ourselves with proofs of the d = 1, 2 results. The proof for d≥3 can be found in Durrett.
We begin with some lemmas which are valid in any dimension. The rst is analogous to Theorem 18.2.
n
Lemma 18.1. Let X1 , X2 , ... be i.i.d. and take Sn = Xi . Then Sn is recurrent if and only if
X
i=1
∞
for every ε > 0.
X
P (kSn k < ε) = ∞
n=1
Proof.
If Sn is recurrent, then P (kSn k < ε i.o.) =1 for all ε > 0, so the contrapositive of the rst Borel-Cantelli
∞
X
lemma implies that P (kSn k < ε) = ∞ for all ε > 0.
n=1
P∞
For the converse, x k≥1 and dene Zk = i=1 1 {kSi k < ε, kSi+j k ≥ ε for all j ≥ k}.
Then Zk ≤ k by construction, so
∞
X
k ≥ E[Zk ] = P (Si < ε, kSi+j k ≥ ε for all j ≥ k)
i=1
∞
X
≥ P (kSi k < ε, kSi+j − Si k ≥ 2ε for all j ≥ k)
i=1
X∞
= P (kSi k < ε) P (kSi+j − Si k ≥ 2ε for all j ≥ k)
i=1
∞
X
= P (kSj k ≥ 2ε for all j ≥ k) P (kSi k < ε) .
i=1
P∞
Thus if i=1 P (kSi k < ε) = ∞, then it must be the case that P (kSj k ≥ 2ε for all j ≥ k) = 0.
As this is true for all k ∈ N, we see that P (kSj k < 2ε i.o.) = 1, so, since this holds for all ε > 0, Sn is
recurrent.
Note that one could equivalently take the lower index of summation to be 0 in the preceding theorem since
adding or subtracting 1 does not change whether or not the sum diverges to innity.
Everything that we have done thus far is true for any norm. Our reason for working with the supremum
norm is the following result. As all norms on Rd are equivalent, this choice entails no loss of generality - the
Proof. The left hand side gives the expected number of visits to the open cube (−mε, mε)d . This can be
d
obtained by summing the number of visits to each of the (2m) subcubes of side length ε obtained by dividing
each side of the cube into 2m equal segments. Thus it suces to show that the expected number of visits to
P∞
any of these subcubes is at most n=0 P (kSn k < ε).
111
To this end, let C be any such side length ε cube in Rd and let T = inf{n : Sn ∈ C} be the hitting time for
C. If T =∞ then the walk never visits C, while if T = m, then on every subsequent visit to C the walk is
within ε of Sm , so
∞
X ∞ X
X n ∞ X
X ∞
P (Sn ∈ C) = P (Sn ∈ C, T = m) = P (Sn ∈ C, T = m)
n=0 n=0 m=0 m=0 n=m
X∞ X ∞
≤ P (kSn − Sm k < ε, T = m)
m=0 n=m
X∞ X ∞
= P (kSn − Sm k < ε) P (T = m)
m=0 n=m
X∞ ∞
X
= P (T = m) P (kSk k < ε)
m=0 k=0
∞
X
≤ P (kSk k < ε) .
k=0
P∞
The preceding lemma shows that establishing convergence/divergence of n=1 P (kSn k < ε) for a single
∞
Corollary 18.1. is recurrent if and only if P (kSn k < 1) = ∞.
X
Sn
n=1
Proof.
P∞
Lemma 18.1 shows that if Sn is recurrent, then n=1 P (kSn k < 1) = ∞.
P∞
On the other hand, suppose that n=1 P (kSn k < 1) = ∞ and let ε>0 be given.
−1
Applying Lemma 18.2 with m> ε yields
∞
X ∞
X ∞
X
P (kSn k < 1) ≤ (2m)d P kSn k < m−1 ≤ (2m)d
∞= P (kSn k < ε) ,
n=1 n=0 n=0
so, since ε was arbitrary, it follows from Lemma 18.1 that Sn is recurrent.
1
Theorem 18.5. If X1 , X2 , ... are i.i.d. R-valued random variables and Sn →p 0, then Sn is recurrent.
n
Proof.
P∞
By Corollary 18.1, it suces to prove that n=1 P (|Sn | < 1) = ∞.
Lemma 18.2 shows that for any m ∈ N,
∞ ∞ Km
X 1 X 1 X
P (|Sn | < 1) ≥ P (|Sn | < m) ≥ P (|Sn | < m)
n=0
2m n=0 2m n=0
Km Km
1 X n K 1 X n
≥ P |Sn | < = · P |Sn | <
2m n=0 K 2 Km n=0 K
n
for any K ∈ N. By hypothesis, P |Sn | < K → 1 as n → ∞, so, sending m to ∞ shows that
P∞ K
n=0 P (|Sn | < 1) ≥ 2 , and the proof is complete since K was arbitrary.
112
It is worth remarking that if the Xi0 s have a well-dened expectation E[Xi ] = µ 6= 0, then the strong law
S
implies n
n →µ a.s. In this case, we must have |Sn | → ∞, so the walk is transient. If E[Xi ] = 0, then the
Sn
weak law implies
n →p 0. Thus if the increments have an expectation µ = E[Xi ] ∈ [−∞, ∞], then the walk
We now show that, in dimension 2, a random walk is recurrent if a mean 0 central limit theorem holds.
In the case where the limit distribution N (0, Σ) is degenerate - that is, the covariance matrix Σ has rank(Σ) <
2 - the random walk is either always at 0 or is essentially one-dimensional, in which case recurrence follows
from Theorem 18.5. Thus we will assume in what follows that N (0, Σ) is nondegenerate and thus has a
2
density with respect to Lebesgue measure on R .
∞ ∞
X 1 X
4 P (kSn k < 1) ≥ lim inf P (kSn k < m)
m→∞ m2
n=0 n=0
ˆ ∞
= lim inf P
Sbm2 θc
< m dθ
m→∞ 0
ˆ ˆ ∞
!
≥ 1
n(y)dy dθ.
0 kyk<θ − 2
Sn is recurrent.
113
19. Path Properties
We conclude our investigation of random walk with a look at the trajectories of simple random walk on Z.
2
Here we think of a random walk S1 , S2 , ... as being represented by the polygonal curve in R having vertices
(1, S1 ), (2, S2 ), ... where successive vertices (n, Sn ) and (n + 1, Sn+1 ) are connected by a line segment.
We will call any polygonal curve which is a possible realization of simple random walk a path.
Now any path from (0, 0) to (n, x) consists of a steps in the positive direction and b steps in the negative
a+b=n
,
a−b=x
n+x n−x
hence a= and b = .
2 2
(Note that if (n, x) is a vertex on a possible trajectory of SRW, then n ∈ N0 and x∈Z must have the same
Our rst result is an enumeration of the paths beginning and ending above the x-axis which hit 0 at some
point.
Lemma 19.1 (Reection principle). For any q, t, n ∈ N, the number of paths from (0, q) to (n, t) that are 0
at some point is equal to the number of paths from (0, −q) to (n, t).
Proof. (Draw picture)
Suppose that (0, r0 ), (1, r1 ), ..., (n, rn ) is a path from (0, q) to (n, t) which is 0 at some point.
Let K = min{k : rk = 0} be the rst time the path touches the x-axis. Then, setting ri0 = −ri for 0≤i<K
and ri0 = ri for K ≤ i ≤ n, we see that (0, r00 ), (1, r10 ), ..., (n, rn0 ) is a path from (0, −q) to (n, t).
Conversely, if (0, s0 ), (1, s1 ), ..., (n, sn ) is a path from (0, −q) to (n, t), then it must cross the x-axis at some
point. Let L = min{l : sl = 0} be the rst time this happens. Then (0, s00 ), (1, s01 ), ..., (n, s0n ) dened by
s0j = −sj for 0≤j<L and s0j = sj for L≤j≤n is a path from (0, q) to (n, t) which is 0 at time L.
Thus the set of paths from (0, q) to (n, t) which are 0 at some point is in 1−1 correspondence with the set
of paths from (0, −q) to (n, t), and the theorem is proved.
Theorem 19.1 (The Ballot Theorem). Suppose that in an election candidate A gets α votes and candidate
B gets β votes where α > β . The probability that candidate A is always in the lead when the votes are
α−β
counted one by one is .
α+β
Proof. Let x = α−β and n = α + β. The number of favorable outcomes is equal to the number of paths from
(1, 1) to (n, x) which are never 0. (Think of a vote for A as a positive step and a vote for B as a negative
step, keeping in mind that the rst vote counted must be for A.)
Shifting by one time step shows that this is equal to the number of paths from (0, 1) to (n − 1, x) which are
never zero.
114
The number of paths from (0, 1) to (n − 1, x) that hit 0 is equal to the number of paths from (0, −1) to
(n − 1, x), so subtracting this from the total number of paths gives the number of favorable outcomes.
Shifting in the vertical direction to get paths starting at zero shows that the number in question is
n−1 n−1 n−1 n−1
Nn−1,x−1 − Nn−1,x+1 = (n−1)+(x−1) − (n−1)+(x+1) = −
2 2
α−1 α
(n − 1)! (n − 1)! (n − 1)! (α − (n − α)) 2α − n n
= − = = .
(α − 1)!(n − α)! α!(n − α − 1)! α!(n − α)! n α
0 0 n
Since the total number of sequences of α A s and β B s is , the probability that A always leads is
α
2α−n n
n α 2α − n 2α − (α + β) α−β
n = = = .
α
n α + β α+β
This kind of reasoning involved in the preceding arguments is useful in computing the distribution of the
Proof.
If S1 6= 0, ..., S2n 6= 0, then either S1 , ..., S2n > 0 or S1 , ..., S2n < 0 (since simple random walk cannot skip
over integers) and the two events are equally likely by symmetry.
1
P (S1 > 0, ..., S2n > 0) = P (S2n = 0).
2
Breaking up the event {S1 , ..., S2n > 0} according to the value of S2n (which is necessarily an even number
n
X
P (S1 > 0, ..., S2n > 0) = P (S1 > 0, ..., S2n−1 > 0, S2n = 2r)
r=1
Now each path of length 2n has probability 2−2n of being realized, and the number of paths from (0, 0) to
(2n, 2r) which are never zero at positive times is N2n−1,2r−1 − N2n−1,2r+1 by the argument in the proof of
Accordingly,
n
X
P (S1 > 0, ..., S2n > 0) = P (S1 > 0, ..., S2n−1 > 0, S2n = 2r)
r=1
n
1 X
= (N2n−1,2r−1 − N2n−1,2r+1 )
22n r=1
1
= [(N2n−1,1 − N2n−1,3 ) + (N2n−1,3 − N2n−1,5 ) + ... + (N2n−1,2n−1 − N2n−1,2n+1 )]
22n
N2n−1,1 − N2n−1,2n+1 N2n−1,1
= n
= .
2 2n
where the nal equality is because you can't get to 2n + 1 in 2n − 1 steps of size 1.
115
To complete the proof, we observe that
Lemma 19.2 and our previous computations for simple random walk show that the distribution function of
α = inf{n ∈ N : Sn = 0} is given by
√
1 2n πn − 1
P (α ≤ 2n) = 1 − P (S1 6= 0, ..., S2n 6= 0) = 1 − P (S2n = 0) = 1 − ≈ √ ,
22n n πn
P (α ≤ 2n + 1) = P (α ≤ 2n)
for n = 1, 2...
Our nal set of results concern the so-called arcsine laws, which show that certain suitably normalized
random walk statistics have limiting distributions that can be described using the arcsine function.
These theorems are typically stated in terms of Brownian motion, which arises as a scaling limit of random
walk.
Lemma 19.3. Let u2m = P (S2m = 0). Then P (L2n = 2k) = u2k u2n−2k for k = 0, 1, ..., n.
116
Theorem 19.2. For 0 < a < b < 1,
ˆ b
L2n 1
P a≤ ≤b → p dx.
2n a π x(1 − x)
n 1 1 1
nP (L2n = 2k) = nu2k u2(n−k) ≈ √ p = q = q ,
πk π(n − k) π nk−k2 π k
(1 − k
)
n2 n n
k
so if
n → x, then
nP (L 2n = 2k) 1 → p 1
nP (L2n = 2k) = · q .
√k1 k π x(1 − x)
π n (1− n ) π nk (1 − nk )
Now dene an and bn so that 2nan is the smallest even integer greater than or equal to 2na and 2nbn is the
1
Moreover, our work above shows that fn (x) → f (x) = √ 1(0,1) (x) uniformly on compact sets, so
π x(1−x)
for any 0 < a < b < 1, thus we can apply the bounded convergence theorem to conclude
ˆ 1
bn + n ˆ b
L2n
P a≤ ≤b = fn (x)dx → f (x)dx.
2n an a
√
To see the reason for the name, observe that the substitution y= x yields
ˆ ˆ √
b
1 2 b
dy 2 −1 √ √
p dx = √
p = sin b − sin−1 a .
a π x(1 − x) π a 1 − y2 π
L2n 1 2
sin−1 √1 1
Note that P 2n ≤ 2 → π 2
= 2 . (This symmetry is also apparent in the mass function derived
in Lemma 19.3.)
An amusing consequence is that if two people were to bet $1 on a coin ip every day of the year, then with
1
probability approximately
2 , one player would remain consistently ahead from July 1st onwards. In other
words, if you were to play this game against me, then the probability that you would be in the lead for the
rst two days is about the same as the probability that you would be in the lead for the entire second half
of the year!
Finally, note that the proof of Theorem 19.2 shows that any statistic Tn satisfying
P (T2n = 2k) = u2k u2n−2k k = 0, 1, ..., n obeys the asymptotic arcsine law
for
ˆ b
T2n 1
P a≤ ≤b → p dx, 0 < a < b < 1.
2n a π x(1 − x)
117
Theorem 19.3. Let π2n be the number of line segments (k − 1, Sk−1 ) to (k, Sk ) that lie above the x-axis
and let um = P (Sm = 0). Then
P (π2n = 2k) = u2k u2n−2k .
1 1
u2n u0 = u2n = P (S1 > 0, ..., S2n > 0)
2 2
= P (S1 = 1, S2 − S1 ≥ 0, ..., S2n − S1 ≥ 0)
1
= P (S1 ≥ 0, ..., S2n−1 ≥ 0)
2
1 1
= P (S1 ≥ 0, ..., S2n ≥ 0) = β2n,2n
2 2
where the penultimate equality is due to the fact that S2n−1 ≥ 0 implies S2n−1 ≥ 1.
This proves the result for k = n, and since β0,2n = β2n,2n (replacing Sn with −Sn shows that always
nonnegative is as likely as always nonpositive), we see that the result also holds for k = 0.
Suppose now that 1 ≤ k ≤ n − 1. Let R be the time of the rst return to 0 (so that R = 2m with 0 < m < n)
and write f2m = P (R = 2m). Breaking things up according to whether the rst excursion was on the
k n−k
1 X 1 X
β2k,2n = f2m β2k−2m,2n−2m + f2m β2k,2n−2m
2 m=1 2 m=1
k n−k
1 X 1 X
= f2m u2k−2m u2n−2k + f2m u2k u2n−2m−2k
2 m=1 2 m=1
k n−k
1 X 1 X
= u2n−2k f2m u2k−2m + u2k f2m u2n−2m−2k
2 m=1
2 m=1
Since
k
X n−k
X
u2k = f2m u2k−2m , u2n−2k = f2m u2n−2k−2m
m=1 m=1
(by considering the time of the rst return to 0), we have
1 1
β2k,2n = u2n−2k u2k + u2k u2n−2k = u2k u2n−2k
2 2
and the result follows from the principle of induction.
1 1
Since f (x) = √ has a minimum at x= 2 and goes to ∞ as x → 0, 1, Theorem 19.3 shows that an
π x(1−x)
equal division of steps above and below the axis is least likely, and completely one-sided divisions have the
greatest probability.
118