100% found this document useful (1 vote)
248 views56 pages

Measure Theory

This document discusses measure theory and introduces some key concepts. It describes informal ideas of measure and assigns sizes to subsets of a space. Jordan measure is introduced as the first attempt to define a measure, but it is shown to not be a true measure. The document then formally defines a σ-algebra and measurable sets. It also defines the Borel σ-algebra which will be important for measures on Rn.

Uploaded by

keynote76
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
248 views56 pages

Measure Theory

This document discusses measure theory and introduces some key concepts. It describes informal ideas of measure and assigns sizes to subsets of a space. Jordan measure is introduced as the first attempt to define a measure, but it is shown to not be a true measure. The document then formally defines a σ-algebra and measurable sets. It also defines the Borel σ-algebra which will be important for measures on Rn.

Uploaded by

keynote76
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 56

Math 617: Measure Theory

David Aristo

January 2017

1 Introduction to measure

1.1 Informal ideas

A measure µ assigns a scalar size to certain subsets of a space. The ingredients we need are:

• A space, E

• A collection, E, of subsets of E which can be measured

• A rule for assigning measure to each element of E.

Formally, µ : E → [0, ∞]. (Sets have nonnegative but possibly innite measure.) Elements of E
are called measurable sets. Of course, µ must satisfy some natural properties. The most obvious
is that the measure of the union of disjoint sets equals the sum of the measures of each set
separately:
µ(A ∪ B) = µ(A) + µ(B) for disjoint A, B ∈ E. (1)

This is called nite additivity. Actually, we'll assume an analogous property for countable unions:

µ(A1 ∪ A2 ∪ . . .) = µ(A1 ) + µ(A2 ) + . . . for disjoint A1 , A2 , . . . ∈ E. (2)

The latter property is called countable additivity. Though it isn't explicit in (2), for the equation
above to make sense, a countable disjoint union of measurable sets must be measurable:

if A1 , A2 , . . . ∈ E are disjoint, then A1 ∪ A2 ∪ . . . ∈ E.

Both (1) and (2) are geometrically reasonable, so why do we assume (2) instead of (1)? The
reason is that countable additivity allows us to dene measures of complicated sets from more
simple ones by taking limits. Of course, countable additivity implies nite additivity.

Example 1.1. (Discrete measure.) E = a countable set, E = 2E = all subsets of E , and for
each x ∈ E , µ({x}) is an arbitrary xed value in [0, ∞]. Then by countable additivity,
for A ⊆ E.
X
µ(A) = µ({x}),
x∈E

Thus, the measure of any set is the sum of measures of singleton sets.

1
The trickiest question here is somewhat hidden. Once we dene measure on some simple
sets, property (2) essentially leads to a rule for assigning measure to general sets. The trouble
is in deciding what these sets will be, that is, what sets are consistent with (2). It is tempting
to just set E = 2E to be all subsets of E. While this is ne in the countable space case (see
Example 1.1), there are many diculties in the uncountable case (e.g. E = R). At the moment,
we will only assume that the whole space E and the empty set ∅ are measurable. We would
also like to have µ(∅) = 0; this is guaranteed whenever at least one set has nite measure, an
assumption we adopt from here on.

1.2 Jordan measure

In the case of discrete (or countable) space, it is easy to consistently dene a measure by assigning
values to each point. The continuous (or uncountable) space case is surprisingly subtle. Below
we will consider E = Rn . Our rst attempt at dening a measure uses a simple idea: that
a set can be approximately measured by lling it up with nitely many tiny non-overlapping
rectangles, and then summing the volumes of the rectangles. This leads to Jordan measure,
which is not a true measure as as its measurable sets are not closed under disjoint countable
unions.
Consider E = Rn . For our simple sets we take

S = {all half-open rectangles of the form [a1 , b1 ) × [a2 , b2 ) × . . . × [an , bn )}.

A natural choice for the measure of such rectangles is

µ([a1 , b1 ) × [a2 , b2 ) × . . . × [an , bn )) = (b1 − a1 )(b2 − a2 ) . . . (bn − an ). (3)

(We allow ai = bi , in which case the rectangle is the empty set.) The reason for using half-open
intervals (instead of, say, open or closed intervals) is that any nite union of elements of S can be
written as a disjoint nite union of elements of S . This allows us to use additivity to measure
such sets. Intuitively, to measure a more general bounded set A, we can ll in most of the
inside of A with nitely many rectangles from S, and then cover the remaining boundary of
A with nitely many smaller rectangles. If the boundary is small, A is measurable. Formally,
dene

k
X

µ (A) = inf µ(Si )
S1 ∪...∪Sk ⊇A
i=1
Xk
µ∗ (A) = sup µ(Si )
S1 ∪...∪Sk ⊆A i=1

for bounded A, where the inf and sup are over disjoint sets S1 , . . . , Sk ∈ S , with k arbitrary.
Bounded sets are declared measurable when these two quantities agree:

A ∈ E ⇐⇒ A is bounded and µ∗ (A) = µ∗ (A), (4)

in which case we set µ(A) = µ∗ (A) = µ∗ (A). The resulting µ is called Jordan measure. (Often
it is called Jordan content because it is not a true measure; see Exercise 1.2.) Jordan measure
has some nice properties. Notably, it is congruence invariant: applying rotations, reections or

2
translations to a set A∈E will not change its measure (or measurability). However, it turns
out that Jordan measure is not countably additive, in the sense that its measurable sets are not
closed under disjoint countable unions. (However, Jordan measure is nitely additive).

Exercise 1.2. (Jordan measure is not a true measure.) Consider E dened by (4) . Show that
single points are in E , but Q ∩ [0, 1] ∈/ E .
Actually Jordan measurable sets are not closed under disjoint countable unions by a simpler
argument: the intervals [n, n + 1) are in E but their countable union, R, is not in E since it isn't
bounded. Exercise 1.2 shows Jordan measure is not countably additive even when we restrict
to bounded sets.

Exercise 1.3. Consider E dened by (4) and let f : [0, 1] → R be continuous. Prove the region
A = {(x, y) : 0 ≤ x < 1, 0 ≤ y ≤ f (x)}

under the graph of f is in E .


Exercise 1.4. Consider the fat Cantor set constructed by starting with [0, 1] and, at step k,
removing the closed middle (1/4)k of each remaining interval. Find µ∗ (C) and µ∗ (C). Conclude
that E dened by (4) does not contain certain compact and bounded open sets.

2 Formal measure theory

Having tried and failed at a naive theory, we now proceed to formal measure theory.

Denition 2.1. A σ-algebra over E is a collection E of subsets of E that contains ∅ and is


closed under complements and countable unions.
By closure under complements, we mean if A ∈ E then E \ A ∈ E . By closure under relative
complements, we mean if A, B ∈ E then B \ A ∈ E . The σ -algebra will be our collection of
measurable sets. Denition 2.1 essentially comes from countable additivity. Provided we assume
E is closed under relative complements, countable additivity enforces closure under countable
unions (see Exercise 2.2).

Exercise 2.2. Show that if E is a collection of subsets of E that is closed under relative com-
plements and disjoint countable unions, then E is closed under countable unions.
Recall from Section 1 that we want to dene measure by starting from simple sets. This
means our σ -algebra should contain all the simple sets. If the σ -algebra is minimal with respect
to this property, it is said to be generated by the simple sets.
Denition 2.3. The σ-algebra generated by a collection C of subsets of E , denoted σ(C), is the
smallest σ-algebra which contains C .
Explicitly, σ(C) is the intersection of all σ -algebras that contain C. The intersection has at
least one argument since 2
E contains all possible subsets. We leave it as an exercise to show

that this intersection is actually a σ -algebra.

3
Exercise 2.4. Let G be a nonempty collection of σ-algebras over E . Show that
for all F ∈ G }
\
E := F = {A : A ∈ F
F ∈G

is a σ-algebra.
One σ -algebra will be of particular interest to us:

Denition 2.5. The σ-algebra generated by the open sets in Rn is the Borel σ-algebra.
Elements of the Borel σ -algebra are called Borel sets. Which sets are Borel? Single points,
the rationals, the irrationals, and rectangles (of any type) are all examples of Borel sets. Actually,
half-open rectangles generate the Borel σ -algebra (Exercise 2.6). It turns out that not all subsets
of Rn are Borel, though; see the discussion below Theorem 2.9.

Exercise 2.6. Let E = Rn , and let E be the σ-algebra generated by all half-open rectangles of
the form [a1 , b1 ) × . . . × [an , bn ). Show that E is the Borel σ-algebra.
We can easily see the Borel σ -algebra contains some sets that do not have Jordan measure.

Exercise 2.7. Show that the fat Cantor set of Exercise 1.4 is a Borel set.
We are now ready to give a precise denition of measure.

Denition 2.8. Let E be a σ-algebra over E . A function µ : E → [0, ∞] is a measure on (E, E)


k=1 of elements of E ,
if µ(∅) = 0 and for any disjoint collection {Ak }∞

X
µ (∪∞
k=1 Ak ) = µ(Ak ). (5)
k=1

As above, elements of E are called measurable sets. Here and below (E, E) is called a
measurable space, and (E, E, µ) is called a measure space. Note that the only thing we added to
the discussion in Section 1 is the explicit structure of E.
Our next step is to dene a natural (true) measure on E = Rn . Here, by natural we mean
the measure should be congruence invariant (like Jordan measure) and the unit cube should
have measure 1. The following result furnishes the desired µ:
Theorem 2.9. (Borel measure.) Let E = Rn and E the Borel σ-algebra over E . There is a
unique measure µ on (E, E) whose measure on half-open rectangles is
µ([a1 , b1 ) × . . . × [an , bn )) = (b1 − a1 ) . . . (bn − an ). (6)

Moreover, µ is congruence-invariant.
The µ in Theorem 2.9 is called Borel measure. The proof of Theorem 2.9 is surprisingly
technical, considering the simplicity of the ideas considered so far. For this reason we will
postpone the proof to Section 3. For the remainder of this section we will discuss two deciencies
of the Borel measure.
The main deciency is that not all subsets of E = Rn are Borel sets. This problem cannot
be xed, at least in the ZFC framework, as we show in the following discussion. We will focus

4
on dimension n = 1 though the arguments generalize to any dimension. Consider the unit circle
S1 as [0, 1] with mod 1 arithmetic. Dene an equivalence relation onS 1 by

x ∼ y ⇐⇒ x − y ∈ Q ∩ [0, 1]

where the subtraction is mod 1. Choose a set R of representatives. For a set S and a point
p, dene p + S = {p + s : s ∈ S}. Each equivalence class for our relation on S 1 has the form
r + Q ∩ [0, 1] where r ∈ R and the addition is mod 1. Since the equivalence classes partition S 1 ,
[ [ [
[0, 1] = (r + Q ∩ [0, 1]) = {r + q} = (q + R),
r∈R r∈R, q∈Q∩[0,1] q∈Q∩[0,1]

where the union on the right hand side is disjoint, and the addition is mod 1. By congruence
invariance (in particular, translation invariance) and countable additivity,

 
[ X X
µ (q + R) = µ (q + R) = µ(R).
q∈Q∩[0,1] q∈Q∩[0,1] q∈Q∩[0,1]

The right hand side here is either 0 or ∞. However, µ([0, 1]) = 1. This shows that R is not a
Borel measurable set. Such sets are called Vitali sets. (The interval [0, 1] is not special here: a
nonmeasurable Vitali-type set can be constructed from any set of positive Borel measure. Think
about why this is true...)
The second deciency of Borel measure is that some some sets of Borel measure zero have
non-Borel subsets (see Exercise 2.10). This is strange, since we'd expect subsets of measure zero
sets to also have measure zero. This deciency is remedied with Lebesgue measure. Lebesgue
measure is an extension of Borel measure in which subsets of measure zero sets are always
measurable (with measure 0). For this reason Lebesgue measure is called the completion of
Borel measure.
One way to see that not all subsets of measure zero Borel sets are Borel is the following
cardinality argument. The cardinality of the Borel sets is the same as the cardinality of R, while
the Cantor set C has the cardinality of R so its power set 2C has strictly larger cardinality. That
means some subset of the Cantor set must be non-Borel. (The Cantor set has Borel measure 0;
think about why.) An explicit construction is in Exercise 2.10.

Exercise 2.10. Let c : [0, 1] → [0, 1] be the Cantor function and dene ψ(x) = x + c(x). Show
that ψ is invertible and maps Borel sets to Borel sets. Use ψ to construct a non-Borel set A
such that A ⊆ B where B is a Borel set with µ(B) = 0. (Hint: let C be the Cantor set and look
at ψ(C).)
The Vitali set example shows that a (countably additive) measure on E = Rn that is
translation invariant and has nontrivial values on rectangles cannot assign measure to every
subset of E. Can this be xed by replacing countable additivity by nite additivity? The
Banach-Tarski paradox says the answer is no: A sphere can be cut into nitely many pieces,
which can be reconstructed using only congruences to produce two copies of the original sphere,
thus violating nite additivity.

5
3 Existence and uniqueness of Borel measure

In this section we prove Theorem 2.9. The proof is somewhat long so we organize it as follows.
First, we show existence using Carathéodory's extension theorem in Section 3.1. We postpone
the proof of Carathéodory's theorem, which is quite technical, to Section 3.4. In Section 3.2, we
prove uniqueness using Dynkin's lemma. Throughout we will focus on E = R so that notation
is simpler. However, all of the arguments can easily be generalized to E = Rn .

3.1 Existence

We begin by dening rings of sets.


Denition 3.1. A ring over E is a collection A of subsets of E that contains ∅ and is closed
under relative complements and nite unions. That is, if A, B ∈ A then B \ A ∈ A and
A ∪ B ∈ A.

Over E = R, the ring A we have in mind is the set of nite disjoint unions of half-open
intervals. (Recall these were the sets we used to dene Jordan measure!)

Exercise 3.2. Let S be the set of nite unions of half-open intervals of the form [a, b). (We
allow a = b in which case [a, b) = ∅.) Show that A is a ring.
Exercise 3.3. Show that rings are closed under nite intersections.
Existence of Borel measure comes from Carathéodory's extension theorem:

Theorem 3.4. (Carathéodory) Let A be a ring over E and suppose µ : A → [0, ∞] satises

X
µ(∪∞
k=1 Ak ) = µ(Ak ) (7)
k=1

k=1 Ak ∈ A. Then µ extends to a measure on (E, σ(A)).


whenever A1 , A2 , . . . ∈ A are disjoint and ∪∞
We postpone the proof of Theorem 3.4 to Section 3.4. Note that the condition on µ above
is slightly dierent than what we have called countable additivity, since we do not assume A is
closed under disjoint countable unions.

Proof of existence in Theorem 2.9. Let A be the ring of nite unions of half-open intervals.
Exercise 2.6 shows that A generates the Borel σ -algebra. It easy to see that we can dene µ
on A satisfying nite additivity and µ([a, b)) = b − a. Thus, all we must do is demonstrate
the countable additivity condition in Theorem 3.4. Let A1 , A2 , . . . ∈ A be disjoint and suppose
A = A1 ∪ A2 ∪ . . . ∈ A. By nite additivity, it suces to show µ(A) = limk→∞ µ(A1 ∪ . . . ∪ Ak ),

where A := ∪k=1 Ak . Using nite additivity again, this is equivalent to showing limk→∞ µ(A \
(A1 ∪ . . . ∪ Ak )) = 0. Dene Bk := A \ (A1 ∪ . . . ∪ Ak ). Suppose for contradiction that there is
 > 0 such that µ(Bk ) ≥  for all k . For each k we can pick Ck ∈ A such that C̄k ⊆ Bk (the bar

6
denotes closure) and µ(Bk \ Ck ) ≤ 2−k−1 . Then

µ(Bk \ (C1 ∩ . . . ∩ Ck ) = µ((Bk \ C1 ) ∪ . . . ∪ (Bk \ Ck ))


≤ µ((B1 \ C1 ) ∪ . . . ∪ (Bk \ Ck ))
≤ µ(B1 \ C1 ) + . . . + µ(Bk \ Ck )
k
X
≤ 2−k−1 < /2.
i=1

Thus, µ(C1 ∩. . .∩Ck ) ≥ /2, so in particular C1 ∩. . .∩Ck is nonempty for each k . By construction,


\ ∞
\
C̄k ⊆ Bk = ∅.
k=1 k=1

The LHS above is nonempty by Cantor's intersection theorem, which is our desired contradiction.

We used two properties of µ in the proof above that easily follow from nite additivity of µ,
monotonocity, which states that if A, B
but that we have not explicitly stated thus far. One is
are in the domain of µ and A ⊆ B , then µ(A) ≤ µ(B). The other is subadditivity, which states
that if A, B and A∪B are in the domain of µ, then µ(A ∪ B) ≤ µ(A) + µ(B).

Denition 3.5. A sequence B1 , B2 , . . . of sets is called decreasing if B1 ⊇ B2 ⊇ . . . and


if B1 ⊆ B2 ⊆ . . .
increasing

Exercise 3.6. Let µ be a measure on (E, E) and B1 , B2 , . . . be an increasing sequence in E . Let


B = ∪∞k=1 Bk . Show that µ(B) = limk→∞ µ(Bk ). This property is called continuity from below.
Prove an analogous property for a decreasing sequence in E , called continuity from above. (Note:
you need an extra assumption for continuity from above...)

3.2 Uniqueness

We begin with some technical denitions.

Denition 3.7. A π-system over E is a collection A of subsets of E that contains ∅ and is


closed under nite intersections.
Over E = R, the π -system we have in mind is the collection A of half-open intervals. Recall
that in Exercise 2.6 we showed that this π -system generates the Borel σ -algebra.

Exercise 3.8. Let A be the set of half-open intervals of the form [a, b). Show that A is a
π -system.

Denition 3.9. A d-system over E is a collection D of subsets of E such that


• E∈D

• if A, B ∈ D with A ⊆ B , then B \ A ∈ D
• if A1 , A2 , . . . ∈ D is an increasing sequence, then ∪∞
k=1 Ak ∈ D

7
The d-system we will look at will be introduced shortly in the proof of uniqueness of Borel
measure. Roughly speaking, the d-system will consist of all sets on which two candidate Borel
measures have the same values. The proof of uniqueness relies on Dynkin's lemma below. We
leave two ingredients of the proof as exercises.

Exercise 3.10. Show if A is a π-system and a d-system over E , then A is a σ-algebra over E .
Exercise 3.11. Let D be a d-system over E and x C ∈ D. Show that
DC = {A ⊆ E : A ∩ C ∈ D}
is a d-system over E .
Lemma 3.12. (Dynkin) Let A be a π-system over E . Then any d-system over E that contains
A must also contain σ(A).
Proof of Lemma 3.12. Let D be the smallest d-system that contains A. That is, D is the
intersection of all d-systems containing A. (Similar to Exercise 2.4, it is easy to check D is a
d-system.) By Exercise 3.10, it suces to show D is also a π -system. Since d-systems contain
∅ we only have to show D is closed under nite intersection. Let C ∈ A and dene
DC = {A ⊆ E : A ∩ C ∈ D}. (8)

Since C∈A and A is closed under nite intersection, DC


A. By Exercise 3.11, DC is
contains
a d-system, so DC contains D . Since C was arbitrary, it follows that A ∩ C ∈ D for all A ∈ D
and C ∈ A. Now suppose C ∈ D and dene DC as in (8). By the same argument as above, DC
contains D , and so A ∩ C ∈ D for all A, C ∈ D .

Now we are ready for the proof of uniqueness of Borel measure.

Proof of uniqueness in Theorem 2.9. Let E = R and E the Borel σ -algebra, and let µ1 and µ2
be two measures on (E, E) such that µ1 ([a, b)) = µ2 ([a, b)) = b − a for each half-open interval
[a, b). For A∈E dene

ν1 (A) = µ1 (A ∩ [0, 1)), ν2 (A) = µ2 (A ∩ [0, 1)).


Then ν1 = ν2 on the collection A of half-open intervals of the form [a, b). By Exercises 2.6
and Exercise 3.8, A is a π -system that generates the Borel σ -algebra. Consider
D = {A ∈ E : ν1 (A) = ν2 (A)}.
We will show D d-system below. Since D contains A, by Dynkin's lemma (Lemma 3.12), D
is a
contains the Borelσ -algebra. Thus ν1 = ν2 on E , and it easily follows that µ1 = µ2 on E .
Note that E = R ∈ D since ν1 (R) = µ1 ([0, 1)) = µ2 ([0, 1)) = ν2 (R). For A, B ∈ E
with A ⊆ B , we get ν1 (B \ A) = ν2 (B \ A) by additivity. (It is here that we needed to
use νi instead of µi , to guarantee νi (E) < ∞.) Thus, B \ A ∈ D . Let A1 , A2 , . . . ∈ D

be an increasing sequence and A = ∪k=1 Ak . Then by continuity from below (Exercise 3.6),
ν1 (A) = limk→∞ ν1 (Ak ) = limk→∞ ν2 (Ak ) = ν2 (A). Thus, A ∈ D.

The following fact, established in the proof of Lemma 3.12, will be useful below:

Proposition 3.13. If µ, ν are measures on (E, E) with µ(E) = ν(E) < ∞, and µ, ν agree on
a π-system that generates E , then µ = ν .
8
3.3 Completion of measures and Lebesgue measure

The aim of this section is to show that Borel measure can be extended to correct the deciency
discussed at the end of Section 2, namely, that not all subsets of Borel measure 0 are Borel.

Denition 3.14. Let (E, E, µ) be a measure space. A set N ⊆E is called null if there is some
A ∈ E such that N ⊆ A and µ(A) = 0.

For example, if µ is Borel measure in E = R, then any subset of the Cantor set is a null set.
Recall from Example 2.10 that some subsets of the Cantor set are not Borel.

Exercise 3.15. Let (E, E, µ) be a measure space and let


F = {A ∪ N : A ∈ E, N null}.
Show that F is a σ-algebra.
Exercise 3.16. Let (E, E, µ) be a measure space and F = {A ∪ N : A ∈ E, N null}. Extend µ
to F by dening µ on F via
µ(A ∪ N ) = µ(A) for A ∈ E, N null. (9)

Show that this extension is well-dened and countably additive.


It is not hard to see that the extension is unique, that is, there is only one extension of µ
to F that satises (9). When E=R and E is the Borel σ -algebra, the extension of µ to F is
called Lebesgue measure, and the elements of F are the Lebesgue measurable sets.
Lebesgue measure is congruence invariant (so in particular translation invariant) and assigns
nonzero measure to intervals. Thus, a non-measurable Vitali set can be constructed in the exact
same fashion as in Section 2. So though the Lebesgue σ -algebra is larger than the Borel σ-
algebra, it still does not contain every subset of R.

3.4 Proof of Carathéodory's extension theorem

For convenience we restate Carathéodory's theorem:

Theorem 3.17. (Carathéodory) Let A be a ring over E and suppose µ : A → [0, ∞] satises

X
µ(∪∞
k=1 Ak ) = µ(Ak ) (10)
k=1

whenever A1 , A2 , . . . ∈ A are disjoint and ∪∞


k=1 Ak ∈ A. Then µ extends to a measure on (E, σ(A)).

Let µ and A be as in the theorem. The main ingredient in the proof of Carathéodory's
theorem is a candidate for the extension of µ. This is the so-called outer measure µ∗ , dened by

X
µ∗ (A) = inf

µ(Ak ) for A ⊆ E, (11)
A⊆∪k=1 Ak
k=1

9
where the inf is over all sequences A1 , A2 , . . . inA (the Ak 's need not be disjoint). If A is not
contained in any countable union of elements of A we set µ∗ (A) = ∞. We must show µ∗ is
countably additive on σ(A). Actually, we will

show µ is countably additive on F , where

A ∈ F ⇐⇒ µ∗ (B) = µ∗ (B ∩ A) + µ∗ (B ∩ Ac ) for all B ⊆ E. (12)

(Here and below we write Ac = E \ A for the absolute complement of A in E .) The proof is
nished by showing F is a σ -algebra containing A.
Recall that we dened Jordan measurable sets as those whose inner Jordan measure equals
their outer Jordan measure (see (4)). When µ is Borel or Lebesgue measure in E = R, there is
a similar characterization of F above. If µ∗ (A) = supC⊆A µ(C) where the sup is over compact

subsets of R, then A ∈ F if and only if µ∗ (A) = µ (A). We will stick to the characterization (12)
due to Carathéodory, since it is a more general construction.

Exercise 3.18. Let µ and A be as in Theorem 3.17 and µ∗ as in (11). Show that the restriction
of µ∗ to A equals µ.
We now turn to the proof of Carathéodory's theorem. The rst step will be to show µ∗ is
countably subadditive.
Lemma 3.19. Let A1 , A2 , . . . ⊂ E and A ⊆ ∪∞
k=1 Ak . Then

X

µ (A) ≤ µ∗ (Ak ). (13)
k=1

Proof. Let  > 0. We may assume µ(Ak ) < ∞ for each k. For each k pick Bk1 , Bk2 , . . . so that
Ak ⊆ ∪∞
`=1 Bk` and

X
µ∗ (Ak ) ≥ −2−k + µ(Bk` ).
`=1
Then since A ⊆ ∪∞
k,`=1 Bk` ,

∞ X
X ∞ ∞
X
µ∗ (A) ≤ µ(Bk` ) ≤  + µ∗ (Ak ).
k=1 `=1 k=1

Since >0 was arbitrary, we are done.

Exercise 3.20. Let A be as in Theorem 3.17 and F as in (12) . Show that F ⊇ A.


Proof of Theorem 3.17. We will show F σ -algebra and µ∗ is a measure on (E, F). Then by
is a
Exercises 3.18 and 3.20, we are done. Clearly, F contains E and is closed under complements.
Thus, it suces to show F is closed under nite intersections and disjoint countable unions.
(See Exercise 2.2.) If A, B ∈ F and C ⊆ E , then

µ∗ (C) = µ∗ (C ∩ A) + µ∗ (C ∩ Ac )
= µ∗ (C ∩ A ∩ B) + µ∗ (C ∩ A ∩ B c ) + µ∗ (C ∩ Ac )
= µ∗ (C ∩ A ∩ B) + µ∗ (C ∩ (A ∩ B)c ∩ A) + µ∗ (C ∩ (A ∩ B)c ∩ Ac )
= µ∗ (C ∩ (A ∩ B)) + µ∗ (C ∩ (A ∩ B)c ).

10
This shows A∩B ∈ F so F is closed under nite intersections and nite unions. We now show
countable additivity. Let A1 , A2 , . . . ∈ F be disjoint and set A = ∪∞
k=1 Ak . For any B ⊆ E ,

µ∗ (B) = µ∗ (B ∩ A1 ) + µ∗ (B ∩ Ac1 )
= µ∗ (B ∩ A1 ) + µ∗ (B ∩ Ac1 ∩ A2 ) + µ∗ (B ∩ Ac1 ∩ Ac2 )
= µ∗ (B ∩ A1 ) + µ∗ (B ∩ A2 ) + µ∗ (B ∩ Ac1 ∩ Ac2 ).
Continuing on in this way,

k
X

µ (B) = µ∗ (B ∩ Ai ) + µ∗ (B ∩ Ac1 ∩ . . . ∩ Ack ). (14)
i=1

By Lemma 3.19, µ∗ (B ∩ Ac ) ≤ µ∗ (B ∩ Ac1 ∩ . . . ∩ Ack ). Letting k→∞ in (14) thus gives


X
µ∗ (B) ≥ µ∗ (B ∩ Ak ) + µ∗ (B ∩ Ac )
(15)
k=1
∗ ∗ c
≥ µ (B ∩ A) + µ (B ∩ A )
where the last inequality used Lemma 3.19 again. The reverse inequality µ∗ (B) ≤ µ∗ (B ∩ A) +
µ∗ (B ∩ Ac ) holds by Lemma 3.19. Thus, µ∗ (B) = µ∗ (B ∩ A) + µ∗ (B ∩ Ac ), which shows A ∈ F.
Putting B=A in (15) now shows that

X

µ (A) = µ∗ (Ak )
k=1
as desired.

Notice that the main property of µ∗ that we needed to prove the results above was countable
subadditivity. Of course, for a measure subadditivity follows from countable additivity. That is, if
P∞
(E, E, µ) is a measure space and A, A1 , A2 , . . . ∈ E with A ⊆ ∪∞
k=1 Ak , then µ(A) ≤ k=1 µ(Ak ).
When E = R, µ is Borel measure, and A is the ring of nite unions of half-open intervals,
µ∗ is called the Lebesgue outer measure. In this case, for a Lebesgue measurable set A (dened

in Section 3.3), we have λ(A) = µ (A) where λ is Lebesgue measure. Note that the dierence
between Lebesgue and Jordan outer measure is that for Lebesgue outer measure, the inmimum
in the denition is taken over countable instead of nite unions of half-open intervals. This turns
out to be a crucial dierence!
When µ is Borel measure, what sets are in F dened in (12)? The proofs above show that
F contains the Borel σ -algebra. Though we won't prove it, it turns out that F is exactly the
collection of Lebesgue measurable sets. The following exercise shows that at least F contains
all the Lebesgue measurable sets.

Exercise 3.21. Let E = R, µ Borel measure, and F as in (12) . Show that null sets are in F .
Notice also that we have not used a Lebesgue inner measure to dene measurability. However,
this can be done. Consider Lebesgue measure restricted to [0, 1]. Dene the inner measure of
A ⊆ [0, 1] as µ∗ (A) = 1 − µ∗ (A). Then it turns out that A ∈ F if and only if µ∗ (A) = µ∗ (A).
The proof involves a set approximation argument along the lines of the ones above, so we will
skip it.
The construction of outer measure in Theorem 3.17 is interesting in its own right. For later
use, we note that:

11
Proposition 3.22 (Lebesgue outer measure) . If A is Lebesgue measurable, then

X
µ(A) = µ∗ (A) := inf

µ(Ik ) (16)
A⊆∪k=1 Ik
k=1

where the inmum is over (half-open, open, or closed) intervals Ik .

4 Probability measures

Let (E, E, µ) be a measure space. If µ(E) = 1 then this is called a probability space. In this
setting, instead of (E, E, µ), it is standard to write (Ω, F, P), where now the ingredients are

• A set Ω of outcomes

• A collection of events, F
• A rule, P, for assigning probability to each event.

(Ω, F, P) is called a probability space. For now, we use the word event to refer to any subset of
Ω, though it is common to assume measurability implicitly.

For example, for a single coin ip, Ω = {heads, tails}, F = 2 , and P({heads}) = P({tails}) =

1/2. For a single die roll, Ω = {1, 2, 3, 4, 5, 6}, F = 2 , and P({i}) = 1/6 for i = 1, . . . , 6. Note
that in each of these examples the probability of each outcome is the same. This is what is
called theuniform probability measure (or uniform distribution).
Example 4.1. (Discrete uniform probability measure.) Ω = a set with n < ∞ elements,
F = 2Ω , P({ω}) = 1/n for each ω ∈ Ω.
Example 4.2. (Continuous uniform probability measure.) Ω = [0, 1], F = {A ⊆ [0, 1] :
A is Borel}, P([a, b)) = b − a for 0 ≤ a ≤ b ≤ 1.
(Notice we used the Borel instead of Lebesgue σ -algebra; we will see why in Section 5.)
Consider now a coin tossing experiment with innitely many tosses. The set of outcomes is
Ω = {0, 1}N where now 0, 1 represent heads and tails. That is, ω = (ω1 , ω2 , . . .) ∈ Ω represents
the outcome where the k th toss is ωk ∈ {0, 1}. Here all single outcomes have probability zero,
but events can have positive probability. For example, Ek = {ω ∈ Ω : ωk = 0}, the event that
the k th toss is heads, has probability P(Ek ) = 1/2.
Exercise 4.3. For the innite coin tossing experiment, let A be the collection of all events that
depend only on nitely many tosses, together with ∅. Show that A is a ring.
We now sketch how to dene a probability measure P in the innite coin tossing example.
First, following Carathéodory's theorem, we dene P on A. Let An be the set of events that
depend only on the rst n tosses. More formally, E ∈ An if and only if there is En ⊆ {0, 1}
n

such that E = {ω ∈ Ω : (ω1 , . . . , ωn ) ∈ En }. Clearly, A = ∪n=1 An . Dene P on A by

P(E) = 2−n |En |, if E = {ω ∈ Ω : (ω1 , . . . , ωn ) ∈ En } ∈ An . (17)

Since the An 's A1 ⊆ A2 ⊆ . . .), we should check that P is well-dened


are not disjoint (in fact
on A. P is well-dened and nitely additive on A. Following
We leave it as an exercise to check
the arguments in Section 3, we can then show that P extends to a unique measure on F = σ(A),
which is easily seen to be a probability measure.

12
Exercise 4.4. Let P be as in (17) and A as in Exercise 4.3. Show that P is well-dened and
nitely additive on A.
Existence and uniqueness of P is a special case of Kolmogorov's extension theorem (Theo-
rem 8.9 below). It turns out that F = σ(A) is the right σ -algebra. Indeed, all reasonable
subsets of Ω are in σ(A). The problem with trying to measure all subsets of Ω, which we
also faced while constructing Borel measure, is that space Ω 1
is uncountable. (Intuitively, it is
dicult to use countable additivity to measure all subsets of an uncountable space.)
Exercise 4.5. Consider the innite coin tossing experiment, and let E be the event that innitely
many of the tosses are heads. Show that E ∈ F = σ(A) where A is dened as in Exercise 4.3.
(See also Section 4.2 below.)

4.1 Independence

In this section events will refer to only measurable subsets of Ω, that is, elements of F. One
important element of probability theory that is missing from measure theory is the concept
of independence. Consider a dice rolling experiment with two rolls. Here Ω = {1, 2, . . . , 6}2 ,
F = 2Ω , and P is uniform on Ω. Consider the events E1 = {rst roll is odd} and E2 =
second roll is ≤ 2}. By counting the number of ways both E1 , E2 can occur, and dividing by
3·2 3 2
possibilities, we get P(E1 ∩ E2 ) =
the 36 total
36 = 6 · 6 = P(E1 )P(E2 ).

Denition 4.6. Events A, B are independent if


P(A ∩ B) = P(A)P(B). (18)

A collection {Ai }i∈I of events is independent if for each nite subset J of I ,


 
\ Y
P Aj  = P(Aj ). (19)
j∈J j∈J

If A, B are events in a nite outcome space Ω that have nothing to do with each other,"
equation (18) can be obtained from a simple counting argument. Note that events with prob-
ability 0 or 1 are automatically independent of all other events. Another way to understand
independence is as follows. Assume P(B) > 0 and rearrange (18) as

P(A ∩ B)
P(A) = . (20)
P(B)
Dene another probability measure Q on (Ω, F) by

P(C ∩ B)
Q(C) = , C ∈ F.
P(B)
Then Q is the probability measure on events that assumes B has occurred, and equation (20),
which says Q(A) = P(A), shows that occurrence of B has no eect on A. The RHS of (20) is
the conditional probability of A given B :
1
For a specic example of an non-measurable set in this context see https://arxiv.org/pdf/math/
0610705v2.pdf

13
Denition 4.7. Let A, B be events with P(B) > 0. The conditional probability of A given B is
P(A ∩ B)
P(A|B) := . (21)
P(B)

As an example, consider two die rolls. Let B be the event that the rst roll is odd, and A
the event that the rolls add to 5. P(A|B). Assuming B occurred, we know that
Consider rst
the rst roll is either 1, 3 or 5, and each has the same probability 1/3. If the rst roll is 1 or
3, the sum of the rolls is 7 only when the second roll is 4 or 1, respectively, both which have
probability 1/6. If the rst roll is 5 there is no way the sum is 5. Thus,

1 1 1 1 1 1
P(A|B) = · + · + ·0= .
3 6 3 6 3 9
On the other hand, consider P(A ∩ B). There are only two outcomes  namely (1, 4) and (3, 2)
 for which the rst roll is odd and the rolls add to 5. Thus, P(A ∩ B) = 2/36 = 1/18. Since
P(B) = 1/2 we veried equation (21).
Note that if A, B are independent and P(B) > 0, then P(A|B) = P(A). Conversely if
P(A|B) = P(A) then A, B are independent. It is important to note that pairwise independence
does not imply independence: (19) may not hold even if (18) holds for each pair Ai , Aj , i 6= j ∈ I .

Denition 4.8. σ-algebras F1 , F2 over Ω are called independent if E1 and E2 are independent
whenever E1 ∈ F1 and E2 ∈ F2 .
One application of Dynkin's lemma (Lemma 3.12) is the following:

Exercise 4.9. Let A and A0 be π-systems over Ω contained in F such that


P(E ∩ E 0 ) = P(E)P(E 0 ) (22)

whenever E ∈ A and E 0 ∈ A0 . Show that σ(A) and σ(A0 ) are independent.


Consider the innite coin tossing experiment above. Let A be the collection of all events
that only depend on a nite number of tosses, together with ∅. Let E be the event that innitely
many of the tosses are heads, and dene A0 = {E, ∅}. Intuitively, A and A0 must satisfy the
condition (22), since the event that innitely many tosses are heads cannot depend on nitely
many tosses. This means F = σ(A) and σ(A0 ) are independent. However, F is the collection
of all (measurable) events! How can all events be independent of the event E? Note that, in
particular, E is independent of itself, so P(E ∩ E) = P(E)2 . That is, P(E) must be zero or one!
This result is a special case of Kolmogorov's zero-one law. (In this case, P(E) = 1, as we will
see below.)

4.2 Borel Cantelli lemmas

Consider again the innite coin tossing example. Let Ek be the event that the k th toss is heads
and E the event that innitely many tosses are heads. (See Exercise 4.5.) Consider the event

[
Ek .
k≥n

14
This is the event that some toss after the (n − 1)st toss is heads. Now consider the event

∞ [
\
Ek .
n=1 k≥n

The is the event that, after every toss, some future toss is heads. In other words,

∞ [
\
E := {innitely many tosses are heads} = Ek .
n=1 k≥n

This leads us to the next denition.

Denition 4.10. Let A1 , A2 , . . . ∈ F . We dene


∞ [
\
lim sup En = Ek
n→∞
n=1 k≥n

and ∞ \
[
lim inf En = Ek .
n→∞
n=1 k≥n

The above arguments are easily generalized to show that lim supn→∞ En is the event that
innitely many of the En occur. For this reason, we sometimes write

lim sup En = {En innitely often}.


n→∞

Similar arguments show that lim inf n→∞ En is the event that every En occurs past some nite
n. For this reason we sometimes write

lim inf En = {En eventually}.


n→∞

Exercise 4.11. Show that {En innitely often}c = {Enc eventually}.


Consider again the coin toss example with En the event that the nth toss is heads. Let
ω = (0, 1, 0, 1, 0, 1, . . .)
and ω
0 = (0, 0, 0, 0, . . .). Then ω, ω 0 ∈ {En innitely often} and ω0 ∈
{En eventually} but ω ∈ / {En eventually}.
Recall the denition of increasing and decreasing sequences of sets (Denition 3.5). A
sequence of sets is monotone if it is either increasing or decreasing.

Exercise 4.12. Show that if En is a monotone sequence, then lim supn→∞ En = lim inf n→∞ En .
Lemma 4.13. (Borel-Cantelli.) If E1 , E2 , . . . are independent events with = ∞,
P∞
n=1 P(En )

P({En innitely often}) = 1.

15
Proof. By Exercise 4.11, it suces to show that P({Enc eventually}) = 0. We have
 
\ Y
P Ekc  = P(Ekc )
k≥n k≥n
Y
= (1 − P(Ek ))
k≥n
Y
≤ exp (−P(Ek ))
k≥n
 
X
= exp − P(Ek ) = 0.
k≥n

where we used independence and 1 − x ≤ e−x . Thus, from countable subadditivity,


 
∞ \
[
P({Enc eventually}) = P Ekc  = 0.
n=1 k≥n

Exercise 4.14. Consider the innite coin tossing example and let E be the event that innitely
many tosses are heads. Use the Borel-Cantelli lemma to show that P(E) = 1.
Exercise P
4.15. Prove the following result, also due to Borel-Cantelli. If E1 , E2 , . . . are events
such that ∞n=1 P(En ) < ∞, then P({En innitely often}) = 0.

Note that the result of Exercise 4.15 applies to nite measures (i.e. P(Ω) = 1 isn't needed).
Consider a game where at step n n
you lose 2 dollars with probability 1/(2n + 1) or win 1
dollar with probability 2n /(2n + 1). At each step, your average or expected winnings are

1 2n
−2n · + 1 · =0
2n + 1 2n + 1
(see Section 6.1 below). Let En be the event that you lose in the nth step. Assuming these
events are independent, Exercise 4.15 shows that you lose only nitely many games. Thus, you
can continue playing to win innite money, even though you the games are fair in the sense
that on average, you break even at each step!

5 Measurable functions

A measurable function is a map between measurable spaces that respects measurability. This
means that the preimage of a measurable set should be measurable:

Denition 5.1. Let (E, E) and (F, F) be two measurable spaces. A function g : E → F is
measurable if g
−1 (A) ∈ E for each A ∈ F .

It turns out that the condition in Denition 5.1 can be replaced by a slightly weaker one: if
A is a collection of subsets of F that generates F, then it is enough to check measurability of
preimages of elements of A.

16
Proposition 5.2. Let (E, E) and (F, F) be two measurable spaces and suppose F = σ(A) for
some collection A of subsets of F . Then g : E → F is measurable if and only if g−1 (A) ∈ E for
each A ∈ A.
Proof. The only if  is clear, so we prove the if  part. Suppose g −1 (A) ∈ E for each A ∈ A,
and consider D = {A ⊆ F : g −1 (A)
∈ E}. We will show D is a σ -algebra; since D ⊇ A by
assumption, it follows that D ⊇ σ(A) = F and so g is measurable. Suppose A1 , A2 , . . . ∈ D.
Then g
−1 (A ) ∈ E for each k , so
k

∞ ∞
!
[ [
−1
g Ak = g −1 (Ak ) ∈ E,
k=1 k=1

which shows ∪∞
k=1 Ak ∈ D . Now suppose A ∈ D. Then g −1 (A) ∈ E , so

g −1 (F \ A) = E \ g −1 (A) ∈ E,

which shows that F \ A ∈ D. This completes the proof.

Whenever E (resp. F ) is a Borel subset of R, unless we specify otherwise, the corresponding


σ -algebra is E = {A Borel : A ⊆ E} (resp. F = {A Borel : A ⊆ F }), the Borel σ -algebra
restricted to E (resp. F ).
We will often consider the case where E ⊆ R is a Borel set and E = {A ⊆ E : A Borel}. In
this case a measurable function g : E → R is sometimes called a Borel function. When (Ω, F, P)
is a probability space, a measurable function X : Ω → R is called a random variable:
Let E = F = R and E = F = the Borel σ -algebra. Recall that for a continuous function,
preimages of open sets are open. Since the open sets generate the Borel σ -algebra, Proposi-
tion 5.2 implies that continuous functions g : E → F are measurable. Another important class
of measurable functions are the indicator functions 1A of Borel sets A (dened as 1A (x) = 1
if x ∈ A and 0 otherwise). Their importance will become clear below, when we show that
measurability of functions is closed under sums, products and limits (Proposition 5.5). This
will allow us to demonstrate measurability of relatively complicated functions by writing them
as limits of sums of indicator functions (Theorem 5.7).

Exercise 5.3. Let (E, E) be a measurable space. Use Proposition 5.2 to show that g:E→R
is measurable if and only if {x ∈ E : g(x) < a} ∈ E for each a ∈ R.
It is not hard to see that the < in Exercise 5.3 can be replaced by >, ≤, ≥.

Exercise 5.4. Let E ⊆ R be a Borel set and f : E → R an increasing function. Show that f is
a Borel function.
Proposition 5.5. Let (E, E) be a measurable space. The set G of measurable functions g : E →
R is closed under nite addition, nite multiplication, and lim inf and lim sup of sequences.

Proof. To see closure under addition, for f, g ∈ G write

[
{x ∈ E : (f + g)(x) < a} = ({x ∈ E : f (x) < r} ∩ {x ∈ E : g(x) < a − r}) ∈ E.
r∈Q

17
Next, note that f2 ∈ G whenever f ∈ G, since

√ √
{x ∈ E : f 2 (x) < a} = {x ∈ E : f (x) < a} ∩ {x ∈ E : f (x) > − a} ∈ E.

Similarly cf ∈ G whenever f ∈ G. Since 4f g = (f + g)2 − (f − g)2 , we conclude G is closed under


nite multiplication. We leave closure under lim sup and lim inf as exercises (Exercise 5.6).

Exercise 5.6. Finish the proof of Proposition 5.5 by showing G is closed under lim sup and
of sequences.
lim inf

Of course, closure under lim inf and lim sup implies closure under limits, when they exist.
The collection of Borel functions g:R→R is also closed under composition. Actually, this
is the reason for taking F as the Borel sigma algebra in Example 4.2: Borel functions of random
variables should be measurable.
Let 1 denote the identity function and recall 1A is equal to 1 on A and zero elsewhere. We
write gn % g to indicate gn : E → R is a nondecreasing sequence of functions with pointwise
limit g . That is, g1 (x) ≤ g2 (x) ≤ . . . and limn→∞ gn (x) = g(x) for each x ∈ E .

Theorem 5.7. (Monotone class theorem.) Let (E, E) be a measurable space and A a π-system
such that E = σ(A). Let G be a collection of bounded functions g : E → R that is closed under
nite addition and scalar multiplication. If in addition
(i) 1 ∈ G and 1A ∈ G for all A ∈ A,
(ii) If gn ∈ G with 0 ≤ gn % g for a bounded function g, then g ∈ G,

then G contains every bounded measurable function.


Proof. First we show the indicator functions of measurable sets are in G . Let D = {A ⊆ E :
1A ∈ G}. Then D is a d-system containing A, so by Dynkin's lemma, D ⊇ σ(A). Let g : E → R
be bounded, nonnegative and measurable, and let M = 1 + sup g . Consider the functions
gn : E → R dened by

2n −1
MX   
k k k+1
gn (x) = 1A , An := x ∈ E : g(x) ∈ , .
2n n 2n 2n
k=0

Note that each An is measurable, so gn ∈ G by closure of G under nite addition and scalar
multiplication. We leave it as an exercise to show that g1 , g2 , . . . is nonnegative and increasing
with limn→∞ gn = g . Thus, g ∈ G. We assumed g is nonnegative, but any bounded measurable
function is the dierence of two nonnegative bounded measurable functions.

Examples of applications of the monotone class theorem include proofs of change of variable
and Fubini's theorems for integration. In those examples, E = R (change of variables) or E = R2
(Fubini's theorem), A is the collection of intervals (change of variables) or rectangles (Fubini's
theorem), and G is the collection of bounded functions for which change of variables or Fubini's
theorem holds. Closure of G under nite addition and scalar multiplication comes from linearity
of integrals; (i) is easy to check, since it involves integrating indicator functions of intervals or
rectangles; and (ii) follows from the monotone convergence theorem (Theorem 6.3 below).

18
5.1 Images of measures and probability distribution functions

An image measure is a measure induced by a measurable function.


Denition 5.8. Let (E, E, µ) be a measure space and (F, F) a measurable space. A measurable
function g : E → F induces an image measure ν on (F, F) dened by
ν(A) = µ(g −1 (A)), A ∈ F.

Recall in a probability space, a real-valued measurable function is called a random variable:


Denition 5.9. Let (Ω, F, P) be a probability space. A random variable is a measurable function
X : Ω → R.
In this case the image measure induced by X is called the distribution of X . In this setting
it is standard to write P(X ∈ A) instead of P(X −1 (A)). Similarly, when A = (−∞, x] we write
P(X ≤ x) instead of P(X −1 ((−∞, x])).
Denition 5.10. The cumulative distribution function (CDF) of a random variable X is
FX (x) = P(X ≤ x), x ∈ R.

The CDF of a random variable completely determines its distribution, and vice-versa. For
this reason, random variables are usually dened simply by specifying a CDF, without ever
referring to an underlying measurable space (Ω, F). (It is possible that two random variables
have the same distribution but are dened on dierent probability spaces. We will consider such
random variables equivalent.)
It turns out that a function F :R→R is the CDF of a random variable if and only if F
is right-continuous and non-decreasing with limx→−∞ F (x) = 0 and limx→∞ F (x) = 1. (See
Theorem 5.16 below.)

Exercise 5.11. Let µ be the uniform measure on Ω = [0, 1] from Example 4.2. Fix λ > 0 and
dene X : Ω → R by X(ω) = −λ−1 log(1 − ω). Find the CDF of X .
The distribution in Example 5.11 is called the exponential distribution.
In many cases a random variable X takes values only nonnegative integer values. In this
setting it is more convenient to consider the probability mass function of X. In this case we
write P(X = x) instead of P(X −1 ({x})).
Denition 5.12. Let (Ω, F, P) be a probability space and X :Ω→N a random variable. The
probability mass function (PMF) of X is

fX (x) = P(X = x), x ∈ N.

The simplest example of a nonnegative integer-valued random variable is the Bernoulli-p


random variable:

Example 5.13. Fix p ∈ (0, 1). A Bernoulli-p random variable has PMF
(
p, x=1
fX (x) =
1 − p, x=0

and fX (x) = 0 if x ∈/ {0, 1}.


19
Note that a Bernoulli random variable with p = 1/2 has the same PMF as an unbiased coin
toss. For p ∈ (0, 1), we can think of a Bernoulli random variable as a (possibly biased) coin toss.

Exercise 5.14. Consider the innite coin toss experiment from Section 4, but let the coin be
biased so that the probability of tails and heads is p and 1 − p, respectively. Let X be the number
of tosses before the rst heads. Find the PMF of X .
The distribution in Example 5.14 is called the geometric distribution.
Exercise 5.15. Let X have either a geometric or an exponential distribution. Show that
P(X > t + s|X > s) = P(X > t),

for any s, t ∈ N if X is geometric, or any s, t ∈ (0, ∞) if X is exponential. (See Denition (4.7).)


What functions F are CDFs of random variables? Note that if F (x) = P(X ≤ x) for
a random variable X, then P(X ∈ (a, b]) = F (b) − F (a). Thus, we consider a related but
more general question: for which functions F does there exist a measure dF on R such that
dF ((a, b]) = F (b) − F (a)?

Theorem 5.16. Let F : R → R be nonconstant, nondecreasing and right continuous. Then


there exists a unique measure dF on R (with the Borel σ-algebra) such that
dF ((a, b]) = F (b) − F (a).

The proof of Theorem 5.16 will use the following lemma.

Lemma 5.17. Let F : R → R be nonconstant, nondecreasing and right continuous. Set I =


(limx→−∞ F (x), limx→∞ F (x)) and dene g : I → R by g(x) = inf{z ∈ R : F (z) ≥ x}. Then
for all x ∈ I and y ∈ R,
g(x) ≤ y ⇐⇒ x ≤ F (y).
Moreover, g is a Borel function.
Proof. For x ∈ I dene Ix = {z ∈ R : F (z) ≥ x}. Suppose x ∈ I and y ∈ R are such
that g(x) ≤ y . Then for any n ∈ N, y + 1/n is not a lower bound of Ix , so we may pick
yn ∈ [y, y + 1/n) such that yn ∈ Ix . Thus F (yn ) ≥ x for each n, and using right continuity of F ,
we get F (y) = limn→∞ F (yn ) ≥ x. Now suppose x ∈ I and y ∈ R are such that g(x) > y . Then
y is strictly smaller than the greatest lower bound of Ix , so y ∈
/ Ix , meaning F (y) < x. Finally,
since {x ∈ I : g(x) ≤ y} = {x ∈ I : x ≤ F (y)} = I ∩ (−∞, F (y)], g is a Borel function.

Proof of Theorem 5.16. Let F and g be as in Lemma 5.17. Let µ be Borel measure, and consider
the image measure dened by dF (A) = µ(g −1 (A)) for Borel sets A. Then

dF ((a, b]) = µ({x ∈ R : a < g(x) ≤ b})


= µ({x ∈ R : F (a) < x ≤ F (b)})
= µ((F (a), F (b)]) = F (b) − F (a).

Uniqueness of dF follows from the arguments in the proof of uniqueness of Borel measure.

20
Exercise 5.18. Let F be the Heaviside step function F (x) = 0 for x < 0, F (x) = 1 for x ≥ 0.
For Borel sets A ⊆ R, dene dF by dF (A) = 1 if 0 ∈ A, dF (A) = 0 if 0 ∈/ A. Check that dF is
a measure and dF ((a, b]) = F (b) − F (a). (dF is called the Dirac delta distribution.)
Let E be the Borel σ -algebra. A nonzero measure µ on (R, E) with the property that
µ(K) < ∞ whenever K is compact is called a Radon measure.
Exercise 5.19. Show that any Radon measure can be obtained from a function F as in Theo-
rem 5.16. That is, for any Radon measure µ, there is a function F satisfying the hypotheses of
Theorem 5.16 such that µ((a, b]) = F (b) − F (a).
We conclude this section by dening independence for random variables.

Denition 5.20. Random variables {Xi }i∈I are independent if


 
\ Y
P {Xj ∈ Aj } = P(Xj ∈ Aj )
j∈J j∈J

for each nite subcollection J of I and Borel sets Aj , j ∈ J .


It is not hard to see {Xi }i∈I are independent if
 
\ Y
P  {Xj ≤ xj } = P(Xj ≤ xj )
j∈J j∈J

for each nite subcollection J of I and real numbers {xj }j∈J . A random variable X generates
a sigma algebra σ(X) := {X −1 (A) : A Borel}, and it is also not hard to see that X1 , X2 are
independent if and only if σ(X1 ), σ(X2 ) are independent (see Denition 4.8).

Example 5.21. In the innite coin toss, dene random variables Xn (ω) = ωn , Y (ω) = ω1 +
. . . + ω5 and Z(ω) = ω6 ω7 . Then X1 , X2 , . . . are independent, and Y, Z are independent.

5.2 Rademacher functions

Let (Ω, F, P) correspond to the uniform probability measure on Ω = (0, 1] (see Example 4.2).
Note that each ω∈Ω has a binary expansion

ω = 0.ω1 ω2 . . .

where we forbid innite sequences of 0's to make the expansion unique. Dene random vari-
ables Rn by Rn (ω) = ωn . These are called Rademacher functions. The Rademacher func-
tions R1 , R2 , . . . are independent Bernoulli-1/2 random variables (see Example 5.13). Thus, the
Rademacher functions provide an alternative formulation of the innite coin toss example! Ac-
tually, we can use these functions to construct a much wider class of sequences of independent
random variables. Consider a sequence of random variables Xn with CDFs Fn . More precisely,
and not losing generality, the Fn are funtions satisfying the hypotheses of Theorem 5.16 such
that limx→∞ Fn (x) = 1 and limx→−∞ Fn (x) = 0 for each n. The next theorem shows how to
dene a sequence of independent random variables with these CDFs on the same probability
space.

21
Theorem 5.22. Let (Ω, F, P) correspond to the uniform probability measure on Ω = (0, 1], and
let Fn : R → R be a sequence of cumulative distribution functions as described above. Then there
exists a sequence X1 , X2 , . . . of independent random variables Xn : Ω → R such that FXn = Fn
for all n.
Proof. Let b : N2 → N be any bijection and dene Yn,m = Rb(n,m) and


X
Yn = 2−m Yn,m .
m=1

Yn 's are independent with common CDF P(Yn ≤ x) = x. Dene


It is not hard to see that the
gn (x) = inf{z ∈ R : Fn (z) ≥ x}. By the proof of Theorem 5.16, Xn = gn (Yn ) is a random
variable with CDF Fn . Since the Yn 's are independent, so are the Xn 's (we leave this as an
exercise).

6 Integration

Throughout this section (E, E, µ) is a measure space. Let f :E→R be a measurable function.
For brevity, we often say a function f is measurable without explicit mention of E, E or µ. The
integral of f with respect to µ, if it exists, will be written
Z Z
µ(f ) := f dµ ≡ f (x)µ(dx).
E E

To dene µ(f ), we rst consider simple functions:

Denition 6.1. A simple function is a function of the form


n
ck 1Ak
X
f= (23)
k=1

where ck ≥ 0 and Ak ∈ E for each k.


For the simple function f in equation (23), we dene

n
X
µ(f ) = ck µ(Ak ).
k=1

(The representation of a simple function is not unique, but it is easy to check this is well-dened.)
For a nonnegative measurable function f : E → R, set

µ(f ) = sup{µ(g) : g ≤ f, g simple}. (24)

For an arbitrary measurable function f : E → R, let

µ(f ) = µ(f + ) − µ(f − )

whenever µ(f + ) < ∞ and µ(f − ) < ∞, where

f + = f ∨ 0 := max{f, 0}, f − = −f ∨ 0 := max{−f, 0}.

22
When µ(f + ) < ∞ and µ(f − ) < ∞, we say f is integrable with respect to µ. If f ≥ 0 is
measurable but not integrable, we adopt the convention that µ(f ) = ∞. It is easy to check that
|µ(f )| ≤ µ(|f |) and f is integrable if and only if µ(|f |) < ∞.
We say that f = 0 a.e. (almost everywhere) if µ({x : f (x) 6= 0}) = 0. In the context of
probability, sometimes a.s. is written instead of a.e., meaning almost surely. More generally,
we say that a property holds a.e. or a.s. if it holds everywhere except on a null set.
Exercise 6.2. Show that (1) µ acts linearly on simple functions; (2) if f, g are nonnegative and
measurable with 0 ≤ f ≤ g, then µ(f ) ≤ µ(g); and (3) if f is simple, then f = 0 a.e. if and
only if µ(f ) = 0.
When µ is Borel or Lebesgue measure on a subset E of R, then µ(f ) is called the Lebesgue
integral. It is easy to see that some functions that are not Riemann integrable are still Lebesgue
integrable. For example, let E = [0, 1], let µ be Borel or Lebesgue measure, and f (x) = 1 if
x∈
/ Q, f (x) = 0 if x ∈ Q. f = 1[0,1]\Q is Lebesgue integrable with µ(f ) = µ([0, 1]\Q) = 1.
Then
However, all upper Darboux sums for f are ≥ 1 while all lower Darboux sums are 0 (see
denitions at the end of this section). Thus, f is not Riemann integrable. Later (Exercise 6.13)
we will see that all Riemann integrable functions are Lebesgue integrable.
Since the integral in (24) is dened as a supremum over simple functions, it is often easy to
get lower bounds for µ(f ) by considering convenient simple functions. For the reverse inequality,
we'll show that if simple functions increase to f ≥ 0, then their integrals increase to the integral
of f. This is a consequence of the monotone convergence theorem (Theorem 6.3 below). The
monotone convergence theorem gives conditions under which a limit can be brought inside
an integral; see also Fatou's lemma (Theorem 6.8) and the dominated convergence theorem
(Theorem 6.9) below.
We writexn % x if x1 , x2 , . . . is a nondecreasing sequence of real numbers that converges to
x. For functions f, fn : E → R, we write fn % f if fn (x) % f (x) for every x ∈ E . When the
nondecreasing condition is relaxed, we write xn → x and fn → f .

Theorem 6.3 (Monotone convergence theorem). Let f ≥ 0 be measurable and suppose fn ≥ 0


are measurable functions such that fn % f . Then µ(fn ) % µ(f ).
Proof. We divide the proof in steps, showing that the result holds when:

1. f and fn are indicator functions

2. fn are simple and f is an indicator function

3. fn and f are simple

4. fn are simple and f ≥0 is measurable

5. fn ≥ 0 and f ≥0 are measurable.

1. Suppose f = 1A and fn = 1An . Then fn % f implies An is increasing with ∪∞


n=1 An = A.
By continuity from below, µ(1A ) = µ(A) = limn→∞ µ(An ) = limn→∞ µ(1An ), which shows
µ(fn ) % µ(f ).
2. Suppose fn are simple and f = 1A . Fix  > 0 and let An = {x : fn (x) > 1 − }. Then
(1 − )1An ≤ fn ≤ f which shows (1 − )µ(An ) ≤ µ(fn ) ≤ µ(f ). Since 1An % 1A , by part 1, we
get µ(An ) % µ(A). By the squeeze theorem, we conclude µ(fn ) % µ(f ).

23
f are simple and write f = nk=1 ck 1Ak where ck > 0 and the Ak 's are
P
3. Suppose fn and
disjoint. For each k , fn 1Ak % ck 1Ak . Since fn 1Ak is simple, by part 2, µ(fn 1Ak ) % ck µ(Ak ).

k=1 fn 1Ak we get µ(fn ) = k=1 µ(fn 1Ak ) %


Pn Pn Pn
Since fn = k=1 ck µ(Ak ) = µ(f ).
4. Suppose fn are simple and f ≥ 0 is measurable. Let g be simple with g ≤ f . Then
fn ∧ g := min{fn , g} % g so by part 3, µ(fn ∧ g) % µ(g). Since µ(fn ) ≥ µ(fn ∧ g) and g ≤ f
was arbitrary, we conclude µ(fn ) % µ(f ).

5. Suppose fn ≥ 0 and f ≥ 0 are measurable. Dene

n2n −1
k
1A + n1Bn
X
gn =
2n k
k=1

where  
k k+1
Ak = x: ≤ fn (x) < , Bn = {x : fn (x) ≥ n}.
2n 2n
Observe that gn ≤ fn ≤ f and, since fn % f , by construction of gn we get gn % f . By part 4,
this means µ(gn ) % µ(f ), and it follows that µ(fn ) % µ(f ).

Monotone convergence can be rephrased so that it looks like a countable additivity condition
(or perhaps a continuity from below condition, which is closely related) for integrals.

Corollary 6.4. Let fn be a sequence of nonnegative measurable functions. Then


∞ ∞
!
X X
µ fn = µ(fn ).
n=1 n=1

Exercise 6.5. Give an example to show that the assumption fn is nondecreasing is needed in
the monotone convergence theorem. Give another example to show that the assumption fn ≥ 0
is needed.
Monotone convergence leads to the following reassuring result.

Proposition 6.6. Let f, g ≥ 0 be measurable. Then µ(af + bg) = aµ(f ) + bµ(g) for all a, b ≥ 0.
Moreover, f = 0 a.e. if and only if µ(f ) = 0.
Proof. Let fn , gn be simple functions such that fn % f and gn % g . (We have seen how to
construct such functions in step 5 in the proof of Theorem 6.3 above; see also the proof of
Theorem 5.7.) Since a, b ≥ 0, we have afn + bgn % af + bg . Since µ is linear on simple functions
(Exercise 6.2), we have µ(afn + bgn ) = aµ(fn ) + bµ(gn ). Letting n → ∞ and using monotone
convergence gives the linearity result. Now if f = 0 a.e., then since 0 ≤ fn ≤ f , we must have
fn = 0 a.e. for each n. By Exercise 6.2 it follows that µ(fn ) = 0 for each n, and monotone
convergence implies µ(f ) = 0. Conversely, suppose µ(f ) = 0. Then since 0 ≤ fn ≤ f , we must
have µ(fn ) = 0 for each n. By Exercise 6.2, fn = 0 a.e. for each n. We conclude f = 0 a.e.

Exercise 6.7. Generalize Proposition 6.6 by showing that µ(af + bg) = aµ(f ) + bµ(g) holds for
integrable functions f, g and any a, b ∈ R. Show also that for integrable f , if f = 0 a.e. then
µ(f ) = 0, but the converse is not true.
The monotone convergence theorem leads to Fatou's lemma and the dominated convergence
theorem below.

24
Theorem 6.8 (Fatou's lemma). Let fn ≥ 0 and f ≥0 be measurable. Then
µ(lim inf fn ) ≤ lim inf µ(fn ).
n→∞ n→∞

Proof. Note that inf k≥n fk ≤ fm for all m ≥ n. Thus, µ(inf k≥n fk ) ≤ µ(fm ) for all m ≥ n,
and so µ(inf k≥n fk ) ≤ inf k≥n µ(fk ) ≤ lim inf n→∞ µ(fn ). Since inf k≥n fk % lim inf n→∞ fn , the
monotone convergence theorem gives the result.

Theorem 6.9 (Dominated convergence theorem). Let fn be measurable such that fn → f ,


where f is measurable. Suppose there is an integrable function g such that |fn | ≤ g for all n.
Then fn , f are integrable and µ(fn ) → µ(f ).
Proof. fn is integrable since |fn | ≤ g . Since |fn | ≤ g for all n, we have |f | ≤ g and f is
integrable. Since g + fn and f − fn are nonnegative, we may apply Fatou's lemma to get (see
Exercise 6.7)

µ(g) + µ(f ) = µ(lim inf g + fn ) ≤ lim inf µ(g + fn ) = µ(g) + lim inf µ(fn )
n→∞ n→∞ n→∞
µ(g) − µ(f ) = µ(lim inf g − fn ) ≤ lim inf µ(g − fn ) = µ(g) − lim sup µ(fn ).
n→∞ n→∞ n→∞

Cancelling the µ(g) terms and combining the above equations,

µ(f ) ≤ lim inf µ(fn ) ≤ lim sup µ(fn ) ≤ µ(f ).


n→∞ n→∞

Note that in a probability, space, since P(Ω) = 1 < ∞, the dominated convergence theorem
holds if the fn are uniformly bounded in n.

Exercise 6.10. Give an example to show the assumption fn ≥ 0 is needed in Fatou's lemma.
Give another example to show that the assumption g is integrable is needed in the dominated
convergence theorem.
Exercise 6.11. Prove a generalization of the dominated convergence theorem where the assump-
tion |fn | ≤ g with g integrable is replaced by |fn | ≤ gn where gn → g and gn , g are integrable with
µ(gn ) → µ(g). Use this to prove that if fn , f are integrable with fn → f and µ(|fn |) → µ(|f |),
then µ(|fn − f |) → 0.
When E = [a, b], µ is Lebesgue measure, and f :E→R is integrable, we write

Z b
µ(f ) ≡ f (x) dx.
a
Rb
When
a |f (x)| dx < ∞, we say f is Lebesgue integrable.

Theorem 6.12 (Fundamental theorem of calculus). (i) Let f : [a, b] → R be Lebesgue integrable
and dene F (x) = a f (x) dx for x ∈ [a, b]. If f is continuous at c ∈ [a, b], then F is dieren-
Rx

tiable at c and F 0 (c) = f (c). (ii) Suppose F : [a, b] → R has a Lebesgue integrable derivative f .
Then F (b) − F (a) = a f (x) dx.
Rb

25
Part (ii) of the theorem can fail even if F admits a Lebesgue integrable derivative at almost
every point. For instance, the Cantor function c has c0 = 0 almost everywhere, and c(1) − c(0) =
absolutely continuous
R1 0
1 6= 0 = 0 c (x) dx. The reason is that the Cantor function fails to be
with respect to Lebesgue measure. Part (ii) will be generalized to arbitrary measures, assuming
absolute continuity, in the Radon-Nikodym theorem (Theorem 9.4 below).
We end this section by connecting the Lebesgue integral and the Riemann integral. We
briey review the denition of Riemann integral. A partition P of [a, b] is a collection x0 , . . . , xn
such that a = x0 < x1 < . . . < xn = b. For a bounded function f : [a, b] → R dene upper and
lower Darboux sums U (P, f ) and L(P, f ) by

n
X
U (P, f ) = (xk − xk−1 )Mk , Mk = sup f (x),
xk−1 ≤x≤xk
k=1
Xn
L(P, f ) = (xk − xk−1 )mk , mk = inf f (x).
xk−1 ≤x≤xk
k=1

Then f is Riemann integrable if it is bounded and

inf U (P, f ) = sup L(P, f )


P P

where the sup and inf are over all partitions P . When f is Riemann integrable, its integral is
the common value inf P U (P, f ) = supP L(P, f ). We leave it as an exercise to show that the
Lebesgue integral generalizes the Riemann integral.

Exercise 6.13. Show that if f : [a, b] → R is Riemann integrable, then it is Lebesgue integrable,
and the two integrals are the same.

6.1 Expectation

Let (Ω, F, P) be a probability space and X : Ω → R a random variable. In this setting the
integral of X is written Z Z
E(X) = X dP = X(ω)P(dω)

expected value expectation 1


Pn
and called the or of X. Explicitly, if X = k=1 ck Ak is a simple
random variable (meaning Ak ∈ F and ck ≥ 0), then

n
X
E(X) = ck P(Ak ),
k=1

and the extension of E(X) to integrable X is exactly the same as in the last section.
E(X) is interpreted as the average value of X with respect to the probabilities assigned by
P. For instance, consider the two die roll experiment and let X be the sum of the two rolls. The
average value of X is
1 2 3 4
2· +3· +4· +5· + ...
36 36 36 36
where we count the sum of each value times the probability of the value. This idea is formalized
and generalized in the following exercises.

26
ExerciseP6.14. Let X be a random variable with values in N, and let f be its PMF. Show that
E(X) = ∞ k=1 kf (k). Use this to nd the expected value of a geometric random variable (see
Exercise 5.14).
When X :Ω→R is a random variable and g :R→R is a Borel function, then g◦X is
another random variable, usually written g(X).
Exercise 6.15. Let X be a random variable
R and F its CDF. Let g : R → R be Borel. Show that if
g(X)is integrable, E(g(X)) = dF (g) ≡ g(x) dF (dx), where dF is dened as in Theorem 5.16.
Exercise 6.16. In ExerciseR 6.15, let F be dierentiable with Lebesgue integrable derivative f.
Show that then E(g(X)) = g(x)f (x) dx. In particular, if X is integrable, E(X) = xf (x) dx.
R

Use this to nd the expected value of an exponential random variable (see Exercise 5.11).
When X has a CDF F with derivative f = F 0, we call f theprobability
R density function
(PDF) of X . Exercise 6.16 shows that if X has a PDF f, then E(X) =R xf (x) dx. Thinking
of f (x) as the probability that X is near x, we can think of E(X) = xf (x) dx as a way to
say the expected value of X is the integral of the values of X multiplied by their probability
densities. Note that this is an analogue of the result in Exercise 6.14.
We end this section by connecting expectation with independence (Denition 5.20).

Proposition 6.17. Suppose X, Y : Ω → R are independent random variables, and f, g : R → R


are Borel functions. Then
E(f (X)g(Y )) = E(f (X))E(g(Y )) (25)

whenever f (X)g(Y ), f (X), and g(Y ) are all integrable.


Proof. Fix f = 1A where A is Borel. Then for measurable g = 1B , we have E(f (X)g(Y )) =
P({X ∈ A} ∩ {Y ∈ B}) = P(X ∈ A)P(Y ∈ B) = E(f (X))E(g(Y )) (see Denition 5.20). So (25)
holds for indicator functions g . By linearity of integral, it also holds for simple functions g .
Now let g ≥ 0 be measurable, and pick gn simple such that gn % g . Then by the monotone
convergence theorem (Theorem 6.3), (25) holds for g . Now x g ≥ 0 and repeat the same
arguments with f to see that (25) holds whenever f, g ≥ 0 are measurable. To nish the proof,
+
write f = f − f
− and g = g + − g − and use linearity of expectation.

7 Conditional expectation

Let (Ω, F, P) be a probability space and A, B ∈ F events. Recall the conditional probability
(Denition 4.7) of A given B is (if P(B) > 0)
P(A ∩ B)
P(A|B) = .
P(B)
We now generalize this idea to the expected value of a random variable, given some other
information (for instance about another random variable). The information will be encoded by
a sub-σ -algebra of F.
Denition 7.1. Let X be a random variable. The σ-algebra generated by X is
σ(X) := {X −1 (A) : A ⊆ R Borel}.
27
If Xi , i ∈ I are random variables, the σ-algebra generated by the Xi is
σ({Xi }i∈I ) := σ {Xi−1 (A) : A ⊆ R Borel, i ∈ I} .


Denition 7.2. Let G ⊆ F be a σ-algebra over Ω. A random variable X : Ω → R is G -


measurable if X
−1 (A) ∈ G for each Borel set A ⊆ R.

The Borel sets A in the last two denitions can be replaced by intervals (−∞, a]. Note that
in Denition 7.2 every G -measurable random variable is (F -)measurable, but not conversely.

Exercise 7.3. Let Ω = (0, 1] and P the uniform probability measure. Let G1 = {∅, Ω},
G2 = {∅, Ω, (0, 1/2], (1/2, 1]}. Identify which random variables are Gi -measurable for i = 1, 2.
Interpreting P as the innite coin toss probability measure, say which coin toss random variables
are Gi -measurable.
Denition 7.4. Let (Ω, F, P) be a probability space, X : Ω → R an integrable random variable,
and G ⊆ F a σ-algebra over Ω. The conditional expectation of X given G , denoted Y = E(X|G),
is dened to be the random variable Y such that (i) Y is G -measurable and (ii) E(Y 1A ) =
E(X 1A ) for all A ∈ G .
Existence of conditional expectation will be proved later using the Radon-Nikodym theorem
(Theorem 9.4 below). At the moment we only prove uniqueness.

Proof of uniqueness of conditional expectation. Let Y and Y 0 be two G -measurable random vari-
ables with E(Y 1A ) = E(X 1A ) for all A ∈ G . Dene Bn = {ω ∈ Ω : Y (ω) − Y (ω) > 1/n}. Then
0

Bn ∈ G and so
1
0 = E((Y − Y 0 )1Bn ) ≥
P(Bn ) ≥ 0.
n P∞
0 ∞
Thus P(Bn ) = 0 for all n. Now P(Y − Y > 0) = P(∪n=1 Bn ) ≤ n=1 P(Bn ) = 0. By symmetry
0 0
we conclude P(Y − Y < 0) = 0 and so P(Y = Y ) = 1.

In the context of conditional expectation, a σ -algebra represents some information about a


random variable or collection of random variables, and conditional expectation represents the
expected value of a random variable, given the information from the σ -algebra. As an example,
consider the innite coin toss experiment where Ω = {0, 1}N and the probability of heads is
1/2. Let Gn be the σ -algebra generated by the sum of the rst n tosses. Dene X : Ω → R
by X(ω) = ω1 + . . . + ωm where m > n, and consider the consider the conditional expectation
of X given Gn . The σ -algebra represents the information from the sum of the rst n tosses.
Informally, Y = E(X|Gn ) is the expected value of X given that we know the outcome of the
1
sum of the rst n tosses. This should be Y (ω) := ω1 + . . . + ωn + (m − n). A proof can be
2
found in Exercise 7.6 below.
The next exercise shows that conditional expectation generalizes conditional probability:

Exercise 7.5. Let A, B be events, X = 1A , and G = {∅, Ω, B, B c }. Let Y = E(X|G). Show


that Y (ω) = P(A|B) if ω ∈ B and Y (ω) = P(A|B c ) if ω ∈ B c . (See Denition 4.7.)
The next exercise makes the discussion about the innite coin toss above rigorous:

Exercise 7.6. Let X and Y be integrable independent random variables and G = σ(Y ). Show
that E(X + Y |G) = E(X) + Y . (Hint: use Proposition 6.17.)
We leave it an exercise to show that conditional expectation is linear, i.e. E(aX + b|G) =
aE(X|G) + b for a, b ∈ R, and monotonic, i.e. if X≤Y a.s. then E(X|G) ≤ E(Y |G).

28
7.1 Martingales

Consider independent random variables Xi , i = 1, 2, . . . with P(Xi = −1) = P(Xi = 1) = 1/2.


Let Sn = X1 + . . . + Xn . Then Sn is called a simple random walk: starting at zero, at each step
you move one unit right or left with equal probability. Dene Fn = σ(X1 , . . . , Xn ). Note that
Sn+1 − Sn = Xn+1 is independent of X1 , . . . , Xn . By arguments similar to Exercise 7.6,

E(Sn+1 |Fn ) = Sn .

This says the expected position of the walk at step n + 1, knowing all information from the
rst n steps of the walk, is the position of the walk at step n. Informally, this follows from the
simple fact that we're equally likely to move right or left at the (n + 1)th step.

Denition 7.7. Let F0 ⊆ F1 ⊆ . . . be an increasing sequence of σ-algebras. A sequence


S0 , S1 , . . . of random variables is a martingale with respect to Fn if for each n ≥ 0, Sn is
Fn -measurable and
E(Sn+1 |Fn ) = Sn . (26)

The theory of martingales arose from mathematical analysis of betting strategies. In that
context, Sn = X1 +. . .+Xn is usually a bettor's cumulative earnings after n plays of some game,
with Xi the earnings in the ith game; Fn = σ(X1 , . . . , Xn ) is the information from the rst n
plays; and equation (26) means that the game is fair, i.e., the expected cumulative earnings by
the (n + 1)th play, given the information from the rst n plays, is just the cumulative earnings
from the rst n plays. The simple random walk can be understood in this context: starting
with 0 dollars, in each game, you win or lose1 dollar with equal probability.
Xi , i = 1, 2, . . . be independent with P(Xi = 2i−1 ) = P(Xi = −2i−1 ) = 1/2, and dene
Let
Sn = X1 + . . . + Xn and Fn = σ(X1 , . . . , Xn ). Note that Sn is a martingale (the argument is
the same as for the simple random walk). Think of Sn as the cumulative dollar earnings after
n games. Since Sn is a martingale the games are fair. Suppose we stop playing the games as
soon as Sn > 0, which happens on the rst winning play n. (This n is nite a.s.; why?) Then
0
we win −(2 + . . . + 2
n−2 ) + 2n−1 = 1 dollar, even though all the games were fair! This is the

so-called double or nothing strategy. (For another martingale paradox, see the example at the
end of Section 4.2 above.)

Exercise 7.8. Show that if G ⊆ H, then E(E(X|H)|G) = E(X|G).


Exercise 7.9. Use Exercise 7.8 to show that if Sn is a martingale with respect to Fn , then
E(Sn |F0 ) = S0 and so E(Sn ) = E(S0 ) for every n. (Hint: Note that E(X) = E(X|{∅, Ω})).

We conclude this section by giving an famous example of a martingale, due to Polya. Con-
sider an urn consisting initially of a single red and green ball. At each game, a ball is chosen
uniformly at random from the urn and then returned along with another ball of the same color.
Let Xn be the number of red balls after n games, Sn = Xn /(n + 2) the proportion of red balls
after n games, and Fn = σ(X1 , . . . , Xn ) the information from the rst n games. Then Sn is a
martingale with respect to Fn .

29
8 Product measures and Fubini's Theorem

Denition 8.1. Let E1 and E2 be σ-algebras over E1 and E2 , respectively. The product σ-
algebra of E1 and E2 is

E1 ⊗ E2 = σ ({A1 × A2 : Ai ∈ Ei , i = 1, 2}) .

The denition extends to nite products of σ -algebras in the obvious way. When Ei = E for
i= 1, . . . , n, we write ⊗ni=1 Ei
≡ E ⊗n for the product
σ -algebra.
Consider two die rolls, where E1 = E2 = {1, . . . , 6}, E1 = E2 = 2
{1,...,6} , and Ω = E × E .
1 2

It is easy to check that F := 2 = E1 ⊗ E2 . Not all elements of F have the form A1 × A2 where
Ai ∈ Ei ; for instance {1, 2} ∪ {2, 1} cannot be written in this form. As another example, take
E1 = E2 = R and Ei the Borel σ -algebra over R. Then Exercise 2.6 shows that E1 ⊗ E2 is the
Borel σ -algebra over R .
2

Throughout this section we assume (E1 , E1 , µ1 ) and (E2 , E2 , µ2 ) are nite measure spaces,
that is, µi (Ei ) < ∞ for i = 1, 2.

Proposition 8.2. Suppose f : E1 × E2 → R is E1 ⊗ E2 -measurable. Then for xed x1 ∈ E1 , the


map x2 7→ f (x1 , x2 ) is E2 -measurable.
Proof. Let G E1 ⊗ E2 -measurable functions for which the conclusion of
be the collection of
Proposition 8.2 holds. Note that A = {A1 × A2 : Ai ∈ Ei , i = 1, 2} is a π -system that generates
E1 ⊗ E2 . Clearly 1 ∈ G and 1A ∈ G whenever A ∈ A. By Proposition 5.5, G is closed under
addition and scalar multiplication. Let fn ∈ G be such that 0 ≤ fn % f . Then by Proposition 5.5
again, f ∈ G . By the monotone class theorem (Theorem 5.7), G contains all bounded measurable
functions. If f ≥ 0, then fn = min{n, f } ∈ G and fn % f , so by Proposition 5.5, f ∈ G . The
+
proof is completed by writing f = f − f .

Exercise 8.3. Let f : E1 × E2 → R be bounded and E1 ⊗ E2 -measurable. Show that x1 7→


2 is bounded and E1 -measurable.
R
E2 f (x , x
1 2 2)µ (dx )

Exercise 8.4. Show that there is a measure µ on E1 ×E2 such that µ(A1 ×A2 ) = µ1 (A1 )µ2 (A2 )
for all A1 ∈ E1 and A2 ∈ E2 . (Hint: for existence, dene
Z Z 
µ(A) = 1A (x1 , x2 )µ2 (dx2 ) µ1 (dx1 ) (27)
E1 E2

for A ∈ E1 ⊗ E2 .)
By Proposition 3.13, the measure in Exercise 8.4 is unique. It is called the product measure:
Denition 8.5. Let (E1 , E1 , µ1 ) and (E2 , E2 , µ2 ) be measure spaces. The product measure µ of
µ1 and µ2 , written µ = µ = µ1 ⊗ µ2 , is the unique measure on (E1 × E2 , E1 ⊗ E2 ) satisfying

µ(A1 × A2 ) = µ1 (A1 )µ2 (A2 )

for all A1 ∈ E1 and A2 ∈ E2 .

30
Consider again two die rolls, whereE1 = E2 = {1, . . . , 6}, E1 = E2 = 2{1,...,6} . Let Pi be
uniform on (Ei , Ei ). Then the product measure P = P1 ⊗ P2 is the unique measure under which
the die rolls are independent. More precisely, for 1 ≤ i ≤ n, let (Ωi , Fi , Pi ) be probability
spaces and Xi : Ωi → R random variables. Let Ω = Ω1 × . . . × Ωn , F = F1 ⊗ . . . ⊗ Fn , and
P = P1 ⊗ . . . ⊗ Pn . Dene X̂i : Ω → R by X̂i (ω) = Xi (ωi ). Then the X̂i 's are independent
random variables with respect to P, and they have the same distribution under P as the Xi 's have
under the Pi . This is one way of constructing nite sequences of independent random variables.
Recall the ingenious construction in Theorem 5.22 above of innite independent sequences of
random variables using Rademacher functions. An alternative construction can be done with
innite product measures, the construction of which relies on Kolmogorov's extension theorem,
Theorem 8.9 below.
We are now ready to prove Fubini's theorem.

Theorem 8.6 (Fubini). Let f : E1 × E2 → R be E1 × E2 -measurable and nonnegative. Then


Z Z  Z Z 
µ(f ) = f (x1 , x2 )µ2 (dx2 ) µ1 (dx1 ) = f (x1 , x2 )µ1 (dx1 ) µ2 (dx2 ). (28)
E1 E2 E2 E1

The conclusion remains true if f is E1 × E2 -measurable and integrable.


Proof. The hint in Exercise 8.4 shows that the conclusion of Fubini's theorem holds for indicator
functions 1A , A = A1 × A2 , Ai ∈ Ei . The second equality in (28) comes from the symmetry
in choice of ordering. Note that this symmetry may fail if the measures are not nite; see
Exercise 8.7. By linearity of integrals (Exercise 6.7), Fubini's theorem holds for simple functions.
By monotone convergence (Theorem 6.3), it holds for nonnegative functions. We leave the rest
of the proof as an exercise.

Exercise 8.7. Let Ei = [0, 1] and Ei the Borel σ-algebra. Let µ1 be Lebesgue measure and µ2
counting measure, i.e., µ2 (A) = |A| ∈ N ∪ {∞} for A Borel. Show that µ2 is a measure and that
Fubini's theorem fails for f = 1A where A = {(x, y) ∈ [0, 1]2 : y = x}.
µ1 , µ2 are σ
Fubini's theorem can be extended to the case where -nite measures
. A measure
µ on (E, E) is σ -nite

if E = ∪n=1 En for En ∈ E such that µ(En ) < ∞ for each n. For instance,
n
Borel and Lebesgue measures on R are σ -nite, but the counting measure in Exercise 8.7 is
not σ -nite.

8.1 Innite products and Markov chains

Denition 8.8. Let Ei be σ-algebras over Ei for i = 1, 2, . . .. The product σ-algebra is


⊗∞
i=1 Ei = σ ({A1 × . . . × An × En+1 × En+2 × . . . : n ∈ N, Ai ∈ Ei , i = 1, . . . , n}) . (29)

If Ei ≡ E for each i, we write ⊗∞


i=1 Ei = E
⊗N .

The sets generating the σ -algebra in (29) are sometimes called cylinder sets.
For example, consider the innite coin toss, where Fi ≡ F = 2{0,1} are σ -algebras over
Ωi ≡ Ω = {0, 1}. ∞
In this case ⊗i=1 Fi ≡ F ⊗N is generated by the collection of events that
depend on only nitely many tosses.

31
Theorem 8.9 (Kolmogorov extension theorem). Let F be the Borel σ-algebra over R. Suppose
Pn , n = 1, 2, . . . are probability measures on (Rn , F ⊗n ) such that for each n

Pn+1 ((a1 , b1 ] × . . . × (an , bn ] × R) = Pn ((a1 , b1 ] × . . . × (an , bn ]). (30)

Then there is a unique probability measure P on (RN , F ⊗N ) such that


P((a1 , b1 ] × . . . × (an , bn ] × R × R × . . .) = Pn ((a1 , b1 ] × . . . × (an , bn ]). (31)

Proof. A be the ring of nite unions of sets of the form (a1 , b1 ] × . . . × (an , bn ] × R × R × . . ..
Let
Dene P (a1 , b1 ] × . . . × (an , bn ] × R × R × . . . by using (31), and then extend
on sets of the form
the denition to A by enforcing nite additivity. Equation (30) together with additivity of the
Pn ensures that this works. To show existence, by Theorem 3.4, it suces to prove that P is
countably additive on A. This can be done by following the arguments in the proof of existence
of Borel measure (Theorem 2.9, bottom of page 6). Uniqueness follows from Dynkin's lemma
(see Proposition 3.13).

In Kolmogorov's extension theorem, (R, F) can be replaced by any standard probability space
 a probability space isomorphic to an interval with Lebesgue measure, a nite or countably
innite set, or a union of an interval with a nite or countably innite set. (Here an isomorphism
is a bijective measure-preserving map.) For instance, a nite or countably innite space E ,
together with its natural σ -algebra 2E , is a standard probability space. In the latter case the
sets (a1 , b1 ] × . . . × (an , bn ] × R × R × . . . in Kolmogorov's theorem are replaced by {i1 } × . . . ×
{in } × E × E × . . . where ik ∈ E , k = 1, . . . , n.
In particular, Kolmogorov's theorem provides a construction of the probability measure of
the innite coin toss. This is the canonical way to construct the innite coin toss probability
measure, and in general, any probability measure corresponding to sequences of independent
random variables (see the section on Rademacher functions for an alternative construction):

Exercise 8.10. Let Fn , n = 1, 2, . . . be CDFs. Use Theorem 8.9 to construct a probability space
(Ω, F, P) and
random variables Xn : Ω → R such that X1 , X2 , . . . are independent and Xn has
CDF Fn .
More generally, Kolmogorov's theorem can be used to construct Markov chains. Recall the
simple random walk in Section 7.1 above: starting at zero, at each step you add or subtract
1 with equal probability. That is,Xn = Z0 + . . . + Zn where the Zi are independent random
variables with P(Xi = 1) = P(Xi = −1) = 1/2. More generally, a random walk has the form
Xn = Z0 + . . . + Zn , where the step sizes Zi are independent random variables. Markov chains
are generalizations of random walks, where the step sizes are random variables that are allowed
to depend on the current position of the walk.

Denition 8.11. A sequence Xn , n = 0, 1, . . . of random variables is a Markov chain if


E(1{Xn+1 ∈A} |σ(X0 , . . . , Xn )) = E(1{Xn+1 ∈A} |σ(Xn )) (32)

for each Borel A ⊆ R. Equation 48 is called the Markov property.


Every random walk is a Markov chain, but not conversely.

32
Note that in Denition 8.11 we have not referred to the underlying probability space; usually
it is not necessary. When the Xn 's take uncountably many values, the natural underlying
probability space is (RN , E ⊗N , P), with E the Borel σ -algebra. When the Xn 's take values
only in a nite or countable set E , it is more convenient to consider the probability space
(E N , (2E )⊗N , P). In both settings the Xn 's are dened by Xn (ω) = ωn . Existence of P such
that (48) holds is shown (in the case where E is countable) in Exercise 8.15.
Equation 48 shows that Xn+1 depends on X1 , . . . , Xn only through Xn . Usually, we think
of n as time and Xn as position after n time steps. Informally, the next position of a Markov
chain depends only on its current position. When Xn has values in a nite or countably innite
set E ⊆ R, then the Markov property (48) can be rewritten as follows:

Denition 8.12. A sequence Xn , n = 0, 1, . . . of random variables with values only in a count-


able set E is a Markov chain if for each j, i, i0 , . . . , in−1 ∈ E ,
P(Xn+1 = j|X0 = i0 , . . . , Xn−1 = in−1 , Xn = i) = P(Xn+1 = j|Xn = i). (33)

(See Exercise 7.5 for the connection between conditional probability and conditional expec-
tation that leads to Denition 8.12 above.) In equation (33),

(n+1)
Pij := P(Xn+1 = j|Xn = i)
are called thetransition matrices of Xn . They are characterized by the following properties:
Denition 8.13. A transition P matrix on a countable set E is a matrix (Pij )i,j∈E such that
0 ≤ Pij ≤ 1 for all i, j ∈ E and j∈E Pij = 1 for each i ∈ E .
 
0 1/2 1/2
Example 8.14. The transition matrix P = 1/3 0 2/3 denes a Markov chain X0 , X1 , . . .
0 0 1
such that, if Xn = 1, then Xn+1 = 2 or 3 each with probability 1/2, if Xn = 2, then Xn+1 = 1
or 3 with probabilities 1/3 and 2/3, respectively, and if Xn = 3, then Xn+1 = 3.
If (33) holds, it is straightforward to check that

(1) (n)
Pi0 i1 . . . Pin−1 in = P(Xn = in , Xn−1 = in−1 , . . . X1 = i1 |X0 = i0 )
and
(P (1) . . . P (n) )ij = P(Xn = j|X0 = i).
Thus, the transition matrices of a Markov chain completely determine its evolution.
We are often interested in the case where all the P (n) 's are the same and equal P. In this
case we call P the transition matrix of the Markov chain.

Exercise 8.15. Let Pij(n) , n = 1, 2, . . . be transition matrices on a countable set E . Use Kol-
mogorov's theorem to construct a Markov chain Xn with transition matrices P (n) . More pre-
cisely, use Kolmogorov's extension theorem to show there exists a measure P on (E N , (2E )⊗N )
and random variables Xn : E N → E which satisfy (33) with P(Xn+1 = j|Xn = i) = Pij(n+1) .
From Kolmogorov's theorem, the probability measure P in Exercise 8.15 is unique once we
assign a value (or more generally a distribution) toX0 . This is called the canonical construction
of Xn .
Exercise 8.16. Give an example of (i) a Markov chain that is not a random walk, (ii) a Markov
chain that is not a martingale, (iii) a martingale that is not a Markov chain.

33
9 Radon-Nikodym theorem

In this section we state the Radon-Nikodym theorem (Theorem 9.4), which can be seen as a
generalization of the fundamental theorem of calculus. Before stating the theorem we need the
following denition.

Denition 9.1. Let µ and ν be measures on (E, E). We say µ is absolutely continuous with
respect to ν , written µ  ν , if for all A ∈ E , ν(A) = 0 implies µ(A) = 0.

For example, let ν be Lebesgue measure and δ the delta distribution, that is, δ(A) = 1 if
0∈A and δ(A) = 0 otherwise. Then ν({0}) = 0 but δ({0}) = 1 so δ 6 ν .

Exercise 9.2. Let µ be Lebesgue measure, δ be as above, and ν = µ+δ. Show µ  ν but ν 6 µ.
An alternate denition of absolute continuity is given by the following result. Recall a
measureµ on (E, E) is nite if µ(E) < ∞, and σ -nite if there exist A1 , A2 , . . . ∈ E such that

E = ∪n=1 An and µ(An ) < ∞ for each n.

Lemma 9.3. Let µ be a nite measure and ν a σ-nite measure on (E, E). Then µ  ν ⇐⇒

for each  > 0, there exists δ > 0 such that that for all A ∈ E, ν(A) < δ ⇒ µ(A) < . (34)

Proof. A ∈ E with µ(A) > 0. Pick δ > 0


We begin with the if  part. Suppose (34) holds. Let
which satises (34) for this A and  = µ(A). Then ν(A) ≥ δ > 0. Consider now the only
if  part. Suppose (34) is false. We want to show µ 6 ν . Pick  > 0 and An ∈ E such that
ν(An ) < 2−n but µ(An ) ≥  for all n. Consider A = ∩∞
n=1 ∪k≥n Ak . By Exercise 4.15, ν(A) = 0,
and by continuity from above, µ(A) ≥  > 0.

Theorem 9.4 (Radon-Nikodym). Let ν be a nite measure and µ a σ-nite measure on (E, E)
Rsuch that ν  µ. Then there is a µ-a.e. unique measurable f : E → [0, ∞) such that ν(A) =
A f dµ ≡ µ(f 1A ). In this case we write f = dν
dµ .
We call f= dµ
dν and the Radon-Nikodym derivative of µ with respect to ν . Theorem 9.4 will
be stated more generally in the next section. The proof is postponed to Section 10.

Exercise 9.5. Let µ and ν be as in Exercise 9.2. Find the Radon-Nikodym derivative dµ
dν .
To see how Theorem 9.4 generalizes the fundamental theorem of calculus, we need to dene
absolute continuity for functions.

Denition 9.6. Let I ⊆ R be an interval. A function F : I → R is absolutely continuous if for


each  > 0P
, there is δ > 0 such that forPany nite collection (x1 , y1 ), . . . , (xn , yn ) ⊆ I of disjoint
intervals, nk=1 (yk − xk ) < δ implies nk=1 |F (yk ) − F (xk )| < .
Using open intervals in the above denition is not essential; we could instead take disjoint
closed or half open intervals. By taking n = 1 in the denition above, we see that absolutely con-
tinuous functions are uniformly continuous. The converse is not true, with one counterexample
being the Cantor function.
We use the same word absolute continuity for both functions and measures because of the
following result.

34
Lemma 9.7. Let µ be a nite measure. Then µ is absolutely continuous with respect to Lebesgue
measure if and only if F (x) := µ((−∞, x]) is absolutely continuous.
Proof. Let (x1 , y1P n
P set A = ∪k=1 (xk , yk ], and let ν be Lebesgue
], . . . , (xn , yn ] be disjoint intervals,
n n
measure. Then k=1 (yk − xk ) = ν(A) and k=1 |F (yk )
− F (xk )| = µ(A). Appealing to
Lemma 9.3 proves the only if  part. It remains to prove the if  part. Let F be absolutely
continuous, assume A has zero Lebesgue measure and let  > 0. Choose the corresponding δ > 0
from the denition of absolute continuity of F . We may pick disjoint intervals [a1 , b1 ), [a2 , b2 ), . . .
P∞
covering A such that k=1 (bk − ak ) <Pδ . (This follows from Proposition 3.22.) Dene An =
∪nk=1 [ak , bk ). By choice of δ , µ(An ) = nk=1 |F (bk ) − F (ak )| <  for all n. By continuity from
below, µ(A) = limn→∞ µ(An ) ≤ . Since  > 0 was arbitrary, we are done.

The next exercise is an important special case of the Radon-Nikodym theorem.

Exercise 9.8. Let X be a random variable with CDF F , and µ the associated image measure
µ((a, b]) := F (b) − F (a) (also called the distribution of X ; see the remarks after Denition 5.9.)
Suppose F 0 = f is Lebesgue integrable. Show that µ is absolutely continuous with respect to
Lebesgue measure ν , and dµ dν = f . (Hint: use the fundamental theorem of calculus, Theorem 6.12,
and Lemma 9.9 below.)
In Exercise 9.8, recall that f is called the probability density function (PDF) of X; see the
remarks below Exercise 6.16. In the above, the assumption on F can be replaced by assuming
F is absolutely continuous. It turns out that this is equivalent to saying F has
R b a derivative f
almost everywhere, such that f is Lebesgue integrable with F (b) − F (a) = a f (t) dt. These
results fall into the realm of analysis and not measure theory, so we omit proof.

Lemma 9.9. Let (E, E, µ) be a measure space and f : E R→ R an integrable function. Then for
each  > 0 there exists δ > 0 such that µ(A) < δ implies A |f | dµ < .
Proof. Let  > 0 andR dene Bk = {x ∈ E : |f (x)| > k}. By dominated convergence, we may
choose k such that
Bk |f | dµ < 2 . Let δ = 2k

and assume µ(A) < δ . Then
Z Z Z
|f | dµ = |f | dµ + |f | dµ
A A\Bk A∩Bk
Z Z
≤ k dµ + |f | dµ
A Bk

≤ kµ(A) + = .
2

9.1 Signed measures

We will use the Radon-Nikodym theorem to prove existence of conditional expectation. For this
application, we need to consider signed measures.
Denition 9.10. Let (E, E) be a measurable space. A function φ : E → R is a signed measure
on (E, E) if for any disjoint collection {Ak }∞
k=1 of elements of E ,

X
φ(∪∞
k=1 Ak ) = µ(Ak ). (35)
k=1

35
Compare with the denition of measure (Denition 2.8). The dierence is that we allow
signed measures to have negative values, but not innite values. In particular, a nite measure
is a signed measure, but not conversely. (Throughout we write measure or nite measure for a
unsigned measure.)
Continuity from above and from below hold for signed measures, which we leave as an
exercise.

Denition 9.11. Let φ be a signed measure on (E, E), and µ a measure on (E, E). Then we
say φ is absolutely continuous with respect to µ, written φ  µ, if for each A ∈ E , µ(A) = 0
implies φ(A) = 0.
Theorem 9.12 (Radon-Nikodym, signed version). Let φ be a signed measure and µ a σ-nite
measure on (E,
R E) such that φ  µ. Then there is a µ-a.e.dφunique measurable f : E → R such
that φ(A) = A f dµ ≡ µ(f 1A ). In this case we write f = dµ .
We are now ready to prove existence of conditional expectation.

Exercise 9.13. Use Theorem 9.12 to prove existence of conditional expectation (see Deni-
tion 7.4). That is, show that if (Ω, F, P) is a probability space, G ⊆ F is a sub-σ-algebra, and
X : Ω → R is an integrable (F -measurable) random variable, then there exists a G -measurable
random variable Y such that E(Y 1A ) = E(X 1A ) for each A ∈ G .
We now turn to two structure theorems for signed measures. The rst says that a signed
measure can be decomposed as a dierence of two nite measures that are disjoint in the
following sense.

Denition 9.14. Two signed measures φ, ψ on (E, E) are called mutually singular , written
φ ⊥ ψ , if they can be supported on disjoint sets, that is, if there exists A, B ∈ E such that
A ∩ B = ∅, A ∪ B = E , and φ|A ≡ 0, ψ|B ≡ 0.
Above and below, for a signed measure φ and measurable set S , φ|S := φ(· ∩ S).
Theorem 9.15 (Jordan decomposition). Every signed measure φ can be uniquely decomposed
as φ = µ − ν , where µ and ν are nite measures with µ ⊥ ν .
Proof. Call a set measurable set A totally positiveφ|A ≥ 0 (equivalently, φ|A is a nite
if
measure) and totally negative if φ|A ≤ 0 (equivalently, −φ|A is a nite measure). Consider the
sup, m, of φ(A) over all totally positive A. There is a sequence An of totally positive sets such
that limn→∞ φ(An ) = m. Without loss of generality A1 ⊆ A2 ⊆ . . . and continuity from below
∞ ∞ c
shows that φ(∪n=1 An ) = m. Dene P = ∪n=1 An and N = P . We check that P is totally
positive and N is totally negative. Taking µ = φ|P and ν = φ|N will then complete the proof.
If P is not totally positive, there is measurable A ⊆ P such that φ(A) < 0. It follows that
0 > φ(A) = φ(∪∞ n=1 A ∩ An ) = limn→∞ φ(A ∩ An ) so φ(A ∩ An ) < 0 for some n, a contradiction
to An being totally positive. The argument showing N is totally negative is analogous. We
leave proof of uniqueness as an exercise.

It is now straightforward to extend the results of the last section to signed measures.

Lemma 9.16. Let φ be a signed measure and µ a σ-nite measure on (E, E). Then φ  µ if and
only if for each  > 0, there exists δ > 0 such that that for all A ∈ E , µ(A) < δ ⇒ |φ(A)| < .

36
Proof.
R
Suppose φ  µ. φ(A) = A f dµ for some integrable
Then by Theorem 9.12, we may write
f : E → RR. By Lemma 9.9, given  > 0, we may choose δ > 0 such that µ(A) < δ implies
|φ(A)| ≤ A |f | dµ < . Conversely, suppose |φ(A)| > 0, let  = |φ(A)|, and choose δ as in the
condition above. Then we must have µ(A) ≥ δ > 0.

We can now prove the following equivalence:

Theorem 9.17. Let φ be a signed measure and µ a σ-nite measure on (E, E). Then TFAE:
1. φ  µ
2. There is a µ-a.e. unique measurable f : E → R such that for each A ∈ E , φ(A) =
R
Af dµ

3. Given  > 0, there is δ > 0 such if A ∈ E with µ(A) < δ , then |φ(A)| < .
Proof. By Lemma 9.16, 1 and 3 are equivalent. Theorem 9.12 states that 2 follows from 1, and
Lemma 9.9 shows that 3 follows from 2.

We can now connect absolute continuity of signed measures with absolute continuity of
functions (see also Lemma 9.7).

Exercise 9.18. Let φ be a signed measure. Show that φ is absolutely continuous with respect to
Lebesgue measure if and only if F (x) := φ((−∞, x]) is absolutely continuous.

10 Proof of Radon-Nikodym theorem

Lemma 10.1. Let µ and ν be nite measures on (E, E). Then either µ ⊥ ν , or there exists
 > 0 and B ∈ F such that µ(B) > 0 and ν(A) ≥ µ(A) for each A ∈ F with A ⊆ B .

Proof. Let (Pn , Nn ) be the totally positive/negative sets in the proof of Theorem 9.15 in the
case where the signed measure is φn := ν − n−1 µ. Let P = ∪∞ ∞
n=1 Pn and N = ∩n=1 Nn . Then
φn (N ) ≤ 0 or equivalently ν(N ) ≤ n−1 µ(N ) for each n. Thus, ν(N ) = 0. Using the fact that
Pn is increasing and Nn is decreasing with Pn ∪ Nn = E and Pn ∩ Nn = ∅, we can check that
that P ∪ N = E and P ∩ N = ∅. If µ(P ) = 0, then µ ⊥ ν . So suppose µ(P ) > 0. Using
µ(P ) = limn→∞ µ(Pn ) we see that µ(Pn ) > 0 for some n. Taking  = 1/n for this n yields the
result.

Proof of Theorem 9.4. We consider the case where µ is nite and leave the σ -nite case to an
exercise. We will only prove existence; uniqueness will follow from the proof of Theorem 10.4
R
below. Let G be the set of measurable functions g : E → [0, ∞) such that ≤ ν(A) for
A g dµ
all A ∈ E . Observe G is nonempty since 0 ∈ G . We claim that h = max{f, g} ∈ G whenever
f, g ∈ G . To see this, let B = {x ∈ E : f (x) > g(x)} and check that
Z Z Z
h dµ = f dµ + g dµ
A A∩B A∩B c
= ν(A ∩ B) + ν(A ∩ B c ) = ν(A).

37
The monotone convergence theorem also shows
R G is closed under increasing limits.
R Now let
m := supg∈G E g dµ ≤ ν(E). Choose gn ∈ G such that limn→∞ E gn dµ = m. Take fn =
max{g1 , . . . , gn }. Then fn ∈ G is increasing and

Z Z
gn dµ ≤ fn dµ ≤ m
E E

and so limn→∞ fn dµ = m. Since fn is increasing it has a pointwise limit f = limn→∞ fn , where


f ∈ G . Note that f ∈ G implies f is integrable, hence nite µ-almost everywhere. We will show

that that f =
dµ . R
To see this, dene ν0 (A) = ν(A) −
A f dµ for A ∈ E . Note that ν0 is a nite measure since
f ∈ G . We claim thatR ν0 ⊥ µ. Once the claim is established, we are done: since ν  µ by
assumption, ν1 (A) :=
A f dµ is absolutely continuous with respectR to µ, and ν0 = ν − ν1 , we
have ν0  µ. If also ν0 ⊥ µ, we conclude ν0 ≡ 0 and hence ν(A) =
A f dµ for all A ∈ E .
To prove the claim, suppose that ν0 6⊥ µ. Then by Lemma 10.1, we may pick  > 0 and
B0 ∈ E such that µ(B0 ) > 0 and B0 is totally positive for ν0 − µ. Then
Z
ν(A) = ν0 (A) + f dµ
A Z
≥ ν0 (A ∩ B0 ) + f dµ
ZA

≥ µ(A ∩ B0 ) + f dµ
Z A

= (f + 1B0 ) dµ
A

which shows f + 1B0 ∈ G . But

Z
(f + 1B0 )dµ = m + µ(B0 ) > m,
E
R
which contradicts the denition of m := supg∈G E f dµ.

Exercise 10.2. Fill in the following detail of the proof of Theorem 9.4. Show that if φ is a
signed measure and µ a measure with φ  µ and φ ⊥ µ, then φ ≡ 0.
Exercise 10.3. Finish the proof of Theorems 9.4 and 9.12 by generalizing to the cases where µ
is σ-nite and φ is a signed measure.
The proof of the Radon-Nikodym theorem leads to the following structure theorem, which
says that any signed measure has a absolutely continuous and singular part with respect to an
arbitrary nite measure.

Theorem 10.4 (Lebesgue decomposition). Let φ be a signed measure and µ a σ-nite measure
on (E, E). Then there exists unique signed measures ν0 and ν1 such that φ = ν0 + ν1 and ν0 ⊥ µ,
ν1  µ.

Proof. We will prove the theorem in the case where φ and µ are nite measures. We leave the
generalization to signed φ and σ -nite µ as an exercise.

38
Existence is an immediate consequence of the proof of Theorem 9.4 above. For uniqueness,
suppose φ = α0 +α1 where α0 ⊥ µ and α1  µ. Then φ = α0 +α1 = ν0 +ν1 so α0 −ν0 = ν1 −α1 .
Since ν1  µ and α1  µ we have ν1 − α1  µ. We claim that α0 − ν0 ⊥ µ. Once this
is established, we are done by Exercise 10.2. To prove the claim, choose A, B, S, T such that
A = B c , S = T c and ν0 |A = µ|B = α0 |S = µ|T = 0. Then (A∩S)c = B∪T and (α0 −ν0 )|A∩S = 0
and µ|B∪T = 0. Thus α0 − ν0 ⊥ µ.

Exercise 10.5. Given an example of a non-σ-nite measure µ for which the conclusion of
Theorem 9.4 is false. Then give an example of a non-σ-nite measure φ for which the conclusion
of Theorem 10.4 is false.
Exercise 10.6. Let f, g : R → [0, ∞) be Lebesgue integrable functions, such that f /g is Lebesgue
integrable. Dene measures ν , µ by ν(A) = A f (t) dt, µ(A) = A g(t) dt for Borel A ⊆ R. Find
R R

the Lebesgue decomposition ν = ν0 + ν1 where ν0 ⊥ µ and ν1  µ. Then nd dν dµ .


1

The following is a chain rule for Radon-Nikodym derivatives:

Theorem 10.7. Let ν , µ and λ be nite measures on (E, E) such that ν  µ and µ  λ. Then
dν dν dµ
= , λ − a.e.
dλ dµ dλ
Proof. It is easy to check that ν  λ. We want to show that
Z
dν dµ
ν(A) = dλ.
A dµ dλ
Observe that this is equivalent to
Z Z
dν dν dµ
dµ = dλ. (36)
A dµ A dµ dλ

Let f= dν
dµ . Note that (36) holds when f = 1B for B ∈ E . By linearity, (36) holds for simple f ,
and by monotone convergence it holds for nonnegative f . It holds for integrable f by considering
the positive and negative part as usual.

11 Ergodic theory

Denition 11.1. Let (Ω, F, P) be a probability space. A mapping T : Ω → Ω is measure-


preserving if it is measurable and

P(T −1 (A)) = P(A), for all A ∈ F. (37)

In Denition 11.1, it is enough to check (37) on a π -system. That is, if A is a π -system


generating F and P(T −1 (A)) = P(A) for each A ∈ A, then T is measure-preserving.
Example 11.2. Let Ω = [0, 1] and P the uniform measure on Borel sets. It is easy to check
that T : Ω → Ω, T (ω) = 1 − ω is measure-preserving.
Example 11.3. Let Ω = [0, 1) and P the uniform measure. For each r ∈ [0, 1), the map
Tr : Ω → Ω, Tr (ω) = ω + r mod 1 is measure-preserving.

39
Another interesting example is the following. Consider the innite coin toss probability
space(Ω, F, P) with Ω = {0, 1}N , and dene T (ω1 , ω2 , . . .) = (ω2 , ω3 , . . .). Let A = A1 × . . . ×
An × {0, 1} × {0, 1} × . . . ∈ F . Dene Xi : Ω → R by Xi (ω) = ωi . Then

P(T −1 (A)) = P(X2 ∈ A1 , . . . , Xn+1 ∈ An )


= P(X2 ∈ A1 ) . . . P(Xn+1 ∈ An )
= P(X1 ∈ A1 ) . . . P(Xn ∈ An )
= P(X1 ∈ A1 , . . . , Xn ∈ An )
= P(A).

Since such sets A form a π -system generating F, the map T is measure-preserving. This T is
called the shift operator. It is not hard to check that the shift operator is measure-preserving
on product spaces corresponding to sequences of independent identically distributed random
variables. We will show below it is also measure-preserving on product spaces corresponding to
certain Markov chains.

Denition 11.4. Let P be a transition matrix on a nite set E . A left (row) eigenvector of P
corresponding to eigenvalue 1 is
Pcalled a stationary vector of P . We assume that a stationary
vector π is normalized so that i∈E πi = 1.
Such eigenvectors exist because the rows of P sum to 1, hence 1 is an eigenvalue of P (take
the all ones vector as a right eigenvector). It turns out that these eigenvectors have nonnegative
entries. This means we can think of a stationary vector on E as dening a probability measure
on E. A stationary vector π is often also called invariant since πP = π, or equivalently,
X
πi Pij = πj .
i

The next exercise shows why we use the word stationary.

Exercise 11.5. Let Xn be a Markov chain with transition matrix P on a nite set E . Let π be
a stationary vector of P . Show that if P(X0 = i) = πi for all i ∈ E , then P(Xn = i) = πi for
all i ∈ E and n ≥ 0.
In Exercise 11.5, since P(X0 = i) = πi for all i ∈ E , we say that Xn starts at π . Informally,
the exercise shows that if Xn starts at π , then it stays at π . Exercise 11.6 below shows that the
shift operator is measure-preserving for a Markov chain that starts at a stationary vector.
Recall (Exercise 8.15) that the canonical construction of a Markov chain with transition
matrix P consists of a probability measure P N E ⊗N ) and random variables
on (E , (2 ) Xn : E N →
E dened by Xn (ω) = ωn , such that

P(Xn+1 = j|Xn = i) = Pij .

P is unique once we specify an initial distribution, that is, probabilities for X0 . We will always
assume Markov chains have this canonical construction.

Exercise 11.6. Let E be a nite set, P a transition matrix, and P the probability measure from
the canonical construction of a Markov chain with transition matrix P . Let π be a stationary
vector for P and suppose P(X0 = i) = πi for all i ∈ E . Show that the shift T : E N → E N ,
T (ω0 , ω1 , . . .) = (ω1 , ω2 , . . .) is a measure-preserving map.

40
Denition 11.7. Let (Ω, F, P) be a probability space and T : Ω → Ω. A set A ∈ F is called
invariantfor T if T −1 (A) = A a.s. A random variable Y : Ω → R is called T -invariant if
Y ◦ T = Y a.s. Sometimes we say almost invariant to emphasize that these equalities hold a.s.
In the above, A=B a.s. means P(A4B) = 0.
Exercise 11.8. Identify the invariant sets in Examples 11.2 and 11.3.
It is easy to check that the collection of almost invariant sets forms a σ -algebra I .
Theorem 11.9 (Birkho-Khinchin). Let (Ω, F, P) be a probability space, T :Ω→Ω a measure
preserving map, and X : Ω → R an integrable random variable. Then
n−1
a.s.,
X
−1
lim n X ◦ T k = E(X|I) (38)
n→∞
k=0

where I ⊆ F is the σ-algebra of T -almost invariant sets.


We prove Theorem 11.9 in Section 12 below. Note that in (38), the conditional expectation
E(X|I) is T -invariant. In the case where I is trivial  that is, P(A) = 0 or 1 for every A ∈ I 
we say that T is ergodic. In this case the RHS of (38) becomes the constant value E(X).
Exercise 11.10. Consider the setting of Exercise 11.2 and interpret the conclusion of the
Birkho-Khinchin theorem, as follows. Given a random variable X : Ω → R, identify the
right hand side of (38) and check it satises the properties of conditional expectation (see De-
nition 7.4). Do the same for Exercise 11.3 when r = 1/m, m ∈ N. You may use Exercise 12.4
below without proof.
 
1/2 1/2 0 0 0
1/3 0 2/3 0 0 
Exercise 11.11. Let P =  0 . Find all the stationary vectors of P .
 
 0 1 0 0 
 0 0 0 1/2 1/2
0 0 0 1/2 1/2
For each stationary vector π, consider a Markov chain Xn with transition matrix P starting at
π (i.e. P(X0 = i) = πi , i = 1, . . . , 5), and let T be the shift transformation (see Exercise 11.6).
Let X = f ◦ X0 ≡ f (X0 ) where f : {1, . . . , 6} → R is some function. Identify the LHS and RHS
of (38) . Check that the RHS satises the properties of conditional expectation.

11.1 Ergodic transformations and the irrational circle rotation

Denition 11.12. Let (Ω, F, P) be a probability space and T : Ω → Ω a measure preserving


map. Then T is called ergodic if each almost invariant set A has P(A) = 0 or 1.
An important example is the following. Suppose P is a transition matrix on a nite set E
with a unique stationary vector π. Let Xn be a Markov chain starting at π , let T be the shift,
and let X = f ◦ X0 ≡ f (X0 ) where f : E → R is some function. The fact that π is unique
implies I is trivial. (We omit proof of this statement, but compare with Exercise 11.11 in which
π is nonunique and I is nontrivial.) Then 38 becomes

n−1
1X X
lim f (Xn ) = E(f (X0 )) = πi f (i) a.s. (39)
n→∞ n
k=0 i∈E

41
We think of the RHS as a space average as it is an average with respect to π dened on space
E . We think of the LHS as a time average. Note that the LHS of (39) is a function of paths
ω = (ω1 , ω2 , . . .) and the RHS is constant. Thus (39) says that time averages over paths equal
the appropriate space average, with probability 1, provided the path starting point X0 (ω) = ω0
has πω0 > 0. (In fact, (39) remains true even without the assumption on the starting point.)
A special case of (39) is the strong law of large numbers:
Theorem 11.13 (Strong law of large numbers). Suppose Xk , k = 0, 1, . . . are independent and
identically distributed (iid) integrable random variables. Then
n−1
1X
lim
n→∞ n
Xk = E(X0 ) a.s.
k=0

The theorem is obtained by noticing that the shift transformation T (ω1 , ω2 , . . .) = (ω2 , ω3 , . . .)
is measure-preserving and ergodic on the probability space constructed in Exercise 4.9. Alter-
natively, the sequence X0 , X1 , . . . can be seen as a Markov chain with a stationary distribution
that is the same as the distribution of X0 .
For the remainder of this section we study irrational rotations of the unit circle. Such
transformations are ergodic:

Proposition 11.14. Let P be the uniform probability measure on S 1 = [0, 1), x r ∈ (0, 1) \ Q,
and dene T : S 1 → S 1 by T (t) = t + r mod 1. Then T is ergodic.
Below we omit  mod 1 in our notations, and we use analysis instead of probability notation,
e.g. writing µ instead of P for Lebesgue measure on S1. To prove Proposition 11.14 we use the
following lemmas.

Lemma 11.15. Each orbit Os = {T k (s) : k ∈ N} is dense in S 1 .


Proof. Note that ifT k (s) = T ` (s), then (k − `)r = 0 and so k = `. So Os is innite. Suppose
Os is not dense in S 1 . Then there is  > 0 and suppose there is some closed interval I of width
 which is disjoint from Os . If there is n, m ∈ N with m > n such that |T n (s) − T m (s)| < ,
then {T
k(m−n) (s) : k ∈ N} is a collection of distinct points with spacing less than , so it

intersects I . We conclude there is no such m, n. But then Os can contain no more than 1/
points, contradiction.

Lemma 11.16. Let A ⊆ S 1 be Lebesgue measurable. Then given  > 0, there exist disjoint
intervals I1 , . . . , In ⊆ S 1 such that the Lebesgue measure of A4(I1 ∪ . . . ∪ In ) is less than .
Proof. Let µ be Lebesgue measure. Using the Lebesgue outer measure characterization (Propo-
sition (3.22)), we may choose intervals
P∞ Ik , k = 1, 2, . . . such that A ⊆ ∪∞ k=1 Ik and µ(A) >
P∞
− + k=1 µ(Ik ). WLOG the
P∞ Ik 's can be chosen disjoint. Since k=1 µ(Ik ) < ∞, we may
choose n ∈ N such that k=n+1 µ(Ik ) < . Now

µ(A \ (I1 ∪ . . . ∪ In )) < µ(A \ ∪∞


k=1 Ik ) +  = 

and
µ((I1 ∪ . . . ∪ In ) \ A) ≤ µ((∪∞
k=1 Ik ) \ A) < .

and so µ(A4(I1 ∪ . . . In )) < 2.

42
Lemma 11.17. Let A ⊆ S 1 be RLebesgue measurable. Then given  > 0, there exists a continuous
function f : S 1 → R such that S 1 |f (t) − 1A (t)| dt < .
Proof. Let  > 0. Choose disjoint intervals I1 , . . . , In ⊆ S 1 as in Lemma 11.16. Dene f (x) = 1
on I := ∪nk=1 Ik and S 1 \ ∪nk=1 I2−k , where Bδ denotes the δ -neighborhood of
f (x) = 0 on a set
B . Finally dene f on the rest of S 1 by linear interpolation. Then
Z Z Z
|f (t) − 1A (t)| dt ≤ |f (t) − 1I (t)| dt + |1I (t) − 1A (t)| dt ≤ 2 +  = 3.
S1 S1 S1

Proof of Proposition 11.14. Suppose A ⊆ S 1 is a T -almost invariant set. Let  >


R 0 and choose
f as in Lemma 11.17. Then S 1 |f (T (t)) − 1A (t)| dt <  for each k ∈ N and so S 1 |f (T k (t)) −
k
R

f (T ` (t))| dt < 2 for each k, ` ∈ N. By Lemma 11.15, the orbit {T k (t) : k ∈ N} is dense in S1.
1
R
It follows that
S 1 |f (t + s) − f (s)| dt ≤ 2 for all s ∈ S . By Fubini's theorem,
Z Z Z Z

f (t) − f (s) ds dt = (f (t) − f (t + s)) ds dt
1
S1 S1 S1 S
Z Z
≤ |f (t) − f (t + s)| dt ds ≤ 2.
S1 S1

Let c be the Lebesgue measure of A. Then


Z Z Z Z
(f (s) − 1A ) ds dt ≤ 


f (s) ds − c dt =


S 1 S1 S 1 S1

and so
Z Z Z Z
|1A (t) − c| dt ≤ |1A (t) − f (t)| dt +


f (t) − f (s) ds dt
S1 S1 S1 S1
Z Z

+
f (s) ds − c dt
S1 S1
≤  + 2 +  = 4.

Since >0 was arbitrary, 1A = c a.e., so the Lebesgue measure of A is 0 or 1.


The result in Lemma 11.17 is interesting in its own right. We pursue this below.

Lemma 11.18. Let (E, E, µ) be a measure space and fn , f :E→R measurable functions with
E |fn − f | dµ < ∞. Then fn → f a.e. on E .
P ∞ R
n=1

Proof. Let  > 0, set An = {x ∈ E : |fn (x) − f (x)| ≥ }, and Bn = ∪k≥n Ak . Note that


X ∞ Z
X
 µ(An ) ≤ |fn − f | dµ < ∞.
n=1 n=1 E

Thus,
X
0 ≤ µ(Bn ) ≤ µ(Ak ) → 0 as n → ∞,
k≥n

so µ(∩∞
n=1 Bn ) = 0. Equivalently, µ(lim inf n→∞ Acn ) = 1, which completes the proof.

43
Exercise 11.19. Let f : [a, b] → R be an integrable function. Use Lemmas 11.17 and 11.18 to
show there exist continuous functions fn : [a, b] → R such that fn → f a.e.
Exercise 11.19, together with Egoro 's theorem below, shows that measurable functions are
almost continuous.

Theorem 11.20 (Egoro 's theorem). Let (E, E, µ) be a nite measure space, and fn : E → R
measurable functions such that fn → f a.e. Then for each  > 0, there exists measurable A ⊆ E
such that µ(A) <  and fn → f uniformly on Ac .
Proof of Egoro's theorem. Assume WLOG that fn → f everywhere on E and dene An,k =
∪∞
m=n {x ∈ E : |fm (x) − f (x)| ≥ 1/k}. Since A1,k ⊇ A2,k ⊇ . . . and ∩∞ n=1 An,k = ∅, continuity
from above implies limn→∞ µ(An,k ) = 0. Let  > 0 and for each k ∈ N pick nk such that
µ(Ank ,k ) < 2−k . Set E = ∪∞ k=1 Ank ,k . Then µ(A) <  and |fn (x) − f (x)| < 1/k for n > nk and
x ∈ E . Thus fn → f uniformly on E c .
c

Theorem 11.21 (Lusin's theorem). Let f : [a, b] → R be measurable. Then for each  > 0,
there exists a compact C ⊆ [a, b] with Lebesgue measure greater than b − a −  such that f |C is
continuous.
Exercise 11.22. Use Exercise 11.19 and Theorem 11.20 to prove Theorem 11.21. (Hint: rst
use Lebesgue outer measure measure, Proposition 3.22, to show that if B ⊆ [a, b] is Lebesgue
measurable, then given  > 0 there exists compact C ⊆ B such that µ(B \ C) < .)

12 Proof of ergodic theorem

We begin with the ergodic case, that is, the case where the invariant σ -algebra I is trivial.

Theorem 12.1. Let (Ω, F, P) be a probability space, T : Ω → Ω measure preserving, and


X : Ω → R integrable. Assume T is ergodic. With
n−1
X
Sn (ω) := X(T k (ω)),
k=0

we have
Sn
lim
n→∞ n
= E(X) a.s. (40)

We will need the following lemma:

Lemma 12.2. Let X , T , and Sn be as in Theorem 12.1. If E(X) > 0 then P(Sn > 0 ∀ n) > 0.
Proof of Theorem 12.1. By subtracting E(X) from both sides of (40), we may assume WLOG
that E(X) = 0. Let  > 0. Applying Lemma 12.2 toX +  shows that P(Sn /n > − ∀ n) > 0,
and thus  
Sn
P lim inf ≥ − > 0. (41)
n→∞ n

44
The assumption T is ergodic means that the σ -algebra I of almost invariant events is trivial.
Note that lim inf n→∞ Sn /n is T -almost invariant, or equivalently, I -measurable. (See Exer-
cise 12.3.) This means lim inf n→∞ Sn /n is constant a.s., so (41) shows that

Sn
lim inf ≥ − a.s.
n→∞ n
As  > 0 was arbitrary, lim inf n→∞ Sn /n ≥ 0 a.s. Applying the same argument to −X +  gives
lim supn→∞ Sn /n ≤ 0 a.s., which completes the proof.

Proof of Lemma 12.2. Assume E(X) > 0 and let A = {Sn ≤ 0, some n}. We want to show
that P(Ac ) > 0. We will actually prove a stronger statement, namely that E(1A X) ≤ 0 (see
Lemma 12.5 below). Then since E(X) > 0, we have E(1Ac ) > 0 and so P(Ac ) > 0 as desired.
Let An = {Sk ≤ 0, some 1 ≤ k ≤ n}. Then An is increasing with ∪n=1 An = A. So

by the dominated convergence theorem, it suces to show that E(1An (X)) > 0. Set Fn =
min{0, S1 , . . . , Sn } where the minimum is pointwise. Note that Sn ◦ T + X = Sn+1 . Thus,
Fn ◦ T + X ≤ Sk ◦ T + X = Sk+1 for 0 ≤ k ≤ n and so Fn ◦ T + X ≤ min{S1 , . . . , Sn }. Note
that min{S1 , . . . , Sn } = Fn on An and so

X 1An ≤ (Fn − Fn ◦ T )1An .

Thus,

E(1An X) ≤ E(1An Fn ) − E(1An Fn ◦ T )


= E(Fn ) − E(1An Fn ◦ T )
≤ E(Fn ) − E(Fn ◦ T )
= 0,

where the last three lines of the display above follow from the facts that Fn = 0 on Acn , Fn ◦T ≤ 0,
and T is measure-preserving (see Exercise 12.4 below), respectively.

Exercise 12.3. Fill in the following detail in the proof of Theorem 12.1: Show that if limn→∞ Sn /n
exists a.s., then the limit is a T -almost invariant and hence I -measurable function.
Exercise 12.4. Fill in the following detail in the proof of Lemma 12.2: Show that if Y : Ω →
R is integrable and T : Ω → Ω is measure-preserving and A ∈ I is almost invariant, then
E((Y ◦ T )1A ) = E(Y 1A ). (Lemma 12.2 uses the case A = Ω.)

From the proof of Lemma 12.2 we actually obtain the following stronger statement:

Lemma 12.5 (Maximal ergodic theorem). Let (Ω, F, P) be a probability space, T : Ω → Ω


measure preserving, X : Ω → R integrable, and dene Sn = X + X ◦ T + . . . + X ◦ T k−1 . Let
n=1 {Sn < 0}. Then
A = ∪∞
E(1A X) ≤ 0.

(Note that an inequality became strict in the denition of A above, but the proofs in
Lemma 12.5 still go through.) Lemma 12.5 holds whether or not T is ergodic, and indeed
we will use it to prove Theorem 11.9 in the case where I is not trivial.
The following corollary will prove useful:

45
Corollary 12.6. Consider the setup of Lemma 12.5, and let Bα = ∪∞ n=1 {Sn /n < α}. If A is
T -almost invariant, then E(1Bα ∩A X) ≤ αP(Bα ∩ A).
Proof. First suppose A = Ω, and X̃ = X − α, S̃n = Sn − nα. Then Bα = ∪∞n=1 {S̃n < 0}, so by
Lemma 12.5, E(1Bα X̃) ≤ 0, or equivalently, E(1Bα X) ≤ αP(Bα ). To nish the proof, repeat
the above with T |A in place of T . (Since A is almost invariant, we may assume T |A : A → A
and T |A is measure-preserving.)

We are ready to prove the ergodic theorem in the general case, where I may be nontrivial.
For convenience we restate the theorem:

Theorem 12.7 (Birkho-Khinchin). Let (Ω, F, P) be a probability space, T :Ω→Ω a measure


preserving map, and X : Ω → R an integrable random variable. Then
Sn
lim
n→∞ n
= E(X|I) a.s., (42)

where
n−1
X
Sn := X ◦ Tk (43)
k=0
and I ⊆ F is the σ-algebra of T -almost invariant sets.
Proof. We rst show that the limit on the LHS of (42) exists a.s. Let

Sn Sn
X∗ = lim inf , X ∗ = lim sup ,
n→∞ n n→∞ n
and Aα,β = {X∗ < α, X ∗ > β}. Then
[
A := Aα,β = {X∗ 6= X ∗ }.
α<β
α,β∈Q

We claim P(A) = 0; by countable additivity it suces to show P(Aα,β ) = 0 whenever α < β .


Since X∗ , X
∗ are T -invariant (by an argument identical to Exercise 12.3), so is A
α,β . Let

Bα = ∪n=1 {Sn /n < α}. Notice that Aα,β ⊆ Bα . By Corollary 12.6,
E(1Aα,β X) = E(1Bα ∩Aα,β X) ≤ αP(Bα ∩ Aα,β ) = αP(Aα,β ). (44)

In the above arguments, we can replace X, α and β with −X , −β and −α, respectively, to get

E(1Aα,β X) ≥ βP(Aα,β ).
Thus, βP(Aα,β ) ≤ αP(Aα,β ). So if α<β then P(Aα,β ) = 0 as desired.
We have shown that
Sn
Y := lim
n→∞ n
exists a.s. We will show Y = E(X|I) by checking the properties of conditional expectation
(see Denition 7.4). By Exercise 12.3, Y is I -measurable. Suppose A ∈ I. By Exercise 12.4,
E((X ◦ T k) 1A ) = E(X 1A ) for each k ≥ 0 and so
 
Sn
E 1A = E(X 1A ). (45)
n
46
So if we can show that  
Sn
lim E 1A = E (Y 1A ) , (46)
n→∞ n
then E(X 1A ) = E(Y 1A ) and we are done.
It only remains to establish (46). For λ > 0 set
     
Sn S n S n
Cλ = sup > λ = inf < −λ ∪ sup >λ .
n≥1 n n≥1 n n≥1 n

By Corollary 12.6,
 
Sn
≤ −λ−1 E 1B−λ X ≤ λ−1 E(|X|).

P inf < −λ
n≥1 n

Using minus signs again and repeating the above arguments, we can get
 
Sn
P sup > λ ≤ λ−1 E(|X|)
n≥1 n

and so
P(Cλ ) ≤ 2λ−1 E(|X|).
Let Dγ,k = {|X ◦ T k | > γ}. We have
  n−1
Sn 1X
E 1Cλ ≤
E(1Cλ |X ◦ T k |)
n n
k=0
n−1
1 X 
≤ E(1Cλ 1Dγ,k |X ◦ T k |) + γE(1Cλ 1Dγ,k
c )
n (47)
k=0
n−1
1 X 
≤ E(1Dγ,k |X ◦ T k |) + γP(Cλ )
n
k=0
γ
≤ 2 E(|X|) + E(1Dγ,0 |X|).
λ
Since X is integrable,
Z
E(1Dγ,0 |X|) = |X| dP → 0 as γ → ∞.
|X|>γ

Let Eλ = {|Sn /n| > λ}. By taking γ = λ in (47), we nd that
   
Sn Sn
E 1Eλ ≤ E 1Cλ → 0 as λ → ∞, uniformly in n.
n n
This is enough to establish (46), as we show in Theorem 12.8 below.

Theorem 12.8. Suppose Xn is a sequence of random variables such that Xn →X a.s. and
E(1{|Xn |>λ} |Xn |) → 0 as λ → ∞, uniformly in n. Then X is integrable and
lim E(|Xn − X|) = 0.
n→∞

In particular, limn→∞ E(Xn ) = E(X).


47
Proof. Choose λ such that E(1{|Xn |>λ} |Xn |) < 1 for all n. By Fatou's lemma (Theorem 6.8),

E(|X|) ≤ lim inf E(|Xn |) ≤ λ + lim inf E(1{|Xn |>λ} |Xn |) < λ + 1.
n→∞ n→∞

So X is integrable. We leave the rest of the proof as an exercise (Exercise 12.9).

Exercise 12.9. Finish the proof of Theorem 12.8. (Hint: Use Egoro's theorem, Theorem 11.20.)
The condition E(1{|Xn |>λ} |Xn |) → 0 as λ → ∞, uniformly in n, in Theorem 12.8 is often
called uniform integrability. You can add it to the list of conditions which allow for limits to go
inside of integrals!

13 Inequalities

13.1 Jensen's inequality

We begin by reviewing denitions of convexity in Euclidean space:

Denition 13.1. A set S ⊆ Rn is convex if for all x, y ∈ S and t ∈ [0, 1], tx + (1 − t)y ∈ S .
Denition 13.2. Let S ⊆ Rn be convex. A function φ : S → R is convex if
φ(tx + (1 − t)y) ≤ tφ(x) + (1 − t)φ(y).

for all x, y ∈ S and t ∈ [0, 1].


Jensen's inequality relies on the following lemma.

Lemma 13.3. Let S ⊆ Rn be a closed convex set and p a point on ∂S . Then there exists
nonzero u ∈ Rn such that u · (x − p) ≥ 0 for all x ∈ S .
Proof. Take a sequence pn ∈ Rn \ S converging to p. Such a sequence exists because S is closed
and p is on its boundary. For each n dene

xn = arg minx∈S kp − pn k2 .

It is easy to check thatxn exists, since the minimization can be done inside the compact set
S ∩ {x ∈ Rn : kx − pn k ≤ M } where M is chosen suciently large. Note that kxn − pn k > 0
since xn ∈ S and pn ∈
/ S . Dene vn = xn − pn . We claim that vn · (x − pn ) > 0 for all x ∈ S .
Suppose instead that vn · (x − pn ) ≤ 0 for some x ∈ S and consider xt = (1 − t)xn + tx. Then

kxt − pn k2 = k(1 − t)(xn − pn ) + t(x − pn )k2


= (1 − t)2 kxn − pn k2 + 2t(1 − t)(xn − pn ) · (x − pn ) + t2 kx − pn k2
≤ (1 − t)2 kxn − pn k2 + t2 kx − pn k2
< kxn − pn k2

for t suciently small, contradicting minimality of xn . Now dene un = vn /kvn k. Since the un
lie on the unit sphere, which is compact, there is a subsequence unk converging to a unit vector
u. Since un · (x − pn ) > 0 for each n, we must have u · (x − p) ≥ 0.

48
Theorem 13.4 (Jensen's inequality). Let (Ω, F, P) be a probability space, G ⊆ F a sub-σ
algebra, X : Ω → R an integrable random variable, and φ : R → R a convex function. Then
φ(E(X|G)) ≤ E(φ(X)|G).

Proof. Let C g(t) = a + bt where a, b ∈ R such that


be the collection of functions of the form
g ≤ φ. It is easy to check that then the region S = {(x, y) ∈ R2 : y ≥ φ(x)} is closed and convex.
Applying Lemma 13.3, there is a line through p = (x, φ(x)) orthogonal to u that corresponds
to a function g ∈ C with g ≤ φ and g(x) = φ(x). Thus, C is nonempty and φ(x) = supg∈C g(x).
Let g ∈ C and write g(t) = a + bt. Then

E(φ(X)|G) ≥ E(g(X)|G) = aE(X|G) + b = g(E(X|G)).

Thus,
φ(E(X|G)) = sup g(E(X|G)) ≤ E(φ(X)|G).
g∈C

In measure-theoretic notation, for a measure µ on (E, E) with µ(E) = 1, a function f :E→


R, and a convex function φ : R → R, we have φ(µ(f )) ≤ µ(φ ◦ f ), or
Z  Z
φ f dµ ≤ φ(f ) dµ.
E E

13.2 Markov, Chebyshev and Cherno inequalities

The following very simple inequality leads to many important results in probability:

Theorem 13.5 (Markov inequality). If X is a non-negative valued random variable, then


E(X)
P(X ≥ a) ≤ . (48)
a
Proof. Note that a1{X≥a} ≤ X . Thus, aP(X ≥ a) ≤ E(X).

Denition 13.6. The variance of a random variable X is


Var(X) = E((X − E(X))2 ).
It is easy to check that Var(X) = E(X 2 ) − E(X)2 ≥ 0. The variance of a random variable
is a measure of how much it tends to deviate from its mean or expected value.

Proposition 13.7. If X is a random variable and a, b ∈ R, then Var(aX + b) = a2 Var(X). If


X1 , . . . , Xn are independent, then Var(X1 + . . . + Xn ) = Var(X1 ) + . . . + Var(Xn ).

We leave proof of Proposition 13.7 as an exercise.

Theorem 13.8 (Chebyshev inequality). Let X be a random variable with E(X) = µ < ∞ and
Var(X) = σ2 < ∞. Then
1
P(|X − µ| ≥ sσ) ≤ . (49)
s2

49
Proof. Applying (48) with (X − µ)2 and s2 σ 2 in place of X and a, respectively, gives

E((X − µ)2 ) 1
P(|X − µ| ≥ sσ) = P((X − µ)2 ≥ s2 σ 2 ) ≤ = 2.
s2 σ 2 s

Chebyshev's inequality leads to the weak law of large numbers:


Theorem 13.9 (Weak law of large numbers). Let Xk , k = 1, 2, . . . be independent and identically
distributed (iid) random variables with E(X1 ) = µ < ∞ and Var(X1 ) = σ2 < ∞. Let Sn =
k=1 Xn . Then for each  > 0,
Pn

 
Sn
lim P
− µ ≥  = 0.
(50)
n→∞ n

Proof. Let  > 0. Without loss of generality assume µ = 0. Then E(Sn /n) = 0. By Proposi-

tion 13.7, Var(Sn /n) = σ 2 /n. Applying (49) withSn /n in place of X and s =  n/σ gives

σ2
 
Sn
P ≥ ≤
.
n n

The convergence in (50) is called convergence in probability. We discuss this and other modes
of convergence in Section 14 below. Convergence in probability is weaker than convergence a.s.,
which is why Theorem 13.9 is called the weak law of large numbers. The strong law of large
numbers states that, if Xk , k = 1, 2, . . . are iid with E(X1 ) < ∞, then
X1 + . . . + Xn
lim = µ, a.s. (51)
n→∞ n
The strong law is a consequence of the ergodic theorem, Theorem 12.7.

Theorem 13.10 (Cherno inequality) . Let X1 , . . . , Xn be iid random variables and let Sn =
X1 + . . . + Xn . Then

P(Sn ≥ a) = min e−ta E(etX1 )n , P(Sn ≤ a) = min eta E(e−tX1 )n . (52)


t>0 t>0

Proof. For t > 0, apply the Markov inequality (48) with etSn and eta in place of X and a to get

E(etSn )
P(Sn ≥ a) = P(etSn ≥ eta ) ≤ .
eta
Since X1 , . . . , Xn are iid,

E(etSn ) = E(etX1 . . . etXn ) = E(etX1 ) . . . E(etXn ) = E(etX1 )n .

The second equation in (52) follows by replacing Sn with −Sn and a with −a.

50
The Cherno inequalities are sharper than the Markov inequality, but they require the
assumption of independence. For example, consider the innite coin toss, where X1 , . . . , Xn are
iid Bernoulli-1/2, that is, P(Xi = 1) = P(Xi = 0) = 1/2. Then, since

1 et et − 1 t
E(etX1 ) = + =1+ ≤ e(e −1)/2 ,
2 2 2
Cherno 's inequality gives
 n  t
P Sn ≥ (1 + δ) ≤ min e−tn(1+δ)/2 en(e −1)/2 .
2 t>0

Taking t = log(1 + δ) gives

n/2
enδ/2 eδ

 n 
P Sn ≥ (1 + δ) ≤ = .
2 (1 + δ)n(1+δ)/2 (1 + δ)1+δ
To compare the Markov and Cherno inequalities here, take δ = 1. Then the calculation above
givesP(Sn ≥ n) ≤ (e/4)n/2 , while the Markov inequality gives P(Sn ≥ n) ≤ 1/2. Note that the
n
actual probability is P(Sn ≥ n) = (1/2) .

14 Modes of convergence

The most common type of convergence we have considered above is a pointwise convergence:
Denition 14.1. A sequence Xn of random variables converges to X a.s. if
 
P lim Xn = X = 1,
n→∞

written Xn −a.s.
−→ X .
a.s.
Xn −−→ X means that Xn → X pointwise, up to a set of probability zero.

Denition 14.2. A sequence Xn of random variables converges to X in mean if


lim E(|Xn − X|) = 0,
n→∞

1
written Xn −L→ X .
Sometimes this is called convergence in L1 . p
More generally if limn→∞ E(|Xn − X| ) = 0,
we say Xn converges to X p
in L . There is no direct relationship between convergence a.s. and
convergence in mean. Howeer, note that convergence a.s. implies convergence in mean under the
assumptions in the monotone and dominated convergence theorems. Both of these theorems
generalize in a straightforward way to convergence in Lp . Conversely, convergence in mean
implies convergence a.s. along a subsequence:

Proposition 14.3. If Xn −L→ X , then there is a subsequence Xnk such that Xnk −a.s.
1
−→ X .
L1
Proof. If Xn −→ X , we may choose nk such that E(|Xnk − X|) < 2−k . The result then follows
from Lemma 11.18.

51
We now introduce the type of convergence from the Weak Law of Large Numbers (Theo-
rem 13.9):

Denition 14.4. A sequence Xn of random variables converges to X in probability if


lim P(|Xn − X| ≥ ) = 0
n→∞

for each  > 0, written Xn →


− X.
p

Convergence in probability is weaker than convergence a.s., as the next exercise shows.

Proposition 14.5. If Xn −a.s.


−→ X , then Xn →
− X . The converse fails in general.
p

Proof. Suppose
a.s.
Xn −−→ X and let  > 0 and δ > 0. Using Egoro 's theorem (Theorem 11.20),
c
choose A such that P(A) < δ and Xn → X uniformly on A . Choose N such that n ≥ N implies
c
|Xn (ω) − X(ω)| <  whenever ω ∈ A . Then n ≥ N implies P(|Xn − X| ≥ ) ≤ P(A) < δ . Since
δ was arbitrary, limn→∞ P(|Xn − X| ≥ ) = 0. To see the converse does not hold, let P be the
m n
uniform measure on [0, 1], for each n ∈ N write n = 2 + k , m ∈ N, k ∈ {0, . . . , 2 − 1}, and
consider the sequence Xn = 1[k/2m ,(k+1)/2m ] . Then P(|Xn | ≥ ) ≤ 2
−m → 0 as n → ∞ whenever
p
0 <  ≤ 1, so Xn →
− 0. However, for each ω ∈ [0, 1], Xn (ω) = 1 for innitely many n, and so
Xn does not converge to 0 a.s.

Proposition 14.6. If Xn −L→ X , then Xn →


1
− X . The converse fails in general.
p

L1
Proof. Suppose XN −→ X . Since 1{|Xn −X|≥} ≤ |Xn − X|, we have P(|Xn − X| ≥ ) ≤
p
E(|Xn − X|) → 0 as n → ∞. Thus, Xn → − X . To see the converse does not hold, let P be
uniform on [0, 1] and set Xn = n1[0,1/n] . Then P(|Xn | ≥ ) = 1/n whenever  < 1, so Xn →
p
− 0.
but E(|Xn |) = 1 for each n so Xn does not converge to 0 in mean.

The last type of convergence we consider has to do with distributions:

Denition 14.7. A sequence Xn of random variables converges to X in distribution, written


→ X , if the CDFs Fn of Xn converge to the CDF F of X at every point where F is
d
Xn −
continuous.
Exercise 14.8 (Optional). Let Xn be random variables with P(Xn = n2 ) = 1/n and P(Xn =
0) = 1−1/n. Decide in what sense(s) Xn converges to X ≡ 0, or whether not enough information
is given.
Exercise 14.9 (Optional). Let Xn be independent random variables with P(Xn = 1) = 1/n and
P(Xn = 0) = 1 − 1/n. Decide in what sense(s) Xn converges to X ≡ 0, or whether not enough
information is given.
An important characterization of convergence in distribution is:

Theorem 14.10. Xn −
→X
d
if and only if
lim E(g(Xn )) = E(g(X)),
n→∞
for all bounded continuous g : R → R. (53)

52
Proof. Assume (53) holds, let δ > 0, and let x be a point at which the CDF of X is continuous.
Let gδ : R → R be a continuous function with values in [0, 1] such that gδ (t) = 1 on (−∞, x)
and gδ (t) = 0 on (x + δ, ∞). By assumption, for each m,

lim E(gδ (Xn )) = E(gδ (X)).


n→∞

By choice of gδ we also have

P(Xn ≤ x) ≤ E(gδ (Xn )), E(gδ (X)) ≤ P(X ≤ x + δ).

Thus,
lim sup P(Xn ≤ x) ≤ lim E(gδ (Xn )) = E(gδ (X)) ≤ P(X ≤ x + δ).
n→∞ n→∞

Since δ>0 was arbitrary, by continuity from above,

lim sup P(Xn ≤ x) ≤ P(X ≤ x).


n→∞

Similar arguments show that lim supn→∞ P(Xn ≥ x) ≤ P(X ≥ x), so by taking complements,

lim inf P(Xn < x) ≥ P(X < x).


n→∞

Thus,

P(X ≤ x) ≥ lim sup P(Xn ≤ x) ≥ lim sup P(Xn < x)


n→∞ n→∞
≥ lim inf P(Xn < x) (54)
n→∞
≥ P(X < x).

Since F is continuous at x, we have equality in (54). This proves the if  part. Consider the
d
only if  part. Let F and Fn X and Xn , respectively, suppose Xn −
be the CDFs of → X , and let
g : R → R be bounded and continuous. Let  > 0 and choose an interval [a, b] such that F is
continuous at a and b and P(X ∈ [a, b]) ≥ 1 − . We may do this since limx→−∞ F (x) = 0 and
limx→∞ F (x) = 1 and, as F is nondecreasing, its set of discontinuities is countable. Now choose
points x1 < . . . < xk in [a, b] such that for each j , F is continuous at xj and x, y ∈ [xj , xj+1 ]
implies |g(x) − g(y)| < . Uniform continuity of g on closed bounded intervals ensures this is
possible. Set

k−1
X k−1
X
S= g(xj )(F (xj+1 ) − F (xj )), Sn = g(xj )(Fn (xj+1 ) − Fn (xj )).
j=1 j=1

Let M = sup |g|. Let dF be the measure on R such that dF ((a, b]) = F (b) − F (a); see
Theorem 5.16. By Exercise 6.15,
Z ∞

|E[g(X)] − S| =
g(x)dF (dx) − S
−∞
Z b

≤ g(x)dF (dx) − S + M P(X ∈
/ [a, b])
a
≤ (M + 1).

53
Similarly, let dFn be the measure on R such that dFn ((a, b]) = Fn (b) − Fn (a); by Exercise 6.15,
Z ∞

|E[g(Xn )] − Sn | = g(x)dFn (dx) − Sn
−∞
Z b

≤ g(x)dFn (dx) − Sn + M P(Xn ∈ / [a, b])
a
≤  + M P(Xn ∈
/ [a, b])
→ (M + 1) as n → ∞.
As the xj 's are continuity points of F , we have limn→∞ Fn (xj ) = F (xj ) and so limn→∞ Sn = S .
Together with the last two displays, this completes the proof.

Proposition 14.11. If Xn → − X , then Xn −


p
→ X . The converse fails in general.
d

Proof. Let  > 0, let g : R → R be bounded and continuous and set M = 2 sup |g|. Then

|g(Xn ) − g(X)| ≤  + M 1{|g(Xn )−g(X)|≥} .


Thus,
E(|g(Xn ) − g(X)|) ≤  + M P(|g(Xn ) − g(X)| ≥ ). (55)

Let
Bδ = {x ∈ R : ∃y ∈ R s.t. |x − y| < δ and |g(x) − g(y)| ≥ }.
By continuity of g, for each x ∈ R there is δx > 0 such that |x − y| < δ implies |g(x) − g(y)| < .
p
Thus, B1/n are decreasing with ∩∞
n=1 B1/n = ∅. Now if Xn → − X , then for each δ > 0,
P(|g(Xn ) − g(X)| ≥ ) ≤ P(|Xn − X| ≥ δ) + P(X ∈ Bδ ). (56)

The rst term on the RHS of (56) goes to 0 as n → ∞ for each δ > 0, and the second goes to 0
as δ → 0 by continuity from above. Since δ > 0 was arbitrary, we conclude limn→∞ P(|g(Xn ) −
g(X)| ≥ ) = 0. From (55) it follows that limn→∞ E(|g(Xn ) − g(X)|) = 0, which implies
limn→∞ E(g(Xn )) = E(g(X)).
To see the converse fails, take P uniform on Ω = [0, 1] and let Xn (ω) = ω for n odd and
Xn (ω) = 1 − ω for n even. Then each Xn has the same CDF. However, Xn do not converge in
probability (we leave this as an exercise).

14.1 Central limit theorem

Consider an iid sequence Xk , k = 1, 2, . . . of random variables with mean E(X1 ) = µ. Recall


that the strong law of large numbers (Theorem 11.13) says that

X1 + . . . + Xn − nµ a.s.
−−→ 0, as n → ∞.
n
This says that an average of iid random variables concentrates around its mean. If we change
the scaling, we get an idea of the deviation around the mean. The central limit theorem states
that (under appropriate conditions)

X1 + . . . + Xn − nµ d
√ −
→ Z, (57)
σ n
where σ 2 = Var(X1 ) and Z is a standard normal random variable:
54
Denition 14.12. A standard normal random variable has CDF
2
x
e−t /2
Z
F (x) = √ dt. (58)
−∞ 2π
The standard normal random variable has a bell-shaped PDF (namely the integrand in (58),
often called a Gaussian. For example, consider a simple random walk, where Zn = X1 +. . .+Xn
and the Xi 's are independent with P(Xi = 1) = P(Xi = −1) = 1/2. Then by (57),

Z d
√n −
→Z
n
with Z a standard normal random variable. This means the deviation of a random walk from 0
at long times has a bell-shaped distribution. As another example, suppose the Xi are responses
to a survey question from person i. Assuming responses are independent, then (57) shows that
for large n,
  Z ∞ −t2 /2
X1 + . . . + Xn zσ e
P
− µ ≥ √
≈ P(|Z| ≥ z) = 2 √ dt. (59)
n n z 2π
Suppose we want to be 90% condent about the mean response. Then by choosing z such that
the RHS of (59) equals 0.9, we can conclude that
 
X1 + . . . + Xn zσ zσ
∈ µ − √ ,µ + √
n n n

with 90% condence, that is, about 90% of the time if we were to repeat the survey many times.
Our proof of the central limit theorem will use moment generating functions:
Denition 14.13. The moment generating function of a random variable X is φX (t) = E(etX ).
Note that these appeared already in the Cherno inequalities (Theorem 13.10).
The reason for the name is that, if φX is nite in a neighborhood of 0,
dn t

= E(X n ) =: nth

n
φX (t) moment of X.
dt t=0

Note that, from Jensen's inequality, φX (t) ≥ etE(X) , and by Markov's inequality, φX (t) ≥
eta P(X ≥ a). The moment generating function φX can be understood as the Laplace transform
of X . A key property of the Laplace transform is invertible under very general conditions. We
borrow without proof the following result from Laplace transform theory:

Theorem 14.14. If X, Y are random variables such that φX = φY < ∞ in a neighborhood of


0, then X and Y have the same distribution.

We will also use the following result:

Theorem 14.15. Suppose X1 , X2 , . . . and X are random variables such that limn→∞ φXn =
φX < ∞ in a neighborhood of 0. Then Xn − → X.
d

The proof of Theorem 14.15 relies on Prokhorov's theorem. We omit proof.

55
Exercise 14.16 (Optional). Use Theorem 14.15 to prove the following version of the law of
large numbers: If Xk , k = 1, 2, . . . are iid with MGF which is nite in a neighborhood of 0, then
→ E(X1 ).
d
(X1 + . . . + Xn )/n −

Exercise 14.17 (Optional). Find the MGF of a standard normal random variable.
Theorem 14.18 (Central limit theorem). Let Xk , k = 1, 2, . . . be iid with MGF which is nite
in a neighborhood of 0. Then µ = E(X1 ) < ∞, σ2 = Var(X1 ) < ∞ and
X1 + . . . + Xn − nµ d
√ −
→ Z,
σ n

with Z a standard normal random variable.


The assumption that the MGF of X1 is nite in a neighborhood of 0 is not needed, but it
makes proofs easy. It is enough to assume the mean µ and variance σ2 of X1 are nite.

Proof of Theorem 14.18. Assume WLOG µ = 0. Let φ be the MGF of X1 . By Taylor expansion,

s2 σ 2
φ(s) = 1 + + O(s3 ).
2

Let φn be the MGF of (X1 + . . . + Xn )/(σ n). By independence of the Xi 's,
n  n
t2
  2
t −3/2 t
φn (t) = φ √ = 1+ + O(n ) → exp as n → ∞.
σ n 2n 2

The last expression is the MGF of a standard normal random variable (see Exercise 14.17). By
Theorem 14.15, we are done.

56

You might also like