Measure Theory
Measure Theory
David Aristo
January 2017
1 Introduction to measure
A measure µ assigns a scalar size to certain subsets of a space. The ingredients we need are:
• A space, E
Formally, µ : E → [0, ∞]. (Sets have nonnegative but possibly innite measure.) Elements of E
are called measurable sets. Of course, µ must satisfy some natural properties. The most obvious
is that the measure of the union of disjoint sets equals the sum of the measures of each set
separately:
µ(A ∪ B) = µ(A) + µ(B) for disjoint A, B ∈ E. (1)
This is called nite additivity. Actually, we'll assume an analogous property for countable unions:
The latter property is called countable additivity. Though it isn't explicit in (2), for the equation
above to make sense, a countable disjoint union of measurable sets must be measurable:
Both (1) and (2) are geometrically reasonable, so why do we assume (2) instead of (1)? The
reason is that countable additivity allows us to dene measures of complicated sets from more
simple ones by taking limits. Of course, countable additivity implies nite additivity.
Example 1.1. (Discrete measure.) E = a countable set, E = 2E = all subsets of E , and for
each x ∈ E , µ({x}) is an arbitrary xed value in [0, ∞]. Then by countable additivity,
for A ⊆ E.
X
µ(A) = µ({x}),
x∈E
Thus, the measure of any set is the sum of measures of singleton sets.
1
The trickiest question here is somewhat hidden. Once we dene measure on some simple
sets, property (2) essentially leads to a rule for assigning measure to general sets. The trouble
is in deciding what these sets will be, that is, what sets are consistent with (2). It is tempting
to just set E = 2E to be all subsets of E. While this is ne in the countable space case (see
Example 1.1), there are many diculties in the uncountable case (e.g. E = R). At the moment,
we will only assume that the whole space E and the empty set ∅ are measurable. We would
also like to have µ(∅) = 0; this is guaranteed whenever at least one set has nite measure, an
assumption we adopt from here on.
In the case of discrete (or countable) space, it is easy to consistently dene a measure by assigning
values to each point. The continuous (or uncountable) space case is surprisingly subtle. Below
we will consider E = Rn . Our rst attempt at dening a measure uses a simple idea: that
a set can be approximately measured by lling it up with nitely many tiny non-overlapping
rectangles, and then summing the volumes of the rectangles. This leads to Jordan measure,
which is not a true measure as as its measurable sets are not closed under disjoint countable
unions.
Consider E = Rn . For our simple sets we take
(We allow ai = bi , in which case the rectangle is the empty set.) The reason for using half-open
intervals (instead of, say, open or closed intervals) is that any nite union of elements of S can be
written as a disjoint nite union of elements of S . This allows us to use additivity to measure
such sets. Intuitively, to measure a more general bounded set A, we can ll in most of the
inside of A with nitely many rectangles from S, and then cover the remaining boundary of
A with nitely many smaller rectangles. If the boundary is small, A is measurable. Formally,
dene
k
X
∗
µ (A) = inf µ(Si )
S1 ∪...∪Sk ⊇A
i=1
Xk
µ∗ (A) = sup µ(Si )
S1 ∪...∪Sk ⊆A i=1
for bounded A, where the inf and sup are over disjoint sets S1 , . . . , Sk ∈ S , with k arbitrary.
Bounded sets are declared measurable when these two quantities agree:
in which case we set µ(A) = µ∗ (A) = µ∗ (A). The resulting µ is called Jordan measure. (Often
it is called Jordan content because it is not a true measure; see Exercise 1.2.) Jordan measure
has some nice properties. Notably, it is congruence invariant: applying rotations, reections or
2
translations to a set A∈E will not change its measure (or measurability). However, it turns
out that Jordan measure is not countably additive, in the sense that its measurable sets are not
closed under disjoint countable unions. (However, Jordan measure is nitely additive).
Exercise 1.2. (Jordan measure is not a true measure.) Consider E dened by (4) . Show that
single points are in E , but Q ∩ [0, 1] ∈/ E .
Actually Jordan measurable sets are not closed under disjoint countable unions by a simpler
argument: the intervals [n, n + 1) are in E but their countable union, R, is not in E since it isn't
bounded. Exercise 1.2 shows Jordan measure is not countably additive even when we restrict
to bounded sets.
Exercise 1.3. Consider E dened by (4) and let f : [0, 1] → R be continuous. Prove the region
A = {(x, y) : 0 ≤ x < 1, 0 ≤ y ≤ f (x)}
Having tried and failed at a naive theory, we now proceed to formal measure theory.
Exercise 2.2. Show that if E is a collection of subsets of E that is closed under relative com-
plements and disjoint countable unions, then E is closed under countable unions.
Recall from Section 1 that we want to dene measure by starting from simple sets. This
means our σ -algebra should contain all the simple sets. If the σ -algebra is minimal with respect
to this property, it is said to be generated by the simple sets.
Denition 2.3. The σ-algebra generated by a collection C of subsets of E , denoted σ(C), is the
smallest σ-algebra which contains C .
Explicitly, σ(C) is the intersection of all σ -algebras that contain C. The intersection has at
least one argument since 2
E contains all possible subsets. We leave it as an exercise to show
3
Exercise 2.4. Let G be a nonempty collection of σ-algebras over E . Show that
for all F ∈ G }
\
E := F = {A : A ∈ F
F ∈G
is a σ-algebra.
One σ -algebra will be of particular interest to us:
Denition 2.5. The σ-algebra generated by the open sets in Rn is the Borel σ-algebra.
Elements of the Borel σ -algebra are called Borel sets. Which sets are Borel? Single points,
the rationals, the irrationals, and rectangles (of any type) are all examples of Borel sets. Actually,
half-open rectangles generate the Borel σ -algebra (Exercise 2.6). It turns out that not all subsets
of Rn are Borel, though; see the discussion below Theorem 2.9.
Exercise 2.6. Let E = Rn , and let E be the σ-algebra generated by all half-open rectangles of
the form [a1 , b1 ) × . . . × [an , bn ). Show that E is the Borel σ-algebra.
We can easily see the Borel σ -algebra contains some sets that do not have Jordan measure.
Exercise 2.7. Show that the fat Cantor set of Exercise 1.4 is a Borel set.
We are now ready to give a precise denition of measure.
As above, elements of E are called measurable sets. Here and below (E, E) is called a
measurable space, and (E, E, µ) is called a measure space. Note that the only thing we added to
the discussion in Section 1 is the explicit structure of E.
Our next step is to dene a natural (true) measure on E = Rn . Here, by natural we mean
the measure should be congruence invariant (like Jordan measure) and the unit cube should
have measure 1. The following result furnishes the desired µ:
Theorem 2.9. (Borel measure.) Let E = Rn and E the Borel σ-algebra over E . There is a
unique measure µ on (E, E) whose measure on half-open rectangles is
µ([a1 , b1 ) × . . . × [an , bn )) = (b1 − a1 ) . . . (bn − an ). (6)
Moreover, µ is congruence-invariant.
The µ in Theorem 2.9 is called Borel measure. The proof of Theorem 2.9 is surprisingly
technical, considering the simplicity of the ideas considered so far. For this reason we will
postpone the proof to Section 3. For the remainder of this section we will discuss two deciencies
of the Borel measure.
The main deciency is that not all subsets of E = Rn are Borel sets. This problem cannot
be xed, at least in the ZFC framework, as we show in the following discussion. We will focus
4
on dimension n = 1 though the arguments generalize to any dimension. Consider the unit circle
S1 as [0, 1] with mod 1 arithmetic. Dene an equivalence relation onS 1 by
x ∼ y ⇐⇒ x − y ∈ Q ∩ [0, 1]
where the subtraction is mod 1. Choose a set R of representatives. For a set S and a point
p, dene p + S = {p + s : s ∈ S}. Each equivalence class for our relation on S 1 has the form
r + Q ∩ [0, 1] where r ∈ R and the addition is mod 1. Since the equivalence classes partition S 1 ,
[ [ [
[0, 1] = (r + Q ∩ [0, 1]) = {r + q} = (q + R),
r∈R r∈R, q∈Q∩[0,1] q∈Q∩[0,1]
where the union on the right hand side is disjoint, and the addition is mod 1. By congruence
invariance (in particular, translation invariance) and countable additivity,
[ X X
µ (q + R) = µ (q + R) = µ(R).
q∈Q∩[0,1] q∈Q∩[0,1] q∈Q∩[0,1]
The right hand side here is either 0 or ∞. However, µ([0, 1]) = 1. This shows that R is not a
Borel measurable set. Such sets are called Vitali sets. (The interval [0, 1] is not special here: a
nonmeasurable Vitali-type set can be constructed from any set of positive Borel measure. Think
about why this is true...)
The second deciency of Borel measure is that some some sets of Borel measure zero have
non-Borel subsets (see Exercise 2.10). This is strange, since we'd expect subsets of measure zero
sets to also have measure zero. This deciency is remedied with Lebesgue measure. Lebesgue
measure is an extension of Borel measure in which subsets of measure zero sets are always
measurable (with measure 0). For this reason Lebesgue measure is called the completion of
Borel measure.
One way to see that not all subsets of measure zero Borel sets are Borel is the following
cardinality argument. The cardinality of the Borel sets is the same as the cardinality of R, while
the Cantor set C has the cardinality of R so its power set 2C has strictly larger cardinality. That
means some subset of the Cantor set must be non-Borel. (The Cantor set has Borel measure 0;
think about why.) An explicit construction is in Exercise 2.10.
Exercise 2.10. Let c : [0, 1] → [0, 1] be the Cantor function and dene ψ(x) = x + c(x). Show
that ψ is invertible and maps Borel sets to Borel sets. Use ψ to construct a non-Borel set A
such that A ⊆ B where B is a Borel set with µ(B) = 0. (Hint: let C be the Cantor set and look
at ψ(C).)
The Vitali set example shows that a (countably additive) measure on E = Rn that is
translation invariant and has nontrivial values on rectangles cannot assign measure to every
subset of E. Can this be xed by replacing countable additivity by nite additivity? The
Banach-Tarski paradox says the answer is no: A sphere can be cut into nitely many pieces,
which can be reconstructed using only congruences to produce two copies of the original sphere,
thus violating nite additivity.
5
3 Existence and uniqueness of Borel measure
In this section we prove Theorem 2.9. The proof is somewhat long so we organize it as follows.
First, we show existence using Carathéodory's extension theorem in Section 3.1. We postpone
the proof of Carathéodory's theorem, which is quite technical, to Section 3.4. In Section 3.2, we
prove uniqueness using Dynkin's lemma. Throughout we will focus on E = R so that notation
is simpler. However, all of the arguments can easily be generalized to E = Rn .
3.1 Existence
Over E = R, the ring A we have in mind is the set of nite disjoint unions of half-open
intervals. (Recall these were the sets we used to dene Jordan measure!)
Exercise 3.2. Let S be the set of nite unions of half-open intervals of the form [a, b). (We
allow a = b in which case [a, b) = ∅.) Show that A is a ring.
Exercise 3.3. Show that rings are closed under nite intersections.
Existence of Borel measure comes from Carathéodory's extension theorem:
Theorem 3.4. (Carathéodory) Let A be a ring over E and suppose µ : A → [0, ∞] satises
∞
X
µ(∪∞
k=1 Ak ) = µ(Ak ) (7)
k=1
Proof of existence in Theorem 2.9. Let A be the ring of nite unions of half-open intervals.
Exercise 2.6 shows that A generates the Borel σ -algebra. It easy to see that we can dene µ
on A satisfying nite additivity and µ([a, b)) = b − a. Thus, all we must do is demonstrate
the countable additivity condition in Theorem 3.4. Let A1 , A2 , . . . ∈ A be disjoint and suppose
A = A1 ∪ A2 ∪ . . . ∈ A. By nite additivity, it suces to show µ(A) = limk→∞ µ(A1 ∪ . . . ∪ Ak ),
∞
where A := ∪k=1 Ak . Using nite additivity again, this is equivalent to showing limk→∞ µ(A \
(A1 ∪ . . . ∪ Ak )) = 0. Dene Bk := A \ (A1 ∪ . . . ∪ Ak ). Suppose for contradiction that there is
> 0 such that µ(Bk ) ≥ for all k . For each k we can pick Ck ∈ A such that C̄k ⊆ Bk (the bar
6
denotes closure) and µ(Bk \ Ck ) ≤ 2−k−1 . Then
Thus, µ(C1 ∩. . .∩Ck ) ≥ /2, so in particular C1 ∩. . .∩Ck is nonempty for each k . By construction,
∞
\ ∞
\
C̄k ⊆ Bk = ∅.
k=1 k=1
The LHS above is nonempty by Cantor's intersection theorem, which is our desired contradiction.
We used two properties of µ in the proof above that easily follow from nite additivity of µ,
monotonocity, which states that if A, B
but that we have not explicitly stated thus far. One is
are in the domain of µ and A ⊆ B , then µ(A) ≤ µ(B). The other is subadditivity, which states
that if A, B and A∪B are in the domain of µ, then µ(A ∪ B) ≤ µ(A) + µ(B).
3.2 Uniqueness
Exercise 3.8. Let A be the set of half-open intervals of the form [a, b). Show that A is a
π -system.
• if A, B ∈ D with A ⊆ B , then B \ A ∈ D
• if A1 , A2 , . . . ∈ D is an increasing sequence, then ∪∞
k=1 Ak ∈ D
7
The d-system we will look at will be introduced shortly in the proof of uniqueness of Borel
measure. Roughly speaking, the d-system will consist of all sets on which two candidate Borel
measures have the same values. The proof of uniqueness relies on Dynkin's lemma below. We
leave two ingredients of the proof as exercises.
Exercise 3.10. Show if A is a π-system and a d-system over E , then A is a σ-algebra over E .
Exercise 3.11. Let D be a d-system over E and x C ∈ D. Show that
DC = {A ⊆ E : A ∩ C ∈ D}
is a d-system over E .
Lemma 3.12. (Dynkin) Let A be a π-system over E . Then any d-system over E that contains
A must also contain σ(A).
Proof of Lemma 3.12. Let D be the smallest d-system that contains A. That is, D is the
intersection of all d-systems containing A. (Similar to Exercise 2.4, it is easy to check D is a
d-system.) By Exercise 3.10, it suces to show D is also a π -system. Since d-systems contain
∅ we only have to show D is closed under nite intersection. Let C ∈ A and dene
DC = {A ⊆ E : A ∩ C ∈ D}. (8)
Proof of uniqueness in Theorem 2.9. Let E = R and E the Borel σ -algebra, and let µ1 and µ2
be two measures on (E, E) such that µ1 ([a, b)) = µ2 ([a, b)) = b − a for each half-open interval
[a, b). For A∈E dene
The following fact, established in the proof of Lemma 3.12, will be useful below:
Proposition 3.13. If µ, ν are measures on (E, E) with µ(E) = ν(E) < ∞, and µ, ν agree on
a π-system that generates E , then µ = ν .
8
3.3 Completion of measures and Lebesgue measure
The aim of this section is to show that Borel measure can be extended to correct the deciency
discussed at the end of Section 2, namely, that not all subsets of Borel measure 0 are Borel.
Denition 3.14. Let (E, E, µ) be a measure space. A set N ⊆E is called null if there is some
A ∈ E such that N ⊆ A and µ(A) = 0.
For example, if µ is Borel measure in E = R, then any subset of the Cantor set is a null set.
Recall from Example 2.10 that some subsets of the Cantor set are not Borel.
Theorem 3.17. (Carathéodory) Let A be a ring over E and suppose µ : A → [0, ∞] satises
∞
X
µ(∪∞
k=1 Ak ) = µ(Ak ) (10)
k=1
Let µ and A be as in the theorem. The main ingredient in the proof of Carathéodory's
theorem is a candidate for the extension of µ. This is the so-called outer measure µ∗ , dened by
∞
X
µ∗ (A) = inf
∞
µ(Ak ) for A ⊆ E, (11)
A⊆∪k=1 Ak
k=1
9
where the inf is over all sequences A1 , A2 , . . . inA (the Ak 's need not be disjoint). If A is not
contained in any countable union of elements of A we set µ∗ (A) = ∞. We must show µ∗ is
countably additive on σ(A). Actually, we will
∗
show µ is countably additive on F , where
(Here and below we write Ac = E \ A for the absolute complement of A in E .) The proof is
nished by showing F is a σ -algebra containing A.
Recall that we dened Jordan measurable sets as those whose inner Jordan measure equals
their outer Jordan measure (see (4)). When µ is Borel or Lebesgue measure in E = R, there is
a similar characterization of F above. If µ∗ (A) = supC⊆A µ(C) where the sup is over compact
∗
subsets of R, then A ∈ F if and only if µ∗ (A) = µ (A). We will stick to the characterization (12)
due to Carathéodory, since it is a more general construction.
Exercise 3.18. Let µ and A be as in Theorem 3.17 and µ∗ as in (11). Show that the restriction
of µ∗ to A equals µ.
We now turn to the proof of Carathéodory's theorem. The rst step will be to show µ∗ is
countably subadditive.
Lemma 3.19. Let A1 , A2 , . . . ⊂ E and A ⊆ ∪∞
k=1 Ak . Then
∞
X
∗
µ (A) ≤ µ∗ (Ak ). (13)
k=1
Proof. Let > 0. We may assume µ(Ak ) < ∞ for each k. For each k pick Bk1 , Bk2 , . . . so that
Ak ⊆ ∪∞
`=1 Bk` and
∞
X
µ∗ (Ak ) ≥ −2−k + µ(Bk` ).
`=1
Then since A ⊆ ∪∞
k,`=1 Bk` ,
∞ X
X ∞ ∞
X
µ∗ (A) ≤ µ(Bk` ) ≤ + µ∗ (Ak ).
k=1 `=1 k=1
µ∗ (C) = µ∗ (C ∩ A) + µ∗ (C ∩ Ac )
= µ∗ (C ∩ A ∩ B) + µ∗ (C ∩ A ∩ B c ) + µ∗ (C ∩ Ac )
= µ∗ (C ∩ A ∩ B) + µ∗ (C ∩ (A ∩ B)c ∩ A) + µ∗ (C ∩ (A ∩ B)c ∩ Ac )
= µ∗ (C ∩ (A ∩ B)) + µ∗ (C ∩ (A ∩ B)c ).
10
This shows A∩B ∈ F so F is closed under nite intersections and nite unions. We now show
countable additivity. Let A1 , A2 , . . . ∈ F be disjoint and set A = ∪∞
k=1 Ak . For any B ⊆ E ,
µ∗ (B) = µ∗ (B ∩ A1 ) + µ∗ (B ∩ Ac1 )
= µ∗ (B ∩ A1 ) + µ∗ (B ∩ Ac1 ∩ A2 ) + µ∗ (B ∩ Ac1 ∩ Ac2 )
= µ∗ (B ∩ A1 ) + µ∗ (B ∩ A2 ) + µ∗ (B ∩ Ac1 ∩ Ac2 ).
Continuing on in this way,
k
X
∗
µ (B) = µ∗ (B ∩ Ai ) + µ∗ (B ∩ Ac1 ∩ . . . ∩ Ack ). (14)
i=1
∞
X
µ∗ (B) ≥ µ∗ (B ∩ Ak ) + µ∗ (B ∩ Ac )
(15)
k=1
∗ ∗ c
≥ µ (B ∩ A) + µ (B ∩ A )
where the last inequality used Lemma 3.19 again. The reverse inequality µ∗ (B) ≤ µ∗ (B ∩ A) +
µ∗ (B ∩ Ac ) holds by Lemma 3.19. Thus, µ∗ (B) = µ∗ (B ∩ A) + µ∗ (B ∩ Ac ), which shows A ∈ F.
Putting B=A in (15) now shows that
∞
X
∗
µ (A) = µ∗ (Ak )
k=1
as desired.
Notice that the main property of µ∗ that we needed to prove the results above was countable
subadditivity. Of course, for a measure subadditivity follows from countable additivity. That is, if
P∞
(E, E, µ) is a measure space and A, A1 , A2 , . . . ∈ E with A ⊆ ∪∞
k=1 Ak , then µ(A) ≤ k=1 µ(Ak ).
When E = R, µ is Borel measure, and A is the ring of nite unions of half-open intervals,
µ∗ is called the Lebesgue outer measure. In this case, for a Lebesgue measurable set A (dened
∗
in Section 3.3), we have λ(A) = µ (A) where λ is Lebesgue measure. Note that the dierence
between Lebesgue and Jordan outer measure is that for Lebesgue outer measure, the inmimum
in the denition is taken over countable instead of nite unions of half-open intervals. This turns
out to be a crucial dierence!
When µ is Borel measure, what sets are in F dened in (12)? The proofs above show that
F contains the Borel σ -algebra. Though we won't prove it, it turns out that F is exactly the
collection of Lebesgue measurable sets. The following exercise shows that at least F contains
all the Lebesgue measurable sets.
Exercise 3.21. Let E = R, µ Borel measure, and F as in (12) . Show that null sets are in F .
Notice also that we have not used a Lebesgue inner measure to dene measurability. However,
this can be done. Consider Lebesgue measure restricted to [0, 1]. Dene the inner measure of
A ⊆ [0, 1] as µ∗ (A) = 1 − µ∗ (A). Then it turns out that A ∈ F if and only if µ∗ (A) = µ∗ (A).
The proof involves a set approximation argument along the lines of the ones above, so we will
skip it.
The construction of outer measure in Theorem 3.17 is interesting in its own right. For later
use, we note that:
11
Proposition 3.22 (Lebesgue outer measure) . If A is Lebesgue measurable, then
∞
X
µ(A) = µ∗ (A) := inf
∞
µ(Ik ) (16)
A⊆∪k=1 Ik
k=1
4 Probability measures
Let (E, E, µ) be a measure space. If µ(E) = 1 then this is called a probability space. In this
setting, instead of (E, E, µ), it is standard to write (Ω, F, P), where now the ingredients are
• A set Ω of outcomes
• A collection of events, F
• A rule, P, for assigning probability to each event.
(Ω, F, P) is called a probability space. For now, we use the word event to refer to any subset of
Ω, though it is common to assume measurability implicitly.
Ω
For example, for a single coin ip, Ω = {heads, tails}, F = 2 , and P({heads}) = P({tails}) =
Ω
1/2. For a single die roll, Ω = {1, 2, 3, 4, 5, 6}, F = 2 , and P({i}) = 1/6 for i = 1, . . . , 6. Note
that in each of these examples the probability of each outcome is the same. This is what is
called theuniform probability measure (or uniform distribution).
Example 4.1. (Discrete uniform probability measure.) Ω = a set with n < ∞ elements,
F = 2Ω , P({ω}) = 1/n for each ω ∈ Ω.
Example 4.2. (Continuous uniform probability measure.) Ω = [0, 1], F = {A ⊆ [0, 1] :
A is Borel}, P([a, b)) = b − a for 0 ≤ a ≤ b ≤ 1.
(Notice we used the Borel instead of Lebesgue σ -algebra; we will see why in Section 5.)
Consider now a coin tossing experiment with innitely many tosses. The set of outcomes is
Ω = {0, 1}N where now 0, 1 represent heads and tails. That is, ω = (ω1 , ω2 , . . .) ∈ Ω represents
the outcome where the k th toss is ωk ∈ {0, 1}. Here all single outcomes have probability zero,
but events can have positive probability. For example, Ek = {ω ∈ Ω : ωk = 0}, the event that
the k th toss is heads, has probability P(Ek ) = 1/2.
Exercise 4.3. For the innite coin tossing experiment, let A be the collection of all events that
depend only on nitely many tosses, together with ∅. Show that A is a ring.
We now sketch how to dene a probability measure P in the innite coin tossing example.
First, following Carathéodory's theorem, we dene P on A. Let An be the set of events that
depend only on the rst n tosses. More formally, E ∈ An if and only if there is En ⊆ {0, 1}
n
∞
such that E = {ω ∈ Ω : (ω1 , . . . , ωn ) ∈ En }. Clearly, A = ∪n=1 An . Dene P on A by
12
Exercise 4.4. Let P be as in (17) and A as in Exercise 4.3. Show that P is well-dened and
nitely additive on A.
Existence and uniqueness of P is a special case of Kolmogorov's extension theorem (Theo-
rem 8.9 below). It turns out that F = σ(A) is the right σ -algebra. Indeed, all reasonable
subsets of Ω are in σ(A). The problem with trying to measure all subsets of Ω, which we
also faced while constructing Borel measure, is that space Ω 1
is uncountable. (Intuitively, it is
dicult to use countable additivity to measure all subsets of an uncountable space.)
Exercise 4.5. Consider the innite coin tossing experiment, and let E be the event that innitely
many of the tosses are heads. Show that E ∈ F = σ(A) where A is dened as in Exercise 4.3.
(See also Section 4.2 below.)
4.1 Independence
In this section events will refer to only measurable subsets of Ω, that is, elements of F. One
important element of probability theory that is missing from measure theory is the concept
of independence. Consider a dice rolling experiment with two rolls. Here Ω = {1, 2, . . . , 6}2 ,
F = 2Ω , and P is uniform on Ω. Consider the events E1 = {rst roll is odd} and E2 =
second roll is ≤ 2}. By counting the number of ways both E1 , E2 can occur, and dividing by
3·2 3 2
possibilities, we get P(E1 ∩ E2 ) =
the 36 total
36 = 6 · 6 = P(E1 )P(E2 ).
If A, B are events in a nite outcome space Ω that have nothing to do with each other,"
equation (18) can be obtained from a simple counting argument. Note that events with prob-
ability 0 or 1 are automatically independent of all other events. Another way to understand
independence is as follows. Assume P(B) > 0 and rearrange (18) as
P(A ∩ B)
P(A) = . (20)
P(B)
Dene another probability measure Q on (Ω, F) by
P(C ∩ B)
Q(C) = , C ∈ F.
P(B)
Then Q is the probability measure on events that assumes B has occurred, and equation (20),
which says Q(A) = P(A), shows that occurrence of B has no eect on A. The RHS of (20) is
the conditional probability of A given B :
1
For a specic example of an non-measurable set in this context see https://arxiv.org/pdf/math/
0610705v2.pdf
13
Denition 4.7. Let A, B be events with P(B) > 0. The conditional probability of A given B is
P(A ∩ B)
P(A|B) := . (21)
P(B)
As an example, consider two die rolls. Let B be the event that the rst roll is odd, and A
the event that the rolls add to 5. P(A|B). Assuming B occurred, we know that
Consider rst
the rst roll is either 1, 3 or 5, and each has the same probability 1/3. If the rst roll is 1 or
3, the sum of the rolls is 7 only when the second roll is 4 or 1, respectively, both which have
probability 1/6. If the rst roll is 5 there is no way the sum is 5. Thus,
1 1 1 1 1 1
P(A|B) = · + · + ·0= .
3 6 3 6 3 9
On the other hand, consider P(A ∩ B). There are only two outcomes namely (1, 4) and (3, 2)
for which the rst roll is odd and the rolls add to 5. Thus, P(A ∩ B) = 2/36 = 1/18. Since
P(B) = 1/2 we veried equation (21).
Note that if A, B are independent and P(B) > 0, then P(A|B) = P(A). Conversely if
P(A|B) = P(A) then A, B are independent. It is important to note that pairwise independence
does not imply independence: (19) may not hold even if (18) holds for each pair Ai , Aj , i 6= j ∈ I .
Denition 4.8. σ-algebras F1 , F2 over Ω are called independent if E1 and E2 are independent
whenever E1 ∈ F1 and E2 ∈ F2 .
One application of Dynkin's lemma (Lemma 3.12) is the following:
Consider again the innite coin tossing example. Let Ek be the event that the k th toss is heads
and E the event that innitely many tosses are heads. (See Exercise 4.5.) Consider the event
[
Ek .
k≥n
14
This is the event that some toss after the (n − 1)st toss is heads. Now consider the event
∞ [
\
Ek .
n=1 k≥n
The is the event that, after every toss, some future toss is heads. In other words,
∞ [
\
E := {innitely many tosses are heads} = Ek .
n=1 k≥n
and ∞ \
[
lim inf En = Ek .
n→∞
n=1 k≥n
The above arguments are easily generalized to show that lim supn→∞ En is the event that
innitely many of the En occur. For this reason, we sometimes write
Similar arguments show that lim inf n→∞ En is the event that every En occurs past some nite
n. For this reason we sometimes write
Exercise 4.12. Show that if En is a monotone sequence, then lim supn→∞ En = lim inf n→∞ En .
Lemma 4.13. (Borel-Cantelli.) If E1 , E2 , . . . are independent events with = ∞,
P∞
n=1 P(En )
15
Proof. By Exercise 4.11, it suces to show that P({Enc eventually}) = 0. We have
\ Y
P Ekc = P(Ekc )
k≥n k≥n
Y
= (1 − P(Ek ))
k≥n
Y
≤ exp (−P(Ek ))
k≥n
X
= exp − P(Ek ) = 0.
k≥n
Exercise 4.14. Consider the innite coin tossing example and let E be the event that innitely
many tosses are heads. Use the Borel-Cantelli lemma to show that P(E) = 1.
Exercise P
4.15. Prove the following result, also due to Borel-Cantelli. If E1 , E2 , . . . are events
such that ∞n=1 P(En ) < ∞, then P({En innitely often}) = 0.
Note that the result of Exercise 4.15 applies to nite measures (i.e. P(Ω) = 1 isn't needed).
Consider a game where at step n n
you lose 2 dollars with probability 1/(2n + 1) or win 1
dollar with probability 2n /(2n + 1). At each step, your average or expected winnings are
1 2n
−2n · + 1 · =0
2n + 1 2n + 1
(see Section 6.1 below). Let En be the event that you lose in the nth step. Assuming these
events are independent, Exercise 4.15 shows that you lose only nitely many games. Thus, you
can continue playing to win innite money, even though you the games are fair in the sense
that on average, you break even at each step!
5 Measurable functions
A measurable function is a map between measurable spaces that respects measurability. This
means that the preimage of a measurable set should be measurable:
Denition 5.1. Let (E, E) and (F, F) be two measurable spaces. A function g : E → F is
measurable if g
−1 (A) ∈ E for each A ∈ F .
It turns out that the condition in Denition 5.1 can be replaced by a slightly weaker one: if
A is a collection of subsets of F that generates F, then it is enough to check measurability of
preimages of elements of A.
16
Proposition 5.2. Let (E, E) and (F, F) be two measurable spaces and suppose F = σ(A) for
some collection A of subsets of F . Then g : E → F is measurable if and only if g−1 (A) ∈ E for
each A ∈ A.
Proof. The only if is clear, so we prove the if part. Suppose g −1 (A) ∈ E for each A ∈ A,
and consider D = {A ⊆ F : g −1 (A)
∈ E}. We will show D is a σ -algebra; since D ⊇ A by
assumption, it follows that D ⊇ σ(A) = F and so g is measurable. Suppose A1 , A2 , . . . ∈ D.
Then g
−1 (A ) ∈ E for each k , so
k
∞ ∞
!
[ [
−1
g Ak = g −1 (Ak ) ∈ E,
k=1 k=1
which shows ∪∞
k=1 Ak ∈ D . Now suppose A ∈ D. Then g −1 (A) ∈ E , so
g −1 (F \ A) = E \ g −1 (A) ∈ E,
Exercise 5.3. Let (E, E) be a measurable space. Use Proposition 5.2 to show that g:E→R
is measurable if and only if {x ∈ E : g(x) < a} ∈ E for each a ∈ R.
It is not hard to see that the < in Exercise 5.3 can be replaced by >, ≤, ≥.
Exercise 5.4. Let E ⊆ R be a Borel set and f : E → R an increasing function. Show that f is
a Borel function.
Proposition 5.5. Let (E, E) be a measurable space. The set G of measurable functions g : E →
R is closed under nite addition, nite multiplication, and lim inf and lim sup of sequences.
[
{x ∈ E : (f + g)(x) < a} = ({x ∈ E : f (x) < r} ∩ {x ∈ E : g(x) < a − r}) ∈ E.
r∈Q
17
Next, note that f2 ∈ G whenever f ∈ G, since
√ √
{x ∈ E : f 2 (x) < a} = {x ∈ E : f (x) < a} ∩ {x ∈ E : f (x) > − a} ∈ E.
Exercise 5.6. Finish the proof of Proposition 5.5 by showing G is closed under lim sup and
of sequences.
lim inf
Of course, closure under lim inf and lim sup implies closure under limits, when they exist.
The collection of Borel functions g:R→R is also closed under composition. Actually, this
is the reason for taking F as the Borel sigma algebra in Example 4.2: Borel functions of random
variables should be measurable.
Let 1 denote the identity function and recall 1A is equal to 1 on A and zero elsewhere. We
write gn % g to indicate gn : E → R is a nondecreasing sequence of functions with pointwise
limit g . That is, g1 (x) ≤ g2 (x) ≤ . . . and limn→∞ gn (x) = g(x) for each x ∈ E .
Theorem 5.7. (Monotone class theorem.) Let (E, E) be a measurable space and A a π-system
such that E = σ(A). Let G be a collection of bounded functions g : E → R that is closed under
nite addition and scalar multiplication. If in addition
(i) 1 ∈ G and 1A ∈ G for all A ∈ A,
(ii) If gn ∈ G with 0 ≤ gn % g for a bounded function g, then g ∈ G,
2n −1
MX
k k k+1
gn (x) = 1A , An := x ∈ E : g(x) ∈ , .
2n n 2n 2n
k=0
Note that each An is measurable, so gn ∈ G by closure of G under nite addition and scalar
multiplication. We leave it as an exercise to show that g1 , g2 , . . . is nonnegative and increasing
with limn→∞ gn = g . Thus, g ∈ G. We assumed g is nonnegative, but any bounded measurable
function is the dierence of two nonnegative bounded measurable functions.
Examples of applications of the monotone class theorem include proofs of change of variable
and Fubini's theorems for integration. In those examples, E = R (change of variables) or E = R2
(Fubini's theorem), A is the collection of intervals (change of variables) or rectangles (Fubini's
theorem), and G is the collection of bounded functions for which change of variables or Fubini's
theorem holds. Closure of G under nite addition and scalar multiplication comes from linearity
of integrals; (i) is easy to check, since it involves integrating indicator functions of intervals or
rectangles; and (ii) follows from the monotone convergence theorem (Theorem 6.3 below).
18
5.1 Images of measures and probability distribution functions
The CDF of a random variable completely determines its distribution, and vice-versa. For
this reason, random variables are usually dened simply by specifying a CDF, without ever
referring to an underlying measurable space (Ω, F). (It is possible that two random variables
have the same distribution but are dened on dierent probability spaces. We will consider such
random variables equivalent.)
It turns out that a function F :R→R is the CDF of a random variable if and only if F
is right-continuous and non-decreasing with limx→−∞ F (x) = 0 and limx→∞ F (x) = 1. (See
Theorem 5.16 below.)
Exercise 5.11. Let µ be the uniform measure on Ω = [0, 1] from Example 4.2. Fix λ > 0 and
dene X : Ω → R by X(ω) = −λ−1 log(1 − ω). Find the CDF of X .
The distribution in Example 5.11 is called the exponential distribution.
In many cases a random variable X takes values only nonnegative integer values. In this
setting it is more convenient to consider the probability mass function of X. In this case we
write P(X = x) instead of P(X −1 ({x})).
Denition 5.12. Let (Ω, F, P) be a probability space and X :Ω→N a random variable. The
probability mass function (PMF) of X is
Example 5.13. Fix p ∈ (0, 1). A Bernoulli-p random variable has PMF
(
p, x=1
fX (x) =
1 − p, x=0
Exercise 5.14. Consider the innite coin toss experiment from Section 4, but let the coin be
biased so that the probability of tails and heads is p and 1 − p, respectively. Let X be the number
of tosses before the rst heads. Find the PMF of X .
The distribution in Example 5.14 is called the geometric distribution.
Exercise 5.15. Let X have either a geometric or an exponential distribution. Show that
P(X > t + s|X > s) = P(X > t),
Proof of Theorem 5.16. Let F and g be as in Lemma 5.17. Let µ be Borel measure, and consider
the image measure dened by dF (A) = µ(g −1 (A)) for Borel sets A. Then
Uniqueness of dF follows from the arguments in the proof of uniqueness of Borel measure.
20
Exercise 5.18. Let F be the Heaviside step function F (x) = 0 for x < 0, F (x) = 1 for x ≥ 0.
For Borel sets A ⊆ R, dene dF by dF (A) = 1 if 0 ∈ A, dF (A) = 0 if 0 ∈/ A. Check that dF is
a measure and dF ((a, b]) = F (b) − F (a). (dF is called the Dirac delta distribution.)
Let E be the Borel σ -algebra. A nonzero measure µ on (R, E) with the property that
µ(K) < ∞ whenever K is compact is called a Radon measure.
Exercise 5.19. Show that any Radon measure can be obtained from a function F as in Theo-
rem 5.16. That is, for any Radon measure µ, there is a function F satisfying the hypotheses of
Theorem 5.16 such that µ((a, b]) = F (b) − F (a).
We conclude this section by dening independence for random variables.
for each nite subcollection J of I and real numbers {xj }j∈J . A random variable X generates
a sigma algebra σ(X) := {X −1 (A) : A Borel}, and it is also not hard to see that X1 , X2 are
independent if and only if σ(X1 ), σ(X2 ) are independent (see Denition 4.8).
Example 5.21. In the innite coin toss, dene random variables Xn (ω) = ωn , Y (ω) = ω1 +
. . . + ω5 and Z(ω) = ω6 ω7 . Then X1 , X2 , . . . are independent, and Y, Z are independent.
Let (Ω, F, P) correspond to the uniform probability measure on Ω = (0, 1] (see Example 4.2).
Note that each ω∈Ω has a binary expansion
ω = 0.ω1 ω2 . . .
where we forbid innite sequences of 0's to make the expansion unique. Dene random vari-
ables Rn by Rn (ω) = ωn . These are called Rademacher functions. The Rademacher func-
tions R1 , R2 , . . . are independent Bernoulli-1/2 random variables (see Example 5.13). Thus, the
Rademacher functions provide an alternative formulation of the innite coin toss example! Ac-
tually, we can use these functions to construct a much wider class of sequences of independent
random variables. Consider a sequence of random variables Xn with CDFs Fn . More precisely,
and not losing generality, the Fn are funtions satisfying the hypotheses of Theorem 5.16 such
that limx→∞ Fn (x) = 1 and limx→−∞ Fn (x) = 0 for each n. The next theorem shows how to
dene a sequence of independent random variables with these CDFs on the same probability
space.
21
Theorem 5.22. Let (Ω, F, P) correspond to the uniform probability measure on Ω = (0, 1], and
let Fn : R → R be a sequence of cumulative distribution functions as described above. Then there
exists a sequence X1 , X2 , . . . of independent random variables Xn : Ω → R such that FXn = Fn
for all n.
Proof. Let b : N2 → N be any bijection and dene Yn,m = Rb(n,m) and
∞
X
Yn = 2−m Yn,m .
m=1
6 Integration
Throughout this section (E, E, µ) is a measure space. Let f :E→R be a measurable function.
For brevity, we often say a function f is measurable without explicit mention of E, E or µ. The
integral of f with respect to µ, if it exists, will be written
Z Z
µ(f ) := f dµ ≡ f (x)µ(dx).
E E
n
X
µ(f ) = ck µ(Ak ).
k=1
(The representation of a simple function is not unique, but it is easy to check this is well-dened.)
For a nonnegative measurable function f : E → R, set
22
When µ(f + ) < ∞ and µ(f − ) < ∞, we say f is integrable with respect to µ. If f ≥ 0 is
measurable but not integrable, we adopt the convention that µ(f ) = ∞. It is easy to check that
|µ(f )| ≤ µ(|f |) and f is integrable if and only if µ(|f |) < ∞.
We say that f = 0 a.e. (almost everywhere) if µ({x : f (x) 6= 0}) = 0. In the context of
probability, sometimes a.s. is written instead of a.e., meaning almost surely. More generally,
we say that a property holds a.e. or a.s. if it holds everywhere except on a null set.
Exercise 6.2. Show that (1) µ acts linearly on simple functions; (2) if f, g are nonnegative and
measurable with 0 ≤ f ≤ g, then µ(f ) ≤ µ(g); and (3) if f is simple, then f = 0 a.e. if and
only if µ(f ) = 0.
When µ is Borel or Lebesgue measure on a subset E of R, then µ(f ) is called the Lebesgue
integral. It is easy to see that some functions that are not Riemann integrable are still Lebesgue
integrable. For example, let E = [0, 1], let µ be Borel or Lebesgue measure, and f (x) = 1 if
x∈
/ Q, f (x) = 0 if x ∈ Q. f = 1[0,1]\Q is Lebesgue integrable with µ(f ) = µ([0, 1]\Q) = 1.
Then
However, all upper Darboux sums for f are ≥ 1 while all lower Darboux sums are 0 (see
denitions at the end of this section). Thus, f is not Riemann integrable. Later (Exercise 6.13)
we will see that all Riemann integrable functions are Lebesgue integrable.
Since the integral in (24) is dened as a supremum over simple functions, it is often easy to
get lower bounds for µ(f ) by considering convenient simple functions. For the reverse inequality,
we'll show that if simple functions increase to f ≥ 0, then their integrals increase to the integral
of f. This is a consequence of the monotone convergence theorem (Theorem 6.3 below). The
monotone convergence theorem gives conditions under which a limit can be brought inside
an integral; see also Fatou's lemma (Theorem 6.8) and the dominated convergence theorem
(Theorem 6.9) below.
We writexn % x if x1 , x2 , . . . is a nondecreasing sequence of real numbers that converges to
x. For functions f, fn : E → R, we write fn % f if fn (x) % f (x) for every x ∈ E . When the
nondecreasing condition is relaxed, we write xn → x and fn → f .
23
f are simple and write f = nk=1 ck 1Ak where ck > 0 and the Ak 's are
P
3. Suppose fn and
disjoint. For each k , fn 1Ak % ck 1Ak . Since fn 1Ak is simple, by part 2, µ(fn 1Ak ) % ck µ(Ak ).
n2n −1
k
1A + n1Bn
X
gn =
2n k
k=1
where
k k+1
Ak = x: ≤ fn (x) < , Bn = {x : fn (x) ≥ n}.
2n 2n
Observe that gn ≤ fn ≤ f and, since fn % f , by construction of gn we get gn % f . By part 4,
this means µ(gn ) % µ(f ), and it follows that µ(fn ) % µ(f ).
Monotone convergence can be rephrased so that it looks like a countable additivity condition
(or perhaps a continuity from below condition, which is closely related) for integrals.
Exercise 6.5. Give an example to show that the assumption fn is nondecreasing is needed in
the monotone convergence theorem. Give another example to show that the assumption fn ≥ 0
is needed.
Monotone convergence leads to the following reassuring result.
Proposition 6.6. Let f, g ≥ 0 be measurable. Then µ(af + bg) = aµ(f ) + bµ(g) for all a, b ≥ 0.
Moreover, f = 0 a.e. if and only if µ(f ) = 0.
Proof. Let fn , gn be simple functions such that fn % f and gn % g . (We have seen how to
construct such functions in step 5 in the proof of Theorem 6.3 above; see also the proof of
Theorem 5.7.) Since a, b ≥ 0, we have afn + bgn % af + bg . Since µ is linear on simple functions
(Exercise 6.2), we have µ(afn + bgn ) = aµ(fn ) + bµ(gn ). Letting n → ∞ and using monotone
convergence gives the linearity result. Now if f = 0 a.e., then since 0 ≤ fn ≤ f , we must have
fn = 0 a.e. for each n. By Exercise 6.2 it follows that µ(fn ) = 0 for each n, and monotone
convergence implies µ(f ) = 0. Conversely, suppose µ(f ) = 0. Then since 0 ≤ fn ≤ f , we must
have µ(fn ) = 0 for each n. By Exercise 6.2, fn = 0 a.e. for each n. We conclude f = 0 a.e.
Exercise 6.7. Generalize Proposition 6.6 by showing that µ(af + bg) = aµ(f ) + bµ(g) holds for
integrable functions f, g and any a, b ∈ R. Show also that for integrable f , if f = 0 a.e. then
µ(f ) = 0, but the converse is not true.
The monotone convergence theorem leads to Fatou's lemma and the dominated convergence
theorem below.
24
Theorem 6.8 (Fatou's lemma). Let fn ≥ 0 and f ≥0 be measurable. Then
µ(lim inf fn ) ≤ lim inf µ(fn ).
n→∞ n→∞
Proof. Note that inf k≥n fk ≤ fm for all m ≥ n. Thus, µ(inf k≥n fk ) ≤ µ(fm ) for all m ≥ n,
and so µ(inf k≥n fk ) ≤ inf k≥n µ(fk ) ≤ lim inf n→∞ µ(fn ). Since inf k≥n fk % lim inf n→∞ fn , the
monotone convergence theorem gives the result.
µ(g) + µ(f ) = µ(lim inf g + fn ) ≤ lim inf µ(g + fn ) = µ(g) + lim inf µ(fn )
n→∞ n→∞ n→∞
µ(g) − µ(f ) = µ(lim inf g − fn ) ≤ lim inf µ(g − fn ) = µ(g) − lim sup µ(fn ).
n→∞ n→∞ n→∞
Note that in a probability, space, since P(Ω) = 1 < ∞, the dominated convergence theorem
holds if the fn are uniformly bounded in n.
Exercise 6.10. Give an example to show the assumption fn ≥ 0 is needed in Fatou's lemma.
Give another example to show that the assumption g is integrable is needed in the dominated
convergence theorem.
Exercise 6.11. Prove a generalization of the dominated convergence theorem where the assump-
tion |fn | ≤ g with g integrable is replaced by |fn | ≤ gn where gn → g and gn , g are integrable with
µ(gn ) → µ(g). Use this to prove that if fn , f are integrable with fn → f and µ(|fn |) → µ(|f |),
then µ(|fn − f |) → 0.
When E = [a, b], µ is Lebesgue measure, and f :E→R is integrable, we write
Z b
µ(f ) ≡ f (x) dx.
a
Rb
When
a |f (x)| dx < ∞, we say f is Lebesgue integrable.
Theorem 6.12 (Fundamental theorem of calculus). (i) Let f : [a, b] → R be Lebesgue integrable
and dene F (x) = a f (x) dx for x ∈ [a, b]. If f is continuous at c ∈ [a, b], then F is dieren-
Rx
tiable at c and F 0 (c) = f (c). (ii) Suppose F : [a, b] → R has a Lebesgue integrable derivative f .
Then F (b) − F (a) = a f (x) dx.
Rb
25
Part (ii) of the theorem can fail even if F admits a Lebesgue integrable derivative at almost
every point. For instance, the Cantor function c has c0 = 0 almost everywhere, and c(1) − c(0) =
absolutely continuous
R1 0
1 6= 0 = 0 c (x) dx. The reason is that the Cantor function fails to be
with respect to Lebesgue measure. Part (ii) will be generalized to arbitrary measures, assuming
absolute continuity, in the Radon-Nikodym theorem (Theorem 9.4 below).
We end this section by connecting the Lebesgue integral and the Riemann integral. We
briey review the denition of Riemann integral. A partition P of [a, b] is a collection x0 , . . . , xn
such that a = x0 < x1 < . . . < xn = b. For a bounded function f : [a, b] → R dene upper and
lower Darboux sums U (P, f ) and L(P, f ) by
n
X
U (P, f ) = (xk − xk−1 )Mk , Mk = sup f (x),
xk−1 ≤x≤xk
k=1
Xn
L(P, f ) = (xk − xk−1 )mk , mk = inf f (x).
xk−1 ≤x≤xk
k=1
where the sup and inf are over all partitions P . When f is Riemann integrable, its integral is
the common value inf P U (P, f ) = supP L(P, f ). We leave it as an exercise to show that the
Lebesgue integral generalizes the Riemann integral.
Exercise 6.13. Show that if f : [a, b] → R is Riemann integrable, then it is Lebesgue integrable,
and the two integrals are the same.
6.1 Expectation
Let (Ω, F, P) be a probability space and X : Ω → R a random variable. In this setting the
integral of X is written Z Z
E(X) = X dP = X(ω)P(dω)
n
X
E(X) = ck P(Ak ),
k=1
and the extension of E(X) to integrable X is exactly the same as in the last section.
E(X) is interpreted as the average value of X with respect to the probabilities assigned by
P. For instance, consider the two die roll experiment and let X be the sum of the two rolls. The
average value of X is
1 2 3 4
2· +3· +4· +5· + ...
36 36 36 36
where we count the sum of each value times the probability of the value. This idea is formalized
and generalized in the following exercises.
26
ExerciseP6.14. Let X be a random variable with values in N, and let f be its PMF. Show that
E(X) = ∞ k=1 kf (k). Use this to nd the expected value of a geometric random variable (see
Exercise 5.14).
When X :Ω→R is a random variable and g :R→R is a Borel function, then g◦X is
another random variable, usually written g(X).
Exercise 6.15. Let X be a random variable
R and F its CDF. Let g : R → R be Borel. Show that if
g(X)is integrable, E(g(X)) = dF (g) ≡ g(x) dF (dx), where dF is dened as in Theorem 5.16.
Exercise 6.16. In ExerciseR 6.15, let F be dierentiable with Lebesgue integrable derivative f.
Show that then E(g(X)) = g(x)f (x) dx. In particular, if X is integrable, E(X) = xf (x) dx.
R
Use this to nd the expected value of an exponential random variable (see Exercise 5.11).
When X has a CDF F with derivative f = F 0, we call f theprobability
R density function
(PDF) of X . Exercise 6.16 shows that if X has a PDF f, then E(X) =R xf (x) dx. Thinking
of f (x) as the probability that X is near x, we can think of E(X) = xf (x) dx as a way to
say the expected value of X is the integral of the values of X multiplied by their probability
densities. Note that this is an analogue of the result in Exercise 6.14.
We end this section by connecting expectation with independence (Denition 5.20).
7 Conditional expectation
Let (Ω, F, P) be a probability space and A, B ∈ F events. Recall the conditional probability
(Denition 4.7) of A given B is (if P(B) > 0)
P(A ∩ B)
P(A|B) = .
P(B)
We now generalize this idea to the expected value of a random variable, given some other
information (for instance about another random variable). The information will be encoded by
a sub-σ -algebra of F.
Denition 7.1. Let X be a random variable. The σ-algebra generated by X is
σ(X) := {X −1 (A) : A ⊆ R Borel}.
27
If Xi , i ∈ I are random variables, the σ-algebra generated by the Xi is
σ({Xi }i∈I ) := σ {Xi−1 (A) : A ⊆ R Borel, i ∈ I} .
The Borel sets A in the last two denitions can be replaced by intervals (−∞, a]. Note that
in Denition 7.2 every G -measurable random variable is (F -)measurable, but not conversely.
Exercise 7.3. Let Ω = (0, 1] and P the uniform probability measure. Let G1 = {∅, Ω},
G2 = {∅, Ω, (0, 1/2], (1/2, 1]}. Identify which random variables are Gi -measurable for i = 1, 2.
Interpreting P as the innite coin toss probability measure, say which coin toss random variables
are Gi -measurable.
Denition 7.4. Let (Ω, F, P) be a probability space, X : Ω → R an integrable random variable,
and G ⊆ F a σ-algebra over Ω. The conditional expectation of X given G , denoted Y = E(X|G),
is dened to be the random variable Y such that (i) Y is G -measurable and (ii) E(Y 1A ) =
E(X 1A ) for all A ∈ G .
Existence of conditional expectation will be proved later using the Radon-Nikodym theorem
(Theorem 9.4 below). At the moment we only prove uniqueness.
Proof of uniqueness of conditional expectation. Let Y and Y 0 be two G -measurable random vari-
ables with E(Y 1A ) = E(X 1A ) for all A ∈ G . Dene Bn = {ω ∈ Ω : Y (ω) − Y (ω) > 1/n}. Then
0
Bn ∈ G and so
1
0 = E((Y − Y 0 )1Bn ) ≥
P(Bn ) ≥ 0.
n P∞
0 ∞
Thus P(Bn ) = 0 for all n. Now P(Y − Y > 0) = P(∪n=1 Bn ) ≤ n=1 P(Bn ) = 0. By symmetry
0 0
we conclude P(Y − Y < 0) = 0 and so P(Y = Y ) = 1.
Exercise 7.6. Let X and Y be integrable independent random variables and G = σ(Y ). Show
that E(X + Y |G) = E(X) + Y . (Hint: use Proposition 6.17.)
We leave it an exercise to show that conditional expectation is linear, i.e. E(aX + b|G) =
aE(X|G) + b for a, b ∈ R, and monotonic, i.e. if X≤Y a.s. then E(X|G) ≤ E(Y |G).
28
7.1 Martingales
E(Sn+1 |Fn ) = Sn .
This says the expected position of the walk at step n + 1, knowing all information from the
rst n steps of the walk, is the position of the walk at step n. Informally, this follows from the
simple fact that we're equally likely to move right or left at the (n + 1)th step.
The theory of martingales arose from mathematical analysis of betting strategies. In that
context, Sn = X1 +. . .+Xn is usually a bettor's cumulative earnings after n plays of some game,
with Xi the earnings in the ith game; Fn = σ(X1 , . . . , Xn ) is the information from the rst n
plays; and equation (26) means that the game is fair, i.e., the expected cumulative earnings by
the (n + 1)th play, given the information from the rst n plays, is just the cumulative earnings
from the rst n plays. The simple random walk can be understood in this context: starting
with 0 dollars, in each game, you win or lose1 dollar with equal probability.
Xi , i = 1, 2, . . . be independent with P(Xi = 2i−1 ) = P(Xi = −2i−1 ) = 1/2, and dene
Let
Sn = X1 + . . . + Xn and Fn = σ(X1 , . . . , Xn ). Note that Sn is a martingale (the argument is
the same as for the simple random walk). Think of Sn as the cumulative dollar earnings after
n games. Since Sn is a martingale the games are fair. Suppose we stop playing the games as
soon as Sn > 0, which happens on the rst winning play n. (This n is nite a.s.; why?) Then
0
we win −(2 + . . . + 2
n−2 ) + 2n−1 = 1 dollar, even though all the games were fair! This is the
so-called double or nothing strategy. (For another martingale paradox, see the example at the
end of Section 4.2 above.)
We conclude this section by giving an famous example of a martingale, due to Polya. Con-
sider an urn consisting initially of a single red and green ball. At each game, a ball is chosen
uniformly at random from the urn and then returned along with another ball of the same color.
Let Xn be the number of red balls after n games, Sn = Xn /(n + 2) the proportion of red balls
after n games, and Fn = σ(X1 , . . . , Xn ) the information from the rst n games. Then Sn is a
martingale with respect to Fn .
29
8 Product measures and Fubini's Theorem
Denition 8.1. Let E1 and E2 be σ-algebras over E1 and E2 , respectively. The product σ-
algebra of E1 and E2 is
E1 ⊗ E2 = σ ({A1 × A2 : Ai ∈ Ei , i = 1, 2}) .
The denition extends to nite products of σ -algebras in the obvious way. When Ei = E for
i= 1, . . . , n, we write ⊗ni=1 Ei
≡ E ⊗n for the product
σ -algebra.
Consider two die rolls, where E1 = E2 = {1, . . . , 6}, E1 = E2 = 2
{1,...,6} , and Ω = E × E .
1 2
Ω
It is easy to check that F := 2 = E1 ⊗ E2 . Not all elements of F have the form A1 × A2 where
Ai ∈ Ei ; for instance {1, 2} ∪ {2, 1} cannot be written in this form. As another example, take
E1 = E2 = R and Ei the Borel σ -algebra over R. Then Exercise 2.6 shows that E1 ⊗ E2 is the
Borel σ -algebra over R .
2
Throughout this section we assume (E1 , E1 , µ1 ) and (E2 , E2 , µ2 ) are nite measure spaces,
that is, µi (Ei ) < ∞ for i = 1, 2.
Exercise 8.4. Show that there is a measure µ on E1 ×E2 such that µ(A1 ×A2 ) = µ1 (A1 )µ2 (A2 )
for all A1 ∈ E1 and A2 ∈ E2 . (Hint: for existence, dene
Z Z
µ(A) = 1A (x1 , x2 )µ2 (dx2 ) µ1 (dx1 ) (27)
E1 E2
for A ∈ E1 ⊗ E2 .)
By Proposition 3.13, the measure in Exercise 8.4 is unique. It is called the product measure:
Denition 8.5. Let (E1 , E1 , µ1 ) and (E2 , E2 , µ2 ) be measure spaces. The product measure µ of
µ1 and µ2 , written µ = µ = µ1 ⊗ µ2 , is the unique measure on (E1 × E2 , E1 ⊗ E2 ) satisfying
30
Consider again two die rolls, whereE1 = E2 = {1, . . . , 6}, E1 = E2 = 2{1,...,6} . Let Pi be
uniform on (Ei , Ei ). Then the product measure P = P1 ⊗ P2 is the unique measure under which
the die rolls are independent. More precisely, for 1 ≤ i ≤ n, let (Ωi , Fi , Pi ) be probability
spaces and Xi : Ωi → R random variables. Let Ω = Ω1 × . . . × Ωn , F = F1 ⊗ . . . ⊗ Fn , and
P = P1 ⊗ . . . ⊗ Pn . Dene X̂i : Ω → R by X̂i (ω) = Xi (ωi ). Then the X̂i 's are independent
random variables with respect to P, and they have the same distribution under P as the Xi 's have
under the Pi . This is one way of constructing nite sequences of independent random variables.
Recall the ingenious construction in Theorem 5.22 above of innite independent sequences of
random variables using Rademacher functions. An alternative construction can be done with
innite product measures, the construction of which relies on Kolmogorov's extension theorem,
Theorem 8.9 below.
We are now ready to prove Fubini's theorem.
Exercise 8.7. Let Ei = [0, 1] and Ei the Borel σ-algebra. Let µ1 be Lebesgue measure and µ2
counting measure, i.e., µ2 (A) = |A| ∈ N ∪ {∞} for A Borel. Show that µ2 is a measure and that
Fubini's theorem fails for f = 1A where A = {(x, y) ∈ [0, 1]2 : y = x}.
µ1 , µ2 are σ
Fubini's theorem can be extended to the case where -nite measures
. A measure
µ on (E, E) is σ -nite
∞
if E = ∪n=1 En for En ∈ E such that µ(En ) < ∞ for each n. For instance,
n
Borel and Lebesgue measures on R are σ -nite, but the counting measure in Exercise 8.7 is
not σ -nite.
The sets generating the σ -algebra in (29) are sometimes called cylinder sets.
For example, consider the innite coin toss, where Fi ≡ F = 2{0,1} are σ -algebras over
Ωi ≡ Ω = {0, 1}. ∞
In this case ⊗i=1 Fi ≡ F ⊗N is generated by the collection of events that
depend on only nitely many tosses.
31
Theorem 8.9 (Kolmogorov extension theorem). Let F be the Borel σ-algebra over R. Suppose
Pn , n = 1, 2, . . . are probability measures on (Rn , F ⊗n ) such that for each n
Proof. A be the ring of nite unions of sets of the form (a1 , b1 ] × . . . × (an , bn ] × R × R × . . ..
Let
Dene P (a1 , b1 ] × . . . × (an , bn ] × R × R × . . . by using (31), and then extend
on sets of the form
the denition to A by enforcing nite additivity. Equation (30) together with additivity of the
Pn ensures that this works. To show existence, by Theorem 3.4, it suces to prove that P is
countably additive on A. This can be done by following the arguments in the proof of existence
of Borel measure (Theorem 2.9, bottom of page 6). Uniqueness follows from Dynkin's lemma
(see Proposition 3.13).
In Kolmogorov's extension theorem, (R, F) can be replaced by any standard probability space
a probability space isomorphic to an interval with Lebesgue measure, a nite or countably
innite set, or a union of an interval with a nite or countably innite set. (Here an isomorphism
is a bijective measure-preserving map.) For instance, a nite or countably innite space E ,
together with its natural σ -algebra 2E , is a standard probability space. In the latter case the
sets (a1 , b1 ] × . . . × (an , bn ] × R × R × . . . in Kolmogorov's theorem are replaced by {i1 } × . . . ×
{in } × E × E × . . . where ik ∈ E , k = 1, . . . , n.
In particular, Kolmogorov's theorem provides a construction of the probability measure of
the innite coin toss. This is the canonical way to construct the innite coin toss probability
measure, and in general, any probability measure corresponding to sequences of independent
random variables (see the section on Rademacher functions for an alternative construction):
Exercise 8.10. Let Fn , n = 1, 2, . . . be CDFs. Use Theorem 8.9 to construct a probability space
(Ω, F, P) and
random variables Xn : Ω → R such that X1 , X2 , . . . are independent and Xn has
CDF Fn .
More generally, Kolmogorov's theorem can be used to construct Markov chains. Recall the
simple random walk in Section 7.1 above: starting at zero, at each step you add or subtract
1 with equal probability. That is,Xn = Z0 + . . . + Zn where the Zi are independent random
variables with P(Xi = 1) = P(Xi = −1) = 1/2. More generally, a random walk has the form
Xn = Z0 + . . . + Zn , where the step sizes Zi are independent random variables. Markov chains
are generalizations of random walks, where the step sizes are random variables that are allowed
to depend on the current position of the walk.
32
Note that in Denition 8.11 we have not referred to the underlying probability space; usually
it is not necessary. When the Xn 's take uncountably many values, the natural underlying
probability space is (RN , E ⊗N , P), with E the Borel σ -algebra. When the Xn 's take values
only in a nite or countable set E , it is more convenient to consider the probability space
(E N , (2E )⊗N , P). In both settings the Xn 's are dened by Xn (ω) = ωn . Existence of P such
that (48) holds is shown (in the case where E is countable) in Exercise 8.15.
Equation 48 shows that Xn+1 depends on X1 , . . . , Xn only through Xn . Usually, we think
of n as time and Xn as position after n time steps. Informally, the next position of a Markov
chain depends only on its current position. When Xn has values in a nite or countably innite
set E ⊆ R, then the Markov property (48) can be rewritten as follows:
(See Exercise 7.5 for the connection between conditional probability and conditional expec-
tation that leads to Denition 8.12 above.) In equation (33),
(n+1)
Pij := P(Xn+1 = j|Xn = i)
are called thetransition matrices of Xn . They are characterized by the following properties:
Denition 8.13. A transition P matrix on a countable set E is a matrix (Pij )i,j∈E such that
0 ≤ Pij ≤ 1 for all i, j ∈ E and j∈E Pij = 1 for each i ∈ E .
0 1/2 1/2
Example 8.14. The transition matrix P = 1/3 0 2/3 denes a Markov chain X0 , X1 , . . .
0 0 1
such that, if Xn = 1, then Xn+1 = 2 or 3 each with probability 1/2, if Xn = 2, then Xn+1 = 1
or 3 with probabilities 1/3 and 2/3, respectively, and if Xn = 3, then Xn+1 = 3.
If (33) holds, it is straightforward to check that
(1) (n)
Pi0 i1 . . . Pin−1 in = P(Xn = in , Xn−1 = in−1 , . . . X1 = i1 |X0 = i0 )
and
(P (1) . . . P (n) )ij = P(Xn = j|X0 = i).
Thus, the transition matrices of a Markov chain completely determine its evolution.
We are often interested in the case where all the P (n) 's are the same and equal P. In this
case we call P the transition matrix of the Markov chain.
Exercise 8.15. Let Pij(n) , n = 1, 2, . . . be transition matrices on a countable set E . Use Kol-
mogorov's theorem to construct a Markov chain Xn with transition matrices P (n) . More pre-
cisely, use Kolmogorov's extension theorem to show there exists a measure P on (E N , (2E )⊗N )
and random variables Xn : E N → E which satisfy (33) with P(Xn+1 = j|Xn = i) = Pij(n+1) .
From Kolmogorov's theorem, the probability measure P in Exercise 8.15 is unique once we
assign a value (or more generally a distribution) toX0 . This is called the canonical construction
of Xn .
Exercise 8.16. Give an example of (i) a Markov chain that is not a random walk, (ii) a Markov
chain that is not a martingale, (iii) a martingale that is not a Markov chain.
33
9 Radon-Nikodym theorem
In this section we state the Radon-Nikodym theorem (Theorem 9.4), which can be seen as a
generalization of the fundamental theorem of calculus. Before stating the theorem we need the
following denition.
Denition 9.1. Let µ and ν be measures on (E, E). We say µ is absolutely continuous with
respect to ν , written µ ν , if for all A ∈ E , ν(A) = 0 implies µ(A) = 0.
For example, let ν be Lebesgue measure and δ the delta distribution, that is, δ(A) = 1 if
0∈A and δ(A) = 0 otherwise. Then ν({0}) = 0 but δ({0}) = 1 so δ 6 ν .
Exercise 9.2. Let µ be Lebesgue measure, δ be as above, and ν = µ+δ. Show µ ν but ν 6 µ.
An alternate denition of absolute continuity is given by the following result. Recall a
measureµ on (E, E) is nite if µ(E) < ∞, and σ -nite if there exist A1 , A2 , . . . ∈ E such that
∞
E = ∪n=1 An and µ(An ) < ∞ for each n.
Lemma 9.3. Let µ be a nite measure and ν a σ-nite measure on (E, E). Then µ ν ⇐⇒
for each > 0, there exists δ > 0 such that that for all A ∈ E, ν(A) < δ ⇒ µ(A) < . (34)
Theorem 9.4 (Radon-Nikodym). Let ν be a nite measure and µ a σ-nite measure on (E, E)
Rsuch that ν µ. Then there is a µ-a.e. unique measurable f : E → [0, ∞) such that ν(A) =
A f dµ ≡ µ(f 1A ). In this case we write f = dν
dµ .
We call f= dµ
dν and the Radon-Nikodym derivative of µ with respect to ν . Theorem 9.4 will
be stated more generally in the next section. The proof is postponed to Section 10.
Exercise 9.5. Let µ and ν be as in Exercise 9.2. Find the Radon-Nikodym derivative dµ
dν .
To see how Theorem 9.4 generalizes the fundamental theorem of calculus, we need to dene
absolute continuity for functions.
34
Lemma 9.7. Let µ be a nite measure. Then µ is absolutely continuous with respect to Lebesgue
measure if and only if F (x) := µ((−∞, x]) is absolutely continuous.
Proof. Let (x1 , y1P n
P set A = ∪k=1 (xk , yk ], and let ν be Lebesgue
], . . . , (xn , yn ] be disjoint intervals,
n n
measure. Then k=1 (yk − xk ) = ν(A) and k=1 |F (yk )
− F (xk )| = µ(A). Appealing to
Lemma 9.3 proves the only if part. It remains to prove the if part. Let F be absolutely
continuous, assume A has zero Lebesgue measure and let > 0. Choose the corresponding δ > 0
from the denition of absolute continuity of F . We may pick disjoint intervals [a1 , b1 ), [a2 , b2 ), . . .
P∞
covering A such that k=1 (bk − ak ) <Pδ . (This follows from Proposition 3.22.) Dene An =
∪nk=1 [ak , bk ). By choice of δ , µ(An ) = nk=1 |F (bk ) − F (ak )| < for all n. By continuity from
below, µ(A) = limn→∞ µ(An ) ≤ . Since > 0 was arbitrary, we are done.
Exercise 9.8. Let X be a random variable with CDF F , and µ the associated image measure
µ((a, b]) := F (b) − F (a) (also called the distribution of X ; see the remarks after Denition 5.9.)
Suppose F 0 = f is Lebesgue integrable. Show that µ is absolutely continuous with respect to
Lebesgue measure ν , and dµ dν = f . (Hint: use the fundamental theorem of calculus, Theorem 6.12,
and Lemma 9.9 below.)
In Exercise 9.8, recall that f is called the probability density function (PDF) of X; see the
remarks below Exercise 6.16. In the above, the assumption on F can be replaced by assuming
F is absolutely continuous. It turns out that this is equivalent to saying F has
R b a derivative f
almost everywhere, such that f is Lebesgue integrable with F (b) − F (a) = a f (t) dt. These
results fall into the realm of analysis and not measure theory, so we omit proof.
Lemma 9.9. Let (E, E, µ) be a measure space and f : E R→ R an integrable function. Then for
each > 0 there exists δ > 0 such that µ(A) < δ implies A |f | dµ < .
Proof. Let > 0 andR dene Bk = {x ∈ E : |f (x)| > k}. By dominated convergence, we may
choose k such that
Bk |f | dµ < 2 . Let δ = 2k
and assume µ(A) < δ . Then
Z Z Z
|f | dµ = |f | dµ + |f | dµ
A A\Bk A∩Bk
Z Z
≤ k dµ + |f | dµ
A Bk
≤ kµ(A) + = .
2
We will use the Radon-Nikodym theorem to prove existence of conditional expectation. For this
application, we need to consider signed measures.
Denition 9.10. Let (E, E) be a measurable space. A function φ : E → R is a signed measure
on (E, E) if for any disjoint collection {Ak }∞
k=1 of elements of E ,
∞
X
φ(∪∞
k=1 Ak ) = µ(Ak ). (35)
k=1
35
Compare with the denition of measure (Denition 2.8). The dierence is that we allow
signed measures to have negative values, but not innite values. In particular, a nite measure
is a signed measure, but not conversely. (Throughout we write measure or nite measure for a
unsigned measure.)
Continuity from above and from below hold for signed measures, which we leave as an
exercise.
Denition 9.11. Let φ be a signed measure on (E, E), and µ a measure on (E, E). Then we
say φ is absolutely continuous with respect to µ, written φ µ, if for each A ∈ E , µ(A) = 0
implies φ(A) = 0.
Theorem 9.12 (Radon-Nikodym, signed version). Let φ be a signed measure and µ a σ-nite
measure on (E,
R E) such that φ µ. Then there is a µ-a.e.dφunique measurable f : E → R such
that φ(A) = A f dµ ≡ µ(f 1A ). In this case we write f = dµ .
We are now ready to prove existence of conditional expectation.
Exercise 9.13. Use Theorem 9.12 to prove existence of conditional expectation (see Deni-
tion 7.4). That is, show that if (Ω, F, P) is a probability space, G ⊆ F is a sub-σ-algebra, and
X : Ω → R is an integrable (F -measurable) random variable, then there exists a G -measurable
random variable Y such that E(Y 1A ) = E(X 1A ) for each A ∈ G .
We now turn to two structure theorems for signed measures. The rst says that a signed
measure can be decomposed as a dierence of two nite measures that are disjoint in the
following sense.
Denition 9.14. Two signed measures φ, ψ on (E, E) are called mutually singular , written
φ ⊥ ψ , if they can be supported on disjoint sets, that is, if there exists A, B ∈ E such that
A ∩ B = ∅, A ∪ B = E , and φ|A ≡ 0, ψ|B ≡ 0.
Above and below, for a signed measure φ and measurable set S , φ|S := φ(· ∩ S).
Theorem 9.15 (Jordan decomposition). Every signed measure φ can be uniquely decomposed
as φ = µ − ν , where µ and ν are nite measures with µ ⊥ ν .
Proof. Call a set measurable set A totally positiveφ|A ≥ 0 (equivalently, φ|A is a nite
if
measure) and totally negative if φ|A ≤ 0 (equivalently, −φ|A is a nite measure). Consider the
sup, m, of φ(A) over all totally positive A. There is a sequence An of totally positive sets such
that limn→∞ φ(An ) = m. Without loss of generality A1 ⊆ A2 ⊆ . . . and continuity from below
∞ ∞ c
shows that φ(∪n=1 An ) = m. Dene P = ∪n=1 An and N = P . We check that P is totally
positive and N is totally negative. Taking µ = φ|P and ν = φ|N will then complete the proof.
If P is not totally positive, there is measurable A ⊆ P such that φ(A) < 0. It follows that
0 > φ(A) = φ(∪∞ n=1 A ∩ An ) = limn→∞ φ(A ∩ An ) so φ(A ∩ An ) < 0 for some n, a contradiction
to An being totally positive. The argument showing N is totally negative is analogous. We
leave proof of uniqueness as an exercise.
It is now straightforward to extend the results of the last section to signed measures.
Lemma 9.16. Let φ be a signed measure and µ a σ-nite measure on (E, E). Then φ µ if and
only if for each > 0, there exists δ > 0 such that that for all A ∈ E , µ(A) < δ ⇒ |φ(A)| < .
36
Proof.
R
Suppose φ µ. φ(A) = A f dµ for some integrable
Then by Theorem 9.12, we may write
f : E → RR. By Lemma 9.9, given > 0, we may choose δ > 0 such that µ(A) < δ implies
|φ(A)| ≤ A |f | dµ < . Conversely, suppose |φ(A)| > 0, let = |φ(A)|, and choose δ as in the
condition above. Then we must have µ(A) ≥ δ > 0.
Theorem 9.17. Let φ be a signed measure and µ a σ-nite measure on (E, E). Then TFAE:
1. φ µ
2. There is a µ-a.e. unique measurable f : E → R such that for each A ∈ E , φ(A) =
R
Af dµ
3. Given > 0, there is δ > 0 such if A ∈ E with µ(A) < δ , then |φ(A)| < .
Proof. By Lemma 9.16, 1 and 3 are equivalent. Theorem 9.12 states that 2 follows from 1, and
Lemma 9.9 shows that 3 follows from 2.
We can now connect absolute continuity of signed measures with absolute continuity of
functions (see also Lemma 9.7).
Exercise 9.18. Let φ be a signed measure. Show that φ is absolutely continuous with respect to
Lebesgue measure if and only if F (x) := φ((−∞, x]) is absolutely continuous.
Lemma 10.1. Let µ and ν be nite measures on (E, E). Then either µ ⊥ ν , or there exists
> 0 and B ∈ F such that µ(B) > 0 and ν(A) ≥ µ(A) for each A ∈ F with A ⊆ B .
Proof. Let (Pn , Nn ) be the totally positive/negative sets in the proof of Theorem 9.15 in the
case where the signed measure is φn := ν − n−1 µ. Let P = ∪∞ ∞
n=1 Pn and N = ∩n=1 Nn . Then
φn (N ) ≤ 0 or equivalently ν(N ) ≤ n−1 µ(N ) for each n. Thus, ν(N ) = 0. Using the fact that
Pn is increasing and Nn is decreasing with Pn ∪ Nn = E and Pn ∩ Nn = ∅, we can check that
that P ∪ N = E and P ∩ N = ∅. If µ(P ) = 0, then µ ⊥ ν . So suppose µ(P ) > 0. Using
µ(P ) = limn→∞ µ(Pn ) we see that µ(Pn ) > 0 for some n. Taking = 1/n for this n yields the
result.
Proof of Theorem 9.4. We consider the case where µ is nite and leave the σ -nite case to an
exercise. We will only prove existence; uniqueness will follow from the proof of Theorem 10.4
R
below. Let G be the set of measurable functions g : E → [0, ∞) such that ≤ ν(A) for
A g dµ
all A ∈ E . Observe G is nonempty since 0 ∈ G . We claim that h = max{f, g} ∈ G whenever
f, g ∈ G . To see this, let B = {x ∈ E : f (x) > g(x)} and check that
Z Z Z
h dµ = f dµ + g dµ
A A∩B A∩B c
= ν(A ∩ B) + ν(A ∩ B c ) = ν(A).
37
The monotone convergence theorem also shows
R G is closed under increasing limits.
R Now let
m := supg∈G E g dµ ≤ ν(E). Choose gn ∈ G such that limn→∞ E gn dµ = m. Take fn =
max{g1 , . . . , gn }. Then fn ∈ G is increasing and
Z Z
gn dµ ≤ fn dµ ≤ m
E E
≥ µ(A ∩ B0 ) + f dµ
Z A
= (f + 1B0 ) dµ
A
Z
(f + 1B0 )dµ = m + µ(B0 ) > m,
E
R
which contradicts the denition of m := supg∈G E f dµ.
Exercise 10.2. Fill in the following detail of the proof of Theorem 9.4. Show that if φ is a
signed measure and µ a measure with φ µ and φ ⊥ µ, then φ ≡ 0.
Exercise 10.3. Finish the proof of Theorems 9.4 and 9.12 by generalizing to the cases where µ
is σ-nite and φ is a signed measure.
The proof of the Radon-Nikodym theorem leads to the following structure theorem, which
says that any signed measure has a absolutely continuous and singular part with respect to an
arbitrary nite measure.
Theorem 10.4 (Lebesgue decomposition). Let φ be a signed measure and µ a σ-nite measure
on (E, E). Then there exists unique signed measures ν0 and ν1 such that φ = ν0 + ν1 and ν0 ⊥ µ,
ν1 µ.
Proof. We will prove the theorem in the case where φ and µ are nite measures. We leave the
generalization to signed φ and σ -nite µ as an exercise.
38
Existence is an immediate consequence of the proof of Theorem 9.4 above. For uniqueness,
suppose φ = α0 +α1 where α0 ⊥ µ and α1 µ. Then φ = α0 +α1 = ν0 +ν1 so α0 −ν0 = ν1 −α1 .
Since ν1 µ and α1 µ we have ν1 − α1 µ. We claim that α0 − ν0 ⊥ µ. Once this
is established, we are done by Exercise 10.2. To prove the claim, choose A, B, S, T such that
A = B c , S = T c and ν0 |A = µ|B = α0 |S = µ|T = 0. Then (A∩S)c = B∪T and (α0 −ν0 )|A∩S = 0
and µ|B∪T = 0. Thus α0 − ν0 ⊥ µ.
Exercise 10.5. Given an example of a non-σ-nite measure µ for which the conclusion of
Theorem 9.4 is false. Then give an example of a non-σ-nite measure φ for which the conclusion
of Theorem 10.4 is false.
Exercise 10.6. Let f, g : R → [0, ∞) be Lebesgue integrable functions, such that f /g is Lebesgue
integrable. Dene measures ν , µ by ν(A) = A f (t) dt, µ(A) = A g(t) dt for Borel A ⊆ R. Find
R R
Theorem 10.7. Let ν , µ and λ be nite measures on (E, E) such that ν µ and µ λ. Then
dν dν dµ
= , λ − a.e.
dλ dµ dλ
Proof. It is easy to check that ν λ. We want to show that
Z
dν dµ
ν(A) = dλ.
A dµ dλ
Observe that this is equivalent to
Z Z
dν dν dµ
dµ = dλ. (36)
A dµ A dµ dλ
Let f= dν
dµ . Note that (36) holds when f = 1B for B ∈ E . By linearity, (36) holds for simple f ,
and by monotone convergence it holds for nonnegative f . It holds for integrable f by considering
the positive and negative part as usual.
11 Ergodic theory
39
Another interesting example is the following. Consider the innite coin toss probability
space(Ω, F, P) with Ω = {0, 1}N , and dene T (ω1 , ω2 , . . .) = (ω2 , ω3 , . . .). Let A = A1 × . . . ×
An × {0, 1} × {0, 1} × . . . ∈ F . Dene Xi : Ω → R by Xi (ω) = ωi . Then
Since such sets A form a π -system generating F, the map T is measure-preserving. This T is
called the shift operator. It is not hard to check that the shift operator is measure-preserving
on product spaces corresponding to sequences of independent identically distributed random
variables. We will show below it is also measure-preserving on product spaces corresponding to
certain Markov chains.
Denition 11.4. Let P be a transition matrix on a nite set E . A left (row) eigenvector of P
corresponding to eigenvalue 1 is
Pcalled a stationary vector of P . We assume that a stationary
vector π is normalized so that i∈E πi = 1.
Such eigenvectors exist because the rows of P sum to 1, hence 1 is an eigenvalue of P (take
the all ones vector as a right eigenvector). It turns out that these eigenvectors have nonnegative
entries. This means we can think of a stationary vector on E as dening a probability measure
on E. A stationary vector π is often also called invariant since πP = π, or equivalently,
X
πi Pij = πj .
i
Exercise 11.5. Let Xn be a Markov chain with transition matrix P on a nite set E . Let π be
a stationary vector of P . Show that if P(X0 = i) = πi for all i ∈ E , then P(Xn = i) = πi for
all i ∈ E and n ≥ 0.
In Exercise 11.5, since P(X0 = i) = πi for all i ∈ E , we say that Xn starts at π . Informally,
the exercise shows that if Xn starts at π , then it stays at π . Exercise 11.6 below shows that the
shift operator is measure-preserving for a Markov chain that starts at a stationary vector.
Recall (Exercise 8.15) that the canonical construction of a Markov chain with transition
matrix P consists of a probability measure P N E ⊗N ) and random variables
on (E , (2 ) Xn : E N →
E dened by Xn (ω) = ωn , such that
P is unique once we specify an initial distribution, that is, probabilities for X0 . We will always
assume Markov chains have this canonical construction.
Exercise 11.6. Let E be a nite set, P a transition matrix, and P the probability measure from
the canonical construction of a Markov chain with transition matrix P . Let π be a stationary
vector for P and suppose P(X0 = i) = πi for all i ∈ E . Show that the shift T : E N → E N ,
T (ω0 , ω1 , . . .) = (ω1 , ω2 , . . .) is a measure-preserving map.
40
Denition 11.7. Let (Ω, F, P) be a probability space and T : Ω → Ω. A set A ∈ F is called
invariantfor T if T −1 (A) = A a.s. A random variable Y : Ω → R is called T -invariant if
Y ◦ T = Y a.s. Sometimes we say almost invariant to emphasize that these equalities hold a.s.
In the above, A=B a.s. means P(A4B) = 0.
Exercise 11.8. Identify the invariant sets in Examples 11.2 and 11.3.
It is easy to check that the collection of almost invariant sets forms a σ -algebra I .
Theorem 11.9 (Birkho-Khinchin). Let (Ω, F, P) be a probability space, T :Ω→Ω a measure
preserving map, and X : Ω → R an integrable random variable. Then
n−1
a.s.,
X
−1
lim n X ◦ T k = E(X|I) (38)
n→∞
k=0
n−1
1X X
lim f (Xn ) = E(f (X0 )) = πi f (i) a.s. (39)
n→∞ n
k=0 i∈E
41
We think of the RHS as a space average as it is an average with respect to π dened on space
E . We think of the LHS as a time average. Note that the LHS of (39) is a function of paths
ω = (ω1 , ω2 , . . .) and the RHS is constant. Thus (39) says that time averages over paths equal
the appropriate space average, with probability 1, provided the path starting point X0 (ω) = ω0
has πω0 > 0. (In fact, (39) remains true even without the assumption on the starting point.)
A special case of (39) is the strong law of large numbers:
Theorem 11.13 (Strong law of large numbers). Suppose Xk , k = 0, 1, . . . are independent and
identically distributed (iid) integrable random variables. Then
n−1
1X
lim
n→∞ n
Xk = E(X0 ) a.s.
k=0
The theorem is obtained by noticing that the shift transformation T (ω1 , ω2 , . . .) = (ω2 , ω3 , . . .)
is measure-preserving and ergodic on the probability space constructed in Exercise 4.9. Alter-
natively, the sequence X0 , X1 , . . . can be seen as a Markov chain with a stationary distribution
that is the same as the distribution of X0 .
For the remainder of this section we study irrational rotations of the unit circle. Such
transformations are ergodic:
Proposition 11.14. Let P be the uniform probability measure on S 1 = [0, 1), x r ∈ (0, 1) \ Q,
and dene T : S 1 → S 1 by T (t) = t + r mod 1. Then T is ergodic.
Below we omit mod 1 in our notations, and we use analysis instead of probability notation,
e.g. writing µ instead of P for Lebesgue measure on S1. To prove Proposition 11.14 we use the
following lemmas.
intersects I . We conclude there is no such m, n. But then Os can contain no more than 1/
points, contradiction.
Lemma 11.16. Let A ⊆ S 1 be Lebesgue measurable. Then given > 0, there exist disjoint
intervals I1 , . . . , In ⊆ S 1 such that the Lebesgue measure of A4(I1 ∪ . . . ∪ In ) is less than .
Proof. Let µ be Lebesgue measure. Using the Lebesgue outer measure characterization (Propo-
sition (3.22)), we may choose intervals
P∞ Ik , k = 1, 2, . . . such that A ⊆ ∪∞ k=1 Ik and µ(A) >
P∞
− + k=1 µ(Ik ). WLOG the
P∞ Ik 's can be chosen disjoint. Since k=1 µ(Ik ) < ∞, we may
choose n ∈ N such that k=n+1 µ(Ik ) < . Now
and
µ((I1 ∪ . . . ∪ In ) \ A) ≤ µ((∪∞
k=1 Ik ) \ A) < .
42
Lemma 11.17. Let A ⊆ S 1 be RLebesgue measurable. Then given > 0, there exists a continuous
function f : S 1 → R such that S 1 |f (t) − 1A (t)| dt < .
Proof. Let > 0. Choose disjoint intervals I1 , . . . , In ⊆ S 1 as in Lemma 11.16. Dene f (x) = 1
on I := ∪nk=1 Ik and S 1 \ ∪nk=1 I2−k , where Bδ denotes the δ -neighborhood of
f (x) = 0 on a set
B . Finally dene f on the rest of S 1 by linear interpolation. Then
Z Z Z
|f (t) − 1A (t)| dt ≤ |f (t) − 1I (t)| dt + |1I (t) − 1A (t)| dt ≤ 2 + = 3.
S1 S1 S1
f (T ` (t))| dt < 2 for each k, ` ∈ N. By Lemma 11.15, the orbit {T k (t) : k ∈ N} is dense in S1.
1
R
It follows that
S 1 |f (t + s) − f (s)| dt ≤ 2 for all s ∈ S . By Fubini's theorem,
Z Z Z Z
f (t) − f (s) ds dt = (f (t) − f (t + s)) ds dt
1
S1 S1 S1 S
Z Z
≤ |f (t) − f (t + s)| dt ds ≤ 2.
S1 S1
and so
Z Z Z Z
|1A (t) − c| dt ≤ |1A (t) − f (t)| dt +
f (t) − f (s) ds dt
S1 S1 S1 S1
Z Z
+
f (s) ds − c dt
S1 S1
≤ + 2 + = 4.
Lemma 11.18. Let (E, E, µ) be a measure space and fn , f :E→R measurable functions with
E |fn − f | dµ < ∞. Then fn → f a.e. on E .
P ∞ R
n=1
Proof. Let > 0, set An = {x ∈ E : |fn (x) − f (x)| ≥ }, and Bn = ∪k≥n Ak . Note that
∞
X ∞ Z
X
µ(An ) ≤ |fn − f | dµ < ∞.
n=1 n=1 E
Thus,
X
0 ≤ µ(Bn ) ≤ µ(Ak ) → 0 as n → ∞,
k≥n
so µ(∩∞
n=1 Bn ) = 0. Equivalently, µ(lim inf n→∞ Acn ) = 1, which completes the proof.
43
Exercise 11.19. Let f : [a, b] → R be an integrable function. Use Lemmas 11.17 and 11.18 to
show there exist continuous functions fn : [a, b] → R such that fn → f a.e.
Exercise 11.19, together with Egoro 's theorem below, shows that measurable functions are
almost continuous.
Theorem 11.20 (Egoro 's theorem). Let (E, E, µ) be a nite measure space, and fn : E → R
measurable functions such that fn → f a.e. Then for each > 0, there exists measurable A ⊆ E
such that µ(A) < and fn → f uniformly on Ac .
Proof of Egoro's theorem. Assume WLOG that fn → f everywhere on E and dene An,k =
∪∞
m=n {x ∈ E : |fm (x) − f (x)| ≥ 1/k}. Since A1,k ⊇ A2,k ⊇ . . . and ∩∞ n=1 An,k = ∅, continuity
from above implies limn→∞ µ(An,k ) = 0. Let > 0 and for each k ∈ N pick nk such that
µ(Ank ,k ) < 2−k . Set E = ∪∞ k=1 Ank ,k . Then µ(A) < and |fn (x) − f (x)| < 1/k for n > nk and
x ∈ E . Thus fn → f uniformly on E c .
c
Theorem 11.21 (Lusin's theorem). Let f : [a, b] → R be measurable. Then for each > 0,
there exists a compact C ⊆ [a, b] with Lebesgue measure greater than b − a − such that f |C is
continuous.
Exercise 11.22. Use Exercise 11.19 and Theorem 11.20 to prove Theorem 11.21. (Hint: rst
use Lebesgue outer measure measure, Proposition 3.22, to show that if B ⊆ [a, b] is Lebesgue
measurable, then given > 0 there exists compact C ⊆ B such that µ(B \ C) < .)
We begin with the ergodic case, that is, the case where the invariant σ -algebra I is trivial.
we have
Sn
lim
n→∞ n
= E(X) a.s. (40)
Lemma 12.2. Let X , T , and Sn be as in Theorem 12.1. If E(X) > 0 then P(Sn > 0 ∀ n) > 0.
Proof of Theorem 12.1. By subtracting E(X) from both sides of (40), we may assume WLOG
that E(X) = 0. Let > 0. Applying Lemma 12.2 toX + shows that P(Sn /n > − ∀ n) > 0,
and thus
Sn
P lim inf ≥ − > 0. (41)
n→∞ n
44
The assumption T is ergodic means that the σ -algebra I of almost invariant events is trivial.
Note that lim inf n→∞ Sn /n is T -almost invariant, or equivalently, I -measurable. (See Exer-
cise 12.3.) This means lim inf n→∞ Sn /n is constant a.s., so (41) shows that
Sn
lim inf ≥ − a.s.
n→∞ n
As > 0 was arbitrary, lim inf n→∞ Sn /n ≥ 0 a.s. Applying the same argument to −X + gives
lim supn→∞ Sn /n ≤ 0 a.s., which completes the proof.
Proof of Lemma 12.2. Assume E(X) > 0 and let A = {Sn ≤ 0, some n}. We want to show
that P(Ac ) > 0. We will actually prove a stronger statement, namely that E(1A X) ≤ 0 (see
Lemma 12.5 below). Then since E(X) > 0, we have E(1Ac ) > 0 and so P(Ac ) > 0 as desired.
Let An = {Sk ≤ 0, some 1 ≤ k ≤ n}. Then An is increasing with ∪n=1 An = A. So
∞
by the dominated convergence theorem, it suces to show that E(1An (X)) > 0. Set Fn =
min{0, S1 , . . . , Sn } where the minimum is pointwise. Note that Sn ◦ T + X = Sn+1 . Thus,
Fn ◦ T + X ≤ Sk ◦ T + X = Sk+1 for 0 ≤ k ≤ n and so Fn ◦ T + X ≤ min{S1 , . . . , Sn }. Note
that min{S1 , . . . , Sn } = Fn on An and so
Thus,
where the last three lines of the display above follow from the facts that Fn = 0 on Acn , Fn ◦T ≤ 0,
and T is measure-preserving (see Exercise 12.4 below), respectively.
Exercise 12.3. Fill in the following detail in the proof of Theorem 12.1: Show that if limn→∞ Sn /n
exists a.s., then the limit is a T -almost invariant and hence I -measurable function.
Exercise 12.4. Fill in the following detail in the proof of Lemma 12.2: Show that if Y : Ω →
R is integrable and T : Ω → Ω is measure-preserving and A ∈ I is almost invariant, then
E((Y ◦ T )1A ) = E(Y 1A ). (Lemma 12.2 uses the case A = Ω.)
From the proof of Lemma 12.2 we actually obtain the following stronger statement:
(Note that an inequality became strict in the denition of A above, but the proofs in
Lemma 12.5 still go through.) Lemma 12.5 holds whether or not T is ergodic, and indeed
we will use it to prove Theorem 11.9 in the case where I is not trivial.
The following corollary will prove useful:
45
Corollary 12.6. Consider the setup of Lemma 12.5, and let Bα = ∪∞ n=1 {Sn /n < α}. If A is
T -almost invariant, then E(1Bα ∩A X) ≤ αP(Bα ∩ A).
Proof. First suppose A = Ω, and X̃ = X − α, S̃n = Sn − nα. Then Bα = ∪∞n=1 {S̃n < 0}, so by
Lemma 12.5, E(1Bα X̃) ≤ 0, or equivalently, E(1Bα X) ≤ αP(Bα ). To nish the proof, repeat
the above with T |A in place of T . (Since A is almost invariant, we may assume T |A : A → A
and T |A is measure-preserving.)
We are ready to prove the ergodic theorem in the general case, where I may be nontrivial.
For convenience we restate the theorem:
where
n−1
X
Sn := X ◦ Tk (43)
k=0
and I ⊆ F is the σ-algebra of T -almost invariant sets.
Proof. We rst show that the limit on the LHS of (42) exists a.s. Let
Sn Sn
X∗ = lim inf , X ∗ = lim sup ,
n→∞ n n→∞ n
and Aα,β = {X∗ < α, X ∗ > β}. Then
[
A := Aα,β = {X∗ 6= X ∗ }.
α<β
α,β∈Q
In the above arguments, we can replace X, α and β with −X , −β and −α, respectively, to get
E(1Aα,β X) ≥ βP(Aα,β ).
Thus, βP(Aα,β ) ≤ αP(Aα,β ). So if α<β then P(Aα,β ) = 0 as desired.
We have shown that
Sn
Y := lim
n→∞ n
exists a.s. We will show Y = E(X|I) by checking the properties of conditional expectation
(see Denition 7.4). By Exercise 12.3, Y is I -measurable. Suppose A ∈ I. By Exercise 12.4,
E((X ◦ T k) 1A ) = E(X 1A ) for each k ≥ 0 and so
Sn
E 1A = E(X 1A ). (45)
n
46
So if we can show that
Sn
lim E 1A = E (Y 1A ) , (46)
n→∞ n
then E(X 1A ) = E(Y 1A ) and we are done.
It only remains to establish (46). For λ > 0 set
Sn S n S n
Cλ = sup > λ = inf < −λ ∪ sup >λ .
n≥1 n n≥1 n n≥1 n
By Corollary 12.6,
Sn
≤ −λ−1 E 1B−λ X ≤ λ−1 E(|X|).
P inf < −λ
n≥1 n
Using minus signs again and repeating the above arguments, we can get
Sn
P sup > λ ≤ λ−1 E(|X|)
n≥1 n
and so
P(Cλ ) ≤ 2λ−1 E(|X|).
Let Dγ,k = {|X ◦ T k | > γ}. We have
n−1
Sn 1X
E 1Cλ ≤
E(1Cλ |X ◦ T k |)
n n
k=0
n−1
1 X
≤ E(1Cλ 1Dγ,k |X ◦ T k |) + γE(1Cλ 1Dγ,k
c )
n (47)
k=0
n−1
1 X
≤ E(1Dγ,k |X ◦ T k |) + γP(Cλ )
n
k=0
γ
≤ 2 E(|X|) + E(1Dγ,0 |X|).
λ
Since X is integrable,
Z
E(1Dγ,0 |X|) = |X| dP → 0 as γ → ∞.
|X|>γ
√
Let Eλ = {|Sn /n| > λ}. By taking γ = λ in (47), we nd that
Sn Sn
E 1Eλ ≤ E 1Cλ → 0 as λ → ∞, uniformly in n.
n n
This is enough to establish (46), as we show in Theorem 12.8 below.
Theorem 12.8. Suppose Xn is a sequence of random variables such that Xn →X a.s. and
E(1{|Xn |>λ} |Xn |) → 0 as λ → ∞, uniformly in n. Then X is integrable and
lim E(|Xn − X|) = 0.
n→∞
E(|X|) ≤ lim inf E(|Xn |) ≤ λ + lim inf E(1{|Xn |>λ} |Xn |) < λ + 1.
n→∞ n→∞
Exercise 12.9. Finish the proof of Theorem 12.8. (Hint: Use Egoro's theorem, Theorem 11.20.)
The condition E(1{|Xn |>λ} |Xn |) → 0 as λ → ∞, uniformly in n, in Theorem 12.8 is often
called uniform integrability. You can add it to the list of conditions which allow for limits to go
inside of integrals!
13 Inequalities
Denition 13.1. A set S ⊆ Rn is convex if for all x, y ∈ S and t ∈ [0, 1], tx + (1 − t)y ∈ S .
Denition 13.2. Let S ⊆ Rn be convex. A function φ : S → R is convex if
φ(tx + (1 − t)y) ≤ tφ(x) + (1 − t)φ(y).
Lemma 13.3. Let S ⊆ Rn be a closed convex set and p a point on ∂S . Then there exists
nonzero u ∈ Rn such that u · (x − p) ≥ 0 for all x ∈ S .
Proof. Take a sequence pn ∈ Rn \ S converging to p. Such a sequence exists because S is closed
and p is on its boundary. For each n dene
xn = arg minx∈S kp − pn k2 .
It is easy to check thatxn exists, since the minimization can be done inside the compact set
S ∩ {x ∈ Rn : kx − pn k ≤ M } where M is chosen suciently large. Note that kxn − pn k > 0
since xn ∈ S and pn ∈
/ S . Dene vn = xn − pn . We claim that vn · (x − pn ) > 0 for all x ∈ S .
Suppose instead that vn · (x − pn ) ≤ 0 for some x ∈ S and consider xt = (1 − t)xn + tx. Then
for t suciently small, contradicting minimality of xn . Now dene un = vn /kvn k. Since the un
lie on the unit sphere, which is compact, there is a subsequence unk converging to a unit vector
u. Since un · (x − pn ) > 0 for each n, we must have u · (x − p) ≥ 0.
48
Theorem 13.4 (Jensen's inequality). Let (Ω, F, P) be a probability space, G ⊆ F a sub-σ
algebra, X : Ω → R an integrable random variable, and φ : R → R a convex function. Then
φ(E(X|G)) ≤ E(φ(X)|G).
Thus,
φ(E(X|G)) = sup g(E(X|G)) ≤ E(φ(X)|G).
g∈C
The following very simple inequality leads to many important results in probability:
Theorem 13.8 (Chebyshev inequality). Let X be a random variable with E(X) = µ < ∞ and
Var(X) = σ2 < ∞. Then
1
P(|X − µ| ≥ sσ) ≤ . (49)
s2
49
Proof. Applying (48) with (X − µ)2 and s2 σ 2 in place of X and a, respectively, gives
E((X − µ)2 ) 1
P(|X − µ| ≥ sσ) = P((X − µ)2 ≥ s2 σ 2 ) ≤ = 2.
s2 σ 2 s
Sn
lim P
− µ ≥ = 0.
(50)
n→∞ n
Proof. Let > 0. Without loss of generality assume µ = 0. Then E(Sn /n) = 0. By Proposi-
√
tion 13.7, Var(Sn /n) = σ 2 /n. Applying (49) withSn /n in place of X and s = n/σ gives
σ2
Sn
P ≥ ≤
.
n n
The convergence in (50) is called convergence in probability. We discuss this and other modes
of convergence in Section 14 below. Convergence in probability is weaker than convergence a.s.,
which is why Theorem 13.9 is called the weak law of large numbers. The strong law of large
numbers states that, if Xk , k = 1, 2, . . . are iid with E(X1 ) < ∞, then
X1 + . . . + Xn
lim = µ, a.s. (51)
n→∞ n
The strong law is a consequence of the ergodic theorem, Theorem 12.7.
Theorem 13.10 (Cherno inequality) . Let X1 , . . . , Xn be iid random variables and let Sn =
X1 + . . . + Xn . Then
Proof. For t > 0, apply the Markov inequality (48) with etSn and eta in place of X and a to get
E(etSn )
P(Sn ≥ a) = P(etSn ≥ eta ) ≤ .
eta
Since X1 , . . . , Xn are iid,
The second equation in (52) follows by replacing Sn with −Sn and a with −a.
50
The Cherno inequalities are sharper than the Markov inequality, but they require the
assumption of independence. For example, consider the innite coin toss, where X1 , . . . , Xn are
iid Bernoulli-1/2, that is, P(Xi = 1) = P(Xi = 0) = 1/2. Then, since
1 et et − 1 t
E(etX1 ) = + =1+ ≤ e(e −1)/2 ,
2 2 2
Cherno 's inequality gives
n t
P Sn ≥ (1 + δ) ≤ min e−tn(1+δ)/2 en(e −1)/2 .
2 t>0
n/2
enδ/2 eδ
n
P Sn ≥ (1 + δ) ≤ = .
2 (1 + δ)n(1+δ)/2 (1 + δ)1+δ
To compare the Markov and Cherno inequalities here, take δ = 1. Then the calculation above
givesP(Sn ≥ n) ≤ (e/4)n/2 , while the Markov inequality gives P(Sn ≥ n) ≤ 1/2. Note that the
n
actual probability is P(Sn ≥ n) = (1/2) .
14 Modes of convergence
The most common type of convergence we have considered above is a pointwise convergence:
Denition 14.1. A sequence Xn of random variables converges to X a.s. if
P lim Xn = X = 1,
n→∞
written Xn −a.s.
−→ X .
a.s.
Xn −−→ X means that Xn → X pointwise, up to a set of probability zero.
1
written Xn −L→ X .
Sometimes this is called convergence in L1 . p
More generally if limn→∞ E(|Xn − X| ) = 0,
we say Xn converges to X p
in L . There is no direct relationship between convergence a.s. and
convergence in mean. Howeer, note that convergence a.s. implies convergence in mean under the
assumptions in the monotone and dominated convergence theorems. Both of these theorems
generalize in a straightforward way to convergence in Lp . Conversely, convergence in mean
implies convergence a.s. along a subsequence:
Proposition 14.3. If Xn −L→ X , then there is a subsequence Xnk such that Xnk −a.s.
1
−→ X .
L1
Proof. If Xn −→ X , we may choose nk such that E(|Xnk − X|) < 2−k . The result then follows
from Lemma 11.18.
51
We now introduce the type of convergence from the Weak Law of Large Numbers (Theo-
rem 13.9):
Convergence in probability is weaker than convergence a.s., as the next exercise shows.
Proof. Suppose
a.s.
Xn −−→ X and let > 0 and δ > 0. Using Egoro 's theorem (Theorem 11.20),
c
choose A such that P(A) < δ and Xn → X uniformly on A . Choose N such that n ≥ N implies
c
|Xn (ω) − X(ω)| < whenever ω ∈ A . Then n ≥ N implies P(|Xn − X| ≥ ) ≤ P(A) < δ . Since
δ was arbitrary, limn→∞ P(|Xn − X| ≥ ) = 0. To see the converse does not hold, let P be the
m n
uniform measure on [0, 1], for each n ∈ N write n = 2 + k , m ∈ N, k ∈ {0, . . . , 2 − 1}, and
consider the sequence Xn = 1[k/2m ,(k+1)/2m ] . Then P(|Xn | ≥ ) ≤ 2
−m → 0 as n → ∞ whenever
p
0 < ≤ 1, so Xn →
− 0. However, for each ω ∈ [0, 1], Xn (ω) = 1 for innitely many n, and so
Xn does not converge to 0 a.s.
L1
Proof. Suppose XN −→ X . Since 1{|Xn −X|≥} ≤ |Xn − X|, we have P(|Xn − X| ≥ ) ≤
p
E(|Xn − X|) → 0 as n → ∞. Thus, Xn → − X . To see the converse does not hold, let P be
uniform on [0, 1] and set Xn = n1[0,1/n] . Then P(|Xn | ≥ ) = 1/n whenever < 1, so Xn →
p
− 0.
but E(|Xn |) = 1 for each n so Xn does not converge to 0 in mean.
Theorem 14.10. Xn −
→X
d
if and only if
lim E(g(Xn )) = E(g(X)),
n→∞
for all bounded continuous g : R → R. (53)
52
Proof. Assume (53) holds, let δ > 0, and let x be a point at which the CDF of X is continuous.
Let gδ : R → R be a continuous function with values in [0, 1] such that gδ (t) = 1 on (−∞, x)
and gδ (t) = 0 on (x + δ, ∞). By assumption, for each m,
Thus,
lim sup P(Xn ≤ x) ≤ lim E(gδ (Xn )) = E(gδ (X)) ≤ P(X ≤ x + δ).
n→∞ n→∞
Similar arguments show that lim supn→∞ P(Xn ≥ x) ≤ P(X ≥ x), so by taking complements,
Thus,
Since F is continuous at x, we have equality in (54). This proves the if part. Consider the
d
only if part. Let F and Fn X and Xn , respectively, suppose Xn −
be the CDFs of → X , and let
g : R → R be bounded and continuous. Let > 0 and choose an interval [a, b] such that F is
continuous at a and b and P(X ∈ [a, b]) ≥ 1 − . We may do this since limx→−∞ F (x) = 0 and
limx→∞ F (x) = 1 and, as F is nondecreasing, its set of discontinuities is countable. Now choose
points x1 < . . . < xk in [a, b] such that for each j , F is continuous at xj and x, y ∈ [xj , xj+1 ]
implies |g(x) − g(y)| < . Uniform continuity of g on closed bounded intervals ensures this is
possible. Set
k−1
X k−1
X
S= g(xj )(F (xj+1 ) − F (xj )), Sn = g(xj )(Fn (xj+1 ) − Fn (xj )).
j=1 j=1
Let M = sup |g|. Let dF be the measure on R such that dF ((a, b]) = F (b) − F (a); see
Theorem 5.16. By Exercise 6.15,
Z ∞
|E[g(X)] − S| =
g(x)dF (dx) − S
−∞
Z b
≤ g(x)dF (dx) − S + M P(X ∈
/ [a, b])
a
≤ (M + 1).
53
Similarly, let dFn be the measure on R such that dFn ((a, b]) = Fn (b) − Fn (a); by Exercise 6.15,
Z ∞
|E[g(Xn )] − Sn | = g(x)dFn (dx) − Sn
−∞
Z b
≤ g(x)dFn (dx) − Sn + M P(Xn ∈ / [a, b])
a
≤ + M P(Xn ∈
/ [a, b])
→ (M + 1) as n → ∞.
As the xj 's are continuity points of F , we have limn→∞ Fn (xj ) = F (xj ) and so limn→∞ Sn = S .
Together with the last two displays, this completes the proof.
Proof. Let > 0, let g : R → R be bounded and continuous and set M = 2 sup |g|. Then
Let
Bδ = {x ∈ R : ∃y ∈ R s.t. |x − y| < δ and |g(x) − g(y)| ≥ }.
By continuity of g, for each x ∈ R there is δx > 0 such that |x − y| < δ implies |g(x) − g(y)| < .
p
Thus, B1/n are decreasing with ∩∞
n=1 B1/n = ∅. Now if Xn → − X , then for each δ > 0,
P(|g(Xn ) − g(X)| ≥ ) ≤ P(|Xn − X| ≥ δ) + P(X ∈ Bδ ). (56)
The rst term on the RHS of (56) goes to 0 as n → ∞ for each δ > 0, and the second goes to 0
as δ → 0 by continuity from above. Since δ > 0 was arbitrary, we conclude limn→∞ P(|g(Xn ) −
g(X)| ≥ ) = 0. From (55) it follows that limn→∞ E(|g(Xn ) − g(X)|) = 0, which implies
limn→∞ E(g(Xn )) = E(g(X)).
To see the converse fails, take P uniform on Ω = [0, 1] and let Xn (ω) = ω for n odd and
Xn (ω) = 1 − ω for n even. Then each Xn has the same CDF. However, Xn do not converge in
probability (we leave this as an exercise).
X1 + . . . + Xn − nµ a.s.
−−→ 0, as n → ∞.
n
This says that an average of iid random variables concentrates around its mean. If we change
the scaling, we get an idea of the deviation around the mean. The central limit theorem states
that (under appropriate conditions)
X1 + . . . + Xn − nµ d
√ −
→ Z, (57)
σ n
where σ 2 = Var(X1 ) and Z is a standard normal random variable:
54
Denition 14.12. A standard normal random variable has CDF
2
x
e−t /2
Z
F (x) = √ dt. (58)
−∞ 2π
The standard normal random variable has a bell-shaped PDF (namely the integrand in (58),
often called a Gaussian. For example, consider a simple random walk, where Zn = X1 +. . .+Xn
and the Xi 's are independent with P(Xi = 1) = P(Xi = −1) = 1/2. Then by (57),
Z d
√n −
→Z
n
with Z a standard normal random variable. This means the deviation of a random walk from 0
at long times has a bell-shaped distribution. As another example, suppose the Xi are responses
to a survey question from person i. Assuming responses are independent, then (57) shows that
for large n,
Z ∞ −t2 /2
X1 + . . . + Xn zσ e
P
− µ ≥ √
≈ P(|Z| ≥ z) = 2 √ dt. (59)
n n z 2π
Suppose we want to be 90% condent about the mean response. Then by choosing z such that
the RHS of (59) equals 0.9, we can conclude that
X1 + . . . + Xn zσ zσ
∈ µ − √ ,µ + √
n n n
with 90% condence, that is, about 90% of the time if we were to repeat the survey many times.
Our proof of the central limit theorem will use moment generating functions:
Denition 14.13. The moment generating function of a random variable X is φX (t) = E(etX ).
Note that these appeared already in the Cherno inequalities (Theorem 13.10).
The reason for the name is that, if φX is nite in a neighborhood of 0,
dn t
= E(X n ) =: nth
n
φX (t) moment of X.
dt t=0
Note that, from Jensen's inequality, φX (t) ≥ etE(X) , and by Markov's inequality, φX (t) ≥
eta P(X ≥ a). The moment generating function φX can be understood as the Laplace transform
of X . A key property of the Laplace transform is invertible under very general conditions. We
borrow without proof the following result from Laplace transform theory:
Theorem 14.15. Suppose X1 , X2 , . . . and X are random variables such that limn→∞ φXn =
φX < ∞ in a neighborhood of 0. Then Xn − → X.
d
55
Exercise 14.16 (Optional). Use Theorem 14.15 to prove the following version of the law of
large numbers: If Xk , k = 1, 2, . . . are iid with MGF which is nite in a neighborhood of 0, then
→ E(X1 ).
d
(X1 + . . . + Xn )/n −
Exercise 14.17 (Optional). Find the MGF of a standard normal random variable.
Theorem 14.18 (Central limit theorem). Let Xk , k = 1, 2, . . . be iid with MGF which is nite
in a neighborhood of 0. Then µ = E(X1 ) < ∞, σ2 = Var(X1 ) < ∞ and
X1 + . . . + Xn − nµ d
√ −
→ Z,
σ n
Proof of Theorem 14.18. Assume WLOG µ = 0. Let φ be the MGF of X1 . By Taylor expansion,
s2 σ 2
φ(s) = 1 + + O(s3 ).
2
√
Let φn be the MGF of (X1 + . . . + Xn )/(σ n). By independence of the Xi 's,
n n
t2
2
t −3/2 t
φn (t) = φ √ = 1+ + O(n ) → exp as n → ∞.
σ n 2n 2
The last expression is the MGF of a standard normal random variable (see Exercise 14.17). By
Theorem 14.15, we are done.
56