0% found this document useful (0 votes)
3 views95 pages

Probability and Measure

The document consists of lecture notes on Probability and Measure, covering topics such as measure spaces, σ-algebras, Lebesgue measure, independence of events, and convergence of random variables. It also includes discussions on integration, inequalities in Lp spaces, the Fourier transform, and ergodic theory. The notes are based on lectures by J. Miller and are not officially endorsed, containing modifications and potential errors by the note-taker, Dexter Chua.

Uploaded by

Majid Gunnarz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views95 pages

Probability and Measure

The document consists of lecture notes on Probability and Measure, covering topics such as measure spaces, σ-algebras, Lebesgue measure, independence of events, and convergence of random variables. It also includes discussions on integration, inequalities in Lp spaces, the Fourier transform, and ergodic theory. The notes are based on lectures by J. Miller and are not officially endorsed, containing modifications and potential errors by the note-taker, Dexter Chua.

Uploaded by

Majid Gunnarz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 95

Part II — Probability and Measure

Based on lectures by J. Miller


Notes taken by Dexter Chua

Michaelmas 2016

These notes are not endorsed by the lecturers, and I have modified them (often
significantly) after lectures. They are nowhere near accurate representations of what
was actually lectured, and in particular, all errors are almost surely mine.

Analysis II is essential

Measure spaces, σ-algebras, π-systems and uniqueness of extension, statement *and


proof* of Carathéodory’s extension theorem. Construction of Lebesgue measure on R.
The Borel σ-algebra of R. Existence of non-measurable subsets of R. Lebesgue-Stieltjes
measures and probability distribution functions. Independence of events, independence
of σ-algebras. The Borel–Cantelli lemmas. Kolmogorov’s zero-one law. [6]
Measurable functions, random variables, independence of random variables. Construc-
tion of the integral, expectation. Convergence in measure and convergence almost
everywhere. Fatou’s lemma, monotone and dominated convergence, differentiation
under the integral sign. Discussion of product measure and statement of Fubini’s
theorem. [6]
Chebyshev’s inequality, tail estimates. Jensen’s inequality. Completeness of Lp for
1 ≤ p ≤ ∞. The Hölder and Minkowski inequalities, uniform integrability. [4]
L2 as a Hilbert space. Orthogonal projection, relation with elementary conditional
probability. Variance and covariance. Gaussian random variables, the multivariate
normal distribution. [2]
The strong law of large numbers, proof for independent random variables with bounded
fourth moments. Measure preserving transformations, Bernoulli shifts. Statements
*and proofs* of maximal ergodic theorem and Birkhoff’s almost everywhere ergodic
theorem, proof of the strong law. [4]
The Fourier transform of a finite measure, characteristic functions, uniqueness and
inversion. Weak convergence, statement of Lévy’s convergence theorem for characteristic
functions. The central limit theorem. [2]

1
Contents II Probability and Measure

Contents
0 Introduction 3

1 Measures 5
1.1 Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Probability measures . . . . . . . . . . . . . . . . . . . . . . . . . 16

2 Measurable functions and random variables 20


2.1 Measurable functions . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.2 Constructing new measures . . . . . . . . . . . . . . . . . . . . . 23
2.3 Random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.4 Convergence of measurable functions . . . . . . . . . . . . . . . . 29
2.5 Tail events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3 Integration 36
3.1 Definition and basic properties . . . . . . . . . . . . . . . . . . . 36
3.2 Integrals and limits . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.3 New measures from old . . . . . . . . . . . . . . . . . . . . . . . 44
3.4 Integration and differentiation . . . . . . . . . . . . . . . . . . . . 46
3.5 Product measures and Fubini’s theorem . . . . . . . . . . . . . . 48

4 Inequalities and Lp spaces 54


4.1 Four inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.2 Lp spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.3 Orthogonal projection in L2 . . . . . . . . . . . . . . . . . . . . . 61
4.4 Convergence in L1 (P) and uniform integrability . . . . . . . . . . 65

5 Fourier transform 69
5.1 The Fourier transform . . . . . . . . . . . . . . . . . . . . . . . . 69
5.2 Convolutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.3 Fourier inversion formula . . . . . . . . . . . . . . . . . . . . . . 72
5.4 Fourier transform in L2 . . . . . . . . . . . . . . . . . . . . . . . 77
5.5 Properties of characteristic functions . . . . . . . . . . . . . . . . 79
5.6 Gaussian random variables . . . . . . . . . . . . . . . . . . . . . 80

6 Ergodic theory 83
6.1 Ergodic theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

7 Big theorems 90
7.1 The strong law of large numbers . . . . . . . . . . . . . . . . . . 90
7.2 Central limit theorem . . . . . . . . . . . . . . . . . . . . . . . . 92

Index 94

2
0 Introduction II Probability and Measure

0 Introduction
In measure theory, the main idea is that we want to assign “sizes” to different
sets. For example, we might think [0, 2] ⊆ R has size 2, while perhaps Q ⊆ R has
size 0. This is known as a measure. One of the main applications of a measure
is that we can use it to come up with a new definition of an integral. The idea
is very simple, but it is going to be very powerful mathematically.
Recall that if f : [0, 1] → R is continuous, then the Riemann integral of f is
defined as follows:
(i) Take a partition 0 = t0 < t1 < · · · < tn = 1 of [0, 1].

(ii) Consider the Riemann sum


n
X
f (tj )(tj − tj−1 )
j=1

(iii) The Riemann integral is


Z
f = Limit of Riemann sums as the mesh size of the partition → 0.

···
···

x
0 t1 t2 t3 tk tk+1 · · · 1

The idea of measure theory is to use a different approximation scheme. Instead


of partitioning the domain, we partition the range of the function. We fix some
numbers r0 < r1 < r2 < · · · < rn .
We then approximate the integral of f by
n
X
rj · (“size of f −1 ([rj−1 , rj ])”).
j=1

We then define the integral as the limit of approximations of this type as the
mesh size of the partition → 0.

3
0 Introduction II Probability and Measure

We can make an analogy with bankers — If a Riemann banker is given a stack


of money, they would just add the values of the money in order. A measure-
theoretic banker will sort the bank notes according to the type, and then find
the total value by multiplying the number of each type by the value, and adding
up.
Why would we want to do so? It turns out this leads to a much more
general theory of integration on much more general spaces. Instead of integrating
functions [a, b] → R only, we can replace the domain with any measure space.
Even in the context of R, this theory of integration is much much more powerful
than the Riemann sum, and can integrate a much wider class of functions. While
you probably don’t care about those pathological functions anyway, being able
to integrate more things means that we can state more general theorems about
integration without having to put in funny conditions.
That was all about measures. What about probability? It turns out the
concepts we develop for measures correspond exactly to many familiar notions
from probability if we restrict it to the particular case where the total measure
of the space is 1. Thus, when studying measure theory, we are also secretly
studying probability!

4
1 Measures II Probability and Measure

1 Measures
In the course, we will write fn % f for “fn converges to f monotonically
increasingly”, and fn & f similarly. Unless otherwise specified, convergence is
taken to be pointwise.

1.1 Measures
The starting point of all these is to come up with a function that determines
the “size” of a given set, known as a measure. It turns out we cannot sensibly
define a size for all subsets of [0, 1]. Thus, we need to restrict our attention to a
collection of “nice” subsets. Specifying which subsets are “nice” would involve
specifying a σ-algebra.
This section is mostly technical.
Definition (σ-algebra). Let E be a set. A σ-algebra E on E is a collection of
subsets of E such that

(i) ∅ ∈ E.
(ii) A ∈ E implies that AC = X \ A ∈ E.
(iii) For any sequence (An ) in E, we have that
[
An ∈ E.
n

The pair (E, E) is called a measurable space.


Note that the axioms imply that σ-algebras are also closed under countable
intersections, as we have A ∩ B = (AC ∪ B C )C .

Definition (Measure). A measure on a measurable space (E, E) is a function


µ : E → [0, ∞] such that
(i) µ(∅) = 0
(ii) Countable additivity: For any disjoint sequence (An ) in E, then

!
[ X
µ An = µ(An ).
n n=1

Example. Let E be any countable set, and E = P (E) be the set of all subsets
of E. A mass function is any function m : E → [0, ∞]. We can then define a
measure by setting X
µ(A) = m(x).
x∈A

In particular, if we put m(x) = 1 for all x ∈ E, then we obtain the counting


measure.

5
1 Measures II Probability and Measure

Countable spaces are nice, because we can always take E = P (E), and the
measure can be defined on all possible subsets. However, for “bigger” spaces, we
have to be more careful. The set of all subsets is often “too large”. We will see
a concrete and also important example of this later.
In general, σ-algebras are often described on large spaces in terms of a smaller
set, known as the generating sets.
Definition (Generator of σ-algebra). Let E be a set, and that A ⊆ P (E) be a
collection of subsets of E. We define

σ(A) = {A ⊆ E : A ∈ E for all σ-algebras E that contain A}.

In other words σ(A) is the smallest sigma algebra that contains A. This is
known as the sigma algebra generated by A.
Example. Take E = Z, and A = {{x} : x ∈ Z}. Then σ(A) is just P (E), since
every subset of E can be written as a countable union of singletons.
Example. Take E = Z, and let A = {{x, x + 1, x + 2, x + 3, · · · } : x ∈ E}. Then
again σ(E) is the set of all subsets of E.
The following is the most important σ-algebra in the course:
Definition (Borel σ-algebra). Let E = R, and A = {U ⊆ R : U is open}. Then
σ(A) is known as the Borel σ-algebra, which is not the set of all subsets of R.
We can equivalently define this by à = {(a, b) : a < b, a, b ∈ Q}. Then σ(Ã)
is also the Borel σ-algebra.
Often, we would like to prove results that allow us to deduce properties
about the σ-algebra just by checking it on a generating set. However, usually,
we cannot just check it on an arbitrary generating set. Instead, the generating
set has to satisfy some nice closure properties. We are now going to introduce a
bunch of many different definitions that you need not aim to remember (except
when exams are near).
Definition (π-system). Let A be a collection of subsets of E. Then A is called
a π-system if
(i) ∅ ∈ A
(ii) If A, B ∈ A, then A ∩ B ∈ A.
Definition (d-system). Let A be a collection of subsets of E. Then A is called
a d-system if
(i) E ∈ A
(ii) If A, B ∈ A and A ⊆ B, then B \ A ∈ A
S
(iii) For all increasing sequences (An ) in A, we have that n An ∈ A.
The point of d-systems and π-systems is that they separate the axioms of a
σ-algebra into two parts. More precisely, we have
Proposition. A collection A is a σ-algebra if and only if it is both a π-system
and a d-system.

6
1 Measures II Probability and Measure

This follows rather straightforwardly from the definitions.


The following definitions are also useful:
Definition (Ring). A collection of subsets A is a ring on E if ∅ ∈ A and for all
A, B ∈ A, we have B \ A ∈ A and A ∪ B ∈ A.
Definition (Algebra). A collection of subsets A is an algebra on E if ∅ ∈ A,
and for all A, B ∈ A, we have AC ∈ A and A ∪ B ∈ A.
So an algebra is like a σ-algebra, but it is just closed under finite unions only,
rather than countable unions.
While the names π-system and d-system are rather arbitrary, we can make
some sense of the names “ring” and “algebra”. Indeed, a ring forms a ring
(without unity) in the algebraic sense with symmetric difference as “addition”
and intersection as “multiplication”. Then the empty set acts as the additive
identity, and E, if present, acts as the multiplicative identity. Similarly, an
algebra is a boolean subalgebra under the boolean algebra P (E).
A very important lemma about these things is Dynkin’s lemma:
Lemma (Dynkin’s π-system lemma). Let A be a π-system. Then any d-system
which contains A contains σ(A).
This will be very useful in the future. If we want to show that all elements of
σ(A) satisfy a particular property for some generating π-system A, we just have
to show that the elements of A satisfy that property, and that the collection of
things that satisfy the property form a d-system.
While this use case might seem rather contrived, it is surprisingly common
when we have to prove things.
Proof. Let D be the intersection of all d-systems containing A, i.e. the smallest
d-system containing A. We show that D contains σ(A). To do so, we will show
that D is a π-system, hence a σ-algebra.
There are two steps to the proof, both of which are straightforward verifica-
tions:
(i) We first show that if B ∈ D and A ∈ A, then B ∩ A ∈ D.
(ii) We then show that if A, B ∈ D, then A ∩ B ∈ D.
Then the result immediately follows from the second part.
We let
D0 = {B ∈ D : B ∩ A ∈ D for all A ∈ A}.
We note that D0 ⊇ A because A is a π-system, and is hence closed under
intersections. We check that D0 is a d-system. It is clear that E ∈ D0 . If we
have B1 , B2 ∈ D0 , where B1 ⊆ B2 , then for any A ∈ A, we have

(B2 \ B1 ) ∩ A = (B2 ∩ A) \ (B1 ∩ A).

By definition of D0 , we know B2 ∩ A and B1 ∩ A are elements of D. Since D is a


d-system, we know this intersection is in D. So B2 \ B1 ∈ D0 .
Finally, suppose that (Bn ) is an increasing sequence in D0 , with B = Bn .
S
Then for every A ∈ A, we have that
[  [
Bn ∩ A = (Bn ∩ A) = B ∩ A ∈ D.

7
1 Measures II Probability and Measure

Therefore B ∈ D0 .
Therefore D0 is a d-system contained in D, which also contains A. By our
choice of D, we know D0 = D.
We now let

D00 = {B ∈ D : B ∩ A ∈ D for all A ∈ D}.

Since D0 = D, we again have A ⊆ D00 , and the same argument as above implies
that D00 is a d-system which is between A and D. But the only way that can
happen is if D00 = D, and this implies that D is a π-system.
After defining all sorts of things that are “weaker versions” of σ-algebras, we
now defined a bunch of measure-like objects that satisfy fewer properties. Again,
no one really remembers these definitions:
Definition (Set function). Let A be a collection of subsets of E with ∅ ∈ A. A
set function function µ : A → [0, ∞] such that µ(∅) = 0.
Definition (Increasing set function). A set function is increasing if it has the
property that for all A, B ∈ A with A ⊆ B, we have µ(A) ≤ µ(B).
Definition (Additive set function). A set function is additive if whenever
A, B ∈ A and A ∪ B ∈ A, A ∩ B = ∅, then µ(A ∪ B) = µ(A) + µ(B).
Definition (Countably additive set function). A set function is countably addi-
tive if whenever An is a sequence of disjoint sets in A with ∪An ∈ A, then
!
[ X
µ An = µ(An ).
n n

Under these definitions, a measure is just a countable additive set function


defined on a σ-algebra.
Definition (Countably subadditive set function). A set function
S is countably
subadditive if whenever (An ) is a sequence of sets in A with n An ∈ A, then
!
[ X
µ An ≤ µ(An ).
n n

The big theorem that allows us to construct measures is the Caratheodory


extension theorem. In particular, this will help us construct the Lebesgue measure
on R.
Theorem (Caratheodory extension theorem). Let A be a ring on E, and µ
a countably additive set function on A. Then µ extends to a measure on the
σ-algebra generated by A.
Proof. (non-examinable) We start by defining what we want our measure to be.
For B ⊆ E, we set
( )
X [

µ (B) = inf µ(An ) : (An ) ∈ A and B ⊆ An .
n

8
1 Measures II Probability and Measure

If it happens that there is no such sequence, we set this to be ∞. This measure is


known as the outer measure. It is clear that µ∗ (φ) = 0, and that µ∗ is increasing.
We say a set A ⊆ E is µ∗ -measurable if

µ∗ (B) = µ∗ (B ∩ A) + µ∗ (B ∩ AC )

for all B ⊆ E. We let

M = {µ∗ -measurable sets}.

We will show the following:


(i) M is a σ-algebra containing A.
(ii) µ∗ is a measure on M with µ∗ |A = µ.

Note that it is not true in general that M = σ(A). However, we will always
have M ⊇ σ(A).
We are going to break this up into five nice bite-size chunks.
Claim. µ∗ is countably subadditive.
Suppose B ⊆ n Bn . We need to show that µ∗ (B) ≤ n µ∗ (Bn ). We can
S P
wlog assume that µ∗ (Bn ) is finite for all n, or else the inequality is trivial. Let
ε > 0. Then by definition of the outer measure, for each n, we can find a
sequence (Bn,m )∞
m=1 in A with the property that
[
Bn ⊆ Bn,m
m

and
ε X
µ∗ (Bn ) + ≥ µ(Bn,m ).
2n m

Then we have [ [
B⊆ Bn ⊆ Bn,m .
n n,m

Thus, by definition, we have


X X ε  X
µ∗ (B) ≤ µ∗ (Bn,m ) ≤ µ∗ (Bn ) + n = ε + µ∗ (Bn ).
n,m n
2 n

Since ε was arbitrary, we are done.


Claim. µ∗ agrees with µ on A.
In the first example sheet, we will show that if A is a ring and µ is a countably
additive set function on µ, then µ is in fact countably subadditive
S and increasing.
Assuming this, suppose that A, (An ) are in A and A ⊆ n An . Then by
subadditivity, we have
X X
µ(A) ≤ µ(A ∩ An ) ≤ µ(An ),
n n

9
1 Measures II Probability and Measure

using that µ is countably subadditivity and increasing. Note that we have to do


this in two steps, S
rather than just applying countable subadditivity, since we did
not assume that n An ∈ A. Taking the infimum over all sequences, we have

µ(A) ≤ µ∗ (A).

Also, we see by definition that µ(A) ≥ µ∗ (A), since A covers A. So we get that
µ(A) = µ∗ (A) for all A ∈ A.
Claim. M contains A.
Suppose that A ∈ A and B ⊆ E. We need to show that

µ∗ (B) = µ∗ (B ∩ A) + µ∗ (B ∩ AC ).

Since µ∗ is countably subadditive, we immediately have µ∗ (B) ≤ µ∗ (B ∩ A) +


µ∗ (B ∩ AC ). For the other inequality, we first observe that it is trivial if µ∗ (B)
is infinite. If it is finite,
S then by definition, given ε > 0, we can find some (Bn )
in A such that B ⊆ n Bn and
X
µ∗ (B) + ε ≥ µ(Bn ).
n

Then we have
[
B∩A⊆ (Bn ∩ A)
n
[
B ∩ A ⊆ (Bn ∩ AC )
C

We notice that Bn ∩ AC = Bn \ A ∈ A. Thus, by definition of µ∗ , we have


X X
µ∗ (B ∩ A) + µ∗ (B ∩ Ac ) ≤ µ(Bn ∩ A) + µ(Bn ∩ AC )
n n
X
= (µ(Bn ∩ A) + µ(Bn ∩ AC ))
n
X
= µ(Bn )
n

≤ µ (Bn ) + ε.

Since ε was arbitrary, the result follows.


Claim. We show that M is an algebra.
We first show that E ∈ M. This is true since we obviously have

µ∗ (B) = µ∗ (B ∩ E) + µ∗ (B ∩ E C )

for all B ⊆ E.
Next, note that if A ∈ M, then by definition we have, for all B,

µ∗ (B) = µ∗ (B ∩ A) + µ∗ (B ∩ AC ).

Now note that this definition is symmetric in A and AC . So we also have


AC ∈ M .

10
1 Measures II Probability and Measure

Finally, we have to show that M is closed under intersection (which is


equivalent to being closed under union when we have complements). Suppose
A1 , A2 ∈ M and B ⊆ E. Then we have

µ∗ (B) = µ∗ (B ∩ A1 ) + µ∗ (B ∩ AC
1)

= µ∗ (B ∩ A1 ∩ A2 ) + µ∗ (B ∩ A1 ∩ AC ∗ C
2 ) + µ (B ∩ A1 )

= µ∗ (B ∩ (A1 ∩ A2 )) + µ∗ (B ∩ (A1 ∩ A2 )C ∩ A1 )
+ µ∗ (B ∩ (A1 ∩ A2 )C ∩ AC
1)

= µ∗ (B ∩ (A1 ∩ A2 )) + µ∗ (B ∩ (A1 ∩ A2 )C ).

So we have A1 ∩ A2 ∈ M. So M is an algebra.
Claim. M is a σ-algebra, and µ∗ is a measure on M.
To show that M is a σ-algebra, we need to show that it is closed under
countable unions. We let
S (An ) be a disjoint collection of sets in M, then we
want to show that A = n An ∈ M and µ∗ (A) = n µ∗ (An ).
P
Suppose that B ⊆ E. Then we have

µ∗ (B) = µ∗ (B ∩ A1 ) + µ∗ (B ∩ AC
1)

Using the fact that A2 ∈ M and A1 ∩ A2 = ∅, we have

= µ∗ (B ∩ A1 ) + µ∗ (B ∩ A2 ) + µ∗ (B ∩ AC C
1 ∩ A2 )
= ···
Xn
= µ∗ (B ∩ Ai ) + µ∗ (B ∩ AC C
1 ∩ · · · ∩ An )
i=1
n
X
≥ µ∗ (B ∩ Ai ) + µ∗ (B ∩ AC ).
i=1

Taking the limit as n → ∞, we have



X
µ∗ (B) ≥ µ∗ (B ∩ Ai ) + µ∗ (B ∩ AC ).
i=1

By the countable-subadditivity of µ∗ , we have



X
µ∗ (B ∩ A) ≤ µ∗ (B ∩ Ai ).
i=1

Thus we obtain
µ∗ (B) ≥ µ∗ (B ∩ A) + µ∗ (B ∩ AC ).
By countable subadditivity, we also have inequality in the other direction. So
equality holds. So A ∈ M. So M is a σ-algebra.
To see that µ∗ is a measure on M, note that the above implies that

X
µ∗ (B) = (B ∩ Ai ) + µ∗ (B ∩ AC ).
i=1

11
1 Measures II Probability and Measure

Taking B = A, this gives



X ∞
X
µ∗ (A) = (A ∩ Ai ) + µ∗ (A ∩ AC ) = µ∗ (Ai ).
i=1 i=1

Note that when A itself is actually a σ-algebra, the outer measure can be
simply written as

µ∗ (B) = inf{µ(A) : A ∈ A, B ⊆ A}.

Caratheodory gives us the existence of some measure extending the set function
on A. Could there be many? In general, there could. However, in the special
case where the measure is finite, we do get uniqueness.
Theorem. Suppose that µ1 , µ2 are measures on (E, E) with µ1 (E) = µ2 (E) <
∞. If A is a π-system with σ(A) = E, and µ1 agrees with µ2 on A, then µ1 = µ2 .

Proof. Let
D = {A ∈ E : µ1 (A) = µ2 (A)}
We know that D ⊇ A. By Dynkin’s lemma, it suffices to show that D is a
d-system. The things to check are:
(i) E ∈ D — this follows by assumption.
(ii) If A, B ∈ D with A ⊆ B, then B \ A ∈ D. Indeed, we have the equations

µ1 (B) = µ1 (A) + µ1 (B \ A) < ∞


µ2 (B) = µ2 (A) + µ2 (B \ A) < ∞.

Since µ1 (B) = µ2 (B) and µ1 (A) = µ2 (A), we must have µ1 (B \ A) =


µ2 (B \ A).
S
(iii) Let (An ) ∈ D be an increasing sequence with An = A. Then

µ1 (A) = lim µ1 (An ) = lim µ2 (An ) = µ2 (A).


n→∞ n→∞

So A ∈ D.
The assumption that µ1 (E) = µ2 (E) < ∞ is necessary. The theorem does
not necessarily hold without it. We can see this from a simple counterexample:
Example. Let E = Z, and let E = P (E). We let

A = {{x, x + 1, x + 2, · · · } : x ∈ E} ∪ {∅}.

This is a π-system with σ(A) = E. We let µ1 (A) be the number of elements in A,


6 µ2 , but µ1 (A) = ∞ = µ2 (A) for A ∈ A.
and µ2 = 2µ1 (A). Then obviously µ1 =
Definition (Borel σ-algebra). Let E be a topological space. We define the
Borel σ-algebra as
B(E) = σ({U ⊆ E : U is open}).
We write B for B(R).

12
1 Measures II Probability and Measure

Definition (Borel measure and Radon measure). A measure µ on (E, B(E)) is


called a Borel measure. If µ(K) < ∞ for all K ⊆ E compact, then µ is a Radon
measure.
The most important example of a Borel measure we will consider is the
Lebesgue measure.
Theorem. There exists a unique Borel measure µ on R with µ([a, b]) = b − a.
Proof. We first show uniqueness. Suppose µ̃ is another measure on B satisfying
the above property. We want to apply the previous uniqueness theorem, but our
measure is not finite. So we need to carefully get around that problem.
For each n ∈ Z, we set
µn (A) = µ(A ∩ (n, n + 1]))
µ̃n (A) = µ̃(A ∩ (n, n + 1]))
Then µn and µ̃n are finite measures on B which agree on the π-system of intervals
of the form (a, b] with a, b ∈ R, a < b. Therefore we have µn = µ̃n for all n ∈ Z.
Now we have
X X X
µ(A) = µ(A ∩ (n, n + 1]) = µn (A) = µ̃n (A) = µ̃(A)
n∈Z n∈Z n∈Z

for all Borel sets A.


To show existence, we want to use the Caratheodory extension theorem. We
let A be the collection of finite, disjoint unions of the form
A = (a1 , b1 ] ∪ (a2 , b2 ] ∪ · · · ∪ (an , bn ].
Then A is a ring of subsets of R, and σ(A) = B (details are to be checked on
the first example sheet).
We set
Xn
µ(A) = (bi − ai ).
i=1
We note that µ is well-defined, since if
A = (a1 , b1 ] ∪ · · · ∪ (an , bn ] = (ã1 , b̃1 ] ∪ · · · ∪ (ãn , b̃n ],
then
n
X n
X
(bi − ai ) = (b̃i − ãi ).
i=1 i=1
Also, if µ is additive, A, B ∈ A, A ∩ B = ∅ and A ∪ B ∈ A, we obviously have
µ(A ∪ B) = µ(A) + µ(B). So µ is additive.
Finally, we have to show that µ isSin fact countably additive. Let (An ) be a

sequence in A, and let A = i=1 An ∈ A. Then we need to show that
disjoint P

µ(A) = n=1 µ(An ).
Since µ is additive, we have
µ(A) = µ(A1 ) + µ(A \ A1 )
= µ(A1 ) + µ(A2 ) + µ(A \ A1 ∪ A2 )
n n
!
X [
= µ(Ai ) + µ A \ Ai
i=1 i=1

13
1 Measures II Probability and Measure

To finish the proof, we show that


n
!
[
µ A\ Ai → 0 as n → ∞.
i=1

We are going to reduce this to the finite intersection property of compact


Tn sets in R:
if (Kn ) is a sequence
T∞ of compact sets in R with the property that m=1 Km 6= ∅
for all n, then m=1 Km 6= ∅.
We first introduce some new notation. We let
n
[
Bn = A \ Am .
m=1

We now suppose, for contradiction, that µ(Bn ) 6→ 0 as n → ∞. Since the Bn ’s


are decreasing, there must exist ε > 0 such that µ(Bn ) ≥ 2ε for every n.
For each n, we take Cn ∈ A with the property that Cn ⊆ Bn and µ(Bn \Cn ) ≤
ε
2 n . This is possible since each Bn is just a finite union of intervals. Thus we
have
n
! n
!
\ \
µ(Bn ) − µ C m = µ Bn \ Cm
m=1 m=1
n
!
[
≤µ (Bm \ Cm )
m=1
n
X
≤ µ(Bm \ Cm )
m=1
n
X ε

m=1
2m
≤ ε.

On the other hand, we also know that µ(Bn ) ≥ 2ε.


n
!
\
µ Cm ≥ ε
m=1
Tn
for all n. We now let that Kn = m=1 Cm . Then µ(Kn ) ≥ ε, and in particular
Kn 6= ∅ for all n.
Thus, the finite intersection property says

\ ∞
\
∅=
6 Kn ⊆ Bn = ∅.
n=1 n=1

This is a contradiction. So we have µ(Bn ) → 0 as n → ∞. So done.


Definition (Lebesgue measure). The Lebesgue measure is the unique Borel
measure µ on R with µ([a, b]) = b − a.
Note that the Lebesgue measure is not a finite measure, since µ(R) = ∞.
However, it is a σ-finite measure.

14
1 Measures II Probability and Measure

Definition (σ-finite measure). Let (E, E) be a measurable space, and µ a


measure.
S We say µ is σ-finite if there exists a sequence (En ) in E such that
E
n n = E and µ(En ) < ∞ for all n.

This is the next best thing we can hope after finiteness, and often proofs that
involve finiteness carry over to σ-finite measures.
Proposition. The Lebesgue measure is translation invariant, i.e.

µ(A + x) = µ(A)

for all A ∈ B and x ∈ R, where

A + x = {y + x, y ∈ A}.

Proof. We use the uniqueness of the Lebesgue measure. We let

µx (A) = µ(A + x)

for A ∈ B. Then this is a measure on B satisfying µx ([a, b]) = b − a. So the


uniqueness of the Lebesgue measure shows that µx = µ.
It turns out that translation invariance actually characterizes the Lebesgue
measure.
Proposition. Let µ̃ be a Borel measure on R that is translation invariant and
µ([0, 1]) = 1. Then µ̃ is the Lebesgue measure.
Proof. We show that any such measure must satisfy

µ([a, b]) = b − a.

By additivity and translation invariance, we can show that µ([p, q]) = q − p for all
rational p < q. By considering µ([p, p + 1/n]) for all n and using the increasing
property, we know µ({p}) = 0. So µ(([p, q)) = µ((p, q]) = µ((p, q)) = q − p for
all rational p, q.
Finally, by countable additivity, we can extend this to all real intervals. Then
the result follows from the uniqueness of the Lebesgue measure.
In the proof of the Caratheodory extension theorem, we constructed a measure
µ∗ on the σ-algebra M of µ∗ -measurable sets which contains A. This contains
B = σ(A), but could in fact be bigger than it. We call M the Lebesgue σ-algebra.
Indeed, it can be given by

M = {A ∪ N : A ∈ B, N ⊆ B ∈ B with µ(B) = 0}.

If A ∪ N ∈ M, then µ(A ∪ N ) = µ(A). The proof is left for the example sheet.
It is also true that M is strictly larger than B, so there exists A ∈ M with
A 6∈ B. Construction of such a set was on last year’s exam (2016).
On the other hand, it is also true that not all sets are Lebesgue measurable.
This is a rather funny construction.

15
1 Measures II Probability and Measure

Example. For x, y ∈ [0, 1), we say x ∼ y if x − y is rational. This defines an


equivalence relation on [0, 1). By the axiom of choice, we pick a representative
of each equivalence class, and put them into a set S ⊆ [0, 1). We will show that
S is not Lebesgue measurable.
Suppose that S were Lebesgue measurable. We are going to get a contra-
diction to the countable additivity of the Lebesgue measure. For each rational
r ∈ [0, 1) ∩ Q, we define

Sr = {s + r mod 1 : s ∈ S}.

By translation invariance, we know Sr is also Lebesgue measurable, and µ(Sr ) =


µ(S). S
Also, by construction of S, we know (Sr )r∈Q is disjoint, and r∈Q Sr = [0, 1).
Now by countable additivity, we have
 
[ X X
1 = µ([0, 1)) = µ  Sr  = µ(Sr ) = µ(S),
r∈Q r∈Q r∈Q

which is clearly not possible. Indeed, if µ(S) = 0, then this says 1 = 0; If


µ(S) > 0, then this says 1 = ∞. Both are absurd.

1.2 Probability measures


Since the course is called “probability and measure”, we’d better start talking
about probability! It turns out the notions we care about in probability theory
are very naturally just special cases of the concepts we have previously considered.
Definition (Probability measure and probability space). Let (E, E) be a measure
space with the property that µ(E) = 1. Then we often call µ a probability measure,
and (E, E, µ) a probability space.
Probability spaces are usually written as (Ω, F, P) instead.
Definition (Sample space). In a probability space (Ω, F, P), we often call Ω
the sample space.
Definition (Events). In a probability space (Ω, F, P), we often call the elements
of F the events.
Definition (Probaiblity). In a probability space (Ω, F, P), if A ∈ F, we often
call P[A] the probability of the event A.
These are exactly the same things as measures, but with different names!
However, thinking of them as probabilities could make us ask different questions
about these measure spaces. For example, in probability, one is often interested
in independence.
Definition (Independence of events). A sequence of events (An ) is said to be
independent if " #
\ Y
P An = P[An ]
n∈J n∈J

for all finite subsets J ⊆ N.

16
1 Measures II Probability and Measure

However, it turns out that talking about independence of events is usually


too restrictive. Instead, we want to talk about the independence of σ-algebras:

Definition (Independence of σ-algebras). A sequence of σ-algebras (An ) with


An ⊆ F for all n is said to be independent if the following is true: If (An ) is a
sequence where An ∈ An for all n, them (An ) is independent.
Proposition. Events (An ) are independent iff the σ-algebras σ(An ) are inde-
pendent.

While proving this directly would be rather tedious (but not too hard), it is
an immediate consequence of the following theorem:
Theorem. Suppose A1 and A2 are π-systems in F. If

P[A1 ∩ A2 ] = P[A1 ]P[A2 ]

for all A1 ∈ A1 and A2 ∈ A2 , then σ(A1 ) and σ(A2 ) are independent.


Proof. This will follow from two applications of the fact that a finite measure is
determined by its values on a π-system which generates the entire σ-algebra.
We first fix A1 ∈ A1 . We define the measures

µ(A) = P[A ∩ A1 ]

and
ν(A) = P[A]P[A1 ]
for all A ∈ F. By assumption, we know µ and ν agree on A2 , and we have that
µ(Ω) = P[A1 ] = ν(Ω) ≤ 1 < ∞. So µ and ν agree on σ(A2 ). So we have

P[A1 ∩ A2 ] = µ(A2 ) = ν(A2 ) = P[A1 ]P[A2 ]

for all A2 ∈ σ(A2 ).


So we have now shown that if A1 and A2 are independent, then A1 and
σ(A2 ) are independent. By symmetry, the same argument shows that σ(A1 )
and σ(A2 ) are independent.
Say we are rolling a dice. Instead of asking what the probability of getting
a 6, we might be interested instead in the probability of getting a 6 infinitely
often. Intuitively, the answer is “it happens with probability 1”, because in each
dice roll, we have a probability of 16 of getting a 6, and they are all independent.
We would like to make this precise and actually prove it. It turns out that
the notions of “occurs infinitely often” and also “occurs eventually” correspond
to more analytic notions of lim sup and lim inf.

Definition (limsup and liminf). Let (An ) be a sequence of events. We define


\ [
lim sup An = Am
n m≥n
[ \
lim inf An = Am .
n m≥n

17
1 Measures II Probability and Measure

To parse these definitions more easily, we can read ∩ as “for all”, and ∪ as
“there exits”. For example, we can write

lim sup An = ∀n, ∃m ≥ n such that Am occurs


= {x : ∀n, ∃m ≥ n, x ∈ Am }
= {Am occurs infinitely often}
= {Am i.o.}

Similarly, we have

lim inf An = ∃n, ∀m ≥ n such that Am occurs


= {x : ∃n, ∀m ≥ n, x ∈ Am }
= {Am occurs eventually}
= {Am e.v.}

We are now going to prove two “obvious” results, known as the Borel–Cantelli
lemmas. These give us necessary conditions for an event to happen infinitely
often, and in the case where the events are independent, the condition is also
sufficient.
Lemma (Borel–Cantelli lemma). If
X
P[An ] < ∞,
n

then
P[An i.o.] = 0.
Proof. For each k, we have
 
\ [
P[An i.o] = P  Am 
n m≥n
 
[
≤ P Am 
m≥k

X
≤ P[Am ]
m=k
→0

as k → ∞. So we have P[An i.o.] = 0.


Note that we did not need to use the fact that we are working with a
probability measure. So in fact this holds for any measure space.
Lemma (Borel–Cantelli lemma II). Let (An ) be independent events. If
X
P[An ] = ∞,
n

then
P[An i.o.] = 1.

18
1 Measures II Probability and Measure

Note that independence is crucial. IfPwe flip a fair


Pcoin, and we set all the An
to be equal to “getting a heads”, then n P[An ] = n 12 = ∞, but we certainly
do not have P[An i.o.] = 1. Instead it is just 21 .

Proof. By example sheet, if (An ) is independent, then so is (AC


n ). Then we have
" N # N
\ Y
C
P Am = P[AC
m]
m=n m=n
N
Y
= (1 − P[Am ])
m=n
N
Y
≤ exp(−P[Am ])
m=n
N
!
X
= exp − P[Am ]
m=n
→0
P
as N → ∞, as we assumed that n P[An ] = ∞. So we have
" ∞ #
\
C
P Am = 0.
m=n

By countable subadditivity, we have



" #
[ \
C
P Am = 0.
n m=n

This in turn implies that


∞ ∞
" # " #
\ [ [ \
C
P Am = 1 − P Am = 1.
n m=n n m=n

So we are done.

19
2 Measurable functions and random variables II Probability and Measure

2 Measurable functions and random variables


We’ve had enough of measurable sets. As in most of mathematics, not only
should we talk about objects, but also maps between objects. Here we want to
talk about maps between measure spaces, known as measurable functions. In
the case of a probability space, a measurable function is a random variable!
In this chapter, we are going to start by defining a measurable function and
investigate some of its basic properties. In particular, we are going to prove the
monotone class theorem, which is the analogue of Dynkin’s lemma for measurable
functions. Afterwards, we turn to the probabilistic aspects, and see how we can
make sense of the independence of random variables. Finally, we are going to
consider different notions of “convergence” of functions.

2.1 Measurable functions


The definition of a measurable function is somewhat like the definition of a
continuous function, except that we replace “open” with “in the σ-algebra”.
Definition (Measurable functions). Let (E, E) and (G, G) be measure spaces.
A map f : E → G is measurable if for every A ∈ G, we have

f −1 (A) = {x ∈ E : f (x) ∈ E} ∈ E.

If (G, G) = (R, B), then we will just say that f is measurable on E.


If (G, G) = ([0, ∞], B), then we will just say that f is non-negative measurable.
If E is a topological space and E = B(E), then we call f a Borel function.
How do we actually check in practice that a function is measurable? It turns
out we are lucky. We can simply check that f −1 (A) ∈ E for A in any generating
set Q of G.
Lemma. Let (E, E) and (G, G) be measurable spaces, and G = σ(Q) for some
Q. If f −1 (A) ∈ E for all A ∈ Q, then f is measurable.
Proof. We claim that
{A ⊆ G : f −1 (A) ∈ E}
is a σ-algebra on G. Then the result follows immediately by definition of σ(Q).
Indeed, this follows from the fact that f −1 preserves everything. More
precisely, we have
!
[ [
−1
f An = f −1 (An ), f −1 (AC ) = (f −1 (A))C , f −1 (∅) = ∅.
n n
S
So if, say, all An ∈ A, then so is n An .
Example. In the particular case where we have a function f : E → R, we know
that B = B(R) is generated by (−∞, y] for y ∈ R. So we just have to check that

{x ∈ E : f (x) ≤ y} = f −1 ((−∞, y])) ∈ E.

20
2 Measurable functions and random variables II Probability and Measure

Example. Let E, F be topological spaces, and f : E → F be continuous. We


will see that f is a measurable function (under the Borel σ-algebras). Indeed,
by definition, whenever U ⊆ F is open, we have f −1 (U ) open as well. So
f −1 (U ) ∈ B(E) for all U ⊆ F open. But since B(F ) is the σ-algebra generated
by the open sets, this implies that f is measurable.
This is one very important example. We can do another very important
example.

Example. Suppose that A ⊆ E. The indicator function of A is 1A (x) : E →


{0, 1} given by (
1 x∈A
1A (x) = .
0 x 6∈ A
Suppose we give {0, 1} the non-trivial measure. Then 1A is a measurable function
iff A ∈ E.
Example. The identity function is always measurable.
Example. Composition of measurable functions are measurable. More precisely,
if (E, E), (F, F) and (G, G) are measurable spaces, and the functions f : E → F
and g : F → G are measurable, then the composition g ◦f : E → G is measurable.
Indeed, if A ∈ G, then g −1 (A) ∈ F, so f −1 (g −1 (A)) ∈ E. But f −1 (g −1 (A)) =
(g ◦ f )−1 (A). So done.
Definition (σ-algebra generated by functions). Now suppose we have a set E,
and a family of real-valued functions {fi : i ∈ I} on E. We then define

σ(fi : i ∈ I) = σ(fi−1 (A) : A ∈ B, i ∈ I).

This is the smallest σ-algebra on E which makes all the fi ’s measurable.


This is analogous to the notion of initial topologies for topological spaces.
If we want to construct more measurable functions, the following definition
will be rather useful:
Definition (Product measurable space). Let (E, E) and (G, G) be measure
spaces. We define the product measure space as E × G whose σ-algebra is
generated by the projections

E×G
π1 π2 .
E G

More explicitly, the σ-algebra is given by

E ⊗ G = σ({A × B : A ∈ E, B ∈ G}).

More generally, if (Ei , Ei )Q


is a collection of measure spaces, the product measure
space has Q
underlying set i Ei , and the σ-algebra generated by the projection
maps πi : j Ej → Ei .

This satisfies the following property:

21
2 Measurable functions and random variables II Probability and Measure

Q Let fi : E → Fi be functions. Then {fi } are all measurable iff


Proposition.
(fi ) : E → Fi is measurable, where the function (fi ) is defined by setting the
ith component of (fi )(x) to be fi (x).

Proof. If the map (fi ) is measurable, then by composition with the projections
πi , we know that each fi is measurable. Q
Conversely, if all fi are measurable, then since the σ-algebra of Fi is
−1
generated by sets of the form πj (A) : A ∈ Fj , and the pullback of such sets
along (fi ) is exactly fj−1 (A), we know the function (fi ) is measurable.
Using this, we can prove that a whole lot more functions are measurable.
Proposition. Let (E, E) be a measurable space. Let (fn : n ∈ N) be a sequence
of non-negative measurable functions on E. Then the following are measurable:

f1 + f2 , f1 f2 , max{f1 , f2 }, min{f1 , f2 },
inf fn , sup fn , lim inf fn , lim sup fn .
n n n n

The same is true with “real” replaced with “non-negative”, provided the new
functions are real (i.e. not infinity).
Proof. This is an (easy) exercise on the example sheet. For example, the sum
f1 + f2 can be written as the following composition.

(f1 ,f2 ) +
E [0, ∞]2 [0, ∞].

We know the second map is continuous, hence measurable. The first function is
also measurable since the fi are. So the composition is also measurable.
The product follows similarly, but for the infimum and supremum, we need to
check explicitly that the corresponding maps [0, ∞]N → [0, ∞] is measurable.
Notation. We will write

f ∧ g = min{f, g}, f ∨ g = max{f, g}.

We are now going to prove the monotone class theorem, which is a “Dynkin’s
lemma” for measurable functions. As in the case of Dynkin’s lemma, it will
sound rather awkward but will prove itself to be very useful.

Theorem (Monotone class theorem). Let (E, E) be a measurable space, and


A ⊆ E be a π-system with σ(A) = E. Let V be a vector space of functions such
that
(i) The constant function 1 = 1E is in V.
(ii) The indicator functions 1A ∈ V for all A ∈ A

(iii) V is closed under bounded, monotone limits.


More explicitly, if (fn ) is a bounded non-negative sequence in V, fn % f
(pointwise) and f is also bounded, then f ∈ V.
Then V contains all bounded measurable functions.

22
2 Measurable functions and random variables II Probability and Measure

Note that the conditions for V is pretty like the conditions for a d-system,
where taking a bounded, monotone limit is something like taking increasing
unions.
Proof. We first deduce that 1A ∈ V for all A ∈ E.
D = {A ∈ E : 1A ∈ V}.
We want to show that D = E. To do this, we have to show that D is a d-system.
(i) Since 1E ∈ V, we know E ∈ D.
(ii) If 1A ∈ V , then 1 − 1A = 1E\A ∈ V. So E \ A ∈ D.
(iii) If (An ) is an increasing sequence in D, then 1An → 1S An monotonically
increasingly. So 1S An is in D.
So, by Dynkin’s lemma, we know D = E. So V contains indicators of all measur-
able sets. We will now try to obtain any measurable function by approximating.
Suppose that f is bounded and non-negative measurable. We want to show
that f ∈ V. To do this, we approximate it by letting

X
fn = 2−n b2n f c = k2−n 1{k2−n ≤f <(k+1)2−n } .
k=0

Note that since f is bounded, this is a finite sum. So it is a finite linear


combination of indicators of elements in E. So fn ∈ V, and 0 ≤ fn → f
monotonically. So f ∈ V.
More generally, if f is bounded and measurable, then we can write
f = (f ∨ 0) + (f ∧ 0) ≡ f + − f − .
Then f + and f − are bounded and non-negative measurable. So f ∈ V.
Unfortunately, we will not have a chance to use this result until the next
chapter where we discuss integration. There we will use this a lot.

2.2 Constructing new measures


We are going to look at two ways to construct new measures on spaces based on
some measurable function we have.
Definition (Image measure). Let (E, E) and (G, G) be measure spaces. Suppose
µ is a measure on E and f : E → G is a measurable function. We define the
image measure ν = µ ◦ f −1 on G by
ν(A) = µ(f −1 (A)).
It is a routine check that this is indeed a measure.
If we have a strictly increasing continuous function, then we know it is
invertible (if we restrict the codomain appropriately), and the inverse is also
strictly increasing. It is also clear that these conditions are necessary for an
inverse to exist. However, if we relax the conditions a bit, we can get some sort
of “pseudoinverse” (some categorists may call them “left adjoints” (and will tell
you that it is a trivial consequence of the adjoint functor theorem)).
Recall that a function g is right continuous if xn & x implies g(xn ) → g(x),
and similarly f is left continuous if xn % x implies f (xn ) → f (x).

23
2 Measurable functions and random variables II Probability and Measure

Lemma. Let g : R → R be non-constant, non-decreasing and right continuous.


We set
g(±∞) = lim g(x).
x→±∞

We set I = (g(−∞), g(∞)). Since g is non-constant, this is non-empty.


Then there is a non-decreasing, left continuous function f : I → R such that
for all x ∈ I and y ∈ R, we have

x ≤ g(y) ⇔ f (x) ≤ y.

Thus, taking the negation of this, we have

x > g(y) ⇔ f (x) > y.

Explicitly, for x ∈ I, we define

f (x) = inf{y ∈ R : x ≤ g(y)}.

Proof. We just have to verify that it works. For x ∈ I, consider

Jx = {y ∈ R : x ≤ g(y)}.

Since g is non-decreasing, if y ∈ Jx and y 0 ≥ y, then y 0 ∈ Jx . Since g is


right-continuous, if yn ∈ Jx is such that yn & y, then y ∈ Jx . So we have

Jx = [f (x), ∞).

Thus, for f ∈ R, we have

x ≤ g(y) ⇔ f (x) ≤ y.

So we just have to prove the remaining properties of f . Now for x ≤ x0 , we have


Jx ⊆ Jx0 . So f (x) ≤ f (x0 ). So f is non-decreasing.
T
Similarly, if xn % x, then we have Jx = n Jxn . So f (xn ) → f (x). So this
is left continuous.
Example. If g is given by the function

then f is given by

24
2 Measurable functions and random variables II Probability and Measure

This allows us to construct new measures on R with ease.


Theorem. Let g : R → R be non-constant, non-decreasing and right continuous.
Then there exists a unique Radon measure dg on B such that

dg((a, b]) = g(b) − g(a).

Moreover, we obtain all non-zero Radon measures on R in this way.


We have already seen an instance of this when we g was the identity function.
Given the lemma, this is very easy.
Proof. Take I and f as in the previous lemma, and let µ be the restriction of
the Lebesgue measure to Borel subsets of I. Now f is measurable since it is left
continuous. We define dg = µ ◦ f −1 . Then we have

dg((a, b]) = µ({x ∈ I : a < f (x) ≤ b})


= µ({x ∈ I : g(a) < x ≤ g(b)})
= µ((g(a), g(b)]) = g(b) − g(a).

So dg is a Radon measure with the required property.


There are no other such measures by the argument used for uniqueness of
the Lebesgue measure.
To show we get all non-zero Radon measures this way, suppose we have a
Radon measure ν on R, we want to produce a g such that ν = dg. We set
(
−ν((y, 0]) y ≤ 0
g(y) = .
ν((0, y]) y>0

Then ν((a, b]) = g(b) − g(a). We see that ν is non-zero, so g is non-constant.


It is also easy to see it is non-decreasing and right continuous. So ν = dg by
continuity.

2.3 Random variables


We are now going to look at these ideas in the context of probability. It turns
out they are concepts we already know and love!
Definition (Random variable). Let (Ω, F, P) be a probability space, and (E, E)
a measurable space. Then an E-valued random variable is a measurable function
X : Ω → E.
By default, we will assume the random variables are real.
Usually, when we have a random variable X, we might ask questions like
“what is the probability that X ∈ A?”. In other words, we are asking for the
“size” of the set of things that get sent to A. This is just the image measure!

Definition (Distribution/law). Given a random variable X : Ω → E, the


distribution or law of X is the image measure µx : P ◦ X −1 . We usually write

P(X ∈ A) = µx (A) = P(X −1 (A)).

25
2 Measurable functions and random variables II Probability and Measure

If E = R, then µx is determined by its values on the π-system of intervals


(−∞, y]. We set
FX (x) = µX ((−∞, x]) = P(X ≤ x)
This is known as the distribution function of X.
Proposition. We have
(
0 x → −∞
FX (x) → .
1 x → +∞

Also, FX (x) is non-decreasing and right-continuous.


We call any function F with these properties a distribution function.
Definition (Distribution function). A distribution function is a non-decreasing,
right continuous function f : R → [0, 1] satisfying
(
0 x → −∞
FX (x) → .
1 x → +∞

We now want to show that every distribution function is indeed a distribution.


Proposition. Let F be any distribution function. Then there exists a probability
space (Ω, F, P) and a random variable X such that FX = F .
Proof. Take (Ω, F, P) = ((0, 1), B(0, 1), Lebesgue). We take X : Ω → R to be

X(ω) = inf{x : ω ≤ f (x)}.

Then we have
X(ω) ≤ x ⇐⇒ w ≤ F (x).
So we have
FX (x) = P[X ≤ x] = P[(0, F (x)]] = F (x).
Therefore FX = F .
This construction is actually very useful in practice. If we are writing
a computer program and want to sample a random variable, we will use this
procedure. The computer usually comes with a uniform (pseudo)-random number
generator. Then using this procedure allows us to produce random variables of
any distribution from a uniform sample.
The next thing we want to consider is the notion of independence of random
variables. Recall that for random variables X, Y , we used to say that they are
independent if for any A, B, we have

P[X ∈ A, Y ∈ B] = P[X ∈ A]P[Y ∈ B].

But this is exactly the statement that the σ-algebras generated by X and Y are
independent!
Definition (Independence of random variables). A family (Xn ) of random vari-
ables is said to be independent if the family of σ-algebras (σ(Xn )) is independent.

26
2 Measurable functions and random variables II Probability and Measure

Proposition. Two real-valued random variables X, Y are independent iff

P[X ≤ x, Y ≤ y] = P[X ≤ x]P[Y ≤ y].

More generally, if (Xn ) is a sequence of real-valued random variables, then they


are independent iff
n
Y
P[x1 ≤ x1 , · · · , xn ≤ xn ] = P[Xj ≤ xj ]
j=1

for all n and xj .


Proof. The ⇒ direction is obvious. For the other direction, we simply note that
{(−∞, x] : x ∈ R} is a generating π-system for the Borel σ-algebra of R.
In probability, we often say things like “let X1 , X2 , · · · be iid random vari-
ables”. However, how can we guarantee that iid random variables do indeed
exist? We start with the less ambitious goal of finding iid Bernoulli(1/2) random
variables:
Proposition. Let

(Ω, F, P) = ((0, 1), B(0, 1), Lebesgue).

be our probability space. Then there exists as sequence Rn of independent


Bernoulli(1/2) random variables.
Proof. Suppose we have ω ∈ Ω = (0, 1). Then we write ω as a binary expansion

X
ω= ωn 2−n ,
n=1

where ωn ∈ {0, 1}. We make the binary expansion unique by disallowing infinite
sequences of zeroes.
We define Rn (ω) = ωn . We will show that Rn is measurable. Indeed, we can
write
R1 (ω) = ω1 = 1(1/2,1] (ω),
where 1(1/2,1] is the indicator function. Since indicator functions of measurable
sets are measurable, we know R1 is measurable. Similarly, we have

R2 (ω) = 1(1/4,1/2] (ω) + 1(3/4,1] (ω).

So this is also a measurable function. More generally, we can do this for any
Rn (ω): we have
n−1
2X
Rn (ω) = 1(2−n (2j−1),2−n (2j)] (ω).
j=1

So each Rn is a random variable, as each can be expressed as a sum of indicators


of measurable sets.
Now let’s calculate
2n−1 2n−1
X
−n
X 1
P[Rn = 1] = 2 ((2j) − (2j − 1)) = 2−n = .
j=1 j=1
2

27
2 Measurable functions and random variables II Probability and Measure

Then we have
1
P[Rn = 0] = 1 − P[Rn = 1] =
2
as well. So Rn ∼ Bernoulli(1/2).
We can straightforwardly check that (Rn ) is an independent sequence, since
for n 6= m, we have
1
P[Rn = 0 and Rm = 0] = = P[Rn = 0]P[Rm = 0].
4
We will now use the (Rn ) to construct any independent sequence for any
distribution.
Proposition. Let

(Ω, F, P) = ((0, 1), B(0, 1), Lebesgue).

Given any sequence (Fn ) of distribution functions, there is a sequence (Xn ) of


independent random variables with FXn = Fn for all n.
Proof. Let m : N2 → N be any bijection, and relabel

Yk,n = Rm(k,n) ,

where the Rj are as in the previous random variable. We let



X
Yn = 2−k Yk,n .
k=1

Then we know that (Yn ) is an independent sequence of random variables, and


each is uniform on (0, 1). As before, we define

Gn (y) = inf{x : y ≤ Fn (x)}.

We set Xn = Gn (Yn ). Then (Xn ) is a sequence of random variables with


FXn = Fn .

Pn the section with a random fact: let (Ω, F, P) and Rj be as above.


We end
Then n1 j=1 Rj is the average of n independent of Bernoulli(1/2) random
variables. The weak law of large numbers says for any ε > 0, we have
 
n
1 X 1
P Rj − ≥ ε → 0 as n → ∞.
n j=1 2

The strong law of large numbers, which we will prove later, says that
 
n
 1 X 1 
P ω : Rj → = 1.
 n j=1 2

So “almost every number” in (0, 1) has an equal proportion of 0’s and 1’s in its
binary expansion. This is known as the normal number theorem.

28
2 Measurable functions and random variables II Probability and Measure

2.4 Convergence of measurable functions


The next thing to look at is the convergence of measurable functions. In measure
theory, wonderful things happen when we talk about convergence. In analysis,
most of the time we had to require uniform convergence, or even stronger
notions, if we want limits to behave well. However, in measure theory, the kinds
of convergence we talk about are somewhat pointwise in nature. In fact, it
will be weaker than pointwise convergence. Yet, we are still going to get good
properties out of them.
Definition (Convergence almost everywhere). Suppose that (E, E, µ) is a mea-
sure space. Suppose that (fn ), f are measurable functions. We say fn → f
almost everywhere (a.e.) if

µ({x ∈ E : fn (x) 6→ f (x)}) = 0.

If (E, E, µ) is a probability space, this is called almost sure convergence.


To see this makes sense, i.e. the set in there is actually measurable, note that

{x ∈ E : fn (x) 6→ f (x)} = {x ∈ E : lim sup |fn (x) − f (x)| > 0}.

We have previously seen that lim sup |fn − f | is non-negative measurable. So the
set {x ∈ E : lim sup |fn (x) − f (x)| > 0} is measurable.
Another useful notion of convergence is convergence in measure.
Definition (Convergence in measure). Suppose that (E, E, µ) is a measure space.
Suppose that (fn ), f are measurable functions. We say fn → f in measure if for
each ε > 0, we have

µ({x ∈ E : |fn (x) − f (x)| ≥ ε}) → 0 as n → ∞,

then we say that fn → f in measure.


If (E, E, µ) is a probability space, then this is called convergence in probability.
In the case of a probability space, this says

P(|Xn − X| ≥ ε) → 0 as n → ∞

for all ε, which is how we state the weak law of large numbers in the past.
After we define integration, we can consider the norms of a function f by
Z 1/p
kf kp = |f (x)|p dx .

Then in particular, if kfn − f kp → 0, then fn → f in measure, and this provides


an easy way to see that functions converge in measure.
In general, neither of these notions imply each other. However, the following
theorem provides us with a convenient dictionary to translate between the two
notions.
Theorem.
(i) If µ(E) < ∞, then fn → f a.e. implies fn → f in measure.

29
2 Measurable functions and random variables II Probability and Measure

(ii) For any E, if fn → f in measure, then there exists a subsequence (fnk )


such that fnk → f a.e.

Proof.
(i) First suppose µ(E) < ∞, and fix ε > 0. Consider

µ({x ∈ E : |fn (x) − f (x)| ≤ ε}).

We use the result from the first example sheet that for any sequence of
events (An ), we have

lim inf µ(An ) ≥ µ(lim inf An ).

Applying to the above sequence says

lim inf µ({x : |fn (x) − f (x)| ≤ ε}) ≥ µ({x : |fm (x) − f (x)| ≤ ε eventually})
≥ µ({x ∈ E : |fm (x) − f (x)| → 0})
= µ(E).

As µ(E) < ∞, we have µ({x ∈ E : |fn (x) − f (x)| > ε}) → 0 as n → ∞.


(ii) Suppose that fn → f in measure. We pick a subsequence (nk ) such that
 
1
µ x ∈ E : |fnk (x) − f (x)| > ≤ 2−k .
k

Then we have
∞   X∞
X 1
µ x ∈ E : fnk (x) − f (x)| > ≤ 2−k = 1 < ∞.
k
k=1 k=1

By the first Borel–Cantelli lemma, we know


 
1
µ x ∈ E : |fnk (x) − f (x)| > i.o. = 0.
k

So fnk → f a.e.

It is important that we assume that µ(E) < ∞ for the first part.
Example. Consider (E, E, µ) = (R, B, Lebesgue). Take fn (x) = 1[n,∞) (x).
Then fn (x) → 0 for all x, and in particular almost everywhere. However, we
have  
1
µ x ∈ R : |fn (x)| > = µ([n, ∞)) = ∞
2
for all n.

There is one last type of convergence we are interested in. We will only
first formulate it in the probability setting, but there is an analogous notion in
measure theory known as weak convergence, which we will discuss much later on
in the course.

30
2 Measurable functions and random variables II Probability and Measure

Definition (Convergence in distribution). Let (Xn ), X be random variables


with distribution functions FXn and FX , then we say Xn → X in distribution if
FXn (x) → FX (x) for all x ∈ R at which FX is continuous.

Note that here we do not need that (Xn ) and X live on the same probability
space, since we only talk about the distribution functions.
But why do we have the condition with continuity points? The idea is that
if the resulting distribution has a “jump” at x, it doesn’t matter which side of
the jump FX (x) is at. Here is a simple example that tells us why this is very
important:
Example. Let Xn to be uniform on [0, 1/n]. Intuitively, this should converge
to the random variable that is always zero.
We can compute

0
 x≤0
FXn (x) = nx 0 < x < 1/n .

1 x ≥ 1/n

We can also compute the distribution of the zero random variable as


(
0 x<0
F0 = .
1 x≥0

But FXn (0) = 0 for all n, while FX (0) = 1.

One might now think of cheating by cooking up some random variable such
that F is discontinuous at so many points that random, unrelated things converge
to F . However, this cannot be done, because F is a non-decreasing function,
and thus can only have countably many points of discontinuities.
The big theorem we are going to prove about convergence in distribution is
that actually it is very boring and doesn’t give us anything new.

Theorem (Skorokhod representation theorem of weak convergence).


(i) If (Xn ), X are defined on the same probability space, and Xn → X in
probability. Then Xn → X in distribution.
(ii) If Xn → X in distribution, then there exists random variables (X̃n ) and
X̃ defined on a common probability space with FX̃n = FXn and FX̃ = FX
such that X̃n → X̃ a.s.
Proof. Let S = {x ∈ R : FX is continuous}.

(i) Assume that Xn → X in probability. Fix x ∈ S. We need to show that


FXn (x) → FX (x) as n → ∞.
We fix ε > 0. Since x ∈ S, this implies that there is some δ > 0 such that
ε
FX (x − δ) ≥ FX (x) −
2
ε
FX (x + δ) ≤ FX (x) + .
2

31
2 Measurable functions and random variables II Probability and Measure

We fix N large such that n ≥ N implies P[|Xn − X| ≥ δ] ≤ 2ε . Then

FXn (x) = P[Xn ≤ x]


= P[(Xn − X) + X ≤ x]

We now notice that {(Xn − X) + X ≤ x} ⊆ {X ≤ x + δ} ∪ {|Xn − X| > δ}.


So we have

≤ P[X ≤ x + δ] + P[|Xn − X| > δ]


ε
≤ FX (x + δ) +
2
≤ FX (x) + ε.

We similarly have

FXn (x) = P[Xn ≤ x]


≥ P[X ≤ x − δ] − P[|Xn − X| > δ]
ε
≥ FX (x − δ) −
2
≥ FX (x) − ε.

Combining, we have that n ≥ N implying |Fxn (x) − FX (x)| ≤ ε. Since ε


was arbitrary, we are done.
(ii) Suppose Xn → X in distribution. We again let

(Ω, F, B) = ((0, 1), B((0, 1)), Lebesgue).

We let

X̃n (ω) = inf{x : ω ≤ FXn (x)},


X̃(ω) = inf{x : ω ≤ FX (x)}.

Recall from before that X̃n has the same distribution function as Xn for
all n, and X̃ has the same distribution as X. Moreover, we have

X̃n (ω) ≤ x ⇔ ω ≤ FXn (x)


x < X̃n (ω) ⇔ FXn (x) < ω,

and similarly if we replace Xn with X.


We are now going to show that with this particular choice, we have X̃n → X̃
a.s.
Note that X̃ is a non-decreasing function (0, 1) → R. Then by general
analysis, X̃ has at most countably many discontinuities. We write

Ω0 = {ω ∈ (0, 1) : X̃ is continuous at ω0 }.

Then (0, 1) \ Ω0 is countable, and hence has Lebesgue measure 0. So

P[Ω0 ] = 1.

32
2 Measurable functions and random variables II Probability and Measure

We are now going to show that X̃n (ω) → X̃(ω) for all ω ∈ Ω0 .
Note that FX is a non-decreasing function, and hence the points of discon-
tinuity R \ S is also countable. So S is dense in R. Fix ω ∈ Ω0 and ε > 0.
We want to show that |X̃n (ω) − X̃(ω)| ≤ ε for all n large enough.
Since S is dense in R, we can find x− , x+ in S such that

x− < X̃(ω) < x+

and x+ − x− < ε. What we want to do is to use the characteristic property


of X̃ and FX to say that this implies

FX (x− ) < ω < FX (x+ ).

Then since FXn → FX at the points x− , x+ , for sufficiently large n, we


have
FXn (x− ) < ω < FXn (x+ ).
Hence we have
x− < X̃n (ω) < x+ .
Then it follows that |X̃n (ω) − X̃(ω)| < ε.
However, this doesn’t work, since X̃(ω) < x+ only implies ω ≤ FX (x+ ),
and our argument will break down. So we do a funny thing where we
introduce a new variable ω + .
Since X̃ is continuous at ω, we can find ω + ∈ (ω, 1) such that X̃(ω + ) ≤ x+ .

X̃(ω)

x+
ε
x−

ω ω+

Then we have
x− < X̃(ω) ≤ X̃(ω + ) < x+ .
Then we have
FX (x− ) < ω < ω + ≤ FX (x+ ).
So for sufficiently large n, we have

FXn (x− ) < ω < FXn (x+ ).

So we have
x− < X̃n (ω) ≤ x+ ,
and we are done.

33
2 Measurable functions and random variables II Probability and Measure

2.5 Tail events


Finally, we are going to quickly look at tail events. These are events that depend
only on the asymptotic behaviour of a sequence of random variables.
Definition (Tail σ-algebra). Let (Xn ) be a sequence of random variables. We
let
Tn = σ(Xn+1 , Xn+2 , · · · ),
and \
T = Tn .
n

Then T is the tail σ-algebra.


Then T -measurable events and random variables only depend on the asymp-
totic behaviour of the Xn ’s.
Example. Let (Xn ) be a sequence of real-valued random variables. Then
n n
1X 1X
lim sup Xj , lim inf Xj
n→∞ n j=1 n→∞ n j=1

are T -measurable random variables. Finally,


 
n
 1X 
lim Xj exists ∈T,
n→∞ n 
j=1

since this is just the set of all points where the previous two things agree.
Theorem (Kolmogorov 0-1 law). Let (Xn ) be a sequence of independent (real-
valued) random variables. If A ∈ T , then P[A] = 0 or 1.
Moreover, if X is a T -measurable random variable, then there exists a
constant c such that
P[X = c] = 1.
Proof. The proof is very funny the first time we see it. We are going to prove
the theorem by checking something that seems very strange. We are going to
show that if A ∈ T , then A is independent of A. It then follows that

P[A] = P[A ∩ A] = P[A]P[A],

so P[A] = 0 or 1. In fact, we are going to prove that T is independent of T .


Let
Fn = σ(X1 , · · · , Xn ).
This σ-algebra is generated by the π-system of events of the form

A = {X1 ≤ x1 , · · · , Xn ≤ xn }.

Similarly, Tn = σ(Xn+1 , Xn+2 , · · · ) is generated by the π-system of events of the


form
B = {Xn+1 ≤ xn+1 , · · · , Xn+k ≤ xn+k },
where k is any natural number.

34
2 Measurable functions and random variables II Probability and Measure

Since the Xn are independent, we know for any such A and B, we have

P[A ∩ B] = P[A]P[B].

T for all A and B, it follows that Fn is independent of Tn .


Since this is true
SinceST = k Tk ⊆ Tn for each n, we know Fn is independent of T .
Now k Fk is a π-system,
S which generates the σ-algebra F∞ = σ(X1 , X2 , · · · ).
We know that if A ∈ n Fn , then there has to exist an index n such that A ∈ Fn .
So A is independent of T . So F∞ is independent of T .
Finally, note that T ⊆ F∞ . So T is independent of T .
To find the constant, suppose that X is T -measurable. Then

P[X ≤ x] ∈ {0, 1}

for all x ∈ R since {X ≤ x} ∈ T .


Now take
c = inf{x ∈ R : P[X ≤ x] = 1}.
Then with this particular choice of c, it is easy to see that P[X = c] = 1. This
completes the proof of the theorem.

35
3 Integration II Probability and Measure

3 Integration
3.1 Definition and basic properties
We are now going to work towards defining the integral of a measurable function
on a measure space (E, E, µ). Different sources use different notations for the
integral. The following notations are all commonly used:
Z Z Z
µ(f ) = f dµ = f (x) dµ(x) = f (x)µ(dx).
E E E

In the case where (E, E, µ) = (R, B, Lebesgue), people often just write this as
Z
µ(f ) = f (x) dx.
R

On the other hand, if (E, E, µ) = (Ω, F, P) is a probability space, and X is a


random variable, then people write the integral as E[X], the expectation of X.
So how are we going to define the integral? There are two steps to defining
the integral. The idea is that we first define the integral on simple functions,
and then extend the definition to more general measurable functions by taking
the limit. When we do the definition for simple functions, it will be obvious that
the definition satisfies the nice properties, and we will have to check that they
are preserved when we take the limit.
Definition (Simple function). A simple function is a measurable function that
can be written as a finite non-negative linear combination of indicator functions
of measurable sets, i.e.
Xn
f= ak 1Ak
k=1

for some Ak ∈ E and ak ≥ 0.


Note that some sources do not assume that ak ≥ 0, but assuming this makes
our life easier.
It is obvious that
Proposition. A function is simple iff it is measurable, non-negative, and takes
on only finitely-many values.
Definition (Integral of simple function). The integral of a simple function
n
X
f= ak 1Ak
k=1

is given by
n
X
µ(f ) = ak µ(Ak ).
k=1

Note that it can be that µ(Ak ) = ∞, but ak = 0. When this happens, we


are just going to declare that 0 · ∞ = 0 (this makes sense because this means
we are ignoring all 0 · 1A terms for any A). After we do this, we can check the
integral is well-defined.

36
3 Integration II Probability and Measure

We are now going to extend this definition to non-negative measurable


functions by a limiting procedure. Once we’ve done this, we are going to extend
the definition to measurable functions by linearity of the integral. Then we
would have a definition of the integral, and we are going to deduce properties of
the integral using approximation.
Definition (Integral). Let f be a non-negative measurable function. We set
µ(f ) = sup{µ(g) : g ≤ f, g is simple}.
For arbitrary f , we write
f = f + − f − = (f ∨ 0) + (f ∧ 0).
We put |f | = f + + f − . We say f is integrable if µ(|f |) < ∞. In this case, set
µ(f ) = µ(f + ) − µ(f − ).
If only one of µ(f + ), µ(f− ) < ∞, then we can still make the above definition,
and the result will be infinite.
In the case where we are integrating over (a subset of) the reals, we call it
the Lebesgue integral .
Proposition. Let f : [0, 1] → R be Riemann integrable. Then it is also Lebesgue
integrable, and the two integrals agree.
We will not prove this, but this immediately gives us results like the funda-
mental theorem of calculus, and also helps us to actually compute the integral.
However, note that this does not hold for infinite domains, as you will see in the
second example sheet.
But the Lebesgue integrable functions are better. A lot of functions are
Lebesgue integrable but not Riemann integrable.
Example. Take the standard non-Riemann integrable function
f = 1[0,1]\Q .
Then f is not Riemann integrable, but it is Lebesgue integrable, since
µ(f ) = µ([0, 1] \ Q) = 1.
We are now going to study some basic properties of the integral. We will first
look at the properties of integrals of simple functions, and then extend them to
general integrable functions.
For f, g simple, and α, β ≥ 0, we have that
µ(αf + βg) = αµ(f ) + βµ(g).
So the integral is linear.
Another important property is monotonicity — if f ≤ g, then µ(f ) ≤ µ(g).
Finally, we have f = 0 a.e. iff µ(f ) = 0. It is absolutely crucial here that we
are talking about non-negative functions.
Our goal is to show that these three properties are also satisfied for arbitrary
non-negative measurable functions, and the first two hold for integrable functions.
In order to achieve this, we prove a very important tool — the monotone
convergence theorem. Later, we will also learn about the dominated convergence
theorem and Fatou’s lemma. These are the main and very important results
about exchanging limits and integration.

37
3 Integration II Probability and Measure

Theorem (Monotone convergence theorem). Suppose that (fn ), f are non-


negative measurable with fn % f . Then µ(fn ) % µ(f ).
In the proof we will use the fact that the integral is monotonic, which we
shall prove later.
Proof. We will split the proof into five steps. We will prove each of the following
in turn:
(i) If fn and f are indicator functions, then the theorem holds.
(ii) If f is an indicator function, then the theorem holds.
(iii) If f is simple, then the theorem holds.
(iv) If f is non-negative measurable, then the theorem holds.
Each part follows rather straightforwardly from the previous one, and the reader
is encouraged to try to prove it themself.

We first consider the case where fn = 1An and f = 1A . Then fn % f is true


iff An % A. On the other hand, µ(fn ) % µ(f ) iff µ(An ) % µ(A).
For convenience, we let A0 = ∅. We can write
!
[
µ(A) = µ An \ An−1
n

X
= µ(An \ An−1 )
n=1
N
X
= lim µ(An \ An−1 )
N →∞
n=1
= lim µ(AN ).
N →∞

So done.

We next consider the case where f = 1A for some A. Fix ε > 0, and set

An = {fn > 1 − ε} ∈ E.

Then we know that An % A, as fn % f . Moreover, by definition, we have

(1 − ε)1An ≤ fn ≤ f = 1A .

As An % A, we have that

(1 − ε)µ(f ) = (1 − ε) lim µ(An ) ≤ lim µ(fn ) ≤ µ(f )


n→∞ n→∞

since fn ≤ f . Since ε is arbitrary, we know that

lim µ(fn ) = µ(f ).


n→∞

38
3 Integration II Probability and Measure

Next, we consider the case where f is simple. We write


m
X
f= ak 1Ak ,
k=1

where ak > 0 and Ak are pairwise disjoint. Since fn % f , we know


a−1
k fn 1Ak % 1Ak .

So we have
m
X m
X m
X
µ(fn ) = µ(fn 1Ak ) = ak µ(a−1
k fn 1Ak ) → ak µ(Ak ) = µ(f ).
k=1 k=1 k=1

Suppose f is non-negative measurable. Suppose g ≤ f is a simple function.


As fn % f , we know fn ∧ g % f ∧ g = g. So by the previous case, we know that
µ(fn ∧ g) → µ(g).
We also know that
µ(fn ) ≥ µ(fn ∧ g).
So we have
lim µ(fn ) ≥ µ(g)
n→∞
for all g ≤ f . This is possible only if
lim µ(fn ) ≥ µ(f )
n→∞

by definition of the integral. However, we also know that µ(fn ) ≤ µ(f ) for all n,
again by definition of the integral. So we must have equality. So we have
µ(f ) = lim µ(fn ).
n→∞

Theorem. Let f, g be non-negative measurable, and α, β ≥ 0. We have that


(i) µ(αf + βg) = αµ(f ) + βµ(g).
(ii) f ≤ g implies µ(f ) ≤ µ(g).
(iii) f = 0 a.e. iff µ(f ) = 0.
Proof.
(i) Let
fn = 2−n b2n f c ∧ n
gn = 2−n b2n gc ∧ n.
Then fn , gn are simple with fn % f and gn % g. Hence µ(fn ) % µ(f )
and µ(gn ) % µ(g) and µ(αfn + βgn ) % µ(αf + βg), by the monotone
convergence theorem. As fn , gn are simple, we have that
µ(αfn + βgn ) = αµ(fn ) + βµ(gn ).
Taking the limit as n → ∞, we get
µ(αf + βg) = αµ(f ) + βµ(g).

39
3 Integration II Probability and Measure

(ii) We shall be careful not to use the monotone convergence theorem. We


have

µ(g) = sup{µ(h) : h ≤ g simple}


≥ sup{µ(h) : h ≤ f simple}
= µ(f ).

(iii) Suppose f 6= 0 a.e. Let


 
1
An = x : f (x) > .
n

Then [
{x : f (x) 6= 0} = An .
n

Since the left hand set has non-negative measure, it follows that there is
some An with non-negative measure. For that n, we define
1
h= 1A .
n n
Then µ(f ) ≥ µ(h) > 0. So µ(f ) 6= 0.
Conversely, suppose f = 0 a.e. We let

fn = 2−n b2n f c ∧ n

be a simple function. Then fn % f and fn = 0 a.e. So

µ(f ) = lim µ(fn ) = 0.


n→∞

We now prove the analogous statement for general integrable functions.


Theorem. Let f, g be integrable, and α, β ≥ 0. We have that
(i) µ(αf + βg) = αµ(f ) + βµ(g).
(ii) f ≤ g implies µ(f ) ≤ µ(g).
(iii) f = 0 a.e. implies µ(f ) = 0.
Note that in the last case, the converse is no longer true, as one can easily
see from the sign function sgn : [−1, 1] → R.
Proof.
(i) We are going to prove these by applying the previous theorem.
By definition of the integral, we have µ(−f ) = −µ(f ). Also, if α ≥ 0, then

µ(αf ) = µ(αf + ) − µ(αf − ) = αµ(f + ) − αµ(f − ) = αµ(f ).

Combining these two properties, it then follows that if α is a real number,


then
µ(αf ) = αµ(f ).

40
3 Integration II Probability and Measure

To finish the proof of (i), we have to show that µ(f + g) = µ(f ) + µ(g).
We know that this is true for non-negative functions, so we need to employ
a little trick to make this a statement about the non-negative version. If
we let h = f + g, then we can write this as

h+ − h− = (f + − f − ) + (g + − g − ).

We now rearrange this as

h+ f − + g − = f + + g + + h− .

Now everything is non-negative measurable. So applying µ gives

µ(f + ) + µ(f − ) + µ(g − ) = µ(f + ) + µ(g + ) + µ(h− ).

Rearranging, we obtain

µ(h+ ) − µ(h− ) = µ(f + ) − µ(f − ) + µ(g + ) − µ(g − ).

This is exactly the same thing as saying

µ(f + g) = µ(h) = µ(f ) = µ(g).

(ii) If f ≤ g, then g − f ≥ 0. So µ(g − f ) ≥ 0. By (i), we know µ(g) − µ(f ) ≥ 0.


So µ(g) ≥ µ(f ).
(iii) If f = 0 a.e., then f + , f − = 0 a.e. So µ(f + ) = µ(f − ) = 0. So µ(f ) =
µ(f + ) − µ(f − ) = 0.
As mentioned, the converse to (iii) is no longer true. However, we do have
the following partial converse:
Proposition. If A is a π-system with E ∈ A and σ(A) = E, and f is an
integrable function that
µ(f 1A ) = 0
for all A ∈ A. Then µ(f ) = 0 a.e.
Proof. Let
D = {A ∈ E : µ(f 1A ) = 0}.
It follows immediately from the properties of the integral that D is a d-system.
So D = E by Dynkin’s lemma. Let

A+ = {x ∈ E : f (x) > 0},


A− = {x ∈ E : f (x) < 0}.

Then A± ∈ E, and
µ(f 1A+ ) = µ(f 1A− ) = 0.
So f 1A+ and f 1A− vanish a.e. So f vanishes a.e.
Proposition. Suppose that (gn ) is a sequence of non-negative measurable
functions. Then we have
∞ ∞
!
X X
µ gn = µ(gn ).
n=1 n=1

41
3 Integration II Probability and Measure

Proof. We know

N
! !
X X
gn % gn
n=1 n=1

as N → ∞. So by the monotone convergence theorem, we have



N N
! !
X X X
µ(gn ) = µ gn % µ gn .
n=1 n=1 n=1

But we also know that


N
X ∞
X
µ(gn ) % µ(gn )
n=1 n=1

by definition. So we are done.


So for non-negative measurable functions, we can always switch the order of
integration and summation.
Note that we can consider summation as integration. We let E = N and
E = {all subsets of N}. We let µ be the counting measure, so that µ(A) is the
size of A. Then integrability (and having a finite integral) is the same as absolute
convergence. Then if it converges, then we have
Z ∞
X
f dµ = f (n).
n=1

So we can just view our proposition as proving that we can swap the order of
two integrals. The general statement is known as Fubini’s theorem.

3.2 Integrals and limits


We are now going to prove more things about exchanging limits and integrals.
These are going to be extremely useful in the future, as we want to exchange
limits and integrals a lot.
Theorem (Fatou’s lemma). Let (fn ) be a sequence of non-negative measurable
functions. Then
µ(lim inf fn ) ≤ lim inf µ(fn ).
Note that a special case was proven in the first example sheet, where we did
it for the case where fn are indicator functions.

Proof. We start with the trivial observation that if k ≥ n, then we always have
that
inf fm ≤ fk .
m≥n

By the monotonicity of the integral, we know that


 
µ inf fm ≤ µ(fk ).
m≥n

for all k ≥ n.

42
3 Integration II Probability and Measure

So we have
 
µ inf fm ≤ inf µ(fk ) ≤ lim inf µ(fm ).
m≥n k≥n m

It remains to show that the left hand side converges to µ(lim inf fm ). Indeed,
we know that
inf fm % lim inf fm .
m≥n m

Then by monotone convergence, we have


   
µ inf fm % µ lim inf fm .
m≥n m

So we have  
µ lim inf fm ≤ lim inf µ(fm ).
m m

No one ever remembers which direction Fatou’s lemma goes, and this leads to
many incorrect proofs and results, so it is helpful to keep the following example
in mind:
Example. We let (E, E, µ) = (R, B, Lebesgue). We let

fn = 1[n,n+1] .

Then we have
lim inf fn = 0.
n

So we have
µ(fn ) = 1 for all n.
So we have
lim inf µ(fn ) = 1, µ(lim inf fn ) = 0.
So we have  
µ lim inf fm ≤ lim inf µ(fm ).
m m

The next result we want to prove is the dominated convergence theorem.


This is like the monotone convergence theorem, but we are going to remove the
increasing and non-negative measurable condition, and add in something else.

Theorem (Dominated convergence theorem). Let (fn ), f be measurable with


fn (x) → f (x) for all x ∈ E. Suppose that there is an integrable function g such
that
|fn | ≤ g
for all n, then we have
µ(fn ) → µ(f )
as n → ∞.

43
3 Integration II Probability and Measure

Proof. Note that


|f | = lim |f |n ≤ g.
n

So we know that
µ(|f |) ≤ µ(g) < ∞.
So we know that f , fn are integrable.
Now note also that
0 ≤ g + fn , 0 ≤ g − fn
for all n. We are now going to apply Fatou’s lemma twice with these series. We
have that

µ(g) + µ(f ) = µ(g + f )


 
= µ lim inf (g + fn )
n
≤ lim inf µ(g + fn )
n
= lim inf (µ(g) + µ(fn ))
n
= µ(g) + lim inf µ(fn ).
n

Since µ(g) is finite, we know that

µ(f ) ≤ lim inf µ(fn ).


n

We now do the same thing with g − fn . We have

µ(g) − µ(f ) = µ(g − f )


 
= µ lim inf (g − fn )
n
≤ lim inf µ(g − fn )
n
= lim inf (µ(g) − µ(fn ))
n
= µ(g) − lim sup µ(fn ).
n

Again, since µ(g) is finite, we know that

µ(f ) ≥ lim sup µ(fn ).


n

These combine to tell us that

µ(f ) ≤ lim inf µ(fn ) ≤ lim sup µ(fn ) ≤ µ(f ).


n n

So they must be all equal, and thus µ(fn ) → µ(f ).

3.3 New measures from old


We have previously considered several ways of constructing measures from old
ones, such as the image measure. We are now going to study a few more ways of
constructing new measures, and see how integrals behave when we do these.

44
3 Integration II Probability and Measure

Definition (Restriction of measure space). Let (E, E, µ) be a measure space,


and let A ∈ E. The restriction of the measure space to A is (A, EA , µA ), where

EA = {B ∈ E : B ⊆ A},

and µA is the restriction of µ to EA , i.e.

µA (B) = µ(B)

for all B ∈ EA .
It is easy to check the following:
Lemma. For (E, E, µ) a measure space and A ∈ E, the restriction to A is a
measure space.

Proposition. Let (E, E, µ) and (F, F, µ0 ) be measure spaces and A ∈ E. Let


f : E → F be a measurable function. Then f |A is EA -measurable.
Proof. Let B ∈ F. Then

(f |A )−1 (B) = f −1 (B) ∩ A ∈ EA .

Similarly, we have
Proposition. If f is integrable, then f |A is µA -integrable and µA (f |A ) =
µ(f 1A ).
Note that means we have
Z Z
µ(f 1A ) = f 1A dµ = f dµA .
E A

Usually, we are lazy and just write


Z
µ(f 1A ) = f dµ.
A

In the particular case of Lebesgue integration, if A is an interval with left and


right end points a, b (i.e. it can be open, closed, half open or half closed), then
we write Z Z b
f dµ = f (x) dx.
A a
There is another construction we would be interested in.
Definition (Pushforward/image of measure). Let (E, E) and (G, G) be measure
spaces, and f : E → G a measurable function. If µ is a measure on (E, E), then

ν = µ ◦ f −1

is a measure on (G, G), known as the pushforward or image measure.


We have already seen this before, but we can apply this to integration as
follows:

45
3 Integration II Probability and Measure

Proposition. If g is a non-negative measurable function on G, then

ν(g) = µ(g ◦ f ).

Proof. Exercise using the monotone class theorem (see example sheet).
Finally, we can specify a measure by specifying a density.
Definition (Density). Let (E, E, µ) be a measure space, and f be a non-negative
measurable function. We define

ν(A) = µ(f 1A ).

Then ν is a measure on (E, E).


Proposition. The ν defined above is indeed a measure.

Proof.
(i) ν(φ) = µ(f 1∅ ) = µ(0) = 0.
(ii) If (An ) is a disjoint sequence in E, then
[   X  X X
ν An = µ(f 1S An ) = µ f 1An = µ (f 1An ) = ν(f ).

Definition (Density). Let X be a random variable. We say X has a density if


its law µX has a density with respect to the Lebesgue measure. In other words,
there exists fX non-negative measurable so that
Z
µX (A) = P[X ∈ A] = fX (x) dx.
A

In this case, for any non-negative measurable function, for any non-negative
measurable g, we have that
Z
E[g(X)] = g(x)fX (x) dx.
R

3.4 Integration and differentiation


In “normal” calculus, we had three results involving both integration and dif-
ferentiation. One was the fundamental theorem of calculus, which we already
stated. The others are the change of variables formula, and differentiating under
the integral sign.
We start by proving the change of variables formula.
Proposition (Change of variables formula). Let φ : [a, b] → R be continuously
differentiable and increasing. Then for any bounded Borel function g, we have
Z φ(b) Z b
g(y) dy = g(φ(x))φ0 (x) dx. (∗)
φ(a) a

We will use the monotone class theorem.

46
3 Integration II Probability and Measure

Proof. We let

V = {Borel functions g such that (∗) holds}.

We will want to use the monotone class theorem to show that this includes all
bounded functions.
We already know that
(i) V contains 1A for all A in the π-system of intervals of the form [u, v] ⊆ [a, b].
This is just the fundamental theorem of calculus.
(ii) By linearity of the integral, V is indeed a vector space.
(iii) Finally, let (gn ) be a sequence in V , and gn ≥ 0, gn % g. Then we know
that Z φ(b) Z b
gn (y) dy = gn (φ(x))φ0 (x) dx.
φ(a) a

By the monotone convergence theorem, these converge to


Z φ(b) Z b
g(y) dy = g(φ(x))φ0 (x) dx.
φ(a) a

Then by the monotone class theorem, V contains all bounded Borel functions.
The next problem is differentiation under the integral sign. We want to know
when we can say Z Z
d ∂f
f (x, t) dx = (x, t) dx.
dt ∂t
Theorem (Differentiation under the integral sign). Let (E, E, µ) be a space,
and U ⊆ R be an open set, and f : U × E → R. We assume that
(i) For any t ∈ U fixed, the map x 7→ f (t, x) is integrable;
(ii) For any x ∈ E fixed, the map t 7→ f (t, x) is differentiable;
(iii) There exists an integrable function g such that

∂f
(t, x) ≤ g(x)
∂t

for all x ∈ E and t ∈ U .


Then the map
∂f
x 7→
(t, x)
∂t
is integrable for all t, and also the function
Z
F (t) = f (t, x)dµ
E

is differentiable, and Z
0 ∂f
F (t) = (t, x) dµ.
E ∂t

47
3 Integration II Probability and Measure

The reason why we want the derivative to be bounded is that we want to


apply the dominated convergence theorem.
Proof. Measurability of the derivative follows from the fact that it is a limit of
measurable functions, and then integrability follows since it is bounded by g.
Suppose (hn ) is a positive sequence with hn → 0. Then let

f (t + hn , x) − f (t, x) ∂f
gn (x) = − (t, x).
hn ∂t
Since f is differentiable, we know that gn (x) → 0 as n → ∞. Moreover, by the
mean value theorem, we know that

|gn (x)| ≤ 2g(x).

On the other hand, by definition of F (t), we have

F (t + hn ) − F (t)
Z Z
∂f
− (t, x) dµ = gn (x) dx.
hn E ∂t

By dominated convergence, we know the RHS tends to 0. So we know

F (t + hn ) − F (t)
Z
∂f
lim → (t, x) dµ.
n→∞ hn E ∂t

Since hn was arbitrary, it follows that F 0 (t) exists and is equal to the integral.

3.5 Product measures and Fubini’s theorem


Recall the following definition of the product σ-algebra.
Definition (Product σ-algebra). Let (E1 , E1 , µ1 ) and (E2 , E2 , µ2 ) be finite mea-
sure spaces. We let

A = {A1 × A2 : A2 × E1 , A2 × E2 }.

Then A is a π-system on E1 × E2 . The product σ-algebra is

E = E1 ⊗ E2 = σ(A).

We now want to construct a measure on the product σ-algebra. We can, of


course, just apply the Caratheodory extension theorem, but we would want a
more explicit description of the integral. The idea is to define, for A ∈ E1 ⊗ E2 ,
Z Z 
µ(A) = 1A (x1 , x2 ) µ2 (dx2 ) µ1 (dx1 ).
E1 E2

Doing this has the advantage that it would help us in a step of proving Fubini’s
theorem.
However, before we can make this definition, we need to do some preparation
to make sure the above statement actually makes sense:
Lemma. Let E = E1 × E2 be a product of σ-algebras. Suppose f : E → R is
E-measurable function. Then

48
3 Integration II Probability and Measure

(i) For each x2 ∈ E2 , the function x1 7→ f (x1 , x2 ) is E1 -measurable.


(ii) If f is bounded or non-negative measurable, then
Z
f2 (x2 ) = f (x1 , x2 ) µ1 (dx1 )
E1

is E2 -measurable.
Proof. The first part follows immediately from the fact that for a fixed x2 ,
the map ι1 : E1 → E given by ι1 (x1 ) = (x1 , x2 ) is measurable, and that the
composition of measurable functions is measurable.
For the second part, we use the monotone class theorem.
R We let V be
the set of all measurable functions f such that x2 7→ E1 f (x1 , x2 )µ1 (dx1 ) is
E2 -measurable.
(i) It is clear that 1E , 1A ∈ V for all A ∈ A (where A is as in the definition
of the product σ-algebra).
(ii) V is a vector space by linearity of the integral.
(iii) Suppose (fn ) is a non-negative sequence in V and fn % f , then
 Z   Z 
x2 7→ fn (x1 , x2 ) µ1 (dx1 ) % x2 7→ f (x1 , x2 ) µ(dx1 )
E1 E1

by the monotone convergence theorem. So f ∈ V .


So the monotone class theorem tells us V contains all bounded measurable
functions.
Now if f is a general non-negative measurable function, then f ∧ n is bounded
and measurable, hence f ∧ n ∈ V . Therefore f ∈ V by the monotone convergence
theorem.
Theorem. There exists a unique measurable function µ = µ1 ⊗ µ2 on E such
that
µ(A1 × A2 ) = µ(A1 )µ(A2 )
for all A1 × A2 ∈ A.
Here it is crucial that the measure space is finite. Actually, everything
still works for σ-finite measure spaces, as we can just reduce to the finite case.
However, things start to go wrong if we don’t have σ-finite measure spaces.
Proof. One might be tempted to just apply the Caratheodory extension theorem,
but we have a more direct way of doing it here, by using integrals. We define
Z Z 
µ(A) = 1A (x1 , x2 ) µ2 (dx2 ) µ1 (dx1 ).
E1 E2

Here the previous lemma is very important. It tells us that these integrals
actually make sense!
We first check that this is a measure:
(i) µ(∅) = 0 is immediate since 1∅ = 0.

49
3 Integration II Probability and Measure

S
(ii) Suppose (An ) is a disjoint sequence and A = An . Then we have
Z Z 
µ(A) = 1A (x1 , x2 ) µ2 (dx2 ) µ1 (dx1 )
E1 E2
Z Z X !
= 1An (x1 , x2 ) µ2 (dx2 ) µ1 (dx1 )
E1 E2 n

We now use the fact that integration commutes with the sum of non-
negative measurable functions to get
Z X Z !
= 1A (x1 , x2 ) µ2 (dx2 ) µ1 (dx1 )
E1 n E2
XZ Z 
= 1An (x1 , x2 ) µ2 (dx2 ) µ1 (dx1 )
n E1 E2
X
= µ(An ).
n

So we have a working measure, and it clearly satisfies

µ(A1 × A2 ) = µ(A1 )µ(A2 ).

Uniqueness follows because µ is finite, and is thus characterized by its values on


the π-system A that generates E.
Exercise. Show the non-uniqueness of the product Lebesgue measure on [0, 1]
and the counting measure on [0, 1].
Note that we could as well have defined the measure as
Z Z 
µ(A) = 1A (x1 , x2 ) µ1 (dx1 ) µ2 (dx2 ).
E2 E1

The same proof would go through, so we have another measure on the space.
However, by uniqueness, we know they must be the same! Fubini’s theorem
generalizes this to arbitrary functions.
Theorem (Fubini’s theorem).
(i) If f is non-negative measurable, then
Z Z 
µ(f ) = f (x1 , x2 ) µ2 (dx2 ) µ1 (dx1 ). (∗)
E1 E2

In particular, we have
Z Z  Z Z 
f (x1 , x2 ) µ2 (dx2 ) µ1 (dx1 ) = f (x1 , x2 ) µ1 (dx1 ) µ2 (dx2 ).
E1 E2 E2 E1

This is sometimes known as Tonelli’s theorem.

50
3 Integration II Probability and Measure

(ii) If f is integrable, and


 Z 
A = x1 ∈ E : |f (x1 , x2 )|µ2 (dx2 ) < ∞ .
E2

then
µ1 (E1 \ A) = 0.
If we set (R
E2
f (x1 , x2 ) µ2 (dx2 ) x1 ∈ A
f1 (x1 ) = ,
0 x1 6∈ A
then f1 is a µ1 integrable function and

µ1 (f1 ) = µ(f ).

Proof.
(i) Let V be the set of all measurable functions such that (∗) holds. Then V
is a vector space since integration is linear.

(a) By definition of µ, we know 1E and 1A are in V for all A ∈ A.


(b) The monotone convergence theorem on both sides tell us that V is
closed under monotone limits of the form fn % f , fn ≥ 0.
By the monotone class theorem, we know V contains all bounded measur-
able functions. If f is non-negative measurable, then (f ∧ n) ∈ V , and
monotone convergence for f ∧ n % f gives that f ∈ V .
(ii) Assume that f is µ-integrable. Then
Z
x1 7→ |f (x1 , x2 )| µ(dx2 )
E2

is E1 -measurable, and, by (i), is µ1 -integrable. So A1 , being the inverse


image of ∞ under that map, lies in E1 . Moreover, µ1 (E1 \ A1 ) = 0 because
integrable functions can only be infinite on sets of measure 0.
We set
Z
f1+ (x1 ) = f + (x1 , x2 ) µ2 (dx2 )
E2
Z
f1− (x1 ) = f − (x1 , x2 ) µ2 (dx2 ).
E2

Then we have
f1 = (f1+ − f1− )1A1 .
So the result follows since

µ(f ) = µ(f + ) − µ(f − ) = µ(f1+ ) − µ1 (f1− ) = µ1 (f1 ).

by (i).

51
3 Integration II Probability and Measure

Since R is σ-finite, we know that we can sensibly talk about the d-fold product
of the Lebesgue measure on R to obtain the Lebesgue measure on Rd .
What σ-algebra is the Lebesgue measure on Rd defined on? We know the
Lebesgue measure on R is defined on B. So the Lebesgue measure is defined on

B × · · · × B = σ(B1 × · · · × Bd : Bi ∈ B).

By looking at the definition of the product topology, we see that this is just the
Borel σ-algebra on Rd !
Recall that when we constructed the Lebesgue measure, the Caratheodory
extension theorem yields a measure on the “Lebesgue σ-algebra” M, which
was strictly bigger than the Borel σ-algebra. It was shown in the first example
sheet that M is complete, i.e. if we have A ⊆ B ⊆ R with B ∈ M, µ(B) = 0,
then A ∈ M. We can also take the Lebesgue measure on Rd to be defined on
M ⊗ · · · ⊗ M. However, it happens that M ⊗ M together with the Lebesgue
measure on R2 is no longer complete (proof is left as an exercise for the reader).
We now turn to probability. Recall that random variables X1 , · · · , Xn are
independent iff the σ-algebras σ(X1 ), · · · , σ(Xn ) are independent. We will show
that random variables are independent iff their laws are given by the product
measure.

Proposition. Let X1 , · · · , Xn be random variables on (Ω, F, P) with values in


(E1 , E1 ), · · · , (En , En ) respectively. We define

E = E1 × · · · × En , E = E1 ⊗ · · · ⊗ En .

Then X = (X1 , · · · , Xn ) is E-measurable and the following are equivalent:

(i) X1 , · · · , Xn are independent.


(ii) µX = µX1 ⊗ · · · ⊗ µXn .
(iii) For any f1 , · · · , fn bounded and measurable, we have
" n # n
Y Y
E fk (Xk ) = E[fk (Xk )].
k=1 k=1

Proof.
– (i) ⇒ (ii): Let ν = µX1 × · · · ⊗ µXn . We want to show that ν = µX . To
do so, we just have to check that they agree on a π-system generating the
entire σ-algebra. We let

A = {A1 × · · · × An : A1 ∈ E1 , · · · , Ak ∈ Ek }.

Then A is a generating π-system of E. Moreover, if A = A1 × · · · × An ∈ A,


then we have

µX (A) = P[X ∈ A]
= P[X1 ∈ A1 , · · · , Xn ∈ An ]

52
3 Integration II Probability and Measure

By independence, we have
n
Y
= P[Xk ∈ Ak ]
k=1
= ν(A).

So we know that µX = ν = µX1 ⊗ · · · ⊗ µXn on E.


– (ii) ⇒ (iii): By assumption, we can evaluate the expectation
" n # Z n
Y Y
E fk (Xk ) = fk (xk )µ(dxk )
k=1 E k=1
n Z
Y
= f (xk )µk (dxk )
k=1 Ek

Yn
= E[fk (Xk )].
k=1

Here in the middle we have used Fubini’s theorem.


– (iii) ⇒ (i): Take fk = 1Ak for Ak ∈ Ek . Then we have
" n #
Y
P[X1 ∈ A1 , · · · , Xn ∈ An ] = E 1Ak (Xk )
k=1
n
Y
= E[1Ak (Xk )]
k=1
Yn
= P[Xk ∈ Ak ]
k=1

So X1 , · · · , Xn are independent.

53
4 Inequalities and Lp spaces II Probability and Measure

4 Inequalities and Lp spaces


Eventually, we will want to define the Lp spaces as follows:
Definition (Lp spaces). Let (E, E, µ) be a measurable space. For 1 ≤ p < ∞,
we define Lp = Lp (E, E, µ) to be the set of all measurable functions f such that
Z 1/p
p
kf kp = |f | dµ < ∞.

For p = ∞, we let L∞ = L∞ (E, E, µ) to be the space of functions with

kf k∞ = inf{λ ≥ 0 : |f | ≤ λ a.e.} < ∞.

However, it is not clear that this is a norm. First of all, kf kp = 0 does not
imply that f = 0. It only means that f = 0 a.e. But this is easy to solve. We
simply quotient out the vector space by functions that differ on a set of measure
zero. The more serious problem is that we don’t know how to prove the triangle
inequality.
To do so, we are going to prove some inequalities. Apart from enabling us to
show that k · kp is indeed a norm, they will also be very helpful in the future
when we want to bound integrals.

4.1 Four inequalities


The four inequalities we are going to prove are the following:
(i) Chebyshev/Markov inequality
(ii) Jensen’s inequality
(iii) Hölder’s inequality

(iv) Minkowski’s inequality.


So let’s start proving the inequalities.
Proposition (Chebyshev’s/Markov’s inequality). Let f be non-negative mea-
surable and λ > 0. Then
1
µ({f ≥ λ}) ≤ µ(f ).
λ
This is often used when this is a probability measure, so that we are bounding
the probability that a random variable is big.
The proof is essentially one line.
Proof. We write
f ≥ f 1f ≥λ ≥ λ1f ≥λ .
Taking µ gives the desired answer.
This is incredibly simple, but also incredibly useful!
The next inequality is Jensen’s inequality. To state it, we need to know what
a convex function is.

54
4 Inequalities and Lp spaces II Probability and Measure

Definition (Convex function). Let I ⊆ R be an interval. Then c : I → R is


convex if for any t ∈ [0, 1] and x, y ∈ I, we have

c(tx + (1 − t)y) ≤ tc(x) + (1 − t)c(y).

(1 − t)f (x) + tc(y)

(1 − t)x + ty
x y

Note that if c is twice differentiable, then this is equivalent to c00 > 0.


Proposition (Jensen’s inequality). Let X be an integrable random variable
with values in I. If c : I → R is convex, then we have

E[c(X)] ≥ c(E[X]).

It is crucial that this only applies to a probability space. We need the total
mass of the measure space to be 1 for it to work. Just being finite is not enough.
Jensen’s inequality will be an easy consequence of the following lemma:
Lemma. If c : I → R is a convex function and m is in the interior of I, then
there exists real numbers a, b such that

c(x) ≥ ax + b

for all x ∈ I, with equality at x = m.

φ
ax + b

If the function is differentiable, then we can easily extract this from the
derivative. However, if it is not, then we need to be more careful.
Proof. If c is smooth, then we know c00 ≥ 0, and thus c0 is non-decreasing. We
are going to show an analogous statement that does not mention the word
“derivative”. Consider x < m < y with x, y, m ∈ I. We want to show that

c(m) − c(x) c(y) − c(m)


≤ .
m−x y−m

55
4 Inequalities and Lp spaces II Probability and Measure

To show this, we turn off our brains and do the only thing we can do. We can
write
m = tx + (1 − t)y
for some t. Then convexity tells us

c(m) ≤ tc(x) + (1 − t)c(y).

Writing c(m) = tc(m) + (1 − t)c(m), this tells us

t(c(m) − c(x)) ≤ (1 − t)(c(y) − c(m)).

To conclude, we simply have to compute the actual value of t and plug it in. We
have
y−m m−x
t= , 1−t= .
y−x y−x
So we obtain
y−m m−x
(c(m) − c(x)) ≤ (c(y) − c(m)).
y−x y−x
Cancelling the y − x and dividing by the factors gives the desired result.
Now since x and y are arbitrary, we know there is some a ∈ R such that
c(m) − c(x) c(y) − c(m)
≤a≤ .
m−x y−m
for all x < m < y. If we rearrange, then we obtain

c(t) ≥ a(t − m) + c(m)

for all t ∈ I.
Proof of Jensen’s inequality. To apply the previous result, we need to pick a
right m. We take
m = E[X].
To apply this, we need to know that m is in the interior of I. So we assume that
X is not a.s. constant (that case is boring). By the lemma, we can find some
a, b ∈ R such that
c(X) ≥ aX + b.
We want to take the expectation of the LHS, but we have to make sure the
E[c(X)] is a sensible thing to talk about. To make sure it makes sense, we show
that E[c(X)− ] = E[(−c(X)) ∨ 0] is finite.
We simply bound

[c(X)]− = [−c(X)] ∨ 0 ≤ |a||X| + |b|.

So we have
E[c(X)− ] ≤ |a|E|X| + |b| < ∞
since X is integrable. So E[c(X)] makes sense.
We then just take

E[c(X)] ≥ E[aX + b] = aE[X] + b = am + b = c(m) = c(E[X]).

So done.

56
4 Inequalities and Lp spaces II Probability and Measure

We are now going to use Jensen’s inequality to prove Hölder’s inequality.


Before that, we take note of the following definition:
Definition (Conjugate). Let p, q ∈ [1, ∞]. We say that they are conjugate if
1 1
+ = 1,
p q
where we take 1/∞ = 0.
Proposition (Hölder’s inequality). Let p, q ∈ (1, ∞) be conjugate. Then for
f, g measurable, we have

µ(|f g|) = kf gk1 ≤ kf kp kgkq .

When p = q = 2, then this is the Cauchy-Schwarz inequality.


We will provide two different proofs.
Proof. We assume that kf kp > 0 and kf kp < ∞. Otherwise, there is nothing to
prove. By scaling, we may assume that kf kp = 1. We make up a probability
measure by Z
P[A] = |f |p 1A dµ.

Since we know Z 1/p


kf kp = |f |p dµ = 1,

we know P[ · ] is a probability measure. Then we have

µ(|f g|) = µ(|f g|1{|f |>0} )


 
|g| p
=µ 1{|f |>0} |f |
|f |p−1
 
|g|
=E 1{|f |>0}
|f |p−1

Now use the fact that (E|X|)q ≤ E[|X|q ] since x 7→ xq is convex for q > 1. Then
we obtain
1/q
|g|q
 
≤ E 1{|f |>0} .
|f |(p−1)q

The key realization now is that 1q + p1 = 1 means that q(p − 1) = p. So this


becomes  q 1/q
|g|
E 1{|f |>0} = µ(|g|q )1/q = kgkq .
|f |p
Using the fact that kf kp = 1, we obtain the desired result.
Alternative proof. We wlog 0 < kf kp , kgkq < ∞, or else there is nothing to
prove. By scaling, we wlog kf kp = kgkq = 1. Then we have to show that
Z
|f ||g| dµ ≤ 1.

57
4 Inequalities and Lp spaces II Probability and Measure

1
To do so, we notice if p + 1q = 1, then the concavity of log tells us for any a, b > 0,
we have  
1 1 a b
log a + log b ≤ log + .
p q p q
Replacing a with ap ; b with bp and then taking exponentials tells us
ap bq
ab ≤ + .
p q
While we assumed a, b > 0 when deriving, we observe that it is also valid when
some of them are zero. So we have
Z  p
|g|q

|f |
Z
1 1
|f ||g| dµ ≤ + dµ = + = 1.
p q p q

Just like Jensen’s inequality, this is very useful when bounding integrals, and
it is also theoretically very important, because we are going to use it to prove
the Minkowski inequality. This tells us that the Lp norm is actually a norm.
Before we prove the Minkowski inequality, we prove the following tiny lemma
that we will use repeatedly:
Lemma. Let a, b ≥ 0 and p ≥ 1. Then

(a + b)p ≤ 2p (ap + bp ).

This is a terrible bound, but is useful when we want to prove that things are
finite.
Proof. We wlog a ≤ b. Then

(a + b)p ≤ (2b)p ≤ 2p bp ≤ 2p (ap + bp ).

Theorem (Minkowski inequality). Let p ∈ [1, ∞] and f, g measurable. Then

kf + gkp ≤ kf kp + kgkp .

Again the proof is magic.


Proof. We do the boring cases first. If p = 1, then
Z Z Z Z
kf + gk1 = |f + g| ≤ (|f | + |g|) = |f | + |g| = kf k1 + kgk1 .

The proof of the case of p = ∞ is similar.


Now note that if kf + gkp = 0, then the result is trivial. On the other hand,
if kf + gkp = ∞, then since we have

|f + g|p ≤ (|f | + |g|)p ≤ 2p (|f |p + |g|p ),

we know the right hand side is infinite as well. So this case is also done.

58
4 Inequalities and Lp spaces II Probability and Measure

Let’s now do the interesting case. We compute

µ(|f + g|p ) = µ(|f + g||f + g|p−1 )


≤ µ(|f ||f + g|p−1 ) + µ(|g||f + g|p−1 )
≤ kf kp k|f + g|p−1 kq + kgkp k|f + g|p−1 kq
= (kf kp + kgkp )k|f + g|p−1 kq
= (kf kp + kgkp )µ(|f + g|(p−1)q )1−1/p
= (kf kp + kgkp )µ(|f + g|p )1−1/p .

So we know
µ(|f + g|p ) ≤ (kf kp + kgkp )µ(|f + g|p )1−1/p .
Then dividing both sides by (µ(|f + g|p )1−1/p tells us

µ(|f + g|p )1/p = kf + gkp ≤ kf kp + kgkp .

Given these inequalities, we can go and prove some properties of Lp spaces.

4.2 Lp spaces
Recall the following definition:

Definition (Norm of vector space). Let V be a vector space. A norm on V is


a function k · k : V → R≥0 such that
(i) ku + vk ≤ kuk + kvk for all U, v ∈ V .
(ii) kαvk = |α|kvk for all v ∈ V and α ∈ R

(iii) kvk = 0 implies v = 0.


Definition (Lp spaces). Let (E, E, µ) be a measurable space. For 1 ≤ p < ∞,
we define Lp = Lp (E, E, µ) to be the set of all measurable functions f such that
Z 1/p
p
kf kp = |f | dµ < ∞.

For p = ∞, we let L∞ = L∞ (E, E, µ) to be the space of functions with

kf k∞ = inf{λ ≥ 0 : |f | ≤ λ a.e.} < ∞.

By Minkowski’s inequality, we know Lp is a vector space, and also (i) holds.


By definition, (ii) holds obviously. However, (iii) does not hold for k · kp , because
kf kp = 0 does not imply that f = 0. It merely implies that f = 0 a.e.
To fix this, we define an equivalence relation as follows: for f, g ∈ Lp , we say
that f ∼ g iff f − g = 0 a.e. For any f ∈ Lp , we let [f ] denote its equivalence
class under this relation. In other words,

[f ] = {g ∈ Lp : f − g = 0 a.e.}.

59
4 Inequalities and Lp spaces II Probability and Measure

Definition (Lp space). We define

Lp = {[f ] : f ∈ Lp },

where
[f ] = {g ∈ Lp : f − g = 0 a.e.}.
This is a normed vector space under the k · kp norm.
One important property of Lp is that it is complete, i.e. every Cauchy
sequence converges.
Definition (Complete vector space/Banach spaces). A normed vector space
(V, k · k) is complete if every Cauchy sequence converges. In other words, if (vn )
is a sequence in V such that kvn − vm k → 0 as n, m → ∞, then there is some
v ∈ V such that kvn − vk → 0 as n → ∞. A complete vector space is known as
a Banach space.
Theorem. Let 1 ≤ p ≤ ∞. Then Lp is a Banach space. In other words, if (fn )
is a sequence in Lp , with the property that kfn − fm kp → 0 as n, m → ∞, then
there is some f ∈ Lp such that kfn − f kp → 0 as n → ∞.
Proof. We will only give the proof for p < ∞. The p = ∞ case is left as an
exercise for the reader.
Suppose that (fn ) is a sequence in Lp with kfn − fm kp → 0 as n, m → ∞.
Take a subsequence (fnk ) of (fn ) with

kfnk+1 − fnk kp ≤ 2−k

for all k ∈ N. We then find that


M
X M
X
|fnk+1 − fnk | ≤ kfnk+1 − fnk kp ≤ 1.
k=1 p k=1

We know that
M
X ∞
X
|fnk+1 − fnk | % |fnk+1 − fnk | as M → ∞.
k=1 k=1

So applying the monotone convergence theorem, we know that



X ∞
X
|fnk+1 − fnk | ≤ kfnk+1 − fnk kp ≤ 1.
k=1 p k=1

In particular,

X
|fnk+1 − fnk | < ∞ a.e.
k=1

So fnk (x) converges a.e., since the real line is complete. So we set
(
limk→∞ fnk (x) if the limit exists
f (x) =
0 otherwise

60
4 Inequalities and Lp spaces II Probability and Measure

By an exercise on the first example sheet, this function is indeed measurable.


Then we have

kfn − f kpp = µ(|fn − f |p )


 
= µ lim inf |fn − fnk |p
k→∞

≤ lim inf µ(|fn − fnk |p ),


k→∞

which tends to 0 as n → ∞ since the sequence is Cauchy. So f is indeed the


limit.
Finally, we have to check that f ∈ Lp . We have

µ(|f |p ) = µ(|f − fn + fn |p )
≤ µ((|f − fn | + |fn |)p )
≤ µ(2p (|f − fn |p + |fn |p ))
= 2p (µ(|f − fn |p ) + µ(|fn |p )2 )

We know the first term tends to 0, and in particular is finite for n large enough,
and the second term is also finite. So done.

4.3 Orthogonal projection in L2


In the particular case p = 2, we have an extra structure on L2 , namely an inner
product structure, given by
Z
hf, gi = f g dµ.

This inner product induces the L2 norm by

kf k22 = hf, f i.

Recall the following definition:

Definition (Hilbert space). A Hilbert space is a vector space with a complete


inner product.
So L2 is not only a Banach space, but a Hilbert space as well.
Somehow Hilbert spaces are much nicer than Banach spaces, because you
have an inner product structure as well. One particular thing we can do is
orthogonal complements.
Definition (Orthogonal functions). Two functions f, g ∈ L2 are orthogonal if

hf, gi = 0,

Definition (Orthogonal complement). Let V ⊆ L2 . We then set

V ⊥ = {f ∈ L2 : hf, vi = 0 for all v ∈ V }.

61
4 Inequalities and Lp spaces II Probability and Measure

Note that we can always make these definitions for any inner product space.
However, the completeness of the space guarantees nice properties of the orthog-
onal complement.
Before we proceed further, we need to make a definition of what it means
for a subspace of L2 to be closed. This isn’t the usual definition, since L2 isn’t
really a normed vector space, so we need to accommodate for that fact.
Definition (Closed subspace). Let V ⊆ L2 . Then V is closed if whenever (fn )
is a sequence in V with fn → f , then there exists v ∈ V with v ∼ f .
Thee main thing that makes L2 nice is that we can use closed subspaces to
decompose functions orthogonally.
Theorem. Let V be a closed subspace of L2 . Then each f ∈ L2 has an
orthogonal decomposition
f = u + v,
where v ∈ V and u ∈ V ⊥ . Moreover,

kf − vk2 ≤ kf − gk2

for all g ∈ V with equality iff g ∼ v.


To prove this result, we need two simple identities, which can be easily proven
by writing out the expression.
Lemma (Pythagoras identity).

kf + gk2 = kf k2 + kgk2 + 2hf, gi.

Lemma (Parallelogram law).

kf + gk2 + kf − gk2 = 2(kf k2 + kgk2 ).

To prove the existence of orthogonal decomposition, we need to use a slight


trick involving the parallelogram law.
Proof of orthogonal decomposition. Given f ∈ L2 , we take a sequence (gn ) in V
such that
kf − gn k2 → d(f, V ) = inf kf − gk2 .
g

We now want to show that the infimum is attained. To do so, we show that gn
is a Cauchy sequence, and by the completeness of L2 , it will have a limit.
If we apply the parallelogram law with u = f − gn and v = f − gm , then we
know
ku + vk22 + ku − vk22 = 2(kuk22 + kvk22 ).
Using our particular choice of u and v, we obtain
  2
gn + gm
2 f− + kgn − gm k22 = 2(kf − gn k22 + kf − gm k22 ).
2 2

So we have
2
gn + gm
kgn − gm k22 = 2(kf − gn k22 + kf − gm k22 ) − 4 f − .
2 2

62
4 Inequalities and Lp spaces II Probability and Measure

The first two terms on the right hand side tend to d(f, V )2 , and the last term
is bounded below in magnitude by 4d(f, V ). So as n, m → ∞, we must have
kgn − gm k2 → 0. By completeness of L2 , there exists a g ∈ L2 such that gn → g.
Now since V is assumed to be closed, we can find a v ∈ V such that g = v
a.e. Then we know

kf − vk2 = lim kf − gn k2 = d(f, V ).


n→∞

So v attains the infimum. To show that this gives us an orthogonal decomposition,


we want to show that
u = f − v ∈ V ⊥.
Suppose h ∈ V . We need to show that hu, hi = 0. We need to do another funny
trick. Suppose t ∈ R. Then we have

d(f, V )2 ≤ kf − (v + th)k22
= kf − vk2 + t2 khk22 − 2thf − v, hi.

We think of this as a quadratic in t, which is minimized when

hf − v, hi
t= .
khk22

But we know this quadratic is minimized when t = 0. So hf − v, hi = 0.

We are now going to look at the relationship between conditional expectation


and orthogonal projection.
Definition (Conditional expectation). Suppose we have a probability S space
(Ω, F, P), and (Gn ) is a collection of pairwise disjoint events with n Gn = Ω.
We let
G = σ(Gn : n ∈ N).
The conditional expectation of X given G is the random variable

X
Y = E[X | Gn ]1Gn ,
n=1

where
E[X1Gn ]
E[X | Gn ] = for P[Gn ] > 0.
P[Gn ]
In other words, given any x ∈ Ω, say x ∈ Gn , then Y (x) = E[X | Gn ].
If X ∈ L2 (P), then Y ∈ L2 (P), and it is clear that Y is G-measurable. We
claim that this is in fact the projection of X onto the subspace L2 (G, P) of
G-measurable L2 random variables in the ambient space L2 (P).
Proposition. The conditional expectation of X given G is the projection of X
onto the subspace L2 (G, P) of G-measurable L2 random variables in the ambient
space L2 (P).
In some sense, this tells us Y is our best prediction of X given only the
information encoded in G.

63
4 Inequalities and Lp spaces II Probability and Measure

Proof. Let Y be the conditional expectation. It suffices to show that E[(X −W )2 ]


is minimized for W = Y among G-measurable random variables. Suppose that
W is a G-measurable random variable. Since

G = σ(Gn : n ∈ N),

it follows that

X
W = an 1Gn .
n=1

where an ∈ R. Then
 !2 

X
E[(X − W )2 ] = E  (X − an )1Gn 
n=1
" #
X
2
=E (X + a2n − 2an X)1Gn
n
" #
X
2
=E (X + a2n − 2an E[X | Gn ])1Gn
n

We now optimize the quadratic

X 2 + a2n − 2an E[X | Gn ]

over an . We see that this is minimized for

an = E[X | Gn ].

Note that this does not depend on what X is in the quadratic, since it is in the
constant term.
Therefore we know that E[X | Gn ] is minimized for W = Y .
We can also rephrase variance and covariance in terms of the L2 spaces.
Suppose X, Y ∈ L2 (P) with

mX = E[X], mY = E[Y ].

Then variance and covariance just correspond to L2 inner product and norm.
In fact, we have

var(X) = E[(X − mX )2 ] = kX − mX k22 ,


cov(X, Y ) = E[(X − mX )(Y − mY )] = hX − mX , Y − mY i.

More generally, the covariance matrix of a random vector X = (X1 , · · · , Xn ) is


given by
var(X) = (cov(Xi , Xj ))ij .
On the example sheet, we will see that the covariance matrix is a positive definite
matrix.

64
4 Inequalities and Lp spaces II Probability and Measure

4.4 Convergence in L1 (P) and uniform integrability


What we are looking at here is the following question — suppose (Xn ), X are
random variables and Xn → X in probability. Under what extra assumptions is
it true that Xn also converges to X in L1 , i.e. E[Xn − X] → 0 as X → ∞?
This is not always true.
Example. If we take (Ω, F, P) = ((0, 1), B((0, 1)), Lebesgue), and

Xn = n1(0,1/n) .

Then Xn → 0 in probability, and in fact Xn → 0 almost surely. However,


1
E[|Xn − 0|] = E[Xn ] = n · = 1,
n
which does not converge to 1.
We see that the problem with this series is that there is a lot of “stuff”
concentrated near 0, and indeed the functions can get unbounded near 0. We
can easily curb this problem by requiring our functions to be bounded:
Theorem (Bounded convegence theorem). Suppose X, (Xn ) are random vari-
ables. Assume that there exists a (non-random) constant C > 0 such that
|Xn | ≤ C. If Xn → X in probability, then Xn → X in L1 .
The proof is a rather standard manipulation.
Proof. We first show that |X| ≤ C a.e. Let ε > 0. We then have

P[|X| > C + ε] ≤ P[|X − Xn | + |Xn | > C + ε]


≤ P[|X − Xn | > ε] + P[|Xn | > C]

We know the second term vanishes, while the first term → 0 as n → ∞. So we


know
P[|X| > C + ε] = 0
for all ε. Since ε was arbitrary, we know |X| ≤ C a.s.
Now fix an ε > 0. Then
 
E[|Xn − X| = E |Xn − X|(1|Xn −X|≤ε + 1|Xn −X|>ε )
≤ ε + 2C P [|Xn − X| > ε] .

Since Xn → X in probability, for N sufficiently large, the second term is ≤ ε.


So E[|Xn − X|] ≤ 2ε, and we have convergence in L1 .
But we can do better than that. We don’t need the functions to be actually
bounded. We just need that the functions aren’t concentrated in arbitrarily
small subsets of Ω. Thus, we make the following definition:
Definition (Uniformly integrable). Let X be a family of random variables.
Define
IX (δ) = sup{E[|X|1A ] : X ∈ X , A ∈ F with P[A] < δ}.
Then we say X is uniformly integrable if X is L1 -bounded (see below), and
IX (δ) → 0 as δ → 0.

65
4 Inequalities and Lp spaces II Probability and Measure

Definition (Lp -bounded). Let X be a family of random variables. Then we say


X is Lp -bounded if
sup{kXkp : X ∈ X } < ∞.
In some sense, this is “uniform continuity for integration”. It is immediate
that
Proposition. Finite unions of uniformly integrable sets are uniformly integrable.
How can we find uniformly integrable families? The following proposition
gives us a large class of such families.
Proposition. Let X be an Lp -bounded family for some p > 1. Then X is
uniformly integrable.
Proof. We let
C = sup{kXkp : X ∈ X } < ∞.
Suppose that X ∈ X and A ∈ F. We then have

E[|X|1A ] =≤ E[|X|p ]1/p P[A]1/q ≤ CP[A]1/q .

by Hölder’s inequality, where p, q are conjugates. This is now a uniform bound


depending only on P[A]. So done.
This is the best we can get. L1 boundedness is not enough. Indeed, our
earlier example
Xn = n1(0,1/n) ,
is L1 bounded but not uniformly integrable. So L1 boundedness is not enough.
For many practical purposes, it is convenient to rephrase the definition of
uniform integrability as follows:
Lemma. Let X be a family of random variables. Then X is uniformly integrable
if and only if
sup{E[|X|1|X|>k ] : X ∈ X } → 0
as k → ∞.
Proof.
(⇒) Suppose that χ is uniformly integrable. For any k, and X ∈ X by Cheby-
shev inequality, we have
E[X]
P[|X| ≥ k] ≤ .
k
Given ε > 0, we pick δ such that P[|X|1A ] < ε for all A with µ(A) < δ.
Then pick k sufficiently large such that kδ < sup{E[X] : X ∈ X }. Then
P[|X| ≥ k] < δ, and hence E[|X|1|X|>k ] < ε for all X ∈ X .
(⇐) Suppose that the condition in the lemma holds. We first show that X is
L1 -bounded. We have

E[|X|] = E[|X|(1|X|≤k + 1|X|>k )] ≤ k + E[|X|1|X|>k ] < ∞

by picking a large enough k.

66
4 Inequalities and Lp spaces II Probability and Measure

Next note that for any measurable A and X ∈ X , we have

E[|X|1A ] = E[|X|1A (1|X|>k + 1|X|≤k )] ≤ E[|X|1|X|>k ] + kP[A].

Thus, for any ε > 0, we can pick k sufficiently large such that the first
term is < 2ε for all X ∈ X by assumption. Then when P[A] < 2kε
, we have
E|X|1A ] ≤ ε.
As a corollary, we find that

Corollary. Let X = {X}, where X ∈ L1 (P). Then X is uniformly integrable.


Hence, a finite collection of L1 functions is uniformly integrable.
Proof. Note that

X
E[|X|] = E[|X|1X∈[k,k+1) ].
k=0

Since the sum is finite, we must have



X
E[|X|1|X|≥K ] = E[|X|1X∈[k,k+1) ] → 0.
k=K

With all that preparation, we now come to the main theorem on uniform
integrability.

Theorem. Let X, (Xn ) be random variables. Then the following are equivalent:
(i) Xn , X ∈ L1 for all n and Xn → X in L1 .
(ii) {Xn } is uniformly integrable and Xn → X in probability.
The (i) ⇒ (ii) direction is just a standard manipulation. The idea of the (ii)
⇒ (i) direction is that we use uniformly integrability to cut off Xn and X at some
large value K, which gives us a small error, then apply bounded convergence.
Proof. We first assume that Xn , X are L1 and Xn → X in L1 . We want to show
that {Xn } is uniformly integrable and Xn → X in probability.
We first show that Xn → X in probability. This is just going to come from
the Chebyshev inequality. For ε > 0. Then we have

E[|X − Xn |]
P[|X − Xn | > ε] ≤ →0
ε
as n → ∞.
Next we show that {Xn } is uniformly integrable. Fix ε > 0. Take N such
that n ≥ N implies E[|X − Xn |] ≤ 2ε . Since finite families of L1 random variables
are uniformly integrable, we can pick δ > 0 such that A ∈ F and P[A] < δ
implies
ε
E[X1A ], E[|Xn |1A ] ≤
2
for n = 1, · · · , N .

67
4 Inequalities and Lp spaces II Probability and Measure

Now when n > N and A ∈ F with P[A] ≤ δ, then we have

E[|Xn |1A ] ≤ E[|X − Xn |1A ] + E[|X|1A ]


ε
≤ E[|X − Xn |] +
2
ε ε
≤ +
2 2
= ε.

So {Xn } is uniformly integrable.

Assume that {Xn } is uniformly integrable and Xn → X in probability.


The first step is to show that X ∈ L1 . We want to use Fatou’s lemma, but
to do so, we want almost sure convergence, not just convergence in probability.
Recall that we have previously shown that there is a subsequence (Xnk ) of
(Xn ) such that Xnk → X a.s. Then we have
 
E[|X|] = E lim inf |Xnk | ≤ lim inf E[|Xnk |] < ∞
k→∞ k→∞

since uniformly integrable families are L1 bounded. So E[|X|] < ∞, hence


X ∈ L1 .
Next we want to show that Xn → X in L1 . Take ε > 0. Then there exists
K ∈ (0, ∞) such that
    ε
E |X|1{|X|>K} , E |Xn |1{|Xn |>K} ≤ .
3
To set things up so that we can use the bounded convergence theorem, we have
to invent new random variables

XnK = (Xn ∨ −K) ∧ K, X K = (X ∨ −K) ∧ K.

Since Xn → X in probability, it follows that XnK → X K in probability.


Now bounded convergence tells us that there is some N such that n ≥ N
implies
ε
E[|XnK − X K |] ≤ .
3
Combining, we have for n ≥ N that

E[|Xn − X|] ≤ E[|XnK − X K |] + E[|X|1{|X|≥K} ] + E[|Xn |1{|Xn |≥K} ] ≤ ε.

So we know that Xn → X in L1 .
The main application is that when {Xn } is a type of stochastic process known
as a martingale. This will be done in III Advanced Probability and III Stochastic
Calculus.

68
5 Fourier transform II Probability and Measure

5 Fourier transform
We now turn to the exciting topic of the Fourier transform. There are two main
questions we want to ask — when does the Fourier transform exist, and when
we can recover a function from its Fourier transform.
Of course, not only do we want to know if the Fourier transform exists. We
also want to know if it lies in some nice space, e.g. L2 .
It turns out that when we want to prove things about Fourier transforms,
it is often helpful to “smoothen” the function by doing what is known as a
Gaussian convolution. So after defining the Fourier transform and proving some
really basic properties, we are going to investigate convolutions and Gaussians
for a bit (convolutions are also useful on their own, since they correspond to
sums of independent random variables). After that, we can go and prove the
actual important properties of the Fourier transform.

5.1 The Fourier transform


When talking about Fourier transforms, we will mostly want to talk about
functions Rd → C. So from now on, we will write Lp for complex valued Borel
functions on Rd with Z  1/p
kf kp = |f |p < ∞.
Rd
The integrals of complex-valued function are defined on the real and imaginary
parts separately, and satisfy the properties we would expect them to. The details
are on the first example sheet.
Definition (Fourier transform). The Fourier transform fˆ : Rd → C of f ∈
L1 (Rd ) is given by Z
ˆ
f (u) = f (x)ei(u,x) dx,
Rd
where u ∈ Rd and (u, x) denotes the inner product, i.e.
(u, x) = u1 x1 + · · · + ud xd .
Why do we care about Fourier transforms? Many computations are easier
with fˆ in place of f , especially computations that involve differentiation and
convolutions (which are relevant to sums of independent random variables). In
particular, we will use it to prove the central limit theorem.
More generally, we can define the Fourier transform of a measure:
Definition (Fourier transform of measure). The Fourier transform of a finite
measure µ on Rd is the function µ̂ : Rd → C given by
Z
µ̂(u) = ei(u,x) µ(dx).
Rd

In the context of probability, we give these things a different name:


Definition (Characteristic function). Let X be a random variable. Then the
characteristic function of X is the Fourier transform of its law, i.e.
φX (u) = E[ei(u,X) ] = µ̂X (u),
where µX is the law of X.

69
5 Fourier transform II Probability and Measure

We now make the following (trivial) observations:


Proposition.
kfˆk∞ ≤ kf k1 , kµ̂k∞ ≤ µ(Rd ).
Less trivially, we have the following result:
Proposition. The functions fˆ, µ̂ are continuous.
Proof. If un → u, then

f (x)ei(un ,x) → f (x)ei(u,x) .

Also, we know that


|f (x)ei(un ,x) | = |f (x)|.
So we can apply dominated convergence theorem with |f | as the bound.

5.2 Convolutions
To actually do something useful about the Fourier transforms, we need to talk
about convolutions.
Definition (Convolution of random variables). Let µ, ν be probability measures.
Their convolution µ ∗ ν is the law of X + Y , where X has law µ and Y has law
ν, and X, Y are independent. Explicitly, we have

µ ∗ ν(A) = P[X + Y ∈ A]
ZZ
= 1A (x + y) µ(dx) ν(dy)

Let’s suppose that µ has a density function f with respect to the Lebesgue
measure. Then we have
ZZ
µ ∗ ν(A) = 1A (x + y)f (x) dx ν(dy)
ZZ
= 1A (x)f (x − y) dx ν(dy)
Z Z 
= 1A (x) f (x − y) ν(dy) dx.

So we know that µ ∗ ν has law


Z
f (x − y) ν(dy).

This thing has a name.


Definition (Convolution of function with measure). Let f ∈ Lp and ν a
probability measure. Then the convolution of f with µ is
Z
f ∗ ν(x) = f (x − y) ν(dy) ∈ Lp .

70
5 Fourier transform II Probability and Measure

Note that we do have to treat the two cases of convolutions separately, since
a measure need not have a density, and a function need not specify a probability
measure (it may not integrate to 1).
We check that it is indeed in Lp . Since ν is a probability measure, Jensen’s
inequality says we have
Z Z p
p
kf ∗ νkp = |f (x − y)|ν(dy) dx
ZZ
≤ |f (x − y)|p ν(dy) dx
ZZ
= |f (x − y)|p dx ν(dy)

= kf kpp
< ∞.
In fact, from this computation, we see that
Proposition. For any f ∈ Lp and ν a probability measure, we have
kf ∗ νkp ≤ kf kp .
The interesting thing happens when we try to take the Fourier transform of
a convolution.
Proposition.
∗ ν(u) = fˆ(u)ν̂(u).
f[
Proof. We have
Z Z 
∗ ν(u) =
f[ f (x − y)ν(dy) ei(u,x) dx
ZZ
= f (x − y)ei(u,x) dx ν(dy)
Z Z 
i(u,x−y)
= f (x − y)e d(x − y) ei(u,y) µ(dy)
Z Z 
i(u,x)
= f (x)e d(x) ei(u,y) µ(dy)
Z
= fˆ(u)ei(u,x) µ(dy)
Z
= fˆ(u) ei(u,x) µ(dy)

= fˆ(u)ν̂(u).
In the context of random variables, we have a similar result:
Proposition. Let µ, ν be probability measures, and X, Y be independent vari-
ables with laws µ, ν respectively. Then
∗ ν(u) = µ̂(u)ν̂(u).
µ[
Proof. We have
∗ ν(u) = E[ei(u,X+Y ) ] = E[ei(u,X) ]E[ei(u,Y ) ] = µ̂(u)ν̂(u).
µ[

71
5 Fourier transform II Probability and Measure

5.3 Fourier inversion formula


We now want to work towards proving the Fourier inversion formula:

Theorem (Fourier inversion formula). Let f, fˆ ∈ L1 . Then


Z
1
f (x) = fˆ(u)e−i(u,x) du a.e.
(2π)d

Our strategy is as follows:


(i) Show that the Fourier inversion formula holds for a Gaussian distribution
by direct computations.
(ii) Show that the formula holds for Gaussian convolutions, i.e. the convolution
of an arbitrary function with a Gaussian.
(iii) We show that any function can be approximated by a Gaussian convolution.
Note that the last part makes a lot of sense. If√X is a random variable, then
convolving with a Gaussian is just adding X + tZ, and if we take t → 0, we
recover the original function. What we have to do is to show that this behaves
sufficiently well with the Fourier transform and the Fourier inversion formula
that we will actually get the result we want.

Gaussian densities
Before we start, we had better start by defining the Gaussian distribution.
Definition (Gaussian density). The Gaussian density with variance t is
d/2

1 2
gt (x) = e−|x| /2t .
2πt

This is equivalently the density of tZ, where Z = (Z1 , · · · , Zd ) with Zi ∼
N (0, 1) independent.
We now want to compute the Fourier transformation directly and show that
the Fourier inversion formula works for this.
We start off by working in the case d = 1 and Z ∼ N (0, 1). We want to
compute the Fourier transform of the law of this guy, i.e. its characteristic
function. We will use a nice trick.
Proposition. Let Z ∼ N (0, 1). Then
2
φZ (a) = e−u /2
.

We see that this is in fact a Gaussian up to a factor of 2π.


Proof. We have

φZ (u) = E[eiuZ ]
Z
1 2
= √ eiux e−x /2 dx.

72
5 Fourier transform II Probability and Measure

We now notice that the function is bounded, so we can differentiate under the
integral sign, and obtain

φ0Z (u) = E[iZeiuZ ]


Z
1 2
=√ ixeiux e−x /2 dx

= −uφZ (u),

where the last equality is obtained by integrating by parts. So we know that


φZ (u) solves
φ0Z (u) = −uφZ (u).
This is easy to solve, since we can just integrate this. We find that
1
log φZ (u) = − u2 + C.
2
So we have 2
φZ (u) = Ae−u /2
.
We know that A = 1, since φZ (0) = 1. So we have
2
φZ (u) = e−u /2
.

We now do this problem in general.



Proposition. Let Z = (Z1 , · · · , Zd ) with Zj ∼ N (0, 1) independent. Then tZ
has density
1 2
gt (x) = d/2
e−|x| /(2t) .
(2πt)
with 2
ĝt (u) = e−|u| t/2
.
Proof. We have

ĝt (u) = E[ei(u, tZ)
]
d
Y √
= E[ei(uj , tZj )
]
j=1
d
Y √
= φZ ( tuj )
j=1
d
Y 2
= e−tuj /2
j=1
2
= e−|u| t/2
.

Again, gt and ĝt are almost the same, apart form the factor of (2πt)−d/2 and
the position of t shifted. We can thus write this as

ĝt (u) = (2π)d/2 t−d/2 g1/t (u).

73
5 Fourier transform II Probability and Measure

So this tells us that


ĝˆt (u) = (2π)d gt (u).
This is not exactly the same as saying the Fourier inversion formula works,
because in the Fourier inversion formula, we integrated against e−i(u,x) , not
ei(u,x) . However, we know that by the symmetry of the Gaussian distribution,
we have
 d Z
−d ˆ 1
gt (x) = gt (−x) = (2π) ĝt (−x) = ĝt (u)e−i(u,x) du.

So we conclude that

Lemma. The Fourier inversion formula holds for the Gaussian density function.

Gaussian convolutions
Definition (Gaussian convolution). Let f ∈ L1 . Then a Gaussian convolution
of f is a function of the form f ∗ gt .
We are now going to do a little computation that shows that functions of
this type also satisfy the Fourier inversion formula.
Before we start, we make some observations about the Gaussian convolution.
By general theory of convolutions, we know that we have
Proposition.
kf ∗ gt k1 ≤ kf k1 .

We also have a pointwise bound


Z  d/2
2 1
|f ∗ gt (x)| = f (x − y)e−|y| /(2t)
dy
2πt
Z
−d/2
≤ (2πt) |f (x − y)| dx

≤ (2πt)−d/2 kf k1 .

This tells us that in fact


Proposition.
kf ∗ gt k∞ ≤ (2πt)−d/2 kf k1 .
So in fact the convolution is pointwise bounded. We see that the bound gets
worse as t → 0, and we will see that this is because as t → 0, the convolution
f ∗ gt becomes a better and better approximation of f , and we did not assume
that f is bounded.
Similarly, we can compute that
Proposition.
∗ gt k1 = kfˆĝt k1 ≤ (2π)d/2 t−d/2 kfˆk1 ,
kf\
and
∗ gt k∞ ≤ kfˆk∞ .
kf\

74
5 Fourier transform II Probability and Measure

Now given these bounds, it makes sense to write down the Fourier inversion
formula for a Gaussian convolution.
Lemma. The Fourier inversion formula holds for Gaussian convolutions.
We are going to reduce this to the fact that the Gaussian distribution itself
satisfies Fourier inversion.
Proof. We have
Z
f ∗ gt (x) = f (x − y)gt (y) dy
Z  Z 
1 −i(u,y)
= f (x − y) ĝt (u)e du dy
(2π)d
 d Z Z
1
= f (x − y)ĝt (u)e−i(u,y) du dy

 d Z Z 
1
= f (x − y)e−i(u,x−y) dy ĝt (u)e−i(u,x) du

 d Z
1
= fˆ(u)ĝt (u)e−i(u,x) du

 d Z
1
= f\ ∗ gt (u)e−i(u,x) du

So done.

The proof
Finally, we are going to extend the Fourier inversion formula to the case where
f, fˆ ∈ L2 .
Theorem (Fourier inversion formula). Let f ∈ L1 and
Z Z
2
ft (x) = (2π)−d fˆ(u)e−|u| t/2 e−i(u,x) du = (2π)−d f\
∗ gt (u)e−i(u,x) du.

Then kft −f k1 → 0, as t → 0, and the Fourier inversion holds whenever f, fˆ ∈ L1 .


To prove this, we first need to show that the Gaussian convolution is indeed
a good approximation of f :
Lemma. Suppose that f ∈ Lp with p ∈ [1, ∞). Then kf ∗ gt − f kp → 0 as
t → 0.
Note that this cannot hold for p = ∞. Indeed, if p = ∞, then the ∞-norm
is the uniform norm. But we know that f ∗ gt is always continuous, and the
uniform limit of continuous functions is continuous. So the formula cannot hold
if f is not already continuous.
Proof. We fix ε > 0. By a question on the example sheet, we can find h which
is continuous and with compact support such that kf − hkp ≤ 3ε . So we have
ε
kf ∗ gt − h ∗ gt kp = k(f − h) ∗ gt kp ≤ kf − hkp ≤ .
3

75
5 Fourier transform II Probability and Measure

So it suffices for us to work with a continuous function h with compact support.


We let Z
e(y) = |h(x − y) − h(x)|p dx.

We first show that e is a bounded function:


Z
e(y) ≤ 2p (|h(x − y)|p + |h(x)|p ) dx

= 2p+1 khkpp .

Also, since h is continuous and bounded, the dominated convergence theorem


tells us that e(y) → 0 as y → 0. R
Moreover, using the fact that gt (y) dy = 1, we have
Z Z p
kh ∗ gt − hkpp = (h(x − y) − h(x))gt (y) dy dx

Since gt (y) dy is a probability measure, by Jensen’s inequality, we can bound


this by
ZZ
≤ |h(x − y) − h(x)|p gt (y) dy dx
Z Z 
p
= |h(x − y) − h(x)| dx gt (y) dy
Z
= e(y)gt (y) dy
Z √
= e( ty)g1 (y) dy,

where we used the definition of g and substitution. We know that this tends to 0
as t → 0 by the bounded convergence theorem, since we know that e is bounded.
Finally, we have

kf ∗ gt − f kp ≤ kf ∗ gt − h ∗ gt kp + kh ∗ gt − hkp + kh − f kp
ε ε
≤ + + kh ∗ gt − hkp
3 3

= + kh ∗ gt − hkp .
3
Since we know that kh ∗ gt − hkp → 0 as t → 0, we know that for all sufficiently
small t, the function is bounded above by ε. So we are done.
With this lemma, we can now prove the Fourier inversion theorem.
Proof of Fourier inversion theorem. The first part is just a special case of the
previous lemma. Indeed, recall that
2
∗ gt (u) = fˆ(u)e−|u| t/2 .
f\

Since Gaussian convolutions satisfy Fourier inversion formula, we know that

ft = f ∗ gt .

76
5 Fourier transform II Probability and Measure

So the previous lemma says exactly that kft − f k1 → 0.


Suppose now that fˆ ∈ L1 as well. Then looking at the integrand of
Z
2
ft (x) = (2π)−d fˆ(u)e−|u| t/2 e−i(u,x) du,

we know that 2
fˆ(u)e−|u| t/2 e−i(u,x) ≤ |fˆ|.

Then by the dominated convergence theorem with dominating function |fˆ|, we


know that this converges to
Z
ft (x) → (2π)−d fˆ(u)e−i(u,x) du as t → 0.

By the first part, we know that kft − f k1 → 0 as t → 0. So we can find a


sequence (tn ) with tn > 0, tn → 0 so that ftn → f a.e. Combining these, we
know that Z
f (x) = fˆ(u)e−i(u,x) du a.e.

So done.

5.4 Fourier transform in L2


It turns out wonderful things happen when we take the Fourier transform of an
L2 function.
Theorem (Plancherel identity). For any function f ∈ L1 ∩ L2 , the Plancherel
identity holds:
kfˆk2 = (2π)d/2 kf k2 .

As we are going to see in a moment, this is just going to follow from the
Fourier inversion formula plus a clever trick.
Proof. We first work with the special case where f, fˆ ∈ L1 , since the Fourier
inversion formula holds for f . We then have
Z
kf k22 = f (x)f (x) dx
Z Z 
1 ˆ(u)e−i(u,x) du f (x) dx
= f
(2π)d
Z
1  
ˆ(u) f (x)e−i(u,x) dx du
= f
(2π)d
Z
1
fˆ(u) f (x)ei(u,x) dx du

=
(2π)d
Z
1
= fˆ(u)fˆ(u) du
(2π)d
1
= kfˆ(u)k22 .
(2π)d

So the Plancherel identity holds for f .

77
5 Fourier transform II Probability and Measure

To prove it for the general case, we use this result and an approximation
argument. Suppose that f ∈ L1 ∩ L2 , and let ft = f ∗ gt . Then by our earlier
lemma, we know that
kft k2 → kf k2 as t → 0.
Now note that 2
fˆt (u) = fˆ(u)ĝt (u) = fˆ(u)e−|u| t/2 .
2
The important thing is that e−|u| t/2 % 1 as t → 0. Therefore, we know
Z Z
2
kfˆt k22 = |fˆ(u)|2 e−|u| t du → |fˆ(u)|2 du = kfˆk22

as t → 0, by monotone convergence.
Since ft , fˆt ∈ L1 , we know that the Plancherel identity holds, i.e.

kfˆt k2 = (2π)d/2 kft k2 .

Taking the limit as t → 0, the result follows.


What is this good for? It turns out that the Fourier transform gives as
a bijection from L2 to itself. While it is not true that the Fourier inversion
formula holds for everything in L2 , it holds for enough of them that we can just
approximate everything else by the ones that are nice. Then the above tells us
that in fact this bijection is a norm-preserving automorphism.
Theorem. There exists a unique Hilbert space automorphism F : L2 → L2
such that
F ([f ]) = [(2π)−d/2 fˆ]
whenever f ∈ L1 ∩ L2 .
Here [f ] denotes the equivalence class of f in L2 , and we say F : L2 → L2 is
a Hilbert space automorphism if it is a linear bijection that preserves the inner
product.
Note that in general, there is no guarantee that F sends a function to its
Fourier transform. We know that only if it is a well-behaved function (i.e. in
L1 ∩ L2 ). However, the formal property of it being a bijection from L2 to itself
will be convenient for many things.
Proof. We define F0 : L1 ∩ L2 → L2 by

F0 ([f ]) = [(2π)−d/2 fˆ].

By the Plancherel identity, we know F0 preserves the L2 norm, i.e.

kF0 ([f ])k2 = k[f ]k2 .

Also, we know that L1 ∩ L2 is dense in L2 , since even the continuous functions


with compact support are dense. So we know F0 extends uniquely to an isometry
F : L2 → L2 .
Since it preserves distance, it is in particular injective. So it remains to show
that the map is surjective. By Fourier inversion, the subspace

V = {[f ] ∈ L2 : f, fˆ ∈ L1 }

78
5 Fourier transform II Probability and Measure

is sent to itself by the map F . Also if f ∈ V , then F 4 [f ] = [f ] (note that


applying it twice does not suffice, because we actually have F 2 [f ](x) = [f ](−x)).
So V is contained in the image F , and also V is dense in L2 , again because it
contains all Gaussian convolutions (we have fˆt = fˆgˆt , and fˆ is bounded and gˆt
is decaying exponentially). So we know that F is surjective.

5.5 Properties of characteristic functions


We are now going to state a bunch of theorems about characteristic functions.
Since the proofs are not examinable (but the statements are!), we are only going
to provide a rough proof sketch.
Theorem. The characteristic function φX of a distribution µX of a random
variable X determines µX . In other words, if X and X̃ are random variables
and φX = φX̃ , then µX = µX̃
Proof sketch. Use the Fourier inversion to show that φX determines µX (g) =
E[g(X)] for any bounded, continuous g.
Theorem. If φX is integrable, then µX has a bounded, continuous density
function Z
fX (x) = (2π)−d φX (u)e−i(u,x) du.

Proof sketch. Let Z ∼ N (0, 1) be independent of X. Then X + tZ has a
bounded continuous density function which, by Fourier inversion, is
Z
2
−d
ft (x) = (2π) φX (u)e−|u| t/2 e−i(u,x) du.

Sending t → 0 and using the dominated convergence theorem with dominating


function |φX |.
The next theorem relates to the notion of weak convergence.
Definition (Weak convergence of measures). Let µ, (µn ) be Borel probability
measures. We say that µn → µ weakly if and only if µn (g) → µ(g) for all
bounded continuous g.
Similarly, we can define weak convergence of random variables.
Definition (Weak convergence of random variables). Let X, (Xn ) be random
variables. We say Xn → X weakly iff µXn → µX weakly, iff E[g(Xn )] → E[g(X)]
for all bounded continuous g.
This is related to the notion of convergence in distribution, which we defined
long time ago without talking about it much. It is an exercise on the example
sheet that weak convergence of random variables in R is equivalent to convergence
in distribution.
It turns out that weak convergence is very useful theoretically. One reason is
that they are related to convergence of characteristic functions.
Theorem. Let X, (Xn ) be random variables with values in Rd . If φXn (u) →
φX (u) for each u ∈ Rd , then µXn → µX weakly.

79
5 Fourier transform II Probability and Measure

The main application of this that will appear later is that this is the fact
that allows us to prove the central limit theorem.

Proof sketch. By the example sheet, it suffices to show that E[g(Xn )] → E[g(X)]
for all compactly supported g ∈ C ∞ . We then use Fourier inversion and
convergence of characteristic functions to check that
√ √
E[g(Xn + tZ)] → E[g(X + tZ)]

for all t >√0 for Z ∼ N (0, 1) independent of X, (Xn ). Then we check that
E[g(Xn + tZ)] is close to E[g(Xn )] for t > 0 small, and similarly for X.

5.6 Gaussian random variables


Recall that in the proof of the Fourier inversion theorem, we used these things
called Gaussians, but didn’t really say much about them. These will be useful
later on when we want to prove the central limit theorem, because the central
limit theorem says that in the long run, things look like Gaussians. So here we
lay out some of the basic definitions and properties of Gaussians.
Definition (Gaussian random variable). Let X be a random variable on R.
This is said to be Gaussian if there exists µ ∈ R and σ ∈ (0, ∞) such that the
density of X is
(x − µ)2
 
1
fX (x) = √ exp − .
2πσ 2 2σ 2
A constant random variable X = µ corresponds to σ = 0. We say this has mean
µ and variance σ 2 .
When this happens, we write X ∼ N (µ, σ 2 ).
For completeness, we record some properties of Gaussian random variables.

Proposition. Let X ∼ N (µ, σ 2 ). Then

E[X] = µ, var(X) = σ 2 .

Also, for any a, b ∈ R, we have

aX + b ∼ N (aµ + b, a2 σ 2 ).

Lastly, we have
2
σ 2 /2
φX (u) = e−iµu−u .

Proof. All but the last of them follow from direct calculation, and can be found
in IA Probability.
For the last part, if X ∼ N (µ, σ 2 ), then we can write

X = σZ + µ,

where Z ∼ N (0, 1). Recall that we have previously found that the characteristic
function of a N (0, 1) function is
2
φZ (u) = e−|u| /2
.

80
5 Fourier transform II Probability and Measure

So we have

φX (u) = E[eiu(σZ+µ) ]
= eiuµ E[eiuσZ ]
= eiuµ φZ (iuσ)
2
σ 2 /2
= eiuµ−u .

What we are next going to do is to talk about the corresponding facts for
the Gaussian in higher dimensions. Before that, we need to come up with the
definition of a higher-dimensional Gaussian distribution. This might be different
from the one you’ve seen previously, because we want to allow some degeneracy
in our random variable, e.g. some of the dimensions can be constant.

Definition (Gaussian random variable). Let X be a random variable. We say


that X is a Gaussian on Rn if (u, X) is Gaussian on R for all u ∈ Rn .
We are now going to prove a version of our previous theorem to higher
dimensional Gaussians.

Theorem. Let X be Gaussian on Rn , and le tA be an m × n matrix and b ∈ Rm .


Then
(i) AX + b is Gaussian on Rm .
(ii) X ∈ L2 and its law µX is determined by µ = E[X] and V = var(X), the
covariance matrix.

(iii) We have
φX (u) = ei(u,µ)−(u,V u)/2 .

(iv) If V is invertible, then X has a density of


 
1
fX (x) = (2π)−n/2 (det V )−1/2 exp − (x − µ, V −1 (x − µ)) .
2

(v) If X = (X1 , X2 ) where Xi ∈ Rni , then cov(X1 , X2 ) = 0 iff X1 and X2 are


independent.
Proof.
(i) If u ∈ Rm , then we have

(AX + b, u) = (AX, u) + (b, u) = (X, AT u) + (b, u).

Since (X, AT u) is Gaussian and (b, u) is constant, it follows that (AX +b, u)
is Gaussian.

(ii) We know in particular that each component of X is a Gaussian random


variable, which are in L2 . So X ∈ L2 . We will prove the second part of (ii)
with (iii)

81
5 Fourier transform II Probability and Measure

(iii) If µ = E[X] and V = var(X), then if u ∈ Rn , then we have

E[(u, X)] = (u, µ), var((u, X)) = (u, V u).

So we know
(u, X) ∼ N ((u, µ), (u, V u)).
So it follows that

φX (u) = E[ei(u,X) ] = ei(u,µ)−(u,V u)/2 .

So µ and V determine the characteristic function of X, which in turn


determines the law of X.

(iv) We start off with a boring Gaussian vector Y = (Y1 , · · · , Yn ), where the
Yi ∼ N (0, 1) are independent. Then the density of Y is
2
fY (y) = (2π)−n/2 e−|y| /2
.

We are now going to construct X from Y . We define

X̃ = V 1/2 Y + µ.

This makes sense because V is always non-negative definite. Then X̃ is


Gaussian with E[X̃] = µ and var(X̃) = V . Therefore X has the same
distribution as X̃. Since V is assumed to be invertible, we can compute
the density of X̃ using the change of variables formula.
(v) It is clear that if X1 and X2 are independent, then cov(X1 , X2 ) = 0.
Conversely, let X = (X1 , X2 ), where cov(X1 , X2 ) = 0. Then we have
 
V11 0
V = var(X) = .
0 V22

Then for u = (u1 , u2 ), we have

(u, V u) = (u1 V11 u1 ) + (u2 , V22 u2 ),

where V11 = var(X1 ) and V22 var(X2 ). Then we have

φX (u) = eiµu−(u,V u)/2


= eiµ1 u1 −(u1 ,V11 u1 )/2 eiµ2 u2 −(u2 ,V22 u2 )/2
= φX1 (u1 )φX2 (u2 ).

So it follows that X1 and X2 are independent.

82
6 Ergodic theory II Probability and Measure

6 Ergodic theory
We are now going to study a new topic — ergodic theory. This is the study
the “long run behaviour” of system under the evolution of some Θ. Due to time
constraints, we will not do much with it. We are going to prove two ergodic
theorems that tell us what happens in the long run, and this will be useful when
we prove our strong law of large numbers at the end of the course.
The general settings is that we have a measure space (E, E, µ) and a measur-
able map Θ : E → E that is measure preserving, i.e. µ(A) = µ(Θ−1 (A)) for all
A ∈ E.
Example. Take (E, E, µ) = ([0, 1), B([0, 1)), Lebesgue). For each a ∈ [0, 1), we
can define
Θa (x) = x + a mod 1.
By what we’ve done earlier in the course, we know this translation map preserves
the Lebesgue measure on [0, 1).
Our goal is to try to understand the “long run averages” of the system when
we apply Θ many times. One particular quantity we are going to look at is the
following:
Let f be measurable. We define

Sn (f ) = f + f ◦ Θ + · · · + f ◦ Θn−1 .

We want to know what is the long run behaviour of Snn(f ) as n → ∞.


The ergodic theorems are going to give us the answer in a certain special
case. Finally, we will apply this in a particular case to get the strong law of
large numbers.
Definition (Invariant subset). We say A ∈ E is invariant for Θ if A = Θ−1 (A).
Definition (Invariant function). A measurable function f is invariant if f =
f ◦ Θ.
Definition (EΘ ). We write

EΘ = {A ∈ E : A is invariant}.

It is easy to show that EΘ is a σ-algebra, and f : E → R is invariant iff it is


EΘ measurable.
Definition (Ergodic). We say Θ is ergodic if A ∈ EΘ implies µ(A) = 0 or
µ(AC ) = 0.
Example. For the translation map on [0, 1), we have Θa is ergodic iff a is
irrational. Proof is left on example sheet 4.
Proposition. If f is integrable and Θ is measure-preserving. Then f ◦ Θ is
integrable and Z Z
f ◦ Θdµ = f dµ.
E

It turns out that if Θ is ergodic, then there aren’t that many invariant
functions.

83
6 Ergodic theory II Probability and Measure

Proposition. If Θ is ergodic and f is invariant, then there exists a constant c


such that f = c a.e.
The proofs of these are left as an exercise on example sheet 4.
We are now going to spend a little bit of time studying a particular example,
because this will be needed to prove the strong law of large numbers.
Example (Bernoulli shifts). Let m be a probability distribution on R. Then
there exists an iid sequence Y1 , Y2 , · · · with law m. Recall we constructed this
in a really funny way. Now we are going to build it in a more natural way.
We let E = RN be the set of all real sequences (xn ). We define the σ-algebra
E to be the σ-algebra generated by the projections Xn (x) = xn . In other
words, this is the smallest σ-algebra such that all these functions are measurable.
Alternatively, this is the σ-algebra generated by the π-system
( )
Y
A= An , An ∈ B for all n and An = R eventually .
n∈N

Finally, to define the measure µ, we let

Y = (Y1 , Y2 , · · · ) : Ω → E

where Yi are iid random variables defined earlier, and Ω is the sample space of
the Yi .
Then Y is a measurable map because each of the Yi ’s is a random variable.
We let µ = P ◦ Y −1 .
By the independence of Yi ’s, we have that
Y
µ(A) = m(An )
n∈N

for any
A = A1 × A2 × · · · × An × R × · · · × R.
Note that the product is eventually 1, so it is really a finite product.
This (E, E, µ) is known as the canonical space associated with the sequence
of iid random variables with law m.
Finally, we need to define Θ. We define Θ : E → E by

Θ(x) = Θ(x1 , x2 , x3 , · · · ) = (x2 , x3 , x4 , · · · ).

This is known as the shift map.


Why do we care about this? Later, we are going to look at the function

f (x) = f (x1 , x2 , · · · ) = x1 .

Then we have

Sn (f ) = f + f ◦ Θ + · · · + f ◦ Θn−1 = x1 + · · · + xn .

So Snn(f ) will the average of the first n things. So ergodic theory will tell us
about the long-run behaviour of the average.

84
6 Ergodic theory II Probability and Measure

Theorem. The shift map Θ is an ergodic, measure preserving transformation.


Proof. It is an exercise to show that Θ is measurable and measure preserving.
To show that Θ is ergodic, recall the definition of the tail σ-algebra
\
Tn = σ(Xm : m ≥ n + 1), T = Tn .
n
Q
Suppose that A ∈ n∈N An ∈ A. Then

Θ−n (A) = {Xn+k ∈ Ak for all k} ∈ Tn .

Since Tn is a σ-algebra, we and Θ−n (A) ∈ TN for all A ∈ A and σ(A) = E, we


know Θ−n (A) ∈ TN for all A ∈ E.
So if A ∈ EΘ , i.e. A = Θ−1 (A), then A ∈ TN for all N . So A ∈ T .
From the Kolmogorov 0-1 law, we know either µ[A] = 1 or µ[A] = 0. So
done.

6.1 Ergodic theorems


The proofs in this section are non-examinable.
Instead of proving the ergodic theorems directly, we first start by proving
the following magical lemma:
Lemma (Maximal ergodic lemma). Let f be integrable, and

S ∗ = sup Sn (f ) ≥ 0,
n≥0

where S0 (f ) = 0 by convention. Then


Z
f dµ ≥ 0.
{S ∗ >0}

Proof. We let
Sn∗ = max Sm
0≤m≤n

and
An = {Sn∗ > 0}.
Now if 1 ≤ m ≤ n, then we know

Sm = f + Sm−1 ◦ Θ ≤ f + Sn∗ ◦ Θ.

Now on An , we have
Sn∗ = max Sm ,
1≤m≤n

since S0 = 0. So we have
Sn∗ ≤ f + Sn∗ ◦ Θ.
On AC
n , we have
Sn∗ = 0 ≤ Sn∗ ◦ Θ.

85
6 Ergodic theory II Probability and Measure

So we know
Z Z Z
Sn∗ dµ = Sn∗ dµ + Sn∗ dµ
E An AC
n
Z Z Z
≤ f dµ + Sn∗ ◦ Θ dµ + Sn∗ ◦ Θ dµ
An An AC
n
Z Z
= f dµ + Sn∗ ◦ Θ dµ
An E
Z Z
= f dµ + Sn∗ dµ
An E

So we know Z
f dµ ≥ 0.
An

Taking the limit as n → ∞ gives the result by dominated convergence with


dominating function f .
We are now going to prove the two ergodic theorems, which tell us the
limiting behaviour of Sn (f ).
Theorem (Birkhoff’s ergodic theorem). Let (E, E, µ) be σ-finite and f be
integrable. There exists an invariant function f¯ such that

µ(|f¯|) ≤ µ(|f |),

and
Sn (f )
→ f¯ a.e.
n
If Θ is ergodic, then f¯ is a constant.
Note that the theorem only gives µ(|f¯|) ≤ µ(|f |). However, in many cases,
we can use some integration theorems such as dominated convergence to argue
that they must in fact be equal. In particular, in the ergodic case, this will allow
us to find the value of f¯.
Theorem (von Neumann’s ergodic theorem). Let (E, E, µ) be a finite measure
space. Let p ∈ [1, ∞) and assume that f ∈ Lp . Then there is some function
f¯ ∈ Lp such that
Sn (f )
→ f¯ in Lp .
n
Proof of Birkhoff ’s ergodic theorem. We first note that
Sn Sn
lim sup , lim sup
n n n n
are invariant functions, Indeed, we know

Sn ◦ Θ = f ◦ Θ + f ◦ Θ2 + · · · + f ◦ Θn
= Sn+1 − f

So we have
Sn ◦ Θ Sn f Sn
lim sup = lim sup + → lim sup .
n→∞ n n→∞ n n n→∞ n

86
6 Ergodic theory II Probability and Measure

Exactly the same reasoning tells us the lim inf is also invariant.
What we now need to show is that the set of points on which lim sup and
lim inf do not agree have measure zero. We set a < b. We let
 
Sn (x) Sn (x)
D = D(a, b) = x ∈ E : lim inf < a < b < lim sup .
n→∞ n n→∞ n

Now if lim sup Snn(x) 6= lim inf Snn(x) , then there is some a, b ∈ Q such that
x ∈ D(a, b). So by countable subadditivity, it suffices to show that µ(D(a, b)) = 0
for all a, b.
We now fix a, b, and just write D. Since lim sup Snn and lim inf Snn are both
invariant, we have that D is invariant. By restricting to D, we can assume that
D = E.
Suppose that B ∈ E and µ(G) < ∞. We let

g = f − b1B .

Then g is integrable because f is integrable and µ(B) < ∞. Moreover, we have

Sn (g) = Sn (f − b1B ) ≥ Sn (f ) − nb.


Sn (f )
Since we know that lim supn n > b by definition, we can find an n such that
Sn (g) > 0. So we know that

S ∗ (g)(x) = sup Sn (g)(x) > 0


n

for all x ∈ D. By the maximal ergodic lemma, we know


Z Z Z
0≤ g dµ = f − b1B dµ = f dµ − bµ(B).
D D D

If we rearrange this, we know


Z
bµ(B) ≤ f dµ.
D

for all measurable sets B ∈ E with finite measure. Since our space is σ-finite, we
can find Bn % D such µ(Bn ) < ∞ for all n. So taking the limit above tells
Z
bµ(D) ≤ f dµ. (†)
D

Now we can apply the same argument with (−a) in place of b and (−f ) in place
of f to get Z
(−a)µ(D) ≤ − f dµ. (‡)
D
Now note that since b > a, we know that at least one of b > 0 and a < 0 has to
be true. In the first case, (†) tells us that µ(D) is finite, since f is integrable.
Then combining with (‡), we see that
Z
bµ(D) ≤ f dµ ≤ aµ(D).
D

87
6 Ergodic theory II Probability and Measure

But a < b. So we must have µ(D) = 0. The second case follows similarly (or
follows immediately by flipping the sign of f ).
We are almost done. We can now define
(
lim Sn (f )/n the limit exists
f¯(x) =
0 otherwise

Then by the above calculations, we have

Sn (f )
→ f¯ a.e.
n
Also, we know f¯ is invariant, because lim Sn (f )/n is invariant, and so is the set
where the limit exists.
Finally, we need to show that

µ(f¯) ≤ µ(|f |).

This is since
µ(|f ◦ Θn |) = µ(|f |)
as Θn preserves the metric. So we have that

µ(|Sn |) ≤ nµ(|f |) < ∞.

So by Fatou’s lemma, we have


 
¯ Sn
µ(|f |) ≤ µ lim inf
n n
 
Sn
≤ lim inf µ
n n
≤ µ(|f |)

The proof of the von Neumann ergodic theorem follows easily from Birkhoff’s
ergodic theorem.
Proof of von Neumann ergodic theorem. It is an exercise on the example sheet
to show that
Z Z
kf ◦ Θkp = |f ◦ Θ| dµ = |f |p dµ = kf kpp .
p p

So we have
Sn 1
= kf + f ◦ Θ + · · · + f ◦ Θn−1 k ≤ kf kp
n p n

by Minkowski’s inequality.
So let ε > 0, and take M ∈ (0, ∞) so that if

g = (f ∨ (−M )) ∧ M,

88
6 Ergodic theory II Probability and Measure

then
ε
kf − gkp < .
3
By Birkhoff’s theorem, we know

Sn (g)
→ ḡ
n
a.e.
Also, we know
Sn (g)
≤M
n
for all n. So by bounded convergence theorem, we know

Sn (g)
− ḡ →0
n p

as n → ∞. So we can find N such that n ≥ N implies

Sn (g) ε
− ḡ < .
n p 3

Then we have
p
Sn (f − g)
Z
p
f¯ − ḡ p
= lim inf dµ
n n
p
Sn (f − g)
Z
≤ lim inf dµ
n
≤ kf − gkpp .

So if n ≥ N , then we know

Sn (f ) Sn (f − g) Sn (g)
− f¯ ≤ + − f¯ + ḡ − f¯ p
≤ ε.
n p n p n p

So done.

89
7 Big theorems II Probability and Measure

7 Big theorems
We are now going to use all the tools we have previously developed to prove
two of the most important theorems about the sums of independent random
variables, namely the strong law of large numbers and the central limit theorem.

7.1 The strong law of large numbers


Before we start proving the strong law of large numbers, we first spend some
time discussing the difference between the strong law and the weak law. In both
cases, we have a sequence (Xn ) of iid random variables with E[Xi ] = µ. We let

Sn = X1 + · · · + Xn .

– The weak law of large number says Sn /n → µ in probability as n → ∞,


provided E[X12 ] < ∞.
– The strong law of large number says Sn /n → µ a.s. provided E|X1 | < ∞.
So we see that the strong law is indeed stronger, because convergence almost
everywhere implies convergence in measure.
We are actually going to do two versions of the strong law with different
hypothesis.
Theorem (Strong law of large numbers assuming finite fourth moments). Let
(Xn ) be a sequence of independent random variables such that there exists µ ∈ R
and M > 0 such that
E[Xn ] = µ, E[Xn4 ] ≤ M
for all n. With Sn = X1 + · · · + Xn , we have that
Sn
→ µ a.s. as n → ∞.
n
Note that in this version, we do not require that the Xn are iid. We simply
need that they are independent and have the same mean.
The proof is completely elementary.
Proof. We reduce to the case that µ = 0 by setting

Yn = Xn − µ.

We then have

E[Yn ] = 0, E[Yn4 ] ≤ 24 (E[µ4 + Xn4 ]) ≤ 24 (µ4 + M ).

So it suffices to show that the theorem holds with Yn in place of Xn . So we can


assume that µ = 0.
By independence, we know that for i 6= j, we have

E[Xi Xj3 ] = E[Xi ]E[Xj3 ] = 0.

Similarly, for all i, j, k, ` distinct, we have

E[Xi Xj Xk2 ] = E[Xi Xj Xk X` ] = 0.

90
7 Big theorems II Probability and Measure

Hence we know that


 
n
X X
E[Sn4 ] = E  Xk4 + 6 Xi2 Xj2  .
k=1 1≤i<j≤n

We know the first term is bounded by nM , and we also know that for i 6= j, we
have q
E[Xi2 Xj2 ] = E[Xi2 ]E[Xj2 ] ≤ E[Xi4 ]E[Xj4 ] ≤ M

by Jensen’s inequality. So we know


 
X
E 6 Xi2 Xj2  ≤ 3n(n − 1)M.
1≤i<j≤n

Putting everything together, we have

E[Sn4 ] ≤ nM + 3n(n − 1)M ≤ 3n2 M.

So we know
 3M
E (Sn /n)4 ≤ 2 .

n
So we know
∞  ∞
" 4 #
X Sn X 3M
E ≤ < ∞.
n=1
n n=1
n2
So we know that
∞  4
X Sn
< ∞ a.s.
n=1
n

So we know that (Sn /n)4 → 0 a.s., i.e. Sn /n → 0 a.s.


We are now going to get rid of the assumption that we have finite fourth
moments, but we’ll need to work with iid random variables.
Theorem (Strong law of large numbers). Let (Yn ) be an iid sequence of inte-
grable random variables with mean ν. With Sn = Y1 + · · · + Yn , we have
Sn
→ ν a.s.
n
We will use the ergodic theorem to prove this. This is not the “usual” proof
of the strong law, but since we’ve done all that work on ergodic theory, we might
as well use it to get a clean proof. Most of the work left is setting up the right
setting for the proof.
Proof. Let m be the law of Y1 and let Y = (Y1 , Y2 , Y3 , · · · ). We can view Y as
a function
Y : Ω → RN = E.
We let (E, E, µ) be the canonical space associated with the distribution m so
that
µ = P ◦ Y −1 .

91
7 Big theorems II Probability and Measure

We let f : E → R be given by

f (x1 , x2 , · · · ) = X1 (x1 , · · · , xn ) = x1 .

Then X1 has law given by m, and in particular is integrable. Also, the shift map
Θ : E → E given by
Θ(x1 , x2 , · · · ) = (x2 , x3 , · · · )
is measure-preserving and ergodic. Thus, with

Sn (f ) = f + f ◦ Θ + · · · + f ◦ Θn−1 = X1 + · · · + Xn ,

we have that
Sn (f )
→ f¯ a.e.
n
by Birkhoff’s ergodic theorem. We also have convergence in L1 by von Neumann
ergodic theorem.
Here f¯ is EΘ -measurable, and Θ is ergodic, so we know that f¯ = c a.e. for
some constant c. Moreover, we have

c = µ(f¯) = lim µ(Sn (f )/n) = ν.


n→∞

So done.

7.2 Central limit theorem


Theorem. Let (Xn ) be a sequence of iid random variables with E[Xi ] = 0 and
E[X12 ] = 1. Then if we set

Sn = X1 + · · · + Xn ,

then for all x ∈ R, we have


  Z x −y2 /2
Sn e
P √ ≤x → √ dy = P[N (0, 1) ≤ x]
n −∞ 2π
as n → ∞.
Proof. Let φ(u) = E[eiuX1 ]. Since E[X12 ] = 1 < ∞, we can differentiate under
the expectation twice to obtain

φ(u) = E[eiuX1 ], φ0 (u) = E[iX1 eiuX1 ], φ00 (u) = E[−X12 eiuX1 ].

Evaluating at 0, we have

φ(0) = 1, φ0 (0) = 0, φ00 (0) = −1.

So if we Taylor expand φ at 0, we have

u2
φ(u) = 1 − + o(u2 ).
2

92
7 Big theorems II Probability and Measure


We consider the characteristic function of Sn / n

φn (u) = E[eiuSn / n ]
Yn √
= E[eiuXj / n ]
i=1

= φ(u/ n)n
 2 n
u2

u
= 1− +o .
2n n

We now take the logarithm to obtain

u2
  2 
u
log φn (u) = n log 1 − +o
2n n
u2
= − + o(1)
2
u2
→−
2
So we know that 2
φn (u) → e−u /2
,
which is the characteristic function of a N (0, 1) random variable.
So we have convergence in characteristic function, hence weak convergence,
hence convergence in distribution.

93
Index II Probability and Measure

Index

FX , 26 conjugate, 57
Lp space, 54, 59 convergence
Lp -bounded, 66 almost everywhere, 29
N (µ, σ 2 ), 80 almost sure, 29
Sn (f ), 83 in distribution, 31
V ⊥ , 61 in measure, 29
E[X], 36 in probability, 29
lim inf, 17 convex function, 55
lim sup, 17 convolution
B, 12 function with measure, 70
B(E), 12 random variable, 70
EΘ , 83 countable additivity, 5
Lp space, 60 countably additive set function, 8
T -measurable, 34 countably subadditive set function,
µ(f ), 36 8
π-system, 6 counting measure, 5
σ-algebra, 5 covariance, 64
independent, 17 covariance matrix, 64
product, 21, 48
tail, 34 d-system, 6
σ-algebra generated by functions, density, 46
21 random variable, 46
σ-finite measure, 15 differentiation under the integral
sign, 47
Additive set function, 8 distribution, 25
algebra, 7 distribution function, 26
almost everywhere, 29 dominated convergence theorem, 43
almost sure convergence, 29 Dynkin’s π-system lemma, 7

ergodic, 83
Banach space, 60
event
Bernoulli shift, 84
independent, 16
Birkhoff’s ergodic theorem, 86
events, 16
Borel σ-algebra, 6, 12
expectation, 36
Borel function, 20
Borel measure, 13 Fatou’s lemma, 42
Borel–Cantelli lemma, 18 finite intersection property, 14
Borel–Cantelli lemma II, 18 Fourier transform, 69
bounded convergence theorem, 65 of measure, 69
Fubini’s theorem, 50
canonical space, 84
Caratheodory extension theorem, 8 Gaussian convolution, 74
change of variables formula, 46 Gaussian density, 72
characteristic function, 69 Gaussian random variable, 80, 81
Chebyshev’s inequality, 54 mean, 80
closed subspace, 62 variance, 80
complete vector space, 60 generating set, 6
conditional expectation, 63 generator of σ-algebra, 6

94
Index II Probability and Measure

Hölder’s inequality, 57 parallelogram law, 62


Hilbert space, 61 pi-system, 6
Plancherel identity, 77
image measure, 23, 45 probability, 16
Increasing set function, 8 probability measure, 16
independent probability space, 16
σ-algebras, 17 product σ-algebra, 21, 48
events, 16 product measurable space, 21
random variable, 26 pushforward of measure, 45
integrable function, 37 Pythagoras identity, 62
integral, 37
simple function, 36 Radon measure, 13
invariant function, 83 random variable, 25
invariant subset, 83 characteristic function, 69
convolution, 70
Jensen’s inequality, 55 density, 46
Gaussian, 80, 81
Kolmogorov 0-1 law, 34 independent, 26
restriction of measure space, 45
law, 25 right continuous, 23
Lebesgue σ-algebra, 15 ring, 7
Lebesgue integral, 37
Lebesgue measure, 14 sample space, 16
left continuous, 23 set function, 8
additive, 8
Markov’s inequality, 54 countably additive, 8
martingale, 68 increasing, 8
mass function, 5 shift map, 84
maximal ergodic lemma, 85 sigma-algebra, 5
mean, 80 simple function, 36
measurable function, 20 integral, 36
measurable space, 5 Skorokhod representation theorem
product, 21 of weak convergence, 31
measure, 5 strong law of large numbers, 91
image, 45 assuming finite fourth
pushforward, 45 moments, 90
measure space
restriction, 45 tail σ-algebra, 34
Minkowski inequality, 58 Tonelli’s theorem, 50
monotone class theorem, 22 translation invariant, 15
monotone convergence theorem, 38 UI, 65
uniformly integrable, 65
non-negative measurable function,
20 variance, 64, 80
norm vector space
vector space, 59 norm, 59
von Neumann’s ergodic theorem, 86
orthogonal complement, 61
orthogonal decomposition, 62 weak convergence
orthogonal functions, 61 of measures, 79
outer measure, 9 of random variables, 79

95

You might also like