Probability and Measure
Probability and Measure
Michaelmas 2016
These notes are not endorsed by the lecturers, and I have modified them (often
significantly) after lectures. They are nowhere near accurate representations of what
was actually lectured, and in particular, all errors are almost surely mine.
Analysis II is essential
1
Contents II Probability and Measure
Contents
0 Introduction 3
1 Measures 5
1.1 Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Probability measures . . . . . . . . . . . . . . . . . . . . . . . . . 16
3 Integration 36
3.1 Definition and basic properties . . . . . . . . . . . . . . . . . . . 36
3.2 Integrals and limits . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.3 New measures from old . . . . . . . . . . . . . . . . . . . . . . . 44
3.4 Integration and differentiation . . . . . . . . . . . . . . . . . . . . 46
3.5 Product measures and Fubini’s theorem . . . . . . . . . . . . . . 48
5 Fourier transform 69
5.1 The Fourier transform . . . . . . . . . . . . . . . . . . . . . . . . 69
5.2 Convolutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.3 Fourier inversion formula . . . . . . . . . . . . . . . . . . . . . . 72
5.4 Fourier transform in L2 . . . . . . . . . . . . . . . . . . . . . . . 77
5.5 Properties of characteristic functions . . . . . . . . . . . . . . . . 79
5.6 Gaussian random variables . . . . . . . . . . . . . . . . . . . . . 80
6 Ergodic theory 83
6.1 Ergodic theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
7 Big theorems 90
7.1 The strong law of large numbers . . . . . . . . . . . . . . . . . . 90
7.2 Central limit theorem . . . . . . . . . . . . . . . . . . . . . . . . 92
Index 94
2
0 Introduction II Probability and Measure
0 Introduction
In measure theory, the main idea is that we want to assign “sizes” to different
sets. For example, we might think [0, 2] ⊆ R has size 2, while perhaps Q ⊆ R has
size 0. This is known as a measure. One of the main applications of a measure
is that we can use it to come up with a new definition of an integral. The idea
is very simple, but it is going to be very powerful mathematically.
Recall that if f : [0, 1] → R is continuous, then the Riemann integral of f is
defined as follows:
(i) Take a partition 0 = t0 < t1 < · · · < tn = 1 of [0, 1].
···
···
x
0 t1 t2 t3 tk tk+1 · · · 1
We then define the integral as the limit of approximations of this type as the
mesh size of the partition → 0.
3
0 Introduction II Probability and Measure
4
1 Measures II Probability and Measure
1 Measures
In the course, we will write fn % f for “fn converges to f monotonically
increasingly”, and fn & f similarly. Unless otherwise specified, convergence is
taken to be pointwise.
1.1 Measures
The starting point of all these is to come up with a function that determines
the “size” of a given set, known as a measure. It turns out we cannot sensibly
define a size for all subsets of [0, 1]. Thus, we need to restrict our attention to a
collection of “nice” subsets. Specifying which subsets are “nice” would involve
specifying a σ-algebra.
This section is mostly technical.
Definition (σ-algebra). Let E be a set. A σ-algebra E on E is a collection of
subsets of E such that
(i) ∅ ∈ E.
(ii) A ∈ E implies that AC = X \ A ∈ E.
(iii) For any sequence (An ) in E, we have that
[
An ∈ E.
n
Example. Let E be any countable set, and E = P (E) be the set of all subsets
of E. A mass function is any function m : E → [0, ∞]. We can then define a
measure by setting X
µ(A) = m(x).
x∈A
5
1 Measures II Probability and Measure
Countable spaces are nice, because we can always take E = P (E), and the
measure can be defined on all possible subsets. However, for “bigger” spaces, we
have to be more careful. The set of all subsets is often “too large”. We will see
a concrete and also important example of this later.
In general, σ-algebras are often described on large spaces in terms of a smaller
set, known as the generating sets.
Definition (Generator of σ-algebra). Let E be a set, and that A ⊆ P (E) be a
collection of subsets of E. We define
In other words σ(A) is the smallest sigma algebra that contains A. This is
known as the sigma algebra generated by A.
Example. Take E = Z, and A = {{x} : x ∈ Z}. Then σ(A) is just P (E), since
every subset of E can be written as a countable union of singletons.
Example. Take E = Z, and let A = {{x, x + 1, x + 2, x + 3, · · · } : x ∈ E}. Then
again σ(E) is the set of all subsets of E.
The following is the most important σ-algebra in the course:
Definition (Borel σ-algebra). Let E = R, and A = {U ⊆ R : U is open}. Then
σ(A) is known as the Borel σ-algebra, which is not the set of all subsets of R.
We can equivalently define this by à = {(a, b) : a < b, a, b ∈ Q}. Then σ(Ã)
is also the Borel σ-algebra.
Often, we would like to prove results that allow us to deduce properties
about the σ-algebra just by checking it on a generating set. However, usually,
we cannot just check it on an arbitrary generating set. Instead, the generating
set has to satisfy some nice closure properties. We are now going to introduce a
bunch of many different definitions that you need not aim to remember (except
when exams are near).
Definition (π-system). Let A be a collection of subsets of E. Then A is called
a π-system if
(i) ∅ ∈ A
(ii) If A, B ∈ A, then A ∩ B ∈ A.
Definition (d-system). Let A be a collection of subsets of E. Then A is called
a d-system if
(i) E ∈ A
(ii) If A, B ∈ A and A ⊆ B, then B \ A ∈ A
S
(iii) For all increasing sequences (An ) in A, we have that n An ∈ A.
The point of d-systems and π-systems is that they separate the axioms of a
σ-algebra into two parts. More precisely, we have
Proposition. A collection A is a σ-algebra if and only if it is both a π-system
and a d-system.
6
1 Measures II Probability and Measure
7
1 Measures II Probability and Measure
Therefore B ∈ D0 .
Therefore D0 is a d-system contained in D, which also contains A. By our
choice of D, we know D0 = D.
We now let
Since D0 = D, we again have A ⊆ D00 , and the same argument as above implies
that D00 is a d-system which is between A and D. But the only way that can
happen is if D00 = D, and this implies that D is a π-system.
After defining all sorts of things that are “weaker versions” of σ-algebras, we
now defined a bunch of measure-like objects that satisfy fewer properties. Again,
no one really remembers these definitions:
Definition (Set function). Let A be a collection of subsets of E with ∅ ∈ A. A
set function function µ : A → [0, ∞] such that µ(∅) = 0.
Definition (Increasing set function). A set function is increasing if it has the
property that for all A, B ∈ A with A ⊆ B, we have µ(A) ≤ µ(B).
Definition (Additive set function). A set function is additive if whenever
A, B ∈ A and A ∪ B ∈ A, A ∩ B = ∅, then µ(A ∪ B) = µ(A) + µ(B).
Definition (Countably additive set function). A set function is countably addi-
tive if whenever An is a sequence of disjoint sets in A with ∪An ∈ A, then
!
[ X
µ An = µ(An ).
n n
8
1 Measures II Probability and Measure
µ∗ (B) = µ∗ (B ∩ A) + µ∗ (B ∩ AC )
Note that it is not true in general that M = σ(A). However, we will always
have M ⊇ σ(A).
We are going to break this up into five nice bite-size chunks.
Claim. µ∗ is countably subadditive.
Suppose B ⊆ n Bn . We need to show that µ∗ (B) ≤ n µ∗ (Bn ). We can
S P
wlog assume that µ∗ (Bn ) is finite for all n, or else the inequality is trivial. Let
ε > 0. Then by definition of the outer measure, for each n, we can find a
sequence (Bn,m )∞
m=1 in A with the property that
[
Bn ⊆ Bn,m
m
and
ε X
µ∗ (Bn ) + ≥ µ(Bn,m ).
2n m
Then we have [ [
B⊆ Bn ⊆ Bn,m .
n n,m
9
1 Measures II Probability and Measure
µ(A) ≤ µ∗ (A).
Also, we see by definition that µ(A) ≥ µ∗ (A), since A covers A. So we get that
µ(A) = µ∗ (A) for all A ∈ A.
Claim. M contains A.
Suppose that A ∈ A and B ⊆ E. We need to show that
µ∗ (B) = µ∗ (B ∩ A) + µ∗ (B ∩ AC ).
Then we have
[
B∩A⊆ (Bn ∩ A)
n
[
B ∩ A ⊆ (Bn ∩ AC )
C
µ∗ (B) = µ∗ (B ∩ E) + µ∗ (B ∩ E C )
for all B ⊆ E.
Next, note that if A ∈ M, then by definition we have, for all B,
µ∗ (B) = µ∗ (B ∩ A) + µ∗ (B ∩ AC ).
10
1 Measures II Probability and Measure
µ∗ (B) = µ∗ (B ∩ A1 ) + µ∗ (B ∩ AC
1)
= µ∗ (B ∩ A1 ∩ A2 ) + µ∗ (B ∩ A1 ∩ AC ∗ C
2 ) + µ (B ∩ A1 )
= µ∗ (B ∩ (A1 ∩ A2 )) + µ∗ (B ∩ (A1 ∩ A2 )C ∩ A1 )
+ µ∗ (B ∩ (A1 ∩ A2 )C ∩ AC
1)
= µ∗ (B ∩ (A1 ∩ A2 )) + µ∗ (B ∩ (A1 ∩ A2 )C ).
So we have A1 ∩ A2 ∈ M. So M is an algebra.
Claim. M is a σ-algebra, and µ∗ is a measure on M.
To show that M is a σ-algebra, we need to show that it is closed under
countable unions. We let
S (An ) be a disjoint collection of sets in M, then we
want to show that A = n An ∈ M and µ∗ (A) = n µ∗ (An ).
P
Suppose that B ⊆ E. Then we have
µ∗ (B) = µ∗ (B ∩ A1 ) + µ∗ (B ∩ AC
1)
= µ∗ (B ∩ A1 ) + µ∗ (B ∩ A2 ) + µ∗ (B ∩ AC C
1 ∩ A2 )
= ···
Xn
= µ∗ (B ∩ Ai ) + µ∗ (B ∩ AC C
1 ∩ · · · ∩ An )
i=1
n
X
≥ µ∗ (B ∩ Ai ) + µ∗ (B ∩ AC ).
i=1
Thus we obtain
µ∗ (B) ≥ µ∗ (B ∩ A) + µ∗ (B ∩ AC ).
By countable subadditivity, we also have inequality in the other direction. So
equality holds. So A ∈ M. So M is a σ-algebra.
To see that µ∗ is a measure on M, note that the above implies that
∞
X
µ∗ (B) = (B ∩ Ai ) + µ∗ (B ∩ AC ).
i=1
11
1 Measures II Probability and Measure
Note that when A itself is actually a σ-algebra, the outer measure can be
simply written as
Caratheodory gives us the existence of some measure extending the set function
on A. Could there be many? In general, there could. However, in the special
case where the measure is finite, we do get uniqueness.
Theorem. Suppose that µ1 , µ2 are measures on (E, E) with µ1 (E) = µ2 (E) <
∞. If A is a π-system with σ(A) = E, and µ1 agrees with µ2 on A, then µ1 = µ2 .
Proof. Let
D = {A ∈ E : µ1 (A) = µ2 (A)}
We know that D ⊇ A. By Dynkin’s lemma, it suffices to show that D is a
d-system. The things to check are:
(i) E ∈ D — this follows by assumption.
(ii) If A, B ∈ D with A ⊆ B, then B \ A ∈ D. Indeed, we have the equations
So A ∈ D.
The assumption that µ1 (E) = µ2 (E) < ∞ is necessary. The theorem does
not necessarily hold without it. We can see this from a simple counterexample:
Example. Let E = Z, and let E = P (E). We let
A = {{x, x + 1, x + 2, · · · } : x ∈ E} ∪ {∅}.
12
1 Measures II Probability and Measure
13
1 Measures II Probability and Measure
14
1 Measures II Probability and Measure
This is the next best thing we can hope after finiteness, and often proofs that
involve finiteness carry over to σ-finite measures.
Proposition. The Lebesgue measure is translation invariant, i.e.
µ(A + x) = µ(A)
A + x = {y + x, y ∈ A}.
µx (A) = µ(A + x)
µ([a, b]) = b − a.
By additivity and translation invariance, we can show that µ([p, q]) = q − p for all
rational p < q. By considering µ([p, p + 1/n]) for all n and using the increasing
property, we know µ({p}) = 0. So µ(([p, q)) = µ((p, q]) = µ((p, q)) = q − p for
all rational p, q.
Finally, by countable additivity, we can extend this to all real intervals. Then
the result follows from the uniqueness of the Lebesgue measure.
In the proof of the Caratheodory extension theorem, we constructed a measure
µ∗ on the σ-algebra M of µ∗ -measurable sets which contains A. This contains
B = σ(A), but could in fact be bigger than it. We call M the Lebesgue σ-algebra.
Indeed, it can be given by
If A ∪ N ∈ M, then µ(A ∪ N ) = µ(A). The proof is left for the example sheet.
It is also true that M is strictly larger than B, so there exists A ∈ M with
A 6∈ B. Construction of such a set was on last year’s exam (2016).
On the other hand, it is also true that not all sets are Lebesgue measurable.
This is a rather funny construction.
15
1 Measures II Probability and Measure
Sr = {s + r mod 1 : s ∈ S}.
16
1 Measures II Probability and Measure
While proving this directly would be rather tedious (but not too hard), it is
an immediate consequence of the following theorem:
Theorem. Suppose A1 and A2 are π-systems in F. If
µ(A) = P[A ∩ A1 ]
and
ν(A) = P[A]P[A1 ]
for all A ∈ F. By assumption, we know µ and ν agree on A2 , and we have that
µ(Ω) = P[A1 ] = ν(Ω) ≤ 1 < ∞. So µ and ν agree on σ(A2 ). So we have
17
1 Measures II Probability and Measure
To parse these definitions more easily, we can read ∩ as “for all”, and ∪ as
“there exits”. For example, we can write
Similarly, we have
We are now going to prove two “obvious” results, known as the Borel–Cantelli
lemmas. These give us necessary conditions for an event to happen infinitely
often, and in the case where the events are independent, the condition is also
sufficient.
Lemma (Borel–Cantelli lemma). If
X
P[An ] < ∞,
n
then
P[An i.o.] = 0.
Proof. For each k, we have
\ [
P[An i.o] = P Am
n m≥n
[
≤ P Am
m≥k
∞
X
≤ P[Am ]
m=k
→0
then
P[An i.o.] = 1.
18
1 Measures II Probability and Measure
So we are done.
19
2 Measurable functions and random variables II Probability and Measure
f −1 (A) = {x ∈ E : f (x) ∈ E} ∈ E.
20
2 Measurable functions and random variables II Probability and Measure
E×G
π1 π2 .
E G
E ⊗ G = σ({A × B : A ∈ E, B ∈ G}).
21
2 Measurable functions and random variables II Probability and Measure
Proof. If the map (fi ) is measurable, then by composition with the projections
πi , we know that each fi is measurable. Q
Conversely, if all fi are measurable, then since the σ-algebra of Fi is
−1
generated by sets of the form πj (A) : A ∈ Fj , and the pullback of such sets
along (fi ) is exactly fj−1 (A), we know the function (fi ) is measurable.
Using this, we can prove that a whole lot more functions are measurable.
Proposition. Let (E, E) be a measurable space. Let (fn : n ∈ N) be a sequence
of non-negative measurable functions on E. Then the following are measurable:
f1 + f2 , f1 f2 , max{f1 , f2 }, min{f1 , f2 },
inf fn , sup fn , lim inf fn , lim sup fn .
n n n n
The same is true with “real” replaced with “non-negative”, provided the new
functions are real (i.e. not infinity).
Proof. This is an (easy) exercise on the example sheet. For example, the sum
f1 + f2 can be written as the following composition.
(f1 ,f2 ) +
E [0, ∞]2 [0, ∞].
We know the second map is continuous, hence measurable. The first function is
also measurable since the fi are. So the composition is also measurable.
The product follows similarly, but for the infimum and supremum, we need to
check explicitly that the corresponding maps [0, ∞]N → [0, ∞] is measurable.
Notation. We will write
We are now going to prove the monotone class theorem, which is a “Dynkin’s
lemma” for measurable functions. As in the case of Dynkin’s lemma, it will
sound rather awkward but will prove itself to be very useful.
22
2 Measurable functions and random variables II Probability and Measure
Note that the conditions for V is pretty like the conditions for a d-system,
where taking a bounded, monotone limit is something like taking increasing
unions.
Proof. We first deduce that 1A ∈ V for all A ∈ E.
D = {A ∈ E : 1A ∈ V}.
We want to show that D = E. To do this, we have to show that D is a d-system.
(i) Since 1E ∈ V, we know E ∈ D.
(ii) If 1A ∈ V , then 1 − 1A = 1E\A ∈ V. So E \ A ∈ D.
(iii) If (An ) is an increasing sequence in D, then 1An → 1S An monotonically
increasingly. So 1S An is in D.
So, by Dynkin’s lemma, we know D = E. So V contains indicators of all measur-
able sets. We will now try to obtain any measurable function by approximating.
Suppose that f is bounded and non-negative measurable. We want to show
that f ∈ V. To do this, we approximate it by letting
∞
X
fn = 2−n b2n f c = k2−n 1{k2−n ≤f <(k+1)2−n } .
k=0
23
2 Measurable functions and random variables II Probability and Measure
x ≤ g(y) ⇔ f (x) ≤ y.
Jx = {y ∈ R : x ≤ g(y)}.
Jx = [f (x), ∞).
x ≤ g(y) ⇔ f (x) ≤ y.
then f is given by
24
2 Measurable functions and random variables II Probability and Measure
25
2 Measurable functions and random variables II Probability and Measure
Then we have
X(ω) ≤ x ⇐⇒ w ≤ F (x).
So we have
FX (x) = P[X ≤ x] = P[(0, F (x)]] = F (x).
Therefore FX = F .
This construction is actually very useful in practice. If we are writing
a computer program and want to sample a random variable, we will use this
procedure. The computer usually comes with a uniform (pseudo)-random number
generator. Then using this procedure allows us to produce random variables of
any distribution from a uniform sample.
The next thing we want to consider is the notion of independence of random
variables. Recall that for random variables X, Y , we used to say that they are
independent if for any A, B, we have
But this is exactly the statement that the σ-algebras generated by X and Y are
independent!
Definition (Independence of random variables). A family (Xn ) of random vari-
ables is said to be independent if the family of σ-algebras (σ(Xn )) is independent.
26
2 Measurable functions and random variables II Probability and Measure
where ωn ∈ {0, 1}. We make the binary expansion unique by disallowing infinite
sequences of zeroes.
We define Rn (ω) = ωn . We will show that Rn is measurable. Indeed, we can
write
R1 (ω) = ω1 = 1(1/2,1] (ω),
where 1(1/2,1] is the indicator function. Since indicator functions of measurable
sets are measurable, we know R1 is measurable. Similarly, we have
So this is also a measurable function. More generally, we can do this for any
Rn (ω): we have
n−1
2X
Rn (ω) = 1(2−n (2j−1),2−n (2j)] (ω).
j=1
27
2 Measurable functions and random variables II Probability and Measure
Then we have
1
P[Rn = 0] = 1 − P[Rn = 1] =
2
as well. So Rn ∼ Bernoulli(1/2).
We can straightforwardly check that (Rn ) is an independent sequence, since
for n 6= m, we have
1
P[Rn = 0 and Rm = 0] = = P[Rn = 0]P[Rm = 0].
4
We will now use the (Rn ) to construct any independent sequence for any
distribution.
Proposition. Let
Yk,n = Rm(k,n) ,
The strong law of large numbers, which we will prove later, says that
n
1 X 1
P ω : Rj → = 1.
n j=1 2
So “almost every number” in (0, 1) has an equal proportion of 0’s and 1’s in its
binary expansion. This is known as the normal number theorem.
28
2 Measurable functions and random variables II Probability and Measure
We have previously seen that lim sup |fn − f | is non-negative measurable. So the
set {x ∈ E : lim sup |fn (x) − f (x)| > 0} is measurable.
Another useful notion of convergence is convergence in measure.
Definition (Convergence in measure). Suppose that (E, E, µ) is a measure space.
Suppose that (fn ), f are measurable functions. We say fn → f in measure if for
each ε > 0, we have
P(|Xn − X| ≥ ε) → 0 as n → ∞
for all ε, which is how we state the weak law of large numbers in the past.
After we define integration, we can consider the norms of a function f by
Z 1/p
kf kp = |f (x)|p dx .
29
2 Measurable functions and random variables II Probability and Measure
Proof.
(i) First suppose µ(E) < ∞, and fix ε > 0. Consider
We use the result from the first example sheet that for any sequence of
events (An ), we have
lim inf µ({x : |fn (x) − f (x)| ≤ ε}) ≥ µ({x : |fm (x) − f (x)| ≤ ε eventually})
≥ µ({x ∈ E : |fm (x) − f (x)| → 0})
= µ(E).
Then we have
∞ X∞
X 1
µ x ∈ E : fnk (x) − f (x)| > ≤ 2−k = 1 < ∞.
k
k=1 k=1
So fnk → f a.e.
It is important that we assume that µ(E) < ∞ for the first part.
Example. Consider (E, E, µ) = (R, B, Lebesgue). Take fn (x) = 1[n,∞) (x).
Then fn (x) → 0 for all x, and in particular almost everywhere. However, we
have
1
µ x ∈ R : |fn (x)| > = µ([n, ∞)) = ∞
2
for all n.
There is one last type of convergence we are interested in. We will only
first formulate it in the probability setting, but there is an analogous notion in
measure theory known as weak convergence, which we will discuss much later on
in the course.
30
2 Measurable functions and random variables II Probability and Measure
Note that here we do not need that (Xn ) and X live on the same probability
space, since we only talk about the distribution functions.
But why do we have the condition with continuity points? The idea is that
if the resulting distribution has a “jump” at x, it doesn’t matter which side of
the jump FX (x) is at. Here is a simple example that tells us why this is very
important:
Example. Let Xn to be uniform on [0, 1/n]. Intuitively, this should converge
to the random variable that is always zero.
We can compute
0
x≤0
FXn (x) = nx 0 < x < 1/n .
1 x ≥ 1/n
One might now think of cheating by cooking up some random variable such
that F is discontinuous at so many points that random, unrelated things converge
to F . However, this cannot be done, because F is a non-decreasing function,
and thus can only have countably many points of discontinuities.
The big theorem we are going to prove about convergence in distribution is
that actually it is very boring and doesn’t give us anything new.
31
2 Measurable functions and random variables II Probability and Measure
We similarly have
We let
Recall from before that X̃n has the same distribution function as Xn for
all n, and X̃ has the same distribution as X. Moreover, we have
Ω0 = {ω ∈ (0, 1) : X̃ is continuous at ω0 }.
P[Ω0 ] = 1.
32
2 Measurable functions and random variables II Probability and Measure
We are now going to show that X̃n (ω) → X̃(ω) for all ω ∈ Ω0 .
Note that FX is a non-decreasing function, and hence the points of discon-
tinuity R \ S is also countable. So S is dense in R. Fix ω ∈ Ω0 and ε > 0.
We want to show that |X̃n (ω) − X̃(ω)| ≤ ε for all n large enough.
Since S is dense in R, we can find x− , x+ in S such that
X̃(ω)
x+
ε
x−
ω ω+
Then we have
x− < X̃(ω) ≤ X̃(ω + ) < x+ .
Then we have
FX (x− ) < ω < ω + ≤ FX (x+ ).
So for sufficiently large n, we have
So we have
x− < X̃n (ω) ≤ x+ ,
and we are done.
33
2 Measurable functions and random variables II Probability and Measure
since this is just the set of all points where the previous two things agree.
Theorem (Kolmogorov 0-1 law). Let (Xn ) be a sequence of independent (real-
valued) random variables. If A ∈ T , then P[A] = 0 or 1.
Moreover, if X is a T -measurable random variable, then there exists a
constant c such that
P[X = c] = 1.
Proof. The proof is very funny the first time we see it. We are going to prove
the theorem by checking something that seems very strange. We are going to
show that if A ∈ T , then A is independent of A. It then follows that
A = {X1 ≤ x1 , · · · , Xn ≤ xn }.
34
2 Measurable functions and random variables II Probability and Measure
Since the Xn are independent, we know for any such A and B, we have
P[A ∩ B] = P[A]P[B].
P[X ≤ x] ∈ {0, 1}
35
3 Integration II Probability and Measure
3 Integration
3.1 Definition and basic properties
We are now going to work towards defining the integral of a measurable function
on a measure space (E, E, µ). Different sources use different notations for the
integral. The following notations are all commonly used:
Z Z Z
µ(f ) = f dµ = f (x) dµ(x) = f (x)µ(dx).
E E E
In the case where (E, E, µ) = (R, B, Lebesgue), people often just write this as
Z
µ(f ) = f (x) dx.
R
is given by
n
X
µ(f ) = ak µ(Ak ).
k=1
36
3 Integration II Probability and Measure
37
3 Integration II Probability and Measure
So done.
We next consider the case where f = 1A for some A. Fix ε > 0, and set
An = {fn > 1 − ε} ∈ E.
(1 − ε)1An ≤ fn ≤ f = 1A .
As An % A, we have that
38
3 Integration II Probability and Measure
So we have
m
X m
X m
X
µ(fn ) = µ(fn 1Ak ) = ak µ(a−1
k fn 1Ak ) → ak µ(Ak ) = µ(f ).
k=1 k=1 k=1
by definition of the integral. However, we also know that µ(fn ) ≤ µ(f ) for all n,
again by definition of the integral. So we must have equality. So we have
µ(f ) = lim µ(fn ).
n→∞
39
3 Integration II Probability and Measure
Then [
{x : f (x) 6= 0} = An .
n
Since the left hand set has non-negative measure, it follows that there is
some An with non-negative measure. For that n, we define
1
h= 1A .
n n
Then µ(f ) ≥ µ(h) > 0. So µ(f ) 6= 0.
Conversely, suppose f = 0 a.e. We let
fn = 2−n b2n f c ∧ n
40
3 Integration II Probability and Measure
To finish the proof of (i), we have to show that µ(f + g) = µ(f ) + µ(g).
We know that this is true for non-negative functions, so we need to employ
a little trick to make this a statement about the non-negative version. If
we let h = f + g, then we can write this as
h+ − h− = (f + − f − ) + (g + − g − ).
h+ f − + g − = f + + g + + h− .
Rearranging, we obtain
Then A± ∈ E, and
µ(f 1A+ ) = µ(f 1A− ) = 0.
So f 1A+ and f 1A− vanish a.e. So f vanishes a.e.
Proposition. Suppose that (gn ) is a sequence of non-negative measurable
functions. Then we have
∞ ∞
!
X X
µ gn = µ(gn ).
n=1 n=1
41
3 Integration II Probability and Measure
Proof. We know
∞
N
! !
X X
gn % gn
n=1 n=1
So we can just view our proposition as proving that we can swap the order of
two integrals. The general statement is known as Fubini’s theorem.
Proof. We start with the trivial observation that if k ≥ n, then we always have
that
inf fm ≤ fk .
m≥n
for all k ≥ n.
42
3 Integration II Probability and Measure
So we have
µ inf fm ≤ inf µ(fk ) ≤ lim inf µ(fm ).
m≥n k≥n m
It remains to show that the left hand side converges to µ(lim inf fm ). Indeed,
we know that
inf fm % lim inf fm .
m≥n m
So we have
µ lim inf fm ≤ lim inf µ(fm ).
m m
No one ever remembers which direction Fatou’s lemma goes, and this leads to
many incorrect proofs and results, so it is helpful to keep the following example
in mind:
Example. We let (E, E, µ) = (R, B, Lebesgue). We let
fn = 1[n,n+1] .
Then we have
lim inf fn = 0.
n
So we have
µ(fn ) = 1 for all n.
So we have
lim inf µ(fn ) = 1, µ(lim inf fn ) = 0.
So we have
µ lim inf fm ≤ lim inf µ(fm ).
m m
43
3 Integration II Probability and Measure
So we know that
µ(|f |) ≤ µ(g) < ∞.
So we know that f , fn are integrable.
Now note also that
0 ≤ g + fn , 0 ≤ g − fn
for all n. We are now going to apply Fatou’s lemma twice with these series. We
have that
44
3 Integration II Probability and Measure
EA = {B ∈ E : B ⊆ A},
µA (B) = µ(B)
for all B ∈ EA .
It is easy to check the following:
Lemma. For (E, E, µ) a measure space and A ∈ E, the restriction to A is a
measure space.
Similarly, we have
Proposition. If f is integrable, then f |A is µA -integrable and µA (f |A ) =
µ(f 1A ).
Note that means we have
Z Z
µ(f 1A ) = f 1A dµ = f dµA .
E A
ν = µ ◦ f −1
45
3 Integration II Probability and Measure
ν(g) = µ(g ◦ f ).
Proof. Exercise using the monotone class theorem (see example sheet).
Finally, we can specify a measure by specifying a density.
Definition (Density). Let (E, E, µ) be a measure space, and f be a non-negative
measurable function. We define
ν(A) = µ(f 1A ).
Proof.
(i) ν(φ) = µ(f 1∅ ) = µ(0) = 0.
(ii) If (An ) is a disjoint sequence in E, then
[ X X X
ν An = µ(f 1S An ) = µ f 1An = µ (f 1An ) = ν(f ).
In this case, for any non-negative measurable function, for any non-negative
measurable g, we have that
Z
E[g(X)] = g(x)fX (x) dx.
R
46
3 Integration II Probability and Measure
Proof. We let
We will want to use the monotone class theorem to show that this includes all
bounded functions.
We already know that
(i) V contains 1A for all A in the π-system of intervals of the form [u, v] ⊆ [a, b].
This is just the fundamental theorem of calculus.
(ii) By linearity of the integral, V is indeed a vector space.
(iii) Finally, let (gn ) be a sequence in V , and gn ≥ 0, gn % g. Then we know
that Z φ(b) Z b
gn (y) dy = gn (φ(x))φ0 (x) dx.
φ(a) a
Then by the monotone class theorem, V contains all bounded Borel functions.
The next problem is differentiation under the integral sign. We want to know
when we can say Z Z
d ∂f
f (x, t) dx = (x, t) dx.
dt ∂t
Theorem (Differentiation under the integral sign). Let (E, E, µ) be a space,
and U ⊆ R be an open set, and f : U × E → R. We assume that
(i) For any t ∈ U fixed, the map x 7→ f (t, x) is integrable;
(ii) For any x ∈ E fixed, the map t 7→ f (t, x) is differentiable;
(iii) There exists an integrable function g such that
∂f
(t, x) ≤ g(x)
∂t
is differentiable, and Z
0 ∂f
F (t) = (t, x) dµ.
E ∂t
47
3 Integration II Probability and Measure
f (t + hn , x) − f (t, x) ∂f
gn (x) = − (t, x).
hn ∂t
Since f is differentiable, we know that gn (x) → 0 as n → ∞. Moreover, by the
mean value theorem, we know that
F (t + hn ) − F (t)
Z Z
∂f
− (t, x) dµ = gn (x) dx.
hn E ∂t
F (t + hn ) − F (t)
Z
∂f
lim → (t, x) dµ.
n→∞ hn E ∂t
Since hn was arbitrary, it follows that F 0 (t) exists and is equal to the integral.
A = {A1 × A2 : A2 × E1 , A2 × E2 }.
E = E1 ⊗ E2 = σ(A).
Doing this has the advantage that it would help us in a step of proving Fubini’s
theorem.
However, before we can make this definition, we need to do some preparation
to make sure the above statement actually makes sense:
Lemma. Let E = E1 × E2 be a product of σ-algebras. Suppose f : E → R is
E-measurable function. Then
48
3 Integration II Probability and Measure
is E2 -measurable.
Proof. The first part follows immediately from the fact that for a fixed x2 ,
the map ι1 : E1 → E given by ι1 (x1 ) = (x1 , x2 ) is measurable, and that the
composition of measurable functions is measurable.
For the second part, we use the monotone class theorem.
R We let V be
the set of all measurable functions f such that x2 7→ E1 f (x1 , x2 )µ1 (dx1 ) is
E2 -measurable.
(i) It is clear that 1E , 1A ∈ V for all A ∈ A (where A is as in the definition
of the product σ-algebra).
(ii) V is a vector space by linearity of the integral.
(iii) Suppose (fn ) is a non-negative sequence in V and fn % f , then
Z Z
x2 7→ fn (x1 , x2 ) µ1 (dx1 ) % x2 7→ f (x1 , x2 ) µ(dx1 )
E1 E1
Here the previous lemma is very important. It tells us that these integrals
actually make sense!
We first check that this is a measure:
(i) µ(∅) = 0 is immediate since 1∅ = 0.
49
3 Integration II Probability and Measure
S
(ii) Suppose (An ) is a disjoint sequence and A = An . Then we have
Z Z
µ(A) = 1A (x1 , x2 ) µ2 (dx2 ) µ1 (dx1 )
E1 E2
Z Z X !
= 1An (x1 , x2 ) µ2 (dx2 ) µ1 (dx1 )
E1 E2 n
We now use the fact that integration commutes with the sum of non-
negative measurable functions to get
Z X Z !
= 1A (x1 , x2 ) µ2 (dx2 ) µ1 (dx1 )
E1 n E2
XZ Z
= 1An (x1 , x2 ) µ2 (dx2 ) µ1 (dx1 )
n E1 E2
X
= µ(An ).
n
The same proof would go through, so we have another measure on the space.
However, by uniqueness, we know they must be the same! Fubini’s theorem
generalizes this to arbitrary functions.
Theorem (Fubini’s theorem).
(i) If f is non-negative measurable, then
Z Z
µ(f ) = f (x1 , x2 ) µ2 (dx2 ) µ1 (dx1 ). (∗)
E1 E2
In particular, we have
Z Z Z Z
f (x1 , x2 ) µ2 (dx2 ) µ1 (dx1 ) = f (x1 , x2 ) µ1 (dx1 ) µ2 (dx2 ).
E1 E2 E2 E1
50
3 Integration II Probability and Measure
then
µ1 (E1 \ A) = 0.
If we set (R
E2
f (x1 , x2 ) µ2 (dx2 ) x1 ∈ A
f1 (x1 ) = ,
0 x1 6∈ A
then f1 is a µ1 integrable function and
µ1 (f1 ) = µ(f ).
Proof.
(i) Let V be the set of all measurable functions such that (∗) holds. Then V
is a vector space since integration is linear.
Then we have
f1 = (f1+ − f1− )1A1 .
So the result follows since
by (i).
51
3 Integration II Probability and Measure
Since R is σ-finite, we know that we can sensibly talk about the d-fold product
of the Lebesgue measure on R to obtain the Lebesgue measure on Rd .
What σ-algebra is the Lebesgue measure on Rd defined on? We know the
Lebesgue measure on R is defined on B. So the Lebesgue measure is defined on
B × · · · × B = σ(B1 × · · · × Bd : Bi ∈ B).
By looking at the definition of the product topology, we see that this is just the
Borel σ-algebra on Rd !
Recall that when we constructed the Lebesgue measure, the Caratheodory
extension theorem yields a measure on the “Lebesgue σ-algebra” M, which
was strictly bigger than the Borel σ-algebra. It was shown in the first example
sheet that M is complete, i.e. if we have A ⊆ B ⊆ R with B ∈ M, µ(B) = 0,
then A ∈ M. We can also take the Lebesgue measure on Rd to be defined on
M ⊗ · · · ⊗ M. However, it happens that M ⊗ M together with the Lebesgue
measure on R2 is no longer complete (proof is left as an exercise for the reader).
We now turn to probability. Recall that random variables X1 , · · · , Xn are
independent iff the σ-algebras σ(X1 ), · · · , σ(Xn ) are independent. We will show
that random variables are independent iff their laws are given by the product
measure.
E = E1 × · · · × En , E = E1 ⊗ · · · ⊗ En .
Proof.
– (i) ⇒ (ii): Let ν = µX1 × · · · ⊗ µXn . We want to show that ν = µX . To
do so, we just have to check that they agree on a π-system generating the
entire σ-algebra. We let
A = {A1 × · · · × An : A1 ∈ E1 , · · · , Ak ∈ Ek }.
µX (A) = P[X ∈ A]
= P[X1 ∈ A1 , · · · , Xn ∈ An ]
52
3 Integration II Probability and Measure
By independence, we have
n
Y
= P[Xk ∈ Ak ]
k=1
= ν(A).
Yn
= E[fk (Xk )].
k=1
So X1 , · · · , Xn are independent.
53
4 Inequalities and Lp spaces II Probability and Measure
However, it is not clear that this is a norm. First of all, kf kp = 0 does not
imply that f = 0. It only means that f = 0 a.e. But this is easy to solve. We
simply quotient out the vector space by functions that differ on a set of measure
zero. The more serious problem is that we don’t know how to prove the triangle
inequality.
To do so, we are going to prove some inequalities. Apart from enabling us to
show that k · kp is indeed a norm, they will also be very helpful in the future
when we want to bound integrals.
54
4 Inequalities and Lp spaces II Probability and Measure
(1 − t)x + ty
x y
E[c(X)] ≥ c(E[X]).
It is crucial that this only applies to a probability space. We need the total
mass of the measure space to be 1 for it to work. Just being finite is not enough.
Jensen’s inequality will be an easy consequence of the following lemma:
Lemma. If c : I → R is a convex function and m is in the interior of I, then
there exists real numbers a, b such that
c(x) ≥ ax + b
φ
ax + b
If the function is differentiable, then we can easily extract this from the
derivative. However, if it is not, then we need to be more careful.
Proof. If c is smooth, then we know c00 ≥ 0, and thus c0 is non-decreasing. We
are going to show an analogous statement that does not mention the word
“derivative”. Consider x < m < y with x, y, m ∈ I. We want to show that
55
4 Inequalities and Lp spaces II Probability and Measure
To show this, we turn off our brains and do the only thing we can do. We can
write
m = tx + (1 − t)y
for some t. Then convexity tells us
To conclude, we simply have to compute the actual value of t and plug it in. We
have
y−m m−x
t= , 1−t= .
y−x y−x
So we obtain
y−m m−x
(c(m) − c(x)) ≤ (c(y) − c(m)).
y−x y−x
Cancelling the y − x and dividing by the factors gives the desired result.
Now since x and y are arbitrary, we know there is some a ∈ R such that
c(m) − c(x) c(y) − c(m)
≤a≤ .
m−x y−m
for all x < m < y. If we rearrange, then we obtain
for all t ∈ I.
Proof of Jensen’s inequality. To apply the previous result, we need to pick a
right m. We take
m = E[X].
To apply this, we need to know that m is in the interior of I. So we assume that
X is not a.s. constant (that case is boring). By the lemma, we can find some
a, b ∈ R such that
c(X) ≥ aX + b.
We want to take the expectation of the LHS, but we have to make sure the
E[c(X)] is a sensible thing to talk about. To make sure it makes sense, we show
that E[c(X)− ] = E[(−c(X)) ∨ 0] is finite.
We simply bound
So we have
E[c(X)− ] ≤ |a|E|X| + |b| < ∞
since X is integrable. So E[c(X)] makes sense.
We then just take
So done.
56
4 Inequalities and Lp spaces II Probability and Measure
Now use the fact that (E|X|)q ≤ E[|X|q ] since x 7→ xq is convex for q > 1. Then
we obtain
1/q
|g|q
≤ E 1{|f |>0} .
|f |(p−1)q
57
4 Inequalities and Lp spaces II Probability and Measure
1
To do so, we notice if p + 1q = 1, then the concavity of log tells us for any a, b > 0,
we have
1 1 a b
log a + log b ≤ log + .
p q p q
Replacing a with ap ; b with bp and then taking exponentials tells us
ap bq
ab ≤ + .
p q
While we assumed a, b > 0 when deriving, we observe that it is also valid when
some of them are zero. So we have
Z p
|g|q
|f |
Z
1 1
|f ||g| dµ ≤ + dµ = + = 1.
p q p q
Just like Jensen’s inequality, this is very useful when bounding integrals, and
it is also theoretically very important, because we are going to use it to prove
the Minkowski inequality. This tells us that the Lp norm is actually a norm.
Before we prove the Minkowski inequality, we prove the following tiny lemma
that we will use repeatedly:
Lemma. Let a, b ≥ 0 and p ≥ 1. Then
(a + b)p ≤ 2p (ap + bp ).
This is a terrible bound, but is useful when we want to prove that things are
finite.
Proof. We wlog a ≤ b. Then
kf + gkp ≤ kf kp + kgkp .
we know the right hand side is infinite as well. So this case is also done.
58
4 Inequalities and Lp spaces II Probability and Measure
So we know
µ(|f + g|p ) ≤ (kf kp + kgkp )µ(|f + g|p )1−1/p .
Then dividing both sides by (µ(|f + g|p )1−1/p tells us
4.2 Lp spaces
Recall the following definition:
[f ] = {g ∈ Lp : f − g = 0 a.e.}.
59
4 Inequalities and Lp spaces II Probability and Measure
Lp = {[f ] : f ∈ Lp },
where
[f ] = {g ∈ Lp : f − g = 0 a.e.}.
This is a normed vector space under the k · kp norm.
One important property of Lp is that it is complete, i.e. every Cauchy
sequence converges.
Definition (Complete vector space/Banach spaces). A normed vector space
(V, k · k) is complete if every Cauchy sequence converges. In other words, if (vn )
is a sequence in V such that kvn − vm k → 0 as n, m → ∞, then there is some
v ∈ V such that kvn − vk → 0 as n → ∞. A complete vector space is known as
a Banach space.
Theorem. Let 1 ≤ p ≤ ∞. Then Lp is a Banach space. In other words, if (fn )
is a sequence in Lp , with the property that kfn − fm kp → 0 as n, m → ∞, then
there is some f ∈ Lp such that kfn − f kp → 0 as n → ∞.
Proof. We will only give the proof for p < ∞. The p = ∞ case is left as an
exercise for the reader.
Suppose that (fn ) is a sequence in Lp with kfn − fm kp → 0 as n, m → ∞.
Take a subsequence (fnk ) of (fn ) with
We know that
M
X ∞
X
|fnk+1 − fnk | % |fnk+1 − fnk | as M → ∞.
k=1 k=1
In particular,
∞
X
|fnk+1 − fnk | < ∞ a.e.
k=1
So fnk (x) converges a.e., since the real line is complete. So we set
(
limk→∞ fnk (x) if the limit exists
f (x) =
0 otherwise
60
4 Inequalities and Lp spaces II Probability and Measure
µ(|f |p ) = µ(|f − fn + fn |p )
≤ µ((|f − fn | + |fn |)p )
≤ µ(2p (|f − fn |p + |fn |p ))
= 2p (µ(|f − fn |p ) + µ(|fn |p )2 )
We know the first term tends to 0, and in particular is finite for n large enough,
and the second term is also finite. So done.
kf k22 = hf, f i.
hf, gi = 0,
61
4 Inequalities and Lp spaces II Probability and Measure
Note that we can always make these definitions for any inner product space.
However, the completeness of the space guarantees nice properties of the orthog-
onal complement.
Before we proceed further, we need to make a definition of what it means
for a subspace of L2 to be closed. This isn’t the usual definition, since L2 isn’t
really a normed vector space, so we need to accommodate for that fact.
Definition (Closed subspace). Let V ⊆ L2 . Then V is closed if whenever (fn )
is a sequence in V with fn → f , then there exists v ∈ V with v ∼ f .
Thee main thing that makes L2 nice is that we can use closed subspaces to
decompose functions orthogonally.
Theorem. Let V be a closed subspace of L2 . Then each f ∈ L2 has an
orthogonal decomposition
f = u + v,
where v ∈ V and u ∈ V ⊥ . Moreover,
kf − vk2 ≤ kf − gk2
We now want to show that the infimum is attained. To do so, we show that gn
is a Cauchy sequence, and by the completeness of L2 , it will have a limit.
If we apply the parallelogram law with u = f − gn and v = f − gm , then we
know
ku + vk22 + ku − vk22 = 2(kuk22 + kvk22 ).
Using our particular choice of u and v, we obtain
2
gn + gm
2 f− + kgn − gm k22 = 2(kf − gn k22 + kf − gm k22 ).
2 2
So we have
2
gn + gm
kgn − gm k22 = 2(kf − gn k22 + kf − gm k22 ) − 4 f − .
2 2
62
4 Inequalities and Lp spaces II Probability and Measure
The first two terms on the right hand side tend to d(f, V )2 , and the last term
is bounded below in magnitude by 4d(f, V ). So as n, m → ∞, we must have
kgn − gm k2 → 0. By completeness of L2 , there exists a g ∈ L2 such that gn → g.
Now since V is assumed to be closed, we can find a v ∈ V such that g = v
a.e. Then we know
d(f, V )2 ≤ kf − (v + th)k22
= kf − vk2 + t2 khk22 − 2thf − v, hi.
hf − v, hi
t= .
khk22
where
E[X1Gn ]
E[X | Gn ] = for P[Gn ] > 0.
P[Gn ]
In other words, given any x ∈ Ω, say x ∈ Gn , then Y (x) = E[X | Gn ].
If X ∈ L2 (P), then Y ∈ L2 (P), and it is clear that Y is G-measurable. We
claim that this is in fact the projection of X onto the subspace L2 (G, P) of
G-measurable L2 random variables in the ambient space L2 (P).
Proposition. The conditional expectation of X given G is the projection of X
onto the subspace L2 (G, P) of G-measurable L2 random variables in the ambient
space L2 (P).
In some sense, this tells us Y is our best prediction of X given only the
information encoded in G.
63
4 Inequalities and Lp spaces II Probability and Measure
G = σ(Gn : n ∈ N),
it follows that
∞
X
W = an 1Gn .
n=1
where an ∈ R. Then
!2
∞
X
E[(X − W )2 ] = E (X − an )1Gn
n=1
" #
X
2
=E (X + a2n − 2an X)1Gn
n
" #
X
2
=E (X + a2n − 2an E[X | Gn ])1Gn
n
an = E[X | Gn ].
Note that this does not depend on what X is in the quadratic, since it is in the
constant term.
Therefore we know that E[X | Gn ] is minimized for W = Y .
We can also rephrase variance and covariance in terms of the L2 spaces.
Suppose X, Y ∈ L2 (P) with
mX = E[X], mY = E[Y ].
Then variance and covariance just correspond to L2 inner product and norm.
In fact, we have
64
4 Inequalities and Lp spaces II Probability and Measure
Xn = n1(0,1/n) .
65
4 Inequalities and Lp spaces II Probability and Measure
66
4 Inequalities and Lp spaces II Probability and Measure
Thus, for any ε > 0, we can pick k sufficiently large such that the first
term is < 2ε for all X ∈ X by assumption. Then when P[A] < 2kε
, we have
E|X|1A ] ≤ ε.
As a corollary, we find that
With all that preparation, we now come to the main theorem on uniform
integrability.
Theorem. Let X, (Xn ) be random variables. Then the following are equivalent:
(i) Xn , X ∈ L1 for all n and Xn → X in L1 .
(ii) {Xn } is uniformly integrable and Xn → X in probability.
The (i) ⇒ (ii) direction is just a standard manipulation. The idea of the (ii)
⇒ (i) direction is that we use uniformly integrability to cut off Xn and X at some
large value K, which gives us a small error, then apply bounded convergence.
Proof. We first assume that Xn , X are L1 and Xn → X in L1 . We want to show
that {Xn } is uniformly integrable and Xn → X in probability.
We first show that Xn → X in probability. This is just going to come from
the Chebyshev inequality. For ε > 0. Then we have
E[|X − Xn |]
P[|X − Xn | > ε] ≤ →0
ε
as n → ∞.
Next we show that {Xn } is uniformly integrable. Fix ε > 0. Take N such
that n ≥ N implies E[|X − Xn |] ≤ 2ε . Since finite families of L1 random variables
are uniformly integrable, we can pick δ > 0 such that A ∈ F and P[A] < δ
implies
ε
E[X1A ], E[|Xn |1A ] ≤
2
for n = 1, · · · , N .
67
4 Inequalities and Lp spaces II Probability and Measure
So we know that Xn → X in L1 .
The main application is that when {Xn } is a type of stochastic process known
as a martingale. This will be done in III Advanced Probability and III Stochastic
Calculus.
68
5 Fourier transform II Probability and Measure
5 Fourier transform
We now turn to the exciting topic of the Fourier transform. There are two main
questions we want to ask — when does the Fourier transform exist, and when
we can recover a function from its Fourier transform.
Of course, not only do we want to know if the Fourier transform exists. We
also want to know if it lies in some nice space, e.g. L2 .
It turns out that when we want to prove things about Fourier transforms,
it is often helpful to “smoothen” the function by doing what is known as a
Gaussian convolution. So after defining the Fourier transform and proving some
really basic properties, we are going to investigate convolutions and Gaussians
for a bit (convolutions are also useful on their own, since they correspond to
sums of independent random variables). After that, we can go and prove the
actual important properties of the Fourier transform.
69
5 Fourier transform II Probability and Measure
5.2 Convolutions
To actually do something useful about the Fourier transforms, we need to talk
about convolutions.
Definition (Convolution of random variables). Let µ, ν be probability measures.
Their convolution µ ∗ ν is the law of X + Y , where X has law µ and Y has law
ν, and X, Y are independent. Explicitly, we have
µ ∗ ν(A) = P[X + Y ∈ A]
ZZ
= 1A (x + y) µ(dx) ν(dy)
Let’s suppose that µ has a density function f with respect to the Lebesgue
measure. Then we have
ZZ
µ ∗ ν(A) = 1A (x + y)f (x) dx ν(dy)
ZZ
= 1A (x)f (x − y) dx ν(dy)
Z Z
= 1A (x) f (x − y) ν(dy) dx.
70
5 Fourier transform II Probability and Measure
Note that we do have to treat the two cases of convolutions separately, since
a measure need not have a density, and a function need not specify a probability
measure (it may not integrate to 1).
We check that it is indeed in Lp . Since ν is a probability measure, Jensen’s
inequality says we have
Z Z p
p
kf ∗ νkp = |f (x − y)|ν(dy) dx
ZZ
≤ |f (x − y)|p ν(dy) dx
ZZ
= |f (x − y)|p dx ν(dy)
= kf kpp
< ∞.
In fact, from this computation, we see that
Proposition. For any f ∈ Lp and ν a probability measure, we have
kf ∗ νkp ≤ kf kp .
The interesting thing happens when we try to take the Fourier transform of
a convolution.
Proposition.
∗ ν(u) = fˆ(u)ν̂(u).
f[
Proof. We have
Z Z
∗ ν(u) =
f[ f (x − y)ν(dy) ei(u,x) dx
ZZ
= f (x − y)ei(u,x) dx ν(dy)
Z Z
i(u,x−y)
= f (x − y)e d(x − y) ei(u,y) µ(dy)
Z Z
i(u,x)
= f (x)e d(x) ei(u,y) µ(dy)
Z
= fˆ(u)ei(u,x) µ(dy)
Z
= fˆ(u) ei(u,x) µ(dy)
= fˆ(u)ν̂(u).
In the context of random variables, we have a similar result:
Proposition. Let µ, ν be probability measures, and X, Y be independent vari-
ables with laws µ, ν respectively. Then
∗ ν(u) = µ̂(u)ν̂(u).
µ[
Proof. We have
∗ ν(u) = E[ei(u,X+Y ) ] = E[ei(u,X) ]E[ei(u,Y ) ] = µ̂(u)ν̂(u).
µ[
71
5 Fourier transform II Probability and Measure
Gaussian densities
Before we start, we had better start by defining the Gaussian distribution.
Definition (Gaussian density). The Gaussian density with variance t is
d/2
1 2
gt (x) = e−|x| /2t .
2πt
√
This is equivalently the density of tZ, where Z = (Z1 , · · · , Zd ) with Zi ∼
N (0, 1) independent.
We now want to compute the Fourier transformation directly and show that
the Fourier inversion formula works for this.
We start off by working in the case d = 1 and Z ∼ N (0, 1). We want to
compute the Fourier transform of the law of this guy, i.e. its characteristic
function. We will use a nice trick.
Proposition. Let Z ∼ N (0, 1). Then
2
φZ (a) = e−u /2
.
φZ (u) = E[eiuZ ]
Z
1 2
= √ eiux e−x /2 dx.
2π
72
5 Fourier transform II Probability and Measure
We now notice that the function is bounded, so we can differentiate under the
integral sign, and obtain
Again, gt and ĝt are almost the same, apart form the factor of (2πt)−d/2 and
the position of t shifted. We can thus write this as
73
5 Fourier transform II Probability and Measure
So we conclude that
Lemma. The Fourier inversion formula holds for the Gaussian density function.
Gaussian convolutions
Definition (Gaussian convolution). Let f ∈ L1 . Then a Gaussian convolution
of f is a function of the form f ∗ gt .
We are now going to do a little computation that shows that functions of
this type also satisfy the Fourier inversion formula.
Before we start, we make some observations about the Gaussian convolution.
By general theory of convolutions, we know that we have
Proposition.
kf ∗ gt k1 ≤ kf k1 .
≤ (2πt)−d/2 kf k1 .
74
5 Fourier transform II Probability and Measure
Now given these bounds, it makes sense to write down the Fourier inversion
formula for a Gaussian convolution.
Lemma. The Fourier inversion formula holds for Gaussian convolutions.
We are going to reduce this to the fact that the Gaussian distribution itself
satisfies Fourier inversion.
Proof. We have
Z
f ∗ gt (x) = f (x − y)gt (y) dy
Z Z
1 −i(u,y)
= f (x − y) ĝt (u)e du dy
(2π)d
d Z Z
1
= f (x − y)ĝt (u)e−i(u,y) du dy
2π
d Z Z
1
= f (x − y)e−i(u,x−y) dy ĝt (u)e−i(u,x) du
2π
d Z
1
= fˆ(u)ĝt (u)e−i(u,x) du
2π
d Z
1
= f\ ∗ gt (u)e−i(u,x) du
2π
So done.
The proof
Finally, we are going to extend the Fourier inversion formula to the case where
f, fˆ ∈ L2 .
Theorem (Fourier inversion formula). Let f ∈ L1 and
Z Z
2
ft (x) = (2π)−d fˆ(u)e−|u| t/2 e−i(u,x) du = (2π)−d f\
∗ gt (u)e−i(u,x) du.
75
5 Fourier transform II Probability and Measure
= 2p+1 khkpp .
where we used the definition of g and substitution. We know that this tends to 0
as t → 0 by the bounded convergence theorem, since we know that e is bounded.
Finally, we have
kf ∗ gt − f kp ≤ kf ∗ gt − h ∗ gt kp + kh ∗ gt − hkp + kh − f kp
ε ε
≤ + + kh ∗ gt − hkp
3 3
2ε
= + kh ∗ gt − hkp .
3
Since we know that kh ∗ gt − hkp → 0 as t → 0, we know that for all sufficiently
small t, the function is bounded above by ε. So we are done.
With this lemma, we can now prove the Fourier inversion theorem.
Proof of Fourier inversion theorem. The first part is just a special case of the
previous lemma. Indeed, recall that
2
∗ gt (u) = fˆ(u)e−|u| t/2 .
f\
ft = f ∗ gt .
76
5 Fourier transform II Probability and Measure
we know that 2
fˆ(u)e−|u| t/2 e−i(u,x) ≤ |fˆ|.
So done.
As we are going to see in a moment, this is just going to follow from the
Fourier inversion formula plus a clever trick.
Proof. We first work with the special case where f, fˆ ∈ L1 , since the Fourier
inversion formula holds for f . We then have
Z
kf k22 = f (x)f (x) dx
Z Z
1 ˆ(u)e−i(u,x) du f (x) dx
= f
(2π)d
Z
1
ˆ(u) f (x)e−i(u,x) dx du
= f
(2π)d
Z
1
fˆ(u) f (x)ei(u,x) dx du
=
(2π)d
Z
1
= fˆ(u)fˆ(u) du
(2π)d
1
= kfˆ(u)k22 .
(2π)d
77
5 Fourier transform II Probability and Measure
To prove it for the general case, we use this result and an approximation
argument. Suppose that f ∈ L1 ∩ L2 , and let ft = f ∗ gt . Then by our earlier
lemma, we know that
kft k2 → kf k2 as t → 0.
Now note that 2
fˆt (u) = fˆ(u)ĝt (u) = fˆ(u)e−|u| t/2 .
2
The important thing is that e−|u| t/2 % 1 as t → 0. Therefore, we know
Z Z
2
kfˆt k22 = |fˆ(u)|2 e−|u| t du → |fˆ(u)|2 du = kfˆk22
as t → 0, by monotone convergence.
Since ft , fˆt ∈ L1 , we know that the Plancherel identity holds, i.e.
V = {[f ] ∈ L2 : f, fˆ ∈ L1 }
78
5 Fourier transform II Probability and Measure
79
5 Fourier transform II Probability and Measure
The main application of this that will appear later is that this is the fact
that allows us to prove the central limit theorem.
Proof sketch. By the example sheet, it suffices to show that E[g(Xn )] → E[g(X)]
for all compactly supported g ∈ C ∞ . We then use Fourier inversion and
convergence of characteristic functions to check that
√ √
E[g(Xn + tZ)] → E[g(X + tZ)]
for all t >√0 for Z ∼ N (0, 1) independent of X, (Xn ). Then we check that
E[g(Xn + tZ)] is close to E[g(Xn )] for t > 0 small, and similarly for X.
E[X] = µ, var(X) = σ 2 .
aX + b ∼ N (aµ + b, a2 σ 2 ).
Lastly, we have
2
σ 2 /2
φX (u) = e−iµu−u .
Proof. All but the last of them follow from direct calculation, and can be found
in IA Probability.
For the last part, if X ∼ N (µ, σ 2 ), then we can write
X = σZ + µ,
where Z ∼ N (0, 1). Recall that we have previously found that the characteristic
function of a N (0, 1) function is
2
φZ (u) = e−|u| /2
.
80
5 Fourier transform II Probability and Measure
So we have
φX (u) = E[eiu(σZ+µ) ]
= eiuµ E[eiuσZ ]
= eiuµ φZ (iuσ)
2
σ 2 /2
= eiuµ−u .
What we are next going to do is to talk about the corresponding facts for
the Gaussian in higher dimensions. Before that, we need to come up with the
definition of a higher-dimensional Gaussian distribution. This might be different
from the one you’ve seen previously, because we want to allow some degeneracy
in our random variable, e.g. some of the dimensions can be constant.
(iii) We have
φX (u) = ei(u,µ)−(u,V u)/2 .
Since (X, AT u) is Gaussian and (b, u) is constant, it follows that (AX +b, u)
is Gaussian.
81
5 Fourier transform II Probability and Measure
So we know
(u, X) ∼ N ((u, µ), (u, V u)).
So it follows that
(iv) We start off with a boring Gaussian vector Y = (Y1 , · · · , Yn ), where the
Yi ∼ N (0, 1) are independent. Then the density of Y is
2
fY (y) = (2π)−n/2 e−|y| /2
.
X̃ = V 1/2 Y + µ.
82
6 Ergodic theory II Probability and Measure
6 Ergodic theory
We are now going to study a new topic — ergodic theory. This is the study
the “long run behaviour” of system under the evolution of some Θ. Due to time
constraints, we will not do much with it. We are going to prove two ergodic
theorems that tell us what happens in the long run, and this will be useful when
we prove our strong law of large numbers at the end of the course.
The general settings is that we have a measure space (E, E, µ) and a measur-
able map Θ : E → E that is measure preserving, i.e. µ(A) = µ(Θ−1 (A)) for all
A ∈ E.
Example. Take (E, E, µ) = ([0, 1), B([0, 1)), Lebesgue). For each a ∈ [0, 1), we
can define
Θa (x) = x + a mod 1.
By what we’ve done earlier in the course, we know this translation map preserves
the Lebesgue measure on [0, 1).
Our goal is to try to understand the “long run averages” of the system when
we apply Θ many times. One particular quantity we are going to look at is the
following:
Let f be measurable. We define
Sn (f ) = f + f ◦ Θ + · · · + f ◦ Θn−1 .
EΘ = {A ∈ E : A is invariant}.
It turns out that if Θ is ergodic, then there aren’t that many invariant
functions.
83
6 Ergodic theory II Probability and Measure
Y = (Y1 , Y2 , · · · ) : Ω → E
where Yi are iid random variables defined earlier, and Ω is the sample space of
the Yi .
Then Y is a measurable map because each of the Yi ’s is a random variable.
We let µ = P ◦ Y −1 .
By the independence of Yi ’s, we have that
Y
µ(A) = m(An )
n∈N
for any
A = A1 × A2 × · · · × An × R × · · · × R.
Note that the product is eventually 1, so it is really a finite product.
This (E, E, µ) is known as the canonical space associated with the sequence
of iid random variables with law m.
Finally, we need to define Θ. We define Θ : E → E by
f (x) = f (x1 , x2 , · · · ) = x1 .
Then we have
Sn (f ) = f + f ◦ Θ + · · · + f ◦ Θn−1 = x1 + · · · + xn .
So Snn(f ) will the average of the first n things. So ergodic theory will tell us
about the long-run behaviour of the average.
84
6 Ergodic theory II Probability and Measure
S ∗ = sup Sn (f ) ≥ 0,
n≥0
Proof. We let
Sn∗ = max Sm
0≤m≤n
and
An = {Sn∗ > 0}.
Now if 1 ≤ m ≤ n, then we know
Sm = f + Sm−1 ◦ Θ ≤ f + Sn∗ ◦ Θ.
Now on An , we have
Sn∗ = max Sm ,
1≤m≤n
since S0 = 0. So we have
Sn∗ ≤ f + Sn∗ ◦ Θ.
On AC
n , we have
Sn∗ = 0 ≤ Sn∗ ◦ Θ.
85
6 Ergodic theory II Probability and Measure
So we know
Z Z Z
Sn∗ dµ = Sn∗ dµ + Sn∗ dµ
E An AC
n
Z Z Z
≤ f dµ + Sn∗ ◦ Θ dµ + Sn∗ ◦ Θ dµ
An An AC
n
Z Z
= f dµ + Sn∗ ◦ Θ dµ
An E
Z Z
= f dµ + Sn∗ dµ
An E
So we know Z
f dµ ≥ 0.
An
and
Sn (f )
→ f¯ a.e.
n
If Θ is ergodic, then f¯ is a constant.
Note that the theorem only gives µ(|f¯|) ≤ µ(|f |). However, in many cases,
we can use some integration theorems such as dominated convergence to argue
that they must in fact be equal. In particular, in the ergodic case, this will allow
us to find the value of f¯.
Theorem (von Neumann’s ergodic theorem). Let (E, E, µ) be a finite measure
space. Let p ∈ [1, ∞) and assume that f ∈ Lp . Then there is some function
f¯ ∈ Lp such that
Sn (f )
→ f¯ in Lp .
n
Proof of Birkhoff ’s ergodic theorem. We first note that
Sn Sn
lim sup , lim sup
n n n n
are invariant functions, Indeed, we know
Sn ◦ Θ = f ◦ Θ + f ◦ Θ2 + · · · + f ◦ Θn
= Sn+1 − f
So we have
Sn ◦ Θ Sn f Sn
lim sup = lim sup + → lim sup .
n→∞ n n→∞ n n n→∞ n
86
6 Ergodic theory II Probability and Measure
Exactly the same reasoning tells us the lim inf is also invariant.
What we now need to show is that the set of points on which lim sup and
lim inf do not agree have measure zero. We set a < b. We let
Sn (x) Sn (x)
D = D(a, b) = x ∈ E : lim inf < a < b < lim sup .
n→∞ n n→∞ n
Now if lim sup Snn(x) 6= lim inf Snn(x) , then there is some a, b ∈ Q such that
x ∈ D(a, b). So by countable subadditivity, it suffices to show that µ(D(a, b)) = 0
for all a, b.
We now fix a, b, and just write D. Since lim sup Snn and lim inf Snn are both
invariant, we have that D is invariant. By restricting to D, we can assume that
D = E.
Suppose that B ∈ E and µ(G) < ∞. We let
g = f − b1B .
for all measurable sets B ∈ E with finite measure. Since our space is σ-finite, we
can find Bn % D such µ(Bn ) < ∞ for all n. So taking the limit above tells
Z
bµ(D) ≤ f dµ. (†)
D
Now we can apply the same argument with (−a) in place of b and (−f ) in place
of f to get Z
(−a)µ(D) ≤ − f dµ. (‡)
D
Now note that since b > a, we know that at least one of b > 0 and a < 0 has to
be true. In the first case, (†) tells us that µ(D) is finite, since f is integrable.
Then combining with (‡), we see that
Z
bµ(D) ≤ f dµ ≤ aµ(D).
D
87
6 Ergodic theory II Probability and Measure
But a < b. So we must have µ(D) = 0. The second case follows similarly (or
follows immediately by flipping the sign of f ).
We are almost done. We can now define
(
lim Sn (f )/n the limit exists
f¯(x) =
0 otherwise
Sn (f )
→ f¯ a.e.
n
Also, we know f¯ is invariant, because lim Sn (f )/n is invariant, and so is the set
where the limit exists.
Finally, we need to show that
This is since
µ(|f ◦ Θn |) = µ(|f |)
as Θn preserves the metric. So we have that
The proof of the von Neumann ergodic theorem follows easily from Birkhoff’s
ergodic theorem.
Proof of von Neumann ergodic theorem. It is an exercise on the example sheet
to show that
Z Z
kf ◦ Θkp = |f ◦ Θ| dµ = |f |p dµ = kf kpp .
p p
So we have
Sn 1
= kf + f ◦ Θ + · · · + f ◦ Θn−1 k ≤ kf kp
n p n
by Minkowski’s inequality.
So let ε > 0, and take M ∈ (0, ∞) so that if
g = (f ∨ (−M )) ∧ M,
88
6 Ergodic theory II Probability and Measure
then
ε
kf − gkp < .
3
By Birkhoff’s theorem, we know
Sn (g)
→ ḡ
n
a.e.
Also, we know
Sn (g)
≤M
n
for all n. So by bounded convergence theorem, we know
Sn (g)
− ḡ →0
n p
Sn (g) ε
− ḡ < .
n p 3
Then we have
p
Sn (f − g)
Z
p
f¯ − ḡ p
= lim inf dµ
n n
p
Sn (f − g)
Z
≤ lim inf dµ
n
≤ kf − gkpp .
So if n ≥ N , then we know
Sn (f ) Sn (f − g) Sn (g)
− f¯ ≤ + − f¯ + ḡ − f¯ p
≤ ε.
n p n p n p
So done.
89
7 Big theorems II Probability and Measure
7 Big theorems
We are now going to use all the tools we have previously developed to prove
two of the most important theorems about the sums of independent random
variables, namely the strong law of large numbers and the central limit theorem.
Sn = X1 + · · · + Xn .
Yn = Xn − µ.
We then have
90
7 Big theorems II Probability and Measure
We know the first term is bounded by nM , and we also know that for i 6= j, we
have q
E[Xi2 Xj2 ] = E[Xi2 ]E[Xj2 ] ≤ E[Xi4 ]E[Xj4 ] ≤ M
So we know
3M
E (Sn /n)4 ≤ 2 .
n
So we know
∞ ∞
" 4 #
X Sn X 3M
E ≤ < ∞.
n=1
n n=1
n2
So we know that
∞ 4
X Sn
< ∞ a.s.
n=1
n
91
7 Big theorems II Probability and Measure
We let f : E → R be given by
f (x1 , x2 , · · · ) = X1 (x1 , · · · , xn ) = x1 .
Then X1 has law given by m, and in particular is integrable. Also, the shift map
Θ : E → E given by
Θ(x1 , x2 , · · · ) = (x2 , x3 , · · · )
is measure-preserving and ergodic. Thus, with
Sn (f ) = f + f ◦ Θ + · · · + f ◦ Θn−1 = X1 + · · · + Xn ,
we have that
Sn (f )
→ f¯ a.e.
n
by Birkhoff’s ergodic theorem. We also have convergence in L1 by von Neumann
ergodic theorem.
Here f¯ is EΘ -measurable, and Θ is ergodic, so we know that f¯ = c a.e. for
some constant c. Moreover, we have
So done.
Sn = X1 + · · · + Xn ,
Evaluating at 0, we have
u2
φ(u) = 1 − + o(u2 ).
2
92
7 Big theorems II Probability and Measure
√
We consider the characteristic function of Sn / n
√
φn (u) = E[eiuSn / n ]
Yn √
= E[eiuXj / n ]
i=1
√
= φ(u/ n)n
2 n
u2
u
= 1− +o .
2n n
u2
2
u
log φn (u) = n log 1 − +o
2n n
u2
= − + o(1)
2
u2
→−
2
So we know that 2
φn (u) → e−u /2
,
which is the characteristic function of a N (0, 1) random variable.
So we have convergence in characteristic function, hence weak convergence,
hence convergence in distribution.
93
Index II Probability and Measure
Index
FX , 26 conjugate, 57
Lp space, 54, 59 convergence
Lp -bounded, 66 almost everywhere, 29
N (µ, σ 2 ), 80 almost sure, 29
Sn (f ), 83 in distribution, 31
V ⊥ , 61 in measure, 29
E[X], 36 in probability, 29
lim inf, 17 convex function, 55
lim sup, 17 convolution
B, 12 function with measure, 70
B(E), 12 random variable, 70
EΘ , 83 countable additivity, 5
Lp space, 60 countably additive set function, 8
T -measurable, 34 countably subadditive set function,
µ(f ), 36 8
π-system, 6 counting measure, 5
σ-algebra, 5 covariance, 64
independent, 17 covariance matrix, 64
product, 21, 48
tail, 34 d-system, 6
σ-algebra generated by functions, density, 46
21 random variable, 46
σ-finite measure, 15 differentiation under the integral
sign, 47
Additive set function, 8 distribution, 25
algebra, 7 distribution function, 26
almost everywhere, 29 dominated convergence theorem, 43
almost sure convergence, 29 Dynkin’s π-system lemma, 7
ergodic, 83
Banach space, 60
event
Bernoulli shift, 84
independent, 16
Birkhoff’s ergodic theorem, 86
events, 16
Borel σ-algebra, 6, 12
expectation, 36
Borel function, 20
Borel measure, 13 Fatou’s lemma, 42
Borel–Cantelli lemma, 18 finite intersection property, 14
Borel–Cantelli lemma II, 18 Fourier transform, 69
bounded convergence theorem, 65 of measure, 69
Fubini’s theorem, 50
canonical space, 84
Caratheodory extension theorem, 8 Gaussian convolution, 74
change of variables formula, 46 Gaussian density, 72
characteristic function, 69 Gaussian random variable, 80, 81
Chebyshev’s inequality, 54 mean, 80
closed subspace, 62 variance, 80
complete vector space, 60 generating set, 6
conditional expectation, 63 generator of σ-algebra, 6
94
Index II Probability and Measure
95