0% found this document useful (0 votes)
16 views141 pages

Measure Theory

Uploaded by

Luca Rampoldi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views141 pages

Measure Theory

Uploaded by

Luca Rampoldi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 141

Fundamentals of measure theory

Bent Nielsen

University of Oxford

Further Mathematical Methods


Trinity 2024, weeks 6-8

Comments welcome — do let me know if you spot any mistakes

Slightly revised version of slides written by Anders Kock 2023

1/141
Introduction

The present slide set contains some fundamental definitions


and results from measure theory and measure-theoretic
probability.
Some textbooks that provide a (much) more
detailed/complete treatment are mentioned next.
However, the slides should be largely self-contained and they
are all I expect you to study.
They will also be sufficient in order to solve the problem sets.

2/141
Some textbooks
There is a plethora of textbooks on measure-theoretic probability.
Some examples:
Resnick (2019) and Shiryaev (1996): Many examples, focus
on probability, good for self-study.
Dudley (2018): Comprehensive. On real analysis, probability.
Lehmann and Casella (1998) and Lehmann and Romano
(2005) also contain crash-course style material with particular
emphasis on applications in statistics and econometrics.
Liese and Miescke (2008): Appendix with collection of results.
Axler (2021) has a lot of intuition and many examples of why
the theory is the way it is [why define measures only
on σ-algebras, why Lebesgue integration and not Riemann?].
It may be interesting for self-study. Available for free on
Axler’s website.
Kolmogorov and Fomin (1970): concise, clear overview of
metric spaces, topology, σ-algebras.
3/141
Why the measure-theoretic approach?

The measure-theoretic approach allows us to treat discrete,


continuous and distributions that are neither of these in a
unified way.
Sequences of measurable functions (and random variables) are
closed under pointwise limits.
This is in contrast to continuous functions underlying much of
Riemann-integration.
We can interchange pointwise limits and integration under
rather general conditions. This is again in contrast to
Riemann-integration.
You will find it used in many papers (also outside
econometrics) so it useful not be be “scared” when
encountering σ-algebras, Lebesgue measures, the Lebesgue
integral and terminology like “almost everywhere” and
“almost surely”!
4/141
Plan
We will cover
1 σ-algebras.
2 Measures and their properties.
3 Measurable functions.
Stability properties of measurable functions — in particular
under pointwise limits.
4 The Lebesgue integral and useful results related to it.
Construction of the integral via simple functions.
Linearity, monotonicity, dealing with null sets.
The Monotone and Dominated Convergence theorems.
Markov, Hölder, Jensen, Minkowski inequalities.
Induced measures and the substitution rule.
Tonelli’s and Fubini’s Theorem.
Radon-Nikodym and densities.
5 Some applications to probability theory.
The primary examples will be probability spaces and random
variables. 5/141
σ-algebra

Consider a non-empty set X .


Definition 1 (σ-algebra)

A collection of subsets A of X is called a σ-algebra in X if the


following conditions are met:
(σ1) X ∈ A.
(σ2) For all A ∈ A, then Ac ∈ A.
S
(σ3) For any sequence of sets (An )n∈N in A, then n∈N An ∈ A.

The sets in A are called the measurable sets.


The terminology “σ-algebra” comes from the fact that a
countable number of operations take place in (σ3).
The term σ-field is also frequently used instead of σ-algebra.

6/141
Examples of σ-algebras

The systems

{∅, X } , P(X ) = {A : A ⊆ X } and {∅, A, Ac , X }

are all σ-algebras.

The trivial σ-algebra {∅, X } is the smallest σ-algebra in X . Every


other σ-algebra must contain the elements of {∅, X }.

The power set P(X ) is the largest σ-algebra in X . The elements


of every other σ-algebra are contained in P(X ).

{∅, A, Ac , X } is the smallest σ-algebra containing A (why?).

7/141
Some further properties of σ-algebras

Lemma 2 (Some further stability properties of σ-algebras)

If A is a σ-algebra in X , then
1 ∅ ∈ A.
2 If A1 , . . . , AN is a finite collection of sets in A,
then ∪N n=1 An ∈ A.
3 If A, B ∈ A, then A ∩ B ∈ A.
4 If A, B ∈ A, then A \ B ∈ A.
5 If (An )n∈N is a sequence of sets in A, then ∩n∈N An ∈ A.

Lemma 2 tells us that within a σ-algebra A one can work freely


with the usual set operations without “falling out” out of A as
long as countably many set operations are involved.

8/141
Proof of Lemma 2

1 ∅ = X c ∈ A by (σ1) and (σ2).


2 Use (σ3) on the sequence of sets (An )n∈N with An being the
given sets for n = 1, . . . , N and An = AN for n ≥ N + 1.
c c
3 A ∩ B = (A ∩ B)c = Ac ∪ B c ∈ A by (σ2) and (2) of
this lemma.
4 A \ B = A ∩ B c ∈ A by (σ2) and (3) of this lemma.
c
5 ∩n∈N An = (∩n∈N An )c = (∪n∈N Acn )c ∈ A by (σ2)
and (σ3).

9/141
Intersection of σ-algebras

You will show the following in the exercises.


Theorem 3 (Intersection of σ-algebras)

Let (Ai )i∈I be an (arbitrary) family of σ-algebras in X . Then also


the system
\ 
Ai = A ⊆ X : A ∈ Ai for all i ∈ I
i∈I

is a σ-algebra in X .

Note, ”an arbitrary family” can have finite, countable or


non-countable index-set I .

Theorem 3 is important for defining and constructing


“generated σ-algebras”, which we shall discuss next.

10/141
What level of generality do we want?

The discussion of σ-algebras can happen at two levels.

Specific level. In probability, basic structures are random


variables. These are objects with distribution functions

F (x) = P(X ≤ x) = P(X ∈] − ∞, x]).

Thus, we work with real line R. Implicitly, we have the Euclidean


distance ρ2 in mind. We have experience from school in working
with open and closed sets such as (a, b) and [a, b]. In abstract
mathematics, the open and closed sets are derived from the
distance measure we have implicitly in mind.

11/141
General level. In mathematics, basic structures are metric spaces
and topologies (Kolmogorov and Fomin, 1970).

A metric space consists of a set of elements and a metric, for


instance R with the Euclidean distance ρ2 . On a metric space, we
can define continuity, convergence and a system of open sets.

A topology defines a system of open sets. This is sufficient to


define continuity. A metric space is a particular topological space.

In probability theory, metric spaces and topology are not really


that important for discussion of random variables and random
vectors. But these concepts become important when discussing
convergence of stochastic processes, such as random walks,
objective functions and empirical distribution functions (viewed
jointly is functions over all possible arguments).
12/141
Generated σ-algebras

Let a, b ∈ R such that a < b. (Implicitly working with (R, ρ2 )


Because ∞ 1 1
T
n=1 (a − n , b + n ) = [a, b] the family of open
intervals Dopen in R does not form a σ-algebra.
Because ∞ 1 1
S
n=1 [a + n , b − n ] = (a, b) the family of closed
intervals Dclosed in R does not form a σ-algebra.

Given this observation, one can ask which subsets of R we must


add to a family of sets D in order to obtain a σ-algebra.
More generally, for any family of sets D in a set X , one can ask if
there exists a smallest σ-algebra containing it.
The following simple theorem asserts the existence of such a
smallest σ-algebra.

13/141
Theorem 4 (Generated σ-algebra)

Let D be a family of subsets in X . Then there exists a


smallest σ-algebra σ(D) in X containing D. That is, σ(D) satisfies
1 D ⊆ σ(D).
2 For any σ-algebra A in X containing D, then σ(D) ⊆ A.

Proof: Define the following set of σ-algebras



Σ(D) := A ⊆ P(X ) : A is a σ-algebra in X and D ⊆ A .

Clearly, P(X ) ∈ Σ(D) and so Σ(D) is non-empty. Let


\
σ(D) := A.
A∈Σ(D)

By Theorem 3, σ(D) is a σ-algebra. Clearly, D ⊆ σ(D) and by


construction σ(D) ⊆ A for any other A ∈ Σ(D).
14/141
For any family of subsets D in X , the smallest σ-algebra σ(D)
containing D is called the σ-algebra generated by D.
D is called a generating class for σ(D).
A σ-algebra A is said to be countably generated if there exists
a countable family of sets D such that A = σ(D).
If A is a σ-algebra in X then part 2 of Theorem 4 amounts to
the implication

D ⊆ A =⇒ σ(D) ⊆ A.

In particular, for two families of subsets D1 and D2 of X we


have the implications

D1 ⊆ D2 =⇒ D1 ⊆ σ(D2 ) =⇒ σ(D1 ) ⊆ σ(D2 ). (1)

Theorem 4 does not require a metric. We say the result is


topological.

15/141
Equation (1) on the previous slide implies that if the system D is a
generating class for a σ-algebra A in X , then the families
 c
1 D :D∈D

2 ∪n∈N Dn : Dn ∈ D for all n ∈ N

3 ∩n∈N Dn : Dn ∈ D for all n ∈ N
also generate A. You will show this in the exercises.

16/141
Recall, that a topology in a set X is a system τ of subsets D ⊂ X ,
called open sets (relative to τ ) such that
1 ∅ and X belong to τ .
2 Arbitrary (finite or infinite) unions ∪i∈I Di and finite
intersections ∩ni=1 Di of open sets belong to τ .
The pair (X , τ ) is a topological space.
The complements D c = X − D of open sets D are closed.
If (X , d) is a metric space then its open sets form a topology on
X . But, not every topological space arises in this way.1
By (1) above it follows that the family of open sets G and the
family of closed sets F of any topological space generate the
same σ-algebra.

1
Math topic 3, p. 24; Kolmogorov and Fomin (1970, p. 79) 17/141
Borel algebra - specific level

Borel algebras are defined sometimes at a ‘specific level’,


sometimes at a ‘general level’

In probability texts, it is common to define the Borel algebra as the


specific level in the context of Rd with the Euclidean distance ρ2 .

Definition 5 (Borel-algebra in Rd with Euclidean distance)

The Borel-algebra in Rd with the Euclidean distance ρ2 is


the σ-algebra generated by the closed half-planes. It is denoted
by B(Rd ), that is

B(Rd ) := σ (−∞, b1 ] × · · · × (−∞, bd ] for b1 , . . . , bd ∈ R .




The elements of B(Rd ) are called Borel sets.

Note, the notation B(Rd ) suppresses the choice of distance.


18/141
Borel sets on R

Let a ≤ b.
Examples of Borel sets of R (with ρ2 ):
Half-open intervals: (a, b] = (−∞, b]\(−∞, a].
Open half-lines: (−∞, b) = ∪n∈N (−∞, b − 1/n].
Closed intervals: [a, b] = (−∞, b]\(−∞, a).
Open intervals: (a, b) = (−∞, b)\(−∞, a].
The above sets could also be used as bases for the Borel-algebra.
Further examples of Borel sets:
Points: {a} = [a, a].

One can construct (complicated) sets that are not Borel in the
sense of Definition 5. The construction requires the axiom of
choice.2

2
Kolmogorov and Fomin (1970, Section 3.7, Problem 26.7).
19/141
Borel algebra - general level

In mathematics and probability texts concerned with convergence


of stochastic processes, the Borel algebra is commonly defined at
the general level.
Definition 6 (Borel-algebra on a topological space)

The Borel-algebra in the set X with a system τ of open sets is


denoted by B(X ) and given by

B(X ) := σ G ∈ τ .

The elements of B(X ) are called Borel sets.

Note, the notation B(X ) suppresses the choice of topology.


Note, that a metric space implies a topology.3

3
Math topic 3, p. 23-24; Kolmogorov and Fomin (1970, p. 50, 79)
20/141
Definition 5 is a special case of Definition 6. In one dimension, this
is because the Borel-algebra in Definition 5 is equivalent to the
σ-algebra generated by the open intervals (a, b) for all a, b ∈ R.
The open intervals generate a topology, where it suffices to
consider countable set operations. The latter is a consequence of
the next Theorem 7.
As all norms on Rd are equivalent4 , then the open sets on Rd with
the Euclidean distance could equivalently be defined with respect
to any norm induced metric.
The trivial σ-algebra (∅, Rd ) and the power set on Rd are both
Borel-algebras in the sense of Definition 6. Nonetheless, we will
only use the notation B(Rd ) when the open sets are generated by
the open hyper-cubes or equivalently by a norm induced metric.
The question whether all σ-algebras are Borel-algebras in the sense
of Definition 6 appears to be an open question. Kolmogorov and
Fomin (1970, p. 35) write that the term Borel-algebra is often used
to denote a σ-algebra.
4
Kolmogorov and Fomin (1970, p. 141) 21/141
Borel algebra on (Rd , ρ2 ) is countably generated

Let b2 (x, r ) := y ∈ Rd : ρ2 (x, y ) < r .




The following theorem shows that B(Rd ) (with Euclidean


distance) is countably generated by open balls.

Theorem 7
For every d ∈ N it holds that on (Rd , ρ2 ) then

B(Rd ) = σ b2 (x, r ) : x ∈ Rd , r ∈ (0, ∞)


 

= σ b2 (x, r ) : x ∈ Qd , r ∈ (0, ∞) ∩ Q
 

As it turns out, the Borel algebra B(Rd , ρ2 ) has just the right size
in many contexts. For contrast:
The power set of Rd is an uncountably generated σ-algebra.
The trivial σ-algebra (∅, R) is finitely generated.
22/141
Recall: a metric space is separable if it contains a dense subset.5
To prove Theorem 7, we use the following separability result.
Lemma 8
Let Rd be equipped with a metric ρ so that Qd is dense in Rd
(e.g., ρ2 ). Let G ⊆ Rd be a non-empty open set. Write the
countable set G ∩ Qd as

G ∩ Qd = {xk : k ∈ N} .

Let bρ (x, r ) := {y ∈ Rd : ρ(x, y ) < r } for x ∈ Rd and r > 0.


Then a sequence of positive rational numbers (rk )k∈N exists such
that
[
G= bρ (xk , rk ),
k∈N

Thus, every open set in (Rd , ρ) can be written as a countable


union of open ρ-balls with rational centers and radii.
5
Math topic 3, p.18-19; Kolmogorov and Fomin (1970, p. 48). 23/141
Proof of Lemma 8

For every n ∈ N define



sn := sup r ∈ (0, 1] : bρ (xn , r ) ⊆ G .

Fix an rn ∈ [sn /2, sn ) ∩ Q and observe that bρ (xn , rn ) ⊆ G for


all n ∈ N. Hence,
[
bρ (xn , rn ) ⊆ G .
n∈N

We now show that the reverse inclusion holds as well. To this end,
fix any x ∈ G and let r ∈ (0, 2] be such that bρ (x, r ) ⊆ G . Since
G is open such an r exists. By assumption, Qd is dense in Rd .
Thus, there exists an n ∈ N such that xn ∈ bρ (x, r /4).6

6
Math topic 3, p. 18-20. 24/141
For any y ∈ bρ (xn , r /2) it holds, by first using the triangle
inequality and then the specified radii of the balls, that

ρ(y , x) ≤ ρ(y , xn ) + ρ(xn , x) < r /2 + r /4 < r .

Hence, bρ (xn , r /2) ⊆ bρ (x, r ) ⊆ G . Since r /2 ∈ (0, 1], it follows


from the definition of sn and rn that

r /2 ≤ sn and so r /4 ≤ sn /2 ≤ rn .

Thus, since ρ(x, xn ) < r /4


[
x ∈ bρ (xn , r /4) ⊆ bρ (xn , rn ) ⊆ bρ (xk , rk ).
k∈N
S
Since x ∈ G was arbitrary this shows that G ⊆ k∈N bρ (xk , rk ).

25/141
Proof of Theorem 7

By (1) on page 15 it follows that

σ b2 (x, r ) : x ∈ Qd , r ∈ (0, ∞) ∩ Q
 

⊆σ b2 (x, r ) : x ∈ Rd , r > 0
 

⊆σ {G ⊆ Rd : G is open} = B(Rd ).


Next, by Lemma 8

G : G is open in Rd ⊆ σ b2 (x, r ) : x ∈ Qd , r ∈ (0, ∞) ∩ Q ,


  

and so, by another application of (1),

B(Rd ) = σ G : G is open in Rd
 

⊆ σ b2 (x, r ) : x ∈ Qd , r ∈ (0, ∞) ∩ Q ,
 

which yields the desired conclusion.


26/141
Measures

The next topic is measures.


Measures give the “size” of sets.
Measures are defined on measurable spaces, which we define
next.

Definition 9 (Measurable space)


A measurable space is a pair (X , A), where X is a non-empty set
and A is a σ-algebra in X .

Example: (Rd , B(Rd )) is a measurable space.

27/141
Definition 10 (Measure and measure spaces)

Let (X , A) be a measurable space. A measure µ on (X , A) is a


mapping µ : A → [0, ∞] satisfying the following two conditions.
1 µ(∅) = 0.
2 µ is countably additive, that is for every sequence (An )n∈N of
disjoint subsets in A it holds that
[  X
µ An = µ(An ). (2)
n∈N n∈N

If µ is a measure on (X , A), the triple (X , A, µ) is called a


measure space.

Note that
S the left-hand side in (2) is well-defined
since n∈N An ∈ A (by σ3) and the right-hand side is
well-defined as it is a sum of non-negative terms.

28/141
Some terminology

If µ(X ) = 1, then we say that µ is a probability measure.


Thus, probability measures are just special measures. In this
case, we often write P instead of µ.
If µ(X ) < ∞ we say that µ is finite.
If there exists a sequence
S (An )n∈N in A such that µ(An ) < ∞
for all n ∈ N and n∈N An = X , then we say that µ is σ-finite.

29/141
Examples of measures

P = N(0, 1) is a probability measure on (R, B(R)).


Consider the measurable space (X , P(X )), where P(X ) is the
power set. For an element a ∈ X , we define the Dirac
measure δa as
(
0 if a ∈ Ac
δa (A) =
1 if a ∈ A.

30/141
The Lebesgue measure λd on (Rd , B(Rd ))

The Lebesgue measure λd on (Rd , B(Rd )) is the measure that


assigns “volumes to cubes” in the sense of

λd [a1 , b1 ] × . . . × [ad , bd ] = (b1 − a1 ) · . . . · (bd − ad ) (3)

for all a1 , b1 , . . . , ad , bd ∈ R with ai < bi , i = 1, . . . , d.


(3) only specifies the Lebesgue measure on a small class of Borel
sets in Rd .
It is thus not clear whether
1 a measure on all of B(Rd ) fulfilling (3) exists.
2 whether it is unique.
The answer to both of these is “yes”, but we don’t show this.7
is σ-finite since n∈N [−n, n]d = Rd
S
The Lebesgue measure
and λd [−n, n]d = (2n)d < ∞.


7
See Shiryaev (1996, p. 151–159). 31/141
Counting measure
Counting measure: Consider the measurable space (X , P(X )),
where P(X ) is the power set. The counting measure
τ : P(X ) → [0, ∞] is then defined as

τ (A) = number of elements in A.

That is, τ (A) < ∞ if and only if A has finitely many elements.

Let us verify that τ is indeed a measure. Note that τ (∅) = 0.


To prove (2), consider a sequence of disjoint sets (An )n∈N in A.
If τ (∪n∈N An ) < ∞ then τ (An ) < ∞ for all n ∈ N and there are at
most finitely many An that are non-empty. Denote these
by An1 , . . . , Ank . Then,
k
 X X
τ ∪n∈N An = τ ∪kj=1 Anj =

τ (Anj ) = τ (An ).
j=1 n∈N

32/141
If, on the other hand, τ (∪n∈N An ) = ∞, then at least one of the
following holds:
1 There exists an n0 ∈ N such that τ (An0 ) = ∞.
2 τ (An ) > 0 for infinitely many n ∈ N.
P
In both cases n∈N τ (An ) = ∞ as desired. Thus, τ is a measure
on (X , P(X )).

33/141
Theorem 11 (Fundamental properties of measures)

Let (X , A, µ) be a measure space. Then


1 µ is finitely additive, that is if A1 , .. . , A
PNNis a finite collection of
disjoint sets in A, then µ ∪N A
n=1 n = n=1 µ(An ).
2 If A, B ∈ A and A ⊆ B, then µ(A) ≤ µ(B).
3 If A, B ∈ A, A ⊆ B and µ(A) < ∞, then µ(B \ A) = µ(B) − µ(A).
4 For any sequence of sets (An )n∈N in A it holds that
[  X
µ An ≤ µ(An ).
n∈N n∈N

5 Let (An )n∈N be a sequence of sets in A satisfying A1 ⊆ A2 ⊆ . . ..


Then, µ(∪n∈N An ) = limn→∞ µ(An ).
6 Let (An )n∈N be a sequence of sets in A satisfying A1 ⊇ A2 ⊇ . . .
and µ(A1 ) < ∞. Then, µ(∩n∈N An ) = limn→∞ µ(An ).

34/141
Comments

The properties of measures are as one would hope/expect.


In Economics we shall often deal with probability measures.
Clearly, the finiteness requirements in parts (3) and (6) of
Theorem 11 are satisfied in this case.
However, these finiteness requirements can not be dispensed
of!
To see this in (6), consider the counting measure τ
on (N, P(N)). With An = {n, n + 1, . . .}, n ∈ N, it holds that

τ ∩n∈N An = τ (∅) = 0 but lim τ (An ) = lim ∞ = ∞.
n→∞ n→∞

(4) of Theorem 11 is also referred to as the union bound or


Boole’s inequality.

35/141
Proof of Theorem 11

(1): Let A1 , . . . , AN be disjoint sets in A and set Bn = An


for n = 1, . . . , N and Bn = ∅ for n ≥ N + 1. Then, by (2) on page
28 and by µ(∅) = 0,
N
 X X
µ ∪N

A
n=1 n = µ ∪ B
n∈N n = µ(Bn ) = µ(An ).
n∈N n=1

(2) and (3): Since B = A ∪ (B \ A) with A ∩ (B \ A) = ∅ it follows


from part (1) that

µ(B) = µ(A) + µ(B \ A) ≥ µ(A),

because µ(B \ A) ≥ 0, thus establishing (2). Furthermore,


if µ(A) < ∞, then the previous display yields that

µ(B \ A) = µ(B) − µ(A).

36/141
(4): Let (An )n∈N be an arbitrary sequence of sets in A and define
 
B1 = A1 and Bn = An \ ∪n−1 j=1 Aj for n ≥ 2. (4)

It is clear that (Bn )n∈N is a disjoint sequence of sets in A,


cf. Lemma 2, page 8. Furthermore,

[ [ N
[ N
[
An = Bn and An = Bn for all N ∈ N,
n∈N n∈N n=1 n=1

and so, by (2), page 28, and since Bn ⊆ An for all n ∈ N,


[  [  X X
µ An = µ Bn = µ(Bn ) ≤ µ(An ), (5)
n∈N n∈N n∈N n∈N

which establishes (4).

37/141
(5): assume in addition that A1 ⊆ A2 ⊆ A3 ⊆ . . . and construct
disjoint sets Bn as in (4), such that
N
[ N
[
AN = An = Bn .
n=1 n=1

It thus follows from the first two equalities of (5) that

[  X N
X
µ An = µ(Bn ) = lim µ(Bn ).
N→∞
n∈N n∈N n=1

Use that the Bn are disjoint to apply part (1) of this theorem to get

,→ · · · = lim µ ∪N

n=1 Bn = lim µ(AN ).
N→∞ N→∞

38/141
(6): Assume that A1 ⊇ A2 ⊇ A3 ⊇ . . . and, to begin with,
that µ(X ) < ∞. Hence, µ(A) < ∞ for all A ∈ A.
Clearly Ac1 ⊆ Ac2 ⊆ Ac3 ⊆ . . . and so by part (5) of this theorem
 \ c  [ 
µ An =µ Acn = lim µ(AcN ).
N→∞
n∈N n∈N

Thus, recalling that all values of µ are finite, we can use part (3)
to conclude
 \ c  \ 
µ(AN ) = µ(X ) − µ(AcN ) → µ(X ) − µ An =µ An .
n∈N n∈N

39/141
If µ(X ) = ∞ but µ(A1 ) < ∞ let the measure µA1 : A → [0, ∞) be
defined as

µA1 (A) := µ(A ∩ A1 ).

We have that µA1 (X ) = µ(A1 ) < ∞ and that µA1 is a measure


(why?). Therefore, by the argument on the previous slide
\ 
lim µA1 (AN ) = µA1 An .
N→∞
n∈N

The result follows upon observing that

µA1 (AN ) = µ(AN ) for all N ∈ N


T
because AN ⊆ A1 . As we also have n∈N An ⊆ A1 , then
\  \ 
lim µA1 (AN ) = µA1 An = µ An .
N→∞
n∈N n∈N

40/141
Measurable functions

We shall now introduce measurable functions.


We shall see that if one equips two metric spaces (S, ρS )
and (T , ρT ) with their respective Borel σ-algebras, then every
continuous function is also measurable.
In this sense, the measurable functions are a more general
class than the continuous functions.
We will seek to appreciate the following definition.
Definition 12 (Measurable function)

Let (X , A) and (Y, B) be measurable spaces and consider the


mapping f : X → Y. We say that f is measurable (or more
precisely A-B-measurable) if

f −1 (B) ∈ A for all B ∈ B.

41/141
Recall the following properties of functions.8
A function associates a unique real number y = f (x) with
each element of a set X of real numbers.
X is the domain.
Y = {f (x) : x ∈ X } is the range.
The definitions extends to arbitrary sets X , Y. In this more
general context, f may be called a mapping.
If x ∈ X then y = f (x) is the image of x.
Every element of X with a given element y ∈ Y as its image
is called a preimage of y .
If the preimage of y is unique we denote this by f −1 (y ).
If A is a subset of X then {f (x) : x ∈ A} is the image of A.
If B is a subset of Y then f −1 (B) = {x ∈ X : f (x) ∈ B} is
the preimage of B.
If no element in B has a preimage then f −1 (B) = ∅.
8
Math topic 1, p. 15-18; Kolmogorov and Fomin (1970, p. 4-5)
42/141
Examples of functions.
f (x) = x 2 for x ∈ R is a function.

The solutions to y = x 2 for x ∈ R are y = ± x. This is not
unique, so not a function.
Examples of preimages.
Let f (x) = x 2 for x ∈ R. Then

f −1 ([1, 4]) = x ∈ R : x 2 ∈ [1, 4] = [−2, −1] ∪ [1, 2].




Some properties of preimages. 9

f −1 (A ∪ B) = f −1 (A) ∪ f −1 (B).
f −1 (A ∩ B) = f −1 (A) ∩ f −1 (B).
f (A ∪ B) = f (A) ∪ f (B).
These results extend to unions and intersections of an
arbitrary number of sets.

9
Kolmogorov and Fomin (1970, p. 5-6) 43/141
On metric spaces, continuity is defined as follows.10
Let f map (X , ρx ) into (Y, ρy ). Then, f is continuous at the
point x0 ∈ X if ∀ > 0, ∃δ > 0 so that

ρy f (x), f (x0 ) <  whenever ρx (x, x0 ) < δ.

A topological space is a set X with a system τ of open sets (page


17). There is no notion of distance. Continuity is defined as
follows.11
Let f map (X , τx ) into (Y, τy ). Then f is continuous at the
point x0 ∈ X if ∀B ∈ τy , ∃A ∈ τx , so that x0 ∈ A, f (x0 ) ∈ B
and
f (A) ⊂ B.
On a topological space, the following result holds.11
The mapping f from (X , τx ) into (Y, τy ) is continuous
if and only if f −1 (B) ∈ τx for all B ∈ τy .
10
Math topic 2, p. 24; Kolmogorov and Fomin (1970, p. 44)
11
Kolmogorov and Fomin (1970, p. 87) 44/141
We repeat the definition of a measurable function.
Definition 12 (Measurable function)
Let (X , A) and (Y, B) be measurable spaces and consider the
mapping f : X → Y. We say that f is measurable (or more
precisely A-B-measurable) if

f −1 (B) ∈ A for all B ∈ B.

The larger A is, the easier it is to be measurable.


The larger B is, the harder it is to be measurable.

45/141
Example of measurable function
Let (X , A) be a measurable space and define for every A the
indicator function 1A : X → R as
(
1 if x ∈ A,
1A (x) =
0 if x ∈ Ac .

Let A ∈ A. Clearly, for any B ⊆ R





 X if 0, 1 ∈ B,
A if 1 ∈ B and 0 ∈ B c ,

1−1
A (B) =  c

 A if 0 ∈ B and 1 ∈ B c ,
if 0, 1 ∈ B c .

∅

Thus, 1A is A-F-measurable if A ∈ A irrespective of


which σ-algebra R is equipped with (e.g. F = B(R)).
Even the “discontinuous” indicator function is measurable!
If A 6∈ A then 1A is not measurable. 46/141
Probability spaces and random variables

Let us put what we have introduced so far into a probabilistic


context.
Definition 13 (Probability space)
A probability space is a triple (Ω, F, P), where Ω is a non-empty
set, F is a σ-algebra in Ω and P is a probability measure on F.
The sets in F are called events and for A ∈ F we
interpret P(A) ∈ [0, 1] as the probability of the event A occurring.

Definition 14 (Random variable)


Let (Ω, F, P) be a probability space and consider the measurable
space (R, B(R)). A random variable is mapping X : Ω → R that
is F-B(R)-measurable.

Thus, a random variable X is just a measurable function defined


on a probability space mapping into R. Random
vectors X : Ω → Rd are defined analogously.
47/141
Checking measurability: a simplification

Let (X , A) and (Y, B) be measurable spaces and consider the


mapping f : X → Y.
If D is a family of subsets that generates B, i.e. B = σ(D),
then A-B-measurability of f is equivalent to

f −1 (D) ∈ A for all D ∈ D,

that is, we only need to check wether inverse images in the


generating class D — a potentially much smaller and
manageable family — belong to A.
The following lemma is useful for establishing this fact.

48/141
Lemma 15
Let X be a non-empty set and suppose that D is a family of
subsets which all possess a property P. Furthermore, assume that

E(P) := A ⊆ X : A has property P

is a σ-algebra in X . Then every element of σ(D) has the


property P.

Proof: Clearly D ⊆ E(P) and so σ(D) ⊆ E(P), cf. equation (1)


on page 15.

49/141
Lemma 16 (Useful rules for measurability)

Let (X , A) and (Y, B) be measurable spaces and consider the


mapping f : X → Y.
1 If there exists a family of subsets D such that B = σ(D),
then f is A-B-measurable if and only if

f −1 (D) ∈ A for all D ∈ D.

2 Let (Z, C) be another measurable space. If f


is A-B-measurable and g : Y → Z is B-C-measurable,
then g ◦ f : X → Z defined via (g ◦ f )(x) = g (f (x))
is A-C-measurable.

50/141
Proof of Lemma 16
(1) We want to show that f −1 (B) ∈ A for all B ∈ B. This will
follow from Lemma 15 upon showing that

P := B ⊆ Y : f −1 (B) ∈ A


forms a σ-algebra in Y since D ⊆ P.12


(σ1) Y ∈ P since f −1 (Y) = X ∈ A.
(σ2) Let B ∈ P. We have that13 X = f −1 (Y) = f −1 (B) ∪ f −1 (B c )
and ∅ = f −1 (∅) = f −1 (B ∩ c −1
cB ) = f −1(B) ∩
−1 c
cf (B ).
−1 c −1
Then f (B ) = f (B) . But f (B) ∈ A
because f −1 (B) ∈ A and A satisfies (σ2).
(σ3) Let (Bn )n∈N be a sequence of sets in P. Then13

f −1 ∪n∈N Bn = ∪n∈N f −1 Bn ∈ A,
 

because f −1 (Bn ) ∈ A for all n ∈ N and A satisfies (σ3).


12
The property P of a set B ∈ Y in Lemma 15 is that f −1 (B) ∈ A.
13
By the preimage rules (page 43) 51/141
(2) Let C ∈ C. Then

(g ◦ f )−1 (C ) = x ∈ X : g (f (x)) ∈ C


= x ∈ X : f (x) ∈ g −1 (C )


= f −1 g −1 (C ) ∈ A,


since g −1 (C ) ∈ B [g is B-C-measurable] and f −1 (B) ∈ A for


all B ∈ B [f is A-B-measurable].

52/141
Continuity implies measurability
Theorem 17 (Continuity implies measurability)
Let (S, ρS ) and (T , ρT ) be metric spaces equipped with the
corresponding Borel σ-algebras B(S) and B(T ), respectively. Then
every continuous mapping f : S → T is B(S)-B(T )-measurable.

Proof: By part (1) of Lemma 16 (page 50) it suffices to how that


f −1 (G ) ∈ B(S) for every open subset G of T . A metric space can
be generated by a topology (page 17). Thus, since f is
continuous, f −1 (G ) is an open subset of S (page 44) and hence an
element in B(S).
In words: Every continuous function is measurable in the
above sense.
Observe that we have also seen that discontinuous functions
can be measurable. In this sense “there are more measurable
functions than continuous functions”.
We shall typically work with S = Rd and T = Rk for
some d, k ∈ N. Often k = 1.
53/141
Stability properties of measurable functions

On a measurable space (X , A), the most important class of


measurable functions are probably the A-B(R)-measurable
functions f : X → R. Specifically, define

M(A) := {f : X → R : f is A-B(R)-measurable}

Observe that by Lemma 16 (page 50) and Theorem 7 (page


22) a function f : X → R belongs to M(A) if and only if

f −1 ((a, b)) ∈ A for all − ∞ < a < b < ∞,

or, alternatively, if and only if, see Definition 5 (page 18),

f −1 ((−∞, b]) ∈ A for all b ∈ R.

We shall see that measurability is a rather stable property in


the sense that it is generally preserved when combining
functions and taking limits.
The latter sets it apart from continuous functions.
54/141
Example: Every monotone function f : R → R is an element
of M(A) [even if discontinuous!]. Assume that f is increasing
(i.e. f (x) ≥ f (y ) for x ≥ y ) and define for every b ∈ R the number

s(f , b) := sup {x ∈ R : f (x) ≤ b} ,

with s(f , b) = −∞ if f (x) > b for all x ∈ R. Then





∅ if s(f , b) = −∞

(−∞, s(f , b)] if s(f , b) ∈ R & f (s(f , b)) ≤ b
f −1 ((−∞, b]) =


(−∞, s(f , b)) if s(f , b) ∈ R & f (s(f , b)) > b

R if s(f , b) = ∞.

In each case f −1 ((−∞, b]) ∈ B(R) and the measurability follows


from the observation on the previous slide.
Decreasing functions f can be handled in a similar fashion or we
can use −f is increasing in conjunction with part (2) of
Theorem 18 below.
55/141
An important stability result for M(A)

In the exercises you will show:


Theorem 18 (Stability properties of measurable functions)
1 If f1 , . . . , fd : X → R are elements of M(A) and ϕ : Rd → R
is B(Rd )-B(R)-measurable, then

ϕ(f1 , . . . , fd ) : X → R

defined via x 7→ ϕ(f1 (x), . . . , fd (x)) is an element of M(A).


2 If f , g ∈ M(A), and c ∈ R then the functions

cf , f + g, f · g, f ∧ g, f ∨g

are elements of M(A). In particular, M(A) is a vector space.

Thus, measurability is preserved under transformations.


56/141
Example

Let f , g ∈ M(A). Then the sets



x ∈ X : f (x) = g (x) ,

x ∈ X : f (x) ≥ g (x) and

x ∈ X : f (x) > g (x)

are all elements of A. This is because we can write these sets as

(f − g )−1 ({0}), (f − g )−1 ([0, ∞)) and (f − g )−1 ((0, ∞)),

where (f − g ) ∈ M(A) according to Theorem 18 and {0}, [0, ∞)


and (0, ∞) all belong to B(R).

57/141
Measurability and limits

One frequently deals with sequences of measurable functions.


such as sequences of random variables {Xn }n∈N .
We shall see that the pointwise limit of a sequence of
functions in M(A) is also an element of M(A), provided that
the pointwise limits are elements of R for all x ∈ X .
This is very useful as the Lebesgue integral is only defined for
measurable functions.
Recall that continuity is not generally maintained under
pointwise limits, cf. fn : [0, 1] → [0, 1] where fn (x) = x n . Then
pointwise fn (x) → 1{1} (x), which is discontinuous at 1.

58/141
Functions with values in R

Even for a sequence (fn )n∈N of real valued functions quantities


such as supn∈N fn (x) and limn→∞ fn (x) need not be real.
We thus allow functions to take values in R = R ∪ {−∞, ∞}.
We equip R, using ρ2 , with the σ-algebra

B(R) := B ⊆ R : B ∩ R ∈ B(R)

[check that B(R) satisfies (σ1)–(σ3), i.e. it is a σ-algebra]


Furthermore, it can be shown that B(R) is generated by the
family of sets

[−∞, b] : b ∈ R .

59/141
Measurability is preserved under limits

Define

M(A) = {f : X → R : f is A-B(R)-measurable}
M(A)+ = {f ∈ M(A) : f (x) ≥ 0 for all x ∈ X }

Note that f : X → R belongs to M(A) if and only if

f −1 ([−∞, b]) ∈ A,

since {[−∞, b] : b ∈ R} is a generating class for B(R).

60/141
One key advantage of working with measurable functions is
that measurability is preserved under pointwise limits.
As mentioned, this is contrast to pointwise limits of
continuous functions.
Define the function inf n∈N fn : X → R as

( inf fn )(x) := inf fn (x) = inf {fn (x) : n ∈ N} for x ∈ X


n∈N n∈N

and similarly for the other functions in the following theorem.


For the infimum, it is useful to work with R since inf n∈N fn (x)
could be ∞ for some x ∈ X even if fn (x) ∈ R for all n ∈ N.

61/141
Theorem 19 (Limits and measurability)
1 Let (X , A) be a measurable space and let (fn )n∈N be a
sequence of functions in M(A). Then the functions

inf fn , sup fn , lim inf fn and lim sup fn


n∈N n∈N n→∞ n→∞

are also elements of M(A). Finally, if fn is pointwise


convergent, that is if f (x) := limn→∞ fn (x) exists in R for
all x ∈ X , then also the limiting function f ∈ M(A).
2 If (fn )n∈N is a sequence of functions in M(A) and
f (x) := limn→∞ fn (x) exists in R for all x ∈ X then
also f ∈ M(A).

62/141
Proof of Theorem 19

 that supn∈N fn ∈ M(A), it is enough to verify


To show
that x ∈ X : supn∈N fn (x) ∈ [−∞, b] ∈ A. This follows since
n o \
x ∈ X : sup fn (x) ∈ [−∞, b] = fn−1 ([−∞, b]) ∈ A.
n∈N n∈N

To show that inf n∈N fn ∈ M(A), we first show that −f ∈ M(A)


whenever f ∈ M(A):14
 
x ∈ X : −f (x) ∈ [−∞, b] = x ∈ X : f (x) ∈ [−b, ∞]

= X \ x ∈ X : f (x) ∈ [−∞, −b)
= X \ ∪n∈N f −1 ([−∞, −b − n−1 ])
∈ A,

since f −1 ([−∞, −b − n−1 ]) ∈ A for all n ∈ N.


14
Observe that we can’t directly use part 2 of Theorem 18 to conclude
that −f ∈ M(A) since that theorem is for real-valued functions in M(A). 63/141
Therefore,15

inf fn = − sup(−fn ) ∈ M(A).


n∈N n∈N

From this it also follows that gn := inf k≥n fk ∈ M(A) and so

lim inf fn = sup inf fk and lim sup fn = inf sup fk .


n→∞ n∈N k≥n n→∞ n∈N k≥n

are elements of M(A).


Finally, if fn is pointwise convergent, then

f = lim fn = lim inf fn ∈ M(A).


n→∞ n→∞

15
Recall that for any A ⊆ R it holds that − inf A = sup(−A) and use this
with A = {fn (x) : n ∈ N} for each x ∈ X . 64/141
A final note on stability of measurability in M(A)

Barring subtracting ∞ from ∞ a result analogous to part 2 of


Theorem 18 (page 56) also holds for M(A).
We mention the following result without proof.

Theorem 20
Let f , g be functions in M(A) and let c be a constant in R. Then
1 the functions cf , f ∧ g , f ∨ g and fg are elements of M(A).
2 if

{x ∈ X : f (x) = ∞ and g (x) = −∞} = ∅

and

{x ∈ X : f (x) = −∞ and g (x) = ∞} = ∅,

then the function f + g is well-defined and an element of M(A).


65/141
The Lebesgue integral

We shall now introduce the Lebesgue integral and study its main
properties. It has several advantages over the Riemann integral.
1 It is defined for a broader class of functions.
2 It is much more stable under pointwise limits of sequences of
functions. That is, pointwise limits and integration can often
be interchanged. For the Riemann integral we typically need
uniform convergence.
3 The Lebesgue integral is easily defined for functions on an
arbitrary measure space (X , A, µ). The Riemann integral is
defined for functions on R. This is important in probability
theory where the random variables are defined on a probability
space (Ω, F, P) and expected values are Lebesgue integrals
Z
E X = X (ω)P(dω).

66/141
Simple functions

We shall now introduce the so-called simple functions.


These are the basic building blocks for the Lebesgue integral,
which can be defined in a natural way for these.
Furthermore, many useful properties of the Lebesgue integral
can easily be established for simple functions...
...and we shall see that any element of M(A) can be
approximated by simple functions and so one can often extend
the useful properties of the Lebesgue integral
of simple functions to integrals of functions in M(A).

67/141
Given a measure space (X , A, µ), ai ∈ R and Ai ∈ A, i = 1, . . . , n
we call s : X → R defined via
n
X
s(x) = ai 1Ai (x) (6)
i=1

a simple function.
Observe that s is A-B(R)-measurable by part (2) of Theorem 18.
Denote by SM(A) the set of simple functions and by SM(A)+
the set of non-negative simple functions — clearly SM(A) is a
vector space — so that we can apply linear operations.
It is clear that we can always choose the Ai in (6) such that they
are disjoint and ∪ni=1 Ai = X . For example, for f : R → R with

f = 3 · 1[1,3] + 2 · 1[2,4]
= 3 · 1[1,2) + 5 · 1[2,3] + 2 · 1(3,4] + 0 · 1R\[1,4]

We call such a representation a standard representation. The


standard representation is not unique. 68/141
Approximation via simple functions
We shall frequently use that for f ∈ M(A)+ there exists a
sequence of non-negative simple functions sn such that sn ↑ f .
Lemma 21
Let f ∈ M(A)+ . Then there exists a sequence (sn )n∈N
in SM(A)+ such that 0 ≤ s1 (x) ≤ s2 (x) ≤ . . . for all x ∈ X
and sn (x) → f (x), that is sn ↑ f .

Proof: Let f ∈ M(A)+ . For every n ∈ N we define the function


n
n2 
X j − 1
sn (x) := 1f −1 ([ j−1 , j ))
(x) + n1f −1 ([n,∞]) (x).
2n 2n 2n
j=1

Clearly, sn ∈ SM(A)+ for n ∈ N. Observe that


|f (x) − sn (x)| ≤ 2−n if f (x) ∈ [0, n) and sn (x) = n if f (x) ≥ n.
Thus sn (x) → f (x) for all x ∈ X . It is also clear
that sn (x) ≤ sn+1 (x) for all n ∈ N. 69/141
Integrals of non-negative simple functions
We begin by introducing the integral non-negative simple
functions.
Definition 22
Let s be a non-negative simple function, i.e. s ∈ SM(A)+ , with
standard representation
n
X
s= ai 1Ai .
i=1

Then we define the integral of s with respect to µ as


Z n
X
sdµ := ai µ(Ai ) ∈ [0, ∞].
i=1

The Ai ∈ A, hence µ(Ai ) is well-defined.


Since the ai ≥ 0 there are no “∞ − ∞” issues.
70/141
The same simple function can be written in many ways.
However, it is not difficult to show that the value of the
integral in Definition 22 does not depend on the chosen
representation.

71/141
We begin by establishing some fundamental properties of the
integral on SM(A)+ (page 68).

Theorem 23 (Properties the integral on SM(A)+ )

Let f , g ∈ SM(A)+ and a ∈ [0, ∞). Then


R
1 1A dµ = µ(A) for every A ∈ A.
R R
2 afdµ = a fdµ.
R R R
3 (f + g )dµ = fdµ + gdµ.
R R
4 fdµ ≤ gdµ if f ≤ g .

As can be seen the integral behaves as one would hope:


The integral of an indicator function is the measure of the set:
Property (1).
The integral is linear: Properties (2) & (3).
The integral is monotone: Property (4).

72/141
Proof of Theorem 23

Recall Definition 22 (page 70). If s is a non-negative


Pn simple
function with standard representation s = i=1 a i 1 Ai
, then
sdµ := ni=1 ai µ(Ai ) ∈ [0, ∞].
R P

(1) and (2) follow by manipulating Definition 22.


(3). Write
n
X m
X
f = ai 1Ai and g = bj 1Bj ,
i=1 j=1

(Ai )ni=1 and (Bi )m


i=1 are partitions of X ; a1 , . . . , an , b1 , . . . , bm ≥ 0.
Thus,
m
X n
X
1Ai = 1Ai ∩Bj and 1Bj = 1Ai ∩Bj ,
j=1 i=1

73/141
and so
n
X m
X m
X n
X
f +g = ai 1Ai ∩Bj + bj 1Ai ∩Bj
i=1 j=1 j=1 i=1
n
XX m
= (ai + bj )1Ai ∩Bj
i=1 j=1

which, by the definition of the integral on SM(A)+ , implies


Z n X
X m
(f + g )dµ = (ai + bj )µ(Ai ∩ Bj )
i=1 j=1
Xn X m m
X n
X
= ai µ(Ai ∩ Bj ) + bj µ(Ai ∩ Bj )
i=1 j=1 j=1 i=1
Xn m
X Z Z
= ai µ(Ai ) + bj µ(Bj ) = fdµ + gdµ.
i=1 j=1

The third equality uses part (1) of Theorem 11 (page 34). 74/141
(4) Since f ≤ g are in SM(A)+ then g − f ≥ 0 is in SM(A)+ .
We then find by part (3) that
Z Z Z Z
gdµ = {f + (g − f )}dµ = fdµ + (g − f )dµ.

+ then (g − f )dµ ≥ 0 so that


R
Since
R g −R f is in SM(A)
gdµ ≥ fdµ as desired.

75/141
The Lebesgue integral on M(A)+
Lemma 21 (page 69) shows that non-negative functions can be
approximated by simple functions.
Definition 24
Let (X , A, µ) be a measure space and f : XR → [0, ∞] a function
in M(A)+ . We then define the µ-integral fdµ of f via
Z nZ o
fdµ := sup sdµ : s ∈ SM(A)+ , s ≤ f ∈ [0, ∞].

Monotonicity : Let f , g ∈ M(A)+ satisfy f ≤ g . Since


nZ o
sdµ : s ∈ SM(A)+ , s ≤ f
nZ o
⊆ sdµ : s ∈ SM(A)+ , s ≤ g ,

gdµ, i.e. the integral is monotone on M(A)+ .


R R
then fdµ ≤
[Note that Definition 24 implies Definition 22 (page 70).] 76/141
Lebesgue’s Monotone Convergence Theorem

The following theorem is of fundamental importance in the


theory of Lebesgue integration.
It is an example of the ease with which one can interchange
pointwise limits and integration.

Theorem 25 (Lebesgue’s Monotone Convergence Theorem)

Let (fn )n∈N be a sequence of functions in M(A)+ such that

f1 ≤ f2 ≤ . . .

Then the function f := limn→∞ fn = supn∈N fn is again an element


of M(A)+ and
Z Z Z Z
fdµ = lim fn dµ = lim fn dµ = sup fn dµ.
n→∞ n→∞ n∈N

77/141
Proof of Theorem 25
That f = limn→∞ fn = supn∈N fn ≥ 0 is again an element of
M(A)+ follows from Theorem 19 (page 62).
Since fn ≤ supn∈N fn = f for all n ∈ N, it holds that
Z Z Z Z
fn dµ ≤ fdµ and hence sup fn dµ ≤ fdµ,
n∈N

by the monotonicity observation after Definition 24 (page 76).


R R
It remains to be shown that also supn∈N fn dµ ≥ fdµ. By
Definition 24 (page 76). it suffices to show that
Z Z
sdµ ≤ sup fn dµ for all s ∈ SM(A)+ with s ≤ f .
n∈N

Thus, let such an s be given and note that it suffices to show that
Z Z
α sdµ ≤ sup fn dµ for all α ∈ (0, 1).
n∈N
78/141
Thus, let α ∈ (0, 1) be given and set

Bn := x ∈ X : αs(x) ≤ fn (x) .

If f (x) > 0 then αs(x) < f (x). Hence, since fn (x) ↑ f (x) there
exists an nx such that αs(x) ≤ fn (x) for n ≥ nx . If f (x) = 0, then
also αs(x) = fn (x) = 0. Thus, Bn ↑ X , i.e. B1 ⊆ B2 ⊆ . . .
and ∪n∈N Bn = X . Thus,

αs(x)1Bn ≤ fn (x) for all n ∈ N.

Since αs1Bn ∈ SM(A)+ it follows from part (2) of Theorem 23


(page 72)
Z Z Z Z
α s1Bn dµ = αs1Bn dµ ≤ fn dµ ≤ sup fn dµ,
n∈N

and hence
Z Z
α lim sup s1Bn dµ ≤ sup fn dµ.
n→∞ n∈N
79/141
We conclude by Rshowing that the left-hand side of the previous
display equals α sdµ. Since s ∈ SM(A)+ there exist an N ∈ N,
and standard representation with a1 , . . . , aN ≥ 0 such that
N
X N
X
s= ai 1Ai and so s1Bn = ai 1Ai ∩Bn .
i=1 i=1
Therefore,
Z N
X
s1Bn dµ = ai µ(Ai ∩ Bn ).
i=1
By part (5) of Theorem 11 (page 34) it holds for i = 1, . . . , N that

lim µ(Ai ∩ Bn ) = µ ∪n∈N (Ai ∩ Bn )
n→∞

= µ Ai ∩ (∪n∈N Bn ) = µ(Ai ).
It follows that
Z N
X Z
lim sup s1Bn dµ = ai µ(Ai ) = sdµ.
n→∞
i=1
80/141
A useful consequence

Recall that by Lemma 21 (page 69) it holds that for


every f ∈ M(A)+ there always exists a sequence (sn )n∈N
in SM(A)+ such that sn ↑ f .
By the monotone convergence theorem it follows in particular
that for any f ∈ M(A)+ and any sequence (sn )n∈N
in SM(A)+ such that sn ↑ f it holds that
Z Z
lim sn dµ = fdµ.
n→∞

This can be used to generalize properties of the integral


on SM(A)+ to M(A)+ , cf. (the proof of) Theorem 26
below.

81/141
Dirichlet’s function

The following is an example of how the Lebesgue integral handles


pointwise limits better than the Riemann integral:
Consider Dirichlet’s functions D = 1Q∩[0,1] on (R, B(R)).
D is not Riemann-integrable as all lower Riemann sums are 0
and all upper Riemann sums are 1.
R1
But 0 Ddλ1 = λ1 (Q ∩ [0, 1]) = 0.
It gets worse: Since Q ∩ [0, 1] is countable, we can write it
as Q ∩ [0, 1] = {xk : k ∈ N}.
Clearly, fn := nk=1 1{xk } ↑ D.
P
R1
For every n, fn is Riemann integrable and 0 fn (x)dx = 0.16

16
Assume without loss of generality that x1 < . . . < xn such that there is a
strictly positive distance r between the xi . Hence, in any partition of [0, 1]
consisting of intervals of length at most r /2 at most n elements contain an xi .
Taking the infimum over such partitions one sees that the upper Riemann
integral is 0 [just like the lower Riemann integral clearly is] 82/141
But since D is not Riemann-integrable we see that pointwise
limits do not preserve Riemann-integrability and it does not
make sense to write
Z 1 Z 1
lim fn (x)dx = D(x)dx.
n→∞ 0 0

However, in accordance with the Monotone Convergence


Theorem, 17
Z 1 n
X Z 1
lim fn dλ1 = lim λ1 ({xk }) = 0 = Ddλ1 .
n→∞ 0 n→∞ 0
k=1

17
Alternatively, we can use the Dominated Convergence Theorem to be
introduced in Theorem 33 below. 83/141
We now use the Monotone Convergence Theorem to show how the
integral on M(A)+ inherits properties from SM(A)+ .

Theorem 26 (Properties of the integral on M(A)+ )

Let f , g ∈ M(A)+ and a ∈ [0, ∞). Then


R R
1 afdµ = a fdµ.
R R R
2 (f + g )dµ = fdµ + gdµ.
R R
3 fdµ ≤ gdµ if f ≤ g .

Proof: Part (3) was established after Definition 24 (page 76).


To prove part (1) let f ∈ M(A)+ and a ∈ [0, ∞). By Lemma 21
(page 69) we can and do choose a sequence (sn )n∈N in SM(A)+
such that sn ↑ f . Observe that asn ∈ SM(A)+ for all n ∈ N
and asn ↑ af . Thus, by Theorem 25 (page 77) and part (2) of
Theorem 23 (page 72) it follows that
Z Z Z Z
afdµ = lim asn dµ = a lim sn dµ = a fdµ.
n→∞ n→∞
84/141
To prove part (2) choose (sn )n∈N and (tn )n∈N in SM(A)+ such
that sn ↑ f and tn ↑ g . Clearly, (sn + tn ) ∈ SM(A)+ for all n ∈ N
and (sn + tn ) ↑ (f + g ). Hence, by Lebesgue’s monotone
convergence theorem (page 77)
Z Z Z Z 
(f + g )dµ = lim (sn + tn )dµ = lim sn dµ + tn dµ
n→∞ n→∞
Z Z
= fdµ + gdµ,

the second equality using linearity of the integral on SM(A)+ , see


part (3) of Theorem 23 (page 72).

85/141
µ-null set, µ-a.e. and “almost surely”

It turns that the µ-integral “does not care about null sets”.

Definition 27
A subset N of X is called a µ-null set if there exists an A ∈ A such
that

N ⊆ A and µ(A) = 0.

Remark: Consider the null-sets for the Lebesgue measure on


R. A σ-algebra can be generated from those and the
Borel-algebra. The Lebesgue measure can be extended to that
σ-algebra. Even so, there exists sets in R that are not in that
σ-algebra. Those sets are said to be non-measurable.

86/141
Examples of λ1 -null sets

Consider the measure space (R, B(R), λ1 ). As mentioned


before, the singleton {x} is a Borel set: For x ∈ R

\
x − n−1 , x + n−1 ∈ B(R),
 
{x} =
n=1

such that by part (6) of Theorem 11

λ1 ({x}) = lim λ1 x − n−1 , x + n−1 = lim 2/n = 0.


 
n→∞ n→∞

It follows by the definition of a measure


that λ1 (Q) = λ1 (∪x∈Q {x}) = 0. Thus, the rational numbers
are a λ1 -null set [even though they are dense in R!]

87/141
Consider (X , A, µ). We say that a property holds for µ-almost
all x ∈ X if the property holds for all x ∈ X \N where
N is a µ-null set. [common in mathematics]
N ∈ A and µ(N) = 0. [common in probability]
We also say the property holds µ-almost everywhere (a.e.).
In probability, we say almost surely (a.s.) or with probability
one. This terminology often turns up in probability and
statistics.
Examples:
If µ(x ∈ X : f (x) 6= g (x)) = 0 we say that f = g µ-almost
everywhere, or for µ-almost every x or µ-a.e.
If µ(x ∈ X : limn→∞ fn (x) does not exist) = 0, we say that fn
converges µ-almost everywhere, or for µ-almost every x.

88/141
It is often useful that the µ-integral does not register µ-null set in
the sense of (4) in the following theorem.

Theorem 28 (The integral and µ-null sets)

Let f , g ∈ M(A)+ . Then


R
1 fdµ = 0 ⇐⇒ f = 0 µ-a.e.
R
2 f 1N dµ = 0 for every µ-null set N in A.
R
3 If fdµ < ∞ then it holds that f < ∞ µ-a.e.
R R
4 If f = g µ-a.e. then it holds that fdµ = gdµ.

If f ∈ M(A)+ and fdµ = 0 then we can conclude


R

that µ({x ∈ X : f (x) 6= 0}) = 0.


However, we can’t conclude that f (x) = 0 for all x ∈ X . To
see this let f (x) = 1A (x) for some A 6= ∅ but µ(A) = 0.
R
Example: Consider (R, B(R), λ1 ). Then 1Q dλ1 = 0,
but 1Q (x) = 1 for x ∈ Q.

89/141
Proof of Theorem 28

Part (2) follows from part (1) since there f 1N = 0 µ-a.e.


Part (1): Assume that f = 0 µ-a.e. and let (sn )n∈N be a sequence
in SM(A)+ such that sn ↑ f . Then sn = 0 on {f = 0} and for
every n ∈ N there exists an N ∈ N such that18
N
X
sn = 0 · 1{f =0} + ai 1Ai ,
i=1

where Ai ∈ A and Ai ⊆ {f > 0} as well as ai >R 0, i = 1, . . . , N.


Since µ(Ai ) = 0 for i = 1, . . . , N, it holds that sn dµ = 0 for
all n ∈ N and by the monotone convergence theorem
Z Z
fdµ = lim sn dµ = 0.
n→∞

18
For example, one can use sn as constructed in the proof of Lemma 21
(page 69). 90/141
For the converse, suppose µ(f > 0) =: ε > 0. We bound
Z Z Z
1 1
fdµ ≥ f 1{f >1/n} dµ ≥ 1{f >1/n} dµ = µ(f > 1/n).
n n

Since µ(f > 0) = ε > 0 then by part (5) of Theorem 11 (page 34)
there exists an n ∈ N such that µ(f >R 1/n) > ε/2.
Thus, µ(f > 0) = ε > 0 implies that fdµ > ε/(2n) > 0.
R
Part (3): Suppose fdµ < ∞. Define the monotone sequence
gn = n1(f =∞) . Then f ≥ g ≥ gn where gn → g = ∞ · 1(f =∞) .
Part 4 of Theorem 26 (page 84) then shows
Z Z Z
∞ > fdµ ≥ gdµ = lim gn dµ.
n→∞

The Monotone Convergence Theorem (page 77) then shows


Z Z Z
∞> lim gn dµ = lim gn dµ = lim n1(f =∞) dµ.
n→∞ n→∞ n→∞

91/141
Part 2 of Theorem 26 (page 84) then shows
Z Z
∞ > lim n1(f =∞) dµ = lim n 1(f =∞) dµ
n→∞ n→∞

= µ(f = ∞) lim n = ∞ · µ(f = ∞).


n→∞

This implies µ(f = ∞) = 0.

Part (4): Assume that f = g µ-a.e. Then, by part (2) of this


theorem and linearity of the integral on M(A)+
Z Z
fdµ = (f 1{f =g } + f 1{f 6=g } )dµ
Z Z
= f 1{f =g } dµ + f 1{f 6=g } dµ
Z Z
= g 1{f =g } dµ + g 1{f 6=g } dµ
Z Z
= (g 1{f =g } + g 1{f 6=g } )dµ = gdµ.

92/141
Integration of real measurable functions

So far we have focused on developing the properties of


the µ-integral over SM(A)+ and M(A)+ .
We shall now extend these properties to functions that are not
necessarily non-negative.
Denote by

f+ =f ∨0 and f − = −(f ∧ 0)

the positive and negative part of f : X → R. Observe that

|f | = f + + f − and f = f + − f − ,

(the latter being well-defined since either f + or f − equals 0).


By Theorem 20 (page 65) and the above f + , f − ∈ M(A)+ if and
only if f ∈ M(A). In this case also |f | ∈ M(A)+ .
93/141
Definition 29 (L(µ) and L1 (µ))
For a measure space (X , A, µ) we define
n Z Z o
L(µ) = f ∈ M(A) : f + dµ ∧ f − dµ < ∞
n Z Z o
L1 (µ) = f ∈ M(A) : f + dµ ∨ f − dµ < ∞

Observe that L1 (µ) consists of functions taking real values only.

Definition 30 (The general integral)


For a function f ∈ L(µ) we define
Z Z Z
fdµ := f dµ − f − dµ
+

Observe that since f ∈ L(µ), we never run into “∞ − ∞” issues.

94/141
Because
Z Z Z Z Z 

+
|f |dµ = f dµ + f dµ ≤ 2 f dµ ∨ f − dµ
+

and
Z Z Z Z Z
+ − + −
f dµ ∨ f dµ ≤ f dµ + f dµ = |f |dµ

it follows
Z
1

L (µ) = f ∈ M(A) : |f |dµ < ∞ .

Hence, since by Theorem 26 (page 84) one has for f , g ∈ L1 (µ)


and a ∈ R
Z Z Z Z Z
|af |dµ = |a| |f |dµ and |f + g |dµ ≤ |f |dµ + |g |dµ

it follows that L1 (µ) is a vector space (in M(A)).


95/141
Some alternative notations

R
The following expressions are used interchangeably for fdµ.
Z Z Z Z
fdµ, f (x)µ(dx), f (x)µ(dx), f (x)dµ(x).
X X

96/141
Theorem 31 (Linearity and other properties of the integral)

Let (X , A, µ) be a measure space, let f , g ∈ L1 (µ) and a ∈ R.


Then,
R R
1 afdµ = a fdµ.
R R R
2 (f + g )dµ = fdµ + gdµ.
R R
3 If f ≤ g µ-a.e. then fdµ ≤ gdµ and
Z Z
fdµ = gdµ ⇐⇒ f = g µ-a.e.
R R
4 | fdµ| ≤ |f |dµ.

Observe that we do not run into ∞ − ∞ issues in calculating f + g


in (2) since by virtue of f , g ∈ L1 (µ), they take real values only.

97/141
Proof of Theorem 31
(2): As L1 (µ) is a vector space f + g ∈ L1 (µ) and
(f + g )+ − (f + g )− = f + g = f + − f − + g + − g − ,
and hence (recall that all function values are real)
(f + g )+ + f − + g − = (f + g )− + f + + g + .
Thus, by part (2) of Theorem 26 (page 84)
(f + g )+ dµ + f − dµ + g − dµ = (f + g )− dµ + f + dµ + g + dµ.
R R R R R R

Since all integrals in the previous display are finite one gets
Z Z Z
(f + g )dµ = (f + g )+ dµ − (f + g )− dµ
Z Z Z Z
= f + dµ − f − dµ + g + dµ − g − dµ
Z Z
= fdµ + gdµ.

(1), (3), (4) proved in similar fashion. 98/141


Integrating over subsets A ∈ A

Let us also mention that for f ∈ M(A) and A ∈ A such


that f 1A ∈ L(µ) we define
Z Z
fdµ := f 1A dµ.
A

99/141
Interchanging limits and integration

One of the appeals of the µ-integral is the ease with which it


allows interchanging integration and limits.
This is another advantage over the Riemann integral, which
generally only allows interchanging integration and limits on
bounded intervals under uniform rather than pointwise
convergence of a sequence of functions fn to f .
The following is a slightly more general version of the
Monotone Convergence Theorem previously stated in
Theorem 25 (page 77).

100/141
Monotone Convergence Theorem

Theorem 32 (Monotone Convergence Theorem)

Let f , f1 , f2 , f3 , . . . be functions in M(A)+


satisfying f1 ≤ f2 ≤ f3 ≤ . . . . µ-a.e. Then, if f = limn→∞ fn µ-a.e.
it holds that
Z Z
lim fn dµ = fdµ.
n→∞

The Monotone Convergence Theorem allows us to


interchange integration and pointwise limits for an increasing
sequence of non-negative functions.
Note that we only need f1 (x) ≤ f2 (x) ≤ . . . for µ-almost all x.
The only difference to Theorem 25 (page 77) is that
monotonicity and convergence now only need to
holds µ-a.e. rather than for all x ∈ X .
101/141
Proof of Theorem 32
Define
\
A := {x ∈ X : fn (x) ≤ fn+1 (x)} ,
n∈N
B := {x ∈ X : lim fn (x) = f (x)} ,
n→∞

and C := A ∩ B. Then, µ(C c ) ≤ µ(Ac ) + µ(B c ) = 0 by


assumption. Further defining g , gn ∈ M(A)+ for n ∈ N as

gn (x) := fn (x)1C (x) and g (x) := f (x)1C (x)

it follows that gn (x) ↑ g (x) for all x ∈ X . Also, fn = gn µ-a.e.


and f = g µ-a.e. Hence, by part (4) of Theorem 28 (page 89) and
Theorem 25 (page 77)
Z Z Z Z
fn dµ = gn dµ → gdµ = fdµ.

102/141
Prior to illustrating the Monotone Convergence Theorem, let
us note that in case (Ω, F, P) is a probability space upon
which a random variable X with values in (R, B(R)) is
defined, then we write
Z Z
E X := X (ω)P(dω) = XdP, X ∈ L(P).
Ω Ω

for the expected value of X .


The distribution PX of X on B(R) is defined as

PX (A) := P(X ∈ A) = P(ω ∈ Ω : X (ω) ∈ A), A ∈ B(R).

[We shall see that the distribution of a random variable X is


nothing else than a so-called image measure.]

103/141
Example: AR(1)

Let (εt )t∈Z be a sequence of random variables on the


probability space (Ω, F, P) with values in (R, B(R)) such
that E |εt | ≤ C < ∞ for all t ∈ Z and some C > 0.
Let

Yt = αYt−1 + εt , t ∈ N, |α| < 1.

We know that the solution to this autoregressive equation is



X
αi εt−i
i=0

But is this (random) series well-defined?

104/141
P∞ i
Consider first i=0 |α| |εt−i |.
PN
Clearly, limN→∞ i=0 |α|i |εt−i | exists in [0, ∞].
Since N i
P
i=0 |α| |εt−i | is F-B(R)-measurable (by Theorem 18,
page 56) and using that limits preserve measurability (by
Theorem 19, page
P 62) i , we conclude
that limN→∞ N i=0 |α| |εt−i | is F-B(R)-measurable.
Since i=0 |α| |εt−i | ↑ ∞
PN i
P i
i=0 |α| |εt−i |, the Monotone
Convergence Theorem (page 101) and linearity of the integral
yield that

X N
X
i
E |α| |εt−i | = lim |α|i E |εt−i | ≤ C /(1 − |α|) < ∞.
N→∞
i=0 i=0

Thus, ∞ i
P
i=0 |α| |εt−i | < ∞ P-a.s. (see Theorem 28, part 3,
page 89)
Since ∞ i
P
i=0 α εt−i converges absolutely P-a.s, it also
converges P-a.s.
105/141
Lebesgue’s Dominated Convergence Theorem

Theorem 33 (Lebesgue’s Dominated Convergence Theorem)

Let f , f1 , f2 , f3 , . . . be functions in M(A) such


that f = limn→∞ fn µ-a.e. If there exists a function g ∈ M(A)+
such that
1 |fn | ≤ g µ-a.e. for all n ∈ N.
R
2 gdµ < ∞.
then
Z Z Z
lim fn dµ = fdµ and lim |fn − f |dµ = 0.
n→∞ n→∞

A function g satisfying the conditions of the theorem is often


referred to as an integrable majorant of the sequence (fn )n∈N ,
i.e “g dominates the fn ” — hence the name of the theorem.

106/141
Example
Consider the measure space (R, B(R), λ1 ) and let f ∈ L1 (λ1 ). We
show that
Z Z
1 1
lim f (x)λ1 (dx) = lim f (x)1[−n,n] (x)λ1 (dx) = 0,
n→∞ 2n [−n,n] n→∞ 2n
(7)

via the Dominated Convergence Theorem.


Observe that
1
fn (x) := f (x)1[−n,n] (x) → 0 for all x ∈ R.
2n
Furthermore, for all n ∈ N,
Z
1 1
|fn (x)| ≤ |f (x)| and |f (x)|λ1 (dx) < ∞,
2 2

since f ∈ L1 (λ1 ). The Dominated Convergence Theorem now


yields (7). 107/141
Relationship to Riemann integral

It is often useful that integrals with respect to the Lebesgue


measure can be calculated via the Riemann integral over bounded
intervals.
Theorem 34
Let a and b be real numbers such that a < b and let f : [a, b] → R
be Riemann integrable and B([a, b])-B(R)-measurable. Then,
Z Z b
fdλ1 = f (x)dx,
[a,b] a

the right-hand integral being the Riemann-integral.



In the above B([a, b]) := A ∩ [a, b] : A ∈ B(R) .

108/141
A useful consequence

Consider the measure space (R, B(R), λ1 ) and let f ∈ L1 (λ1 ).


Then, by the Dominated Convergence Theorem,
Z Z Z
fdλ1 = lim f 1[−n,n] dλ1 = lim fdλ1 , (8)
n→∞ n→∞ [−n,n]

where we used that f 1[−n,n] → f and |f 1[−n,n] | ≤ |f | for all n ∈ N.


Thus, if f is also Riemann
R integrable, we can calculate the
Lebesgue integral fdλ1 as
Z Z n
fdλ1 = lim f (x)dx.
n→∞ −n

Of course, (8) remains valid in a situation where the first equality


can be ensured via the Monotone Convergence Theorem (such as
when f ≥ 0).
109/141
Example
R∞
Let us determine the value of the integral 0 xe −x λ1 (dx).
Observe that xe −x 1[0,n) (x) ↑ xe −x 1[0,∞) (x) for all x ∈ X . Thus,
by the Monotone Convergence Theorem and Theorem 34
Z ∞
xe −x λ1 (dx)
0Z Z
−x
= xe 1[0,∞) (x)λ1 (dx) = lim xe −x 1[0,n) (x)λ1 (dx)
n→∞
Z n Z n
−x
= lim xe λ1 (dx) = lim xe −x dx,
n→∞ 0 n→∞ 0

the last integral being Riemann. Using partial integration


Z n Z n
−x −x n
xe dx = [−xe ]0 + e −x dx = −ne −n − e −n + 1,
0 0

and so
Z ∞ Z n
−x
xe −x dx = lim 1 − (n + 1)e −n = 1.

xe λ1 (dx) = lim
0 n→∞ 0 n→∞
110/141
Improper integrals

Some integrals exist as improper Riemann integrals but not as


Lebesgue integrals. An example is the Dirichlet integral. It can be
shown that Z ∞
sin x
dx = π
−∞ x
as an improper
R bRiemann integral. The argument involves showing
that limb→∞ 0 x −1 sin xdx exists. However, it can also be shown
that Z ∞
sin x
dx = ∞.
−∞ x
Thus, the function (sin x)/x is not Lebesgue integrable.
The generalized Riemann integral has Riemann, improper Riemann
and Lebesgue integrals as special cases (Bartle and Sherbert,
2011).

111/141
Regularity conditions: Example

Let us illustrate a situation where one can’t interchange integration


and limits. Consider the probability space

([0, 1], B([0, 1]), λ1 ) and fn (x) = n1[0,1/n] (x), n ∈ N.

Clearly, fn (x) → 0 for λ1 -a.e. x ∈ [0, 1]. [Observe that {0} is a


Lebesgue null set].
However, [use either the above relationship to the
Riemann-integral or that λ1 ([0, 1/n]) = 1/n]
Z Z
fn dλ1 = 1 6= 0 = 0dλ1 .

Thus, we should carefully check the regularity conditions of the


Dominated and Monotone Convergence Theorems prior to applying
them.
112/141
Some useful inequalities

For f , g ∈ M(A), the following inequalities hold,


R p 1/p
writing ||f ||p = |f | dµ
Z
1
Markov: µ(|f | ≥ t) ≤ p |f |p dµ, p ∈ (0, ∞), t > 0.
t
Z
1 1
Hölder: |fg |dµ ≤ ||f ||p ||g ||q , + = 1, p > 1.
p q
Z
Cauchy-Schwarz: |fg |dµ ≤ ||f ||2 ||g ||2 .

Minkowski: ||f + g ||p ≤ ||f ||p + ||g ||p , p ≥ 1.

113/141
For illustration, let us show how one can easily establish a
generalized version of Markov’s inequality within the
measure-theoretic framework we have developed.
Let ψ : R → [0, ∞) be increasing (hence B(R)-B(R)-measurable).
Then, for all t ∈ (0, ∞) such that ψ(t) ∈ (0, ∞)
Z
 
µ |f | ≥ t ≤ µ ψ(|f |) ≥ ψ(t) = 1{ψ(|f |)≥ψ(t)} dµ
R
ψ(|f |) ψ(|f |)dµ
Z
≤ 1{ψ(|f |)≥ψ(t)} dµ ≤ .
ψ(t) ψ(t)

Using ψ(t) = t p with p ∈ (0, ∞) yields Markov’s inequality.


And Markov’s inequality with “f = X − E X ” for a random
variable X ,µ a probability measure, and p = 2, implies Chebyshev’s
inequality.

114/141
Jensen’s inequality

Jensen’s inequality is for probability measures. Consider a


random vector X on a probability space (Ω, F, P).

Lemma 35 (Jensen’s inequality)


Let C be an open and convex subset of Rm and let L : C → R be
a convex function. If X is a random vector with P(X ∈ C ) = 1
and E ||X || < ∞,19 then
1 E X ∈ C.
2 L(E X ) ≤ E L(X ).

Since x →7 x 2 is convex one has (E X )2 ≤ E X 2 .


If g : C → R is concave then −g is convex and so

−g (E X ) ≤ E[−g (X )] ⇐⇒ E g (X ) ≤ g (E X ).

19
pPm
Here, for any x ∈ Rm , ||x|| = i=1 xi2 denotes the Euclidean norm.
115/141
Induced measure and substitution rule

Let (X , A) and (Y, B) be measurable spaces and T : X → Y


be A-B-measurable.
If µ is a measure on A, then µ ◦ T −1 , defined by

(µ ◦ T −1 )(B) := µ(T −1 (B)) = µ(x ∈ X : T (x) ∈ B), B∈B

is a measure on B which is called the induced measure/image


measure/push-forward measure (of µ under the mapping T ).
One frequently writes µT for the image measure.
Let us verify that µT is a measure on B (Definition 10, page 28).
Clearly, µT (∅) = µ(T −1 (∅)) = µ(∅) = 0 and for (Bn )n∈N a disjoint
sequence of sets in B one has, since µ is a measure on A,

µT ∪n∈N Bn = µ T −1 ∪n∈N Bn = µ ∪n∈N T −1 (Bn )


  
X  X
= µ T −1 (Bn ) = µT (Bn ).
n∈N n∈N
116/141
We now present a “substitution rule” that is often practical and
that tells us how to integrate with respect to the image
measure µT = µ ◦ T −1 .
Lemma 36 (Substitution rule)

Let (X , A, µ) be a measure space, (Y, B) a measurable space


and T : X → Y be A-B-measurable. For µT the image measure
of µ under T one has

L(µT ) = {f ∈ M(B) : f ◦ T ∈ L(µ)} ,


L1 (µT ) = {f ∈ M(B) : f ◦ T ∈ L1 (µ)} .

In addition, it holds for any function f ∈ L(µT ) that


Z Z
fdµT = (f ◦ T )dµ. (9)

117/141
Observe that (9) can equivalently be written as
Z Z Z
fd(µ ◦ T −1 ) = (f ◦ T )dµ = f (T )dµ.

The proof of (9) uses a classic technique of first showing the


property for indicator functions, then simple functions, then
non-negative functions, then general functions.
This is called the standard extension method or the standard proof.

118/141
Proof of Lemma 36
First, we show (9) for indicator functions. Let B ∈ B. By the
definition of µT ,
Z Z
−1
1B dµT = µT (B) = µ(T (B)) = 1T −1 (B) dµ.

Further, by definition, x ∈ T −1 (B) if and only if T (x) ∈ B, so that


Z Z
1B dµT = 1B (T )dµ,

Second, we show (9) for s ∈ SM(B)+ , that is


n
X
s= ai 1Bi ai ≥ 0, Bi ∈ B, i = 1, . . . , n, n ∈ N
i=1
and by linearity of the integral and the penultimate display
Z Xn Z Xn Z Z
sdµT = ai 1Bi dµT = ai 1Bi (T )dµ = s(T )dµ.
i=1 i=1
119/141
Third, we show (9) for f ∈ M(B)+ . Use the Monotone
Convergence Theorem (page 77) Let (sn )n∈N be a sequence
in SM(B)+ such that sn ↑ f . Then, sn (T ) ∈ SM(A)+
and sn (T ) ↑ f (T ) and so
Z Z Z Z
fdµT = lim sn dµT = lim sn (T )dµ = f (T )dµ.
n→∞ n→∞

Fourth, for a general f ∈ M(B)


Z Z
f ∈ L(µT ) ⇐⇒ f dµT ∧ f − dµT < ∞
+

Z Z
⇐⇒ f (T )dµ ∧ f − (T )dµ < ∞
+

Z Z
⇐⇒ (f ◦ T ) dµ ∧ (f ◦ T )− dµ < ∞
+

⇐⇒ f ◦ T ∈ L(µ).

120/141
Similarly, replacing “∧” by “∨” in the previous display, we
conclude that

f ∈ L1 (µT ) ⇐⇒ f ◦ T ∈ L1 (µ)

Finally, for f ∈ L(µT ),


Z Z Z Z Z
fdµT = f + dµT − f − dµT = f + (T )dµ − f − (T )dµ
Z Z Z
+ −
= (f ◦ T ) dµ − (f ◦ T ) dµ = (f ◦ T )dµ.

121/141
Product measures and Tonelli’s Theorem

Definition 37 (Product σ-algebra)


Let (X , A) and (Y, B) be measurable spaces. We
call A ⊗ B = σ(A × B : A ∈ A, B ∈ B) the product σ-algebra,
and (X × Y, A ⊗ B) the product space.

Definition 38 (Product measure)


Let (X , A, µ) and (Y, B, ν) be σ-finite measure spaces. The
product measure µ ⊗ ν on A ⊗ B is the unique measure
satisfying (µ ⊗ ν)(A × B) = µ(A) · ν(B) for all A ∈ A and B ∈ B.

Typical product spaces that one encounters are

Rd1 × Rd2 , B(Rd1 ) ⊗ B(Rd2 )




with the product measure λd1 ⊗ λd2 .


122/141
Integration with respect to product measures

Tonelli’s Theorem tells us how to integrate with respect to


product measures.
In particular, it allows us to change the order of integration.

Theorem 39 (Tonelli’s Theorem)

Let (X , A, µ) and (Y, B, ν) be σ-finite measure spaces and


consider the product space (X × Y, A ⊗ B, µ ⊗ ν). For every
function f : X × Y → [0, ∞] from M(A ⊗ B)+ it holds that
Z Z  Z
f (x, y )ν(dy ) µ(dx) = f (x, y )(µ ⊗ ν)(dx, dy )
X Y X ×Y
Z Z 
= f (x, y )µ(dx) ν(dy ).
Y X

123/141
Tonelli’s theorem tells us how to integrate non-negative
functions with respect to product measures.
We now discuss how to integrate functions that are not
necessarily non-negative.

Theorem 40 (Fubini’s Theorem)

Let (X , A, µ) and (Y, B, ν) be σ-finite measure spaces and


consider the product space (X × Y, A ⊗ B, µ ⊗ ν).
Let f : X × Y → R be a function in L1 (µ ⊗ ν). Then,
Z Z  Z
f (x, y )ν(dy ) µ(dx) = f (x, y )(µ ⊗ ν)(dx, dy )
X Y X ×Y
Z Z 
= f (x, y )µ(dx) ν(dy ).
Y X

Observe that the requirement that f ∈ L1 (µ ⊗ ν) can be


checked via
R Tonelli’s theorem by checking
whether X ×Y |f |d(µ ⊗ ν) < ∞.
124/141
Absolute continuity and domination

One often encounters distributions that have a density.


Let (X , A) be a measurable space and µ, ν be σ-finite
measures on A.
If ν(A) = 0 implies µ(A) = 0, A ∈ A then we say that µ is
absolutely continuous with respect to ν.
We also say that ν dominates µ and write µ  ν.
In words, every set that ν assigns measure 0 to, µ must also
assign measure 0 to.

125/141
Radon-Nikodym

Theorem 41 (Radon-Nikodym)
If µ and ν are σ-finite measures on A and µ  ν, then there exists
a ν-a.e. uniquely determined f ∈ M(A)+ called the density of µ
with respect to ν such that for every A ∈ A
Z Z Z
µ(A) = fdν and hdµ = hfdν.
A

where h ∈ L(µ) = {h ∈ M(A) : hf ∈ L(ν)}.

The function f is also denoted by dµ/dν and called the


Radon-Nikodym derivative of µ with respect to ν.

126/141
Example: Normal distribution

For η ∈ R and σ 2 ∈ (0, ∞) we have that the Normal


distribution N(η, σ 2 ) is the measure on (R, B(R)) with density

1 2 /2σ 2
fη,σ2 (x) = √ e −(x−η) , x ∈R
2πσ 2
with respect to the Lebesgue measure λ1 .
That is,
Z
N(η, σ 2 )(A) = fη,σ2 (x)λ1 (dx), A ∈ B(R).
A

In case A is an interval, we thus have


Z
2
N(η, σ )(A) = fη,σ2 (x)dx,
A

the last integral being a Riemann-integral.


127/141
Example: Poisson distribution

For λ ∈ (0, ∞) we have that the Poisson distribution Poi(λ) is the


measure on (N0 , P(N0 )) with density

e −λ λx
fλ (x) = , x ∈ N0
x!
with respect to the counting measure τ .
That is,
Z X e −λ λx
Poi(λ)(A) = fλ (x)τ (dx) = , A ∈ P(N0 ),
A x!
x∈A

the last equality following from integration with respect to


counting measure.

128/141
Applications to measure-theoretic probability

We shall finish by some applications to measure-theoretic


probability:
Let (Ω, F, P) be a probability space
If (X , A) is measure space we call X : Ω → X a random
function if it is F-A-measurable.
If (X , A) = (Rd , B(Rd )) we call X a random vector.
The distribution PX of X is defined as

PX (A) := (P ◦ X −1 )(A) = P(X −1 (A)) = P(X ∈ A)

which is just the image measure of P under X .


If PX is the normal or the Poisson distribution, we have just
seen how probabilities can be obtained via integration against
densities.

129/141
Independence

Definition 42
1 Let X , . . . , X be random functions defined on (Ω, F, P) with
1 n
values in the measurable spaces (X1 , A1 ), . . . , (Xn , An ),
respectively. We say that X1 , . . . , Xn are independent if for
all A1 , . . . , An in A1 , . . . , An , respectively, it holds that
n
 Y
P X1 ∈ A1 , . . . , Xn ∈ An = P(Xi ∈ Ai ).
i=1

2 Let (Xi )i∈N be an (infinite) sequence of random variables


defined on (Ω, F, P) such that Xj takes values in a measurable
space (Xj , Aj ), j ∈ N. Then we say that the X1 , X2 , X3 , . . .
are independent if X1 , . . . , Xn are independent according to
the definition in part (1) for each n ∈ N.

130/141
The distribution PX1 ,...,Xn of (X1 , . . . , Xn ) on A1 ⊗ . . . , ⊗An is
defined as

PX1 ,...,Xn (C ) := P ◦ (X1 , . . . , Xn )−1 (C ), C ∈ A1 ⊗ . . . , ⊗An .

We now show that X1 , . . . , Xn are independent if and only if their


distribution is the product measure of the marginal distributions.
Theorem 43 (Independence and product measures)

Let X1 , . . . , Xn be random functions defined on (Ω, F, P) with


values in the measurable spaces (X1 , A1 ), . . . , (Xn , An ),
respectively. Consider the random
function X = (X1 , . . . , Xn ) : Ω → X1 × . . . × Xn , which
is F-A1 ⊗ . . . ⊗ An -measurable with distribution PX1 ,...,Xn .
Consider also the product measure PX1 ⊗ . . . ⊗ PXn .

X1 , . . . , Xn are independent ⇐⇒ PX1 ,...,Xn = PX1 ⊗ . . . ⊗ PXn .

131/141
Proof of Theorem 43

Let X1 , . . . , Xn be independent. Then for A1 , . . . , An


from A1 , . . . , An , respectively

P(X1 ,...,Xn ) (A1 × . . . × An ) = P(X1 ∈ A1 , . . . , Xn ∈ An )


Yn Yn
= P(Xi ∈ Ai ) = PXi (Ai ),
i=1 i=1

which is the characterizing property of a product measure.


If, on the other hand, PX1 ,...,Xn = PX1 ⊗ . . . ⊗ PXn , then

P(X1 ∈ A1 , . . . Xn ∈ An ) = P(X1 ,...,Xn ) (A1 × . . . × An )


Yn Yn
= PXi (Ai ) = P(Xi ∈ Ai ),
i=1 i=1

which shows that the X1 , . . . , Xn are independent.


132/141
Lemma 44
Let X and Y be independent random variables on (Ω, F, P) and
assume that X , Y ∈ L1 (P). Then XY ∈ L1 (P) and

E[XY ] = E[X ] E[Y ]. (10)

Proof: Since X and Y are independent P(X ,Y ) = PX ⊗ PY by


Theorem 43, (page 131). Hence, by Tonelli’s Theorem (page 123)
Z Z
E |XY | = |XY |dP = |xy |(PX ⊗ PY )(dx, dy )
Ω R 2
Z hZ i
= |x||y |PX (dx) PY (dy )
ZR R Z
= |x|PX (dx) |y |PY (dy ) = E |X | E |Y | < ∞.
R R

Thus, XY ∈ L1 (P) and we can repeat the above calculations


without absolute value signs by Fubini’s Theorem (page 124) to
get (10). 133/141
Qn
Also, if X1 , . . . , Xn ∈ L1 (P), then i=1 Xi ∈ L1 (P) and
n
hY i Yn
E Xi = E[Xi ].
i=1 i=1

The covariance between two random variables


X , Y ∈ L2 (P) := {X ∈ M(F) : E X 2 < ∞} is defined as20

Cov (X , Y ) := E [X − E X ][Y − E Y ] = E[XY ] − E[X ] E[Y ].

Thus, if X and Y are independent, it follows from Lemma 44


(page 133) that Cov (X , Y ) = 0.
2
For X ∈ L2 (P), we also define Var (X ) := E X − E[X ] .
Observe that Var (X ) = 0 ⇐⇒ X = E[X ] P-a.s.

20
Since |X | ≤ 1 + X 2 it follows that L2 (P) ⊆ L1 (P) and the covariance is
well-defined. 134/141
A law of large numbers
Let us finally see how we can easily prove a law of large numbers
within the measure-theoretic framework.
Theorem 45
Let (Xi )i∈N be an independent sequence of random variables on a
probability space (Ω, F, P). Assume that the Xi are identically
distributed, that is PXi = PX1 for all i ∈ N. If X1 ∈ L2 (P) then for
all ε > 0 and n ∈ N
 X n 
1 Var (X1 )
P Xi − E[X1 ] > ε ≤ . (11)
n nε2
i=1

Observe that the right-hand side of (11) tends to zero as n → ∞.


Pn
Hence, n1 i=1 Xi converges in probability to E[X1 ].
One actually only needs the Xi to have a first moment in order to
show almost sure convergence instead (Kolmogorov’s strong law of
large numbers). However, the proof is more involved.
135/141
Proof of Theorem 45
Fix ε > 0 and n ∈ N. By Markov’s inequality,
 X n  n 2
1 1 1 X 
P Xi − E[X1 ] > ε ≤ 2 E Xi − E[X1 ] .
n ε n
i=1 i=1

Now, by linearity of the P-integral and Cov (Xi , Xj ) = 0 for i 6= j,


 Xn 2  Xn 2
1 1 
E Xi − E[X1 ] = E Xi − E[X1 ]
n n
i=1 i=1
n n
1 XX  
= E Xi − E[X1 ] Xj − E[X1 ]
n2
i=1 j=1
n n
1 XX
= 2 Cov (Xi , Xj )
n
i=1 j=1
Var (X1 )
= .
n
136/141
Conditional expectations

Definition 46 (Conditional expectation)


Let (Ω, F, P) be a probability space, X ∈ L1 (P) a random variable
on it and G a σ-algebra G ⊆ F. The random variables Y satisfying
1 Y is G-B(R)-measurable,
R R
2
A XdP = A YdP for all A ∈ G,
are called a conditional expectation of X given G. We
write E(X |G) for any random variable Y satisfying (1) and (2)
above.
Important fact: For any X ∈ L1 (P) and any σ-algebra G ⊆ F a
conditional expectation always exists and is P-a.s. unique.
Conditional expectations satisfy all the intuitive rules that we
would like them to (linearity, monotonicity,...), cf.,
e.g. Hoffmann-Jørgensen (1994) Section 6.8 or Dudley (2018)
page 338.
137/141
Let (T , T ) be a measurable space. For T : Ω → T we
let σ(T ) ⊆ F denote the σ-algebra generated by T , i.e.

σ(T ) = T −1 (B) : B ∈ T .


σ(T ) is clearly the smallest σ-algebra in Ω that makes T


measurable when T is equipped with T .
We write E(X |T ) := E(X |σ(T )).
The socalled factorization lemma yields that since E(X |T )
is σ(T )-B(R)-measurable there exists
a ϕ : T → R, T -B(R)-measurable, such that

E(X |T )(ω) = ϕ(T (ω)) for all ω ∈ Ω.

138/141
Let us characterize the function ϕ. By the definition of a
conditional expectation it satisfies 21
Z Z
ϕ(T (ω))P(dω) = X (ω)P(dω) for all B ∈ T ,
T −1 (B) T −1 (B)

or, equivalently via the substitution rule,


Z Z
ϕ(t)PT (dt) = X (ω)P(dω) for all B ∈ T ,
B T −1 (B)

where PT = P ◦ T −1 .
A T -B(R)-measurable function ϕ satisfying the above two
displays is also called a conditional expectation of X given T = t.
One often uses the notation E(X |T = t) := ϕ(t) for any such
function ϕ.

21
Observe that a typical element of σ(T ) is of the form T −1 (B) for B ∈ T .
139/141
Summarizing advantages of the measure theoretic approach

Measurability of functions is preserved under pointwise limits.


This is not the case for continuity (underlying much of
Riemann integration).
Limits and integration can easily be interchanged.
It provides a unified framework for dealing with random
variables — no matter whether they have continuous or
discrete distributions or neither.

140/141
References
Axler, S. (2021): Measure, integration and real analysis,
Springer.
Bartle, R. G. and D. R. Sherbert (2011): Introduction to
real analysis, Wiley, 4th ed.
Dudley, R. M. (2018): Real analysis and probability, CRC Press.
Hoffmann-Jørgensen, J. (1994): Probability with a view
toward Statistics, vol. 1, Chapman and Hall.
Kolmogorov, A. N. and S. V. Fomin (1970): Introductory
Real Analysis, Dover.
Lehmann, E. and G. Casella (1998): Theory of Point
Estimation, Springer.
Lehmann, E. and J. Romano (2005): Testing Statistical
Hypotheses, Springer.
Liese, F. and K.-J. Miescke (2008): Statistical Decision
Theory, Springer.
Resnick, S. (2019): A probability path, Springer.
Shiryaev, A. N. (1996): Probability, Springer, 2nd ed. 141/141

You might also like