0% found this document useful (0 votes)
24 views155 pages

Notes On Measure, Probability and Stochastic Processes

Uploaded by

miguel jimenez
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views155 pages

Notes On Measure, Probability and Stochastic Processes

Uploaded by

miguel jimenez
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 155

Notes on Measure, Probability and

Stochastic Processes

João Lopes Dias


Departamento de Matemática, ISEG, Universidade de
Lisboa, Rua do Quelhas 6, 1200-781 Lisboa, Portugal
Email address: [email protected]
October 2, 2020.
Contents

Chapter 1. Introduction 1
1. Classical definitions of probability 1
2. Mathematical expectation 4

Part 1. Measure theory 7


Chapter 2. Measure and probability 9
1. Algebras 9
2. Monotone classes 13
3. Product algebras 14
4. Measures 15
5. Examples 23
Chapter 3. Measurable functions 29
1. Definition 29
2. Simple functions 31
3. Extended real-valued functions 32
4. Convergence of sequences of measurable functions 33
5. Induced measure 35
6. Generation of σ-algebras by measurable functions 36
Chapter 4. Lebesgue integral 37
1. Definition 37
2. Properties 40
3. Examples 42
4. Convergence theorems 44
5. Fubini theorem 50
6. Signed measures 53
7. Radon-Nikodym theorem 55

Part 2. Probability 59
Chapter 5. Distributions 61
1. Definition 61
2. Simple examples 63
3. Distribution functions 64
4. Classification of distributions 66
5. Convergence in distribution 69
6. Characteristic functions 72
iii
iv CONTENTS

Chapter 6. Independence 79
1. Independent events 79
2. Independent random variables 80
3. Independent σ-algebras 82
Chapter 7. Conditional expectation 83
1. Conditional expectation 83
2. Conditional probability 88

Part 3. Stochastic processes 91


Chapter 8. General stochastic processes 93
Chapter 9. Sums of iid processes: Limit theorems 95
1. Sums of independent random variables 95
2. Law of large numbers 97
3. Central limit theorem 98
Chapter 10. Markov chains 101
1. The Markov property 101
2. Distributions 102
3. Homogeneous Markov chains 105
4. Recurrence time 108
5. Classification of states 109
6. Decomposition of chains 113
7. Stationary distributions 116
8. Limit distributions 120
Chapter 11. Martingales 125
1. The martingale strategy 125
2. General definition of a martingale 127
3. Examples 128
4. Stopping times 129
5. Stochastic processes with stopping times 130
Appendix A. Things that you should know before starting 133
1. Notions of mathematical logic 133
2. Set theory notions 136
3. Function theory notions 139
4. Topological notions in R 143
5. Notions of differentiable calculus on R 148
6. Greek alphabet 150
Bibliography 151
CHAPTER 1

Introduction

These are the lecture notes for the course “Probability Theory and
Stochastic Processes” of the Master in Mathematical Finance (since
2016/2017) at ISEG–University of Lisbon. It is required good knowl-
edge of calculus and basic probability. I would like to thank comments,
corrections and suggestions given by several people, in particular by my
colleague Telmo Peixe.

1. Classical definitions of probability

Science is about observing given phenomena, recording data, analysing


it and explaining particular features and behaviours using theoretical
models. This may be a rough description of what really means to make
science, but highlights the fact that experimentation is a crucial part
of obtaining knowledge.
Most experiments are of random nature. That is, their results are
not possible to predict, often due to the huge number of variables that
underlie the process under scrutiny. One needs to repeat the experi-
ment and observe its different outcomes. A collection of possible out-
comes is called an event. Our main goal is to quantify the likelihood
of each event.
These general ideas can be illustrated by the experiment of throwing
dice. We can get six possible outcomes depending on too many different
factors, so that it becomes impossible to predict the result. Consider
the event corresponding to an even number of dots, i.e. 2, 4 or 6 dots.
How can we measure the probability of this event to occur when we
throw the dice once? If the dice are fair (unbiased), intuition tells us
that it is equal to 12 .
The way one usually thinks of probability is summarised in the
following relation:
number of favourable cases
Probability(“event”) =
number of possible cases
assuming that all cases are equally possible. This is the classical defi-
nition of probability, called the Laplace law.
1
2 1. INTRODUCTION

Example 1.1. Tossing of a perfect coin in order to get either heads


or tails. The number of possible cases is 2. So,
Prob(“heads”) = 1/2
Prob(“heads at least once in two experiments”) = 3/4.
Example 1.2.
1
P(“winning the Euromillions with one bet”) = ' 7 × 10−9 .
C550 C212

The Laplace law has the following important consequences:


(1) For any event S, 0 ≤ P (S) ≤ 1.
(2) If P (S) = 1, then S is a safe event. If P (S) = 0, S is an
impossible event.
(3) P (not S) = 1 − P (S).
(4) If A and B are disjoint events, then P (A or B) = P (A)+P (B).
(5) If A and B are independent, then P (A and B) = P (A) P (B).
The first mathematical formulation of the probability concept ap-
peared in 17th century France. A gambler called Antoine Gombauld
realized empirically that he was able to make money by betting on
getting at least a 6 in 4 dice throwings. Later he thought that betting
on getting at least a pair of 6’s by throwing two dice 24 times was also
advantangeous. As that did not turn out to be the case, he wrote to
Pascal for help. Pascal and Fermat exchanged letters in 1654 discussing
this problem, and that is the first written account of probability theory,
later formalized and further expanded by Laplace.
According to Laplace law, Gombauld’s first problem can be de-
scribed mathematically as follows. Since
5
P (not get 6 in one atempt) =
6
and each dice is independent of the others, then
 4
5
P (not get 6 in 4 attempts) = ' 0.482.
6
Therefore, in the long run Gombauld was able to make a profit1:
 4
5 1
P (get 6 in 4 attempts) = 1 − ' 0.518 > .
6 2
However, for the second game,
35
P (no pair of 6’s out of 2 dice) =
36
1Test your luck at http://random.org/dice/
1. CLASSICAL DEFINITIONS OF PROBABILITY 3

and
 24
35
P (no pair of 6’s out of 2 dice in 24 attempts) = ' 0, 508.
36

This time he did not have an advantage as


1
P (pair of 6’s out of 2 dice in 24 attempts) ' 1 − 0, 508 = 0, 492 < .
2

Laplace law is far from what one could consider as a useful definition
of probability. For instance, we would like to examine also “biased”
experiments, that is, with unequally possible outcomes. A way to deal
with this question is defining probability by the frequency that some
event occurs when repeating the experiment many times under the
same conditions. So,
number of favourable cases in n experiments
P(“event”) = lim .
n→+∞ n
Example 1.3. In 2015 there was 85500 births in Portugal and
43685 were boys. So,

P(“it’s a boy!”) ' 0.51.

A limitation of this second definition of probability occurs if one


considers infinitely many possible outcomes. There might be situations
were the probability of every event is zero!
Modern probability is based in measure theory, bringing a funda-
mental mathematical rigour and an abrangent concept (although very
abstract as we will see). This course is an introduction to this subject.

Exercise 1.4. Gamblers A and B throw a dice each. What is the


probability of A getting more dots than B? (5/12)

Exercise 1.5. A professor chooses an integer number between 1


and N , where N is the number of students in the lecture room. By
alphabetic order each of the N students try to guess the hidden number.
The first student at guessing it wins 2 extra points in the final exam.
Is this fair for the students named Xavier and Zacarias? What is the
probability for each of the students (ordered alphabetically) to win?
(all N1 , fair).

Exercise 1.6. A gambler bets money on a roulette either even or


odd. By winning he receives the same ammount that he bet. Otherwise
he looses the betting money. His strategy is to bet N1 of the total money
at each moment in time, starting with eM . Is this a good strategy?
4 1. INTRODUCTION

2. Mathematical expectation

Knowing the probability of every event concerning some experiment


gives us a lot of information. In particular, it gives a way to compute
the best prediction, namely the weighted average. Let X be the value
of a measurement taken at the outcome of the experiment, a so called
random variable. Suppose that X can only attain a finite number of
values, say a1 , . . . , an , and we know that the probability of each event
X = ai is given by P (X = ai ) for all i = 1, . . . , n. Then, the weighted
average of all possible values of X given their likelihoods of realization
is naturally given by
E(X) = a1 P (X = a1 ) + · · · + an P (X = an ).
If all results are equally probable, P (X = ai ) = n1 , then E(X) is just
the arithmetical average.
The weighted average above is better known as the expected value
of X. Other names include expectation, mathematical expectation,
average, mean value, mean or first moment. It is the best option when
making decisions and for that it is a fundamental concept in probability
theory.
Example 1.7. Throw dice 1 and dice 2 and count the number
of dots denoting them by X1 and X2 , respectively. Their sum S2 =
X1 + X2 can be any integer number between 2 and 12. However, their
probabilities are not equal. For instance, S2 = 2 corresponds to a
unique configuration of one dot in each dice, i.e. X1 = X2 = 1. On
the other hand, S2 = 3 can be achieved by two different configurations:
X1 = 1, X2 = 2 or X1 = 2, X2 = 1. Since the dice are independent,
1
P (X1 = a1 , X2 = a2 ) = P (X1 = a1 ) P (X2 = a2 ) = ,
36
one can easily compute that
(
n−1
36
, 2≤n≤7
P (S2 = n) = 12−n+1
36
, 8 ≤ n ≤ 12

and E(S2 ) = 7.
Example 1.8. Toss two fair coins. If we get two heads we win
e4, two tails e1, otherwise we loose e3. Moreover, P (two heads) =
P (two heads) = 14 and also P (one head one tail) = 21 . Let X be the
gain for a given outcome, i.e. X(two heads) = 4, X(two tails) = 1 and
X(one head one tail) = −3. The profit expectation for this game is
therefore
E(X) = 4P (X = 4) + 1P (X = 1) − 3P (X = −3).
2. MATHEMATICAL EXPECTATION 5

The probabilities above correspond to the probabilities of the corre-


sponding events so that
E(X) = 4P (two heads)+1P (two tails)−3P (one head one tail) = −0.25.
It is expected that one gets a loss in this game on average, so the
decision should be not to play it. This is an unfair game, one would
need to have a zero expectation for the game to be fair.

The definition above of mathematical expectation is of course lim-


ited to the case X having a finite number of values. As we will see in
the next chapters, the way to generalize this notion to infinite sets is
by interpreting it as the integral of X with respect to the probability
measure.
Exercise 1.9. Consider the throwing of three dice. A gambler
wins e3 if all dice are 6’s, e2 if two dice are 6’s, e1 if only one dice is
a 6, and looses e1 otherwise. Is this a fair game?
Part 1

Measure theory
CHAPTER 2

Measure and probability

1. Algebras

Given an experiment we consider Ω to be the set of all possible


outcomes. This is the probabilistic interpretation that we want to
associate to Ω, but in the point of view of the more general measure
theory, Ω is just any given set.
The collection of all the subsets of Ω is denoted by
P(Ω) = {A : A ⊂ Ω}.
It is also called the set of the parts of Ω. When there is no ambiguity,
we will simply write P. We say that Ac = Ω \ A is the complement of
A ∈ P in Ω.
As we will see later, a proper definition of the measure of a set re-
quires several properties. In some cases, that will restrict the elements
of P that are measurable. It turns out that the measurable ones just
need to verify the following conditions.
A collection A ⊂ P is an algebra of Ω iff
(1) ∅ ∈ A,
(2) If A ∈ A, then Ac ∈ A,
(3) If A1 , A2 ∈ A, then A1 ∪ A2 ∈ A.
An algebra F of Ω is called a σ-algebra of Ω iff given A1 , A2 , · · · ∈ F
we have
+∞
[
An ∈ F.
n=1

Remark 2.1. We can easily verify by induction that any finite


union of elements of an algebra is still in the algebra. What makes a
σ-algebra different is that the infinite countable union of elements is
still in the σ-algebra.
Example 2.2. Consider the set A of all the finite union of intervals
in R, including R and ∅. Notice that the complementary of an interval
is a finite union of intervals. Therefore, A is an algebra. However, the
countable union of the sets An =]n, n + 1[ ∈ A, n ∈ N, is no longer
finite. That is, A is not a σ-algebra.
9
10 2. MEASURE AND PROBABILITY

Remark 2.3. Any finite algebra A (i.e. it contains only a finite


number of subsets of Ω) is immediately a σ-algebra. Indeed, any infinite
union of sets is in fact finite.

The elements of a σ-algebra F of Ω are called measurable sets. In


probability theory they are also known as events. The pair (Ω, F) is
called a measurable space.
Exercise 2.4. Decide if F is a σ-algebra of Ω where:
(1) F = {∅, Ω}.
(2) F = P(Ω).
(3) F = {∅,
 {1, 2}, −{3, 4, 5, 6}, Ω}, Ω = {1, 2, 3, 4, 5, 6}.

(4) F = ∅, {0}, R , R0 , R+ , R+ 0 , R \ {0}, R , Ω = R.

Proposition 2.5. Let F ⊂ P such that it contains the comple-


mentary set of all its elements. For A1 , A2 , · · · ∈ F,
+∞
[ +∞
\
Acn ∈F iff An ∈ F.
n=1 n=1

Proof. (⇒) Using Morgan’s laws,


+∞ +∞
!c
\ [
An = Acn ∈F
n=1 n=1

because the complements are always in F.


(⇐) Same idea. 

Therefore, the definitions of algebra and σ-algebra can be changed


to require intersections instead of unions.
Exercise 2.6. Let Ω be a finite set with #Ω = n. Compute
#P(Ω). Hint: Find a bijection between P and {v ∈ Rn : vi ∈ {0, 1}}.
Exercise 2.7. Let Ω be an infinite set, i.e. #Ω = +∞. Consider
the collection of all finite subsets of Ω:
C = {A ∈ P(Ω) : #A < +∞}.
Is C ∪ {Ω} an algebra? Is it a σ-algebra?
Exercise 2.8. Let Ω = [−1, 1] ⊂ R. Determine if the following
collection of sets is a σ-algebra:
F = {A ∈ B(Ω) : x ∈ A ⇒ −x ∈ A} .
Exercise 2.9. Let (Ω, F) be a measurable space. Consider two
disjoint sets A, B ⊂ Ω and assume that A ∈ F. Show that A ∪ B ∈ F
is equivalent to B ∈ F?
1. ALGEBRAS 11

1.1. Generation of σ-algebras. In many situations one requires


some sets to be measurable due to their relevance to the problem we
are studying. If the collection of those sets is not already a σ-algebra,
we need to take a larger one that is. That will be called the σ-algebra
generated by the original collection, which we define below.
Take I to be any set (of indices).
T
Theorem 2.10. If Fα is a σ-algebra, α ∈ I, then F = α∈I Fα is
also a σ-algebra.

Proof.
(1) As for any α we have ∅ ∈ Fα , then ∅ ∈ F.
(2) Let A ∈ F. So, A ∈ Fα for any α. Thus, Ac ∈SFα and Ac ∈ F.
(3) If An ∈ F, we have An ∈ Fα for any α. So, n An ∈ Fα and
S
n An ∈ F.


Exercise 2.11. Is the union of σ-algebras also a σ-algebra?

Consider now the collection of all σ-algebras:


Σ = {all σ-algebras of Ω}.
So, e.g. P ∈ Σ and {∅, Ω} ∈ Σ. In addition, let I ⊂ P be a collection
of subsets of Ω, i.e. I ⊂ P, not necessarily a σ-algebra. Define the
subset of Σ given by the σ-algebras that contain I:
ΣI = {F ∈ Σ : I ⊂ F}.

The σ-algebra generated by I is the intersection of all σ-algebras


containing I,
\
σ(I) = F.
F ∈ΣI

Hence, σ(I) is the smallest σ-algebra containing I (i.e. it is a subset


of any σ-algebra containing I).
Example 2.12.
(1) Let A ⊂ Ω and I = {A}. Any σ-algebra containing I has to
include the sets ∅, Ω, A and Ac . Since these sets form already
a σ-algebra, we have
σ(I) = {∅, Ω, A, Ac }.
(2) Consider two disjoint sets A, B ⊂ Ω and I = {A, B}. The
generated σ-algebra is
σ(I) = {∅, Ω, A, B, Ac , B c , A ∪ B, (A ∪ B)c } .
12 2. MEASURE AND PROBABILITY

(3) Consider now two different sets A, B ⊂ Ω such that A∩B 6= ∅,


and I = {A, B}. Then,
σ(I) = {∅, Ω, A, B, Ac , B c ,
A ∪ B, A ∪ B c , Ac ∪ B, (A ∪ B)c , (A ∪ B c )c , (Ac ∪ B)c ,
B c ∪ (A ∪ B c )c , (B c ∪ (A ∪ B c )c )c ,
((A ∪ B c )c ) ∪ ((Ac ∪ B)c ), (((A ∪ B c )c ) ∪ ((Ac ∪ B)c ))c }
= {∅, Ω, A, B, Ac , B c ,
A ∪ B, A ∪ B c , Ac ∪ B, Ac ∩ B c , B \ A, A \ B,
(A ∩ B)c , A ∩ B,
(A ∪ B) \ (A ∩ B), (Ac ∩ B c ) ∪ (A ∩ B)} .
Exercise 2.13. Show that
(1) If I1 ⊂ I2 ⊂ P, then σ(I1 ) ⊂ σ(I2 ).
(2) σ(σ(I)) = σ(I) for any I ⊂ P.
Exercise 2.14. Consider a finite set Ω = {ω1 , . . . , ωn }. Prove that
I = {{ω1 }, . . . , {ωn }} generates P(Ω).
Exercise 2.15. Determine σ(C), where
C = {{x} : x ∈ Ω} .
What is the smallest algebra that contains C.

1.2. Borel sets. A specially important collection of subsets of R


in applications is
I = {] − ∞, x] ⊂ R : x ∈ R}.
It is not an algebra since it does not contain even the emptyset. An-
other collection could be obtained by considering complements and
intersections of pairs of sets in I. That is,
I 0 = {]a, b] ⊂ R : − ∞ ≤ a ≤ b ≤ +∞}.
Here we are using the following conventions
]a, +∞] =]a, +∞[ and ]a, a] = ∅
so that ∅ and R are also in the collection. The complement of ]a, b] ∈ I 0
is still not in I 0 , but is the union of two sets there:
]a, b]c =] − ∞, a]∪]b, +∞].
So, the smallest algebra that contains I corresponds to the collection
of finite unions of sets in I 0 ,
(N )
[
0
A(R) = In ⊂ R : I1 , . . . , IN ∈ I , N ∈ N ,
n=1

called the Borel algebra of R. Clearly, I ⊂ I 0 ⊂ A(R).


2. MONOTONE CLASSES 13

We define the Borel σ-algebra as


B(R) = σ(I) = σ(I 0 ) = σ(A(R)).
The elements of B(R) are called the Borel sets. We will often simplify
the notation by writing B.
When Ω is a subset of R we can also define the Borel algebra and
the σ-algebra on Ω. It is enough to take
A(Ω) = {A ∩ Ω : A ∈ A(R)} and B(Ω) = {A ∩ Ω : A ∈ B(R)}.
Exercise 2.16. Check that A(Ω) and B(Ω) are an algebra and a
σ-algebra of Ω, respectively.
Exercise 2.17. Show that:
(1) B(R) 6= A(R).
(2) Any singular set {a} with a ∈ R, is a Borel set.
(3) Any countable set is a Borel set.
(4) Any open set is a Borel set. Hint: Any open set can be written
as a countable union of pairwise disjoint open intervals.

2. Monotone classes

We write An ↑ A to represent a sequence of sets A1 , A2 , . . . that is


increasing, i.e.
A1 ⊂ A2 ⊂ . . . ,
and converges to the set
+∞
[
A= An .
n=1
Similarly, An ↓ A corresponds to a sequence of sets A1 , A2 , . . . that is
decreasing, i.e.
· · · ⊂ A2 ⊂ A1 ,
and converging to
+∞
\
A= An .
n=1
Notice that in both cases, if the sets An are measurable, then A is also
measurable.
A collection A ⊂ P is a monotone class iff
(1) if A1 , A2 , · · · ∈ A such that An ↑ A, then A ∈ A,
(2) if A1 , A2 , · · · ∈ A such that An ↓ A, then A ∈ A.
Theorem 2.18. Suppose that A is an algebra. Then, A is a σ-
algebra iff it is a monotone class.

Proof.
14 2. MEASURE AND PROBABILITY

(⇒) If A1 , A2 , · · · ∈ A such that An ↑ A or An ↓ A, then A ∈ A by


the properties of a σ-algebra.
(⇐) Let A1 , A2 , · · · ∈ A. Take
n
[
Bn = Ai , n ∈ N.
i=1

Hence, Bn ∈ A for all n since A is an algebra. Moreover,


Bn ⊂ Bn+1 and Bn ↑ ∪n An ∈ A because A is a monotone
class.

Theorem 2.19. If A is an algebra, then the smallest monotone
class containing A is σ(A).
Exercise 2.20. Prove it.

3. Product algebras

Let (Ω1 , F1 ) and (Ω2 , F2 ) be two measurable spaces. We want to


find a natural algebra and σ-algebra of the product space
Ω = Ω 1 × Ω2 .

A particular type of subsets of Ω, called measurable rectangles, is


given by the product of a set A ∈ F1 by another B ∈ F2 , i.e.
A × B = {(x1 , x2 ) ∈ Ω : x1 ∈ A, x2 ∈ B}
= {x1 ∈ A} ∩ {x2 ∈ B}
= (A × Ω2 ) ∩ (Ω1 × B),
where we have simplified notation in the obvious way. Consider the
following collection of finite unions of measurable rectangles
(N )
[
A= Ai × Bi ⊂ Ω : Ai ∈ F1 , Bi ∈ F2 , N ∈ N . (2.1)
i=1

We denote it by A = F1 × F2 .
Proposition 2.21. A is an algebra (called the product algebra).

Proof. Notice that ∅ × ∅ is the empty set of Ω and is in A.


The complement of A × B in Ω is
(A × B)c = {x1 6∈ A or x2 6∈ B}
= {x1 6∈ A} ∪ {x2 6∈ B}
= (Ac × Ω2 ) ∪ (Ω1 × B c )
4. MEASURES 15

which is in A. Moreover, the intersection between two measurable


rectangles is given by
(A1 × B1 ) ∩ (A2 × B2 ) = {x1 ∈ A1 , x2 ∈ B1 , x1 ∈ A2 , x2 ∈ B2 }
= {x1 ∈ A1 ∩ A2 , x2 ∈ B1 ∩ B2 }
= (A1 ∩ A2 ) × (B1 ∩ B2 ),
again in A. So, the complement of a finite union of measurable rect-
angles is the intersection of the complements, which is thus in A. 
Exercise 2.22. Show that any element in A can be written as a
finite union of disjoint measurable rectangles.

The product σ-algebra is defined as


F = σ (A) .
Example 2.23. A well-known example is the Borel σ-algebra B(Rd )
of Rd , corresponding to the product
B(Rd ) = σ(B(R) × · · · × B(R)).
In particular it includes all open sets of Rd .

4. Measures

Consider an algebra A of a set Ω and a function


µ : A → R̄
that for each set in A attributes a real number or ±∞, i.e. in
R̄ = R ∪ {−∞, +∞}.

We say that µ is additive if for any two disjoint sets A1 , A2 ∈ A we


have
µ (A1 ∪ A2 ) = µ(A1 ) + µ(A2 ).
By induction the same property holds for a finite union of pairwise
disjoint sets, and we call it finite additivity.
Moreover, µ is σ-additive S if for any sequence of pairwise disjoint
sets A1 , A2 , · · · ∈ A such that +∞
n=1 An ∈ A
1
we have
+∞
! +∞
[ X
µ An = µ(An ).
n=1 n=1
In case it is only possible to prove the inequality ≤ instead of the
equality, µ is said to be σ-subadditive.
Remark 2.24. If the algebra A is finite, then σ-additive means
additive.
1Notice that this condition is always satisfied if A is a σ-algebra.
16 2. MEASURE AND PROBABILITY

Exercise 2.25. Let µ : P(Ω) → R that satisfies


µ(∅) = 0, µ(Ω) = 2, µ(A) = 1, A ∈ P(Ω) \ {∅, Ω}.
Determine if µ is σ-additive.

The function µ is called a measure on A iff


(1) µ(A) ≥ 0 or µ(A) = +∞ for any A ∈ A,
(2) µ(∅) = 0,
(3) µ is σ-additive2.
Remark 2.26. We use the arithmetic in R̄ by setting
(+∞) + (+∞) = +∞ and a + ∞ = +∞
for any a ∈ R. Moreover, we write a < +∞ to mean that a is a finite
number.

We say that P : A → R is a probability measure iff


(1) P is a measure,
(2) P (Ω) = 1.
Remark 2.27. A (non-trivial) finite measure, i.e. satisfying 0 <
µ(Ω) < +∞, can be made into a probability measure P by a normal-
ization:
µ(A)
P (A) = , A ∈ A.
µ(Ω)

Given a measure µ on an algebra A, a set A ∈ A is said to have


full measure if µ(Ac ) = 0. In the case of probability measure we also
say that this set (event) has full probability.
Exercise 2.28. (Counting measure) Show that the function that
counts the number of elements of a set A ∈ P(Ω):
(
#A, #A < +∞
µ(A) =
+∞, c.c,
is a measure. Find the sets with full measure µ.
Exercise 2.29. Let µn be a measure and an ≥ 0 for all n ∈ N.
Prove that
+∞
X
µ= an µ n ,
n=1
is also a measure.
P Furthermore, show that if µn is a probability measure
for all n and n an = 1, then µ is also a probability measure.
2µ is called an outer measure in case it is σ-subadditive.
4. MEASURES 17

4.1. Properties. If F is a σ-algebra of Ω, µ a measure on F and


P a probability measure on F, we say that (Ω, F, µ) is a measure space
and (Ω, F, P ) is a probability space.
Proposition 2.30. Consider a measure space (Ω, F, µ) and A, B ∈
F. Then,
(1) µ(A ∪ B) + µ(A ∩ B) = µ(A) + µ(B).
(2) If A ⊂ B, then µ(A) ≤ µ(B).
(3) If A ⊂ B and µ(A) < +∞, then µ(B \ A) = µ(B) − µ(A).

Proof. Notice that


A ∪ B = (A \ B) ∪ (A ∩ B) ∪ (B \ A)
is the union of disjoint sets. Moreover, A = (A \ B) ∪ (A ∩ B) and
B = (B \ A) ∪ (A ∩ B).
(1) We have then µ(A ∪ B) + µ(A ∩ B) = µ(A \ B) + µ(A ∩ B) +
µ(B \ A) + µ(A ∩ B) = µ(A) + µ(B).
(2) If A ⊂ B, then B = A∪(B \A) and µ(B) = µ(A)+µ(B \A) ≥
µ(A).
(3) If µ(A) < +∞, then µ(B \ A) = µ(B) − µ(A). Observe that
if µ(A) = +∞, then µ(B) = +∞. Hence, it would not be
possible to determine µ(B \ A).

Exercise 2.31. Let (Ω, F, µ) be a measure space. Show that for
any sequence of measurable sets A1 , A2 , · · · ∈ F we have
!
[ X
µ An ≤ µ(An ).
n≥1 n≥1

A proposition is said to be valid µ-almost everywhere (µ-a.e.), if it


holds on a set of full measure µ.
Exercise 2.32. Consider two sets each one having full measure.
Show that their intersection also has full measure.
Theorem 2.33 (Continuity property). Let (Ω, F, µ) be a measure
space and A1 , A2 , · · · ∈ F.
(1) If An ↑ A, then
µ (A) = lim µ(An ).
n→+∞

(2) If An ↓ A and µ(A1 ) < +∞, then


µ (A) = lim µ(An ).
n→+∞

Proof.
18 2. MEASURE AND PROBABILITY

(1) If there is i such that µ(Ai ) = +∞, then µ(An ) = +∞ for


S ≥ i. So, limn µ(A
n Sn ) = +∞. On the other hand, as Ai ⊂
A
n n , we have µ( n An ) = +∞. It remains to consider the
case where µ(An ) < +∞ for any n. Let A0 = ∅ and Bn =
Sn \ An−1S
A , n ≥ 1, a sequence of pairwise disjoint sets. Then,
n An = n Bn and µ(Bn ) = µ(An ) − µ(An−1 ). Finally,
! !
[ [
µ An = µ Bn
n n
n
X (2.2)
= lim (µ(Ai ) − µ(Ai−1 ))
n→+∞
i=1
= lim µ(An ).
n→+∞

(2) Since µ(A1 ) < +∞ any subset of A1 also has finite measure.
Notice that
!c
\ [ [
c
An = An = A1 \ Cn ,
n n n
where Cn = Acn
∩ A1 . We also have Ck ⊂ Ck+1 . Hence, by the
previous case,
! !
\ [
µ An = µ(A1 ) − µ Cn
n n
= lim (µ(A1 ) − µ(Cn )) (2.3)
n→+∞
= lim µ(An ).
n→+∞


Example 2.34. Consider the counting measure µ. Let
An = {n, n + 1, . . . }.
T+∞
Therefore, A = n=1 An = ∅ and An+1 ⊂ An . However, µ(An ) = +∞
does not converge to µ(A) = 0. Notice that the previous theorem does
not apply because µ(A1 ) = +∞.

The next theorem gives us a way to construct probability measures


on an algebra.
Theorem 2.35. Let A be an algebra of a set Ω. Then, P : A → R
is a probability measure on A iff
(1) P (Ω) = 1,
(2) P (A ∪ B) = P (A) + P (B) for every disjoint pair A, B ∈ A,
(3) limn→+∞ P (An ) = 0 for all A1 , A2 · · · ∈ A such that An ↓ ∅.
Exercise 2.36. *Prove it.
4. MEASURES 19

Exercise 2.37. Let (Ω, F, P ) probability space, A1 , A2 , · · · ∈ F


and B is the set of points in Ω that belong to an infinite number of
An ’s:
+∞
\ +∞
[
B= Ak .
n=1 k=n
Show that:
(1) (First Borel-Cantelli lemma) If
+∞
X
P (An ) < +∞,
n=1

then P (B) = 0.
(2) *(Second Borel-Cantelli lemma) If
+∞
X
P (An ) = +∞
n=1
and !
n
\ n
Y
P Ai = P (Ai ),
i=1 i=1
for every n ∈ N (i.e. the events are mutually independent; see
section 2), then P (B) = 1.

4.2. *Carathéodory extension theorem. In the definition of


σ-additivity is not very convenient to check whether we are choosing
only sets A1 , A2 , . . . in the algebra A such that their union is still in
A. That would be guaranteed by considering a σ-algebra instead of an
algebra.
Theorem 2.38 below assures the extension of the measure to a σ-
algebra containing A. So, we only need to construct a measure on
an algebra in order to have it well determined on a larger σ-algebra.
Before stating the theorem we need several definitions.
Let µ be a measure on an algebra A of Ω. We say that a sequence
of disjoint sets A1 , A2 , · · · ∈ A is a cover of A ∈ P if
[
A⊂ Aj .
j

Consider the function µ∗ : P → R̄ given by


X
µ∗ (A) = inf µ(Aj ), A ∈ P,
A1 ,A2 ,... cover A
j

where the infimum is taken over all covers A1 , A2 , · · · ∈ A of A. Notice


that µ∗ (A) ≥ 0 or µ∗ (A) = +∞. Also, µ∗ (∅) = 0 as the empty set
covers itself. To show that µ∗ is a measure it is enough to determine
its σ-additivity.
20 2. MEASURE AND PROBABILITY

Consider now the collection of subsets of Ω defined as


M = {A ∈ P : µ∗ (B) = µ∗ (B ∩ A) + µ∗ (B ∩ Ac ), B ∈ P} .
Theorem 2.38 (Carathéodory extension).
(1) M is a σ-algebra and A ⊂ σ(A) ⊂ M.
(2) µ∗ is a measure on M and
µ∗ (A) = µ(A), A∈A
(i.e. µ∗ extends µ to M).
(3) If µ is finite, then µ∗ is its unique extension to σ(A) and it is
also finite.

The remaining part of this section is devoted to the proof of the


above theorem.
Lemma 2.39. µ∗ is σ-subadditive on P.

Proof. Take A1 , A2 , · · · ∈ P pairwise disjoint and ε > 0. For each


An , n ∈ N, consider the cover An,1 , An,2 , · · · ∈ A such that
X ε
µ(An,j ) < µ∗ (An ) + n .
j
2
Then, because µ is a measure,
! !
[ X [
µ∗ An ≤ µ An,j
n j n
X
≤ µ (An,j )
n,j
X
< µ∗ (An ) + ε.
n

Since ε > 0 is arbitrary, µ is σ-subadditive. 

From the σ-subadditivity we know that for any A, B ∈ P we have


µ∗ (B) ≤ µ∗ (B ∩ A) + µ∗ (B ∩ Ac ).
An element A of M has to verify the other inequality:
µ∗ (B) ≥ µ∗ (B ∩ A) + µ∗ (B ∩ Ac )
for every B ∈ P.
Lemma 2.40. µ∗ is finitely additive on M.

Proof. Let A1 , A2 ∈ M disjoint. Then,


µ∗ (A1 ∪A2 ) = µ∗ ((A1 ∪A2 )∩A1 )+µ∗ ((A1 ∪A2 )∩Ac1 ) = µ∗ (A1 )+µ∗ (A2 ).
By induction we obtain the finite additivity on M. 
4. MEASURES 21

Notice that µ∗ is monotonous on P, i.e. µ∗ (C) ≤ µ∗ (D) whenever


C ⊂ D. This is because a cover of D is also a cover of C.
Lemma 2.41. M is a σ-algebra.

Proof. Let B ∈ P. From µ∗ (B ∩∅)+µ∗ (B ∩Ω) = µ∗ (∅)+µ∗ (B) =



µ (B) we obtain that ∅ ∈ M. If A ∈ M it is clear that Ac is also in
M.
Now, let A1 , A2 ∈ M. Their union is also in M because
µ∗ (B ∩ (A1 ∪ A2 )) = µ∗ ((B ∩ A1 ) ∪ (B ∩ A2 ∩ Ac1 ))
≤ µ∗ (B ∩ A1 ) + µ∗ (B ∩ A2 ∩ Ac1 )
and
µ∗ (B ∩ (A1 ∪ A2 )c ) = µ∗ (B ∩ Ac1 ∩ Ac2 ),
whose sum gives
µ∗ (B ∩ (A1 ∪ A2 )) + µ∗ (B ∩ (A1 ∪ A2 )c ) ≤ µ∗ (B ∩ A1 ) + µ∗ (B ∩ Ac1 )
= µ∗ (B),
where we have used the fact that A1 , A2 are in M.
By induction any finite union of sets in M is also in M. It remains
to deal with the countable union case. Suppose that A1 , A2 , · · · ∈ M
and write !
n−1
[
Fn = An \ Ak , n ∈ N,
k=1

which are pairwise disjoint and satisfy


[ [
Fn = An .
n n

Moreover, since each Fn is a finite union of sets in M, then F ∈ M.


Finally,
N
! N
!c !
[ [
µ∗ (B) = µ∗ B ∩ Fn + µ∗ B ∩ Fn
n=1 n=1
N +∞
!c !
X [
≥ µ∗ (B ∩ Fn ) + µ∗ B ∩ Fn
n=1 n=1

where we have used the finite additivity of µ on M along with the
facts that
+∞
!c N
!c
[ [
Fn ⊂ Fn
n=1 n=1
22 2. MEASURE AND PROBABILITY

and µ∗ is monotonous. Taking the limit N → +∞ we obtain


+∞ +∞
!c !
X [
µ∗ (B) ≥ µ∗ (B ∩ Fn ) + µ∗ B ∩ Fn
n=1 n=1
+∞
! +∞
!c !
[ [
≥ µ∗ B∩ Fn + µ∗ B∩ Fn ,
n=1 n=1

proving that M is a σ-algebra. 


Lemma 2.42. A ⊂ σ(A) ⊂ M.

Proof. Let A ∈ A. Given any ε > 0 and B ∈ P consider


A1 , A2 , · · · ∈ A covering B such that
X
µ∗ (B) ≤ µ(An ) < µ∗ (B) + ε.
n

On the other hand, the sets A1 ∩ A, A2 ∩ A, · ·P · ∈ A cover P


B ∩ A and
A1 ∩ Ac , A2 ∩ Ac , · · · ∈ A cover B ∩ Ac . From n µ(An ) = n µ(An ∩
A) + µ(An ∩ Ac ) ≥ µ∗ (B ∩ A) + µ∗ (B ∩ Ac ) we obtain
µ∗ (B ∩ A) + µ∗ (B ∩ Ac ) < µ∗ (B) + ε.
Since ε is arbitrary and from the σ-subadditivity of µ∗ we have
µ∗ (B) = µ∗ (B ∩ A) + µ∗ (B ∩ Ac ).
So, A ∈ M and A ⊂ M.
Finally, as M is a σ-algebra that contains A it also contains σ(A).

Lemma 2.43. µ∗ is σ-additive on M.

Proof. Let A1 , A2 , · · · ∈ M pairwise disjoint. By the finite addi-


tivity and the monotonicity of µ∗ on M we have
N N
! +∞
!
X [ [
µ∗ (An ) = µ∗ A n ≤ µ∗ An .
n=1 n=1 n=1

Taking the limit N → +∞ we obtain


+∞ +∞
!
X [
∗ ∗
µ (An ) ≤ µ An .
n=1 n=1

As µ∗ is σ-subadditive on M, it is also σ-additive. 

It simple to check that µ∗ agrees with µ for any A ∈ A. In fact, we


can choose the cover equal to A itself, so that
µ∗ (A) = µ(A).
5. EXAMPLES 23

Suppose now that µ is finite. SinceP the cover S of a set A consists of


disjoint sets A1 , A2 , · · · ∈ A, then j µ(Aj ) = µ( j Aj ) ≤ µ(Ω). Thus
µ∗ is also finite.
The uniqueness of the extension comes from the following lemma.
Lemma 2.44. Let µ be finite. If µ∗1 and µ∗2 are extensions of µ to
M, then µ∗1 = µ∗2 on σ(A).

Proof. The collection to where one has a unique extension is


F = {A ∈ M : µ∗1 (A) = µ∗2 (A)}.
Taking an increasing sequence A1 ⊂ A2 ⊂ . . . in F we have
+∞
! !
[ [
∗ ∗
µ1 A n = µ1 An \ An−1
n=1 n
X
= µ∗1 (An \ An−1 )
n
X
= µ∗2 (An \ An−1 )
n
!
[
= µ∗2 An ,
n
S
where A0 = ∅. Thus, An ↑ nTAn ∈ F. Similarly, for a decreasing
sequence we also obtain An ↓ n An ∈ F. Thus, F is a monotone
class. According to Theorem 2.19, since F contains the algebra A it
also contains σ(A). 

5. Examples

5.1. Dirac measure. Let a ∈ Ω and P : P(Ω) → R given by


(
1, a ∈ A
P (A) =
0, other cases.
If A1 , A2 , . . . are pairwise disjoint subsets of Ω, then only one of the
following alternatives can hold:
S
(1) There exists aPunique j such that a ∈ Aj . So, P ( n An ) =
1 = P (Aj ) = n P (An ). S
(2) PFor all n we have that a 6∈ An . Therefore, P ( n An ) = 0 =
n P (An ).

This implies that P is σ-additive. Since P (Ω) = 1, P is a probability


measure called Dirac measure3 at a, and it is denoted by δa .
3It is also often called atomic measure or degenerate measure.
24 2. MEASURE AND PROBABILITY

Exercise 2.45. Is
+∞
X 1
P = δ1/n .
n=1
2n
a probability measure on P(R)?

5.2. Lebesgue measure. Consider the Borel algebra A = A(R)


on R and a function m : A → R̄. For a sequence of disjoint intervals
]an , bn ] with −∞ ≤ an ≤ bn ≤ +∞, n ∈ N, such that their union is in
A, we define ! +∞
+∞
[ X
m ]an , bn ] = (bn − an )
n=1 n=1
corresponding to the sum of the lengths of the intervals. This is the σ-
additivity. Moreover, m(∅) = m(]a, a]) = 0 and m(A) ≥ 0 or m(A) =
+∞ for any A ∈ A. Therefore, m is a measure on the algebra A, which
is called Lebesgue measure. By the Carathéodory extension theorem
(Theorem 2.38) there is an extension of m to the Borel σ-algebra B =
σ(A) also denoted by m : B → R.
Remark 2.46. There is a larger σ-algebra to where we can extend
m. It is called the Lebesgue σ-algebra M and includes all sets Λ ⊂ R
such that there are Borelian sets A, B ∈ B satisfying A ⊂ Λ ⊂ B and
m(B \ A) = 0. Clearly, B ⊂ M. We set m(Λ) = m(A) = m(B).

Taking a bounded interval Ω = [a, b] ⊂ R, with −∞ < a < b < +∞,


one can define m as above in A(Ω). Its extension to B(Ω) is unique
by the Carathéodory extension theorem since m(Ω) = b − a is finite.
If b − a = 1 we have in fact that the Lebesgue measure in [a, b] is a
probability measure.
Example 2.47. Given a ∈ R the set {a} is the complement of an
open set, so it is a Borel set. Its Lebesgue measure is easily computed
using Theorem 2.33. Indeed, An ↓ {a}, where
 
1
An = a − , a
n
and m(]a − 1/n, a]) = 1/n. So, m({a}) = 0.
Example 2.48. Consider any countable set A = {a1 , a2 , . . . } ⊂ R.
It is a Borel set since it is the countable union of the unit sets4 {an } ∈ B.
These sets are disjoint, so
+∞
X
m(A) = m({an }) = 0.
n=1
Therefore, any countable set has zero Lebesgue measure. This includes
N, Z, and Q.
4Sets with exactly one element.
5. EXAMPLES 25

Exercise 2.49. Compute the Lebesgue measure of the following


Borel sets: [a, b], [a, b[, ] − ∞, a[, ] − ∞, a], [b, +∞[ and ]b, +∞[.

We have seen that countable sets have zero Lebesgue measure. On


the other hand, intervals are uncountable and have positive measure.
However not all uncountable sets have positive measure, as the follow-
ing non-intuitive example shows.
Example 2.50 (Middle third Cantor set). Consider A0 = [0, 1].
Remove the middle third and obtain A1 = [0, 1/3] ∪ [2/3, 1]. Repeat
this procedure removing the middle third of each of the two disjoint
intervals of A1 in order to get
A2 = [0, 1/9] ∪ [2/9, 1/3] ∪ [2/3, 7/9] ∪ [8/9, 1].
and so on. At step n we get An as the disjoint union of 2n intervals,
each with measure 1/3n , so that m(An ) = (2/3)n → 0. It is also clear
that An+1 ⊂ An and An ↓ A where
+∞
\
A= An .
n=0

So, A is a Borel set and m(A) = 0. Notice that A is not empty, as for
instance it includes 0. In fact, A is uncountable. To prove this we just
need to find a bijection between A and [0, 1]. For that we use base 3
representation of numbers5 in [0, 1]. Observe that
A1 = {x ∈ [0, 1] : x = (0.a1 a2 a3 . . . )3 , a1 6= 1}.
Moreover,
An = {x ∈ [0, 1] : x = (0.a1 a2 a3 . . . )3 , a1 6= 1, . . . , an 6= 1}
and
A = {x ∈ [0, 1] : x = (0.a1 a2 a3 . . . )3 , ai 6= 1, i ∈ N}.
So, elements of A are characterized as the points that have no 1’s in
their base 3 expansion6. Finally we choose the bijection h : A → [0, 1]
given by
h((0.a1 a2 a3 . . . )3 ) = (0.b1 b2 b3 . . . )2
where bi = ai /2 ∈ {0, 1}.

5x ∈ [0, 1] in base b is written as x = (0.a1 a2 a3 . . . )b where ai ∈ {0, . . . , b}. We


P+∞
can recover the decimal expansion by x = i=1 ai b−i .
6Notice that (0.02222 . . . ) = (0.10000 . . . ) = 1/3. We use the first represen-
3 3
tation so that this point is in A. The same choice for any point with a similar
tail.
26 2. MEASURE AND PROBABILITY

5.3. Product measure. Let (Ω1 , F1 , µ1 ) and (Ω2 , F2 , µ2 ) be mea-


sure spaces. Consider the product space Ω = Ω1 × Ω2 with the product
σ-algebra F = σ(A), where A is the product algebra introduced in
(2.1).
We start by defining the product measure µ = µ1 ×µ2 as the measure
on Ω that satisfies
µ(A1 × A2 ) = µ1 (A1 ) µ2 (A2 )
for any measurable rectangle A1 × A2 ∈ F1 × F2 . For other sets in A,
i.e. finite union of measurable rectangles, we define µ as to make it
σ-additive.
Exercise 2.51.
(1) Prove that if A ∈ A can be written as the finite disjoint union
of measurable rectangles in two different ways, i.e. we can find
measurable rectangles Ai × Bi , i = 1, . . . , N , and also A0i × Bi0 ,
i = 1, . . . , N 0 , such that
[ [
A= Ai × Bi = A0i × Bi0 ,
i i

then X X
µ(Ai × Bi ) = µ(A0i × Bi0 ).
i i
So, µ(A) is well-defined.
(2) Show that µ can be extended to every measurable set in the
product σ-algebra F.

5.4. Non-unique extensions. The Carathéodory extension the-


orem guarantees an extension to σ(A) of a measure initially defined on
an algebra A. If the measure is not finite (µ(Ω) = +∞), then there
might be more than one extension, say µ∗1 and µ∗2 . They both agree
with µ for sets in A but are different in some sets in σ(A) \ A. Here
are two examples.
Consider the algebra A = A(R) and the measure µ : A → R̄ given
by (
0, A=∅
µ(A) =
+∞, o.c.
Two extensions of µ to B = σ(A) are
(
0, A=∅
µ1 (A) = and µ2 (A) = #A,
+∞, o.c.
for any A ∈ B. Notice that µ1 is the one given by the construction in
the proof of the Carathéodory extension theorem.
5. EXAMPLES 27

Another example is the following. Let Ω = [0, 1] × [0, 1] and the


product algebra A = B([0, 1]) × P([0, 1]). Define the product measure
µ = m×ν on A where ν is the counting measure. Thus, two extensions
of µ are
X Z 1
µ1 (A) = m(Ay ) and µ2 (A) = nA (x) dx,
y : (x,y)∈A 0

for any A ∈ σ(A), where


Ax = {y ∈ [0, 1] : (x, y) ∈ A}, Ay = {x ∈ [0, 1] : (x, y) ∈ A}
and nA (x) = #Ax . In particular, D = {(x, x) ∈ Ω : x ∈ [0, 1]} is in
σ(A) but not in A, and
µ1 (D) = 0 and µ2 (D) = 1.
The extension given by construction in the proof of the Carathéodory
extension theorem is
(
m × ν(A), ∪x Ax is countable and m(∪y Ay ) = 0
µ3 (A) =
+∞, o.c.
with µ3 (D) = +∞.
CHAPTER 3

Measurable functions

1. Definition

Let (Ω1 , F1 ) and (Ω2 , F2 ) be measurable spaces and consider a func-


tion between those spaces f : Ω1 → Ω2 . We say that f is (F1 , F2 )-
measurable iff
f −1 (B) ∈ F1 , B ∈ F2 .
That is, the pre-image of a measurable set is also measurable (with
respect to the respective σ-algebras). This definition will be useful in
order to determine the measure of a set in F2 by looking at the measure
of its pre-image in F1 . Whenever there is no ambiguity, namely the
σ-algebras are known and fixed, we will simply say that the function
is measurable.
Remark 3.1. Notice that the pre-image can be seen as a func-
tion between the collection of subsets, f −1 : P(Ω2 ) → P(Ω1 ). So, f is
measurable iff the image under f −1 of F2 is contained in F1 , i.e.
f −1 (F2 ) ⊂ F1 .
Exercise 3.2. Show the following propositions:
(1) If f is (F1 , F2 )-measurable, it is also (F, F2 )-measurable for
any σ-algebra F ⊃ F1 .
(2) If f is (F1 , F2 )-measurable, it is also (F1 , F)-measurable for
any σ-algebra F ⊂ F2 .

We do not need to check the conditon of measurability for every


measurable set in F2 . In fact, it is only required for a collection that
generates F2 .
Proposition 3.3. Let I ⊂ P(Ω2 ). Then, f is (F1 , σ(I))-measurable
iff f −1 (I) ⊂ F1 .

Proof.
(1) (⇒) Since any I ∈ I also belongs to σ(I), if f is measurable
then f −1 (I) is in F1 .
(2) (⇐) Let
F = {B ∈ σ(I) : f −1 (B) ∈ F1 }.
29
30 3. MEASURABLE FUNCTIONS

Notice that F is a σ-algebra because


• f −1 (∅) = ∅ ∈ F1 , so ∅ ∈ F.
• If B ∈ F, then f −1 (B c ) = f −1 (B)c ∈ F1 . Hence, B c is
also in F.
• Let B1 , B2 , · · · ∈ F. Then,
+∞
! +∞
[ [
f −1 Bn = f −1 (Bn ) ∈ F1 .
n=1 n=1
S+∞
So, n=1 Bn is also in F.
Since I ⊂ F we have σ(I) ⊂ F ⊂ σ(I). That is, F = σ(I).


We will be particularly interested in the case of scalar functions,


i.e. with values in R. Fix the Borel σ-algebra B on R. Recall that B
can be generated by the collection I = {] − ∞, x] : x ∈ R}. So, from
Proposition 3.3 we say that f : Ω → R is F-measurable iff
f −1 (] − ∞, x]) ∈ F, x ∈ R.
In probability theory, these functions are called random variables.
Remark 3.4. The following notation is widely used (especially in
probability theory) to represent the pre-image of a set in I:
{f ≤ x} = {ω ∈ Ω : f (ω) ≤ x} = f −1 (] − ∞, x]).
Example 3.5. Consider the constant function f (ω) = a, ω ∈ Ω,
where a ∈ R. Then,
(
Ω, x ≥ a
f −1 (] − ∞, x]) =
∅, x < a.
So, f −1 (]−∞, x]) belongs to any σ-algebra on Ω. Therefore, a constant
function is always measurable regardless of the σ-algebra considered.
Example 3.6. Let A ⊂ Ω. The indicator function of A is defined
by (
1, ω ∈ A
XA (ω) =
0, ω ∈ Ac .
Therefore, 
Ω, x ≥ 1

−1
XA (] − ∞, x]) = Ac , 0 ≤ x < 1

∅, x < 0.
So, XA is F-measurable iff A ∈ F.
Exercise 3.7. Show that:
(1) Xf −1 (A) = XA ◦ f for any A ⊂ Ω2 and f : Ω1 → Ω2 .
2. SIMPLE FUNCTIONS 31

(2) XA∩B = XA XB for A, B ⊂ Ω.


(3) XA∪B = XA + XB − XA XB for A, B ⊂ Ω.
Example 3.8. Recall that f : Rd → R is a continuous function iff
the pre-image of any open set is open. Since open sets are Borel sets,
it follows that any continuous function is B(Rd )-measurable.
Proposition 3.9. Let f, g : Ω → R be measurable functions. Then,
their sum f + g and product f g are also measurable.

Proof. Let F : R2 → R be a continuous function and h : Ω → R


given by
h(x) = F (f (x), g(x)).
Since F is continuous, we have that F −1 (]S− ∞, a[) is open. Thus, for
any a ∈ R we can write F −1 (] − ∞, a[) = n In × Jk , where In and Jn
are open intervals. So,
[
h−1 (] − ∞, a[) = f −1 (In ) ∩ g −1 (Jn ) ∈ F.
n
That is, h is measurable. We complete the proof by applying this to
F (u, v) = u + v and F (u, v) = uv. 
Proposition 3.10. Let f : Ω1 → Ω2 and g : Ω2 → Ω3 be measurable
functions. Then, g ◦ f is also measurable.

Proof. Prove it. 

2. Simple functions

Consider N pairwise disjoint sets A1 , . . . , AN ⊂ Ω whose union is


Ω. A function ϕ : Ω → R is a simple function on A1 , . . . , AN if there
are different numbers c1 , . . . , cN ∈ R such that
N
X
ϕ= cj X A j .
j=1

That is, a simple function is constant on a finite number of sets that


cover Ω: ϕ(Aj ) = cj . Hence ϕ has only a finite number of possible
values.
Proposition 3.11. A function is simple on A1 , . . . , AN iff it is
σ({A1 , . . . , AN })-measurable.

Proof.
(⇒) For x ∈ R, take the set J(x) = {j : cj ≤ x} ⊂ {1, . . . , N }.
Hence,
[ [
ϕ−1 (] − ∞, x]) = ϕ−1 (cj ) = Aj .
j∈J(x) j∈J(x)
32 3. MEASURABLE FUNCTIONS

So, ϕ is σ({A1 , . . . , AN })-measurable.


(⇐) Suppose that f is not simple. So, for some j it is not constant
on Aj (and so Aj has more than one element). Then, there
are ω1 , ω2 ∈ Aj and x ∈ R such that
f (ω1 ) < x < f (ω2 ).
Hence, ω1 is in f −1 (] − ∞, x]) but ω2 is not. This means that
this set can not be any of A1 , . . . , AN , their complements or
their unions. Therefore, f is not σ({A1 , . . . , AN })-measurable.

Remark 3.12. From the above proposition it follows that a func-
tion is constant (i.e. a simple function on the unique set Ω) iff it is
measurable with respect to the trivial σ-algebra (thus to any other).
See Example 3.5.
Exercise 3.13. Consider a simple function ϕ. Write |ϕ| and de-
termine if it is also simple.
Remark 3.14. Consider two simple functions
N N 0
X X
0
ϕ= cj X A j and ϕ = c0j 0 XA0j0 .
j=1 j 0 =1

Their sum is also a simple function given by


N X
N 0
X
0
ϕ+ϕ = (cj + c0j 0 )XAj ∩A0j0 ,
j=1 j 0 =1

and their product is the simple function


N X
N 0
X
0
ϕϕ = cj c0j 0 XAj ∩A0j0 .
j=1 j 0 =1

3. Extended real-valued functions

We many applications it is convenient to consider the case of func-


tions with values in
R̄ = R ∪ {−∞, +∞}.
For such cases it is enough to consider the Borel σ-algebra of R̄. That
is defined as
B(R̄) = σ({[−∞, a] : a ∈ R}).
Let (Ω, F) be a measurable space. We say that f : Ω → R̄ is F-
measurable (a random variable) iff it is (F, B(R̄))-measurable. This is
equivalent to checking that f −1 ([−∞, a]) ∈ F for every a ∈ R.
4. CONVERGENCE OF SEQUENCES OF MEASURABLE FUNCTIONS 33

Given a sequence of measurable functions fn : Ω → R, we define the


infimum function as
f (x) = inf fn (x).
n∈N

It is a well-defined function from Ω to [−∞, +∞[⊂ R̄. Similarly, the


supremum function supn fn has values in ] − ∞, +∞].
Recall the definitions of liminf and limsup for a sequence of func-
tions:
lim inf fn = sup inf fk , (3.1)
n→+∞ n∈N k≥n
lim sup fn = inf sup fk . (3.2)
n→+∞ n∈N k≥n

These are also functions with values in R̄. Moreover, the sequence fn
converges if they are finite and equal to the limit of fn (lim fn ), as
discussed in the next section.
Proposition 3.15. For any sequence of measurable functions fn ,
the functions inf n fn , supn fn , lim inf n→+∞ fn and lim supn→+∞ fn are
also measurable.
Exercise 3.16. Prove it.

4. Convergence of sequences of measurable functions

Consider countably many measurable functions fn : Ω → R ordered


by n ∈ N. This defines a sequence of measurable functions f1 , f2 , . . . .
We denote such a sequence by its general term fn . There are several
notions of its convergence:
• fn converges pointwisely 1 to f (i.e. fn → f ) iff
lim fn (x) = f (x) for every x ∈ Ω.
n→+∞

We also say that f is the limit of fn .


u
• fn converges uniformly to f (i.e. fn →− f ) iff
lim sup |fn (x) − f (x)| = 0.
n→+∞ x∈Ω

a.e.
• fn converges almost everywhere to f (i.e. fn −−→ f ) iff there
is A ∈ F such that µ(A) = 0 and
lim fn (x) = f (x) for every x ∈ Ac .
n→+∞
µ
• fn converges in measure to f (i.e. fn →
− f ) iff for every ε > 0
lim µ ({x ∈ Ω : |fn (x) − f (x)| ≥ ε}) = 0.
n→+∞

1or simply, fn converges to f


34 3. MEASURABLE FUNCTIONS

Remark 3.17. In the case of a probability measure we refer to


convergence almost everywhere (a.e.) as almost surely (a.s.), and con-
vergence in measure as convergence in probability.
Exercise 3.18. Let ([0, 1], B([0, 1]), m) the Lebesgue measure space
and fn (x) = xn , x ∈ [0, 1], n ∈ N. Determine the convergence of fn .
Exercise 3.19. Determine the convergence of XAn when
(1) An ↑ A
(2) An ↓ A
(3) the sets A1 , A2 , . . . are pairwise disjoint.

A function f : Ω → R is called bounded if there is M > 0 such that


for every ω ∈ Ω we have |f (ω)| ≤ M .
We use the notation
fn % f
to mean that fn → f and fn ≤ f for every n ∈ N.
Proposition 3.20.
(1) For every measurable function f there is a sequence of simple
functions ϕn such that ϕn % f .
(2) For every bounded measurable function f there is a sequence
of simple functions ϕn such that ϕn % f and the convergence
is uniform.

Proof.
(1) Consider the simple functions
n2n+1
X  j

ϕn = −n + n XAn,j + nXf −1 ([n,+∞[) − nXf −1 (]−∞,−n[)
j=0
2
where
 
−1 j j+1
An,j = f −n + n , −n + n .
2 2
Notice that for any ω ∈ An,j we have
j j+1
−n + n ≤ f (ω) < −n + n
2 2
and
j
ϕn (ω) = −n + n .
2
So,
1
f (ω) − n < ϕn (ω) ≤ f (ω).
2
Therefore, ϕn → f for every ω ∈ Ω since for n sufficiently
large ω belongs to some An,j .
5. INDUCED MEASURE 35

(2) Assume that |f (ω)| ≤ M for every ω ∈ Ω. Given n ∈ N, let


2(j − 1)M
cj = −M + , j = 1, . . . , n.
n
Define the intervals Ij = [cj , cj +2M/n[ for j = 1, . . . , n−1 and
In = [cn , M ]. Clearly, these n intervals are pairwise disjoint
and their union is [−M, M ]. Take also Aj = f −1 (Ij ) which
are pairwise disjoint measurable sets and cover Ω, and the
sequence of simple functions
X n
ϕn = cj X A j .
j=1

On each Aj the function is valued in Ij , and it is always 2M/n


close to cj (corresponding to the length of Ij . Then,
2M
sup |ϕn (ω) − f (ω)| ≤ .
ω∈Ω n
u
As n → +∞ we obtain ϕn →
− f.

Exercise 3.21. Show that if the limit of a sequence of measurable
functions exists, it is also measurable.

5. Induced measure

Let (Ω1 , F1 , µ1 ) be a measure space, (Ω2 , F2 ) a measurable space


and f : Ω1 → Ω2 a measurable function. Notice that µ1 ◦ f −1 defines a
function F2 → R since f −1 : F2 → F1 and µ1 : F1 → R.
Proposition 3.22. The function
µ2 = µ1 ◦ f −1
is a measure on F2 called the induced measure. Moreover, if µ1 is a
probability measure, then µ2 is also a probability measure.
Exercise 3.23. Prove it.
Remark 3.24.
(1) The induced measure µ1 ◦f −1 is sometimes called push-forward
measure and denoted by f∗ µ1 .
(2) In probability theory the induced probability measure is also
known as distribution of f . If Ω2 = R and F2 = B(R) it is
known as well as probability distribution. We will always refer
to it as distribution.
Exercise 3.25. Consider the measure space (Ω, P, δa ) where δa is
the Dirac measure at a ∈ Ω. If f : Ω → R is measurable, what is its
induced measure (distribution)?
36 3. MEASURABLE FUNCTIONS

Exercise 3.26. Compute m ◦ f −1 where f (x) = 2x, x ∈ R, and m


is the Lebesgue measure on R.

6. Generation of σ-algebras by measurable functions

Consider a function f : Ω → R. The smallest σ-algebra of Ω for


which f is measurable is
σ(f ) = σ({f −1 (B) ∈ F : B ∈ B}).
It is called the σ-algebra generated by f . Notice that f will be also
measurable for any other σ-algebra containing σ(f ).
When we have a finite set of functions f1 , . . . , fn , the smallest σ-
algebra for which all these functions are measurable is
σ(f1 , . . . , fn ) = σ({fi−1 (B) ∈ F : B ∈ B, i = 1, . . . , n}).
We also refer to it as the σ-algebra generated by f1 , . . . , fn .
Example 3.27. Let A ⊂ Ω and take the indicator function XA .
Then,
σ(XA ) = σ({∅, Ω, Ac }) = {∅, Ω, A, Ac } = σ({A}).
Similarly, for A1 , . . . , An ⊂ Ω,
σ(XA1 , . . . , XAn ) = σ({A1 , . . . , An }).
Exercise 3.28. Decide if the following propositions are true:
(1) σ(f ) = σ({f −1 (] − ∞, x] : x ∈ R}).
(2) σ(f + g) = σ(f, g).
Exercise 3.29. Show that:
(1) For every 1 ≤ i ≤ n we have σ(fi ) ⊂ σ(f1 , . . . , fn ).
(2) For every 1 ≤ i1 , . . . , ik ≤ n we have σ(fi1 , . . . , fik ) ⊂ σ(f1 , . . . , fn ).
(3) σ(f ) ⊂ F whenever f is F-measurable.
CHAPTER 4

Lebesgue integral

In this chapter we define the Lebesgue integral of a measurable


function f on a measurable set A with respect to a measure µ. This
is a huge generalization of the Riemann integral in R introduced in
first year Calculus. There, the functions have anti-derivatives, sets are
intervals and there is no mention of the measure, eventhough it is the
Lebesgue measure that is being used (the length of the intervals).
Roughly speaking, the Lebesgue integral is a “sum” of the values
of f at all points in A times a weight given by the measure µ. For
probability measures it can be thought as the weighted average of f on
A.
In the following, in order to simplify the language, we will drop
the name Lebesgue when referring to the integral. Moreover, given
functions f : Ω → R and g : Ω → R we write f ≤ g to mean that
f (x) ≤ g(x) for every x ∈ Ω. Similarly for f ≥ g, f < g and f > g.

1. Definition

We will define the integral first for non-negative simple functions,


then for non-negative measurable functions, and finally for measurable
functions.

1.1. Integral of non-negative simple functions. Let (Ω, F, µ)


be a measure space and ϕ : Ω → R a non-negative simple function
(ϕ ≥ 0) of the form
XN
ϕ= cj X A j ,
j=1

where cj ≥ 0, Aj ∈ F and N ∈ N. The integral of a simple function ϕ


with respect to the measure µ is
Z N
X
ϕ dµ = cj µ(Aj ).
j=1

It is a number in [0, +∞].


Remark 4.1.
37
38 4. LEBESGUE INTEGRAL
R
(1) If cj = 0 and µ(Aj ) = +∞ we set cj µ(Aj ) = 0. So, 0 dµ = 0
for any measure µ.
(2) Consider the simple function ϕ(x) = c1 XA + c2 XAc where
µ(A) = µ(Ac ) = +∞. If we had allowed c1 > 0 and c2 < 0,
then there would be an indetermination c1 µ(A) + c2 µ(Ac ) =
+∞ − ∞. This is why in the definition of the above integral
we restrict to non-negative simple functions.
(3) We frequently use the following notation so that the variable
of integration is explicitly written:
Z Z
ϕ dµ = ϕ(x) dµ(x).

Proposition 4.2. Let ϕ1 , ϕ2 ≥ 0 be simple functions and a1 , a2 ≥


0. Then,
(1)
Z Z Z
(a1 ϕ1 + a2 ϕ2 ) dµ = a1 ϕ1 dµ + a2 ϕ2 dµ.

(2) If ϕ1 ≤ ϕ2 , then
Z Z
ϕ1 dµ ≤ ϕ2 dµ.

Proof.
(1) Let ϕ, ϕ
e be simple functions in the form
N
X N
e
X
ϕ= cj X A j , ϕ
e= cj XAej
e
j=1 j=1

where Aj = ϕ−1 (cj ), A


ej = ϕ e−1 (e
cj ). Then,
Z X
(ϕ + ϕ)
e dµ = (ci + e ci )µ(Ai ∩ Aej )
i,j
X X X X
= ci µ(Ai ∩ A
ej ) + cj
e µ(Ai ∩ A
ej ).
i j j i
P
Notice that j µ(Ai ∩ A ej ) = µ(Ai ) because the sets A
ej are
pairwise disjoint and its union is Ω. The same applies to
P
i µ(Ai ∩ Aj ) = µ(Aj ). Hence,
e e
Z Z Z
(ϕ + ϕ)
e dµ = ϕ dµ + ϕ e dµ.

(2) Prove it.



1. DEFINITION 39

1.2. Integral of non-negative measurable functions. Let (Ω, F, µ)


be a measure space and f : Ω → R a non-negative measurable function
(f ≥ 0). Consider the set of all possible values of the integral of non-
negative simple functions that are not above f , i.e.
Z 
I(f ) = ϕ dµ : 0 ≤ ϕ ≤ f, ϕ is simple .

Proposition 4.3. There is a ∈ [0, +∞] such that I(f ) = [0, a[ or


I(f ) = [0, a].
R
Proof. SinceR ϕ dµ ≥ 0, then I(f ) ⊂ [0, +∞]. Moreover, 0 ∈
I(f ) because 0 dµ = 0 for the simple function ϕ = 0 and 0 ≤ f .
Suppose now that x ∈ I(f ) with Rx > 0. This means that there is a
simple function 0 ≤ ϕ ≤ f such that ϕ dµ = x. Considering y ∈ [0, x],
let ϕ̃ = xy ϕ. This is also a simple function satisfying 0 ≤ ϕ̃ ≤ ϕ ≤ f .
Furthermore, Z Z
y
ϕ̃ dµ = ϕ dµ = y ∈ I(f ).
x
Therefore, [0, x] ⊂ I(f ).
The only sets which have the property [0, x] ⊂ I(f ) for every x ∈
I(f ) are the intervals [0, a[ and [0, a] for some a ≥ 0 or a = +∞. 

The integral of f ≥ 0 with respect to the measure µ is defined to


be Z
f dµ = sup I(f ).
So, the integral always exists and it is either a finite number in [0, +∞[
or +∞.

1.3. Integral of measurable functions. Let (Ω, F, µ) be a mea-


sure space and f : Ω → R a measurable function. There is a simple
decomposition of f into its positive and negative parts:
f + (x) = max{f (x), 0} ≥ 0
f − (x) = max{−f (x), 0} ≥ 0.
Hence,
f (x) = f + (x) − f − (x)
and also
|f (x)| = max{f + (x), f − (x)} = f + (x) + f − (x).
R
A measurable function f is integrable with respect to µ iff |f | dµ <
+∞. Its integral is defined as
Z Z Z
f dµ = f dµ − f − dµ.
+
40 4. LEBESGUE INTEGRAL

The integral of f on A ∈ F with respect to µ is


Z Z
f dµ = f XA dµ.
A

Exercise 4.4. Consider a simple function ϕ (not necessarily non-


negative). Show that:
P
(1) ϕ.XA = j cj XAj ∩A .
(2)
Z N
X
ϕ dµ = cj µ(Aj ∩ A).
A j=1

In probability theory the integral of an integrable random variable


X on a probability space (Ω, F, P ), is denoted by
Z
E(X) = X dP

and called the expected value1 of X.


Remark 4.5. As for simple functions we will also be using the
notation: Z Z
f dµ = f (x) dµ(x).
A A

2. Properties

Proposition 4.6. Let f and g be integrable functions and A, B ∈


F.
R R
(1) f ≤ g µ-a.e. Rthen f dµ R≤ g dµ.
If
(2) A ⊂ B, then AR|f | dµ ≤ B |f | dµ.
If
(3) If
µ(A) = 0 then A f Rdµ = 0. R R
(4) µ(A ∩ B) = 0 thenR A∪B f dµ = A f dµ + B f dµ.
If
(5) If
f = 0 µ-a.e. then f dµ = 0.
(6) f ≥ 0 and λ > 0 then
If
Z
1
µ ({x ∈ Ω : f (x) ≥ λ}) ≤ f dµ (Markov inequality).
λ
R
(7) If f ≥ 0 and Rf dµ = 0, then f = 0 µ-a.e.
(8) (inf f ) µ(Ω) ≤ f dµ ≤ (sup f ) µ(Ω).

Proof.
1It
is also known as expectation, mathematical expectation, mean value, mean,
average or first moment. It is sometimes denoted by E[X], E(X) or hXi.
2. PROPERTIES 41

(1) Any simple function satisfying ϕ ≤ f + a.e. also satisfiesR + ϕ≤


+ + + + +
gR a.e. since f ≤ g a.e. So, I(f ) ⊂ I(g R ) and Rf dµ ≤
g + dµ. R Similarly,R g − ≤ f −R a.e. andR g − dµ ≤ f − dµ.
Finally, f +Rdµ − f − dµ +
R ≤ g dµ − g dµ.

(2) Notice that A |f | dµ = |f |XA dµ and similarly for the in-


tegral
R in B. Since
R |f |XA ≤ |f |XB , by the previous property,
|f |XA dµ ≤ |f |XB dµ.R P
(3) For any simple function ϕ dµ = j cj µ(Aj ∩ A) = 0. Thus,
R
A
|f | dµ = 0.
(4) Suppose that f ≥ 0. For any C ∈ F we have

µ((A ∪ B) ∩ C) = µ((A ∩ C) ∪ (B ∩ C))


= µ(A ∩ C) + µ(B ∩ C)

as A ∩ B ∩ C ⊂ A ∩ B has zero measure. Given a simple


function 0 ≤ ϕ ≤ f we have

Z N
X
ϕ dµ = cj µ(Aj ∩ (A ∪ B))
A∪B j=1
N
X
= cj (µ(Aj ∩ A) + µ(Ai ∩ B))
j=1
Z Z
= ϕ dµ + ϕ dµ.
A B

Using the relation sup(g1 + g2 ) ≤ sup g1 + sup g2 for any func-


tions g1 , g1 , we obtain
Z Z Z
f dµ ≤ f dµ + f dµ.
A∪B A B

Now, consider simple functions 0 ≤ ϕ1 , ϕ2 ≤ f , Since A


and B are disjoint and 0 ≤ ϕ1 XA + ϕ2 XB ≤ f we get
Z Z Z
ϕ1 dµ + ϕ2 dµ = (ϕ1 XA + ϕ2 XB ) dµ
A B A∪B
Z
≤ f dµ.
A∪B

Considering the supremum over the simple functions, we get


Z Z Z
f dµ + f dµ ≤ f dµ.
A B A∪B
42 4. LEBESGUE INTEGRAL

We have thus proved that A∪B f ± dµ = A f ± dµ+ B f ± dµ.


R R R
This implies that
Z Z Z
f dµ = +
f dµ − f − dµ
A∪B
ZA∪B Z A∪B Z Z

= +
f dµ + +
f dµ − f dµ − f − dµ
ZA Z B A B

= f dµ + f dµ.
A B
(5) RWe have R0 ≤ f ≤R 0 a.e. Then, by the the first property,
0 dµ ≤ f dµ ≤ 0 dµ.
(6) Let A = {x ∈ Ω : f (x) ≥ λ}. Then,
Z Z Z
f dµ ≥ f dµ ≥ λ dµ = λµ(A).
A A
(7) We want to show that µ({x ∈ Ω : f (x) > 0}) = µ◦f −1 (]0, +∞[) =
0. The Markov inequality implies that for any n ∈ N,
  Z
−1 1
µ◦f , +∞ ≤ n f dµ = 0.
n
Since
+∞  
−1
[
−1 1
f (]0, +∞[) = f , +∞ ,
n=1
n
we have
+∞  
−1
X
−1 1
µ◦f (]0, +∞[) ≤ µ◦f , +∞ = 0.
n=1
n
(8) It is enough to notice that inf f ≤ f ≤ sup f .


3. Examples

We present now two fundamental examples of integrals, constructed


with the Dirac and the Lebesgue measures.

3.1. Integral for the Dirac measure. Consider the measure


space (Ω, P, δa ) where δa is the Dirac measure at a ∈ Ω. We start
by determining P the integral of a simple function ϕ ≥ 0 written in the
usual form ϕ = N j=1 cj XAj . So, there is a unique 1 ≤ k ≤ N such that
a ∈ Ak (since the sets Aj are pairwise disjoint and their union is Ω)
and ck is the value in Ak . In particular, ϕ(a) = ck . This implies that
Z X
ϕ dδa = cj δa (Aj ) = ck = ϕ(a).
j
3. EXAMPLES 43

Any function f : Ω → R is measurable for the σ-algebra considered.


Take any f + ≥ 0. Its integral is computed from the fact that
I(f ) = ϕ(a) : 0 ≤ ϕ ≤ f + , ϕ simple = [0, f + (a)].


Therefore, for any function f = f + − f − we have


Z Z Z
f dδa = f dδa − f − dδa = f + (a) − f − (a) = f (a).
+

3.2. Integral for the Lebesgue measure. Let (R, B, m) be the


measure space associated to the Lebesgue measure m and a measurable
function f : I → R where I ⊂ R is an interval. We use the notation
Z b (R
[a,b]
f dm, a≤b
f (t) dt = R
a − [b,a] f dm, b < a,
where a, b ∈ I. Notice that we write dm(t) = dt when the measure is
the Lebesgue one.
Consider some a ∈ I and the function F : I → R given by
Z x
F (x) = f (t) dt.
a

Theorem 4.7. If f is continuous at x which is in the interior of


I, then F is differentiable at x and
F 0 (x) = f (x).

Proof. By the definition, the derivative of F at x is, if it exists,


given by Ry
0 F (y) − F (x) f (t) dt
F (x) = lim = lim x .
y→x y−x y→x y−x
Now, if x < y, Ry
f (t) dt
inf f ≤ x ≤ sup f.
[x,y] y−x [x,y]

As y → x+ we get inf [x,y] f → f (x) and sup[x,y] f → f (x) because f is


continuous at x. Similarly for the case y < x. So, F 0 (x) = f (x). 
Remark 4.8. We call F an anti-derivative of f . It is not unique,
there are other functions whose derivative is equal to f .
Exercise 4.9. Show that if F1 and F2 are anti-derivatives of f ,
then F1 − F2 is a constant function.
Theorem 4.10. If f is continuous in I and has an anti-derivative
F , then for any a, b ∈ I we have
Z b
f (t) dt = F (b) − F (a).
a
44 4. LEBESGUE INTEGRAL

Proof. Theorem 4.7 and the factRthat the anti-derivative is deter-


b
mined up to a constant, imply that a f (t) dt = F (b) + cR where c is
a
a constant. To determine c it is enough to compute 0 = a f (t) dt =
F (a) + c, thus c = −F (a). 
Example 4.11. Let f : R → R given by f (x) = e−|x| . It is a
continuous function on R and
Z x (
1 − e−x , x≥0
e−|t| dt = x
0 −(1 − e ), x < 0.

4. Convergence theorems

The computation of the integral of a function f ≥ 0 is not direct for


most choices of measures. It requires considering all simple functions
below f and determine the supremum of the set of all their integrals.
As a measurable function is the limit of a sequence of simple functions
ϕn , it would be very convenient to have the integral of f as just the
limit of the integrals of ϕn .
R R
More generally, we would like to see if lim fn dµ is equal to lim fn dµ.
This indeed is given by the convergence theorems (monotone and dom-
inated). There are however conditions that have to be imposed as the
following example shows.
Example 4.12. Consider ϕn = XR[n,n+1] a sequence of functions that
converge
R to the zero function.
R So, ϕn dm = 1 for any n ∈ N, and
lim ϕn dm = 0 < 1 = lim ϕn , dm.

4.1. Monotone convergence. We start by a preliminary result


that will be used later. Let (Ω, F, µ) be a measure space.
Lemma 4.13 (Fatou). Consider fn to be a sequence of measurable
functions such that fn ≥ 0. Then,
Z Z
lim inf fn dµ ≤ lim inf fn dµ.
n→+∞ n→+∞

Proof. Consider any simple function 0 ≤ ϕ ≤ lim inf fn , 0 < c < 1


and the increasing sequence of measurable functions
gn = inf fk .
k≥n

Thus, for a sufficiently large n we have


cϕ < gn ≤ sup gn = lim inf fn .

Let
An = {x ∈ Ω : gn (x) ≥ cϕ(x)}.
4. CONVERGENCE THEOREMS 45

So, An ↑ Ω. In addition,
Z Z Z Z
cϕ dµ ≤ gn dµ ≤ fk dµ ≤ fk dµ
An An An
for any k ≥ n. Finally,
Z Z Z
cϕ dµ ≤ inf fk dµ ≤ lim inf fn dµ.
An k≥n

Therefore, since the previous inequality is valid for any 0 < c < 1 and
any n large, Z Z
ϕ dµ ≤ lim inf fn dµ.

It remains to observe that the definition of the integral requires that


Z Z 
lim inf fn dµ = sup ϕ dµ : 0 ≤ ϕ ≤ lim inf fn .

The claim follows immediately. 

The next result is the first one for limits and not just liminf.
Theorem 4.14 (Monotone convergence). Let fn ≥ 0 be a sequence
of measurable functions. If fn % f a.e., then
Z Z
lim fn dµ = lim fn dµ.
n→+∞ n→+∞

R R
Proof. Notice that fn ≤ limn→+∞ fn . Hence,
Z Z Z Z
lim sup fn ≤ lim fn = lim inf fn ≤ lim inf fn
n→+∞ n→+∞ n→+∞ n→+∞

where we have used Fatou’s lemma. Since lim inf is always less or equal
to lim sup, the above inequality implies that they have to be the same
and equal to lim. 
Remark 4.15. This result applied to a sequence of random vari-
ables Xn ≥ 0 on a probability space is the following: if Xn % X a.s.,
then E(lim Xn ) = lim E(Xn ).

4.2. More properties.


Proposition 4.16. Let f, g integrable functions on (Ω, F, µ) and
α, β ∈ R.
R R R
(1) (αfR + βg) dµ
R = α f dµ + β g dµ. (linearity)
(2) If RA f dµ ≤ RA g dµ for all A ∈ F, then f ≤ g µ-a.e.
(3) IfR A f dµ =R A g dµ for all A ∈ F, then f = g µ-a.e.
(4) | f dµ| ≤ |f | dµ.

Proof.
46 4. LEBESGUE INTEGRAL

(1) Consider sequences of non-negative simple functions ϕn % f +


en % g + . By Proposition 4.2 and the monotone conver-
and ϕ
gence theorem applied twice,
Z Z
+ +
(αf + βg ) dµ = lim(αϕn + βψn ) dµ
Z
= lim (αϕn + βψn ) dµ
Z Z
= lim α ϕn dµ + lim β ψn dµ
Z Z
=α lim ϕn dµ + β lim ψn dµ
Z Z
=α f dµ + β g + dµ.
+

The same is done for αf − + βg − and the result follows imme-


diately. R
(2) By writing A (g − f ) dµ ≥ 0 for all A ∈ F, we want to show
that h = g − f ≥ 0 a.e. or equivalently that h− = 0 a.e. Let
A = {x ∈ Ω : h(x) < 0} = {x ∈ Ω : h− (x) > 0}.
Then, on A we have h = −h− and
Z Z
0≤ h dµ = −h− dµ ≤ 0.
A A
− −
R
That is, A h dµ = 0 and h ≥ 0. So, by (7) of Proposition 4.6

we obtain that
R h = 0Ra.e.
(3) Notice that A f dµ = A g dµ implies that
Z Z Z
f dµ ≤ g dµ ≤ f dµ.
A A A

By (2) we get f ≤ g on a set of full measure and g ≤ f on


another set of full measure. Since the intersection of both sets
has still full measure, we have f = g a.e.
(4) From the definition of the integral
Z Z Z
f dµ = f dµ − f − dµ
+

Z Z
≤ +
f dµ + f − dµ
Z Z
= f dµ + f − dµ
+

Z
= |f | dµ.


4. CONVERGENCE THEOREMS 47

Proposition 4.17. Let (Ω1 , F1 , µ1 ) be a measure space, (Ω2 , F2 ) a


measurable space and f : Ω1 → Ω2 measurable. If µ2 = µ1 ◦ f −1 is the
induced measure, then
Z Z
g dµ2 = g ◦ f dµ1
Ω2 Ω1

for any g : Ω2 → R measurable.

Proof. Consider a simple function ϕ in the form


+∞
X
ϕ= cj X A j .
j=1

Then,
Z +∞
X Z
ϕ dµ2 = cj XAj dµ2
Ω2 j=1 Ω2

+∞
X
= cj µ1 ◦ f −1 (Aj )
j=1
+∞
X Z
= cj dµ1
j=1 f −1 (Aj )

+∞
X Z
= cj Xf −1 (Aj ) dµ1
j=1 Ω1
Z
= ϕ ◦ f dµ1 .
Ω1

So, the result is proved for simple functions.


Take a sequence of non-negative simple functions ϕn % g + (we can
use a similar approach for g − ) noting that ϕn ◦ f % g + ◦ f . We can
therefore use the monotone convergence theorem for the sequences of
simple functions ϕn and ϕn ◦ f . Thus,
Z Z
+
g dµ2 = lim ϕn dµ2
Ω2 Ω2
Z
= lim ϕn dµ2
Ω2
Z
= lim ϕn ◦ f dµ1
Ω1
Z
= lim ϕn ◦ f dµ1
Ω1
Z
= g + ◦ f dµ1 .
Ω1
48 4. LEBESGUE INTEGRAL


Example 4.18. Consider a probability space (Ω, F, P ) and X : Ω →
R a random variable. By setting the induced measure α = P ◦ X −1
and g : R → R we have
Z Z
E(g(X)) = g ◦ X dP = g(x) dα(x).
R
In particular, E(X) = x dα(x).
Proposition 4.19. Consider the measure
+∞
X
µ= an µ n ,
n=1

where µn is a measure and an ≥ 0, n ∈ N. If f : Ω → R satisfies


+∞
X Z
an |f | dµn < +∞,
n=1

then f is also µ-integrable and


Z +∞
X Z
f dµ = an f dµn .
n=1

Proof. Recall Exercise 2.29 showing that µ is a measure. Suppose


that f ≥ 0. Take a sequence of simple functions
X
ϕk = cj XAj
j

such that ϕk % f as k → +∞. Then,


Z X +∞
X X +∞
X Z
ϕk dµ = cj µ(Aj ) = an cj µn (Aj ) = an ϕk dµn ,
j n=1 j n=1

which is finite for every k because ϕk ≤ |f |. Now, using the monotone


convergence theorem, f is µ-integrable and
Z Z
f dµ = lim ϕk dµ = lim lim bk,m ,
k→+∞ k→+∞ m→+∞

where
m
X Z
bk,m = an ϕk dµn .
n=1

Notice that bk,m ≥ 0, it is increasing both on k and on m and


bounded from above. So, A = supk supm bk,m = limk limm bk,m . Define
also B = supk,m bk,m . We want to show that A = B.
4. CONVERGENCE THEOREMS 49

For all k and m we have B ≥ bk,m , so B ≥ A. Given any ε > 0 we


can find k0 and m0 such that B − ε ≤ bk0 ,m0 ≤ B. This implies that
A = sup sup bk,m ≥ sup bk,m0 ≥ bk0 ,m0 ≥ B − ε.
k m k

Taking ε → 0 we get A = B. The same arguments above can be used


to show that A = B = limm limk bk,m .
We can thus exchange the limits order and, again by the monotone
convergence theorem,
Z X+∞ Z +∞
X Z
f dµ = an lim ϕk dµn = an f dµn .
k
n=1 n=1

Consider now f not necessarily ≥ 0. Using the decomposition f =


f −f − with f + , f − ≥ 0, we have |f | = f + +f − . Thus, f is µ-integrable
+

and
Z Z Z +∞
X Z

+
f dµ = f dµ − f dµ = an (f + − f − ) dµn .
n=1

4.3. Dominated convergence.


Theorem 4.20 (Dominated convergence). Let fn be a sequence of
measurable functions and g an integrable function. If fn converges a.e.
and for any n ∈ N we have
|fn | ≤ g a.e.,
then, Z Z
lim fn dµ = lim fn dµ.
n→+∞ n→+∞

Proof. Suppose that 0 ≤ fn ≤ g. By Fatou’s lemma,


Z Z
lim fn dµ ≤ lim inf fn dµ.
n→+∞ n→+∞
R R
It remains to show that lim supn→+∞ fn dµ ≤ limn→+∞ fn dµ.
Again using Fatou’s lemma,
Z Z Z
g dµ − lim fn dµ = lim (g − fn ) dµ
n→+∞ k→+∞
Z
≤ lim inf (g − fn ) dµ
n→+∞
Z Z
= g dµ − lim sup fn dµ.
n→+∞
50 4. LEBESGUE INTEGRAL

This implies that


Z Z
lim sup fn dµ ≤ lim fn dµ.
n→+∞ n→+∞

g, we have max{fn+ , fn− } ≤ g and limn→+∞ fn± dµ =


R
R For |fn | ≤
±
limn→+∞ fn dµ. 
Example 4.21.
(1) Consider Ω =]0, 1[ and
n sin x
fn (x) = √ .
1 + n2 x
So,
n 1
|fn (x)| ≤ 2
√ ≤√ .
1+n x x

As g(x) = 1/ x is integrable,
Z Z
lim fn dm = lim fn dm = 0.
n→+∞ n→+∞

(2)
Z Z Z
−(x2 +y 2 )n −(x2 +y 2 )n
lim e dx dy = lim e dx dy = dm = π,
n→+∞ R2 R2 n→+∞ D
2 +y 2 )n 2 +y 2 )
where we have used the fact that |e−(x | ≤ e−(x is
integrable and

1
e,
 (x, y) ∈ ∂D
2 2 n
lim e−(x +y ) = 1, (x, y) ∈ D

0, o.c.
with D = {(x, y) ∈ R2 : x2 + y 2 < 1}.
Exercise 4.22. Determine the following limits:
R +∞ n
(1) limn→+∞ 0 1+rr n+2 dr
Rπ √ n
(2) limn→+∞ 0 1+xx2 dx
R +∞
(3) limn→+∞ −∞ e−|x| cosn (x) dx
n (x−y)
(4) limn→+∞ R2 1+cos
R
(x2 +y 2 +1)2
dx dy

5. Fubini theorem

Let (Ω1 , F1 , P1 ) and (Ω2 , F2 , P2 ) be probability spaces. Consider


the product probability space (Ω, F, P ). Given x1 ∈ Ω1 and x2 ∈ Ω2
take A ∈ F and its sections
Ax1 = {x2 ∈ Ω2 : (x1 , x2 ) ∈ A},
Ax2 = {x1 ∈ Ω1 : (x1 , x2 ) ∈ A}.
5. FUBINI THEOREM 51

Exercise 4.23. Show that


(1) for any A ⊂ Ω,
(Ac )x1 = (Ax1 )c and (Ac )x2 = (Ax2 )c . (4.1)
(2) for any A1 , A2 , · · · ⊂ Ω,
! !
[ [ [ [
An = (An )x1 and An = (An )x2 . (4.2)
n∈N x1 n∈N n∈N x2 n∈N

Proposition 4.24. For every x1 ∈ Ω1 and x2 ∈ Ω2 , we have Ax2 ∈


F1 and Ax1 ∈ F2 .

Proof. Consider the collection


G = {A ∈ F : Ax2 ∈ F1 , x2 ∈ Ω2 }.
We want to show that G = F.
Notice that any measurable rectangle B = B1 × B2 with B1 ∈ F1
and B2 ∈ F2 is in G. In fact, Bx2 = B1 if x2 ∈ B2 , otherwise it is
empty.
If I is the collection of all measurable rectangles, then I ⊂ G ⊂ F.
This implies that F = σ(I) ⊂ σ(G) ⊂ F and σ(G) = F. It is now
enough to show that G is a σ-algebra. This follows easily by using (4.1)
and (4.2). 
Proposition 4.25. Let A ∈ F.
(1) The function x1 7→ P2 (Ax1 ) on Ω1 is measurable.
(2) The function x2 7→ P1 (Ax2 ) on Ω2 is measurable.
(3)
Z Z
P (A) = P2 (Ax1 ) dP1 (x1 ) = P1 (Ax2 ) dP2 (x2 ).

Proof. Given A ∈ F write fA (x1 ) = P2 (Ax1 ) and gA (x2 ) =


P1 (Ax2 ). Denote the collection of all measurable rectangles by I and
consider
 Z Z 
G = A ∈ F : fA and gA are measurable, fA dP1 = gA dP2 .

We want to show that G = F.


We start by looking at measurable rectangles whose collection we
denote by I. For each B = B1 × B2 ∈ I we have that fB = P2 (B2 )XB1
and gB = P1 (B1 )XB2 are simple functions, thus measurable for F1 and
F2 , respectively. In addition,
Z Z
P (A) = fA dP1 = gA dP2 = P1 (B1 )P2 (B2 ).
52 4. LEBESGUE INTEGRAL

So, I ⊂ G. The same can be checked for the finite union of measurable
rectangles, corresponding to an algebra A, so that A ⊂ G.
We now show that G is a monotone class. Take an increasing se-
quence An ↑ A in G. Hence, their sections are increasing as well as
fAn and gAn . Moreover,
R fA = limR fAn and gA = lim gAn are measur-
able. Finally, since fAn dP1 =R gAn dP2 Rholds for every n, by the
monotone convergence theorem fA dP1 = g dP2 . That means that
A ∈ G. The same argument can be carried over to decreasing sequences
An ↓ A. Therefore, G is a monotone class.
By Theorem 2.19 we know that σ(A) ⊂ G. R Since F = σ(A) and
G ⊂ F we obtain that G = F. Also, P (A) = fA dP1 for any A ∈ F
by extending this property for measurable rectangles. 
Remark 4.26. There exist examples of non-measurable sets (A ⊂
Ω but A 6∈ F) with measurable sections and measurable functions
P2 (Ax1 ) and P1 (Ax2 ) whose integrals differ.

Consider now a measurable function f : Ω → R. Given x1 ∈ Ω1 we


define
fx1 : Ω2 → R, fx1 (x2 ) = f (x1 , x2 ).
Similarly, for x2 ∈ Ω2 let
fx2 : Ω1 → R, fx2 (x1 ) = f (x1 , x2 ).
Exercise 4.27. Show that if f is measurable, then fx1 and fx2 are
measurable for each x1 and x2 , respectively.

Define Z
I1 : Ω1 → R, I1 (x1 ) = fx1 dP2
and Z
I2 : Ω2 → R, I2 (x2 ) = fx2 dP1 .

Exercise 4.28. Show that if f is measurable, then I1 and I2 are


measurable.
Theorem 4.29 (Fubini). Let f : Ω → R
(1) If f is an integrable function, then fx1 and fx2 are integrable
for a.e. x1 and x2 , respectively. Moreover, I1 and I2 are inte-
grable functions and
Z Z Z
f dP = I1 dP1 = I2 dP2 .

(2) If f ≥ 0 and I1 is an integrable function, then f is integrable.


Exercise 4.30. Prove it.
6. SIGNED MEASURES 53

Example 4.31. Consider Ω = [0, 1] × [0, 1], F = B(Ω) and f : Ω →


R given by f (x, y) = xy.
(1) Take the product measure µ = m×m, where m is the Lebesgue
measure on R. Then,
Z Z 1Z 1 Z 1
y 1
f dµ = xy dx dy = dy = .
0 0 0 2 4
(2) For another measure µ = m×δ1 , where δ1 is the Dirac measure
at 1, we obtain
Z Z 1Z Z 1
1
f dµ = xy dδ1 (y) dx = x dx = .
0 [0,1] 0 2
Exercise 4.32. Consider the Lebesgue probability space ([0, 1], B, m).
Write an example
R of a measurable
R function f such that I1 and I2 are
integrable but I1 dP1 6= I2 dP2 .

6. Signed measures

Let (Ω, F) be a measurable space. We say that a function λ : F → R̄


is a signed measure iff it is σ-additive in the following sense: for pairwise
disjoint sets A1 , A2 , · · · ∈ F we have that
+∞
! +∞ +∞
[ X X
λ An = λ(An ) and λ(An )− < +∞.
n=1 n=1 n=1

Here x− = max{0, −x}. A signed measure is finite if λ(A) is finite for


every A ∈ F.
Obviously, any measure is also a signed measure.
Exercise 4.33. Show that if λ is a signed measure and there is
A ∈ F such that λ(A) is finite, then λ(∅) = 0.
Example 4.34. If µ1 is a measure and µ2 a finite measure (i.e.
µ2 (Ω) < +∞, then λ = µ1 − µ2 is a signed measure. In fact, for a
sequence A1 , A2 , . . . of pairwise disjoint measurable sets,
+∞
! +∞
! +∞
!
[ [ [
λ An = µ1 A n − µ2 An
n=1 n=1 n=1
+∞
X
= µ1 (An ) − µ2 (An )
n=1
+∞
X
= λ(An )
n=1

by the σ-additivity of the measures.


54 4. LEBESGUE INTEGRAL

Theorem 4.35. Let f be an integrable function. Then,


Z
ν(A) = f dµ, A ∈ F,
A
is a finite signed measure. Moreover, if f ≥ 0 a.e. then ν is a finite
measure and Z Z
g dν = g f dµ (4.3)
A A
for any function g integrable with respect to ν and A ∈ F.
Remark
R 4.36.
R In the conditions of the above theorem we get
ν(A) = A dν = A f dµ. It is therefore natural to use the notation
dν = f dµ.

Proof. Let A1 , A2 , · · · ∈ F be pairwise disjoint and B = +∞


S
i=1 Ai .
Hence, Z Z Z
ν (B) = f dµ = f XB dµ − f − XB dµ.
+
B
Define gn± = f ± XBn where Bn = ni=1 A ± ± ±
S
R i . So,±gn ≤ f XBR =±lim gn . By
the monotone convergence theorem, lim gn dµ = lim gn dµ. That
is,
Z Z +∞ Z
X +∞
X
f XB dµ = lim f dµ = f dµ = ν(Ai ),
n→+∞ Bn Ai
P R i=1 i=1
where we have used Bn f dµ = ni=1 Ai f dµ obtained by induction of
R

the property in Proposition 4.6. Therefore, ν is σ-additive. It is finite


because f is integrable.
By Proposition
R R = 0 because µ(∅) = 0. As f ≥ 0 a.e, we
4.6, ν(∅)
obtain ν(A) = A f dµ ≥ A 0 dµ = 0, again using Proposition 4.6.
Finally, choose a sequence of simple functions ϕn % g each written
in the usual form
XN
ϕn = cj X A j .
j=1
Applying the monotone convergence theorem twice we obtain
Z Z
g dν = lim ϕn dν
A n→+∞ A
X
= lim cj ν(Aj ∩ A)
n→+∞
j
X Z
= lim cj f XAj dµ
n→+∞ A
j
Z Z
= lim f ϕn dµ = gf dµ.
n→+∞ A A

7. RADON-NIKODYM THEOREM 55

Example 4.37. Consider f : R → R given by


(
e−x , x ≥ 0
f (x) =
0, x < 0,

and the measure ν : B(R) → R,


Z
ν(A) = f dm.
A

So, m([n − 1, n]) = 1 for any n ∈ N, and


Z n
ν([n − 1, n]) = e−t dt = e−n (e − 1),
n−1

which goes to 0 as n → +∞. Moreover, if g(x) = R 1, then g is not


integrable with respect to m since we would have g dm = m(R) =
+∞. However,
Z Z Z +∞
g dν = gf dm = e−t dt = 1.
0

Also, ν(R) = 1 and ν is a probability measure.

7. Radon-Nikodym theorem

A signed measure λ is absolutely continuous with respect to a mea-


sure µ iff for any set A ∈ F such that µ(A) = 0 we also have λ(A) = 0.
That is, every µ-null set is also λ-null. We use the notation
λ  µ.
Example 4.38. Consider the measurable space (I, B) where I is
an interval of R with positive length, the Dirac measure δa at some
a ∈ I and the Lebesgue measure m on I. For the set A = {a} we
have m(A) = 0 but δa (A) = 1. So, δa is not absolutely continuous with
respect to m. On the other hand, if A = I \ {a} we get δa (A) = 0 but
m(A) = m(I) > 0. Hence, m is not absolutely continuous with respect
to δa .
Example 4.39. If µ is a measure and f is integrable, then
Z
λ(A) = f dµ
A

is a signed measure by Theorem 4.35. Moreover, if µ(A) = 0 then the


integral over A is always equal to zero. So, λ  µ.

The above example is fundamental because of the next result.


56 4. LEBESGUE INTEGRAL

Theorem 4.40 (Radon-Nikodym). Let (Ω, F) be a measurable space,


λ a finite signed measure and µ a finite measure. If λ  µ, then there
is an integrable function f such that
Z
λ(A) = f dµ, A ∈ F.
A

Moreover, f is unique µ-a.e.

The proof is contained in section 7.1.


Remark 4.41.
(1) The function f in the above theorem is called the Radon-
Nikodym derivative of λ with respect to µ and denoted by

= f.

Hence,
Z

λ(A) = dµ, A ∈ F.
A dµ

(2) If λ is also a measure then dµ
≥ 0 a.e.
Example 4.42.
(1) Consider a countable set Ω = {a1 , a2 , . . . } and F = P(Ω).
If µ is a finite measure such that all points in Ω have weight
(i.e. µ({an }) > 0 for any n ∈ N) and λ is any finite signed
measure, then the only possible subset A ⊂ Ω with µ(A) = 0
is the empty set A = ∅. So, λ(A) is also equal to zero and
λ  µ. Since
X
µ(A) = µ({an })
an ∈A

and
Z
dλ dλ
λ({an }) = (x) dµ(x) = (an )µ({an }),
{an } dµ dµ
we obtain
dλ λ({an })
(an ) = , n ∈ N.
dµ µ({an })
This defines the Radon-Nikodym derivative at µ-almost every
point.
(2) Suppose that Ω = [0, 1] ⊂ R and F = B([0, 1]). Take the
Dirac measure δ0 at 0, the Lebesgue measure m on [0, 1] and
µ = 21 δ0 + 12 m. If µ(A) = 0 then 21 δ0 (A) + 21 m(A) = 0 which
7. RADON-NIKODYM THEOREM 57

implies that δ0 (A) = 0 and m(A) = 0. Therefore, δ0  µ and


m  µ. Notice that for any integrable function f we have
Z Z Z
1 1
f dµ = f dδ0 + f dm.
A 2 A 2 A
Therefore, for every A ∈ F,
Z Z
1 dδ0 1 dδ0
δ0 (A) = dδ0 + dm
2 A dµ 2 A dµ
and also
Z Z
1 dm 1 dm
m(A) = dδ0 + dm.
2 A dµ 2 A dµ
Aiming at finding the Radon-Nikodym derivatives, we first
choose A = {0} so that
dδ0 dm
(0) = 2, (0) = 0.
dµ dµ
Moreover, let A ∈ F be such that 0 6∈ A. Thus,
Z Z
dδ0 dm
dm = 0, dm = 2.
A dµ A dµ

By considering the σ-algebra F 0 induced by F on ]0, 1] we have that the


Radon-Nikodym derivatives restricted to the measurable space (]0, 1], F 0 )
are measurable and so the above equations imply that
( (
dδ0 2, x = 0 dm 0, x = 0
(x) = (x) =
dµ 0, o.c. dµ 2, o.c.
Proposition 4.43. Let ν, λ, µ be finite measures. If ν  λ and
λ  µ, then
(1) ν  µ
(2)
dν dν dλ
= a.e.
dµ dλ dµ

Proof.
(1) If A ∈ F is such that µ(A) = 0 then λ(A) = 0 because λ  µ.
Furthermore, since ν  λ we also have ν(A) = 0. This means
that ν  µ.
(2) We know that
Z

λ(A) = dµ.
A dµ
So, Z Z
dν dν dλ
ν(A) = dλ = dµ,
A dλ A dλ dµ
58 4. LEBESGUE INTEGRAL

where we have used (4.3).




7.1. Proof of the Radon-Nikodym theorem.


Part 2

Probability
CHAPTER 5

Distributions

From now on we focus on probability theory and use its notations


and nomenclatures. That is, we interpret Ω as the set of outcomes
of an experiment, F as the collection of events (sets of outcomes), F-
measurable functions as random variables (numerical result of an ob-
servation) and finally P is a probability measure. Moreover, whenever
there is a property valid for a set of full probability measure, we will
use the initials a.s. (almost surely) instead of a.e. (almost everywhere).
In this chapter we are going to explore a correspondence between
distributions and two types of functions: distribution functions and
characteristic functions. It is simpler to study functions than measures.
Determining a measure requires knowing its value for every measurable
set, a much harder task than to understand a function.

1. Definition

Let (Ω, F, P ) be a probability space and X : Ω → R a random


variable (an F-measurable function). The distribution of X (or the
law of X) is the induced probability measure α : B(R) → R,

α = P ◦ X −1 .

In general we say that any probability measure P on R is a distribu-


tion by considering the identity random variable X : R → R, X(x) = x,
so that α = P .
It is common in probability theory to use several notations that are
appropriate in the context. We list below some of them:

(1) P (X ∈ A) = P (X −1 (A)) = α(A)


(2) P (X ∈ A, X ∈ B) = α(A ∩ B)
(3) P (X ∈ A or X ∈ B) = α(A ∪ B)
(4) P (X 6∈ A) = α(Ac )
(5) P (X ∈ A, X 6∈ B) = α(A \ B)
(6) P (X = a) = α({a})
(7) P (X ≤ a) = α(] − ∞, a])
(8) P (a < X ≤ b) = α(]a, b])
61
62 5. DISTRIBUTIONS

Exercise 5.1. Suppose that f : R → R is a B(R)-measurable func-


tion and α is the distribution of a random variable X. Find the distri-
bution of f ◦ X.

When the random variable is multidimensional, i.e. X : Ω → Rd


and X = (X1 , . . . , Xd ), we call the induced measure α : B(Rd ) → R
given by α = P ◦ X −1 the joint distribution of X1 , . . . , Xd .
In most applications it is the distribution α that really matters.
For example, suppose that Ω is the set of all possible states of the
atmosphere. If X is the function that gives the temperature (◦ C) in
Lisbon for a given state of the atmosphere and I = [20, 21],
α(I) = P (X −1 (I)) = P (X ∈ I) = P (20 ≤ X ≤ 21)
is the probability of the temperature being between 20◦ C and 21◦ C.
That is, we first compute the set X −1 (I) of all states that correspond
to a temperature in Lisbon inside the interval I, and then find its
probability measure.
It is important to be aware that for the vast majority of systems
in the real world, we do not know Ω and P . So, one needs to guess
α. Finding the right distribution is usually a very difficult task, if not
impossible. Nevertheless, a frequently convenient way to acquire some
knowledge of α is by treating statistically the data from experimental
observations. In particular, it is possible to determine good approxi-
mations of each moment of order n of α (if it exists):
Z
mn = E(X ) = xn dα(x), n ∈ N, m0 = 1,
n

Knowing the moments is a first step towards a choice of the distri-


bution, but in general it does not determine it uniquely. Notice that
E(X n ) exists if E(|X n |) < +∞ (i.e. X n is integrable).
Remark 5.2.
(1) The moment of order 1 is the expectation of X,
m1 = E(X).
(2) The variance of X is defined as
Var(X) = E(X − E(X))2 = E(X 2 ) − E(X)2 = m2 − m21 .
(3) Given two integrable random variables X and Y , their covari-
ance is
Cov(X, Y ) = E((X − E(X))(Y − E(Y )))
= E(XY ) − E(X)E(Y ).
If Cov(X, Y ) = 0, then we say that X and Y are uncorrelated.
Exercise 5.3. Show that Var(X) = 0 iff P (X = E(X)) = 1.
2. SIMPLE EXAMPLES 63

Exercise 5.4. Show that for each distribution α there is n0 such


that mn exists for every n ≤ n0 and it does not exist otherwise.
Exercise 5.5. Consider X1 , . . . , Xn integrable random variables.
Show that if Cov(Xi , Xj ) = 0, i 6= j, then
n
! n
X X
Cov Xi = Var(Xi ).
i=1 i=1

Exercise 5.6. Let X be a random variable and λ > 0. Prove the


Tchebychev inequalities:
(1)
1
P (|X| ≥ λ) ≤ E(|X|k ).
λk
When k = 1 this is also called Markov inequality.
(2) For k ∈ N,
Var(X)
P (|X − E(X)| ≥ λ) ≤ .
λ2

2. Simple examples

Here are some examples of random variables for which one can find
explicitly their distributions.
Example 5.7. Consider X to be constant, i.e. X(x) = c for any
x ∈ Ω and some c ∈ R. Then, given B ∈ B we obtain
(
∅, c 6∈ B
X −1 (B) =
Ω, c ∈ B.
Hence,
(
P (∅) = 0, c 6∈ B
α(B) = P (X −1 (B)) =
P (Ω) = 1, c ∈ B.
That
R is, α = δc is the Dirac distribution at c. Finally, E(g(X)) =
g(x) dα(x) = g(c), so mn = cn and in particular
E(X) = c and Var(X) = 0.
Example 5.8. Given A ∈ F and constants c1 , c2 ∈ R, let X =
c1 XA + c2 XAc . Then, for B ∈ B we get


 A, c1 ∈ B, c2 6∈ B
Ac , c 6∈ B, c ∈ B

1 2
X −1 (B) =

 Ω, c1 , c2 ∈ B

∅, o.c.

64 5. DISTRIBUTIONS

So, the distribution of cXA is




 p, c1 ∈ B, c2 ∈6 B

1 − p, c1 6∈ B, c2 ∈ B
α(B) =

 1, c1 , c2 ∈ B

0, o.c.

where p = P (A). That is,


α = pδc1 + (1 − p)δc2
R
is the so-called Bernoulli distribution. Hence, E(g(X)) = g(x) dα(x) =
pg(c1 ) + (1 − p)g(c2 ) and mn = pcn1 + (1 − p)cn2 . In particular,
E(X) = pc1 + (1 − p)c2 , Var(X) = p(1 − p)(c1 + c2 )2 .
Exercise 5.9. Find the distribution of a simple function in the
form
XN
X= cj X A j
j=1
and compute its moments.

3. Distribution functions

A function F : R → R is a distribution function iff


(1) it is increasing, i.e. for any x1 < x2 we have F (x1 ) ≤ F (x2 ),
(2) it is continuous from the right at every point, i.e. F (x+ ) =
F (x),
(3) F (−∞) = 0, F (+∞) = 1.
The next theorem states that there is a one-to-one correspondence
between distributions and distribution functions.
Theorem 5.10 (Lebesgue).
(1) If α is a distribution, then
F (x) = α(] − ∞, x]), x ∈ R,
is a distribution function.
(2) If F is a distribution function, then there is a unique distribu-
tion α such that
α(] − ∞, x]) = F (x), x ∈ R.
Remark 5.11. The function F as above is called the distribution
function of α. Whenever α is the distribution of a random variable X,
we also say that F is the distribution function of X and
F (x) = P (X ≤ x).
3. DISTRIBUTION FUNCTIONS 65

Proof.
(1) For any x1 ≤ x2 we have ]−∞, x1 ] ⊂ ]−∞, x2 ]. Thus, F (x1 ) ≤
F (x2 ) and F is increasing. Now, given any sequence xn → a+ ,
lim F (xn ) = lim α(] − ∞, xn ])
n→+∞ n→+∞
+∞
!
\
=α ] − ∞, xn ]
n=1
= α(] − ∞, a]) = F (a).
That is, F is continuous from the right for any a ∈ R. Finally,
using Theorem 2.33,
F (−∞) = lim F (−n)
n→+∞
= lim α(] − ∞, −n])
n→+∞
+∞
!
\
=α ] − ∞, −n]
n=1
= α(∅) = 0.
and
F (+∞) = lim F (n)
n→+∞
= lim α(] − ∞, n])
n→+∞
+∞
!
[
=α ] − ∞, n]
n=1
= α(R) = 1.
(2) Consider the algebra A(R) that contains every finite union of
intervals of the form ]a, b] (see section 1.2). Take a sequence of
disjoint intervals ]an , bn ], −∞ ≤ an ≤ bn ≤ +∞, whose union
is in A(R) and define
+∞
! +∞
[ X
α ]an , bn ] = (F (bn ) − F (an )).
n=1 n=1

Thus, α is σ-additive, α(∅) = α(]a, a]) = 0, α(A) ≥ 0 for any


A ∈ A(R) because F is increasing, and α(R) = F (+∞) −
F (−∞) = 1. Thus, α is a probability measure on A(R). In
particular, α(] − ∞, x]) = F (x), x ∈ R.
Finally, the Carathéodory extension theorem guarantees
that α can be uniquely extended to a distribution in σ(A(R)) =
B(R).

66 5. DISTRIBUTIONS

Exercise 5.12. Show that for −∞ ≤ a ≤ b ≤ +∞ we have


(1) α({a}) = F (a) − F (a− )
(2) α(]a, b[) = F (b− ) − F (a)
(3) α([a, b[) = F (b− ) − F (a− )
(4) α([a, b]) = F (b) − F (a− )
(5) α(] − ∞, b[) = F (b− )
(6) α([a, +∞[) = 1 − F (a− )
(7) α(]a, +∞[) = 1 − F (a)
Exercise 5.13. Compute the distribution function of the following
distributions:
(1) The Dirac distribution δa at a ∈ R.
(2) The Bernoulli distribution pδa + (1 − p)δb with 0 ≤ p ≤ 1 and
a, b ∈ R.
(3) The uniform distribution on a bounded interval I ⊂ R
m(A ∩ I)
mI (A) = , A ∈ B(R),
m(I)
where m is the Lebesgue measure.
(4) α = c1 δa + c2 mI on B(R) where c1 , c2 ≥ 0 and c1 + c2 = 1.
(5)
+∞
X 1
α= δ−1/n .
n=1
2n

4. Classification of distributions

Consider a distribution α and its correspondent distribution func-


tion F : R → R. The set of points where F is discontinuous is denoted
by
D = x ∈ R : F (x− ) < F (x) .


Proposition 5.14.
(1) α({a}) > 0 iff a ∈ D.
(2) D is countable.

Proof.
(1) Recall that α({a}) = F (a) − F (a− ). So, it is positive iff a is a
discontinuity point of F .
(2) For each x ∈ D we can choose a rational number g(x) such
that F (x− ) < g(x) < F (x). This defines a function g : D → Q.
Now, for x1 , x2 ∈ D satisfying x1 < x2 , we have
g(x1 ) < F (x1 ) ≤ F (x−
2 ) < g(x2 ).
4. CLASSIFICATION OF DISTRIBUTIONS 67

Hence g is strictly increasing and injective. It yields therefore


a bijection D → g(D) ⊂ Q which implies that D is countable.


4.1. Discrete. A distribution function F : R → R is called dis-


constant, i.e. for any n ∈ N we can find an ∈ R
crete if it is piecewise P
and pn ≥ 0 such that n pn = 1 and
+∞
X X
F (x) = pn X]−∞,an ] (x) = pn .
n=1 an ≥x

A distribution is called discrete iff its distribution function is dis-


crete. So, a discrete distribution α is given by
+∞
X X
α(A) = pn δan (A) = pn , A ∈ B(R).
n=1 an ∈A

Exercise 5.15. Show that α(D) = 1 iff α is a discrete distribution.


Example 5.16. Consider a random variable X such that P (X ∈
N) = 1. So,
+∞
X
P (X ≥ n) = P (X = i).
i=n
Its expected value is then
Z +∞
X
E(X) = X dP = iP (X = i)
i=1
+∞ X
X i
= P (X = i)
i=1 n=1
+∞
XX +∞
= P (X = i)
n=1 i=n
+∞
X
= P (X ≥ n).
n=1

4.2. Continuous. A distribution is called continuous iff its dis-


tribution function is continuous.
4.2.1. Absolutely continuous. We say that a distribution function
is absolutely continuous if there is an integrable function f ≥ 0 with
respect to the Lebesgue measure m such that
Z x
F (x) = f dm, x ∈ R.
−∞
68 5. DISTRIBUTIONS

In particular, F is continuous (D = ∅). Recall that by the fundamental


theorem of calculus, if F is differentiable at x, then F 0 (x) = f (x).
Moreover, if f is continuous, then F is differentiable.
A distribution is called absolutely continuous iff its distribution
function is absolutely continuous. An absolutely continuous distribu-
tion α is given by
Z
α(A) = f dm, A ∈ B(R).
A

The function f is known as the density of α.

Example 5.17. Take Ω = R, F = B(R) and the Lebesgue measure


m on [0, 1]. For a fixed r > 0 consider the random variable

X(ω) = ω r X[0,+∞[ (ω).

Thus,
(
∅, x<0
{X ≤ x} = 1/r
] − ∞, x ], x ≥ 0

and the distribution function of X is



0,
 x<0
1/r
F (x) = m(X ≤ x) = x , 0 ≤ x < 1

1, x ≥ 1.

The function F is absolutely continuous since


Z x
F (x) = f (t) dt,
−∞

where f (t) = F 0 (t) is the density function given by


1
f (t) = t1/r−1 X[0,1] (t).
r

4.2.2. Singular continuous. We say that a distribution function is


singular continuous if F is continuous (D = ∅) but not absolutely
continuous.
A distribution is called singular continuous iff its distribution func-
tion is singular continuous.

4.3. Mixed. A distribution is called mixed iff it is not discrete


neither continuous.
5. CONVERGENCE IN DISTRIBUTION 69

5. Convergence in distribution

Consider a sequence of random variables Xn and the sequence of


their distributions αn = P ◦ Xn−1 . Moreover, we take the sequence of
the corresponding distribution functions Fn .
We say that Xn converges in distribution to a random variable X
iff
lim Fn (x) = F (x), x ∈ Dc ,
n→+∞

where F is the distribution function of X and D the set of its discon-


tinuity points. We use the notation
d
Xn → X.

Moreover, we say that αn converges weakly to a distribution α iff


Z Z
lim f dαn = f dα
n→+∞

for every f : R → R continuous and bounded. We use the notation


w
αn → α.

It turns out that it is enough to check the convergence of the above


integral for a one-parameter family of complex-valued functions gt (x) =
eitx where t ∈ R. Recall that eitx = cos(tx) + i sin(tx), so that
Z Z Z
itx
e dα(x) = cos(tx) dα(x) + i sin(tx) dα(x).

Theorem 5.18. Let αn be a sequence of distributions. If


Z
lim eitx dαn
n→+∞

exists for every t ∈ R and it is continuous at 0, then there is a distri-


w
bution α such that αn → α.

Proof. Let Fn be the distribution function of each αn . Let rj be


a sequence ordering the rational numbers. As Fn (r1 ) ∈ [0, 1] there is a
subsequence k1 (n) (that is, k1 : N → N is strictly increasing) for which
Fk1 (n) (r1 ) converges when n → +∞, say to br1 . Again, Fk1 (n) (r2 ) ∈
[0, 1] implies that there is a subsequence k2 (n) of k1 (n) (meaning that
k2 : N → k1 (N) is strictly increasing) giving Fk2 (n) (r2 ) → br2 . Induc-
tively, we can find subsequences kj (n) of kj−1 (n) such that Fkj (n) (rj ) →
brj . Notice that the sequence m(n) = kn (n) is a subsequence of kj (n)
when n ≥ j. Therefore, for any j we have that Fm(n) (rj ) → brj . Since
each distribution function Fn is increasing, for rationals r < r0 we have
for any sufficiently large n that Fm(n) (r) ≤ Fm(n) (r0 ).
70 5. DISTRIBUTIONS

Define Gn = Fm(n) . So, Gn (r) → br for any r ∈ Q. In addition,


for rationals r < r0 it holds br ≤ br0 . We now choose the function
G : R → [0, 1] by
G(x) = inf br .
r>x
We will now show that G is also a distribution function.
For x1 < x2 it is simple to check that
G(x1 ) = inf br ≤ inf br = G(x2 ),
r>x1 r>x2

so that G is increasing. Take a sequence xn → x+ . Hence G(x) =




The following theorem shows that convergence in distribution for


sequences of random variables is the same as weak convergence for their
distributions. Moreover, this is equivalent to showing convergence of
the integrals for a specific complex function x 7→ eitx for each t ∈ R.
This last fact will be explored in the next section, and this integral will
be called the characteristic function of the distribution.
Theorem 5.19 (Lévy-Cramer continuity). For each n ∈ N consider
a distribution αn with distribution function Fn . Let α be a distribution
α with distribution function F . The following propositions are equiva-
lent:
(1) Fn → F on Dc ,
w
(2) αn → α,
(3) for each t ∈ R,
Z Z
lim e dαn (x) = eitx dα(x).
itx
n→+∞

Proof.
(1)⇒(2) Assume that Fn → F on the set Dc of continuity points of
F . Let ε > 0 and a, b ∈ Dc such that a < b, F (a) ≤ ε and
F (b) ≥ 1 − ε. Then, there is n0 ∈ N satisfying
Fn (a) ≤ 2ε and Fn (b) ≥ 1 − 2ε
for all n ≥ n0 .
Let δ > 0 and f continuous such that |f (x)| ≤ M for some
M > 0. Take the following partition
N
[
]a, b] = Ij , Ij =]aj , aj+1 ],
j=1

where a = a1 < · · · < aN +1 = b with ai ∈ Dc such that


max f − min f < δ.
Ij Ij
5. CONVERGENCE IN DISTRIBUTION 71

Consider now the simple function


N
X
h(x) = f (aj )XIj .
j=1

Hence,
|f (x) − h(x)| ≤ δ, x ∈]a, b].
In addition,
Z Z Z
(f − h) dαn = (f − h) dαn + f dαn
]a,b] ]a,b]c
≤ δ αn (]a, b]) + (max |f |)(Fn (a) + 1 − Fn (b))
≤ δ + 4M ε.
Similarly,
Z
(f − h) dα ≤ δ + 2M ε.

In addition,
αn (Ij ) − α(Ij ) = Fn (aj+1 ) − F (aj+1 ) − (Fn (aj ) − F (aj ))
converges to zero as n → +∞ and the same for
Z Z XN
h dαn − h dα = f (aj ) (αn (Ij ) − α(Ij ))
j=1

Therefore, using
Z Z Z Z
f dαn − f dα = (f − h) dαn − (f − h) dα
Z Z
+ h dαn − h dα

we obtain
Z Z
lim sup f dαn − f dα ≤ 2δ + 6M ε.
n→+∞
w
Being ε and δ arbitrary, we get αn → α.
(2)⇒(1) Let y be a continuity point of F . So, α({y}) = 0. Consider
A =] − ∞, y[ and the sequence of functions

1
1, x ≤ y − 2k

fk (x) = −2k (x − y), y − 21k < x ≤ y

0, x > y,
where k ∈ N. Notice that fk % XA . Thus, using the domi-
nated convergence theorem
Z Z Z
F (y) = α(A) = XA dα = lim fk dα = lim fk dα.
k k
72 5. DISTRIBUTIONS

Since fk is continuous and bounded, and fk ≤ XA ,


Z Z
lim fk dα = lim lim fk dαn
k k n
Z
≤ lim lim inf XA dαn = lim inf Fn (y)
k n n

where it was also used the fact that Fn (y − ) ≤ Fn (y).


Now, take A =] − ∞, y] and

1, x ≤ y

fk (x) = −2k (x − y) + 1, y < x ≤ y + 21k
x > y + 21k .

0,

Similarly to above, as fk & XA ,


Z Z
F (y) = lim lim fk dαn ≥ lim lim sup XA dαn = lim sup Fn (y).
k n k n n

Combining the two inequalities,


lim sup Fn (y) ≤ F (y) ≤ lim inf Fn (y)
n n

we conclude that F (y) = limn Fn (y).


(2)⇒(3) Define gt (x) = eitx = cos(tx) + i sin(tx) for each t ∈ R. Since
cos(tx) and sin(tx) are continuous
R and bounded
R as functions
of x, by (2) we have limn gt (x) dαn (x) = gt (x) dα(x).
R
(3)⇒(2) This follows from Theorem 5.18 by noticing that t 7→ eitx dα(x)
is continuous at 0.

Exercise 5.20. Show that if Xn converges in distribution to a
constant, then it also converges in probability.

6. Characteristic functions

A function φ : R → C is a characteristic function iff


(1) φ is continuous at 0,
(2) φ is positive definite, i.e.
Xn X n
φ(ti − tj )zi z j ∈ R+
0
i=1 j=1

for all z1 , . . . , zn ∈ C, t1 , . . . , tn ∈ R and n ∈ N,


(3) φ(0) = 1.
The next theorem states that there is a one-to-one correspondence
between distributions and characteristic functions.
Theorem 5.21 (Bochner).
6. CHARACTERISTIC FUNCTIONS 73

(1) If α is a distribution, then


Z
φ(t) = eitx dα(x), t ∈ R,

is a characteristic function.
(2) If φ is a characteristic function, then there is a unique distri-
bution α such that
Z
eitx dα(x) = φ(t), t ∈ R.

The above theorem is proved in section 6.3.


Remark 5.22. The function φ as above is called the characteristic
function 1 of α. Whenever α is the distribution of a random variable
X, we also say that φ is the characteristic function of X and
φ(t) = E(eitX ).
Example 5.23. The characteristic function of the Dirac distribu-
tion δa at a point a ∈ R is
Z
φ(t) = eitx dδa (x) = eita .

Exercise 5.24. Let X and Y = aX + b be random variables with


a, b ∈ R and a 6= 0. Show that if φX is the characteristic function of
the distribution of X, then
φY (t) = eitb φX (at), t ∈ R,
is the characteristic function of the distribution of Y .
Exercise 5.25. Let X be a random variable and φX its character-
istic function. Show that the characteristic function of −X is
φ−X (t) = φX (−t).
Exercise 5.26. Let φ be the characteristic function of the distri-
bution α. Prove that φ is real-valued (i.e. φ(t) ∈ R, t ∈ R) iff α is
symmetric around the origin (i.e. α(A) = α(−A), A ∈ B).

6.1. Regularity of the characteristic function. We start by


presenting some facts about positive definite functions.
Exercise 5.27. Show the following statements:
(1) If φ is positive definite, then for any a ∈ R the function ψ(t) =
eita φ(t) is also positive definite.
(2) If φ, .P
. . , φn are positive definite functions and a1 , . . . , an > 0,
then ni=1 ai φi is also positive definite.
1Itis also known as the Fourier transform of α. Notice that if α is absolutely
continuous then φ is the Fourier transform of the density function.
74 5. DISTRIBUTIONS

Lemma 5.28. Suppose that φ : R → C is a positive definite function.


Then,

(1) 0 ≤ |φ(t)| ≤ φ(0) and φ(−t) = φ(t) for every t ∈ R.


(2) for any s, t ∈ R,
|φ(t) − φ(s)|2 ≤ 4φ(0) |φ(0) − φ(t − s)|.
(3) φ is continuous at 0 iff it is uniformly continuous on R.

Proof.
(1) Take n = 2, t1 = 0 and t2 = t. Hence,
φ(0)z1 z 1 + φ(−t)z1 z 2 + φ(t)z2 z 1 + φ(0)z2 z 2 ∈ R+
0

for any choice of z1 , z2 ∈ C. In particular, using z1 = 1 and


z2 = 0 we obtain φ(0) ∈ R+ 0 . On the other hand, z1 = z2 = 1
implies that the imaginary part of φ(−t) + φ(t) is zero. For
z1 = 1 and z2 = i, we getpthat the real part of φ(−t) − φ(t) is
zero. Finally, z1 = z 2 = −φ(t) yields that |φ(t)| ≤ φ(0).
(2) Fixing n ∈ N and t1 , . . . , tn ∈ R we have that the matrix
[φ(ti − tj )]i,j is positive definite and Hermitian. In particular,
by choosing n = 3, t1 = t, t2 = s and t3 = 0, we obtain that
φ(t − s) φ(t)
 
φ(0)
φ(t − s) φ(0) φ(s)
φ(t) φ(s) φ(0)
has a non-negative determinant given by
φ(0)3 + 2 Re(φ(t − s)φ(s)φ(t)) − φ(0)(|φ(t)|2 + |φ(s)|2 + |φ(t − s)|2 ) ≥ 0.
Hence, assuming that φ(0) > 0 (otherwise the result is imme-
diate),
|φ(t) − φ(s)| = |φ(t)|2 + |φ(s)|2 − 2 Re φ(s)φ(t)
φ(s)φ(t)
≤ φ(0)2 + 2 Re(φ(t − s) − φ(0)) − |φ(t − s)|2
φ(0)
≤ (φ(0) − |φ(t − s)|)(φ(0) + |φ(t − s)| + 2φ(0))
≤ 4φ(0)|φ(0) − φ(t − s)|.
(3) This follows from the previous estimate.


The previous lemma implies that any characteristic function φ is


continuous everywhere and its absolute value is between 0 and 1. In
the following we find a condition for the differentiability of φ.
6. CHARACTERISTIC FUNCTIONS 75

Proposition 5.29. If there is k ∈ N such that


Z
|x|k dα(x) < +∞,

then φ is C k and φ(k) (0) = ik mk .

Proof. Let k = 1. Then,


φ(t + s) − φ(t)
φ0 (t) = lim
s→0 s
eisx − 1
Z
= lim eitx dα(x)
s→0 s
+∞
(ix)n n−1
Z X
itx
= lim e s dα(x)
s→0
n=1
n!
+∞
(ix)n
Z X
itx
= e lim sn−1 dα(x)
n=1
n! s→0
Z
= eitx ix dα(x).

This integral exists (it is finite) because


Z Z
itx
e ix dα(x) ≤ |x| dα(x) < +∞

by hypothesis. Therefore, φ0 exists and it is a continuous function of t.


In addition, φ0 (0) = i x dα(x). The claim is proved for k = 1.
R

We can now proceed by induction for the remaining cases k ≥ 2.


This is left as an exercise for the reader. 

6.2. Examples. In the following there is a list of widely used dis-


tributions. For some of the examples below there is a special notation
to indicate the distribution of a random variable X. For instance, if
it is the Uniform distribution on [a, b], we write X ∼ U([a, b]). Other
cases are included in the next examples.
Exercise 5.30. Find the characteristic functions of the following
discrete distributions α(A) = P (X ∈ A), A ∈ B(R):
(1) Dirac (or degenerate or atomic) distribution
(
1, a ∈ A
α(A) =
0, o.c.
where a ∈ R.
(2) Binomial distribution (X ∼ Bin) with n ∈ N:
α({k}) = Ckn pk (1 − p)n−k , 0 ≤ k ≤ n.
76 5. DISTRIBUTIONS

(3) Poisson distribution (X ∼ Poisson(λ)) with λ > 0:


λk
α({k}) = , k ∈ N ∪ {0}.
k!eλ
This describes the distribution of ’rare’ events with rate λ.
(4) Geometric distribution (X ∼ Geom(p)) with 0 ≤ p ≤ 1:
α({k}) = (1 − p)k p, k ∈ N ∪ {0}.
This describes the distribution of the number of unsuccessful
attempts preceding a success with probability p.
(5) Negative binomial distribution (X ∼ NBin):
α({k}) = Ckn+k−1 (1 − p)k pn , k ∈ N ∪ {0}.
This describes the distribution of the number of accumulated
failuresPbefore n successes. Hint: Recall the Taylor series of
1
1−x
= +∞ i
i=0 x for |x| < 1. Differentiate this n times and use
the result.
Exercise 5.31. Find the characteristic functions of theR following
absolutely continuous distributions α(A) = P (X ∈ A) = A f (x) dx,
A ∈ B(R) where f is the density function:
(1) Uniform distribution on [a, b] (X ∼ U([a, b])):
(
1
, x ∈ [a, b]
f (x) = b−a
0, o.c.
(2) Exponential distribution (X ∼ Exp(λ)) with λ > 0:
f (x) = λe−λx , x≥0
(3) The two-sided exponential distribution (X ∼ Exp(λ)), with
λ > 0:
λ
f (x) = e−λ|x| , x ∈ R
2
(4) The Cauchy distribution (X ∼ Cauchy):
1 1
f (x) = , x∈R
π 1 + x2
Hint: Use the residue theorem of complex analysis.
(5) The normal (Gaussian) distribution (X ∼ N(µ, σ)) with mean
µ and variance σ 2 > 0:
1 (x−µ)2
f (x) = √ e− 2σ2 , x∈R
2πσ
Exercise 5.32. Let Xn ∼ U([0, 1]) and Yn = − λ1 log(1 − Xn ) with
λ > 0 . Show that Yn ∼ Exp(λ).
6. CHARACTERISTIC FUNCTIONS 77

6.3. Proof of Bochner theorem.


Proposition 5.33. Consider a distribution α and the function
Z
φ(t) = eitx dα(x), t ∈ R.

Then, φ is a characteristic function, i.e.


(1) φ(0) = 1,
(2) φ is uniformly continuous,
(3) φ is positive definite.

Proof.
R
(1) φ(0) = dα = 1.
(2) For any s, t ∈ R we have
Z
|φ(t) − φ(s)| = (eitx − eisx ) dα(x)
Z
≤ |eisx | |ei(t−s)x − 1| dα(x)
Z
= |ei(t−s)x − 1| dα(x).

Taking s → t we can use the dominated convergence theorem


toZshow that Z
i(t−s)x
lim |e − 1| dα(x) = lim |ei(t−s)x − 1| dα(x) = 0,
s→t s→t

being enough to notice that |ei(t−s)x − 1| is bounded. So,


lim |φ(t) − φ(s)| = 0,
s→t
meaning that φ is uniformly continuous.
(3) For all z1 , . . . , zn ∈ C, t1 , . . . , tn ∈ R and n ∈ N,
Xn X n Z
φ(ti − tj )zi z j = zi z j ei(ti −tj )x dα(x)
i,j=1 i,j=1
n
Z X n
X
= zi eiti x zj eitj x dα(x)
i=1 j=1
n
Z X 2
iti x
= zi e dα(x) ≥ 0.
i=1


Proposition 5.34. If φ : R → C is a characteristic function, then
there is a unique distribution α such that
Z
eitx dα(x) = φ(t).
78 5. DISTRIBUTIONS

Proof. Visit the library. 


CHAPTER 6

Independence

1. Independent events

Two events A1 and A2 are independent if they do not influence each


other in terms of probability. This notion is fundamental in probability
theory and it is stated in general in the following way. Let (Ω, F, P ) be
a probability space. We say that A1 , A2 ∈ F are independent events iff
P (A1 ∩ A2 ) = P (A1 ) P (A2 ).

A simple intuitive description of what are independent events can


be achieved by assuming that P (A2 ) > 0. In this case we can write
P (A1 ∩ Ω) P (A1 ∩ A2 )
= .
P (Ω) P (A2 )
This means that A1 and A2 are independent whenever we have the same
relative probability of A1 regardless of the restriction to the event A2 .
Exercise 6.1. Show that:

(1) If A1 and A2 are independent, then Ac1 and A2 are also inde-
pendent.
(2) Any full probability event is independent of any other event.
The same for any zero probability event.
(3) Two disjoint events are independent iff at least one of them
has zero probability.
(4) Consider two events A1 ⊂ A2 . They are independent iff A1
has zero probability or A2 has full probability.
Example 6.2. Consider the Lebesgue measure m on Ω = [0, 1] and
the event I1 = [0, 12 ]. Any other interval I2 = [a, b] with 0 ≤ a < b ≤ 1
that is independent of I1 has to satisfy the relation P (I2 ∩ [0, 21 ]) =
1
2
(b−a). Notice that a ≤ 12 (otherwise I1 ∩I2 = ∅) and b ≥ 21 (otherwise
I2 ⊂ I1 ). So, b = 1 − a. That is, any interval [a, 1 − a] with 0 ≤ a ≤ 21
is independent of [0, 12 ].
Exercise 6.3. Suppose that A and C are independent events as
well as B and C with A ∩ B = ∅. Show that A ∪ B and C are also
independent.
79
80 6. INDEPENDENCE

Exercise 6.4. Give examples of probability measures P1 and P2 ,


and of events A1 and A2 such that P1 (A1 ∩ A2 ) = P1 (A1 ) P1 (A2 ) but
P2 (A1 ∩ A2 ) 6= P2 (A1 ) P2 (A2 ). Recall that the definition of indepen-
dence depends on the probability measure.

2. Independent random variables

Two random variables X, Y are independent random variables iff


P (X ∈ B1 , Y ∈ B2 ) = P (X ∈ B1 ) P (Y ∈ B2 ), B1 , B2 ∈ B.
Remark 6.5. The independence between X and Y is equivalent to
any of the following propositions. For any B1 , B2 ∈ B,
(1) X −1 (B1 ) and Y −1 (B2 ) are independent events.
(2) P ((X, Y ) ∈ B1 × B2 ) = P (X ∈ B1 ) P (Y ∈ B2 ).
(3) αZ (B1 × B2 ) = αX (B1 ) αY (B2 ), where αZ = P ◦ Z −1 is the
joint distribution of Z = (X, Y ), αX = P ◦ X −1 and αY =
P ◦ Y −1 are the distributions of X and Y , respectively. We
can therefore show that the joint distribution is the product
measure
αZ = αX × αY .
Exercise 6.6. Show that any random variable is independent of a
constant random variable.
Example 6.7. Consider simple functions
N N0
X X
X= ci XAi , Y = c0j XA0j .
i=1 j=1

Then, for any B1 , B2 ∈ B,


[ [
X −1 (B1 ) = Ai , Y −1 (B2 ) = A0j .
i : ci ∈B1 j : c0j ∈B2

These are independent events iff Ai and A0j are independent for every
i, j.
Proposition 6.8. Let X and Y be independent random variables.
Then, there are sequences ϕn and ϕ0n of simple functions such that
(1) ϕn % X and ϕ0n % Y ,
(2) ϕn and ϕ0n are independent for every n ∈ N.

Proof. We follow the idea in the proof of Proposition 3.20. The


construction there guarantees that we get ϕn % X and ϕ0n % Y by
considering the simple functions
n2 n+1
X  j

ϕn = −n + n XAn,j + nXX −1 ([n,+∞[) − nXX −1 (]−∞,−n[)
j=0
2
2. INDEPENDENT RANDOM VARIABLES 81

where  
−1 j j+1
An,j = X −n + n , −n + n ,
2 2
and
n2 n+1  
X j
ϕ0n = −n + n XAn,j + nXY −1 ([n,+∞[) − nXY −1 (]−∞,−n[)
j=0
2

where  
j j+1
A0n,j =Y −1
−n + n , −n + n .
2 2
It remains to check that ϕn and ϕ0n are independent for any given
n. This follows from the fact that X and Y are independent, since any
pre-image of a Borel set by X and Y are independent. 
Proposition 6.9. If X and Y are independent, then
E(XY ) = E(X) E(Y )
and
Var(X + Y ) = Var(X) + Var(Y ).

Proof. We start byP


considering two independent simple functions
ϕ = j cj XAj and ϕ0 = j 0 c0j 0 XA0j0 . The independence implies that
P

P (Aj ∩ A0j 0 ) = P (Aj ) P (A0j 0 ).


So, X
E(ϕϕ0 ) = cj c0j 0 P (Aj ∩ A0j 0 ) = E(ϕ) E(ϕ0 ).
j,j 0

The claim follows from the application of the monotone convergence


theorem to sequences of simple functions ϕn % X and ϕ0n % Y which
are independent.
Finally, it is simple to check that Var(X + Y ) = Var(X) + Var(Y ) +
2E(XY ) − 2E(X)E(Y ). So, by the previous relation we complete the
proof. 
Proposition 6.10. If X and Y are independent random variables
and f and g are B-measurable functions on R, then
(1) f (X) and g(Y ) are independent.
(2) E(f (X)g(Y )) = E(f (X)) E(g(Y )) if E(|f (X)|), E(|g(Y )|) <
+∞.
Exercise 6.11. Prove it.
Example 6.12. Let f (x) = x2 and g(y) = ey . If X and Y are
independent random variables, then X 2 and eY are also independent.
82 6. INDEPENDENCE

The random variables in a sequence X1 , X2 , . . . are independent iff


for any n ∈ N and B1 , . . . , Bn ∈ B we have
P (X1 ∈ B1 , . . . , Xn ∈ Bn ) = P (X1 ∈ B1 ) · · · P (Xn ∈ Bn ).
That is, the joint distribution of (X1 , . . . , Xn ) is equal to the product
of the individual distributions for any n ∈ N.
Exercise 6.13. Suppose that the random variables X and Y have
only values in {0, 1}. Show that if E(XY ) = E(X)E(Y ), then X, Y
are independent.

Recall the definition of variance of a random variable X,


Var(X) = E(X 2 ) − E(X)2 ,
and of covariance between X and Y ,
Cov(X, Y ) = E(XY ) − E(X)E(Y ).
Notice that if X, Y are independent, then they are uncorrelated since
Cov(X, Y ) = 0.
Exercise 6.14. Construct an example of two uncorrelated random
variables that are not independent.
Exercise 6.15. Show that if Var(X) 6= Var(Y ), then X + Y and
X − Y are not independent.

3. Independent σ-algebras

Two σ-algebras F1 , F2 are independent iff every A1 ∈ F1 and A2 ∈


F2 are independent.
Exercise 6.16. Show that:
(1) If G ⊂ F1 and F1 , F2 are independent σ-algebras, then G and
F2 are also independent.
(2) Two random variables X, Y are independent iff σ(X) and σ(Y )
are independent.
CHAPTER 7

Conditional expectation

In this chapter we introduce the concept of conditional expectation.


It will be used in the construction of stochastic processes which are not
sequences of i.i.d. random variables.

1. Conditional expectation

Let (Ω, F, P ) be a probability space and X : Ω → R a random


variable (i.e. a F-measurable function). Define the signed measure
λ : F → R given by
Z
λ(B) = X dP, B ∈ F.
B
Recall that the Radon-Nikodym derivative is an F-measurable function
and it is of course given by

= X.
dP
Consider now a σ-subalgebra G ⊂ F and the restriction of λ to G.
That is, λG : G → R such that
Z
λG (A) = X dP, A ∈ G.
A
If the random variable X is not G-measurable, it is not the Radon-
Nikodym derivative of λG . We define the conditional expectation of X
given G as
dλG
E(X|G) = a.s.
dP
which is an G-measurable function. Therefore,
Z Z
λG (A) = E(X|G) dP = X dP, A ∈ G. (7.1)
A A
Remark 7.1. The conditional expectation E(X|G) is a random
variable on the probability space (Ω, G, P ).
Proposition 7.2. Let X be a random variable and G ⊂ F a σ-
algebra.
1
(1) If X is G-measurable, then E(X|G) = X a.s.
1In particular E(X|F) = X.
83
84 7. CONDITIONAL EXPECTATION

(2) E(E(X|G)) = E(X).


(3) If X ≥ 0, then E(X|G) ≥ 0 a.s.
(4) E(|E(X|G)|) ≤ E(|X|).
(5) E(1|G) = 1 a.s.
(6) For every c1 , c2 ∈ R and random variables X1 , X2 ,
E(c1 X1 + c2 X2 |G) = c1 E(X1 |G) + c2 E(X2 |G).
(7) If h : Ω → R is G-measurable and bounded, then
E(hX|G) = hE(X|G) a.s.
(8) If G1 ⊂ G2 ⊂ F are σ-algebras, then
E(E(X|G2 )|G1 ) = E(X|G1 ).
(9) If φ : R → R is convex, then
E(φ ◦ X|G) ≥ φ ◦ E(X|G) a.s.

Proof.
(1) If X is G-measurable, then it is the Radon-Nikodym derivative
of λG with respect to P .
(2) This follows from (7.1) with A = Ω.
(3) Consider the set A = {E(X|G) < 0} which is in G since
E(X|G) is G-measurable. If P (A) > 0, then by (7.1),
Z Z
0≤ X dP = E(X|G) dP < 0,
A A

which is false. So, P (A) = 0.


(4) Consider the set A = {E(X|G) ≥ 0} ∈ G. Hence,
Z Z
E(|E(X|G)|) = E(X|G) dP − E(X|G) dP
A Ac
Z Z
= X dP − X dP
A Ac
Z Z
≤ |X| dP + |X| dP = E(|X|).
A Ac

(5) Since X = 1 is a constant it is G-measurable. Therefore,


E(X|G) = X a.s.
(6) Using the linearity of the integral and (7.1), for every A ∈ G,
Z Z Z
E(c1 X1 + c2 X2 |G) dP = c1 X1 dP + c2 X2 dP
A A A
Z
= (c1 E(X1 |G) + c2 E(X2 |G)) dP.
A
Since both integrand functions are G-measurable, they agree
a.s.
1. CONDITIONAL EXPECTATION 85

(7) Assume that h ≥ 0 (the general case follows from the decom-
position h = h+ − h− with h+ , h− ≥ 0). Take a sequence of
G-measurable non-negative simple functions ϕn % h of the
form X
ϕn = cj X A j ,
j
where each Aj ∈ G. We will show first that the claim holds
for simple functions and later use the monotone convergence
theorem to deduce it for h. For any A ∈ G we have that
A ∩ Aj ∈ G. Hence
Z Z
E(ϕn X|G) dP = ϕn X dP
A A
X Z
= cj X dP
j A∩Aj
X Z
= cj E(X|G) dP
j A∩Aj
Z
= ϕn E(X|G) dP.
A
By the monotone convergence theorem applied twice,
Z Z
E(hX|G) dP = lim ϕn X dP
A A
Z
= lim E(ϕn X|G) dP
ZA
= lim ϕn E(X|G) dP
A
Z
= hE(X|G) dP.
A
(8) Let A ∈ G1 . Then,
Z Z
E(E(X|G2 )|G1 ) dP = E(X|G2 ) dP
A A
Z Z
= X dP = E(X|G1 ) dP
A A
since A is also in G2 .
(9) Do it as an exercise.

Remark 7.3. Whenever the σ-algebra is generated by the random
variables Y1 , . . . , Yn , we use the notation
E(X|Y1 , . . . , Yn ) = E(X|σ(Y1 , . . . , Yn ))
which reads as the conditional expectation of X given Y1 , . . . , Yn .
86 7. CONDITIONAL EXPECTATION

Proposition 7.4. Let X, Y1 , . . . Yn be independent random vari-


ables. Then,
E(X|Y1 , . . . , Yn ) = E(X) a.s.

Proof. We first consider the case X = XB for B ∈ F such that


B is independent of σ(Y1 , . . . , Yn ). Then, for any A ∈ σ(Y1 , . . . , Yn ) we
have that
Z Z
E(XB |Y1 , . . . , Yn ) dP = XB dP = P (A ∩ B) = P (A)P (B)
A A
R
since A and B are independent. Therefore, as P (A) = A
dP we have
that
Z Z
E(XB |Y1 , . . . , Yn ) dP = P (B) dP.
A A

Since P (B) is a constant, hence σ(Y1 , . . . , Yn )-measurable, the equality


above implies that E(XB |Y1 , . . . , Yn ) = P (B) = E(XB ).

P now a sequence of simple functions ϕn % X of the form


Choose
ϕn = j cj XAj such that for every n ∈ N we have that ϕn , Y1 , . . . , Yn
are independent. So, for any A ∈ σ(Y1 , . . . , Yn ), using the monotone
convergence theorem,
Z Z
E(X|Y1 , . . . , Yn ) dP = X dP
A A
Z
= lim ϕn dP
A
X
= lim cj P (Aj ∩ A)
j
X
= lim cj P (Aj )P (A)
j
Z
= lim E(ϕn ) dP
A
Z
= E(X) dP.
A

Example 7.5. Fixing some event B ∈ F notice that the σ-algebra


generated by the random variable XB is σ(XB ) = {∅, Ω, B, B c }. As
E(X|XB ) is σ(XB )-measurable it is constant in B and in B c :
(
a1 , x ∈ B
E(X|XB )(x) =
a2 , x ∈ B c .
1. CONDITIONAL EXPECTATION 87

By (7.1) we obtain the conditions


a1 P (B) + a2 P (B c ) = E(X)
Z
a1 P (B) = X dP
B
Z
c
a2 P (B ) = X dP.
Bc

So, if 0 < P (B) < 1 we have


(R X dP
B
, x∈B
E(X|XB )(x) = R (B)
P
X dP
Bc
x ∈ Bc
P (B c )
,
R R
B
X dP c X dP
= XB + B XB c .
P (B) P (B c )
Finally, if P (B) = 0 of P (B) = 1 we have
E(X|XB )(x) = E(X) a.e.
Remark 7.6. In the case that P (B) > 0 we define the conditional
expectation of X given the event B as the restriction of E(X|XB ) to
B and use the notation
R
X dP
E(X|B) = E(X|XB )|B = B .
P (B)
In particular, for the event B = {Y = y} for some random variable Y
and y ∈ R it is written as E(X|Y = y).
Exercise 7.7. Let X be a random variable.

(1) Show that if 0 < P (B) < 1 and α, β ∈ R, then


E(X|αXB + βXB c ) = E(X|XB ).
(2) Let Y = α1 XB1 + α2 XB2 where B1 ∩ B2 = ∅ and α1 6= α2 . Find
E(X|Y ).
Exercise 7.8. Let Ω = {1, 2, 3, 4, 5, 6}, F = P(Ω),

1
 16 , x = 1, 2

P ({x}) = 14 , x = 3, 4
 3 , x = 5, 6,

16
(
2, x = 1, 2
X(x) =
8, x = 3, 4, 5, 6,
and Y = 4X{1,2,3} + 6X{4,5,6} . Find E(X|Y ).
88 7. CONDITIONAL EXPECTATION

Exercise 7.9. Let Ω = [0, 1[, F = B([0, 1[) and P = m where


m is the Lebesgue measure on [0, 1[. Consider the random variables
X(ω) = ω and (
2ω, 0 ≤ ω < 12
Y (ω) =
2ω − 1, 12 ≤ ω < 1.

(1) Find σ(Y ).


(2) By the knowledge that E(X|Y ) is σ(Y )-measurable, show that
E(X|Y )(ω) = E(X|Y )(ω + 1/2), 0 ≤ ω < 1/2.
(3) Reduce the problem of determining E(X|Y ) on [0, 1[ to finding
Zthe solution of Z
1
E(X|Y ) dm = X dm, A ∈ B([0, 1/2[),
A 2 A∪(A+1/2)
and compute E(X|Y ).

2. Conditional probability

Let (Ω, F, P ) be a probability space and G ⊂ F a σ-subalgebra.


The conditional probability of an event B ∈ F given G is defined as
the G-measurable function
P (B|G) = E(XB |G).
Remark 7.10. From the definition of conditional expectation, we
obtain for any A ∈ G that
Z Z Z
P (B|G) dP = E(XB |G) dP = XB dP = P (A ∩ B).
A A A
Theorem 7.11.
(1) If P (B) = 0, then P (B|G) = 0 a.e.
(2) If P (B) = 1, then P (B|G) = 1 a.e.
(3) 0 ≤ P (B|G) ≤ 1 a.e. for any B ∈ F.
(4) If B1 , B2 , · · · ∈ F are pairwise disjoint, then
+∞
! +∞
[ X
P Bn G = P (Bn |G) a.e.
n=1 n=1

(5) If B ∈ G, then P (B|G) = XB a.e.


Exercise 7.12. Prove it.
Remark 7.13.
(1) Similarly to the case of the conditional expectation, we de-
fine the conditional probability of B given random variables
Y1 , . . . , Yn by
P (B|Y1 , . . . , Yn ) = P (B|σ(Y1 , . . . , Yn )).
2. CONDITIONAL PROBABILITY 89

Moreover, given events A, B ∈ F with P (A) > 0, we define


the conditional expectation of B given A as
P (A ∩ B)
P (B|A) = E(XB |A)|A =
P (A)
which is a constant.
(2) Generally, without making any assumption on P (A), the fol-
lowing formula is always true:
P (B|A)P (A) = P (A ∩ B).
(3) Another widely used notation concerns events determined by
random variables X and Y . When B = {X = x} and A =
{Y = y} for some x, y ∈ R, we write
P (X = x|Y = y)P (Y = y) = P (X = x, Y = y).
Exercise 7.14. Show that for A, B ∈ F:
(1) Assuming that P (A) > 0, A and B are independent events iff
P (B|A) = P (B).
(2) P (A|B) P (B) = P (B|A) P (A). S
(3) For any sequence C1 , C2 , · · · ∈ F such that P ( n Cn ) = 1, we
have X
P (A|B) = P (A ∩ Cn |B). (7.2)
n
and
X
P (A|B) = P (A|Cn ∩ B)P (Cn |B). (7.3)
n
Part 3

Stochastic processes
CHAPTER 8

General stochastic processes

First, a simple clarification. Stochastic just means random. Sto-


chastic processes are families of random variables with the purpose of
describing the time evolution of the state of a system. What rules the
system along time is not known, we are only aware of the probability
distribution of its states or of the transitions between states.
In order to formalize the above notions, consider a probability space
(Ω, F, P ) and the Borel measurable space (Rd , B). So, taking time to be
a parameter t belonging to some parameter space T , we define the state
of the system at time t to be a (multidimensional) random variable1
Xt : Ω → Rd .
The set of all possible states of the system is called the state space:
[
S= Xt (Ω).
t∈T

A stochastic process is the family


{Xt : t ∈ T }.
It is usually denoted by its general term Xt .
A stochastic process Xt can be interpreted as the random path
of a point particle in S. More specifically, for each ω ∈ Ω the map
t 7→ Xt (ω) generates an orbit in S starting at X0 (ω). This is called a
realization of Xt , since it is determined by a specific outcome ω ∈ Ω.
For many real-world systems there is no way to determine the tra-
jectory of a state. Given an initial condition, where is the system after
some time? Due to many complex interactions, external influences and
poor observation tools, it has been far more productive to study such
systems by observing the probabilities of the relations between Xt at
different times. Our goal will be to find some answers eventhough there
is a lack of information.
If the parameter space is countable we can assume that T ⊂ N and
we say that it is a discrete-time stochastic process, corresponding to
a sequence of random variables Xn . Otherwise it is a continuous-time
1It can be seen as a vector of random variables.
93
94 8. GENERAL STOCHASTIC PROCESSES

stochastic process, where the parameter space T is usually an interval


of R.
Example 8.1. Take Xn to be the value of the maximum tem-
perature reached in Lisbon at day n. The parameter space is N0 =
{0, 1, 2, . . . }, counting the days, and the state space is the interval
S = [−273.15, +∞[
measured in degrees Celsius. Notice that it is not possible to have
temperatures below the absolute zero at -273.15o C. The next day tem-
perature does not usually change dramatically. This means that Xn+1
should not be independent of Xn . Moreover, the distribution of the
temperatures is affected by the season, as in winter is cooler than in
summer. Hence, the distribution of Xn depends on n.
Example 8.2. Let Xt be the score of a football match. The possible
scores are two-dimensional vectors in the state space S = N0 × N0 .
The duration of a match is 90 minutes, so t ∈ [0, 90] and Xt is a
continuous stochastic process. A realization of Xt is a discontinuous
path by assuming that a goal is instantaneous. At those random times
Xt jumps by a vector either (0, 1) or (1, 0).

A special class of stochastic processes are the ones corresponding to


independent and identically distributed (iid) random variables. In this
case each Xt is independent of any other and they all share the same
distribution. Any realization of such process will correspond to the
repetition of the same experiment under exactly the same conditions.
A basic example is the tossing of a fair coin.
CHAPTER 9

Sums of iid processes: Limit theorems

In this chapter we consider a discrete iid stochastic process Xn , but


we will be interested in the stochastic process that corresponds to the
sum of the first n terms:
Xn
Sn = Xi .
i=1

Exercise 9.1. Show that the Sn ’s are no longer independent ran-


dom variables.

Furthermore, we will show in the following that Sn are not identi-


cally distributed. More spectacularly, by rescaling Sn appropriately, it
is possible to find a limit distribution.

1. Sums of independent random variables

Let (Ω, F, P ) be a probability space and X1 , X2 random variables


with distributions α1 , α2 , respectively. Consider the measurable func-
tion f : R2 → R given by f (x1 , x2 ) = x1 + x2 , which is measurable with
respect to the product Borel σ-algebra G.
The convolution of α1 and α2 is defined to be the induced product
measure on B
α1 ∗ α2 = (α1 × α2 ) ◦ f −1 .
In addition, α1 ∗ α2 (Ω) = 1. So, the convolution is a distribution which
turns out to be of the random variable X1 + X2 .
Proposition 9.2.
(1) For every A ∈ B,
Z
(α1 ∗ α2 )(A) = α1 (A − x2 ) dα2 (x2 ),

where A − x2 = {y − x2 ∈ R : y ∈ A}.
(2) The characteristic function of α1 ∗ α2 is
φα1 ∗α2 = φα1 φα2 ,
where φαi is the characteristic function of αi .
(3) α1 ∗ α2 = α2 ∗ α1 .
95
96 9. SUMS OF IID PROCESSES: LIMIT THEOREMS

Proof. Writing fx2 (x1 ) = f (x1 , x2 ) and using Proposition 4.17


and the Fubini theorem we get:

(1)
Z
(α1 ∗ α2 )(A) = XA (y) d(α1 ∗ α2 )(y)
Z
= XA ◦ f (x1 , x2 ) d(α1 × α2 )(x1 , x2 )
Z Z
= Xfx−1 (A) (x1 ) dα1 (x1 )dα2 (x2 )
2
Z
= α1 (fx−1
2
(A))dα2 (x2 )
Z
= α1 (A − x2 )dα2 (x2 ).

(2)
Z
φα1 ∗α2 (t) = eity d(α1 ∗ α2 )(y)
Z
= eitf (x1 ,x2 ) d(α1 × α2 )(x1 , x2 )
Z Z
itx1
= e dα1 eitx2 dα2 .

(3) By the previous result, it follows from the fact that the char-
acteristic functions are equal.


Proposition 9.3. Let X1 , . . . , Xn be independent random variables
with distributions α1 , . . . , αn , respectively. Then,

(1) µn = α1 ∗ · · · ∗ αn is the distribution of Sn .


(2) φµn = φα1P. . . φαn is the characteristic function of µn .
n
(3) E(Sn ) = P i=1 E(Xi ).
(4) Var(Sn ) = ni=1 Var(Xi ).
Exercise 9.4. Prove it.
Example 9.5. Let λ ∈]0, 1[. Consider a sequence of independent
random variables Xn : Ω → {−λn , λn } with Bernoulli distributions
1
αn = (δ−λn + δλn ).
2
The characteristic function of αn is
φXn (t) = cos(tλn ).
2. LAW OF LARGE NUMBERS 97

The distribution of Sn is µn = α1 ∗ · · · ∗ αn and it is called a Bernoulli


convolution. The characteristic function is
n
Y
φSn (t) = cos(tλi ).
i=1

2. Law of large numbers

Let (Ω, F, P ) be a probability space and X1 , X2 , . . . a sequence of


random variables. We say that the sequence is i.d.d. if the random
variables are independent and identically distributed. That is, all of
the random variables are independent and share the same distribution
α = P ◦ Xn−1 , n ∈ N.
Theorem 9.6 (Weak law of large numbers). Let X1 , X2 , . . . be an
i.i.d. sequence of random variables. If E(|X1 |) < +∞, then
Sn P
→ E(X1 ).
n
Proof. Let φ be the characteristic function of the distribution of
X1 (it is the same for every Xn , n ∈ N). Since E(|X1 |) < +∞, we have
φ0 (0) = iE(X1 ). So, the first order Taylor expansion of φ around 0 is
φ(t) = 1 + iE(X1 )t + o(t),
where |t| < r for some sufficiently small r > 0. For any fixed t ∈ R and
n sufficiently large such that |t|/n < r we have
   
t t t
φ = 1 + iE(X1 ) + o .
n n n
Thus, for those values of t and n, the random variable Mn = n1 Sn has
characteristic function
 n   n
t t t
φn (t) = φ = 1 + iE(X1 ) + o .
n n n
Finally, using the fact that (1+a/n+o(1/n))n → ea whenever n → +∞,
we get
lim φn (t) = eiE(X1 )t .
n→+∞
This is the characteristic function of the Dirac distribution at E(X1 ),
corresponding to the constant random variable E(X1 ). Therefore,Mn
converges in distribution to E(X1 ) and also in probability by Proposi-
tion 5.20. 
Remark 9.7. Notice that
n
Sn 1X
= Xi
n n i=1
98 9. SUMS OF IID PROCESSES: LIMIT THEOREMS

is the average of the random variables X1 , . . . , Xn . So, the weak law of


large numbers states that the average converges in probability to the
expected value.
Exercise 9.8. Show that the weak law of large numbers does not
hold for the Cauchy distribution.

The law of large numbers can be improved by getting a stronger


convergence. For the theorem below we will use the following estimate.
Exercise 9.9. Show that for any t, p > 0 we have
1
P (X ≥ t) ≤ p E(|X|p ). (9.1)
t
Theorem 9.10 (Strong law of large numbers). Let X1 , X2 , . . . be
an i.i.d. sequence of random variables. If E(|X1 |4 ) < +∞, then
Sn
→ E(X1 ) a.s.
n
Proof. Suppose E(X1 ) = 0 for simplicity (the general result is left
as an exercise). So, we can show by induction that E(Sn2 ) = nE(X12 )
and
E(Sn4 ) = nE(X14 ) + 3n(n − 1)E(X12 )2 .
Therefore, E(Sn4 ) ≤ nE(|X1 |4 ) + 3n2 σ 4 .
Now, for any δ > 0, using (9.1),
nE(|X1 |4 ) + 3n2 σ 4
 
Sn
P ≥ δ = P (|Sn | ≥ nδ) ≤ .
n n4 δ 4
This implies that
X  Sn 
P ≥ δ < +∞
n≥1
n
and the claim follows from the first Borel-Cantelli lemma (Exercise 2.37),
i.e. Sn /n → 0 a.s. 

3. Central limit theorem

Let (Ω, F, P ) be a probability space.


Theorem 9.11 (Central limit theorem). Let X1 , X2 , . . . be an i.i.d.
sequence of random variables. If E(X1 ) = 0 and σ 2 = Var(X1 ) < +∞,
then for every x ∈ R,
  Z x
Sn 1 2 2
P √ ≤x → √ e−t /2σ dt.
n 2πσ −∞

Remark 9.12. The central limit theorem states that Sn / n con-
verges in distribution to a random variable with the normal distribu-
tion.
3. CENTRAL LIMIT THEOREM 99

Proof. It is enough to show that the characteristic function φn of


Sn

n
converges to the characteristic function of the normal distribution.
Let φ be the characteristic function of Xn for any n. Its Taylor
expansion of second order at 0 is
1
φ(t) = 1 − σ 2 t2 + o(t2 ),
2
with√ |t| < r for some r > 0. So, for a fixed t ∈ R and n satisfying
2 2
|t|/ n < r (i.e. n > t /r ),
1 2 t2
   2
t t
φ √ =1− σ +o .
n 2 n n
Then, n   2 n
1 2 t2

t t
φn (t) = φ √ = 1− σ +o .
n 2 n n
Taking the limit as n → +∞ we obtain
2 t2 /2
φn (t) → e−σ .

Exercise 9.13. Write the statement of the central limit theorem
for sequences of i.i.d. random variables Xn with mean µ. Hint: Apply
the theorem to Xn − µ which has zero mean.
CHAPTER 10

Markov chains

1. The Markov property

Let (Ω, F, P ) be a probability space and S ⊂ R a countable set


called the state space. For convenience we often choose S to be

S = {1, 2, . . . , N }

where N ∈ N ∪ {+∞} but it can also be S = Z. We are considering


both cases of S finite or infinite.
A discrete-time stochastic process X0 , X1 , . . . is a Markov chain on
S iff for all n ≥ 0,

(1) P (Xn ∈ S) = 1,
(2) it satisfies the Markov property: for every i0 , . . . , in ∈ S,

P (Xn+1 = in+1 |Xn = in , . . . , X0 = i0 ) = P (Xn+1 = in+1 |Xn = in ).

This means that the next future state (at time n + 1) only depends on
the present one (at time n). The system does not have “memory” of
the past.
We will see that the distributions of each X1 , X2 , . . . are determined
by the knowledge of the above conditional probabilities (that control
the evolution of the system) and the initial distribution of X0 .
Denote by
n
πi,j = P (Xn = j|Xn−1 = i)

the transition probability of moving from state i to state j at time


n ≥ 1. This defines the transition probability matrix at time n given by
 n
Tn = πi,j i,j∈S
.

Notice that Tn can be an infinite matrix if S is infinite.

Proposition 10.1. The sum of the coefficients in each row of Tn


equals 1.
101
102 10. MARKOV CHAINS

Proof. The sum of the coefficients in the i-th row of Tn is


X X
n
πi,j = P (Xn = j|Xn−1 = i)
j∈S j∈S
!
[
=P {Xn = j}|Xn−1 = i
j∈S

= P (Xn ∈ S|Xn−1 = i)
=1
because P (Xn ∈ S) = 1. 

A matrix M with dimension r × r (r ∈ N or r = +∞) is called a


stochastic matrix iff all its coefficients are non-negative and
M (1, 1, . . . ) = (1, 1, . . . ),
i.e. the sum of each row coefficients equals 1. Thus, the product of two
stochastic matrices M1 and M2 is also a stochastic matrix since the co-
efficients are again non-negative and M1 M2 (1, . . . , 1) = M1 (1, . . . , 1) =
(1, . . . , 1).
The matrices Tn are stochastic by Proposition 10.1. In fact, any
sequence of stochastic matrices determines a Markov chain.
n
Proposition 10.2. Given a sequence Tn = [πi,j ]i,j∈S of stochastic
matrices, any stochastic process Xn satisfying for every n ≥ 1
n
P (Xn = j|Xn−1 = i) = πi,j
is a Markov chain.
Exercise 10.3. Prove it.
Example 10.4. Consider the state space S = {1, 2, 3} and the
matrix  1 2n −1 
2n 2n
0
1 n−1 1
Tn =  2n 2n 2
.
0 0 1
This is a stochastic matrix that corresponds to a Markov process with
the transition probability of moving from state i to state j at time n
given by the entry i, j of Tn .

2. Distributions

Let Xn be a Markov chain. The distribution of each Xn , n ≥ 0,


given by P ◦ Xn−1 , can be represented by a vector with dimension equal
to #S:
αn = (αn,1 , αn,2 , . . . ) where αn,j = P (Xn = j), j ∈ S.
2. DISTRIBUTIONS 103

We say that αn is the distribution of Xn . Notice that


αn · (1, 1, . . . ) = 1.
We can now determine the distribution of the chain at each time n
from the initial distribution and the transition probabilities.
Proposition 10.5. If α0 is the distribution of X0 , then
αn = α0 T1 . . . Tn
is the distribution of Xn , n ≥ 1.

Proof. If n = 1 the formula states that


X
1
α1,j = α0,k πk,j
k∈S
X
= P (X0 = k)P (X1 = j|X0 = k)
k
X
= P (X1 = j, X0 = k)
k
= P (X1 = j)
which is the distribution of X1 . We proceed by induction for n ≥ 2
assuming that αn−1 = α0 T1 . . . Tn−1 is the distribution of Xn−1 . So,
X
n
αn,j = αn−1,k πk,j
k
X
= P (Xn−1 = k)P (Xn = j|Xn−1 = k)
k
X
= P (Xn = j, Xn−1 = k)
k
= P (Xn = j)
that is the distribution of Xn . 
Example 10.6. Using the setting of Example 10.4 and the initial
distribution α0 = (1, 0, 0) of X0 , we obtain that the distribution of X1
is given by  
1 1
α1 = α0 T1 = , ,0 .
2 2
That is, P (X1 = 1) = P (X1 = 2) = 12 , P (X1 = 3) = 0. Moreover, the
distribution of X2 is
 
1 1 1
α2 = α1 T2 = , , .
4 2 4

A vector (i0 , i1 , . . . , in ) ∈ S × · · · × S defines a trajectory of states


visited by the stochastic process up to time n. We are interested in
computing its probability.
104 10. MARKOV CHAINS

Proposition 10.7. If α0 is the distribution of X0 , then for any


i0 , . . . , in ∈ S and n ≥ 1,
P (X0 = i0 , . . . , Xn = in ) = α0,i0 πi10 ,i1 . . . πinn−1 ,in .

Proof. Starting at n = 1 we have


P (X0 = i0 , X1 = i1 ) = P (X0 = i0 )P (X1 = i1 |X0 = i0 )
= α0,i0 πi10 ,i1 .
By induction, for n ≥ 2 and assuming that P (X0 = i0 , . . . , Xn−1 =
in−1 ) = α0,i0 πi10 ,i1 . . . πin−1
n−2 ,in−1
we get
P (X0 = i0 , . . . , Xn = in ) =P (X0 = i0 , . . . , Xn−1 = in−1 )
P (Xn = in |X0 = i0 , . . . , Xn−1 = in−1 )
=α0,i0 πi10 ,i1 . . . πin−1
n−2 ,in−1
P (Xn = in |Xn−1 = in−1 )
=α0,i0 πi10 ,i1 . . . πinn−1 ,in ,
where we have used the Markov property. 
Example 10.8. Using the setting of Example 10.6, the probability
of starting at state 1, then moving to state 2 and next back to 1 is
11 1
P (X0 = 1, X1 = 2, X2 = 1) = 1 = .
24 8
The probability of a trajectory given an initial state is now simple
to obtain. It also follows the n-step transition probability.
Proposition 10.9. If P (X0 = i) > 0, then
1
(1) P (X1 = i1 , . . . , Xn = in |X0 = i) = πi,i 1
. . . πinn−1 ,in .
(2)
(n)
P (Xn = j|X0 = i) = πi,j (10.1)
(n)
where πi,j is the (i, j)-coefficient of the product matrix T1 . . . Tn .

Proof.
(1) It is enough to observe that
P (X0 = i, X1 = i1 , . . . , Xn = in )
P (X1 = i1 , . . . , Xn = in |X0 = i) =
P (X0 = i)
and use Proposition 10.7.
(2) Using the previous result and (7.2),
X
P (Xn = j|X0 = i) = P (X1 = i1 , . . . , Xn−1 = in−1 , Xn = j|X0 = i)
i1 ,...,in−1
X
= 1
πi,i π 2 . . . πin−1
1 i1 ,i2
πn
n−2 ,in−1 in−1 ,j
.
i1 ,...,in−1
3. HOMOGENEOUS MARKOV CHAINS 105

Now, notice that


X X
1 2
πi,i π
1 i1 ,i2
= P (X2 = i2 |X1 = i1 , X0 = i) P (X1 = i1 |X0 = i)
i1 i1
= P (X2 = i2 |X0 = i),
where we have used the fact that it is a Markov chain and (7.3).
Moreover, using the same arguments
X X
P (X2 = i2 |X0 = i)πi32 ,i3 = P (X3 = i3 |X2 = i2 , X0 = i) P (X2 = i2 |X0 = i)
i2 i1
= P (X3 = i3 |X0 = i).
Therefore, repeating the same ideia up to the sum in in−1 , we
finally prove the claim.

Example 10.10. Following the previous examples we have
1 1 1
4 2 4
T1 T2 =  18 38 21  .
0 0 1
So, for instance
3
P (X2 = 2|X0 = 2) = .
8
Exercise 10.11. Given any increasing sequence of positive integers
un , show that the sequence of (stochastic) product matrices
T1 . . . Tu1 , Tu1 +1 . . . Tu2 , Tu2 +1 . . . Tu3 , . . .
corresponds to the transition matrices of the Markov chain
X0 , Xu1 , Xu2 , Xu3 . . . .

3. Homogeneous Markov chains

From now on we will restrict our attention to a special class of


Markov chains, when the transition probabilities do not depend on the
time n, i.e.
T1 = T2 = · · · = T = [πi,j ]i,j∈S .
These are called homogeneous Markov chains.
By Proposition 10.5 we have that the distribution of Xn is
αn = α0 T n .
where α0 is the initial distribution (i.e. the one of X0 ) and T n is the
n-th power of T .
106 10. MARKOV CHAINS

Example 10.12. Let S = {1, 2, 3} and


1 1 1
2 4 4
T =  31 13 13  .
1 0 0
We can represent this stochastic process in graphical mode.
1
3

1 2
4 1
3
1
1
3 1
2 1 3

1
4

Moreover, we can get the distribution α1 = α0 T of X1 as


3
X (1)
P (X1 = 1) = πi,j P (X0 = j)
j=1
1 1
= P (X0 = 1) + P (X0 = 2) + P (X0 = 3)
2 3
1 1
P (X1 = 2) = P (X0 = 1) + P (X0 = 2)
4 3
1 1
P (X1 = 3) = P (X0 = 1) + P (X0 = 2).
4 3
Similar relations can be obtained for the distribution αn of Xn for
any n ≥ 1. In addition, given X0 = 1 the probability of a trajectory
(1, 2, 3, 1) is
1
P (X1 = 2, X2 = 3, X3 = 1|X0 = 1) = .
12
Whenever α0,i = P (X0 = i) > 0 by (10.1) we get
(n)
P (Xn = j|X0 = i) = πi,j
(n)
where πi,j is the (i, j)-coefficient of T n . In fact, all the information
about the evolution of the stochastic process is derived from the power
matrix T n called the n-step transition matrix. It is also a stochastic
matrix. In particular, the sequence of random variables
X0 , Xn , X2n , X3n , . . .
is also a Markov chain with transition matrix T n and called the n-step
Markov chain.
3. HOMOGENEOUS MARKOV CHAINS 107

Notice that we can include the case n = 0, since


(
(0) 1, i = j
πi,j = P (X0 = j|X0 = i) =
0, i 6= j.

This corresponds to the transition matrix T 0 = I (the identity matrix).


Example 10.13. Let
1 1

T = 2 2 .
1 0
Then,
3 1
 5 3

2 4 4 3 8 8
T = 1 1 and T = 3 1 .
2 2 4 4
The Markov chain corresponding to T can be represented graphically
as
1
2
1
2 1 2
1

Moreover, the two-step Markov chain given by T 2 looks like


1
4
3 1
4 1 2 2
1
2

Finally, the three-step Markov chain is


3
8
5 1
8 1 2 4
3
4

Exercise 10.14 (Bernoulli process). Let S = N, 0 < p < 1 and


P (Xn+1 = i + 1|Xn = i) = p
P (Xn+1 = i|Xn = i) = 1 − p,
for every n ≥ 0, i ∈ S. The random variable Xn could count the
number of heads in n tosses of a coin if we set P (X0 = 0) = 1. This is
a homogeneous Markov chain with (infinite) transition matrix
1−p p 0
 

T = 1−p p 
.. ..
0 . .
108 10. MARKOV CHAINS

i.e. for i, j ∈ N

1 − p, i = j

πi,j = p, i+1=j

0, o.c.
Show that
n
P (Xn = j|X0 = i) = Cj−i pj−i (1 − p)n−j+i , 0 ≤ j − i ≤ n.

4. Recurrence time

Consider a homogeneous Markov chain Xn on S. The time of the


first visit to i ∈ S (regardless of the initial state) is the random variable
ti : Ω → N ∪ {+∞} given by
(
min{n ≥ 1 : Xn = i}, if there is n such that Xn = i,
ti =
+∞, if for all n we have Xn 6= i.

Exercise 10.15. Show that ti is a random variable.


Exercise 10.16. Show that
X
ti = X{ti ≥n} .
n≥1

The distribution of ti is then given by


P (ti = n) = P (X1 6= i, . . . , Xn−1 6= i, Xn = i), n ∈ N,
and
P (ti = +∞) = P (X1 6= i, X2 6= i, . . . ) .

The mean recurrence time τi of the state i is the expected value of


ti given X0 = i,
τi = E(ti |X0 = i).
Using the convention
(
0, a=0
+∞. a =
+∞, a > 0,
we can write
+∞
X
τi = nP (ti = n|X0 = i) + ∞.P (ti = +∞|X0 = i).
n=1

Thus, τi ∈ [1, +∞[ ∪{+∞}. Notice also that if P (X0 = i) = 0, then


τi = E(ti ).
5. CLASSIFICATION OF STATES 109

5. Classification of states

A state i is called recurrent iff


P (ti = +∞|X0 = i) = 0.
This is also equivalent to
P (X1 6= i, X2 6= i, . . . |X0 = i) = 0.
It means that the process returns to the initial state i with full proba-
bility. A state i which is not recurrent is said to be transient. So,
S = R ∪ T,
where R is the set of recurrent states and T its complementary in S.
Remark 10.17. Notice that
+∞
[
{Xn = i for some n ≥ 1} = {Xn = i}
n=1
+∞
!c
\
= {Xn 6= i} .
n=1
Hence, i ∈ R iff
P (Xn = i for some n ≥ 1 |X0 = i) = 1.
Proposition 10.18. Let i ∈ S. Then,
(1) i ∈ R iff
+∞
X (n)
πi,i = +∞.
n=1
(2) If i ∈ T , then for any j ∈ S
+∞
X (n)
πj,i < +∞.
n=1
P
Remark 10.19. Recall that if n un < +∞, then un → 0. On
the other hand, there are sequences un that convergeP to 0 but the
corresponding series does not converge. For example, n 1/n = +∞.

Proof.
(1)
(⇒) If i is recurrent, then there is m ≥ 1 such that
P (Xm = i|X0 = i) > 0.
From (7.3) and the Markov property, for any q > m we
have
X
P (Xq = j|X0 = i) = P (Xq = j|Xm = k)P (Xm = k|X0 = i).
k∈S
110 10. MARKOV CHAINS

Next, we prove by induction that for q = sm with s ∈ N


and j = i we have

P (Xsm = i|X0 = i) ≥P (Xm = i|X0 = i)s .

This is clear for s = 1. Assuming that the relation above


is true for s,

P (X(s+1)m = i|X0 = i) ≥P (X(s+1)m = i|Xm = i)P (Xm = i|X0 = i)


=P (Xsm = i|X0 = i)P (Xm = i|X0 = i)
≥P (Xm = i|X0 = i)s+1 .

Thus,
+∞
X +∞
X
(n)
πi,i = P (Xn = i|X0 = i)
n=1 n=1
+∞
X
≥ P (Xm = i|X0 = i)2 = +∞.
s=1

P (n)
(⇐) Suppose now that n πi,i = +∞. Using (7.3), since

+∞
X
P (ti = k) = 1,
k=1

we have
(n)
πj,i = P (Xn = i|X0 = j)
+∞
X
= P (Xn = i|ti = k, X0 = j)P (ti = k|X0 = j).
k=1

The Markov property implies that


(n−k)
P (Xn = i|ti = k, X0 = j) = P (Xn = i|Xk = i) = πi,i

for 0 ≤ k ≤ n and it vanishes for other values of k. Recall


that {ti = k} = {X1 6= 1, . . . , Xk−1 6= i, Xk = i} and
(0)
πi,i = 1. Therefore,

n
X
(n) (n−k)
πj,i = πi,i P (ti = k|X0 = j). (10.2)
k=1
5. CLASSIFICATION OF STATES 111

For N ≥ 1 and j = i, the following holds by resummation


N
X N X
X n
(n) (n−k)
πi,i = πi,i P (ti = k|X0 = i)
n=1 n=1 k=1
N X
X N
(n−k)
= πi,i P (ti = k|X0 = i)
k=1 n=k
N
X N
X −k
(n)
= P (ti = k|X0 = i) πi,i
k=1 n=0
N N
!
X X (n)
≤ P (ti = k|X0 = i) 1 + πi,i .
k=1 n=1

Finally,
N PN (n)
n=1 πi,i
X
1≥ P (ti = k|X0 = i) ≥ PN (n)
→1
k=1 1+ n=1 πi,i

as N → +∞, which implies


+∞
X
P (ti < +∞|X0 = i) = P (ti = k|X0 = i) = 1.
k=1

That is, i is recurrent.


P (n)
(2) Consider a transient state i, i.e. n πi,i < +∞. Using (10.2)
we have by resummation
+∞
X +∞ X
X n
(n) (n−k)
πj,i = πi,i P (ti = k|X0 = j)
n=1 n=1 k=1
+∞
X +∞
X (n)
= P (ti = k|X0 = j) πi,i
k=1 n=1
+∞
X (n)
≤ πi,i < +∞.
n=1


Example 10.20. Consider the Markov chain with two states and
transition matrix  
0 1
T = .
1 0
Thus,
(
T, n odd
Tn =
I, n even.
112 10. MARKOV CHAINS

It is now easy to check for i = 1, 2 that


X+∞
(n)
πi,i = +∞.
n=1
Both states are recurrent.
Remark 10.21. If i ∈ T , i.e. P (ti = +∞|X0 = i) > 0, then
τi = +∞.

So, only recurrent states can have finite mean recurrence time. We
will classify them accordingly.
A recurrent state i is null iff τi = +∞ (τi−1 = 0). We write i ∈ R0 .
Otherwise it is called positive, i.e. τi ≥ 1 and i ∈ R+ . Hence,
R = R0 ∪ R+ .
Proposition 10.22. Let i ∈ S. Then,
(1) τi = +∞ iff
(n)
lim πi,i = 0.
n→+∞
(2) If τi = +∞, then for any j ∈ S
(n)
lim πj,i = 0.
n→+∞

Exercise 10.23. Prove it.

The period of a state i is given by


(n)
P er(i) = gcd{n ≥ 1 : πi,i > 0}
if this set is non-empty; otherwise the period is not defined. Here gcd
stands for the greatest common divisor. Furthermore, i is periodic iff
P er(i) ≥ 2. It is called aperiodic iff P er(i) = 1.
Remark 10.24.
(n)
(1) If n is not a multiple of P er(i), then πi,i = 0.
(2) If πi,i > 0, then i is aperiodic.

Finally, a state i is said to be ergodic iff it is recurrent positive and


aperiodic. We denote the set of ergodic states by E, which is a subset
of R+ .
Exercise 10.25. Given a state i ∈ S, consider the function Ni : Ω →
N ∪ {+∞} that counts the number of times the chain visits its starting
point i:
+∞
X
Ni = XXn−1 ({i}) .
n=1

(1) Show that Ni is a random variable.


6. DECOMPOSITION OF CHAINS 113

(2) Compute the distribution of Ni .


(3) What is P (Ni = +∞) if i is recurrent and if it is transient.
Exercise 10.26. * Consider a homogeneous Markov chain on the
state space S = N given by
P (X1 = i|X0 = i) = r, i ≥ 2,
P (X1 = i − 1|X0 = i) = 1 − r, i ≥ 2,
1
P (X1 = j|X0 = 1) = j , j ≥ 1.
2
Classify the states of the chain and find their mean recurence times by
computing the probability of first return after n steps, P (ti = n|X0 = i)
for i ∈ S.
Exercise 10.27. Show that
P (n) (n)
(1) i ∈ R+ iff n πi,i = +∞ and limn→+∞ πi,i 6= 0.
P (n) (n)
(2) i ∈ R0 iff n πi,i = +∞ and limn→+∞ πi,i = 0.
P (n) (n)
(3) i ∈ T iff n πi,i < +∞ (in particular limn→+∞ πi,i = 0).
(n)
Conclude that τi = +∞ iff limn→+∞ πi,i = 0.

6. Decomposition of chains

Let i, j ∈ S. We write
i→j
(n)
whenever there is n ≥ 0 such that πi,j > 0. That is, the probability
of eventually moving from i to j is positive. Moreover, we use the
notation
i ←→ j
if i → j and j → i simultaneously.
Exercise 10.28. Consider i 6= j. Show that i → j is equivalent to
+∞
X
P (tj = n|X0 = i) > 0.
n=1

Proposition 10.29. ←→ is an equivalence relation on S.


(0)
Proof. Since πi,i = 1, we always have i ←→ i. Moreover, having
i ←→ j is clearly equivalent to j ←→ i. Finally, given any three states
i, j, k such that i ←→ j and j ←→ k, the probability of moving from i
to k is positive because it is greater or equal than the product of the
probabilities of moving from i to j and from j to k. In the same way
we obtain that k → i. So, i ←→ k. 
114 10. MARKOV CHAINS

Denote the sets of all states that are equivalent to a given i ∈ S by


[i] = {j ∈ S : i ←→ j},
which is called the equivalence class of i. Of course, [i] = [j] iff i ←→ j.
The equivalence classes are also known as irreducible sets.
Theorem 10.30. If j ∈ [i], then
(1) P er(i) = P er(j).
(2) i is recurrent iff j is recurrent.
(3) i is null recurrent iff j is null recurrent.
(4) i is positive recurrent iff j is positive recurrent.
(5) i is ergodic iff j is ergodic.

Proof. We will just prove (2). The remaining cases are similar
and left as an exercise.
Notice first that
(m+n+r)
X (m+n) (r)
πi,i = πi,k πk,i
k
(m+n) (r)
≥ πi,j πj,i
X (m) (n) (r)
= πi,k πk,j πj,i
k
(m) (n) (r)
≥ πi,j πj,j πj,i
(m) (r)
Since i ←→ j, there are m, r ≥ 0 such that πi,j πj,i > 0. So,
(m+n+r)
(n) πi,i
πj,j ≤ (m) (r)
.
πi,j πj,i
This implies that
+∞ +∞
X (n) 1 X (m+n+r)
πj,j ≤ (m) (r)
πi,i
n=1 πi,j πj,i n=1
+∞
1 X (n)
≤ (m) (r)
πi,i
πi,j πj,i n=1
Therefore, if i is transient, i.e.
+∞
X (n)
πi,i < +∞,
n=1

then j is also transient. 

Consider a subset of the states C ⊂ S. We say that C is closed iff


(1)
for every i ∈ C and j 6∈ C we have πi,j = 0. This means that moving
6. DECOMPOSITION OF CHAINS 115

out of C is an event of probability zero. It does not exclude outside


(1)
states from moving inside C, i.e. we can have πj,i > 0.
A closed set C made of only one state is called an absorving state.
Proposition 10.31. If i ∈ R, then [i] is closed.

Proof. Suppose that [i] is not closed. Then, there is some j 6∈ [i]
(1)
such that πi,j > 0. That is, i → j but j 6→ i (otherwise j would be in
[i]). So,
! !
\ \
P {Xn 6= i}|X0 = i ≥ P {X1 = j} ∩ {Xn 6= i}|X0 = i
n≥1 n≥2
(1)
= P (X1 = j|X0 = i) = πi,j > 0.
Taking the complementary set
! !
[ \
P {Xn = i}|X0 = i = 1 − P {Xn 6= i}|X0 = i < 1.
n≥1 n≥1

This means that i ∈ T . 

The previous proposition implies the following decomposition of the


state space.
Theorem 10.32 (Decomposition). Any state space S can be de-
composed into the union of the set of transient states T and closed
recurrent irreducible sets C1 , C2 , . . . :
S = T ∪ C1 ∪ C2 ∪ . . . .
Remark 10.33.
(1) If [i] is not closed, then i ∈ T .
(2) If X0 is in Ck , then Xn stays in Ck forever with probability 1.
(3) If X0 is in T , then Xn stays in T or moves eventually to one
of the Ck ’s. If the state space is finite, it can not stay in T
forever.
Exercise 10.34. If Xn is an irreducible Markov chain with period
d, is Yn = Xnd also an irreducible Markov chain? If yes, what is the
period of Yn ?

6.1. Finite closed sets.


Proposition 10.35. If C ⊂ S is closed and finite, then
C ∩ R = C ∩ R+ 6= ∅.
Moreover, if C is a irreducible set, then C ⊂ R+ .
116 10. MARKOV CHAINS

Proof. Suppose that all states are transient. Then, for any i, j ∈
(n)
C we have πj,i → 0 as n → +∞ by Proposition 10.18. Moreover, for
any j ∈ C we have X (n)
πj,i = 1.
i∈C
So, for any ε > 0 there is N ∈ N such that for any n ≥ N we have
(n)
πj,i < ε. Therefore,
X (n)
1= πj,i < ε #C,
i∈C
which implies for any ε that #C < 1/ε. That is, C is infinite.
Assume now that there is i ∈ R0 ∩ C. So, by Proposition 10.22 we
(n)
have for any j ∈ C that πj,i → 0 as n → +∞. As before,
X (n)
πj,i = 1
j∈C

and the limit of the left hand side is zero unless C is infinite.
Finally, if C is irreducible all its states have the same recurrence
property. Since at least one is in R+ , then all are in R+ . 
Remark 10.36. The previous result implies that if [i] is finite and
closed, then [i] ⊂ R+ . In particular, if S is finite and irreducible (notice
that it is always closed), then S = R+ .
Example 10.37. Consider the finite state space S = {1, 2, 3, 4, 5, 6}
and the transition probabilities matrix
1 1 
2 2
0 0 0 0
 1 3 0 0 0 0
 41 41 1 1 

4 4 4 4
0 0 
T = 01 1 1 1
.
4 4 4
0 4
0 0 0 0 1 1 

2 2
0 0 0 0 12 21
It is simple to check that 1 ←→ 2, 3 ←→ 4 and 5 ←→ 6. We have
that [1] = {1, 2} and [5] = {5, 6} are irreducible closed sets, while
[3] = {3, 4} is not closed. So, the states in [1] and [5] are positive
recurrent and in [3] are transient.

7. Stationary distributions

Consider a homogeneous Markov chain Xn on a state space S.


Given an initial distribution α of X0 , we have seen that the distri-
bution of Xn is given by αn = αT n , n ∈ N. A special case is when
the distribution stays the same for all times n, i.e. αn = α. So, a
distribution α on S is called stationary iff
αT = α.
7. STATIONARY DISTRIBUTIONS 117

Example 10.38. Consider a Markov chain with S = N and for any


i∈S
1 1
P (X1 = 1|X0 = i) = , P (X1 = i + 1|X0 = i) = .
2 2
A stationary distribution has to satisfy
P (X0 = i) = P (X1 = i), i ∈ S.
So, X
P (X0 = i) = P (X1 = i|X0 = j)P (X0 = j).
j
If i = 1, this implies that
1X 1
P (X0 = 1) = P (X0 = j) = .
2 j 2
If i ≥ 2,
P (X0 = i) = P (X1 = i|X0 = i − 1)P (X0 = i − 1)
1
= P (X0 = i − 1).
2
So,
1
P (X0 = i) = .
2i
In the case of a finite state space a stationary distribution α is a
solution of the linear equation:
(T > − I)α> = 0.
It can also be computed as an eigenvector of T > (the transpose matrix
of T ) corresponding
P to the eigenvalue 1. Notice that it must satisfy
αi ≥ 0 and i αi = 1. Moreover, if T does not have an eigenvalue 1
(T and T > share the same eigenvalues), then there are no stationary
distributions.
Example 10.39. Consider the Markov chain with transition matrix
1 1
T = 21 32 .
4 4
1
The eigenvalues of T are 1 and 4
.
Furthermore, an eigenvector of
>
T associated to the unit eigenvalue is (1, 2) ∈ R2 . Therefore, the
eigenvector which corresponds to a distribution is α = (α1 , 2α1 ) with
α1 ≥ 0 and 3α1 = 1. That is, α = ( 31 , 23 ).
Exercise 10.40. Find a stationary distribution for the Markov
chain in Example 10.37.
Theorem 10.41. Consider an irreducible S. Then, S = R+ iff
there is a unique stationary distribution, in which case it is given by
αi = τi−1 .
118 10. MARKOV CHAINS

The proof of the above theorem is contained in section 7.1.


Remark 10.42. Recall that if S is finite and irreducible then S =
R+ . So, in this case there is a unique stationary distribution.
Exercise 10.43. Find the unique stationary distribution for the
Markov chain with transition matrix:
1 1 
2 2
0
T = 0 2 12  .
 1
1 1
2 2
0

7.1. Proof of Theorem 10.41. A measure µ on S is stationary


iff µT = µ. Notice that it is not required that µ is a probability
measure as in the case of a stationary distribution (when µ(S) = 1).
In other words, a stationary measure is a generalization of a stationary
distribution.
Exercise 10.44. Show that any stationary measure ν = (ν1 , . . . )
on an irreducible S verifies 0 < νi < +∞ for every i ∈ S.

In the following we always assume S to be irreducible.


Proposition 10.45. If S = T then there are no stationary mea-
sures.
P (n)
Proof. As for any i, j ∈ S = T we have n πi,j < +∞, thus
(n)
πi,j → 0 as n → +∞. Therefore µT n → 0, implying that µT can not
be equal to µ unless µ = 0 which is not a measure. 
Proposition 10.46. If S = R, then for each i ∈ S the measure
µ(i) given by
(i)
X
µj := µ(i) ({j}) = P (Xn = j, ti ≥ n|X0 = i), j ∈ S,
n≥1

is stationary. Moreover, µ(i) (S) = τi .

Proof. Fix i ∈ S and let


X
Nj = X{Xn =j,ti ≥n}
n≥1

be the random variable that counts the number of visits to state j until
time ti . That is, the chain visits the state j for Nj times until it reaches
i. Notice that Ni = 1.
The mean of Nj starting at X0 = i is
ρj = E(Nj |X0 = i).
7. STATIONARY DISTRIBUTIONS 119

Clearly, ρi = 1. Considering the simple functions


Xm
ϕm = P (Xn = j, ti ≥ n|X0 = i)
n=1
so that ϕm % Nj as m → +∞, we can use the monotone convergence
theorem to get
X
ρj = P (Xn = j, ti ≥ n|X0 = i).
n≥1

Furthermore,
XX
ρj = πi,j + P (Xn = j, Xn−1 = k, ti ≥ n|X0 = i)
n≥2 k6=i
XX
= πi,j + πk,j P (Xn−1 = k, ti ≥ n|X0 = i)
n≥2 k6=i
X X
= πi,j + πk,j P (Xn = k, ti ≥ n + 1|X0 = i).
k6=i n≥1

Notice that for k 6= i we have


{Xn = k, ti ≥ n + 1} = {X1 6= i, . . . , Xn−1 6= i, Xn = k}
= {Xn = k, tk ≥ n}.
So, since ρi = 1,
X X
ρj = πi,j ρi + πk,j ρj = πk,j ρj .
k6=i k∈S

That is,
ρ = ρT
where ρ = (ρ1 , ρ2 , . . . ). We therefore take µ(i) ({j}) = ρj .
The sum of all the Nj ’s is equal to ti . Indeed,
X XX X
Nj = X{Xn =j,ti ≥n} = X{ti ≥n} = ti .
j∈S n≥1 j∈S n≥1

Again by the monotone convergence theorem,


X
µi (S) = ρj = E (ti |X0 = i) = τi .
j∈S


(i)
Exercise 10.47. Show that µi = 1.
Exercise 10.48. Show that
(i)
X X
µj = πi,j + πi,k1 πk1 ,k2 . . . πkn−1 ,kn πkn ,j .
n≥1 k1 ,...,kn 6=i

Proposition 10.49. If S = R and ν = (ν1 , . . . ) a stationary mea-


sure, then for any i ∈ S we have ν = νi µ(i) .
120 10. MARKOV CHAINS

(n)
Proof. Given j ∈ S there is n such that πj,i > 0 by the irre-
ducibility of S. Using also the stationarity property of the measures
(νT n = ν and µ(i) T n = µ(i) ),
X (n)
X (i) (n) (i)
νk πk,i = νi and µk πk,i = µi = 1.
k∈S k∈S

So, from X (i) (n) (i) (n)


0= (νk − νi µk )πk,i ≥ (νj − νi µj )πj,i
k∈S
(i)
we obtain νj ≤ νi µ j .
Now, for j ∈ S, again by the stationarity of ν,
X
νj = νi πi,j + νk1 πk1 ,j
k1 6=i

Using the same relation for νk1 we obtain


X X
νj = νi πi,j + νi πi,k1 πk1 ,j + νk2 πk2 ,k1 πk1 ,j .
k1 6=i k1 ,k2 6=i

Repeating this indefinitely, we get


X X
νi−1 νj ≥ πi,j + πi,k1 πk1 ,k2 . . . πkn−1 ,kn πkn ,j
n≥1 k1 ,...,kn 6=i
(i)
= µj .


Exercise 10.50. Complete the proof of Theorem 10.41.

8. Limit distributions

Recall that the distribution of a Markov chain at time n is given by


X (n)
P (Xn = j) = πi,j P (X0 = i), j ∈ S.
i∈S

We are now interested in determining the convergence in distribution


of Xn .
Theorem 10.51. Let S be irreducible and aperiodic. Then,
(n) 1
lim πi,j = , i, j ∈ S,
n→+∞ τj
and
1
lim P (Xn = j) = .
n→+∞ τj

See the proof in section 8.1.


Theorem 10.52. Let S be irreducible and aperiodic.
8. LIMIT DISTRIBUTIONS 121

(1) If S = T or S = R0 , then Xn diverges in distribution.


(2) If S = R+ = E, then Xn converges in distribution to the
unique stationary distribution.

Proof. Recall that if S is transient or null recurrent, then τj =


(n)
+∞ for all j ∈ S. So, πi,j → 0 for all i, j ∈ S. So, according to
Theorem 10.51, limn→+∞ P (Xn = j) = 0 for all j. Hence, it can not
be a distribution. In these cases there are no limit distributions.
If the S = R+ , its unique stationary distribution is
1
P (Xn = j) = , n ≥ 0, j ∈ S.
τj
Using again Theorem 10.51, the limit distribution is equal to the sta-
tionary distribution. 
Remark 10.53.
(1) When S is aperiodic and positive recurrent is said to be er-
godic. That is why the previous theorem is usually called the
ergodic theorem.
(2) Since the limit distribution for irreducible ergodic Markov chains
does not depend on the initial distribution, we say that these
chains forget their origins.
Example 10.54. Consider the transition matrix
1 1
T = 21 12 .
2 2

The chain is irreducible and ergodic. Thus, there is a unique stationary


distribution which is the limit distribution. From the fact that T n = T
(n)
we have πi,j = 21 → 12 we know that α1 = α2 = 21 and τ1 = τ2 = 2.
Example 10.55. Let now
 1 1

0 2 2
T =  21 0 1
2 .
1 1
2 2
0
The chain is irreducible and finite, hence S = R+ . The period is 2 and
α = ( 13 , 31 , 13 ) is the unique stationary distribution. On the other hand,
(n) (n)
π1,2 is zero iff n is even. This implies for example that limn→+∞ π1,2
does not exist. The aperiodicity condition imposed in Theorem 10.51
is therefore essential.

8.1. Proof of Theorem 10.51. Fix j ∈ S and consider the suc-


cessive visits to j given for each k ≥ 1 by
Rk = min{n > Rk−1 : Xn = j} and R0 = min{n ≥ 0 : Xn = j}.
122 10. MARKOV CHAINS

Notice that R0 = 0 means that X0 = j. Otherwise, R0 = tj . Also,


Xn = j is equivalent to the existence of some k ∈ N satisfying Rk = n.
Moreover, there is only a finite number of visits to j iff Rk = ∞ for
some k ∈ N0 .
So, !
(n)
[
πj,j =P {Rk = n}|R0 = 0 .
k
On the other hand, we have
Xn
(n) (n−k)
πi,j = P (tj = k|X0 = i)πj,j .
k=1
(n) (n)
Exercise 10.56. Show that limn→+∞ πi,j = limn→+∞ πj,j
Exercise 10.57. Show that Rk is a homogeneous Markov chain
with transition probabilities
fn := P (R1 = m + n|R0 = m) = P (tj = n|X0 = j)
for any n ∈ N ∪ {∞}, which is the same for every m ∈ N0 .

Notice that
X
τj = E(tj |X0 = j) = nfn + ∞ · f∞ .
n≥1

Suppose that
f∞ = P (R1 = ∞|R0 = m) = P (tj = ∞|X0 = j) > 0,
so that j is transient for the chain Xn with mean recurrence time
τj = ∞. Also,
(n)
lim πj,j = 0,
n→+∞
which means that it is equal to 1/τj .
It remains to be proved that for f∞ = 0 we have
(n) 1
lim πj,j = .
n→+∞ τj
Take first the sup limit
(n)
a0 = lim sup πj,j .
By considering a subsequence for which the limit is a0 , we use the
diagonalization argument to have a subsequence for which there exists
(k −k)
ak = lim πj,jn ≤ a0
(k)
for any k ∈ N. Here we make the assumption that πj,j = 0 for any
k ≤ −1.
8. LIMIT DISTRIBUTIONS 123

Recall that n
(n)
X (n−m)
πj,j = fm πj,j .
m=1
Taking the limit along the sequence kn , by the dominated convergence
theorem, we get
+∞
X
a0 = f m am .
m=1
P
Since k fk = 1 and ak ≤ a0 , then ak = a for every k ∈ D = {n ∈
N : fn > 0}. Similarly,
n −k
kX +∞
(k −k−m)
X
ak = lim fm πj,jn = fm ak+m
m=1 m=1

and ak = a for all k ∈ D ⊕ D = {n1 + n2 ∈ N : n1 , n2 ∈ D}. Proceeding


by induction and using the fact that the gcd of D is 1, we get that
ak = a for k sufficiently large. This implies that ak = a for all k.
Exercise 10.58. Show that
Xn X+∞ n
X
(n−k+1)
fi πj,j = fk .
k=1 i=k−1 k=1

Taking the limit along kn the above equality becomes


+∞ X
X +∞
(n)
lim sup πj,j fi = 1
k=1 i=k−1

Exercise 10.59. Show that


+∞ X
X +∞
fi = τj .
k=1 i=k−1

So,
(n) 1
lim sup πi,j = .
τj
The same idea can be used for the lim inf proving that the limit
exists. This completes the proof of Theorem 10.51.
CHAPTER 11

Martingales

1. The martingale strategy

In the 18th century there was a popular strategy to guarantee a


profit when gambling in a casino. We assume that the game is fair,
for simplicity it is the tossing of a fair coin. Starting with an initial
capital K0 a gambler bets a given amount b. Winning the game means
that the capital is now K1 = K0 + b and there is already a profit. A
loss implies that K1 = K0 − b. The martingale strategy consists in
repeating the game until we get a win, while doubling the previous bet
at each time. That is, if there is a first loss, at the second game we bet
2b. If we win it then K2 = K0 − b + 2b = K0 + b and there is a profit.
If it takes n games to obtain a win, then
n−1
X
Kn = Kn−1 + 2n−1 b = K0 − 2i−1 b + 2n−1 b = K0 + b
i=1
and there is a profit.
In conclusion, if we wait long enough until getting a win (and it
is quite unlikely that one would obtain only losses in a reasonable fair
game), then we will obtain a profit of b. It seems a great strategy,
without risk. Why everybody is not doing it? What would happen if
all players were doing it? What is the catch?
The problem with the strategy is that the capital K0 is finite. If it
takes too long to obtain a win (say n times), then
Kn−1 = K0 − (2n−1 − 1)b.
Bankruptcy occurs when Kn−1 ≤ 0, i.e. waiting n steps with
n ≥ log2 (K0 /b + 1) + 1.
For example, if we start with the Portuguese GDP in 20151:
K0 = e198 920 000 000
and choosing b =e1, then we can afford to loose 38 consecutive times.
On the other hand, if we assume that getting 10 straight losses in
a row is definitely very rare and are willing to risk, then we need to
assemble an initial capital of K0 =e511b.
1cf. PORDATA http://www.pordata.pt/
125
126 11. MARTINGALES

We can formulate the probabilistic model in the following way. Con-


sider τ to be the first time we get a win. We call it stopping time and
denote by
τ = min{n ≥ 1 : Yn = 1},
where the Yn ’s are iid random variables with distribution
1 1
P (Yn = 1) = and P (Yn = −1) =
2 2
(the tossing of a coin). Notice that if Yn = −1 for every n ∈ N, then
we set τ = +∞.

Exercise 11.1. Show that τ : Ω → N∪{+∞} is a random variable.

At time n the capital Kn is thus the random variable


τ ∧n
X
Kn = K0 + b 2i−1 Yi ,
i=1

where
τ ∧ n = min{τ, n}.

The probability of winning in finite time is


+∞
!
[
P (τ < +∞) = P {Yn = 1}
n=1
+∞
!
\
=1−P {Yn = −1}
n=1
+∞
Y
=1− P (Yi = −1)
n=1
= 1.

So, with full probability the gambler eventually wins. Since


1
P (τ = n) = P (Y1 = −1, . . . , Pn−1 = −1, Yn = 1) = ,
2n
we easily determine the mean time of getting a win is
X X n
E(τ ) = n P (τ = n) = = 2.
n∈N n∈N
2n

So, on average it does not take too long to get a win. However, what
matters to avoid ruin is the mean capital just before a win. Whilst
2. GENERAL DEFINITION OF A MARTINGALE 127

E(Kτ ) = K0 + b, we have
τ −1
!
X
E(Kτ −1 ) = K0 − E b 2i−1
i=1
τ −1
= K0 − bE(2 − 1)
+∞
X
= K0 − b P (τ = n)(2n−1 − 1)
n=1
+∞
X 1 n−1
= K0 − b n
(2 − 1)
n=1
2
= −∞.
That is, the mean value for the capital just before winning is −∞.
Notice also that E(K1 ) = K0 . In general, for any n, since Kn+1 =
Kn + 2n bYn+1 and Yn+1 is independent of Kn (Kn is a sum involving
only Y1 , . . . , Yn and the sequence Yn is independent) we have
E(Kn+1 |Kn ) = Kn + 2n bE(Yn+1 |Kn ) = Kn .

The martingale strategy is therefore an example of a fair game in


the sense that knowing your capital at time n, the best prediction of
Kn+1 is actually Kn . There is therefore no advantage but only risk.

2. General definition of a martingale

Let (Ω, F, P ) be a probability space. An increasing sequence of


σ-subalgebras
F1 ⊂ F2 ⊂ · · · ⊂ F.
is called a filtration.
A stochastic process Xn is a martingale with respect to a filtration
Fn if for every n ∈ N we have that
(1) Xn is Fn -measurable (we say that Xn is adapted to the filtra-
tion Fn )
(2) Xn is integrable (i.e. E(|Xn |) < +∞).
(3) E(Xn+1 |Fn ) = Xn , P -a.e.
Remark 11.2.
(1) Given a stochastic process Xn , the sequence of σ-algebras
Fn = σ (X1 , . . . , Xn )
is a filtration and it is called the natural filtration. Notice that
σ (X1 , . . . , Xn ) ⊂ σ (X1 , . . . , Xn+1 ) .
128 11. MARTINGALES

(2) It is simple to check that the expected value of the random


variables Xn in a martingale is constant:
E(Xn ) = E(E(Xn+1 |Fn )) = E(Xn+1 ).
(3) Since Xn is Fn -measurable we have that E(Xn |Fn ) = Xn . So,
E(Xn+1 |Fn ) = Xn is equivalent to E(Xn+1 − Xn |Fn ) = 0.

A sub-martingale is defined whenever


Xn ≤ E(Xn+1 |Fn ), P -a.e.
and a super-martingale requires that
Xn ≥ E(Xn+1 |Fn ), P -a.e.
So, E(Xn ) decreases for sub-martingales and it increases for super-
martingales.
In some contexts a martingale is known as a fair game, a sub-
martingale is a favourable game and a super-martingale is an unfair
game. The interpretation of a martingale as a fair game (there is risk,
there is no arbitrage) is very relevant in the application to finance.

3. Examples

Example 11.3. Consider a sequence of independent and integrable


random variables Yn , and the natural filtration Fn = σ(Y1 , . . . , Yn ).
(1) Let
n
X
Xn = Yi .
i=1
Then Xn is Fn -measurable since any Yi is Fi -measurable and
Fi ⊂ Fn for i ≤ n. Furthermore,
n
X
E(|Xn |) ≤ E(|Yi |) < +∞,
i=1

i.e. Xn is integrable. Since all the Yn ’s are independent,


E(Xn+1 − Xn |Fn ) = E(Yn+1 |Y1 , . . . , Yn ) = E(Yn+1 ).
Therefore, Xn is a martingale iff E(Yn ) = 0 for every n ∈ N.
(2) Let
Xn = Y1 Y2 . . . Yn .
It is also simple to check that Xn is Fn -measurable for each
n ∈ N. In addition, because the Yn ’s are independent as well
as the |Yn |’s,
E(|Xn |) = E(|Y1 |) . . . E(|Yn |) < +∞.
4. STOPPING TIMES 129

Now,
E(Xn+1 − Xn |Fn ) = Xn E(Yn+1 − 1).
Thus, Xn is a martingale iff E(Yn ) = 1 for every n ∈ N.
(3) Consider now the stochastic process
n
!2
X
Xn = Yn
i=1

assuming that Yn2 is also integrable. Clearly Xn is Fn -measurable


for each n ∈ N. It is also integrable since
X n
E(|Xn |) ≤ n E(|Yi |2 ) < +∞.
i=1

where we have use the Cauchy-Schwarz inequality. Finally,


2
E(Xn+1 − Xn |Fn ) = E(2(Y1 + · · · + Yn )Yn+1 |Fn ) + E(Yn+1 )
≥ 2(Y1 + . . . Yn )E(Yn ).
So, Xn is a sub-martingale if E(Yn ) = 0 for every n ∈ N.
Example 11.4 (Doob’s process). Consider an integrable random
variable Y and a filtration Fn . Let
Xn = E(Y |Fn ).
By definition of the conditional expectation Xn is Fn -measurable. It is
also integrable since
E(|Xn |) = E(|E(Y |Fn )|) ≤ E(E(|Y | |Fn )) = E(|Y |).
Finally,
E(Xn+1 −Xn |Fn ) = E(E(Y |Fn+1 )−E(Y |Fn )|Fn ) = E(Y −Y |Fn ) = 0.
That is, Xn is a martingale.

4. Stopping times

Let (Ω, F, P ) be a probability space and Fn a filtration. A function


τ : Ω → N ∪ {+∞} is a stopping time iff {τ = n} ∈ Fn for every n ∈ N.
Proposition 11.5. The following propositions are equivalent:
(1) {τ = n} ∈ Fn for every n ∈ N.
(2) {τ ≤ n} ∈ Fn for every n ∈ N.
(3) {τ > n} ∈ Fn for every n ∈ N.
Exercise 11.6. Prove it.
Proposition 11.7. τ is a random variable (F-measurable).
Exercise 11.8. Prove it.
130 11. MARTINGALES

Example 11.9. Let Xn be a stochastic process, B ∈ B(R) and Fn is


a filtration such that for each n ∈ N we have that Xn is Fn -measurable.
Consider the time of the first visit to B given by
τ = min{n ∈ N : Xn ∈ B}.
Notice that τ = +∞ if Xn 6∈ B for every n ∈ N. Hence,
{τ = n} = {X1 6∈ B, . . . , Xn−1 6∈ B, Xn ∈ B}
n−1
\
= {Xn ∈ B} ∩ {Xi ∈ B c } ∈ Fn .
i=1

That is, τ is a stopping time.


Exercise 11.10. Show that
+∞
X
E(τ ) = P (τ ≥ n). (11.1)
n=1

5. Stochastic processes with stopping times

Let Xn be a stochastic process and Fn is a filtration such that for


each n ∈ N we have that Xn is Fn -measurable. Given a stopping time
τ with respect to Fn , we define the sequence Xn stopped at τ by
Zn = Xτ ∧n ,
where τ ∧ n = min{τ, n}.
Exercise 11.11. Show that
n−1
X
Zn = Xi X{τ =i} + Xn X{τ ≥n} .
i=1

Proposition 11.12.
E(Zn+1 − Zn |Fn ) = E(Xn+1 − Xn |Fn )X{τ ≥n+1} .
Exercise 11.13. Prove it.
Remark 11.14. From the above result we can conclude that:
(1) If Xn is a martingale, then Zn is also a martingale.
(2) If Xn is a submartingale, then Zn is also a sub-martingale.
(3) If Xn is a supermartingale, then Zn is also a super-martingale.

Consider now the term in the sequence Xn corresponding to the


stopping time τ ,
+∞
X
Xτ = Xn X{τ =n} .
i=1
Clearly, it is a random variable.
5. STOCHASTIC PROCESSES WITH STOPPING TIMES 131

Recall that a sequence Zn of random variables is dominated if there


is an integrable function g ≥ 0 such that |Zn | ≤ g for every n ∈ N.
Theorem 11.15 (Optional stopping). Let Xn be a martingale. If
(1) P (τ < +∞) = 1
(2) Xτ ∧n is dominated,
then E(Xτ ) = E(X1 ).

Proof. Since P (τ < +∞) = 1 we have that limn→+∞ Xτ ∧n =


Xτ P -a.e. Hence, by the dominated convergence theorem using the
domination,
 
E(Xτ ) = E lim Xτ ∧n = lim E(Xτ ∧n ) = E(Xτ ∧1 ) = E(X1 ),
n→+∞ n→+∞

where we have used the fact that Xτ ∧n is also a martingale. 

The domination condition that is required in the optional stopping


theorem above is implied by other conditions that might be simpler to
check.
Proposition 11.16. If any of the following holds:
(1) there is k ∈ N such that P (τ ≤ k) = 1
(2) E(τ ) < +∞ and there is M > 0 such that for any n ∈ N
E(|Xn+1 − Xn ||Fn ) ≤ M,
then Xτ ∧n is dominated.
Exercise 11.17. Prove it.

A related result to the above optional stopping theorem (not re-


quiring to have a martingale) is the following.
Theorem 11.18 (Wald’s equation).Pn Let Yn be a sequence of inte-
grable iid random variables, Xn = i=1 Yi , Fn = σ(Y1 , . . . , Yn ) and τ
a stopping time with respect to Fn . If E(τ ) < +∞, then
E(Xτ ) = E(X1 ) E(τ ).

Proof. Recall that


+∞
X
Xτ = Yn X{τ ≥n} .
n=1

So,
+∞
X
E(Xτ ) = E(Yn X{τ ≥n} ).
n=1
132 11. MARTINGALES

Recall that X{τ ≥n} = 1 − X{τ ≤n−1} , which implies that


σ(X{τ ≥n} ) ⊂ Fn−1 .
Since Fn−1 and σ(Yn ) are independent, it follows that Yn and X{τ ≥n}
are independent random variables.
Finally, using (11.1),
+∞
X
E(Xτ ) = E(Y1 ) P (τ ≥ n) = E(X1 ) E(τ ).
n=1

APPENDIX A

Things that you should know before starting

1. Notions of mathematical logic

1.1. Propositions. A proposition is a statement that can be qual-


ified either as true (T) or else as false (F) – there is no third way.
Example A.1.
(1) p =“Portugal is bordered by the Atlantic ocean” (T)
(2) q =“zero is an integer number” (T)
(3) r =“Sevilla is the capital city of Spain” (F)
Remark A.2. There are statements that can not be qualified as
true or false. For instance, “This sentence is false”. If it is false, then
it is true (contradiction). On the other hand, if it is true, then it is
false (again contradiction). This kind of statements are not considered
to be propositions since they leads us to contradiction (simultaneously
true and false). Therefore, they will not be the object of our study.

The goal of the mathematical logic is to relate propositions through


their logical symbols: T or F. We are specially interested in those that
are T.

1.2. Operations between propositions. Let p and q be propo-


sitions. We define the following operations between propositions. The
result is still a proposition.
• ∼ p, not p (p is not satisfied).
• p ∧ q, p and q (both propositions are satisfied).
• p ∨ q, p or q (at least one of the propositions is satisfied).
• p ⇒ q, p implies q (if p is satisfied, then q is also satisfied).
• p ⇔ q, p is equivalent to q (p is satisfied iff q is satisfied).
Example A.3. Using the propositions p, q and r in Example A.1,
(1) ∼ p =“Portugal is not bordered by the Atlantic Ocean” (F)
(2) p ∧ q =“Portugal is bordered by the Atlantic Ocean and zero
is an integer number” (T)
(3) p ∨ r =“Portugal is bordered by the Atlantic Ocean or Sevilla
is the capital city of Spain” (T)
133
134 A. THINGS THAT YOU SHOULD KNOW BEFORE STARTING

Example A.4.
(1) “If Portugal is bordered by the Atlantic Ocean, then Portugal
is bordered by the sea” (T)
(2) “x = 0 iff |x| = 0” (T)

The logic value of the proposition obtained by operations between


propositions is given by the following table:

p q ∼p p∧q p∨q p⇒q ∼q ⇒∼ p ∼ p ∨ q p ⇔ q (∼ p∧ ∼ q) ∨ (p ∧ q)


T T F T T T T T T T
T F F F T F F F F F
F T T F T T T T F F
F F T F F T T T T T
Exercise A.5. Show that the following propositions are true
(1) ∼ (∼ p) ⇔ p
(2) (p ⇒ q) ⇔ (∼ q ⇒∼ p)
(3) ∼ (p ∧ q) ⇔ (∼ p) ∨ (∼ q)
(4) ∼ (p ∨ q) ⇔ (∼ p) ∧ (∼ q)
(5) ((p ⇒ q) ∧ (q ⇒ p)) ⇔ (p ⇔ q)
(6) p ∧ (q ∨ r) ⇔ ((p ∧ q) ∨ (p ∧ r))
(7) p ∨ (q ∧ r) ⇔ ((p ∨ q) ∧ (p ∨ r))
(8) (p ⇔ q) ⇔ ((∼ p∧ ∼ q) ∨ (p ∧ q))
Example A.6. Consider the following propositions:
• p =“Men are mortal”
• q =“Dogs live less than men”
• r =“Dogs are not imortal”
So, the relation ((p ∧ q) ⇒ r) ⇔ (∼ r ⇒ (∼ p∨ ∼ q)) can be read as:
Saying that “if men are mortal and dogs live less than
men, then dogs are mortal ”, is the same as saying
that “if dogs are imortal, then men are imortal or
dogs live more than men”.

1.3. Symbols. In the mathematical writing it is used frequently


the following symbols:
• ∀ for all.
• ∃ there is.
• : such that.
• , usually means “and”.
Example A.7.
1. NOTIONS OF MATHEMATICAL LOGIC 135

(1) ∀x≥0 ∃y≥1 : x+y ≥ 1. “For any x non-negative there is y greater


or equal to 1 such that x + y is greater or equal than 1”. (T)
(2) ∀y multiple of 4 ∃x ≥ 0 : − 21 < x + y < 12 . “For any y multiple
of 4 there is x ≥ 0 such that x + y is strictly between − 12 and
1
2
”. (F)

We can apply the ∼ operator


∼ ∃x p(x) ⇔ ∀x ∼ p(x)
where p is a proposition that depends on x.

1.4. Mathematical induction. Let p(n) be a proposition that


depends on a number n that can be 1, 2, 3, . . . . We want to show that
p(n) is T for any n. The mathematical induction principle is a method
that allows to prove for any such n in just two steps:
(1) Show that p(1) is T.
(2) Suppose that p(m) is T for a fixed m, then show that the next
proposition p(m + 1) is also T.
This method works because if it is T for n = 1 and for the consecutive
propostion of any that it is T, then is T for n = 2, 3, . . . .
Example A.8. Consider the propositions p(n) given for each n by
(n + 1)n
1 + 2 + ··· + n = .
2
For n = 1, we have that p(1) reduces simply to 1 = 1 that is clearly T.
Suppose now that p(m) for a fixed m. I.e. assume that 1+2+· · ·+m =
(m+1)m
2
. Thus,
(m + 1)m (m + 1)(m + 2)
1 + · · · + m + (m + 1) = + (m + 1) = .
2 2
That is, we have just showed that p(m + 1) is T. Therefore, ∀n p(n) is
T.

This is one of the more popular methods in all sub-areas of math-


ematics, in computacional sciences, in economics, in finance and most
sciences that use quantitative methods. A professional mathematician
has the obligation to master it.
Exercise A.9. Show the binomial theorem: for any a, b ∈ R and
n ∈ N we have
Xn
n
(a + b) = Ckn ak bn−k ,
k=0
where
n!
Ckn = .
k!(n − k)!
136 A. THINGS THAT YOU SHOULD KNOW BEFORE STARTING

1.5. Mathematical writing and proofs. What really distin-


guishes Mathematics from any other science are ’proofs’. These are
logical constructions to show that a proposition is true beyond any
doubt. This reliability is unique to Mathematics.
The mathematical literature is in general based on presenting def-
initions and then demonstrating propositions yielding several conse-
quences and properties that can be useful. A proposition is called
either a Lemma, a Proposition or a Theorem by increasing order of im-
portance (but in many cases it mostly depends on the subjective choice
of the writer). A Corollary is a simple consequence of a Theorem. A
Conjecture is just a guess that can turn into a proposition if proved to
be correct.

2. Set theory notions

2.1. Sets. A set is a collection of elements represented in the form:


Ω = {a, b, c, . . . }
where a, b, c, . . . are the elements of Ω. The dots are added to replace
all other elements that one does not bother or is not able to write,
similarly to the use of the abbreviation etc.
A subset A of Ω is a set whose elements are also in Ω, and we write
A ⊂ Ω. We also write Ω ⊃ A to mean the same thing.
Instead of naming all the elements of a set (an often impossible
task), sometimes it is necessary to define a set through a given property
that we want satisfied. So, we also use the following representation for
a set:
Ω = {x : p(x)}
where p(x) is a proposition that depends on x. This can be read as “Ω
is the set of all x such that p(x) holds”.
We write
a∈A
to mean that a is an element of A (a is in A). Sometimes it is convenient
to write instead A 3 a. If a is not in A we write a 6∈ A.
Example A.10.
(1) 1 ∈ {1}
(2) {1} 6∈ {1}
(3) {1} ∈ {{1}, {1, 2}, {1, 2, 3}}.

A subset A of Ω corresponding to all the elements x of Ω that satisfy


a proposition q(x) is denoted by
A = {x ∈ Ω : q(x)}.
2. SET THEORY NOTIONS 137

If a set has a finite number of elements it is called finite. Otherwise,


it is an infinite set. The set with zero elements is called the empty set
and it is denoted by {} or ∅.
Example A.11.
(1) A = {0, 1, 2, . . . , 9} is finite (it has 10 elements).
(2) The set of natural numbers
N = {1, 2, 3, . . . }
is infinite.
(3) The set of integers
Z = {. . . , −2, −1, 0, 1, 2, . . . }
is infinite.
(4) The set of rational numbers (ratio between integers)
 
p
Q= : p ∈ Z, q ∈ N
q
is infinite.
(5) The set of real numbers R consists of numbers of the form
a0 .a1 a2 a2 . . .
where a0 ∈ Z and ai ∈ {0, 1, 2, . . . , 9} for any i ∈ N, is also
infinite.

2.2. Relation between sets. Let A and B be any two sets.


• A = B (A equals B) iff (x ∈ A ⇔ x ∈ B) ∨ (A = ∅ ∧ B = ∅).
• A ⊂ B (A is contained in B) iff (x ∈ A ⇒ x ∈ B) ∨ A = ∅).
Example A.12. N ⊂ Z ⊂ Q ⊂ R.
Properties A.13.
(1) ∅⊂A
(2) (A = B ∧ B = C) ⇒ A = C
(3) A⊂A
(4) (A ⊂ B ∧ B ⊂ A) ⇒ A = B
(5) (A ⊂ B ∧ B ⊂ C) ⇒ A ⊂ C
Remark A.14. If
A = {x : p(x)} and B = {x : q(x)}, (A.1)
then
A = B ⇔ ∀x (p(x) ⇔ q(x)) and A ⊂ B ⇔ ∀x (p(x) ⇒ q(x)).
138 A. THINGS THAT YOU SHOULD KNOW BEFORE STARTING

2.3. Operations between sets. Let A, B ⊂ Ω.


• A ∩ B = {x : x ∈ A ∧ x ∈ B} is the intersection between A
and B.
• A ∪ B = {x : x ∈ A ∨ x ∈ B} is the union between A and B.
Representing the sets as in (A.1), we have
A ∩ B = {x : p(x) ∧ q(x)} and A ∪ B = {x : p(x) ∨ q(x)}.
Example A.15. Let A = {x ∈ R : |x| ≤ 1} and B = {x ∈ R : x ≥
0}. Thus, A ∩ B = {x ∈ R : 0 ≤ x ≤ 1} and A ∪ B = {x ∈ R : x ≥ −1}.
Properties A.16. Let A, B, C ⊂ Ω. Then,
(1) A ∩ B = B ∩ A and A ∪ B = B ∪ A (commutativity)
(2) A ∩ (B ∩ C) = (A ∩ B) ∩ C and A ∪ (B ∪ C) = (A ∪ B) ∪ C
(associativity)
(3) A ∩ (B ∪ C) = (A ∩ B) ∪ (B ∩ C) and A ∪ (B ∩ C) = (A ∪
B) ∩ (B ∪ C) (distributivity)
(4) A ∩ A = A and A ∪ A = A (idempotence)
(5) A ∩ (A ∪ B) = A and A ∪ (A ∩ B) = A (absortion)

Let A, B ⊂ Ω.
• A \ B = {x ∈ Ω : x ∈ A ∧ x 6∈ B} is the difference between A
and B (A minus B).
• Ac = {x ∈ Ω : x 6∈ A} is the complementary set of A in Ω.
As in (A.1) we can write:
A \ B = {x : p(x)∧ ∼ q(x)} and Ac = {x : ∼ p(x)}.
Properties A.17.
(1) A \ B = A ∩ B c
(2) A ∩ Ac = ∅
(3) A ∪ Ac = Ω.

It is also possible to define without difficulties the intersection and


union of infinitely many sets. Let I to be a set, which we will call index
set. This corresponds to the indices of a family of sets Aα ⊂ Ω with
α ∈ I. Hence,
\ [
Aα = {x : ∀α∈I x ∈ Aα } and Aα = {x : ∃α∈I x ∈ Aα }.
α∈I α∈I

Example A.18.
(1) Let An = [n, n + 1] ⊂ R, with n ∈ N (notice that I = N).
Then \ [
An = ∅, An = [1, +∞[.
n∈N n∈N
3. FUNCTION THEORY NOTIONS 139

(2) Let Aα = [0, | sin α|], α ∈ I = R. Then


\ [
Aα = {0}, Aα = [0, 1].
α∈R α∈R

Proposition A.19 (Morgan laws).


(1)
!c
\ [
Aα = Acα
α∈I α∈I

(2)
!c
[ \
Aα = Acα
α∈I α∈I

Exercise A.20. Prove it.

If two sets do not intersect, i.e. A ∩ B = ∅, we say that they are


disjoint. A family of sets Aα , α ∈ I, is called pairwise disjoint if for
any α, β ∈ I such that α 6= β we have Aα ∩ Aβ = ∅ (each pair of sets
in the family is disjoint).

3. Function theory notions

Given two sets A and B, a function f is a correspondence between


each x ∈ A to one and only one y = f (x) ∈ B. It is also called a map
or a mapping.
Representation:
f: A→B
x 7→ y = f (x).

Notation:
• A is the domain of f .
• f (C) = {f (x) ∈ B : x ∈ C} is the image of C ⊂ A.
• f −1 (D) = {x ∈ A : f (x) ∈ D} is the pre-image of D ⊂ B.
Example A.21.
(1) Let A = {a, b, c, d}, B = N and f a function f : A → B defined
by the following table:
x a b c d
f (x) 3 5 7 9
Then, f ({b, c}) = {5, 7}, f −1 ({1}) = ∅, f −1 ({3, 5}) = {a, b},
f −1 ({n ∈ N : n2 ∈ N}) = ∅.
140 A. THINGS THAT YOU SHOULD KNOW BEFORE STARTING

(2) A function whose domain is N is called a sequence. For ex-


ample, consider u : N → {−1, 1} given by un = u(n) = (−1)n .
Then, u(N) = {−1, 1}, u−1 ({1}) = {2n : n ∈ N}, u−1 ({−1}) =
{2n − 1 : n ∈ N}.
(3) Let f : R → R,
(
|x|, x ≤ 1
f (x) =
2, x > 1.
Thus, f (R) = R+ 0 , f ([1, +∞[) = {1, 2}, f
−1
([2, +∞[) =] −
−1
∞, −2]∪]1, +∞[, f ({1, 2}) = {−1, −2} ∪ [1, +∞[.
(4) For any set ω, the identity function is f : Ω → Ω with f (x) =
x. We use the notation f = Id.
(5) Let A ⊂ Ω. The indicator function is XA : Ω → R with
(
1, x ∈ A
XA (x) =
0, x 6∈ A.

The pre-image behaves nicely with the union, intersection and com-
plement of sets. Let I to be the set of indices of Aα ⊂ A and Bα ⊂ B
with α ∈ I .
Proposition A.22.
(1) !
[ [
f Aα = f (Aα )
α∈I α∈I
(2) !
\ \
f Aα ⊂ f (Aα )
α∈I α∈I
(3) !
[ [
−1
f Bα = f −1 (Bα )
α∈I α∈I
(4) !
\ \
f −1 Bα = f −1 (Bα )
α∈I α∈I
(5)
f −1 (Bαc ) = f −1 (Bα )c
(6)
f (f −1 (Bα ) ⊂ Bα
(7)
f −1 (f (Aα )) ⊃ Aα
Exercise A.23. Prove it.
3. FUNCTION THEORY NOTIONS 141

3.1. Injectivity e surjectivity. According to the definition of a


function f : A → B, to each x in the domain it corresponds a unique
f (x) in the image. Notice that nothing is said about the possibility of
another point x0 in the domain to have the same image f (x0 ) = f (x).
This does not happen for injective functions. On the other hand, there
might be that B is different from f (A). This does not happen for
surjective functions.
• f is injective (one-to-one) iff f (x1 ) = f (x2 ) ⇒ x1 = x2 .
• f is surjective (onto) iff f (A) = B.
• f is a bijection iff it is injective and surjective.

3.2. Composition of functions. After computing g(x) as the


image of x by a function g, in many situations we want to apply yet
another function (ot the same) to g(x), i.e. f (g(x)). It is said that
we are composing two functions. Let g : A → B and f : C → D. We
define the composition function in the following way
f ◦ g : g −1 (C) → D
(f ◦ g)(x) = f (g(x))
(read as f composed with g or f after g).
Example
√ A.24. Let g : R → R, g(x) = 1 − 2x and f : [1, +∞[→ R,
−1 −
f (x) = x − 1. We have that√ g ([1, +∞[) = R0 . So, f ◦g : ]−∞, 0] →
R, f ◦ g(x) = f (1 − 2x) = −2x.

An injective function f : A → B is also called invertible because we


can find its inverse function f −1 : f (A) → A such that
∀x∈A f −1 (f (x)) = x and ∀y∈f (A) f (f −1 (y)) = y.
Example A.25.
(1) f : R → R, f (x) = x2 . Is not invertible since, e.g. f (1) =
f (−1). However, if we restrict the domain to R+ 0 , it be-
comes invertible. I.e. g : R+ 0 → R, g(x) = x 2
is invert-

ible and g(R+ 0 ) = R +
.
0 √ From y = x 2
⇔ x = y, we write
−1 + +
g : R0 → R0 , g(x) = x.
(2) Let sin : R → R be the function sine. This function is invert-
ible if restricted to certain sets. For example, sin : [− π2 , π2 ] → R
is invertible. Notice that sin([− π2 , π2 ]) = [−1, 1]. Then, we de-
fine the function arc-sine arcsin : [−1, 1] → [− π2 , π2 ] that to each
x ∈ [−1, 1] corresponds the angle in [− π2 , π2 ] whose sine is x.
Finally, we have that arcsin(sin x) = sin(arcsin x) = x.
(3) When restricted to [0, π] the cosine can also be inverted. The
arc-cosine function arccos : [−1, 1] → [0, π] at x ∈ [−1, 1] is
the angle whose cosine is x. Consequently, arccos(cos x) =
cos(arccos x) = x.
142 A. THINGS THAT YOU SHOULD KNOW BEFORE STARTING

(4) The tangent function in ] − π2 , π2 [ has the inverse given by the


arc-tangent function arctg : R →]− π2 , π2 [ such that arctg(tg x) =
tg(arctg x) = x.
(5) The exponential function f : R → R, f (x) = ex is invert-
ible and f (R) = R+ . Its inverse is the logarithm function
f −1 : R+ → R, f −1 (x) = log x.
Proposition A.26. If f and g are invertible, then f ◦g is invertible
and
(f ◦ g)−1 = g −1 ◦ f −1 .

Proof.

• If f ◦ g(x1 ) = f ◦ g(x2 ) ⇔ f (g(x1 )) = f (g(x2 )), then as f


is invertible, g(x1 ) = g(x2 ). In addition, as g is invertible, it
implies that x1 = x2 . So, f ◦ g is invertible.
• We can now show that g −1 ◦ f −1 is the inverse of f ◦ g:
– g −1 ◦ f −1 (f ◦ g(x)) = g −1 (f −1 (f (g(x)))) = g −1 (g(x)) = x,
– f ◦ g(g −1 ◦ f −1 (x)) = f (g(g −1 (f −1 (x)))) = f (f −1 (x)) = x,
where we have used the fact that f −1 and g −1 are the inverse
functions of f and g, respectively.
• It remains to show that the inverse is unique. Suppose that
there is another inverse function of f ◦g namely u different from
g −1 ◦ f −1 . Hence, f ◦ g(u(x)) = x. If we apply the function
g −1 ◦ f −1 , then g −1 ◦ f −1 (f ◦ g(u(x))) = g −1 ◦ f −1 (x) ⇔ u(x) =
g −1 ◦ f −1 (x).

3.3. Countable and uncountable sets. As seen before, we can


classify sets in terms of its number of elements, either finite or infinite.
There is a particular case of an infinite set: the set of natural numbers
N. This set can be counted in the sense that we can have an ordered
sequence of its elements: given any element we know what is the next
one.
A set A is countable if there is a one-to-one function f : A → N.
Countable sets can be either finite (f only takes values in a finite sub-
set of N) or infinite (like N). A set which is not countable is called
uncountable.
Exercise A.27. Let A ⊂ B. Show that

(1) If B is countable, then A is also countable.


(2) If A is uncountable, then B is also uncountable.
Example A.28. The following are countable sets:
4. TOPOLOGICAL NOTIONS IN R 143

(1) Q, by choosing a sequence that covers all rational numbers.


Find one that works.
(2) Z, because Z ⊂ Q.
Example A.29. The following are uncountable sets:
(1) [0, 1], by the following argument. Suppose that [0, 1] is count-
able. This implies the existence of a sequence that covers all
the points in [0, 1]. Write the sequence as xn = 0.an,1 an,2 . . .
where an,i ∈ {0, 1, 2, . . . , 9}. Take now x ∈ [0, 1] given by
x = 0.b1 b2 . . . where bi 6= ai,i for every i ∈ N. In order to avoid
the cases of the type 0.1999 · · · = 0.2, whenever ai,i = 9 we
choose bi 6= 9 and also bi 6= 0. Thus, x is different from every
point in the sequence. So, [0, 1] can not be countable.
(2) R, because [0, 1] ⊂ R.
Proposition A.30. Let A and B to be any two sets and h : A → B
a bijection between them. Then,
(1) A is finite iff B is finite.
(2) A is countable iff B is countable.
(3) A is uncountable iff B is uncountable.
Exercise A.31. Prove it.

Consider an index set I and a family of sets Aα with α ∈ I. If I is


finite, we say that
\

α∈I
is a finite intersection. If I is infinite but countable, the above is a
countable intersection. Otherwise, whenever I is uncountable, it is
called an uncountable intersection. Similarly, we use the same type of
nomenclature for unions.

4. Topological notions in R

4.1. Distance. The usual distance between two points x, y ∈ R is


given by
d(x, y) = |x − y|. (A.2)
We can easily deduce the following properties.
Properties A.32. For all x, y, z ∈ R,
(1) d(x, y) ≥ 0
(2) d(x, y) = 0 ⇔ x = y
(3) d(x, y) = d(y, x) (symmetry)
(4) d(x, z) ≤ d(x, y) + d(y, z) (triangular inequality).
144 A. THINGS THAT YOU SHOULD KNOW BEFORE STARTING

In fact, we could have defined distance1 only using the above prop-
erties, since they are the relevant ones. An example of another distance
d on R satisfying the same properties is:
|x − y|
d(x, y) = .
1 + |x − y|
Notice that with this distance we have d(0, 1) = d(1, 2) = 12 and that
d(0, 2) = 23 . On the other hand, there are no points whose distance
between each other is more than 1.
We will restrict our study to the usual distance in (A.2). How-
ever, with some care we could have developped our study for a generic
distance.

4.2. Neighbourhood. One of the main consequences of the abil-


ity to measure distances is the notion of proximity. Let a ∈ R and
ε > 0. An ε-neighbourhood of a is the set of points which are at a
distance less than ε from a. That is,
Vε (a) = {x ∈ R : d(x, a) < ε}.
For the usual distance we obtain
Vε (a) =]a − ε, a + ε[.
Proposition A.33.

T 0 < δ < ε, then Vδ (a) ⊂ Vε (a) and Vδ (a) ∩ Vε (a) = Vδ (a).


(1) If
(2) ε>0 Vε (a) = {a} is not a neighbourhood of a.
(3) If a 6= b, then Vδ (a) ∩ Vε (a) = ∅ ⇔ δ + ε ≤ |b − a|.

4.3. Interior, boundary and exterior. With the notion of neigh-


bourhood of a point, we can distinguish the points that are in the
“interior” of a set. Let A ⊂ R and a ∈ R.
• a is an interior point of A iff there is a neighbourhood of a
contained in A, i.e.
∃ε>0 Vε (a) ⊂ A.
• a is an exterior point of A iff it is an interior point of Ac .
• a is a boundary point of A iff it is neither interior nor exterior.
The set of interior points of A is denoted by int A, the exterior by
ext A and the boundary by front A. So,
R = int A ∪ front A ∪ ext A.
Example A.34.
(1) int[0, 1[=]0, 1[, front[0, 1[= {0, 1}, ext[0, 1[= R \ [0, 1].
1In some literature it is called a metric.
4. TOPOLOGICAL NOTIONS IN R 145

(2) int ∅ = front ∅ = ext R = front R = ∅, ext ∅ = int R = R.


(3) int Q = ext Q = int(R \ Q) = ext(R \ Q) = ∅, front Q =
front(R \ Q) = R.
Proposition A.35.
(1) int(Ac ) = ext A.
(2) ext(Ac ) = int A.
(3) front(Ac ) = front A.
(4) int A ⊂ A.
(5) ext A ⊂ Ac .

4.4. Open and closed sets. The closure of A ⊂ R is


A = int A ∪ front A.
Therefore, R = A ∪ ext A and
int A ⊂ A ⊂ A.
In cases one has equalities, we label the set A as:
• open iff int A = A.
• closed iff A = A.
Example A.36.
(1) ]0, 1[ is open, [0, 1] is closed, ]0, 1] is neither open nor closed,
] − ∞, 1] is closed.
(2) N and Z are closed, Q and R \ Q are neither open nor closed.
(3) ∅ and R are open and closed.
Proposition A.37.
(1) A ⊂ R is open iff Ac is closed.
(2) If A, B ⊂ R are open, then A ∩ B, A ∪ B are open.
(3) If A, B ⊂ R are closed, then A ∩ B, A ∪ B are closed.
(4) If A 6= ∅ is bounded and closed (compact), then it has a maxi-
mum and a minimum.

Proof.
(1) A open ⇔ front A ⊂ Ac ⇔ front Ac ⊂ Ac ⇔ Ac closed.
(2) Let a ∈ A∩B. Then, as A and B are open, there are ε1 , ε2 > 0
such that
Vε1 (a) ⊂ A e Vε2 (a) ⊂ B.
Choosing ε = min{ε1 , ε2 }, we have that
Vε (a) ⊂ Vε1 (a) ∩ Vε2 (a) ⊂ A ∩ B.
Same idea for A ∪ B.
146 A. THINGS THAT YOU SHOULD KNOW BEFORE STARTING

(3) A and B closed ⇔ Ac and B c open ⇒ Ac ∪B c open ⇔ (A∩B)c


open ⇔ A ∩ B closed. Same idea for A ∪ B.
(4) A bounded from above ⇒ sup A ∈ front A. As A is closed, i.e.
front A ⊂ A, we have that sup A ∈ A. Thus, max A = sup A.
Same idea for inf and min.

Remark A.38. The following example
+∞
\  1 1
− , = {0}
n=1
n n
shows that the infinite intersection of open sets might not be an open set
In the previous proposition it is only proved that the finite intersection
of open sets is an open set. The infinite union of closed sets might not
also be a closed set. For example,
+∞
[ 
1 1
−1 + , 1 − =] − 1, 1[.
n=1
n n

Proposition A.39. Any open set is a countable union of pairwise


disjoint open intervals.

Proof. Let U ⊂ R be an open set and x ∈ U . Then there is a


neighbourhood V of x contained in U . Write Ix =]a, b[ where
a = inf{α : ]α, x[⊂ U } and b = sup{β : ]x, β[⊂ U }.
So, Ix is the maximal interval in U containing x. Choose a rational
number r in Ix and denote it by Ir . It is easy to see that Ir = Ix . In
particular, for any other rational number r0 ∈ Ir we have Ir0 = Ir . In
addition, if for given rationals r, r0 we have Ir 6= Ir0 , then Ir ∩ Ir0 = ∅.
Since any x ∈ U is inside some Ir with r ∈ Q ∩ U ,
[
U⊂ Ir .
r∈Q∩U

Moreover, Ir ⊂ U for every r ∈ Q ∩ U , so


[
Ir ⊂ U.
r∈Q∩U

Hence they are the same set. 

4.5. Accumulation points. Note first that for A ⊂ R and a ∈ R:


a ∈ A ⇔ ∀ε>0 Vε (a) ∩ A 6= ∅.
I.e. a point a belongs to the closure of A (in its interior or at the
boundary) iff any neighbourhood of a intersects A.
4. TOPOLOGICAL NOTIONS IN R 147

We are now interested in the closure points of A having for sure


closeby other points of the set. In other words, that are not isolated
points:
• a ∈ A is an isolated point of A iff ∃ε>0 Vε (a) ∩ (A \ {a}) = ∅.
So, we define
• a is an accumulation point of A iff ∀ε>0 Vε (a)∩(A\{a}) 6= ∅.
Accumulation points are thus elements of the closure of A minus the
isolated ones. The set of accumulation points is denoted by A0 .
Example A.40.
(1) ([0, 1[)0 = [0, 1].
(2) ({0, 1})0 = ∅.
(3) ({ n1 : n ∈ N})0 = {0}.
(4) Q0 = R, (R \ Q)0 = R.

4.6. Numerical sequences and convergence. A numerical se-


quence (or simply sequence for short) is a real valued function defined
on N, i.e. u : N → R. Its expression is usually written as un instead of
u(n). Each n is said to be an order of the sequence and the respective
value un is the term of order n.
Whenever we have a strictly increasing sequence kn we write kn %
∞. This allows us to define a subsequence of un by ukn .
A sequence un is said to converge to b ∈ R (or b is the limit of un )
iff
∀ε>0 ∃p∈N ∀n≥p un ∈ Vε (b).
In this case we write un → b as n → +∞ or limn→+∞ un = b.
A sequence might not have a limit. In that case there is still a possi-
bility that subsequences have limits. The sublimits of un are the limits
of its subsequences. The infimum of this set is denoted by lim inf un ,
and its supremum is lim sup un . These are also computable by
lim inf un = sup inf un = lim inf un
n≥1 k≥n n→+∞ k≥n

and
lim sup un = inf sup un = lim sup un .
n≥1 k≥n n→+∞ k≥n

Consider now the sum of the first n terms of a sequence un given


by a new sequence
X n
Sn = ui .
i=1
148 A. THINGS THAT YOU SHOULD KNOW BEFORE STARTING

A numerical series is the limit of Sn as n → +∞ and it is denoted by


+∞
X n
X
ui = lim ui .
n→+∞
i=1 i=1

4.6.1. Diagonal argument. Consider a sequence un,m that depends


on two variables n, m ∈ N.
Theorem A.41 (Diagonal argument). If there is M > 0 such that
|un,m | < M for all n, m ∈ N, then there is kn % ∞ with kn ∈ N such
that
am := lim ukn ,m
n→+∞
exists for every m ∈ N.

Proof. Notice that un,1 is a bounded sequence, thus there exists


(2)
a subsequence ukn(1) ,1 which converges to a1 . Choose now kn to be
(1)
a subsequence of kn so that ukn(2) ,2 converges to a2 . By induction
(j) (j−1)
we obtain a subsequence kn of kn so that ukn(j) ,j converges to aj .
Therefore, ukn(j) ,i converges for every i ≤ j.
(n)
Finally, take kn = kn which is called the diagonal subsequence.
Hence, ukn ,m converges for every m.


5. Notions of differentiable calculus on R

Consider a function f : A → R where A ⊂ R. We say that the limit


of f at a ∈ A is b iff
∀ε>0 ∃δ>0 ∀x∈Vδ (a)∩A f (x) ∈ Vε (b).
In this case we write f (x) → b as x → a or limx→a f (x) = b. This is
also equivalent to say that for any sequence un with values in A that
converges to a we have f (un ) converging to b.
We say that f is continuous at a iff limx→a f (x) = f (a). Moreover,
f is continuous in A if it is continuous at every point of A.
Now, if a ∈ A and A is an open set, let
f (x) − f (a)
f 0 (a) = lim .
x→a x−a
If the above limit exists we say that f is differentiable at a and it is
called the derivative of f at a. The function is differentiable in A if
it is differentiable at every point of A. The function f 0 : A → R is
the derivative of f and it can also be differentiable. In this case we
can compute the second derivative f 00 and so on. If performing the
5. NOTIONS OF DIFFERENTIABLE CALCULUS ON R 149

derivative k times we write f (k) as the k-th derivative of f . The 0-th


derivative corresponds to f itself.
The set of all continuous functions in A is denoted by C 0 (A). The
functions that are differentiable in A whose derivatives are continuous
form the set C 1 (A). More generally, C k (A) is the set of k-times differ-
entiable functions whose k-th derivative is continuous. Finally, C ∞ (A)
corresponds to the set of all functions which are infinitely times differ-
entiable.
Any f ∈ C k+1 (A) can be approximated by its Taylor polynomial
around x0 ∈ A:

k
X f (i) (x0 )
P (x) = (x − x0 )i ,
i=0
n!

with an error given by

f (k+1) (ξ)
f (x) − P (x) = (x − x0 )k+1
n!

for some ξ between x and x0 . If k = 0 we can write

f (x) − f (x0 ) = f 0 (ξ) (x − x0 ),

which is known as the mean value theorem.


150 A. THINGS THAT YOU SHOULD KNOW BEFORE STARTING

6. Greek alphabet

Letter lower case upper case


Alpha α A
Beta β B
Gamma γ Γ
Delta δ ∆
Epsilon ε E
Zeta ζ Z
Eta η E
Theta θϑ Θ
Iota ι I
Kappa κ K
Lambda λ Λ
Mu µ M
Nu ν N
Xi ξ Ξ
Omicron o O
Pi π$ Π
Rho ρ% R
Sigma σς Σ
Tau τ T
Upsilon υ Υ
Phi φϕ Φ
Chi χ X
Psi ψ Ψ
Omega ω Ω
Bibliography

[1] M. Capinski and E. Kopp. Measure, Integral and Probability Springer, 2005.
[2] H. L. Royden. Real Analysis. Macmillan, 3rd Ed, 1988.
[3] W. Rudin. Real and Complex Analysis. McGraw-Hill, 3rd Ed, 1987.
[4] S. R. S. Varadhan. Probability Theory AMS, 2001.

151

You might also like