0% found this document useful (0 votes)
43 views

Notes Mainimp

Uploaded by

dondan123
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views

Notes Mainimp

Uploaded by

dondan123
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 164

Applied Stochastic Processes

Imperial College London

Mathematics Department
a.y. 2013/2014

M. Ottobre

Contents
1 Basics of Probability
1.1 Conditional probability and independence. . . . . . . . . . . .
1.2 Convergence of Random Variables . . . . . . . . . . . . . . .

5
8
10

2 Stochastic processes
14
2.1 Stationary Processes . . . . . . . . . . . . . . . . . . . . . . . 18
3 Markov Chains
19
3.1 Defining Markovianity . . . . . . . . . . . . . . . . . . . . . . 19
3.2 Time-homogeneous Markov Chains on countable state space . 23
4 Markov Chain Monte Carlo Methods
4.1 Simulated Annealing . . . . . . . . . .
4.2 Accept-Reject method . . . . . . . . .
4.3 Metropolis-Hastings algorithm . . . .
4.4 The Gibbs Sampler . . . . . . . . . . .

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

36
39
41
42
46

5 The Ergodic Theorem


49
5.1 Dynamical Systems . . . . . . . . . . . . . . . . . . . . . . . . 49
5.2 Stationary Markov Chains and Canonical Dynamical Systems 51
6 Continuous time Markov processes

54

7 Brownian Motion
56
7.1 Heuristic motivation for the definition of BM. . . . . . . . . . 57
7.2 Rigorous motivation for the definition of BM. . . . . . . . . . 58
7.3 Properties of Brownian Motion . . . . . . . . . . . . . . . . . 59
8 Elements of stochastic calculus and SDEs theory
8.1 Stochastic Integrals. . . . . . . . . . . . . . . . . . .
8.1.1 Stochastic integral in the Ito sense. . . . . . .
8.1.2 Stochastic integral in the Stratonovich sense.
8.2 SDEs . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.2.1 Existence and uniqueness. . . . . . . . . . . .
8.2.2 Examples and methods of solution. . . . . . .
8.2.3 Solutions of SDEs are Markov Processes . . .
8.3 Langevin and Generalized Langevin equation . . . .

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

62
64
67
72
74
74
76
79
81

9 Markov Semigroups and their Generator


86
9.1 Ergodicity for continuous time Markov processes. . . . . . . . 91
10 Diffusion Processes
94
10.1 Definition of diffusion process . . . . . . . . . . . . . . . . . . 95
10.2 Backward Kolmogorov and Fokker-Planck equations . . . . . 100
10.2.1 Time-homogeneous case . . . . . . . . . . . . . . . . . 103
10.3 Reversible diffusions and spectral gap inequality . . . . . . . 106
10.4 The Langevin equation as an example of Hypocoercive diffusion112
11 Anomalous diffusion - - - STILL DRAFT

121

Appendices

126

A Miscellaneous facts
A.1 Why can we think of white noise as the
nian Motion . . . . . . . . . . . . . . . .
A.2 Gronwalls Lemma . . . . . . . . . . . .
A.3 Kolmogorovs Extension Theorem . . . .

126
derivative
. . . . . .
. . . . . .
. . . . . .

of Brow. . . . . . 126
. . . . . . 126
. . . . . . 127

B Elements of Functional Analysis


B.1 Adjoint operator . . . . . . . . . . . . . . . . . . .
B.2 Strong, weak and weak-* convergence. . . . . . . .
B.3 Groups of bounded operators and Stones Theorem
B.4 Functional Inequalities in their basic form . . . . .

127
. 128
. 130
. 131
. 131

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

C Laplace method

132

12 A Few Exercises

141

13 Solutions

151

Warning. The material of Section 1 is very sketchy and it is there mainly to


recall some basic definitions and fix the notation for the rest of the course, as
standard notions from probability theory are assumed to be a prerequisite.
I dont recommend attending this course if you havent taken a course in
Probability before.

Standing assumptions for the rest of the course.


1. We assume that all the measures we deal with have a density, unless
otherwise stated.
2. All the Markov processes are time homogeneous (again, unless otherwise stated).
ACKNOWLEDGMENTS
I would like to thank the students of the course that I taught at Imperial
College in the Autumn 2013 for comments and remarks that helped improving these notes. In particular, A. Corenflos, J. Kunz and M. R. Majorel.
The changes suggested by A.C. have not yet been incorporated in this version...Im working on it!

Basics of Probability

For background material on probability theory see for example [79]. We


start by recalling some elementary definitions.
Let E be a set and E a collection of subsets of E. E is a -algebra of
subsets of E if
E E,
if A E then Ac E,
S
if {Aj }jI E then jI Aj E, where I is an at most countable set
of indices.
The pair (E, E) is a measurable space. If is a measure 1 on E, the triple
(E, E, ) is called a measure space.
If is a class of subsets of E, we denote by () the -algebra generated by
, i.e. the smallest -algebra containing , which is the intersection of all
the -algebras containing .
If E is endowed with a topology then the -algebra generated by the open
sets of E is called the Borel -algebra and denoted B(E) or simply B when
there is no risk of confusion. We will often use the -algebra B(Rn ). It is a
good exercise to compare the definition of -algebra with the definition of
topology, which are completely different and come from two very different
needs, but they are often mixed up.
If P is a measure on a set such that P() = 1, we call P a probability
measure. The triple (, F, P), with F a -algebra on , is then a probability
space. is the sample space, points are sample points and elements
A F are events.
Let (E, E), (E 0 , E 0 ) be two measurable spaces. A map X : E E 0 is
E/E 0 -measurable if the preimage of any A0 E 0 is a set A E, i.e.
X 1 (A0 ) E,

for all A0 E 0 .

When the measurable space at hand is indeed a probability space then a


F/E 0 -measurable map X : E 0 is a E 0 -valued random variable. To fix
ideas, from now on we will mainly work with Rn -valued random variables,
where Rn is assumed to be endowed with the Borel -algebra B(Rn ). Also,
I will not repeat every time that (, F, P) is a probability space.
1

Countably additive and non-negative set function with () = 0.

Let X : Rn be a random variable. The collection of sets


(X) := {X 1 (B), B B}
is a -algebra (check), called the - algebra generated by X. It is the smallest
sub--algebra of F with respect to which X is measurable.
Let X : (, F, P) (Rn , B) be a r.v. The law of X is the map
LX : B [0, 1] defined as follows:
LX (B) = P(X 1 (B)),

for all B B.

LX is a probability measure on Rn . Therefore a random variable induces,


through its law, a probability measure on Rn .
The function FX : Rn [0, 1] defined as
FX (x) := P(X x),
is the distribution function of X. If the range of X is discrete then X is a
discrete r.v. and in this case its distribution function will be discontinuous.
If instead the distribution function of X is continuous then it is customary
to say that X is a continuous r.v. However, in order to avoid confusion with
the terminology for stochastic processes, we will more often simply say that
X is a r.v. with continuous distribution.
If there exists a nonnegative integrable function f : Rn R such that
Z xn
Z x1
FX (x) =
...
f (z1 , . . . , zn )dz1 . . . dzn ,

then f is the density of X. In the above we used the notation Rn 3 x =


(x1 , . . . , xn ).
If X admits a density function then
Z
P(X B) =
f (x)dx.
B

If X1 , . . . , Xm are Rn valued r.v., their joint distribution function is


the function FX1 ,...,Xm : (Rn )m [0, 1],
FX1 ,...,Xm := P(X1 x1 , . . . , Xm xm ),
where this time xj Rn .

We also recall that the expectation of X, E(X) is


Z
X dP
E(X) :=

(1)

so that
P(X B) = E(1B ),
where 1B is the characteristic function of the set B. X is integrable if
E(|X|) < . As you can imagine, formula (1) tends to be quite unpractical
to use for calculations. However, if X has a density, then the following holds.
Lemma 1.1. With the notation introduced above, let g : Rn Rm be a
measurable function such that Y = g(X) is integrable. Then
Z
E(Y ) =
g(x)f (x)dx
Rn

and in particular
Z
E(X) =

xf (x)dx.
Rn

If = E(X) then we define the variance of X to be


Z
V ar(X) := E |X |2 = |x |2 f (x)dx,
where || denotes the euclidean norm. The covariance of two r.v. X and Y
is instead
Cov(X, Y ) := E(XY ) E(X)E(Y ).
Recall also that if two r.v. have joint distribution FX,Y and joint density
function fX,Y then
ZZ
E(g(X, Y )) =
g(x, y)fX,Y (x, y)dxdy.
Moreover, the Convolution Theorem states that in such a case the density
function of Z = X + Y is given by
Z
fZ (z) = fX,Y (x, z x) dx .
Example 1.2. If X : R is a r.v. with density
f (x) =

1
2 2

(xm)2
2 2

then X is a Gaussian (normal) r.v. with mean m and variance 2 . In this


case we use the notation X N (m, 2 ).
7

Example 1.3. Analogously in higher dimension. If X : Rn is a r.v.


with density
1
1
1
f (x) = p
e 2 (xm)C (xm) ,
(2)n det(C)

for some m Rn and some positive definite matrix C then X is a Gaussian


(normal) r.v. with mean m and covariance matrix C.
Finally, we recall that the space L1 ((, F, P); Rn ) (most often I will
just write L1 (, F, P) or L1 (P)) is the space of F-measurable integrable
Rn -valued . Analogously, for all p 1, Lp ((, F, P); Rn ) is the space of
F-measurable functions X : Rn such that
Z
|X()|p dP() < .

If the random variable X has a density (i.e. if the distribution function of


X has a density) f (x), then the above integral can just be rewritten as
Z
Mp (f ) :=
|x|p f (x)dx.
Rn

Mp (f ) is the p-th (non-centered) moment of the r.c. with density f .

1.1

Conditional probability and independence.

If A, B F and P(B) > 0 we define the conditional probability of A given


B to be
P(A B)
P(A|B) :=
.
P(B)
Any two events A, B F are independent if
P(A B) = P(A) P(B).
A collection {Xi }i of Rn valued r.v. is independent if
P(X1 B1 , . . . , Xk Bk ) = P(X1 B1 ) . . . P(Xk Bk )
for any k 2 and any Borel sets Bi . If two real valued r.v. X and Y are
independent then
E(XY ) = E(X) E(Y )

hence Cov(X, Y ) = 0

and
V ar(X + Y ) = V ar(X) + V ar(Y ).
Let us now recall the basic facts about conditional expectation.
8

Definition 1.4. Let X be a Rn valued random variable, X L1 (, F, P)


and let G be a sub-sigma algebra of F. Then there exists a unique integrable
and G-measurable random variable Y such that
E(Y 1G ) = E(X1G )

G G.

(2)

The r.v. Y is (a version of ) the conditional expectation of X given G; we


use the notation Y =: E(X|G). Also, if Z is a r.v. the notation E(X|Z) is
a short notation for E(X|(Z)).
If X L2 (, F, P) then there is a clear interpretation for the conditional
expectation in terms of projections; indeed in this case if G is a sub-sigma
algebra of F then L2 (, G, P) is a proper closed subspace of L2 (, F, P).
Therefore for any X L2 (, F, P) there exists a unique Y L2 (, G, P) =:
K such that
kX Y kL2 = inf kX W kL2

and

W K

X Y W

W K,

i.e. Y is the unique orthogonal projection of X on K. In other words, among


all the G-measurable functions, Y is the best (in the L2 sense) estimator of
X. Rephrasing: if the information contained in the -algebra G is available,
Y is our best guess on X. We list the main properties of the conditional
expectation:
i) If Y = E(X|G) then E(Y ) = E(X) (which follows from (2))
ii) Take out what is known: if Z is G-measurable then E(ZX|G) = ZE(X|G)
and in particular E(Z|G) = Z
iii) Tower property: if H G F then
E[E(X|G)|H] = E[E(X|H)|G] = E(X|H)
iv) If X is independent of Z then E(X|Z) = E(X).
In the same way, for any event A F and any r.v. X, we can define
P(A|X) := P(A|(X)) := E(1A |X).
It is a general fact that, for any two r.v., E(Z|X) can be written as a
measurable function of X, i.e. there exists a measurable function h such
that
E(Z|X) = h(X).
9

However the function h is defined only up to a set of measure zero. When


we apply this reasoning to P(A|X), we obtain that
P(A|X) = h(X),

(3)

for some measurable function h. We might suggestively write, and in fact


we shall do so,
P(A|X = x) = h(x);
however the above expression is intended to hold for almost every x.

1.2

Convergence of Random Variables

First, let us list the main modes of convergence that we will be using.
Definition 1.5. For simplicity let Xn , X : R be real valued random
variables, (but clearly all the definitions below apply to Rn -valued r.v.) and
let n and be the law of Xn and X. We will assume that n and have
a density 2 and with abuse of notation we will write dn (x) = n (x)dx and
d(x) = (x)dx.
a.s.

Xn X almost surely if
Xn () X()

for a.e. , i.e.


P


lim Xn = X = 1.

Lp

Xn X, where Lp := Lp (; R), p 1, if
E |Xn X|p 0.
p

Xn X in probability if for any  > 0


lim P(|Xn X| > ) = 0.

n
D

Xn X (or also Xn X or Xn * X) in distribution or weakly if


Z
Z
h(x)n (x)dx
h(x)(x)dx
for any h Cb (R).
R

2
We say that a probability measure R on (Rn , B) has a density if there exists a nonnegative function f (x) such that (B) = B f (x)dx for any B B.

10

An important fact to notice about weak convergence is that this definition makes sense even if the random variables that come into play are not
defined on the same probability space.
The above modes of convergence of r.v. are related as follows:
a.s. = in probability = in distribution
and
in Lp = in probability .
The latter implication is a consequence of the Markov Inequality:
P(X c)

1
E[g(X)],
g(c)

(4)

for any r.v. X, for all c R and g : R R+ measurable and non-decreasing.


A particular case of the Markov Inequality is the Chebyshevs inequality:
for all c 0 and for any square integrable r.v. X,
P(|X | > c)

V ar(X)
,
c2

where := EX .

In general a.s. convergence and convergence in Lp are not related at all.


However, we recall the following classic results.
Monotone Convergence Theorem (MCT). If 0 Xn X almost
surely, then EXn E(X).
Dominated Convergence Theorem (DCT). If Xn X a.s. and there
exists a integrable r.v. Y s.t. |Xn | Y then
E(|Xn X|) 0

(5)

E(Xn ) E(X).

(6)

and therefore also


Bounded Convergence Theorem (BCT). If Xn X a.s. and there
exists a constant K > 0 independent of n such that |Xn | K then
E(|Xn X|) 0.
Uniform Integrability criterion for L1 convergence (UIC). Let Xn , X
be integrable r.v. Then Xn X in L1 if and only if the following two
conditions hold:
i) Xn X in probability
11

ii) Xn is uniformly integrable, i.e. for all  > 0 there exists K > 0 s.t.
E(|Xn | 1{|Xn |>K} ) < ,

n N.

Scheff
es Lemma. Let Xn , X be integrable and suppose Xn X a.s.
Then
E |Xn | E |X| E(|Xn X|) 0.
Remark 1.6 (On the above theorems). In the DCT and in the BCT the
inequalities |Xn | Y and |Xn | K are meant to hold pointwise, i.e. for all
n and . In the DCT, (6) follows from (5) simply because |E(Xn X)|
E |Xn X|. The BCT is clearly a consequence of the DCT. Both the BCT
and the DCT hold even if the assumption on the a.s. convergence of Xn to
X is replaced by convergence in probability. In this respect, if Xn , X are
assumed to be integrable, it is clear that DCT and BCT are a consequence
of the UIC. Last, because ||a| |b|| |a b|, the implication in
Scheffes Lemma is trivial. The opposite implication is the nontrivial one.
Now some remarks about weak convergence. According to the famous Portmanteau Theorem (see for example [8]) convergence in distribution can be
equivalently formulated as follows:
Theorem 1.7 (Portmanteau Theorem). Let Xn be a sequence of real valued
r.v. with distribution functions Fn (x) and F (x), respectively. Then
d

Xn X

lim Fn (x) = F (x)

for every point of continuity of F .


While we do not show the proof of the above theorem, we do hope that
the following example is somewhat enlightening.
Example. If Xn = 1/n then n , the law of Xn , is the unit mass at 1/n.
Clearly Xn * X, where the law of X, , is the unit mass at 0; indeed
Z
Z
h(x)dn = h(1/n) h(0) =
h(x)d h Cb (R).
R

However 0 is a discontinuity point for F and


Fn (0) = P(Xn 0) = 0 6= 1 = F (0) = P(X 0).

Other facts that you might want to bear in mind:


12

If Xn X and X is a constant (i.e. it is deterministic) then convergence in probability and weak convergence are equivalent.
p

Slutskys Theorem. If Xn c, where c is a constant, then g(Xn )


g(c) for any function g that is continuous at c.
d

Cram
ers Theorem. Suppose Xn X and Yn c, where c is
deterministic. Then
d

i) Xn + Yn X + c
d

ii) Xn Yn cX
d

iii) Xn /Yn X/c if c 6= 0.


Continuous Mapping Theorem. Let g be a continuous function
d
d
and suppose Xn X. Then g(Xn ) g(X). The same thing holds
for almost sure convergence and for convergence in probability as well.
Definition 1.8 (Weak convergence of probability measures). Let (M, d) be
a separable metric space. A sequence of probability measures n on M is said
to converge weakly to (probability measure on M ) if for every continuous
and bounded function h on M one has
Z
Z
h(x)n (dx)
h(x)(dx) .
M

Remark 1.9. Observe the two following facts.


Taking n to be the law of a random variable Xn , it is clear that n
converges weakly to if and only if Xn converges in distribution to
X, where the law of the r.v. X is .
In view of Remark 9.10 and the functional analytic definition of weak-
convergence (see Appendix B.2), what probabilists call weak convergence of probability measures should be referred to (and analysts do
in fact refer to it) as weak- convergence.
As it is customary, we say that {Xi }i are i.i.d. random variables if they
are independent and identically distributed. We recall the following two
fundamental results, which you will have already seen in some probability
course. Later on we will build on these results and reread them in a more
general context.

13

Theorem 1.10 (Strong Law of Large numbers). Let {Xi }iN be a sequence
of i.i.d. integrable random variables with E(Xi ) = and consider
n

1X
Sn :=
Xi .
n
i=1

Then

a.s.
Sn ,

i.e.


P


lim Sn = = 1.

Example 1.11 (The simplest example of Monte Carlo Method). Let f be


a bounded measurable function f : [0, 1] R and Xn be a sequence of i.i.d
r.v., uniformly distributed on [0, 1]. Then the sequence {f (Xn )}n is still a
sequence of i.i.d integrable r.v. Applying the LLN we get
n

1X
a.s.
f (Xk ) E(f (X1 )) =
n
k=1

f (x) dx.
0

Therefore, once we can generate samples from the variables Xj , the quantity
R1
1 Pn
k=1 f (Xk ) is, for large n, a good approximation of 0 f (x) dx.
n
Theorem 1.12 (Central Limit Theorem). Let {Xi }iN be a sequence of
i.i.d. square integrable random variables with E(Xi ) = and V ar(Xi ) = 2 .
Then

d
n(Sn ) N (0, 2 ).
P
Comment. Notice that if we set Sn = nj=1 Xj , then E(Sn ) = n and
V ar(Sn ) = n 2 . Therefore the above theorem is equivalently restated by
Sn E(Sn ) d
N (0, 1),
[V ar(Sn )]1/2
i.e.


Z b
x2
1
Sn n

<b =
lim P a <
e 2 dx,
n
n
2 a

a b R.

Stochastic processes

A family {Xt }tI of S-valued random variables Xt : S is a stochastic


process (s.p.). If I = Z or I = N then Xt is a discrete time stochastic
process; if I = R or I = R+ := {x R : x 0}, it is a continuous time
stochastic process. During this course the state space S will be either Rn
14

or Z, unless otherwise stated. For a reason that will possibly be more clear
after reading Remark 2.1 below, we sometimes consider the process as a
whole and use the notation X := {Xt }tT .
Now notice that
Xt () = X(t, ) : I S.
If we fix and look at the (non-random) map
I 3 t X(t, ) S

for fixed

(7)

then we are looking at the path Xt () =: (t), i.e. we are observing everything that happens to the sample point from time say 0 to time t. If
instead we fix t and we look at the map
3 X(t, ) Rn

for fixed t,

then this is a random variable, which gives us a snapshot of what is happening (although clearly not in a deterministic way) to all the sample points
at the time t when we took the picture. The old chestnut of the
Eulerian vs Lagrangian point of view. It is sometimes convenient to think
of as a set of particles and of the state space S as the physical state space
(for example position-velocity), so that (t) is the path in state space followed by the particle and Xt () represents, for fixed t, the state of the
system at time t. This will be the point of view that we shall adopt when we
talk about non-equilibrium statistical mechanics. In other cases it is more
convenient to think of as an experiment, so that Xt () is a realization of
such an experiment at time t.
Remark 2.1. Formula (7) offers another perspective on stochastic process:
we can look at a stochastic process as a random map from the sample space
to the path space (S)I := {maps from I to S}. More explicitly
X : (S)I
(t),
i.e. to each sample point, or particle, we associate a function, its path.
Definition 2.2. Two stochastic processes {Xt }t and {Yt }t taking values in
the same state space are (stochastically) equivalent if P(Xt 6= Yt ) = 0 for all
t T . If {Xt }t and {Yt }t are stochastically equivalent then {Xt }t is said to

15

be a version of {Yt }t (and the other way around as well). Given a stochastic
process {Xt }t the family of distributions
P(Xt1 B1 , . . . , Xtk Bk ),
for all k N, t1 , . . . , tk T and B1 , . . . , Bk S, are the finite dimensional
distributions of the process {Xt }.
If two stochastic processes are equivalent then they have the same finite dimensional distributions, the converse is not true. Let us make some
remarks about these two notions, starting with an example.
Example 2.3. Stochastically equivalent processes can have different realizations. Indeed, consider the processes {Xt }t[0,1] and {Yt }t[0,1] defined as
follows

0 if t 6=
Xt 0 t and Yt =
1 if t =
where is a random variable with continuous distribution : [0, 1], so
that P( = a) = 0 for all a [0, 1]. In this case P(Xt 6= Yt ) = P(t = ) = 0.
Therefore these two processes are stochastically equivalent but the trajectory
of Xt is continuous while the trajectory of Yt has a discontinuity at t = .
One might wonder whether the finite dimensional distributions determine the process uniquely; in general the answer is no, indeed also the finite
dimensional distributions dont say anything about the paths. However the
Kolmogorov extension Theorem (see for example [56, Theorem 2.1.5] ) gives
a consistency condition in order for a family of distributions to be the finite
dimensional distributions of some stochastic process.
Definition 2.4 (Continuous processes). Let {Xt } be a continuous-time s.p.
We will say that {Xt } is continuous if it has continuous paths, i.e. if the
maps t Xt () are continuous for a.e. .
Given a function we know how to check whether it is continuous or not.
Such an exercise might not be so obvious when it comes to stochastic processes. Luckily for us, Kolmogorov provided us with a very useful criterion,
which we present for real valued processes but it holds in more generality.
Theorem 2.5 (Kolmogorovs continuity criterion). Let {Xt }t0 be a s.p.
with state space R. If there exist constants , > 0 and C 0 s.t.
E |X(t) X(s)| C |t s|1+
16

t, s 0

then the process is continuous. More precisely, for any (0, /) and
T > 0 and for almost every there exists a constant K = K(, , T ) such
that
|X(t, ) X(s, )| K |t s| , 0 s, t T.
Definition 2.6. A filtration on (, F) is a family of -algebras {Et }tI such
that Et F for all t I and
Es Et

if s t.

The process {Xt }t is Et -adapted if the r.v. Xt is Et -measurable for every t.


The most natural filtration to consider is the one generated by the
process itself, i.e. the filtration Ft = ({Xs }0st ), as Xt is clearly Ft adapted. The -algebra Ft contains all the information available to us about
the process up to and including time t.
Example 2.7 (Standard Brownian Motion). We define a Wiener Process
or Brownian Motion (BM) to be a real-valued stochastic process {B(t)}t0
such that
i) B(0) = 0
ii) B(t) B(s) N (0, t s) for all 0 s t
iii) Increments over non-overlapping time intervals are independent: for all
n N and t1 , . . . , tn such that 0 t1 < t2 < < tn , the increments
B(t1 ), B(t2 ) B(t1 ), . . . , B(tn ) B(tn1 ) are independent.
A few remarks about this definition:
1. From the definition it follows that
E(B(t)B(s)) = min(t, s).
Indeed suppose s t, then
E(B(t)B(s)) = E([B(t) B(s) + B(s)]B(s)) = EB(s)2 = s.
Therefore the process has stationary increments (it is not itself stationary though).
2. From ii) it follows that
P(B(t) (a, b)) =

17

1
2t

Z
a

x2

e 2t dx .

(8)

3. It is possible to prove that B(ti ) B(ti1 ) is independent of Fti1 =


{B(s); s ti1 } as a consequence of 1., ii) and iii) above.
A natural extension of the definition of (standard, one dimensional BM) is
the following: a n-dimensional standard BM is a n-vector (B1 (t), . . . , Bn (t))
of independent one dimensional BMs.
BM is possibly the most important example of this course. For the
moment we just give this formal definition. Later on we will introduce BM
in a more physically motivated way and justify the definition that we have
just given.

2.1

Stationary Processes

We start by working with continuous time stochastic processes {Xt }tI , so


for the time being I R.
Definition 2.8. A continuous time stochastic process {Xt }tI is strictly
stationary, or simply stationary, if its finite dimensional distributions are
invariant under time shifts:
P(Xt1 B1 , . . . , Xtk Bk ) = P(Xt1 +h B1 , . . . , Xtk +h Bk )
for all h R (such that h+tj I), for all k N, t1 , . . . , tk I and B1 , . . . , Bk
S, where S is the state space of the process Xt .
The intuitive meaning of this definition is readily seen when we take
k = 1, so that the Definition 2.8 implies that the law of Xt does not depend
on t. Stationary processes are therefore used to describe phenomena which
happen under conditions that do not change in time.
Definition 2.9. A continuous time stochastic process {Xt }tI is wide sense
stationary (WSS) if it has finite first and second moments and
1. E(Xt ) is constant, i.e. it does not depend on t;
2. Cov(Xt Xs ) is a function of the difference t s.
The function Cov(Xt Xs ) is also called the autocovariance function of
the process X. To motivate the Definition 2.9, observe that if a process is
strictly stationary then it is also WSS; indeed if Xt is stationary and, for
simplicity, takes values in R, then we know that P(Xt x) does not depend
on t, hence
Z
E(Xt ) = x dP(Xt x) is constant in time.
18

We can now assume, without loss of generality, that E(Xt ) = 0 for all t.
Then, analogously,
ZZ
E(Xt+h Xs+h ) =
xy dP(Xt+h x, Xs+h y)
ZZ
=
xy dP(Xt x, Xs y) = E(Xt Xs ),
from which we deduce that E(Xt Xs ) depends only on the difference t s.
So strictly stationary WSS. The converse is in general not true. However
for Gaussian processes strictly stationary is equivalent to WSS.
Clearly, the definition of stationarity can be given also for discrete-time
processes and it is completely analogous to the one given for continuous time
processes.
Definition 2.10. A sequence of random variables {Xn }nN is strictly stationary, or simply stationary, if for every m N and every k N, the vector
(X0 , X1 , . . . , Xm ) has the same distribution as (Xk , X1+k , . . . , Xm+k ).
The notion of stationarity will pop up several times during this course,
as stationary processes enjoy good ergodic properties.

Markov Chains

One of the most complete accounts on Markov chains is the book [53]. For
MC on discrete state space we suggest [15], which is the approach that we
will follow in this section.

3.1

Defining Markovianity

Definition 3.1 (Markov chain). A discrete-time stochastic process {Xn }nN ,


Xn : (, F, P) (S, S) on a state space S is a Markov chain if
P(Xn+1 B|X0 , X1 , . . . , Xn ) = P(Xn+1 B|Xn ),

B S, n N. (9)

If the state space is discrete (countable or finite) we assume that S is


the -algebra of all the subsets of S. In the discrete case it is customary to
write (9) as
P(Xn+1 = xn+1 |X0 = x0 , , . . . , Xn = xn ) = P(Xn+1 = xn+1 |Xn = xn ).
(10)

19

However if the state space is countable the above still holds almost everywhere. If the state space is finite then it holds pointwise.
Denoting by Fn the -algebra generated by X0 , . . . , Xn , by definition of
conditional expectation (9) can be rewritten as
P(Xn+1 B|Fn ) = P(Xn+1 B|Xn ),

B S, n N.

(11)

Comment. Whether we look at (9) or (10), the moral of the above definition is always the same: suppose that Xn represents the position of a
particle moving in state space and that at time n our particle is at point x
in state space. In order to know where the particle will be at time n + 1
(more precisely, the probability for the particle to be at point y at time
n + 1) we dont need to know the history of the particle, i.e. the positions
occupied by before time n. All we need to know is where we are at time n.
In other words, given the present, the future is independent of the past. It
could be useful, in order to better understand the idea of Markovianity, to
compare it with its deterministic counterpart; indeed the concept of Markovianity is the stochastic equivalent of Cauchys determinism: consider the
deterministic system
z(t)
= f (z),

z(t0 ) = z0 .

(12)

We all know that under technical assumptions on the function f there exists
a unique solution z(t), t t0 , to the above equation; i.e. given an evolution
law f and an initial datum z0 we can tell the future of z(t) for all t t0 . And
there is no need to know what happened to z(t) before time t0 . Markovianity
is the same thing, just reread in a stochastic way: for the deterministic
system (12) we know exactly where z will be a time t; for a Markov chain,
given an initial position (or an initial distribution) at time n0 , we will know
the probability of finding the system in a certain state for every time n n0 .
Notation. If the chain is started at x then we use the notation
Px (Xn B) := P(Xn B|X0 = x);
if instead the initial position of the chain is not deterministic but drawn at
random from a certain probability distribution ( is a probability on S)
then we write P (Xn B). Clearly Px = Px .
If the MC has initial distribution , the finite dimensional distributions of

20

a Markov Chain can be expressed through the relation


P (X0 B0 , X1 B1 , . . . , Xn Bn )
Z
Z
Z
P(X1 dy1 |X0 = y0 ) . . . P(Xn dyn |Xn1 = yn1 ).
(dy0 )
=
B0

Bn

B1

(13)
During this course we will be mainly concerned with a special class of
MC, i.e. time-homogoneous MC. For these processes the probability of going
from x to y in one time step depends only on x and y and not on when we
are in x. In other words, the one-step transition probabilities do not depend
on time.
Definition 3.2 (Time-homogeneous Markov Chain). A Markov chain (MC)
{Xn } is time-homogeneous if
P(Xn+1 B|Xn ) = P(X1 B|X0 )

n 0.

(14)

Remark 3.3. To be more precise, one can prove that the above formula
(13) is equivalent to the Markov property (9):
(9) (13) .
In words: a time homogeneous Markov process has finite dimensional distributions of the form (13). Viceversa, if a stochastic process has finite
dimensional distributions of the form (13), then it is a time-homogeneous
Markov process .
In the rest of Section 3 (actually, in the rest of this entire course), we
will always assume that we are dealing with time-homogeneous MC, unless
otherwise explicitly stated. The Markov property and the time-homogeneity
imply that we can write
P(Xn+1 B|Xn = x) =: p(x, B)

(15)

for some function p : S S [0, 1].


Definition 3.4. A map p : S S [0, 1] enjoying the following properties
1. For fixed x S, p(x, ) : S [0, 1] is a probability measure, meaning
that p(x, S) = 1,
2. For fixed B S, p(, B) is a measurable map,
21

is called a Markov transition function or a family of transition probabilities. If, in addition to 1 and 2, the relation (15) holds, p are the transition
probabilities of the Markov chain Xn .
Using the transition probabilities we can rewrite the finite dimensional
distributions (13) of the MC as
Z
Z
Z
p(y0 , dy1 ) . . . p(yn1 , dyn ).
(dy0 )
P (X0 B0 , X1 B1 , . . . , Xn Bn ) =
B0

B1

Bn

Therefore, to a Markov process we can associate a family of transition probabilities i.e. a family of functions fulfilling the requirements of Definition
3.4. The above formula also says that once we have an initial distribution,
the transition probabilities are all we need in order to know the evolution
of the chain. This means that we could have introduced Markov processes
working the other way around, i.e. starting from the transition probabilities,
as the following theorem states.
Theorem 3.5. For any initial measure on (S, S) and for any family of
transition probabilities {p(x, A) : x S, A S}, there exists a stochastic
process {Xn }nN such that
Z
Z
Z
P (X0 B0 , X1 B1 , . . . , Xn Bn ) =
(dy0 )
p(y0 , dy1 ) . . . p(yn1 , dyn ).
B0

B1

Bn

(16)
Thanks to Theorem 3.5, an alternative way - alternative to Definition
3.1 - of defining a MC is as follows: we first assign a family of transition
probabilities and then we say that Xn is a Markov Chain if
P(Xn+1 B|Fn ) = p(Xn , B),

n N and B S.

At this point, as before (see Remark 3.3), one can prove that the finite
dimensional distributions of Xn are given by (16) and also, viceversa, that a
process Xn with finite dimensional distributions given by (16) is a Markov
chain with transition probabilities p(x, A).
Remark 3.6. Before reading this Remark you should revise the content of
the Kolmogorov extension Theorem in Appendix A.
Once we assign an initial distribution and a family of transition probabilities, the finite dimensional distributions (16) of the chain Xk (say each
r.v. of the chain is real valued) are a consistent family of probability measures on Rn . Therefore the Kolmogorov extension Theorem applies and
22

there exists a probability measure P on sequence space RN (equipped with


the -algebra RN generated by the cylinder sets) such that for all m N
and all Ai B(R)
P ( RN : i Ai , i = 1, . . . , m) = P(X1 A1 , . . . , Xm Am ).
In other words the coordinate maps of (RN , RN , P ), i.e. the maps Yn :
(RN , RN , P ) R defined as Yn () = Yn (0 , 1 , . . . ) = n , have the same
finite dimensional distributions as the chain Xn on (, F, P). We have therefore found a representation of our chain in sequence space.

3.2

Time-homogeneous Markov Chains on countable state


space

In the remainder of this chapter we will assume that the Markov Chain Xn
that we are dealing with is time-homogeneous and that it takes values on
a countable state space S. Each x S is called a state of the chain. We
endow S with the -algebra S of all the subsets of S. To fix ideas you can
think of S = Z. The proofs that we will skip can be found in [15]. For this
class of chains we have:
Transition probabilities: in this case it suffices to assign the transition
matrix 3 p = {p(x, y), x, y S} where each map p : S S [0, 1]
satisfies
X
p(x, y) = 1 and p(x, y) 0, x, y S.4
yS

Clearly, p(x, y) := P(X1 = y|X0 = x) = Px (X1 = y) and for all n 1


we denote pn (x, y) := Px (Xn = y).
Markov property: P(Xn+1 = xn+1 |Xn = xn , . . . , X0 = x0 ) = p(xn , xn+1 ).5
Finite dimensional distributions:
P (X0 = x0 , . . . , Xn = xn ) = (x0 )p(x0 , x1 ) . . . p(xn1 , xn ).
3

p will actually be a matrix if the state space is finite, it will be an infinite matrix if
the state space is countable.
4
The first condition says that p is a stochastic matrix. Here we dont need to specify
that p(x, y) is measurable in the first argument because with the chosen -algebra this is
automatically true.
5
Observe that this equality contains both the loss of memory and the time-homogeneity.

23

The next theorem gives an extremely important property of the MC.


Theorem 3.7 (Chapman-Kolmogorov equation). Let Xn be a time-homogeneous
Markov chain with discrete state space. Then for any m, n 0,
X
Px (Xn+m = y) =
Px (Xn = z)Pz (Xm = y).
zS

Proof. We recall the conditional version of the law of total probability:


X
P(A|B) =
P(A|Cj B)P(Cj |B) .
j

Using this fact,


P(Xn+m = y|X0 = x) =

P(Xn+m = y|Xm = z, X0 = x)P(Xm = z|X0 = x)

zS

P(Xn+m = y|Xm = z)P(Xm = z|X0 = x),

zS

having used the Markov property in the second equality.


Example 3.8 (Random Walk on the integers). Let 1 , 2 , . . . be i.i.d. random variables taking values in Z and with (j) = P(n = j). The random
walk n is defined as n = n1 + n , n 1. Let us calculate the transition
probabilities of the chain:
P(1 = y|0 = x) = P(0 + 1 = y|0 = x)
= P(x + 1 = y) = (y x).
Therefore p(x, y) = (x y). Notice that the transition probability of going
from x to y depends only on the increment x y and not on x and y, i.e.
the random walk is translation invariant.
Example 3.9 (Renewal Chain). This time the state space is N. We define
the chain through its transition probabilities
as follows: given a sequence of
P
positive numbers ak 0 such that k0 ak = 1,
p(k, k 1) = 1

if k 1

p(0, k) = ak

for all k 0

p(j, k) = 0

otherwise.

24

Example 3.10 (Ehrenfest chain). A box contains N air molecules. The


box is divided in two chambers, that communicate through a small hole. The
state of the system is determined once we know the number k of molecules
that are contained say in the left chamber at each moment in time. Ehrenfest
modelled the evolution of the system through a Markov chain defined as
follows: suppose that at time n there are k molecules in the left chamber.
Assuming that only one molecule per time step can go through the hole, at
time n + 1 either one molecule has gone from left to right (in which case
we will end up with k 1 particles on the left) or one molecule has gone
from right to left (leaving us with k + 1 particles on the left.) In other
words, Ehrenfest modelled the behaviour of gas molecules by using one of
the classical urn problems of probability theory. With this in mind, the
transition probabilities of the chain on state space S = {0, 1, . . . , N }, are
given by
p(k, k 1) = k/N

and

p(k, k + 1) = (N k)/N,

for all k 0,

while p(j, k) = 0 otherwise.


Definition 3.11. For every set A S we define
A := inf{n 0 : Xn A}

and

TA := inf{n 1 : Xn A},

to be the hitting time of A and the time of first return to A, respectively.


With obvious extension of notation, x := {x} and Tx := T{x} , for all x S.
For all k 0, the time of k-th return to x can be defined recursively:
Tx0 := 0

and

Txk := inf{n > Txk1 : Xn = x},

k 1.

In this way Tx1 = Tx . Moreover we let


xy := Px (Ty < )
and we say that x communicates with y (in symbols, x y) if xy > 0.
Finally we denote by N (y) the number of visits to y, i.e.
N (y) :=

1(Xn =y) .

n=1
6

Most textbooks, especially the most control-theory oriented, will say that y is accessible from x if there exists n N, n 0, such that pn (x, y) > 0 and then will say that x
and y communicate if x is accessible from y and y is accessible from x. Notice that our
definition is slightly different as xy is defined through the time of first return, as opposed
to being defined through the hitting time.

25

We notice without proof that


Px (Tyk < ) = xy k1
yy .

(17)

Now a few definitions regarding the classification of the states of the chain.
Definition 3.12. We say that the state y S is
Recurrent if yy = 1,
Transient if yy < 1,
Positive recurrent if yy = 1 and Ey (Ty ) < ,
Null recurrent if yy = 1 and Ey (Ty ) = .
A state y S is absorbent if Py (X1 = y) = 1.
Because for any given state y, we can only either have yy = 1 or yy < 1,
the state of a chain can only be either recurrent or transient. In the first case
we know from (17) that Py (Tyk < ) = 1 for all k 1. Therefore the chain
will return to y infinitely many times (more precisely, P(Xn = y i.o.) = 1).
If instead y is transient then, on average, the chain will only return to y a
finite number of times.
Theorem 3.13. A state y is recurrent if and only if Ey (N (y)) = and it
is transient if and only if Ey (N (y)) < .
Proof. Recalling the definition of N (y), Definition 3.11, we have
Ey (N (y)) =

Py (N (y) k) =

k=0

(17) X k
=
yy
k=0

Py (Tyk < )

(18)

k=0


=

1
1yy

iff yy = 1
iff yy < 1.

Theorem 3.14. If x is recurrent and communicates with y, i.e. xy > 0,


then y is recurrent and yx = 1.
Proof. Let us start with proving that if x is recurrent and xy > 0 then
yx = 1. Recall that x recurrent means that Px (Tx < ) = 1 i.e. Px (Tx =
) = 0. We will show that if xy > 0 and yx < 1 then x cannot be

recurrent. If xy > 0 then there exists h > 0 such that ph (x, y) > 0. Let h
26

be the smallest h for which ph (x, y) > 0. Then there exist z1 , . . . , zh1
S

such that
p(x, z1 )p(z1 , z2 ) . . . p(zh1
, y) > 0
wouldnt be the smallest h for which
and zj 6= x for all j (otherwise h
h
p (x, y) > 0). With this in mind, if yx < 1 we have
Px (Tx = ) = p(x, z1 ) . . . p(zh1
, y)Py (Tx = )
= p(x, z1 ) . . . p(zh1
, y)(1 yx ) > 0,
so it has to be yx = 1. Now let us prove that y is recurrent. To do so,
we will show that Ey (N (y)) = . Because yx > 0, there exists ` > 0

s.t. p` (y, x) > 0 and also recall that ph (x, y) > 0. From the ChapmanKolmogorov equation we have that

p`+n+h (y, y) p` (y, x)pn (x, x)ph (x, y),


so
P that nsumming over n on both sides we get Ey (N (y)) = as Ex (N (x)) =
n1 p (x, x) = because x is recurrent.
Now we need another couple of definitions.
Definition 3.15. A set C S is closed if
x C and xy > 0 y C.
A set F S is irreducible if
x, y F xy > 0.
A chain is irreducible if the whole state space is irreducible.
These two definitions might look similar but they are actually quite
different. In the case of a closed set, if we start in C then we remain in C
i.e. if x C then Px (Xn C) = 1 for all n 1. Also, strictly speaking, if
we have a closed set and we consider the set C 0 = C {z}, where p(c, z) = 0
for all c C, then C 0 is still closed. For an irreducible set this cannot
happen as every element has to communicate with each other. Moreover,
in an irreducible set, if xy > 0 then also yx > 0 and this is not true for a
closed set.
With the above definition, the following corollary is a straightforward
consequence of Theorem 3.14.

27

Corollary 3.16. If a chain is irreducible then either all the states are recurrent or all the states are transient.
What the following Theorem 3.17 and Corollary 3.18 prove is that in the
case of a chain with finite state space, there is a way to decide which of the
two cases occurs: if S is finite, closed and irreducible then all the states are
recurrent.
Theorem 3.17. If C is a finite closed set then C contains at least one
recurrent state.
Corollary 3.18. If C is a finite set which is closed and irreducible then
every state in C is recurrent.
Proof of Theorem 3.17. By contradiction, suppose all the states in C are
transient. If this is the case then, acting as in (18) we have
Ex (N (y)) = xy /(1 yy ) < .

(19)

So if we pick x C, we have
>

X
yC

Ex (N (y)) =

X
X

pn (x, y) = ,

n=1 yC

where the first inequality on the left follows


P from the finiteness of C and the
last equality follows from the fact that yC pn (x, y) = 1 as C is closed.
Sometimes (and when I say sometimes I mean when the state space is
finite and with reasonable cardinality), it can be useful to draw a Markov
chain, which can be done as follows.
Example 3.19. When the state space is finite the transition probabilities
can be organized in a transition matrix, P = (p(x, y))x,yS . Consider the
transition matrix
1
2
3
4
5
6
7

1
2
3
4
5
6
7
0.3 0
0
0 0.7 0
0
0.1 0.2 0.3 0.4 0
0
0
0
0 0.5 0.5 0
0
0
0
0
0 0.5 0 0.5 0
0.6 0
0
0 0.4 0
0
0
0
0
0
0 0.2 0.8
0
0
0
1
0
0
0
28

We draw a graph without indicating the self-loops:


o

H1

2


/4h
@ O

The sets {1, 5} and {4, 6, 7} are irreducible closed sets so every state in these
sets is recurrent.
You might think that the fact that all the recurrent states are contained
in one of the closed and irreducible sets of the chain is specific to the above
example. The following theorem shows that this is not the case.
Theorem 3.20 (Decomposition Theorem). Let R := {r S : r is recurrent}
be the set of all recurrent states of the chain. Then R can be written as the
disjoint union of closed and irreducible sets, i.e.
R = i Ci ,

Ci closed and irreducible for any i.

Proof. (Almost all of it). Let x be a recurrent state and define Cx := {y


S : xy > 0}. From Theorem 3.14 we know that Cx R and that yx > 0.
Therefore either Cx Cy = or Cx = Cy .
Now we want to introduce a very important notion in the theory of
stochastic processes, the notion of stationary measure. This will lead us to
the definition of ergodicity. Technicalities are of fundamental importance
but sometimes, if not driven by the right intuition, they can obfuscate the
idea underlying them. So let us first explain in words what we are trying
to do, starting with a very simple example. Consider a pendulum, oscillating around the vertical position. The only (stable) stationary state for this
system is the vertical position. Such a state is stationary in the sense that
if we start the motion at that position, the system is going to remain in
that state. In other words, such a state is an equilibrium for the system. If
we start the motion out of equilibrium, in the long run we will end up at
equilibrium again. However, the one that we have described is a deterministic system, whereas we deal with stochastic systems so we dont look at the
deterministic evolution but rather at the evolution of probability measures,
as the state of the system at time n is described by a probability measure
(think about the Ehrenfest chain for example, where the state of the system
is described once we exhibit the probability to have k molecules in the left
chamber at time n, for all k and n). Therefore in our context we dont talk
29

about stationary states but we rather consider stationary measures and we


say that a measure is stationary for the process Xn if X0 Xn
for all n 0. As in the case of the pendulum, stationary measures are
potential candidates to be the equilibrium state of the chain and therefore
they describe the long time behaviour of the process. Obviously there can
be more than one stationary measure. This leads us to the concept of ergodicity, which is a property regarding the long time behaviour of the process:
a process is ergodic if it admits a unique stationary measure. We will come
back to this definition later and we will state it in more detail but for the
moment, roughly speaking, in order for a process to be ergodic, it has to (in
the long run)
explore the whole state space
explore it in an homogeneous way, i.e. we want the time spent in a
given area of the state space to be proportional to how big that area
is. Indeed the physicists definition of ergodicity is space averages
equal time averages
we want all the above to happen independent of the initial condition.
Loosely speaking, if the process is ergodic, it will converge to the stationary
measure. Let us now start with the proper maths, hoping to give more
intuition along the way. We recall that in all that follows we are still referring
to time-homogeneous Markov chains with discrete state space.
Definition 3.21 (Stationary measure). A measure on S is stationary for
the Markov chain Xn with transition matrix p if
X
(x)p(x, y) = (y).
(20)
x

If is a probability measure then we call it a stationary distribution.


A short notation for (20) is p = . If is the initial distribution of the
chain then the LHS of (20) is P (X1 = y). So, if X0 then also X1
and Xn for all n 1 (check) and this is the reason why these measures
are also called invariant. In particular check that if (20) holds then
X
(x)pn (x, y) = (y) for all n 1.
x

30

Example 3.22. Consider the simple random walk from the exercise sheet.
Then (k) 1 is a stationary measure for the chain, as
X
(k)p(k, j) = (j 1)p(j 1, j)+(j +1)p(j +1, j) = 1p+p = 1 = (j).
k

Definition 3.23 (Reversibility). If a measure satisfies


(x)p(x, y) = (y)p(y, x)

(21)

then is a reversible measure.


The equality (21) is also called detailed balance condition (DB) and it is of
fundamental importance in the context of Metropolis-Hastings algorithms.
Theorem 3.24. If a measure satisfies the DB condition (21) then it is
stationary.
Proof. Just sum over x on both sides of (21) and get
X
X
(x)p(x, y) = (y)
p(y, x) = (y),
xS

xS

where the last equality follows from the fact that p is a stochastic matrix.
Remark 3.25. Suppose that the Markov chain Xn with transition matrix
p admits a stationary measure and that X0 , i.e. the chain is started in
stationarity. For every fixed n N we can consider the process {Ym }0mn
defined as Ymn := Xnm , i.e. Ymn is the time-reversed Xn . Then for every
n N, Ymn is a time-homogeneous Markov chain with Y0 . To calculate
the transition probabilities q(x, y) of Ymn we use Bayes formula:
q(x, y) = P(Y1 = y|Y0 = x) = P(Xn1 = y|Xn = x)
(y)
p(y, x)(y)
= P(Xn = x|Xn1 = y)
=
.
(x)
(x)
If the DB condition holds, then q(x, y) = p(x, y) for all x, y and in this case
Xn is called time-reversible.
Now a very important definition, which holds for chains as well as for
continuous time processes.
Definition 3.26. A Markov chain is said to be ergodic if it admits a unique
stationary probability distribution. In this case, the invariant distribution is
said to be the ergodic measure for the chain.
31

I would like to stress that, depending on the book you open, you might
find slightly different definitions of ergodicity. However they all aim at
describing the same intuitive idea. The next theorem says that as soon
as we have a recurrent state, we can construct a stationary measure (y)
looking at the expected number of visits to y, i.e. the expected time spent
by the chain in y.
Theorem 3.27. Recall that Tx := inf{n 1 : Xn = x}. If x is a recurrent
state then the measure
"T 1
#
x
X
1(Xn =y)
x (y) := Ex
n=0

is stationary.
Proof of Theorem 3.27. We need to prove that
X
x (y)p(y, z) = x (z).
yS

Let us study two cases separately, the case x 6= z and the case x = z.
P
Case x 6= z: To this let us rewrite x (y) =
n=0 Px (Xn = y, Tx > n),
so we have:
X
X
X
x (y)p(y, z) =
Px (Xn = y, Tx > n)p(y, z)
yS

=
=

n=0 yS

Px (Xn+1 = z, Tx > n)

n=0

Px (Xn+1 = z, Tx > n + 1).

n=0

On the other hand,


x (z) =
=
=

X
n=0

X
n=1

Px (Xn = z, Tx > n)
Px (Xn = z, Tx > n)
Px (Xn+1 = z, Tx > n + 1),

n=0

proving the claim if x 6= z.


32

Case x = z then, since x (x) = 1 by definition, we need to show that


X

x (y)p(y, x) =

yS

X
X

Px (Xn = y, Tx > n)p(y, x)

n=0 yS

Px (Tx = n + 1) = 1,

n=0

where in the last equality we used the fact that x is recurrent.

Theorem 3.28. If the chain is irreducible and recurrent then there exists a
unique stationary measure, up to constant multiples.
Idea of Proof. Let R be the set defined in Theorem 3.20. The moral is the
following: if we pick x Ci then the measure x defined in Theorem 3.27
is a stationary measure. If we pick any other x
in the same set Ci , the
corresponding measure x is only a constant multiple of x . Under our
assumptions, there is only one big recurrent irreducible set, the state space
S. Therefore we can pick any x S and construct an invariant measure x .
At this point one proves that any other invariant measure say will be a
constant multiple of x , i.e. there exists a constant K > 0 such that
x (y) = K(y)

for all y S.

So far we have a way to establish existence of the stationary measure,


Theorem 3.27, and a result to establish uniqueness, Theorem 3.28. But what
we are looking for is a distribution.
Theorem 3.29. Suppose the chain is irreducible. Then the following statements are equivalent:
1. the chain is positive recurrent (i.e. all the states are positive recurrent)
2. at least one state is positive recurrent
3. the chain admits a unique stationary distribution.
In order to prove the above result we need the following two technical
lemmata.

33

Lemma 3.30. Suppose is a stationary distribution for the chain. Then


(x) > 0 x is recurrent.
Proof of Lemma 3.30. Exercise.
Lemma 3.31. Suppose the chain has a stationary distribution . If the
chain is also irreducible then we can express as
(y) = 1/Ey Ty .
Proof of Lemma 3.31. Irreducibility (z) > 0 for all z S the chain
is recurrent. Irreducible recurrent chains have a unique stationary measure
up
multiples. and we know that objects of the form x (y) =
PTxto1constant
n
n=0 p (x, y) are stationary measures. The normalization factor for x is
X

x (y) =

TX
x 1 X
n=0

pn (x, y) =

X
X

Px (Xn = y, Tx > n)

n=0 y

Px (Tx > n) = Ex Tx .

n=0

Now it is clear that our stationary distribution is


x (z)
Cz (z)
1
(z) = P
= P
=
,
C y z (y)
Ez Tz
z x (z)
where in the above C is a constant and we used z (z) = 1.
Proof of Theorem 3.29. 1 2 is obvious.
2 3. Suppose x is positive recurrent, so Ex Tx < . Then from the proof
of Lemma 3.31 and from Theorem 3.27 we know that
(y) := x (y)/Ex Tx .
3 1. If the chain is irreducible then (x) > 0 for all x and the stationary
distribution can be expressed as
(x) =

1
> 0,
Ex Tx

hence Ex Tx < for all x, i.e. the chain is positive recurrent.


34

This completes the picture. Roughly speaking we have shown that


recurrence
irreducibility

existence of invariant measure


uniqueness of invariant measure

positive recurrence finiteness of the stationary measure, i.e. we have


a stationary distribution.
Now that we have a unique candidate for the equilibrium state of the chain,
i.e. for the asymptotic behaviour of the process, we need to answer the last
question: when does the chain converge to equilibrium? In other words, we
want to understand the behaviour of pn (x, y) for large n. Before answering
the question we need another definition.
Definition 3.32. Let x be the largest common divisor of the integers in
the set {n 1 : pn (x, x) > 0}. x is the period of x. If x = 1 then x is
aperiodic. A chain is aperiodic if all the states are aperiodic.
As you can imagine,
Lemma 3.33. If xy > 0 then x = y .
Now let us answer the question.
Theorem 3.34. Suppose the chain is irreducible and positive recurrent.
Then we know there exists a unique stationary distribution . If the chain
is also aperiodic then pn (x, y) (y), and the convergence is in total
variation i.e.
X
|pn (x, y) (y)| 0.
(22)
yS

Comment. It is clear now that (y) represents the probability for the
chain to be in y as n . It is very important that convergence to happens irrespective of the initial datum that we pick, i.e. in the limit the initial
condition is forgotten. We will not prove the above theorem as we think that
the meaning of the statement is already transparent enough: if there exists a
unique stationary distribution, which follows from irreducibility and positive
recurrence, the only thing that can prevent the process from converging to
its unique candidate limit is periodicity. Once we rule that possibility out,
the asymptotic behaviour of the chain can only be described by . Another
n
crucial observation to bear in mind is that (22) implies pn (x, y) (y) for
all y. Because (y) = 1/Ey Ty , this result is quite intuitive: the probability
of being in y is asymptotically inversely proportional to the expected value
35

of the time of first return to y. And this brings us to the last result, which
is very much along the lines of time averages equal space averages.
Define Nn (y) to be the number of visits to y in the first n steps, i.e.
Nn (y) =

n
X

1(Xj =y) .

j=1

Theorem 3.35. Suppose y is a recurrent state of the chain. Then for every
initial state of the chain x S, we have
1
Nn (y)

1
,
n
Ey Ty (Ty <)

Px a.s.

Remark 3.36 (On the nomenclature). In all of the above we have referred
all our definitions to the chain Xn rather than to its transition probabilities,
i.e. for example we said that the cain is recurrent or irreducible etc. However,
one can equivalently refer all these definitions to the transition probabilities
p and, in order to say e.g. that the chain is aperiodic we can equivalently
say that p is aperiodic.
Example 3.37. This example shows why aperiodicity is a much needed
assumption in Theorem 3.34. Consider the Markov Chain on state space
S = {a, b} with transition matrix


0 1
.
1 0
This chain is clearly positive recurrent and irreducible, hence ergodic with
invariant measure is (a) = (b) = 1/2. However it is also periodic with
period 2. It is easy to check that the result of Theorem 3.34 does not hold
for this chain, indeed pn (a, b) is one if n is odd and 0 if n is even (and
analogously for pn (b, a)). Therefore pn doesnt converge at all.

Markov Chain Monte Carlo Methods

For the material of this section we refer to [74, 67, 2, 52, 30, 10].
In the previous section we have studied the basic theory of Markov chains
on finite or countable state space. Most of this theory can be translated to
general state space, but this is not what we want to do now. In this section
we want to look at one of the main applications of the theory that we have
seen so far: Markov Chain Monte Carlo methods (MCMC). The main idea
36

and purpose of MCMC is readily explained: suppose we want to sample


from a given probability distribution (x) on a state space S. We know
from Theorem 3.34 that if Xn is an ergodic (and aperiodic) Markov chain
with invariant distribution then for n large enough pn (x, y) (y) i.e.
the outcomes of the chain are distributed according to . Therefore one way
of sampling from is constructing a MC which has as limiting distribution.
If we run the simulation of the chain long enough, the elements Xn will
be the desired samples from 7 . For simulation purposes, a very important
question is how big n needs to be; and, on top of that, how long it takes to
the chain, once stationarity is reached, to explore the state space. We shall
not address these points here but bear in mind that this is a very important
factor to optimize.
An easier reach question at this point is...why do we need to sample from
probability distributions? Well, the main uses of Monte Carlo8 and Markov
Chain Monte Carlo are in integration and optimization problems:
i) To calculate multidimensional integrals: we shall see in Section 5 that,
roughly speaking, if a chain is ergodic and stationary with invariant
probability then
Z
n
1X
lim
f (Xj ) = E (f ) = f (x)(x)dx, 9
(23)
n n
j=1

R
for every - integrable function f (i.e. for every f such that |f (x)| (x)
is finite). We will defer to Section 5 a more thorough discussion of the
limit (23), usually known as Ergodic Theorem. For the time being the
important things to notice are: (i) the limit (23) is a strong law of
large numbers - see Theorem 1.10; (ii) if we take f to be the indicator function of a measurable set then (23) says precisely that, in the
limit, time averages equal space averages; most importantly to our
purposes (iii) for n large enough, the quantity on the LHS is a good
approximation of the integral on the right- see Exercise 2.
7

However MCMC should only be used when other direct analytical methods are not
applicable to the situation at hand.
8
On an hystorical note, Monte Carlo (or also Ordinary Monte Carlo (OMC)) came
before MCMC. OMC was pretty much a matter of statistics: if you could simulate from a
given distribution, then you could simulate a sequence of i.i.d. random variables with that
distribution and then use the Law of Large numbers to approximate integrals. When the
method was introduced, the sequence of i.i.d. was not necessarily created as a stationary
Markov chain.
9
Notice that as usual we are assuming that has a density, which we keep denoting
by .

37

ii) To calculate the minima (or maxima) of functions, see Section 4.1.
The following example contains the main features of the sampling methods
that we will discuss.
Example 4.1 (Landscape painting attempt). *** Let q(x, y) be a transition
probability on a finite state space S . Suppose the transition matrix Q =
(q(x, y)) is symmetric and irreducible. Given such a Q and a probability
distribution (x) on S such that (x) > 0 for all x S, let us now construct
a new transition matrix P = (p(x, y)) as follows :

if (y) (x) and x 6= y


q(x, y)
(y)
q(x, y) (x)
if (y) < (x) and x 6= y
p(x, y) =
(24)

1P
x6=y p(x, y) otherwise.
First
Pis a transition matrix, indeed from the definition
P of all P = (p(x, y))
10 by
p(x,
y)
=
p(x,
x)
+
y6=x p(x, y) = 1. Moreover, P is irreducible
y
construction because Q is; being the state space finite, this also implies
that P is recurrent and that there exists a unique stationary distribution.
We can easily show that such an invariant distribution is exactly as P
is reversible with respect to . To prove - reversibility of P we need to
show that (x)p(x, y) = p(y, x)(y). This is obviously true when x = y. So
suppose x 6= y:
i) if (y) (x): (x)p(x, y) = (x)q(x, y) but also (y)p(y, x) = q(y, x) (x)
(y) (y)
so that using the symmetry of q we get (y)p(y, x) = q(x, y)(x) and
we are done.
ii) if (y) < (x): clearly same as above with roles of x and y reversed.
We are in good shape but we havent yet proved convergence to . This is
left as an exercise, see Exercise 17. It turns out that convergence happens
unless is the uniform distribution on S.
Let us pause everything for a moment to think about what we have done:
starting from a (pretty much) arbitrary transition kernel q, we have constructed a chain with transition kernel p that has as limiting distribution.
We will see that the (pretty much) can be (pretty much) removed. Now
there is only one point left to address: how do we sample from the chain
that has P as transition matrix? By using the following algorithm
10

I.e. the whole state space is irreducible under P ; this implies that the state space is
also closed under P .

38

Algorithm 4.2. Given Xn = xn ,


1. generate yn+1 q(xn , );
2. if (yn+1 ) (xn ) then Xn+1 = yn+1 ;
if (yn+1 ) < (xn ) then Xn+1 = yn+1 with probability (yn+1 )/(xn )
otherwise Xn+1 = Xn with probability 1 (yn+1 )/(xn ).
In words, given the state of the chain at time n, we pick the proposal yn+1
q(xn , ). Then the proposed move is accepted with probability (xn , yn+1 ) :=
min{1, (yn+1 )/(xn )}. If it is rejected, the chain remains where it was.
(x, y) is called the acceptance probability.
Algorithm 4.2 is a first example of a Metropolis-Hastings algorithm. A
nave explanation of the reason why we always accept moves towards points
with higher probability comes from the Ergodic Theorem, limit (23), reread
in the case in which the state space is finite (so that the integral on the
RHS is just a sum). If we want to construct an ergodic chain with invariant
probability then the time spent by the chain in each point y of S equals,
in the long run, the probability assigned by to y, i.e. (y).
Remark 4.3. Notice that by using the acceptance probability, the kernel
(24) can be rewritten in a slightly more compact form:
X
p(x, y) = q(x, y)(x, y) + x (y)
(1 (x, w))q(x, w),
wS

where x (y) is the Kronecker delta. In view of Section 4.3 it is also useful
to note that Algorithm 4.2 can be equivalently expressed as follows: given
Xn = xn ,
1. generate yn+1 q(xn , );

yn+1 with probability (xn , yn+1 )
2. set Xn+1 =
xn
otherwise.

4.1

Simulated Annealing

We now address the point ii) listed at the beginning of this section.
Suppose again we are in the case in which the state space S is finite and
again in the setting of Example 4.1. Let us now apply the reasoning (24) to
the case when the target distribution is
 (x) =

eH(x)/
,
Z
39

(25)

where  > 0 is a positive parameter, H(x) is a function on S and Z is a


normalization constant, i.e.
X
X
Z =
eH(x)/ , so that
 (x) = 1.
xS

xS

If Q = (q(x, y)) is a symmetric and irreducible transition matrix, the prescription (24) now becomes

if H(y) H(x) and x 6= y


q(x, y)
q(x,P
y)e(H(x)H(y))/ if H(y) > H(x) and x 6= y
p (x, y) =

1 x6=y p (x, y)
otherwise.
Notice that to simulate the Markov Chain with transition probability p (x, y)
we dont need to know a priori the value of the normalization constant Z .
We now know that for n large enough, pn  .
Now observe that the definition of  remains unaltered if the function H
is modified through an additive constant c, i.e. if instead of H we consider

H(x)
= H(x) + c for some constant c then
 (x) =

ec/ eH(x)/
eH(x)/
P
=
.
Z
ec/ xS eH(x)/

Therefore we can assume without loss of generality that the minimum of H


is zero. We are after the K (possibly K > 1) points of S, x1 , . . . , xK , such
that H(xj ) = 0. In order to find such a set of points where the minimum is
reached we observe that as  goes to zero  tends to the uniform distribution
on x1 , . . . , xK , indeed

1 if x = xj for some j = 1, . . . , K
0
eH(x)/
0 otherwise,
0

hence Z K and
0

 (x)

1/K if x = xj for some j = 1, . . . , K


0
otherwise.

In this way, if  is small enough, for large n the chain will be in one of the
minima with very high probability.
Instead of fixing a small , we can decrease  at each step. This way
we obtain a non-homogeneous Markov Chain, which we havent discussed.
However, when  is interpreted as the temperature of the system, such a
40

procedure of decreasing  is what gives the name simulated annealing to


the technique that we have just presented. Such a name is borrowed from
the field of metallurgy as it comes from the heat treatment that some metals
are subject to, when they are left to cool off in a controlled way.
In the context of equilibrium statistical mechanics a measure of the form
(25) is commonly called Gibbs measure at inverse temperature , where
= 1/. Such measures are of great importance as they represent the
equilibrium measure for a Hamiltonian system with Hamiltonian H. Therefore the simulated annealing optimization method can be used to locate the
global minima of the energy of the system.

4.2

Accept-Reject method

Example 4.1 and Algorithm 4.2 give an idea of what we are aiming for.
However let us take a step back and start again where we began. The aim
of the game is sampling from a given probability distribution. So suppose we
want to produce sample outcomes of a real valued random variable X with
density function (x)11 . The idea is to use an auxiliary density function
(x) which we know how to sample from.
The method is as follows: we start with producing two samples, independent of each other, Y and U U[0,1] , where U[0,1] denotes the uniform
)
distribution on [0, 1]. If U M(Y
(Y ) then we accept the sample and set
X = Y otherwise we start again. The algorithm is as follows

Algorithm 4.4 (Accept-Reject algorithm).


1. Generate Y and, independently, U U[0,1] ;
2. if U

(Y )
M (Y )

then set X = Y otherwise go back to step one.

It is clear from the above that there are only two constraints on (x)
and (x) in order for this method to be applicable :
the auxiliary density is such that (x) > 0 if (x) > 0;
there exists a constant M > 0 such that (x)/(x) M for all x.
Now we need to check that the samples generated this way are actually
distributed according to . Because




(Y )
P(X a) = P Y a U
for all a R,
M (Y )
11

This sampling method works in higher dimensions as well, here we present it in one
dimension just for ease of notation.

41

what we need to show is that




 Ra

dy (y)
(Y )

P Y a U
= R

M (Y )
dy(y)

for all a R.

This is a simple calculation, once we recall that for any two real valued
random variables, W and Z, with joint probability density fW,Z (w, z), we
have
R R
fW,Z (w, z)dw dz
P(W A|Z B) = RB RA
.
B R fW,Z (w, z)dw dz
In our case it is clearly fY,U = (y) 1, as Y and U are independent, so using
the above equality we obtain
R (y)/M (y)

 Ra


(y) du
dy 0
(Y
)

= R
P Y a U
R (y)/M (y)
M (Y )
(y) du
dy 0
R
a
dy (y)
= R
= P(X a).

dy(y)
Remark 4.5. Notice that to use the accept-reject method we dont need
to know the normalization
R constant for the density function i.e. we dont
need to know Z = (x)dx. However we need to know the constant
M . Indeed, even if M cancels in the calculation above so that it can be
arbitrary we still need to know its value in order to decide whether to
accept or reject the sample Y (see step 2 of Algorithm 4.4). Moreover,
M is a measure of the efficiency of the algorithm. Indeed, the probability of
acceptance at each iteration is exactly 1/M see Exercise 19. Therefore in
principle we would like to choose a distribution that makes M as small as
possible.
One last, probably superfluous, observation: this method has nothing to
do with Markov chains. The reason why we presented this algorithm should
be clear in view of Example 4.1. The next section is actually about sampling
methods that exploit Markov Chains.

4.3

Metropolis-Hastings algorithm

Before reading this section, read again Example 4.1. However this time our
state space is RN . 12 .
12

This algorithm can be presented and studied on a general metric space, see [74]

42

A Metropolis-Hastings (M-H) algorithm is a method of constructing a


time-homogeneous Markov chain or, equivalently, a transition kernel p(x, y),
that is reversible with respect to a given target distribution (x). To construct the -invariant chain Xn we make use of a proposal kernel q(x, y)
which we know how to sample from and of an accept/reject mechanism with
acceptance probability


(y)q(y, x)
(x, y) = min 1,
.
(26)
(x)q(x, y)
We require that (y)q(y, x) > 0 and (x)q(x, y) > 0. The M-H algorithm
consists of two steps:
Algorithm 4.6 (Metropolis-Hastings algorithm). Given Xn = xn ,
1. generate yn+1 q(xn , );

yn+1 with probability (xn , yn+1 )
2. set Xn+1 =
xn
otherwise.
Lemma 4.7. If is the acceptance probability (26),(and assuming (y)q(y, x) >
0 and (x)q(x, y) > 0) the Metropolis-Hastings algorithm, Algorithm 4.6,
produces a - invariant time-homogeneous Markov chain. 13
Proof. This is left as an easy exercise, see Exercise 18.
Notice that the samples produced with this MCMC method are correlated, as opposed to those produced with the simple accept/reject method,
which are i.i.d. The M-H samples are correlated for two reasons: because the
proposed move yn+1 depends on xn and because the acceptance probability
depends on xn .
Remark 4.8. In order to implement Algorithm 4.6 we dont need to know
the normalizing constant for , as it gets canceled in the ratio (26). However we do need to know the normalizing constant for q: q is a transition
probability so by definition for every fixed x the function y q(x, y) is a
probability density i.e. it integrates to one. However the normalizing constant of q(x, ) can, and in general will, depend on x. In other words, q(x, y)
13

On a technical note, the Lemma 4.7 can be made a bit more general (see [74]): first
recall that we are assuming that q(x, ) and () are probability densities with respect
to the Lebesgue measure; having said that we can consider the set R = {x, y S :
(y)q(y, x) > 0 and (x)q(x, y) > 0} and prove that the chain produced by Algorithm 4.6
satisfies the detailed balance condition if and only if the (x, y) = 0 on Rc .

43

R
will in general be of the form q(x, y) = Zx1 q(x, y), with dy q(x, y) = Zx
so that the ratio in the acceptance probability (26) can be more explicitly
(y)Zx q(y,x)
written as (x)Z
(x,y) .
yq
Clearly the choice of the proposal q is crucial in order to improve the
efficiency of the algorithm. We will make more remarks on this point later
on. For the moment we want to explore some special cases of M-H, namely
1 when q(x, y) depends on y only, i.e. q(x, y) = q(y), in which case the
corresponding M-H algorithm is called independence sampler;
2 when q is symmetric, i.e. q(x, y) = q(y, x) and in particular when q is
the transition density of a random walk, so that q(x, y) = q(|x y|)
(see Example 3.8); in this case the resulting M-H algorithm is better
known as Random Walk Metropolis (RWM);
Let us start with the independence sampler.
1 Independence Sampler. If the proposal kernel is independent of
the current state of the chain, i.e. q(x, y) = q(y) with q(y) a probability
density then the acceptance probability looks like


(y)q(x)
(x, y) = min 1,
,
(x)q(y)
so in this case we dont need to know neither the normalization constant for
the target nor the one for the proposal q (see Remark 4.8). Algorithm 4.6
becomes
Algorithm 4.9 (Independence Sampler). Given Xn = xn ,
1. generate yn+1 q();

yn+1 with probability (xn , yn+1 )
2. set Xn+1 =
xn
otherwise.
If in general it makes sense to compare the M-H algorithm with the accept/reject method, it makes even more sense to compare the independence
sampler with Algorithm 4.4, as they look very much alike. Let us spend a
couple of words to stress the differences between these two: first of all bear
in mind that the independence sampler produces a chain Xn , the outcomes
of which are -distributed only for large n, while with accept/reject each
sample is -distributed. In both cases the proposal yn+1 is generated independently of the value that had been produced at the previous iteration.
44

However the independence sampler still produces correlated samples the


acceptance probability still depends on xn as opposed to accept-reject,
which produces i.i.d. outcomes (see also the comment before Remark 4.8).
On the other hand, if we use accept/reject the upper bound on (x)/q(x)
needs to be available a priori, whereas in the independence sampler the constant M does not need to be known.
2 The symmetric case and Random Walk Metropolis. If q(x, y) =
q(y, x) the acceptance probability reads


(y)
(x, y) = min 1,
(27)
(x)
and therefore it doesnt depend on q at all. It is clear that in M-H, as much
as in accept/reject, the choice of q, which can in principle be arbitrary, is
of paramount importance in order to optimize the algorithm. The choice of
the proposal q affects the efficiency of M-H at least on two levels as both
the speed of convergence and the acceptance probability depend on q. In
the symmetric case the acceptance probability (27) does not depend on the
proposal kernel, however the average acceptance rate does,
ZZ
=
(x, y)(x)q(x, y) dx dy,
so it is wrong to conclude from (27) that in the symmetric case the only
factor to optimize in the choice of the proposal is the speed of convergence.
A very popular M-H method is the so called Random Walk Metropolis,
where the proposal yn+1 is of the form
yn+1 = xn + n+1 ,
where n+1 is whatever noise independent of xn . In RWM, the outcomes
1 , 2 , . . . , n , . . . are i.i.d. with common density g(x), which is assumed to
be symmetric with respect to the origin, i.e. g(x) = g(|x|). In this way
q(x, y) = g(y x) = g(|y x|) so q(x, y) = q(|y x|). For example, if
N (0, 2 ) then q(x, y) N (x, 2 ). The case in which the noise is
gaussian has been extensively studied in the literature., for target measures
defined on RN . As you can imagine the efficiency of the algorithm decreases
as the dimension N of the state space increases. This is a well known
phenomenon commonly referred to as curse of dimensionality. Therefore,
choosing the proposal variance becomes a more and more delicate matter as
N grows. In RN it is customary to consider 2 = cN where c, > 0 are
45

two appropriate parameters, the most interesting of the two being . If is


too large then 2 is too small, so the proposed moves tend to stay close to
the current value of the chain and the state space is explored very slowly. If
instead is too small, more precisely smaller than a critical value c , it was
shown in [3, 6, 7] that the average acceptance rate decreases very rapidly to
zero as N tends to infinity. Such a critical value is, for RMW, equal to one.
If we choose = 1 then the acceptance probability does not depend on N .
Remark 4.10. Let us repeat that M-H is a method to generate a reversible time-homogeneous Markov chain. As we have already noticed,
the fact that the chain is -reversible does not imply that is the only
invariant distribution for the chain or even less that the chain converges
to . In these lecture notes we shall not be concerned with the matter
of convergence of the chain constructed via M-H, which is probably better
studied case by case (see for example Exercise 17). However, for the sake of
completeness, let us mention the two following results:
If the target density (x) goes to zero exponentially fast as |x|
then the M-H algorithm is ergodic and converges to .
If supy (y)/q(y) is bounded, the independence sampler is ergodic and
converges to the invariant distribution.
The precise statement and proof of these results can be found in [53, Chapter
20] and references therein.

4.4

The Gibbs Sampler

The Gibbs sampler is a method of sampling from a distribution which is


the marginal of a joint probability. To be more clear, suppose we have two
random variables X and Y with joint distribution fX,Y (x, y) and suppose
we know that our target distribution is precisely the marginal
Z
(x) = fX (x) = fX,Y (x, y)dy .
If we can sample from the conditional distributions fX|Y and fY |X , then the
two-stage Gibbs sampler works as follows:
Algorithm 4.11 (Two-stage Gibbs sampler). Set Y0 = y0 .
Then, for n = 0, 1, 2, . . .
1. Xn f (x|Yn = yn )
46

2. Yn+1 f (y|Xn = xn ) .
Yes I know at first sight it seems impossible that this algorithm does
anything at all. Let us explain what it does and why it works. First of all,
Algorithm 4.11 produces two Markov chains, Xn and Yn . We are interested
in sampling from fX (x) so the chain of interest to us is the chain Xn . One
can prove that the chain Xn has fX as unique invariant distribution and
that, under appropriate assumptions on the conditional distributions fX|Y
and fY |X , the chain converges to fX . We shall illustrate this fact on a
relatively simple but still meaningful example.
Example 4.12. Suppose X and Y are marginally Bernoulli random variables, so that we work in finite state space S = {0, 1}. Suppose the joint
distribution of X and Y is assigned as follows:
 


a1 a2
fX,Y (0, 0) fX,Y (1, 0)
=
, a1 + a2 + a3 + a4 = 1 .
fX,Y (0, 1) fX,Y (1, 1)
a3 a4
The marginal distribution of X is then
fX = [fX (0) fX (1)] = [a1 + a3 a2 + a4 ] .

(28)

Consider the matrices containing the conditional distributions:


 a1

 a1

a3
a2
a
+a
a
+a
a
+a
a
+a
1
3
1
3
1
2
1
2
FY |X =
and FX|Y =
,
a2
a4
a3
a4
a2 +a4

a2 +a4

a4 +a3

a4 +a3

i.e. (FY |X )ij = P(Y = j|X = i), i, j {0, 1}. Starting from the above
matrices we can construct the matrix
FX|X = FY |X FX|Y .
Such a matrix is precisely the transition matrix for the Markov chain Xn .
Indeed by construction, the sequence X0 Y1 X1 is needed to construct
the first step of the chain Xn , i.e. the step X0 X1 . This means that the
transition probabilities of Xn are precisely
X
P(X1 = x1 |X0 = x0 ) =
P(X1 = x1 |Y1 = y)P(Y1 = y|X0 = x0 ) .
y

Therefore the entries of the matrix (FX|X )n are the transition probabilities
pn (x0 , x) = P(Xn = x|X0 = x0 ). Let the distribution of Xn be represented
by the row vector
f n = [f n (0) f n (1)] = [P(Xn = 0) P(Xn = 1)],
47

so that we have
n
= f n1 FX|X
f n = f 0 FX|X

n 1 .

(29)

If all the entries of the matrix FX|X are strictly positive then the chain Xn
is irreducible and, because the state space is finite, there exists a unique invariant distribution, which for the moment we call = [(0) (1)]. However
if all the entries of the transition matrix are strictly positive then the chain
is also regular (see Exercise 16) hence f n does converge to the distribution
as n 14 , irrespective of the choice of the initial distribution for the
chain Xn (which, in the case of the algorithm at hand, means irrespective
of the choice of y0 ). Taking the limit as n on both sides of (29) then
gives
= FX|X .
Because there is only one invariant distribution, the solution of the above
equation is unique. It is easy to check that the marginal fX defined in (28)
satisfies the relation fX = fX FX|X (check it, by simply using the explicit
expressions for fX , FY |X and fX|Y ) and hence fX must be the invariant
distribution.
On a very formal level, one could understand Gibbs sampling as a
Metropolis-Hastings algorithm with acceptance probability constantly equal
to 1.
A generalization of what we have done with two variables is the following. Suppose we have m 2 variables X1 , . . . , Xm with joint distribution
fX1 ,...,Xm (x1 , . . . , xm ) and suppose we want to sample from the marginal
Z
f1 (x1 ) = fX1 ,...,Xm (x1 , . . . , xm )dx2 . . . dxm .
With the notation
f1|2,...,m (x1 |x2 , . . . , xm ) = fX1 |X2 ,...,Xm (x1 |x2 , . . . , xm ) ,
and similarly for fj|1,...,j1,j+1,...,m , the multi step Gibbs sampler is as follows.
(0)

Algorithm 4.13 (Multi stage Gibbs sampler). Set X2 = x2 , . . . , Xm =


(0)
xm . Then, for n = 0, 1, 2, . . . , generate
14

From (22) we know that pn (x, y) (y) for all x S. Because the chain Xn
0
n
is
0 f (x) = fX|Y (x|Y = y0 ), this implies that limn f (x) =
P started with X
n
(lim
p
(z,
x))
f
(z|Y
=
y
)
=
(x).
n
0
X|Y
zS

48

(n)

1. X1

(n)

(n)

f1|2,...,m (x1 |x2 , . . . , xm )

(n+1)

f2|1,3,...,m (x2 |x1 , x3 , . . . , xm )

(n+1)

f3|1,2,4,...,m (x3 |x1 , x2

2. X2

3. X3

(n)

(n)

(n)

(n)

(n+1)

(n)

(n)

, x4 , . . . , xm )

..
.
(n+1)

m. Xm

(n)

(n+1)

fm|1,...,m1 (xm |x1 , x2

(n+1)

, . . . , xm1 ).

The principle behind Algorithm 4.13 is the same that we have illustrated
for Algorithm 4.11.

The Ergodic Theorem

More details about the material of this section can be found in [15, 26, 13].
In this section we want to give some more details about the limit (23). In
particular, we want to link the definition of ergodicity that we have given
for Markov processes with the definition that you might have already seen
in the context of dynamical system theory.

5.1

Dynamical Systems

Throughout this section (, F) will denote a measurable space, as usual.


Definition 5.1 (Invariant measure and measure preserving map). Let :
(, F) (, F) be a measurable map and be a probability measure on
(, F). We define () to be the pushforward of the measure under :
()(A) := (1 (A)),

A F.

If () = , i.e. if
(1 (A)) = (A),

A F

(30)

then we say that the measure is invariant under and that is measure
preserving and preserves the measure .
For a given map there will be in general more than one measure which
is preserved by .

49

Definition 5.2 (Invariant set). A set A F is invariant under if 1 (A) =


A. We denote by J the set of invariant sets under :
J := {A F : 1 (A) = A}.
When it is clear from the context which map we are referring to, we will
drop the subscript and simply write J .
Definition 5.3 (Ergodic measure and ergodic map). A probability measure
which is invariant under is said to be ergodic if the set J is trivial i.e.
if (A) is equal to either 0 or 1 for all A J . If this is the case then the
map on (, F, ) is said to be ergodic.
Let me stress again that for a given map one can in general find more
than one measure that is invariant under and, among these, more than
one that is ergodic.
Notice that in Definition 5.1 and Definition 5.3 the measure doesnt
need to be a probability measure in the sense that we can still start with
a finite measure and then normalize it. The above definitions assume that
this normalization process has already been performed on the measures at
hand. A very important fact about ergodic measures is the following.
Lemma 5.4. Given a map , let M be the set of -invariant measures.
Then:
Two ergodic measures in M either coincide or they are mutually
singular.15
M is a convex set. A measure is ergodic if and only if it is an
extreme point of such a set i.e. if cannot be written as t1 +(1t)2
for t (0, 1), 1 , 2 M .
As a consequence of the above, if admits only one invariant measure,
that measure must be ergodic.
Now we state the main theorem of this section. To this end we recall
that the notation
Z
1
L (P) := {f : (, F, P) S s.t.
|f | dP < },

where S in the above is any Polish space.


15

Two finite measures and on the same measurable space (E, E) are said to be
mutually singular if there exist two disjoint sets A, B E such that A B = E and
(A) = (B) = 0.

50

Theorem 5.5 (Ergodic Theorem). Let X : (, F, P) Rn be integrable,


i.e. X L1 (P) and be a measure preserving transformation that preserves
P. Then
n1
1X
lim
X(k ()) = E(X|J ).
n n
k=0

The above limit holds a.s. and in L1 .


The immediate consequence of such a Theorem is the following corollary.
Corollary 5.6. With the setting of Theorem 5.5, if the measure P is ergodic
for , then
n1
1X
X(k ()) = E(X).
lim
n n
k=0

5.2

Stationary Markov Chains and Canonical Dynamical Systems

Suppose we have a probability space (, F, P), a map : that


preserves the measure P and a r.v. X : R. Let m denote the m-th
iterate of , i.e. 0 () = and m () = (m1 )(), for all m 1 and
define a sequence of r.v. Xn () as follows
X0 () = X(), X1 () = X(()), . . . , Xn () = X(n ()), . . .

(31)

It is easy to check that the sequence of r.v. Xn constructed as above is a stationary sequence. Indeed let A be the set A := { : (X0 (), . . . , Xm ())
B}, for some arbitrary Borel set B in Rm+1 . Then for all k 0
P{(Xk (), . . . , Xm+k ()) B} = P( : k () A) = P(k (A))
= P(A) = P{(X0 (), . . . , Xm ()) B}.
We have hence just proved the following lemma.
Lemma 5.7. Let P be a measure on (, F), a map that preserves P and
X a F-measurable, R-valued random variable. Then the sequence (31) is a
stationary sequence.
Comment. Lemma 5.7 is important because, at the cost of changing measure space, any stationary sequence can be represented in the form (31). Let
us be more clear about this: we have already seen see Remark 3.6 that
any Markov Chain, and hence in particular any stationary Markov Chain,
can be represented as the coordinate map in the sequence space RN endowed
51

with the -algebra generated by cylinder sets and with the measure P constructed by using the Kolmogorov extension Theorem. So suppose that Xn
is a stationary MC, Xn : R. The corresponding process on RN which
has the same distribution as Xn is the process Yn : RN R such that
Yn () = n , = (0 , 1 , . . . ) RN . This means that Yn will be stationary
on (RN , RN , P ). If we consider the r.v. Y : RN R defined as Y () = 0
and the shift map : RN RN defined as (0 , 1 , . . . ) = (1 , 2 , . . . ),
then one can prove that preserves P (you can check it by yourself, using
the fact that Xn is stationary, see Exercise ??) and indeed the sequence
Yn () is nothing but Y (n ()). Clearly everything we have said still holds
if we substitute R with any Polish space, i.e. if the random variable X takes
values in a Polish space or in a subset of a Polish space.
Suppose from now on that the chain Xn takes values in a discrete space
S. If Xn admits an invariant distribution (i.e. a stationary distribution
in the sense of Definition 3.21), , and we start the chain with X0
then the chain is stationary, i.e. Xn for all n 0. Denote by P
the corresponding measure on path space obtained via the Kolmogorovs
Extension Theorem. Then, by the comment above, P is invariant (in the
sense of Definition 5.116 ) for the shift map . The pair (, P ), together
with the space (S N , S N ) is called the canonical dynamical system associated
with the Markov chain Xn which has as invariant distribution and that is
started at , i.e. X0 .
For a given initial stationary distribution , the measure P is the only
-invariant measure such that
P ( S N : i Ai , i = 1, . . . , m) = P (X1 A1 , . . . , Xm Am ).
Now suppose the invariant distribution is such that (x) > 0 for all x
(which is in no way restrictive since any state with (x) = 0 is irrelevant to
the process as it cannot be visited). Under this assumption it is possible to
show that the shift map is ergodic if and only if the chain Xn is irreducible
(see for example [9, Section 3] for a proof in finite state space). Therefore
there exists a unique invariant distribution for Xn , i.e. Xn is ergodic (in the
sense of Definition 3.26).
Theorem 5.8 (Ergodic Theorem for stationary Markov Chains). Let Xn
be a stationary Markov chain. Then for any integrable function, the limit
n1

1X
f (Xk )
n n
lim

k=0

16

Another good reason why is called invariant.

52

exists, a.s. and in L1 . If the chain is also ergodic then


n1

1X
f (Xk ) = E(f (X0 )).
n n
lim

k=0

Take f = 1A for some A F. Then the above limit says exactly that
asymptotically, space averages equal time averages.
We shall talk more about ergodicity and the link between Markov Processes and dynamical systems in the context of diffusion theory. For the
time being I would like to mention the following result, which will be useful
later on.
Lemma 5.9. A measure preserving map on a probability space, : (, F, P)
(, F, P) is ergodic if and only if
n1

1X
P((k A) B) = P(A)P(B),
n n
lim

for all A, B F.

k=0

That is, ergodicity is equivalent to asymptotic average independence.


This motivates the following definition, which introduces a property stronger
than ergodicity.
Definition 5.10. Let be measure preserving map on a probability space,
as in the above Lemma 5.9. is said to be (strongly) mixing if
lim P((n A) B) = P(A)P(B),

for all A, B F.

Start with noticing that mixing is stronger than ergodic, i.e. mixing
ergodic. To have a better intuition about the definition of mixing system
and the difference with an ergodic one suppose for a moment that the map
is invertible; if this is the case, the definition (30) is equivalent to
P((A)) = P(A)

for all A F .

Therefore
P((n A) B) = P(n ((n A) B)) = P(A n (B)) .17
Which means that if is invertible then the limit of Definition 5.10 is equivalent to
P(A (n (B)))
= P(B),
n
P(A)
lim

(32)

It is true in general that f (A f 1 B) = f (A) B but it is not always true that


(A f (B)) is equal to f 1 (A) B.

17
1

for all A, B F.

53

Now let us think of an example that will be recurrent in the following: milk
and coffe. So suppose we have a mug of coffe and we add some milk. At the
beginning the milk is all in the region B. To fix ideas suppose P(B) = 1/3
(which means that, after adding the milk, my mug will contain 1/3 of milk
and 2/3 of coffee). The stain of milk will spread around and at time n it will
be identified by n (B). The property (32) then means that asymptotically
(i.e., in the long run) the proportion of milk that I will find in whatever
subset A of the mug is precisely 1/3. The limit of Lemma 5.9 says instead
that such a property is only true asymptotically on average.

Continuous time Markov processes

The definition of Markovianity for a continuous time stochastic process is


only formally different than the analogous definition for chains, Definition
3.1.
Definition 6.1. Let {Xt }tR+ be a continuous time stochastic process taking
values on a general state space (S, S) and let Ft be the filtration generated
by Xt , Ft := (Xs , 0 s t). If
P(Xt B|Fs ) = P(Xt B|Xs )

for all 0 s t and B S ,

(33)

then Xt is a continuous time Markov process.


It is self-evident that (33) is just (11) in continuous time. In this context,
time-homogeneity can be defined as follows.
Definition 6.2. The Markov process Xt is time-homogeneous if
P(Xt B|Xs ) = P(Xts B|X0 ),

B S and t s 0 .

In discrete time, the transition functions needed to be assigned only for


one time step so the transition probabilities were a function of two arguments
only; in the present case the transition probabilities will depend on time as
well.
Definition 6.3. A map pt (x, B) := p(t, x, B) : R+ S S [0, 1] is a
transition function if
1. for fixed t and x, p(t, x, ) is a probability measure, i.e. p(t, x, S) = 1,
2. for fixed t and B, p(t, , B) is a S-measurable function.
54

Given a transition function p, if


P(Xt B|X0 = x) = pt (x, B),

for all t 0,

(34)

for some time-homogeneous Markov process Xt , then p is the transition function of the process Xt .
Notice that in order for the transition probabilities to satisfy (34) for
t = 0 one needs to have

1 if x B
p0 (x, B) = x (B) =
.
0 if x
/B
In particular, for t = 0 the transition function cant have a density.
As in the case of discrete time, once an initial datum (or an initial
distribution) for the process is chosen, the transition probabilities uniquely
determine the Markov process. We now prove the Chapman-Kolmogorov
relation.
Theorem 6.4 (Chapman-Kolmogorov). Let Xt be a time-homogeneous Markov
process. Then, for all 0 s < u < t and B S, we have
Z
P(Xu dy|Xs )P(Xt B|Xu = y)
(35)
P(Xt B|Xs ) =
ZS
=
P(Xus dy|X0 )P(Xtu B|X0 = y).
(36)
S

We specify that the first equality is true for any Markov process, the second
is due to time-homogeneity.
Proof of Theorem 6.4. Since (35) holds for any Markov process, we only
need to prove (35), as (36) follows from (36), when Xt is time-homogeneous.
Let Ft be the filtration generated by Xt . If s < u < t then Fs Fu Ft .
We will use this fact, together with Markovianity and the properties of
conditional expectation in order to prove (35):
P(Xt B|Xs ) = E[1{Xt B} |Fs ]
= E[E[1{Xt B} |Fu ]|Fs ] = E[P(Xt B|Xu )|Xs ]
Now because for every measurable
function , we can express the conditional
R
expectation E((Xt )|Xs ) = (y)P(Xt dy|Xs ), we obtain the desired
result.

55

If we want the Chapman-Kolmogorov equation (C-K) to hold in a stronger


sense, i.e. if we want
Z
P(Xt B|Xs = x) =
P(Xu dy|Xs = x)P(Xt B|Xu = y)
(37)
S

for all x S and 0 s u t, then this is something that we need to


require as an assumption from the transition probability themselves, as it
doesnt follow from the Markov property alone. For this technical detail see
for example [22, Chapter 3]. However it is always possible to modify the
transition probabilities so that (37) holds, i.e. so that (35) holds pointwise
(see [1, Chapter 2]). From now on we will always assume that this has
been done and that the C-K equation holds in its stronger form (37). If the
process is time-homogeneous, (37) can be rewritten as
Z
P(Xt B|Xs = x) =
P(Xus dy|X0 = x)P(Xtu B|X0 = y) . (38)
S

Using the transition probabilities, the above can also be expressed as


Z
pt+s (x, B) =
ps (x, dy)pt (y, B) .
S

If the transition functions have a density, i.e. if ps (x, dy) = ps (x, y)dy for all
x S, we can also write
Z
pt+s (x, z) =
ps (x, y) pt (y, z) dy .
(39)
S

Continuous time Markov processes arise mainly as solutions of SDEs. We


will see in Section 8.2.3 that, under some conditions on the coefficients of the
SDE, the solution of the equation enjoys the Markov property. Therefore,
all the examples of Section 8.2.2 are examples of continuous time Markov
processes. The next section is about the most important Markov process.

Brownian Motion

Brownian Motion was observed in 1827 by the Botanist Robert Brown, who
first noticed the peculiar properties of the motion of a pollen grain suspended
in water. What Brown saw looked something like the path in Figure 1 below.
In 1900 Louis Bachelier modeled such a motion using the theory of stochastic processes and obtaining results that where rediscovered, although in a
56

Figure 1: Brownian motion


different context, by A. Einstein in 1905. The understanding provided by
Bachelier and Einstein was then put on firm mathematical basis by Norbert
Wiener in the 1920s.
We have given a formal definition of BM in Example 2.7, which we now
would like to justify. First of all, what is BM? In a container there are many
small particles moving around - in Browns case, water particles. Now
suppose we put a bigger particle (the pollen grain) in our container. At
each instant in time the small particles, which move in all possible directions,
will kick our pollen grain and the result of this bombardment is an erratic
motion called Brownian Motion. If B(t) is the position of the pollen grain
at time t, then it should be clear by our naive description that B(t) B(s)
is independent from B(s) B(u) for any u s t, as the kicks that the
grain receives are independent. If we put two pollen grains in the water then
the two motions are independent. Also, the trajectory of the particle is so
irregular that it is (a.s.) nowhere differentiable.
Exercise 7.1. Try and draw a continuous path that is nowhere differentiable. Managed it? Well look better, I am pretty sure that what you have
drawn is a.s. differentiable.

7.1

Heuristic motivation for the definition of BM.

Consider a particle that moves along the x axis by jumping a distance


every seconds. Every time the particle can jump left or right with equal
probability. Suppose that the particle is at the origin at time 0. If we denote
by p(x, t) the probability of finding the particle in x = k (k Z) at time

57

t = n , we have
1
p, (x, t + ) = [p, (x , t) + p, (x + , t)] 18 .
2
If and are small then by expanding both sides of the above equation we
get
p, (x, t)
1 2 p, (x, t) 2
p, (x, t) +
.
= p, (x, t) +
t
2
x2
Now let , 0 in such a way that
2
D,

where D > 0 is a constant, which we will later call the diffusion constant.
Then p, (x, t) p(x, t), where p(x, t) satisfies the heat equation (or diffusion equation):
1
(40)
t p(x, t) = D xx p(x, t).
2
p(x, t) is the probability density of finding the particle in x at time t. The
fundamental solution of (40) is precisely the probability density of a Gaussian with mean zero and variance t, see (8).
The calculation that we have just presented shows another very important fact: pour some milk into your coffee and watch the milk particles
diffuse around in your mug. We will later rigorously define the term diffusion; however for the time being observe that we already know that at
the microscopic (probabilistic) level, the motion of the milk molecules is described by BM. The macroscopic (analytic) description is instead given by
the heat equation.

7.2

Rigorous motivation for the definition of BM.

We look at the same picture as before, of a particle jumping a distance


every seconds to the right or to the left with the same probability. This
time, in order to rigorously derive BM from the random walk, we will use
the CLT. To this end, let B , (t) denote the position of our particle at time
t = n and let P
{Xi }i be i.i.d. r.v. with P(Xi = 0) = 1/2 = P(Xi = 1).
Consider Sn = ni=1 Xi , which clearly represents the number of moves to
the right by time t = n , and observe that
B , (t) = Sn + (n Sn )() = (2Sn n) .
18

Later we will prove that the Wiener process is a Markov process. The Markovianity
of the process is already in this equation.

58

Since E(Xi ) = 1/2 and V ar(Xi ) = 1/4, E(Sn ) = n/2 and V ar(Sn ) = n/4.
Therefore, E(B , (t)) = 0 and
V ar(B , (t)) = 4 2

n
2
= 2 n = t .
4

Assuming 2 / = D, we can rewrite


B , (t) =

Sn (n/2)
Sn (n/2)
p
n = p
tD.
n/4
n/4

Now the CLT implies


a
Sn (n/2)
b

tD
tD
n/4

lim P(a B , (t) < b) = lim P

n
t=n

1
=
2

b
tD

x2
2

dx;

a
tD

if D = 1 the RHS of the above is the same as the RHS of (8) after a change
of variables.

7.3

Properties of Brownian Motion

From the description of BM as the motion of a pollen grain - and even more
from its derivation as the limit of a random walk - it should be clear why
the following result holds.
Theorem 7.2. The Wiener process is a Markov process with
1

P(B(t) (a, b)|B(s)) = p


2(t s)

(xB(s))2
2(ts)

dx .

(41)

Now an important definition.


Definition 7.3. A Rn -valued s.p. Mt is a martingale with respect to the
filtration Et if
1. E |Mt | < for all t
2. Mt is adapted to Et
3. E[Mt |Es ] = Ms for all s t.
59

Martingales enjoy many nice properties, among which they satisfy the
following inequality
Theorem 7.4 (Doobs martingale inequality). If Mt is a continuous martingale then
!
E |MT |p
P sup |Mt |
,
p
t[0,T ]
for all > 0, T 0 and p 1.
In order to understand the martingales inequality, it is useful to compare
it with the Markov inequality (4).
Theorem 7.5. Brownian Motion is a martingale with respect to the filtration Ft = {B(s); 0 s t}.
Proof. Brownian motion is square integrable - as E(B(t)2 ) = t from i) and
ii) of Example 2.7 - and therefore integrable and it is adapted w.r.t. Ft by
definition. Also, for any t s,
E[B(t)|Fs ] = E[B(t) B(s) + B(s)|Fs ] = 0 + Bs
as B(t) B(s) is independent of Fs .
Let us now look at the path properties of the Wiener process: we will
see that while the trajectories of BM are continuous, they are a.s. nowhere
differentiable. The non-differentiability is intimately related with the fact
that BM has infinite total variation, which makes it impossible to define the
stochastic integral w.r. to Brownian motion in the same way in which we
define the Riemann integral.
Theorem 7.6. The sample paths of BM are a.s. continuous.
Proof. We want to use Kolmogorovs continuity Criterion, Theorem 2.5. For
every k > 2 and 0 s t,
Z
2
1
x
k
E |B(t) B(s)| = p
|x|k e 2(ts) dx
2(t s) R
Z

y2
1
(y = x/ t s) =
|y|k (t s)k/2 e 2 dy = C |t s|k/2 .
2 R
So we can apply the criterion with = k and = k/2 1, for all k > 2 and
we obtain the desired continuity.
60

As a byproduct of the proof of Theorem 7.6 the Wiener process has H


older continuous paths for every exponent 0 < < / = 1/2 1/k, for
all k > 2, i.e. the paths are -Holder continuous for every 0 < < 1/2.
If a function is differentiable then it is Holder continuous with exponent
= 1. It turns out that we can prove that BM is not -Holder continuous
for 1/2 and therefore it is a.s. not differentiable. But there is more to
it.
Definition 7.7. Let = {0 = t0 t1 tk = T } be a partition of the
interval [0, T ] and let kk = max(ti+1 ti ) be the mesh of the partition. Let
also n be a sequence of refining partitions of [0, T ], such that kn k 0.
Given a continuous function g on [0, T ], we define the p-th variation of g as
X
|g(ti+1 ) g(ti )|p , p > 0.
lim
n

ti n

When p = 1 the 1-variation is more often called the total variation of g.


Continuously differentiable functions have
R T finite total variation, i.e. if
g C 1 [0, T ] then the total variation of g is 0 |g 0 (s)| ds.
Theorem 7.8. The Wiener process B(t) has a.s. infinite total variation
and finite quadratic variation: if n are refining sequences of the interval
[0, t] with mesh size tending to zero then
P
1. limn ti n |B(ti+1 ) B(ti )| = a.s.
2. limn

ti n

|B(ti+1 ) B(ti )|2 = t a.s.

It is important to realize that property 1. in the above theorem is a


consequence of the almost nowhere differentiability of BM.
Another property of BM that might be useful in the following is
E sup |B(t)|N CT N/2 ,

fome some C > 0 and for all N 1.

t[0,T ]

Now one last fact about BM. As we have seen, BM is almost surely non
differentiable, so talking about its derivative

dB(t)
= (t)
dt

is a bit of a nonsense. However the derivative of Brownian Motion is


commonly referred to as white noise. Clearly, the process (t) doesnt exist,
61

at least not in any classical sense. It is possible to make sense of as a


distribution-valued process but this is not what we want to do here. For us
(t) will be a stationary Gaussian process with autocovariance function
E((t)(s)) = (t s),
where = 0 is the delta function with mass at 0. The reason why is called
white noise is the following: if c(t) is the covariance function of a stationary
process then
Z
eit c(t)dt,

f () =

R,

is the spectral density of the process. In the case of white noise,


Z
f () =
eit 0 (t)dt = 1.
R

Inverting the Fourier transform, this means that all the frequencies contribute equally to the covariance function, in the same way in which all
colours are equally present in the white colour.

Elements of stochastic calculus and SDEs theory

All of you know that the solution of an ODE, say


dXt
= b(t, Xt ),
dt
looks something like this

Figure 2: Trajectory of the solution of an ODE


However, the paths of real life objects are more something like this

62

(42)

Figure 3: Trajectory of the solution of an SDE


This is due to at least two main reasons: i) the motion of the object
that we are observing is subject to many small disturbances and ii) the
graph of the path we are interested in is the result of our measurements, and
measurements are subject to many small errors. This is why, when collecting
data, it is rare to see a smooth curve. What we will more realistically see
is the result of smooth curve+ erratic effects. These random effects are
commonly called noise. For this reason it is of great importance in the
applied sciences to consider equations of the type
dXt
= b(t, Xt ) + noise.
dt

(43)

The first term on the RHS of this equation gives the average (smooth) behaviour, the second is responsible for the fluctuations that turn Figure 2 into
Figure 3. Bear in mind that the solution to (42) is a function, the solution
to (43) whatever this equation might mean is a stochastic process.
Depending on the particular situation that we want to model we can
consider all sorts of noise. In this course we will be interested in the case
of white noise t , which has been the first to be studied thanks to the good
properties that it enjoys:
i) being mean zero, it doesnt alter the average behaviour, i.e. the drift is
only due to b(t, x);
ii) t is independent of s if s 6= t (recall that E(t s ) = 0 (t s));
iii) it is stationary, so that its law is constant in time, i.e. the noise acting on
the object we are observing is qualitatively the same at each moment
in time.
Now just a small problem: we said that we cant make sense of t in a classic
way as t is the derivative of Brownian motion, which is not differentiable.19
19

Another, possibly better, way of looking at this matter is as follows: for practical

63

So, how do we make sense of (43)? Well, if t = dWt /dt then, at least
formally, (43) can be rewritten as
dXt = b(t, Xt )dt + dW.
More in general, we might want to consider the equation
dXt = b(t, Xt )dt + (t, Xt )dW.
In this way, we have that, so far still formally,
Z t
Z t
(s, Ws )dWs .
b(s, Xs )ds +
Xt = X0 +

(44)

(45)

Rt
Assuming that 0 |b(s, Xs )| ds < with probability one, we know how to
make sense out of the first integral on the RHS. The real problem is how to
make sense of the second integral on the RHS. However, I hope we all agree
that if we can rigorously define the integral
Z T
f (t, )dWt
(46)
0

for a stochastic process ft = f (t, ) 20 belonging to some appropriate class,


then we are done making rigorous sense of (43). Equation (45) is a general
stochastic differential equation (SDE), driven by BM.

8.1

Stochastic Integrals.

Unless otherwise specified, throughout this section we will be referring to


real-valued stochastic processes.
Let us start by very briefly (and roughly) recalling what happens in
RT
the case of the Riemann integral: we want to define 0 g(t)dt where g is
some deterministic continuous function. In order to do so, we start with
defining
the integral of step functions, i.e. functions of the form g(t) =
PK
c

j=0 j [tj ,tj+1 ) , where the cj s are constants and {0 = t0 , t1 , . . . , tK = T }


is some partition of the interval [0, T ]. For step functions we define
Z T
K
X
g(t)dt =
cj (tj+1 tj ) .
0

j=0

reasons, suggested for example by engineering problems, we want to consider a noise that
satisfies the properties i), ii) and iii) listed above. However one can prove that there is no
such a process on the path space R[0,) . In any event, the intuition about t suggested
by the these three properties, leads to think of t as the derivative of BM. More detail on
this point are contained in Appendix A.
20
In the integral (46) we wanted to stress the dependence on both t and .

64

At this point we consider a partition of [0, T ] of size 2n i.e. given by


(n)
tj = j/2n (the notation for tj should be tj but we drop the dependence
on n to streamline the notation) for all
P js such that tj < T and tj = T if
j/2n T and observe that gn (t) = j0 g(tj )[tj ,tj+1 ) is, for each n N,
a step function that approximates g(t). If
X
lim
g(tj )(tj+1 tj ) converges
(47)
n

j0

then we define
Z T
Z
g(t)dt := lim

n 0

gn (t)dt = lim

g(tj )(tj+1 tj ).

(48)

j0

I want to stress that if the integrand is continuous, the limit on the right hand
side of (48) does not change if we evaluate g at any other point t [tj , tj+1 ],
i.e. for continuous functions we have:
Z T
X
X
g(t)dt = lim
g(tj )(tj+1 tj ) = lim
g(t )(tj+1 tj ),
n

j0

j0

for every t [tj , tj+1 ].


Inspired by this procedure we want to do something similar to define
the stochastic integral. In doing so we need to bear in mind that we are
attempting to integrate a stochastic process - as opposed to a function with respect to another stochastic process and that while the integral of a
function is, for each fixed T , a number, the integral (46) is, for every fixed
T , a random variable. The analogous of step functions are the elementary
or simple processes on [0, T ], i.e. processes of the form
(t, ) =

K
X

j ()[tj ,tj+1 ) ,

(49)

j=0

where {0 = t0 , t1 , . . . , tK = T } is again some partition of the interval [0, T ]


and the 0j s are random variables, (a.s.) uniformly bounded in j and . For
processes of this form, it seems reasonable to define
Z

(t, )dWt :=
0

K
X

j ()(W (tj+1 ) W (tj )).

(50)

j=0

Then what we want to do is to find a sequence n of elementary processes


that approximate the process f (t, ) and finally define the stochastic integral
65

(46) as the limit of the stochastic integrals of n . The plan sounds good,
except there are a couple of problems and points to clarify.
First of all, assuming the procedure of taking the limit (in some sense)
of the integral of elementary process does work, we would want the limit to
be independent of the point t [tj , tj+1 ] that we choose to approximate the
integrand. It turns out that this is not the case, not even if the integrand is
continuous. We show this fact with an example.
RT
Example 8.1. Suppose we want to calculate 0 Wt dWt . We consider
n and approximate the integrand with the
the partition with mesh size
P 2
elementary process n = j0 W (tj )[tj ,tj+1 ) . Then, for every n N,
E

W (tj )[W (tj+1 ) W (tj )] = 0,

j0

by using the fact that W (tj+1 ) W (tj ) is independent


P of W (tj ). If instead
we approximate the integrand with the process n = j0 W (tj+1 )[tj ,tj+1 ) ,
then
X
E
W (tj+1 )[W (tj+1 ) W (tj )]
j0

=E

[W (tj+1 ) W (tj )][W (tj+1 ) W (tj )] T.

j0

Brownian motion is continuous so, what is going on? Both n and n


seem perfectly reasonable approximating sequences, so why are we obtaining
different results? The problem is that Brownian motion, being a.s. non
differentiable, simply varies too much in the interval t [tj , tj+1 ] and
this leads to the phenomenon illustrated by Example 8.1. There is no way
of solving this pickle, it is simply a fact of life that different choices of
t [tj , tj+1 ] lead to different definitions of the stochastic integral. The
most popular choices are
t = tj , which gives the It
o integral, and
t = (tj + tj+1 )/2, which gives the Stratonovich integral.
We will discuss the differences between these two stochastic integrals later
on. For the moment let us stick to the choice t = tj and talk about the Ito
interpretation of (46).

66

8.1.1

Stochastic integral in the It


o sense.

In these notes we will not go through all the details of the proofs of the
construction of the It
o integral but we simply provide the main steps of
such a construction.
As we have already said, we want to find a sequence n of elementary
processes that approximate the process f (t, ) and finally define the Ito
integral as the limit of the stochastic integrals of n . We still need to specify
in what sense we will take the limit (and in what sense the n s approximate
RT
f ). In the case of (48), { 0 gn (t)dt}n is simply a sequence of real numbers
but in our case we are dealing with a sequence of random variables, so we
need to specify in what sense they converge. It is clear from Theorem 7.8
that trying with L1 -convergence is recipe for disaster, as BM has infinite
first variation. However, it has finite second variation, so we can try with
L2 - convergence, which is what we are going to do. As you might expect,
the procedure that we are going to sketch does not work for any integrand
in the same way in which not any function is Riemann integrable but
it is successful for stochastic processes f (t, ) : R+ R enjoying the
following properties:
(a) f (t, ) is B F- measurable;
(b) ft is Ft -adapted, for all t R+ , where Ft is the natural filtration
associated with the BM Wt ;
RT
(c) E 0 ft2 dt < .
Definition 8.2. We denote by I(0, T ), or simply I when the extrema
of integration are clear from the context, the class of stochastic processes
f (t, ) : R+ R for which the above three properties hold.
The reason why we impose this set of conditions on the integrands will be
more clear once you read the construction procedure. However, property (b)
comes from the fact that we are choosing the left hand point to approximate
our integral, property (c) comes from the Ito isometry (see below). And now,
finally, the main steps of the construction of the Ito integral are as follows:
1. Consider the set of elementary processes (49), where we require that
the random variables j are square integrable and Ftj -measurable
21 (with F the filtration generated by W ); for these processes the
t
t
21
This is because we are choosing t = tj . Notice that this condition does not hold for
the
n s of Example 8.1, as Wtj+1 is not Ftj -measurable.

67

stochastic integral is defined by (50). If (t, x) belongs to such a set,


then one can prove the following fundamental equality
Z

2

(t, ) dWt

E
0

Z
=E

2 (t) dt.

(51)

The above equality is the It


o isometry for simple processes; it indeed
establishes an isometry between L2 () and L2 ([0, T ] ), at least for
simple processes at the moment. It is clear that (51) follows from point
2. of Theorem 7.8.
2. For any process f (t, ) I, there exists a sequence of elementary
processes n such that
Z T
E
|f (t) n (t)|2 dt 0.
0

Therefore n is a Cauchy sequence in L2 ([0, T ] ) (as it converges in


this space), but also in L2 (), thanks to (51). From the completeness
RT
of L2 , this means that 0 t dWt has a limit, which belongs to L2 .
Such a limit is precisely the stochastic integral in the Ito sense of ft .
Remark 8.3. The proof of the above goes as follows: once we define the
stochastic integral for elementary processes, we can define it for bounded
and continuous integrands ft P
I. For such integrands the approximating
sequence is precisely n (t) = j f (tj )[tj ,tj+1 ] (good to know for practical
purposes). Notice that ftj is Ftj -measurable if ft I. The next steps
are proving that there is a sequence of approximants - with the required
properties - for f I which are bounded and then for any f I.
Now that we know what
is a stochastic integral, we know how to interpret
Rt
(45), at least when 0 dW is intended in the Ito sense. We will often use
the shorter notation (44), which is to be interpreted to mean (45). When
we want to refer to the Stratonovich definition, we use, as customary, the
notation
Z t
(s, Ws ) dWs .
0

Let us now list all the properties that help in the calculation of the
stochastic integral.
Theorem 8.4 (Properties of the Ito integral). For every f, g I(0, T ),
t, s [0, T ] and for every , R,
68

RT
f dW + t f dW ;
RT
RT
RT
(ii) Linearity: 0 (f + g)dW = 0 f dW + 0 g dW ;
RT
(iii) E 0 f dW = 0
RT
(iv) IT := 0 f dW is FT -measurable;
(i) Additivity:

RT
0

f dW =

Rt
0

(v) It is an Ft -martingale (hence it satisfies Doobs inequality);


(vi) It is a continuous process (or better, it admits a continuous version)
(vii) It
o isometry:
2

Z

f dW

or, more in general,


Z t
Z
E
f (u)dWu
0

=E

f 2 dt,


g(u)dWu

Z
=E

ts

f (u)g(u) du .
0

(viii) If f (t) is a deterministic


function then It is Gaussian with mean zero
Rt
and variance 0 f 2 (s)ds.
You surely remember that when you were taught the Riemann integral
you were first required to calculate the integral from the definition and then,
once you were provided with integration rules, to calculate more complicated
integrals by using such rules. This is exactly what we will do here. The only
difference is that in the Riemann case the integration rules follow from the
differentiation rules. In this case we only have a stochastic integral calculus
(see (52) and (55)) without a corresponding stochastic differential calculus.
Rt
Example 8.5. Calculate 0 Ws dWs , using the definition. The approximating sequence
that we will use is the sequence n of Example 8.1. If you
Rt
think that 0 Ws dWs = Wt2 /2 you are in for a surprise. So we want to
calculate the following limit in L2
Z t
X
X
lim
n (s)dWs = lim
Wtj (Wtj+1 Wtj ) = lim
Wtj (Wtj ).
n 0

69

In order to do so, observe that Wtj (Wtj ) = 21 [(Wt2j ) (Wtj )2 ]. Therefore


X
X
X
1
1
lim
Wtj (Wtj ) = lim
(Wt2j ) lim
(Wtj )2
n
2 n
2 n
j

1
1
= Wt2 t2 ,
2
2
where we have used the fact that the sum
and Theorem 7.8.

(Wt2j ) is a telescopic sum

At this point you should be convinced that the Ito integral doesnt follow
the usual integration rules. And you might wonder, what about the chain
rule and integration by parts? Well I am afraid they dont stay the same
either.
It
o chain rule: suppose Xt satisfies the equation
dX = b(t, Xt )dt + (t, Xt )dWt .
Then, if g = g(t, x) is a C 1,2 ([0, ) R) function,
dg(t, Xt ) =

g
g
1 2g
(t, Xt ) dt +
(t, Xt ) dXt +
(t, Xt ) 2 (t, Xt )dt .
t
x
2 x2
(52)

Product rule: if Xi (t), i = 1, 2 satisfy the equation


dXi (t) = bi (t, Xi (t))dt + i (t, Xi (t))dWt ,

(53)

respectively, then
d(X1 X2 ) = X1 dX2 + X2 dX1 + dX1 dX2 .

(54)

Integration by parts, which follows from the product rule: with the
same notation as in the product rule,
Z t
Z t
Z t
X1 dX2 = X1 (t)X2 (t) X1 (0)X2 (0)
X2 dX1
dX1 dX2 .
0

(55)

70

How do we calculate the term dX1 dX2 ? I am afraid I will adopt a practical
approach, so the answer is...by using this multiplication table:
dt dt = dW dt = dt dW = 0
while
dW dW = dt and dW dB = 0
if W and B are two independent Brownian motions. Just to be clear, in the
case of X1 and X2 of equation (53), because the BM driving the equation for
X1 is the same as the one driving the equation for X2 , we have dX1 dX2 =
1 2 dt. In this way if X1 is a deterministic, say continuous function, (55)
coincides with the usual integration by parts for the Riemann integral.
I would like to stress that this rule of thumb of the multiplication table
can be rigorously justified (think of Theorem 7.8), but this is beyond the
scope of this course.
Rt
Example 8.6. Let us calculate 0 W 3 dW . We can use the integration by
parts formula with X1 = W 3 and X2 = W . To use (55) we need to calculate
dX1 first, which we can do by applying (52) with g(t, x) = g(x) = x3 . So
we get
dX1 = 3W 2 dW + 3W dt,
hence dX1 dX2 = 3W 2 dt and
Z t
Z t
Z t
W 3 dW = W 4
W (3W 2 dW + 3W dt) 3
W 2 dt ;
0

rearranging,
Z

W 3 dW =

W4 3

4
2

W 2 dt .

Before concluding this subsection, we would like to explicitly write down


the multidimensional version of the main expressions that we have presented
so far: suppose Xt : [0, ) Rd satisfies the following SDE
dXt = b(t, Xt )dt + (t, Xt )dWt
where b(t, x) : [0, ) Rd Rd , (t, x) : [0, ) Rd Rd RRm , i.e. it is a
t
d m matrix, and Wt is a m-dimensional BM. In this case 0 (s, Xs )dWs
is simply a d-vector of It
o integrals, i.e. the i-th component of such a vector
is
i Z t X
Z t
m
dWs =
ij dW j .
0

0 j=1

71

If g = g(t, x) : [0, ) Rd R and Y = g(t, Xt ), the multidimensional Ito


formula reads:
dY (t, Xt ) =

d
m
d
X
1 X 2 g X il jl
g
g
dX i +
dt +
dt.
t
xi
2
xi xj
i,j=1

i=1

l=1

The same formula holds for each component of g, if g = g(t, x) : [0, )


Rd Rp : in this case for each 1 k p,
d
d
X
g k
1 X 2gk
g k
i
dY (t, Xt ) =
dX +
dX i dX j .
dt +
t
xi
2
xi xj
k

i,j=1

i=1

8.1.2

Stochastic integral in the Stratonovich sense.

The construction of the integral in the Stratonovich sense is very similar to


the one of the It
o integral, so we wont repeat it.
We stress, once again, that the result of the Ito integrationRand the result
t
of the Stratonovich integration do not in general coincide: 0 dW is not
Rt
the same as 0 dW . However there is a useful conversion formula to write
an It
o SDE in Stratonovich form and viceversa:
Z t
Z t
Xt = X0 +
b(s, Xs )ds +
(s, Xs ) dWs
(56)
0
0
Z t
Z t
Z
1 t 0
= X0 +
b(s, Xs )ds +
(s, Xs )(s, Xs )ds +
(s, Xs )dWs .
2 0
0
0
where 0 denotes derivative with respect to x. The conversion formula (56)
indicates that in going from the Stratonovich to the Ito formulation, the
only part of the SDE that changes is the drift term, the diffusion coefficient
remains unchanged. Moreover, the two equations coincide if the drift does
not depend on x.
One point in favour of the Stratonovich formulation is that it obeys the
ordinary chain rule: if
dXt = b(t, Xt )dt + (t, Xt ) dWt
then, for every smooth function (actually differentiable with continuous first
derivatives is enough) g = g(t, x), we have that the process Yt = g(t, Xt )
solves the SDE
g
g
dYt =
(t, Xt )dt +
(t, Xt ) dXt
t
x
g
g
g
=
(t, Xt )dt +
(t, Xt )b(t, Xt )dt +
(t, Xt )(t, Xt ) dWt .
t
x
x
72

This is an important advantage of the Stratonovich integral, which turns out


to be useful
R t for example in the context of stochastic calculus on manifolds.
However, 0 dW is not a martingale while the Ito integral is a martingale, as we have seen. Moreover the Ito integral has the property of not
looking into the future that makes it so popular in applications to finance
remember that the integrand needs to be only Ft -measurable, for every t,
so at time t the only information that we need is the one available up until
that time. It is not the case that one of the two integrals is better than
the other, but deciding whether we should look at
dXt = b(t, Xt )dt + (t, Xt ) dWt

(57)

dXt = b(t, Xt )dt + (t, Xt )dWt ,

(58)

rather than
is clearly relevant, as the solution to (57) is not the same as the solution
to (58); and this is not just a technical matter, it is rather a problem of
modelling, and the answer depends on the specific application that we are
considering.
By the modelling point of view, the Stratonovich integral has the big
advantage of being stable under perturbations of the noise. Let us explain
what we mean.
1. Assuming that (57) and (58) admit only one solution, let Xt be the
t the solution of (57). In general Xt 6= X
t , as
solution of (58) and X
we have explained.
2. Now consider the equation
dXtk = b(t, Xtk )dt + (t, Xtk )tk ,
where k is a sequence of smooth random variables that converge to
white noise as k . Notice that because the k s are smooth,
such an equation is, for each , a simple ODE, so we dont need to
decide what we mean by it because we know it already.
3. If k , you would expect that X k X. Well this is not the case.

Indeed, X k X.
This is only the tip of the iceberg of the more general landscape described
by the Wong-Zakai approximation theory, see for example [76, 41] and references therein. The fact that Ito SDEs are not stable under perturbations
of the noise can be interpreted to mean, by the modelling point of view,
73

that unless we are absolutely sure that the noise to which our experiment is
subject is actually white noise, the phenomenon at hand is probably better
described by (57). On this topic, see for example [75]. After having studied
Section 8.2.2 you will be able to solve Exercise 30, which gives a practical
example of the procedure described above.

8.2

SDEs

From now on we focus on Ito SDEs, unless otherwise stated.


8.2.1

Existence and uniqueness.

We have so far handled equation (58) a bit clumsily, as in all fairness we


havent yet defined what we mean by a solution to (58). We will at first work
with real valued processes but everything we say can be easily rewritten in
higher dimension. To define what we mean by a solution to (58) we need
some notation first. Let
Wt be a one dimensional standard BM and FtW be the filtration generated by Wt ;
be a random variable, independent of Wt , for all t;
Gt be the filtration generated by and Wt , i.e.
(, Ws ; 0 s t);

for all t, Gt :=

N be the family of null sets (i.e. sets of P-measure zero) of the underlying probability space ;
Ft be the -algebra generated by N and Gt .
With this notation, let us give the following definition:
Definition 8.7 (Strong Solution). Given a Brownian motion Wt and an
initial datum , a strong solution to (58) is a continuous stochastic process
Xt such that
i) Xt is Ft -adapted, for all t > 0;
ii) P(X0 = ) = 1;
iii) for every t 0,

Rt

0 (|b(s, Xs )|

+ |(s, Xs )|2 )ds < , P a.s.;

iv) (58), or better (45), holds almost surely.


74

t is another strong solution, P(Xt 6=


Such a solution is unique if, whenever X

Xt ) = 0, for all t 0.
Clearly the above definition can be extended to higher dimension if we
take Wt to be a m-dimensional standard BM and condition iii) becomes
Z t
2


( bi (s, Xs ) + ij (s, Xs ) )ds < , P a.s.,
0

for all t 0 and 1 i d, 1 j m. Notice that in this definition the


BM is assigned a priori, i.e. we know b, and Wt and we are looking for
Xt . This might look like a silly remark, but it is not, as this is no longer
true when we talk about weak solutions.
As for ODEs, we want to establish some general conditions under which
the strong solution exists and is unique. We will state such a result directly
for d-dimensional SDEs, so d and m are as at the end of Section 8.1.1. Also,
for any n ` matrix A, we denote the Frobenius norm of A as follows:
kAk2F :=

n X
`
X
ij 2
a .
i=1 j=1

Theorem 8.8. Suppose the coefficients b and are globally Lipshitz and
grow at most linearly, i.e. for all t 0 and for all x, y Rd there exists a
constant C > 0 (independent of x and y) such that
kb(t, x) b(t, y)k + k(t, x) (t, y)kF Ckx yk

(59)

kb(t, x)k2 + k(t, x)k2F C 2 (1 + kxk2 ).

(60)

Suppose also that there exists a r.v. , independent of the m-dimensional


BM and with finite second moment, i.e.
Ekk2 < .
Then there exists a unique strong solution to equation (58), with initial
condition X(0) = . Such a solution is continuous and has bounded second
moment. In particular, for every T > 0 there exists a constant K > 0,
depending only on C and T , such that
EkXt k2 K(1 + EkX0 k2 )eKT ,

for all 0 t T .

Comment. You might be thinking that all this fuss of Ito and Stratonovich
75

is a bit useless as we could also look at an SDE as being an ODE, for each
fixed . This is the approach taken in [73]. However bear in mind that
because Wt is continuous but not C 1 , you would end up with an ODE with
non C 1 coefficients. In [73] is it shown that one can still makes sense of such
an ODE by using a variational approach and that the solution found by
means of the variational method coincides with the Stratonovich solution of
the SDE. The conditions of the existence and uniqueness theorem, Theorem
8.8, can be somewhat weakened.
Theorem 8.9. Suppose all the assumptions of Theorem 8.8 hold, but replace
condition (59) with the following: for each N there exists a constant CN such
that
kb(t, x) b(t, y)k + k(t, x) (t, y)kF CN kx yk

when kxk, kyk N .


(61)
Then there exists a unique strong solution to equation (58).
In other words, under an assumption of local lipshitzianity for the coefficients, the solution still exists and is unique. The linear growth condition
instead is there to guarantee that the solution does not blow up. This is
clear also from the thoery of ordinary ODEs. Consider for example the ODE
dXt = Xt2 dt,

X0 = x.

Then the solution to this equation exists and is unique but it blows up in
finite time:

0
if x = 0
Xt =
x
if
x 6= 0 .
1tx
8.2.2

Examples and methods of solution.

SDEs can be solved in the same way in which we solve ODEs, just taking
into account the It
o chain rule.
Example 8.10 (Ornstein-Uhlenbeck process). Consider the one dimensional equation
dXt = Xt dt + dWt
(62)
where R and > 0 are constants. The solution to this equation (which
exists and is unique by Theorem 8.8) is the Ornstein-Uhlenbeck process
(OU), which is a model of great importance in physics as well as in finance.
Xt is also called coloured noise and, among other things, is used to describe
the velocity of a Brownian particle. By the point of view of non-equilibrium
76

statistical mechanics, (62) is the simplest possible Langevin equation, but


we will talk about this later. Now let us solve (62). As we would do for an
ODE, let us write
dXt Xt dt = dWt
and multiply both sides by et :
et (dXt Xt dt) = et dWt .
If we were dealing with an ODE we would now write
et (dXt Xt dt) = d(et Xt )

(63)

and integrate both sides. However we are dealing with an SDE so we need
to use the chain rule of It
o calculus, equation (54). In this particular case,
because the second derivative with respect to x of et x is zero, applying
(54) leads precisely to (63) (check) however I want to stress that this will
not happen in general, as we will see from the next example. So now we can
simply integrate both sides and find
Z t
t
X(t) = e X0 +
e(ts) dWs .
0

Taking expectation on both sides and using property (iii) of Theorem 8.4
we get
E(Xt ) = et E(X0 ) ,
so that E(Xt ) 0 if < 0 and E(Xt ) + if > 0. If X0 is Gaussian or
deterministic then the OU process is Gaussian, as a consequence of Theorem
8.4, property (vii). Moreover, it is a Markov process. The proof of the latter
fact is a corollary of the results of the next section.
Example 8.11 (Geometric Brownian Motion). The SDE
dXt = rXt dt + Xt dWt ,
is a simple population growth model (but it is quite popular in finance as
well), where Xt represents the population size and r = r + dW is the
relative rate of growth. To solve this SDE we proceed as follows:
dXt
= r dt + dWt .
Xt

77

(64)

At the risk of being boring, we cant say that d(log Xt ) = dXt /Xt , because
we are deling with the stochastic chain rule. So, applying the Ito chain rule,
we find
dXt 1 2
d(log Xt ) =
dt.
(65)
Xt
2
From (64) and (65) we then have


1 2
d(log Xt ) = r dt + dWt ,
2
which gives
1 2
Xt = X0 e(r 2 )t+Wt .

Example 8.12. Also for SDEs it makes sense to try and look for a solution
in product form. For example, consider the SDE
dXt = d(t)Xt dt + f (t)Xt dWt ,

X(0) = X0 .

(66)

If d(t) and f (t) are uniformly bounded on compacts then the assumptions
of the existence and uniqueness theorem are satisfied and we can look for
the solution. So assuming for example that d(t) and f (t) are continuous will
do. We look for a solution in the form
X(t) = Y (t) Z(t),
where Y (t) and Z(t) solve, respectively,


dY = f Y dW
dZ = A(t) dt + B(t)dW
and
Y (0) = X0
Z(0) = 1 .
Applying the It
o product rule,
d(Y Z) = (f X + BY )dW + Y (A + f B)dt .
We now compare the above expression with (66); in order for the two expressions to be equal, we need to have
f X + BY = f X

and d(t)Xt = Y (A + f B);

the first equation gives B 0 so that the second implies A(t) = d(t)Zt .
The equation for Z becomes therefore deterministic:
Z t

dZ = d(t)Zdt , Z(0) = 1 Zt = exp
d(s)ds .
0

78

The equation for Y can be solved similarly to what we have done for geometric Brownian motion:
dY = f (t)Y dW ,
dY

= f (t)dW .
Y

Y (0) = X0

Because d(log Y ) = dY /Y f 2 dt/2, we get


Z t
Z
1 t 2
1 2
Yt
f dW
=
f dt
d(log Y ) = f dW f dt log
2
Y0
2 0
0
Z t

Z
1 t 2
f dW
Y (t) = X0 exp
f ds .
2 0
0
Putting everything together, finally

Z t
Z t
Z
1 t 2
f ds +
d(s)ds .
f dW
X(t) = X0 exp
2 0
0
0
8.2.3

(67)

Solutions of SDEs are Markov Processes

The solution of an It
o SDE is also called an It
o process. In this section
we will always work under the assumptions of the existence and uniqueness
theorem and we will prove two very important facts about the solutions of
It
o SDEs: first of all, the solution of an Ito SDE is a continuous time Markov
process. Secondly, if the coefficients of the equation do not depend on time,
the solution is a time-homogeneous Markov process.
Theorem 8.13. Suppose the assumptions of Theorem 8.8 hold. Then the
solution of the SDE
dXs = b(s, Xs ) ds + (s, Xs ) dWs ,

X0 = x,

(68)

is a Markov process.
The proof of this theorem is not particularly instructive, as the technicalities obfuscate the intuition behind them. The reason why the above theorem holds true is morally simple: we know by the existence and uniqueness
theorem that, once Xs is known, the solution of (68) is uniquely determined
for every t s. Theorem 8.8 doesnt require any information about Xu for
u < s in order to determine Xu for u > s, once Xs is known. Which is why
the above statement holds under the same assumptions as Theorem 8.8.
79

Theorem 8.14. Suppose the assumptions of Theorem 8.8 hold. If the coefficients of the It
o SDE (58) do not depend on time then the It
o process is
time-homogeneous. I.e., the solution of the equation
dXs = b(Xs ) ds + (Xs ) dWs ,

X0 = x,

is a time-homogeneous Markov process.


As we have already said, a time-homogeneous process is a process describing a phenomenon that happens in conditions that do not change in
time. Because the coefficients b and describe precisely the conditions in
which the phenomenon happens, if such coefficients are time independent
it is almost tautological that the solution of the equation should be timehomogeneous.
Proof of Theorem 8.14. The Markovianity is a consequence of Theorem 8.13,
so we only need to prove time-homogeneity, which means that we need to
prove that
P(Xu B|X0 = x) = P(Xu+t B|Xt = x),

B S, x S, t 0 .

Denote by X x,t (u + t) the solution of the equation


Z t+u
Z t+u
b((s))ds +
((s))dWs
(t + u) = x +

(69)

and by X x,0 (u) the solution of the equation


Z u
Z u
(u) = x +
b((s))ds +
((s))dWs .
0

(70)

What we want to prove is that X x,t (u + t) has the same distribution as


X x,0 (u). To this end, notice that if Wv is a standard BM then Wt+v Wt is a
v = Wt+v Wt has the same distribution
standard BM as well (check), i.e. W
as Wv . With this in mind, a simple change of variable concludes the proof.
Indeed, the RHS of (140) can be rewritten as
Z u
Z u
(t + u) = x +
b((v + t))dv +
((v + t))dWt+v .
0

and W have the same distribution, and the above is


At this point, because W
, it is clear by the uniqueness
nothing but (141), when we replace W with W
of the solution that (t + u) has the same distribution as (u), which is,
X x,t (u + t) has the same distribution as X x,0 (u).
80

8.3

Langevin and Generalized Langevin equation

A very interesting example is provided by the so called Langevin equation


q(t) = q V (q) q(t)
+ f (t).

(71)

In the above equation > 0 is a constant, V (q) is a confining22 potential and


f (t) is white noise. Therefore equation (71) describes the position q(t) R
of a particle subject to Newtons equation of motion plus dumping and noise.
The Langevin equation was introduced as a stochastic model for chemical
reactions, in which a particle, held by intermolecular forces, undergoes the
reaction when activated by random molecular collisions. In this framework,
the term q expresses the rate at which the reaction slows down due to
such random interactions. Notice that equation (71) can be rewritten in
a form more familiar to you, i.e. as a system of first order SDEs, by just
enlarging the state space with the introduction of the momentum variable
p(t):
q(t)
= p(t)
p(t)
= q V (q) p(t) + f (t) .
If V (q) = q 2 /2, by setting


0
1
B=
1


and

0 0
0 1

(72)


,

(73)

you can observe that (72) is nothing but an O-U process, this time in two
dimensions:


q(t)
dXt = BXt + dWt , where Xt =
,
p(t)
and Wt = (Wt1 , Wt2 ) is a two-dimensional standard BM, i.e. W 1 and W 2
are independent one dimensional standard BMs. However notice that the
diffusion matrix23 is degenerate (in the sense that it has zero determinant);
for this reason this is a degenerate process (in particular one can prove that
this is an hypoelliptic process). We will comment again on this later on, as
the degeneracy of the diffusion matrix adds big difficulties to the analysis of
such a process.
22
23

A function g(x) : Rn R is said to be confining if g(x) when kxk .


We shall see in the next section why this object is called diffusion matrix.

81

The model (72) assumes the collisions between molecules to occur instantaneously, hence it is not valid in physical situations in which such approximation cannot be made. In these cases a better description is given by
the Generalized Langevin Equation GLE :
Z t
q(t) = q V (q)
ds (t s)q(s)
+ F (t).
(74)
0

Here (t) is a smooth function of t and F (t) is a mean zero stationary


Gaussian noise. F (t) and the memory kernel (t) are related through the
following fluctuation dissipation relation
E(F (t)F (s)) = 1 (t s),

(75)

where > 0 is a constant, representing the inverse temperature of the system. The GLE together with the fluctuationdissipation theorem principle
(75) appears in various applications such as surface diffusion and polymer
dynamics. It also serves as one of the standard models of nonequilibirum
statistical mechanics, describing the dynamics of a small Hamiltonian system (a distinguished particle) coupled to a heat bath at inverse temperature
. Just to provide some context: given a system in equilibrium, we can drive
it away from its stationary state by either coupling it to one or more large
Hamiltonian systems or by using non-Hamiltonian forces. In the Hamiltonian approach, which is often referred to as the open systems theory [66],
the system we are interested in is coupled to one or more heat reservoirs.
Multiple reservoirs and the non-Hamiltonian methods are used to study non
equilibrium steady states (for example if we consider two heat baths a different temperatures, you can imagine that there will be a constant heat
flux from the warmer to the colder reservoir; such a state is called a nonequilibrium steady state), see [68, 28]; coupling with a single heat bath is
used to study return to equilibrium.24 The GLE serves precisely this purpose. Let us now say a couple of words about the derivation of the GLE.
In order to derive such an equation which, I would like to point out, is
a stochastic integro-differential equation we think of the system particle
+ bath as a mechanical system in which a distinguished particle interacts
with n heat bath molecules of mass {mj }1jn , through linear springs with
random stiffness parameter {kj }1jn ; the Hamiltonian of the system is
24

Quoting D.Ruelle: The purpose of nonequilibrium statistical mechanics is to explain


irreversibility on the basis of microscopic dynamics, and to give quantitative predictions
for dissipative phenomena.

82

then :
n

p2n 1 X Pi2 1 X
+
+
ki (Qi qn )2 ,
2 2
mi 2
i=1
i=1
(76)
where (qn , pn ) and (Q1 , ..., Qn , P1 , ..., Pn ) are the positions and momenta of
the tagged particle and of the heat bath molecules, respectively (the notation
(qn , pn ) is to stress that the position and momentum of the particle depend
on the number of molecules it is coupled to).
Excursus. The theory of Equilibrium statistical mecanics is involved
with the study of many particle systems in their equilibirum state; such a
theory has been put on firm ground by Boltzman and Gibbs in the second
half of the 19th century. Consider an Hamiltonian system with Hamiltonian H(q, p) (for example our gas particles in a box). The formalism of
equilibrium statistical mechanics allows to calculate macroscopic properties
of the system starting from the laws that govern the motion of each particle (good reading material are the papers [46, 45, 68]). The BoltzmannGibbs prescription states the following: the equilibrium value of any quantity at inverse temperature is the average of that quantity in the canonical ensamble, i.e. it is the average with respect to the Gibbs measure
(q, p) = Z 1 exp{H(q, p)}dqdp. The reason why this measure plays
such an important role is a consequence of the fact that the Hamiltonian
flow preserves functions of the Hamiltonian as well as the volume element
dqdp.
Going back to where we were, we can write down the equations of motion
of the system with Hamiltonian (76):

qn = pn

p = V (q ) + Pn k (Q q )
n
q
n
i
n
i=1 i
i = Pi /mi
Q
1


Pi = ki (Qi qn )
1 i n.

H(qn , pn , Q1 , ..., Qn , P1 , ..., Pn ) = V (qn )+

The initial conditions for the distinguished particle are assumed to be deterministic, namely qn (0) = q0 and pn (0) = p0 ; those for the heat bath are
randomly drawn from a Gibbs distribution at temperature 1 , namely
:=

1 H
e
,
Z

Z normalizing constant,

where the Hamiltonian H has been defined in (76). Integrating out the heat
bath variables we obtain a closed equation for qn , of the form (74). In the
83

thermodynamic limit as n we recover the GLE. Under the assumption


that at time t = 0 the heat bath is in equilibrium at inverse temperature ,
we obtain the fluctuation dissipation relation (75) as well. The form of the
memory kernel (t) depends on the choice of the distribution of the spring
constants of the harmonic oscillators in the heat bath [23].
It is important to point out that, for a general memory kernel (t),
the GLE is non Markovian and this feature makes it not very amenable
to analysis (even though it is worth noticing that significant progress has
been mad in the study of non-Markovian processes [27]). One way to go
about this problem is trying to approximate the non Markovian dynamics
(74) with a Markovian one. As noticed in [43], the problem can be recast as
follows: we want to approximate the non Markovian process (74) with the
Markovian dynamics given by the system of ODEs:
dq = p dt

(77a)

dp = q V (q) dt + g u dt

(77b)

du = (p g Au) dt + C dW (t) ,

(77c)

where (q, p) R2 , u and g are column vectors of Rd , denotes Euclidean


scalar product, W (t) = (W1 (t), . . . , Wd (t)) is a d-dimensional Brownian motion, V (q) is a potential and A and C are constant coefficients dd matrices,
related through the fluctuation dissipation principle:
A + AT = CC T .

(78)

We also recall that the noise in (74) is Gaussian, stationary and mean zero.
Because the memory kernel and the noise in (74) are related through the
fluctuation-dissipation relation, the rough idea is that we might try to either
approximate the noise and hence obtain the corresponding memory kernel
or, the other way around, we could approximate the correlation function
and read off the noise. The latter is the approach that we shall follow in
this section. As a motivation, we would like to notice that for some specific
choices of the kernel, equation (74) is equivalent to a finite dimensional
Markovian system in an extended state space. If, for example, we choose
(t) = 2 et , t > 0, then (74) becomes

q = p
Rt
(79)
p = q V (q) 2 0 e(ts) p(s)ds + F (t);
the fluctuation dissipation theorem (with = 1) yields
E(F (t + s)F (t)) = 2 e|s| .
84

(80)

Since we are requiring F (t) to be stationary and Gaussian, (80) implies that
25 If we write F (t) = v(t), with
F (t) is the Ornstein-Uhlenbeck process.

, and we define the new process


v(t) satisfying the equation v = v + 2W
Z t
z(t) =
e(ts) p(s)ds + v(t),
(81)
0

then (79) becomes

q = p
p = q V + z

,
z = p z + 2W
which is precisely system (77) with d = 1, A = 1 and g = . The Markovianization of (74) was first done by Mori [54] by first approximating the
Laplace transform of the memory kernel (t), (), by a rational function
(if and when this is possible) and then imposing the fluctuation relation,
which gives the matrices A andPC as well as the vector P
g. If (t) itself is
a sum of exponentials, d (t) = di=1 2i ei t , then d = di=1 2i /(i + i ),
so the procedure indicated by Mori is clearly successful and it corresponds
to the case in which A = diag{1 , . . . , d } and g = (1 , . . . , d )T . Another
typical situation is when the Laplace transform of has a continued fraction
representation
21

() =

22

+ 1 +
+2 +

i > 0.

2
3

+3 +

..

In this case the approximation is done by truncating the fraction at step d


and then reading off the corresponding Markovian system of (d + 2) SDEs.
The matrix A is then tridiagonal,



1 2



2 2 3




..


.
3
3
A=



..
..


.
.



d
and g = (1 , 0, . . . , 0)T .
25

It is possible to show that the only mean zero stationary Gaussian process with autocorrelation function et is the stationary O-U process.

85

Markov Semigroups and their Generator

The properties of Markov processes can be studied by means of functional


analytic tools, which we come to introduce in this section. The (very little)
background on functional analysis that you need to study this section is in
the appendix. The framework we describe will refer to a continuous-time
setting; therefore it will be applied to continuous time Markov processes and,
in the next section, to diffusion processes (which are a special subclass of
the continuous time Markov processes). However some of the definitions and
results presented in the following, can also be formulated in discrete-time.
For the content of this section we refer to [25, 13, 62]. In all that follows
(B, k k) will denote a real Banach space.
Definition 9.1 (Markov Semigroup). A one parameter family of linear
bounded operators over a Banach space {Pt }tR+ , Pt : B B for all t 0,
is a semigroup if
1. P0 = I, where I denotes the identity on B;
2. Pt+s = Pt Ps = Ps Pt , for all t, s R+ .
If the map R+ 3 t Pt f B is continuous for all f B, then the semigroup is said to be strongly continuous. A strongly continuous semigroup
of bounded linear operators is also called a C0 -semigroup. A semigroup is a
Markov semigroup if
i) it preserves constants: Pt 1 = 1 for all t R+ (here 1 is the constant
function equal to one);
ii) it is positivity preserving: f 0 Pt f 0 for all t R+ .
Definition 9.2. With the notation of the previous definition, the semigroup
Pt is contractive if
kPt f k kf k,

for all t R+ and f B.

Comment. i) You are likely to have already seen the definition of semigroup when talking about the solution of differential equations, for example
linear equations of the type
t f = Lf,
where, for every fixed t, f (t, x) belongs to some Banach space B and L is a
differential operator on B. Indeed properties 1. and 2. simply say that the
solution of the equation is unique. Strong continuity is used to make sense
86

out of the notion of initial datum for the equation.


ii) Let pt (x, dy) be the transition function of a continuous time and timehomogeneous Markov process Xt . To fix ideas suppose Xt is real valued.
Then
Z
f (y)pt (x, dy),
(82)
Pt f (x) = E [f (Xt )|X0 = x] =
R

defines a positivity preserving semigroup with Pt 1 = 1 (check) on the space


Bm of bounded and measurable functions from R to R. Whether the semigroup is strongly continuous or not, this depends on the process Xt and
on the function space that we restrict to. The Banach spaces that we will
consider will be function spaces and we assume that such spaces will contain
at least the space Bm . Another popular space for us will be the space Cb
of bounded and continuous functions from R to R (more in general one can
consider functions defined on a Polish space).
The semigroup (82) is the Markov semigroup associated with the process
Xt . Such a semigroup will be the main focus of the remainder of this set of
lecture notes. The aim of the game will be deriving properties of the Markov
process Xt from properties of the semigroup Pt . The fascinating bit here
is that by doing so, we are trying to solve a stochastic problem by purely
functional analytic means.
Example 9.3. Let (, F, ) be a probability space, Cb () be the space
of continuous and bounded functions on the space (endowed with the
uniform norm) and m > 0 a constant. Then
Pt f (x) := emt f (x) + (1 emt )E(f ),
where E(f ) :=
group.

f (x)d(x),

defines a strongly continuous Markov semi-

Example 9.4. The relation


1
Pt f (x) :=
2t

f (y)e

|xy|2
2t

dy,

(83)

defines a strongly continuous semigroup on the space of bounded and uniformly continuous functions from R to R. This is precisely the semigroup
associated with Brownian motion through (82). In other words, because
the transition probabilities of BM are given by (41), we can rewrite the
expression (83) as
Pt f (x) = E[f (Bt )|B0 = x] .

87

Given a semigroup, we can associate to it an operator, the generator of


the semigroup.
Definition 9.5. Given a C0 -semigroup, Pt , we define the infinitesimal generator of the semigroup Pt to be the operator
Lf := lim

t0+

Pt f f
,
t

(84)

for all f D(L) := {f B : the limit on the RHs of (84) exists in B}. If
Pt is a strongly continuous Markov semigroup, then the operator L is a
Markov generator. The generator of the semigroup defined in (82), is often
referred to as the generator of the Markov process Xt .
From now on we will always work with strongly continuous semigroups,
unless otherwise stated. The following properties hold.
Theorem 9.6. With the notation and nomenclature introduced so far, let
Pt be a strongly continuous semigroup. Then we have:
1. if f D(L) then also Pt f D(L), for all t 0;
2. the semigroup and its generator commute: LPt f = Pt Lf for all f
D(L);
3. t (Pt f ) = LPt f .
Notice that point 3. of Theorem 9.6 means that if we define a function
gt (x) = g(t, x) := Pt f (x) then such a function satisfies, for all t R+ , the
equation t g = Lg.
Proof of Theorem 9.6. 1. and 2. can be proved together: for any t, s 0,
we can write
Ps Pt f Pt f
Ps f f
= Pt
.
s
s
No we can take the limit as s 0+ on both sides (the limit of the RHS
makes sense and therefore also the one on the LHS does, which means that
1. is proved) and obtain
LPt f = lim

s0+

Ps Pt f Pt f
Ps f f
= Pt lim
= Pt Lf .
s
s
s0+

88

Now we come to proving 3. In the remainder of this proof h > 0; let us look
at the right limit first:
lim

h0+

Pt+h f Pt f
Ph f f
by2.
= Pt lim
= Pt Lf = LPt f .
+
h
h
h0

Therefore the right derivative is equal to LPt f . Now we need to show the
same for the left derivative as well:


Pt f Pth f
Ph f f
LPt f = lim Pth
Pt Lf
lim
h0
h0
h
h


Ph f f
Lf
lim Pth
h0
h
+ lim [Pth Lf Pt Lf ] .
h0

The second limit is equal to zero by strong continuity. Using Exercise 31,
also the first limit is equal to zero, indeed


Ph f f
Ph f f
lim kPth
Lf k lim c(t)k
Lf k,
h0
h0
h
h
where c(t) is a constant depending on t, not on h. Now the limit on the
RHS of the above is clearly equal to zero, by the definition of L.
The following very famous result gives a necessary and sufficient condition in order for an operator to be the generator of a Markov semigroup.
Theorem 9.7 (Hille-Yosida Theorem for Markov semigroups). A linear
operator L is the generator of a Markov semigroup {Pt }tR+ on the Banach
space B if and only if
1. 1 D(L) and L1 = 0;
2. D(L) is dense in B;
3. L is closed;
4. for any > 0, the operator I L is invertible; its inverse is bounded,
sup k(I L)1 f k 1 ,
kf k1

and positivity preserving.

89

Example 9.8. [Generator of Brownian motion] From the discussion of Section


7 we expect the generator of BM, i.e. the generator of the semigroup (83),
to be /2, where denotes the Laplacian. We work in one dimension so
2 /2. Indeed, let f be a bounded and continuous funcwe expect to obtain xx
tion, with bounded and continuous first and second derivatives. Then, if Pt
denotes the Brownian semigroup (83), we have


Z
|xy|2
Pt f (x) f (x)
1
1
2t

(f (y) f (x))e
=
dy
t
t
2t R


Z
1
1
z 2 /2t

(f (x + z) f (x))e
dz
=
t
2t R


Z

1
1
w2 /2

=
dw
(f (x + w t) f (x))e
t
2 R

Z 

1 1
1 00
2
0
2
=
f (x)w t + f (x + w t)tw ew /2 dw,
t 2 R
2
where the last inequality holds, by Taylors Theorem, for some [0, 1].
From the continuity of f 00 and observing that
Z
Z
2
0
w2 /2
0
f (x)w e
dw = f (x) w ew /2 dw = 0,
R

we have
lim

t0+

Pt f (x) f (x)
f 00 (x)
=
.
t
2

Definition 9.9. Given a time-homogeneous Markov process Xt : (, F, P)


R, we have defined the associated Markov semigroup Pt by (82). Denoting by
pt (x, dy) the transition functions of the process, we can also define the dual
semigroup associated to Xt and acting on the space of probability measures
on R (denoted by M1 (R)):
Z
(Pt )(B) =
pt (x, B)(dx), t 0, B B(R), M1 (R).
(85)
R

Remark 9.10. We recall that the dual of C([0, T ], R) is the space of measures on [0, T ] with bounded total variation on [0, T ] (see [80], page 119);
the total variation on [0, T ] is defined as
Z
kkT V := sup
f (s)(ds).
f C([0,T ]) [0,T ]
kf k1

90

Also, the dual of the space L (S, S, ), where (S, S, ) is a measure space
with (S) < , is the space of finitely additive measures absolutely
continuous with respect to and such that supB |(B)| < (see again
same reference).
Notice that the Markov semigroup Pt acts on functions, its dual acts on
probability measures. Moreover, Pt f is a function, while Pt is a probability
measure on R. (Useless to say that in all of the above R can be replaced by
any Polish space S and everything works anyway.) The reason why Pt is
called the dual semigroup is the following: with Remark 9.10 in mind 26 , if
we use the notation
Z
hPt f, i = (Pt f )(x)(dx),
R

then for any say bounded and measurable function f and for any probability
measure we have

Z Z
hPt f, i =
pt (x, dy)f (y) (dx)
R
R
Z
Z
=
f (y) pt (x, dy)(dx)
R
ZR
=
f (y)(Pt )(dy) = hf, Pt i.
R

So, at least formally, Pt is the dual (or adjoint) operator of Pt .

9.1

Ergodicity for continuous time Markov processes.

Definition 9.11. Let Pt be a Markov semigroup. A probability measure


is said to be invariant for Pt if for any bounded and measurable function ,
Z
Z
(Pt )(x)(dx) =
(x)(dx).
(86)
R

Given a time-homogeneous Markov process, we say that a measure is invariant for Xt if it is invariant for the associated Markov semigroup.
26

In view of Remark 9.10, this is just a duality relation, i.e. it expresses the action of
the dual of a space on the space itself, see the Appendix; in the case of a Hilbert space
this would be just a scalar product, which is why we denote duality relations and scalar
products in the same way.

91

With the notation introduced at the end of the previous section, (86)
can be rewritten as
hPt , i = h, i.
(87)
Therefore, by the point of view of the dual (or adjoint) semigroup, we can
also say that the measure is invariant if
Pt = ,

t 0.27

(88)

Comment. I suppose that you would like to reconcile Definition 9.11


with Definition 3.21. If Pt is the Markov semigroup associated with a given
process Xt and we take (x) = 1B (x), for some measurable set B, then

Z
Z Z
Pt (x)(dx) =
1B (y)pt (x, y)dy (dx)
Z
= pt (x, B)(dx).
Therefore equality (86) implies
Z
Z
pt (x, B)(dx) = 1B (x)d(x) = (B).
Observe, as we have done for the time discrete case, that if is invariant
for Xt and X0 , then Xt is stationary, indeed in this case
Z
P (Xt B) =
pt (x, B)(dx) = Pt (B) = (B).
(89)
R

Comment. The stationarity of is used in (89) only for the last equality.
All the other equalities are true simply by definition of Pt . This means that
if we start the process with a certain distribution , then the semigroup Pt
describes the evolution of the law of the process:
(Pt )(B) = P (Xt B),
for every measurable set B. On the other hand, from the definition (82) of
the semigroup Pt , it is clear that Pt describes the evolution of the so called
observables.
Again analogously to the time-discrete case, we give the following definition.
27

If (87) holds for all buonded and measurable then take to be the indicator function
of a measurable set and you obtain (88). On the other hand, if (88) is true then hPt , i =
h, Pt i = h, i, for all bounded and measurable.

92

Definition 9.12. A time-homogeneous (continuous time) Markov process


is ergodic if the associate semigroup (i.e. the semigroup defined through the
relation (82)) admits a unique invariant measure.
Before making the comment below, we would like to point out a technical
fact.
Lemma 9.13. Let Pt be a Markov semigroup associated with some real
valued process Xt . If is an invariant measure for the semigroup, then Pt
can be extended to a strongly continuous semigroup on Lp (R, ), for every
p 1.
Notation: if S is any Polish space, Lp (S, ) is a weighted Lp space (of R
or Rn -valued functions) over S, that is
Z
Lp (S, ) := {functions f : S R such that
|f |p (x)d(x) < }.
S

From now on, we shall refer to the non-weighted Lp space as to the flat Lp .
Thanks to the above Lemma, we can work in a Hilbert space setting as soon
as an invariant measure exists.
Comment. One can prove that the set of all invariant probability measures for Pt is convex and an invariant probability measure is ergodic if
and only if it is an extremal point of such set (compare with Lemma 5.4).
Hence, if a semigroup has a unique invariant measure that measure is ergodic. This justifies the definition of ergodic process that we have just given.
It can be shown (see [13]) that, given a C0 -Markov semigroup, the following
statements are equivalent:
(i) is ergodic.
(ii) L2 (S, ) and Pt = t > 0, a.s. is a constant ( a.s.).
(iii) for any L2 (S, ) ,
Z
Z
1 t
lim
(Ps )(x) ds =
(x)(dx),
t t 0
S

in L2 (S, ).

Because the Markov semigroup preserves constants, condition (ii), read at


the level of the generator of the semigroup, says that 0 is a simple eigenvalue
of L, if and only if Pt = t > 0, a.s. is a constant. In other
words, L = 0 is constant (for L2 (S, ) D(L)). As for (iii),
93

this is precisely the ergodic theorem for continuous time Markov processes.
Notice that the time average on the LHS depends on x (the initial value of
the process) while the RHS does not: again, the limiting behaviour does not
depend on the initial conditions.
Remark 9.14. When the semigroup has an invariant measure , we shall
always assume that the generator of such a semigroup is densely defined in
L2 (R, ). This assumption is not always trivial to check in practical cases.
For the moment only formally, the generator of Pt is just L , the flat
L2 -adjoint of the generator of Pt , L. From (88), a measure is invariant if
L = 0.

(90)

If Pt is the semigroup associated with some time homogeneous Markov


process then the process is ergodic if equation (90) admits a unique (normalized) solution. (Notice indeed that L and L will be differential operators,
so the solution to (90) cannot be unique.)

10

Diffusion Processes

For the material contained in this section we refer the reader to [1, 25, 22, 41].
In this section we want to give a mathematical description of the physical phenomenon called diffusion. Simply put, we want to mathematically
describe the way milk diffuses into coffee.
It seems only right and proper to start this section by reporting the
way in which Maxwell described the process of diffusion in his Encyclopedia
Britannica article:
When two fluids are capable of being mixed, they cannot remain in
equilibrium with each other; if they are placed in contact with each other the
process of mixture begins of itself, and goes on till the state of equilibrium
is attained, which, in the case of fluids which mix in all proportions, is a
state of uniform mixture. This process of mixture is called diffusion. It
may be easily observed by taking a glass jar half full of water and pouring
a strong solution of a coloured salt, such as sulphate of copper, through a
long-stemmed funnel, so as to occupy the lower part of the jar. If the jar
is not disturbed we may trace the process of diffusion for weeks, months, or
years, by the gradual rise of the colour into the upper part of the jar, and the
weakening of the colour in the lower part. This, however, is not a method
capable of giving accurate measurements of the composition of the liquid at
different depths in the vessel. ...
94

If we observe the process of diffusion with our most powerful microscopes,


we cannot follow the motion of any individual portions of the fluids. We
cannot point out one place in which the lower fluid is ascending, and another
in which the upper fluid is descending. There are no currents visible to
us, and the motion of the material substances goes on as imperceptibly as
the conduction of heat or electricity. Hence the motion which constitutes
diffusion must be distinguished from those motions of fluids which we can
trace by means of floating motes. It may be described as a motion of the
fluids, not in mass but by molecules. ...
By a phenomenological point of view, the word diffusion denotes a transport
phenomenon which happens without bulk motion and results in mixing
or spreading. 28 The two typical examples that you should bear in mind:
the way milk diffuses into coffee and the way gas molecules confined to one
side of a box divided in two compartments spread to the whole container as
soon as the division is lifted. In the first case, if you wait long enough, you
will obtain a well stirred coffee with milk; in the second case you will see
that the gas molecules reach an equilibrium state described by the Gibbs
distribution. 29

10.1

Definition of diffusion process

Athough we will mostly deal with time-homogeneous examples, we give the


definition below for the general context of non necessarily time-homogeneous
Markov processes. We therefore use the notation introduced in Exercise 21.
Definition 10.1. A real valued, continuous time Markov process, {Xt }t[0,T ] ,
is a diffusion process if its transition probabilities p(s, x, t, A), satisfy the following conditions:
1. For any  > 0,
Z
1
lim
p(t, x, t + , dy) = 0,
0 |xy|>

for all t [0, T ] and x R;

2. There exists a real valued function b(t, x) such that for all  > 0
Z
1
lim
(yx)p(t, x, t+, dy) = b(t, x), for all t [0, T ] and x R;
0 |xy|
28
Indeed, the word diffusion comes from the latin verb diffundere, which means to
spread out.
29
At this point one should mention the apparent paradox created by the Poincare recurrence Theorem, but we will refrain from getting into the details of this long diatribe.

95

3. There exists a real valued function D(t, x) such that for all  > 0
Z
1
(yx)2 p(t, x, t+, dy) = D(t, x), for all t [0, T ] and x R.
lim
0 |xy|
The function b(t, x) is often called the drift while the function D(t, x) is the
diffusion coefficient.
Before explaining the meaning of Definition 10.1, let us Remark that the
same definition can be given for Rn valued processes.
Remark 10.2. For ease of notation, Definition 10.1 has been stated for
one-dimensional processes. The exact same definition carries to the case in
which Xt takes values in Rn , n 1. In this case b(t, x) will be a Rn -valued
function and D(t, x) will be a n n matrix such that for any  > 0,
Z
lim
(y x)(y x)T p(t, x, t + , dy) = D(t, x),
0 |xy|

for all t and x. In the above (y x) is a column vector and (y x)T denotes
its transpose.
Now let us see what the three conditions of Definition 10.1 actually mean.
Condition 1. simply says that diffusion processes do not jump. In conditions
2. and 3. we had to cut the space domain to |x y| <  because strictly
speaking at the moment we dont know yet whether the transition probabilities of the process have first and second moment or not. Assuming they
do, then conditions 2. and 3. say, roughly speaking, that the displacement
of the process from time t to time t + is the sum of two terms: an average
drift term, b(t, x), plus random effects, say , plus small terms, o(); the
random effects are such that E()2 = D(t, x) + o().30
Lemma 10.3. Suppose a real valued Markov process Xt on [0, T ] with transition probabilities p(s, x, t, A) satisfies the following three conditions:
1. there exists a > 0 such that
Z
1
lim
|y x|2+a p(t, x, t + , dy) = 0,
0 R
30

for all t [0, T ] and x R;

Here and in the following the notation o() denotes a term, say f (), which is small
in in the sense that
f ()
lim
= 0.
0

96

2. there exists a function b(t, x) such that


Z
1
lim
(y x)p(t, x, t + , dy) = b(t, x),
0 R
3. there exists a function D(t, x) such that
Z
1
(yx)2 p(t, x, t+, dy) = D(t, x),
lim
0 R

for all t [0, T ] and x R;

for all t [0, T ] and x R.

Then Xt is a diffusion process.


Proof. We want to show that if the above three conditions hold then the
conditions of Definition 10.1 are satisfied. Let us start with checking that
under our assumptions, condition 1. of Definition 10.1 holds: for any  > 0,
we have
Z
Z
1
1
|y x|2+a
0
p(t, x, t + , dy)
p(t, x, t + , dy) 0,
|yx|>
R
2+a
having used the fact that |y x| >  (|y x| /) > 1. To verify that
condition 2. of Definition 10.1 holds, observe that for every  > 0,
Z
Z
1
1
(y x)p(t, x, t + , dy) =
(y x)p(t, x, t + , dy)
R
|yx|
Z
1
(y x)p(t, x, t + , dy).
+
|yx|
If we can prove that the second addend on the RHS of the above converges
to 0 as 0, then we are done by assumption 2. Using again |y x| >
 (|y x| /) > 1, we act as before and we get
Z

1
1 Z |y x|2+a
0


(y x)p(t, x, t + , dy)
p(t, x, t + , dy) 0.

|yx|
R
1+a
You can act analogously to show that also condition 3. of Definition 10.1 is
satisfied.
What we want to do now is showing that, under appropriate conditions
on the involved coefficients, solutions of SDEs are diffusion processes. Viceversa, a diffusion process solves, under certain conditions, an appropriate
SDE. Unless otherwise stated, everything we say in the following is referred
to real valued processes. The multidimensional case will be commented on
in separate remarks.
97

Theorem 10.4. Consider the SDE


dXu = b(u, x)du + (u, x)dWu ,

Xt = x.

(91)

Suppose the coefficients b(u, x) and (u, x) are continuous in both arguments
and satisfy all the assumptions of the existence and uniqueness theorem,
Theorem 8.9. Then the process Xu , solution of (91), is a diffusion process
with diffusion coefficient D(u, x) = 2 (u, x).
Before proving the above statement, let us make the following remark.
Remark 10.5. Under the conditions of Theorem 8.9 one can prove that the
solution of equation (91) is still a Markov process and the following estimate
holds: there exists a constant C > 0 such that
E |Xu x|2m Cum eCu (|x|2m + 1),

for all m 1 .

Proof of Theorem 10.4. We already know that Xu is a Markov process. All


we need to verify are the conditions listed in Definition 10.1. To this end,
we shall show that the three conditions of Lemma 10.3 are satisfied.
1. The estimate of Remark 10.5, with m = 2, gives
Z
E |X(t + ) x|4 =
|x y|4 p(t, x, t + , dy) K 2 (1 + |x|4 ).
R

Dividing by and letting go to zero, condition 1. is verified.


2. We want to prove that
E[Xt+ x]
1
=

(y x)p(t, x, t + , dy) b(t, x).


R

Using (91), we can write


Z t+
Z
Xt+ = x +
b(s, Xs )ds +
t

t+

(s, Xs )dWs .

Therefore
1
E[Xt+ x]
= E

t+

b(s, Xs )ds .
t

Tha change of variables s = t + u then gives


Z 1
E[Xt+ x]
b(t + u, Xt+u )du b(t, x).
=E

0
98

In the above, we could exchange the limit and the integral because by
assumption
|b(t + u, Xt+u )|2 C(1 + |Xt+u |2 )
and we know by the existence and uniqueness theorem that
Z 1
Z 1
2
(1 + E |Xt+u |2 ) < .
|b(t + u, Xt+u )| C
E
0

3. We are left with showing that


Z
1
(Xt+ x)2
0
E
=
(y x)2 p(t, x, t + , dy) D(t, x) .

R
To this end, let us write
E(Xt+ x)2 = E(Xt+ )2 x2 2x [EXt+ x] .

(92)

Using It
o formula to calculate d(Xu )2 , we get
d(Xu )2 = 2Xu dXu + 2 (u, Xu )du .
Integrating between t and t + and taking expectation then gives
Z t+
Z t+
E(Xt+ )2 x2 = E
2Xs b(s, Xs )ds + E
2 (s, Xs )ds .
t

Combining with (92):


2

E(Xt+ x) = E

t+



2Xs b(s, Xs ) + 2 (s, Xs ) ds

2x(b(t, x) + o()) .
After dividing both sides by , we can, as before, make the change of
variables s = t + u, exchange limit and integral (which is justified
with an argument similar to the one explained above) and conclude
by sending to zero.

Now the reverse.


Theorem 10.6. Let Xt be a diffusion process on [0, T ]. Suppose the coefficients of the diffusion b(t, x) and D(t, x) satisfy the following conditions:
99

b(t, x) is continuous in both arguments and it grows linearly, i.e. there


exists a constant C > 0 such that
|b(t, x)| C(1 + |x|);
D(t, x) is continuous in both arguments and it has bounded an continuous first derivatives (i.e. the derivatives t D and x D exist and are
bounded and continuous);
(D(t, x))1 is bounded;
there exists a function (x), independent of t and , such that
a) (x) > 1 + |x| and supt[0,T ] E((Xt )) < ;
R

R
b) (y x)p(t, x, t + , dy) and (y x)2 p(t, x, t+, dy) are bounded
above by (x);
R
c) (|y| + y 2 )p(t, x, t + , dy) is bounded above by (x).
I will not prove the above theorem. However, let me give you some
heuristics to understand intuitively why, loosely speaking, a diffusion process
satisfies an SDE. To this end, let Xt be a diffusion process. Then conditions
2. and 3. of Definition 10.1 imply
E(Xt Xs |Xs = x) = b(s, x)(t s) + o(t s)
E((Xt Xs )2 |Xs = x) = D(s, x)(t s) + o(t s) .
Therefore, if we take a function (t, x) such that 2 (t, x) = D(t, x), recalling
that E(Wt Ws )2 = (t s), we can write
Xt Xs = b(s, x)(t s) + (s, x)(Wt Ws ) + o(t s),
which, forgetting about the o(t s) terms, can be rewritten in differential
form to be precisely equation (91).

10.2

Backward Kolmogorov and Fokker-Planck equations

For the moment we will keep working in one dimension. At the end we shall
comment on the multi-dimensional extension of the following results.
Lemma 10.7. Let f (x) be twice differentiable and suppose there exist m, C >
0 such that

2

d
d




|f (x)| + f (x) + 2 f (x) C(1 + |x|m ), x R.
dx
dx
100

Assume also that b(t, x) and D(t, x) satisfy the assumptions of Theorem
10.4. Then
Ef (Xt ) f (x)
d
1
d2
= b(t, x) f (x) + 2 (t, x) 2 f (x),
0

dx
2
dx
lim

where Xu is the solution of


dXu = b(u, Xu )du + (u, Xu )dWu ,

Xt = x .

(93)

Proof. The proof of this lemma is left as an exercise, see Exercise 37.
Theorem 10.8. Let Xu be the solution of the SDE
dXu = b(u, Xu )du + (u, Xu )dWu ,

Xt = x ,

(94)

where we assume that the coefficients of the SDE are continuous in both arguments with continuous partial derivatives x b(u, x), x (u, x), xx b(u, x),
xx (u, x). Suppose f (x) is a twice continuously differentiable function and
that there exist K, m > 0 such that
|b(u, x)| + |(u, x)| K(1 + |x|) ,
|x b(u, x)| + |xx b(u, x)| + |x (u, x)| + |xx (u, x)| K(1 + |x|m ) ,

2

d
d




|f (x)| + f (x) + 2 f (x) K(1 + |x|m ) .
dx
dx
Then the function h(t, x) = E [f (Xu )|Xt = x] satisfies the equation
h(t, x)
h(t, x) 1 2
2 h(t, x)
+ b(t, x)
+ (t, x)
= 0,
t
x
2
x2
lim h(t, x) = f (x) .

t (0, u)

tu

(95)
(96)

Remark 10.9. Equation (95) is called the Backward Kolmogorov equation


(BK). The name of the equation comes from the fact that it is an equation
for the backward variables t and x. Moreover, notice that h(t, x) is the
solution of a final value problem.
Comment. Let Lt 31 be the differential operator operator
Lt := b(t, x)

1
2

+ 2 (t, x) 2 .
x 2
x

31

The notation Lt rather than just L is to emphasize that the coefficients of this operator
do depend on time.

101

Lt is called the backward operator or also the generator of the (non timehomogeneous) diffusion (94). Theorem 10.8 says that the evolution equation
h(t, x)
= Lt h(t, x)
t
lim u(t, x) = f (x)
tu

is a backward equation for the observables.


Proof of Theorem 10.8(sketch). I will not prove the differentiability properties of the function h(t, x) but rather show formula (95). More details about
this proof ca be found in [22, Section 11]. For the purposes of this proof we
shall denote by X x,t (u) the solution at time u of equation (94) started in x
at time t. With this notation we can simply write h(t, x) = E[f (X x,t (u))].
What we want to show is
h(t, x) h(t , x)
0

1
= b(t, x)x h(t, x) 2 xx h(t, x) .
2

t h(t, x) = lim

Let us start with observing that


X x,t (u) = X X

x,t ,t

(u).

Therefore
h(t , x) = E[E(f (X x,t (u))|X x,t (t))]
= Eh(t, X x,t (t)).
Using the above and Lemma 10.7 we get the result:
h(t, x) h(t , x)
Eh(t, X x,t (t)) h(t, x)
= lim
0
0

1
= b(t, x)x h(t, x) 2 xx h(t, x).
2
lim

Theorem 10.10. Let Xt , t [0, T ] be a diffusion process with coefficients


b(t, x) and D(t, x) such that the limits in Definition 10.1 hold uniformly

102

for t [0, T ] and x R. Suppose the transition probabilities of Xt have a


density 32 p(s, x, t, y) for every t > s. If the partial derivatives
p
2
,
(b(t, y)p),
(D(t, y)p)
t y
y 2
exist and are continuous, then the transition density p(s, x, t, y) solves the
equation
p

1 2
(D(t, y)p) = 0
+
(b(t, y)p)
t
y
2 y 2
lim p(s, x, t, y) = (x y) .
ts

(97)
(98)

Equation (97) is a differential equation in the forward variables t and y


and for this reason it is known as the Fokker-Planck or forward Kolmogorov
equation.
10.2.1

Time-homogeneous case

Let us now specialize to the time-homogeneous case. In this case the coefficients of the SDE (94) do not depend on time. The process we are concerned
with is then solution of the SDE
dXu = b(Xu )du + (Xu )dWu ,

Xt = x ,

(99)

where b(x) and (x) satisfy all the assumptions stated so far. In other words,
the function h(t, x) = E[f (Xu )|Xt = x] becomes
h(t, x) = E[f (Xu )|Xt = x] = E[f (Xut )|X0 = x].
Setting = u t and noticing that t = , equation (95) becomes now
h(, x)
= Lh(, x)

lim h(, x) = f (x) ,

(100)

where the differential operator L is now the generator of the diffusion in the
sense of Definition (84),
L = b(x)

1
2

+ 2 (x) 2 .33
x 2
x

32

See Remark 10.16 on this point.


You could find this expression also by acting analogously to what we have done in
Example 9.8.
33

103

Because of (100), it is customary to write


h(, x) = (P f )(x) = e L .
In this time-homogeneous setting, the Fokker-Planck equation becomes
an equation for the transition probability density p(t, x, y) (rather than
p(s, x, u, y)) and it is precisely

p(t, x, y) = L p(t, x, y),


t

lim p(t, x, y) = (x y),

t0

for every fixed x R, where L is precisely the flat L2 -adjoint of the operator
L:

1 2 2
L = (b(y)) +
( (y)) .
y
2 y 2
Example 10.11. Consider the O-U process
dXt = Xt dt + dWt ,

, > 0.

This is a time-homogeneous process with generator


L = x

2 2

+
x
2 x2

and Fokker-Planck operator

2 2
(x) +
x
2 x2
The equation L = 0 has a unique normalized solution
r
x2 /2D
=
e
,
2D
L =

hence the O-U process is ergodic.


Example 10.12. Both the generator and the Fokker-Planck operator of
2 /2.
Brownian motion are simply xx
Comment. The Fokker-Planck equation is a continuity equation in the
sense that it expresses the conservation of probability mass. Indeed, using
the Fokker-Planck (FP) equation, it is straightforward to show
Z
d
p(t, x, y)dy = 0 for every fixed x R.
dt R
Therefore
Z
Z
Z
p(t, x, y)dy =
p(0, x, y)dy = (x y) = 1,
R

104

for all t > 0.

Remark 10.13. What happens in higher dimensions? Recalling Remark


10.2, this time we have a process Xt Rn solving the multidimensional SDE
dXt = b(x)dt + (x)dWt ,
where b(x) is a vector in Rn and (x) is a matrix. Xt is a time-homogeneous
diffusion with drift vector b(x) and diffusion coefficient D(x) = (x) T (x).
The generator of such a process is the operator
L=

n
X
j=1

1 X i,j
2
b (x) j +
D (x) j j .
x
2
x x
j

i,j=1

Notice that the diffusion matrix is always symmetric and positive definite.
Example 10.14. Consider the system

dYt = 3Yt dt mZt dt + 2a dWt1 ,

dZt = 6Zt dt + 2dWt2 .

a, m > 0

The generator of the process is


L = 3y

mz
+ 2a 2 6z
+ 2.
y
z
y
z z

Definition 10.15. A second order differential operator L on Rn of the form


L=

n
X

ai (x)

i=1

n
X

2
ij
+
M
(x)
,
xi
xi xj
i,j=1

where a(x) = (ai (x))1in is a Rn valued function on Rn and M (x) =


(M ij (x))1i,jn is a matrix valued function on Rn is said uniformly elliptic
if there exists a positive constant > 0 such that
n
X

M ij (x)v i v j kvk2 ,

i,j=1

for all vectors v Rn .


Remark 10.16. If the operator L is uniformly elliptic then the process Xt
with generator L has a density. This is a consequence of the good smoothing
properties of elliptic operators. Indeed if L is an elliptic operator then the
Fokker-Plank equation t p = L p is a parabolic equation (just the heat
105

equation in the case of Brownian Motion, for example) and it is a standard


result from the theory of PDEs that the solution to this kind of equation is
smooth for every positive time, even if the initial datum is not. Also, the
maximum principle for parabolic equations guarantees that if we start with
a positive initial datum then the solution of the Fokker-Planck equation
remains positive for every subsequent time, see [21].

10.3

Reversible diffusions and spectral gap inequality

Also in this section we refer to time-homogeneous processes. Let us start


with an example.
Example 10.17. Consider the one dimensional process

dXt = V 0 (Xt )dt + 2dWt ,

(101)

where V (x) is a smooth confining potential. The generator of the process is


L = V 0 (x)

2
+ 2
x x

while the FP operator is


L =


2

V 0 (x) + 2 ,
x
x

and the equation L = 0 has a unique normalized solution = (x):





V (x)(x) +
(x) = 0
L =0
x
x

V 0 (x)(x) +
(x) = const.
x
Solving the above SDE gives (x) = eV (x) /Z, where Z is a normalization
constant. Observe that when the potential is quadratic, this process is
precisely the O-U process.
The semigroup generated by L can be extended to a strongly continuous semigroup on the weighted L2 (see Lemma 9.13). Therefore, by the
Hille-Yosida Theorem, L is a closed operator. Moreover, L is a self-adjoint
operator in L2 . Let us start with proving that L is symmetric. To this end,
let us first recall that the scalar product in L2 is defined as
Z
hf, gi :=
f (x)g(x) (x)dx .
R

106

Let us now show that hLf, gi = hf, Lgi , for every f, g ***smooth and in
L2 :
Z
hLf, gi = (Lf )g dx
R


Z
Z  2 
d
d
0
=
V (x)
f g dx +
f g dx
2
dx
R
R dx
Z
df dg
=
dx
R dx dx
Z
Z
d2 g
dg 0
f 2 dx
f
=
V dx
dx
dx
R
R
Z
=
f (Lg) dx = hf, Lgi .
R

As a byproduct of the above calculation, we also have


Z
df dg
dx ,
hLf, gi =
R dx dx

(102)

which will be useful in the following - notice also that (102) makes the
symmetry property obvious. Because L is symmetric and closed 34 , it is
also self-adjoint in L2 .
Let us now come to explain why diffusions generated by a self-adjoint
operator are so important.
Definition 10.18. A probability measure is reversible for a Markov semigroup Pt if for any f, g Bm
Z
Z
(Pt f )g d(x) = f (Pt g) d(x).
(103)
In this case it is also customary to say that satisfies the detailed balance
condition with respect to Pt .
Analogously to the time discrete case if Pt is the Markov semigroup associated to some Markov process Xt and (103) is satisfied, then the process Xt
is time-reversible (for a proof of this fact in the time-continuous setting see
[70], which is an excellent reference). A given Markov semigroup might have
more than one reversible measure. We denote R(Pt ) the set of reversible
measures for a given Markov semigroup Pt . Taking g 1, it is obvious that
34

and defined on a dense subset of L2 , namely the set of Schwartz functions, see [64].

107

if is reversible for Pt then it is also an invariant measure for the semigroup.


So if Pt admits a reversible measure , then the semigroup can be extended
to a strongly continuous semigroup on L2 . With the same reasoning as in
Example 10.17, the generator is also closed. Moreover it can be shown that
(103) is equivalent to
hLf, gi = hf, Lgi ,

f, g smooth enough and in L2 .

(104)

Therefore L is self-adjoint. This whole reasoning proves the following.


Proposition 10.19. Let Pt be a Markov semigroup associated with a Markov
process Xt and a reversible measure for Pt . Then Xt is time-reversible
and the generator of Pt is self-adjoint. Conversely if L is the generator of
a strongly continuous Markov semigroup on L2 (for some measure ) and
L satisfies (104), then is reversible for the semigroup and the associated
process is reversible.
This is the reason why diffusion processes with self-adjoint generator are
also called reversible diffusions. The study of exponentially fast convergence
to equilibrium for reversible diffusions has been extensively tackled in the
literature, see [4, 5] and references therein. In what follows we will often use
the notation
ft (x) := (Pt f )(x)
and work in one dimension but everything we say can be rephrased in higher
dimensions.
Definition 10.20. Given a Markov semigroup Pt *** with generator L, we
say that a measure R(Pt ) satisfies a spectral gap inequality if there
exists a constant > 0 such that
2
Z 
Z

f
f d d hLf, f i , for every f L2 D(L). (105)
R

The largest positive number such that (105) is satisfied is called the spectral gap of the self-adjoint operator L.
The therm on the RHS of (105) is called the Dirichlet form of the operator L.
Remark 10.21. If L is a self adjoint operator then the form hLf, f i is
real valued (for every f L2 D(L)). In particular the spectrum of L is
real. If L is the generator of a strongly continuous Markov semigroup and
108

the semigroup is ergodic then we already know that 0 is a simple eigenvalue


of L (see comment after Lemma 9.13). In the case of Example 10.17, we
also know from (102) that hLf, f i 0 for every f , therefore the selfadjoint operator L is negative and all the eigenvalues of L will be negative.
Clearly L is positive and the smallest positive such that (105) holds
is the smallest nonzero eigenvalue of L (i.e. is the biggest nonzero
eigenvalue of L). This is the reason why is called the spectral gap. The
next proposition clarifies why spectral gal inequalities are so important.
Proposition 10.22. A measure R(Pt ) satisfies a spectral gap inequality
(with constant ) if and only if
Z 

Z
Pt f

2
2
Z 
Z
2t
f d d e
f
f d d ,

(106)

for all t 0 and f L2 .


R
R
Proof. Observe first that if f d = 0 then ft d = 0 as well, as is
invariant. Also, L is a differential operator, so constant functions are in the
kernel of L. Therefore we can work with mean zero functions. We want
to prove that (105) holds if and only if (106) does. We prove here the
implication (105) (106), the other implication is left as an exercise, see
Exercise 38. ***If we are working with mean zero functions showing that
(105) (106) amounts to proving that
Z
(107)
f 2 d hLf, f i
R

implies
Z

(Pt f ) d e

2t

f 2 d

In order to do so, apply (107) to ft and get




Z
Z
Z
d
2
ft d ft (Lft )d = ft
ft d
dt
R
R
R

Z 
1
d 2
=
f d .
2 R dt t
Therefore

d
dt

ft2

Z
d 2

109

ft2 d.

Integrating the above inequality we get


Z
Z
2
2t
f 2 d.
ft d e
R

Example 10.23 (I.e. Example 10.17 continued). Going back to the process
Xt solution of (101), we now have the tools to check whether Xt converges
exponentially fast to equilibrium. Working again with mean zero functions
(notice that in this case the kernel of the generator is made only of constants), the spectral gap inequality reduces to
Z
Z 2
df
2
dx.
f dx
(108)

R
R dx
Those of you who have taken a basic course in PDEs will have noticed
that this is a Poincare Inequality for the measure . In Appendix B.4,
I have recalled the basic Poincare inequality that most of you will have
already encountered. If the potential V (x) is quadratic then the measure
does satisfy (108) and in this case we therefore have exponentially fast
convergence to equilibrium. For general potentials a classic result is the
following.
Lemma 10.24. Let V (x) : Rn R be a twice differentiable confining
potential such that = eV (x) is a probability density. Denote by H 1 () the
weighted H 1 , with norm
kf k2H1 := kf k2L2 + kf k2L2 .
If V (x) is such that
|V |2
|x|
V (x) ,
2
then the measure satisfies a Poincare inequality: for all functions h
H 1 (eV (x) ) and for some K > 0:
"Z
Z
2 #
Z
2 V (x)
|h| e
dx K
h2 eV (x) dx
heV (x) dx
.
R

Before concluding this section we would like to make a remark, which is


useful in computational practice. Given a probability measure , there are
many processes admitting as invariant measure. We will illustrate this fact
using the process (101). By the point of view of MCMC, this observation
is particularly important as one can then try and find the process that
converges fastest to the measure that we want to sample from.
110

Example 10.25. Let v(x) be a smooth function and consider the process

dXtv = (v(Xtv ) + V 0 (Xtv ))dt + 2dWt ,


(109)
where the potential V (x) satisfies the same assumptions as in (101). It
is clear that we have perturbed the dynamics (101) through the use of the
function v(x) 35 . Recall that the unperturbed process has a unique invariant
measure with density = eV (x) /Z. Now the question is: is it possible to
find a function v(x) such that Xtv solution of (109) admits as invariant
measure? In other words, is it possible to perturb the drift of the process
(101) without altering the invariant measure (and therefore the long time
behaviour of the dynamics)? The answer turns out to be simple. The
generator of (109) is
Lv = (v(x) + V 0 (x))

2
+ 2.
x x

We know that is the invariant measure of (101), therefore

2
(V 0 ) + 2 = 0.
x
x

(110)

If we want to be also the invariant measure of (109) then we need to


impose Lv = 0. This results in


2
(v + V 0 ) + 2 = 0
x
x
(110)

(v) = 0 .
x

Lv = 0

In higher dimension, i.e. if we consider the process Xt Rd satisfying

dXt = V (Xt )dt + 2dWt


where Wt is now d-dimensional standard Brownian motion, you can show
again that the measure with density (x) = eV (x) /Z is the only invariant
measure for Xt . In this case the invariant measure of the process

dXtv = (v(Xtv ) + V (Xtv ))dt + 2dWt ,


is still if and only if
(v) = 0,
where denotes divergence (see Exercise 39).
35

Notice that the generator of the perturbed dynamics is no longer self-adjoint in L2 ,


where = eV (x) /Z

111

10.4

The Langevin equation as an example of Hypocoercive


diffusion

The take-home message from Section 10.3 is: diffusions generated by operators which are elliptic and self-adjoint are usually the easiest to analize
as, roughly speaking, ellipticity gives the existence of a C density for the
invariant measure while the self-adjointedness of the operator gives nice
spectral property which are key to proving exponential convergence to equilibrum. In this section we want to see what happens when the generator
is non elliptic and non self-adjoint. We shall mainly focus on how to prove
exponential convergence to equilibrium for non self-adjoint diffusions. Nihil
recte sine exemplo docetur, so let us start with the main motivating example,
the Langevin equation:
dq = pdt
dp = q V (q)dt pdt +

2dWt ,

(111)

where V (q) is a confinig potential and Wt is one dimensional standard Brownian motion. If we assume that the potential is locally Lipshitz then the
following general result, which can be found for example in [72, Chapter 10],
guarantees the existence of a strong solution for the system (111).
Theorem 10.26. Let b(t, x) : R+ Rn Rn be Locally Lipshitz, uniformly
for t [0, T ] for each T > 0 and suppose
1. sup0tT |b(t, 0)| < , for all T > 0;
2. there exists k > 0 such that hx, b(t, x)i 0 if x is outside of a ball of
radius k.
Then the SDE
dXt = b(t, Xt )dt + dWt
admits a unique strong and nonexploding solution for any random initial
datum X0 and any constant > 0.
The generator of (111) is
L = pq q V (q)p pp + p2

(112)

and the corresponding Fokker-Planck operator is


L = pq + q V (q)p + p (p) + p2 .
112

(113)

Showing that this process has a unique invariant measure and that such a
measure has a density is not straightforward. However the process is indeed
ergodic with invariant density
(q, p) =

e(V (q)+p
Z

2 /2)

(114)

What is easy to see is that the kernel of the generator is only made of
constants. In showing this fact we will also gain a better understanding of
the system (111). The dynamics described by (111) can be thought of as
split into a Hamiltonian component,
q = p
p = q V (q)

(115)

plus a O-U process (in the p variable):


dq = pdt

dp = q V (q)dt pdt + 2dWt


|
{z
}
O-U process .
Indeed the equations (115) are the equations of motion of an Hamiltonian
system with Hamiltonian
H(q, p) = V (q) +

p2
.
2

At the level of the generator this is all very clear:


L = LH + LOU ,
where
LH := pq q V (q)p

(116)

is the Liouville operator of classical Hamiltonian mechanics and


LOU := pp + p2
is the generator of a O-U process in the p variable. By the point of view
of our formalism, the Hamiltonian dynamics (115) admits infinitely many
invariant measures, indeed
LH f (H(q, p)) = 0
113

for every f,

i.e. any function of the Hamiltonian is in the kernel of L . So any integrable


and normalized function of the Hamiltonian is an invariant probability measure for (115). Adding the O-U process amounts to selecting one equilibrum,
indeed
Ker(LOU ) = {functions that are constant in p},
so the kernel of L is only made of constants.
To distinguish between the flat L2 adjoint of an operator T and the
adjoint in the weighted L2 , we shall denote the first by T and the latter
by T ? . Notice now that the generator LH of the Hamiltonian part of the
Langevin equation is antisymmetric both in L2 and in L2 . It is indeed
straightforward to see that
LH = LH .
Also, hLH f, gi = hf, LH gi for every f, g L2 D(LH ):
Z Z
(pq f qp f ) g dpdq
hLH f, gi =
RZ RZ
Z Z
=
f pq (g) dpdq +
f qp (g) dpdq
ZR ZR
Z Z R R
=
f p(q g) +
qf (p g) = hf, LH gi .
R

The generator of the O-U process is instead symmetric in L2 and in particular


LOU = T ? T,
where
T = p ,

so that

T ? = p + p.

In conclusion, the generator of the Langevin equation decomposes into a


symmetric and antisymmetric part. Moreover, the antisymmetric part comes
from the Hamiltonian deterministic component of the dynamics, the symmetric part comes from the stochastic component.
Using Stones Theorem (see Appendix B.3) we also know that the semigroup generated by LH is norm-preserving, while it is easy to see that the
semigroup generated by LOU is dissipative, indeed
d tLOU 2
ke
hk = 2hLOU etLOU h, etLOU hi
dt
= 2hT ? T ht , ht i = 2kT ht k2 < 0 ,

114

where we used the notation ht (x) = etLOU h(x). In conclusion, so far we


have the following picture:
L=

LH
|{z}

?
T
T
|{z}

skew symmetric

symmetric

deterministic

stochastic

conservative

dissipative

part of the dynamics

part of the dynamics.

However it would be misleading to think that, for the Langevin equation,


decay to equilibrium happens only because of the effect of the dissipative
part of the dynamics. This would be true if the operators LOU and LH
did commute, but they dont, so more interesting phenomena take place in
the Langevin dynamics and exponential convergence to equilibrium happens
because of the interaction between the symmetric and the antisymmetric
part of the dynamics. But we are going a bit too fast, we are talking about
exponential convergence to equilibrium and we havent yet proved that such
a thing happens for the Langevin dynamics. Clearly we cannot use the same
technique that we have used for diffusions with symmetric generator. This
is precisely the issue that we would like to address in the remainder of this
section and that has been tackled in the book [77].
The hypocoercivity theory, subject of [77], is concerned with the problem
of exponential convergence to equilibrium for evolution equations of the form
t h + (A? A B) h = 0, 36

(117)

37 .

where B is an antisymmetric operator


We shall briefly present some of
the basic elements of the hypocoercivity theory and then see what are the
outcomes of such a technique when we apply it to the Langevin equation
(111).
We first introduce the necessary notation. Let H be a Hilbert space, real
and separable, k k and (, ) the norm and scalar product of H, respectively.
Let A and B be unbounded operators with domains D(A) and D(B) respectively, and assume that B is antisymmetric, i.e. B ? = B, where ? denotes

Pm
?
Generalizations to the form t h +
i=1 Ai Ai B h = 0 as well as further generalizations are presented in [77]. We refer the reader to such a monograph for these
cases.
37
Notice that, for less than regularity issues, any second order differential operator can
be written in this form.
36

115

adjoint in H. We shall also assume that there exists a vector space S H,


dense in H, where all the operations that we will perform involving A and
B are well defined.
Writing the involved operator in the form T = A? A B has several
advantages. Some of them are purely computational. For example, for
operators of this form checking the contractivity of the semigroup associated
with the dynamics (117) becomes trivial. Indeed, the antisymmetry of B
implies that
(Bx, x) = (x, Bx) for all x D(B).
Therefore (Bx, x) = 0. This fact, together with (A? Ax, x) = kAxk2 > 0,
immediately gives

1
ketT hk2 = kAhk2 < 0 .
2 t t=0+
On the other hand, conceptually, the decomposition A? A B is physically
meaningful as the symmetric part of the operator, A? A, corresponds to the
stochastic (dissipative) part of the dynamics, whereas the antisymmetric
part corresponds to the deterministic (conservative) component.
Definition 10.27. We say that an unbounded linear operator T on H is
relatively bounded with respect to the linear operators T1 , ..., Tn if the domain of T , D(T ), is contained in the intersection D(Tj ) and there exists
a constant > 0 s.t.
h D(T ),

kT hk (kT1 hk + ... + kTn hk).

Definition 10.28 (Coercivity). Let T be an unbounded operator on a Hilbert


space H, denote its kernel by K and assume there exists another Hilbert space
continuously and densely embedded in K . If k k and (, ) are the
H
H
H
respectively, then the operator T is said to
norm and scalar product on H,
if
be -coercive on H
(T h, h)H khk2H ,

h K D(T ),

where D(T ) is the domain of T in H.


Notice the parallel with (105). Not surprisingly, the following Proposition gives an equivalent definition of coercivity.
Proposition 10.29. With the same notation as in Definition 10.28, T is
iff
-coercive on H
k eT t h kH et k h kH
116

and t 0.
h H

Definition 10.30 (Hypocoercivity). With the same notation of Definition


10.28, assume T generates a continuous semigroup. Then T is said to be
if there exists a constant > 0 such that
-hypocoercive on H
k eT t h kH et k h kH ,

and t 0.
h H

(118)

Remark 10.31. We remark that the only difference between Definition


10.28 and Definition 10.30 is in the constant on the right hand side of
(118), when > 1. Thanks to this constant, the notion of hypocoercivity is
invariant under a change of equivalent norm, as opposed to the definition of
coercivity which relies on the choice of the Hilbert norm. Hence the basic
idea employed in the proof of exponentially fast convergence to equilibrium
for degenerate diffusions generated by operators in the form (117), is to
equivalent to the existing one, and
appropriately construct a norm on H,
such that in this norm the operator is coercive.
We will state in the following the basic theorem in the theory of hypocoercivity.
Generalizations can be found in [77].
Theorem 10.32. With the notation introduced so far, let T be an operator
of the form T = A? A B, with B ? = B. Let K = KerT , define C :=
[A, B] and consider on K 38 the norm
khk2H1 := khk2 + kAhk2 + kChk2 .
Suppose the following holds:
1. A and A? commute with C;
2. [A, A? ] is relatively bounded with respect to I and A;
3. [B, C] is relatively bounded with respect to A, A2 , C and AC,
then there exists a scalar product ((, )) on H1 /K defining a norm equivalent
to the H1 norm such that
((h, T h)) k(kAhk2 + kChk2 ),

h H1 /K,

(119)

for some constant k > 0. If, in addition to the above assumptions, we have
A? A + C ? C is coercive for some > 0,
then T is hypocoercive in H1 /K: there exist constants c, > 0 such that
ketL kH1 /KH1 /K cet .
One can prove that space K is the same irrespective of whether we consider the
scalar product h, i of H or the scalar product h, iH1 associated with the norm k kH1 .
38

117

Comment. I will not write a proof of this theorem but I will explain
how it works. The idea is the same that we have explained in Remark 10.31.
Consider the norm
((h, h)) := khk2 + akAhk2 + ckChk2 + 2b(Ah, Ch),
where a, b and c are three strictly positive constants to be chosen. The
assumptions 1. , 2. and 3. are needed to ensure that this norm is equivalent
to the H1 norm, i.e. that there exist constant c1 , c2 > 0 such that
c1 khkH1 ((h, h)) c2 khkH1 .
If we can prove that T is coercive in this norm, then by Proposition 10.29 and
Remark 10.31 we have also shown exponential convergence to equilibrium
in the H1 norm i.e. hypocoercivity. So the whole point is proving that
((T h, h)) K((h, h)),
for some K > 0. If 1. ,2. and 3. hold then (with a few lengthy but surprisingly
not at all complicated calculations) (119) follows. From now on K > 0 will
denote a generic constant which might not be the same from line to line.
The coercivity of A A + C C means that we can write
1
kAhk2 + kChk2 = (kAhk2 + kChk2 ) +
2
1
(kAhk2 + kChk2 ) +
2
KkhkH1 .

1
(kAhk2 + kChk2 )
2

khk2
2

Combining this with (119), we obtain


((h, T h)) k(kAhk2 + kChk2 ) KkhkH1 ((h, h)).
This concludes the proof. Another important observation is that, in practice,
the coercivity of A? A + C ? C boils down to a Poincare inequality. This will
be clear when we apply this machinery to the Langevin equation, see proof
of Theorem 10.35.
Remark 10.33. Let K be the kernel of T and notice that Ker(A? A) =
Ker(A) and K = Ker(A)Ker(B). Suppose KerA KerB; then KerL =
KerA. In this case the coercivity of T is equivalent to the coercivity of A? A.
So the case we are interested in is the case in which A? A is coercive and
T is not. In order for this to happen A? A and B cannot commute; if they
?
did, then etL = etA A etB . Therefore, since etB is norm preserving,

we would have ketL k = ketA A k. This is the intuitive reason to look at


commutators of the form [A, B].
118

We can use Theorem 10.32 to prove exponentially fast convergence to


equilibrium for the Langevin dynamics. We shall apply such a theorem
to the operator L defined in (112) on the space H = L2 , where is the
equilibrium distribution (112). (The space S can be taken to be the space
of Schwartz functions.) The operators A and B are then
A = p

and B = pq q V p ,

so that
C := [A, B] = AB BA = q .
The kernel K of the operator L is made of constants and in this case the
norm H1 will be the Sobolev norm of the weighted H 1 ():
kf k2H1 := kf k2L2 + kq f k2L2 + kp f k2L2 .
Let us first calculate the commutators needed to check the assumptions of
Theorem 10.32.
[A, C] = [A? , C] = 0,

[A, A? ] = Id

(120)

and
p
[B, C] = 1 q2 V (q)p .
Lemma 10.34. Let V C (R). Suppose V (q) satisfies

|q2 V | C 1 + |q V | ,

(121)

(122)

for some constant C 0. Then, for all f H 1 (), there exists a constant
C > 0 such that


k(q2 V ) p f k2L2 () C kf k2L2 () + k(q V )f k2L2 () .
(123)
Theorem 10.35. Let V C (R), satisfying (122) and the assumptions
of Lemma 10.24. Then, there exist constants C, > 0 such that for all
h0 H 1 (),


Z
t L

e
h0 h0 d
Cet kh0 kH 1 () .
(124)


H 1 ()

Proof. We will use Theorem 10.32. We need to check that conditions (i)
to (iv) of the theorem are satisfied. Conditions (i) and (ii) are satisfied,
due to (120). Having calculated (121), condition (iii) requires q2 V p to
119

2 . From Lemma
be relatively bounded with respect to p , p2 , q and qp
10.34 this is trivially true as soon as condition (122) holds. Now we turn to
condition (iv). Let us first write the operator Lb = A? A + C ? C:

Lb = pp p2 + q V q q2 .
In order for this operator to be coercive, it is sufficient for the Gibbs measure
(dpdq) = Z1 eH(q,p) dpdq to satisfy a Poincare inequality. This probability measure is the product of a Gaussian measure (in p) which satisfies a
Poincare inequality, and of the probability measure eV (q) dq. It is sufficient,
therefore, to prove that eV (q) dq satisfies a Poincare inequality. This follows
from Lemma 10.24 and Lemma 10.34.

120

11

Anomalous diffusion - - - STILL DRAFT

Figure 4: example of superdiffusive path.

Figure 5: example of subdiffusive path.


Anomalous diffusion processes are characterized by a mean square displacement which, instead of growing linearly in time, grows like t2 , >
0, 6= 12 . When 0 < < 12 the process is subdiffusive, when > 12 it is
superdiffusive.
Diffusion phenomena can be described at the microscopic level by BM and
macroscopically by the heat equation, i.e. the parabolic problem associated with the Laplacian operator; the link between the two descriptions is,
roughly speaking, the fact that the fundamental solution to the diffusion
equation is the probability density associated with BM.
A similar picture can be obtained for anomalous diffusion. The main difference is that in nature a variety of anomalous diffusion phenomena can
be observed and the question is how to characterize them from both the
analytical and the statistical point of view. It has been shown that the
microscopical (probabilistic) approach can be understood in the context of
continuous time random walks (CTRW) and, in this framework, a process
is uniquely determined once the probability density to move at distance r
121

in time t is known ([?]- [11], [?], [?] and references therein). The analytical
approach is based on the theory of fractional differentiation operators, where
the derivative can be fractional either in time or in space (see [?]-[?], [?] and
references therein).
For f (s) regular enough (e.g. f C(0, t] with an integrable singularity
at s = 0), let us introduce the Riemann-Liouville fractional derivative,
Z
1 d t
f (s)
1

Dt (f ) :=
ds
,
0<< ,
(125)
12
(2) dt 0
(t s)
2
and the Riemann-Liouville fractional integral,
Z t
f (s)
1
ds
It (f ) :=
,
(2 1) 0
(t s)22

1
< < 1,
2

(126)

where is the Euler Gamma function ([?]). Appendix ?? contains a motivation for introducing such operators. For 12 < < 1 let us also introduce
the fractional Laplacian () , defined through its Fourier transform: if the
Laplacian corresponds, in Fourier space, to a multiplication by k 2 , the
1
fractional Laplacian corresponds to a multiplication by | k | . (125) and
(126) can be defined in a more general way (see [?]), but to our purposes
the above definition is sufficient. Furthermore, notice that the operators in
(125) and (126) are fractional in time, whereas the fractional Laplacian is
fractional in space.
Let us now consider the function (t, x), solution to
Z
1 d t
2 (s, x)
1

t (t, x) =
ds x
when 0 < < ,
(127)
(2) dt 0
(t s)12
2
t (t, x) =

1
(2 1)

ds
0

x2 (s, x)
(t s)22

when

1
< < 1.
2

(128)

It has been shown (see [?, ?] and references therein) that such a kernel
is the asymptotic of the probability density of a CTRW run by a particle either moving at constant velocity between stopping points or instantaneously jumping between halt points, where it waits a random time before
jumping again. On the other hand, a classic result states that the Fourier
transform of the solution (t, x) to
1
t (t, x) = () (t, x),
2

1
< < 1,
2

(129)

is, for 12 , the characteristic function of a (stable) process whose first


moment is divergent when 1 (see [?]); this justifies the choice 21 < < 1
122

in equation (129). Processes of this kind are particular CTRWs, the well
known Levy flights; in this case large jumps are allowed with non negligible
probability and this results in the process having divergent second moment.
We will use the notation (t, x) = t (x) to indicate the solution to either
(127), (128) or (129), as in the proofs we use only the properties that these
kernels have in common.
The above described framework is analogous to the one of Einstein diffusion:
for subdiffusion and Riemann-type superdiffusion the statistical description
is given by CTRWs, whose (asymptotical) density is the fundamental solution of the evolution equation associated with the operators of fractional
differentiation and integration, i.e. (127) and (128), respectively (see Appendix B). For the Levy-type superdiffusion, the statistical point of view is
given by Levy flights, whose probability density evolves in time according to
the evolution equation associated with the fractional Laplacian, i.e. (129)
(see [?]).
we want to show how the operators Dt and It naturally arise in the
context of anomalous diffusion and explain in some more detail the link
with CTRWs.
We want to determine an operator A s.t.

t t (x) = A t (x)
t (0) = 0 ,
with (t, x) enjoying the following three properties:
Z
Z
Z
dxt (x) x2 t2
dxt (x) x = 0 e
dxt (x) = 1 ,

(130)

(notice that for = 12 we recover the diffusion equation with A = ). We


recall that f, f # and f denote the Fourier, the Laplace and the FourierLaplace transform of the function f , respectively.
By (130), the following must hold
1
t (k) = 1 ct2 k 2 + o(k 2 )
2
(, k) =

and

1
ck 2
1
2+1 (2 + 1) = (1 c1 2 k 2 ),
2

where c1 = 21 c(2 + 1). In definitions (127) and (128) the constant c1


should appear; we just set it equal to 1 both for simplicity and not being
interested, in this context, in estimating the anomalous diffusion constant.
123

We can assume that the expression for (, k) is valid in the regime 2 k 2 <<
1. Actually, condition (130)3 is meant for an infinitely wide system and for
long times. In other words, if is the region where the particle moves, we
claim that
R
dxt (x) x2
lim lim
= const.
t R
t2
This means that we are interested in the case k << . Of course one can in
principle find an infinite number of functions s.t. (, k) = 1 (1 c1 ) for
 = 2 k 2 . One possible choice is
(, k) =

1
1
= 1 2
=
,
2
(1 + c1 )
+ (c1 k)
+ c1 k 2 12

(131)

which leads to an integro-differential equation and, when = 21 , it coincides


with the Fourier-Laplace transform of a Gaussian density.
We now find the operator whose fundamental solution is (, k). We have
L(t (, k))() = 1 +
(, k) = c1 k 2 12 (, k).
p1

Let p = 2 1 and p (t) = t(p) ; then we need to distinguish two cases in


order to study the right hand side of the above equation:
when 0 < < 12 one can easily check that
L(p (k, )) = (, k)p
which implies that
(, k)12 is the Laplace transform of
when

1
2

1
(2 1)

ds
0

(s, k)
;
(t s)22

< < 1, instead, a straightforward calculation shows that


L[t (p+1 (k, ))] = (, k)p

so that
(, k)12 is the Laplace transform of

1 d
(2) dt

ds
0

(s, k)
.
(t s)12

Finally, taking the inverse Fourier transform, we get that (t, x) satisfies
(127) when 0 < < 12 and (128) when 12 < < 1. Moreover, the explicit
expression for t (x) holds true: by (131) we get that
Z
1 |x|
(, k) =
dx eikx e c1
2 c1
R
124

hence

1 |x|
# (x, ) = e c1
2 c1

and now, by the inverse Laplace formula, we obtain (??). Obviously, the
expression (??) has been deduced after having chosen (131) among all possible candidates for and this choice can now be justified in view of the
link with CTRWs.

125

Appendices
A

Miscellaneous facts

This appendix contains some miscellaneous material.

A.1

Why can we think of white noise as the derivative of


Brownian Motion

We said that white noise is the derivative of BM, without justifying this
statement. We will give here a formal explanation. If white noise is the
derivative of BM then in some sense
Wt+h Wt
.
h0
h

t = lim

By definition of BM, the process on the RHS of the above must be mean
zero and Gaussian, so let us check its covariance function:


Wt+h Wt Ws+h Ws
E
h
h
1
= 2 [(t + h) (s + h) (t + h) s (s + h) t s t]
h

[[(s t) + h] ([t s] ([s t] + h))]/h2 if s 6= t
=
1/h
if t = s

0
if t 6= s
= 0 (t s) ,

+ if t = s
because if s < t then t (s + h) = s + h as h gets smaller (analogous in the
case t < s).

A.2

Gronwalls Lemma

Differential Form. Let u(t) and a(t) be real valued differentiable


functions. If
d
u(t) a(t)u(t)
dt
then
Rt
u(t) u(c)e c a(s)ds .

126

Integral form: Let u(t), a(t) and b(t) be continuous real valued
functions on R+ , with b(t) nonnegative and a(t) non decreasing. If
Z
u(t) a(t) +

b(s)u(s) ds ,

for some c t,

then
u(t) a(t)e

A.3

Rt
c

b(s) ds

Kolmogorovs Extension Theorem

A sequence N of probability measures on (RN , B(RN )) is consistent if


N +1 ((a1 , b1 ) (aN , bN ) R) = N ((a1 , b1 ) (aN , bN )) .
We denote by RN the space of real sequences endowed with the -algebra
RN generated by the cylinder sets, i.e. by sets of the form
CA0 ,...,Am := { = (0 , 1 , 2 , . . . ) RN : i Ai , i = 0, . . . , m},
m N and Ai = (ai , bi ) R.
Theorem A.1 (Kolmogorovs Extension Theorem countable version).
Let N be a consistent sequence of probability measures. Then there exists
a unique probability measure P on sequence space (RN , RN ) such that
P ( = (1 , 2 , . . . ) : i (ai , bi ), i = 1, . . . , N ) = N ((a1 , b1 ) (aN , bN )).
The above Theorem still holds if instead of R we consider any Polish
space S endowed with a -algebra S and let S N be the space of N -vectors
with components in S, endowed with the -algebra S N .

Elements of Functional Analysis

Excellent references for the material of this appendix are the books [80, 64]
Let (X, k kX ) and (Y, k kY ) be two Banach spaces.
Definition B.1. A linear operator A : X Y is bounded if there exists a
constant M > 0 such that
kAxkY M kxkX ,

127

for all x X .

(132)

Notice that the constant M is independent on x X. The space of bounded


linear operators A : X Y is a vector space, denoted by B(X, Y ), and it
can be endowed with the operator norm
kAkB(X,Y ) := sup
x6=0

kAxkY
kAxkY
= sup
,
kxkX
kxkX =1 kxkX

where the second inequality in the above is a consequence of linearity. If


X = Y then we simply write B(X).
Recall the two following facts: 1. A linear operator A : X Y is
bounded if and only if it is continuous; 2. kAkB(X,Y ) is the smallest constant
M such that (132) holds.
When dealing with bounded opertors there is no need to specify their
domain. An unbounded operator L, on the contrary, is specified by its action
as well as by its domain. In other words, it is not correct to talk about the
unbounded operator L, it is more precise to talk about the pair (L, D(L)),
where D(L) denotes the domain of L.
Definition B.2. An operator A : D(A) Y is closed if
xn D(A), xn x, Axn y = x D(A) and Ax = y .
A bounded linear operator A : X Y (X and Y Banach spaces) is
closed; in general a closed operator will not be bounded. However if A is
closed and defined on the whole of X then it is also bounded.
Remember that given a normed space V , the (real) dual space of V ,
denoted V , is the space of bounded linear functionals on V , i.e. the space
of bounded linear maps T : V R. Such a space is a vector space and it
becomes a normed space when endowed with the operator norm.
We denote the duality relation by h, iV ,V or, when there is no risk of
confusion, by simply h, i ; the duality relation is just the action of V on V
i.e. the action of T on the elements of V . In other words, given the linear
functional T V , instead of writing T (x) or T x, for x V , we will write
hT, xi.

B.1

Adjoint operator

Definition B.3 (Adjoint of bounded operator). Let X and Y be two Banach


spaces and A : X Y a linear bounded operator. We say that A : Y X
is the adjoint or dual of A if
hAx, yiY,Y = hx, A yiX,X ,
128

x X, y Y .

Remark B.4. In a Hilbert space setting the duality relation is just the
scalar product. Because the dual of a Hilbert space is the Hilbert space
itself, the above definition coincides, in the Hilbert case, with the one that
you will have already seen, which is: if H is a Hilbert space and A : H H
is a bounded linear operator, A : H H is the adjoint operator of A if
hAx, yi = hx, A yi, for all x, y H (where this time h, i is just the scalar
product in H).
If the operator is unbounded, the definition of adjoint is slightly more
involved.
Definition B.5 (Adjoint of unbounded operator). Let X and Y be Banach
spaces and A be an unbounded operator A : D(A) X Y . If 39 D(A) is
dense in X then for every y Y there exists a unique x
X such that
hAx, yiY,Y = hx, x
iX,X ,

x D(A).

(133)

Therefore we can define an operator A : D(A ) Y X with T y = x


.
Such an operator will satisfy
hAx, yiY,Y = hx, A yiX,X ,

x D(A).

The domain of A is D(A ) := {y Y : x


X satisfying (133)}.
Definition B.6 (Symmetric and self adjoint operators on Hilbert spaces).
Let H be a Hilbert space with scalar product h, i.
Let A : H H be a bounded linear operator. A is self-adjoint if A = A
i.e. if
hAx, yi = hx, Ayi, x, y H.
Let A : H H be an unbounded operator on the Hilbert space H, and
suppose that the domain of A is dense in H. Then A is symmetric if D(A)
D(A ) and Ax = A x for all x D(A). Equivalently, it is symmetric if
hAx, yi = hx, Ayi for all x, y D(A).
Let A : H H be an unbounded operator on the Hilbert space H, and
suppose that the domain of A is dense in H. A is self-adjoint if A = A ,
i.e. if it is symmetric and D(A) = D(A ).
Now one last definition that will be useful in Appendix B.3.
Definition B.7. A bounded linear operator U on a Hilbert space H is unitary if Range(U ) = H and U preserves the scalar product:
hU x, U yi = hx, yi,
39

This is actually an if and only if

129

for all x, y H.

(134)

Notice that for a bounded linear operator on a Hilbert space condition


(134) is equivalent (see [80]) to
kU xk = kxk,

for all x, y H.

One can also prove that a bounded linear operator on a Hilbert space is
unitary if anf only if U = U 1 .

B.2

Strong, weak and weak-* convergence.

Let (V, k k) be a normed space and V its dual (i.e. the space of bounded
linear operators T : V R), endowed with the operator norm.
i) A sequence xn of elements of V is said to converge (strongly) or in norm
to x V if
lim kx xn k = 0 .
n

ii) A sequence xn of elements of V is said to converge weakly to x V if


lim T xn = T x for all T V .

As we have observed, the space V is a normed space itself, when endowed


with the operator norm. In this section we use the notation (V , k k ).
Therefore in V it makes sense to consider the two types of convergence
described above. However, when working on the dual space, we can also
define another type of convergence:
i) A sequence Tn of elements of V is said to converge (strongly) or in norm
to T V if
lim kT Tn k = 0 .
n

ii) A sequence xn of elements of V is said to converge weakly to x V if


lim T Tn = T T

for all T V .

iii) A sequence Tn of elements of V is said to converge weak- to T V if


lim Tn x = T x,

130

for all x V.

B.3

Groups of bounded operators and Stones Theorem

Definition B.8. A one parameter family of linear bounded operators over a


Banach space B, {Pt }tR , Pt : B B for all t 0, is a group of bounded
linear operators if
1. P0 = I, where I denotes the identity on B;
2. Pt+s = Pt Ps = Ps Pt , for all t, s R.
If the map R 3 t Pt f B is continuous for all f B, then the group
is said to be strongly continuous. A strongly continuous group of bounded
linear operators is also called a C0 -group. The infinitesimal generator of the
group Pt is the operator
Pt f f
,
t0
t

Lf := lim

(135)

for all f D(L) := {f B : the limit on the RHs of (135) exists in B}.
It is clear that if Pt , t R, is a C0 -group then Pt , t R+ , is a C0 semigroup generated by L and Pt , t R+ is a C0 -semigroup generated by
L. If Pt is invertible (as an operator) then the inverse is precisely Pt .
Theorem B.9 (Stones Theorem). L is the generator of a C0 group of
unitary operators on a Hilbert space if and only if iA is self-adjoint.

B.4

Functional Inequalities in their basic form

These are the functional inequalities that you will have seen in a first course
on PDEs.
Poincar
e Inequality (in its most classic form) Let U be a bounded,
connected, open subset of Rn with a C 1 boundary U . Then for every
1 p and f W 1,p (U ),
kf hf iU kLp (U ) Ckf kLp (U ) ,
where hf iU denotes the average of f over U and C is a constant depending on n, p and U .

131

Laplace method

The Laplace method is a technique to determine the behaviour of integrals


of the form
Z
I() =
f (x)eh(x)/ dx, as  0.
R

We assume that the functions f (x), h(x) : R R are C and that h(x)
admits a unique global minimum, which is attained at x = x0 . In other
words there exists a unique x0 R such that
1. h0 (x0 ) = 0 and h00 (x0 ) > 0;
2. h(x0 ) < h(x) for all x 6= x0 .
Rewriting
I() = e

h(x0 )/

f (x)e[h(x0 )h(x)]/ dx

and observing that


e[h(x0 )h(x)]/

0 x 6= x0
1 x = x0 ,

it is clear that the main contribution to the value of the integral I() will
come from a neighbourhood of the point x0 . So for some small = () > 0
we can write
Z
h(x0 )/
I() = e
f (x)e[h(x0 )h(x)]/
R
Z x0 +
eh(x0 )/
f (x)e[h(x0 )h(x)]/
x0

eh(x0 )/ f (x0 )

x0 +

e[h(x0 )h(x)]/

Zx0
eh(x0 )/ f (x0 ) e[h(x0 )h(x)]/
R
Z
f (x0 ) eh(x)/ .
R

In the above, when I write X Y I mean that X is equal to Y plus terms


that go to zero as  0. The above formal calculation shows that
I()
R
R

0

eh(x)/ dx
132

f (x0 ).

Clearly all the above clculations are only formal, and in doing a rigorous
proof one should quantify all the . The Laplace method consists in using
an idea similar to the one described above, in order to find the behaviour of
I() to leading order (in ):
Z
f (x)e[h(x0 )h(x)]/ dx
I() = eh(x0 )/
R
Z x0 +
0
00
2
h(x0 )/
e[h (x0 )(xx0 )/ e[h (x0 )(xx0 ) ]/2
e
f (x0 )
x0
x0 +

= eh(x0 )/ f (x0 )

00 (x

e[h

2
0 )(xx0 ) ]/2

dx

x0
h(x0 )/

00 (x

e[h

f (x0 )
ZR

= eh(x0 )/ f (x0 )

ev

2 /2

2
0 )(xx0 ) ]/2

s
h(x0 )/

=e

f (x0 )


2h00 (x

0)

dx

dv

2
.
h00 (x0 )

So as  0,
s
I() e

h(x0 )/

f (x0 )

2
h00 (x

133

0)

to leading order.

References
[1] L. Arnold. Stochastic Differential Equations: Theory and Applications,
Wiley, 1974
[2] P. Baldi, Calcolo delle Probabilit`a e Statistica. Second Edition. Mc
Graw-Hill, Milano 1998.
[3] A. Beskos, G. Roberts and A. M. Stuart. Optimal Scalings for local
Metropolis-Hastings chains on nonproduct targets in high dimensions.
Ann. Appl. Prob. 19 863-898
[4] D. Bakry and M. Emery, Diffusions hypercontractives, pp. 177-206 in
Sem. de Probab. XIX, Lecture Notes in Math., Vol. 1123, SpringerVerlag, Berlin, 1985.
[5] D. Bakry, I. Gentil and M. Ledoux. Analysis and Geometry of Markov
Diffusion Operators. Springer, 2013
[6] A. Beskos, G. Roberts, A. M. Stuart and J. Voss. MCMC methods for
diffusion bridges. Stoch. Dyn. 8 319-350.
[7] A. Beskos and A. M. Stuart. MCMC methods for sampling function
space. In ICIAM Invited Lecture 2007. European Mathematical Society,
Z
urich.
[8] P. Billingsley. Convergence of probability measures. Second edition.
Wiley Series in Probability and Statistics: Probability and Statistics.
New York, 1999.
[9] P. Billingsley. Ergodic Theory and information. Robert E. Krieger Publishing Company. Huntington, New York, 1978.
[10] G. Casella and E.I. George. Explaining the Gibbs sampler.(English summary) Amer. Statist. 46 (1992), no. 3, 167174.
[11] A.V. Chechkin, J. Klafter and I.M. Sokolov. Distributed-Order Fractional Kinetics, 16th Marian Smoluchowski Symposium on Statistical
Physics: Fundamental and Applications, September 2003
[12] G. Da Prato and J. Zabczyk, Stochastic equations in infinite dimensions, Cambridge University Press, Cambridge, 1992.

134

[13] G. Da Prato and J. Zabczyk. Ergodicity for infinite dimensional


systems, London Mathematical Society Lecture Note Series, 229. Cambridge University Press, Cambridge, 1996.
[14] J. Davidson. Asymptotic Methods and Functional Central Limit Theorems. Chapter 5
[15] R. Durrett. Probability: theory and examples. Fourth edition. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge
University Press, Cambridge, 2010.
[16] J.-P. Eckmann and M. Hairer. Non-equilibrium statistical mechanics of
strongly anharmonic chains of oscillators, Comm. Math. Phys., 212(1):
105164, 2000.
[17] J.-P. Eckmann and M. Hairer. Spectral properties of hypoellliptic operators. Comm. Math. Phys., 235, 233-253 (2003).
[18] J-P. Eckmann, C-A. Pillet, and L. Rey-Bellet. Entropy production in
nonlinear, thermally driven Hamiltonian systems, J. Statist. Phys.,
95(1-2): 305331, 1999.
[19] J.-P. Eckmann, C.-A. Pillet and L. Rey-Bellet. Non-equilibrium
statistical mechanics of anharmonic chains coupled to two heat baths
at different temperatures, Comm. Math. Phys., 201(3): 657697, 1999.
[20] S.N. Ethier and T.G. Kurtz. Markov processes. Wiley Series in Probability and Mathematical Statistics: Probability and Mathematical
Statistics (1986).
[21] L. C. Evans. Partial Differential Equations. American Mathematical
Society, 2010.
[22] I. I. Gihman and A. V. Skorohod. Stochastic differential equations.
Translated from the Russian by Kenneth Wickwire. Ergebnisse der
Mathematik und ihrer Grenzgebiete, Band 72. Springer-Verlag, New
York-Heidelberg, 1972.
[23] D. Givon, R. Kupferman and A.M. Stuart. Extracting macroscopic
dynamics: model problems and algorithms, Nonlinearity, 17(6) 2004.
[24] G. R. Grimmett and D. R. Stirzaker. Probability and random processes.
Oxford University Press, New York, 2001.

135

[25] A. Guionnet and B. Zegarlinski. Lectures on Logarithmic Sobolev inequalities. Lecture Notes.
[26] M. Hairer. Ergodic Theory for Stochastic PDEs. Lecture notes.
[27] M. Hairer. Ergodic properties of a class of non-Markovian processes. In
Trends in stochastic analysis, volume 353 of London Math. Soc. Lecture
Note Ser., pages 6598. Cambridge Univ. Press, Cambridge, 2009.
[28] M. Hairer. How hot can a heat bath get? Comm. Math. Phys., 292
(2009), 1, 131-177.
[29] M. Hairer and G. A. Pavliotis. From ballistic to diffusive behavior in
periodic potentials. J. Stat. Phys., 131(1): 175202, 2008.
[30] W.K. Hastings. Monte Carlo Sampling Methods Using Markov Chains
and Their Applications. Biometrika, 57, No. 1. (Apr., 1970), 97109.
[31] P. Hanggi, P. Talkner, and M. Borkovec. Reaction-rate theory: fifty
years after Kramers. Rev. Modern Phys., 62(2):251341, 1990.
[32] B. Helffer and F. Nier. Hypoelliptic estimates and spectral theory for
Fokker-Planck operators and Witten Laplacians, volume 1862 of Lecture
Notes in Mathematics. Springer-Verlag, Berlin, 2005.
[33] F. Herau. Short and long time behavior of the Fokker-Planck equation
in a confining potential and applications, J. Funct. Anal., 244(1): 95
118, 2007.
[34] F. Herau and F. Nier. Isotropic hypoellipticity and trend to equilibrium
for the Fokker-Planck equation with a high-degree potential, Arch.
Ration. Mech. Anal. 171 (2), 151218 (2004)
[35] M. Hitrik and K. Pravda-Starov. Spectra and Semigroup smoothing for
Non-Elliptic Quadratic Operators, Math. Ann. 344, No 4 (2009), pp
801846.
[36] L. H
ormander. Hypoelliptic second order differential equations. Acta
Math., 119:147171, 1967.
[37] L. H
ormander, The analysis of linear partial differential operators, vol.
I,II,III,IV, Springer-Verlag, New York (1985)
[38] V. Jaksic and C.-A. Pillet. Ergodic properties of the non-Markovian
Langevin equation, Lett. Math. Phys., 41(1): 4957, 1997.
136

[39] V. Jaksic and C.-A. Pillet. Spectral theory of thermal relaxation, J.


Math. Phys., 38(4): 17571780, 1997. Quantum problems in condensed
matter physics.
[40] V. Jaksic and C.-A. Pillet. Ergodic properties of classical dissipative
systems. I. Acta Math., 181(2): 245282, 1998.
[41] I. Karatzas and S.E. Shreve. Brownian motion and stochastic calculus.
Second edition. Graduate Texts in Mathematics, 113. Springer-Verlag,
New York, 1991.
[42] H.A. Kramers. Brownian motion in a field of force and the diffusion
model of chemical reactions, Physica, 7(1940), pp 284304.
[43] R. Kupferman. Fractional kinetics in Kac-Zwanzig heat bath models.
J. Statist. Phys., 114(1-2):291326, 2004.
[44] C. Landim. Central limit theorem for Markov processes. In From
classical to modern probability, volume 54 of Progr. Probab., pages 145
205. Birkh
auser, Basel, 2003.
[45] Lanford, III, O. E. Time evolution of large classical systems. In Dynamical systems, theory and applications (Recontres, Battelle Res. Inst.,
Seattle, Wash., 1974). Springer, Berlin, 1975, pp. 1111. Lecture Notes
in Phys., Vol. 38.
[46] J. Lebowitz. Microscopic reversibility and macroscopic behavior: physical explanations and mathematical derivations. In Twentyfive years of
non-equilibrium ststistical mechanics, proceedngs of the XIII Sitges conference (1994), J. Brey, J. Marro, J. Rubi and M. S. Miguel , Eds., Lect.
Notes in Physics, Springer, pp.1-20.
[47] L. Lorenzi and M. Bertoldi. Analytical Methods for Markov Semigroups,
CRC Press, New York (2006)
[48] A. Lunardi. On the Ornstein-Uhlenbeck Operator in L2 Spaces with respect to invariant measures, Transactions of the AMS 349, No 1 (1997),
pp 155169.
[49] P. A. Markowich and C. Villani. On the trend to equilibrium for the
Fokker-Planck equation: an interplay between physics and functional
analysis, Mat. Contemp., 19:129, 2000.

137

[50] J. C. Mattingly, A. M. Stuart and D. J. Higham. Ergodicity for


SDEs and approximations: locally Lipschitz vector fields and degenerate noise. Stochastic Process. Appl., 101(2): 185232, 2002.
Geometric ergodicity of some hypoelliptic diffusions for particle motions, Markov Processes and Related Fields, 8(2): 199214, 2002.
[51] G. Metafune, D. Pallara and E. Priola. Spectrum of Ornstein-Uhlenbeck
operators in Lp spaces with respect to invariant measures, J. Funct.
Anal. 196 (1), 4060 (2002)
[52] N. Metropolis, A.W. Rosenbluth, M.N. Rosenbluth and A.H. Teller.
Equation of State Calculations by Fast Computing Machines. Journal
of chemical physics, 21, 6, June 1953.
[53] S.P. Meyn and R.L. Tweedie. Markov chains and stochastic stability.
Communications and Control Engineering Series. Springer-Verlag London Ltd., London, 1993.
[54] H. Mori. A continued-fraction representation of the time-correlation
functions. Progress of Theoretical Physics, 34(3): 399416, 1965.
[55] E. Nelson. Dynamical theories of Brownian motion. Princeton University Press, Princeton, N.J., 1967.
[56] B. ksendal. Stochastic differential equations. An introduction with
applications. Sixth edition. Universitext. Springer-Verlag, Berlin, 2003.
[57] M. Ottobre. Random Walk in a Random Environment with non Diffusive Feedback Force: Long Time Behavior, accepted by Stochastic proc.
Appl.
[58] M. Ottobre and G.A. Pavliotis, Asymptotic analysis for the Generalized
Langevin Equation, Nonlinearity 24 (2011) 1629-1653
[59] M. Ottobre, G.A. Pavliotis and K. Pravda-Starov. Exponential return
to equilibrium for hypoelliptic quadratic systems. Journal of Functional
Analysis, 262 (2012) 4000-4039.
[60] G.C. Papanicolaou and S. R. S. Varadhan. Ornstein-Uhlenbeck process
in a random potential, Comm. Pure Appl. Math., 38(6): 819834, 1985.
[61] G.A. Pavliotis and A.M. Stuart. Multiscale methods, volume 53 of Texts
in Applied Mathematics. Springer, New York, 2008. Averaging and
homogenization.
138

[62] A. Pazy. Semigroups of linear operators and applications to partial


differential equations. Applied Mathematical Sciences, 44. SpringerVerlag, New York, 1983.
[63] C. Pre
v
ot and M. Rockner. A concise course on stochastic partial
differential equations. Lecture Notes in Mathematics, 1905. Springer,
Berlin, 2007.
[64] M. Reed and B. Simon. Methods of Modern Mathematical Physics I:
Functional Analysis. Academic Press, New York, 1980.
[65] L. Rey-Bellet and L. E. Thomas. Exponential convergence to nonequilibrium stationary states in classical statistical mechanics. Comm.
Math. Phys., 225(2):305329, 2002.
[66] L. Rey-Bellet Ergodic Properties of Markov Processes, Quantum Open
Systems II. The Markovian approach. Lecture Notes in Mathematics
1881 . Berlin: Springer, (2006) pp. 139
[67] C.P. Robert and G. Casella. Introducing Monte Carlo methods with R.
Use R!. Springer, New York, 2010.
[68] D. Ruelle. Smooth Dynamics and new theoretical ideas in Nonequilibrium Statistical Mechanics, J. Stat. Phys., Vol 95, Nos 1/2, 1999.
[69] J. Sj
ostrand. Parametrices for pseudodifferential operators with multiple characteristics, Ark. Mat. (12), 85130 (1974)
[70] D. W. Stroock. Probability theory. An analytic view. Second edition.
Cambridge University Press, Cambridge, 2011.
[71] D.W. Stroock and S.R.S. Varadhan. On the support of Diffusion processes with application to the strong maximum principle Proceedings of
the Sixth Berkeley Symposium on Mathematical Statistics and Probability (Univ. California, Berkeley, Calif., 1970/1971), Vol. III: Probability theory, pp. 333359. Univ. California Press, Berkeley, Calif., 1972.
[72] D.W. Stroock and S.R.S. Varadhan.Multidimensional diffusion processes. Springer, Berlin, 1979.
[73] H.J. Sussmann. An interpretation of stochastic differential equations as
ordinary differential equations which depend on the sample point. Bull.
Amer. Math. Soc. 83 (1977), no. 2, 296-298.

139

[74] L. Tierney. A note on Metropolis-Hastings kernels for general state


spaces. Ann. Appl. Probab. 8 (1998), no. 1, 19.
[75] M. Turelli. Random environments and stochastic calculus. Theoret.
Population Biology 12 (1977), no. 2, 140-178.
[76] K. Twardowska. Wong-Zakai approximations for stochastic differential
equations. Acta Appl. Math. 43 (1996), no. 3, 317-359.
[77] C. Villani. Hypocoercivity. Mem. Amer. Math. Soc., 202(950), 2009.
[78] D.Williams. To begin at the beginning:. . . . Proc. Sympos., Univ.
Durham, Durham, 1980. pp. 1-55, Lecture Notes in Math., 851,
Springer, Berlin-New York, 1981.
[79] D. Williams. Probability with martingales. Cambridge Mathematical
Textbooks. Cambridge University Press, Cambridge, 1991.
[80] K. Yosida. Functional analysis. Classics in Mathematics. SpringerVerlag, Berlin, 1995.
[81] J. Zabczyk. Mathematical control theory. An introduction. Modern
Birkhauser Classics, Birkhauser Boston MA, 1995.
[82] R. Zwanzig. Nonlinear generalized Langevin equations. J. Stat. Phys.,
9(3):215220, 1973.
[83] R. Zwanzig . Problems in nonlinear transport theory in Systems far
from equilibrium, L. Garrido, Springer, New York (1980), pp 198225.

140

12

A Few Exercises

Warning: all the Markov chains mentioned in this exercise sheet are timehomogeneous Markov chain, unless otherwise stated.
1. Show that continuous functions are measurable.
2. Let {Xn }n be a sequence of i.i.d taking values in the set {1, . . . , m}
and P(Xj = k) = pk , where p1 , . . . , pk are positive numbers adding up
to one. Let us fix one value k {1, . . . , m} and define

1 if Xj = k
Yj =
0 otherwise.
What can we say about Sn =

1
n

Pn

j=1 Yj

3. Let Xn be a sequence of i.i.d random variables with valuesP


in {0, 1}
and with P(Xn = 1) = p for some given p [0, 1]. Let Sn = nj=1 Xj .
Find an upper bound on



Sn

P
p >  .
n



(I should have specified, not the trivial upper bound P Snn p >  1,
but an upper bound that depends on ).
4. Let b(t) be standard Brownian Motion. Set (t) = et b(e2t ). Prove
that t is a wide sense stationary process and find its autocorrelation
function.
5. Prove that if B(t) is a standard BM then W (t) = 1c B(c2 t) is a standard
BM as well.
6. Consider the process Xt = A cos( t + ) where A and are two
random variables with arbitrary joint distribution while is uniformly
distributed on [0, 2) and independent of both A and . Prove that if
A has finite first and second moment then Xt is wide sense stationary.
All the above random variables are real valued. (If you want to make
it a bit harder, prove that it is strictly stationary.)
7. Bernoulli-Laplace model of diffusion. There are two urns, say
right and left, each of them containing m balls. Out of the 2m total
balls, b are black and 2mb are white. We will assume b m. At each
141

time we take one ball from each urn and we exchange them. The state
of the system can be described, for example, by keeping track of the
number k of black balls in the left urn. Find the transition probability
of the Markov chain that describes the evolution of the number k.
8. Simple Random Walk. Let i be i.i.d. random variables taking
values in {1, 1}. In particular P(i = 1) =
Pp and P(i = 1) = 1 p,
for all i and for some 0 < p < 1. Let n = ni=1 i . Find the transition
probabilities of the Markov chain n .
9. Give an example of a set (and a chain) which is closed but not irreducible and, viceversa, of a set that is irreducible but not closed.
10. Show that in the Ehrenfest chain all states are recurrent.
11. Prove Lemma 3.30.
12. Let Xn be a time-homogeneous MC on a finite state space S. Prove
that if the state space S is irreducible then there exists a unique stationary distribution . Write an expression for .
13. Queue. As we have all experienced at least once, a queue is a line
where customers wait for services. Suppose that at most n people are
allowed in the queue, i.e. at each moment in time there can be at most
n people in the queue. Rules of the queue:
If the queue has strictly less than n customers then with probability a new customer joins the queue.
If the queue is not empty, then with probability the head of the
line is served and leaves the queue.
From which we deduce that the queue remains unchanged
with probability 1 if it is empty
with probability 1 it is full (i.e. n people waiting)
with probability 1 otherwise.
All other possibilities occur with zero probability. It should be now
clear that this situation ca be modelled using a time-homogeneous MC
with finite state space S = {0, 1, . . . , n}.
(a) Write down the transition probabilities of the chain
(b) Prove that the chain has a unique stationary distribution.
142

(c) Find the stationary distribution by solving the system of equations


X
(k)p(k, j) = (j)
kS

and then imposing the normalization condition.


14. Prove that for a transient (time-homogeneous) MC the limiting behaviour of the transition probabilities is pn (x, y) 0.
15. Prove, as a consequence of Theorem 3.35, that if y is a recurrent state
then
n
xy
1X j
.
p (x, y)
n
Ey Ty
j=1

Hint: use the bounded convergence theorem.


16. Consider a time-homogeneous Markov chain on a finite state space S
and let P = (p(x, y))x,yS be the transition matrix of the chain. The
transition matrix, and hence the chain, is said to be regular if there
exists a positive integer k > 0 such that pk (x, y) > 0 for all x, y S.
Clearly a regular Markov chain is irreducible. Consider a MC on a
finite state space S and prove the following: if for any x and y in S
there exists an integer n > 0 such that pn (x, y) > 0 and there exists
z S such that p(z, z) > 0 then the chain is regular. (Notice that k
is independent of x and y whereas n = n(x, y) i.e. it depends on the
choice of x and y.)
17. Regular chains on finite state spaces are very important. Indeed if Xn
is a regular chain on a finite state space then the chain has exactly one
stationary distribution, , and
lim pn (x, y) = (y),

for all x and y S.

Consider the setting and notation of Example 4.1. Use Exercise 16 and
the above statement to prove that if is not the uniform distribution
on S then the chain with transition matrix P converges to .
Hint: show that there exist two states a, b S such that q(a, b) > 0
and (b) < (a). Then look at p(a, a).
18. Show that the chain produced by Algorithm 4.6 satisfies the detailed
balance condition, i.e. it is -reversible.

143

19. Consider the accept-reject method, Algorithm 4.4. Calculate the probability of acceptance of the sample Y at each iteration. Let N be
the number of iterations needed before successfully producing a sample from the distribution , i.e. the number of rejections before an
acceptance. Calculate P(N = n) for all n 1 and the expected value
of N as a function of M .
Hint: N is a random variable with geometric distribution.
20. Write down the Metropolis-Hastings algorithm to simulate the normal
distribution N (0, 1) using a proposal which is uniformly distributed on
[, ], for an arbitrary > 0. (As you can imagine, in computational
practice it is important to choose a good delta).
21. Let Xt be a continuous time Markov process. If the process is not
time-homogeneous then the transition probabilities are functions of
four arguments:
p(s, w, t, A) : R+ S R+ S [0, 1],

0 s t,

with
p(s, w, t, A) = P(Xt A|Xs = w).
Any Markov process can be turned into a time-homogeneous one, if
we consider an extended state space: suppose Xt with state space S
is not time-homogeneous and show that the process Yt = (t, Xt ) with
state space R+ S is instead time homogeneous (Hint: just calculate
the transition probabilities of Yt ).
22. Let (t) be the process defined in Exercise 4. (t) is called the stationary Ornstein-Uhlenbeck process (O-U process). (t) is a Markov
process and the transition probabilities of such a process have a density. Show that
(
)
y xe(ts) 2
1
P(t = y|s = x) = p
exp
.
2(1 e2(ts) )
2(1 e2(ts) )
23. Prove that if the coefficients b and of an Ito SDE (satisfying the
conditions of the existence and uniqueness theorem) do not depend
on time, then the solution of the SDE is a time-homogeneous Markov
process.

144

24. Show, using the definition, that


Z t
Z t
Ws ds .
s dWs = tWt
0

25. Use the It


o formula to find the SDE (in the form (45)) satisfied by the
following processes
Yt = 2 + t + eWt

and Zt = Wt2 + Bt2 ,

where Wt and Bt are two independent one dimensional Brownians.


26. Calculate

Ws2 dWs .

27. Consider the OU process


dXt = bXt + dWt ,

with b, > 0.

Suppose X0 N (0, 2b ). Calculate the autocovariance function of


Xt . Write the equation satisfied by |Xt |2 and hence the equation for
E |Xt |2 . Using Gronwalls Lemma estimate E |Xt |2 . Now suppose X0
is deterministic and calculate the covariance function of Xt .
28. Consider the geometric BM of Example 8.11. Assuming that the initial
datum X0 is independent of Wt , calculate the expected value of Xt .
29. Verify that (X1 (t), X2 (t)) = (t, et Bt ) solves
dX1 = dt
dX2 = X2 dt + eX1 dBt
and that (X1 (t), X2 (t)) = (cosh Bt , , sinh Bt ) solves
1
dX1 = X1 dt + X2 dBt
2
1
dX2 = X2 dt + X1 dBt
2
where in all the above Bt is a one dimensional standard BM.
30. In this exercise I will not be very precise in defining the types of
convergence involved, so I do not expect you to be rigorous in the
solution, as far as the type of convergence is concerned. Consider a
sequence of stochastic processes k (t) such that
145

(a) for each k N, tk is a Gaussian process with E k (t) = 0 and


E(tk sk ) = dk (t s), where dk is a sequence of smooth functions
that converge to the Dirac delta, dk 0 as k ;
(b) for each k N, the paths t tk are smooth for all . The
sequence of processes tk constitutes a smooth approximation to
white noise t , i.e. in some sense tk t as k .
Now look at the It
o equation (66), namely
dXt = d(t)Xt dt + f (t)Xt dWt ,

X(0) = X0 ,

(136)

the solution of which we have found to be given by equation (67).


Consider the Stratonovich equation
t = d(t)X
t dt + f (t)X
t dWt ,
dX

X(0) = X0 ,

(137)

and solve it (in order to do so you can first transform it in Ito


form and then use one of the presented methods of solution).
Now look at the equation
X tk = d(t)Xtk dt + f (t)Xtk tk ,

X k (0) = X0 , k N

(138)

This is equation (136), when we replace white noise with the


smooth approximants. Therefore, for each k N and for each ,
this is a simple ODE. Solve it.
Rt
The solution of (138) contains the integral I k (t) = 0 f (s) k (s).
Observe that I k (t) is Gaussian with mean zero. Calculate E(Itk Isk ).
With a formal calculation (assume that you can exchange limit
and integral)Rshow that E(Itk Isk ) E(It Is ), where It is the Ito
t
integral It = 0 f (s)dWs .
Deduce that the solution of (138) converges to the solution of
(137) (better, to a process which has the same distributions as
(137)).
31. Let Pt be a C0 -semigroup on a Banach space B. Show that there exist
constants 0 and M 1 such that
kPt kB(B) M et .
Hint: 1. Assume the following statement: if Pt is a C0 -semigroup then
there exists t0 > 0 such that kPt kB(B) M for all 0 t t0 .
2. Observe that M 1 (why?)
3. Set := (log M )/t0 .
146

32. Let Pt be the semigroup defined in Example 9.3. Find the generator
of Pt in the space Cb ().
33. Find the generator of the semigroup

f (x + t) if x + t 1
Tt f (x) =
0
if x + t > 1
defined on the space C 1 [0, 1].
34. Show that the semigrop Pt associated to a time-homogeneous Markov
process is a contraction semigroup in L , i.e. kPt f k kf k .
35. Let Xt be a real valued wide sense stationary process and denote by
C(t, s) = C(t s) = E(Xt Xs ) 2 , R := E(Xt ) its covariance function.
Assume that C(t) L1 (R+ ), i.e. R+ C(s)ds < . Show that the
process is ergodic in the following sense:
Z T
2
1

lim E
Xs ds = 0 .
T
T 0
36. Transform the following Stratonovich SDEs into Ito SDEs or viceversa.
Rt
Rt
Xt = X0 + 0 Xs2 ds + 0 cos(Xs ) dWs
dXt = sinh(Xt )dt + t2 dWt
dXt = Xt dt + 3Xt dWt
37. Prove Lemma 10.7. Hint: Under the assumptions of Lemma 10.7,
Z
lim

0 t

t+

Z
Ef (s, Xs )ds = lim

0 t

Ef (s, Xs )ds = f (t, x),

where Xu is the solution of


dXu = b(u, Xu )du + (u, Xu )dWu ,

Xt = x .

Use this hint to prove the lemma. If you want to prove also the statement of the hint then you can use the same kind of arguments that we
used in the proof of Theorem 10.4.
38. Prove that (106) (105) (work with mean zero functions).

147

39. Consider the process Xt Rd satisfying


dXt = V (Xt )dt +

2dWt

where Wt is d-dimensional standard Brownian motion. Show that the


measure with density (x) = eV (x) /Z (here Z is a normalizing
constant) is the only invariant measure for Xt . Let v : Rd Rd be a
smooth vector valued function, v(x) = (v i (x))i=1,...,d . Show that the
invariant measure of the process

dXtv = (v(Xtv ) + V (Xtv ))dt + 2dWt ,


is still if and only if
(v) = 0,
where denotes divergence.
40. Let Pt be a semigroup of bounded operators on a Banach space B.
Show that the condition
lim Pt u = u for all u B
t0

(139)

is equivalent to the map R+ 3 t Pt u being continuous for every


fixed u B.
41. Prove the following:
(a) Let A be a densely defined operator on a Hilbert space. Then iA
is self adjoint iff A is skew adjoint.
(b) If A is the generator of a C0 group of unitary operators on a
Hilbert space then iA is self adjoint.
42. Use the antisymmetry (in L2 ) of the operator LH defined in (116) to
prove that the semigroup generated by LH is norm preserving.
43. Consider the one dimensional process
dXt = Xt2 dt + dWt ,

> 0.

(a) Recognize that this equation is of the form (101), for an appropriate function V (x). Write V (x) (you may assume V (0) = 0).
(b) Write the generator L of the process and its invariant density
(i.e. the density of the invariant measure).
148

(c) Check that L is symmetric in L2 .


44. Consider the following one dimensional SDEs:
dXt = Xt dt + dWt ,

dYt = rYt dt + 2dBt

where and r are a strictly positive real numbers and Wt and Bt are
one dimensional independent standard Brownian motions. Write the
equation satisfied by
(a) Zt := Xt Yt ;
(b) g(Xt , Yt ), where g is a twice differentiable function g(x, y) : R2
R;
(c) Vt := Xt2 Yt .
45. Let Pt be a Markov semigroup on a Banach space of real valued functions, L the generator of the semigroup and a reversible measure
(on R, and admitting density with respect to the Lebesgue measure)
satisfying the spectral gap inequality with constant . Let U (x) be a
bounded (above and below) and measurable function on R and define
a new measure with density
(x) =

1 U (x)
e
(x),
Z

where Z is the normalizing constant. Notice that, due to the boundedness of U , if f L2 then f L2 as well (same for integrability) and
also the other way around.
(a) Prove that for any constant a R,
2
Z 
Z
Z
f
f d d
(f a)2 d ,
R

and for any f L2 .


(b) Prove that satisfies a spectral gap inequality with constant
e2Osc(U ) where Osc(U ) = sup U inf U . Hint: use the fact
that for any Borel set A R,
eOsc(U ) (A) (A) eOsc(U ) (A) .
46. Solve the following SDEs
149

dYt = rdt + Yt dBt with initial condition Y (0) = Y0 ;


dXt = Xt dt + et dBt with initial condition X(0) = X0 ;
dXt = 2Xt dt + 4Xt dBt with X0 = 3. For this last equation,
calculate also E(Xt ) (you can use the method explained in the
solution of Exercise 28).

150

13

Solutions

Some of the following solutions are a bit sketchy.


1. By definition, the preimage of an open set is an open set.
2. The r.v. Yj are i.i.d (because they are measurable functions of the
Xj s) and integrable. Therefore
a.s.
Sn E(Y1 ) = E(1(X1 =k) ) = P(X1 = k) = pk .

3. EXn = p and V ar(Xn ) = p(1 p) 1/4. Therefore E(Sn /n) = p and


V ar(Sn /n) = p(1 p)/n 1/(4n). Using Chebyshevs inequality, the
probability that we wanted to estimate is bounded above by 1/(4n2 ).
4. Clearly, E((t)) = 0. As for the autocovariance function, suppose
t > s. Then
E((t)(s)) = E(et b(e2t )es b(e2s )) = e(t+s) e2s = e(ts) ,
having used the fact that E(b(t)b(s)) = t s. Hence for any t, s > 0
we have E((t)(s)) = e|ts| .
5. We need to check that the definition of Example 2.7 is satisfied by
W (t). i) is trivial. W (t) W (s) = (B(c2 t) B(c2 s))/c is Gaussian as
it is the sum of Gaussians. It is mean zero and, assuming s t, its
variance is
E(W (t) W (s))2 = E[(B(c2 t) B(c2 s))/c]2
1
= 2 [EB(c2 t) + EB(c2 s) 2EB(c2 t)B(c2 s)]
c
1
= 2 (c2 t + c2 s 2c2 s) = t s .
c
As for iii), it follows after observing that if (a, b) doesnt overlap (d, e)
then the same holds for (c2 a, c2 b) and (c2 d, c2 e).
6. First
E(A cos(t + )) = E(A cos(t) cos ) E(A sin(t) sin )
= E(A cos(t))E(cos ) E(A sin(t))E(sin ).
R 2
1
Since E(cos ) = 2
0 coszdz = 0 and analogously E(sin ) = 0
we have that EXt = 0 if E(A cos(t)) < and E(A sin(t)) < .
151

To check that this is the case, recalling that for the joint probability
density of A and we have fA, (x, y) = f|A (y|x)fA (x), we can write
ZZ
E(A cos(t)) =
x cos(yt)fA, (x, y)dxdy
ZZ

|x| f|A (y|x)fA (x)dxdy


Z
Z
Z
= |x| fA (x)dx f|A (y|x)dy = |x| fA (x)dx,
R
where in the last step we used f|A (y|x)dy = 1 for all x. Therefore
E(A cos(t)) < if A has finite first moment and the same thing can
be checked for E(A sin(t)). In the same way,
E(Xt Xs ) = E(A2 cos(t + ) cos(s + ))
1
= E(A2 cos((t + s) + 2) + A2 cos (t s)).
2
With steps analogous to what we have done before one can check that
E(A2 cos((t + s) + 2)) = 0 if A has finite second moment. Therefore
E(Xt Xs ) = E(A2 cos (t s)) and hence it is a function of t s only.
7. If wl is the number of white balls in the left urn, and similarly for
wr , bl = k and br , the transition probabilities are as follows:
wl br
mk bk
=
mm
m
m
bl wr
k mb+k
p(k, k 1) =
=
mm
m
m
bl br
wl wr
k bk mk mb+k
p(k, k) =
+
=
+
.
mm
m m
m m
m
m

p(k, k + 1) =

8. p(j, k) = p if k = j + 1 and p(j, k) = 1 p if k = j 1. Otherwise


p(j, k) = 0.
9. Suppose we have a two-state chain, i.e. S = {a, b}, with


0 1

.
p=
0 1
Then S is closed but not irreducible because there is no way of going
from b to a. Notice that the state b is recurrent, in particular it is

152

absorbent.
Now let S = {1, 2, 3} and define


0 1 0


p = 0.5 0 0.5 .
0 0 1
Then the set A = {1, 2} is irreducible but not closed as 2 A and
23 > 0 but 3
/ A.
10. Because we have only N + 1 possible states, this is easily done by
drawing the chain
0

...

The state space S is finite, closed and irreducible, therefore every


state is recurrent. If you didnt want to sketch the graph of the chain,
you could have observed that, from the definition of the transition
probabilities, every state communicates with each of its two nearest
neighbours (except 0 and N , that communicate only with 1 and N 1,
respectively). Therefore S is (finite) closed and irreducible. If you
really want to complicate your life you could do it by induction on N .
11. If is a stationary distribution then
X
(y)pn (y, x) = (x).
y

Taking sums over n on both sides we get


X
yx
(y)
= .
1 xx
y
However yx 1, so
=

X
y

(y)

X
yx
1
1

(y) =
,
1 xx
1 xx y
1 xx

so it has to be xx = 1.
12. S is finite and irreducible. If the whole state space S is irreducible,
then it is also closed - this is not true for subsets of S, as we have
seen in Exercise 9. Therefore all the states are recurrent. Therefore
153

stationary distributions are constant multiples of P


each others. Being
x 1 n
the chain recurrent, we also know that x (y) = Tn=0
pP(x, y) is a
stationary measure. However,
due
to
the
finiteness
of
S,
y x (y) <
P
. Therefore (y) = x (y)/ y x (y) (which does not depend on the
choice of x, why?).
13. (a) Transition probabilities
p(k, k + 1) = if k < n
p(k, k 1) = if k > 0,
and also

if k = 0
1
1
if k = n
p(k, k) =

1 if 1 k n 1 .

Otherwise p(j, k) = 0.
(b) Sketching a graph of the chain shows immediately that all the
states communicate with each other, so the chain is closed and irreducible, which on a finite state space implies that the chain has only
one stationary measure.
(c) Expanding the system of equations gives
(0) = (1 ) (0) + (1)
(k) = (k 1) + (1 ) (k) + (k + 1)

1k n1

(n) = n1 + (1 ) (n) .
This system has clearly infinitely many solutions (all multiples of
each others) so we use (0) as a parameter. A solution is (k) =
(0) (/)k . Now impose the normalization condition and get
(/)k
(k) = Pn
.
i
i=0 (/)
P
14. P
If y is transient then n pn (x, y) < pn (x, y) 0.
n
n p (x, y) < because

pn (x, y) = Ex N (y) =

n=1

154

xy
< .
1 yy

15. 0 N (y)/n 1 so you can use the bounded convergence theorem


which gives



Nn (y)
1

1(Ty <) 0
Ex

n
Ey Ty
and hence also


Nn (y)
1
Ex
1
.
Ex
n
Ey Ty (Ty <)
16. Let M = maxx,wS n(x, w), then for any x, y S we have
p2M (x, y) pn(x,z) (x, z) p(z, z) . . . p(z, z) pn(z,y) (z, y) > 0.
|
{z
}
2M n(x,z)n(z,y)

times

So k = 2M is the integer we were after. In the above we need to


consider k = 2M instead of just M to make sure that k n(x, z)
n(z, y) > 0.
17. We want to prove that the chain with transition matrix P is regular.
To this end we will show that the sufficient conditions of Exercise 16
are satisfied (unless x (x) is constant i.e. unless is the uniform
distribution). Recall that Q is irreducible hence P is irreducible as
well, therefore it is true that for all x, y there exists n = n(x, y) > 0
such that pn(x,y) (x, y) > 0. Therefore we only need to find a state
a S such that p(a, a) > 0. Let M be the set M = {x S : (x) =
maxyS (y)}. Because Q is irreducible there exist a M and b M c
such that q(a, b) > 0 and clearly by construction (a) > (b). Notice
also that from the definition of P , p(x, y) q(x, y) for all x 6= y. Then
X
X
p(a, a) = 1
p(a, x) = 1
p(a, x) p(a, b)
x6=a

x6=a,b

q(a, x) q(a, b)(b)/(a)

x6=a,b

=1

q(a, x) + q(a, b) [1 (b)/(a)]

x6=a

= q(a, a) + q(a, b) [1 (b)/(a)] q(a, b) [1 (b)/(a)] > 0.


18. The transition kernel of the chain produced with the M-H algorithm
can be written as
Z
p(x, y) = q(x, y)(x, y) + x (y)
(1 (x, w))q(x, w)dw.
RN

155

The detailed balance condition for p and reads (x)p(x, y) = (y)p(y, x)


and is hence satisfied if and only if (x)q(x, y)(x, y) = (y)q(y, x)(y, x).
Setting (x, y) = (x)q(x, y), what we want to show is that (x, y)(x, y) =
(y, x)(y, x). The latter equality is easily verified:
(x, y)(x, y) = (x, y) min{1, (y, x)/(x, y)} = min{(x, y), (y, x)}
= min{1, (x, y)/(y, x)}(y, x) = (y, x)(y, x).
19. The probability of acceptance is a = P(U (Y )/M (Y )), which can
be calculated by using the law of total probability:
Z
P(U (Y )/M (Y )|Y = y) (y) dy
P(U (Y )/M (Y )) =
R
Z
(y)
=
(y) dy = 1/M,
M (y)
having used the fact that is a probability density function. The
random variable N is geometrically distributed, with probability of
success at each trial equal to a. Then P(N = n) = (1 a)n1 a and
E(N ) = 1/a = M .
20. This is a M-H algorithm with acceptance probability (x, y) = min{exp[(x2
y 2 )/2], 1} and proposal kernel q(x, ) U(x , x + ).
21. Recall the the measure P on sequence space obtained via the extension
theorem is
P ( : i Ai , i = 0, . . . , N ) = P(X0 A1 , . . . , XN AN ),
where P is the probability measure on the space where the sequence
Xn is defined and the Ai s. are Borel sets of R. Let us show the
claim for the simplest cylinder set: let C = { RN : 0 A}. Then
1 (C) = { RN : 1 A}. Therefore
P (C) = P(X0 A)

stationarity

P(X1 A) = P (1 (C)).

Now you can detail the proof for every set in RN .


22. Let x = (s, y) R+ S and B = C A B(R+ ) S. Then the
transition function of the process (t, Xt ) is simply
qt (x, B) = p(s, y, t + s, A)1C (t + s) .
I suppose you will need to give this a thought.
156

23. Sketch: using the properties of Brownian motion,


P(t y|s = x) = P(et W (e2t ) y|es W (e2s ) = x)
= P(W (e2t ) et y|W (e2s ) = es x)


Z et y
1
z xes
p
=
dz .
exp 2t
2(e e2s )
2(e2t e2s )

Now use a change of variables and conclude by observing that


P(t = y|s = x) =

P(t y|s = x) .
y

24. The Markovianity is a consequence of Theorem 8.13, so we only need


to prove time-homogeneity, which means that we need to prove that
P(Xu B|X0 = x) = P(Xu+t B|Xt = x),

B S, x S, t 0 .

Denote by X x,t (u + t) the solution of the equation


Z t+u
Z t+u
b((s))ds +
((s))dWs
(t + u) = x +

(140)

and by X x,0 (u) the solution of the equation


Z u
Z u
(u) = x +
b((s))ds +
((s))dWs .
0

(141)

What we want to prove is that X x,t (u + t) has the same distribution


as X x,0 (u). To this end, notice that if Wv is a standard BM then
v = Wt+v Wt
Wt+v Wt is a standard BM as well (check), i.e. W
has the same distribution as Wv . With this in mind, a simple change
of variable concludes the proof. Indeed, the RHS of (140) can be
rewritten as
Z u
Z u
(t + u) = x +
b((v + t))dv +
((v + t))dWt+v .
0

and W have the same distribution, and the


At this point, because W
, it is clear by
above is nothing but (141), when we replace W with W
the uniqueness of the solution that (t + u) has the same distribution
as (u), which is, X x,t (u + t) has the same distribution as X x,0 (u).
25. Use sj (Wsj ) = (sj Wsj ) Wsj+1 sj .
157

26. Apply It
o formula with g(t, x) = 2 + t + ex and get
1
dYt = (1 + eWt )dt + eWt dWt .
2
For Zt instead you get
dZt = 2dt + 2BdB + 2W dW .
27. For example by integration by parts: set X1 = W 2 , X2 = W . Applying
It
o s rule we have d(W 2 ) = 2W dW + dt and using the multiplication
table d(W 2 )dW = 2W dt. Therefore
Z

t
2

W dW =

Wt3

2W dW

Z
W ds

2W ds ,
0

so that readjusting
t

W3
W dW = t
3

W ds .
0

28. Before giving the solution I would like to Remark that we will show
that N (0, 2 /2b) is the stationary measure of the O-U process. Now
the solution: from Example 8.10 we know that if EX0 = 0 then also
E(Xt ) = 0 for all t > 0. Therefore Cov(Xt , Xs ) = E(Xt Xs ). To
calculate E(Xt Xs ) we use Theorem 8.4:
Z s
2
2 b(ts)
b(t+s)
2 b(t+s)
E(Xt Xs ) = e
e2bu du =
+ e
e
.
2b
2b
0
Using the It
o formula, we have
d(|Xt |2 ) = 2Xt dXt + 2 dt
= 2bXt2 dt + 2Xt dWt 2 dt,
so that
Xt2

X02

2bXs2 ds

Z
+2

Xs dWs + 2 t.

Taking expectation on both sides we get


Z t
2
2
EXt = EX0
2b(EXs2 )ds + 2 t.
0

158

You could solve the above equation, which in differential form is


u = 2bu + 2 , having set u(t) = EXt2 ,
Rt
so that u(t) = e2bt u(0) + 0 e2b(ts) 2 ds. But if you only need an
estimate you can apply Gronwalls Lemma and get
EXt2 (EX02 + 2 t)e2bt .
2

29. Clearly, EXt = E(X0 ) exp(r /2)t E(eWt ). So we need to calculate


E(eWt ). Set Yt = eWt . Using ***Ito formula,
1
dYt = Yt dWt + 2 Yt dt
2
so that
Z

1
Ys dWs +
2

Yt = Y0 +
0

2 Ys ds .

Taking expectation,
Z


Ys dWs

EYt = EY0 + E
0

1
2

2 EYs ds .

You can check that Theorem 8.4 can be applied


R to the second
 addend
t
on the RHS of the above equation, so that E 0 Ys dWs = 0. This
means that EYt satisfies the ODE
1
u(t)

= 2 u(t),
2

u(0) = 1,

hence EYt = exp ( 2 t/2). To conclude, EXt = E(X0 )ert .


30. Just a straightforward application of the Ito chain rule for two-dimensional
SDEs, with g : R+ R2 R2 .
31. Step by step.
Using the conversion formula, (137) is equivalent to the Ito equation


1 2

t dt + f (t)Xt dWt .
dXt = d(t) + f (t) X
2
The solution to such an equation can be found by using Example
8.12 and it is
Rt
Rt
t = X0 e 0 d(s)ds+ 0 f (s)dWs
X
(which is not the same as (67)).
159

(138) is a simple ODE, for each fixed , so


Rt

X k (t) = X0 e

R
d(s)ds+ 0t f (s) k (s)

sk is mean zero and Gaussian and f (s) is deterministic, so f (s)sk


is Gaussian and Itk is too; therefore Itk can only converge to a
Gaussian process. Assuming we can exchange limit and integral,
because Itk is mean zero, also the limiting process has mean zero.
As for the covariance function:
Z tZ s
k k
f (u)f (v)uk vk
lim E(It Is ) = lim E
k
k
0
0
Z tZ s
= lim
f (u)f (v)Euk vk
k 0
0
Z ts
Z tZ s
k
=
f (u)f (v) lim d (u v) =
duf 2 (u)du .
0

Therefore Itk converges to a Gaussian process which has the same


mean and covariance as It . This means that Xtk converges to a
t.
process which has the same distributions as X
32. M 1 because kP0 kB(B) = 1. Now any t R+ can be written
as t = nt0 + r where 0 r t0 . Therefore, using the semigroup
property, we have
kPt kB(B) = kPnt0 Pr kB(B) = kPtn0 Pr kB(B) M n M M M t/t0 = M et .
33. By the definition of generator,
Pt f (x) f (x)
emt 1
1 emt
lim
= lim
f (x) +
t
t
t
t0+
t0+
Z
= mf (x) + m f d .

Z
f d

34. Using the definition of generator we find Lf = f 0 (x).


35. Simply, if f Bm the

Z
Z




sup f (y)pt (x, dy) kf k sup pt (x, dy) = kf k .
x

160

36. Notice that the covariance function is a symmetric function, i.e. C(t, s) =
C(s, t). Therefore
Z T
2
2
Z
Z
1


1 T
1 T 2
2



E
X(s)ds = 2 E
Xs ds + 2
ds
T 0
T
T 0
0

Z T
Z T
1
Xt dt 2
Xs ds
= 2E
T
0
0
Z T Z T
1
= 2
dsC(t, s)
dt
T 0
0
Z T Z t
const
symmetry 2
=
dt dsC(t, s)
0 as T .
T2 0
T
0
Rt
Rt
37. The first becomes Xt = X0 + 0 (Xs2 21 sin Xs cos Xs )ds+ 0 cos Xs dWs .
The second remains unaltered as the diffusion coefficient does not depend on x. The It
o form of the last one is dXt = 72 Xt dt + 3Xt dWt .
38. Let Xu , u t be the solution of (93). Using Ito formula we have
f
f
(Xu )b(u, Xu )du +
(Xu )(u, Xu )dWu
x
x
2f
+
(Xu ) 2 (u, Xu )du .
x2

df (Xu ) =

Integrating both sides and taking expectation:


Z t
1
1
f
[E(f (Xt )) f (x)] = E
(Xu )b(u, Xu )du

t x
Z t 2
f
1
+ E
(Xu ) 2 (u, Xu )du.
2
2
t x
Now just let 0 and use the hint.
39. If

then the function t

ft2 d e2t

R
R
e2t R ft2 d

f 2 d.

is decreasing. This means that




Z
d 2t
e
ft2 d 0.
dt
R

The above inequality, calculated in t = 0, gives the spectral gap inequality.


161

40. After you write the generator of the process Xtv ,



d
d 
X
X

V
+
Lv =
v i (x) + i
x2i ,
x xi
i=1

i=1

all the calculations are completely analogous to all you have seen in
one dimension.
41. We use Exercise 31. Assuming (139), if t, h 0 then by the semigroup
property
kPt+h u Pu k kPt kkPh u uk Cet kPh u uk

by assumption

0,

as h 0. Now do the same thing for kPth Pt k, t h 0. The


converse implication is obvious.
42. We will use Hille-Yosida Theorem.
(a) If iA is self-adjoint then (D(A) = D(A ) and) (iA) = iA which
implies iA = iA A = A . The reverse implication is
analogous.
(b) If A is the generator of a group of unitary operators then in
particular it is densely defined and closed so, using the previous
step, showing that A is skew symmetric will do. To this end, for
all x D(A),
Ax = lim
t0

U (t)x x
U x x
= lim
= A x .
t0
t
t

43. LH is antisymmetric in L2 , so
hh, LH hi = hh, LH i = hh, LH i
hence hh, LH hi = 0. Using this fact
d tLH 2
ke k = 2hht , LH ht i = 0.
dt
44. The potential is V (x) = x3 /3, the generator is
L = x2 x +

2 2
.
2 x
2x3

The density of the invariant measure is (x) = e 32 /Z, where Z is a


normalization constant. To show the symmetry, see Example 10.17 at
the beginning of Section 10.3.
162

45. Using the product rule in (a) and the multidimensional Ito formula in
(b) and (c), we have
(a) d(Xt Yt ) = Xt dYt + Yt dX t + dXt dYt . However dXt dYt = 0 in
this case so
d(Xt Yt ) = rXt Yt dt Yt Xt dt + 2Xt dBt + Yt dWt .
(b) Using dXt dYt = 0, dXt dXt = 2 dt and dYt dYt = 4dt, we have
g
g
(Xt , Yt )dXt +
(Xt , Yt )dYt
x
y
2 2g
2g
+
(X
,
Y
)dt
+
2
(Xt , Yt )dt .
t
t
2 x2
y 2

dg(Xt , Yt ) =

(c) Use the result of (b) and apply them to the function g(x, y) = x2 y
to obtain
d(Xt )2Xt Yt dXt + Xt2 dYt + 2 Yt dt .
46. All the inequalities I will write are assumed to hold for f L2 D(L)
(even though for (a) one can just consider functions in L2 . )
(a) Just consider the function
Z

(f a)2 d.

H(a) =
R

Calculate the derivative of H(a) with respect to a:


Z
Z
d
H(a) = 2 (f a) d = 0 a = f d.
da
R
2

d
Moreover da
2 H(a) = 1 so this is a minimum.
(b) Using the hint and the result of point (a), one gets
2
2
Z 
Z
Z 
Z
f
f d d
f
f d d
R

Osc(U )

Z 

Z
f

which concludes the proof.


163

eOsc(U )

hLf, f i

e2Osc(U )
hLf, f i ,

2
f d d

47. Results:
Yt = Y0 exp(Bt 12 2 t) + r

Rt
0

exp[(Bt Bs ) 21 2 (t s)]ds.

Xt = et X0 + et Bt assuming B0 = 0.
Xt = 3 exp[6t + 4Bt ] and E(Xt ) = 3e2t .

164

You might also like