Notes Mainimp
Notes Mainimp
Mathematics Department
a.y. 2013/2014
M. Ottobre
Contents
1 Basics of Probability
1.1 Conditional probability and independence. . . . . . . . . . . .
1.2 Convergence of Random Variables . . . . . . . . . . . . . . .
5
8
10
2 Stochastic processes
14
2.1 Stationary Processes . . . . . . . . . . . . . . . . . . . . . . . 18
3 Markov Chains
19
3.1 Defining Markovianity . . . . . . . . . . . . . . . . . . . . . . 19
3.2 Time-homogeneous Markov Chains on countable state space . 23
4 Markov Chain Monte Carlo Methods
4.1 Simulated Annealing . . . . . . . . . .
4.2 Accept-Reject method . . . . . . . . .
4.3 Metropolis-Hastings algorithm . . . .
4.4 The Gibbs Sampler . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
36
39
41
42
46
54
7 Brownian Motion
56
7.1 Heuristic motivation for the definition of BM. . . . . . . . . . 57
7.2 Rigorous motivation for the definition of BM. . . . . . . . . . 58
7.3 Properties of Brownian Motion . . . . . . . . . . . . . . . . . 59
8 Elements of stochastic calculus and SDEs theory
8.1 Stochastic Integrals. . . . . . . . . . . . . . . . . . .
8.1.1 Stochastic integral in the Ito sense. . . . . . .
8.1.2 Stochastic integral in the Stratonovich sense.
8.2 SDEs . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.2.1 Existence and uniqueness. . . . . . . . . . . .
8.2.2 Examples and methods of solution. . . . . . .
8.2.3 Solutions of SDEs are Markov Processes . . .
8.3 Langevin and Generalized Langevin equation . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
62
64
67
72
74
74
76
79
81
121
Appendices
126
A Miscellaneous facts
A.1 Why can we think of white noise as the
nian Motion . . . . . . . . . . . . . . . .
A.2 Gronwalls Lemma . . . . . . . . . . . .
A.3 Kolmogorovs Extension Theorem . . . .
126
derivative
. . . . . .
. . . . . .
. . . . . .
of Brow. . . . . . 126
. . . . . . 126
. . . . . . 127
127
. 128
. 130
. 131
. 131
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
C Laplace method
132
12 A Few Exercises
141
13 Solutions
151
Basics of Probability
for all A0 E 0 .
for all B B.
(1)
so that
P(X B) = E(1B ),
where 1B is the characteristic function of the set B. X is integrable if
E(|X|) < . As you can imagine, formula (1) tends to be quite unpractical
to use for calculations. However, if X has a density, then the following holds.
Lemma 1.1. With the notation introduced above, let g : Rn Rm be a
measurable function such that Y = g(X) is integrable. Then
Z
E(Y ) =
g(x)f (x)dx
Rn
and in particular
Z
E(X) =
xf (x)dx.
Rn
1
2 2
(xm)2
2 2
1.1
hence Cov(X, Y ) = 0
and
V ar(X + Y ) = V ar(X) + V ar(Y ).
Let us now recall the basic facts about conditional expectation.
8
G G.
(2)
and
W K
X Y W
W K,
(3)
1.2
First, let us list the main modes of convergence that we will be using.
Definition 1.5. For simplicity let Xn , X : R be real valued random
variables, (but clearly all the definitions below apply to Rn -valued r.v.) and
let n and be the law of Xn and X. We will assume that n and have
a density 2 and with abuse of notation we will write dn (x) = n (x)dx and
d(x) = (x)dx.
a.s.
Xn X almost surely if
Xn () X()
P
lim Xn = X = 1.
Lp
Xn X, where Lp := Lp (; R), p 1, if
E |Xn X|p 0.
p
n
D
2
We say that a probability measure R on (Rn , B) has a density if there exists a nonnegative function f (x) such that (B) = B f (x)dx for any B B.
10
An important fact to notice about weak convergence is that this definition makes sense even if the random variables that come into play are not
defined on the same probability space.
The above modes of convergence of r.v. are related as follows:
a.s. = in probability = in distribution
and
in Lp = in probability .
The latter implication is a consequence of the Markov Inequality:
P(X c)
1
E[g(X)],
g(c)
(4)
V ar(X)
,
c2
where := EX .
(5)
E(Xn ) E(X).
(6)
ii) Xn is uniformly integrable, i.e. for all > 0 there exists K > 0 s.t.
E(|Xn | 1{|Xn |>K} ) < ,
n N.
Scheff
es Lemma. Let Xn , X be integrable and suppose Xn X a.s.
Then
E |Xn | E |X| E(|Xn X|) 0.
Remark 1.6 (On the above theorems). In the DCT and in the BCT the
inequalities |Xn | Y and |Xn | K are meant to hold pointwise, i.e. for all
n and . In the DCT, (6) follows from (5) simply because |E(Xn X)|
E |Xn X|. The BCT is clearly a consequence of the DCT. Both the BCT
and the DCT hold even if the assumption on the a.s. convergence of Xn to
X is replaced by convergence in probability. In this respect, if Xn , X are
assumed to be integrable, it is clear that DCT and BCT are a consequence
of the UIC. Last, because ||a| |b|| |a b|, the implication in
Scheffes Lemma is trivial. The opposite implication is the nontrivial one.
Now some remarks about weak convergence. According to the famous Portmanteau Theorem (see for example [8]) convergence in distribution can be
equivalently formulated as follows:
Theorem 1.7 (Portmanteau Theorem). Let Xn be a sequence of real valued
r.v. with distribution functions Fn (x) and F (x), respectively. Then
d
Xn X
If Xn X and X is a constant (i.e. it is deterministic) then convergence in probability and weak convergence are equivalent.
p
Cram
ers Theorem. Suppose Xn X and Yn c, where c is
deterministic. Then
d
i) Xn + Yn X + c
d
ii) Xn Yn cX
d
13
Theorem 1.10 (Strong Law of Large numbers). Let {Xi }iN be a sequence
of i.i.d. integrable random variables with E(Xi ) = and consider
n
1X
Sn :=
Xi .
n
i=1
Then
a.s.
Sn ,
i.e.
P
lim Sn = = 1.
1X
a.s.
f (Xk ) E(f (X1 )) =
n
k=1
f (x) dx.
0
Therefore, once we can generate samples from the variables Xj , the quantity
R1
1 Pn
k=1 f (Xk ) is, for large n, a good approximation of 0 f (x) dx.
n
Theorem 1.12 (Central Limit Theorem). Let {Xi }iN be a sequence of
i.i.d. square integrable random variables with E(Xi ) = and V ar(Xi ) = 2 .
Then
d
n(Sn ) N (0, 2 ).
P
Comment. Notice that if we set Sn = nj=1 Xj , then E(Sn ) = n and
V ar(Sn ) = n 2 . Therefore the above theorem is equivalently restated by
Sn E(Sn ) d
N (0, 1),
[V ar(Sn )]1/2
i.e.
Z b
x2
1
Sn n
<b =
lim P a <
e 2 dx,
n
n
2 a
a b R.
Stochastic processes
or Z, unless otherwise stated. For a reason that will possibly be more clear
after reading Remark 2.1 below, we sometimes consider the process as a
whole and use the notation X := {Xt }tT .
Now notice that
Xt () = X(t, ) : I S.
If we fix and look at the (non-random) map
I 3 t X(t, ) S
for fixed
(7)
then we are looking at the path Xt () =: (t), i.e. we are observing everything that happens to the sample point from time say 0 to time t. If
instead we fix t and we look at the map
3 X(t, ) Rn
for fixed t,
then this is a random variable, which gives us a snapshot of what is happening (although clearly not in a deterministic way) to all the sample points
at the time t when we took the picture. The old chestnut of the
Eulerian vs Lagrangian point of view. It is sometimes convenient to think
of as a set of particles and of the state space S as the physical state space
(for example position-velocity), so that (t) is the path in state space followed by the particle and Xt () represents, for fixed t, the state of the
system at time t. This will be the point of view that we shall adopt when we
talk about non-equilibrium statistical mechanics. In other cases it is more
convenient to think of as an experiment, so that Xt () is a realization of
such an experiment at time t.
Remark 2.1. Formula (7) offers another perspective on stochastic process:
we can look at a stochastic process as a random map from the sample space
to the path space (S)I := {maps from I to S}. More explicitly
X : (S)I
(t),
i.e. to each sample point, or particle, we associate a function, its path.
Definition 2.2. Two stochastic processes {Xt }t and {Yt }t taking values in
the same state space are (stochastically) equivalent if P(Xt 6= Yt ) = 0 for all
t T . If {Xt }t and {Yt }t are stochastically equivalent then {Xt }t is said to
15
be a version of {Yt }t (and the other way around as well). Given a stochastic
process {Xt }t the family of distributions
P(Xt1 B1 , . . . , Xtk Bk ),
for all k N, t1 , . . . , tk T and B1 , . . . , Bk S, are the finite dimensional
distributions of the process {Xt }.
If two stochastic processes are equivalent then they have the same finite dimensional distributions, the converse is not true. Let us make some
remarks about these two notions, starting with an example.
Example 2.3. Stochastically equivalent processes can have different realizations. Indeed, consider the processes {Xt }t[0,1] and {Yt }t[0,1] defined as
follows
0 if t 6=
Xt 0 t and Yt =
1 if t =
where is a random variable with continuous distribution : [0, 1], so
that P( = a) = 0 for all a [0, 1]. In this case P(Xt 6= Yt ) = P(t = ) = 0.
Therefore these two processes are stochastically equivalent but the trajectory
of Xt is continuous while the trajectory of Yt has a discontinuity at t = .
One might wonder whether the finite dimensional distributions determine the process uniquely; in general the answer is no, indeed also the finite
dimensional distributions dont say anything about the paths. However the
Kolmogorov extension Theorem (see for example [56, Theorem 2.1.5] ) gives
a consistency condition in order for a family of distributions to be the finite
dimensional distributions of some stochastic process.
Definition 2.4 (Continuous processes). Let {Xt } be a continuous-time s.p.
We will say that {Xt } is continuous if it has continuous paths, i.e. if the
maps t Xt () are continuous for a.e. .
Given a function we know how to check whether it is continuous or not.
Such an exercise might not be so obvious when it comes to stochastic processes. Luckily for us, Kolmogorov provided us with a very useful criterion,
which we present for real valued processes but it holds in more generality.
Theorem 2.5 (Kolmogorovs continuity criterion). Let {Xt }t0 be a s.p.
with state space R. If there exist constants , > 0 and C 0 s.t.
E |X(t) X(s)| C |t s|1+
16
t, s 0
then the process is continuous. More precisely, for any (0, /) and
T > 0 and for almost every there exists a constant K = K(, , T ) such
that
|X(t, ) X(s, )| K |t s| , 0 s, t T.
Definition 2.6. A filtration on (, F) is a family of -algebras {Et }tI such
that Et F for all t I and
Es Et
if s t.
17
1
2t
Z
a
x2
e 2t dx .
(8)
2.1
Stationary Processes
We can now assume, without loss of generality, that E(Xt ) = 0 for all t.
Then, analogously,
ZZ
E(Xt+h Xs+h ) =
xy dP(Xt+h x, Xs+h y)
ZZ
=
xy dP(Xt x, Xs y) = E(Xt Xs ),
from which we deduce that E(Xt Xs ) depends only on the difference t s.
So strictly stationary WSS. The converse is in general not true. However
for Gaussian processes strictly stationary is equivalent to WSS.
Clearly, the definition of stationarity can be given also for discrete-time
processes and it is completely analogous to the one given for continuous time
processes.
Definition 2.10. A sequence of random variables {Xn }nN is strictly stationary, or simply stationary, if for every m N and every k N, the vector
(X0 , X1 , . . . , Xm ) has the same distribution as (Xk , X1+k , . . . , Xm+k ).
The notion of stationarity will pop up several times during this course,
as stationary processes enjoy good ergodic properties.
Markov Chains
One of the most complete accounts on Markov chains is the book [53]. For
MC on discrete state space we suggest [15], which is the approach that we
will follow in this section.
3.1
Defining Markovianity
B S, n N. (9)
19
However if the state space is countable the above still holds almost everywhere. If the state space is finite then it holds pointwise.
Denoting by Fn the -algebra generated by X0 , . . . , Xn , by definition of
conditional expectation (9) can be rewritten as
P(Xn+1 B|Fn ) = P(Xn+1 B|Xn ),
B S, n N.
(11)
Comment. Whether we look at (9) or (10), the moral of the above definition is always the same: suppose that Xn represents the position of a
particle moving in state space and that at time n our particle is at point x
in state space. In order to know where the particle will be at time n + 1
(more precisely, the probability for the particle to be at point y at time
n + 1) we dont need to know the history of the particle, i.e. the positions
occupied by before time n. All we need to know is where we are at time n.
In other words, given the present, the future is independent of the past. It
could be useful, in order to better understand the idea of Markovianity, to
compare it with its deterministic counterpart; indeed the concept of Markovianity is the stochastic equivalent of Cauchys determinism: consider the
deterministic system
z(t)
= f (z),
z(t0 ) = z0 .
(12)
We all know that under technical assumptions on the function f there exists
a unique solution z(t), t t0 , to the above equation; i.e. given an evolution
law f and an initial datum z0 we can tell the future of z(t) for all t t0 . And
there is no need to know what happened to z(t) before time t0 . Markovianity
is the same thing, just reread in a stochastic way: for the deterministic
system (12) we know exactly where z will be a time t; for a Markov chain,
given an initial position (or an initial distribution) at time n0 , we will know
the probability of finding the system in a certain state for every time n n0 .
Notation. If the chain is started at x then we use the notation
Px (Xn B) := P(Xn B|X0 = x);
if instead the initial position of the chain is not deterministic but drawn at
random from a certain probability distribution ( is a probability on S)
then we write P (Xn B). Clearly Px = Px .
If the MC has initial distribution , the finite dimensional distributions of
20
Bn
B1
(13)
During this course we will be mainly concerned with a special class of
MC, i.e. time-homogoneous MC. For these processes the probability of going
from x to y in one time step depends only on x and y and not on when we
are in x. In other words, the one-step transition probabilities do not depend
on time.
Definition 3.2 (Time-homogeneous Markov Chain). A Markov chain (MC)
{Xn } is time-homogeneous if
P(Xn+1 B|Xn ) = P(X1 B|X0 )
n 0.
(14)
Remark 3.3. To be more precise, one can prove that the above formula
(13) is equivalent to the Markov property (9):
(9) (13) .
In words: a time homogeneous Markov process has finite dimensional distributions of the form (13). Viceversa, if a stochastic process has finite
dimensional distributions of the form (13), then it is a time-homogeneous
Markov process .
In the rest of Section 3 (actually, in the rest of this entire course), we
will always assume that we are dealing with time-homogeneous MC, unless
otherwise explicitly stated. The Markov property and the time-homogeneity
imply that we can write
P(Xn+1 B|Xn = x) =: p(x, B)
(15)
is called a Markov transition function or a family of transition probabilities. If, in addition to 1 and 2, the relation (15) holds, p are the transition
probabilities of the Markov chain Xn .
Using the transition probabilities we can rewrite the finite dimensional
distributions (13) of the MC as
Z
Z
Z
p(y0 , dy1 ) . . . p(yn1 , dyn ).
(dy0 )
P (X0 B0 , X1 B1 , . . . , Xn Bn ) =
B0
B1
Bn
Therefore, to a Markov process we can associate a family of transition probabilities i.e. a family of functions fulfilling the requirements of Definition
3.4. The above formula also says that once we have an initial distribution,
the transition probabilities are all we need in order to know the evolution
of the chain. This means that we could have introduced Markov processes
working the other way around, i.e. starting from the transition probabilities,
as the following theorem states.
Theorem 3.5. For any initial measure on (S, S) and for any family of
transition probabilities {p(x, A) : x S, A S}, there exists a stochastic
process {Xn }nN such that
Z
Z
Z
P (X0 B0 , X1 B1 , . . . , Xn Bn ) =
(dy0 )
p(y0 , dy1 ) . . . p(yn1 , dyn ).
B0
B1
Bn
(16)
Thanks to Theorem 3.5, an alternative way - alternative to Definition
3.1 - of defining a MC is as follows: we first assign a family of transition
probabilities and then we say that Xn is a Markov Chain if
P(Xn+1 B|Fn ) = p(Xn , B),
n N and B S.
At this point, as before (see Remark 3.3), one can prove that the finite
dimensional distributions of Xn are given by (16) and also, viceversa, that a
process Xn with finite dimensional distributions given by (16) is a Markov
chain with transition probabilities p(x, A).
Remark 3.6. Before reading this Remark you should revise the content of
the Kolmogorov extension Theorem in Appendix A.
Once we assign an initial distribution and a family of transition probabilities, the finite dimensional distributions (16) of the chain Xk (say each
r.v. of the chain is real valued) are a consistent family of probability measures on Rn . Therefore the Kolmogorov extension Theorem applies and
22
3.2
In the remainder of this chapter we will assume that the Markov Chain Xn
that we are dealing with is time-homogeneous and that it takes values on
a countable state space S. Each x S is called a state of the chain. We
endow S with the -algebra S of all the subsets of S. To fix ideas you can
think of S = Z. The proofs that we will skip can be found in [15]. For this
class of chains we have:
Transition probabilities: in this case it suffices to assign the transition
matrix 3 p = {p(x, y), x, y S} where each map p : S S [0, 1]
satisfies
X
p(x, y) = 1 and p(x, y) 0, x, y S.4
yS
p will actually be a matrix if the state space is finite, it will be an infinite matrix if
the state space is countable.
4
The first condition says that p is a stochastic matrix. Here we dont need to specify
that p(x, y) is measurable in the first argument because with the chosen -algebra this is
automatically true.
5
Observe that this equality contains both the loss of memory and the time-homogeneity.
23
zS
zS
if k 1
p(0, k) = ak
for all k 0
p(j, k) = 0
otherwise.
24
and
p(k, k + 1) = (N k)/N,
for all k 0,
and
TA := inf{n 1 : Xn A},
and
k 1.
1(Xn =y) .
n=1
6
Most textbooks, especially the most control-theory oriented, will say that y is accessible from x if there exists n N, n 0, such that pn (x, y) > 0 and then will say that x
and y communicate if x is accessible from y and y is accessible from x. Notice that our
definition is slightly different as xy is defined through the time of first return, as opposed
to being defined through the hitting time.
25
(17)
Now a few definitions regarding the classification of the states of the chain.
Definition 3.12. We say that the state y S is
Recurrent if yy = 1,
Transient if yy < 1,
Positive recurrent if yy = 1 and Ey (Ty ) < ,
Null recurrent if yy = 1 and Ey (Ty ) = .
A state y S is absorbent if Py (X1 = y) = 1.
Because for any given state y, we can only either have yy = 1 or yy < 1,
the state of a chain can only be either recurrent or transient. In the first case
we know from (17) that Py (Tyk < ) = 1 for all k 1. Therefore the chain
will return to y infinitely many times (more precisely, P(Xn = y i.o.) = 1).
If instead y is transient then, on average, the chain will only return to y a
finite number of times.
Theorem 3.13. A state y is recurrent if and only if Ey (N (y)) = and it
is transient if and only if Ey (N (y)) < .
Proof. Recalling the definition of N (y), Definition 3.11, we have
Ey (N (y)) =
Py (N (y) k) =
k=0
(17) X k
=
yy
k=0
Py (Tyk < )
(18)
k=0
=
1
1yy
iff yy = 1
iff yy < 1.
recurrent. If xy > 0 then there exists h > 0 such that ph (x, y) > 0. Let h
26
be the smallest h for which ph (x, y) > 0. Then there exist z1 , . . . , zh1
S
such that
p(x, z1 )p(z1 , z2 ) . . . p(zh1
, y) > 0
wouldnt be the smallest h for which
and zj 6= x for all j (otherwise h
h
p (x, y) > 0). With this in mind, if yx < 1 we have
Px (Tx = ) = p(x, z1 ) . . . p(zh1
, y)Py (Tx = )
= p(x, z1 ) . . . p(zh1
, y)(1 yx ) > 0,
so it has to be yx = 1. Now let us prove that y is recurrent. To do so,
we will show that Ey (N (y)) = . Because yx > 0, there exists ` > 0
s.t. p` (y, x) > 0 and also recall that ph (x, y) > 0. From the ChapmanKolmogorov equation we have that
27
Corollary 3.16. If a chain is irreducible then either all the states are recurrent or all the states are transient.
What the following Theorem 3.17 and Corollary 3.18 prove is that in the
case of a chain with finite state space, there is a way to decide which of the
two cases occurs: if S is finite, closed and irreducible then all the states are
recurrent.
Theorem 3.17. If C is a finite closed set then C contains at least one
recurrent state.
Corollary 3.18. If C is a finite set which is closed and irreducible then
every state in C is recurrent.
Proof of Theorem 3.17. By contradiction, suppose all the states in C are
transient. If this is the case then, acting as in (18) we have
Ex (N (y)) = xy /(1 yy ) < .
(19)
So if we pick x C, we have
>
X
yC
Ex (N (y)) =
X
X
pn (x, y) = ,
n=1 yC
1
2
3
4
5
6
7
0.3 0
0
0 0.7 0
0
0.1 0.2 0.3 0.4 0
0
0
0
0 0.5 0.5 0
0
0
0
0
0 0.5 0 0.5 0
0.6 0
0
0 0.4 0
0
0
0
0
0
0 0.2 0.8
0
0
0
1
0
0
0
28
H1
2
/4h
@ O
The sets {1, 5} and {4, 6, 7} are irreducible closed sets so every state in these
sets is recurrent.
You might think that the fact that all the recurrent states are contained
in one of the closed and irreducible sets of the chain is specific to the above
example. The following theorem shows that this is not the case.
Theorem 3.20 (Decomposition Theorem). Let R := {r S : r is recurrent}
be the set of all recurrent states of the chain. Then R can be written as the
disjoint union of closed and irreducible sets, i.e.
R = i Ci ,
30
Example 3.22. Consider the simple random walk from the exercise sheet.
Then (k) 1 is a stationary measure for the chain, as
X
(k)p(k, j) = (j 1)p(j 1, j)+(j +1)p(j +1, j) = 1p+p = 1 = (j).
k
(21)
xS
where the last equality follows from the fact that p is a stochastic matrix.
Remark 3.25. Suppose that the Markov chain Xn with transition matrix
p admits a stationary measure and that X0 , i.e. the chain is started in
stationarity. For every fixed n N we can consider the process {Ym }0mn
defined as Ymn := Xnm , i.e. Ymn is the time-reversed Xn . Then for every
n N, Ymn is a time-homogeneous Markov chain with Y0 . To calculate
the transition probabilities q(x, y) of Ymn we use Bayes formula:
q(x, y) = P(Y1 = y|Y0 = x) = P(Xn1 = y|Xn = x)
(y)
p(y, x)(y)
= P(Xn = x|Xn1 = y)
=
.
(x)
(x)
If the DB condition holds, then q(x, y) = p(x, y) for all x, y and in this case
Xn is called time-reversible.
Now a very important definition, which holds for chains as well as for
continuous time processes.
Definition 3.26. A Markov chain is said to be ergodic if it admits a unique
stationary probability distribution. In this case, the invariant distribution is
said to be the ergodic measure for the chain.
31
I would like to stress that, depending on the book you open, you might
find slightly different definitions of ergodicity. However they all aim at
describing the same intuitive idea. The next theorem says that as soon
as we have a recurrent state, we can construct a stationary measure (y)
looking at the expected number of visits to y, i.e. the expected time spent
by the chain in y.
Theorem 3.27. Recall that Tx := inf{n 1 : Xn = x}. If x is a recurrent
state then the measure
"T 1
#
x
X
1(Xn =y)
x (y) := Ex
n=0
is stationary.
Proof of Theorem 3.27. We need to prove that
X
x (y)p(y, z) = x (z).
yS
Let us study two cases separately, the case x 6= z and the case x = z.
P
Case x 6= z: To this let us rewrite x (y) =
n=0 Px (Xn = y, Tx > n),
so we have:
X
X
X
x (y)p(y, z) =
Px (Xn = y, Tx > n)p(y, z)
yS
=
=
n=0 yS
Px (Xn+1 = z, Tx > n)
n=0
n=0
X
n=0
X
n=1
Px (Xn = z, Tx > n)
Px (Xn = z, Tx > n)
Px (Xn+1 = z, Tx > n + 1),
n=0
x (y)p(y, x) =
yS
X
X
n=0 yS
Px (Tx = n + 1) = 1,
n=0
Theorem 3.28. If the chain is irreducible and recurrent then there exists a
unique stationary measure, up to constant multiples.
Idea of Proof. Let R be the set defined in Theorem 3.20. The moral is the
following: if we pick x Ci then the measure x defined in Theorem 3.27
is a stationary measure. If we pick any other x
in the same set Ci , the
corresponding measure x is only a constant multiple of x . Under our
assumptions, there is only one big recurrent irreducible set, the state space
S. Therefore we can pick any x S and construct an invariant measure x .
At this point one proves that any other invariant measure say will be a
constant multiple of x , i.e. there exists a constant K > 0 such that
x (y) = K(y)
for all y S.
33
x (y) =
TX
x 1 X
n=0
pn (x, y) =
X
X
Px (Xn = y, Tx > n)
n=0 y
Px (Tx > n) = Ex Tx .
n=0
1
> 0,
Ex Tx
Comment. It is clear now that (y) represents the probability for the
chain to be in y as n . It is very important that convergence to happens irrespective of the initial datum that we pick, i.e. in the limit the initial
condition is forgotten. We will not prove the above theorem as we think that
the meaning of the statement is already transparent enough: if there exists a
unique stationary distribution, which follows from irreducibility and positive
recurrence, the only thing that can prevent the process from converging to
its unique candidate limit is periodicity. Once we rule that possibility out,
the asymptotic behaviour of the chain can only be described by . Another
n
crucial observation to bear in mind is that (22) implies pn (x, y) (y) for
all y. Because (y) = 1/Ey Ty , this result is quite intuitive: the probability
of being in y is asymptotically inversely proportional to the expected value
35
of the time of first return to y. And this brings us to the last result, which
is very much along the lines of time averages equal space averages.
Define Nn (y) to be the number of visits to y in the first n steps, i.e.
Nn (y) =
n
X
1(Xj =y) .
j=1
Theorem 3.35. Suppose y is a recurrent state of the chain. Then for every
initial state of the chain x S, we have
1
Nn (y)
1
,
n
Ey Ty (Ty <)
Px a.s.
Remark 3.36 (On the nomenclature). In all of the above we have referred
all our definitions to the chain Xn rather than to its transition probabilities,
i.e. for example we said that the cain is recurrent or irreducible etc. However,
one can equivalently refer all these definitions to the transition probabilities
p and, in order to say e.g. that the chain is aperiodic we can equivalently
say that p is aperiodic.
Example 3.37. This example shows why aperiodicity is a much needed
assumption in Theorem 3.34. Consider the Markov Chain on state space
S = {a, b} with transition matrix
0 1
.
1 0
This chain is clearly positive recurrent and irreducible, hence ergodic with
invariant measure is (a) = (b) = 1/2. However it is also periodic with
period 2. It is easy to check that the result of Theorem 3.34 does not hold
for this chain, indeed pn (a, b) is one if n is odd and 0 if n is even (and
analogously for pn (b, a)). Therefore pn doesnt converge at all.
For the material of this section we refer to [74, 67, 2, 52, 30, 10].
In the previous section we have studied the basic theory of Markov chains
on finite or countable state space. Most of this theory can be translated to
general state space, but this is not what we want to do now. In this section
we want to look at one of the main applications of the theory that we have
seen so far: Markov Chain Monte Carlo methods (MCMC). The main idea
36
R
for every - integrable function f (i.e. for every f such that |f (x)| (x)
is finite). We will defer to Section 5 a more thorough discussion of the
limit (23), usually known as Ergodic Theorem. For the time being the
important things to notice are: (i) the limit (23) is a strong law of
large numbers - see Theorem 1.10; (ii) if we take f to be the indicator function of a measurable set then (23) says precisely that, in the
limit, time averages equal space averages; most importantly to our
purposes (iii) for n large enough, the quantity on the LHS is a good
approximation of the integral on the right- see Exercise 2.
7
However MCMC should only be used when other direct analytical methods are not
applicable to the situation at hand.
8
On an hystorical note, Monte Carlo (or also Ordinary Monte Carlo (OMC)) came
before MCMC. OMC was pretty much a matter of statistics: if you could simulate from a
given distribution, then you could simulate a sequence of i.i.d. random variables with that
distribution and then use the Law of Large numbers to approximate integrals. When the
method was introduced, the sequence of i.i.d. was not necessarily created as a stationary
Markov chain.
9
Notice that as usual we are assuming that has a density, which we keep denoting
by .
37
ii) To calculate the minima (or maxima) of functions, see Section 4.1.
The following example contains the main features of the sampling methods
that we will discuss.
Example 4.1 (Landscape painting attempt). *** Let q(x, y) be a transition
probability on a finite state space S . Suppose the transition matrix Q =
(q(x, y)) is symmetric and irreducible. Given such a Q and a probability
distribution (x) on S such that (x) > 0 for all x S, let us now construct
a new transition matrix P = (p(x, y)) as follows :
1P
x6=y p(x, y) otherwise.
First
Pis a transition matrix, indeed from the definition
P of all P = (p(x, y))
10 by
p(x,
y)
=
p(x,
x)
+
y6=x p(x, y) = 1. Moreover, P is irreducible
y
construction because Q is; being the state space finite, this also implies
that P is recurrent and that there exists a unique stationary distribution.
We can easily show that such an invariant distribution is exactly as P
is reversible with respect to . To prove - reversibility of P we need to
show that (x)p(x, y) = p(y, x)(y). This is obviously true when x = y. So
suppose x 6= y:
i) if (y) (x): (x)p(x, y) = (x)q(x, y) but also (y)p(y, x) = q(y, x) (x)
(y) (y)
so that using the symmetry of q we get (y)p(y, x) = q(x, y)(x) and
we are done.
ii) if (y) < (x): clearly same as above with roles of x and y reversed.
We are in good shape but we havent yet proved convergence to . This is
left as an exercise, see Exercise 17. It turns out that convergence happens
unless is the uniform distribution on S.
Let us pause everything for a moment to think about what we have done:
starting from a (pretty much) arbitrary transition kernel q, we have constructed a chain with transition kernel p that has as limiting distribution.
We will see that the (pretty much) can be (pretty much) removed. Now
there is only one point left to address: how do we sample from the chain
that has P as transition matrix? By using the following algorithm
10
I.e. the whole state space is irreducible under P ; this implies that the state space is
also closed under P .
38
where x (y) is the Kronecker delta. In view of Section 4.3 it is also useful
to note that Algorithm 4.2 can be equivalently expressed as follows: given
Xn = xn ,
1. generate yn+1 q(xn , );
yn+1 with probability (xn , yn+1 )
2. set Xn+1 =
xn
otherwise.
4.1
Simulated Annealing
We now address the point ii) listed at the beginning of this section.
Suppose again we are in the case in which the state space S is finite and
again in the setting of Example 4.1. Let us now apply the reasoning (24) to
the case when the target distribution is
(x) =
eH(x)/
,
Z
39
(25)
xS
If Q = (q(x, y)) is a symmetric and irreducible transition matrix, the prescription (24) now becomes
1 x6=y p (x, y)
otherwise.
Notice that to simulate the Markov Chain with transition probability p (x, y)
we dont need to know a priori the value of the normalization constant Z .
We now know that for n large enough, pn .
Now observe that the definition of remains unaltered if the function H
is modified through an additive constant c, i.e. if instead of H we consider
H(x)
= H(x) + c for some constant c then
(x) =
ec/ eH(x)/
eH(x)/
P
=
.
Z
ec/ xS eH(x)/
hence Z K and
0
(x)
In this way, if is small enough, for large n the chain will be in one of the
minima with very high probability.
Instead of fixing a small , we can decrease at each step. This way
we obtain a non-homogeneous Markov Chain, which we havent discussed.
However, when is interpreted as the temperature of the system, such a
40
4.2
Accept-Reject method
Example 4.1 and Algorithm 4.2 give an idea of what we are aiming for.
However let us take a step back and start again where we began. The aim
of the game is sampling from a given probability distribution. So suppose we
want to produce sample outcomes of a real valued random variable X with
density function (x)11 . The idea is to use an auxiliary density function
(x) which we know how to sample from.
The method is as follows: we start with producing two samples, independent of each other, Y and U U[0,1] , where U[0,1] denotes the uniform
)
distribution on [0, 1]. If U M(Y
(Y ) then we accept the sample and set
X = Y otherwise we start again. The algorithm is as follows
(Y )
M (Y )
It is clear from the above that there are only two constraints on (x)
and (x) in order for this method to be applicable :
the auxiliary density is such that (x) > 0 if (x) > 0;
there exists a constant M > 0 such that (x)/(x) M for all x.
Now we need to check that the samples generated this way are actually
distributed according to . Because
(Y )
P(X a) = P Y a U
for all a R,
M (Y )
11
This sampling method works in higher dimensions as well, here we present it in one
dimension just for ease of notation.
41
M (Y )
dy(y)
for all a R.
This is a simple calculation, once we recall that for any two real valued
random variables, W and Z, with joint probability density fW,Z (w, z), we
have
R R
fW,Z (w, z)dw dz
P(W A|Z B) = RB RA
.
B R fW,Z (w, z)dw dz
In our case it is clearly fY,U = (y) 1, as Y and U are independent, so using
the above equality we obtain
R (y)/M (y)
Ra
(y) du
dy 0
(Y
)
= R
P Y a U
R (y)/M (y)
M (Y )
(y) du
dy 0
R
a
dy (y)
= R
= P(X a).
dy(y)
Remark 4.5. Notice that to use the accept-reject method we dont need
to know the normalization
R constant for the density function i.e. we dont
need to know Z = (x)dx. However we need to know the constant
M . Indeed, even if M cancels in the calculation above so that it can be
arbitrary we still need to know its value in order to decide whether to
accept or reject the sample Y (see step 2 of Algorithm 4.4). Moreover,
M is a measure of the efficiency of the algorithm. Indeed, the probability of
acceptance at each iteration is exactly 1/M see Exercise 19. Therefore in
principle we would like to choose a distribution that makes M as small as
possible.
One last, probably superfluous, observation: this method has nothing to
do with Markov chains. The reason why we presented this algorithm should
be clear in view of Example 4.1. The next section is actually about sampling
methods that exploit Markov Chains.
4.3
Metropolis-Hastings algorithm
Before reading this section, read again Example 4.1. However this time our
state space is RN . 12 .
12
This algorithm can be presented and studied on a general metric space, see [74]
42
On a technical note, the Lemma 4.7 can be made a bit more general (see [74]): first
recall that we are assuming that q(x, ) and () are probability densities with respect
to the Lebesgue measure; having said that we can consider the set R = {x, y S :
(y)q(y, x) > 0 and (x)q(x, y) > 0} and prove that the chain produced by Algorithm 4.6
satisfies the detailed balance condition if and only if the (x, y) = 0 on Rc .
43
R
will in general be of the form q(x, y) = Zx1 q(x, y), with dy q(x, y) = Zx
so that the ratio in the acceptance probability (26) can be more explicitly
(y)Zx q(y,x)
written as (x)Z
(x,y) .
yq
Clearly the choice of the proposal q is crucial in order to improve the
efficiency of the algorithm. We will make more remarks on this point later
on. For the moment we want to explore some special cases of M-H, namely
1 when q(x, y) depends on y only, i.e. q(x, y) = q(y), in which case the
corresponding M-H algorithm is called independence sampler;
2 when q is symmetric, i.e. q(x, y) = q(y, x) and in particular when q is
the transition density of a random walk, so that q(x, y) = q(|x y|)
(see Example 3.8); in this case the resulting M-H algorithm is better
known as Random Walk Metropolis (RWM);
Let us start with the independence sampler.
1 Independence Sampler. If the proposal kernel is independent of
the current state of the chain, i.e. q(x, y) = q(y) with q(y) a probability
density then the acceptance probability looks like
(y)q(x)
(x, y) = min 1,
,
(x)q(y)
so in this case we dont need to know neither the normalization constant for
the target nor the one for the proposal q (see Remark 4.8). Algorithm 4.6
becomes
Algorithm 4.9 (Independence Sampler). Given Xn = xn ,
1. generate yn+1 q();
yn+1 with probability (xn , yn+1 )
2. set Xn+1 =
xn
otherwise.
If in general it makes sense to compare the M-H algorithm with the accept/reject method, it makes even more sense to compare the independence
sampler with Algorithm 4.4, as they look very much alike. Let us spend a
couple of words to stress the differences between these two: first of all bear
in mind that the independence sampler produces a chain Xn , the outcomes
of which are -distributed only for large n, while with accept/reject each
sample is -distributed. In both cases the proposal yn+1 is generated independently of the value that had been produced at the previous iteration.
44
4.4
2. Yn+1 f (y|Xn = xn ) .
Yes I know at first sight it seems impossible that this algorithm does
anything at all. Let us explain what it does and why it works. First of all,
Algorithm 4.11 produces two Markov chains, Xn and Yn . We are interested
in sampling from fX (x) so the chain of interest to us is the chain Xn . One
can prove that the chain Xn has fX as unique invariant distribution and
that, under appropriate assumptions on the conditional distributions fX|Y
and fY |X , the chain converges to fX . We shall illustrate this fact on a
relatively simple but still meaningful example.
Example 4.12. Suppose X and Y are marginally Bernoulli random variables, so that we work in finite state space S = {0, 1}. Suppose the joint
distribution of X and Y is assigned as follows:
a1 a2
fX,Y (0, 0) fX,Y (1, 0)
=
, a1 + a2 + a3 + a4 = 1 .
fX,Y (0, 1) fX,Y (1, 1)
a3 a4
The marginal distribution of X is then
fX = [fX (0) fX (1)] = [a1 + a3 a2 + a4 ] .
(28)
a2 +a4
a4 +a3
a4 +a3
i.e. (FY |X )ij = P(Y = j|X = i), i, j {0, 1}. Starting from the above
matrices we can construct the matrix
FX|X = FY |X FX|Y .
Such a matrix is precisely the transition matrix for the Markov chain Xn .
Indeed by construction, the sequence X0 Y1 X1 is needed to construct
the first step of the chain Xn , i.e. the step X0 X1 . This means that the
transition probabilities of Xn are precisely
X
P(X1 = x1 |X0 = x0 ) =
P(X1 = x1 |Y1 = y)P(Y1 = y|X0 = x0 ) .
y
Therefore the entries of the matrix (FX|X )n are the transition probabilities
pn (x0 , x) = P(Xn = x|X0 = x0 ). Let the distribution of Xn be represented
by the row vector
f n = [f n (0) f n (1)] = [P(Xn = 0) P(Xn = 1)],
47
so that we have
n
= f n1 FX|X
f n = f 0 FX|X
n 1 .
(29)
If all the entries of the matrix FX|X are strictly positive then the chain Xn
is irreducible and, because the state space is finite, there exists a unique invariant distribution, which for the moment we call = [(0) (1)]. However
if all the entries of the transition matrix are strictly positive then the chain
is also regular (see Exercise 16) hence f n does converge to the distribution
as n 14 , irrespective of the choice of the initial distribution for the
chain Xn (which, in the case of the algorithm at hand, means irrespective
of the choice of y0 ). Taking the limit as n on both sides of (29) then
gives
= FX|X .
Because there is only one invariant distribution, the solution of the above
equation is unique. It is easy to check that the marginal fX defined in (28)
satisfies the relation fX = fX FX|X (check it, by simply using the explicit
expressions for fX , FY |X and fX|Y ) and hence fX must be the invariant
distribution.
On a very formal level, one could understand Gibbs sampling as a
Metropolis-Hastings algorithm with acceptance probability constantly equal
to 1.
A generalization of what we have done with two variables is the following. Suppose we have m 2 variables X1 , . . . , Xm with joint distribution
fX1 ,...,Xm (x1 , . . . , xm ) and suppose we want to sample from the marginal
Z
f1 (x1 ) = fX1 ,...,Xm (x1 , . . . , xm )dx2 . . . dxm .
With the notation
f1|2,...,m (x1 |x2 , . . . , xm ) = fX1 |X2 ,...,Xm (x1 |x2 , . . . , xm ) ,
and similarly for fj|1,...,j1,j+1,...,m , the multi step Gibbs sampler is as follows.
(0)
From (22) we know that pn (x, y) (y) for all x S. Because the chain Xn
0
n
is
0 f (x) = fX|Y (x|Y = y0 ), this implies that limn f (x) =
P started with X
n
(lim
p
(z,
x))
f
(z|Y
=
y
)
=
(x).
n
0
X|Y
zS
48
(n)
1. X1
(n)
(n)
(n+1)
(n+1)
2. X2
3. X3
(n)
(n)
(n)
(n)
(n+1)
(n)
(n)
, x4 , . . . , xm )
..
.
(n+1)
m. Xm
(n)
(n+1)
(n+1)
, . . . , xm1 ).
The principle behind Algorithm 4.13 is the same that we have illustrated
for Algorithm 4.11.
More details about the material of this section can be found in [15, 26, 13].
In this section we want to give some more details about the limit (23). In
particular, we want to link the definition of ergodicity that we have given
for Markov processes with the definition that you might have already seen
in the context of dynamical system theory.
5.1
Dynamical Systems
A F.
If () = , i.e. if
(1 (A)) = (A),
A F
(30)
then we say that the measure is invariant under and that is measure
preserving and preserves the measure .
For a given map there will be in general more than one measure which
is preserved by .
49
Two finite measures and on the same measurable space (E, E) are said to be
mutually singular if there exist two disjoint sets A, B E such that A B = E and
(A) = (B) = 0.
50
5.2
(31)
It is easy to check that the sequence of r.v. Xn constructed as above is a stationary sequence. Indeed let A be the set A := { : (X0 (), . . . , Xm ())
B}, for some arbitrary Borel set B in Rm+1 . Then for all k 0
P{(Xk (), . . . , Xm+k ()) B} = P( : k () A) = P(k (A))
= P(A) = P{(X0 (), . . . , Xm ()) B}.
We have hence just proved the following lemma.
Lemma 5.7. Let P be a measure on (, F), a map that preserves P and
X a F-measurable, R-valued random variable. Then the sequence (31) is a
stationary sequence.
Comment. Lemma 5.7 is important because, at the cost of changing measure space, any stationary sequence can be represented in the form (31). Let
us be more clear about this: we have already seen see Remark 3.6 that
any Markov Chain, and hence in particular any stationary Markov Chain,
can be represented as the coordinate map in the sequence space RN endowed
51
with the -algebra generated by cylinder sets and with the measure P constructed by using the Kolmogorov extension Theorem. So suppose that Xn
is a stationary MC, Xn : R. The corresponding process on RN which
has the same distribution as Xn is the process Yn : RN R such that
Yn () = n , = (0 , 1 , . . . ) RN . This means that Yn will be stationary
on (RN , RN , P ). If we consider the r.v. Y : RN R defined as Y () = 0
and the shift map : RN RN defined as (0 , 1 , . . . ) = (1 , 2 , . . . ),
then one can prove that preserves P (you can check it by yourself, using
the fact that Xn is stationary, see Exercise ??) and indeed the sequence
Yn () is nothing but Y (n ()). Clearly everything we have said still holds
if we substitute R with any Polish space, i.e. if the random variable X takes
values in a Polish space or in a subset of a Polish space.
Suppose from now on that the chain Xn takes values in a discrete space
S. If Xn admits an invariant distribution (i.e. a stationary distribution
in the sense of Definition 3.21), , and we start the chain with X0
then the chain is stationary, i.e. Xn for all n 0. Denote by P
the corresponding measure on path space obtained via the Kolmogorovs
Extension Theorem. Then, by the comment above, P is invariant (in the
sense of Definition 5.116 ) for the shift map . The pair (, P ), together
with the space (S N , S N ) is called the canonical dynamical system associated
with the Markov chain Xn which has as invariant distribution and that is
started at , i.e. X0 .
For a given initial stationary distribution , the measure P is the only
-invariant measure such that
P ( S N : i Ai , i = 1, . . . , m) = P (X1 A1 , . . . , Xm Am ).
Now suppose the invariant distribution is such that (x) > 0 for all x
(which is in no way restrictive since any state with (x) = 0 is irrelevant to
the process as it cannot be visited). Under this assumption it is possible to
show that the shift map is ergodic if and only if the chain Xn is irreducible
(see for example [9, Section 3] for a proof in finite state space). Therefore
there exists a unique invariant distribution for Xn , i.e. Xn is ergodic (in the
sense of Definition 3.26).
Theorem 5.8 (Ergodic Theorem for stationary Markov Chains). Let Xn
be a stationary Markov chain. Then for any integrable function, the limit
n1
1X
f (Xk )
n n
lim
k=0
16
52
1X
f (Xk ) = E(f (X0 )).
n n
lim
k=0
Take f = 1A for some A F. Then the above limit says exactly that
asymptotically, space averages equal time averages.
We shall talk more about ergodicity and the link between Markov Processes and dynamical systems in the context of diffusion theory. For the
time being I would like to mention the following result, which will be useful
later on.
Lemma 5.9. A measure preserving map on a probability space, : (, F, P)
(, F, P) is ergodic if and only if
n1
1X
P((k A) B) = P(A)P(B),
n n
lim
for all A, B F.
k=0
for all A, B F.
Start with noticing that mixing is stronger than ergodic, i.e. mixing
ergodic. To have a better intuition about the definition of mixing system
and the difference with an ergodic one suppose for a moment that the map
is invertible; if this is the case, the definition (30) is equivalent to
P((A)) = P(A)
for all A F .
Therefore
P((n A) B) = P(n ((n A) B)) = P(A n (B)) .17
Which means that if is invertible then the limit of Definition 5.10 is equivalent to
P(A (n (B)))
= P(B),
n
P(A)
lim
(32)
17
1
for all A, B F.
53
Now let us think of an example that will be recurrent in the following: milk
and coffe. So suppose we have a mug of coffe and we add some milk. At the
beginning the milk is all in the region B. To fix ideas suppose P(B) = 1/3
(which means that, after adding the milk, my mug will contain 1/3 of milk
and 2/3 of coffee). The stain of milk will spread around and at time n it will
be identified by n (B). The property (32) then means that asymptotically
(i.e., in the long run) the proportion of milk that I will find in whatever
subset A of the mug is precisely 1/3. The limit of Lemma 5.9 says instead
that such a property is only true asymptotically on average.
(33)
B S and t s 0 .
for all t 0,
(34)
for some time-homogeneous Markov process Xt , then p is the transition function of the process Xt .
Notice that in order for the transition probabilities to satisfy (34) for
t = 0 one needs to have
1 if x B
p0 (x, B) = x (B) =
.
0 if x
/B
In particular, for t = 0 the transition function cant have a density.
As in the case of discrete time, once an initial datum (or an initial
distribution) for the process is chosen, the transition probabilities uniquely
determine the Markov process. We now prove the Chapman-Kolmogorov
relation.
Theorem 6.4 (Chapman-Kolmogorov). Let Xt be a time-homogeneous Markov
process. Then, for all 0 s < u < t and B S, we have
Z
P(Xu dy|Xs )P(Xt B|Xu = y)
(35)
P(Xt B|Xs ) =
ZS
=
P(Xus dy|X0 )P(Xtu B|X0 = y).
(36)
S
We specify that the first equality is true for any Markov process, the second
is due to time-homogeneity.
Proof of Theorem 6.4. Since (35) holds for any Markov process, we only
need to prove (35), as (36) follows from (36), when Xt is time-homogeneous.
Let Ft be the filtration generated by Xt . If s < u < t then Fs Fu Ft .
We will use this fact, together with Markovianity and the properties of
conditional expectation in order to prove (35):
P(Xt B|Xs ) = E[1{Xt B} |Fs ]
= E[E[1{Xt B} |Fu ]|Fs ] = E[P(Xt B|Xu )|Xs ]
Now because for every measurable
function , we can express the conditional
R
expectation E((Xt )|Xs ) = (y)P(Xt dy|Xs ), we obtain the desired
result.
55
If the transition functions have a density, i.e. if ps (x, dy) = ps (x, y)dy for all
x S, we can also write
Z
pt+s (x, z) =
ps (x, y) pt (y, z) dy .
(39)
S
Brownian Motion
Brownian Motion was observed in 1827 by the Botanist Robert Brown, who
first noticed the peculiar properties of the motion of a pollen grain suspended
in water. What Brown saw looked something like the path in Figure 1 below.
In 1900 Louis Bachelier modeled such a motion using the theory of stochastic processes and obtaining results that where rediscovered, although in a
56
7.1
57
t = n , we have
1
p, (x, t + ) = [p, (x , t) + p, (x + , t)] 18 .
2
If and are small then by expanding both sides of the above equation we
get
p, (x, t)
1 2 p, (x, t) 2
p, (x, t) +
.
= p, (x, t) +
t
2
x2
Now let , 0 in such a way that
2
D,
where D > 0 is a constant, which we will later call the diffusion constant.
Then p, (x, t) p(x, t), where p(x, t) satisfies the heat equation (or diffusion equation):
1
(40)
t p(x, t) = D xx p(x, t).
2
p(x, t) is the probability density of finding the particle in x at time t. The
fundamental solution of (40) is precisely the probability density of a Gaussian with mean zero and variance t, see (8).
The calculation that we have just presented shows another very important fact: pour some milk into your coffee and watch the milk particles
diffuse around in your mug. We will later rigorously define the term diffusion; however for the time being observe that we already know that at
the microscopic (probabilistic) level, the motion of the milk molecules is described by BM. The macroscopic (analytic) description is instead given by
the heat equation.
7.2
Later we will prove that the Wiener process is a Markov process. The Markovianity
of the process is already in this equation.
58
Since E(Xi ) = 1/2 and V ar(Xi ) = 1/4, E(Sn ) = n/2 and V ar(Sn ) = n/4.
Therefore, E(B , (t)) = 0 and
V ar(B , (t)) = 4 2
n
2
= 2 n = t .
4
Sn (n/2)
Sn (n/2)
p
n = p
tD.
n/4
n/4
tD
tD
n/4
n
t=n
1
=
2
b
tD
x2
2
dx;
a
tD
if D = 1 the RHS of the above is the same as the RHS of (8) after a change
of variables.
7.3
From the description of BM as the motion of a pollen grain - and even more
from its derivation as the limit of a random walk - it should be clear why
the following result holds.
Theorem 7.2. The Wiener process is a Markov process with
1
(xB(s))2
2(ts)
dx .
(41)
Martingales enjoy many nice properties, among which they satisfy the
following inequality
Theorem 7.4 (Doobs martingale inequality). If Mt is a continuous martingale then
!
E |MT |p
P sup |Mt |
,
p
t[0,T ]
for all > 0, T 0 and p 1.
In order to understand the martingales inequality, it is useful to compare
it with the Markov inequality (4).
Theorem 7.5. Brownian Motion is a martingale with respect to the filtration Ft = {B(s); 0 s t}.
Proof. Brownian motion is square integrable - as E(B(t)2 ) = t from i) and
ii) of Example 2.7 - and therefore integrable and it is adapted w.r.t. Ft by
definition. Also, for any t s,
E[B(t)|Fs ] = E[B(t) B(s) + B(s)|Fs ] = 0 + Bs
as B(t) B(s) is independent of Fs .
Let us now look at the path properties of the Wiener process: we will
see that while the trajectories of BM are continuous, they are a.s. nowhere
differentiable. The non-differentiability is intimately related with the fact
that BM has infinite total variation, which makes it impossible to define the
stochastic integral w.r. to Brownian motion in the same way in which we
define the Riemann integral.
Theorem 7.6. The sample paths of BM are a.s. continuous.
Proof. We want to use Kolmogorovs continuity Criterion, Theorem 2.5. For
every k > 2 and 0 s t,
Z
2
1
x
k
E |B(t) B(s)| = p
|x|k e 2(ts) dx
2(t s) R
Z
y2
1
(y = x/ t s) =
|y|k (t s)k/2 e 2 dy = C |t s|k/2 .
2 R
So we can apply the criterion with = k and = k/2 1, for all k > 2 and
we obtain the desired continuity.
60
ti n
ti n
t[0,T ]
Now one last fact about BM. As we have seen, BM is almost surely non
differentiable, so talking about its derivative
dB(t)
= (t)
dt
f () =
R,
Inverting the Fourier transform, this means that all the frequencies contribute equally to the covariance function, in the same way in which all
colours are equally present in the white colour.
62
(42)
(43)
The first term on the RHS of this equation gives the average (smooth) behaviour, the second is responsible for the fluctuations that turn Figure 2 into
Figure 3. Bear in mind that the solution to (42) is a function, the solution
to (43) whatever this equation might mean is a stochastic process.
Depending on the particular situation that we want to model we can
consider all sorts of noise. In this course we will be interested in the case
of white noise t , which has been the first to be studied thanks to the good
properties that it enjoys:
i) being mean zero, it doesnt alter the average behaviour, i.e. the drift is
only due to b(t, x);
ii) t is independent of s if s 6= t (recall that E(t s ) = 0 (t s));
iii) it is stationary, so that its law is constant in time, i.e. the noise acting on
the object we are observing is qualitatively the same at each moment
in time.
Now just a small problem: we said that we cant make sense of t in a classic
way as t is the derivative of Brownian motion, which is not differentiable.19
19
Another, possibly better, way of looking at this matter is as follows: for practical
63
So, how do we make sense of (43)? Well, if t = dWt /dt then, at least
formally, (43) can be rewritten as
dXt = b(t, Xt )dt + dW.
More in general, we might want to consider the equation
dXt = b(t, Xt )dt + (t, Xt )dW.
In this way, we have that, so far still formally,
Z t
Z t
(s, Ws )dWs .
b(s, Xs )ds +
Xt = X0 +
(44)
(45)
Rt
Assuming that 0 |b(s, Xs )| ds < with probability one, we know how to
make sense out of the first integral on the RHS. The real problem is how to
make sense of the second integral on the RHS. However, I hope we all agree
that if we can rigorously define the integral
Z T
f (t, )dWt
(46)
0
8.1
Stochastic Integrals.
j=0
reasons, suggested for example by engineering problems, we want to consider a noise that
satisfies the properties i), ii) and iii) listed above. However one can prove that there is no
such a process on the path space R[0,) . In any event, the intuition about t suggested
by the these three properties, leads to think of t as the derivative of BM. More detail on
this point are contained in Appendix A.
20
In the integral (46) we wanted to stress the dependence on both t and .
64
j0
then we define
Z T
Z
g(t)dt := lim
n 0
gn (t)dt = lim
g(tj )(tj+1 tj ).
(48)
j0
I want to stress that if the integrand is continuous, the limit on the right hand
side of (48) does not change if we evaluate g at any other point t [tj , tj+1 ],
i.e. for continuous functions we have:
Z T
X
X
g(t)dt = lim
g(tj )(tj+1 tj ) = lim
g(t )(tj+1 tj ),
n
j0
j0
K
X
j ()[tj ,tj+1 ) ,
(49)
j=0
(t, )dWt :=
0
K
X
(50)
j=0
(46) as the limit of the stochastic integrals of n . The plan sounds good,
except there are a couple of problems and points to clarify.
First of all, assuming the procedure of taking the limit (in some sense)
of the integral of elementary process does work, we would want the limit to
be independent of the point t [tj , tj+1 ] that we choose to approximate the
integrand. It turns out that this is not the case, not even if the integrand is
continuous. We show this fact with an example.
RT
Example 8.1. Suppose we want to calculate 0 Wt dWt . We consider
n and approximate the integrand with the
the partition with mesh size
P 2
elementary process n = j0 W (tj )[tj ,tj+1 ) . Then, for every n N,
E
j0
=E
j0
66
8.1.1
In these notes we will not go through all the details of the proofs of the
construction of the It
o integral but we simply provide the main steps of
such a construction.
As we have already said, we want to find a sequence n of elementary
processes that approximate the process f (t, ) and finally define the Ito
integral as the limit of the stochastic integrals of n . We still need to specify
in what sense we will take the limit (and in what sense the n s approximate
RT
f ). In the case of (48), { 0 gn (t)dt}n is simply a sequence of real numbers
but in our case we are dealing with a sequence of random variables, so we
need to specify in what sense they converge. It is clear from Theorem 7.8
that trying with L1 -convergence is recipe for disaster, as BM has infinite
first variation. However, it has finite second variation, so we can try with
L2 - convergence, which is what we are going to do. As you might expect,
the procedure that we are going to sketch does not work for any integrand
in the same way in which not any function is Riemann integrable but
it is successful for stochastic processes f (t, ) : R+ R enjoying the
following properties:
(a) f (t, ) is B F- measurable;
(b) ft is Ft -adapted, for all t R+ , where Ft is the natural filtration
associated with the BM Wt ;
RT
(c) E 0 ft2 dt < .
Definition 8.2. We denote by I(0, T ), or simply I when the extrema
of integration are clear from the context, the class of stochastic processes
f (t, ) : R+ R for which the above three properties hold.
The reason why we impose this set of conditions on the integrands will be
more clear once you read the construction procedure. However, property (b)
comes from the fact that we are choosing the left hand point to approximate
our integral, property (c) comes from the Ito isometry (see below). And now,
finally, the main steps of the construction of the Ito integral are as follows:
1. Consider the set of elementary processes (49), where we require that
the random variables j are square integrable and Ftj -measurable
21 (with F the filtration generated by W ); for these processes the
t
t
21
This is because we are choosing t = tj . Notice that this condition does not hold for
the
n s of Example 8.1, as Wtj+1 is not Ftj -measurable.
67
2
(t, ) dWt
E
0
Z
=E
2 (t) dt.
(51)
Let us now list all the properties that help in the calculation of the
stochastic integral.
Theorem 8.4 (Properties of the Ito integral). For every f, g I(0, T ),
t, s [0, T ] and for every , R,
68
RT
f dW + t f dW ;
RT
RT
RT
(ii) Linearity: 0 (f + g)dW = 0 f dW + 0 g dW ;
RT
(iii) E 0 f dW = 0
RT
(iv) IT := 0 f dW is FT -measurable;
(i) Additivity:
RT
0
f dW =
Rt
0
Z
f dW
=E
f 2 dt,
g(u)dWu
Z
=E
ts
f (u)g(u) du .
0
69
1
1
= Wt2 t2 ,
2
2
where we have used the fact that the sum
and Theorem 7.8.
At this point you should be convinced that the Ito integral doesnt follow
the usual integration rules. And you might wonder, what about the chain
rule and integration by parts? Well I am afraid they dont stay the same
either.
It
o chain rule: suppose Xt satisfies the equation
dX = b(t, Xt )dt + (t, Xt )dWt .
Then, if g = g(t, x) is a C 1,2 ([0, ) R) function,
dg(t, Xt ) =
g
g
1 2g
(t, Xt ) dt +
(t, Xt ) dXt +
(t, Xt ) 2 (t, Xt )dt .
t
x
2 x2
(52)
(53)
respectively, then
d(X1 X2 ) = X1 dX2 + X2 dX1 + dX1 dX2 .
(54)
Integration by parts, which follows from the product rule: with the
same notation as in the product rule,
Z t
Z t
Z t
X1 dX2 = X1 (t)X2 (t) X1 (0)X2 (0)
X2 dX1
dX1 dX2 .
0
(55)
70
How do we calculate the term dX1 dX2 ? I am afraid I will adopt a practical
approach, so the answer is...by using this multiplication table:
dt dt = dW dt = dt dW = 0
while
dW dW = dt and dW dB = 0
if W and B are two independent Brownian motions. Just to be clear, in the
case of X1 and X2 of equation (53), because the BM driving the equation for
X1 is the same as the one driving the equation for X2 , we have dX1 dX2 =
1 2 dt. In this way if X1 is a deterministic, say continuous function, (55)
coincides with the usual integration by parts for the Riemann integral.
I would like to stress that this rule of thumb of the multiplication table
can be rigorously justified (think of Theorem 7.8), but this is beyond the
scope of this course.
Rt
Example 8.6. Let us calculate 0 W 3 dW . We can use the integration by
parts formula with X1 = W 3 and X2 = W . To use (55) we need to calculate
dX1 first, which we can do by applying (52) with g(t, x) = g(x) = x3 . So
we get
dX1 = 3W 2 dW + 3W dt,
hence dX1 dX2 = 3W 2 dt and
Z t
Z t
Z t
W 3 dW = W 4
W (3W 2 dW + 3W dt) 3
W 2 dt ;
0
rearranging,
Z
W 3 dW =
W4 3
4
2
W 2 dt .
0 j=1
71
d
m
d
X
1 X 2 g X il jl
g
g
dX i +
dt +
dt.
t
xi
2
xi xj
i,j=1
i=1
l=1
i,j=1
i=1
8.1.2
(57)
(58)
rather than
is clearly relevant, as the solution to (57) is not the same as the solution
to (58); and this is not just a technical matter, it is rather a problem of
modelling, and the answer depends on the specific application that we are
considering.
By the modelling point of view, the Stratonovich integral has the big
advantage of being stable under perturbations of the noise. Let us explain
what we mean.
1. Assuming that (57) and (58) admit only one solution, let Xt be the
t the solution of (57). In general Xt 6= X
t , as
solution of (58) and X
we have explained.
2. Now consider the equation
dXtk = b(t, Xtk )dt + (t, Xtk )tk ,
where k is a sequence of smooth random variables that converge to
white noise as k . Notice that because the k s are smooth,
such an equation is, for each , a simple ODE, so we dont need to
decide what we mean by it because we know it already.
3. If k , you would expect that X k X. Well this is not the case.
Indeed, X k X.
This is only the tip of the iceberg of the more general landscape described
by the Wong-Zakai approximation theory, see for example [76, 41] and references therein. The fact that Ito SDEs are not stable under perturbations
of the noise can be interpreted to mean, by the modelling point of view,
73
that unless we are absolutely sure that the noise to which our experiment is
subject is actually white noise, the phenomenon at hand is probably better
described by (57). On this topic, see for example [75]. After having studied
Section 8.2.2 you will be able to solve Exercise 30, which gives a practical
example of the procedure described above.
8.2
SDEs
for all t, Gt :=
N be the family of null sets (i.e. sets of P-measure zero) of the underlying probability space ;
Ft be the -algebra generated by N and Gt .
With this notation, let us give the following definition:
Definition 8.7 (Strong Solution). Given a Brownian motion Wt and an
initial datum , a strong solution to (58) is a continuous stochastic process
Xt such that
i) Xt is Ft -adapted, for all t > 0;
ii) P(X0 = ) = 1;
iii) for every t 0,
Rt
0 (|b(s, Xs )|
Xt ) = 0, for all t 0.
Clearly the above definition can be extended to higher dimension if we
take Wt to be a m-dimensional standard BM and condition iii) becomes
Z t
2
(bi (s, Xs ) + ij (s, Xs ) )ds < , P a.s.,
0
n X
`
X
ij 2
a .
i=1 j=1
Theorem 8.8. Suppose the coefficients b and are globally Lipshitz and
grow at most linearly, i.e. for all t 0 and for all x, y Rd there exists a
constant C > 0 (independent of x and y) such that
kb(t, x) b(t, y)k + k(t, x) (t, y)kF Ckx yk
(59)
(60)
for all 0 t T .
Comment. You might be thinking that all this fuss of Ito and Stratonovich
75
is a bit useless as we could also look at an SDE as being an ODE, for each
fixed . This is the approach taken in [73]. However bear in mind that
because Wt is continuous but not C 1 , you would end up with an ODE with
non C 1 coefficients. In [73] is it shown that one can still makes sense of such
an ODE by using a variational approach and that the solution found by
means of the variational method coincides with the Stratonovich solution of
the SDE. The conditions of the existence and uniqueness theorem, Theorem
8.8, can be somewhat weakened.
Theorem 8.9. Suppose all the assumptions of Theorem 8.8 hold, but replace
condition (59) with the following: for each N there exists a constant CN such
that
kb(t, x) b(t, y)k + k(t, x) (t, y)kF CN kx yk
X0 = x.
Then the solution to this equation exists and is unique but it blows up in
finite time:
0
if x = 0
Xt =
x
if
x 6= 0 .
1tx
8.2.2
SDEs can be solved in the same way in which we solve ODEs, just taking
into account the It
o chain rule.
Example 8.10 (Ornstein-Uhlenbeck process). Consider the one dimensional equation
dXt = Xt dt + dWt
(62)
where R and > 0 are constants. The solution to this equation (which
exists and is unique by Theorem 8.8) is the Ornstein-Uhlenbeck process
(OU), which is a model of great importance in physics as well as in finance.
Xt is also called coloured noise and, among other things, is used to describe
the velocity of a Brownian particle. By the point of view of non-equilibrium
76
(63)
and integrate both sides. However we are dealing with an SDE so we need
to use the chain rule of It
o calculus, equation (54). In this particular case,
because the second derivative with respect to x of et x is zero, applying
(54) leads precisely to (63) (check) however I want to stress that this will
not happen in general, as we will see from the next example. So now we can
simply integrate both sides and find
Z t
t
X(t) = e X0 +
e(ts) dWs .
0
Taking expectation on both sides and using property (iii) of Theorem 8.4
we get
E(Xt ) = et E(X0 ) ,
so that E(Xt ) 0 if < 0 and E(Xt ) + if > 0. If X0 is Gaussian or
deterministic then the OU process is Gaussian, as a consequence of Theorem
8.4, property (vii). Moreover, it is a Markov process. The proof of the latter
fact is a corollary of the results of the next section.
Example 8.11 (Geometric Brownian Motion). The SDE
dXt = rXt dt + Xt dWt ,
is a simple population growth model (but it is quite popular in finance as
well), where Xt represents the population size and r = r + dW is the
relative rate of growth. To solve this SDE we proceed as follows:
dXt
= r dt + dWt .
Xt
77
(64)
At the risk of being boring, we cant say that d(log Xt ) = dXt /Xt , because
we are deling with the stochastic chain rule. So, applying the Ito chain rule,
we find
dXt 1 2
d(log Xt ) =
dt.
(65)
Xt
2
From (64) and (65) we then have
1 2
d(log Xt ) = r dt + dWt ,
2
which gives
1 2
Xt = X0 e(r 2 )t+Wt .
Example 8.12. Also for SDEs it makes sense to try and look for a solution
in product form. For example, consider the SDE
dXt = d(t)Xt dt + f (t)Xt dWt ,
X(0) = X0 .
(66)
If d(t) and f (t) are uniformly bounded on compacts then the assumptions
of the existence and uniqueness theorem are satisfied and we can look for
the solution. So assuming for example that d(t) and f (t) are continuous will
do. We look for a solution in the form
X(t) = Y (t) Z(t),
where Y (t) and Z(t) solve, respectively,
dY = f Y dW
dZ = A(t) dt + B(t)dW
and
Y (0) = X0
Z(0) = 1 .
Applying the It
o product rule,
d(Y Z) = (f X + BY )dW + Y (A + f B)dt .
We now compare the above expression with (66); in order for the two expressions to be equal, we need to have
f X + BY = f X
the first equation gives B 0 so that the second implies A(t) = d(t)Zt .
The equation for Z becomes therefore deterministic:
Z t
dZ = d(t)Zdt , Z(0) = 1 Zt = exp
d(s)ds .
0
78
The equation for Y can be solved similarly to what we have done for geometric Brownian motion:
dY = f (t)Y dW ,
dY
= f (t)dW .
Y
Y (0) = X0
(67)
The solution of an It
o SDE is also called an It
o process. In this section
we will always work under the assumptions of the existence and uniqueness
theorem and we will prove two very important facts about the solutions of
It
o SDEs: first of all, the solution of an Ito SDE is a continuous time Markov
process. Secondly, if the coefficients of the equation do not depend on time,
the solution is a time-homogeneous Markov process.
Theorem 8.13. Suppose the assumptions of Theorem 8.8 hold. Then the
solution of the SDE
dXs = b(s, Xs ) ds + (s, Xs ) dWs ,
X0 = x,
(68)
is a Markov process.
The proof of this theorem is not particularly instructive, as the technicalities obfuscate the intuition behind them. The reason why the above theorem holds true is morally simple: we know by the existence and uniqueness
theorem that, once Xs is known, the solution of (68) is uniquely determined
for every t s. Theorem 8.8 doesnt require any information about Xu for
u < s in order to determine Xu for u > s, once Xs is known. Which is why
the above statement holds under the same assumptions as Theorem 8.8.
79
Theorem 8.14. Suppose the assumptions of Theorem 8.8 hold. If the coefficients of the It
o SDE (58) do not depend on time then the It
o process is
time-homogeneous. I.e., the solution of the equation
dXs = b(Xs ) ds + (Xs ) dWs ,
X0 = x,
B S, x S, t 0 .
(69)
(70)
8.3
(71)
and
0 0
0 1
(72)
,
(73)
you can observe that (72) is nothing but an O-U process, this time in two
dimensions:
q(t)
dXt = BXt + dWt , where Xt =
,
p(t)
and Wt = (Wt1 , Wt2 ) is a two-dimensional standard BM, i.e. W 1 and W 2
are independent one dimensional standard BMs. However notice that the
diffusion matrix23 is degenerate (in the sense that it has zero determinant);
for this reason this is a degenerate process (in particular one can prove that
this is an hypoelliptic process). We will comment again on this later on, as
the degeneracy of the diffusion matrix adds big difficulties to the analysis of
such a process.
22
23
81
The model (72) assumes the collisions between molecules to occur instantaneously, hence it is not valid in physical situations in which such approximation cannot be made. In these cases a better description is given by
the Generalized Langevin Equation GLE :
Z t
q(t) = q V (q)
ds (t s)q(s)
+ F (t).
(74)
0
(75)
where > 0 is a constant, representing the inverse temperature of the system. The GLE together with the fluctuationdissipation theorem principle
(75) appears in various applications such as surface diffusion and polymer
dynamics. It also serves as one of the standard models of nonequilibirum
statistical mechanics, describing the dynamics of a small Hamiltonian system (a distinguished particle) coupled to a heat bath at inverse temperature
. Just to provide some context: given a system in equilibrium, we can drive
it away from its stationary state by either coupling it to one or more large
Hamiltonian systems or by using non-Hamiltonian forces. In the Hamiltonian approach, which is often referred to as the open systems theory [66],
the system we are interested in is coupled to one or more heat reservoirs.
Multiple reservoirs and the non-Hamiltonian methods are used to study non
equilibrium steady states (for example if we consider two heat baths a different temperatures, you can imagine that there will be a constant heat
flux from the warmer to the colder reservoir; such a state is called a nonequilibrium steady state), see [68, 28]; coupling with a single heat bath is
used to study return to equilibrium.24 The GLE serves precisely this purpose. Let us now say a couple of words about the derivation of the GLE.
In order to derive such an equation which, I would like to point out, is
a stochastic integro-differential equation we think of the system particle
+ bath as a mechanical system in which a distinguished particle interacts
with n heat bath molecules of mass {mj }1jn , through linear springs with
random stiffness parameter {kj }1jn ; the Hamiltonian of the system is
24
82
then :
n
p2n 1 X Pi2 1 X
+
+
ki (Qi qn )2 ,
2 2
mi 2
i=1
i=1
(76)
where (qn , pn ) and (Q1 , ..., Qn , P1 , ..., Pn ) are the positions and momenta of
the tagged particle and of the heat bath molecules, respectively (the notation
(qn , pn ) is to stress that the position and momentum of the particle depend
on the number of molecules it is coupled to).
Excursus. The theory of Equilibrium statistical mecanics is involved
with the study of many particle systems in their equilibirum state; such a
theory has been put on firm ground by Boltzman and Gibbs in the second
half of the 19th century. Consider an Hamiltonian system with Hamiltonian H(q, p) (for example our gas particles in a box). The formalism of
equilibrium statistical mechanics allows to calculate macroscopic properties
of the system starting from the laws that govern the motion of each particle (good reading material are the papers [46, 45, 68]). The BoltzmannGibbs prescription states the following: the equilibrium value of any quantity at inverse temperature is the average of that quantity in the canonical ensamble, i.e. it is the average with respect to the Gibbs measure
(q, p) = Z 1 exp{H(q, p)}dqdp. The reason why this measure plays
such an important role is a consequence of the fact that the Hamiltonian
flow preserves functions of the Hamiltonian as well as the volume element
dqdp.
Going back to where we were, we can write down the equations of motion
of the system with Hamiltonian (76):
qn = pn
p = V (q ) + Pn k (Q q )
n
q
n
i
n
i=1 i
i = Pi /mi
Q
1
Pi = ki (Qi qn )
1 i n.
The initial conditions for the distinguished particle are assumed to be deterministic, namely qn (0) = q0 and pn (0) = p0 ; those for the heat bath are
randomly drawn from a Gibbs distribution at temperature 1 , namely
:=
1 H
e
,
Z
Z normalizing constant,
where the Hamiltonian H has been defined in (76). Integrating out the heat
bath variables we obtain a closed equation for qn , of the form (74). In the
83
(77a)
dp = q V (q) dt + g u dt
(77b)
du = (p g Au) dt + C dW (t) ,
(77c)
(78)
We also recall that the noise in (74) is Gaussian, stationary and mean zero.
Because the memory kernel and the noise in (74) are related through the
fluctuation-dissipation relation, the rough idea is that we might try to either
approximate the noise and hence obtain the corresponding memory kernel
or, the other way around, we could approximate the correlation function
and read off the noise. The latter is the approach that we shall follow in
this section. As a motivation, we would like to notice that for some specific
choices of the kernel, equation (74) is equivalent to a finite dimensional
Markovian system in an extended state space. If, for example, we choose
(t) = 2 et , t > 0, then (74) becomes
q = p
Rt
(79)
p = q V (q) 2 0 e(ts) p(s)ds + F (t);
the fluctuation dissipation theorem (with = 1) yields
E(F (t + s)F (t)) = 2 e|s| .
84
(80)
Since we are requiring F (t) to be stationary and Gaussian, (80) implies that
25 If we write F (t) = v(t), with
F (t) is the Ornstein-Uhlenbeck process.
q = p
p = q V + z
,
z = p z + 2W
which is precisely system (77) with d = 1, A = 1 and g = . The Markovianization of (74) was first done by Mori [54] by first approximating the
Laplace transform of the memory kernel (t), (), by a rational function
(if and when this is possible) and then imposing the fluctuation relation,
which gives the matrices A andPC as well as the vector P
g. If (t) itself is
a sum of exponentials, d (t) = di=1 2i ei t , then d = di=1 2i /(i + i ),
so the procedure indicated by Mori is clearly successful and it corresponds
to the case in which A = diag{1 , . . . , d } and g = (1 , . . . , d )T . Another
typical situation is when the Laplace transform of has a continued fraction
representation
21
() =
22
+ 1 +
+2 +
i > 0.
2
3
+3 +
..
It is possible to show that the only mean zero stationary Gaussian process with autocorrelation function et is the stationary O-U process.
85
Comment. i) You are likely to have already seen the definition of semigroup when talking about the solution of differential equations, for example
linear equations of the type
t f = Lf,
where, for every fixed t, f (t, x) belongs to some Banach space B and L is a
differential operator on B. Indeed properties 1. and 2. simply say that the
solution of the equation is unique. Strong continuity is used to make sense
86
f (x)d(x),
f (y)e
|xy|2
2t
dy,
(83)
defines a strongly continuous semigroup on the space of bounded and uniformly continuous functions from R to R. This is precisely the semigroup
associated with Brownian motion through (82). In other words, because
the transition probabilities of BM are given by (41), we can rewrite the
expression (83) as
Pt f (x) = E[f (Bt )|B0 = x] .
87
t0+
Pt f f
,
t
(84)
for all f D(L) := {f B : the limit on the RHs of (84) exists in B}. If
Pt is a strongly continuous Markov semigroup, then the operator L is a
Markov generator. The generator of the semigroup defined in (82), is often
referred to as the generator of the Markov process Xt .
From now on we will always work with strongly continuous semigroups,
unless otherwise stated. The following properties hold.
Theorem 9.6. With the notation and nomenclature introduced so far, let
Pt be a strongly continuous semigroup. Then we have:
1. if f D(L) then also Pt f D(L), for all t 0;
2. the semigroup and its generator commute: LPt f = Pt Lf for all f
D(L);
3. t (Pt f ) = LPt f .
Notice that point 3. of Theorem 9.6 means that if we define a function
gt (x) = g(t, x) := Pt f (x) then such a function satisfies, for all t R+ , the
equation t g = Lg.
Proof of Theorem 9.6. 1. and 2. can be proved together: for any t, s 0,
we can write
Ps Pt f Pt f
Ps f f
= Pt
.
s
s
No we can take the limit as s 0+ on both sides (the limit of the RHS
makes sense and therefore also the one on the LHS does, which means that
1. is proved) and obtain
LPt f = lim
s0+
Ps Pt f Pt f
Ps f f
= Pt lim
= Pt Lf .
s
s
s0+
88
Now we come to proving 3. In the remainder of this proof h > 0; let us look
at the right limit first:
lim
h0+
Pt+h f Pt f
Ph f f
by2.
= Pt lim
= Pt Lf = LPt f .
+
h
h
h0
Therefore the right derivative is equal to LPt f . Now we need to show the
same for the left derivative as well:
Pt f Pth f
Ph f f
LPt f = lim Pth
Pt Lf
lim
h0
h0
h
h
Ph f f
Lf
lim Pth
h0
h
+ lim [Pth Lf Pt Lf ] .
h0
The second limit is equal to zero by strong continuity. Using Exercise 31,
also the first limit is equal to zero, indeed
Ph f f
Ph f f
lim kPth
Lf k lim c(t)k
Lf k,
h0
h0
h
h
where c(t) is a constant depending on t, not on h. Now the limit on the
RHS of the above is clearly equal to zero, by the definition of L.
The following very famous result gives a necessary and sufficient condition in order for an operator to be the generator of a Markov semigroup.
Theorem 9.7 (Hille-Yosida Theorem for Markov semigroups). A linear
operator L is the generator of a Markov semigroup {Pt }tR+ on the Banach
space B if and only if
1. 1 D(L) and L1 = 0;
2. D(L) is dense in B;
3. L is closed;
4. for any > 0, the operator I L is invertible; its inverse is bounded,
sup k(I L)1 f k 1 ,
kf k1
89
(f (y) f (x))e
=
dy
t
t
2t R
Z
1
1
z 2 /2t
(f (x + z) f (x))e
dz
=
t
2t R
Z
1
1
w2 /2
=
dw
(f (x + w t) f (x))e
t
2 R
Z
1 1
1 00
2
0
2
=
f (x)w t + f (x + w t)tw ew /2 dw,
t 2 R
2
where the last inequality holds, by Taylors Theorem, for some [0, 1].
From the continuity of f 00 and observing that
Z
Z
2
0
w2 /2
0
f (x)w e
dw = f (x) w ew /2 dw = 0,
R
we have
lim
t0+
Pt f (x) f (x)
f 00 (x)
=
.
t
2
Remark 9.10. We recall that the dual of C([0, T ], R) is the space of measures on [0, T ] with bounded total variation on [0, T ] (see [80], page 119);
the total variation on [0, T ] is defined as
Z
kkT V := sup
f (s)(ds).
f C([0,T ]) [0,T ]
kf k1
90
Also, the dual of the space L (S, S, ), where (S, S, ) is a measure space
with (S) < , is the space of finitely additive measures absolutely
continuous with respect to and such that supB |(B)| < (see again
same reference).
Notice that the Markov semigroup Pt acts on functions, its dual acts on
probability measures. Moreover, Pt f is a function, while Pt is a probability
measure on R. (Useless to say that in all of the above R can be replaced by
any Polish space S and everything works anyway.) The reason why Pt is
called the dual semigroup is the following: with Remark 9.10 in mind 26 , if
we use the notation
Z
hPt f, i = (Pt f )(x)(dx),
R
then for any say bounded and measurable function f and for any probability
measure we have
Z Z
hPt f, i =
pt (x, dy)f (y) (dx)
R
R
Z
Z
=
f (y) pt (x, dy)(dx)
R
ZR
=
f (y)(Pt )(dy) = hf, Pt i.
R
9.1
Given a time-homogeneous Markov process, we say that a measure is invariant for Xt if it is invariant for the associated Markov semigroup.
26
In view of Remark 9.10, this is just a duality relation, i.e. it expresses the action of
the dual of a space on the space itself, see the Appendix; in the case of a Hilbert space
this would be just a scalar product, which is why we denote duality relations and scalar
products in the same way.
91
With the notation introduced at the end of the previous section, (86)
can be rewritten as
hPt , i = h, i.
(87)
Therefore, by the point of view of the dual (or adjoint) semigroup, we can
also say that the measure is invariant if
Pt = ,
t 0.27
(88)
Comment. The stationarity of is used in (89) only for the last equality.
All the other equalities are true simply by definition of Pt . This means that
if we start the process with a certain distribution , then the semigroup Pt
describes the evolution of the law of the process:
(Pt )(B) = P (Xt B),
for every measurable set B. On the other hand, from the definition (82) of
the semigroup Pt , it is clear that Pt describes the evolution of the so called
observables.
Again analogously to the time-discrete case, we give the following definition.
27
If (87) holds for all buonded and measurable then take to be the indicator function
of a measurable set and you obtain (88). On the other hand, if (88) is true then hPt , i =
h, Pt i = h, i, for all bounded and measurable.
92
From now on, we shall refer to the non-weighted Lp space as to the flat Lp .
Thanks to the above Lemma, we can work in a Hilbert space setting as soon
as an invariant measure exists.
Comment. One can prove that the set of all invariant probability measures for Pt is convex and an invariant probability measure is ergodic if
and only if it is an extremal point of such set (compare with Lemma 5.4).
Hence, if a semigroup has a unique invariant measure that measure is ergodic. This justifies the definition of ergodic process that we have just given.
It can be shown (see [13]) that, given a C0 -Markov semigroup, the following
statements are equivalent:
(i) is ergodic.
(ii) L2 (S, ) and Pt = t > 0, a.s. is a constant ( a.s.).
(iii) for any L2 (S, ) ,
Z
Z
1 t
lim
(Ps )(x) ds =
(x)(dx),
t t 0
S
in L2 (S, ).
this is precisely the ergodic theorem for continuous time Markov processes.
Notice that the time average on the LHS depends on x (the initial value of
the process) while the RHS does not: again, the limiting behaviour does not
depend on the initial conditions.
Remark 9.14. When the semigroup has an invariant measure , we shall
always assume that the generator of such a semigroup is densely defined in
L2 (R, ). This assumption is not always trivial to check in practical cases.
For the moment only formally, the generator of Pt is just L , the flat
L2 -adjoint of the generator of Pt , L. From (88), a measure is invariant if
L = 0.
(90)
10
Diffusion Processes
For the material contained in this section we refer the reader to [1, 25, 22, 41].
In this section we want to give a mathematical description of the physical phenomenon called diffusion. Simply put, we want to mathematically
describe the way milk diffuses into coffee.
It seems only right and proper to start this section by reporting the
way in which Maxwell described the process of diffusion in his Encyclopedia
Britannica article:
When two fluids are capable of being mixed, they cannot remain in
equilibrium with each other; if they are placed in contact with each other the
process of mixture begins of itself, and goes on till the state of equilibrium
is attained, which, in the case of fluids which mix in all proportions, is a
state of uniform mixture. This process of mixture is called diffusion. It
may be easily observed by taking a glass jar half full of water and pouring
a strong solution of a coloured salt, such as sulphate of copper, through a
long-stemmed funnel, so as to occupy the lower part of the jar. If the jar
is not disturbed we may trace the process of diffusion for weeks, months, or
years, by the gradual rise of the colour into the upper part of the jar, and the
weakening of the colour in the lower part. This, however, is not a method
capable of giving accurate measurements of the composition of the liquid at
different depths in the vessel. ...
94
10.1
2. There exists a real valued function b(t, x) such that for all > 0
Z
1
lim
(yx)p(t, x, t+, dy) = b(t, x), for all t [0, T ] and x R;
0 |xy|
28
Indeed, the word diffusion comes from the latin verb diffundere, which means to
spread out.
29
At this point one should mention the apparent paradox created by the Poincare recurrence Theorem, but we will refrain from getting into the details of this long diatribe.
95
3. There exists a real valued function D(t, x) such that for all > 0
Z
1
(yx)2 p(t, x, t+, dy) = D(t, x), for all t [0, T ] and x R.
lim
0 |xy|
The function b(t, x) is often called the drift while the function D(t, x) is the
diffusion coefficient.
Before explaining the meaning of Definition 10.1, let us Remark that the
same definition can be given for Rn valued processes.
Remark 10.2. For ease of notation, Definition 10.1 has been stated for
one-dimensional processes. The exact same definition carries to the case in
which Xt takes values in Rn , n 1. In this case b(t, x) will be a Rn -valued
function and D(t, x) will be a n n matrix such that for any > 0,
Z
lim
(y x)(y x)T p(t, x, t + , dy) = D(t, x),
0 |xy|
for all t and x. In the above (y x) is a column vector and (y x)T denotes
its transpose.
Now let us see what the three conditions of Definition 10.1 actually mean.
Condition 1. simply says that diffusion processes do not jump. In conditions
2. and 3. we had to cut the space domain to |x y| < because strictly
speaking at the moment we dont know yet whether the transition probabilities of the process have first and second moment or not. Assuming they
do, then conditions 2. and 3. say, roughly speaking, that the displacement
of the process from time t to time t + is the sum of two terms: an average
drift term, b(t, x), plus random effects, say , plus small terms, o(); the
random effects are such that E()2 = D(t, x) + o().30
Lemma 10.3. Suppose a real valued Markov process Xt on [0, T ] with transition probabilities p(s, x, t, A) satisfies the following three conditions:
1. there exists a > 0 such that
Z
1
lim
|y x|2+a p(t, x, t + , dy) = 0,
0 R
30
Here and in the following the notation o() denotes a term, say f (), which is small
in in the sense that
f ()
lim
= 0.
0
96
Xt = x.
(91)
Suppose the coefficients b(u, x) and (u, x) are continuous in both arguments
and satisfy all the assumptions of the existence and uniqueness theorem,
Theorem 8.9. Then the process Xu , solution of (91), is a diffusion process
with diffusion coefficient D(u, x) = 2 (u, x).
Before proving the above statement, let us make the following remark.
Remark 10.5. Under the conditions of Theorem 8.9 one can prove that the
solution of equation (91) is still a Markov process and the following estimate
holds: there exists a constant C > 0 such that
E |Xu x|2m Cum eCu (|x|2m + 1),
for all m 1 .
t+
(s, Xs )dWs .
Therefore
1
E[Xt+ x]
= E
t+
b(s, Xs )ds .
t
0
98
In the above, we could exchange the limit and the integral because by
assumption
|b(t + u, Xt+u )|2 C(1 + |Xt+u |2 )
and we know by the existence and uniqueness theorem that
Z 1
Z 1
2
(1 + E |Xt+u |2 ) < .
|b(t + u, Xt+u )| C
E
0
R
To this end, let us write
E(Xt+ x)2 = E(Xt+ )2 x2 2x [EXt+ x] .
(92)
Using It
o formula to calculate d(Xu )2 , we get
d(Xu )2 = 2Xu dXu + 2 (u, Xu )du .
Integrating between t and t + and taking expectation then gives
Z t+
Z t+
E(Xt+ )2 x2 = E
2Xs b(s, Xs )ds + E
2 (s, Xs )ds .
t
E(Xt+ x) = E
t+
2Xs b(s, Xs ) + 2 (s, Xs ) ds
2x(b(t, x) + o()) .
After dividing both sides by , we can, as before, make the change of
variables s = t + u, exchange limit and integral (which is justified
with an argument similar to the one explained above) and conclude
by sending to zero.
10.2
For the moment we will keep working in one dimension. At the end we shall
comment on the multi-dimensional extension of the following results.
Lemma 10.7. Let f (x) be twice differentiable and suppose there exist m, C >
0 such that
2
d
d
|f (x)| + f (x) + 2 f (x) C(1 + |x|m ), x R.
dx
dx
100
Assume also that b(t, x) and D(t, x) satisfy the assumptions of Theorem
10.4. Then
Ef (Xt ) f (x)
d
1
d2
= b(t, x) f (x) + 2 (t, x) 2 f (x),
0
dx
2
dx
lim
Xt = x .
(93)
Proof. The proof of this lemma is left as an exercise, see Exercise 37.
Theorem 10.8. Let Xu be the solution of the SDE
dXu = b(u, Xu )du + (u, Xu )dWu ,
Xt = x ,
(94)
where we assume that the coefficients of the SDE are continuous in both arguments with continuous partial derivatives x b(u, x), x (u, x), xx b(u, x),
xx (u, x). Suppose f (x) is a twice continuously differentiable function and
that there exist K, m > 0 such that
|b(u, x)| + |(u, x)| K(1 + |x|) ,
|x b(u, x)| + |xx b(u, x)| + |x (u, x)| + |xx (u, x)| K(1 + |x|m ) ,
2
d
d
|f (x)| + f (x) + 2 f (x) K(1 + |x|m ) .
dx
dx
Then the function h(t, x) = E [f (Xu )|Xt = x] satisfies the equation
h(t, x)
h(t, x) 1 2
2 h(t, x)
+ b(t, x)
+ (t, x)
= 0,
t
x
2
x2
lim h(t, x) = f (x) .
t (0, u)
tu
(95)
(96)
1
2
+ 2 (t, x) 2 .
x 2
x
31
The notation Lt rather than just L is to emphasize that the coefficients of this operator
do depend on time.
101
Lt is called the backward operator or also the generator of the (non timehomogeneous) diffusion (94). Theorem 10.8 says that the evolution equation
h(t, x)
= Lt h(t, x)
t
lim u(t, x) = f (x)
tu
1
= b(t, x)x h(t, x) 2 xx h(t, x) .
2
t h(t, x) = lim
x,t ,t
(u).
Therefore
h(t , x) = E[E(f (X x,t (u))|X x,t (t))]
= Eh(t, X x,t (t)).
Using the above and Lemma 10.7 we get the result:
h(t, x) h(t , x)
Eh(t, X x,t (t)) h(t, x)
= lim
0
0
1
= b(t, x)x h(t, x) 2 xx h(t, x).
2
lim
102
1 2
(D(t, y)p) = 0
+
(b(t, y)p)
t
y
2 y 2
lim p(s, x, t, y) = (x y) .
ts
(97)
(98)
Time-homogeneous case
Let us now specialize to the time-homogeneous case. In this case the coefficients of the SDE (94) do not depend on time. The process we are concerned
with is then solution of the SDE
dXu = b(Xu )du + (Xu )dWu ,
Xt = x ,
(99)
where b(x) and (x) satisfy all the assumptions stated so far. In other words,
the function h(t, x) = E[f (Xu )|Xt = x] becomes
h(t, x) = E[f (Xu )|Xt = x] = E[f (Xut )|X0 = x].
Setting = u t and noticing that t = , equation (95) becomes now
h(, x)
= Lh(, x)
(100)
where the differential operator L is now the generator of the diffusion in the
sense of Definition (84),
L = b(x)
1
2
+ 2 (x) 2 .33
x 2
x
32
103
t0
for every fixed x R, where L is precisely the flat L2 -adjoint of the operator
L:
1 2 2
L = (b(y)) +
( (y)) .
y
2 y 2
Example 10.11. Consider the O-U process
dXt = Xt dt + dWt ,
, > 0.
2 2
+
x
2 x2
2 2
(x) +
x
2 x2
The equation L = 0 has a unique normalized solution
r
x2 /2D
=
e
,
2D
L =
104
n
X
j=1
1 X i,j
2
b (x) j +
D (x) j j .
x
2
x x
j
i,j=1
Notice that the diffusion matrix is always symmetric and positive definite.
Example 10.14. Consider the system
a, m > 0
mz
+ 2a 2 6z
+ 2.
y
z
y
z z
n
X
ai (x)
i=1
n
X
2
ij
+
M
(x)
,
xi
xi xj
i,j=1
M ij (x)v i v j kvk2 ,
i,j=1
10.3
(101)
2
+ 2
x x
2
V 0 (x) + 2 ,
x
x
V (x)(x) +
(x) = 0
L =0
x
x
V 0 (x)(x) +
(x) = const.
x
Solving the above SDE gives (x) = eV (x) /Z, where Z is a normalization
constant. Observe that when the potential is quadratic, this process is
precisely the O-U process.
The semigroup generated by L can be extended to a strongly continuous semigroup on the weighted L2 (see Lemma 9.13). Therefore, by the
Hille-Yosida Theorem, L is a closed operator. Moreover, L is a self-adjoint
operator in L2 . Let us start with proving that L is symmetric. To this end,
let us first recall that the scalar product in L2 is defined as
Z
hf, gi :=
f (x)g(x) (x)dx .
R
106
Let us now show that hLf, gi = hf, Lgi , for every f, g ***smooth and in
L2 :
Z
hLf, gi = (Lf )g dx
R
Z
Z 2
d
d
0
=
V (x)
f g dx +
f g dx
2
dx
R
R dx
Z
df dg
=
dx
R dx dx
Z
Z
d2 g
dg 0
f 2 dx
f
=
V dx
dx
dx
R
R
Z
=
f (Lg) dx = hf, Lgi .
R
(102)
which will be useful in the following - notice also that (102) makes the
symmetry property obvious. Because L is symmetric and closed 34 , it is
also self-adjoint in L2 .
Let us now come to explain why diffusions generated by a self-adjoint
operator are so important.
Definition 10.18. A probability measure is reversible for a Markov semigroup Pt if for any f, g Bm
Z
Z
(Pt f )g d(x) = f (Pt g) d(x).
(103)
In this case it is also customary to say that satisfies the detailed balance
condition with respect to Pt .
Analogously to the time discrete case if Pt is the Markov semigroup associated to some Markov process Xt and (103) is satisfied, then the process Xt
is time-reversible (for a proof of this fact in the time-continuous setting see
[70], which is an excellent reference). A given Markov semigroup might have
more than one reversible measure. We denote R(Pt ) the set of reversible
measures for a given Markov semigroup Pt . Taking g 1, it is obvious that
34
and defined on a dense subset of L2 , namely the set of Schwartz functions, see [64].
107
(104)
f
f d d hLf, f i , for every f L2 D(L). (105)
R
The largest positive number such that (105) is satisfied is called the spectral gap of the self-adjoint operator L.
The therm on the RHS of (105) is called the Dirichlet form of the operator L.
Remark 10.21. If L is a self adjoint operator then the form hLf, f i is
real valued (for every f L2 D(L)). In particular the spectrum of L is
real. If L is the generator of a strongly continuous Markov semigroup and
108
Z
Pt f
2
2
Z
Z
2t
f d d e
f
f d d ,
(106)
implies
Z
(Pt f ) d e
2t
f 2 d
d
dt
ft2
Z
d 2
109
ft2 d.
Example 10.23 (I.e. Example 10.17 continued). Going back to the process
Xt solution of (101), we now have the tools to check whether Xt converges
exponentially fast to equilibrium. Working again with mean zero functions
(notice that in this case the kernel of the generator is made only of constants), the spectral gap inequality reduces to
Z
Z 2
df
2
dx.
f dx
(108)
R
R dx
Those of you who have taken a basic course in PDEs will have noticed
that this is a Poincare Inequality for the measure . In Appendix B.4,
I have recalled the basic Poincare inequality that most of you will have
already encountered. If the potential V (x) is quadratic then the measure
does satisfy (108) and in this case we therefore have exponentially fast
convergence to equilibrium. For general potentials a classic result is the
following.
Lemma 10.24. Let V (x) : Rn R be a twice differentiable confining
potential such that = eV (x) is a probability density. Denote by H 1 () the
weighted H 1 , with norm
kf k2H1 := kf k2L2 + kf k2L2 .
If V (x) is such that
|V |2
|x|
V (x) ,
2
then the measure satisfies a Poincare inequality: for all functions h
H 1 (eV (x) ) and for some K > 0:
"Z
Z
2 #
Z
2 V (x)
|h| e
dx K
h2 eV (x) dx
heV (x) dx
.
R
Example 10.25. Let v(x) be a smooth function and consider the process
2
+ 2.
x x
2
(V 0 ) + 2 = 0.
x
x
(110)
(v) = 0 .
x
Lv = 0
111
10.4
The take-home message from Section 10.3 is: diffusions generated by operators which are elliptic and self-adjoint are usually the easiest to analize
as, roughly speaking, ellipticity gives the existence of a C density for the
invariant measure while the self-adjointedness of the operator gives nice
spectral property which are key to proving exponential convergence to equilibrum. In this section we want to see what happens when the generator
is non elliptic and non self-adjoint. We shall mainly focus on how to prove
exponential convergence to equilibrium for non self-adjoint diffusions. Nihil
recte sine exemplo docetur, so let us start with the main motivating example,
the Langevin equation:
dq = pdt
dp = q V (q)dt pdt +
2dWt ,
(111)
where V (q) is a confinig potential and Wt is one dimensional standard Brownian motion. If we assume that the potential is locally Lipshitz then the
following general result, which can be found for example in [72, Chapter 10],
guarantees the existence of a strong solution for the system (111).
Theorem 10.26. Let b(t, x) : R+ Rn Rn be Locally Lipshitz, uniformly
for t [0, T ] for each T > 0 and suppose
1. sup0tT |b(t, 0)| < , for all T > 0;
2. there exists k > 0 such that hx, b(t, x)i 0 if x is outside of a ball of
radius k.
Then the SDE
dXt = b(t, Xt )dt + dWt
admits a unique strong and nonexploding solution for any random initial
datum X0 and any constant > 0.
The generator of (111) is
L = pq q V (q)p pp + p2
(112)
(113)
Showing that this process has a unique invariant measure and that such a
measure has a density is not straightforward. However the process is indeed
ergodic with invariant density
(q, p) =
e(V (q)+p
Z
2 /2)
(114)
What is easy to see is that the kernel of the generator is only made of
constants. In showing this fact we will also gain a better understanding of
the system (111). The dynamics described by (111) can be thought of as
split into a Hamiltonian component,
q = p
p = q V (q)
(115)
p2
.
2
(116)
for every f,
so that
T ? = p + p.
114
LH
|{z}
?
T
T
|{z}
skew symmetric
symmetric
deterministic
stochastic
conservative
dissipative
(117)
37 .
115
h K D(T ),
and t 0.
h H
and t 0.
h H
(118)
h H1 /K,
(119)
for some constant k > 0. If, in addition to the above assumptions, we have
A? A + C ? C is coercive for some > 0,
then T is hypocoercive in H1 /K: there exist constants c, > 0 such that
ketL kH1 /KH1 /K cet .
One can prove that space K is the same irrespective of whether we consider the
scalar product h, i of H or the scalar product h, iH1 associated with the norm k kH1 .
38
117
Comment. I will not write a proof of this theorem but I will explain
how it works. The idea is the same that we have explained in Remark 10.31.
Consider the norm
((h, h)) := khk2 + akAhk2 + ckChk2 + 2b(Ah, Ch),
where a, b and c are three strictly positive constants to be chosen. The
assumptions 1. , 2. and 3. are needed to ensure that this norm is equivalent
to the H1 norm, i.e. that there exist constant c1 , c2 > 0 such that
c1 khkH1 ((h, h)) c2 khkH1 .
If we can prove that T is coercive in this norm, then by Proposition 10.29 and
Remark 10.31 we have also shown exponential convergence to equilibrium
in the H1 norm i.e. hypocoercivity. So the whole point is proving that
((T h, h)) K((h, h)),
for some K > 0. If 1. ,2. and 3. hold then (with a few lengthy but surprisingly
not at all complicated calculations) (119) follows. From now on K > 0 will
denote a generic constant which might not be the same from line to line.
The coercivity of A A + C C means that we can write
1
kAhk2 + kChk2 = (kAhk2 + kChk2 ) +
2
1
(kAhk2 + kChk2 ) +
2
KkhkH1 .
1
(kAhk2 + kChk2 )
2
khk2
2
and B = pq q V p ,
so that
C := [A, B] = AB BA = q .
The kernel K of the operator L is made of constants and in this case the
norm H1 will be the Sobolev norm of the weighted H 1 ():
kf k2H1 := kf k2L2 + kq f k2L2 + kp f k2L2 .
Let us first calculate the commutators needed to check the assumptions of
Theorem 10.32.
[A, C] = [A? , C] = 0,
[A, A? ] = Id
(120)
and
p
[B, C] = 1 q2 V (q)p .
Lemma 10.34. Let V C (R). Suppose V (q) satisfies
|q2 V | C 1 + |q V | ,
(121)
(122)
for some constant C 0. Then, for all f H 1 (), there exists a constant
C > 0 such that
k(q2 V ) p f k2L2 () C kf k2L2 () + k(q V )f k2L2 () .
(123)
Theorem 10.35. Let V C (R), satisfying (122) and the assumptions
of Lemma 10.24. Then, there exist constants C, > 0 such that for all
h0 H 1 (),
Z
t L
e
h0 h0 d
Cet kh0 kH 1 () .
(124)
H 1 ()
Proof. We will use Theorem 10.32. We need to check that conditions (i)
to (iv) of the theorem are satisfied. Conditions (i) and (ii) are satisfied,
due to (120). Having calculated (121), condition (iii) requires q2 V p to
119
2 . From Lemma
be relatively bounded with respect to p , p2 , q and qp
10.34 this is trivially true as soon as condition (122) holds. Now we turn to
condition (iv). Let us first write the operator Lb = A? A + C ? C:
Lb = pp p2 + q V q q2 .
In order for this operator to be coercive, it is sufficient for the Gibbs measure
(dpdq) = Z1 eH(q,p) dpdq to satisfy a Poincare inequality. This probability measure is the product of a Gaussian measure (in p) which satisfies a
Poincare inequality, and of the probability measure eV (q) dq. It is sufficient,
therefore, to prove that eV (q) dq satisfies a Poincare inequality. This follows
from Lemma 10.24 and Lemma 10.34.
120
11
in time t is known ([?]- [11], [?], [?] and references therein). The analytical
approach is based on the theory of fractional differentiation operators, where
the derivative can be fractional either in time or in space (see [?]-[?], [?] and
references therein).
For f (s) regular enough (e.g. f C(0, t] with an integrable singularity
at s = 0), let us introduce the Riemann-Liouville fractional derivative,
Z
1 d t
f (s)
1
Dt (f ) :=
ds
,
0<< ,
(125)
12
(2) dt 0
(t s)
2
and the Riemann-Liouville fractional integral,
Z t
f (s)
1
ds
It (f ) :=
,
(2 1) 0
(t s)22
1
< < 1,
2
(126)
where is the Euler Gamma function ([?]). Appendix ?? contains a motivation for introducing such operators. For 12 < < 1 let us also introduce
the fractional Laplacian () , defined through its Fourier transform: if the
Laplacian corresponds, in Fourier space, to a multiplication by k 2 , the
1
fractional Laplacian corresponds to a multiplication by | k | . (125) and
(126) can be defined in a more general way (see [?]), but to our purposes
the above definition is sufficient. Furthermore, notice that the operators in
(125) and (126) are fractional in time, whereas the fractional Laplacian is
fractional in space.
Let us now consider the function (t, x), solution to
Z
1 d t
2 (s, x)
1
t (t, x) =
ds x
when 0 < < ,
(127)
(2) dt 0
(t s)12
2
t (t, x) =
1
(2 1)
ds
0
x2 (s, x)
(t s)22
when
1
< < 1.
2
(128)
It has been shown (see [?, ?] and references therein) that such a kernel
is the asymptotic of the probability density of a CTRW run by a particle either moving at constant velocity between stopping points or instantaneously jumping between halt points, where it waits a random time before
jumping again. On the other hand, a classic result states that the Fourier
transform of the solution (t, x) to
1
t (t, x) = () (t, x),
2
1
< < 1,
2
(129)
in equation (129). Processes of this kind are particular CTRWs, the well
known Levy flights; in this case large jumps are allowed with non negligible
probability and this results in the process having divergent second moment.
We will use the notation (t, x) = t (x) to indicate the solution to either
(127), (128) or (129), as in the proofs we use only the properties that these
kernels have in common.
The above described framework is analogous to the one of Einstein diffusion:
for subdiffusion and Riemann-type superdiffusion the statistical description
is given by CTRWs, whose (asymptotical) density is the fundamental solution of the evolution equation associated with the operators of fractional
differentiation and integration, i.e. (127) and (128), respectively (see Appendix B). For the Levy-type superdiffusion, the statistical point of view is
given by Levy flights, whose probability density evolves in time according to
the evolution equation associated with the fractional Laplacian, i.e. (129)
(see [?]).
we want to show how the operators Dt and It naturally arise in the
context of anomalous diffusion and explain in some more detail the link
with CTRWs.
We want to determine an operator A s.t.
t t (x) = A t (x)
t (0) = 0 ,
with (t, x) enjoying the following three properties:
Z
Z
Z
dxt (x) x2 t2
dxt (x) x = 0 e
dxt (x) = 1 ,
(130)
and
1
ck 2
1
2+1 (2 + 1) = (1 c1 2 k 2 ),
2
We can assume that the expression for (, k) is valid in the regime 2 k 2 <<
1. Actually, condition (130)3 is meant for an infinitely wide system and for
long times. In other words, if is the region where the particle moves, we
claim that
R
dxt (x) x2
lim lim
= const.
t R
t2
This means that we are interested in the case k << . Of course one can in
principle find an infinite number of functions s.t. (, k) = 1 (1 c1 ) for
= 2 k 2 . One possible choice is
(, k) =
1
1
= 1 2
=
,
2
(1 + c1 )
+ (c1 k)
+ c1 k 2 12
(131)
1
2
1
(2 1)
ds
0
(s, k)
;
(t s)22
so that
(, k)12 is the Laplace transform of
1 d
(2) dt
ds
0
(s, k)
.
(t s)12
Finally, taking the inverse Fourier transform, we get that (t, x) satisfies
(127) when 0 < < 12 and (128) when 12 < < 1. Moreover, the explicit
expression for t (x) holds true: by (131) we get that
Z
1 |x|
(, k) =
dx eikx e c1
2 c1
R
124
hence
1 |x|
# (x, ) = e c1
2 c1
and now, by the inverse Laplace formula, we obtain (??). Obviously, the
expression (??) has been deduced after having chosen (131) among all possible candidates for and this choice can now be justified in view of the
link with CTRWs.
125
Appendices
A
Miscellaneous facts
A.1
We said that white noise is the derivative of BM, without justifying this
statement. We will give here a formal explanation. If white noise is the
derivative of BM then in some sense
Wt+h Wt
.
h0
h
t = lim
By definition of BM, the process on the RHS of the above must be mean
zero and Gaussian, so let us check its covariance function:
Wt+h Wt Ws+h Ws
E
h
h
1
= 2 [(t + h) (s + h) (t + h) s (s + h) t s t]
h
[[(s t) + h] ([t s] ([s t] + h))]/h2 if s 6= t
=
1/h
if t = s
0
if t 6= s
= 0 (t s) ,
+ if t = s
because if s < t then t (s + h) = s + h as h gets smaller (analogous in the
case t < s).
A.2
Gronwalls Lemma
126
Integral form: Let u(t), a(t) and b(t) be continuous real valued
functions on R+ , with b(t) nonnegative and a(t) non decreasing. If
Z
u(t) a(t) +
b(s)u(s) ds ,
for some c t,
then
u(t) a(t)e
A.3
Rt
c
b(s) ds
Excellent references for the material of this appendix are the books [80, 64]
Let (X, k kX ) and (Y, k kY ) be two Banach spaces.
Definition B.1. A linear operator A : X Y is bounded if there exists a
constant M > 0 such that
kAxkY M kxkX ,
127
for all x X .
(132)
kAxkY
kAxkY
= sup
,
kxkX
kxkX =1 kxkX
B.1
Adjoint operator
x X, y Y .
Remark B.4. In a Hilbert space setting the duality relation is just the
scalar product. Because the dual of a Hilbert space is the Hilbert space
itself, the above definition coincides, in the Hilbert case, with the one that
you will have already seen, which is: if H is a Hilbert space and A : H H
is a bounded linear operator, A : H H is the adjoint operator of A if
hAx, yi = hx, A yi, for all x, y H (where this time h, i is just the scalar
product in H).
If the operator is unbounded, the definition of adjoint is slightly more
involved.
Definition B.5 (Adjoint of unbounded operator). Let X and Y be Banach
spaces and A be an unbounded operator A : D(A) X Y . If 39 D(A) is
dense in X then for every y Y there exists a unique x
X such that
hAx, yiY,Y = hx, x
iX,X ,
x D(A).
(133)
x D(A).
129
for all x, y H.
(134)
for all x, y H.
One can also prove that a bounded linear operator on a Hilbert space is
unitary if anf only if U = U 1 .
B.2
Let (V, k k) be a normed space and V its dual (i.e. the space of bounded
linear operators T : V R), endowed with the operator norm.
i) A sequence xn of elements of V is said to converge (strongly) or in norm
to x V if
lim kx xn k = 0 .
n
for all T V .
130
for all x V.
B.3
Lf := lim
(135)
for all f D(L) := {f B : the limit on the RHs of (135) exists in B}.
It is clear that if Pt , t R, is a C0 -group then Pt , t R+ , is a C0 semigroup generated by L and Pt , t R+ is a C0 -semigroup generated by
L. If Pt is invertible (as an operator) then the inverse is precisely Pt .
Theorem B.9 (Stones Theorem). L is the generator of a C0 group of
unitary operators on a Hilbert space if and only if iA is self-adjoint.
B.4
These are the functional inequalities that you will have seen in a first course
on PDEs.
Poincar
e Inequality (in its most classic form) Let U be a bounded,
connected, open subset of Rn with a C 1 boundary U . Then for every
1 p and f W 1,p (U ),
kf hf iU kLp (U ) Ckf kLp (U ) ,
where hf iU denotes the average of f over U and C is a constant depending on n, p and U .
131
Laplace method
We assume that the functions f (x), h(x) : R R are C and that h(x)
admits a unique global minimum, which is attained at x = x0 . In other
words there exists a unique x0 R such that
1. h0 (x0 ) = 0 and h00 (x0 ) > 0;
2. h(x0 ) < h(x) for all x 6= x0 .
Rewriting
I() = e
h(x0 )/
f (x)e[h(x0 )h(x)]/ dx
0 x 6= x0
1 x = x0 ,
it is clear that the main contribution to the value of the integral I() will
come from a neighbourhood of the point x0 . So for some small = () > 0
we can write
Z
h(x0 )/
I() = e
f (x)e[h(x0 )h(x)]/
R
Z x0 +
eh(x0 )/
f (x)e[h(x0 )h(x)]/
x0
x0 +
e[h(x0 )h(x)]/
Zx0
eh(x0 )/ f (x0 ) e[h(x0 )h(x)]/
R
Z
f (x0 ) eh(x)/ .
R
0
eh(x)/ dx
132
f (x0 ).
Clearly all the above clculations are only formal, and in doing a rigorous
proof one should quantify all the . The Laplace method consists in using
an idea similar to the one described above, in order to find the behaviour of
I() to leading order (in ):
Z
f (x)e[h(x0 )h(x)]/ dx
I() = eh(x0 )/
R
Z x0 +
0
00
2
h(x0 )/
e[h (x0 )(xx0 )/ e[h (x0 )(xx0 ) ]/2
e
f (x0 )
x0
x0 +
00 (x
e[h
2
0 )(xx0 ) ]/2
dx
x0
h(x0 )/
00 (x
e[h
f (x0 )
ZR
ev
2 /2
2
0 )(xx0 ) ]/2
s
h(x0 )/
=e
f (x0 )
2h00 (x
0)
dx
dv
2
.
h00 (x0 )
So as 0,
s
I() e
h(x0 )/
f (x0 )
2
h00 (x
133
0)
to leading order.
References
[1] L. Arnold. Stochastic Differential Equations: Theory and Applications,
Wiley, 1974
[2] P. Baldi, Calcolo delle Probabilit`a e Statistica. Second Edition. Mc
Graw-Hill, Milano 1998.
[3] A. Beskos, G. Roberts and A. M. Stuart. Optimal Scalings for local
Metropolis-Hastings chains on nonproduct targets in high dimensions.
Ann. Appl. Prob. 19 863-898
[4] D. Bakry and M. Emery, Diffusions hypercontractives, pp. 177-206 in
Sem. de Probab. XIX, Lecture Notes in Math., Vol. 1123, SpringerVerlag, Berlin, 1985.
[5] D. Bakry, I. Gentil and M. Ledoux. Analysis and Geometry of Markov
Diffusion Operators. Springer, 2013
[6] A. Beskos, G. Roberts, A. M. Stuart and J. Voss. MCMC methods for
diffusion bridges. Stoch. Dyn. 8 319-350.
[7] A. Beskos and A. M. Stuart. MCMC methods for sampling function
space. In ICIAM Invited Lecture 2007. European Mathematical Society,
Z
urich.
[8] P. Billingsley. Convergence of probability measures. Second edition.
Wiley Series in Probability and Statistics: Probability and Statistics.
New York, 1999.
[9] P. Billingsley. Ergodic Theory and information. Robert E. Krieger Publishing Company. Huntington, New York, 1978.
[10] G. Casella and E.I. George. Explaining the Gibbs sampler.(English summary) Amer. Statist. 46 (1992), no. 3, 167174.
[11] A.V. Chechkin, J. Klafter and I.M. Sokolov. Distributed-Order Fractional Kinetics, 16th Marian Smoluchowski Symposium on Statistical
Physics: Fundamental and Applications, September 2003
[12] G. Da Prato and J. Zabczyk, Stochastic equations in infinite dimensions, Cambridge University Press, Cambridge, 1992.
134
135
[25] A. Guionnet and B. Zegarlinski. Lectures on Logarithmic Sobolev inequalities. Lecture Notes.
[26] M. Hairer. Ergodic Theory for Stochastic PDEs. Lecture notes.
[27] M. Hairer. Ergodic properties of a class of non-Markovian processes. In
Trends in stochastic analysis, volume 353 of London Math. Soc. Lecture
Note Ser., pages 6598. Cambridge Univ. Press, Cambridge, 2009.
[28] M. Hairer. How hot can a heat bath get? Comm. Math. Phys., 292
(2009), 1, 131-177.
[29] M. Hairer and G. A. Pavliotis. From ballistic to diffusive behavior in
periodic potentials. J. Stat. Phys., 131(1): 175202, 2008.
[30] W.K. Hastings. Monte Carlo Sampling Methods Using Markov Chains
and Their Applications. Biometrika, 57, No. 1. (Apr., 1970), 97109.
[31] P. Hanggi, P. Talkner, and M. Borkovec. Reaction-rate theory: fifty
years after Kramers. Rev. Modern Phys., 62(2):251341, 1990.
[32] B. Helffer and F. Nier. Hypoelliptic estimates and spectral theory for
Fokker-Planck operators and Witten Laplacians, volume 1862 of Lecture
Notes in Mathematics. Springer-Verlag, Berlin, 2005.
[33] F. Herau. Short and long time behavior of the Fokker-Planck equation
in a confining potential and applications, J. Funct. Anal., 244(1): 95
118, 2007.
[34] F. Herau and F. Nier. Isotropic hypoellipticity and trend to equilibrium
for the Fokker-Planck equation with a high-degree potential, Arch.
Ration. Mech. Anal. 171 (2), 151218 (2004)
[35] M. Hitrik and K. Pravda-Starov. Spectra and Semigroup smoothing for
Non-Elliptic Quadratic Operators, Math. Ann. 344, No 4 (2009), pp
801846.
[36] L. H
ormander. Hypoelliptic second order differential equations. Acta
Math., 119:147171, 1967.
[37] L. H
ormander, The analysis of linear partial differential operators, vol.
I,II,III,IV, Springer-Verlag, New York (1985)
[38] V. Jaksic and C.-A. Pillet. Ergodic properties of the non-Markovian
Langevin equation, Lett. Math. Phys., 41(1): 4957, 1997.
136
137
139
140
12
A Few Exercises
Warning: all the Markov chains mentioned in this exercise sheet are timehomogeneous Markov chain, unless otherwise stated.
1. Show that continuous functions are measurable.
2. Let {Xn }n be a sequence of i.i.d taking values in the set {1, . . . , m}
and P(Xj = k) = pk , where p1 , . . . , pk are positive numbers adding up
to one. Let us fix one value k {1, . . . , m} and define
1 if Xj = k
Yj =
0 otherwise.
What can we say about Sn =
1
n
Pn
j=1 Yj
time we take one ball from each urn and we exchange them. The state
of the system can be described, for example, by keeping track of the
number k of black balls in the left urn. Find the transition probability
of the Markov chain that describes the evolution of the number k.
8. Simple Random Walk. Let i be i.i.d. random variables taking
values in {1, 1}. In particular P(i = 1) =
Pp and P(i = 1) = 1 p,
for all i and for some 0 < p < 1. Let n = ni=1 i . Find the transition
probabilities of the Markov chain n .
9. Give an example of a set (and a chain) which is closed but not irreducible and, viceversa, of a set that is irreducible but not closed.
10. Show that in the Ehrenfest chain all states are recurrent.
11. Prove Lemma 3.30.
12. Let Xn be a time-homogeneous MC on a finite state space S. Prove
that if the state space S is irreducible then there exists a unique stationary distribution . Write an expression for .
13. Queue. As we have all experienced at least once, a queue is a line
where customers wait for services. Suppose that at most n people are
allowed in the queue, i.e. at each moment in time there can be at most
n people in the queue. Rules of the queue:
If the queue has strictly less than n customers then with probability a new customer joins the queue.
If the queue is not empty, then with probability the head of the
line is served and leaves the queue.
From which we deduce that the queue remains unchanged
with probability 1 if it is empty
with probability 1 it is full (i.e. n people waiting)
with probability 1 otherwise.
All other possibilities occur with zero probability. It should be now
clear that this situation ca be modelled using a time-homogeneous MC
with finite state space S = {0, 1, . . . , n}.
(a) Write down the transition probabilities of the chain
(b) Prove that the chain has a unique stationary distribution.
142
Consider the setting and notation of Example 4.1. Use Exercise 16 and
the above statement to prove that if is not the uniform distribution
on S then the chain with transition matrix P converges to .
Hint: show that there exist two states a, b S such that q(a, b) > 0
and (b) < (a). Then look at p(a, a).
18. Show that the chain produced by Algorithm 4.6 satisfies the detailed
balance condition, i.e. it is -reversible.
143
19. Consider the accept-reject method, Algorithm 4.4. Calculate the probability of acceptance of the sample Y at each iteration. Let N be
the number of iterations needed before successfully producing a sample from the distribution , i.e. the number of rejections before an
acceptance. Calculate P(N = n) for all n 1 and the expected value
of N as a function of M .
Hint: N is a random variable with geometric distribution.
20. Write down the Metropolis-Hastings algorithm to simulate the normal
distribution N (0, 1) using a proposal which is uniformly distributed on
[, ], for an arbitrary > 0. (As you can imagine, in computational
practice it is important to choose a good delta).
21. Let Xt be a continuous time Markov process. If the process is not
time-homogeneous then the transition probabilities are functions of
four arguments:
p(s, w, t, A) : R+ S R+ S [0, 1],
0 s t,
with
p(s, w, t, A) = P(Xt A|Xs = w).
Any Markov process can be turned into a time-homogeneous one, if
we consider an extended state space: suppose Xt with state space S
is not time-homogeneous and show that the process Yt = (t, Xt ) with
state space R+ S is instead time homogeneous (Hint: just calculate
the transition probabilities of Yt ).
22. Let (t) be the process defined in Exercise 4. (t) is called the stationary Ornstein-Uhlenbeck process (O-U process). (t) is a Markov
process and the transition probabilities of such a process have a density. Show that
(
)
y xe(ts) 2
1
P(t = y|s = x) = p
exp
.
2(1 e2(ts) )
2(1 e2(ts) )
23. Prove that if the coefficients b and of an Ito SDE (satisfying the
conditions of the existence and uniqueness theorem) do not depend
on time, then the solution of the SDE is a time-homogeneous Markov
process.
144
Ws2 dWs .
with b, > 0.
X(0) = X0 ,
(136)
X(0) = X0 ,
(137)
X k (0) = X0 , k N
(138)
32. Let Pt be the semigroup defined in Example 9.3. Find the generator
of Pt in the space Cb ().
33. Find the generator of the semigroup
f (x + t) if x + t 1
Tt f (x) =
0
if x + t > 1
defined on the space C 1 [0, 1].
34. Show that the semigrop Pt associated to a time-homogeneous Markov
process is a contraction semigroup in L , i.e. kPt f k kf k .
35. Let Xt be a real valued wide sense stationary process and denote by
C(t, s) = C(t s) = E(Xt Xs ) 2 , R := E(Xt ) its covariance function.
Assume that C(t) L1 (R+ ), i.e. R+ C(s)ds < . Show that the
process is ergodic in the following sense:
Z T
2
1
lim E
Xs ds = 0 .
T
T 0
36. Transform the following Stratonovich SDEs into Ito SDEs or viceversa.
Rt
Rt
Xt = X0 + 0 Xs2 ds + 0 cos(Xs ) dWs
dXt = sinh(Xt )dt + t2 dWt
dXt = Xt dt + 3Xt dWt
37. Prove Lemma 10.7. Hint: Under the assumptions of Lemma 10.7,
Z
lim
0 t
t+
Z
Ef (s, Xs )ds = lim
0 t
Xt = x .
Use this hint to prove the lemma. If you want to prove also the statement of the hint then you can use the same kind of arguments that we
used in the proof of Theorem 10.4.
38. Prove that (106) (105) (work with mean zero functions).
147
2dWt
(139)
> 0.
(a) Recognize that this equation is of the form (101), for an appropriate function V (x). Write V (x) (you may assume V (0) = 0).
(b) Write the generator L of the process and its invariant density
(i.e. the density of the invariant measure).
148
where and r are a strictly positive real numbers and Wt and Bt are
one dimensional independent standard Brownian motions. Write the
equation satisfied by
(a) Zt := Xt Yt ;
(b) g(Xt , Yt ), where g is a twice differentiable function g(x, y) : R2
R;
(c) Vt := Xt2 Yt .
45. Let Pt be a Markov semigroup on a Banach space of real valued functions, L the generator of the semigroup and a reversible measure
(on R, and admitting density with respect to the Lebesgue measure)
satisfying the spectral gap inequality with constant . Let U (x) be a
bounded (above and below) and measurable function on R and define
a new measure with density
(x) =
1 U (x)
e
(x),
Z
where Z is the normalizing constant. Notice that, due to the boundedness of U , if f L2 then f L2 as well (same for integrability) and
also the other way around.
(a) Prove that for any constant a R,
2
Z
Z
Z
f
f d d
(f a)2 d ,
R
150
13
Solutions
To check that this is the case, recalling that for the joint probability
density of A and we have fA, (x, y) = f|A (y|x)fA (x), we can write
ZZ
E(A cos(t)) =
x cos(yt)fA, (x, y)dxdy
ZZ
p(k, k + 1) =
152
absorbent.
Now let S = {1, 2, 3} and define
0 1 0
p = 0.5 0 0.5 .
0 0 1
Then the set A = {1, 2} is irreducible but not closed as 2 A and
23 > 0 but 3
/ A.
10. Because we have only N + 1 possible states, this is easily done by
drawing the chain
0
...
X
y
(y)
X
yx
1
1
(y) =
,
1 xx
1 xx y
1 xx
so it has to be xx = 1.
12. S is finite and irreducible. If the whole state space S is irreducible,
then it is also closed - this is not true for subsets of S, as we have
seen in Exercise 9. Therefore all the states are recurrent. Therefore
153
if k = 0
1
1
if k = n
p(k, k) =
1 if 1 k n 1 .
Otherwise p(j, k) = 0.
(b) Sketching a graph of the chain shows immediately that all the
states communicate with each other, so the chain is closed and irreducible, which on a finite state space implies that the chain has only
one stationary measure.
(c) Expanding the system of equations gives
(0) = (1 ) (0) + (1)
(k) = (k 1) + (1 ) (k) + (k + 1)
1k n1
(n) = n1 + (1 ) (n) .
This system has clearly infinitely many solutions (all multiples of
each others) so we use (0) as a parameter. A solution is (k) =
(0) (/)k . Now impose the normalization condition and get
(/)k
(k) = Pn
.
i
i=0 (/)
P
14. P
If y is transient then n pn (x, y) < pn (x, y) 0.
n
n p (x, y) < because
pn (x, y) = Ex N (y) =
n=1
154
xy
< .
1 yy
n
Ey Ty
and hence also
Nn (y)
1
Ex
1
.
Ex
n
Ey Ty (Ty <)
16. Let M = maxx,wS n(x, w), then for any x, y S we have
p2M (x, y) pn(x,z) (x, z) p(z, z) . . . p(z, z) pn(z,y) (z, y) > 0.
|
{z
}
2M n(x,z)n(z,y)
times
x6=a,b
x6=a,b
=1
x6=a
155
stationarity
P(X1 A) = P (1 (C)).
P(t y|s = x) .
y
B S, x S, t 0 .
(140)
(141)
26. Apply It
o formula with g(t, x) = 2 + t + ex and get
1
dYt = (1 + eWt )dt + eWt dWt .
2
For Zt instead you get
dZt = 2dt + 2BdB + 2W dW .
27. For example by integration by parts: set X1 = W 2 , X2 = W . Applying
It
o s rule we have d(W 2 ) = 2W dW + dt and using the multiplication
table d(W 2 )dW = 2W dt. Therefore
Z
t
2
W dW =
Wt3
2W dW
Z
W ds
2W ds ,
0
so that readjusting
t
W3
W dW = t
3
W ds .
0
28. Before giving the solution I would like to Remark that we will show
that N (0, 2 /2b) is the stationary measure of the O-U process. Now
the solution: from Example 8.10 we know that if EX0 = 0 then also
E(Xt ) = 0 for all t > 0. Therefore Cov(Xt , Xs ) = E(Xt Xs ). To
calculate E(Xt Xs ) we use Theorem 8.4:
Z s
2
2 b(ts)
b(t+s)
2 b(t+s)
E(Xt Xs ) = e
e2bu du =
+ e
e
.
2b
2b
0
Using the It
o formula, we have
d(|Xt |2 ) = 2Xt dXt + 2 dt
= 2bXt2 dt + 2Xt dWt 2 dt,
so that
Xt2
X02
2bXs2 ds
Z
+2
Xs dWs + 2 t.
158
1
Ys dWs +
2
Yt = Y0 +
0
2 Ys ds .
Taking expectation,
Z
Ys dWs
EYt = EY0 + E
0
1
2
2 EYs ds .
= 2 u(t),
2
u(0) = 1,
t dt + f (t)Xt dWt .
dXt = d(t) + f (t) X
2
The solution to such an equation can be found by using Example
8.12 and it is
Rt
Rt
t = X0 e 0 d(s)ds+ 0 f (s)dWs
X
(which is not the same as (67)).
159
X k (t) = X0 e
R
d(s)ds+ 0t f (s) k (s)
Z
f d
160
36. Notice that the covariance function is a symmetric function, i.e. C(t, s) =
C(s, t). Therefore
Z T
2
2
Z
Z
1
1 T
1 T 2
2
E
X(s)ds = 2 E
Xs ds + 2
ds
T 0
T
T 0
0
Z T
Z T
1
Xt dt 2
Xs ds
= 2E
T
0
0
Z T Z T
1
= 2
dsC(t, s)
dt
T 0
0
Z T Z t
const
symmetry 2
=
dt dsC(t, s)
0 as T .
T2 0
T
0
Rt
Rt
37. The first becomes Xt = X0 + 0 (Xs2 21 sin Xs cos Xs )ds+ 0 cos Xs dWs .
The second remains unaltered as the diffusion coefficient does not depend on x. The It
o form of the last one is dXt = 72 Xt dt + 3Xt dWt .
38. Let Xu , u t be the solution of (93). Using Ito formula we have
f
f
(Xu )b(u, Xu )du +
(Xu )(u, Xu )dWu
x
x
2f
+
(Xu ) 2 (u, Xu )du .
x2
df (Xu ) =
t x
Z t 2
f
1
+ E
(Xu ) 2 (u, Xu )du.
2
2
t x
Now just let 0 and use the hint.
39. If
ft2 d e2t
R
R
e2t R ft2 d
f 2 d.
V
+
Lv =
v i (x) + i
x2i ,
x xi
i=1
i=1
all the calculations are completely analogous to all you have seen in
one dimension.
41. We use Exercise 31. Assuming (139), if t, h 0 then by the semigroup
property
kPt+h u Pu k kPt kkPh u uk Cet kPh u uk
by assumption
0,
U (t)x x
U x x
= lim
= A x .
t0
t
t
43. LH is antisymmetric in L2 , so
hh, LH hi = hh, LH i = hh, LH i
hence hh, LH hi = 0. Using this fact
d tLH 2
ke k = 2hht , LH ht i = 0.
dt
44. The potential is V (x) = x3 /3, the generator is
L = x2 x +
2 2
.
2 x
2x3
45. Using the product rule in (a) and the multidimensional Ito formula in
(b) and (c), we have
(a) d(Xt Yt ) = Xt dYt + Yt dX t + dXt dYt . However dXt dYt = 0 in
this case so
d(Xt Yt ) = rXt Yt dt Yt Xt dt + 2Xt dBt + Yt dWt .
(b) Using dXt dYt = 0, dXt dXt = 2 dt and dYt dYt = 4dt, we have
g
g
(Xt , Yt )dXt +
(Xt , Yt )dYt
x
y
2 2g
2g
+
(X
,
Y
)dt
+
2
(Xt , Yt )dt .
t
t
2 x2
y 2
dg(Xt , Yt ) =
(c) Use the result of (b) and apply them to the function g(x, y) = x2 y
to obtain
d(Xt )2Xt Yt dXt + Xt2 dYt + 2 Yt dt .
46. All the inequalities I will write are assumed to hold for f L2 D(L)
(even though for (a) one can just consider functions in L2 . )
(a) Just consider the function
Z
(f a)2 d.
H(a) =
R
d
Moreover da
2 H(a) = 1 so this is a minimum.
(b) Using the hint and the result of point (a), one gets
2
2
Z
Z
Z
Z
f
f d d
f
f d d
R
Osc(U )
Z
Z
f
eOsc(U )
hLf, f i
e2Osc(U )
hLf, f i ,
2
f d d
47. Results:
Yt = Y0 exp(Bt 12 2 t) + r
Rt
0
exp[(Bt Bs ) 21 2 (t s)]ds.
Xt = et X0 + et Bt assuming B0 = 0.
Xt = 3 exp[6t + 4Bt ] and E(Xt ) = 3e2t .
164