Convergence of Markov Processes: Martin Hairer
Convergence of Markov Processes: Martin Hairer
Martin Hairer
Mathematics Department, University of Warwick
Email: [email protected]
Abstract
The aim of this minicourse is to provide a number of tools that allow one to de-
termine at which speed (if at all) the law of a diffusion process, or indeed a rather
general Markov process, approaches its stationary distribution. Of particular in-
terest will be cases where this speed is subexponential. After an introduction
to the general ergodic theory of Markov processes, the first part of the course
is devoted to Lyapunov function techniques. The second part is then devoted to
an elementary introduction to Malliavin calculus and to a proof of Hörmander’s
famous ”sums of squares” regularity theorem.
Hence we are using P both to denote the action on functions and its dual action on
measures. Note that P extends trivially to measurable functions ϕ : X → [0, +∞].
The main concept arising in the study in the long-time behaviour of a Markov chain
is that of an invariant measure:
G ENERAL ( ERGODIC ) THEORY OF M ARKOV PROCESSES 2
τi = inf{k ≥ 1 : xk = i} .
We also denote by Pi the law of the process started at i. With these notations, we
say that the state i is:
We assume that P is irreducible in the sense that for every pair (i, j) there exists
n > 0 such that Pijn > 0. Given a state i, we also denote by τi the first return time
to i, that is τi = inf{n > 0 : xn = i}. We have
Proposition 1.2 If P is irreducible, then all of the states are of the same type.
Note first that since there is a non-zero probability that the process reaches 4
between any two successive return times to j, this quantity is finite for every j.
Furthermore, it follows from the definition of x̃ that we have the identity
P∞ P k
πj = k=0 i6=4 P̃ji Pi4 if j 6= 4,
1 if j = 4.
Therefore, for ` 6= 4 we have
XXX XX X
(P π)` = P`4 + P`j P̃jik Pi4 = P`4 + P̃`j P̃jik Pi4 = π` .
j6=4 k≥0 i6=4 j k≥0 i6=4
since this is precisely the probability that the process x eventually returns to 4.
G ENERAL ( ERGODIC ) THEORY OF M ARKOV PROCESSES 4
Proof. In all three cases, we will show that the existence of a function V with the
specified properties is sufficient to obtain the corresponding transience / recurrence
properties. We then show how to construct such a function in an abstract way.
Let us first show the criterion for transience. Multiplying V by a positive con-
stant, we can assume without loss of generality that infy∈A V (y) = 1. Consider
now x 6∈ A such that V (x) < 1. Since LV (z) ≤ 0 for z 6∈ A, we have
Z Z
V (x) ≥ V (y)P(x, dy) ≥ P(x, A) + V (y)P(x, dy)
Ac Z
Z Z (1.1)
≥ P(x, A) + P(y, A)P(x, dy) + V (z)P(y, dz)P(x, dy) ≥ . . .
Ac Ac Ac
Taking limits, we see that V (x) ≥ Px (τA < ∞). Since V (x) < 1, this immediately
implies that the process is transient. Conversely, if the process is transient, the
function V (x) = Px (τA < ∞) satisfies the required conditions.
We now turn to the condition for recurrence. For N ≥ 0, set VN (x) = V (x)/N
and set DN = {x : VN (x) ≥ 1}. It follows from the assumption that the sets
DN have finite complements. Denote furthermore A = {x : LV (x) > 0}. A
calculation virtually identical to (1.1) then shows that VN (x) ≥ Px (τDN < τA ) ≥
Px (τA = ∞). In particular, one has Px (τA = ∞) = 0 for every x, so that the
process is recurrent. Conversely, assume that the process is transient and consider
an arbitrary finite set A, as T
well as a sequence of decreasing sets DN with finite
complements and such that N >0 DN = φ. We then set WN (x) = Px (τDN < τA )
with the convention that WN (x) = 0 for x ∈ A and WN (x) = 1 for x ∈ DN . It is
straightforward do check that one does have LWN ≤ 0 for x 6∈ A. Furthermore, it
follows from the recurrence of the process that limN →∞ WN P (x) = 0 for every x.
We can therefore find a sequence Nk → ∞ such that V (x) = k≥0 WNk (x) < ∞
G ENERAL ( ERGODIC ) THEORY OF M ARKOV PROCESSES 5
for every x. Since V grows at infinity by construction, it does indeed satisfy the
required conditions.
We finally consider the criterion for positive recurrence. We define the set
A = {x : LV (x) > −1} which is finite by assumption. For x 6∈ A we now have
Z Z
V (x) ≥ 1 + V (y)P(x, dy) ≥ 1 + V (y)P(x, dy)
A c
Z Z
c
≥ 1 + P(x, A ) + V (z)P(y, dz)P(x, dy) ≥ . . .
Ac Ac
Taking limits again, we obtain that V (x) ≥ Ex τA , so that the first hitting time of
A has finite expectation. Since it follows from our assumption that LV is bounded
from above (in particular LV (x) < ∞ on A), this shows that the return time to any
fixed state has finite expectation. Conversely, if the process is positive recurrent,
we set V (x) = Ex τA for an arbitrary finite set A and we check as before that it
does have the required properties.
Proof. It suffices to show that π has the desired property. Denoting VN (x) =
V (x) ∧ N , we see that GN = LVN is bounded, negative outside
R a finite set, and
satisfies limN →∞ GN (x) = G(x) = LV (x). Furthermore, GN (x) π(dx) = 0 by
the invariance of π. The claim now follows from Fatou’s lemma.
Our aim is to use the above criteria to show that it is recurrent if and only if d ≤ 2.
It is clear that it is irreducible and aperiodic. It cannot be positive recurrent
since the density of any invariant probability measure would have to satisfy Lπ =
0, so that there cannot be a point x with p(x) maximal. The question of interest is
whether the random walk is transient or recurrent. We use the fact that if f varies
slowly, then Lf (x) ≈ ∆f (x).
Note now that if we set f (x) = kxk2α , we have
This simple calculation already shows that 2 is the critical dimension for the tran-
sience / recurrence transition since for d < 2 one can find α > 0 so that f satisfies
the conditions for the recurrence criterion, whereas for d > 2, one can find α < 0
G ENERAL ( ERGODIC ) THEORY OF M ARKOV PROCESSES 6
such that f satisfies the conditions for the transience criterion. What happens at the
critical dimension? We consider
so that
4α(α − 1)
∆f (x) = (log(x21 + x22 ))α−2 .
x21 + x22
Choosing α ∈ (0, 1) allows to construct a function such that the criterion for recur-
rence is satisfied.
Theorem 1.7 Given a Markov kernel P, denote by I the set of all invariant prob-
ability measures for P and by E ⊂ I the set of all those that are ergodic. Then, I
is convex and E is precisely the set of its extremal points. Furthermore, for every
invariant measure µ ∈ I, there exists a probability measure Qµ on E such that
Z
µ(A) = ν(A) Qµ (dν) .
E
Remark 1.8 As a consequence, if a Markov process admits more than one invari-
ant measure, it does admit at least two ergodic (and therefore mutually singular)
ones. This leads to the intuition that, in order to guarantee the uniqueness of its
invariant measure, it suffices to show that a Markov process explores its state space
‘sufficiently thoroughly’.
G ENERAL ( ERGODIC ) THEORY OF M ARKOV PROCESSES 7
Since our assumption immediately implies that {µN }N ≥1 is tight, there exists at
least one accumulation point µ∗ and a sequence Nk with Nk → ∞ such that
µNk → µ∗ weakly. Take now an arbitrary test function ϕ ∈ Cb (X ) and denote by
k · k the supremum norm of ϕ. One has
k(Pµ∗ )(ϕ) − µ∗ (ϕ)k = kµ∗ (Pϕ) − µ∗ (ϕ)k = lim kµNk (Pϕ) − µNk (ϕ)k
k→∞
1
= lim kµNk (Pϕ) − µNk (ϕ)k = lim kP N ϕ − ϕk
k→∞ k→∞ Nk
2kϕk
≤ lim =0.
k→∞ Nk
Here, the second equality relies on the fact that Pϕ is continuous since P was
assumed to be Feller. Since ϕ was arbitrary, this shows that Pµ∗ = µ∗ as requested.
Example 1.11 Take X = [0, 1] and consider the transition probabilities defined
by
δx/2 if x > 0
P(x, ·) =
δ1 if x = 0.
It is clear that this Markov operator cannot have any invariant probability mea-
sure. Indeed, assume that µ is invariant. Clearly, one must have µ({0}) = 0 since
P(x, {0}) = 0 for every x. Since, for x 6= 0, one has P(x, {(1/2, 1]}) = 0,
one must also have µ((1/2, 1]) = 0. Proceeding by induction, we have that
µ((1/2n , 1]) = 0 for every n and therefore µ((0, 1]) = 0. Therefore, µ(X ) = 0
which is a contradiction.
S OME SIMPLE UNIQUENESS CRITERIA 8
Endowing X with the usual topology, it is clear that the ‘Feller’ assumption of
the Krylov-Bogolioubov criteria is not satisfied around 0. The tightness criterion
however is satisfied since X is a compact space. On the other hand, we could add
the set {0} to the topology of X , therefore really interpreting it as X = {0}t(0, 1].
Since {0} already belongs to the Borel σ-algebra of X , this change of topology
does not affect the Borel sets. Furthermore, the space X is still a Polish space
and it is easy to check that the Markov operator P now has the Feller property!
However, the space X is no longer compact and a sequence {xn } accumulating at
0 is no longer a precompact set, so that it is now the tightness assumption that is
no longer satisfied.
Definition 2.1 Let P be a Markov operator over a Polish space X and let x ∈ X .
We say that x is accessible for P if, for every y ∈ X and every open neighborhood
U of x, there exists k > 0 such that P k (y, U ) > 0.
It is straightforward to show that if a given point is accessible, then it must belong
to the topological support of every invariant measure of the semigroup:
Lemma 2.2 Let P be a Markov operator over a Polish space X and let x ∈ X be
accessible. Then, x ∈ supp µ for every invariant probability measure µ.
Proof. Let µ be invariant for the Markov operator P and define the resolvent kernel
R by X
R(y, A) = 2−n P n (y, A) .
n>0
Clearly, the accessibility assumption implies that R(y, A) > 0 for every y ∈ X
and every neighborhood U ⊂ X of x. Then, the invariance of µ implies that
Z
µ(U ) = R(y, U ) µ(dy) > 0 ,
X
as required.
Example 2.3 (Ising model) The Ising model is one of the most popular toy models
of statistical mechanics. It is one of the simplest models describing the evolution
of a ferromagnet. The physical space is modelled by a lattice Zd and the mag-
netisation at each lattice site is modelled by a ‘spin’, an element of {±1}. The
2
state space of the system is therefore given by X = {±1}Z , which we endow with
the product topology. This topology can be metrized for example by the distance
function
X |xk − yk |
d(x, y) = ,
2
2|k|
k∈Z
and the space X endowed with this distance function is easily seen to be separable.
The (Glauber) dynamic for the Ising model depends on a parameter β and can
be described in the following way. At each lattice site, we consider independent
clocks that ring at Poisson distributed times. Whenever the Pclock at a given site
(say the site k) rings, we consider the quantity δEk (x) = j∼k xj xk , where the
sum runs over all sites j that are nearest neighbors of k. We then flip the spin at
site k with probability min{1, exp(−β δEk (x))}.
Let us first show that every point is accessible for this dynamic. Fix an arbitrary
configuration x ∈ X and a neighbourhood U containing x. By the definition of the
product topology, U contains an ‘elementary’ neighbourhood UN (x) of the type
UN (x) = {y ∈ X | yk = xk ∀ |k| ≤ N }. Given now an arbitrary initial condition
y ∈ X , we can find a sequence of m spin flips at distinct locations k1 , . . . , km , all
of them located inside the ball {|k| ≤ N }, that allows to go from y into UN (x).
Fix now t > 0. There is a very small but nevertheless strictly positive probability
that within that time interval, the Poisson clocks located at k1 , . . . , km ring exactly
once and exactly in that order, whereas all the other clocks located in the ball
{|k| ≤ N + 2} do not ring. Furthermore, there is a strictly positive probability
that all the corresponding spin flips do actually happen. As a consequence, the
Ising model is topologically irreducible in the sense that for any state x ∈ X , any
open set U ⊂ X and any t > 0, one has Pt (x, U ) > 0.
It is also relatively straightforward to show that the dynamic has the Feller
property, but this is outside the scope of these notes. However, despite the fact that
the dynamic is both Feller and topologically irreducible, one has the following:
Theorem 2.4 For d ≥ 2 there exists βc > 0 such that the Ising model has at least
two distinct invariant measures for β > βc .
The proof of this theorem is not simple and we will not give it here. It was a
√ tour de force by Onsager to be able to compute the critical value βc =
celebrated
ln(1 + 2)/2 explicitly in [Ons44] for the case d = 2. We refer to the monograph
[Geo88] for a more detailed discussion of this and related models.
This example shows that if we wish to base a uniqueness argument on the
accessibility of a point or on the topological irreduciblity of a system, we need
to combine this with a stronger regularity property than the Feller property. One
S OME SIMPLE UNIQUENESS CRITERIA 10
possible regularity property that yields the required properties is the strong Feller
property:
Definition 2.5 A Markov operator P over a Polish space X has the strong Feller
property if, for every function ϕ ∈ Bb (X ), one has Pϕ ∈ Cb (X ).
With this definition, one has:
Proposition 2.6 If a Markov operator P over a Polish space X has the strong
Feller property, then the topological supports of any two mutually singular invari-
ant measures are disjoint.
Proof. Let µ and ν be two mutually singular invariant measures for P, so that there
exists a set A ⊂ X such that µ(A) = 1 and ν(A) = 0. The invariance of µ and ν
then implies that P(x, A) = 1 for µ-almost every x and P(x, A) = 0 for ν-almost
every x.
Set ϕ = P1A , where 1A is the characteristic function of A. It follows from
the previous remarks that ϕ(x) = 1 µ-almost everywhere and ϕ(x) = 0 ν-almost
everywhere. Since ϕ is continuous by the strong Feller property, the claim now
follows from the fact that if a continuous function is constant µ-almost everywhere,
it must be constant on the topological support of µ.
Corollary 2.7 Let P be a strong Feller Markov operator over a Polish space X . If
there exists an accessible point x ∈ X for P, then it can have at most one invariant
measure.
Proof. Combine Proposition 2.6 with Lemma 2.2 and the fact that if I contains
more than one element, then by Theorem 1.7 there must be at least two distinct
ergodic invariant measures for P.
Pt1 (x, dx1 )Pt2 −t1 (x1 , dx2 ) · · · Ptn −tn −1 (xn−1 , dxn ) .
H ARRIS ’ S THEOREM 11
Proposition 2.8 Let Pt be a Markov semigroup over X and let P = PT for some
fixed T > 0. Then, if µ is invariant for P, the measure µ̂ defined by
Z T
1
µ̂(A) = Pt µ(A) dt
T 0
Remark 2.9 The converse is not true at this level of generality. This can be seen
for example by taking Pt (x, ·) = δx+t with X = S 1 .
In the case of continuous-time Markov processes, it is however often conve-
nient to formulate Lyapunov-Foster type conditions in terms of the generator L
of the process. Formally, one has L = ∂t Pt |t=0 , but it turns out that the natural
domain of the generator with this definition may be too restrictive for our usage.
We therefore take a rather pragmatic view of the definition of the generator L of a
Markov process, in the sense that writing
LF = G ,
is considered
R t to be merely a shorthand notation for the statement that the process
F (xt , t) − 0 G(xs , s) ds is a martingale for every initial condition x0 . Similarly,
LF ≤ G ,
Rt
is a shorthand notation for the statement that F (xt , t) − 0 G(xs , s) ds is a super-
martingale for every x0 .
3 Harris’s theorem
The purpose of this section is to show that under a geometric drift condition and
provided that P admits sufficiently large ‘small sets’, its transition probabilities
converge towards a unique invariant measure at an exponential rate. This theorem
dates back to [Har56] and extends the ideas of Doeblin to the unbounded state space
setting. This is usually established by finding a Lyapunov function with “small”
H ARRIS ’ S THEOREM 12
level sets [Has80, MT93]. If the Lyapunov function is strong enough, one then has
a spectral gap in a weighted supremum norm [MT92, MT93].
Traditional proofs of this result rely on the decomposition of the Markov chain
into excursions away from the small set and a careful analysis of the exponential tail
of the length of these excursions [Num84, Cha89, MT92, MT93]. There have been
other variations which have made use of Poisson equations or worked at getting
explicit constants [KM05, DMR04, DMLM03, Bax05]. The proof given in the
present notes is a slight simplification of the proof given in [HM10]. It is very
direct, and relies instead on introducing a family of equivalent weighted norms
indexed by a parameter β and to make an appropriate choice of this parameter
that allows to combine in a very elementary way the two ingredients (existence
of a Lyapunov function and irreducibility) that are crucial in obtaining a spectral
gap. The original motivation was to provide a proof which could be extended to
more general settings in which no “minorisation condition” holds. This has been
applied successfully to the study of the two-dimensional stochastic Navier-Stokes
equations in [HM08].
We will assume throughout this section that P satisfies the following geometric
drift condition:
for all x ∈ X .
Remark 3.2 One could allow V to also take the value +∞. However, since we do
not assume any particular structure on X , this case can immediately be reduced to
the present case by simply replacing X by {x : V (x) < ∞}.
Exercise 3.3 Show that in the case of continuous time, a sufficient condition for
Assumption 3.1 to hold is that there exists a measurable function V : X → [0, ∞)
and positive constants c, K such that LV ≤ K − cV .
Assumption 3.1 ensures that the dynamic enters the “centre” of the state space
regularly with tight control on the length of the excursions from the centre. We now
assume that a sufficiently large level set of V is sufficiently “nice” in the sense that
we have a uniform “minorisation” condition reminiscent of Doeblin’s condition,
but localised to the sublevel sets of V .
Assumption 3.4 For every R > 0, there exists a constant α > 0 so that
Remark 3.5 An alternative way of formulating (3.2) is to say that the bound
|ϕ(x)|
kϕk = sup . (3.3)
x 1 + V (x)
Theorem 3.6 If Assumptions 3.1 and 3.4 hold, then P admits a unique invariant
measure µ? . Furthermore, there exist constants C > 0 and % ∈ (0, 1) such that the
bound
kP n ϕ − µ? (ϕ)k ≤ C%n kϕ − µ? (ϕ)k
holds for every measurable function ϕ : X → R such that kϕk < ∞.
While this result is well-known, the proofs found in the literature are often quite
involved and rely on careful estimates of the return times to small sets, combined
with a clever application of Kendall’s lemma. See for example [MT93, Section 15].
The aim of this note is to provide a very short and elementary proof of Theo-
rem 3.6 based on a simple trick. Instead of working directly with (3.3), we define a
whole family of weighted supremum norms depending on a scale parameter β > 0
that are all equivalent to the original norm (3.3):
|ϕ(x)|
kϕkβ = sup .
x 1 + βV (x)
Theorem 3.7 If Assumptions 3.1 and 3.4 hold, then there exist constants β > 0
and %̄ ∈ (0, 1) such that the bound
Though sightly odd looking, the reader can readily verify that since V ≥ 0 by
assumption, dβ indeed satisfies the axioms of a metric. This metric in turn induces
a Lipschitz seminorm on measurable functions by
|ϕ(x) − ϕ(y)|
|||ϕ|||β = sup .
x6=y dβ (x, y)
It turns out that this norm is almost identical to the one from the previous section.
More precisely, one has:
Proof. It is obvious that |||ϕ|||β ≤ kϕkβ and therefore |||ϕ|||β ≤ infc∈R kϕ + ckβ , so
it remains to show the reverse inequality.
Given any ϕ with |||ϕ|||β ≤ 1, we set c = infx (1 + βV (x) − ϕ(x)). Observe that
for any x and y, ϕ(x) ≤ |ϕ(y)| + |ϕ(x) − ϕ(y)| ≤ |ϕ(y)| + 2 + βV (x) + βV (y).
Hence 1 + βV (x) − ϕ(x) ≥ −1 − βV (y) − |ϕ(y)|. Since there exists at least one
point with V (y) < ∞ we see that c is bounded from below and hence |c| < ∞.
Observe now that
and
Proof. Fix a test function ϕ with |||ϕ|||β ≤ 1. By Lemma 3.8, we can assume
without loss of generality that one also has kϕkβ ≤ 1. The claim then follows if
we can exhibit ᾱ < 1 so that
Choosing any γ̄ ∈ (γ, 1), we then have from (3.1) and (3.3) the bound
It follows that if we ensure that R is sufficiently large so that (γ̄ − γ)R > 2K, then
there exists some α1 < 1 (depending on β) such that we do indeed have
We emphasise that up to now β could be any positive number; only the precise
value of α1 depends on it (and gets “worse” for small values of β). The second
part of the proof will determine a choice of β > 0.
Now consider the case of x and y such that V (x) + V (y) ≤ R. To treat this
case, we split the function ϕ as
where
|ϕ1 (x)| ≤ 1 , |ϕ2 (x)| ≤ βV (x) , ∀x ∈ X .
With this decomposition, we then obtain the bound
|Pϕ(x) − Pϕ(y)| ≤ |Pϕ1 (x) − Pϕ1 (y)| + |Pϕ2 (x)| + |Pϕ2 (y)|
≤ 2(1 − α) + γβV (x) + γβV (y) + 2βK
≤ 2 − 2α + β(γR + 2K)
Remark 3.10 Actually, setting γ0 = γ + 2K/R < 1, then for any α0 ∈ (0, α) one
can choose β = α0 /K and ᾱ = (1 − α + α0 ) ∨ (2 + Rβγ0 )/(2 + Rβ).
S UBGEOMETRIC RATES OF CONVERGENCE 16
Theorem 4.1 Assume that there exists a continuous function V : X → [1, ∞) with
precompact sublevel sets such that
LV ≤ K − ϕ(V ) , (4.1)
for some constant K and for some strictly concave function ϕ : R+ → R+ with
ϕ(0) = 0 and increasing to infinity. Assume furthermore that sublevel sets of V are
‘small’ in the sense that for every C > 0 there exists α > 0 and T > 0 such that
kPT (x, ·) − PT (y, ·)kTV ≤ 2(1 − α) ,
for every (x, y) such that V (x) + V (y) ≤ C. Then the following hold:
Remark 4.2 Since V is bounded from below by assumption, (4.1) follows imme-
diately if one can check that the process
Z t
t 7→ V (xt ) − Kt + (ϕ ◦ V )(xs ) ds
0
is a local supermartingale.
S UBGEOMETRIC RATES OF CONVERGENCE 17
Of course, this bound will never converge to 0 if the two processes are independent,
so the aim of the game is to introduce correlations in a suitable way. This will be
done in Section 4.2, after some preliminary calculations that provide the main tools
used in this section.
To the best of my knowledge, Theorem 4.1 was stated in the form given in these
notes for the first time in the recent work [BCG08, DFG09]. However, it relies
on a wealth of previous work, for example [DFMS04] for the same condition in
the discrete-time case, early results by Nummelin, Tuominen and Tweedie [NT83,
TT94], etc. See also [Ver99, RT99] for examples of early results on subgeometric
convergence. The proof given in these notes is a simplification of the arguments
given in [BCG08, DFG09] and is more self-contained as it avoids reducing oneself
to the discrete-time case.
Proof. The strict log-concavity of H implies that for every ε > 0 there exists
K > 0 such that
ε
H(u + v) ≤ H(u)H(v) , (4.2)
CX
Pu ≥ K and v ≥ K.
for every u, v such that
Denoting X (k) = kn=1 Xn , we then have
X Y Y
H(X (k) ) = H(X (k) ) 1{Xn ≤K} 1{Xm >K}
A⊂{1,...,k} n∈A m6∈A
X X Y
≤ H |A|K + Xm 1{Xm >K}
A⊂{1,...,k} m6∈A m6∈A
X c|
Y H(Xm )
≤ H(|A|K)ε|A ,
CX
A⊂{1,...,k} m6∈A
S UBGEOMETRIC RATES OF CONVERGENCE 18
Note now that it follows from Stirling’s formula for the binomial coefficients that
there exists a constant C such that
√ Xk
X
k−|A| kk
ε ≤C k εn .
nn (k − n)k−n
A⊂{1,...,k} n=0
where we made use of the fact that {N = k} is independent of Fk to get the first
identity. We can first make K large enough so that (1 + δ) < λ1/3 . Then, we note
that the strict subexponential growth of H implies that there exists a constant C
such that H(kK) ≤ Cλk/3 and the claim follows.
Z t X
+ Φ00 (ys , s) dhM ict + (Φ(ys+ , s) − Φ(ys− , s) − Φ0 (ys− , s)∆ys ) ,
0 s∈[0,t]
where we denote by hM ict the quadratic variation of the continuous part of M and
we denote by ∆ys the size of the jump of y at time s. The claim then follows from
the facts that hM ict is an increasing process and that Φ00 is negative.
Proof. Setting yt = F (xt , t), it follows from our assumptions that one can write
dyt = G(xt , t) dt + dNt + dMt , where M is a càdlàg martingale and N is a non-
increasing process. It follows from Proposition 4.4 that
so that the claim now follows from the fact that dNt is a negative measure and Φ0
is positive.
∂t Φ = −ϕ ◦ Φ , Φ(u, 0) = u .
Combining this with the fact that Hϕ0 = 1/ϕ, it immediately follows that one has
the identity
∂t Φ−1 (x, t) ϕ(Φ−1 (x, t))
∂x Φ−1 (x, t) = = .
ϕ(x) ϕ(x)
S UBGEOMETRIC RATES OF CONVERGENCE 20
Since ϕ is concave and increasing by assumption, this shows that Φ−1 is increasing
and concave in its first argument for every fixed value of t ≥ 0. It then follows from
Corollary 4.5 that, provided that V satisfies (4.1), one has the bound
The tightness of the measures µT then follows immediately from the compactness
of the sublevel sets of V . The required integrability also follows at one from Fa-
tou’s lemma.
Take two independent copies of the process x and define
We then have
Since we assumed that ϕ is strictly concave, it follows that for every C > 0 there
exists VC > 0 such that ϕ(2x) ≤ 2ϕ(x) − C, provided that x > VC . It follows that
there exists some VC such that
We can now construct a coupling between two copies of the process in the follow-
ing way:
S UBGEOMETRIC RATES OF CONVERGENCE 21
1. If the two copies are equal at some time, they remain so for future times.
2. If the two copies are different and outside of K, then they evolve indepen-
dently.
3. If the two copies are different and inside of K, then they try to couple over
the next unit time interval.
We denote by τC the coupling time for the two processes. By assumption, the
maximal coupling has probability at least δ to succeed at every occurrence of step 3.
As a consequence, we can construct the coupling in such a way that this probability
is exactly equal to δ.
Note now that one has the coupling inequality
which is the main tool to obtain bounds on the convergence speed. We denote by
τn the n-th return time to K and by zt the pair of processes (xt , yt ). We also denote
by Fn the σ-algebra generated by τn . Note that it then follows from (4.1) that
E(F (τn+1 − τn ) | Fn ) ≤ C ,
for some constant C. In particular, since the map t 7→ kPt (x, ·) − Pt (y, ·)kTV is
non-increasing, we have the pointwise bound
is integrable with respect to µ, then we would immediately obtain from (4.8) the
bound
Z
kPt (x, ·) − µkTV ≤ kPt (x, ·) − Pt (y, ·)kTV µ(dy)
Z
C
≤ −1 V (x) + V (y) µ(dy) .
Hϕ (t)
Unfortunately, all that we can deduce from (4.1) is that ϕ(V ) is integrable with
respect to µ with Z
ϕ(V (x)) µ(dx) ≤ K .
This information can be used in the following way, by decomposing µ into the part
where V ≤ R and its complement:
Z
C
kPt (x, ·) − µkTV ≤ −1 V (x) + V (y) µ(dy) + µ(V > R) .
Hϕ (t) V ≤R
This bound holds for every value of R, so we can try to optimise over it. Since
we assumed that ϕ is concave, the function x 7→ ϕ(x)/x is decreasing so that on
the set {V ≤ R} one has V (y) ≤ ϕ(V (y))R/ϕ(R). Furthermore, Chebychev’s
inequality yields
K
µ(V > R) = µ(ϕ(V ) > ϕ(R)) ≤ .
ϕ(R)
Combining this with the previous bound, we obtain
C KR K
kPt (x, ·) − µkTV ≤ −1 V (x) + + .
Hϕ (t) ϕ(R) ϕ(R)
Setting R = Hϕ−1 (t), we finally obtain for some constant C the bound
CV (x) C
kPt (x, ·) − µkTV ≤ + .
Hϕ (t) (ϕ ◦ Hϕ−1 )(t)
−1
The usual total variation distance corresponds to the choice G = 1. Note also that
(4.9) is independent of the choice of reference measure.
With this notation, one has the following result:
L OWER BOUNDS ON CONVERGENCE RATES 23
Theorem 5.1 Let xt be a Markov process on a Polish space X with invariant mea-
sure µ? and let G : X → [1, ∞) be such that:
• There exists a function f : [1, ∞) → [0, 1] such that the function Id · f : y 7→
yf (y) is increasing to infinity and such that µ? (G ≥ y) ≥ f (y) for every
y ≥ 1.
• There exists a function g : X × R+ → [1, ∞) increasing in its second argu-
ment and such that E(G(xt ) | X0 = x0 ) ≤ g(x0 , t).
Then, the bound
1
kµt − µ? kTV ≥ f ((Id · f )−1 (2g(x0 , t))) , (5.1)
2
holds for every t > 0, where µt is the law of xt with initial condition x0 ∈ X .
M ALLIAVIN CALCULUS AND H ÖRMANDER ’ S THEOREM 24
Proof. It follows from the definition of the total variation distance and from Cheby-
shev’s inequality that, for every t ≥ 0 and every y ≥ 1, one has the lower bound
g(x0 , t)
kµt − µ? kTV ≥ µ? (G(x) ≥ y) − µt (G(x) ≥ y) ≥ f (y) − .
y
Choosing y to be the unique solution to the equation yf (y) = 2g(x0 , t), the result
follows.
The problem is that unless µ? is known exactly, one does not in general have
sufficiently good information on the tail behaviour of µ? to be able to apply The-
orem 5.1 as it stands. However, it follows immediately from the proof that the
bound (5.1) still holds for a subsequence of times tn converging to ∞, provided
that the bound µ? (G ≥ yn ) ≥ f (yn ) holds for a sequence yn converging to infin-
ity. This observation allows to obtain the following corollary that is more useful in
situations where the law of the invariant measure is not known explicitly:
this is relatively easy to show is that of a process with transition probabilities that
have a continuous density with respect to some reference measure, i.e.
where the Vi ’s are smooth vector fields on Rn . We assume that these vector fields
assume the coercivity assumptions necessary so that the solution to (6.1) is C ∞ with
respect to its initial condition. An import tool for our analysis will be the lineari-
sation of (6.1) with respect to its initial condition. This is given by the stochastic
process J0,t defined as the solution to the SDE
m
X
dJ0,t = DV0 (x) J0,t dt + DVi (x) J0,t ◦ dWi . (6.2)
i=1
(k)
Higher order derivatives J0,t with respect to the initial condition can be defined
similarly. Throughout this section, we will make the following standing assump-
tion:
Assumption 6.1 The vector fields Vi are C ∞ and all of their derivatives grow at
most polynomially at infinity. Furthermore, they are such that
(k) p −1 p
E sup |xt |p < ∞ , E sup |J0,t | <∞, E sup |J0,t | <∞,
t≤T t≤T t≤T
for every initial condition x0 ∈ Rn , every terminal time T > 0, and every p > 0.
−1
Note here that the inverse J0,t of the Jacobian can be found by solving the SDE
m
X
−1 −1 −1
dJ0,t = −J0,t DV0 (x) dt − J0,t DVi (x) ◦ dWi . (6.3)
i=1
The aim of this section is to show that under a certain non-degeneracy assump-
tion on the vector fields Vi , the law of the solution to (6.1) has a smooth density
with respect to Wiener measure. To describe this non-degeneracy condition, recall
M ALLIAVIN CALCULUS AND H ÖRMANDER ’ S THEOREM 26
that the Lie bracket [U, V ] between two vector fields U and V is the vector field
defined by
[U, V ](x) = DV (x) U (x) − DU (x) V (x) .
This notation is consistent with the usual notation for the commutator between
two linear operators since, if we denote by AU the first-order differential operator
acting on smooth functions f by AU f (x) = hV (x), ∇f (x)i, then we have the
identity A[U,V ] = [AU , AV ].
With this notation at hand, we give the following definition:
Theorem 6.3 Consider (6.1) and assume that Assumption 6.1 holds. If the corre-
sponding vector fields satisfy the parabolic Hörmander condition, then its solutions
admit a smooth density with respect to Lebesgue measure.
A complete rigorous proof of Theorem 6.3 is far beyond the scope of these
notes. However, we hope to be able to give a convincing argument showing why
this result is true and what are the main steps involved in its probabilistic proof. The
interested reader can find the technical details required to make the proof rigorous
in [Mal78, KS84, KS85, KS87, Nor86, Nua95]. Hörmander’s original, completely
different, proof using fractional integrations can be found in [Hör67]. A yet com-
pletely different functional-analytic proof using the theory of pseudo-differential
operators was developed by Kohn in [Koh78] and can also be found in [Hör85].
Theorem 6.5 Let X be as above, assume that M(w) is invertible for every w and
that, for every p > 1 and every m ≥ 0, we have
E|∂k1 · · · ∂km X(w)|p < ∞ , EkM(w)−1 kp < ∞ . (6.4)
Then the law of X(w) has a smooth density with respect to Lebesgue measure.
Before we turn to the proof of Theorem 6.5, note that if Fk and G are square
integrable functions with square integrable derivatives, then we have the integration
by parts formula
X X X
E ∂k G(w)Fk (w) δtk = EG(w) Fk (w) δwk − EG(w) ∂k Fk (w) δtk
k
Zk k
def
= EG(w) F dw . (6.5)
R
Here we defined the Skorokhod integral F dw by the expression on the first line.
This Skorokhod integral satisfies the following isometry:
Proposition 6.6 Let Fk be square integrable functions with square integrable deriva-
tives, then
Z 2 X X
E F dw = EFk2 (w) δtk + E∂k F` (w)∂` Fk (w) δtk δt`
k k,`
X X
≤ EFk2 (w) δtk + E(∂k F` (w))2 δtk δt` ,
k k,`
holds.
Proof. It follows from the definition that one has the identity
Z 2 X
E F dw = E(Fk F` δwk δw` + ∂k Fk ∂` F` δtk δt` − 2Fk ∂` F` δwk δt` ) .
k,`
Applying the identity EG δw` = E∂` G δt` the first term in the above formula (with
G = Fk F` δwk ), we thus obtain
X
... = E(Fk F` δk,` δt` + ∂k Fk ∂` F` δtk δt` + (F` ∂` Fk − Fk ∂` F` ) δwk δt` ) .
k,`
M ALLIAVIN CALCULUS AND H ÖRMANDER ’ S THEOREM 28
Applying the same identity to the last term then finally leads to
X
... = E(Fk F` δk,` δt` + ∂k F` ∂` Fk δtk δt` ) ,
k,`
Proof of Theorem 6.5. We want to show that Lemma 6.4 can be applied. For η ∈
Rn , we then have from the definition of M the identity
X
(Dj G)(X(w)) = ∂k (G(X(w)))∂k Xm (w) δtk M−1
mj (w) . (6.6)
k,m
Combining this with Proposition 6.6 and (6.4) immediately shows that the re-
quested result holds for k = 1. Higher values of k can be treated similarly by
repeatedly applying (6.6).
Using this identity, it then follows from differentiating (6.1) on both sides that one
has for r ≤ t the identity
Z t m Z
X t
Djr X(t) = DV0 (Xs ) Djr Xs ds + DVi (Xs ) Djr Xs ◦ dWi (s) + Vj (Xr ) .
r i=1 r
We see that this equation is identical (save for the initial condition at time t = r
given by Vj (Xr )) to the equation giving the derivative of X with respect to its initial
condition! Denoting by Js,t the Jacobian for the stochastic flow between times s
and t, we therefore have for s < t the identity
(Since X is adapted, we have Djs Xt = 0 for s ≥ t.) Note that the Jacobian has the
composition property J0,t = Js,t J0,s , so that Js,t can be rewritten in terms of the
−1
‘usual’ Jacobian as Js,t = J0,t J0,s .
Rt
We now denote by A0,t the operator A0,t v = 0 Js,t V (Xs )v(s) ds, where v is
a square integrable, not necessarily adapted, Rm -valued stochastic process and V
is the n × m matrix-valued function obtained by concatenating the vector fields Vj
for j = 1, . . . , m. This allows us to define the Malliavin covariance matrix M0,t of
Xt in the same way as in the previous section by
Z t
M0,t = A0,t A∗0,t = Js,t V (Xs )V ∗ (Xs )Js,t
∗
ds .
0
Proposition 6.7 Consider a diffusion of the type (6.1) satisfying Assumption 6.1.
If M0,t is almost surely invertible and that EkM−1 p
0,t k < ∞ for every p > 0, then
the transition probabilities of (6.1) have a C ∞ density with respect to Lebesgue
measure for every t > 0.
Proof. This is essentially a version of Theorem 6.5. The technical details required
to make it rigorous are well beyond the scope of these notes and can be found for
example in [Nua95].
where C0,t is the reduced Malliavin matrix of our diffusion process. Then since
we assumed that J0,t has inverse moments of all orders, the invertibility of M0,t is
equivalent to that of C0,t . Note first that since C0,t is a positive definite symmetric
matrix, the norm of its inverse is given by
−1
kC−1
0,t k = inf hη, C0,t ηi .
|η|=1
Proof. The non-trivial part of the result is that the supremum over η is taken outside
of the probability in (6.8). For ε > 0, let {ηk }k≤N be a sequence of vectors with
|ηk | = 1 such that for every η with |η| = 1, there exists k such that |ηk −η| ≤ ε2 . It
is clear that one can find such a set with N ≤ Cε2−2n for some C > 0 independent
of ε. We then have the bound
hη, M ηi = hηk , M ηk i + hη − ηk , M ηi + hη − ηk , M ηk i
≥ hηk , M ηk i − 2kM kε2 ,
so that
1
P inf hη, M ηi ≤ ε ≤ P inf hηk , M ηk i ≤ 4ε + P kM k ≥
|η|=1 k≤N ε
1
≤ Cε2−2n sup P hη, M ηi ≤ 4ε + P kM k ≥ .
|η|=1 ε
It now suffices to use (6.8) for p large enough to bound the first term and Cheby-
chev’s inequality combined with the moment bound on kM k to bound the second
term.
Theorem 6.9 Consider (6.1) and assume that Assumption 6.1 holds. If the corre-
sponding vector fields satisfy the parabolic Hörmander condition then, for every
initial condition x ∈ Rn , we have the bound
sup P(hη, C0,1 ηi < ε) ≤ Cp εp ,
|η|=1
Remark 6.10 The choice t = 1 as the final time is of course completely arbitrary.
Here and in the sequel, we will always consider functions on the time interval
[0, 1].
Before we turn to the proof of this result, we introduce a useful notation. Given
a family A = {Aε }ε∈(0,1] of events depending on some parameter ε > 0, we say
that A is ‘almost true’ if, for every p > 0 there exists a constant Cp such that
P(Aε ) ≥ 1 − Cp εp for all ε ∈ (0, 1]. Similarly for ‘almost false’. Given two
such families of events A and B, we say that ‘A almost implies B’ and we write
A ⇒ε B if A \ B is almost false. It is straightforward to check that these notions
behave as expected (almost implication is transitive, finite unions of almost false
events are almost false, etc). Note also that these notions are unchanged under any
reparametrisation of the form ε 7→ εα for α > 0. Given two families X and Y of
real-valued random variables, we will similarly write X ≤ε Y as a shorthand for
the fact that {Xε ≤ Yε } is ‘almost true’.
Before we proceed, we state the following useful result, where k · k∞ denotes
the L∞ norm and k · kα denotes the best possible α-Hölder constant.
M ALLIAVIN CALCULUS AND H ÖRMANDER ’ S THEOREM 31
Lemma 6.11 Let f : [0, 1] → R be continuously differentiable and let α ∈ (0, 1].
Then, the bound
− 1 1 o
n
k∂t f k∞ = kf k1 ≤ 4kf k∞ max 1, kf k∞1+α k∂t f kα1+α
Proof. Denote by x0 a point such that |∂t f (x0 )| = k∂t f k∞ . It follows from the
definition of the α-Hölder constant k∂t f kC α that |∂t f (x)| ≥ 12 k∂t f k∞ for every
x such that |x − x0 | ≤ (k∂t f k∞ /2k∂t f kC α )1/α . The claim then follows from the
fact that if f is continuously differentiable and |∂t f (x)| ≥ A over an interval I,
then there exists a point x1 in the interval such that |f (x1 )| ≥ A|I|/2.
With these notations at hand, we have the following statement, which is es-
sentially a quantitative version of the Doob-Meyer decomposition theorem. Orig-
inally, it appeared in [Nor86], although some form of it was already present in
earlier works. The proof given here is a further simplification of the arguments in
[Nor86].
Lemma 6.12 (Norris) Let W be an m-dimensional Wiener process and let A and
B be R and Rm -valued adapted processes such that, for α = 13 , one has E(kAkα +
kBkα )p < ∞ for every p. Let Z be the process defined by
Z t Z t
Zt = Z0 + As ds + Bs dW (s) . (6.9)
0 0
Then, there exists a universal constant r ∈ (0, 1) such that one has
Proof. Recall the exponential martingale inequality [RY99, p. 153], stating that if
M is any continuous martingale with quadratic variation process hM i(t), then
P sup M (t) ≥ x & hM i(T ) ≤ y ≤ exp(−x2 /2y) ,
t≤T
for every positive T , x, y. With our notations, this immediately implies that for
any q < 21 , one has the almost implication
Since kAk∞ ≤ε ε−1/2 (or any other negative exponent for that matter) by assump-
tion and similarly for B, it follows from this and (6.10) that
nZ 1 1
o nZ 1 1
o
{kZk∞ < ε} ⇒ε As Zs ds ≤ ε & Bs Zs dW (s) ≤ ε 2 .
2
0 0
Inserting these bounds back into (6.11) and applying Jensen’s inequality then yields
nZ 1 o nZ 1 o
2 1/2
{kZk∞ < ε} ⇒ε Bs ds ≤ ε ⇒ |Bs | ds ≤ ε1/4 .
0 0
We now use the fact that kBkα ≤ε ε−q for every q > 0 and we apply Lemma 6.11
with ∂t f (t) = |Bt | (we actually do it component by component), so that
say. In order to get the bound on A, note that we can again apply the exponential
martingale inequality to obtain that this ‘almost implies’ the martingale part in (6.9)
is ‘almost bounded’ in the supremum norm by ε1/18 , so that
n
Z ·
o
1/18
{kZk∞ < ε} ⇒ε A ds ≤ ε .
s
0 ∞
Remark 6.13 By making α arbitrarily close to 1/2, keeping track of the different
norms appearing in the above argument, and then bootstrapping the argument, it is
possible to show that
for p arbitrarily close to 1/5 and q arbitrarily close to 3/10. This is a slight im-
provement over the exponent 1/8 that was originally obtained in [Nor86]. (Note
however that the two results are not necessarily comparable since [Nor86] used L2
norms instead of L∞ norms.)
We now have all the necessary tools to prove Theorem 6.9:
Proof of Theorem 6.9. We fix some initial condition x ∈ Rn and some unit vector
η ∈ Rn . With the notation introduced earlier, our aim is then to show that
so that
m Z
X 1 m Z
X 1 2
hη, C0,1 ηi = |ZVk (t)|2 dt ≥ |ZVk (t)| dt . (6.12)
k=1 0 k=1 0
The processes ZF have the nice property that they solve the stochastic differential
equation
m
X
dZF (t) = Z[F,V0 ] (t) dt + Z[F,Vk ] (t) ◦ dWk (t) .
i=1
for k = 0, . . . , m.
At this stage, we are basically done. Indeed, combining (6.12) with Lemma 6.11
as above, we see that
Applying (6.14) iteratively, we see that for every k > 0 there exists some qk > 0
such that \
{hη, C0,1 ηi < ε} ⇒ε {kZV k∞ < εqk } .
V ∈Vk
Since ZV (0) = hη, V (x0 )i and since there exists some k > 0 such that Vk (x0 ) =
Rn , the right hand side of this expression is empty for some sufficiently large value
of k, which is precisely the desired result.
E XAMPLES 34
7 Examples
In this section, we go through a number of examples where it is possible, using the
results presented in these notes, to give very sharp upper and lower bounds on the
rate of convergence to equilibrium.
Note that the difference between this and (7.2) is that the right hand side is positive
rather than negative. It follows from Jensen’s inequality that the quantity g(t) =
EG(xt ) satisfies the differential inequality
dg(t) Cg(t)
≤ 1 ,
dt (log g(t)) k −2
On the other hand, since the invariant measure for our system is known to be given
by
µ? (dx) ∝ exp(−V (x)) dx ,
one can check that the bound
is valid for some C > 0 and some γ > 0 for sufficiently large values of R. The
value of γ can be made arbitrarily small by increasing the value of κ. This enables
k/(1−k)
us to apply Theorem 5.1 with f (R) = CR−γ and g(x, t) = G(x)ect , so that
we obtain a lower bound of the form
γκ k
kPt (x, ·) − µ? kTV ≥ C exp( V (x) − ct 1−k ) ,
1−γ
which essentially matches the upper bound previously obtained in (7.3).
References
[Bax05] P. H. BAXENDALE. Renewal theory and computable convergence rates for
geometrically ergodic Markov chains. Ann. Appl. Probab. 15, no. 1B, (2005),
700–738.
[BCG08] D. BAKRY, P. C ATTIAUX, and A. G UILLIN. Rate of convergence for ergodic
continuous Markov processes: Lyapunov versus Poincaré. J. Funct. Anal.
254, no. 3, (2008), 727–759.
[Cha89] K. S. C HAN. A note on the geometric ergodicity of a Markov chain. Adv. in
Appl. Probab. 21, no. 3, (1989), 702–704.
[DFG09] R. D OUC, G. F ORT, and A. G UILLIN. Subgeometric rates of convergence
of f -ergodic strong Markov processes. Stochastic Process. Appl. 119, no. 3,
(2009), 897–923.
[DFMS04] R. D OUC, G. F ORT, E. M OULINES, and P. S OULIER. Practical drift condi-
tions for subgeometric rates of convergence. Ann. Appl. Probab. 14, no. 3,
(2004), 1353–1377.
[DMLM03] P. D EL M ORAL, M. L EDOUX, and L. M ICLO. On contraction properties of
Markov kernels. Probability Theory and Related Fields 126, no. 3, (2003),
395–420.
[DMR04] R. D OUC, E. M OULINES, and J. S. ROSENTHAL. Quantitative bounds on
E XAMPLES 38