An Introduction To The Theory of Markov Processes: Mostly For Physics Students
An Introduction To The Theory of Markov Processes: Mostly For Physics Students
Christian Maes1
1 Instituut voor Theoretische Fysica, KU Leuven, Belgium∗
(Dated: 21 June 2013)
Since about 200 years it is generally realized how fluctuations and chance play a
prominent role in fundamental studies of science. The main examples have come from
astronomy (Laplace, Poincaré), biology (Darwin), statistical mechanics (Maxwell,
Boltzmann, Gibbs, Einstein) and the social sciences (Quetelet). The mere power
of numbers for large systems and the unavoidable presence of fluctuations for small
systems make the theory of chance very much part of basic physics. But today also
other domains like economy, business, technology and medicine increasingly demand
complex stochastic models. Stochastic techniques have led to a richer variety of
modeling accompanied by powerful computational methods. An important subclass
of stochastic processes are Markov processes, where memory effects are strongly
limited and to which the present notes are devoted.
I. INTRODUCTION
What follows is a fast and brief introduction to Markov processes. These are a class of
stochastic processes with minimal memory: the update of the system’s state is function only
of the present state, and not of its history. We know such type of evolutions well, as they
appear from the first order differential equations that we traditionally use in mechanics for
the autonomous evolution of a state (positions and momenta). Also in quantum mechanics,
with the Schrödinger equation for the wave function, the update is described by a first order
dynamics. Markov processes add noise to these descriptions, and such that the update is
not fully deterministic. The result is a class of probability distributions on the possible
trajectories.
∗
Electronic address: [email protected];itf.fys.kuleuven.be/~christ/
2
The theory of chances, more often called probability theory, has a long history. It has
mostly come to us not via the royal road on which we find geometry, algebra or analysis
but more via sideways. We need to cross market squares and quite a lot can be learnt from
entering gambling houses.
Many departments of mathematics still today have no regular seminar or working group on
probability theory, even though the axiomatization of probability is now already some 80
years old (Kolmogorov 1933). Very often probability theory gets mixed up with statistics
or with measure theory, and more recently it gets connected with information theory. On
the other hand, all in all, physicists have always appreciated a bit of probability. For
example, we need a theory of errors and we need to understand how reliable can be our
observations and experiments. In other important cases chances enter physics for a variety
of (overlapping) reasons, such as
(1) The world is big and nature is complex and uncertain to our crude senses and our
little minds. By lack of certain information, we talk about plausibility, hopefully estimated
in an objective scientific manner. Chances are the result, making quantitative degrees for
plausibility.
(2) The dynamics we are studying in physics are often very sensitive to initial and
boundary conditions. The technical word is chaos. Chaotic evolutions appear very
erratic and random. Practical unpredictability is the natural consequence; probabilistic
considerations then become useful or even unavoidable.
(3) The details at a very fine degree of accuracy, are not always relevant to our
description. Some degrees of freedom can better be replaced by noise or chance. In so doing
we get a simple and practical way of treating a much larger complex of freedoms.
(4) Quantum mechanical processes often possess chance-like aspects. For example, there
are fundamental uncertainties about the time of decay of a unstable particle, or we can
only give the probabilities for the positions of particles. Therefore again, we use probability
models. In fact two major areas of physics, statistical mechanics and quantum mechanics,
depend on notions of probability for their theoretical foundation and deal explicitly with
laws of chance.
3
One of the first to have the idea to apply probability theory in physics was Daniël
Bernoulli. He did this in an article (1732) on the inclination of planetary orbits relative to
the ecliptic. The chance that all the inclination angles are less than 9 degrees is so small
that Daniël inferred that this phenomenon must have a definite cause. This conceptual
breakthrough and this way of reasoning had a major influence on Pierre Simon de Laplace
who has resumed that from his first works on. Daniël Bernoulli also applies these first
statistical considerations to a gas model, the beginning of the kinetic theory of gases. That
idea was also independently rediscovered by John Waterston and by August Krönig, while
Maxwell and Boltzmann take over towards the end of the 19th century. Boltzmann will give
a most important contribution to the statistical interpretation of thermodynamic entropy.
For the first time a statistical law enters physics. Maxwell calls it a moral certainty and
the second law of thermodynamics gets a statistical status. That way Maxwell repeats
(and cites) another Bernoulli, Jacob Bernoulli and author of the Ars Conjectandi in 1713:
we ought to learn how to guess because the true logic of the world is the calculus of
Probabilities, which takes into account of the magnitude of the probability which is, or ought
to be, in a reasonable man’s mind. (J.C. Maxwell).
The goals of this course are typical for a physics education, modeling and analysis: (1)
learning how to translate to and to model in the mathematics of stochastics, (2) learning to
calculate with probabilistic models, here Markov processes.
A. The chance-trinity
Speaking of probabilities implies that we assign some number between zero and one to
events. Let us start from the events. They can mostly be viewed as sets of more elementary
outcomes. As an example, to throw an even number of eyes is an event A and consists of
the more elementary outcomes 2, 4 and 6, or A = {2, 4, 6}. The basic thing is thus to know
the set (or space) of all possible outcomes; we call it the universe Ω. Each outcome is an
element ω ∈ Ω. An event A is a set of such outcomes, hence a subset of Ω. A probability
4
distribution gives numbers between 0 and 1 to these events. The universe, the possible
events and their probabilities: well, that is the chance-trinity.
The set of events F deserves some structure. First of all we want that Ω and the empty
set ∅ belong to it. But also, that if A, B ∈ F, then also A ∩ B, A ∪ B, Ac ∈ F, where the
latter Ac = Ω \ A denotes the complement of the set A. Such a F is called a field. We say
that it is a σ−field when F is also closed under countable unions (or intersections).
On that set F we put a (additive countable) probability law, which is a non-negative function
P defined on the A ∈ F for which P (A) = 1−P (Ac ) ≥ 0, P (Ω) = 1−P (∅) = 1, and secondly
X
P (A) = P (An )
n
Let us have a look at some examples. If Ω is countable, it is less of a problem. Why not
taking the simplest case: for a coin, we have Ω = {up,down}, and F contains 4 elements.
A probability law on it is completely determined by the the probability of up, P(up). It
becomes more difficult for uncountable universes. Let us take the real numbers, Ω = R.
The natural σ-field here is the so called Borel σ−field. That is defined as the smallest
σ−field that contains all intervals, and in particular contains all open sets.
When we have an additive countable probability law on a space (called, measure space)
(Ω, F) with a universe Ω and a σ−field F, then the triple (Ω, F, P ) is called a probability
space. That is the start. For example, a stochastic variable is a map f : Ω → R so that for
all Borel-sets B, f −1 B ∈ F. The meaning is simply that f takes values that depend on the
outcome of a chance-experiment.
For the purpose of these lectures we consider mostly two classes of probability spaces.
There is first the state space K, say a finite set on which we can define probability distri-
butions ρ with
X
ρ(x) ≥ 0, ρ(x) = 1
x∈K
5
We also write
X
Probρ [B] = ρ(x)
x∈B
for the expectation of a function f on K (an observable). Observables are random variables
and can be added and multiplied because we take them real-valued. The subscripts ρ under
Prob or under h·iρ will only be used when necessary.
Secondly there will be the space of possible trajectories (time-sequences of states) for which
we will reserve the letter Ω, and that one will become uncountable in continuous time.
B. State space
For our purposes it is sufficient to work with a finite number of states, together making
the set K. Obviously that corresponds to quite a reduced description — one forgets about
the microscopic details, taking them into account effectively via the laws of probability. For
example, these states can be possible energy levels of an atom or molecule, or some discrete
set of chemo-mechanical configurations of an enzyme, or a collection of (magnetic) spins on a
finite lattice, or the occupation variables on a graph etc. When we would allow a continuum
of states we can include velocity and position distributions as from classical mechanics, or
even probabilities on wave functions as for quantum mechanics.
The elements of K are called states. The σ−field (the events) are all possible subsets of
K. A probability distribution µ on K specifies numbers µ(x) ≥ 0, x ∈ K, that sum to one,
P
x µ(x) = 1. Here are some simple examples:
(1) spin configurations: the state space is K = {+1, −1}N where N is the number of spins.
Each spin has two values (up or down) that we take as ±1. The elements of K are then
N −tuples x = (x(1), x(2), . . . , x(N )), with each x(i) = ±1. Here is a nice probability law
N
1 J X
ρ(x) = exp x(i) x(j)
Z N i,j=1
6
for a certain coupling constant J and Z is the normalization. This law defines the Curie-
Weiss model for magnetism. The normalization (partition function) Z is
N
X J X X 2
Z= exp x(i) x(j) = eN {Jm +sN (m)}
x∈K
N i,j=1 m
where the last sum is over m = −1, −1 + 2/N, . . . , 1 − 2/N, 1 (the N + 1 possible values of
PN
i=1 x(i)/N ), and sN (m) is the entropy per particle
1 N!
sN (m) = log
N (N 2 )!(N 1−m
1+m
2
)!
C. Conditional probabilities/expectations
Additional information can be used to deform our probabilities. For example, if we know
that the event B happens (is true), then
ρ(x)
ρ(x|B) = , if x ∈ B, ρ(x) = 0 otherwise
Prob[B]
is a conditional probability (given the event B for which we assume a priori that ρ(B) > 0.
Now the probability of event A is
Prob[A ∩ B]
Prob[A|B] =
Prob[B]
Of course that defines a bona fide probability distribution ρ(·|B) on K. We can also take
expectations of a function f as
X
hf |Biρ = ρ(x|B) f (x)
x∈K
which is a conditional expectation. If we have a partition {Bi } of the state space K, then
X
ρ(x) = ρ(x|Bi ) Prob[Bi ]
i
We say that two random variables f and g are independent when their joint distribution
factorizes. In other words
for all a, b possible values of respectively f and g. As a consequence then, their covariance
hf gi − hf i hgi = 0.
D. Distributions
Some probability distributions got names, for good reasons. The mother is Bernoulli, for
which K = {0, 1} and the probability ρ(0) = 1 − p, ρ(1) = p is completely determined by
8
the parameter p ∈ [0, 1] (probability of success). Next is the binomial distribution, which
asks for the number of successes out of n independently repeated Bernoulli experiments. We
now have two parameters, n and p and the probability to get k successes out of n trials is
n!
ρn,p (k) = pk (1 − p)n−k , k = 0, 1, . . . , n
k!(n − k)!
−λ λk
ρn,p (k) → e , k = 0, 1, . . .
k!
which is the Poisson distribution with parameter λ > 0. If n is counted as time, then λ
would be running as (proportional to) a continous time.
√
If, on the other hand, we take k = mn ± σ n, then the binomial distribution can be seen
to converge to the normal (Gaussian) density,
1 1 2
ρ(x) = √ e− 2σ2 (x−m) , x ∈ R
2πσ
Finally we look at the exponential distribution. We say that a random variable τ with values
in [0, ∞) is exponentially distributed when its probability density is
The exponential distribution can be seen as the limit of the geometric distribution, in the
same sense as the Poisson distribution emerges from the binomial distribution.
IV. EXERCISES
what is behind the doors, now has to open one of the two remaining doors, and the door
he opens must have a goat behind it. If both remaining doors have goats behind them,
he chooses one [uniformly] at random. After Monty Hall opens a door with a goat, he
will ask you to decide whether you want to stay with your first choice or to switch to the
last remaining door. Imagine that you chose Door 1 and the host opens Door 3, which
has a goat. He then asks you “Do you want to switch to Door Number 2?” Is it to your
advantage to change your choice?
(Krauss, Stefan and Wang, X. T. (2003). The Psychology of the Monty Hall Problem:
Discovering Psychological Mechanisms for Solving a Tenacious Brain Teaser, Journal of
Experimental Psychology: General 132(1).)
3. Birthday paradox.
Compute the approximate probability that in a room of n people, at least two have the
same birthday. For simplicity, disregard variations in the distribution, such as leap years,
twins, seasonal or weekday variations, and assume that the 365 possible birthdays are
equally likely. For what n is that probability approximately 50 percent?
deadly hit-and-run. Our only witness claims the taxi car was green. An investigation shows
however that the witness always remembers green as green, but in half the cases also says
green when blue is shown. Estimate the plausibility that the accident was caused by a blue
taxi.
6. Poisson approximation
Suppose that N electrons pass a point in a given time t with N following a Poisson
distribution. The mean current is I = ehN i/t. Show that we can estimate the charge e
from the current fluctuations.
8. Statistical independence
Suppose that f and g are two random variables with expectations hf gi = hf i = 0. Can
11
9. Exponential distribution
The exponential distribution is memoryless. Show that its probabilities satisfy
for all t, s ≥ 0. The only memoryless continuous probability distributions are the exponen-
tial distributions, so memorylessness completely characterizes the exponential distributions
among all continuous ones.
10. For the lattice gas near the end of Section III B, find the probability that a given
specified site is occupied.
What is the normalization Z in the given example for ρ?
zk
P [X = k] = e−z , k = 0, 1, 2, . . .
k!
where z > 0 is a parameter. Show that its mean equals its variance equals z. The X most
often refers to a number (of arrivals, mutations, changes, actions, jumps, ...) in a certain
space-time window, like the number of particles in a region of space in a gas where z would
be the fugacity.
13
The easiest way to imagine a trajectory over state space K is to think of uniform steps
(say of size one) over which the state can possibly change. The path space is then Ω = K N
with elements ω = (x0 , x1 , x2 , . . .) where each xn ∈ K (the state at time n). We can now
build certain probability laws on Ω. They are parameterized by two types of objects: (1)
the initial distribution (the law µ from which we draw our initial state x0 ), and (2) the
updating rule (giving the conditional probability of getting some value for xn giving the
state xn−1 at time n − 1). As the updating rule only uses the present state, and not the
past history, we say that the process is Markovian.
In 1907, Andrei Andreyevich Markov started studying such exciting new types of chance
processes, see A.A. Markov, An Example of Statistical Analysis of the Text of Eugene Onegin
Illustrating the Association of Trials into a Chain. Bulletin de l’Académie Impériale des
Sciences de St. Petersburg, ser. 6, vol. 7 (1913), pp. 153162. In these processes, the
outcome of a given experiment affects or can affect the outcome of the next experiment.
(From http://www.saurabh.com/Site/Writings.html, last accessed on 27 Oct.2011)
Suppose you are given a body of text and asked to guess whether the letter at a randomly
selected position is a vowel or a constant. Since consonants occur more frequently than
vowels, your best bet is to always guess consonant. Suppose we decide to be a little more
helpful and tell you whether the letter preceding the one you chose is a vowel or consonant. Is
there now a better strategy you can follow? Markov was trying to answer the above problem
and analysed twenty thousand letters from Pushkin’s poem Eugene Onegin. He found that
43percent letters were vowels and 57percent were consonants. So in the first problem, one
should always guess “consonant” and can hope to be correct 57percent of the time. However,
a vowel was followed by consonant 87percent of the time. A consonant was followed by a
vowel 66percent of the time. Hence, guessing the opposite of the preceding letter would be
a better strategy in the second case. Clearly, knowledge of the preceding letter is helpful.
The real insight came when Markov took the analysis a step further. Markov investigated
whether knowledge about the preceding two letters confers any additional advantage. He
found that there was no significant advantage to knowing the additional preceding letter.
This leads to the central idea of a Markov chain — while the successive outcomes are not
14
independent, only the most recent outcome is of use in making a prediction about the next
outcome.
Let us start with a famous example, Ehrenfest’s model, also called the dog-and-flea
model, for reasons that will become clear. The state space is K = {0, 1, 2, . . . , N } with
each state representing the number x of particles in a vessel. For the updating we imagine
there is another vessel containing N − x particles. We randomly pick a particle from the
two vessels, and we move it to the other vessel with probability p. Alternatively, we leave
it where we found it with probability 1 − p. In that way x will change, by a stochastic rule.
Let us add the formulæ. We have x → x + 1 at the next time with probability p(N − x)/N ,
x → x − 1 with probability px/N and x remains x with probability 1 − p. We can write
it in a matrix p(x, y) where x is the present state and y is the new possible state. That
matrix of transition probabilities will be abstracted in the next section. Here we simply
write p(x, x + 1) = p(N − x)/N, p(x, x − 1) = px/N and p(x, x) = 1 − p. The rest of the
matrix elements are zero.
x N−x
1
0
0
1 1
0
1
0 0
1
1
0 0
1
0
1 1
0
1
0
0
1
1−p p
0
1 1
0
0
1
1
0
0
1
1
0
0
1
1
0 1
0 1
0
0
1 0
1 0
1
1
0 1111111111111
0000000000000
1
0
0
1 0
1
1
0
0
1 1
0
1
0 1
0 0
1 1
0
0
1
1
0 1
0 1
0 1
0
0
1 1
0
1
0 1
0 1
0
0
1 1
0 0
1 1
0 0
1
0
1 0
1
We can try to see the probability of certain (pieces of) trajectory. Say x0 = 1, x1 =
2, x2 = 1, x3 = 1 (which is in fact a cycle) has probability µ(1)p NN−1 p N2 (1 − p), where µ(1)
is the initial probability (to start with one particle in our vessel). We can also look at the
time-reversed trajectory, taking x0 = 1, x1 = 1, x2 = 2, x3 = 1 having the same probability
µ(1)(1 − p)p NN−1 p N2 . Of course that need not always be the case. For example, for the
15
1 N!
ρ(x) = N
, x = 0, 1, 2, . . . , N
2 x!(N − x)!
(fraction of subsets with x elements from a set with N elements) which gives equal probability
to all particle configurations. In fact, this distribution is time-invariant (stationary) and
exactly satisfies (V.1) above, and can be checked more generally to generate time-reversal
invariance. In particular, it is easily checked that for all x = 1, . . . , N − 1,
This ρ is called the equilibrium distribution. The Ehrenfest model of diffusion was proposed
by Paul Ehrenfest to illustrate the relaxation to equilibrium and to understand the second
law of thermodynamics.
Assume that the state space has |K| = m elements. We say that an m × m-matrix is
a stochastic matrix P when all its elements p(x, y) ≥ 0 are non-negative and for each row
P
y p(x, y) = 1. That allows a probabilistic interpretation. Transition matrices specifying
the updating rule for Markov chains are stochastic matrices, and their elements are the
transition probabilities: p(x, y) is the transition probability to change to state y given that
we now have state x. That gives the building block for the probability law on discrete time
trajectories: at every time n = 0, 1, . . .
no matter what earlier history (a0 , a1 , . . . , an−1 ) up to time n − 1. In other words, a Markov
process started with distribution µ and with transition matrix P gives probability
to the trajectory (a0 , a1 , . . . , an ) ∈ K n+1 . Note the subscript µ in the left-hand side
indicating the initial distribution. Formulæ (V.2)–(V.3) are the defining properties of
Markov chains.
Many properties follow. For example, we can ask what is the probability to find state x
at time one if we started with distribution µ at time zero. That is
X X
Probµ [x1 = x] = Probµ [x0 = a, x1 = x] = µ(a) p(a, x)
a∈K a∈K
which is called the Master equation for Markov chains. Observe that the change in
probability (in the left-hand side) is given (in the right-hand side) by a source (the first)
and a sink (the second term). It is What is written above for the evolution of probability
distributions has an obvious dual for the evolution of observables. After all, the expected
value of a function f at time n is
X X X
hf (xn )iµ = µn (f ) = f (x)µn (x) = f (x) µn−1 (a) p(a, x)
x x a∈K
17
then takes the form of a scalar product, multiplying the row-vector µ with the column-vector
f . Never mind too much however that notation. The essential thing is that the transition
probability P really determines everything. Its products P n (under matrix multiplication)
have also a definite meaning, as the matrix elements are
1. Calculation
2. Example
00
11
x=1
00
11
1/2 1
0110 11
00
00
11
x=3 1/2 x=2
1/2 1/2
FIG. 4: Three states and their connections.
19
for state space K = {1, 2, 3}, see Fig. 4. (Check that P is a stochastic matrix.) We want
to find the probability that at general time n the state is 1 given that we started at time
zero in 1. That is P n (1, 1). The best is to compute the eigenvalues of P . The characteristic
equation is
1
det(λ − P ) = 0 = (λ − 1)(4λ2 + 1)
4
(We knew λ = 1 would be an eigenvalue, because P c = c for a constant vector). The
eigenvalues are 1, i/2 and −i/2. From linear algebra we thus have that
i n −i n
P n (1, 1) = a + b +c
2 2
(what would that become when some eigenvalue is repeated?) Since the answer must be
real (a probability!), we can take
1 n nπ nπ
P n (1, 1) = α + {β cos + γ sin }
2 2 2
where the constants α, β, γ do not depend on n. We can compute by mere inspection that
3. Time-correlations
So far we have been mostly interested in finding the distribution and expected values at a
fixed time n. We are obviously also interested in time-correlations. That means to estimate
the correlation and dependence between observations at various different times.
Let us look at pair-correlations, say for observables f, g: for times 0 ≤ u ≤ n,
X
hg(xn ) f (xu )|x0 = xi = P u (x, y) f (y) P n−u (y, z) g(z)
y,z
20
The last equality uses time-homogeneity, the property that the updating-rule itself does not
depend on time. The obtained formula can also be written as
= µu f P n−u g
(V.5)
when starting at time zero from µ. In the same way we can treat more general correlation
functions.
ρP = ρ
or ρ is a left-eigenvector with eigenvalue 1 for P . That means that ρ solves the stationary
Master equation:
X X
ρ(x) = ρ(a) p(a, x), [ρ(a) p(a, x) − ρ(x) p(x, a)] = 0 (V.6)
a∈K a∈K
If we start from ρ at time zero, we get ρ at time one, and we get ρ at all times, ρn = ρ.
That also makes the time-correlations stationary, i.e., by inspecting (V.5),
A special case of stationarity is equilibrium. We get it when each term separately in the
second formula of (V.6) gives zero:
That is very special, and is a strong requirement. In fact, it implies that you can reverse
the order of time in (V.7) as then
Of course, to check whether that holds does not truly require knowing the stationary
distribution ρ; here is how that goes. We say that the Markov chain satisfies the condition
of detailed balance when there is a function V so that
1 −V (x)
ρ(x) = e , x∈K
Z
D. Time-reversal
Suppose that ρ is a stationary distribution for the given transition probability matrix
P . Clearly then, it does not matter when we start the evolution, be it at time zero or at
any other time n = −T . In fact, we can now speak about the stationary Markov chain
defined on (doubly-)infinite trajectories ω = (an , n ∈ Z). Any piece of such a trajectory has
probability
Probρ [xn1 = a1 , xn2 = a2 , . . . , xnk = ak ] = ρ(a1 ) P n2 −n1 (a1 , a2 ) . . . P nk −nk−1 (ak1 , ak ) (V.10)
Prob[xn = a] = ρ(a), a ∈ K, n ∈ Z
Let us then look at its time-reversal, the stochastic process (yn , n ∈ Z) defined from
yn = x−n
It simply reverses the original trajectories. We could write down the probabilities
in general, just like in (V.10), but let us concentrate on two consecutive times. Say we ask
how plausible it is to see in the time-reversed process the transition a → b once we are in
state a:
Probρ [yn = b|yn−1 = a] = Probρ [x−n = b|x−n+1 = a]
ρ(b)
Probρ [yn = b|yn−1 = a] = Probρ [x−n = b|x−n+1 = a] = p(b, a)
ρ(a)
E. Relaxation
At large times n we could hope that the distribution µn becomes more constant. In a
way, for very large n, there should be little difference between P n and P n+k for any given
k. These things can be made precise. (We take here K finite.)
There is always a stationary distribution; since the column-vector of ones is an eigenvector
with eigenvalue 1 for P , then P must have a row-eigenvector with eigenvalue 1.
We say that the Markov chain is probabilistically ergodic when there is a probability
distribution ρ such that for all initial distributions µ = µ0 the limit
lim µn = ρ
n↑+∞
23
gives that ρ. Of course such ρ is unique and is stationary. We could also write, equivalently,
that
P n f −→ ρ(f ) = hf iρ
as time n moves on, for all observables f . From here we see most clearly that this must
be a property of the matrix P , and that linear algebra must be able to tell. That is right,
and the theorem in algebra is that of Perron-Frobenius. We just describe the result here:
the Markov chain is always probabilistically ergodic when the matrix P is irreducible
and aperiodic. Irreducible means that all states are eventually connected; you can reach
all states from wherever in a finite time with positive probability. Aperiodicity on the
other hand relates to the probability to return to the same state. For example, if we
take p = 1 in the Ehrenfest-model we must wait at least two steps before we get back
to the same state, and we cannot get back to the same state after an odd number of
moves. It means that the Ehrenfest model with p = 1 is not aperiodic. But in general
we do not worry too much about aperiodicity because when P is an irreducible stochastic
matrix, then for any 0 < p < 1, the matrix Q = p P + (1 − p) I is stochastic, irreducible
and aperiodic, and has the same stationary distribution as P . The matrix Q is a lazy
version of P in the sense that now for sure there is the possibility to remain in the same state.
The relaxation to stationarity is exponentially fast for irreducible and aperiodic Markov
chains. In other words, there is a typical time, the relaxation time, after which the initial
data are essentially forgotten and a stationary regime is established. The fact that this
relaxation is exponential is not so strange, because forgetting is multiplicative: at each step
some information is lost and that accumulates in a multiplicative way; you even forget what
you forgot.
For example, if we look back at time-correlations (V.7), we can say that for large enough
times n
hg(xn ) f (x0 )iρ ' ρ(g) ρ(f )
Another property of relaxation is that there is a function, called relative entropy, that is
monotone in time:
X µn (x)
s(µn |ρ) = µn (x) log ≥0
x
ρ(x)
turns out to decay monotonically to zero as time n runs. Of course, if you do not know ρ
24
you do not know that relative entropy. Therefore, this monotonicity is most explicitly useful
when there is detailed balance (in which case physicists call it a sort of H-theorem).
F. Random walks
A very interesting example of Markov chains are random walks. There the state space
contains the possible positions on a lattice or graph, with the edges between them indicating
the possible moves. The transition matrix specifies the weights associated to each move.
As a standard example we can consider the (standard) random walk on the integers (one-
dimensional lattice). The state space is K = Z (and there could be no stationary distribu-
tion). At time zero we start at some site x0 = x and the position at time n ≥ 1 is
xn = x0 + v1 + v2 + . . . vn
where the vi are a collection of independent and identically distributed random variables,
say vi = 0 with probability 1 − p and vi = ±1 with probability p/2 (for some p ∈ (0, 1)).
From here we can calculate the mean position at time n (no net drift here) and its variance
(proportional to time n).
FIG. 5: A random walk in 2 dimensions, from close-by and from further away.
That gives a simple model of diffusion. Let us see if we can find the diffusion equation.
For this we look at the Master equation
p p
µn (x) − µn−1 (x) = µn−1 (x − 1) + µn−1 (x + 1) − pµn−1 (x)
2 2
which can be rewritten, suggestively, via a discrete Laplacian (in its right-hand side)
p
µn (x) − µn−1 (x) = [µn−1 (x − 1) + µn−1 (x + 1) − 2µn−1 (x)]
2
25
That indeed resembles a (discrete) diffusion equation with diffusion coefficient p/2. We
imagine that µ(x) then corresponds to a density of independent walkers.
The same set-up can be used in any dimension, on Zd for the d−dimensional regular
lattice. The behavior can however be drastically different depending on the dimension. In
one and two dimensions the probability of (ever) returning to the origin is one. That return
probability decreases as the number of dimensions increases: for d = 3 the probability
decreases to roughly 34 percent; in d = 8 that return probability is about 7 percent. As you
can perhaps imagine, that very different behavior of the standard random walk depending
on the dimension (recurrence versus transience) is at the origin of many physical effects. Or
better, it summarizes and stochastically interprets the behavior of the Green’s function of
the Laplacian, which is of course relevant to quite many physical contexts (electromagnetism
and gravity, Bose-Einstein condensation,...)
It is often important to estimate the probability to ever land in a certain state. Such states
then have a special importance; for example, being in that state could stop the evolution.
We call such states absorbing. Consider thus a random walk on the semi-infinite lattice
K = {0, 1, 2, . . .} with transition probabilities p(x, x + 1) = px , p(x, x − 1) = qx , px + qx = 1
for x ≥ 1 and p(0, 0) = 1. The state zero is absorbing; the chain gets extinct when hitting
it. We want to calculate the hitting probability hx , i.e., the extinction probability starting
from state x:
hx := Prob[Xn = 0 for some n > 0|X0 = x]
There could be more than one solution, but one can prove that the sought for hx is the
minimal solution. Consider now ux := hx−1 − hx for which the recurrence becomes px ux+1 =
qx ux . Hence,
qx qx qx−1 . . . q1
ux+1 = ux = u1
px px px−1 . . . p1
26
hx = 1 − u1 (γo + . . . + γx−1 )
qx qx−1 ...q1
for γx := px px−1 ...p1
,x ≥ 1 and γ0 = 1. So we only need u1 . Suppose now first that the
P
infinite sum γx = ∞. Since hx ∈ [0, 1] we must then have u1 = 0, hx = 1 for all x. On
P
the other hand, if γx < ∞, then we can take u1 > 0 so long as
1 − u1 (γ0 + . . . + γx−1 ) ≥ 0
P −1
The minimal non-negative solution occurs for u1 = γx and then
P∞
y=x γy
hx = P∞ (V.11)
y=0 γy
Let us check detailed balance when we would only play game B (at all times):
Consider the cycle 3 → 1 → 2 → 3 . Its stationary probability (always for game B alone)
is Prob[3 → 1 → 2 → 3] = ρ(3) × 1/10 × 3/4 × 3/4 = 9ρ(0)/160. For the reversed cycle,
27
the probability Prob[3 → 2 → 1 → 3] = ρ(3) × 9/10 × 1/4 × 1/4 = 9ρ(3)/160 is the same.
The equilibrium distribution for game B is then found to be ρ(1) = 2/13, ρ(2) = 6/13 and
ρ(3) = 5/13. Obviously then, there is no current when playing game B and clearly, the
same is trivially verified for game A when tossing with the fair coin. Yet, and here is the
paradox, when playing periodically game B after game A, a current arises... ( ...which you
would like to check).
in which case
1 cosh(J − a) 2 cosh b
b= log , Z=
2 cosh(J + a) cosh(J + a) + cosh(J − a)
Clearly then, the probability of a trajectory (x0 , x1 , . . . , xn ) gives
µ(x0 )
exp{Jx0 x1 + Jx1 x2 + . . . Jxn−1 xn + (a + b)(x1 + x2 + . . . + xn ) + b (x0 − xn )}
Zn
which is up to boundary conditions the probability of a spin configuration in the one-
dimensional Ising model (in a magnetic field a + b) with lattice sites replacing discrete time.
Indeed, that Ising model is traditionally solved using the transfer matrix formalism, which
is equivalent to the formalism of Markov chains.
time or
1dim. space
No thermal phase transition in one dimension for systems with short range interactions
is the same as saying that Markov chains are probabilistically ergodic. The mathematical
ground is the Perron-Frobenius theorem.
28
J. Exercises
2. Consider a discrete time Markov chain on state space K = {+1, 0, −1}. The transition
matrix has p(x, x) = 0 and
3. Show that the Ehrenfest model satisfies detailed balance, and find the potential.
Show that all Markov chains with two states, |K| = 2, satisfy detailed balance, at least
when the p(x, y) > 0.
4. Consider a container with green and red balls, in total N . At discrete times and
blindly two balls are picked out and we look at their color. They are then put back in the
container after we have changed their color (green becomes red and red becomes green).
Model this with a Markov chain, and write down the transition matrix. What could be the
stationary distribution?
5. Show that for detailed balance (V.9) to hold, we must have that for any three states
29
x, y, z
p(x, y)p(y, z)p(z, x) = p(x, z)p(z, y)p(y, x)
or, the probability of any cycle/loop should not depend on the order/orientation in which
it is being traversed.
7. Consider the most general two-state Markov chain (discrete time) and compute the
n−th power P n of its transition matrix.
Discuss when and how it converges, as n ↑ +∞ via integers, to the stationary distribution.
and consider the initial distribution µ0 = (1/10, 2/5, 1/2). Find the probability law µ1 at
the next time. Find also the stationary distribution.
9. Consider the probability distribution ρ = (1/6, 1/2, 1/3) on K = {−1, 0, 1}. Find a
Markov chain on K which makes ρ stationary.
10. Show that the product of two stochastic matrices is again stochastic.
12. Find the n-step transition probabilities P n (x, y) for the chain having transition matrix
0 1/2 1/2
1/3 1/4 5/12
2/3 1/4 1/12
13. Consider the Markov chain on {0, 1, . . . } with transition probabilities given by
2
x+1
p(0, 1) = 1, p(x, x + 1) + p(x, x − 1) = 1, p(x, x + 1) = p(x, x − 1), x ≥ 1
x
Show that if initially x0 = 0 then the probability that xn ≥ 1 for all n ≥ 1 is 6/π 2 . (Hint:
use formula (V.11).)
Find the stationary distribution. (Note the first five Fibonacci numbers.)
15. Lady Ann possesses r umbrellas which she employs in going from home to office
and back. If she is at home (resp. office) at the beginning (resp. end) of a day and it is
raining, then she will take an umbrella with her to the office (resp. home), at least if there
is one to be taken. If it is not raining, then she will not take an umbrella. Assuming that,
independent of the past, it rains at the beginning (end) of a day with probability p, what
fraction of the time does Lady Ann arrive soaked at the office?
16. Markov found the following empirical rulefor the transition matrix in the vowel-
0.128 0.872
consonant space in Pushkin’s novel: . Show that that is consistent with the
0.663 0.337
vowel versus consonant frequency of (0.432, 0.568).
31
We can embed a discrete time process, such as the Markov chains above, in continuous
time. What then seems essential for discrete time processes is that the time step is always
the same and independent of the state the system is in. In other words, one would say
that the time between jumps is deterministic and fixed, no matter what state. One should
add here however that the system can remain in the same state, i.e., the jumps can be to
itself, or there is no real jump to another state. That implies that the state can remain
identical over several time-steps. How long the system remains in that same state depends
on the total probability to jump away from the state. Going to continuous time processes
means to randomize these true waiting times between true jumps. The jump alarm rings at
exponential times, after which the state changes.
A. Examples
Example 1: The Poisson Process. Events that occur independently with some average
rate are modeled with a Poisson process. This is a continuous-time Markov process with
state space K = {0, 1, 2, . . .}. The states count the number of arrivals, successes, occurrences
etc. If at any time we are in state x we can only jump to state x + 1; there is only one
exponential clock running with fixed rate ξ (the intensity). We start at time zero from
x0 = 0 and inspect the state at time t ≥ 0. The probability that xt = k is the probability
that the clock has rung exactly k times before time t. Not surprisingly, that is given by the
Poisson distribution
−ξt (ξt)k
Prob[xt = k] = e
k!
because the probability of a ring in dt is ξdt and we try that t/dt times, cf. the convergence
of the binomial to the Poisson distribution. The (waiting) times between the events/arrivals
have an exponential distribution, like the limit of a geometric distribution. In that way the
Poisson process captures complete randomness of arrivals, as in radioactive decay.
Example 2: Two-state jumps. We have two possible states, K = {0, 1}. When the
state is zero, we wait an exponentially distributed time with rate ξ(0) > 0 after which the
jump 0 −→ 1 occurs. When the state is one, we wait an exponentially distributed time with
32
rate ξ(1) > 0 after which the jump 1 −→ 0 occurs. Trajectories are piecewise constant and
switch between 0 and 1. In the (long time) stationary regime, the relative frequency of one
over zero is ξ(0)/ξ(1).
We can obtain this continuous time jump process from a discrete time approximation. Con-
sider the Markov chain with transition matrix
1 − δξ(0) ξ(0) δ
Pδ =
ξ(1) δ 1 − δξ(1)
when nδ = 1 for n ↑ +∞, δ ↓ 0. Hence, the matrix elements etL (x, y) must give the
transition probability to go from x to y in time t. The rates of change are read from the
off-diagonal elements of L.
B. Path-space distribution
A continuous time Markov process on state space K has piecewise constant trajectories
ω = (xt , t ≥ 0) which are specified by giving the sequence (x0 , x1 , x2 , . . .) of states together
with the jump times (t1 , t2 , . . .) at which the jumps, respectively, x0 → x1 , x1 → x2 , . . .
occur. We take the convention that xt is right-continuous, so that xti +ε = xti for sufficiently
small ε, while xti+1 −ε = xti for all 0 < ε < ti+1 − ti , no matter how small.
To give the probability distribution over these possible paths means to give the distribution
of waiting times ti+1 − ti when in state x, and to give the jump probabilities x → y when a
jump actually occurs. For the first (the waiting times) we take an exponential distribution
with rate ξ(x), i.e., when in state x at the jump time ti , the time to the next jump is
exponentially distributed as
Secondly, at the jump time ti+1 the jump goes x → y with probability p(x, y), for which we
assume that p(x, x) = 0. We thus get
000
111
x(t3)=1
000
111 000
111
1 111
000
000
111
111
000
000
111
x(0)=0 x(t5)=1
0 11
00
00
11
00
11
x(t2)=0
11
00
−1 11
00
11
00
x(t 1)=−1
−2
0 t1 t2 t3 t4 t5 t6 t7 time
The product
X
k(x, y) = ξ(x)p(x, y), with then ξ(x) = k(x, y)
y
are called the transition rates for the jumps x −→ y. We can thus say that a path
ω = (x0 , x1 , . . . , xn ) over the time interval [0, T ] has probability density
Z T
k(x0 , x1 )k(x1 , x2 ) . . . k(xn−1 , xn ) exp{− ξ(xs )ds} (VI.1)
0
for jump times t1 < t2 < . . . < tn < T . These ξ’s are called escape rates, quite
appropriately.
The fact that the waiting times are exponentially distributed is equivalent with the
Markov property, as we can see from the following argument.
Call τ the waiting time between two jumps, say while the state is x. For s ≥ 0 the event
34
τ > s is equivalent to the event {xu = x for 0 ≤ u ≤ s}. Similarly, for s, t ≥ 0 the event
τ > s + t is equivalent to the event {xu = x for 0 ≤ u ≤ s + t}. Therefore,
A continuous time Markov process is specified by giving the transition rates k(x, y) ≥ 0
P
between x, y ∈ K. They define the escape rates ξ(x) = y k(x, y). From these we make the
so called (backward) generator L which is a matrix with elements
Look at the structure: Lf is like the change in the observable f , multiplied with the rate of
that change.
From there we make the semigroup S(t) = exp(tL), t ≥ 0. That semigroup takes over the
role of the transition matrix, in the sense that we have
Of course we can check that it corresponds to the limit of discrete time Markov chains
when we construct the lazy process Pδ = (1 − δ)I + δP and take (Pδ )n → S(t) for nδ = t
with n ↑ +∞, δ ↓ 0, so that the path space distribution of the previous section is entirely
compatible. But now we get many more analytic and algebraic tools, and there is no need
to refer to discrete time at all. The main reason is that the generator L has the structure
(VI.2) which makes S(t) a stochastic matrix for all times t.
Again, this has the form of a balance equation: the probability at x grows by jumps y −→ x
and decreases by the jumps x −→ y.
Obviously the Master equation (VI.3) is nothing more than writing out µt = µS(t) in
differential form
d
µt = µt L = µS(t)L = µLS(t)
dt
We say that a probability law ρ is stationary when ρL = 0, or
X X
jρ (x, y) = [ρ(x)k(x, y) − ρ(y)k(y, x)] = 0
y y
We say the dynamics satisfies the condition of detailed balance when there is a potential
V such that
k(x, y) e−V (x) = k(y, x) e−V (y)
36
Then, for all triples x, y, z ∈ K, k(x, y)k(y, z)k(z, x) = k(x, z)k(z, y)k(y, x), and also, under
detailed balance,
1 −V (x) X
ρ(x) = e , Z= e−V (x)
Z x
When asked to find the explicit time-evolution of a Markov jump process, we must solve
the Master equation (VI.3). That is a collection of linear coupled first order differential
equations for the µt (x). When there are few states, like |K| = 2 or |K| = 3, or when
there are special symmetries, we can solve it almost directly by also using the normalization
P
x µt (x) = 1. In the more general case we need to diagonalize the generator L, which of
Let K = {−1, +1}, and transition rates k(−1, +1) = α, k(+1, −1) = β. Given the initial
state x0 = +1, find the probability for xt = +1, written P [xt = +1].
The Master equation gives
dP [xt = +1]
= α 1 − P [xt = +1] − βP [xt = +1]
dt
= α − (α + β)P [xt = +1]
Solving this differential equation and using the initial condition P [x0 = +1] = 1 we find
α α
P [xt = +1] = + (1 − ) exp[−(α + β)t]
α+β α+β
Check that the initial condition has been implemented and that
α
lim P (xt = +1) =
t→∞ α+β
which gives the equilibrium distribution (solution of the stationary Master equation, with
detailed balance).
37
F. Exercises
3. Consider the following continuous time Markov process. The state space
is K = {0, +2, −2} and the transition rates are k(0, +2) = exp[−b], k(0, −2) =
exp[−a], k(−2, +2) = k(+2, −2) = 0, k(+2, 0) = exp[b − h], k(−2, 0) = exp[a + h]
Determine the stationary distribution. That asks for the time-invariant state occupation.
Is there detailed balance (or, for what values of the parameters a, b, h)?
4. We consider the following continuous time Markov process (xt ). The state space is
K = {−1, 1}, say up and down. The transition rates are specified via parameters v, H > 0:
We choose the initial condition x0 = +1 after which the random trajectory (xt ) develops.
Compute the expectation value of exp xt ,
hext |x0 = 1i
5. Consider a continuous time Markov process with state space K = {1, 2, . . . , N } and
with transition rates
All other transition rates are zero. Determine the stationary distribution.
38
7. Imagine that L is the generator of a continuous time Markov process with a finite
state space.
a) Describe an explicit example for the case of three states where the stationary distribution
is not uniform and where there is no detailed balance. Give all details including the explicit
form of the transition rates and the stationary distribution.
b) Write ρ for the stationary distribution and suppose that ρ(x) 6= 0 for all states x. Show
that if the matrix H with elements
p 1
Hxy = ρ(x) Lxy p
ρ(y)
is symmetric, that then detailed balance holds. (The choice of the letter H points to a
symmetric Hamiltonian for quantum evolutions.)
8. Consider a network with four states (x, v) where x ∈ {0, 1}, v ∈ {−1, +1}. (Imagine x
to be a position and v like a velocity.) We define a Markov process in continuous time via
transition rates that depend on parameter b > 0,
k((1, +1), (1, −1)) = k((1, −1), (1, +1)) = k((0, +1), (0, −1)) = k((0, −1), (0, +1)) = 1
All other transitions are forbidden. Make a drawing. Determine the stationary distribution
on the four states as function of b. Is there detailed balance?
9. Show that a continuous time Markov process on a finite set K satisfies the condition
of detailed balance if and only if
1
k(x, y) = ψ(x, y) e 2 [V (x)−V (y)]
39
L(f 2 ) ≥ 2 f Lf
for the generator L of a Markov process. That property is sometimes called dissipative, in
contrast to a true derivation where the equality holds.
Show also that the differential operator
(λt)j −λt
Pr[N (t) = j] = e , with j = 0, 1, 2, . . . .
j!
Show that the inter-arrival times xn := Tn − Tn−1 are independent random variables with
exponential distribution with parameter λ. [See the figure below for a typical realization of
the experiment.]
40
12. Calculate pt (x, y) (transition probability over time t) for the continuous time Markov
process on N + 1 states with the following diagram:
λ λ λ
11
00 00
11
1111
0000 00
11
1111
0000 1
0 0001
1110
00
11
00
11 00
11
00
11 00
11
00
11 0
1
0
1 0
1
0
1
0 1 2 N−1 N
13. Consider the three-state process with transition rates defined in the following diagram:
14. Let λ, µ > 0 and consider the Markov process on {1, 2} with generator
−µ µ
L=
λ −λ
b) Solve the equation ρL = 0, to find the stationary distribution. Verify that pt (x, y) → ρ(y)
as t ↑ +∞.
15. Consider a three-level atom, in which each level has (a different) energy Ex with
x = 1, 2, 3. Now suppose that this atom is in contact with a thermal reservoir at inverse
temperature β such that the system’s energy jumps between two neighbouring levels (1 ↔ 2
or 2 ↔ 3) at a rate k(x, y) = (Ex − Ey )4 exp[−β (Ey − Ex )/2], with |x − y| = 1.
a) Write down the generator, as it acts on the function/observable that expresses the pop-
ulation of the middle level.
b) Is there detailed balance? Find the stationary distribution.
41
We have seen in the previous sections how jump processes can stand quite naturally
as model for various phenomena. For example, descriptions in terms of random walks
are ubiquitous and appear as good and relevant models for a great variety of processes in
human, natural and mathematical sciences. In other cases, we readily feel that this or that
model can be a useful description within chemical kinetics, or in population dynamics etc.
Physicists often want more however than effective modeling or well-appreciated descriptions.
In physics we also want to understand the physical origin and limitations of a model. We
want to evaluate the model also with respect to more fundamental laws or equations. It
is therefore interesting to spend some time on the question where jump processes could
physically originate. We discuss below two possible origins: 1) via Fermi golden rule — so
called, Van Hove weak coupling limit, 2) within Kramers’ theory of reaction rates.
There is a long and well-documented history of Brownian motion; see for example
http://www.physik.uni-augsburg.de/theo1/hanggi/Duplantier.pdf. While observed
by various people before (such as in 1785 by Jan Ingenhousz), the name refers to the
systematic studies of Robert Brown (1773-1858) who observed the irregular, quite unpre-
dictable, unhalted and universal motion of small particles suspended in fluids at rest. The
motion becomes more erratic as the particle gets smaller, as the temperature gets larger or
for lower viscosity. There is of course no vitalist source; the motion is finally caused by the
collisions of the particle with the fluid particles. What we observe is a reduced dynamics,
having no direct access to the mechanics of the fluid particles. In that sense the theory of
Brownian motion provides a microscope to molecular motion. It was thus a convincing and
an important ingredient in the understanding of the corpuscular nature of matter, say the
atomic hypothesis:
Let us consider a random walker on a line occupying sites 0, ±δ, ±2δ, . . .. After each time
τ the walker is displaced one step to the left or to the right with equal probability. We keep
of course in mind to model the motion of a Brownian particle that is being pushed around
at random. Then, the probability to find the particle at position nδ at time kτ equals
1 k!
P (nδ, kτ ) = k−n
k+n
2k 2
! 2 !
Taking n = x/δ, k = t/τ ↑ ∞ while fixing D = δ 2 /2τ , we have the continuum limit
1 1 x2
lim P (nδ, kτ ) = ρ(x, t) = √ e− 4Dt
δ↓0 δ 4πDt
for the (well-normalized) probability density ρ(x, t) which indeed satisfies the diffusion equa-
tion (VIII.1).
44
There is of course another way to find back that solution. The point is that the random
walk itself satisfies the Master equation
1 1
P (nδ, (k + 1)τ ) = P ((n − 1)δ, kτ ) + P ((n + 1)δ, kτ )
2 2
or, again with x = nδ, t = kτ, D = δ 2 /(2τ ),
P (x, t + τ ) − P (x, t) δ 2 P ((x + δ, t) + P (x − δ, t) − 2P (x, t)
= { }
τ 2τ δ2
which tends to the diffusion equation in the limit δ ↓ 0 with D being kept fixed.
Finally we can look at the process in a more probabilistic way. After all, the position of
the walker is the random variable (starting from X0 = 0)
Xkτ = δ v1 + v2 + . . . + vk
√
where each vi = ±1 with equal probability (independent random variables) and δ = 2τ D =
p
2Dt/k. We let k go to infinity and consider the limit of the sum
r
2Dt
v1 + v2 + . . . + vk
k
The central limit theorem is saying that it converges in distribution to a Gaussian random
variable with mean zero and variance 4Dt. In fact, more is true. Not only do we get
convergence to a Gaussian for fixed time t; also the process itself has a limit and becomes a
Gaussian process. That limiting process, rescaled limit of the standard random walk, is the
so called Wiener process, also simply called after the basic phenomenon it models, Brownian
motion. See more in Section XI A.
45
B. Einstein relation
We can consider the particle density as n(x, t) = N ρ(x, t) for a total of N particles. The
number of particles is conserved and indeed the diffusion equation is a continuity equation
∂ ∂ ∂
n(x, t) + jD (x, t) = 0, jD (x, t) := −D n(x, t)
∂t ∂x ∂x
That diffusive current jD is generated by density gradients. Suppose now however that there
is also a constant external field, like gravity. We put the x−axis vertically upward. Using
Newton’s equation we would write ẋ = v and
d
m v(t) = −mg − mγv(t)
dt
with mγ the friction (or drag) coefficient (and γ is called the damping coefficient). The
particle current caused by the external field is jg (x, t) = n(x, t)v(t) so that under stationary
conditions, where dv/dt = 0, the contribution to the particle current from gravity equals
jg (x) = −gn(x)/γ.
Yet, under equilibrium, there is no net particle current and the diffusive current jD caused
by concentration gradients cancels the “gravity” current. For these currents jg and jD we
can use the equilibrium profile n(x). It is given by the Laplace barometric formula
V (x−x0 )
−
n(x) = n(x0 ) e kB T
where V is the external potential: V (x) = mgx for gravity. Whence, jD (x) =
D mg n(x)/(kB T ) and the total current j = jg + jD vanishes in equilibrium when
gn(x)
j(x) = − + D mg n(x)/(kB T ) = 0
γ
kB T
D=
mγ
called, Einstein relation (Über die von molekülarkinetischen Theorie der Wärme geforderte
Bewegung von in ruhenden Flüssigkeiten suspendierter Teilchen, 1905). The Stokes relation
further specifies mγ = 6π η R in terms of the particle diameter R (at least for spherical
particles) and the fluid viscosity η.
46
Imagine again our random walker but now with a small bias, i.e., an external field α,
that also depends on the position. The walker moves to the right with probability p(nδ) =
1/2 + α(nδ) δ and to the left with probability q(nδ) = 1/2 − α(nδ) δ. Then,
P (nδ, (k + 1)τ ) = p((n − 1)δ) P ((n − 1)δ, kτ ) + q((n + 1)δ) P ((n + 1)δ, kτ )
X. LANGEVIN DYNAMICS
If one considers a particle in a fluid as subject to friction mγ and random forces ξ(t), we
can consider the Newtonian-type equation
d
m v(t) = −mγv(t) + ξ(t) (X.1)
dt
This notation should not be taken extremely serious from the mathematical point of view.
Its importance is foremost to suggest the physics of the dynamics. In a way, we want to
describe a Markov process whose generator L = L1 + L2 is the sum of two contributions,
drift with generator L1 f (v) = −γv f 0 (v) and diffusion with generator L2 f (v) = C f 00 (v).
One question is how, more precisely, to think physically and mathematically about the
process ξ(t). Further technical issues will be how to calculate with it.
The general context is of course that of reduced dynamics. One has integrated out the
many fluid particles and one has assumed that their initial conditions were statistically
described. That can sometimes be done explicitly, but here we choose for more qualitative
considerations; see Exercise 1 below.
The first thing that in general looks reasonable, certainly for a fluid at rest, is that the
average hξ(t)i = 0. The average can first be understood as an arithmetic average over many
samples, particle collisions etc., but in the mathematical model, the average will refer to a
mathematical model for a stochastic process. From this hypothesis, we already deduce from
(X.1) that
1 − e−γt
hv(t)i = v0 e−γt , hx(t)iv0 = v0
γ
with of course the notation v(t = 0) = v0 and assuming hx(0)i = 0.
Now to the covariance hξ(t1 )ξ(t2 )i for which we can first assume that they reflect stationary
correlations, only depending on the time-difference |t2 − t1 |, as we assume the fluid is in a
stationary condition and unaltered by the presence of the few bigger particles that we study
here. In fact, a further simplification and physical hypothesis is that we model the force ξ(t)
as white noise. It means to suppose that
for some constant C. In other words, the ξ(t) in (X.1) varies much faster (on more mi-
croscopic time-scales) than the typical relaxation time γ −1 of the velocity of the suspended
48
particle. That extra hypothesis helps because we can now compute the velocity correlations
from Z t
−γt 1
v(t) = v0 e + ds e−γ(t−s) ξ(s)
m 0
We can think of it as a diffusion constant for the velocity process. To conclude, and slightly
renormalizing the notation from (X.1), we have the Langevin equation
r
d 2γkB T
v(t) = −γ v(t) + ξ(t), hξ(t)ξ(s))i = δ(t − s)
dt m
for the velocity thermalization of a particle in a heat bath (for all possible initial
v(t = 0) = v0 ).
There is of course also the position process, which is however completely enslaved by the
velocity process. For example
Z t Z t
2
hx (t)i = dt1 dt2 hv(t1 )v(t2 )i
0 0
dimensionless γt.
A final remark concerns the equilibrium dynamics, i.e., when v0 is distributed with a
Maxwellian distribution, hv02 i = kB T /m. That makes for one extra average in the above
formula. For example,
as t ↓ 0. The notation o(t) means to indicate any term A(t) for which A(t)/t → 0 as t → 0.
We will show that then necessarily pt (x0 , x) satisfies the Fokker-Planck equation (where
Adriaan Fokker was a Dutch physicist, cousin of but not be confused with the aeronautical
engineer Anthony Fokker)
∂ ∂ 1 ∂2
pt (x0 , x) = − a(x)pt (x0 , x) + b(x) p t (x 0 , x) (XI.1)
∂t ∂x 2 ∂x2
In what follows, we will then simply write pt (x0 , x) = ρt (x) for the probability density at
time t, ignoring the specific initial condition.
The proof of (XI.1) goes as follows. As the process is Markovian, we must have that at time
t+δ Z
pt+δ (x0 , x) = dy pt (x0 , y) pδ (y, x)
R
By integrating the previous identity with a smooth function φ(x) and rearranging the inte-
gration (variables) we obtain
Z Z Z
dy pt+δ (x0 , y)φ(y) = dy dx pt (x0 , y) pδ (y, x) φ(x)
R R R
50
The Fokker-Planck equation (XI.1) is of course also a Master equation with current j(x, t),
∂ ∂ 1 ∂
ρt (x) + j(x, t) = 0, j(x, t) = a(x)ρt (x) − b(x) ρt (x) (XI.2)
∂t ∂x 2 ∂x
We can also write it in terms of the forward generator L+ ,
∂
ρt = L+ ρt
∂t
The backward generator (adjoint of L+ with respect to dx) is then
b(x) 00
Lf (x) = a(x) f 0 (x) + f (x)
2
working on functions (observables) f . There will be conditions on a(x), b(x) that assure that
pt (x0 , x) → ρ(x) (stationary density) as t ↑ +∞.
We have detailed balance and an equilibrium distribution ρ when the current in the station-
ary regime vanishes; that is when
1 ∂
a(x)ρt (x) = b(x) ρt (x)
2 ∂x
Apparently, on R that is always possible, as long as ρ is normalizable. For example, we can
take F (x) = (2a(x) − b0 (x))/b(x) and find the potential V such that −V 0 (x) = F (x). Then,
ρ(x) = Z1 exp −V (x) if normalizable: Z = R exp −V (x) < ∞. That need however not be
R
true in higher dimensions; detailed balance can easily be broken whenever non-conservative
forces are present. More on that later.
51
A. Wiener process
The Wiener process is the Brownian motion process, called after and in honor of Norbert
Wiener. It corresponds to the special case where a = 0, b = 2D = 2kB T /(mγ). It is a
non-stationary time-independent Gaussian process with mean zero and covariance
with white noise ξ(s) having covariance hξ(s)ξ(t)i = Γ δ(t − s), then
Z t1 Z t2
hx(t1 ) x(t2 )i = Γ ds1 ds2 δ(s1 − s2 ) = Γ min{t1 , t2 }
0 0
B. Ornstein-Uhlenbeck process
The Ornstein-Uhlenbeck process (named after Dutch theoretical physicists Leonard Orn-
stein and George Uhlenbeck) is the unique Markov process which is both stationary and
Gaussian. Physically it is then better to have in mind the thermalization of velocities. We
thus change (with respect to the previous sections) x → v to emphasize that the process
deals with speed instead of position. We choose a(v) = −γv and b(v) = 2kB T γ/m = 2γ 2 D.
The Fokker-Planck equation (XI.1) is then
∂ ∂ ∂2
ρt (v) = γ (v ρt (v)) + γ 2 D 2 ρt (v)
∂t ∂v ∂v
and the Markov process that corresponds to it has transition probability density
m (w − ve−γt )2
r
m 1
pt (v, w) = √ exp −
2πkB T 1 − e−2γt 2kB T 1 − e−2γt
52
q mv 2
m
with stationary distribution ρ(v) = 2πkB T
e 2kB T
. Therefore in the stationary regime
hv(t)i = 0 and another calculation shows
kB T −γ(t−t0 )
hv(t0 ) v(t)i = e
m
We can also calculate hv(t)i for fixed initial speed v0 at an earlier time, to find that these
coincide exactly with the Langevin dynamics. In other words, the Langevin dynamics (X.1)
is an Ornstein-Uhlenbeck process. After all, in the Langevin dynamics, the speed v(t) must
be a Gaussian process, and hence is completely fixed by its mean and covariances in time.
d
x(t) = F (x(t)) + ξ(t), hξ(t)ξ(s)i = Γ δ(t − s)
dt
That is clearly a time-homogeneous Markov diffusion process because the increments are
independent and given by the white noise. Its probability density will therefore satisfy a
Fokker-Planck equation (XI.1) with a(x) = F (x) and b(x) = Γ.
There are various natural ways to obtain such a diffusion process, and one important ap-
proximation is to start from the dynamics
d d 1
x(t) = v(t), v(t) = F (x(t)) − γv(t) + ξ(t)
dt dt m
and to suppose that we can neglect the acceleration with respect to the other terms. Such
overdamped approximation then gives an evolution equation for the position:
d 1 Γ
x(t) = F (x(t)) + ξ(t), hξ(s)ξ(t)i = δ(t − s)
dt mγ γ2
which has a(x) = F (x)/(mγ) and b(x) = Γ/γ 2 = D, or with Fokker-Planck equation
∂ 1 ∂ D ∂2
ρt (x) = − (F (x) ρt (x)) + ρt (x)
∂t mγ ∂x 2 ∂x2
which is the Smoluchowski equation (IX.1). When F (x) = −V 0 (x), then the equilibrium
distribution is
1 2γ
ρ(x) = exp[−βV (x)], β=
Z mΓ
53
The main logic and derivations do not change when considering multiple variables or
more dimensional variables. Yet, the physics becomes more rich. We now put x(t) =
(x1 (t), . . . , xn (t)) and suppose that, as t ↓ 0, the transition probability density satisfies,
componentwise, Z
dx (x − x0 )i pt (x0 , x) = ai (x0 )t + o(t)
Rn
Z
dx (x − x0 )i (x − x0 )j pt (x0 , x) = bij (x0 )t + o(t)
Rn
for a ∈ Rn and b a real symmetric n × n matrix, and all higher moments vanishing more
rapidly than t as t ↓ 0.
The corresponding Fokker-Planck equation is then
n n
∂ X ∂ 1X ∂2
ρt (x) = − ai (x)ρt (x) + bij (x) ρt (x)
∂t i=1
∂xi 2 i,j=1 ∂xi ∂xj
A basic example is Kramers’ equation where, in the simplest version, n = 2 and with
variables position and velocity:
r
d d 1 2kB T γ
x(t) = v(t), v(t) = −γv(t) + F (x(t)) + ξ(t)
dt dt m m
Correspondingly, a(x, v) = (v, −γv + F (x)/m) and bxx = bxv = bvx = 0, bvv = 2kB T γ/m.
That special case of the multi-dimensional Fokker-Planck equation gives
∂ ∂ F (x) ∂ ∂ kB T ∂ 2
ρt (x, v) + v ρt (x, v) + ρt (x, v) = γ (vρt (x, v)) + ρ t (x, v)
∂t ∂x m ∂v ∂v m ∂v 2
which is called Kramers’ equation (after yet another Dutch theoretical physicist Hans
Kramers). It can be regarded as a special case of a kinetic equation (of which the Boltz-
mann equation is the most famous representative), but with the great simplification that its
right-hand side is linear in the density. That simplification is the result of considering only
single (or independent) particles interacting with the fluid at rest.
The dynamical variable Xt ∈ S 1 with 0 ≤ t = time, takes values on the circle of length 1.
The constant E represents the (nonequilibrium) driving. Note that on the circle, a constant
cannot be seen as the gradient of a potential function. The potential U is periodic and
assumed sufficiently smooth to assure the existence of a unique and stationary and smooth
probability density ρ with respect to dx. The β refers to the inverse temperature of the
medium in which the particle moves.
The Master equation is the usual Fokker-Planck equation (XI.2) for the time-dependent
probability density ρt :
dρt (x) 1 0
+ jt0 (x) = 0, jt = (E − U 0 )ρt − ρ (XIV.2)
dt β t
as started from some initial density ρ0 . Stationarity implies that jt0 = 0, or the stationary
current j is a constant (in x). The current j can also be identified with the actual flow of
the particle around the circle.
The (backward) generator L of the process on smooth functions f is
d 1
hf (Xt )iX0 =x = Lf (x) = f 00 (x) + (E − U 0 (x)) f 0 (x)
dt β
Here are some standard possibilities:
• Uniform field. The case U = 0 is the simplest nonequilibrium case. The stationary
distribution is given by the uniform density ρ(x) = 1, but there is no detailed balance.
The stationary current equals j = E.
where
x F dz
R
for y ≤ x
y
W (y, x) = R
1 F dz + x F dz
H
for y > x
y 0
is the work performed by the applied forces along the positively oriented path y → x.
In this model the stationary current can be computed by dividing the stationary Master
equation by ρ and by integration over the circle:
W
j=H (XIV.4)
(ρ)−1 dx
H1
where W = 0
F dx is the work carried over a completed cycle. The non-zero value of
this stationary current indicates that time-reversibility is broken.
As for jump processes, we can ask for the plausibility of paths or trajectories ω = (xs , 0 ≤
s ≤ t) in a time-interval [0, t]. The trajectories of diffusion processes are continuous but not
differentiable, so that it becomes a more technical issue how to select the space of trajectories
and how to write down the action. Here we will rather formal, and start with the most
singular object of all, which is the white noise ξ appearing in the Langevin and diffusion
processes discussed before.
The useful input here is to remember the distribution of a multivariate normal variable; that
is a higher dimensional Gaussian random variable. Say for k components, η = (η1 , η2 , . . . , ηk )
is multivariate Gaussian (with mean zero) when its density on (Rk , dη1 . . . dηk ) equals
1 1
f (η1 , η2 , . . . , ηk ) = p exp − (η, A−1 η) (XV.1)
(2π)k det A 2
where det A is the determinant of the positive definite matrix A, and the scalar product in
the exponential is (η, A−1 η) := i,j (A−1 )ij ηi ηj . It is not hard to show that their covariance
P
then equals
hηi ηj i = Aij
time-correlations are specified using a kernel A(s, t). Indeed, a Gaussian process is but a
stochastic process such that every finite collection of fixed-time realizations has a multivari-
ate normal distribution. Thus, a Gaussian process is the infinite-dimensional realization of
a multivariate Gaussian random variable. White noise is a degenerate version of this where
the covariance is just the unit operator. Formally then, the path probability density of white
noise ξ must be proportional to Z t
1
exp − ds ξs2 (XV.2)
2 0
where the sums in the scalar product in the exponential of (XV.1) have now been replaced
by an integral over time, and corresponding A(s, t) = δ(t − s).
and to insert the density for the white noise (XV.2), obtaining path density
Z t
1
P (ω) ∝ exp − ds (ẋs − F (xs ))2
4T 0
Z t Z Z t
1 2 1 1
ds F 2 (xs )
= exp − ds ẋs exp F (xs ) dxs − (XV.4)
4T 0 2T 4T 0
All kinds of things must be further specified here, mostly having to do with the stochastic
calculus of Section XVI, but the main usage of such formulæ is for obtaining the density of
one process in terms of another process. In a way, the first exponential factor in (XV.4) refers
√
again to white noise and is associated to the Brownian process Bt for which Ḃt = 2T ξt .
We can thus compare the path density of the process xt with that of Bt and write
Rt
The formulæ (XV.5) or (XV.6) contain stochastic integrals such as 0
F (xs ) dxs . Their
meaning and manipulation is the subject of stochastic calculus. Stochastic differential and
integration calculus differs from the ordinary Newtonian calculus because of the diffusive
nature of the motion in which the path xt does not allow a velocity; diffusive motion implies
(dxt )2 ∝ dt.
Let us first look back at, what for physics, is a natural point of departure, the stochastic
differential equation (XV.3). We think of determining the next position xt+dt from knowledge
of xt , F (xt ) and the random force ξt . Such an interpretation of (XV.3) is called reading the
equation in the Itô sense. The resulting stochastic integral is then thought of as a limit where
we discretize the time-interval [0, t] via many intermediate times 0 = t0 < t1 < . . . tn = t
Z t X
F (xs ) dxs = lim F (xti )(xti +1 − xti ) (XVI.1)
0 n↑∞
i
and where all the time-intervals ti+1 − ti ' 1/n tend to zero. The limit has to be understood
in the sense of square integrable functions.
The Itô-integral is thus characterized by the fact that the function F in (XVI.1) is each time
evaluated in the left point of the time-interval. There is another more time-symmetric way
of doing the integral, referred to as the Stratonovich sense and denoted by
Z t X xti + xti+1
F (xs ) ◦ dxs = lim F( ) (xti +1 − xti )
0 n↑∞
i
2
which evaluates the function in the middle of the time-integral. The Itô and Stratonovich
integrals can be related via the expansion
X xti + xti+1 X xt − xti
F( ) (xti +1 − xti ) = {F (xti ) + F 0 (xti ) i+1 + . . . }(xti +1 − xti )
i
2 i
2
58
which leads to Z t Z t Z t
F (xs ) ◦ dxs = F (xs ) dxs + T F 0 (xs ) ds (XVI.2)
0 0 0
In other words,
df (xt ) = f 0 (xt ) dxt + T f 00 (xt ) dt
√
which is called the Itô-lemma. In fact, if we insert here dxt = F (xt )dt + 2T dBt (which is
(XV.4) with Bt Brownian motion), we get
√
df (xt ) = {f 0 (xt ) F (xt ) + T f 00 (xt )} dt + 2T dBt
√
= Lf (xt ) dt + 2T dBt
where we recognize the backward generator L acting on f in the term with dt.
which is now nicely split up in a time-antisymmetric term and a time-symmetric term for
the time-reversal transformation. Physicists recognize the antisymmetric term
Z
1
σ := F (xs ) ◦ dxs
T
as the time-integrated entropy flux, or the dissipated heat by temperature from the forcing
F in (XV.4). If the force F is gradient, F = −U 0 for system energy U , then σ = [U (x0 ) −
U (xt )]/T equals the energy change of the environment over its temperature. If however F
contains non-gradient parts, like when considering (XV.4) on the circle in Section XIV.3, σ
is time-extensive Joule heating over bath temperature.
59
So far we have mostly been looking at single (or independent) particles diffusing in some
environment. We can however also build interacting Brownian motions or even interacting
stochastic fields. In this way we consider models that are spatially extensive, and the basic
variables are fields (to be completed...)
XVIII. EXERCISES
1. Try to follow and to add the details to the following derivation of a Langevin-type dy-
namics starting from a Newton dynamics. We want the effective or reduced dynamics
for a bigger particle in a sea (hest bath) of smaller particles.
The bigger particle (e.g. a colloid) is described by a coordinate q and its conjugate
momentum p. The heat bath consists of harmonic oscillators described by a set of co-
ordinates {qj } and their conjugate momenta {pj }. For simplicity, all oscillator masses
are set equal to 1. The system Hamiltonian is
p2
Hs = + U (q)
2m
and the heat bath Hamiltonian HB includes harmonic oscillator Hamiltonians for each
oscillator and a very special coupling to the system,
X p2j 1 γj
HB = [ + ωj2 (qj − 2 q)2 ]
j
2 2 ωj
in which ωj is the frequency of the jth oscillator and γj measures the strength of
coupling of the system to the jth oscillator. HB consists of three parts: the first is
just the ordinary harmonic oscillator Hamiltonian, specified by its frequencies; the
P
second contains a bilinear coupling to the system, γj qj q, specified by the coupling
constants; and the third contains only q and could be regarded as part of the arbitrary
U (q). The bilinear coupling is what here makes the derivation manageable. The
equations of motion for the combined Hamiltonian Hs + HB are
dq p dp X γj
= , = −U 0 (x) + γj (qj − 2 q)
dt m dt j
ωj
dqj dpj
= pj , = −ωj2 qj + γj q
dt dt
60
Suppose that the time dependence of the system coordinate q(t) is known. Then it is
easy to solve for the motion of the heat bath oscillators, in terms of their initial values
and the influence of q(t),
t
sin ωj (t − s)
Z
sin ωj t
qj (t) = qj (0) cos ωj t + pj (0) + γj ds q(s)
ωj 0 ωj
When this is put back into the equation for dp/dt, we obtain
Z t
dp 0 p(t − s)
= −U (x) − ds K(s) + ξ(t) (XVIII.1)
dt 0 m
By carefully choosing the spectrum {ωj } and coupling constants {γj }, the memory
function can be given any assigned form. For example, if the spectrum is continuous,
R
and the sum over j is replaced by an integral, dω g(ω), where g(ω) is a density of
states, and if γ is a function of ω, then the memory function K(t) becomes a Fourier
integral,
γ(ω)2
Z
K(t) = dω g(ω) cos ωt
ω2
Further, if g(ω) is proportional to ω 2 and γ is a constant, then K(t) is proportional
to δ(t) and the integral disappears from (XVIII.1).
The noise ξ(t) is defined in terms of the initial positions and momenta of the bath
oscillators and is therefore in principle a known function of time. However, if the bath
has a large number of independent degrees of freedom, then the noise is a sum contain-
ing a large number of independent terms, and because of the central limit theorem,
we can expect that its statistical properties are simple. Suppose, for example, that a
61
large number of computer simulations of this system are done. In each simulation, the
bath initial conditions are taken from a distribution,
γj
hqj (0) − q(0)i = 0, hpj (0)i = 0
ωj2
Since the noise is a linear combination of these quantities, its average value is zero.
The second moments are
γj kB T
h(qj (0) − 2
q(0))2 i = 2 , hp2j (0)i = kB T
ωj ωj
There are no correlations between the initial values for different j’s. Then by direct
calculation, using trigonometric identities, one sees immediately that there is (what is
called) a fluctuation-dissipation relation,
hξ(t)ξ(t0 )i = kB T K(t − t0 )
Because the noise is a linear combination of quantities that have a Gaussian distri-
bution, the noise is itself a Gaussian random variable. If the heat bath has been
constructed so that the memory function is a delta function, then the noise is white.
2. Fix parameters α > 0, B ∈ R, D > 0 and look at the linear Langevin dynamics for a
global order parameter M ∈ R,
√
Ṁ (t) = −α M (t) + ht B + 2D ξ(t)
for standard white noise ξ(t). The ht , t > 0, is a small time-dependent field. We can
think of a Gaussian approximation to a relaxational dynamics of the scalar magneti-
zation M (no conservation laws and no spatial structure).
Show that the equilibrium (reversible stationary density on R for perturbation B = 0)
is
M2
1
ρ(M ) = exp −α
Z 2D
with zero mean and variance hM 2 i = D/α.
62
Show also that for ht ≡ 0, the correlation function for 0 < s < t is
D −α(t−s)
hM (s) M (t)i = M02 e−α(t+s) + − e−α(t+s)
e
α
and hence
d
hM (s) M (t)i = −αM02 e−α(t+s) + D e−α(t−s) + e−α(t+s)
ds
4. Show next that when we average that last expression over the equilibrium density we
find
d
hM (s) M (t)iρ = D e−α(t−s)
ds
and compare with the response (XVIII.2).
5. Solve the Langevin equation also in higher dimensions. Take for example the evolution
equation for a charged particle in a magnetic field
where B is the magnetic field, e is the charge, γ is the friction, m is the mass and
ξ(t) are three independent standard white noises. There is a stationary equilibrium.
Compute the velocity correlation matrix for t ≥ 0 in equilibrium
6. Take again the standard Langevin dynamics for a particle in a large box with friction
coefficient γ at inverse temperature β. The stationary distribution for the velocities
in Maxwellian, while the position diffuses uniformly over the box. Imagining a large
time-interval t ↑ +∞ we define the diffusion constant as
1
D = lim h(xt − x0 )2 i
t t
63
where we average over the equilibrium trajectories. Let us now add a perturbation,
taking a constant but small external field E, switched on at time zero:
r
2γ
v̇t = −γ vt + E + ξt , t > 0 (XVIII.3)
β
and we enquire on the mobility
d
χ≡ hvi|(E = 0)
dE
The average is over the new stationary distribution (with E) but we look at the linear
order. Verify the relation
χ = 2d β D
Show that
1 1
R(ω) =
m γ − iω
Show that the imaginary part
βω
=R(ω) = G(ω)
2γ
where G is the Fourier transform of the velocity autocorrelation function
Z +∞
G(ω) = dteiωt hv(t) v(0)ieq
−∞
8. Verify the claim (XIV.3), and the expression for the current (XIV.4). See also how it
reduces to the equilibrium form in case of detailed balance (E = 0).
64
Large Deviations
XIX. INTRODUCTION
“Many important phenomena, in physics and beyond, while they cannot be shown to hold
without exception, can be shown to hold with very rare exception, suitably understood.”
(S. Goldstein in Typicality and Notions of Probability in Physics, in: Probability in Physics,
edited by Y. Ben-Menahem and M. Hemmo, 59-71 (Springer, 2012)).
The theory of large deviations is a theory about fluctuations around some typical (lim-
iting) behavior. It depends on the physical context what that typical (limiting) behavior
refers to but invariably, it mathematically corresponds to some (generalized) law of large
numbers. Here are a number of possible examples:
1. Macroscopic averages: We start from identifying the degrees of freedom in the system,
like what are the basic dynamical variables in the given description. We are then
interested in macroscopic properties. These are given in terms of arithmetic means
of the individual contributions. These are typically constant as we vary over most of
the a apriori possibilities. That is sometimes called macroscopic reproducibility. For
example, for a dilute gas we can ask for the fraction of particles having their position
and velocity around a given position and velocity (with some tolerance). The evolution
of that coarse-grained description is provided by the Boltzmann equation. When we
randomly throw the particles in a volume for a given fixed total energy, we expect to
find the homogeneous Maxwell distribution for a temperature that corresponds to that
energy. That is typical. Fluctuations refer to deviations from this law. Here is how
Boltzmann first wrote in 1896 about that case: One should not forget that the Maxwell
distribution is not a state in which each molecule has a definite position and velocity,
and which is thereby attained when the position and velocity of each molecule approach
these definite values asymptotically.... It is in no way a special singular distribution
which is to be contrasted to infinitely many more non-Maxwellian distributions; rather
it is characterized by the fact that by far the largest number of possible velocity distri-
butions have the characteristic properties of the Maxwell distribution, and compared
to these there are only a relatively small number of possible distributions that deviate
significantly from Maxwells. (Ludwig Boltzmann, 1896)
65
2. Time-averages: We can also ask for time-averages, such as what is the fraction of time
that the system has spent in some part of its phase space. When the dynamics is
ergodic, that is given by a phase-space average in the limit for large times. Deviations
(for intermediate times) are possible and are referred to as dynamical fluctuations.
3. Deterministic limits: In some cases we start from a noisy or stochastic dynamics and
we are interested to take the zero noise limit as reference. Typical behavior would
then correspond to the zero noise dynamics and deviations from that trajectory are
possible but increasingly unlikely as the noise gets smaller. Noise can be thermal in
which case we are interested in ground state properties and low temperature behavior.
4. Classical limits: The previous case can be restated in the case of quantum noise.
When discussing quantum dynamics we have in mind that there are limiting cases of
observations or situations where the classical (Newtonian) dynamics should emerge.
We have in mind the case where the de Broglie wavelength gets very small with respect
to spatial variations. Informally speaking, for say large objects moving at high energies,
the typical behavior is classical. Around that situation we get quantum fluctuations
and the possibly multiple and looped Feynman paths start to contribute.
The above already indicate various good reasons for the study of large deviations. They
describe the probability of risks, of rare events, but they also connect various levels of
descriptions. It is perhaps also not surprising that through them, we will obtain variational
principles characterizing the limiting typical behavior. Even when we never see the rare
events, their plausibility is governed by fluctuation functionals that do have a physical and
operational meaning. Their derivatives will for example matter in response theory around
that typical behavior. On a more analytic side the theory of large deviations connects with
asymptotic evaluation of integrals as in the Laplace formula and approximation.
We start explaining all that next with the simplest case where we consider independent
random variables.
stated we will indicate the a priori distribution of Xi by ρ and expectation values (averages)
with h·i. Limits typically refer to N ↑ +∞.
The strong law of large numbers asserts that mN → hXi i almost surely if h|Xi |i < ∞.
The “almost surely” makes it strong: the convergence (of the random mean) to the constant
(expectation) is with probability one. It is however much easier to prove the (weaker)
convergence in mean square, using the assumption of finite second moment hXi2 i < ∞: as
N ↑ +∞,
h(mN − hXi i)2 i → 0
The first correction to the strong law of large numbers is given by the central limit
theorem. That governs the small fluctuations, as we will see below. The central limit
theorem speaks about convergence in distribution,
1 X
√ (Xi − hXi i) → N (0, 1)
N i
when Var Xi = 1. The notation N (0, 1) stands for a Gaussian random variable with mean
zero and variance one (standard normal distribution). Note the rescaling, exactly right to
keep the variance at one. We have blown up something that goes to zero (mN − hXi i) by
√
multiplying with something that tends to infinity ( N ).
Let us rewrite that statement in terms of deviations: for a ≤ b as N ↑ +∞,
Z b
a b 1 2
Prob [ √ < mN − hXi i < √ ] → √ dx e−x /2
N N 2π a
The left and right boundaries in the probability go to zero with N , and we are therefore
looking at rather small fluctuations around zero (= law of large numbers).
67
C. Coin tossing
We start with an example of large deviations. Suppose that Xi = 0, 1 with equal proba-
bility (fair coin, ρ(0) = ρ(1) = 0), and that we inquire about the probability
1
lim log pN (a, b) = − inf [log 2−s(m)], s(m) := −m log m−(1−m) log(1−m), m ∈ [0, 1]
N ↑+∞ N a<m<b
N!
so that for qN (a, b) := maxaN <j<bN (N −j)!j!
,
Therefore, lim N1 log pN (a, b) = lim N1 log qN (a, b) = supa<m<b s(m) where the last equality
employs Stirling’s approximation. In fact, the relation between entropy functions and bino-
mial coefficients comes from the simple fact that
k n−k n−k
r
n! n k
' exp{n[− log − log ]}
(n − k)!k! 2πk(n − k) n n n n
where we call I(m) the fluctuation functional. Note immediately that I(m) ≥ 0 with
equality only in m = 1/2 and maxima at I(m = 0, 1) = log 2. In that way we have
estimated the corrections to the law of large numbers.
A next natural question is to ask about an unfair coin, that is where Xi = 0 with
probability 1 − p and Xi = 1 with probability p.
Then, the probability for k successes is obtained exactly in the same way as before, but with
68
a different a priori weight: instead of the 2−N for the fair coin, we now get pN m (1 − p)N (1−m)
and hence
Prob [mN ' m] ' e−N Ip (m) , Ip (m) = −m log p − (1 − m) log(1 − p) − s(m)
m 1−m
= m log + (1 − m) log (XX.1)
p 1−p
Obviously, Ip (m) ≥ 0 with equality only if m = p.
D. Multinomial distribution
An immediate generalization is to the multinomial case. Here we take a finite state space
K so that the random variables take n = |K| possible values (instead of two). We take N
copies of the random variable (with distribution ρ) and we ask for the probability that a
number mx have the value x (i.e., m1 have a first value, m2 have a second value,... till mn
have the nth value, and of course N = m1 + m2 + . . . + mn .) We then again apply Stirling’s
formula in the form
n
1 N! X mi
lim log =− pi log pi , under → pi (XX.2)
N ↑+∞ N m1 ! . . . mn ! N
i=1
We can formulate the outcome of the N -sampling using the empirical distribution mN ; that
is the random probability distribution
N
N 1 X
m (x) := δX ,x , x∈K
N j=1 j
just the arithmetic mean. The probability to observe the proportions µ(x) when the a priori
probabilities are ρ(x) is
the relative entropy between µ and ρ. That result is called Sanov’s theorem. It is a compu-
tation that goes back to Boltzmann and follows as before for coin tossing but now from the
multinomial formula (XX.2) combined with the a priori probabilities of ρ(x)N µ(x) to throw
always x in N µ(x) trials:
X X
Iρ (µ) = − µ(x) log ρ(x) + µ(x) log µ(x)
x x
That extends to the case of the empirical distribution on N identically distributed random
variables on R with common distribution ρ, summarizing
1 X
Probρ [ δXj ,· ' µ] ' e−N S(µ|ρ)
N j
for quite general distributions will be based on the observation that then for all functions
F : K → R,
1 1 X
logheN F (mN ) i ' log eN F (m) e−N I(m) ' sup{F (m) − I(m)}
N N m
m
where only the last identity has used the fact that we deal with independent random variables.
We have used the notation for discrete random variables having exponential moments (so
that ψ(θ) exists and is finite for all θ):
The supremum is reached at θ∗ = θ∗ (x) such that ψ 0 (θ∗ ) = m. Then, the tilted probability
distribution
∗ x−ψ(θ ∗ )
ρ∗ (x) := eθ ρ(x)
the magnetization m by adjusting the magnetic field θ. The function ψ(θ) is then the
change in free energy under the application of a magnetic field. In the more mathematical
language, ψ(θ) is called the log-generating function; its derivatives are given by connected
correlation functions.
The statement above, for quite general random variables, is known as the Gärtner-Ellis
theorem: if the ψ(θ) exists and is finite and differentiable for all θ, then the fluctuation
functional I(m) can be found as its Legendre transform. In the case that the distribution ρ
does not allow exponential moments, we need to refer to other methods.
Let us combine again with the result that the relative entropy is a large deviation function.
Note that a log a ≥ a − 1, a > 0, so that
X X
µ(x) log µ(x) ≥ µ(x) log ρ(x)
x x
(by choosing a = µ(x)/ρ(x)). That means that the relative entropy S(µ|ρ) is never negative.
Let us now choose ρ(x) = e−V (x) /Z to be an equilibrium distribution for potential U : we
note that
X µ(x) X
s(µ|ρ) = µ(x) log = U (x)µ(x) − s(µ) + log Z ≥ 0
x
ρ(x) x
In other words then, the relative entropy is a difference of the free energy functional F(µ) :=
P
U (x)µ(x) − s(µ), evaluated in µ and ρ respectively:
1 X
Probρ [ δXj ,· ' µ] ' e−N [F (µ)−F (ρ)]
N j
71
Suppose we in fact have a well-defined large deviation property for random variables
X1 , X2 , . . . ,
Prob[mN ' m] ' e−N I(m)
Let us assume for simplicity that their mean is hXi i = 0 so that I(0) = 0 and I 0 (0) = 0 with
smooth large deviation functional I. We are now asked to evaluate the small fluctuations
√ −N I( √x )
Prob[mN ' x/ N ] ' e N
for which we expand around the mean zero, I( √xN ) = x2 I 00 (0)/(2N ) + o(1/N ) or,
√ 00 2
Prob[ N mN ' x] ' e−I (0) x /2
√
which is the central limit theorem if the variance of N mN is indeed I 00 (0). That is the
subject of an exercise. In light of the previous section it is also the second derivative of a
free energy, which is in general related to an equilibrium response function or susceptibility.
We can ask for large deviations on several levels. One can ask for the fluctuations in
arithmetic means like mN or also for
1
(X 2 + X22 + . . . XN2 )
N 1
etc. but we can also ask for large deviations for the empirical distribution mN . One can
however move between the different levels. We compare for example the results of Sections
XX C and XX D. We can indeed verify that
XXIV. ENTROPIES
The expression
X
S(µ) = − µ(x) log µ(x)
x
is called the Shannon entropy of the probability distribution µ on K. The work of Claude
Shannon is situated in the upcoming information and communication science of the 1940-
1950.
Imagine we have lost our keys. If there are four possible rooms where the keys can be found
and with equal chances, then we only need to ask two yes/no questions to find back our
keys (or at least to know the room where the keys are waiting for us). We first ask whether
the keys are (yes or no) in rooms 1 or 2. When the answer is “yes,” a second question “Are
they in room 1?” suffices, and when the answer is “no” the second question is “Are they
in room 3?” That two can be written as 2 = log2 4 = − log2 1/4, and the information is
expressed in terms of a “binary bit.” Generalizing, it appears to make sense to express the
information content of m equally plausible possibilities as log2 m. That number is not an
integer in general, but we can still think of it as a measure of how many questions we need
to ask to locate our keys when lost in one of m possible rooms. Yet, sometimes we know
more about the possibilities and these are not equally plausible. Indeed, it can happen that
we estimate the chance to be 1/2 to find the keys in room 1 and to be each time 1/6 to find
the keys in room 2,3 or 4. The minimal number of questions can then be expected to be
completed...
For possible reference and further study we present here the mathematical definition of
”satisfying a large deviation principle.”
A sequence of probability distributions {PN } on a Polish space (complete separable metric
space) X satisfies a “large deviation principle” with rate N and rate function I(m) if
1. I(m) is lower semi-continuous, nonnegative and has compact level sets {m : I(m) ≤ `};
2.
1
lim sup log PN (C) ≤ − inf I(m) for closed C ⊂ X
N →∞ N m∈C
73
1
lim inf log PN (G) ≥ − inf I(m) for open G ⊂ X
N →∞ N m∈G
Note that there is a general strategy for determining the rate function, as a generalization
of the “magnetic field” in Section XXI. The point is that we should see how to change the
probabilities so that the event A whose probability is sought becomes typical. If we find a
deformation that makes A typical, then the “price to pay” is the large deviation rate function
for A. That is also why large deviation functionals are in fact always relative entropies.
Consider time-averages Z T
1
pT (x) = δ[Xt = x] dt
T 0
(the fraction of time spent in state x) for an irreducible Markov jump process on a finite set
K. We are again interested in a large deviation, but now in time and with the T growing
to infinity. The law of large times says that the time-average pT (x) → ρ(x) to the unique
stationary distribution. The large deviations are then written as
where D(µ) is called the Donsker-Varadhan functional (1975) and will of course depend on
the transition intensities k(x, y) of the Markov process.
We can actually compute the functional D(µ). The point is that we can make any
stationary distribution µ > 0 from modifying the transition rates into
Accepting that, the computation proceeds via the Girsanov formula for the path probabilities
(VI.1). Calling the original process P and the modified process PV we have
Z T
dP k(x0 , x1 ) k(xn−1 , xn )
(ω) = ... exp{ [ξV (xs ) − ξ(xs )]ds}
dPV kV (x0 , x1 ) kV (xn−1 , xn ) 0
P
with modified escape rate ξV (x) := y kV (x, y). For the ratio’s of the transition rates we
simply get
k(x0 , x1 ) k(xn−1 , xn )
... = exp[V (xn ) − V (x0 )]
kV (x0 , x1 ) kV (xn−1 , xn )
which will not contribute extensively in the T −limit for large deviations. On the other hand,
Z T X
[ξV (xs ) − ξ(xs )]ds = T pT (x)[ξV (x) − ξ(x)]
0 x
so that
X X
D(µ) = µ(x) [ξ(x) − ξV (x)] = µ(x)k(x, y)[1 − exp [Vµ (x) − Vµ (y)]/2]
x x,y
which is determined by the modification of expected escape rates. There is another, more
mathematical way, to write the same thing:
X 1
D(µ) = − inf µ(x) Lg(x)
g>0
x
g(x)
where the infimum replaces solving the above mentioned inverse problem (take
g(x) = e−V (x)/2 ). We do not explain that here further.
The situation gets more complicated (and more interesting) under nonequilibrium. There
the D(µ) becomes related to the entropy production rate, but we ignore further details.
75
Freidlin-Wentzel theory (1970) develops the theory of random perturbations around de-
terministic dynamics. Here is how some of it fits in the theory of large deviations. Consider
the stochastic differential equation
Z t √
xε (t) = x + b(xε (s)) ds + ε Bt
0
The answer is again similar to the free energy method. Replace b(x) with a new c(t, x)
and the corresponding stochastic differential equation is
√
dx(t) = c(t, x(t)) dt + ε dBt
We then proceed as in the previous section by computing the density on path space,
applying a Girsanov formula but now for diffusions as in Section XV.
XXX. QUESTIONS
1. Discuss large deviations for sums of independent and identically distributed random
variables that (sometimes) fail to have exponential moments, such as for sums of exponential
76
2. Verfiy that I 00 (0) is related to the variance of the random variable as suggested under
Section XXII.
4. Complete the calculation of Section XXIX for finding the large deviation rate function.
77
The material presented so far is so elementary that most text books go quite beyond.
Books especially dedicated to the relation between stochastics and physics are also rare.
Here are some general texts but much material and lecture notes can also be found on the
web.
Kurt Jacobs, Stochastic Processes for Physicists: Understanding Noisy Systems, Cam-
bridge University Press (2010).
Nico G. Van Kampen, Stochastic Processes in Physics and Chemistry, North Holland;
3rd edition (2007).
Berlin (1933).
For the theory of large deviations, there are various references such as:
F. den Hollander, Large Deviations, Field Institute Monographs, Vol. 14 (Amer. Math.
Soc., 2000).