Script
Script
Lecture Notes
TU Dresden, Faculty of Computer Science
Center for Systems Biology Dresden
Winter 2021/22
ii
iii
Contents v
Foreword ix
1 Introduction 1
1.1 Elementary Probabilities . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.1 Events and Axioms . . . . . . . . . . . . . . . . . . . . . . 2
1.1.2 Combinatorics . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.3 Probability Spaces (Ω, P) . . . . . . . . . . . . . . . . . . 4
1.2 Conditional Probabilities . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.1 Definition and properties . . . . . . . . . . . . . . . . . . 5
1.2.2 Bayes’ Theorem . . . . . . . . . . . . . . . . . . . . . . . 6
1.2.3 Law of Total Probabilities . . . . . . . . . . . . . . . . . . 6
1.2.4 Probability Expansion . . . . . . . . . . . . . . . . . . . . 7
1.3 Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3.1 Indicator/binary random variables . . . . . . . . . . . . . 8
1.3.2 Discrete random variables . . . . . . . . . . . . . . . . . . 8
1.3.3 Continuous random variables . . . . . . . . . . . . . . . . 9
1.4 Probability Distributions . . . . . . . . . . . . . . . . . . . . . . . 9
1.4.1 Discrete Distributions . . . . . . . . . . . . . . . . . . . . 9
1.4.2 Continuous Distributions . . . . . . . . . . . . . . . . . . 10
1.4.3 Joint and Marginal Distributions . . . . . . . . . . . . . . 10
1.4.4 Moments of Probability Distributions . . . . . . . . . . . 11
1.5 Common Examples of Distributions . . . . . . . . . . . . . . . . 13
1.5.1 Common discrete distributions . . . . . . . . . . . . . . . 13
1.5.2 Common continuous distributions . . . . . . . . . . . . . 14
1.5.3 Scale-free distributions . . . . . . . . . . . . . . . . . . . . 15
v
vi CONTENTS
5 Variance Reduction 47
5.1 Antithetic Variates . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.2 Rao-Blackwellization . . . . . . . . . . . . . . . . . . . . . . . . . 48
7 Stochastic Optimization 63
7.1 Stochastic Exploration . . . . . . . . . . . . . . . . . . . . . . . . 64
7.2 Stochastic Descent . . . . . . . . . . . . . . . . . . . . . . . . . . 64
7.3 Random Pursuit . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
7.4 Simulated Annealing . . . . . . . . . . . . . . . . . . . . . . . . . 66
7.5 Evolutionary Algorithms . . . . . . . . . . . . . . . . . . . . . . . 68
7.5.1 ES with fixed mutation rates . . . . . . . . . . . . . . . . 69
7.5.2 ES with adaptive mutation rates . . . . . . . . . . . . . . 69
CONTENTS vii
8 Random Walks 73
8.1 Characterization and Properties . . . . . . . . . . . . . . . . . . . 73
8.1.1 Kolmogorov-forward Equation . . . . . . . . . . . . . . . 73
8.1.2 State Equation . . . . . . . . . . . . . . . . . . . . . . . . 74
8.1.3 Mean and Variance . . . . . . . . . . . . . . . . . . . . . . 74
8.1.4 Restricted Random Walks . . . . . . . . . . . . . . . . . . 75
8.1.5 Relation to the Wiener process (continuous limit) . . . . . 75
8.1.6 Random Walks in higher dimensions . . . . . . . . . . . . 76
9 Stochastic Calculus 77
9.1 Stochastic differential equations . . . . . . . . . . . . . . . . . . . 78
9.1.1 Ito integrals . . . . . . . . . . . . . . . . . . . . . . . . . . 78
9.1.2 Transformation of Wiener processes . . . . . . . . . . . . 79
9.1.3 Mean and Variance of SDE’s . . . . . . . . . . . . . . . . 79
These lecture notes were created for the course “Stochastic Modeling and Simu-
lation”, taught as part of the mandatory electives module “CMS-COR-SAP” in
the Masters Program “Computational Modeling and Simulation” at TU Dres-
den, Germany. The notes are based on handwritten notes by Prof. Sbalzarini
and Dr. Zechner, which have been typeset in LATEXby Fahad Fareed as a paid
student teaching assistantship during his first term of studies in the Masters
Program “Computational Modeling and Simulation” at TU Dresden, Germany,
and subsequently extended and corrected by Prof. Sbalzarini and Dr. Zechner.
ix
x FOREWORD
Chapter 1
Introduction
The goal of stochastic Modeling is to predict the (time evolution of the) proba-
bility over system states:
⃗ t|X
P (X, ⃗0 , t0 ). (1.1)
This is the probability that a system which was in state X ⃗0 at time t0 is found in
⃗ ⃗ ⃗
state X at time t. Here, both X and X0 are random variables. In a deterministic
model, it is possible to predict exactly which state a system is going to be found
in. In contrast, a stochastic model does not predict states, but only probabilities
of states. One hence never knows for sure which state the system is going to be
in, but one can compute the probability of it to be a certain state. For example,
in stochastic weather modeling, the model is not going to predict whether it is
going to rain or not, but rather the probability of rain (e.g., 60% chance of rain
tomorrow).
A stochastic simulation then generates specific realizations of state trajectories:
which are simulated from the model in Eq. 1.1 using a suitable stochastic sim-
ulation algorithm, such that for each time point t:
⃗ t|X
x⃗t ∼ P (X, ⃗0 , t0 ), (1.3)
i.e., the states are distributed according to the desired probability function.
While this sounds pretty straightforward conceptually, there are a number of
technical difficulties that frequently occur, among them:
• P (·) may not be known in closed form, but only through a governing
(partial) differential equation that is often not analytically solvable →
Master equations.
1
2 CHAPTER 1. INTRODUCTION
2. P (Ω) = 1,
3. P (A ∪ B) = P (A) + P (B) if P (A ∩ B) = ∅.
From these three axioms, the rest of probability theory has been proven. The
first axiom simply states that probabilities are always non-negative. The sec-
ond axiom states that the probability of the entire sample space is 1, i.e., the
probability that any of the possible events happens is 1. the third axiom states
that the probability of either A or B happening, but not both, (i.e., an exclusive
OR) is the sum of the probability that A happens and the probability that B
happens. All three axioms perfectly align with our intuition of probability as
“chance”, and we have no difficulty accepting them.
A first thing one can derive from these axioms are the probabilities of logical
operations between events. For example, we find:
From these three basic logical operations, all of Boolean logic can be constructed
for stochastic events.
Our intuitive notion of probability as “chance of an event happening” is quan-
tified by counting. Intuitively, we attribute a higher probability to an event
if it has happened more often in the past. We can thus state the frequentist
interpretation of probability:
#ways/times A happens
P (A) = (1.4)
total #Ω
as the fraction of the sample space that is covered by event A. Therefore, P (A)
can conceptually simply be determined by enumerating all events in the entire
sample space and counting what fraction of them belongs to A. For example,
the sample space of rolling a fair dice is Ω = {1, 2, 3, 4, 5, 6}. If we now define
A the event that an even number of eyes shows, then we have A = {2, 4, 6} and
hence #A = 3 whereas #Ω = 6 and, therefore according to the above definition
of probability: P (A) = 3/6 = 1/2. Often, however, it is infeasible to explicitly
enumerate all possible events and count them because their number may be very
large. The field of combinatorics then provides some useful formulas to compute
the total number of events without having to explicitly list all of them.
1.1.2 Combinatorics
Combinatorics can be used to compute outcome numbers even when explicit
counting is not feasible. The basis of combinatorics is the multiplication prin-
ciple: if experiment 1 has m possible events and experiment 2 has n possible
events, then the total number of different events over both experiments is mn.
4 CHAPTER 1. INTRODUCTION
Example 1.1. Let experiment 1 be the rolling of a dice observing the number of
eyes shown. The m = 6 possible events of experiment 1 are: {1, 2, 3, 4, 5, 6}. Let
experiment 2 be the tossing of a coin. The n = 2 possible events of experiment
2 are: {head, tail}. Then, the combination of both experiments has mn = 12
possible outcomes: {1-head, 1-tail, 2-head, 2-tail, ..., 6-head, 6-tail}.
n!
Pn,r = (1.5)
(n − r)!
Example 1.2. The probability space for the experiment of rolling a fair dice
and observing the number of eyes shown is: ({1, 2, 3, 4, 5, 6}, { 61 , 16 , 16 , 16 , 16 , 16 }).
P (A ∩ B) = P (A)P (B),
which is called the joint probability of A and B, i.e., the probability that both
A and B happen.
Theorem 1.1 (Bayes). For any two random events A and B, there holds:
P (B|A)P (A)
P (A|B) = . (1.9)
P (B)
This follows directly from the logical OR between events, Eq. 1.8, and the fact
that the Bi are mutually disjoint.
1.2. CONDITIONAL PROBABILITIES 7
X:Ω→J (1.14)
In this case, it is not possible any more to directly map to events, because there
are infinitely many events/points in J, each with infinitesimal probability.
The function PX (x) is called the probability mass function (PMF) or the prob-
ability distribution function of the RV X. For discrete RV, we can also define
the Cumulative Distribution Function (CDF) FX (x) of the RV X, as:
X
FX (x) = P (X ≤ x) = P (X = xi ). (1.16)
xi ≤x
Example 1.7. Consider the experiment of rolling a fair dice once, and define
the RV X: number of eyes shown. Then, the probability mass function is:
1
6 if x ∈ {1, 2, 3, 4, 5, 6}
PX (x) =
0 else
x
FX (x) = x ∈ {1, 2, 3, 4, 5, 6}.
6
10 CHAPTER 1. INTRODUCTION
dF (x)
f (x) = , (1.17)
dx
using the analogy between summation and integration in continuous spaces.
Note that this is not a probability. Rather, for any a, b ∈ J, we have:
Z b
P (a ≤ X ≤ b) = f (x) dx. (1.18)
a
so the total probability is correct. We can also compute the CDF from the PDF
by inverting Eq. 1.17: Z x
FX (x) = f (x̃) dx̃. (1.20)
−∞
Finally, since the CDF is monotonic, its slope is always non-negative, thus:
d2
fX,Y (x, y) = FX,Y (x, y) for continuous X, Y. (1.24)
dxdy
1.4. PROBABILITY DISTRIBUTIONS 11
• Marginal: The marginal distribution over one of the two RVs is obtained
from their joint distribution by summing or integrating over the other RV.
For the CDF, this again looks the same for discrete and continuous RVs:
Z ∞
Continuous PDF: fX (x) = fX,Y (x, y) dy.
−∞
fX,Y (x, y)
Continuous PDF: fX|Y (x|y) = .
fY (y)
• M0 [X] = 1 because the total probability over the entire sample space is
always 1.
Example 1.8. Rolling a fair dice. X is the number of eyes shown. And
from Eq. 1.27, we find:
6
X i
E[X] = = 3.5.
i=1
6
This is the expected value when rolling the dice many times. It is related
to the statistical mean of the observed values of X.
and therefore:
Var(X) = 15.167 − (3.5)2 = 2.9167.
where Cov(X, Y ) is the covariance of the two random variables (related to their
correlation).
1.5. COMMON EXAMPLES OF DISTRIBUTIONS 13
E[X] = np (1.34)
Var(X) = np(1 − p). (1.35)
The binomial distribution with parameters (n, p), evaluated at k, gives the
probability of observing exactly k successes from n independent events,
each with probability p to succeed; P (X = k) = P (exactly k successes).
Example 1.10. Imagine an exam with 20 yes/no questions. What is the
probability of getting all answers right by random guessing?
1 20 1
X ∼ Bin(20, ) ⇒ P (X = 20) = ·0.520 ·0.50 = 1· 20 ·1 = 9.537·10−7 .
2 20 2
You would have to retake the exam 1,048,576 times until you could expect
a pass. Clearly not a viable strategy.
2. Poisson
e−λ λk
P (X = k) = k ∈ {0, 1, 2, 3, ...} = N. (1.36)
k!
For X ∼ Poiss(λ):
E[X] = λ (1.37)
Var(X) = λ. (1.38)
In the Poisson distribution, the variance and the mean are identical. It
has only one parameter. The Poisson distribution gives the probability
that a random event is counted k times in a certain time period, if the
event’s rate of happening (i.e., the expected number of happenings in the
observation time period) is λ. It is therefore also called the “counting
distribution”.
14 CHAPTER 1. INTRODUCTION
Example 1.11. Ceramic tiles crack with rate λ = 2.4 during firing. What
is the probability a tile has no cracks?
e−2.4 · 1
X ∼ Poiss(2.4) ⇒ P (X = 0) = = 0.0907.
1
So only about 9% of tiles would survive and a better production process
should be found that has a lower crack rate.
2. Exponential
For X ∼ Exp(λ):
3. Normal/Gaussian
1 1 x−µ 2
fX (x) = √ e− 2 ( σ ) x∈R (1.47)
2πσ 2
x−µ 1 x−µ
FX (x) = Φ = 1 + erf √ . (1.48)
σ 2 σ 2
The CDF of the Gaussian distribution, Φ, has no analytical form, but
can be computed numerically due to its relation with the error function
erf(·), which is a so-called “special function” for which good numerical
approximations exist. For X ∼ N (µ, σ 2 ):
E[X] = µ (1.49)
2
Var[X] = σ . (1.50)
Example 1.14. IQs of people are distributed around µ = 100 with a stan-
dard deviation of σ = 15. What is the probability of having an IQ > 110?
All (and more than) you ever wanted to know about the topics covered in this
chapter can be found in the book: “Non-Uniform Random Variate Generation”
by Luc Devroye (Springer, 859 pages).
The most basic task in any stochastic simulation is the ability to simulate re-
alizations of a random variable from a given/desired probability distribution.
This is called random variate generation or simulation of random variables. On
electronic computers, which are deterministic machines, the simulated random
variates are not truly random, though. They just appear random (e.g., by sta-
tistical test) and follow the correct distribution, which is why we say that the
random variable is “simulated” and it is not the real random variable itself.
A notable exception is special hardware, such as crypto cards, that generates
true random numbers. This often involves a radioactive source, where the time
between decay events is truly random and exponentially distributed.
Y = g(X)
for a given transformation function g with supp(g) ⊇ dom(X), i.e., the support
of the function g has to include the entire domain of the RV X. For a valid
transform, the inverse
g −1 (A) := {x : g(x) ∈ A} (2.1)
for any set A exists, but is not necessarily unique. The function g maps from
the space in which X takes values to the space in which Y takes values.
17
18 CHAPTER 2. RANDOM VARIATE GENERATION
Where we have used the chain rule of differentiation in the last step.
For monotonically decreasing g, we analogously find:
This is easily understood by drawing the graphs of g for the two cases and
observing that the half-space Y ≤ y gets mapped to X ≤ g −1 (y) in one case,
and to X > g −1 (y) in the other.
Example 2.3. Let X ∼ U(0, 1). The PDF of this continuous random variable
is:
1 if x ∈ [0, 1]
fX (x) =
0 else.
which is the same as the PDF of X. We hence find the important result that if
X ∼ U(0, 1), then also Y = 1 − X ∼ U(0, 1), i.e., the probability of a uniformly
distributed event to not happen, is also uniformly distributed.
The distribution with FY (y) = y is the uniform distribution over the interval
[0, 1]. Therefore, random variables from a given cumulative distribution FX can
−1
be simulated from uniform ones by X = FX (U(0, 1)) ∼ FX (x). This endows
uniform random numbers with special importance for stochastic simulations.
20 CHAPTER 2. RANDOM VARIATE GENERATION
where F̂n (x) is the empirical CDF over n samples from the RNG and F (x) = x
is the uniform CDF we want to simulate. This requirement means that if we
generate infinitely many random numbers, then their CDF is identical to the
CDF we want to simulate.
which means that it computes the division rest of azi−1 /m. By definition of the
division rest, |zi | < m. Thus:
zi
ui = ∈ [0, 1). (2.11)
m
The start value (called “seed”) z0 and the two integers m > 0 and 0 < a < m
must be chosen by the user. There are also versions in the literature that
include an additional constant shift, providing one more degree of freedom, but
the general principle remains the same. It can be shown that for the linear
congruential generator,
They are all simple recursion formulas operating on integers or bit strings, which
makes them very fast and easy to implement.
A particular problem occurs when using pseudo-RNGs in parallel computing.
There, obviously, one wants that each processor or core simulates a different
random number sequence. If they all compute the identical thing, the paral-
lelism is wasted. The simplest way to have different random number sequences
is to use a different seed on each processor or core. However, using different
seeds may not change the cycle, nor its length, but could simply starts the par-
allel sequences at different locations in the cycle. So when using P processors,
the effectively usable cycle length on each processor is T /P on average. Beyond
this, processors recompute results that another processor has already computed
before. Special Parallel Random Number Generators (PRNGs) therefore ex-
ist, which one should use in this case. They provide statistically independent
streams of random numbers that have full cycle length on each processor and
guarantee reproducible simulation results when re-running the simulation on
different numbers of processors. We do not go into detail on PRNGs here, but
refer to the literature and the corresponding software libraries available.
FX = 1 − e−λx = y (2.16)
−λx
e =1−y (2.17)
− λx = log(1 − y) (2.18)
1 −1
x = − log(1 − y) = FX . (2.19)
λ
24 CHAPTER 2. RANDOM VARIATE GENERATION
sampling (x, u) in the bounding box of fX and only accept points for which
0 < u < fX (x) (circles in Fig. 2.1). The x-component of the accepted point is
then used as a pseudo-random number. Points above the graph of fX (crosses
in Fig. 2.1) are rejected and not used.
This is very easy to implement and always works. However, it may be inefficient
if the area under the curve of fX only fills a small fraction of the bounding
box, i.e., V ol({x, u}) ≫ 1, which for example is the case if fX is very peaked
or has long tails. In particular, this method becomes difficult for PDFs with
infinite support, such as the Gaussian, where one needs to truncate somewhere,
incurring an approximation error in addition to inefficiency.
Fortunately, we are not limited to sampling points in the bounding box of fX ,
but we can use any other probability density function g(x) for which fX (x) ≤
µg(x) for some µ > 0, i.e., the graph of µg(x) is always above the graph of fX
for some constant µ. Then, simulating X ∼ fX (x) is equivalent to simulating
pairs (y, u) such that:
y ∼ g(x), u ∼ U(0, 1) (2.24)
and accepting the sample x = y if u ≤ fX (x)/µg(x). Obviously, this requires one
to be able to efficiently generate random numbers from the proposal distribution
g(x), so typically one wants to use a g(x) for which an explicit inversion formula
exists, like an exponential or Gaussian distribution.
This is called dyadic binning because the bin limits grow as powers of two.
If one then performs Accept-Reject sampling independently in these bins (i.e.,
first choose a bin proportional to its total probability, the do accept-reject inside
that bin), the efficiency is better than when performing Accept-Reject sampling
directly on fX . This is obvious because the “empty” blocks that do not overlap
with the area under the curve of fX are never considered.
28 CHAPTER 2. RANDOM VARIATE GENERATION
Chapter 3
Discrete-Time Markov
Chains
Markov Chains are one of the most central topics in stochastic modeling and
simulation. Both discrete-time and continuous-time variants exist. Since the
discrete-time case is easier to treat, that is what we are going to start with.
Markov Chains are a special case of the more general concept of stochastic
processes.
29
30 CHAPTER 3. DISCRETE-TIME MARKOV CHAINS
These are all discrete-time processes, some with discrete state space (e.g., dice)
and some with continuous state space (e.g., amount of rain).
The challenge in stochastic modeling is to find a stochastic process model {Xn :
n ≥ 0} that is complex enough to capture the phenomenon of interest, yet
simple enough to be efficiently computable and mathematically tractable.
Example 3.1. Consider the example of repeatedly tossing a fair coin and let
Xn be the outcome (head or tail) of nth toss. Then:
1 1
Xn ∼ Bin ⇒ P (Xn = head) = P (Xn = tail) = ∀n.
2 2
This means that the probability distribution of the next state only depends on
the current state, but not on the history of how the process arrived at the cur-
rent state. Markov Chains are the next-more-complex stochastic process after
i.i.d. processes. They implement a one-step dependence between subsequent
distributions. Despite their simplicity, Markov Chains are very powerful and
expressive, while still remaining mathematically tractable. This explains the
widespread use and central importance of Markov Chains in stochastic model-
ing and simulation.
Equation 3.1 is called the Markov Property. For discrete-state processes (i.e.,
S is discrete), the number 0 ≤ Pij ≤ 1 is the probability to move to state xj
whenever the current state is xi . It is called the one-step transition probability
of the Markov Chain. For discrete S with finite |S|, the matrix of all one-step
transition probabilities P = (Pij ), ∀(i, j) : xi , xj ∈ S, is the transition matrix of
the MarkovP Chain. It is a square, positive-semi-definite matrix.
We have j∈S Pij = 1, since upon leaving state xi , the chain must move to one
of the states xj (possibly the same state, xj = xi , which is the diagonal element
of the matrix). Therefore, each row i of P is a probability distribution over
states reachable from state xi .
Due to the Markov property, the future process Xn+1 , Xn+2 , . . . is independent
of the past process X0 , X1 , . . . , Xn−1 , given the present state Xn .
P(n) = Pn = P × P × . . . × P . (3.4)
| {z }
n times
The fact that the n-step transition matrix is simply the nth power of the one-step
transition matrix follows from the Chapman-Kolmogorov equation:
For any n ≥ 0, m ≥ 0, xi , xj , xk ∈ S, we have:
X
Pijn+m = n m
Pik Pkj , (3.5)
k∈S
where:
n
Pik = P (Xn = xk |X0 = xi )
m
Pkj = P (Xm+n = xj |Xn = xk ).
Because of the Markov property (the future is independent of the past) and
Eq. 1.10, it is easy to see that
n m
Pik Pkj = P (Xm+n = xj |Xn = xk )P (Xn = xk |X0 = xi )
= P (Xn = xk , Xm+n = xj |X0 = xi ).
Summing over all k, i.e., all possible states xk the path from xi to xj could pass
through, i.e., marginalizing this joint probability over k, yields the formula in
Eq. 3.5. Since this is the formula for matrix multiplication, Eq. 3.4 is shown.
If a closed set contains only one state, that state is called absorbing. After the
Markov Chain reaches an absorbing state, it will never leave it again (hence the
name).
Example 3.2. Consider the Markov Chain defined by the recursion f (x, u) =
xu, where the Un are uniformly distributed random numbers in the interval [0, 1]
and the continuous state space S = [0, 1]. That is, the next value (between 0 and
1) of the chain is given by the current value multiplied with a uniform random
number between (and including) 0 and 1. The value 0 is an absorbing state.
Once the chain reaches 0, it is never going to show any value other than 0 any
more. Even more, every lower sub-interval C = [0, ν] ⊆ S for all ν ∈ [0, 1] is a
closed set of states, because the state of the chain is never increasing.
Therefore, in an irreducible chain, every state can be reached from every other
state, eventually. There is no closed set from which the chain could not escape
any more. Every state is reachable, one just has to wait long enough (potentially
infinitely long).
Example 3.3. Clearly, the Markov Chain from Example 3.2 is not irreducible,
because is has infinitely many closed sets. However, if we consider i.i.d. random
variables Un ∼ U[ϵ, 1/x] for some arbitrarily small ϵ > 0, then the chain has
no absorbing state and no closed set any more, and it becomes irreducible. The
state space is then S = (0, 1].
A periodic Markov Chain revisits one or several states in regular time intervals
t. Therefore, if we find the chain in a periodic state xj at time t, we know that
it is going to be in the same state again at times 2t, 3t, 4t, . . .. A Markov chain
that is not periodic is called aperiodic.
An ergodic chain revisits any state with finite mean recurrence time. It is not
possible to predict when exactly the chain is going to revisit a given state, like
in the periodic case, but we know that it will in finite time. While an irreducible
Markov Chain is eventually going to revisit any state, the recurrence time may
be infinite, so irreducibility alone is not sufficient for ergodicity.
One of the interesting properties of Markov Chains for practical applications
is that they can have stationary distributions. This means that when running
34 CHAPTER 3. DISCRETE-TIME MARKOV CHAINS
While it might superficially seem that Xn depended on the entire history of the
process, it is in fact a Markov Chain, since xn = xn−1 +Dn is independent of the
previous states. The recursion formula of this Markov Chain is f (x, d) = x + d,
which, according to Theorem 3.1, proves the Markov property.
If the Xn are scalar and the increments are either +1 or −1, we obtain a one-
dimensional discrete-state random walk with
Such a random walk is called simple. It models the random motion of an object
on a regular 1D lattice. For the special case that p = 21 , the random walk becomes
symmetric, i.e., it has equal probability to go left or right in each time step.
In a simple random walk, the only states that can be reached from xi are xi+1
and xi−1 . Therefore, we have the one-step transition probabilities Pi,i+1 = p
and Pi,i−1 = 1 − p. Consequently, Pii = 0. This would form one row of the
transition matrix. However, we can only write down the matrix once we restrict
the chain to a bounded domain and hence a finite state space.
36 CHAPTER 3. DISCRETE-TIME MARKOV CHAINS
Chapter 4
Example 4.1. How to compute π using random numbers? This historic exam-
ple goes back to Buffon’s “needle problem”, first posed in the 18th century by
Georges-Louis Leclerc, Comte de Buffon. It shows a very simple and intuitive
way to approximate π using MC simulation. We begin by first drawing a unit
square on a blackboard or a piece of paper. Moreover, we draw a quarter of
a unit circle into the square such that the top-left and bottom-right corners of
the square are connected by the circle boundary. We then generate N random
(i) (i)
points uniformly distributed across the unit square, i.e., ⃗x(i) = (u1 , u2 ) with
(i) (i)
u1 ∼ U(0, 1) and u2 ∼ U(0, 1) for i = 1, . . . , N . You can achieve this, for
instance, by throwing pieces of chalk at the board, assuming that your shots are
uniformly distributed across the square. Subsequently, you count the points that
37
38 CHAPTER 4. MONTE CARLO METHODS
i.e., the average of the RVs “converges” to the expectation µ of the individual
variables. This implies that if we have a large number of independent realizations
of the same probabilistic experiment, we can estimate the expected outcome of
this experiment by calculating the empirical average over these realizations.
This is the main working principle behind MC methods.
4.1. THE LAW OF LARGE NUMBERS 39
with ε > 0 an arbitrary positive constant. The weak law implies that the average
X̄N converges in probability / in distribution to the true mean µ, which means
that for any ε we can find a sufficiently high N such that the probability that
X̄N differs from µ by more than ε will be arbitrarily small.
The second version of the law of large numbers is called the strong law of large
numbers, which states that
P lim X̄N = µ = 1, (4.5)
N →∞
meaning that X̄N converges to µ almost surely, that is, with probability one in
value. The difference between the two laws is that in the weak law, we leave
open the possibility that X̄N deviates from µ by more than ε infinitely often
(although unlikely) along the path of N → ∞. So there is no sufficiently large
finite N such that we can guarantee that X̄N stays within the ε margin. The
strong law, by contrast, states that for a sufficiently large N , X̄N will remain
within the margin with probability one. Therefore, the strong law implies the
weak law, but not vice-versa. One is the probability of a limit, whereas the
other is a limit over probabilities. There are certain cases, where the weak law
holds, but the strong law does not. Moreover, there are cases where neither of
them apply: if the samples are Cauchy-distributed, for instance, the mean µ
does not exist and therefore, we can not use sample averages to determine the
expected outcome.
with i the imaginary unit and t the real argument of this function. Making use
of the definition of the expectation,
Z ∞
ϕX (t) = eixt p(x) dx, (4.7)
−∞
∂
ϕX (t) = ϕX (0) + t ϕX (t) + o(t), (4.12)
∂t t=0
whereas o(t) summarizes all higher-order terms (little-o notation). Applying the
definition of the characteristic function, we obtain
and
∂ ∂ itX ∂ itX
ϕX (t) = E[e ] =E e
∂t t=0 ∂t t=0 ∂t t=0 (4.14)
= E iXei0X = iE [X] = iµ.
The term o(t/N ) tends to zero faster than the other two terms for large N since
it contains higher orders of t/N . We can therefore neglect it asymptotically.
Letting N go to infinity for the remaining expression, we obtain
N
t
lim ϕX̄N (t) = lim 1 + i µ , (4.17)
N →∞ N →∞ N
which is the definition of the exponential function eitµ . This, in turn, is the
characteristic function of a constant (deterministic) variable µ, which means
that the sample average converges in density to µ. Intuitively, this says that
the PDF of the sample average will be squeezed together as N → ∞ such that
all its probability mass concentrates at the value µ, resulting in a Dirac delta
distribution δ(x̄ − µ).
Remark: Strictly speaking, our proof has only shown convergence in density
(in characteristic functions), but not in probability as stated by the weak law.
However, it can be shown further that since µ is a constant, convergence in
density implies convergence in probability, which completes the proof. This,
however, is beyond the scope of this lecture.
with a and b the integration limits. In order to reformulate this integral in terms
of an expectation, we first multiply the integrand by 1 = (b − a)/(b − a), which
leaves the value of the integral unaffected, i.e.,
b b
b−a
Z Z
f (x) dx = f (x) dx (4.19)
a a b−a
Z b
= f (x)(b − a)p(x) dx, (4.20)
a
42 CHAPTER 4. MONTE CARLO METHODS
1
where we recognize p(x) = b−a , x ∈ (a, b), as the PDF of a uniform continuous
RV X ∼ U(a, b). We can therefore express the integral as an expectation
Z b
f (x) dx = (b − a)E[f (X)], (4.21)
a
In the first step, we have used the linearity of the variance operator, and in the
second step we have used the fact that the RVs are i.i.d. This shows that the
MC variance scales with 1/N if independent samples are used. If the variance of
the transformed RV f (X) is hard to compute analytically, an empirical estimate
can be determined from the N samples, i.e.,
N
1 X
Var[f (X)] ≈ (f (xi ) − ⟨f (X)⟩)2 , (4.27)
N − 1 i=1
with
N
1 X
⟨f (X)⟩ = f (xi ). (4.28)
N i=1
Remember that the partition function in the empirical sample variance is 1/(N −
1) in order for the estimator to be unbiased (a single sample has no variance)
— see statistics course.
4.3. IMPORTANCE SAMPLING 43
N
1 X
F ≈V f (xi ) =: θN , (4.30)
N i=1
Note that in the one-dimensional case, this volume reduces to the length of the
domain (b − a), as derived above. The corresponding MC estimator variance,
by analogy, is given by
V2
Var[θN ] = Var[f (X)]. (4.32)
N
Ω, i.e., S ⊇ Ω. We obtain:
Z Z
q(x) m
f (x) dm x = f (x) d x (4.34)
Ω Ω q(x)
Z
f (x)
= 1X∈Ω q(x) dm x (4.35)
S q(x)
f (X)
=E 1X∈Ω , (4.36)
q(X)
with X a continuous RV with PDF q(x), and 1X∈Ω the indicator function that
restricts the samples to the original domain of integration. Assuming that we
are given N i.i.d. realizations x1 , . . . , xN of X, we can compute an MC estimate
as
Z N
1 X f (xi )
f (x) dm x ≈ 1X∈Ω . (4.37)
Ω N i=1 q(xi )
| {z }
=:θN
for some constant K that ensures that q integrates to one. Remember that f (x)
is not a PDF, but just a function to be integrated. For the MC variance, we
obtain from Eq. 4.38:
1 f (x) 1 1
Var [θN ] = Var = Var = 0, (4.40)
N Kf (x) N K
because 1/K is a deterministic value. The indicator function was dropped since
f and q have the same support. Paradoxically, this result implies that a single
sample from this proposal would suffice to fully determine the true value of
the integral, for any integrand f (x), since the MC estimator variance is zero.
However, at a second look we realize that in order to evaluate the ratio f (x)/q(x)
in the MC estimator, we require knowledge of the constant K, since f (x)/q(x) =
1/K. This constant, however, is equal to the solution of the integral we are after,
since it must hold that Z
1
= f (x) dm x (4.41)
K Ω
4.3. IMPORTANCE SAMPLING 45
for q(x) to integrate to one. Because the constant already contains the result, one
“sample” of that constant would indeed suffice. Clearly, this choice of proposal
distribution is not realizable, but it still reveals important information about how
suitable proposals should look like: a good q(x) should follow f (x) as closely as
possible, up to a constant scaling factor.
where Ω is the quarter circle and ⃗x = (u1 , u2 ) a point in the 2D plane. Note
that in this case, the function f is one everywhere. This can be rewritten as
Z 1 Z 1
1
π=4 1(u ,u )∈Ω q(u1 , u2 ) du1 du2
0 q(u1 , u2 ) 1 2
0 (4.43)
1
= 4E 1(U1 ,U2 )∈Ω .
q(U1 , U2 )
N
4 X # samples inside the circle
π = 4E 1(U1 ,U2 )∈Ω ≈ 1(u1 ,u2 )∈Ω = 4 ,
N i=1 N
(4.44)
which is the formula introduced at the beginning of this chapter.
Example 4.4. Let us consider a function f (x) = x2 for x ∈ [0, 1], which we
R1
want to integrate using MC integration 0 x2 dx. We consider the standard MC
estimator
N
(1) 1 X
θN = f (xi ) with xi ∼ U(0, 1) (4.45)
N i=1
N
(2) 1 X f (xi )
θN = with xi ∼ q(x) = 2x. (4.46)
N i=1 q(xi )
Which of the two estimators is better? To answer this question, we calculate the
46 CHAPTER 4. MONTE CARLO METHODS
(1) 1
Var[θN ] = Var[f (x)] (4.47)
N
Z 1
Var[f (x)] = Var[X 2 ] = (X 4 − E[X 2 ]2 )p(x) dx = E[X 4 ] − E[X 2 ]2 (4.48)
0
1
1 1
X3
Z Z
2 2 2 1
E[X ] = x p(x) dx = x dx = = (4.49)
0 0 3 3
0
1
1 1
X5
Z Z
1
E[X 4 ] = x4 p(x) dx = x4 dx = = . (4.50)
0 0 5 5
0
Since
1 1
0.0139 < 0.0889 (4.57)
N N
(2)
for any N > 0, θN is a better estimator. This means that for the same sample
size N , estimator 2 achieves a higher (about 6-fold) accuracy, in expectation
over many realizations of the MC procedure.
Chapter 5
Variance Reduction
with Xi as i.i.d. RVs and f (x) as some function. We now consider the case
where the RVs Xi are not independent, in which case the covariance between
any Xi and Xj will be non-zero. In this case, it is straightforward to show that
the variance of an MC estimator is given by
( )
N N
1 X 1 X X
V ar Xi = 2 V ar{f (Xi )} + Cov{f (Xi ), f (Xj )} . (5.2)
N i=1 N i=1 i̸=j
We see that if the covariance terms are positive, the MC variance will be larger
than in the i.i.d. case. However, if the covariances become negative, we can
achieve variance reduction. This is the key idea underlying a popular variance
47
48 CHAPTER 5. VARIANCE REDUCTION
reduction method called antithetic variates. While this approach is fairly gen-
eral, we will illustrate it here in the context of a simple example.
Our goal is to use Monte Carlo estimation to calculate the expectation E{f (X)}
with f (X) as some function and X as a uniform RV. To do so, we generate N
uniformly distributed random numbers
Xi ∼ U(0, 1) ∀i = 1, . . . , N. (5.3)
with X̄i = 1 − Xi . On the one hand, this means that we have doubled the
number of samples of our estimator. On the other hand, we have artificially
introduced negative correlations between pairs of samples since Xi and X̄i will
be anticorrelated. It can be shown that if the function f is monotonically
increasing or decreasing, this implies that also the correlations between f (Xi )
and f (X̄i ) will be negative. In particular, the variance of the antithetic MC
estimator becomes
( 2N
)
A 1 X
V ar{θN } = V ar f (Zi ) (5.6)
2N i=1
2N N
1 X X
= V ar{f (Zi )} +2 Cov{f (Xi ), f (X̄i )} (5.7)
4N 2
| {z } | {z }
i=1 i=1 γf
σf2
1
σf2 + γf .
= (5.8)
2N
Now, since γf is strictly negative for monotonic functions f we obtain variance
reduction when compared to an MC estimator that uses 2N independent sam-
ples. Another important advantage of antithetic variates is that we can double
the sample size N ”for free”: we can use 2N samples, even though we had to
draw only N random numbers.
5.2 Rao-Blackwellization
In this section we will discuss the concept of Rao-Blackwellization to reduce
the variance of MC estimators. This method is suited for Monte Carlo prob-
lems that depend on multiple random variables. While they apply to arbitrary
5.2. RAO-BLACKWELLIZATION 49
and Z ∞ Z ∞
µf = E{f (X, Y )} = f (x, y)p(x, y)dxdy. (5.12)
−∞ −∞
Theorem 5.1. For a given sample size N , the Rao-Blackwellized estimator θ̂N
is guaranteed to achieve lower or equal variance than the standard Monte Carlo
estimator θN , i.e.,
V ar{θ̂N } ≤ V ar{θN } (5.17)
for any N .
Proof: A more formal proof of this theorem can be obtained by employing the
Rao-Blackwell theorem. However, a simple derivation of this result is possi-
ble using the properties of conditional expectations. We begin by rewriting the
variance σf2 that appears in the MC variance of the standard estimator θN as
Using the same idea, we can now replace E{E{f (X, Y ) | Y }2 } by V ar{E{f (X, Y ) |
Y }2 } + E{E{f (X, Y ) | Y }}2 which yields
σf2 = E{V ar{f (X, Y ) | Y }} + V ar{E{f (X, Y ) | Y }2 } + E{E{f (X, Y ) | Y }}2 −µ2f
| {z }
µ2f
(5.23)
2
= E{V ar{f (X, Y ) | Y }} + V ar{E{f (X, Y ) | Y } } . (5.24)
| {z } | {z }
≥0 σ̂f2
Therefore, since the first term on the r.h.s is non-negative, we conclude that
σ̂f2 ≤ σf2 and correspondingly V ar{θ̂N } ≤ V ar{θN }.
Chapter 6
In the previous chapters on Monte Carlo methods we have so far considered cases
where the random numbers used for constructing the Monte Carlo estimators
are easy to generate. In particular, we have focused on one- or two-dimensional
problems that used ”simple” RVs such as uniform random numbers. In many
practical scenarios, however, this may not be the case. In Chapter 2 we have
discussed several methods to generate more complex RVs but these methods
largely apply to low-dimensional problems. So how can we generate random
samples from higher-dimensional and possibly complex distributions? Markov
chain Monte Carlo (MCMC) methods provide a powerful framework to address
this problem. In this chapter we will discuss the core idea of MCMC and intro-
duce two of the most popular MCMC sampling algorithms, commonly known
as the Gibbs- and Metropolis-Hastings samplers.
51
52 CHAPTER 6. MARKOV CHAIN MONTE-CARLO
whereas the sum goes over all states in S. The kernel P is then said to be
invariant with respect to the distribution Π. Intuitively, this means that if we
start at the stationary distribution, then applying the invariant kernel P to it
will leave it unaffected. The goal of MCMC is to find an invariant transition
kernel P which satisfies (6.2) for a given target distribution Π. If we then sim-
ulate the Markov chain and wait until it reaches stationarity, then the resulting
samples are distributed according to Π.
A condition that is related to (6.2) is called detailed balance, which states
for n = 1, 2, . . . nmax .
This shows that the Gibbs sampler has the correct target distribution Π(y, z).
with xk,n as the kth variable at iteration n. We remark that the order in which
the variables are resampled can be chosen. In the algorithm above, we con-
sider a fixed-sequence scan, which means that we resample the variables in a
round-robin fashion (1 → 2 → . . . → K → 1 → 2 . . .). An alternative strategy
is called random-sequence scan, where the update sequence is chosen randomly
(e.g., 5 → 2 → 9 . . .). Moreover, one can group several variables together and
update them jointly within a single step conditionally on all other variables
(e.g., Π(x1 , x2 | x3 ) → Π(x3 | x1 , x2 ) → . . .), which can improve the conver-
gence of the sampler. While all variants of the Gibbs sampler have the right
stationary distribution Π, they may differ in certain properties. For instance,
certain random-sequence scan algorithms exhibit the detailed balance property,
while this is generally not true for fixed-scan algorithms.
so why not directly use q(⃗y |⃗x) as a transition kernel for the chain? The reason
is that this would not satisfy detailed balance (see ??), which is a sufficient
condition for π being an equilibrium distribution. Detailed balance requires
that the probability of being in state ⃗x and going to ⃗y from there is the same
as the probability of doing the reverse transition if the current state is ⃗y . This
is true if and only if
π(⃗x)q(⃗y |⃗x) = π(⃗y )q(⃗x|⃗y ), (6.19)
which is not the case for arbitrary q(⃗y |⃗x). If the two sides are not the same in
general, then one must be bigger than the other. Consider for example the case
where
π(⃗x)q(⃗y |⃗x) > π(⃗y )q(⃗x|⃗y ),
i.e., moves ⃗x → ⃗y are more frequent than the reverse. The other case is anal-
ogous. The idea is then to adjust the transition probability q(⃗y |⃗x) with an
additional probability of move 0 < ρ ≤ 1 in order to reduce it to what is should
be according to detailed balance:
Algorithm 1 Metropolis-Hastings
1: procedure MetropolisHastings(⃗x0 ) ▷ start point ⃗x0
2: for k = 0, 1, 2, . . . do
3: Sample ⃗yk ∼ q(⃗y |⃗xn k) o
xk |⃗
4: Compute ρ = min 1, ff (⃗ (⃗
yk )q(⃗ yk )
yk |⃗
xk )q(⃗ xk )
5: With probability ρ, set ⃗xk+1 = ⃗yk ; else set ⃗xk+1 = ⃗xk .
6: end for
7: end procedure
are too frequent and hence the correction factor, when placed on ⃗x would be
larger than 1. In this case, the move is always accepted (and the reverse move
reduced to satisfy detailed balance).
The Metropolis-Hastings algorithm has some similarity with the accept-reject
sampling method (see Chapter 2.4). Both depend only on ratios of probabilities.
Therefore, Algorithm 1 can also be used for random variate generation from π.
The important difference is that in Metropolis-Hastings sampling, it may be that
⃗xk+1 = ⃗xk , which has probability zero in a continuous Accept-Reject method.
This means that Metropolis-Hastings samples are correlated (as two subsequent
samples are identical with probability 1 − ρ and therefore perfectly correlated
in this case) and not i.i.d., as in accept-reject sampling.
This means that expectations can be replaced by empirical means, which pro-
vides a powerful property in practical applications where expectations of un-
normalized distributions are to be estimated. The reason this works is because
the chain is ergodic. Indeed, for ergodic stochastic processes, ensemble averages
and time averages are interchangeable. However, it is important to only use the
samples of the Metropolis-Hastings chain after the algorithm has converged.
The first couple of samples generated must be ignored because they do not yet
come from the correct target distribution π. This immediately raises the ques-
tion how to detect if the chain has converged, i.e., after how many iterations
one can start using the samples.
{xk }m , m = 1, . . . , M (6.27)
and only use the N samples k = l, . . . , N + l of each chain, i.e., discard the first
l − 1 samples.
Then, calculate:
PN
• the mean of each chain µ̂m = N1 i=1 xm i ,
PN
• the empirical variance of each chain σ̂m 2
= N 1−1 i=1 (xm 2
i − µ̂m ) ,
PM
• the mean across all chains µ̂ = 1
M m=1 µ̂m ,
PM
• the variation of the means across chains B = N
M −1 m=1 (µ̂m − µ)2 , and
PM
• the average chain variance W = 1
M m=1
2
σ̂m .
Then, compute the Gelman-Rubin test statistic, defined as:
1 M +1
V = 1− W+ B. (6.28)
N MN
q
V
The chain has converged if R̂ := W ≈ 1 (in practice |R̂ − 1| < 0.001). Choose
the smallest possible l (i.e., length of the burn-in period) such that this is the
case.
The reason this test works is that both W and V are unbiased estimators of the
variance of π (not proven here). Therefore, for converged chains, they should be
the same. For increasing l, R̂ usually approaches unity from above (for initially
over-dispersed chains).
6.3. THINNING AND CONVERGENCE DIAGNOSTICS 61
Stochastic Optimization
Optimization problems are amongst the most widespread in science and en-
gineering. Many applications, from machine learning over computer vision to
parameter fitting, can be formulated as optimization problems. An optimization
problem consists of finding the optimal ϑ⃗ ∗ such that
⃗ ∗ = arg min h(ϑ)
ϑ ⃗
⃗
ϑ
n
for h : R → R, ⃗ 7→ h(ϑ)
ϑ ⃗ (7.1)
for some given function h. This function is often called the cost function, loss
function, fitness function, or criterion function. Following the usual convention,
we define an optimization problem as a minimization problem, where maximiza-
tion is equivalently contained when replacing h with −h.
Problems of this type, where h maps to R are called scalar real-valued opti-
mization problems or real-valued single-objective optimization. Of course, one
can also consider complex-valued or integer-valued problems, as well as vector-
valued (i.e., multi-objective) optimization. Many of the concepts considered
here generalize to those cases, but we do not describe such generalizations here.
For large n (i.e., high-dimensional domains) or non-convex h(·), there are no
efficient deterministic algorithms to solve the above optimization problem. In
fact, a non-convex function in n dimensions can have exponentially (in n) many
(local) minima. Since the goal according to Eq. 7.1 is to find the best local
minimum, i.e., the global minimum, deterministic approaches have a worst-case
runtime that is exponential in n. A typical way out is the use of randomized
stochastic algorithms. While randomized algorithms can be efficient, they pro-
vide no performance guarantees, i.e., they may not converge, may not find any
minimum, or may get stuck in a sub-optimal local minimum. The only tests
that can be used to compare and select randomized optimization algorithms are
heuristic benchmarks, typically obtained by running large ensembles of Markov
chains. There are two widely used standard suites of test problems: the IEEE
CEC2005-2020 standard and the ACM GECCO BBOB. They define test prob-
lems with known exact solutions, as well as detailed evaluation protocols, on
63
64 CHAPTER 7. STOCHASTIC OPTIMIZATION
which algorithms are to be compared and tested. About 200 different stochas-
tic optimization algorithms have been benchmarked on these tests with the test
results publicly available.
Many stochastic optimization methods have the big advantage that they do not
require h to be known in closed form. Instead, it is often sufficient that h can be
evaluated point-wise. Therefore, h does not have to be a mathematical function,
but can also be a numerical simulation, taking a laboratory measurement, or
user input. Algorithms of this sort are called black-box optimization algorithms,
and optimization problems with an unknown, but evaluatable h are called black-
box optimization problems.
Designing good stochastic optimization algorithms is a vast field of research,
which is in itself split into sub-fields such as evolutionary computing, random-
ized search, and biased sampling. Many exciting concepts, from evolutionary
biology over information theory to Sobolev calculus, are being exploited on this
problem. Here, we exemplarily discuss examples of classes of algorithms for
Monte-Carlo optimization from each of these sub-fields: Stochastic descent and
random pursuit from the class of randomized search heuristics, simulated an-
nealing from the class of biased sampling methods, and evolution strategies from
the class of evolutionary algorithms.
This clearly converges to the correct global minimum for m → ∞ and has a
linear computational complexity in O(m). For general h, the number of samples
required to reach a given probability of finding the global minimum is m ∝ |Θ| =
C n , where the constant C > 0 is the linear dimension of the search space Θ.
Stochastic exploration therefore converges exponentially slowly and is particu-
larly impractical in cases where h(·) is costly to evaluate, e.g., where it is given by
running a simulation or performing a measurement. This is because stochastic
exploration “blindly” samples the search space without exploiting any structure
or properties of h that may be known.
ϑ ⃗ j − αj ∆h(ϑ
⃗ j+1 = ϑ ⃗ j , βj ⃗uj )⃗uj , j = 0, 1, 2, . . . (7.5)
2βj
with ⃗uj i.i.d. uniform random variates on the unit sphere (i.e., |⃗uj | = 1) and
∆h(⃗x, ⃗y ) = h(⃗x + ⃗y ) − h(⃗x − ⃗y ) ≈ 2|⃗y |∇h(⃗x) · ⃗y . (7.6)
The latter is because the finite difference
h(⃗x + ⃗y ) − h(⃗x − ⃗y )
≈ ∇h(⃗x) · ⃗y
2|⃗y |
is an approximation to the directional derivative of h in direction y. This itera-
tion does not proceed along the steepest slope and therefore has some potential
to overcome local minima. Stochastic descent has two algorithm parameters:
• αj : step size,
• βj : sampling radius.
One can show that stochastic descent converges to a local optimum if αj ↓ 0
α
for j → ∞ and limj→∞ βjj = const ̸= 0. There are no guarantees of global con-
vergence. The problem in practice usually is the correct choice and adaptation
of αj and βj . The biggest advantage over stochastic exploration is a greatly
increased convergence speed, and that the method also works in unbounded
search spaces.
A side note on uniform random numbers on the unit sphere: Simply sampling
uniformly in the spherical angles leads to a bias toward the poles. One needs to
correct the samples with the arccos of the polar angle in order for them to have
uniform area density on the unit sphere.
Example 7.1. For example, in 3D, uniform random points ⃗u ∼ U(S 2 ) on the
unit sphere S 2 can be sampled by:
sampling φ ∼ U(0, 2π) (7.7)
sampling w ∼ U(0, 1) (7.8)
computing θ = arccos(2w − 1) (7.9)
and then using (r = 1, φ, θ) as the spherical coordinates of the sample point on
the unit sphere, where the polar angle θ runs from 0 (north pole) to π (south
pole) and the azimuthal angle φ from 0 to 2π.
66 CHAPTER 7. STOCHASTIC OPTIMIZATION
where E is the energy of the state and T is the temperature in the system.
Upon cooling (T ↓), the system settles into low-energy states, i.e., it finds min-
ima in E(·). In Algorithm 3, this analogy is exploited to perform stochastic
optimization over general functions h.
size B, these are the main difficulties in using Simulated Annealing in practice.
Clearly, the choice of f and B has to depend on what the function h looks like,
which may not always be known in a practical application.
This type of evolution strategy is called a (1+1)-ES, where “ES” is short for
“Evolution Strategy”. The (1+1)-ES converges linearly on unimodal h(·), i.e.,
7.5. EVOLUTIONARY ALGORITHMS 69
• (1,λ)-ES: sample λ new points ⃗uk,1 , . . . , ⃗uk,λ i.i.d. from the same Gaussian
mutation distribution in each iteration and set ϑ ⃗ k+1 = arg min⃗u (h(⃗uk,i ))λ ,
k,i i=1
i.e., keep the best offspring to become the parent of the next generation.
• (1+λ)-ES: same as above, but include the parent in the selection, ϑ ⃗ k+1 =
⃗ k )}, i.e., stay at the old point if none of
arg min{h(⃗uk,1 ), . . . , h(⃗uk,λ ), h(ϑ
the new samples are better.
• (µ,λ)-ES and (µ+λ)-ES: retain the best µ samples for the next iteration,
which then has µ “parents”. Use a linear combination (i.e., “genetic re-
combination”, e.g., their mean or pairwise averages) of parents as the
center for the mutation distribution of the new generation. In the comma
version, do not include the parents in the selection; in the plus version, do
include them, as above.
Evolution strategies are further classified with into those where the covariance
matrix Σ of the Gaussian mutation distribution, i.e. the mutation rates, is con-
stant, and those where it is dynamically adapted according to previously seen
samples.
7.5.2.2 Self-adaptation
Self-adaptation does away with the need of choosing decrease and expansion
factors for the mutation rate. Instead, it lets the Darwinian selection process
itself take care of adjusting the mutation rates. For this, each sample (i.e.,
“individual”) has its own mutation rate σk,i , i = 1, . . . λ in iteration k, and we
2
again use isotropic mutations Σk,i = σk,i 1, but now with a different covariance
matrix for each offspring sample
⃗ k , σ 2 1),
⃗uk,i ∼ N (ϑ i = 1, . . . , λ. (7.11)
k,i
The individual mutation rates for each offspring are themselves sampled from a
Gaussian distribution, as:
σk,i = N (σk , s2 ), (7.12)
where σk is the mutation rate of the parent (or the h-weighted mean of the
mutation rates of the parents for a µ-strategy), and s is a step size. This self-
adaptation mechanism will automatically take care that samples with “good”
mutation rates have a higher probability of becoming parents of the next gener-
ation and hence the mutation rate is inherited by the offspring as one of the “ge-
netic traits”. Choosing the step size s is unproblematic. Performance is robust
over a wide range of choices. However, the main drawback of self-adaptation
is its reduced efficiency. The same point potentially needs to be tried multiple
times for different mutation rates, therefore increasing the number of samples
required to converge.
and using that to adapt a fully anisotropic covariance matrix Σk that can also
have off-diagonal elements. This allows the mutation rates of different elements
of the vector ϑ⃗ to be different, in order to account for different scaling of the
parameters in different coordinate directions of the domain. The off-diagonal
elements are used to exploit correlations between search dimensions.
The classic algorithm to achieve this is CMA-ES, the evolution strategy with
Covariance-Matrix Adaptation (CMA). The algorithm adapts the covariance
matrix of the mutation distribution by rank-µ updates of Σ based on correlations
between the previous best samples, which can be interpreted nicely in terms
of information geometry. The algorithm uses Cholesky decompositions and
eigenvalue calculations, and we are not going to give it in full detail here. We
refer to online resources (e.g., wikipedia) for details.
CMA-ES roughly proceeds by:
⃗ k , Σk )
1. sampling λ offspring from N (ϑ
2. choosing the best µ < λ: ⃗u1 , . . . , ⃗uµ
⃗ k+1 = mean(⃗u1 , . . . , ⃗uµ )
3. recombining them as ϑ
4. rank-µ update of Σk using ⃗u1 , . . . , ⃗uµ =⇒ Σk+1
A few remarks about CMA-ES:
Random Walks
73
74 CHAPTER 8. RANDOM WALKS
with ∆Xi as the random increments of the RW and X0 as some known starting
point of the RW. These increments are i.i.d.binary random variables, i.e.,
+1 with probability p
∆Xi ∼ (8.5)
−1 with probability q.
whereas the second last equality follows from the fact that (a) X0 has zero
variance (i.e., it is deterministic) and (b) the variance of the sum of independent
RVs is just the sum of the variances of each individual RV.
Definition 8.2. The mean increment µ is called the drift of a random walk.
Definition 8.3. A random walk with µ = 0 (p=q=1/2) is called symmetric.
We see that the random process Y (t) is given by the initial condition X0 and
a sum of i.i.d. RVs. The number of summands inside this sum will increase
76 CHAPTER 8. RANDOM WALKS
linearly with t. We know from the CLT that the sum of many i.i.d. RVs will
converge to a normally distributed RV. More precisely, we have that
t/r
X t>>r t t
Y (t) = X0 + ∆Xi −−−→ N X0 + µ, σ 2 , (8.16)
i=1
r r
with W (t) as a standard Wiener process. This is because the standard Wiener
process is normally distributed for all t with mean E{W (t)} = 0 and variance
V ar{W (t)} = t. We will have a more detailed discussion about Wiener processes
in Chapter ??.
Stochastic Calculus
77
78 CHAPTER 9. STOCHASTIC CALCULUS
the white noise process u(t) would correspond to the time derivative of W (t).
It is known, however, that a Wiener process is not continuously differentiable
making (9.2) problematic. Therefore, continuous-time stochastic processes are
generally given in the form of (9.3).
whereas the first integral corresponds the a classical Riemann integral. The
second integral is called a stochastic integral, where in this case, the function
σ(X(t), t) is integrated with respect to a standard Wiener process W (t). Math-
ematically, this integral can be defined as
Z t n
X i
H(s)dW (s) = lim H(ti )(W (ti+1 ) − W (ti )) ti = t . (9.6)
0 n→∞
i=0
n
Eq. (9.6) is generally known as the Ito integral. This integral converges (in
probability) if:
• H(t) depends only on {W (t − h) | h ≤ 0}. H(t) is then said to be non-
anticipating.
9.1. STOCHASTIC DIFFERENTIAL EQUATIONS 79
Rt
• It holds that E{ 0
H(s)2 ds} < ∞.
Theorem 9.1. The transformed process Y (t) = f (X(t)) satisfies the SDE
∂ 1 ∂ 2
dY (t) = f (X(t))µ(X(t), t) + f (X(t))σ (X(t), t) dt
∂x 2 ∂x2
(9.10)
∂
+ f (X(t))σ(X(t), t)dW (t).
∂x
This is known as Ito’s lemma.
Remark: Note that (9.10) is valid only if f does not explicitly depend on
t. While Ito’s lemma can be extended also to time-dependent f , we restrict
ourselves to the case where f depends only on X(t) in this lecture.
To calculate the variance, we first use Ito’s lemma to derive an SDE for Y (t) =
f (X(t)) = X(t)2 . We first calculate the first and second order derivatives f ,
i.e.,
∂
f (x) = 2x (9.18)
∂x
∂
f (x) = 2. (9.19)
∂x2
Using these derivates within Ito’s lemma gives us
dY (t) = d[X(t)2 ] = 2X(t)θ(µ − X(t)) + σ 2 dt + 2σX(t)dW (t)
(9.20)
= 2θ(µX(t) − Y (t)) + σ 2 dt + 2σX(t)dW (t).
The second order non-central moment E{Y (t)} = E{X(t)2 } therefore satisfies
the differential equation
d
E{X(t)2 } = 2θ(µE{X(t)} − E{X(t)2 }) + σ 2 . (9.22)
dt
For the variance, we obtain correspondingly,
d d d
V ar{X(t)} = E{X(t)2 } − 2E{X(t)} E{X(t)}
dt dt dt
= 2θ(µE{X(t)} − E{X(t)2 }) + σ 2 − 2E{X(t)}θ(µ − E{X(t)})
= −2θ E{X(t)2 } − E{X(t)}2 + σ 2
= −2θV ar{X(t)} + σ 2 .
(9.23)
σ2
lim V ar{X(t)} = . (9.24)
t→∞ 2θ
We finally remark that the same approach can in principle be used to calculate
mean and variance of any SDE driven by a Wiener process. However, one
should keep in mind that if the SDE is non-linear, one may encounter a so-
called moment-closure problem. That means, that the equation for the mean of
X(t) may depend on the second order moment, which in term depends on the
third-order moment and so forth. In this case, certain approximate techniques
can be considered. Those techniques, however, are beyond the scope of this
lecture.
82 CHAPTER 9. STOCHASTIC CALCULUS
Chapter 10
dX dW1 dWn
= v0 (X(t), t) + v1 (X(t), t) + . . . + vn (X(t), t) (10.1)
dt dt dt
with given functions v0 , v1 , . . . , vn and Wiener processes W1 , . . . , Wn . The first
term on the right-hand side governs the deterministic part of the dynamics
through the function v0 . The remaining terms govern the stochastic influences
on the dynamics, of which there could be more than one, each with its own Itô
transformation v1 , . . . , vn . The Wiener processes Wi (t) are continuous functions
of time that are almost certainly not differentiable anywhere. Therefore, the dW dt
i
are pure white noise and the equation cannot be interpreted mathematically.
However, if we multiply the entire equation by dt, we get:
dX(t) = v0 (X(t), t)dt + v1 (X(t), t)dW1 (t) + . . . + vn (X(t), t)dWn (t), (10.2)
83
84CHAPTER 10. NUMERICAL METHODS FOR STOCHASTIC DIFFERENTIAL EQUATIONS
The deterministic part µ is called drift and the stochastic part σ is called dif-
fusion, because Wiener increments dW are normally distributed (see previous
chapter).
where a > 0 and b > 0 are constants. This equation describes the dynamics of
the velocity X(t) of a particle (point mass) under deterministic friction (friction
coefficient a) and stochastic Brownian motion (diffusion constant b). It is a
central equation in statistical physics, chemistry, finance, and many other fields.
(1) is referred to as the strong solution of the SDE, and (2) as the weak solution
of the SDE.
where the first integral is a deterministic Riemann integral, and the second one
is a stochastic Itô (or Stratonovich) integral.
In order to numerically approximate the solution, we discretize the time interval
[0, T ] in which we are interested in the solution into N finite-sized time steps of
T
duration δt = N such that tn = nδt and Xn = X(t = tn ), Wn = W (t = tn ).
th
Due to the 4 property of the Wiener process from the previous chapter, which
states that the differences between any two time points of a Wiener process are
normally distributed, we can also discretize:
with ∆Wn i.i.d. ∼ N (0, δt) and W0 = 0. The starting value for the Wiener
process, W0 = 0 is chosen arbitrarily, since the absolute value of W is inconse-
quential for the SDE; only the increments dW matter, so we can start from an
arbitrary point.
86CHAPTER 10. NUMERICAL METHODS FOR STOCHASTIC DIFFERENTIAL EQUATIONS
The integrals in Eq. 10.4 can be interpreted as the continuous limits of sums.
The deterministic term can hence be discretized by a standard quadrature (nu-
merical integration). The stochastic term is discretized using the above dis-
cretization of the Wiener process, hence, for any time T ,
Z T N
X
σ(X(t̃), t̃) dW (t̃) ≈ σ(X(tn ), tn )∆Wn .
0 n=0
Using the rectangular rule (i.e., approximating the integral by the sum of the
areas of rectangular bars) for the deterministic integral, and the above sum over
one time step for the stochastic integral, we find:
Z tn+1
µ(X(t̃), t̃) dt̃ ≈ µ(Xn , tn )δt ,
tn
Z tn+1
σ(X(t̃), t̃) dW (t̃) ≈ σ(Xn , tn )∆Wn .
tn
with
∆Wn = Wn+1 − Wn ∼ N (0, δt) i.i.d.,
W0 = 0,
X0 = x0 .
Iterating Eq. 10.6 forward in time n = 0, 1, . . . , N yields a numerical approxi-
mation of one trajectory/realization of the stochastic process X(t) governed by
the SDE from Eq. 10.3.
10.4 Convergence
The error of the numerical solution is defined with respect to the exact stochastic
process X(t) for decreasing δt. To make this comparison possible, we introduce
Xδt (t), the continuous-time stochastic process obtained by connecting the points
(tn , Xn ) by straight lines, i.e.:
t − tn
Xδt (t) = Xn + (Xn+1 − Xn ) for t ∈ [tn , tn+1 ). (10.7)
tn+1 − tn
Comparing this continuous-time process to the exact process, we can define
convergence. Note that in the deterministic case this is not possible as there
we can simply evaluate the analytical solution at the simulation time steps and
compare the values. In a stochastic simulation, however, this is not possible
as each realization of the process has different values. The only thing we can
compare are moments, which can only be computed over continuous processes.
So with the above trick we can define:
10.4. CONVERGENCE 87
for every time T , where the constant C(T ) > 0 depends on T and on the SDE
considered.
Definition 10.3 (weak convergence order). A numerical method has weak con-
vergence order γ ≥ 0 if and only if
h i h i
E g(X(T )) − E g(Xδt (T )) ≤ C(T, g)δtγ (10.11)
for every time T , where the constant C(T, g) > 0 depends on T , g, and the SDE
considered.
88CHAPTER 10. NUMERICAL METHODS FOR STOCHASTIC DIFFERENTIAL EQUATIONS
1
Xn+1 = Xn + µ(Xn )δt + σ(Xn )∆Wn + σ ′ (Xn )σ(Xn )((∆Wn )2 − δt)
2
(10.12)
with:
Theorem 10.3. The Milstein method has both strong and weak orders of con-
vergence of 1.
The Milstein method therefore is more accurate than the Euler-Maruyama
method in the strong sense, but has the same weak order of convergence. Us-
ing a 100-fold smaller time step reduces the numerical error 100 times when
using Milstein. This allows larger time steps compared to Euler-Maruyama and
relaxes the numerical rounding issues.
Of course, Euler-Maruyama and Milstein are not the only known stochastic nu-
merical integration methods. Other methods also exist (e.g. Castell-Gaines,
stochastic Lie integrators, etc.), some with lower pre-factors C in the error
bounds of Eqs. 10.10 and 10.11. However, no method is known with strong
convergence order > 1, unless we can analytically solve the corresponding
Stratonovich integrals.
The numerical stability properties (with δt) of stochastic numerical integration
methods are unclear. Only few results are known on almost-sure stability, e.g.,
for linear scalar SDEs. No A-stable stochastic numerical integrator is known.
Stochastic Reaction
Networks
where:
Si : species i,
k: reaction rate.
The total stoichiometry is νi = νi+ −νi− , and it gives the net change in copy num-
bers when the reaction happens. Reactions are classified by the total number of
91
92 CHAPTER 11. STOCHASTIC REACTION NETWORKS
reactant molecules i νi− , which is called the order of the reaction. Reactions
P
having only a single reactant are of order one, or unimolecular. Reactions with
two reactants are of order two, or bimolecular; and so on. Reactions of order
≤ 2 are called elementary reactions.
Example 11.1. For the reaction A + B → C, the above quantities are:
• Si = {A, B, C}
• N =3
• ⃗ν − = [1, 1, 0]T
• ⃗ν + = [0, 0, 1]T
• ⃗ν = [−1, −1, 1]T
where:
µ: index of reaction Rµ ,
M : total number of different reactions.
Now the stoichiometry is a matrix with one column per reaction: ν = ν + − ν − .
All the stoichiometry matrices are of size N × M . All elements of ν + and ν −
are non-negative whereas those of ν can be negative, zero or positive.
Example 11.2. Consider the following cyclic chain reaction network with N
species and M = N reactions:
Si →
− Si+1 , i = 1, . . . , N − 1
(11.3)
SN →
− S1 .
and
−1 0 1
⃗ν = ⃗ν + − ⃗ν − = 1 −1 0 . (11.7)
0 1 −1
0 0 0 0
1 0 0 0
⃗ν + =
0
, (11.11)
1 0 0
0 0 1 1
and
−2 −1 −1 0
1 −1 0 −2
⃗ν = ⃗ν + − ⃗ν − = . (11.12)
0 1 −1 0
0 0 1 1
(e.g., Wiener processes). Reaction networks do not fall in either of these classes
since they evolve continuously in time, but have a discrete state space (i.e.,
particle numbers are integer-valued). In order to describe such systems, we em-
ploy another class of Markov processes termed continuous-time Markov chains
(CTMCs).
⃗
We consider the state X(t) = (X1 (t), . . . , XN (t)) of a reaction network collecting
the copy numbers (population) of each species at time t. At a certain time
instance, we can now assess the time evolution of X(t) ⃗ with a small amount of
time ∆t using basic probability theory. In particular, we define the following
two probabilities
⃗ + ∆t) = ⃗x + νi | X(t)
P (X(t ⃗ = ⃗x) = ai (⃗x)∆t + o(∆t) (11.13)
X
⃗ + ∆t) = ⃗x | X(t)
P (X(t ⃗ = ⃗x) = 1 − ai (⃗x)∆t + o(∆t), (11.14)
i
Now, in order to calculate the distribution P (⃗x, t + ∆t), we make use of (11.13)
and (11.14). In particular, the probability of being in state ⃗x at time t + ∆t is
the probability that we were brought to this state via any of the N reactions
plus the probability that we have already been in state ⃗x at time t (plus some
additional terms accounting for the possibility that multiple events happened).
In particular, we obtain
!
X
P (⃗x, t + ∆t) = ai (⃗x − νi )∆t + o(∆t) P (⃗x − νi , t)
i
! (11.16)
X
+ 1− ai (⃗x)∆t + o(∆t) P (⃗x, t).
i
"P
d iai (⃗x − νi )∆tP (⃗x − νi , t) o(∆t)P (⃗x − νi , t)
P (⃗x, t) = lim +
dt ∆t→0 ∆t | ∆t
{z }
→0
P #
P (⃗x, t) − P (⃗x, t) ai (⃗x)∆tP (⃗x, t) o(∆t)P (⃗x, t)
+ − i + .
| ∆t
{z } ∆t | ∆t
{z }
→0 →0
d X X
P (⃗x, t) = ai (⃗x − νi )P (⃗x − νi ) − ai (⃗x)P (⃗x, t), (11.17)
dt i i
known as the CME. Similarly to the discrete Markov chain scenario, the CME
describes how the state distribution evolves over time and is thus a continuous-
time analog of the Kolmogorov-forward equation that we have discussed in the
previous chapters. Technically speaking, the CME is a difference-differential
equation: it has a time-derivative on the left hand side and discrete shifts in
the state on the right-hand side. Note that in general, the CME is infinite-
dimensional, since for every possible ⃗x, we would get an additional dimension
in the CME. Unfortunately, analytical solutions of the CME do not exist in all
but the simplest cases (e.g., linear chain of three reactions) and needs to be
solved numerically. Traditional methods from numerical analysis, such as finite
differences or finite elements, also fail due to the high dimensionality of the
domain of the probability distribution P (⃗x, t), which leads to an exponential
increase in computational and memory cost with network size. However, Monte
Carlo approaches can be applied to simulate stochastic reaction networks, as
will be discussed in the next section.
96 CHAPTER 11. STOCHASTIC REACTION NETWORKS
This probability density p is derived as follows: Consider that the time interval
[t, t + τ + dτ ) is divided into k equal intervals of length τk plus a last interval of
length dτ , as illustrated in Fig. 11.1.
The definition of p(τ, µ | ⃗x(t)) in Eq. 11.18 dictates that no reactions occur in all
of the first k intervals, and that reaction µ fires exactly once in the last interval.
11.3. EXACT SIMULATION OF STOCHASTIC REACTION NETWORKS97
We recall that the Master equation has been derived from the following basic
quantities:
The product aµ (⃗x) = hµ (⃗x)cµ is called the reaction propensity. The probability
aµ dτ is the probability that reaction µ happens at least once in the time interval
dτ . It is the product of the probability of reaction and the number of possi-
ble reactant combinations by which this can happen, as these are statistically
independent events.
Now we can write:
This only considers the last sub-interval of length dτ . The first line is simply the
analytical solution of the Master equation. If reaction µ has total stoichiometry
⃗νµ , then the new state of the network after reaction µ happened exactly once
is ⃗x + ⃗νµ . In the second line, the first factor, hµ cµ dτ , is the probability that
reaction µ happens from at least one of the hµ possible reactant combinations.
However, the reaction could still happen more than once. Therefore, the second
factor, (1−cµ dτ )hµ (⃗x)−1 is the probability that none of the other hµ − 1 reactant
combinations leads to a reaction. In the third line, we only multiplied out the
first factor. All others are of O(dτ 2 ) or higher. Overall, the expression thus
is the probability that reaction µ happens once, and exactly once, in the last
sub-interval of length dτ .
Further, we have for the probability that no reaction happens in one of the first
k sub-intervals:
Both of the above expressions assume that the individual reaction events are
statistically independent. This is an important assumption of the Master equa-
tion.
From these two expressions, we can now write an expression for Eq. 11.18 by con-
sidering all k + 1 sub-intervals, again assuming that the individual sub-intervals
are mutually statistically independent:
h τ ik
aµ (⃗x)dτ + O dτ 2 .
p(τ, µ | ⃗x(t))dτ = 1 − a(⃗x)
k
The term in the first square bracket is the probability that no reaction happens
in any one of the k first sub-intervals. This to the power of k thus is the
probability that no reaction happens in all of the k first sub-intervals. The term
in the second square bracket then is the probability that reaction µ happens
exactly once in the last sub-interval.
Dividing both sides of the equation by dτ and taking the limit limdτ →0 , we
obtain
h τ ik
p(τ, µ | ⃗x(t)) = 1 − a(⃗x) aµ (⃗x).
k
Taking the limit limk→∞ , we further get
= a(⃗x)e−a(⃗x)τ . (11.20)
Similarly, integrating Eq. 11.19 over τ we get the marginal probability distribu-
tion function of µ as
Z ∞
p(µ | ⃗x(t)) = aµ (⃗x)e−a(⃗x)τ dτ
0
aµ (⃗x)
= . (11.21)
a(⃗x)
11.3. EXACT SIMULATION OF STOCHASTIC REACTION NETWORKS99
thus inferring that µ and τ are statistically independent random variables. Sam-
pling from the marginals in any order is therefore equivalent to sampling from
the joint density. Sampling from Eq. 11.20 is easily done using the inversion
method (see Section 2.3), as τ is exponentially distributed with parameter a.
Eq. 11.21 describes a discrete probability distribution from which we can also
sample using the inversion method. Note that Eqs. 11.20 and 11.21 also relate
to a basic fact in statistical mechanics: if an event has rate aµ of happening,
then the time one needs to wait until it happens again is ∼ Exp(aµ ) (see Section
1.5.2).
By sampling one reaction event at a time and propagating the simulation in
time according to Eq. 11.20, we obtain exact, time resolved trajectories of the
population ⃗x as governed by the Master equation. The SSA, however, is a
Monte Carlo scheme and hence several independent runs need to performed in
order to obtain a good estimate of the probability function P (⃗x, t), or any of its
moments.
All exact formulations of SSA aim to simulate the network by sampling the
random variables τ (time to the next reaction) and µ (index of the next reaction)
according to Eqs. 11.20 and 11.21 and propagating the state ⃗x of the system one
reaction event at a time. The fundamental steps in every exact SSA formulation
are thus:
1. Sample τ and µ from Eqs. 11.20 and 11.21,
2. Update state ⃗x = ⃗x + ⃗νµ and time t = t + τ ,
3. Recompute the reaction propensities aµ from the changed state.
We here only look at the two classical exact SSA formulations due to Gille-
spie. We note, however, that many more exact SSA formulations exist in the
literature, including the Next Reaction Method (NRM, introducing dependency
graphs for faster propensity updates), the Optimized Direct Method (ODM), the
Sorting Direct Method (SDM), the SSA with Composition-Rejection sampling
(SSA-CR, using composition-rejection sampling as outlined in Section 2.4.1 to
sample µ), and partial-propensity methods (factorizing the propensity and in-
dependently operating on the factors), which we do not discuss here. All are
different formulations of the same algorithm, exact SSA, and sample the exact
same trajectories. However, the computational cost of different SSA formula-
tions may differ for certain types or classes of networks.
A defining feature of exact SSAs is that they explicitly simulate each and ev-
ery reaction event. Once can in fact show that algorithms that skip, miss,
or lump reaction events cannot sample from the exact solution of the Master
equation any more. However, they may still provide good and convergent weak
approximations, at least for low-order moments of p(⃗x, t). Examples of such
approximate SSAs are τ -leaping, R-leaping, and Langevin algorithms. We do
100 CHAPTER 11. STOCHASTIC REACTION NETWORKS
not discuss them here. They are very much related to numerical discretizations
of stochastic differential equations, as discussed in Chapter 10.
using the inversion method i.i.d. for each µ. Subsequently, the next reaction
µ is chosen to be the one with the minimum τµ , and the time τ to the next
reaction is set to the minimum τµ . The algorithm is given in Algorithm 5.