0% found this document useful (0 votes)
24 views112 pages

Script

The document consists of lecture notes on Stochastic Modeling and Simulation, authored by Prof. Dr. Ivo F. Sbalzarini and Dr. Christoph Zechner for the Winter 2021/22 semester at TU Dresden. It covers various topics including elementary probabilities, random variables, Monte Carlo methods, Markov chains, and stochastic calculus. The notes serve as a comprehensive guide for understanding the principles and applications of stochastic processes and simulations.

Uploaded by

snezaim
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views112 pages

Script

The document consists of lecture notes on Stochastic Modeling and Simulation, authored by Prof. Dr. Ivo F. Sbalzarini and Dr. Christoph Zechner for the Winter 2021/22 semester at TU Dresden. It covers various topics including elementary probabilities, random variables, Monte Carlo methods, Markov chains, and stochastic calculus. The notes serve as a comprehensive guide for understanding the principles and applications of stochastic processes and simulations.

Uploaded by

snezaim
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 112

Ivo F.

Sbalzarini & Christoph Zechner

Stochastic Modeling and


Simulation

Lecture Notes
TU Dresden, Faculty of Computer Science
Center for Systems Biology Dresden

Prof. Dr. Ivo F. Sbalzarini & Dr. Christoph Zechner

Winter 2021/22
ii
iii

This page is intentionally left blank


iv
Contents

Contents v

Foreword ix

1 Introduction 1
1.1 Elementary Probabilities . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.1 Events and Axioms . . . . . . . . . . . . . . . . . . . . . . 2
1.1.2 Combinatorics . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.3 Probability Spaces (Ω, P) . . . . . . . . . . . . . . . . . . 4
1.2 Conditional Probabilities . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.1 Definition and properties . . . . . . . . . . . . . . . . . . 5
1.2.2 Bayes’ Theorem . . . . . . . . . . . . . . . . . . . . . . . 6
1.2.3 Law of Total Probabilities . . . . . . . . . . . . . . . . . . 6
1.2.4 Probability Expansion . . . . . . . . . . . . . . . . . . . . 7
1.3 Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3.1 Indicator/binary random variables . . . . . . . . . . . . . 8
1.3.2 Discrete random variables . . . . . . . . . . . . . . . . . . 8
1.3.3 Continuous random variables . . . . . . . . . . . . . . . . 9
1.4 Probability Distributions . . . . . . . . . . . . . . . . . . . . . . . 9
1.4.1 Discrete Distributions . . . . . . . . . . . . . . . . . . . . 9
1.4.2 Continuous Distributions . . . . . . . . . . . . . . . . . . 10
1.4.3 Joint and Marginal Distributions . . . . . . . . . . . . . . 10
1.4.4 Moments of Probability Distributions . . . . . . . . . . . 11
1.5 Common Examples of Distributions . . . . . . . . . . . . . . . . 13
1.5.1 Common discrete distributions . . . . . . . . . . . . . . . 13
1.5.2 Common continuous distributions . . . . . . . . . . . . . 14
1.5.3 Scale-free distributions . . . . . . . . . . . . . . . . . . . . 15

2 Random Variate Generation 17


2.1 Transformation of Random Variables . . . . . . . . . . . . . . . . 17
2.1.1 Transformed discrete distributions . . . . . . . . . . . . . 18
2.1.2 Transformed continuous distributions . . . . . . . . . . . 18
2.1.3 The Inversion Transform . . . . . . . . . . . . . . . . . . . 19
2.2 Uniform Random Variate Generation . . . . . . . . . . . . . . . . 20

v
vi CONTENTS

2.2.1 Uniform Pseudo-Random Number Generation . . . . . . . 20


2.2.2 Uniform Quasi-Random Number Generation . . . . . . . 22
2.3 Inversion Transform for Non-uniform Distributions . . . . . . . . 23
2.4 Accept-Reject Methods for Non-uniform Distributions . . . . . . 25
2.4.1 Composition-Rejection Method . . . . . . . . . . . . . . . 26

3 Discrete-Time Markov Chains 29


3.1 Discrete-Time Stochastic Processes . . . . . . . . . . . . . . . . . 29
3.2 Markov Chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.3 Markov Chains as Recursions . . . . . . . . . . . . . . . . . . . . 31
3.4 Properties of Markov Chains . . . . . . . . . . . . . . . . . . . . 32

4 Monte Carlo Methods 37


4.1 The Law of Large Numbers . . . . . . . . . . . . . . . . . . . . . 38
4.1.1 Proof of the Weak Law . . . . . . . . . . . . . . . . . . . 39
4.2 Monte Carlo Integration . . . . . . . . . . . . . . . . . . . . . . . 41
4.2.1 Multidimensional Extension . . . . . . . . . . . . . . . . . 43
4.3 Importance sampling . . . . . . . . . . . . . . . . . . . . . . . . . 43

5 Variance Reduction 47
5.1 Antithetic Variates . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.2 Rao-Blackwellization . . . . . . . . . . . . . . . . . . . . . . . . . 48

6 Markov Chain Monte-Carlo 51


6.1 Gibbs Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
6.1.1 Multivariate case . . . . . . . . . . . . . . . . . . . . . . . 53
6.2 Metropolis-Hastings Sampling . . . . . . . . . . . . . . . . . . . . 55
6.2.1 Metropolis-Hastings Algorithm . . . . . . . . . . . . . . . 55
6.2.2 Convergence properties . . . . . . . . . . . . . . . . . . . 57
6.2.3 One-step transition kernel . . . . . . . . . . . . . . . . . . 57
6.2.4 Special Proposal Choices . . . . . . . . . . . . . . . . . . 58
6.2.5 Stopping Criteria . . . . . . . . . . . . . . . . . . . . . . . 59
6.3 Thinning and Convergence Diagnostics . . . . . . . . . . . . . . . 59
6.3.1 Gelman-Rubin Test . . . . . . . . . . . . . . . . . . . . . 60
6.3.2 Autocorrelation Test . . . . . . . . . . . . . . . . . . . . . 61

7 Stochastic Optimization 63
7.1 Stochastic Exploration . . . . . . . . . . . . . . . . . . . . . . . . 64
7.2 Stochastic Descent . . . . . . . . . . . . . . . . . . . . . . . . . . 64
7.3 Random Pursuit . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
7.4 Simulated Annealing . . . . . . . . . . . . . . . . . . . . . . . . . 66
7.5 Evolutionary Algorithms . . . . . . . . . . . . . . . . . . . . . . . 68
7.5.1 ES with fixed mutation rates . . . . . . . . . . . . . . . . 69
7.5.2 ES with adaptive mutation rates . . . . . . . . . . . . . . 69
CONTENTS vii

8 Random Walks 73
8.1 Characterization and Properties . . . . . . . . . . . . . . . . . . . 73
8.1.1 Kolmogorov-forward Equation . . . . . . . . . . . . . . . 73
8.1.2 State Equation . . . . . . . . . . . . . . . . . . . . . . . . 74
8.1.3 Mean and Variance . . . . . . . . . . . . . . . . . . . . . . 74
8.1.4 Restricted Random Walks . . . . . . . . . . . . . . . . . . 75
8.1.5 Relation to the Wiener process (continuous limit) . . . . . 75
8.1.6 Random Walks in higher dimensions . . . . . . . . . . . . 76

9 Stochastic Calculus 77
9.1 Stochastic differential equations . . . . . . . . . . . . . . . . . . . 78
9.1.1 Ito integrals . . . . . . . . . . . . . . . . . . . . . . . . . . 78
9.1.2 Transformation of Wiener processes . . . . . . . . . . . . 79
9.1.3 Mean and Variance of SDE’s . . . . . . . . . . . . . . . . 79

10 Numerical Methods for Stochastic Differential Equations 83


10.1 Refresher on SDEs . . . . . . . . . . . . . . . . . . . . . . . . . . 83
10.2 Solving an SDE . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
10.2.1 Solution methods . . . . . . . . . . . . . . . . . . . . . . . 84
10.3 Stochastic Numerical Integration: Euler-Maruyama . . . . . . . . 85
10.4 Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
10.5 Milstein Method . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
10.6 Weak Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

11 Stochastic Reaction Networks 91


11.1 Formal Representations . . . . . . . . . . . . . . . . . . . . . . . 91
11.1.1 Representation of a reaction . . . . . . . . . . . . . . . . . 91
11.1.2 Representation of a reaction network . . . . . . . . . . . . 92
11.2 The Chemical Master Equation . . . . . . . . . . . . . . . . . . . 93
11.3 Exact Simulation of Stochastic Reaction Networks . . . . . . . . 96
11.3.1 The first-reaction method (FRM) . . . . . . . . . . . . . . 100
11.3.2 The direct method (DM) . . . . . . . . . . . . . . . . . . 100
viii CONTENTS
Foreword

These lecture notes were created for the course “Stochastic Modeling and Simu-
lation”, taught as part of the mandatory electives module “CMS-COR-SAP” in
the Masters Program “Computational Modeling and Simulation” at TU Dres-
den, Germany. The notes are based on handwritten notes by Prof. Sbalzarini
and Dr. Zechner, which have been typeset in LATEXby Fahad Fareed as a paid
student teaching assistantship during his first term of studies in the Masters
Program “Computational Modeling and Simulation” at TU Dresden, Germany,
and subsequently extended and corrected by Prof. Sbalzarini and Dr. Zechner.

ix
x FOREWORD
Chapter 1

Introduction

The goal of stochastic Modeling is to predict the (time evolution of the) proba-
bility over system states:
⃗ t|X
P (X, ⃗0 , t0 ). (1.1)

This is the probability that a system which was in state X ⃗0 at time t0 is found in
⃗ ⃗ ⃗
state X at time t. Here, both X and X0 are random variables. In a deterministic
model, it is possible to predict exactly which state a system is going to be found
in. In contrast, a stochastic model does not predict states, but only probabilities
of states. One hence never knows for sure which state the system is going to be
in, but one can compute the probability of it to be a certain state. For example,
in stochastic weather modeling, the model is not going to predict whether it is
going to rain or not, but rather the probability of rain (e.g., 60% chance of rain
tomorrow).
A stochastic simulation then generates specific realizations of state trajectories:

x⃗0 , x⃗1 , x⃗2 , . . . , x⃗t , (1.2)

which are simulated from the model in Eq. 1.1 using a suitable stochastic sim-
ulation algorithm, such that for each time point t:

⃗ t|X
x⃗t ∼ P (X, ⃗0 , t0 ), (1.3)

i.e., the states are distributed according to the desired probability function.
While this sounds pretty straightforward conceptually, there are a number of
technical difficulties that frequently occur, among them:

• P (·) may not be known in closed form, but only through a governing
(partial) differential equation that is often not analytically solvable →
Master equations.

• Generating random numbers from the correct distribution ∼ P (·) is often


not directly possible → sampling algorithms.

1
2 CHAPTER 1. INTRODUCTION

• Time may be a continuous variable in the model, so there exists no natural


⃗ 1, X
discrete sequence of states X ⃗ 2 , . . ., but rather a continuous X(t)
⃗ →
stochastic calculus.
• The model probability P (·) may itself depend on system history → non-
Markovian processes.
In the rest of this course, we will go through all of these points and show some
commonly used solutions. We first consider the easier case when time is discrete,
so X ⃗ 1, X
⃗ 2, . . . , X
⃗ t with t ∈ N. Then, we introduce classic algorithms (some
of them from the very dawn of computing) before generalizing to continuous
time. Before starting, however, we are going to refresh some basic concepts
from probability theory, which will be useful for understanding the algorithms
introduces later. We do not provide a comprehensive treatment of probability
theory, but simply recall some basic principles and revisit them by means of
example. We generally consider this material a prerequisite to the course, but
still provide it here for the sake of completeness and of being explicit about
the prerequisites. Students who do not feel savvy in the basics reviewed in this
chapter are recommended to revise basic probability theory before continuing
with the course.

1.1 Elementary Probabilities


We start by defining some basic vocabulary:
Definition 1.1 (Elementary Probability). We define the following terms:
• Population: A collection of objects.
• Sample: A subset of a Population (possibly random).
• Experiment: The process of measuring certain characteristics of a sample.
• Outcome/Event: The concretely measured values from an experiment.
• Sample Space: The space of all possible outcomes (usually denoted Ω).
• Probability: The likelihood of occurrence of an outcome in the sample
space (usually denoted P (A) for outcome A).

1.1.1 Events and Axioms


All of probability theory is derived from three axioms. Axioms are postulates
that cannot be proved within the framework of the theory, but from which
all the rest of the theory can be proven if one accepts them as true. There
are slightly different versions in the literature of stating the basic axioms of
probability theory, but they are all equivalent to the following:
1. P (A) ≥ 0 ∀ A,
1.1. ELEMENTARY PROBABILITIES 3

2. P (Ω) = 1,

3. P (A ∪ B) = P (A) + P (B) if P (A ∩ B) = ∅.

From these three axioms, the rest of probability theory has been proven. The
first axiom simply states that probabilities are always non-negative. The sec-
ond axiom states that the probability of the entire sample space is 1, i.e., the
probability that any of the possible events happens is 1. the third axiom states
that the probability of either A or B happening, but not both, (i.e., an exclusive
OR) is the sum of the probability that A happens and the probability that B
happens. All three axioms perfectly align with our intuition of probability as
“chance”, and we have no difficulty accepting them.
A first thing one can derive from these axioms are the probabilities of logical
operations between events. For example, we find:

AND : P (A ∩ B) = P (A)P (B) if A and B are independent,

XOR : P (A ∪ B) = P (A) + P (B) − P (A ∩ B),

NOT : P (¬A) = 1 − P (A).

From these three basic logical operations, all of Boolean logic can be constructed
for stochastic events.
Our intuitive notion of probability as “chance of an event happening” is quan-
tified by counting. Intuitively, we attribute a higher probability to an event
if it has happened more often in the past. We can thus state the frequentist
interpretation of probability:

#ways/times A happens
P (A) = (1.4)
total #Ω

as the fraction of the sample space that is covered by event A. Therefore, P (A)
can conceptually simply be determined by enumerating all events in the entire
sample space and counting what fraction of them belongs to A. For example,
the sample space of rolling a fair dice is Ω = {1, 2, 3, 4, 5, 6}. If we now define
A the event that an even number of eyes shows, then we have A = {2, 4, 6} and
hence #A = 3 whereas #Ω = 6 and, therefore according to the above definition
of probability: P (A) = 3/6 = 1/2. Often, however, it is infeasible to explicitly
enumerate all possible events and count them because their number may be very
large. The field of combinatorics then provides some useful formulas to compute
the total number of events without having to explicitly list all of them.

1.1.2 Combinatorics
Combinatorics can be used to compute outcome numbers even when explicit
counting is not feasible. The basis of combinatorics is the multiplication prin-
ciple: if experiment 1 has m possible events and experiment 2 has n possible
events, then the total number of different events over both experiments is mn.
4 CHAPTER 1. INTRODUCTION

Example 1.1. Let experiment 1 be the rolling of a dice observing the number of
eyes shown. The m = 6 possible events of experiment 1 are: {1, 2, 3, 4, 5, 6}. Let
experiment 2 be the tossing of a coin. The n = 2 possible events of experiment
2 are: {head, tail}. Then, the combination of both experiments has mn = 12
possible outcomes: {1-head, 1-tail, 2-head, 2-tail, ..., 6-head, 6-tail}.

There are two basic cases to compute numbers of combinations:

• Permutations: Number of distinct arrangements of r out of n distinct


objects, where order matters

n!
Pn,r = (1.5)
(n − r)!

• Combinations: Number of distinct groups (ordering within the group


does not matter) of r objects that can be chosen from a total of n objects
 
n! n
Cn,r = = . (1.6)
r!(n − r)! r

1.1.3 Probability Spaces (Ω, P)


Combining a finite sample space Ω with a function P , called probability measure,
that assigns to each event in Ω its probability of happening, yields the notion
of a probability space (Ω, P ). Therefore, a probability space is the space of all
possible events including their corresponding probabilities.

Example 1.2. The probability space for the experiment of rolling a fair dice
and observing the number of eyes shown is: ({1, 2, 3, 4, 5, 6}, { 61 , 16 , 16 , 16 , 16 , 16 }).

A notion that is frequently used in conjunction with probability spaces is that


of a σ-Algebra. While the concept is not essential for this course, we state it
briefly for completeness.

Definition 1.2 (σ-Algebra). A σ-Algebra is a collection of subsets of Ω that is


closed under union, intersection, and complement.

Example 1.3. Let Ω = (a,b,c,d). Then the following is a σ-Algebra: Σ =


{∅, {a, b}, {c, d}, {a, b, c, d}}. Any intersection, union, or complement of any of
these sets again yields a member of Σ. Please go ahead and convince yourself
of this. Of course, this particular Σ is just one of several σ-Algebras that can
be defined on this sample space.

σ-Algebras allow us to restrict a probability space to (Σ,P) while still be able


to construct all elementary events and have the same information represented
via the Boolean formulas above.
1.2. CONDITIONAL PROBABILITIES 5

1.2 Conditional Probabilities


A conditional probability expresses the probability of an event given/knowing
another event has happened. For example, we might want to know the proba-
bility of rain tomorrow. After waking up in the morning, we see that the sky is
overcast, but there is (still?) no rain. So now we want to know the probability
that there will be rain given the sky is overcast. This conditional probability is
likely different (higher) than the probability we estimated the day before when
we did not yet know what the sky looks like the next morning. Conditional prob-
abilities hence include the effect of certain events on the chance of an uncertain
event.

1.2.1 Definition and properties


Definition 1.3 (Conditional Probability). The conditional probability of an
event A given an event B is:
P (A ∩ B)
P (A|B) = . (1.7)
P (B)
The symbol | means “given”, so A|B is the event “A given B”.
Example 1.4. As an example, consider the standard deck of 52 playing cards.
Now we might ask:
1. What is the probability that the other player has an Ace of Hearts given
s/he told us that the card is red? We have:
1/52 1
P (Ace of Hearts | card is red) = = .
1/2 26
So, knowing the card is red doubles the probability of it being an Ace of
Hearts, because Hearts are red.
2. What is the probability that the card is a King given it is red? We have:
2/52 1
P (King | card is red) = = .
1/2 13
Note that in this example, knowing that the card is red does not change
the probability of it being a King, since we also have:
4 1
P (King) = = ,
52 13
which is the same since there are 4 Kings in the deck in total, two of them
red and two black.
In the second part of the above example, the probability and the conditional
probability are the same. We then say that the two events are independent.
That is, the probability of a card being a King is independent of the probability
of it being red, because there are just as many red Kings as black Kings.
6 CHAPTER 1. INTRODUCTION

Definition 1.4 (Independence). Two events A and B are called independent


if and only if:
P (A|B) = P (A).

This definition implies that for independent events A and B, we have

P (A ∩ B) = P (A)P (B),

which directly follows from combining Definitions 1.3 and 1.4.


Using this result, we can now also write the logical AND for general, non-
independent events as:

P (A ∩ B) = P (A|B)P (B) = P (A, B), (1.8)

which is called the joint probability of A and B, i.e., the probability that both
A and B happen.

1.2.2 Bayes’ Theorem


An important theorem for probabilities is Bayes’ Rule. It relates conditional
probabilities of two events in both orderings, i.e., it relates the probability of
“A given B” to the probability of “B given A”. This is useful, because one is
often interested in the one, but can only measure or observe the other. Bayes’
Theorem holds in general and not only for independent events. It is stated as:

Theorem 1.1 (Bayes). For any two random events A and B, there holds:

P (B|A)P (A)
P (A|B) = . (1.9)
P (B)

In this theorem, the probability P (A|B) is called “Posterior”, P (B|A) is called


“Likelihood”, and P (A) is called “Prior”.
Bayes’ Theorem follows directly from the following identity:

P (A, B) = P (A|B)P (B) = P (B|A)P (A)

when solving for P (A|B). Independence is not assumed.

1.2.3 Law of Total Probabilities


Another useful identity for conditional probabilities is the law of total probabil-
ity. Let B1 , B2 , . . . , Bn be a set of mutually disjoint events, i.e., P (Bi ∩ Bj ) =
∅ ∀ i ̸= j and B1 ∪ B2 ∪ . . . ∪ Bn = Ω. Then, for any event A, we have:

P (A) = P (A|B1 )P (B1 ) + P (A|B2 )P (B2 ) + . . . + P (A|Bn )P (Bn ). (1.10)

This follows directly from the logical OR between events, Eq. 1.8, and the fact
that the Bi are mutually disjoint.
1.2. CONDITIONAL PROBABILITIES 7

1.2.4 Probability Expansion


For any set of events A1 , A2 , . . . An we can write the probability that all of them
happen as:
P (A1 ∩ A2 ∩ . . . ∩ An ) = P (A1 )P (A2 |A1 )P (A3 |A1 ∩ A2 ) . . .
P (An |A1 ∩ A2 ∩ ... ∩ An−1 ). (1.11)
This formula is known as conditional probability expansion and will be very
important when we consider random processes later on. For now, consider the
following classic example:
Example 1.5. The “Birthday Problem”: Assume there are n, 2 ≤ n ≤ 365
people in a room, each equally likely to be born on any day of the year (ignoring
leap years). Now define the events:
• Ai : “the birthday of person i + 1 is different from those of all persons
1, . . . , i. Using simple counting (i.e., how many days of the year are still
available to choose from for a different birthday), we find:
365 − 1
P (A1 ) =
365
365 − 2
P (A2 |A1 ) =
365
..
.
365 − i
P (Ai |A1 ∩ A2 ∩ . . . ∩ Ai−1 ) = .
365
Note that in this case it is easy to write down that conditional probabili-
ties knowing that the birthdays of all people before are different. Directly
writing down P (Ai ) would not be so easy.
• Now consider the event B: “at least two people in the room share a birth-
day”. Clearly, it is:
P (B) = 1 − P (A1 ∩ A2 ∩ . . . ∩ An−1 ).
Using the probability expansion from Eq. 1.11, this is:
P (B) = 1 − P (A1 )P (A2 |A1 )P (A3 |A1 ∩ A2 ) . . . P (An−1 |A1 ∩ A2 ∩ . . . ∩ An−2 )
n−1
Y 365 − i
P (B) = 1 − ,
i=1
365

providing us with an easily computable formula. The results of this are


somewhat counter-intuitive, as the probability of shared birthdays is higher
than we intuitively tend to expect given the year has 365 days to choose
from. Indeed, already for n = 23 people, we have P (B) = 0.5 and for
n = 50 people P (B) = 0.97, hence a 97% chance that at least two share a
birthday.
8 CHAPTER 1. INTRODUCTION

1.3 Random Variables


The notion of events is very general and, as we have seen in the above example,
events can be described in words in arbitrary ways. In order to enable easier
mathematical treatment and numerical computation, however, we often restrict
ourselves to events that can be expressed by numbers. We then define:
Definition 1.5 (Random Variable). A random variable (RV) is a number as-
signed to the outcome of an experiment.
Example 1.6. When rolling a fair dice, the following random variables can,
e.g., be defined:
• the number of eyes shown
• the sum of eyes shown over the past 10 rolls
• the number of times a 5 showed
• ...
In the following, we denote RVs by uppercase latin characters and the values
they assume by the corresponding lowercase character. Depending on the set of
numbers from which the values of a random variable come, we distinguish three
types of random variables: indicator/binary RV, discrete RV, and continuous
RV.

1.3.1 Indicator/binary random variables


An indicator RV I only takes values from the set B = {0, 1}. It therefore is a
binary variable. It is defined as:

1 if A
I:Ω→B A 7→ I(A) = (1.12)
0 else.
Indicator RVs directly generalize the concept of events to random variables with
an easy mapping:
P (I(A) = 1) = P (A)
I(A ∩ B) = I(A)I(B)
I(A ∪ B) = I(A) + I(B) − I(A)I(B)

1.3.2 Discrete random variables


A discrete RV X takes values from a finite or countably infinite set of outcomes
S = {x1 , x2 , . . .}. It is defined as:
X:Ω→S A 7→ X(A) = xA ∈ S. (1.13)
For discrete random variables, we can write P (X = xi ) the probability that it
assumes one of the possible values xi .
1.4. PROBABILITY DISTRIBUTIONS 9

1.3.3 Continuous random variables


A continuous RV X takes values from a continuum, i.e., any subset J ⊂ R of
the set of real numbers R. It is defined as:

X:Ω→J (1.14)

In this case, it is not possible any more to directly map to events, because there
are infinitely many events/points in J, each with infinitesimal probability.

1.4 Probability Distributions


Probability distributions are mathematical functions that assign probability to
events, i.e., they govern how the total probability mass of 1 is distributed across
the events in the sample space, or what the probabilities are of the possible
values of a RV. Again, we distinguish the discrete and the continuous case.

1.4.1 Discrete Distributions


For discrete RVs, each possible outcome xi ∈ S can directly be assigned a
probability, as:
P (X = xi ) = PX (xi ). (1.15)

The function PX (x) is called the probability mass function (PMF) or the prob-
ability distribution function of the RV X. For discrete RV, we can also define
the Cumulative Distribution Function (CDF) FX (x) of the RV X, as:
X
FX (x) = P (X ≤ x) = P (X = xi ). (1.16)
xi ≤x

Note that by definition 0 ≤ FX (x) ≤ 1 and FX (x) is monotonically increasing,


as it simply is a cumulative sum over probabilities, which are all non-negative
by definition.

Example 1.7. Consider the experiment of rolling a fair dice once, and define
the RV X: number of eyes shown. Then, the probability mass function is:

1

6 if x ∈ {1, 2, 3, 4, 5, 6}
PX (x) =
0 else

and the cumulative distribution function of X is:

x
FX (x) = x ∈ {1, 2, 3, 4, 5, 6}.
6
10 CHAPTER 1. INTRODUCTION

1.4.2 Continuous Distributions


For continuous RVs, probability mass functions do not exist. However, since
there are infinitely many infinitesimally small probabilities, we can define the
probability density function (PDF), as:

dF (x)
f (x) = , (1.17)
dx
using the analogy between summation and integration in continuous spaces.
Note that this is not a probability. Rather, for any a, b ∈ J, we have:
Z b
P (a ≤ X ≤ b) = f (x) dx. (1.18)
a

Therefore, the probability of the RV taking values in a certain interval is given


by the area under the curve of the PDF over this interval. Obviously, from the
definition of the PDF:
Z ∞
P (X ∈ J) = f (x) dx = 1, (1.19)
−∞

so the total probability is correct. We can also compute the CDF from the PDF
by inverting Eq. 1.17: Z x
FX (x) = f (x̃) dx̃. (1.20)
−∞

Finally, since the CDF is monotonic, its slope is always non-negative, thus:

f (x) ≥ 0 ∀x. (1.21)

1.4.3 Joint and Marginal Distributions


Considering more than one RV, one can define joint, marginal, and conditional
distributions analogously to joint and conditional probabilities. Obviously, all
RVs must be of the same type (continuous/discrete). In the following, we give
the definitions for two RVs.

• Joint: The joint CDF of two RVs X and Y is:

FX,Y (x, y) = P (X ≤ x, Y ≤ y). (1.22)

For discrete RVs, the joint PMF is:

PX,Y (x, y) = P (X = x, Y = y) = P (X = x ∩ Y = y) for discrete X, Y


(1.23)
and for continuous RVs, the joint PDF is:

d2
fX,Y (x, y) = FX,Y (x, y) for continuous X, Y. (1.24)
dxdy
1.4. PROBABILITY DISTRIBUTIONS 11

• Marginal: The marginal distribution over one of the two RVs is obtained
from their joint distribution by summing or integrating over the other RV.
For the CDF, this again looks the same for discrete and continuous RVs:

FX (x) = P (X ≤ x) = P (X ≤ x, Y ≤ ∞) = FX,Y (x, ∞). (1.25)

And for the other distributions:


X
Discrete PMF: PX (x) = PX,Y (x, y)
y

Z ∞
Continuous PDF: fX (x) = fX,Y (x, y) dy.
−∞

• Conditional: The conditional probability distribution is the distribution


of one of the two RVs given the other. Again, the conditional CDF is the
same for continuous and discrete RVs:

FX|Y (x|y) = P (X ≤ x|Y ≤ y). (1.26)

And for the other distributions:


PX,Y (x, y)
Discrete PMF: PX|Y (x|y) =
PY (y)

fX,Y (x, y)
Continuous PDF: fX|Y (x|y) = .
fY (y)

Note that the conditional distribution is the joint distribution divided by


the marginal distribution of the given RV, i.e., of the condition.

1.4.4 Moments of Probability Distributions


It is powerful to consider the moments of different order of a probability distribu-
tion. On the one hand, this is because these moments correspond to descriptive
summary statistics of the RV governed by the distribution. On the other hand,
it is often easier to write models for the moments of a distribution than for the
distribution itself. It may even be the case that a distribution is not known or
cannot be modeled explicitly, but one can still compute or model (some of) its
moments. A fundamental theorem in mathematics guarantees that knowing all
moments of a function is equivalent to knowing the function itself. This is be-
cause moments relate to Taylor expansions of the function. Often in stochastic
modeling, considering just the first few moments of a probability distribution
provides a good-enough approximation for a simulation.
The moment of order µ of a discrete RV X is defined as:
X
Mµ [X] = xµ PX (x). (1.27)
x
12 CHAPTER 1. INTRODUCTION

And for a continuous RV X:


Z ∞
Mµ [X] = xµ fX (x) dx. (1.28)
−∞

Considering the first three moments, we find:

• M0 [X] = 1 because the total probability over the entire sample space is
always 1.

• M1 [X] = E[X], which is the expectation value of RV X.

Example 1.8. Rolling a fair dice. X is the number of eyes shown. And
from Eq. 1.27, we find:
6
X i
E[X] = = 3.5.
i=1
6

This is the expected value when rolling the dice many times. It is related
to the statistical mean of the observed values of X.

• M2 [X] = Var(X) + E[X]2 = E[X 2 ], where Var is the variance of the RV


X.

Example 1.9. For the same dice example, we find:


6
X i2
E[X 2 ] = = 15.167
i=1
6

and therefore:
Var(X) = 15.167 − (3.5)2 = 2.9167.

Higher moments relate to higher cumulants of the distribution, like skewness,


Kurtosis etc. One can also define central moments that are shifted by the ex-
pectation value, as this is sometimes more convenient. We refer to the literature
for these additional concepts.
The most important moments of a RV are the expectation value and the vari-
ance. They behave as follows when computed over sums or affine transforms of
RVs:

E[aX + b] = aE[X] + b (1.29)


E[X + Y ] = E[X] + E[Y ] (1.30)
2
Var(aX + b) = a Var(X) (1.31)
Var(X + Y ) = Var(X) + Var(Y ) + 2Cov(X, Y ), (1.32)

where Cov(X, Y ) is the covariance of the two random variables (related to their
correlation).
1.5. COMMON EXAMPLES OF DISTRIBUTIONS 13

1.5 Common Examples of Distributions


We provide some examples of commonly used and classic probability distribu-
tions with their key properties and results. Keep in mind, though, that this
is only a very small selection of the many more distributions that are known
and characterized, like Gamma, Beta, Chi-squared (χ2 ), Geometric, Weibull,
Gumbel, etc.

1.5.1 Common discrete distributions


1. Binomial (Bernoulli)
 
n k
P (X = k) = p (1 − p)n−k k ∈ {0, 1, 2, 3, ..., n}. (1.33)
k

For X ∼ Bin(n, p):

E[X] = np (1.34)
Var(X) = np(1 − p). (1.35)

The binomial distribution with parameters (n, p), evaluated at k, gives the
probability of observing exactly k successes from n independent events,
each with probability p to succeed; P (X = k) = P (exactly k successes).
Example 1.10. Imagine an exam with 20 yes/no questions. What is the
probability of getting all answers right by random guessing?
 
1 20 1
X ∼ Bin(20, ) ⇒ P (X = 20) = ·0.520 ·0.50 = 1· 20 ·1 = 9.537·10−7 .
2 20 2
You would have to retake the exam 1,048,576 times until you could expect
a pass. Clearly not a viable strategy.

2. Poisson
e−λ λk
P (X = k) = k ∈ {0, 1, 2, 3, ...} = N. (1.36)
k!
For X ∼ Poiss(λ):

E[X] = λ (1.37)
Var(X) = λ. (1.38)

In the Poisson distribution, the variance and the mean are identical. It
has only one parameter. The Poisson distribution gives the probability
that a random event is counted k times in a certain time period, if the
event’s rate of happening (i.e., the expected number of happenings in the
observation time period) is λ. It is therefore also called the “counting
distribution”.
14 CHAPTER 1. INTRODUCTION

Example 1.11. Ceramic tiles crack with rate λ = 2.4 during firing. What
is the probability a tile has no cracks?
e−2.4 · 1
X ∼ Poiss(2.4) ⇒ P (X = 0) = = 0.0907.
1
So only about 9% of tiles would survive and a better production process
should be found that has a lower crack rate.

1.5.2 Common continuous distributions


1. Uniform
1
fX (x) = a≤x∈R≤b (1.39)
b−a
x−a
FX (x) = a ≤ x ∈ R ≤ b. (1.40)
b−a
For X ∼ U(a, b):
1
E[X] = (a + b) (1.41)
2
1
Var[X] = (b − a)2 . (1.42)
12
The uniform distribution formalizes the case where all values within an
interval [a, b] over the real numbers are equally likely.
Example 1.12. Random number generation → See next chapter.

2. Exponential

fX (x) = λe−λx x≥0 (1.43)


−λx
FX (x) = 1 − e x ≥ 0. (1.44)

For X ∼ Exp(λ):

E[X] = λ−1 (1.45)


−2
Var[X] = λ . (1.46)

The exponential distribution governs waiting times between memoryless


random events occurring at rate λ.
Example 1.13. An engine has a probability to fail once in 100,000 km.
How likely is it to last 200,000 km?

X ∼ Exp(1/100, 000) ⇒ P (X > 200, 000) = 1 − P (X ≤ 200, 000) =


−5
·2·105
1 − FX (200, 000) = 1 − (1 − e−10 ) = e−2 = 0.1353.
So about 13% of all engines will reach the double of their mean lifetime.
1.5. COMMON EXAMPLES OF DISTRIBUTIONS 15

3. Normal/Gaussian

1 1 x−µ 2
fX (x) = √ e− 2 ( σ ) x∈R (1.47)
2πσ 2
    
x−µ 1 x−µ
FX (x) = Φ = 1 + erf √ . (1.48)
σ 2 σ 2
The CDF of the Gaussian distribution, Φ, has no analytical form, but
can be computed numerically due to its relation with the error function
erf(·), which is a so-called “special function” for which good numerical
approximations exist. For X ∼ N (µ, σ 2 ):

E[X] = µ (1.49)
2
Var[X] = σ . (1.50)

The normal distribution governs measurements with a true value of µ and


measurement uncertainty/error of σ (called standard deviation).

Example 1.14. IQs of people are distributed around µ = 100 with a stan-
dard deviation of σ = 15. What is the probability of having an IQ > 110?

X ∼ N (100, 152 ) ⇒ P (X > 110) = 1 − P (X ≤ 110) =

1 − Φ(110 − 100/15) = 1 − Φ(2/3) = 1 − 0.7454 = 0.2546.


So about a quarter of all people have an IQ higher than 110. The value of
Φ was taken from a table.

1.5.3 Scale-free distributions


A special class of distributions are the so-called “scale-free” distributions. Intu-
itively, we often ask ourselves: “Why does one so often observe rare events?”.
This is because our intuition likes normal distributions where occurrences are
relatively tightly distributed around a mean (e.g., body sizes) and exponential
distributions (e.g., time to wait for the bus). According to this intuition, earth-
quakes of magnitude 8 or 9 should basically never occur given that the mean
earthquake magnitude is a 2.3. However, they do occur much more frequently
than we expect. This is, of course, because earthquake magnitudes are not
normally or exponentially distributed.
Distributions we find “intuitive” have a characteristic scale (µ for the normal and
λ for the exponential distribution), which somewhat sets the order of magnitude
of the values that occur. However, many real-world processes are scale-free
with values spanning several orders of magnitude without any preferred “typical
value”. Examples include city sizes (from a few 100 to tens of millions without
any preferential scale), wealth (from a few cents to billions of dollars), strength
of earthquakes, stock market values, etc. In a scale-free distribution, very large
and very small values are not rare.
16 CHAPTER 1. INTRODUCTION

Example 1.15. A classic example of a scale-free distribution is the power law:

P (x) = x−α , α > 0. (1.51)

Another example is the Gutenberg-Wagner law:

log N = a − bM, (1.52)

for the frequency N (M ) of earthquakes of magnitude M . Also the log-normal


distribution is an example of a scale-free distribution.
Scale-free distributions are typically the result of multiplicative processes where
effects multiply instead of just adding up (e.g., stock market). Their moments
are often infinite, such that a variance cannot really be computed (or it would
be infinity), and sometimes even the mean cannot be computed. Nevertheless,
scale-free distributions are very important, e.g., when studying criticality in
physics, scaling in biology, or processes on networks in engineering.
Chapter 2

Random Variate Generation

All (and more than) you ever wanted to know about the topics covered in this
chapter can be found in the book: “Non-Uniform Random Variate Generation”
by Luc Devroye (Springer, 859 pages).

The most basic task in any stochastic simulation is the ability to simulate re-
alizations of a random variable from a given/desired probability distribution.
This is called random variate generation or simulation of random variables. On
electronic computers, which are deterministic machines, the simulated random
variates are not truly random, though. They just appear random (e.g., by sta-
tistical test) and follow the correct distribution, which is why we say that the
random variable is “simulated” and it is not the real random variable itself.
A notable exception is special hardware, such as crypto cards, that generates
true random numbers. This often involves a radioactive source, where the time
between decay events is truly random and exponentially distributed.

2.1 Transformation of Random Variables


Many simulation methods for random variables are based on transformation of
random variables. We therefore start by providing the basic background on
random variable transformation. Let a random variable X be given and define
a second random variable from it, as:

Y = g(X)

for a given transformation function g with supp(g) ⊇ dom(X), i.e., the support
of the function g has to include the entire domain of the RV X. For a valid
transform, the inverse
g −1 (A) := {x : g(x) ∈ A} (2.1)
for any set A exists, but is not necessarily unique. The function g maps from
the space in which X takes values to the space in which Y takes values.

17
18 CHAPTER 2. RANDOM VARIATE GENERATION

Example 2.1. Consider the following transformation functions:


• g(x) = x3
⇒ g −1 ({1, 8}) = {1, 2}. (2.2)
In this case, the inverse is unique, as g is a bijection.
• g(x) = sin x
⇒ g −1 (0) = kπ; k ∈ Z. (2.3)
Here, the inverse in not unique and maps to a countably infinite set.
The map A 7→ P (g(X) ∈ A) = P (X ∈ g −1 (A)) satisfies the axioms of prob-
ability and therefore allows us to define probability distributions over Y =
g(X). How these distributions are defined, and what they look like, depends on
whether (X, Y ) are continuous or discrete random variables.

2.1.1 Transformed discrete distributions


For X and Y discrete with probability mass functions PX (x), PY (y), we have
for Y = g(X): X
PY (y) = PX (x). (2.4)
x∈g −1 (y)

Example 2.2. Let X be uniformly distributed over the integers from 1 to n.


Then, the probability mass function of X is:
( 1
n x ∈ {1, 2, . . . , n}
PX (x) =
0 else.

Now consider the transformation Y = X + a, which adds a given constant a to


each realization of X. From Eq. 2.4, we find:
( 1
n y ∈ {a + 1, . . . , a + n}
PY (y) =
0 else.

2.1.2 Transformed continuous distributions


Let X and Y be continuous random variables with the CDF of X being FX (x).
Further assume that g is uniquely invertible, i.e., it is a bijection. While this
assumption is not necessary, the case where g is not a bijection (e.g., the sin
example above) is more difficult to treat and omitted here. Since g is a bijection,
it is a monotonic function. There are two cases:
For monotonically increasing g, we have for the CDF of Y = g(X):
FY (y) = P (Y ≤ y) = P (g(X) ≤ y) = P (X ≤ g −1 (y)) = FX (g −1 (y)) (2.5)
and for the PDF:
d −1
fY (y) = FY′ (y) = fX (g −1 (y)) g (y), (2.6)
dy
2.1. TRANSFORMATION OF RANDOM VARIABLES 19

Where we have used the chain rule of differentiation in the last step.
For monotonically decreasing g, we analogously find:

FY (y) = P (g(X) ≤ y) = P (X > g −1 (y)) = 1 − FX (g −1 (y)) (2.7)


d
fY (y) = −fX (g −1 (y)) g −1 (y) (2.8)
dy

This is easily understood by drawing the graphs of g for the two cases and
observing that the half-space Y ≤ y gets mapped to X ≤ g −1 (y) in one case,
and to X > g −1 (y) in the other.

Example 2.3. Let X ∼ U(0, 1). The PDF of this continuous random variable
is:

1 if x ∈ [0, 1]
fX (x) =
0 else.

Now consider the transformed random variable Y = g(X) = 1 − X. This


function in uniquely invertible as:

y = g(x) = 1 − x ⇒ g −1 (y) = 1 − y for x, y ∈ [0, 1].

The function g is monotonically decreasing. From Eq. 2.8, we thus find:


(
1 if y ∈ [0, 1]
fY (y) = −fX (1 − y)(−1) =
0 else,

which is the same as the PDF of X. We hence find the important result that if
X ∼ U(0, 1), then also Y = 1 − X ∼ U(0, 1), i.e., the probability of a uniformly
distributed event to not happen, is also uniformly distributed.

2.1.3 The Inversion Transform


The inversion transform is an important identity when generating random vari-
ates from a given continuous distribution. Let X be a continuous random vari-
able. It’s CDF FX (x) is monotonically increasing by definition. Now set the
special transform Y = g(X) = FX (X). Then:

FY (y) = P (Y ≤ y) = P (FX (X) ≤ y)


−1 −1
= P (X ≤ FX (y)) = FX (FX (y)) = y.

The distribution with FY (y) = y is the uniform distribution over the interval
[0, 1]. Therefore, random variables from a given cumulative distribution FX can
−1
be simulated from uniform ones by X = FX (U(0, 1)) ∼ FX (x). This endows
uniform random numbers with special importance for stochastic simulations.
20 CHAPTER 2. RANDOM VARIATE GENERATION

2.2 Uniform Random Variate Generation


Due to the inversion transform, random variables of any given distribution can
be simulated if we can simulate uniformly distributed random variables over
the interval [0, 1]. We therefore first consider the problem of simulating U(0, 1)
on a deterministic computer. There are two approaches to this: algorithmic or
data-driven. Data-driven approaches include, e.g., measuring the time between
two keystrokes of the user, taking every y-th byte of network traffic flowing
through the network interface of the computer, or reading every x-th byte of
data from a storage medium. Of course, neither the data on a storage medium,
nor network traffic, nor user keystrokes are truly random, but they may appear
random in the absence of a predictive model. Data-driven approaches are often
used by cryptography applications because the sequence of numbers is truly
unpredictable (even if it is not truly random). For simulations, however, the
algorithmic approach is more practical, because it allows us to generate many
random numbers quickly, not limited by the speed of a data stream, and it
ensures reproducible simulation results. We hence focus on the algorithmic
approach here.
Algorithms for random variable simulation are called Random Number Gener-
ators (RNG). Two types of RNGs exist:
• pseudo-RNGs use deterministic algorithms to produce sequences of num-
bers that appear random by statistical tests, but are predictable.
• quasi-RNGs use deterministic algorithms to produce low-discrepancy
sequences, sampling more “evenly” while still appearing random but being
predictable.
In both cases, we require:

lim |F̂n (x) − x| = 0 ∀x ∈ [0, 1], (2.9)


n→∞

where F̂n (x) is the empirical CDF over n samples from the RNG and F (x) = x
is the uniform CDF we want to simulate. This requirement means that if we
generate infinitely many random numbers, then their CDF is identical to the
CDF we want to simulate.

2.2.1 Uniform Pseudo-Random Number Generation


Uniform pseudo-random number generators over the interval [0, 1] are avail-
able in all programming languages as basic intrinsic functions, and they are
the core building block of any stochastic simulation. There exists a host of
well-known pseudo-RNG algorithms. The classic example, which illustrates the
working principle of pseudo-RNGs, is the Linear Congruential Generator
(a.k.a. multiplicative congruential generator, Lehmer generator), which com-
putes the sequence of numbers:

zi = azi−1 mod m, i = 1, 2, 3, . . . (2.10)


2.2. UNIFORM RANDOM VARIATE GENERATION 21

which means that it computes the division rest of azi−1 /m. By definition of the
division rest, |zi | < m. Thus:
zi
ui = ∈ [0, 1). (2.11)
m
The start value (called “seed”) z0 and the two integers m > 0 and 0 < a < m
must be chosen by the user. There are also versions in the literature that
include an additional constant shift, providing one more degree of freedom, but
the general principle remains the same. It can be shown that for the linear
congruential generator,

|F̂n (u) − u| ≤ ε(m) ∀u (2.12)

with ε(m) ↓ 0 as m → ∞. Therefore, the linear congruential generator is a valid


pseudo-RNG in the sense of Eq. 2.9. The question, remains, however, how to
choose a and m. Usually (a, m) are chosen very large and mutually prime. As
the upper error bound ε(m) becomes smaller for larger m, it is good to choose
m as large as possible. The following is a good choice on a 32-bit numeric type
(C++11 standard):
a = 48 271, m = 231 − 1.
The linear congruential generator is very simple, easy to implement, and illus-
trates well how pseudo-RNGs work. However, much better pseudo-RNGs are
known nowadays, and the linear congruential generator is not used any more
in practice. Its main problem is the relatively short cycle length. Since the
set of machine numbers (i.e., finite-precision floating-point numbers) is finite,
the values ui necessarily start to repeat after some point T . Therefore, uT
it the first number produced that is identical to an already seen number, say
u0 . Since the algorithm of a pseudo-RNG is deterministic, the entire sequence
uT +1 , uT +2 , . . . = u1 , u2 , . . . repeats itself. Until it hits uT again, and then it re-
peats itself again identically. Therefore, the number T is called the cycle length
of the RNG. Every deterministic pseudo-RNG has a finite cycle length. Ob-
viously, any stochastic simulation that requires more pseudo-random numbers
than T is not going to advance any more beyond T , as it is simply going to
recompute results it has already computed. In practice, stochastic simulations
routinely require millions of random numbers. The linear congruential genera-
tor has a relatively short cycle length. The actual value of T depends on m and
a. If m is a power of 2, then the cycle length cannot be more than m/4, with
the low bits having an even shorter period than the high bits. Better algorithms
with longer cycles are known and mostly used nowadays, e.g.:

• Mersenne Twister (1998) – Most commonly used

• Park-Miller (1988) – C/C++ std

• XOR-Shift (2003) – Good for GPU

• Xoroshiro128+ (2018) – Fastest on 64-bit CPU


22 CHAPTER 2. RANDOM VARIATE GENERATION

They are all simple recursion formulas operating on integers or bit strings, which
makes them very fast and easy to implement.
A particular problem occurs when using pseudo-RNGs in parallel computing.
There, obviously, one wants that each processor or core simulates a different
random number sequence. If they all compute the identical thing, the paral-
lelism is wasted. The simplest way to have different random number sequences
is to use a different seed on each processor or core. However, using different
seeds may not change the cycle, nor its length, but could simply starts the par-
allel sequences at different locations in the cycle. So when using P processors,
the effectively usable cycle length on each processor is T /P on average. Beyond
this, processors recompute results that another processor has already computed
before. Special Parallel Random Number Generators (PRNGs) therefore ex-
ist, which one should use in this case. They provide statistically independent
streams of random numbers that have full cycle length on each processor and
guarantee reproducible simulation results when re-running the simulation on
different numbers of processors. We do not go into detail on PRNGs here, but
refer to the literature and the corresponding software libraries available.

2.2.2 Uniform Quasi-Random Number Generation


It is a general property of uniformly distributed pseudo-random numbers that
they form “clusters” and “holes” in high-dimensional spaces. If one for example
generates uniform pseudo-random numbers (x, y) and plots them as points in
the x−y plane, they do not look homogeneously distributed. Instead, they form
visible clusters, despite the fact that they are statistically uniformly distributed.
Of course, for n → ∞, the entire plane will be covered and the distribution is
indistinguishable from the exact uniform distribution, as required. But for finite
numbers of samples, the clustering is clearly visible.
Quasi-RNGs solve this problem by generating more evenly scattered samples
with the same statistical properties. These sequences of numbers are also called
low-discrepancy sequences because they aim to reduce a quality measure called
“discrepancy”:
Definition 2.1 (Discrepancy). The discrepancy Dn of a finite sequence of n
numbers xi , i = 1, 2, . . . , n is:
n
X
Dn = max I[0,u] (xi ) − u , (2.13)
0≤u≤1
i=1

1 if x ∈ [0, u]
where the indicator function over the interval [0, u], I[0,u] (x) =
0 else.
The sum over the indicator is the total number of samples in the interval [0, u].
The expected number of samples under the uniform distribution is u. The
discrepancy therefore measures the largest deviation of the empirical CDF from
the true CDF for any finite number n of samples. For small sample counts, the
CDF of a low-discrepancy sequence converges faster to the true CDF than that
2.3. INVERSION TRANSFORM FOR NON-UNIFORM DISTRIBUTIONS23

of a pseudo-random number sequence. However, low-discrepancy sequences may


suffer from aliasing artifacts for large sample counts.
A simple classic quasi-RNG is the Additive Recurrence Sequence:

ui+1 = (ui + α) mod 1 (2.14)

with an irrational number α. This sequence has Dn ∈ Oε (n−1+ε ), which means


that
√ the discrepancy
√ scales inversely proportional to n. Good choices of α are:
1
2 ( 5 − 1) or 2 − 1.
The Additive Recurrence Sequence is easy to generate and understand, but is
not the best low-discrepancy sequence known. Better ones include:
• Van-der-Corput sequence
• Halton sequence
• Faure sequence
• Niederreiter sequence
• Sobol sequence
• ...
We refer to the literature for more information about them.

2.3 Inversion Transform for Non-uniform Dis-


tributions
Using the result from Section 2.3, sequences of uniformly distributed pseudo- or
quasi-random numbers can be transformed to another distribution with known
and invertible CDF FX , as:
−1
X ∼ FX = FX (Y ) with Y ∼ U(0, 1). (2.15)
−1
In the best case, the inverse CDF FX can be computed analytically and can
directly be used as a transform. If this is not possible, numerical methods of
inversion can be used, such as line search or Regula Falsi.
Example 2.4. We want to generate exponentially distributed pseudo-random
numbers X ∼ Exp(λ). From the CDF of the exponential distribution, we ana-
lytically find the inverse:

FX = 1 − e−λx = y (2.16)
−λx
e =1−y (2.17)
− λx = log(1 − y) (2.18)
1 −1
x = − log(1 − y) = FX . (2.19)
λ
24 CHAPTER 2. RANDOM VARIATE GENERATION

As y ∼ U(0, 1), also r = 1 − y ∼ U(0, 1), according to Example 2.3.


1
⇒ X = − log(R) for R ∼ U(0, 1). (2.20)
λ
This simple formula then generates exponentially distributed pseudo-random
numbers with parameter λ.
While the above example illustrates the concept of inversion transforms, prac-
tical cases are not always this simple. A prominent example is the genera-
tion of normally/Gaussian distributed pseudo-random numbers. A simple idea
would be to use the Central Limit Theorem, stating that the average of a large
number of random variables is normally distributed, regardless of how the in-
dividual random variables are distributed. So one could compute averages over
large, independent collections of uniform pseudo-random numbers, and get a
standard-normal variable:
n
!
1 X n
X=p Ri − (2.21)
n/12 i=1 2

from n uniformly distributed Ri ∼ U(0, 1), i = 1, 2, . . . n. Here, n/2 is the mean


of the n uniformly distributed values, and n/12 their variance (see Sec. 1.5.2).
However simple, this method is not very good. Due to the slow convergence of
the Central Limit Theorem, the number n needs to be very large in order to
get a good approximation to the Gaussian distribution. Therefore, this method
requires many (typically thousands) uniform random numbers to generate a
single Gaussian random number, with obvious problems for the cycle length of
the generator.
A better way of generating Gaussian random numbers is the Box-Muller trans-
form. It is a classic inversion transform method. However, the CDF of the
Gaussian distribution is not analytically invertible (in fact, the CDF is not an
analytical function, see Section 1.5.2). But a little trick helps: considering two
independent Gaussian random numbers (x1 , x2 ) with mean 0 and variance 1
(i.e., standard-normal distribution) jointly as the coordinates of a point in the
2D plane, and converting to polar coordinates (r, θ):
r2 = x21 + x22 ∼ χ2 (2) = Exp(1/2)
θ ∼ U(0, 2π).
The sum of squares of k independent standard Gaussian random variables is
Chi-square distributed with parameter k, χ2 (k), which is χ2 (2) for 2 standard
Gaussian random variables. The Chi-square distribution with parameter 2 is
identical to the exponential distribution with parameter 1/2. And, since the 2D
standard Gaussian is rotationally symmetric, the angle is uniformly distributed.
From above, we know how to invert the exponential distribution, so we can
analytically invert the coordinate transform and find:
p
x1 = cos (2πr2 ) −2 log(r1 )
p
x2 = sin (2πr2 ) −2 log(r1 ). (2.22)
2.4. ACCEPT-REJECT METHODS FOR NON-UNIFORM DISTRIBUTIONS25

This is the Box-Muller transform. It generates two independent standard-


normal random numbers (x1 , x2 ) from two independent uniformly distributed
random numbers (r1 , r2 ) ∼ U(0, 1), which makes it optimally efficient. Because
it is based on an analytical inversion, it is also accurate to machine precision.
Note that the Box-Muller transform is not the only way of generating normally
distributed random variates. Another famous example is the Marsaglia polar
method, which avoids the evaluation of trigonometric functions. This may be
preferred in practical implementations, since trigonometric functions are expen-
sive to compute on CPUs, where they are typically approximated by finite series
expansions. We derived the Box-Muller transform here because it is easiest to
understand and illustrates well the inversion principle. The other standard nor-
mal transforms can be understood from it.
In descriptive statistics, there is also a class of transforms called power trans-
forms, which are designed to make empirical data “look more normally dis-
tributed”. Examples include the Box-Cox transform or the Yeo-Johnson trans-
form. There are, however, not directly suited to simulate Gaussian random
variables, as they are only approximate, i.e., they produce “Gaussian-like” dis-
tributions as suited for statistical tests of normality.
For the following distributions, pseudo-random number generators based on the
inversion transform are also known: Poisson, Chi-square, Beta, Gamma, and
Student t (see literature).

2.4 Accept-Reject Methods for Non-uniform Dis-


tributions
Accept-Reject methods are an algorithmic alternative to inversion-transform
methods for simulating continuous random variables from non-uniform distri-
butions. As they do not require the (analytical or numerical) computation of
an inverse function, accept-reject methods are very general and easy to imple-
ment. They may, however, be rather inefficient, requiring multiple uniformly
distributed random numbers to generate one random number from the target
distribution. Accept-Reject methods are based on the following theorem:
Theorem 2.1. Simulating X ∼ fX (x) is equivalent to simulating:

(X, U ) ∼ U{(x, u) : 0 < u < fX (x)}. (2.23)

This is because Z fX (x)


fX (x) = du,
0
but we omit the formal proof here. Theorem 2.1 provides a recipe for simu-
lating random variables X with given PDF fX (x) using pairs of independent
uniform random numbers (x, u), as illustrated in Fig. 2.1. The theorem tells us
that generating random numbers from the PDF fX can be done by uniformly
26 CHAPTER 2. RANDOM VARIATE GENERATION

Figure 2.1: Illustration of the Accept-Reject method for simulating a random


variable X with PDF fX . Crosses are rejected, circles are accepted.

sampling (x, u) in the bounding box of fX and only accept points for which
0 < u < fX (x) (circles in Fig. 2.1). The x-component of the accepted point is
then used as a pseudo-random number. Points above the graph of fX (crosses
in Fig. 2.1) are rejected and not used.
This is very easy to implement and always works. However, it may be inefficient
if the area under the curve of fX only fills a small fraction of the bounding
box, i.e., V ol({x, u}) ≫ 1, which for example is the case if fX is very peaked
or has long tails. In particular, this method becomes difficult for PDFs with
infinite support, such as the Gaussian, where one needs to truncate somewhere,
incurring an approximation error in addition to inefficiency.
Fortunately, we are not limited to sampling points in the bounding box of fX ,
but we can use any other probability density function g(x) for which fX (x) ≤
µg(x) for some µ > 0, i.e., the graph of µg(x) is always above the graph of fX
for some constant µ. Then, simulating X ∼ fX (x) is equivalent to simulating
pairs (y, u) such that:
y ∼ g(x), u ∼ U(0, 1) (2.24)
and accepting the sample x = y if u ≤ fX (x)/µg(x). Obviously, this requires one
to be able to efficiently generate random numbers from the proposal distribution
g(x), so typically one wants to use a g(x) for which an explicit inversion formula
exists, like an exponential or Gaussian distribution.

2.4.1 Composition-Rejection Method


Another possibility of improving the efficiency (i.e., increasing the fraction of
accepted samples) of the Accept-Reject method is composition-rejection sam-
pling. The idea here is to bin the target distribution fX into dyadic intervals,
as illustrated in Fig. 2.2. The first bin contains all blocks of (x, u) for which
0 ≤ fX (x) ≤ a for some constant a. The second bin contains all blocks with
a < fX (x) ≤ 2a, the third 2a < fX (x) ≤ 4a, and so on.
2.4. ACCEPT-REJECT METHODS FOR NON-UNIFORM DISTRIBUTIONS27

Figure 2.2: Illustration of composition-rejection sampling.

This is called dyadic binning because the bin limits grow as powers of two.
If one then performs Accept-Reject sampling independently in these bins (i.e.,
first choose a bin proportional to its total probability, the do accept-reject inside
that bin), the efficiency is better than when performing Accept-Reject sampling
directly on fX . This is obvious because the “empty” blocks that do not overlap
with the area under the curve of fX are never considered.
28 CHAPTER 2. RANDOM VARIATE GENERATION
Chapter 3

Discrete-Time Markov
Chains

Markov Chains are one of the most central topics in stochastic modeling and
simulation. Both discrete-time and continuous-time variants exist. Since the
discrete-time case is easier to treat, that is what we are going to start with.
Markov Chains are a special case of the more general concept of stochastic
processes.

3.1 Discrete-Time Stochastic Processes


In a discrete-time model, time is a discrete variable and can therefore be rep-
resented by integer numbers. The intuition is not to talk about time as a con-
tinuum, but about time steps or time points, such as for example years (2018,
2019, 2020, . . .) or hours of the day (0, 1, 2, . . ., 23).
Definition 3.1 (Discrete-Time stochastic process). A stochastic process in dis-
crete time n ∈ N = {0, 1, 2, . . .} is a sequence of random variables X0 , X1 , X2 , . . .
denoted by {Xn : n ≥ 0}. The value xn of Xn is called the state of the process
at time n; The value x0 of X0 is the initial state.
So, in the most general case, a stochastic process is any sequence of random
variables. In the discrete-time case, time is given by an integer index n. We
further distinguish between:
1. Discrete-state processes where the random variables Xn take values in a
discrete space (e.g., Xn ∈ Z = {. . . , −2, −1, 0, 1, 2, . . .}).
2. Continuous-state processes where the random variables Xn take values in
a continuous space (e.g., Xn ⊂ Rd , d ≥ 1).
Definition 3.2 (State space). The state space S of a stochastic process {Xn :
n ≥ 0} is the space of all possible values of the random variables Xn .

29
30 CHAPTER 3. DISCRETE-TIME MARKOV CHAINS

Therefore, a discrete-state stochastic process has a discrete state space, whereas


a continuous-state stochastic process has a continuous state space.
Stochastic processes are widely used to model the time evolution of a random
phenomenon, such as:

• Population size in year n.

• Amount of water in nth rain of the year.

• Time the nth patient spends in hospital.

• Outcome of nth rolling of a dice.

These are all discrete-time processes, some with discrete state space (e.g., dice)
and some with continuous state space (e.g., amount of rain).
The challenge in stochastic modeling is to find a stochastic process model {Xn :
n ≥ 0} that is complex enough to capture the phenomenon of interest, yet
simple enough to be efficiently computable and mathematically tractable.

Example 3.1. Consider the example of repeatedly tossing a fair coin and let
Xn be the outcome (head or tail) of nth toss. Then:
 
1 1
Xn ∼ Bin ⇒ P (Xn = head) = P (Xn = tail) = ∀n.
2 2

In this example, all Xn in {Xn : n ≥ 0} are independent of each other and


identically distributed (i.i.d.).

i.i.d. random processes are the easiest to deal with. They:

1. are defined by a single distribution,

2. obey the central limit theorem,

3. allow simply multiplying the probabilities of elementary events.

3.2 Markov Chains


For many phenomena, i.i.d. processes are not powerful enough to be a good
model. For example, one would expect the population size of a country in year
n + 1 to be large if it was already large in year n. Often, one therefore wants a
model where the distribution of Xn+1 depends on the value xn of Xn in order
to allow for correlations between random variables in the stochastic process.
However, allowing for all possible correlations between all n random variables
would be computationally infeasible and in many cases overkill, as dependence
on the current value of the stochastic process is often sufficient. This then
defines:
3.3. MARKOV CHAINS AS RECURSIONS 31

Definition 3.3 (Markov chain). A discrete-time stochastic process {Xn : n ≥


0} is called a Markov Chain if and only if for all times n ≥ 0 and all states
xi , xj ∈ S:

P (Xn+1 = xj |Xn = xi , Xn−1 = xi−1 , . . . , X0 = x0 )


= P (Xn+1 = xj |Xn = xi ) = Pij . (3.1)

This means that the probability distribution of the next state only depends on
the current state, but not on the history of how the process arrived at the cur-
rent state. Markov Chains are the next-more-complex stochastic process after
i.i.d. processes. They implement a one-step dependence between subsequent
distributions. Despite their simplicity, Markov Chains are very powerful and
expressive, while still remaining mathematically tractable. This explains the
widespread use and central importance of Markov Chains in stochastic model-
ing and simulation.
Equation 3.1 is called the Markov Property. For discrete-state processes (i.e.,
S is discrete), the number 0 ≤ Pij ≤ 1 is the probability to move to state xj
whenever the current state is xi . It is called the one-step transition probability
of the Markov Chain. For discrete S with finite |S|, the matrix of all one-step
transition probabilities P = (Pij ), ∀(i, j) : xi , xj ∈ S, is the transition matrix of
the MarkovP Chain. It is a square, positive-semi-definite matrix.
We have j∈S Pij = 1, since upon leaving state xi , the chain must move to one
of the states xj (possibly the same state, xj = xi , which is the diagonal element
of the matrix). Therefore, each row i of P is a probability distribution over
states reachable from state xi .
Due to the Markov property, the future process Xn+1 , Xn+2 , . . . is independent
of the past process X0 , X1 , . . . , Xn−1 , given the present state Xn .

3.3 Markov Chains as Recursions


There is a close relationship between Markov Chains and recursive functions.
Indeed:
Theorem 3.1. Let f (x, u) be a real-valued function of two variables. Let {Un :
n ≥ 0} be an i.i.d. discrete-time random process. Then:

xn+1 = f (xn , Un ), n ≥ 0 (3.2)

is a Markov Chain if x0 is independent of {Un : n ≥ 0}.


The transition probabilities of this Markov Chain are given by Pij = P (f (xi , un ) =
xj ).
It is clear that Xn+1 only depends on xn and Un . Since the random variables Un
are independent of the past (they are i.i.d. by definition), the Markov property
holds. Note that the transition probability is still allowed to change over time,
which can be made explicit by writing f (xn , Un , n).
Remarkably, however, the converse is also true:
32 CHAPTER 3. DISCRETE-TIME MARKOV CHAINS

Theorem 3.2. Every Markov Chain {Xn = xn : n ≥ 0} can be represented as


a recursion
xn+1 = f (xn , Un ), n ≥ 0 (3.3)
for some suitable f and i.i.d. {Un : n ≥ 0}.
In particular, the {Un : n ≥ 0} can always be chosen i.i.d. from U(0, 1) with
f adjusted accordingly. Therefore, every Markov Chain can be simulated if
one can simulate uniformly distributed random variables (see Section 2.2). The
proof is more involved, but is based on the inversion transform.
If the transition probabilities are independent of time, i.e. Pij = const for each
given pair (i, j) (time-invariant Markov Chain), we can compute the n-step
transition matrix as:

P(n) = Pn = P × P × . . . × P . (3.4)
| {z }
n times

The fact that the n-step transition matrix is simply the nth power of the one-step
transition matrix follows from the Chapman-Kolmogorov equation:
For any n ≥ 0, m ≥ 0, xi , xj , xk ∈ S, we have:
X
Pijn+m = n m
Pik Pkj , (3.5)
k∈S

where:
n
Pik = P (Xn = xk |X0 = xi )
m
Pkj = P (Xm+n = xj |Xn = xk ).

Because of the Markov property (the future is independent of the past) and
Eq. 1.10, it is easy to see that
n m
Pik Pkj = P (Xm+n = xj |Xn = xk )P (Xn = xk |X0 = xi )
= P (Xn = xk , Xm+n = xj |X0 = xi ).

Summing over all k, i.e., all possible states xk the path from xi to xj could pass
through, i.e., marginalizing this joint probability over k, yields the formula in
Eq. 3.5. Since this is the formula for matrix multiplication, Eq. 3.4 is shown.

3.4 Properties of Markov Chains


Markov Chains have some useful and elegant mathematical properties. It is
mainly thanks to these properties that Markov Chains are so popular and widely
used. We start by giving a few definitions of properties a Markov Chain may
have and then discuss their importance.
Definition 3.4 (Closed set of states). A set C = {xi } of states is closed if and
only if no state outside C can be reached from any x0 ∈ C.
3.4. PROPERTIES OF MARKOV CHAINS 33

If a closed set contains only one state, that state is called absorbing. After the
Markov Chain reaches an absorbing state, it will never leave it again (hence the
name).

Example 3.2. Consider the Markov Chain defined by the recursion f (x, u) =
xu, where the Un are uniformly distributed random numbers in the interval [0, 1]
and the continuous state space S = [0, 1]. That is, the next value (between 0 and
1) of the chain is given by the current value multiplied with a uniform random
number between (and including) 0 and 1. The value 0 is an absorbing state.
Once the chain reaches 0, it is never going to show any value other than 0 any
more. Even more, every lower sub-interval C = [0, ν] ⊆ S for all ν ∈ [0, 1] is a
closed set of states, because the state of the chain is never increasing.

Definition 3.5 (Irreducibility). A Markov Chain is irreducible if and only if


there exists no closed set other than S itself.

Therefore, in an irreducible chain, every state can be reached from every other
state, eventually. There is no closed set from which the chain could not escape
any more. Every state is reachable, one just has to wait long enough (potentially
infinitely long).

Example 3.3. Clearly, the Markov Chain from Example 3.2 is not irreducible,
because is has infinitely many closed sets. However, if we consider i.i.d. random
variables Un ∼ U[ϵ, 1/x] for some arbitrarily small ϵ > 0, then the chain has
no absorbing state and no closed set any more, and it becomes irreducible. The
state space is then S = (0, 1].

Definition 3.6 (Periodicity). A Markov Chain is periodic if and only if for at


(n) (kt)
least one state xj , Pjj = 0 for all n ̸= kt with k, t ∈ N, but Pjj = 1. t > 1
is called the period. If no such t > 1 exists for none of the states, the Markov
Chain is aperiodic.

A periodic Markov Chain revisits one or several states in regular time intervals
t. Therefore, if we find the chain in a periodic state xj at time t, we know that
it is going to be in the same state again at times 2t, 3t, 4t, . . .. A Markov chain
that is not periodic is called aperiodic.

Definition 3.7 (Ergodicity). A Markov Chain is ergodic if and only if it is


irreducible and aperiodic.

An ergodic chain revisits any state with finite mean recurrence time. It is not
possible to predict when exactly the chain is going to revisit a given state, like
in the periodic case, but we know that it will in finite time. While an irreducible
Markov Chain is eventually going to revisit any state, the recurrence time may
be infinite, so irreducibility alone is not sufficient for ergodicity.
One of the interesting properties of Markov Chains for practical applications
is that they can have stationary distributions. This means that when running
34 CHAPTER 3. DISCRETE-TIME MARKOV CHAINS

the Markov Chain for n → ∞, states are visited according to an invariant


probability distribution, i.e., the probability
(n)
Pk = lim Pjk ∀k, (3.6)
n→∞
P
is independent of xj . Since Pk ≥ 0 and k Pk = 1, this defines a probability dis-
tribution over states, where state xk is visited with probability Pk , independent
of what the previous and current states are.
Example 3.4. Any i.i.d. random process is also a Markov Chain (but not vice
versa!), with recursion function f (x, u) = u. Because it is i.i.d., it will visit
each state with given probability. Therefore, i.i.d. processes are Markov Chains
with trivial stationary distribution.
While it may seem that in such cases an i.i.d. model would be simpler to use,
it may not always be computable. The stationary Markov Chain then provides
a simple algorithm for (approximately) simulating the process. Markov Chains
with known stationary distribution can be used to simulate random numbers
from that distribution. One just has to recurse the chain long enough and then
record its states, which will be distributed according to the desired stationary
distribution. A problem in practice is to know what “long enough” means.
Equation 3.6 talks about time going to infinity. In practice, stationary distribu-
tions may be (approximately) reached after a finite time. This time is called the
mixing time of the Markov Chain. After the initial mixing time, the chain sam-
ples values from its stationary distribution forever onward. For a given chain,
mixing times can be hard to known and are usually determined empirically, i.e.,
by simulating the chain and using statistical tests to decide when the values
are statistically indistinguishable from samples of the stationary distribution.
Part of the problem is that while the stationary distribution is independent of
the initial state by definition, the mixing time may depend on the initial state.
Only for a few special cases, mixing times are known analytically.
Another important question is when a Markov Chain has a stationary distribu-
tion at all. Fortunately, this can be clearly answered:
Theorem 3.3. A Markov Chain possesses a unique stationary distribution if
and only if it is ergodic.
In other words, a Markov Chain must never be trapped in any closed set of states
(irreducible) and it must not have predictable periods. While it is intuitive that
these are necessary conditions for the existence of a stationary distribution, they
are also sufficient, and the distribution is unique (proof not given here).
Example 3.5. The Random Walk is a classic example of a Markov Chain
with great importance as a model in physics, finance, biology, and other fields.
In physics, for example, random walks model the process of diffusion caused
by Brownian motion of microscopic particles, such as molecules. In biology,
random walks model population dynamics or cell motion, and in economics they
are used to model stock market fluctuations or gambling.
3.4. PROPERTIES OF MARKOV CHAINS 35

Mathematically, a random walk is defined by an i.i.d. sequence of “increments”


{Dn = dn : n ≥ 0} and the stochastic process:
n
X
xn := di , x0 = 0. (3.7)
i=1

While it might superficially seem that Xn depended on the entire history of the
process, it is in fact a Markov Chain, since xn = xn−1 +Dn is independent of the
previous states. The recursion formula of this Markov Chain is f (x, d) = x + d,
which, according to Theorem 3.1, proves the Markov property.
If the Xn are scalar and the increments are either +1 or −1, we obtain a one-
dimensional discrete-state random walk with

P (D = +1) = p, P (D = −1) = 1 − p. (3.8)

Such a random walk is called simple. It models the random motion of an object
on a regular 1D lattice. For the special case that p = 21 , the random walk becomes
symmetric, i.e., it has equal probability to go left or right in each time step.
In a simple random walk, the only states that can be reached from xi are xi+1
and xi−1 . Therefore, we have the one-step transition probabilities Pi,i+1 = p
and Pi,i−1 = 1 − p. Consequently, Pii = 0. This would form one row of the
transition matrix. However, we can only write down the matrix once we restrict
the chain to a bounded domain and hence a finite state space.
36 CHAPTER 3. DISCRETE-TIME MARKOV CHAINS
Chapter 4

Monte Carlo Methods

There exist a variety of computational problems, which are very expensive or


even impossible to solve using conventional numerical and analytical techniques.
Monte Carlo (MC) methods are an important class of computational algorithms,
which can be used to efficiently approximate solutions of such problems by ex-
ploiting (quasi-/pseudo-)random numbers generated on a computer. MC meth-
ods find application in many disciplines of engineering and science, including
operations research, finance, and machine learning. They are most commonly
used to (a) numerically calculate integrals (b) sample from complex distributions
(c) solve optimization problems.
The name “Monte Carlo Method” goes back to the 1940s when John von Neu-
mann and Stanislaw Ulam developed the computational aspects of this class of
algorithms in the US nuclear weapons projects at Los Alamos National Labora-
tory. Since the project was top-secret, they required a code name for their new
algorithm. They chose the code name “Monte Carlo Method”, as suggested by
Nicholas Metropolis in reference to the famous gambling casino in the city of
Monte Carlo (Principality of Monaco) due to its use of random numbers.
To motivate the core idea of MC methods, we consider a classical example.

Example 4.1. How to compute π using random numbers? This historic exam-
ple goes back to Buffon’s “needle problem”, first posed in the 18th century by
Georges-Louis Leclerc, Comte de Buffon. It shows a very simple and intuitive
way to approximate π using MC simulation. We begin by first drawing a unit
square on a blackboard or a piece of paper. Moreover, we draw a quarter of
a unit circle into the square such that the top-left and bottom-right corners of
the square are connected by the circle boundary. We then generate N random
(i) (i)
points uniformly distributed across the unit square, i.e., ⃗x(i) = (u1 , u2 ) with
(i) (i)
u1 ∼ U(0, 1) and u2 ∼ U(0, 1) for i = 1, . . . , N . You can achieve this, for
instance, by throwing pieces of chalk at the board, assuming that your shots are
uniformly distributed across the square. Subsequently, you count the points that

37
38 CHAPTER 4. MONTE CARLO METHODS

made it into the circle and calculate


# of samples inside the circle area of circle r2 π
4 ≈ = 2 = π, (4.1)
N area of square r
which will give you an approximation of π. The more samples N you use,
the more accurate your estimate will be. We shall see later why this algorithm
works, but an intuitive explanation can be provided right away. In particular,
the fraction of points that fall inside the circle approximates how much of the
area of the unit square is occupied by the quarter circle. We know that the area
of a unite square is As = r2 = 1 and that the area of the quarter of a unit circle
is Ac = r2 π/4 = π/4. The ratio between the two is therefore Ac /As = π/4. In
Eq. 4.1 we first calculate an estimate of this ratio and then multiply it by four,
which will therefore result in an approximation of π. Using MC simulations, the
ratio Ac /As is not determined exactly, but approximated using standard uniform
random numbers, which we can easily generate on a computer.
Remarks:
• Solutions obtained by MC simulation are random. Repeated MC simula-
tions will not give you the exact same answer (unless the same sequence
of quasi- or pseudo-random numbers is used).
• It is crucial to know how accurate a MC approximation is. How much can
you trust your results?
• MC methods converge to the exact answer for N → ∞. This is a funda-
mental result known as the law of large numbers.

4.1 The Law of Large Numbers


The law of large numbers gives a rigorous argument why MC methods converge
to the exact solution if N becomes sufficiently large. To show this, let us consider
a sequence of i.i.d. RVs X1 , . . . , XN with finite expectation value E{Xi } =
E{Xj } = µ. Moreover, we define the average of these variables as
N
1 X
X̄N = Xi . (4.2)
N i=1

The law of large numbers states that


N →∞
X̄N −−−−→ µ, (4.3)

i.e., the average of the RVs “converges” to the expectation µ of the individual
variables. This implies that if we have a large number of independent realizations
of the same probabilistic experiment, we can estimate the expected outcome of
this experiment by calculating the empirical average over these realizations.
This is the main working principle behind MC methods.
4.1. THE LAW OF LARGE NUMBERS 39

Depending on how we specify convergence, we can distinguish two forms of the


law of large numbers. The first one is called the weak law of large numbers and
it states that
lim P (|X̄N − µ| > ε) = 0, (4.4)
N →∞

with ε > 0 an arbitrary positive constant. The weak law implies that the average
X̄N converges in probability / in distribution to the true mean µ, which means
that for any ε we can find a sufficiently high N such that the probability that
X̄N differs from µ by more than ε will be arbitrarily small.
The second version of the law of large numbers is called the strong law of large
numbers, which states that
 
P lim X̄N = µ = 1, (4.5)
N →∞

meaning that X̄N converges to µ almost surely, that is, with probability one in
value. The difference between the two laws is that in the weak law, we leave
open the possibility that X̄N deviates from µ by more than ε infinitely often
(although unlikely) along the path of N → ∞. So there is no sufficiently large
finite N such that we can guarantee that X̄N stays within the ε margin. The
strong law, by contrast, states that for a sufficiently large N , X̄N will remain
within the margin with probability one. Therefore, the strong law implies the
weak law, but not vice-versa. One is the probability of a limit, whereas the
other is a limit over probabilities. There are certain cases, where the weak law
holds, but the strong law does not. Moreover, there are cases where neither of
them apply: if the samples are Cauchy-distributed, for instance, the mean µ
does not exist and therefore, we can not use sample averages to determine the
expected outcome.

4.1.1 Proof of the Weak Law


While the proof of the strong law is rather technical, the weak law can be
derived in a relatively straightforward manner using the concept of characteristic
functions.

Definition 4.1 (Characteristic function). The characteristic function ϕX (t) of


a continuous RV X is defined as

ϕX (t) := E[eiXt ], (4.6)

with i the imaginary unit and t the real argument of this function. Making use
of the definition of the expectation,
Z ∞
ϕX (t) = eixt p(x) dx, (4.7)
−∞

we recognize ϕX (t) as the Fourier transform of the PDF p(x) of X.


40 CHAPTER 4. MONTE CARLO METHODS

Characteristic functions provide an alternative characterization of RVs that can


be beneficial in certain cases. This is due to the mathematical properties of
the characteristic functions. The following two properties are required for our
proof:

• ϕX+Y (t) = ϕX (t)ϕY (t) if X and Y are independent,

• ϕαX (t) = ϕX (αt) with α any a real constant.

We begin the proofPby setting up the characteristic function of the sample


N
average X̄N = 1/N i=1 Xi :
h t PN i
ϕX̄N (t) = E ei N i=i Xi
. (4.8)

By rearranging the exponent we obtain

ϕX̄N (t) = E[eit(X1 /N +X2 /N +···+XN /N ) ], (4.9)

which is the characteristic function of a sum of N independent and rescaled RVs.


Therefore, by exploiting the two properties of characteristic functions above, we
obtain
ϕX̄N (t) = ϕX1 (t/N )ϕX2 (t/N ) · · · ϕXN (t/N ). (4.10)
Now, since all Xi are identically distributed, i.e., ϕXi (t) = ϕXj (t) = ϕX (t), we
further obtain
ϕX̄N (t) = ϕX (t/N )N . (4.11)
The specific form of ϕX (t) depends on the PDF of the RV X, but we can get
a general form of it by Taylor expanding it around t = 0 up to order one. In
particular, we obtain


ϕX (t) = ϕX (0) + t ϕX (t) + o(t), (4.12)
∂t t=0

whereas o(t) summarizes all higher-order terms (little-o notation). Applying the
definition of the characteristic function, we obtain

ϕX (0) = E[ei0X ] = E[1] = 1 (4.13)

and
 
∂ ∂ itX ∂ itX
ϕX (t) = E[e ] =E e
∂t t=0 ∂t t=0 ∂t t=0 (4.14)
= E iXei0X = iE [X] = iµ.
 

In summary, we thus obtain

ϕX (t) = 1 + itµ + o(t), (4.15)


4.2. MONTE CARLO INTEGRATION 41

and inserting this into Eq. (4.11) yields


 N
t
ϕX̄N (t) = 1 + i µ + o(t/N ) . (4.16)
N

The term o(t/N ) tends to zero faster than the other two terms for large N since
it contains higher orders of t/N . We can therefore neglect it asymptotically.
Letting N go to infinity for the remaining expression, we obtain
 N
t
lim ϕX̄N (t) = lim 1 + i µ , (4.17)
N →∞ N →∞ N

which is the definition of the exponential function eitµ . This, in turn, is the
characteristic function of a constant (deterministic) variable µ, which means
that the sample average converges in density to µ. Intuitively, this says that
the PDF of the sample average will be squeezed together as N → ∞ such that
all its probability mass concentrates at the value µ, resulting in a Dirac delta
distribution δ(x̄ − µ).
Remark: Strictly speaking, our proof has only shown convergence in density
(in characteristic functions), but not in probability as stated by the weak law.
However, it can be shown further that since µ is a constant, convergence in
density implies convergence in probability, which completes the proof. This,
however, is beyond the scope of this lecture.

4.2 Monte Carlo Integration


One main application of MC methods is the numerical computation of compli-
cated and/or high-dimensional integrals, for which analytically solutions do not
exist and numerical quadrature may be computationally inefficient. The key
idea is to reformulate an integral in terms of an expectation, which can then be
approximated using a conventional sample average.

Let us first consider a one-dimensional integral of the form


Z b
F = f (x) dx (4.18)
a

with a and b the integration limits. In order to reformulate this integral in terms
of an expectation, we first multiply the integrand by 1 = (b − a)/(b − a), which
leaves the value of the integral unaffected, i.e.,
b b
b−a
Z Z
f (x) dx = f (x) dx (4.19)
a a b−a
Z b
= f (x)(b − a)p(x) dx, (4.20)
a
42 CHAPTER 4. MONTE CARLO METHODS

1
where we recognize p(x) = b−a , x ∈ (a, b), as the PDF of a uniform continuous
RV X ∼ U(a, b). We can therefore express the integral as an expectation
Z b
f (x) dx = (b − a)E[f (X)], (4.21)
a

taken with respect to a uniform RV X ∼ U(a, b). We can therefore approximate


this integral by the sample average
b N
b−aX
Z
f (x) dx ≈ f (xi ) =: θN , (4.22)
a N i=1

with i.i.d. samples xi ∼ U(a, b) for i = 1, . . . , N . In the following, we refer to


the sample average θN as the Monte Carlo estimator. Since a uniform RV has
a finite mean, we know by the law of large numbers that θN will converge to
the correct solution of the integral as N → ∞.

As indicated earlier in this chapter, it is crucial to know how much uncertainty


we have to expect in our MC estimator for a certain finite sample size N . To
this end, we can calculate the variance of the MC estimator, i.e.,
" N
#
(b − a) X
Var[θN ] = Var f (xi ) (4.23)
N i=1
N
(b − a)2 X
= Var[f (Xi )] (4.24)
N 2 i=1
(b − a)2
= N Var[f (X)] (4.25)
N2
= O(1/N ). (4.26)

In the first step, we have used the linearity of the variance operator, and in the
second step we have used the fact that the RVs are i.i.d. This shows that the
MC variance scales with 1/N if independent samples are used. If the variance of
the transformed RV f (X) is hard to compute analytically, an empirical estimate
can be determined from the N samples, i.e.,
N
1 X
Var[f (X)] ≈ (f (xi ) − ⟨f (X)⟩)2 , (4.27)
N − 1 i=1

with
N
1 X
⟨f (X)⟩ = f (xi ). (4.28)
N i=1
Remember that the partition function in the empirical sample variance is 1/(N −
1) in order for the estimator to be unbiased (a single sample has no variance)
— see statistics course.
4.3. IMPORTANCE SAMPLING 43

4.2.1 Multidimensional Extension


MC integration can be extended to multidimensional integrals of the form
Z
F = f (x) dm x, x ∈ Rm , (4.29)

with f : Rm → R and Ω the domain of this integral. Just as in the one-


dimensional case, we can approximate this integral by an MC estimate

N
1 X
F ≈V f (xi ) =: θN , (4.30)
N i=1

with xi as N i.i.d. RVs distributed uniformly in Ω. The scaling factor V is the


volume of the domain Ω, i.e.,
Z
V = dm x. (4.31)

Note that in the one-dimensional case, this volume reduces to the length of the
domain (b − a), as derived above. The corresponding MC estimator variance,
by analogy, is given by

V2
Var[θN ] = Var[f (X)]. (4.32)
N

4.3 Importance sampling


A major drawback of standard MC integration is that it relies on uniform RVs.
This means that each part of Ω is sampled equally likely. However, depending
on the integrand f (x), there may be certain regions of Ω where f (x) contributes
more to the value of the integral than in other regions. One would therefore
expect more accurate results if samples were drawn more frequently in these
“important” regions than in regions that only contribute marginally to the inte-
gral. This leads to the idea of importance sampling. For the sake of illustration,
we introduce the concept in the context of MC integration. However, keep in
mind that it can also be applied to other MC problems.
As before, we consider a general m-dimensional integral of the form
Z
F = f (x) dm x, (4.33)

with f : Rm → R and Ω as the domain of this integral. In order to derive the


importance sampler, we first multiply the integrand by 1 = q(x)/q(x), where
q(x) is an m-dimensional proposal PDF, whose support S contains the domain
44 CHAPTER 4. MONTE CARLO METHODS

Ω, i.e., S ⊇ Ω. We obtain:
Z Z
q(x) m
f (x) dm x = f (x) d x (4.34)
Ω Ω q(x)
Z
f (x)
= 1X∈Ω q(x) dm x (4.35)
S q(x)
 
f (X)
=E 1X∈Ω , (4.36)
q(X)
with X a continuous RV with PDF q(x), and 1X∈Ω the indicator function that
restricts the samples to the original domain of integration. Assuming that we
are given N i.i.d. realizations x1 , . . . , xN of X, we can compute an MC estimate
as
Z N
1 X f (xi )
f (x) dm x ≈ 1X∈Ω . (4.37)
Ω N i=1 q(xi )
| {z }
=:θN

The corresponding MC estimator variance is given by:


 
1 f (x)
Var[θN ] = Var 1X∈Ω . (4.38)
N q(x)
This shows that the MC variance depends on the proposal distribution q(x). In
order to study which proposals achieve the lowest MC variance, and hence the
most accurate approximations, we consider the following toy examples, before
we proceed to a general discussion of this topic in the next chapter.
Example 4.2 (Optimal proposal distribution). We first consider a proposal
distribution that is proportional to the function f (x), i.e.,

q(x) = Kf (x) (4.39)

for some constant K that ensures that q integrates to one. Remember that f (x)
is not a PDF, but just a function to be integrated. For the MC variance, we
obtain from Eq. 4.38:
   
1 f (x) 1 1
Var [θN ] = Var = Var = 0, (4.40)
N Kf (x) N K
because 1/K is a deterministic value. The indicator function was dropped since
f and q have the same support. Paradoxically, this result implies that a single
sample from this proposal would suffice to fully determine the true value of
the integral, for any integrand f (x), since the MC estimator variance is zero.
However, at a second look we realize that in order to evaluate the ratio f (x)/q(x)
in the MC estimator, we require knowledge of the constant K, since f (x)/q(x) =
1/K. This constant, however, is equal to the solution of the integral we are after,
since it must hold that Z
1
= f (x) dm x (4.41)
K Ω
4.3. IMPORTANCE SAMPLING 45

for q(x) to integrate to one. Because the constant already contains the result, one
“sample” of that constant would indeed suffice. Clearly, this choice of proposal
distribution is not realizable, but it still reveals important information about how
suitable proposals should look like: a good q(x) should follow f (x) as closely as
possible, up to a constant scaling factor.

Example 4.3 (Calculating π revisited). We have shown at the beginning of


this chapter that π can be approximated using random numbers. It now becomes
clear that the scheme we thus used can be understood as an importance sampler.
Indeed, the number π can be written as the integral
ZZ
π = 4AΩ = 4 du1 du2 , (4.42)

where Ω is the quarter circle and ⃗x = (u1 , u2 ) a point in the 2D plane. Note
that in this case, the function f is one everywhere. This can be rewritten as
Z 1 Z 1
1
π=4 1(u ,u )∈Ω q(u1 , u2 ) du1 du2
0 q(u1 , u2 ) 1 2
 0  (4.43)
1
= 4E 1(U1 ,U2 )∈Ω .
q(U1 , U2 )

In the introductory example, we have chosen q(u1 , u2 ) to be a uniform proposal


distribution on the unit square. Since this proposal is constant and q = 1 inside
the entire unit square, we further obtain

N
  4 X # samples inside the circle
π = 4E 1(U1 ,U2 )∈Ω ≈ 1(u1 ,u2 )∈Ω = 4 ,
N i=1 N
(4.44)
which is the formula introduced at the beginning of this chapter.

Example 4.4. Let us consider a function f (x) = x2 for x ∈ [0, 1], which we
R1
want to integrate using MC integration 0 x2 dx. We consider the standard MC
estimator
N
(1) 1 X
θN = f (xi ) with xi ∼ U(0, 1) (4.45)
N i=1

as well as an estimator that uses importance sampling using a proposal distri-


bution q(x) = 2x, x ∈ [0, 1]:

N
(2) 1 X f (xi )
θN = with xi ∼ q(x) = 2x. (4.46)
N i=1 q(xi )

Which of the two estimators is better? To answer this question, we calculate the
46 CHAPTER 4. MONTE CARLO METHODS

MC variances of both estimators. For estimator 1 we find:

(1) 1
Var[θN ] = Var[f (x)] (4.47)
N
Z 1
Var[f (x)] = Var[X 2 ] = (X 4 − E[X 2 ]2 )p(x) dx = E[X 4 ] − E[X 2 ]2 (4.48)
0
1
1 1
X3
Z Z
2 2 2 1
E[X ] = x p(x) dx = x dx = = (4.49)
0 0 3 3
0
1
1 1
X5
Z Z
1
E[X 4 ] = x4 p(x) dx = x4 dx = = . (4.50)
0 0 5 5
0

Therefore, the MC variance of estimator 1 is:


 
(1) 1 1 1 1 4 0.0889
Var[θN ] = − = = . (4.51)
N 5 9 N 45 N

For estimator 2 we obtain:


   2
(2) 1 f (x) 1 X
Var[θN ] = Var = Var (4.52)
N q(x) N 2X
1
= Var[X]. (4.53)
4N
Again, the indicator function can be dropped since f and q have the same sup-
port. We know that Var[X] = E[X 2 ] − E[X]2 and since
Z 1
2
E[X] = 2x2 dx = (4.54)
0 3
Z 1
1
E[X 2 ] = 2x3 dx = , (4.55)
0 2

we get for the MC variance of estimator 2


 
(2) 1 1 4 1 1 0.0139
Var[θN ] = − = = . (4.56)
4N 2 9 N 72 N

Since
1 1
0.0139 < 0.0889 (4.57)
N N
(2)
for any N > 0, θN is a better estimator. This means that for the same sample
size N , estimator 2 achieves a higher (about 6-fold) accuracy, in expectation
over many realizations of the MC procedure.
Chapter 5

Variance Reduction

As discussed in the previous sections, a key property of any MC estimator is the


variance associated with it. Ultimately, the MC variance determines how many
samples N one has to generate in order to achieve a certain statistical precision.
We have also shown that different MC estimators of the same quantity can
have very different MC variances in the context of importance sampling. In
this chapter we will discuss techniques that allow to improve the accuracy of
MC estimators, commonly known as variance reduction techniques. While there
exists a broad range of variance reduction methods, we will focus on two of them
in this lecture: antithetic variates and Rao-Blackwellization.

5.1 Antithetic Variates


So far, a key assumption of the discussed MC estimators was that the individual
samples used for constructing the estimate are i.i.d. We recall that in this case,
the MC variance is given by
( N
) N
1 X 1 X
V ar{θN } = V ar f (Xi ) = 2 V ar{f (Xi )}, (5.1)
N i=1 N i=1

with Xi as i.i.d. RVs and f (x) as some function. We now consider the case
where the RVs Xi are not independent, in which case the covariance between
any Xi and Xj will be non-zero. In this case, it is straightforward to show that
the variance of an MC estimator is given by
( )  
N N
1 X 1 X X
V ar Xi = 2  V ar{f (Xi )} + Cov{f (Xi ), f (Xj )} . (5.2)
N i=1 N i=1 i̸=j

We see that if the covariance terms are positive, the MC variance will be larger
than in the i.i.d. case. However, if the covariances become negative, we can
achieve variance reduction. This is the key idea underlying a popular variance

47
48 CHAPTER 5. VARIANCE REDUCTION

reduction method called antithetic variates. While this approach is fairly gen-
eral, we will illustrate it here in the context of a simple example.

Our goal is to use Monte Carlo estimation to calculate the expectation E{f (X)}
with f (X) as some function and X as a uniform RV. To do so, we generate N
uniformly distributed random numbers

Xi ∼ U(0, 1) ∀i = 1, . . . , N. (5.3)

Our canonical MC estimator would take the form


N
1 X
f (Xi ). (5.4)
N i=1

The goal of antithetic variates is to inversely correlate the generated samples.


To this end, we define the augmented sample

Z = {X1 , . . . , XN , X̄1 , . . . , X̄N } = {Z1 , . . . , Z2N }, (5.5)

with X̄i = 1 − Xi . On the one hand, this means that we have doubled the
number of samples of our estimator. On the other hand, we have artificially
introduced negative correlations between pairs of samples since Xi and X̄i will
be anticorrelated. It can be shown that if the function f is monotonically
increasing or decreasing, this implies that also the correlations between f (Xi )
and f (X̄i ) will be negative. In particular, the variance of the antithetic MC
estimator becomes
( 2N
)
A 1 X
V ar{θN } = V ar f (Zi ) (5.6)
2N i=1
 
2N N
1  X X
= V ar{f (Zi )} +2 Cov{f (Xi ), f (X̄i )} (5.7)

4N 2

| {z } | {z }
i=1 i=1 γf
σf2

1
σf2 + γf .

= (5.8)
2N
Now, since γf is strictly negative for monotonic functions f we obtain variance
reduction when compared to an MC estimator that uses 2N independent sam-
ples. Another important advantage of antithetic variates is that we can double
the sample size N ”for free”: we can use 2N samples, even though we had to
draw only N random numbers.

5.2 Rao-Blackwellization
In this section we will discuss the concept of Rao-Blackwellization to reduce
the variance of MC estimators. This method is suited for Monte Carlo prob-
lems that depend on multiple random variables. While they apply to arbitrary
5.2. RAO-BLACKWELLIZATION 49

multidimensional problems, we limit ourselves to the two-dimensional case for


simplicity.
Assume we want to use MC estimation to calculate the expectation E{f (X, Y )}
with X and Y as dependent RVs and f as some function that maps a tuple
(x, y) to a real or integer number. In the following, we assume that X, Y ∈ R
and f : R2 → R, but keep in mind that the same concept applies also to integer-
valued random variables and functions f . Analogously to the one-dimensional
case, we can construct an MC estimator as
N
1 X
θN = f (xi , yi ), (5.9)
N i=1

with (xi , yi ) as i.i.d.random samples from a joint probability distribution p(xi , yi ).


The corresponding MC variances is given by
1 1
V ar{θN } = V ar{f (X, Y )} = σf2 , (5.10)
N2 N
with Z ∞ Z ∞
σf2 = (f (x, y) − µf )2 p(x, y)dxdy (5.11)
−∞ −∞

and Z ∞ Z ∞
µf = E{f (X, Y )} = f (x, y)p(x, y)dxdy. (5.12)
−∞ −∞

Let’s now consider another estimator, given by


N
1 X
θ̂N = E{f (X, yi ) | yi }, (5.13)
N i=1
R∞
with yi as i.i.d.samples from the marginal distribution ∼ p(y) = −∞
p(x, y)dx
and E{f (X, yi ) | yi } as a conditional expectation defined by
Z ∞
E{f (X, yi ) | yi } = f (x, yi )p(x | yi )dx. (5.14)
−∞

This estimator is called Rao-Blackwellized (RB) estimator. One important dif-


ference between the standard MC and RB estimators is that the latter one uses
samples only from one of the two variables (in this case Y ). The second vari-
able has been ”integrated out” analytically, which means that we effectively deal
with lower-dimensional sampling space. Let us now look into the properties of
the RB estimator. We first realize that on expectation, the RB estimator will
deliver the correct result just as the original MC estimator since
N N
1 X 1 X
E{θ̂N } = E{E{f (X, Y ) | Y }} = E{f (X, Y )} = E{f (X, Y )}.
N i=1 N i=1
(5.15)
50 CHAPTER 5. VARIANCE REDUCTION

Most importantly, the variance of the RB estimator,


1 1
V ar{θ̂N } = V ar{E{f (X, Y ) | Y )}} = σ̂f2 , (5.16)
N N
is lower than that of the standard MC estimator as summarized in the following
theorem.

Theorem 5.1. For a given sample size N , the Rao-Blackwellized estimator θ̂N
is guaranteed to achieve lower or equal variance than the standard Monte Carlo
estimator θN , i.e.,
V ar{θ̂N } ≤ V ar{θN } (5.17)
for any N .
Proof: A more formal proof of this theorem can be obtained by employing the
Rao-Blackwell theorem. However, a simple derivation of this result is possi-
ble using the properties of conditional expectations. We begin by rewriting the
variance σf2 that appears in the MC variance of the standard estimator θN as

σf2 = E{f (X, Y )2 } − µ2f (5.18)


2
= E{E{f (X, Y ) | Y }} − µ2f . (5.19)

We then realize that the inner expectation E{f (X, Y )2 | Y } is a conditional


expectation taken over the squared function f . We can therefore make use of
the relation

V ar{f (X, Y ) | Y } = E{f (X, Y )2 | Y }} − E{f (X, Y ) | Y }2 , (5.20)

to replace E{f (X, Y )2 | Y } in (5.19) by V ar{f (X, Y ) | Y } + E{f (X, Y ) | Y }2


such that we obtain

σf2 = E{V ar{f (X, Y ) | Y }} + E{E{f (X, Y ) | Y }2 } − µ2f . (5.21)


(5.22)

Using the same idea, we can now replace E{E{f (X, Y ) | Y }2 } by V ar{E{f (X, Y ) |
Y }2 } + E{E{f (X, Y ) | Y }}2 which yields

σf2 = E{V ar{f (X, Y ) | Y }} + V ar{E{f (X, Y ) | Y }2 } + E{E{f (X, Y ) | Y }}2 −µ2f
| {z }
µ2f
(5.23)
2
= E{V ar{f (X, Y ) | Y }} + V ar{E{f (X, Y ) | Y } } . (5.24)
| {z } | {z }
≥0 σ̂f2

Therefore, since the first term on the r.h.s is non-negative, we conclude that
σ̂f2 ≤ σf2 and correspondingly V ar{θ̂N } ≤ V ar{θN }.
Chapter 6

Markov Chain Monte-Carlo

In the previous chapters on Monte Carlo methods we have so far considered cases
where the random numbers used for constructing the Monte Carlo estimators
are easy to generate. In particular, we have focused on one- or two-dimensional
problems that used ”simple” RVs such as uniform random numbers. In many
practical scenarios, however, this may not be the case. In Chapter 2 we have
discussed several methods to generate more complex RVs but these methods
largely apply to low-dimensional problems. So how can we generate random
samples from higher-dimensional and possibly complex distributions? Markov
chain Monte Carlo (MCMC) methods provide a powerful framework to address
this problem. In this chapter we will discuss the core idea of MCMC and intro-
duce two of the most popular MCMC sampling algorithms, commonly known
as the Gibbs- and Metropolis-Hastings samplers.

Let us assume we want to sample from an arbitrary multidimensional target


distribution Π(x) with x ∈ S. For the sake of illustration, we consider the case
where x is a discrete-valued k-dimensional RV such that S = Nk . However,
the following concepts apply in a similar fashion to other scenarios (e.g., when
x is real-valued). The key idea of MCMC is to construct a Markov chain Xn ,
which has exactly Π as its stationary distribution. If we would simulate such a
Markov chain until it reaches stationarity, we can use the states that it occupies
as random samples from the correct target distribution Π.

We recall from Chapter 3 that the time-evolution of a discrete-time Markov


chain Xn is governed by a transition kernel

P (Xn+1 = xj |Xn = xi ) = Pij , (6.1)

which characterizes the probability of moving from state i to state j. We fur-


thermore know that if Π is a stationary distribution of Xn , then it must hold
that X
Pij Π(xi ) = Π(xj ), (6.2)
i

51
52 CHAPTER 6. MARKOV CHAIN MONTE-CARLO

whereas the sum goes over all states in S. The kernel P is then said to be
invariant with respect to the distribution Π. Intuitively, this means that if we
start at the stationary distribution, then applying the invariant kernel P to it
will leave it unaffected. The goal of MCMC is to find an invariant transition
kernel P which satisfies (6.2) for a given target distribution Π. If we then sim-
ulate the Markov chain and wait until it reaches stationarity, then the resulting
samples are distributed according to Π.
A condition that is related to (6.2) is called detailed balance, which states

Pij Π(xi ) = Pji Π(xj ). (6.3)

Importantly, also detailed balance guarantees that Π is a stationary distribution


of the considered Markov chain, which can be seen immediately by summing
over i (or j) on both sides of (6.3), which leads directly to (6.2). The advantage
of detailed balance is that it is a local condition that is easier to check. Corre-
spondingly, many MCMC methods construct Markov chains that obey detailed
balance such as the Metropolis-Hastings algorithm. Note that while (6.2) and
(6.3) guarantee that Π is a stationary distribution, they do not necessarily en-
sure that Π is the only stationary distribution. Correspondingly, Π may not
always be reached from any initial condition. This can be achieved by addi-
tionally requiring the Markov chain to be ergodic, which implies that it has a
unique stationary distribution, which must be Π when also (6.2) and/or (6.3)
are satisfied.

6.1 Gibbs Sampling


The first MCMC sampler that we discuss is known as the Gibbs sampler. Before
we consider the general case, we will motivate the idea using a two-dimensional
problem. Let us assume we want to sample from a two-dimensional distribu-
tion Π(x) = Π(y, z). A key assumption of the Gibbs sampler is that while it
is hard to sample directly from the joint distribution Π, it is easy to sample
from the conditional distributions Π(y | z) and Π(z | y). While at first sight,
this looks like a pretty strong assumption, this is indeed the case for many sta-
tistical problems encountered in practice. The two-dimensional Gibbs sampler
iteratively samples from the conditional distributions starting from an arbitrary
initial condition, i.e.,

yn+1 ∼ Π(y | zn ) (6.4)


zn+1 ∼ Π(z | yn+1 ) (6.5)

for n = 1, 2, . . . nmax .

Theorem 6.1. Xn = (Yn , Zn ) is a Markov Chain with stationary distribution


Π(y, z).
6.1. GIBBS SAMPLING 53

Proof: We have to show that


XX
P (ŷ, ẑ | y, z)Π(y, z) = Π(ŷ, ẑ). (6.6)
y z

The transition kernel P of the Gibbs sampler is given by

P (ŷ, ẑ | y, z) = Π(ẑ | ŷ)Π(ŷ | z), (6.7)

whereas we assume that in each iteration n, we first resample y and subsequently


z. Note that also the reverse order would be possible. Inserting the transition
kernel into 6.6 yields
XX
Π(ẑ | ŷ)Π(ŷ | z)Π(y, z) = Π(ŷ, ẑ). (6.8)
y z

By manipulating this expression, we can show that


XX X X
Π(ẑ | ŷ)Π(ŷ | z)Π(y, z) = Π(ẑ | ŷ)Π(ŷ | z) Π(y, z) (6.9)
y z z y
X
= Π(ẑ | ŷ)Π(ŷ | z)Π(z) (6.10)
z
X
= Π(ẑ | ŷ)Π(ŷ, z) (6.11)
z
X
= Π(ẑ | ŷ) Π(ŷ, z) (6.12)
z
= Π(ẑ | ŷ)Π(ŷ) (6.13)
= Π(ẑ, ŷ). (6.14)

This shows that the Gibbs sampler has the correct target distribution Π(y, z).

6.1.1 Multivariate case


The Gibbs sampler can be extended to distributions with more than two vari-
ables in a straightforward manner. For instance, if we want to sample from a
K-dimensional probability distribution Π(x1 , . . . , xK ), the Gibbs sampler would
iteratively sample from the so-called full conditional distributions where one of
the K variables is resampled conditionally on all other variables, i.e.,

1. x1,n+1 ∼ Π(x1 | x2,n , x3,n , . . . , xK,n )

2. x2,n+1 ∼ Π(x2 | x1,n+1 , x3,n , . . . , xK,n )


..
.

3. xK,n+1 ∼ Π(xK | x1,n+1 , . . . , xK−1,n+1 ),


54 CHAPTER 6. MARKOV CHAIN MONTE-CARLO

with xk,n as the kth variable at iteration n. We remark that the order in which
the variables are resampled can be chosen. In the algorithm above, we con-
sider a fixed-sequence scan, which means that we resample the variables in a
round-robin fashion (1 → 2 → . . . → K → 1 → 2 . . .). An alternative strategy
is called random-sequence scan, where the update sequence is chosen randomly
(e.g., 5 → 2 → 9 . . .). Moreover, one can group several variables together and
update them jointly within a single step conditionally on all other variables
(e.g., Π(x1 , x2 | x3 ) → Π(x3 | x1 , x2 ) → . . .), which can improve the conver-
gence of the sampler. While all variants of the Gibbs sampler have the right
stationary distribution Π, they may differ in certain properties. For instance,
certain random-sequence scan algorithms exhibit the detailed balance property,
while this is generally not true for fixed-scan algorithms.

Example 6.1. Imagine we want to study the statistical relationship between


price λ and service cost γ of cars. We are told that the price of cars can be well
described by a Gamma distribution p(λ) = Γ(α, β), with α and β as the shape
and inverse scale parameter of this distribution. Furthermore, we find out that
for a given price λ, the service cost γ follows an exponential distribution with
1
mean cλ and c as a known constant such that p(γ | λ) = Exp(cλ). Our goal is
to analyze the joint distribution
p(λ, γ) = p(γ | λ)p(λ), (6.15)
which captures the statistical relation between price and service cost. To do so,
we try to generate random numbers from this distribution using Gibbs sampling,
which we could then use to estimate several statistical properties of the model
such as the correlation between price and service cost. To perform Gibbs sam-
pling, we need the conditional distributions p(γ | λ) and p(λ | γ). The former
one, we already know from our statistical assumptions. To calculate the latter
one, we can use Bayes’ rule:
p(γ | λ)p(λ)
p(λ | γ) = ∝ p(γ | λ)p(λ). (6.16)
p(γ)
If we plug in the definitions of the exponential and Gamma distributions, we
obtain
β α α−1 −βλ
p(λ | γ) ∝ cλe−cλγ λ e (6.17)
γ(α)
∝ λα e−λ(cγ+β) . (6.18)
The last expression is equivalent to a Gamma distribution Γ(α + 1, cγ + β) up
to a scaling constant that is independent of λ. We can therefore conclude that
p(λ | γ) = Γ(α + 1, cγ + β). Therefore, we can sample from the joint distribution
p(λ, γ) by iteratively sampling from
1. γn+1 ∼ Exp(cλn )
2. λn+1 ∼ Γ(α + 1, cγn+1 + β).
6.2. METROPOLIS-HASTINGS SAMPLING 55

6.2 Metropolis-Hastings Sampling


Consider a Markov Chain with one-step transition matrix P. A key question
in Markov Chains is to ask when a chain has a stationary distribution Pn = π
for n → ∞, and what that distribution is. This was discussed in Chapter
3. Markov-Chain Monte Carlo (MCMC) methods as introduced in Chapter
?? consider the inverse problem: given a known target distribution π, how can
one generate a Markov Chain (or its transition matrix) that has exactly this
distribution as its stationary distribution.
The standard Markov-Chain Monte Carlo method was developed by Ulam and
von Neumann in the 1940s and published in 1949. It was Nicolas Metropolis,
however, who is said to have suggested the name “Monte Carlo” for this class of
algorithms. The Metropolis algorithm, a classical sampler to computationally
solve MCMC problems, originated from work toward the hydrogen bomb at Los
Alamos National Laboratory (New Mexico, USA) and can be seen as the first
Markov-Chain Monte Carlo (MCMC) algorithm in history. It was published
by Metropolis in 1953. The original Metropolis algorithm was only used in
physics, but it was generalized to applications in statistics by Hastings in 1970.
Since then, the resulting form of the algorithm is commonly referred to as the
Metropolis-Hastings Algorithm. It became famous because it performs much
better in high-dimensional spaces than the original Monte Carlo methods, such
as accept-reject sampling or umbrella sampling (i.e., it is less prone to fall victim
to the curse of dimensionality). Today, the Metropolis-Hastings sampler is the
basis of modern Bayesian computation.
The Metropolis-Hastings algorithm is related to Gibbs sampling, but instead
of updating components of X ⃗ t = (X1 , . . . , Xn )t one at a time, it uses a local
proposal and a rejection mechanism to compute X ⃗ n from X⃗ n−1 . Also in this
⃗ ⃗
case, the sequence X0 , X1 , . . . is a Markov Chain. The main advantage over
Gibbs sampling is that the Metropolis-Hastings sampler does not require that
it is possible/easy to sample from the full conditional probability distributions,
and it generally has smaller mixing times than the Markov Chain produces
by Gibbs sampling, in particular in cases when components of X ⃗ are highly
correlated.

6.2.1 Metropolis-Hastings Algorithm


The Metropolis-Hastings algorithm is an MCMC sampler that generates a Markov
Chain with a given stationary (target) distribution π. Like Gibbs sampling, the
algorithm generates samples from a proposal distribution q(⃗y |⃗x) given the cur-
rent state of the chain ⃗x. Unlike Gibbs sampling, however, π only needs to be
known up to a constant. This is particularly important for Bayesian compu-
tation, where the posterior is often available only in unnormalized form. The
Metropolis-Hastings algorithm generates a Markov Chain with stationary dis-
tribution π from any f = Cπ with C any constant.
56 CHAPTER 6. MARKOV CHAIN MONTE-CARLO

The proposal distribution is required to be properly normalized, i.e.,


Z
q(⃗y |⃗x)d⃗y = 1,

so why not directly use q(⃗y |⃗x) as a transition kernel for the chain? The reason
is that this would not satisfy detailed balance (see ??), which is a sufficient
condition for π being an equilibrium distribution. Detailed balance requires
that the probability of being in state ⃗x and going to ⃗y from there is the same
as the probability of doing the reverse transition if the current state is ⃗y . This
is true if and only if
π(⃗x)q(⃗y |⃗x) = π(⃗y )q(⃗x|⃗y ), (6.19)
which is not the case for arbitrary q(⃗y |⃗x). If the two sides are not the same in
general, then one must be bigger than the other. Consider for example the case
where
π(⃗x)q(⃗y |⃗x) > π(⃗y )q(⃗x|⃗y ),
i.e., moves ⃗x → ⃗y are more frequent than the reverse. The other case is anal-
ogous. The idea is then to adjust the transition probability q(⃗y |⃗x) with an
additional probability of move 0 < ρ ≤ 1 in order to reduce it to what is should
be according to detailed balance:

π(⃗x)q(⃗y |⃗x)ρ = π(⃗y )q(⃗x|⃗y ). (6.20)

Thus, the condition of detailed balance can be used to determine ρ by solving


Eq. 6.20 for ρ as:
π(⃗y ) q(⃗x|⃗y ) Cπ(⃗y ) q(⃗x|⃗y ) f (⃗y ) q(⃗x|⃗y )
ρ= = = . (6.21)
π(⃗x) q(⃗y |⃗x) Cπ(⃗x) q(⃗y |⃗x) f (⃗x) q(⃗y |⃗x)
In the second step, we realized that replacing the target distribution π with an
unnormalized version f = Cπ for any constant C leaves ρ unchanged because
the unknown normalization factor C cancels.
The resulting Metropolis-Hastings algorithm produces a Markov Chain {x⃗k },
as given in Algorithm 1.

Algorithm 1 Metropolis-Hastings
1: procedure MetropolisHastings(⃗x0 ) ▷ start point ⃗x0
2: for k = 0, 1, 2, . . . do
3: Sample ⃗yk ∼ q(⃗y |⃗xn k) o
xk |⃗
4: Compute ρ = min 1, ff (⃗ (⃗
yk )q(⃗ yk )
yk |⃗
xk )q(⃗ xk )
5: With probability ρ, set ⃗xk+1 = ⃗yk ; else set ⃗xk+1 = ⃗xk .
6: end for
7: end procedure

The factor ρ is called the Metropolis-Hastings acceptance probability. The mini-


mum operator in line 4 of Algorithm 1 accounts for the case when moves ⃗y → ⃗x
6.2. METROPOLIS-HASTINGS SAMPLING 57

are too frequent and hence the correction factor, when placed on ⃗x would be
larger than 1. In this case, the move is always accepted (and the reverse move
reduced to satisfy detailed balance).
The Metropolis-Hastings algorithm has some similarity with the accept-reject
sampling method (see Chapter 2.4). Both depend only on ratios of probabilities.
Therefore, Algorithm 1 can also be used for random variate generation from π.
The important difference is that in Metropolis-Hastings sampling, it may be that
⃗xk+1 = ⃗xk , which has probability zero in a continuous Accept-Reject method.
This means that Metropolis-Hastings samples are correlated (as two subsequent
samples are identical with probability 1 − ρ and therefore perfectly correlated
in this case) and not i.i.d., as in accept-reject sampling.

6.2.2 Convergence properties


Any Markov Chain generated by the Metropolis-Hastings algorithm has the
following properties:
1. Irreducibility ⇐⇒ q(⃗y |⃗x) > 0 ∀⃗x,
2. Aperiodicity ⇐⇒ ρ < 1 (i.e., non-zero probability of ⃗xk+1 = ⃗xk ),
3. Ergodicity.
In addition, it satisfies detailed balance by design. The first two are easy to
see, while proving ergodicity is technically involved. Irreducibility requires that
no absorbing states exist, which is clearly the case if the outgoing transition
probability from any state ⃗x is always non-zero. Aperiodicity requires that
there is no sequence of states that exactly repeats with a certain period. This is
clearly the case if every now and then, randomly, the chain stays in its present
state, disrupting any regular cycle.

6.2.3 One-step transition kernel


The one-step transition probability of the Markov Chain generated by Algorithm
1 is:
P⃗x,⃗y = ρ(⃗y , ⃗x)q(⃗y |⃗x) + (1 − a)δ(⃗x), (6.22)
where δ is the Dirac delta distribution and
Z
a = ρ(⃗y , ⃗x)q(⃗y |⃗x)d⃗y . (6.23)

Theorem 6.2. If the Markov Chain generated by Metropolis-Hastings sampling


is irreducible, then for any integrable real-valued function h
n
1X ⃗
lim h(⃗xt ) = E[h(X)] (6.24)
n→∞ n
t=1

for every starting point ⃗x0 .


58 CHAPTER 6. MARKOV CHAIN MONTE-CARLO

This means that expectations can be replaced by empirical means, which pro-
vides a powerful property in practical applications where expectations of un-
normalized distributions are to be estimated. The reason this works is because
the chain is ergodic. Indeed, for ergodic stochastic processes, ensemble averages
and time averages are interchangeable. However, it is important to only use the
samples of the Metropolis-Hastings chain after the algorithm has converged.
The first couple of samples generated must be ignored because they do not yet
come from the correct target distribution π. This immediately raises the ques-
tion how to detect if the chain has converged, i.e., after how many iterations
one can start using the samples.

6.2.4 Special Proposal Choices


The main degree of freedom one has to tune the Metropolis-Hastings algorithm
to specific applications is the choice of the proposal distribution q(⃗y |⃗x). While
a myriad of possible choices exist, we here review two particularly important
special cases.

6.2.4.1 Symmetric proposals


A symmetric proposal is a proposal that only depends on the distance between
the two states and not on the actual source state, thus:
q(⃗y |⃗x) = g(⃗y − ⃗x)
with symmetric (i.e., even) distribution g(−⃗z) = g(⃗z). This greatly simplifies
the algorithm, as the proposal does not depend on the source state any more,
but only on the “step length”, which is typically the case when simulating
conservative systems (i.e., potential fields). In this case, the proposed new state
⃗yk can be written as:
⃗yk = ⃗xk + ⃗ε
with the random variable ⃗ε ∼ g. Then, the Metropolis-Hastings acceptance
probability simplifies to:
   
f (⃗yk ) g(⃗yk − ⃗xk ) f (y⃗k )
ρ = min 1, = min 1, (6.25)
f (⃗xk ) g(⃗xk − ⃗yk ) f (⃗xk )
due to the symmetry of g. With this choice of proposal, the algorithm is also
called Random Walk Metropolis-Hastings algorithm. It accepts every move to a
more probable state with probability 1 and moves to less probable states with
probability f (⃗y )/f (⃗x) = π(⃗y )/π(⃗x) < 1. A popular (because easy to simulate
from) choice for g is the multivariate Gaussian distribution, in which case the
Markov chain generated is a Brownian walk (see Chapter ??).

6.2.4.2 Independent proposals


A further simplification is given by assuming an independent proposal for which
the probability only depends on the target state, but neither on the source state
6.3. THINNING AND CONVERGENCE DIAGNOSTICS 59

nor on the distance between source and target. Therefore, an independent


proposal can be written as:
q(⃗y |⃗x) = g(⃗y )
for all ⃗x and simply gives the probability of going to state ⃗y regardless from
where. The Metropolis-Hastings acceptance probability then becomes:
 
f (⃗yk ) g(⃗xk )
ρ = min 1, . (6.26)
f (⃗xk ) g(⃗yk )

With this choice of proposal, the algorithm is called Independent Metropolis-


Hastings algorithm. It generates ⃗yk ∼ g(⃗y ) i.i.d.. Still, the ⃗xk are not i.i.d., since
not every move is accepted. The independent Metropolis-Hastings algorithm
converges if g(⃗y ) > 0 for all ⃗y in the support of f . In particular, it converges if
there exists a constant M > 0 such that f (⃗x) ≤ M g(⃗x) for all ⃗x. Every such pair
of functions (f, g) can also be used in Accept-Reject sampling, but independent
1
Metropolis-Hastings has a higher expected accept probability (namely M at
stationary) and is therefore typically more efficient.

6.2.5 Stopping Criteria


The question when to stop a Metropolis-Hastings sampler is mostly applica-
tion dependent. The short answer is: when enough samples have been col-
lected. However, it is important to keep in mind that the samples generated
by Metropolis-Hastings are not independent. Since the same sample may ap-
pear repeatedly, the information gathered about the target distribution π is not
directly given by the number of samples generated. It is instructive to com-
pute the autocorrelation c(τ ) of a given chain and thin it by using only every
τ -th sample, where τ is chosen such that c(τ ) ≈ 0. This is particularly im-
portant when estimating variances (or higher-order moments) of π, where only
independent samples guarantee convergence to the correct result.

6.3 Thinning and Convergence Diagnostics


It takes time until the Markov Chain generated by the Metropolis-Hastings al-
gorithm has converged (in distribution!), i.e., until {xk } ∼ π(x). The time until
convergence is called burn-in period. The samples generated during the burn-in
period are not to be used as they do not come from the desired distribution (the
Markov Chain has not yet reached stationarity). Convergence Diagnostics are
designed to determine when the burn-in period is over or to decide on thinning of
the chain. “Thinning” refers to the process of only using every k-th sample from
the chain (e.g., every 10-th) and is used to reduce correlation between samples
(remember that there is a non-zero probability that the Metropolis-Hastings
algorithm outputs the same sample repeatedly). While thinning is irrelevant
when using the chain to estimate means, variances can only be estimated from
uncorrelated samples.
60 CHAPTER 6. MARKOV CHAIN MONTE-CARLO

The design of convergence diagnostics for Markov Chains is an open research


topic and most available methods are heuristics. This means that they usually
work in practice, but are not based on theory and provide no guaranteed per-
formance. What may immediately come to mind is to run many chains and
use the Law of Large Numbers (see Section 4.1) to check for normality of the
means computed across chains. While this would work, it is very inefficient as
it requires a very large (potentially millions) number of chains to be generated.
Much more efficient convergence tests are available. We present the two most
frequently used diagnostics for Markov Chains: one based on a statistical test,
and one based on empirical correlation. Both are general to arbitrary stationary
Markov Chains, and not limited to the Metropolis-Hastings algorithm.

6.3.1 Gelman-Rubin Test


The Gelman-Rubin test is a statistical test that provides a general convergence
diagnostic for stationary Markov Chains. It is based on comparing intra-chain
variance with inter-chain variance and requires generating M independent real-
izations of the chain from different starting points x0 m , m = 1, . . . , M . Consider
scalar chains and denote these M chains by

{xk }m , m = 1, . . . , M (6.27)

and only use the N samples k = l, . . . , N + l of each chain, i.e., discard the first
l − 1 samples.
Then, calculate:
PN
• the mean of each chain µ̂m = N1 i=1 xm i ,
PN
• the empirical variance of each chain σ̂m 2
= N 1−1 i=1 (xm 2
i − µ̂m ) ,
PM
• the mean across all chains µ̂ = 1
M m=1 µ̂m ,
PM
• the variation of the means across chains B = N
M −1 m=1 (µ̂m − µ)2 , and
PM
• the average chain variance W = 1
M m=1
2
σ̂m .
Then, compute the Gelman-Rubin test statistic, defined as:
 
1 M +1
V = 1− W+ B. (6.28)
N MN
q
V
The chain has converged if R̂ := W ≈ 1 (in practice |R̂ − 1| < 0.001). Choose
the smallest possible l (i.e., length of the burn-in period) such that this is the
case.
The reason this test works is that both W and V are unbiased estimators of the
variance of π (not proven here). Therefore, for converged chains, they should be
the same. For increasing l, R̂ usually approaches unity from above (for initially
over-dispersed chains).
6.3. THINNING AND CONVERGENCE DIAGNOSTICS 61

6.3.2 Autocorrelation Test


The Gelman-Rubin test works well in practice and is based on solid reasoning,
but it requires M independent replica of the chain. If only a single realization
of the chain is available, it cannot be used. The (weaker, in terms of statistical
power) autocorrelation test can then be used instead. It is based on computing
the autocorrelation function of the Markov chain, defined as:
PN −τ
i=1 (xi − µ̂)(xi+τ − µ̂)
c(τ ) = PN , (6.29)
2
i=1 (xi − µ̂)
PN
where µ̂ = N1 i=1 xi is the mean of the chain samples. The autocorrelation
function c(τ ) is a function of the time lag τ and is, in practice, computed iter-
atively for increasing τ = 1, 2, . . .. For each τ , c(τ ) tells us what the average
correlation is between samples that are τ time points apart. A sufficient con-
dition for convergence is that the samples are uncorrelated. Therefore, choose
the smallest τ for which c(τ ) ≈ 0 as the length of the burn-in period. Beyond
this time point, samples are effectively independent.
While uncorrelatedness is a sufficient condition for convergence, it is not nec-
essary. Recall that the Metropolis-Hastings algorithm may produce correlated
samples even at stationarity. In order to obtain uncorrelated samples, the chain
then needs to be thinned. The autocorrelation test is also useful in deciding
the thinning factor. Indeed, using only use every τ -th sample from the chain
provides roughly uncorrelated samples and is a good choice for thinning.
62 CHAPTER 6. MARKOV CHAIN MONTE-CARLO
Chapter 7

Stochastic Optimization

Optimization problems are amongst the most widespread in science and en-
gineering. Many applications, from machine learning over computer vision to
parameter fitting, can be formulated as optimization problems. An optimization
problem consists of finding the optimal ϑ⃗ ∗ such that
⃗ ∗ = arg min h(ϑ)
ϑ ⃗

ϑ
n
for h : R → R, ⃗ 7→ h(ϑ)
ϑ ⃗ (7.1)
for some given function h. This function is often called the cost function, loss
function, fitness function, or criterion function. Following the usual convention,
we define an optimization problem as a minimization problem, where maximiza-
tion is equivalently contained when replacing h with −h.
Problems of this type, where h maps to R are called scalar real-valued opti-
mization problems or real-valued single-objective optimization. Of course, one
can also consider complex-valued or integer-valued problems, as well as vector-
valued (i.e., multi-objective) optimization. Many of the concepts considered
here generalize to those cases, but we do not describe such generalizations here.
For large n (i.e., high-dimensional domains) or non-convex h(·), there are no
efficient deterministic algorithms to solve the above optimization problem. In
fact, a non-convex function in n dimensions can have exponentially (in n) many
(local) minima. Since the goal according to Eq. 7.1 is to find the best local
minimum, i.e., the global minimum, deterministic approaches have a worst-case
runtime that is exponential in n. A typical way out is the use of randomized
stochastic algorithms. While randomized algorithms can be efficient, they pro-
vide no performance guarantees, i.e., they may not converge, may not find any
minimum, or may get stuck in a sub-optimal local minimum. The only tests
that can be used to compare and select randomized optimization algorithms are
heuristic benchmarks, typically obtained by running large ensembles of Markov
chains. There are two widely used standard suites of test problems: the IEEE
CEC2005-2020 standard and the ACM GECCO BBOB. They define test prob-
lems with known exact solutions, as well as detailed evaluation protocols, on

63
64 CHAPTER 7. STOCHASTIC OPTIMIZATION

which algorithms are to be compared and tested. About 200 different stochas-
tic optimization algorithms have been benchmarked on these tests with the test
results publicly available.
Many stochastic optimization methods have the big advantage that they do not
require h to be known in closed form. Instead, it is often sufficient that h can be
evaluated point-wise. Therefore, h does not have to be a mathematical function,
but can also be a numerical simulation, taking a laboratory measurement, or
user input. Algorithms of this sort are called black-box optimization algorithms,
and optimization problems with an unknown, but evaluatable h are called black-
box optimization problems.
Designing good stochastic optimization algorithms is a vast field of research,
which is in itself split into sub-fields such as evolutionary computing, random-
ized search, and biased sampling. Many exciting concepts, from evolutionary
biology over information theory to Sobolev calculus, are being exploited on this
problem. Here, we exemplarily discuss examples of classes of algorithms for
Monte-Carlo optimization from each of these sub-fields: Stochastic descent and
random pursuit from the class of randomized search heuristics, simulated an-
nealing from the class of biased sampling methods, and evolution strategies from
the class of evolutionary algorithms.

7.1 Stochastic Exploration


Stochastic exploration is a classic approach for the case when the search space,
⃗ ∈ Θ ⊂ Rn is bounded. In this case, a straightforward
i.e., the domain of ϑ
approach is to sample:
µ ⃗ m ∼ U(Θ)
⃗ 1, . . . , µ (7.2)
uniformly over Θ and then use:
⃗ ∗ ≈ arg min(h(⃗
ϑ µ1 ), . . . , h(⃗
µm )). (7.3)

This clearly converges to the correct global minimum for m → ∞ and has a
linear computational complexity in O(m). For general h, the number of samples
required to reach a given probability of finding the global minimum is m ∝ |Θ| =
C n , where the constant C > 0 is the linear dimension of the search space Θ.
Stochastic exploration therefore converges exponentially slowly and is particu-
larly impractical in cases where h(·) is costly to evaluate, e.g., where it is given by
running a simulation or performing a measurement. This is because stochastic
exploration “blindly” samples the search space without exploiting any structure
or properties of h that may be known.

7.2 Stochastic Descent


A first attempt to reduce the number of samples required by exploiting ad-
ditional information about h is to use the gradient of h. This is inspired by
7.2. STOCHASTIC DESCENT 65

deterministic gradient descent as a popular algorithm to find local minima:


⃗ j+1 = ϑ
ϑ ⃗ j − αj ∇h(ϑ
⃗ j ), j = 0, 1, 2, . . . (7.4)
with step size αj > 0. If the step size is well chosen (or dynamically adjusted),
this converges to a (not necessarily the nearest) local minimum of h around
the given starting point ϑ ⃗ 0 . In practice, however, the gradient ∇h may not
be known analytically (in black-box problems, even h itself may not be known
analytically), and numerically approximating gradients may be expensive for
large n.
Stochastic descent therefore replaces the iteration over the gradient with:

ϑ ⃗ j − αj ∆h(ϑ
⃗ j+1 = ϑ ⃗ j , βj ⃗uj )⃗uj , j = 0, 1, 2, . . . (7.5)
2βj
with ⃗uj i.i.d. uniform random variates on the unit sphere (i.e., |⃗uj | = 1) and
∆h(⃗x, ⃗y ) = h(⃗x + ⃗y ) − h(⃗x − ⃗y ) ≈ 2|⃗y |∇h(⃗x) · ⃗y . (7.6)
The latter is because the finite difference
h(⃗x + ⃗y ) − h(⃗x − ⃗y )
≈ ∇h(⃗x) · ⃗y
2|⃗y |
is an approximation to the directional derivative of h in direction y. This itera-
tion does not proceed along the steepest slope and therefore has some potential
to overcome local minima. Stochastic descent has two algorithm parameters:
• αj : step size,
• βj : sampling radius.
One can show that stochastic descent converges to a local optimum if αj ↓ 0
α
for j → ∞ and limj→∞ βjj = const ̸= 0. There are no guarantees of global con-
vergence. The problem in practice usually is the correct choice and adaptation
of αj and βj . The biggest advantage over stochastic exploration is a greatly
increased convergence speed, and that the method also works in unbounded
search spaces.
A side note on uniform random numbers on the unit sphere: Simply sampling
uniformly in the spherical angles leads to a bias toward the poles. One needs to
correct the samples with the arccos of the polar angle in order for them to have
uniform area density on the unit sphere.
Example 7.1. For example, in 3D, uniform random points ⃗u ∼ U(S 2 ) on the
unit sphere S 2 can be sampled by:
sampling φ ∼ U(0, 2π) (7.7)
sampling w ∼ U(0, 1) (7.8)
computing θ = arccos(2w − 1) (7.9)
and then using (r = 1, φ, θ) as the spherical coordinates of the sample point on
the unit sphere, where the polar angle θ runs from 0 (north pole) to π (south
pole) and the azimuthal angle φ from 0 to 2π.
66 CHAPTER 7. STOCHASTIC OPTIMIZATION

In higher dimensions n > 3, uniform random points on the unit hypersphere


S n−1 can be sampled by successive planar rotations, which can be efficiently
computed using Givens rotations. This is, e.g., described in: C. L. Müller,
B. Baumgartner, and I. F. Sbalzarini. Particle swarm CMA evolution strategy
for the optimization of multi-funnel landscapes. In Proc. IEEE Congress on
Evolutionary Computation (CEC), pages 2685-2692, Trondheim, Norway, May
2009.

7.3 Random Pursuit


The next example of a classic algorithm is Random Pursuit. It is related to
stochastic descent, but does not require choosing or adapting any step-size pa-
rameters (like αj and βj in stochastic descent). It tries to keep the good conver-
gence properties of stochastic descent, while relaxing its most salient drawback.
The procedure is given in Algorithm 2.

Algorithm 2 Random Pursuit


1: procedure RandomPursuit(ϑ ⃗0) ⃗0
▷ start point ϑ
2: for k = 0, 1, . . . , N − 1 do
3: ⃗uk ∼ U(S n−1 ) ▷ uniform on the unit sphere
4: ⃗ k+1 = ϑ
ϑ ⃗ k + LineSearch(ϑ ⃗ k , ⃗uk )⃗uk
5: end for
6: return ϑ ⃗N
7: end procedure

The subroutine LineSearch(⃗x, ⃗y ) finds the distance to the minimum of h along


a line in direction ⃗y starting from ⃗x. Since this is a 1D problem, it can efficiently
be solved using stochastic exploration, usually followed by local descent. Note
that bisection search only works for convex h.
Random Pursuit does not require setting any step sizes, thanks to the line
search subroutine. It is guaranteed to converge to a (local) minimum of h but
only works for bounded search spaces Θ. There are no guarantees of global
convergence.

7.4 Simulated Annealing


A classic stochastic optimization method is Simulated Annealing, as devel-
oped in 1953 by Metropolis. Strictly speaking, this is not a pure Monte-Carlo
method, but a Markov-Chain Monte-Carlo method, because the samples are
not i.i.d. The basic idea is to perform stochastic exploration with a proposal
that focuses on promising regions. This should speed up stochastic exploration
and reduce its computational cost (i.e., number of samples required) to some-
thing tractable. Like stochastic exploration, simulated annealing only works for
bounded search spaces Θ.
7.4. SIMULATED ANNEALING 67

The simulated annealing algorithm is inspired from physics, particularly from


statistical mechanics. There, the probability of finding a collection of particles
(e.g., atoms) in a certain state is proportional to the Boltzmann factor:

P (state) ∝ e−E/T , (7.10)

where E is the energy of the state and T is the temperature in the system.
Upon cooling (T ↓), the system settles into low-energy states, i.e., it finds min-
ima in E(·). In Algorithm 3, this analogy is exploited to perform stochastic
optimization over general functions h.

Algorithm 3 Simulated Annealing


1: procedure SimulatedAnnealing(ϑ ⃗0) ⃗0
▷ start point ϑ
2: for k = 0, 1, . . . , N − 1 do
3: ⃗uk ∼ U(B(ϑ ⃗ k )) ▷ uniform in B
4: ∆h = h(⃗uk ) − h(ϑ ⃗k )
5: ⃗ k+1 = ⃗uk with probability ρ = min{e−∆h/T , 1},
accept ϑ
6: else set ϑ ⃗ k+1 = ϑ⃗k
7: Tk+1 = f (Tk ) ▷ adjust temperature
8: end for
9: return ϑ⃗N
10: end procedure

The Simulated Annealing algorithm actually amounts to a Metropolis-Hastings


sampler, as discussed in Section 6.2, where the acceptance probability is given by
the Boltzmann factor of the change in h the move would cause. This means that
the Markov chain thus generated has a higher probability of moving in directions
of decreasing h, but it can also “climb uphill” with some lower probability in
order to escape local minima.
The proposal focuses on promising regions by only sampling over a neighborhood
B around the current state. This neighborhood is typically chosen to be a box
or a sphere, because sampling uniformly over these shapes is easy.
The key mechanism to push for lower-energy solutions is the gradual adapta-
tion of the “temperature” T . Due to the physical analogy, the parameter T is
called “temperature”, although it is not actually a physical temperature, but
an exploration range parameter for the sampler. In each iteration, T is reduced
according to a fixed “cooling schedule” f (·), which is a function that computes
the new temperature from the current one (e.g., as Tk+1 = 0.95 Tk ).
Simulated Annealing converges to a (local) minimum of h for bounded search
spaces Θ and for T ↓ 0 not too fast. If the cooling happens too fast, then the
sampler finds no acceptable points any more at all and gets stuck, i.e., does
not converge to any (not even a local) minimum. Clearly, choosing the right
cooling schedule f (·) is of paramount importance. If the cooling is too fast,
the algorithm does not converge to a minimum. If the cooling is too slow,
convergence is slow. Together with the need to choose the right neighborhood
68 CHAPTER 7. STOCHASTIC OPTIMIZATION

size B, these are the main difficulties in using Simulated Annealing in practice.
Clearly, the choice of f and B has to depend on what the function h looks like,
which may not always be known in a practical application.

7.5 Evolutionary Algorithms


Another classic family of nature-inspired stochastic optimization algorithms are
Evolutionary Algorithms. This time, the motivation does not come from statis-
tical mechanics, but from Darwinian evolution of genomes. The basic idea is to
apply random “mutations” to the state ϑ ⃗ and then to evaluate in a “selection”
whether or not to keep the new state. Like in simulated annealing, the nomen-
clature of the inspiring domain (here: evolutionary biology) is used. The key
differences to simulated annealing are that:
• mutation is not done by uniformly sampling over a bounded neighborhood,
but usually from a multivariate Gaussian with mean 0 and covariance
matrix Σ;
• selection can be seen as a cooling schedule f that depends on ∆h rather
than being pre-determined.
This addresses the biggest problem of simulated annealing: choosing the cooling
schedule.
Evolutionary Algorithms were introduced by Ingo Rechenberg in 1971 and Hans-
Paul Schwefel in 1974. Today, they comprise a large class of bio-inspired op-
timization algorithms, including genetic algorithms, ant-colony optimization,
particle-swarm optimization, and evolution strategies. The nomenclature is
somewhat fuzzy, but methods that operate over discrete domains Θ are gener-
ally referred to as genetic algorithms, whereas methods for continuous domains,
such as the one in Eq. 7.1 where the domain of the cost function is a continuous
set, are generally called evolution strategies.
In its simplest form, an evolution strategy proceeds as outlined in Algorithm 4.

Algorithm 4 (1+1)-ES Evolution Strategy


1: procedure 1+1-ES(ϑ ⃗0) ⃗0
▷ start point ϑ
2: for k = 0, 1, . . . , N − 1 do
3: ⃗uk = ϑ⃗ k + N (0, Σk ) ▷ mutation
(
⃗k )
⃗uk if h(⃗uk ) ≤ h(ϑ
4: ⃗ k+1 =
ϑ ▷ selection

ϑk else
5: end for
6: return ϑ ⃗N
7: end procedure

This type of evolution strategy is called a (1+1)-ES, where “ES” is short for
“Evolution Strategy”. The (1+1)-ES converges linearly on unimodal h(·), i.e.,
7.5. EVOLUTIONARY ALGORITHMS 69

its convergence rate is asymptotically as good as that of deterministic gradient


descent, but without requiring the gradient of h or any approximation of it.
ES are not limited to generating only one sample per iteration. This gives rise
to an entire family of methods, classified as:

• (1+1)-ES: one “parent sample” ϑ ⃗ k and one “offspring sample” ϑ


⃗ k+1 in
each iteration (“generation”); see above.

• (1,λ)-ES: sample λ new points ⃗uk,1 , . . . , ⃗uk,λ i.i.d. from the same Gaussian
mutation distribution in each iteration and set ϑ ⃗ k+1 = arg min⃗u (h(⃗uk,i ))λ ,
k,i i=1
i.e., keep the best offspring to become the parent of the next generation.

• (1+λ)-ES: same as above, but include the parent in the selection, ϑ ⃗ k+1 =
⃗ k )}, i.e., stay at the old point if none of
arg min{h(⃗uk,1 ), . . . , h(⃗uk,λ ), h(ϑ
the new samples are better.

• (µ,λ)-ES and (µ+λ)-ES: retain the best µ samples for the next iteration,
which then has µ “parents”. Use a linear combination (i.e., “genetic re-
combination”, e.g., their mean or pairwise averages) of parents as the
center for the mutation distribution of the new generation. In the comma
version, do not include the parents in the selection; in the plus version, do
include them, as above.

Evolution strategies are further classified with into those where the covariance
matrix Σ of the Gaussian mutation distribution, i.e. the mutation rates, is con-
stant, and those where it is dynamically adapted according to previously seen
samples.

7.5.1 ES with fixed mutation rates


In an ES with fixed mutation rates, Σ is a constant and does not depend on k,
thus: Σk = Σ = const for all k. Typically, it is even chosen as Σ = σ 2 1 for some
constant scalar mutation rate σ. This means that the same mutation rate is
⃗ and mutations only act along coordinate
applied to all elements of the vector ϑ
axes.

7.5.2 ES with adaptive mutation rates


In all of the above variants of evolution strategies, the mutation rates can also
be dynamically adapted depending on previous progress. The question of course
is how, i.e., how to determine Σk+1 from Σk and the current fitness values of
the sample population.
There are three classic approaches to this, which we present below. Many more
exist and can be found in the literature. This is an active field of research.
Typically, these adaptive evolution strategies are empirically found to work well
on test problems and in practical applications, but their convergence has so far
not been proven and convergence rates are unknown.
70 CHAPTER 7. STOCHASTIC OPTIMIZATION

7.5.2.1 Rechenberg’s 1/5 rule


We keep the covariance matrix of the mutation distribution proportional to the
identity matrix, i.e., isotropic: Σk = σk2 1, but we adapt the scalar mutation
rate σk over iterations/generations k as follows:
If less than 1/5 (i.e., 20%) of the new samples are better than the (best) old
sample, then decrease σk+1 = dσk with some decrease factor d < 1. Else,
increase σk+1 = eσk with some expansion factor e > 1. The fraction of new
samples that are better than the (best) old sample is called the success rate.
For λ-sample strategies, the success rate can be computed per iteration if λ ≥ 5.
Otherwise, the success rate is computed as a running average over the past > 5
iterations, possibly with exponential forgetting.
The idea behind this empirical rule is that if less than 1/5 of the samples are
successful, the algorithm is hopping around a minimum and the step size should
be decreased in order to force convergence into that minimum. If more than
1/5 are successful, the algorithm is on a run toward a minimum (i.e., down a
slope) and larger steps can be tried to increase convergence speed. Of course,
the art becomes choosing the factors d and e.

7.5.2.2 Self-adaptation
Self-adaptation does away with the need of choosing decrease and expansion
factors for the mutation rate. Instead, it lets the Darwinian selection process
itself take care of adjusting the mutation rates. For this, each sample (i.e.,
“individual”) has its own mutation rate σk,i , i = 1, . . . λ in iteration k, and we
2
again use isotropic mutations Σk,i = σk,i 1, but now with a different covariance
matrix for each offspring sample
⃗ k , σ 2 1),
⃗uk,i ∼ N (ϑ i = 1, . . . , λ. (7.11)
k,i

The individual mutation rates for each offspring are themselves sampled from a
Gaussian distribution, as:
σk,i = N (σk , s2 ), (7.12)
where σk is the mutation rate of the parent (or the h-weighted mean of the
mutation rates of the parents for a µ-strategy), and s is a step size. This self-
adaptation mechanism will automatically take care that samples with “good”
mutation rates have a higher probability of becoming parents of the next gener-
ation and hence the mutation rate is inherited by the offspring as one of the “ge-
netic traits”. Choosing the step size s is unproblematic. Performance is robust
over a wide range of choices. However, the main drawback of self-adaptation
is its reduced efficiency. The same point potentially needs to be tried multiple
times for different mutation rates, therefore increasing the number of samples
required to converge.

7.5.2.3 Covariance matrix adaptation (Hansen & Ostermeier, 1993)


A more efficient (if not the most efficient known) way of adapting the mutation
rates is by considering the path the sampler has taken over the past iterations
7.5. EVOLUTIONARY ALGORITHMS 71

and using that to adapt a fully anisotropic covariance matrix Σk that can also
have off-diagonal elements. This allows the mutation rates of different elements
of the vector ϑ⃗ to be different, in order to account for different scaling of the
parameters in different coordinate directions of the domain. The off-diagonal
elements are used to exploit correlations between search dimensions.
The classic algorithm to achieve this is CMA-ES, the evolution strategy with
Covariance-Matrix Adaptation (CMA). The algorithm adapts the covariance
matrix of the mutation distribution by rank-µ updates of Σ based on correlations
between the previous best samples, which can be interpreted nicely in terms
of information geometry. The algorithm uses Cholesky decompositions and
eigenvalue calculations, and we are not going to give it in full detail here. We
refer to online resources (e.g., wikipedia) for details.
CMA-ES roughly proceeds by:
⃗ k , Σk )
1. sampling λ offspring from N (ϑ
2. choosing the best µ < λ: ⃗u1 , . . . , ⃗uµ
⃗ k+1 = mean(⃗u1 , . . . , ⃗uµ )
3. recombining them as ϑ
4. rank-µ update of Σk using ⃗u1 , . . . , ⃗uµ =⇒ Σk+1
A few remarks about CMA-ES:

• The rank-µ update requires an eigendecomposition of Σ and therefore has


a computational cost of O(n3 ).
• No formal convergence proof exists for CMA-ES.
• CMA-ES can be interpreted as natural gradient descent in the space of
sample distributions.
• CMA-ES usually is among the top performers in benchmarks (IEEE CEC
2005, ACM BBOB 2015).
• The Markov Chain generated by CMA-ES has nice stationarity and invari-
ance properties. Note that CMA-ES still is a Markov-Chain Monte-Carlo
method, since the rank-µ update of the covariance matrix only depends
on the current samples.
• On convex deterministic problems, CMA-ES is about 10-fold slower than
the deterministic BFGS optimizer, but CMA-ES is black-box and also
works on non-convex and stochastic problems.
72 CHAPTER 7. STOCHASTIC OPTIMIZATION
Chapter 8

Random Walks

In Chapter 3 we briefly discussed a special class of discrete-time Markov chains


called random walks. Intuitively, a random walk (RW) is a stochastic process
defined on a discrete lattice, which in each step, moves to any of its neighbouring
states with a certain probability. RWs play a central role in many different fields
including finance, physics or biology. In this chapter we will first discuss some
properties of discrete RWs and show how those relate to continuous-time RWs.
We consider a discrete-time, integer-valued Markov chain Xn ∈ Z. At each time
step, the state of the chain either increase or decrease by one. The corresponding
transition kernel of this Markov chain is given by

P (Xn+1 = x + 1|Xn = x) = p (8.1)


P (Xn+1 = x − 1|Xn = x) = q = 1 − p, (8.2)

with p ∈ [0, 1].


Definition 8.1. Xn is called a one-dimensional Random Walk.

8.1 Characterization and Properties


8.1.1 Kolmogorov-forward Equation
As any other Markov chain, a random walk can be characterized in terms of a
Kolmogorov-forward equation

P (Xn = x) = pP (Xn−1 = x − 1) + qP (Xn−1 = x + 1). (8.3)

This recursive equation can be intuitively interpreted. If the RW is in state


x at time n, this implies that it was either in state x + 1 at time n − 1 and
decreased by one, or it was at state x − 1 and increased by one. This provides
a recursive way to calculate the probability of finding the RW at any state x
at time n. The probability P (Xn = x) will therefore consist of two additive
contributions corresponding to the two outcomes. The first contribution is the

73
74 CHAPTER 8. RANDOM WALKS

probability that the walker was in x − 1 at n − 1 (i.e., P (Xn−1 = x − 1)) times


the probability of moving upwards (i.e., p). The second contribution is the
probability of finding the walker in x + 1 at n − 1 (i.e., P (Xn−1 = x + 1)) times
the probability of moving downwards (i.e., q). Note that eq. (8.3) is generally
infinite-dimensional, since we need to consider an equation for every possible x.

8.1.2 State Equation


Instead of characterizing the probability distribution of the RW, we can also
formulate a stochastic equation that describes the stochastic time evolution of
a single path of the RW. In particular, such equation reads
n
X
Xn = Xn−1 + ∆Xn = X0 + ∆Xi , (8.4)
i=1

with ∆Xi as the random increments of the RW and X0 as some known starting
point of the RW. These increments are i.i.d.binary random variables, i.e.,

+1 with probability p
∆Xi ∼ (8.5)
−1 with probability q.

8.1.3 Mean and Variance


We next study how the mean and variance of a RW evolve with time. To this
end, we first calculate the mean and variance of the random increments ∆Xi .
For the mean we obtain
µ = E{∆Xn } = −1q + p (8.6)
=p−q (8.7)
= p − (1 − p) (8.8)
= 2p − 1. (8.9)
For the variance we obtain correspondingly,
σ 2 = E{∆Xn2 } − µ2 = p + q − (2p − 1)2 (8.10)
2
= 1 − 4p + 4p − 1 (8.11)
= 4p(1 − p) (8.12)
Using these results, we can now calculate the mean and variance of Xn for any
n. For the mean, we obtain
n
hX i n
X
E{Xn } = X0 + E ∆Xi = X0 + E{∆Xi } = X0 + nµ. (8.13)
i=1 i=1

The variance becomes


h n
X i Xn
V ar{Xn } = V ar X0 + ∆Xi = V ar{∆Xi } = nσ 2 , (8.14)
i=1 i=1
8.1. CHARACTERIZATION AND PROPERTIES 75

whereas the second last equality follows from the fact that (a) X0 has zero
variance (i.e., it is deterministic) and (b) the variance of the sum of independent
RVs is just the sum of the variances of each individual RV.
Definition 8.2. The mean increment µ is called the drift of a random walk.
Definition 8.3. A random walk with µ = 0 (p=q=1/2) is called symmetric.

8.1.4 Restricted Random Walks


In many practical scenarios, it is useful to define RWs on a bounded state space.
For instance, this is the case if one models the random motion of a molecule in a
volume of fixed size or gambler with finite capital. Mathematically, this can be
accounted for by introducing appropriate boundary conditions in the transition
kernel of the underlying Markov chain. For instance, if a process cannot go
negative, we should set P (Xn+1 = −1 | Xn = 0) = 0. More generally, we can
adjust the transition probabilities to prevent a RW to leave a certain region.
Such RWs are called restricted RWs. An important distinction of restricted
RWs can be made based on how they behave when they reach the boundary
of their state space. For instance, a gambler that reaches zero capital will no
longer be able to participate in the game, which would mean that the RW stays
at zero for ever. In this case, we would have P (Xn+1 = 1 | Xn = 0) = 0
and P (Xn+1 = 0 | Xn = 0) = 1. We refer to such a process as a restricted
RW with absorbing boundary. In other cases, the RW may ”bounce back” as
soon as it hits its boundary, which would for instance be the case if we have
P (Xn+1 = 1 | Xn = 0) = 1 and P (Xn+1 = 0 | Xn = 0) = 0. Such RWs are
called restricted RWs with reflecting boundary.

8.1.5 Relation to the Wiener process (continuous limit)


Let us consider an unrestricted 1D random walk. If we observe this process on
very long time scales, it will look almost like a continuous-time random walk.
In particular, since V ar{Xn } scales linearly with n, we will see increasingly
large deviations from the starting point as time increases, such that the relative
change between Xn and Xn+1 will appear almost continuous. Indeed, it can
be shown mathematically, that a discrete-time RW converges to a continuous-
time RW as n → ∞. An intuitive justification of this result can be given by the
central limit theorem (CLT). Let us now consider a discrete-time RW Xn within
some fixed time window. We assume that each time step of the RW corresponds
to an elapsed time r such that the total time that elapses after n iterations is
t = nr. This allows us to reparameterize the RW in terms of t such that
t/r
X
Y (t) = Xt/r = X0 + ∆Xi . (8.15)
i=1

We see that the random process Y (t) is given by the initial condition X0 and
a sum of i.i.d. RVs. The number of summands inside this sum will increase
76 CHAPTER 8. RANDOM WALKS

linearly with t. We know from the CLT that the sum of many i.i.d. RVs will
converge to a normally distributed RV. More precisely, we have that
t/r  
X t>>r t t
Y (t) = X0 + ∆Xi −−−→ N X0 + µ, σ 2 , (8.16)
i=1
r r

whereas t >> r is equivalent to letting n be very large. If we now interpret t as


a continuous variable, we can approximate the RW as
µ σ
Y (t) ≈ t + √ W (t), (8.17)
r r

with W (t) as a standard Wiener process. This is because the standard Wiener
process is normally distributed for all t with mean E{W (t)} = 0 and variance
V ar{W (t)} = t. We will have a more detailed discussion about Wiener processes
in Chapter ??.

8.1.6 Random Walks in higher dimensions


Random walks represent a very general class of stochastic processes and can
be extended to many different scenarios. For instance, it is straightforward to
extend RWs to a two-dimensional lattice, in which case a random walker can
move from its current state to its four neighbouring states. Correspondingly, the
transition kernel consists of four transition probabilities. Similarly, we would
obtain six transition probabilities for the three-dimensional case and so forth.
However, RWs can exhibit very different properties depending on their dimen-
sion. For instance, 1-and 2D RWs have the property that no matter where the
walker starts, it will at some point in the future return to its starting point
(almost surely, i.e., with probability one). Surprisingly, this is not the case for
RWs of dimension three or higher.
Chapter 9

Stochastic Calculus

In the previous chapters, we have focused predominantly on discrete-time stochas-


tic processes such as Markov chains or random walks. In this chapter we focus on
continuous-time stochastic processes, which play an important role in modeling
the stochastic dynamics of physical an natural phenomena.
As a motivation, let us for the moment consider a deterministic dynamical
system described by an ordinary differential equation
dx(t)
= f (x(t), t) x(0) = x0 , (9.1)
dt
with x(t) as the state of the system at time t and f as some vector field. In this
system, the time evolution of x(t) is uniquely determined by the initial value x0 ,
such that x(t) moves along a fixed trajectory as time increases. Many real-world
systems, however, evolve stochastically: if we observe the same system several
times using the same initial condition, then the trajectory x(t) may vary from
experiment to experiment. For instance, if we analyze a chemical reaction in a
test tube, then the exact number of reaction products at some fixed time t will
vary over repeated experiments due to thermal fluctuations.
To account for stochasticity in the time-evolution of dynamical systems we can
introduce a stochastic driving term in the differential equation, i.e.,
dx(t)
= f (x(t), t) + u(t), (9.2)
dt
whereas u(t) is a ”white noise” signal. This means that u(t) is statistically
independent of u(s) for any s ̸= t. Now, the time evolution of x(t) will be
stochastic, which means that if we repeatedly observe this system, the trajectory
x(t) will be different each time. Using the substitution u(t)dt = dW (t), we can
rewrite (9.2) as
dx(t) = f (x(t), t)dt + dW (t). (9.3)
with dW (t) as the differential version of a standard Wiener process as briefly
introduced in Chapter 8. We remark here that while equation (9.3) is mathemat-
ically sound, the original version (9.2) comes with certain trouble. In particular,

77
78 CHAPTER 9. STOCHASTIC CALCULUS

the white noise process u(t) would correspond to the time derivative of W (t).
It is known, however, that a Wiener process is not continuously differentiable
making (9.2) problematic. Therefore, continuous-time stochastic processes are
generally given in the form of (9.3).

9.1 Stochastic differential equations


Definition 9.1. A stochastic differential equation (SDE) is defined as

dX(t) = µ(X(t), t)dt + σ(X(t), t)dW (t), (9.4)

with dW (t) as the infinitesimal increment of a standard Wiener process. The


terms µ and σ are commonly referred to as drift and diffusion terms, respec-
tively.
Definition 9.2. The standard Wiener process is characterized by the following
properties:
1. The Wiener process has independent increments: W (t + τ ) − W (t) is
independent of W (s) ∀s ≤ t.
2. The Wiener process is stationary: the increments W (t + τ ) − W (t) do not
depend on t.
3. The Wiener process has Gaussian increments: Wt+τ − Wt ∼ N (0, τ ).
From the last definition, it follows that E{W (t+τ )−W (t)} = 0 and V ar{W (t+
τ ) − W (t)} = τ .

9.1.1 Ito integrals


SDE’s can also be written in integral form
Z t Z t
X(t) = X0 + µ(X(s), s)ds + σ(X(s), s)dW (s), (9.5)
0 0

whereas the first integral corresponds the a classical Riemann integral. The
second integral is called a stochastic integral, where in this case, the function
σ(X(t), t) is integrated with respect to a standard Wiener process W (t). Math-
ematically, this integral can be defined as
Z t n
X i
H(s)dW (s) = lim H(ti )(W (ti+1 ) − W (ti )) ti = t . (9.6)
0 n→∞
i=0
n

Eq. (9.6) is generally known as the Ito integral. This integral converges (in
probability) if:
• H(t) depends only on {W (t − h) | h ≤ 0}. H(t) is then said to be non-
anticipating.
9.1. STOCHASTIC DIFFERENTIAL EQUATIONS 79

Rt
• It holds that E{ 0
H(s)2 ds} < ∞.

The mean and variance of the Ito integral are given by


"Z #
t
E H(s)dW (s) = 0 ⇐⇒ E{H(t)dW (t)} = 0 (9.7)
0
"Z #
t Z t
V ar H(s)dW (s) = E{H(s)2 }ds. (9.8)
0 0

9.1.2 Transformation of Wiener processes


A fundamental result in stochastic calculus is Ito’s lemma, which can be under-
stood as a generalization of the chain rule for stochastic processes. Assume that
we are given an SDE of the form

dX(t) = µ(X(t), t)dt + σ(X(t), t)dW (t), (9.9)

with W (t) as a standard Wiener process. Furthermore, consider a non-linear


transformation f : R → R.

Theorem 9.1. The transformed process Y (t) = f (X(t)) satisfies the SDE
 
∂ 1 ∂ 2
dY (t) = f (X(t))µ(X(t), t) + f (X(t))σ (X(t), t) dt
∂x 2 ∂x2
(9.10)

+ f (X(t))σ(X(t), t)dW (t).
∂x
This is known as Ito’s lemma.

Remark: Note that (9.10) is valid only if f does not explicitly depend on
t. While Ito’s lemma can be extended also to time-dependent f , we restrict
ourselves to the case where f depends only on X(t) in this lecture.

9.1.3 Mean and Variance of SDE’s


While solutions of SDEs are inherently stochastic, their temporal dynamics can
be characterized in terms of statistical moments. For instance, we can calculate
the mean and variance of X(t) in order to study the ”average” solution of an
SDE and how much it varies between different realizations. In order to calculate
the expectation, we can apply the expectation operator on both sides the SDE,
i.e.,
E{dX(t)} = E{µ(X(t), t)}dt + E{σ(X(t), t)dW (t)}. (9.11)
Due to the properties of the standard Wiener process we know that the second
term will be zero such that we obtain
d
E{X(t)} = E{µ(X(t), t)}. (9.12)
dt
80 CHAPTER 9. STOCHASTIC CALCULUS

Calculating the variance, is slightly more complicated since we need a dynamic


equation that describes how V ar{X(t)} evolves with time. In order to get
such an equation, we can make use of Ito’s lemma to first derive an SDE for
Y (t) = f (X(t)) = X(t)2 . Taking the expectation of the SDE then gives an
equation for the (non-central) second order moment E{X(t)2 }, which we can
then use to derive an equation for the variance according to
d d
E{X(t)2 } − E{X(t)}2

V ar{X(t)} =
dt dt (9.13)
d d
= E{X(t)2 } − 2E{X(t)} E{X(t)}.
dt dt
We will illustrate this approach using the following simple example.
Example 9.1 (Mean and Variance of the Ornstein-Uhlenbeck (OU) process).
The Ornstein-Uhlenbeck process is defined as
dX(t) = θ(µ − X(t))dt + σdW (t), (9.14)
with θ > 0, σ > 0 and µ as real parameters.
To calculate an equation for the mean of X(t), we take the expectation of (9.14)
dE{X(t)} = θ(µ − E{X(t)})dt + E{σdW (t)}, (9.15)
| {z }
0

which after rearranging yields


d
E{X(t)} = θ(µ − E{X(t)}). (9.16)
dt
To calculate the average of the OU process at stationarity (i.e., when t → ∞),
we can set the left hand side of this equation to zero and solve for E{X(t)}. We
obtain
lim E{X(t)} = µ. (9.17)
t→∞

To calculate the variance, we first use Ito’s lemma to derive an SDE for Y (t) =
f (X(t)) = X(t)2 . We first calculate the first and second order derivatives f ,
i.e.,

f (x) = 2x (9.18)
∂x

f (x) = 2. (9.19)
∂x2
Using these derivates within Ito’s lemma gives us
dY (t) = d[X(t)2 ] = 2X(t)θ(µ − X(t)) + σ 2 dt + 2σX(t)dW (t)

(9.20)
= 2θ(µX(t) − Y (t)) + σ 2 dt + 2σX(t)dW (t).


Taking the expectation on both sides yields


dE{Y (t)} = 2θ(µE{X(t)} − E{Y (t)}) + σ 2 dt + 2σE{X(t)dW (t)}.

(9.21)
9.1. STOCHASTIC DIFFERENTIAL EQUATIONS 81

The second order non-central moment E{Y (t)} = E{X(t)2 } therefore satisfies
the differential equation
d
E{X(t)2 } = 2θ(µE{X(t)} − E{X(t)2 }) + σ 2 . (9.22)
dt
For the variance, we obtain correspondingly,
d d d
V ar{X(t)} = E{X(t)2 } − 2E{X(t)} E{X(t)}
dt dt dt
= 2θ(µE{X(t)} − E{X(t)2 }) + σ 2 − 2E{X(t)}θ(µ − E{X(t)})
= −2θ E{X(t)2 } − E{X(t)}2 + σ 2


= −2θV ar{X(t)} + σ 2 .
(9.23)

The long-term variance therefore becomes,

σ2
lim V ar{X(t)} = . (9.24)
t→∞ 2θ
We finally remark that the same approach can in principle be used to calculate
mean and variance of any SDE driven by a Wiener process. However, one
should keep in mind that if the SDE is non-linear, one may encounter a so-
called moment-closure problem. That means, that the equation for the mean of
X(t) may depend on the second order moment, which in term depends on the
third-order moment and so forth. In this case, certain approximate techniques
can be considered. Those techniques, however, are beyond the scope of this
lecture.
82 CHAPTER 9. STOCHASTIC CALCULUS
Chapter 10

Numerical Methods for


Stochastic Differential
Equations

Stochastic differential equations (SDE), as introduced in the previous chapter,


form the basis of modeling continuous-time stochastic processes. While in some
cases, the solution (or its moments) of a SDE can be computed analytically (see,
e.g., the Ornstein-Uhlenbeck process in the previous chapter), most SDEs need
to be simulated or solved numerically. In this chapter, we present the classic
numerical methods for solving or simulating SDEs.

10.1 Refresher on SDEs


Before we start, we briefly refresh the main concepts of SDEs as they pertain
to numerical methods, and introduce the notation. A scalar SDE governs the
dynamics of a continuous random variable X(t) ∈ R by:

dX dW1 dWn
= v0 (X(t), t) + v1 (X(t), t) + . . . + vn (X(t), t) (10.1)
dt dt dt
with given functions v0 , v1 , . . . , vn and Wiener processes W1 , . . . , Wn . The first
term on the right-hand side governs the deterministic part of the dynamics
through the function v0 . The remaining terms govern the stochastic influences
on the dynamics, of which there could be more than one, each with its own Itô
transformation v1 , . . . , vn . The Wiener processes Wi (t) are continuous functions
of time that are almost certainly not differentiable anywhere. Therefore, the dW dt
i

are pure white noise and the equation cannot be interpreted mathematically.
However, if we multiply the entire equation by dt, we get:

dX(t) = v0 (X(t), t)dt + v1 (X(t), t)dW1 (t) + . . . + vn (X(t), t)dWn (t), (10.2)

83
84CHAPTER 10. NUMERICAL METHODS FOR STOCHASTIC DIFFERENTIAL EQUATIONS

where everything is still time-continuous and the solution therefore is a contin-


uous function in time. This is the usual way of writing SDEs. In the following,
for simplicity, we only consider SDEs where the stochastic terms can all be
collected into one and write:

dX(t) = µ(X(t), t)dt + σ(X(t), t)dW (t). (10.3)

The deterministic part µ is called drift and the stochastic part σ is called dif-
fusion, because Wiener increments dW are normally distributed (see previous
chapter).

Example 10.1. Consider as an example the classic Langevin equation:



dX(t) = −aX(t)dt + b dW (t),

where a > 0 and b > 0 are constants. This equation describes the dynamics of
the velocity X(t) of a particle (point mass) under deterministic friction (friction
coefficient a) and stochastic Brownian motion (diffusion constant b). It is a
central equation in statistical physics, chemistry, finance, and many other fields.

10.2 Solving an SDE


It is not immediately clear what “solving an SDE” means. Since X(t) is a
continuous-time stochastic process, every realization of it is different, so we
cannot find the solution x(t). Consider t ∈ [0, T ]. Each realization of the
Wiener process produces a different xT = X(t = T ) when always starting from
the same initial X(t = 0) = x0 . As a “solution”, we may want to know:

1. The probability density function (PDF) of xT = X(t = T ),

2. E[X(t = T )] or any E[g(X(t = T ))].

(1) is referred to as the strong solution of the SDE, and (2) as the weak solution
of the SDE.

10.2.1 Solution methods


10.2.1.1 Weak solution: Feynman-Kac formulae
Weak solutions can be computed analytically or numerically by reducing the
stochastic problem to a deterministic one. Using Feynman-Kac formulae, every
E[g(X(t))] of every SDE has an associated deterministic parabolic PDE with
solution:
u(t, x) = E[g(X(t)) | X(0) = x].
This deterministic parabolic PDE can then be solved analytically or using meth-
ods from numerical analysis (e.g., finite differences). Many quantities of the
original SDE are thus accessible, e.g.:
10.3. STOCHASTIC NUMERICAL INTEGRATION: EULER-MARUYAMA85

• g a monomial → corresponding moment of X(t),


• g = δ (i.e, the Dirac delta) → transition PDF of X(t),
• g = H (i.e., the Heaviside step function) → transition CDF of X(t),
• g = exp(·) → the Laplace transform of the solution.

10.2.1.2 Strong solution: Analytical solution


If the strong solution is required, the SDE can sometimes be solved analytically.
A famous example is the Black-Scholes equation that models stock option prices:

dX = Xµdt + XσdW (t)

with constants µ and σ.

10.2.1.3 Strong and weak solution: Numerical integration


The above two methods are exact. If a numerical approximation to the solutions
is sufficient, stochastic numerical integration of the SDE can be used to approx-
imate both weak and strong solutions. This is analogous to MCMC simulation
in the discrete-time case, except that now there are infinitely many infinitesimal
stochastic transitions, so extra care is required.

10.3 Stochastic Numerical Integration: Euler-


Maruyama
The solution of the SDE in Eq. 10.3 can be written in integral form as:
Z t Z t
X(t) = x0 + µ(X(t̃), t̃) dt̃ + σ(X(t̃), t̃) dW (t̃), (10.4)
0 0

where the first integral is a deterministic Riemann integral, and the second one
is a stochastic Itô (or Stratonovich) integral.
In order to numerically approximate the solution, we discretize the time interval
[0, T ] in which we are interested in the solution into N finite-sized time steps of
T
duration δt = N such that tn = nδt and Xn = X(t = tn ), Wn = W (t = tn ).
th
Due to the 4 property of the Wiener process from the previous chapter, which
states that the differences between any two time points of a Wiener process are
normally distributed, we can also discretize:

Wn+1 = Wn + ∆Wn (10.5)

with ∆Wn i.i.d. ∼ N (0, δt) and W0 = 0. The starting value for the Wiener
process, W0 = 0 is chosen arbitrarily, since the absolute value of W is inconse-
quential for the SDE; only the increments dW matter, so we can start from an
arbitrary point.
86CHAPTER 10. NUMERICAL METHODS FOR STOCHASTIC DIFFERENTIAL EQUATIONS

The integrals in Eq. 10.4 can be interpreted as the continuous limits of sums.
The deterministic term can hence be discretized by a standard quadrature (nu-
merical integration). The stochastic term is discretized using the above dis-
cretization of the Wiener process, hence, for any time T ,
Z T N
X
σ(X(t̃), t̃) dW (t̃) ≈ σ(X(tn ), tn )∆Wn .
0 n=0

Using the rectangular rule (i.e., approximating the integral by the sum of the
areas of rectangular bars) for the deterministic integral, and the above sum over
one time step for the stochastic integral, we find:
Z tn+1
µ(X(t̃), t̃) dt̃ ≈ µ(Xn , tn )δt ,
tn
Z tn+1
σ(X(t̃), t̃) dW (t̃) ≈ σ(Xn , tn )∆Wn .
tn

This yields the classic Euler-Maruyama method:

Xn+1 = Xn + µ(Xn , tn )δt + σ(Xn , tn )∆Wn (10.6)

with
∆Wn = Wn+1 − Wn ∼ N (0, δt) i.i.d.,
W0 = 0,
X0 = x0 .
Iterating Eq. 10.6 forward in time n = 0, 1, . . . , N yields a numerical approxi-
mation of one trajectory/realization of the stochastic process X(t) governed by
the SDE from Eq. 10.3.

10.4 Convergence
The error of the numerical solution is defined with respect to the exact stochastic
process X(t) for decreasing δt. To make this comparison possible, we introduce
Xδt (t), the continuous-time stochastic process obtained by connecting the points
(tn , Xn ) by straight lines, i.e.:
t − tn
Xδt (t) = Xn + (Xn+1 − Xn ) for t ∈ [tn , tn+1 ). (10.7)
tn+1 − tn
Comparing this continuous-time process to the exact process, we can define
convergence. Note that in the deterministic case this is not possible as there
we can simply evaluate the analytical solution at the simulation time steps and
compare the values. In a stochastic simulation, however, this is not possible
as each realization of the process has different values. The only thing we can
compare are moments, which can only be computed over continuous processes.
So with the above trick we can define:
10.4. CONVERGENCE 87

Definition 10.1 (strong and weak convergence). A numerical method is strongly


convergent if and only if
h i
lim E |X(T ) − Xδt (T )| = 0 (10.8)
δt→0

and weakly convergent if

lim E[g(X(T ))] − E[g(Xδt (T ))] = 0 (10.9)


δt→0

for every polynomial g and every time point T .


Strong convergence is also sometimes called convergence in value, and it implies
convergence to the strong solution of the SDE. Weak convergence is also called
convergence in distribution and it implies convergence to the weak solution of
the SDE.
For the Euler-Maruyama algorithm, the following result is known:
Theorem 10.1. The Euler-Maruyama method is both strongly and weakly con-
vergent if µ(·, ·) and σ(·, ·) are four times continuously differentiable and have
bounded first derivatives. This condition is sufficient, but not necessary.
Strong convergence implies weak convergence, but not the other way around.
Strong convergence intuitively means that for a given simulation with δt → 0,
the trajectory exactly matches one of the trajectories of the analytical process.
This is because the above has to hold for all T . So any simulation converges to
something that is a valid trajectory of the analytical process. Weak convergence
does not require this. If only required the moments to match. For example, if
the true trajectory goes to value 0.5, a strong simulation would also have to go
to 0.5. A weak simulation, however, can go to 1.0 half of the time and to 0.0
half of the time. The average is still 0.5, albeit all of the simulated trajectories
are arbitrarily far away from any true trajectory.
The next question then is, how fast the algorithm converges. For this, we define
the order of convergence in both the weak and strong sense, as:
Definition 10.2 (strong convergence order). A numerical method has strong
convergence order γ ≥ 0 if and only if
h i
E |X(T ) − Xδt (T )| ≤ C(T )δtγ (10.10)

for every time T , where the constant C(T ) > 0 depends on T and on the SDE
considered.
Definition 10.3 (weak convergence order). A numerical method has weak con-
vergence order γ ≥ 0 if and only if
h i h i
E g(X(T )) − E g(Xδt (T )) ≤ C(T, g)δtγ (10.11)

for every time T , where the constant C(T, g) > 0 depends on T , g, and the SDE
considered.
88CHAPTER 10. NUMERICAL METHODS FOR STOCHASTIC DIFFERENTIAL EQUATIONS

For the Euler-Maruyama algorithm, the following result is known:


Theorem 10.2. The Euler-Maruyama method has weak convergence order 1
and strong convergence order 21 .
The proof of this theorem involves stochastic Taylor expansions (into Stratonovich
integrals) and is omitted here.
In principle, the Euler-Maruyama method allows us to compute both the strong
and weak solution of any SDE to any precision, if we just choose the time step
δt small enough. The strong convergence order of 1/2 is rather slow, though. It
means that in order to get a solution that is 10 times more accurate, we need
100 times smaller time steps. In practice, however, δt cannot be chosen too
small because of finite-precision arithmetic rounding errors and computer time.
Therefore, there is an ongoing search for stochastic integration methods with
higher orders of convergence.

10.5 Milstein Method


The standard way to increase the convergence order of a numerical method is to
take into account additional higher-order terms in the Taylor expansion of the
solution. Keeping terms up to and including order 2 in the stochastic Taylor
expansion of X(t + δt), or recursively applying Itô’s Lemma to the coefficient
functions of the autonomous SDE

dX(t) = µ(X(t))dt + σ(X(t))dW(t)

yields the Milstein method:

1
Xn+1 = Xn + µ(Xn )δt + σ(Xn )∆Wn + σ ′ (Xn )σ(Xn )((∆Wn )2 − δt)
2
(10.12)
with:

∆Wn = Wn+1 − Wn ∼ N (0, δt) i.i.d.,


W0 = 0,
X0 = x0 ,
dσ(x)
σ ′ (x) = .
dx
This formulation of the Milstein method only works for autonomous SDEs, i.e.,
SDEs where the coefficient functions µ(·) and σ(·) depend on X(t), but not ex-
plicitly on t. For non-autonomous SDEs, where the coefficients are also explicit
functions of time, the formulation of the Milstein method involves Stratonovich
integrals that need to be numerically approximated by Lévy area calculation
if they cannot be analytically solved. Here, we only consider the autonomous
case, where these complications do not appear.
Regarding the convergence order of the Milstein method, we have:
10.6. WEAK SIMULATION 89

Theorem 10.3. The Milstein method has both strong and weak orders of con-
vergence of 1.
The Milstein method therefore is more accurate than the Euler-Maruyama
method in the strong sense, but has the same weak order of convergence. Us-
ing a 100-fold smaller time step reduces the numerical error 100 times when
using Milstein. This allows larger time steps compared to Euler-Maruyama and
relaxes the numerical rounding issues.
Of course, Euler-Maruyama and Milstein are not the only known stochastic nu-
merical integration methods. Other methods also exist (e.g. Castell-Gaines,
stochastic Lie integrators, etc.), some with lower pre-factors C in the error
bounds of Eqs. 10.10 and 10.11. However, no method is known with strong
convergence order > 1, unless we can analytically solve the corresponding
Stratonovich integrals.
The numerical stability properties (with δt) of stochastic numerical integration
methods are unclear. Only few results are known on almost-sure stability, e.g.,
for linear scalar SDEs. No A-stable stochastic numerical integrator is known.

10.6 Weak Simulation


While no method with strong order of convergence larger than 1 is known, the
weak order of convergence can be increased at will if strong convergence is not
required. Therefore, if one is only interested in the weak solution of an SDE,
simpler methods exist to approximate E[g(X(T ))] without needing to simulate
the entire path of the process. √
√ can,1e.g., simply choose binomial increments ∆W = ± δt
It turns out that one f
with P (∆Wf = ± δt) = . This does not require simulating Gaussian random
2
variates.
√ Instead,
√ at each time point one simply chooses i.i.d. increments of
+ δt or − δt, each with probability 21 . Then, one uses ∆W f in the stochastic
integration method instead of ∆W . Depending on the numerical integration
(quadrature) scheme used to approximate Eq. 10.4, any weak order of conver-
gence can be achieved. E.g.:
• Rectangular rule −→ Euler-Maruyama −→ weak 1-st order
• Trapezoidal rule −→ weak 2-nd order
..
.

If the quadrature approximates the integrand by a piecewise polynomial of de-


gree p, the weak order of convergence of a simulation using binomial increments
∆W f is p + 1. There is no strong convergence in any of these cases (i.e., strong
order of convergence is 0).
90CHAPTER 10. NUMERICAL METHODS FOR STOCHASTIC DIFFERENTIAL EQUATIONS
Chapter 11

Stochastic Reaction
Networks

Reaction networks, as introduced in the previous chapter, are an important class


of stochastic models that can be used to describe populations of individuals that
randomly interact with one another over time. Applications range from chemical
kinetics over computer networks to logistics and traffic. In this chapter, we
will introduce the formalism of reaction networks and discuss how they can be
described mathematically and simulated using numerical methods.

11.1 Formal Representations


11.1.1 Representation of a reaction
A reaction reactants → products is formally represented by:
N N
k
X X
νi− Si −→ νi+ Si , (11.1)
i=1 i=1

where:

Si : species i,

N : total number of different species in the reaction,

νi− : reactant stoichiometry,

νi+ : product stoichiometry,

k: reaction rate.

The total stoichiometry is νi = νi+ −νi− , and it gives the net change in copy num-
bers when the reaction happens. Reactions are classified by the total number of

91
92 CHAPTER 11. STOCHASTIC REACTION NETWORKS

reactant molecules i νi− , which is called the order of the reaction. Reactions
P
having only a single reactant are of order one, or unimolecular. Reactions with
two reactants are of order two, or bimolecular; and so on. Reactions of order
≤ 2 are called elementary reactions.
Example 11.1. For the reaction A + B → C, the above quantities are:
• Si = {A, B, C}
• N =3
• ⃗ν − = [1, 1, 0]T
• ⃗ν + = [0, 0, 1]T
• ⃗ν = [−1, −1, 1]T

11.1.2 Representation of a reaction network


A reaction network comprising M reactions between N distinct species is then
represented by the indexing Eq. 11.1 with the reaction index and writing:
N N
X
− kµ X
+
νi,µ Si −→ νi,µ Si , µ = 1, . . . , M (11.2)
i=1 i=1

where:
µ: index of reaction Rµ ,
M : total number of different reactions.
Now the stoichiometry is a matrix with one column per reaction: ν = ν + − ν − .
All the stoichiometry matrices are of size N × M . All elements of ν + and ν −
are non-negative whereas those of ν can be negative, zero or positive.
Example 11.2. Consider the following cyclic chain reaction network with N
species and M = N reactions:

Si →
− Si+1 , i = 1, . . . , N − 1
(11.3)
SN →
− S1 .

For N = 3 the reaction network is


Reaction 1 : S1 →
− S2
Reaction 2 : S2 →
− S3 (11.4)
Reaction 3 : S3 →
− S1 .

The stoichiometry matrices for this reaction network are:


 
1 0 0
⃗ν − =  0 1 0  , (11.5)
0 0 1
11.2. THE CHEMICAL MASTER EQUATION 93
 
0 0 1
⃗ν + = 1 0 0 , (11.6)
0 1 0

and
 
−1 0 1
⃗ν = ⃗ν + − ⃗ν − = 1 −1 0 . (11.7)
0 1 −1

Example 11.3. Considerj the


k following colloidal aggregation reaction network
N2
with N species and M = 4 reactions:
N 
Si + Sj →
− Si+j , i = 1, . . . , 2 ; j = i, . . . , N − i. (11.8)

Species Si can be considered a multimer consisting of i monomers.


For N = 4 the reaction network is
Reaction 1 : S1 + S1 →
− S2
Reaction 2 : S1 + S2 →
− S3
(11.9)
Reaction 3 : S1 + S3 →
− S4
Reaction 4 : S2 + S2 →
− S4 .

The stoichiometry matrices for this reaction network are:


 
2 1 1 0
0 1 0 2 
⃗ν − = 

, (11.10)
 0 0 1 0 
0 0 0 0

 
0 0 0 0
 1 0 0 0 
⃗ν + =
 0
, (11.11)
1 0 0 
0 0 1 1

and
 
−2 −1 −1 0
 1 −1 0 −2 
⃗ν = ⃗ν + − ⃗ν − = . (11.12)
 0 1 −1 0 
0 0 1 1

11.2 The Chemical Master Equation


We are now concerned with the temporal dynamics of stochastic reaction net-
works. In particular, we seek for a stochastic process that captures the random
firings of reactions over time. In the previous chapters we have discussed dis-
crete Markov chains (e.g., random walks) and continuous-time Markov processes
94 CHAPTER 11. STOCHASTIC REACTION NETWORKS

(e.g., Wiener processes). Reaction networks do not fall in either of these classes
since they evolve continuously in time, but have a discrete state space (i.e.,
particle numbers are integer-valued). In order to describe such systems, we em-
ploy another class of Markov processes termed continuous-time Markov chains
(CTMCs).

We consider the state X(t) = (X1 (t), . . . , XN (t)) of a reaction network collecting
the copy numbers (population) of each species at time t. At a certain time
instance, we can now assess the time evolution of X(t) ⃗ with a small amount of
time ∆t using basic probability theory. In particular, we define the following
two probabilities

⃗ + ∆t) = ⃗x + νi | X(t)
P (X(t ⃗ = ⃗x) = ai (⃗x)∆t + o(∆t) (11.13)
X
⃗ + ∆t) = ⃗x | X(t)
P (X(t ⃗ = ⃗x) = 1 − ai (⃗x)∆t + o(∆t), (11.14)
i

where ai (⃗x) is commonly referred to as a rate function, hazard function or


propensity function. Intuitively, the propensity function ai (x) tells us how likely
a particular reaction i happens within an infinitely small amount of time and
therefore sets the time scale of this reaction. Eq. (11.13) defines the probability
that the state changes by νi within ∆t. On the one hand, this can happen
if exactly one reaction of type i happens. This term is proportional to the
propensity function ai (x). However, it can also happen that a sequence of
consecutive reactions yield a net change of νi , which is captured by the second
term o(∆t). This term, however, is of order higher than ∆t, such that it tends
to zero much faster than ai (⃗x)∆t as ∆t goes to zero. We will make use of that
fact later in this section. Eq. (11.14) is the probability that the net change
is zero, which is just one minus the probability that any of the M reactions
happens plus an additional term o(∆t) that accounts for the possibility that
two or more reactions accumulated to a net change of zero. Note that both
(11.13) and (11.14) are ”instantaneous”, i.e., they depend only on the current
state X(t) = x and not any past state. This reflects the Markovianity of the
reaction networks defined in such a way. We remark that while the Markov
assumption can be rigorously justified in certain physical scenarios, it can be
violated in others. While non-Markovian extensions of reaction networks exist,
they are beyond the scope of this lecture.
A key quantity that captures the stochastic dynamics of reaction networks is the

state probability distribution P (⃗x, t) := P (X(t) = ⃗x). This distribution tells us
how likely we will find the system in a particular molecular configuration ⃗x at
any time t. While P (⃗x, t) is generally not known explicitly, it can be described
by a famous differential equation, commonly known as the Chemical Master
Equation (CME). Importantly, this equation is straightforward to derive using
our definitions from (11.13) and (11.14). In particular, we first write down an
expression for the temporal change in P (⃗x, t), i.e.,

d P (⃗x, t + ∆t) − P (⃗x, t)


P (⃗x, t) = lim (11.15)
dt ∆t→0 ∆t
11.2. THE CHEMICAL MASTER EQUATION 95

Now, in order to calculate the distribution P (⃗x, t + ∆t), we make use of (11.13)
and (11.14). In particular, the probability of being in state ⃗x at time t + ∆t is
the probability that we were brought to this state via any of the N reactions
plus the probability that we have already been in state ⃗x at time t (plus some
additional terms accounting for the possibility that multiple events happened).
In particular, we obtain

!
X
P (⃗x, t + ∆t) = ai (⃗x − νi )∆t + o(∆t) P (⃗x − νi , t)
i
! (11.16)
X
+ 1− ai (⃗x)∆t + o(∆t) P (⃗x, t).
i

Plugging this expression into (11.15) yields

"P
d iai (⃗x − νi )∆tP (⃗x − νi , t) o(∆t)P (⃗x − νi , t)
P (⃗x, t) = lim +
dt ∆t→0 ∆t | ∆t
{z }
→0
P #
P (⃗x, t) − P (⃗x, t) ai (⃗x)∆tP (⃗x, t) o(∆t)P (⃗x, t)
+ − i + .
| ∆t
{z } ∆t | ∆t
{z }
→0 →0

We therefore obtain for the following differential equation for P (⃗x, t)

d X X
P (⃗x, t) = ai (⃗x − νi )P (⃗x − νi ) − ai (⃗x)P (⃗x, t), (11.17)
dt i i

known as the CME. Similarly to the discrete Markov chain scenario, the CME
describes how the state distribution evolves over time and is thus a continuous-
time analog of the Kolmogorov-forward equation that we have discussed in the
previous chapters. Technically speaking, the CME is a difference-differential
equation: it has a time-derivative on the left hand side and discrete shifts in
the state on the right-hand side. Note that in general, the CME is infinite-
dimensional, since for every possible ⃗x, we would get an additional dimension
in the CME. Unfortunately, analytical solutions of the CME do not exist in all
but the simplest cases (e.g., linear chain of three reactions) and needs to be
solved numerically. Traditional methods from numerical analysis, such as finite
differences or finite elements, also fail due to the high dimensionality of the
domain of the probability distribution P (⃗x, t), which leads to an exponential
increase in computational and memory cost with network size. However, Monte
Carlo approaches can be applied to simulate stochastic reaction networks, as
will be discussed in the next section.
96 CHAPTER 11. STOCHASTIC REACTION NETWORKS

11.3 Exact Simulation of Stochastic Reaction Net-


works
While direct integration of the CME is challenging, a class of algorithms exists
that samples solutions (trajectories of the Markov chain) from the exact solu-
tion P (⃗x, t) of the Master equation without ever explicitly solving the Master
equation. These are known as exact stochastic simulation algorithms (for short:
exact SSA), and they play an important role in many practical applications.
From the given stoichiometry matrices and reaction rates introduced at the
beginning of this chapter, an exact stochastic simulation algorithm samples a
trajectory of the system, ⃗x(t) ∼ P (⃗x, t) from the exact solution of the Master
equation. Exact Stochastic Simulation Algorithms (SSA) are a special case of
the larger class of kinetic Monte Carlo methods that were introduced in the
1940s by Doob. Daniel Gillespie then formulated the modern SSA family and
proved that they are exact in the sense that they sample trajectories from the
(unknown) exact solution of the Master equation. Due to the importance of
this proof, SSAs are also sometimes referred to as Gillespie algorithms. As
a side remark, it is rare that simulations are exact. Usually, they are only
approximations of the true solution. The fact that SSAs are exact gives them
special importance, and it is one of the few cases where simulations can be used
to validate experiments, and not the other way around.
In SSA, the probability P (⃗x, t) of finding the network in state ⃗x (vector of copy
numbers) at time t, whose time evolution is given by the Master equation, is
replaced by the joint probability density of a single reaction event p(τ, µ | ⃗x(t)),
defined as

p(τ, µ | ⃗x(t))dτ = Probability that the next reaction is µ and it fires in


[t + τ, t + τ + dτ ) given ⃗x at time t. (11.18)

This probability density p is derived as follows: Consider that the time interval
[t, t + τ + dτ ) is divided into k equal intervals of length τk plus a last interval of
length dτ , as illustrated in Fig. 11.1.

Figure 11.1: Division of the time interval [t, t + τ + dτ ) into k + 1 intervals.


Here, t represents the current time. The only reaction firing is reaction µ in
the (k + 1)th infinitesimally small time interval [t + τ, t + τ + dτ ).

The definition of p(τ, µ | ⃗x(t)) in Eq. 11.18 dictates that no reactions occur in all
of the first k intervals, and that reaction µ fires exactly once in the last interval.
11.3. EXACT SIMULATION OF STOCHASTIC REACTION NETWORKS97

We recall that the Master equation has been derived from the following basic
quantities:

cµ dτ = Probability of reaction µ happening in the next infinitesimal time



interval [t, t + dτ ) with any randomly selected νi,µ reactants;
hµ (⃗x) = Number of distinct combinations in which the reactants of
reaction µ can react to form products.

The product aµ (⃗x) = hµ (⃗x)cµ is called the reaction propensity. The probability
aµ dτ is the probability that reaction µ happens at least once in the time interval
dτ . It is the product of the probability of reaction and the number of possi-
ble reactant combinations by which this can happen, as these are statistically
independent events.
Now we can write:

Prob{µ fires once in [t + τ, t + τ + dτ ) given ⃗x at t + τ } = P (⃗x + ⃗νµ , t + τ + dτ | ⃗x, t + τ )


= hµ (⃗x)cµ dτ (1 − cµ dτ )hµ (⃗x)−1
= cµ hµ (⃗x)dτ + O(dτ 2 )
= aµ (⃗x)dτ + O(dτ 2 ).

This only considers the last sub-interval of length dτ . The first line is simply the
analytical solution of the Master equation. If reaction µ has total stoichiometry
⃗νµ , then the new state of the network after reaction µ happened exactly once
is ⃗x + ⃗νµ . In the second line, the first factor, hµ cµ dτ , is the probability that
reaction µ happens from at least one of the hµ possible reactant combinations.
However, the reaction could still happen more than once. Therefore, the second
factor, (1−cµ dτ )hµ (⃗x)−1 is the probability that none of the other hµ − 1 reactant
combinations leads to a reaction. In the third line, we only multiplied out the
first factor. All others are of O(dτ 2 ) or higher. Overall, the expression thus
is the probability that reaction µ happens once, and exactly once, in the last
sub-interval of length dτ .
Further, we have for the probability that no reaction happens in one of the first
k sub-intervals:

Prob{No reaction in [t, t + τ /k) given ⃗x at t} = P (⃗x, t + τ /k | ⃗x, t)


M
X τ
= 1− aµ (⃗x)
µ=1
k
τ
= 1 − a(⃗x) ,
k
PM
where the total propensity a(⃗x) = µ=1 aµ (⃗x). In the second line, we sum over
all reactions. Each reaction has probability aµ τ /k of happening at least once,
so the sum over all µ is the probability that any reaction happens at least once.
One minus this then is the probability that no reaction happens ever.
98 CHAPTER 11. STOCHASTIC REACTION NETWORKS

Both of the above expressions assume that the individual reaction events are
statistically independent. This is an important assumption of the Master equa-
tion.
From these two expressions, we can now write an expression for Eq. 11.18 by con-
sidering all k + 1 sub-intervals, again assuming that the individual sub-intervals
are mutually statistically independent:
h τ ik 
aµ (⃗x)dτ + O dτ 2 .

p(τ, µ | ⃗x(t))dτ = 1 − a(⃗x)
k
The term in the first square bracket is the probability that no reaction happens
in any one of the k first sub-intervals. This to the power of k thus is the
probability that no reaction happens in all of the k first sub-intervals. The term
in the second square bracket then is the probability that reaction µ happens
exactly once in the last sub-interval.
Dividing both sides of the equation by dτ and taking the limit limdτ →0 , we
obtain
h τ ik
p(τ, µ | ⃗x(t)) = 1 − a(⃗x) aµ (⃗x).
k
Taking the limit limk→∞ , we further get

p(τ, µ | ⃗x(t)) = e−a(⃗x)τ aµ (⃗x), (11.19)


k
because limk→∞ 1 + xk = ex . We have thus taken the continuum limit for
infinitesimally small dτ and infinitely many infinitesimally small previous sub-
intervals. Therefore, this is a probability density function. Sampling reactions µ
and reaction waiting times τ from this density is equivalent to sampling trajecto-
ries from the exact solution of the Master equation, because they both describe
the same continuous-time stochastic process. Directly doing so, however, is dif-
ficult, because τ is a continuous variable, whereas µ is a discrete variable. The
density p(τ, µ|⃗x) therefore lives in a hybrid continuous-discrete space.
It is therefore easier to sample from the two marginals. Summing Eq. 11.19
over all reactions (i.e., summing over µ) we get the marginal probability density
function of τ as
M
X
p(τ | ⃗x(t)) = aµ (⃗x)e−a(⃗x)τ
µ=1

= a(⃗x)e−a(⃗x)τ . (11.20)

Similarly, integrating Eq. 11.19 over τ we get the marginal probability distribu-
tion function of µ as
Z ∞
p(µ | ⃗x(t)) = aµ (⃗x)e−a(⃗x)τ dτ
0
aµ (⃗x)
= . (11.21)
a(⃗x)
11.3. EXACT SIMULATION OF STOCHASTIC REACTION NETWORKS99

From Eqs. 11.19, 11.20, and 11.21 we observe that

p(τ, µ | ⃗x(t)) = p(τ | ⃗x(t)) p(µ | ⃗x(t)), (11.22)

thus inferring that µ and τ are statistically independent random variables. Sam-
pling from the marginals in any order is therefore equivalent to sampling from
the joint density. Sampling from Eq. 11.20 is easily done using the inversion
method (see Section 2.3), as τ is exponentially distributed with parameter a.
Eq. 11.21 describes a discrete probability distribution from which we can also
sample using the inversion method. Note that Eqs. 11.20 and 11.21 also relate
to a basic fact in statistical mechanics: if an event has rate aµ of happening,
then the time one needs to wait until it happens again is ∼ Exp(aµ ) (see Section
1.5.2).
By sampling one reaction event at a time and propagating the simulation in
time according to Eq. 11.20, we obtain exact, time resolved trajectories of the
population ⃗x as governed by the Master equation. The SSA, however, is a
Monte Carlo scheme and hence several independent runs need to performed in
order to obtain a good estimate of the probability function P (⃗x, t), or any of its
moments.
All exact formulations of SSA aim to simulate the network by sampling the
random variables τ (time to the next reaction) and µ (index of the next reaction)
according to Eqs. 11.20 and 11.21 and propagating the state ⃗x of the system one
reaction event at a time. The fundamental steps in every exact SSA formulation
are thus:
1. Sample τ and µ from Eqs. 11.20 and 11.21,
2. Update state ⃗x = ⃗x + ⃗νµ and time t = t + τ ,
3. Recompute the reaction propensities aµ from the changed state.
We here only look at the two classical exact SSA formulations due to Gille-
spie. We note, however, that many more exact SSA formulations exist in the
literature, including the Next Reaction Method (NRM, introducing dependency
graphs for faster propensity updates), the Optimized Direct Method (ODM), the
Sorting Direct Method (SDM), the SSA with Composition-Rejection sampling
(SSA-CR, using composition-rejection sampling as outlined in Section 2.4.1 to
sample µ), and partial-propensity methods (factorizing the propensity and in-
dependently operating on the factors), which we do not discuss here. All are
different formulations of the same algorithm, exact SSA, and sample the exact
same trajectories. However, the computational cost of different SSA formula-
tions may differ for certain types or classes of networks.
A defining feature of exact SSAs is that they explicitly simulate each and ev-
ery reaction event. Once can in fact show that algorithms that skip, miss,
or lump reaction events cannot sample from the exact solution of the Master
equation any more. However, they may still provide good and convergent weak
approximations, at least for low-order moments of p(⃗x, t). Examples of such
approximate SSAs are τ -leaping, R-leaping, and Langevin algorithms. We do
100 CHAPTER 11. STOCHASTIC REACTION NETWORKS

not discuss them here. They are very much related to numerical discretizations
of stochastic differential equations, as discussed in Chapter 10.

11.3.1 The first-reaction method (FRM)


The first-reaction method is one of the earliest exact SSA formulations, derived
by Gillespie in 1976. In this formulation, the time τµ when reaction µ fires next
is sampled according to the probability density function

p(τµ | ⃗x(t)) = aµ e−aµ τµ (11.23)

using the inversion method i.i.d. for each µ. Subsequently, the next reaction
µ is chosen to be the one with the minimum τµ , and the time τ to the next
reaction is set to the minimum τµ . The algorithm is given in Algorithm 5.

Algorithm 5 First Reaction Method


1: procedure FRM-SSA(⃗x0 ) ▷ Initial state ⃗x0
2: Set t ← 0; initialize ⃗x, aµ ∀µ, and a
3: for k = 0, 1, 2, . . . do
4: Sample τµ according to Eq. 11.23 for each reaction µ: For each reac-
tion generate an i.i.d. uniform random number rµ ∼ U(0, 1) and compute
τµ ← −a−1 µ log(rµ ). τ ← min{τ1 , . . . , τM }
5: µ ← the index of minimum{τ1 , . . . , τM }
6: Update: ⃗x ← ⃗x + ⃗νµ , where ⃗νµ is the total stoichiometry of reaction
µ; recompute all aµ and a
7: t ← t+τ
8: end for
9: end procedure

The computational cost of FRM is O(M ) where M is the number of reactions


in the network. This is due to steps 4 and 6 in Algorithm 5, both of which have
a runtime of O(M ): step 4 involves generating M random numbers and step 6
involves recomputing all M reaction propensities.

11.3.2 The direct method (DM)


The direct method (Gillespie, 1977) first samples the next reaction index µ ac-
cording to Eq. 11.21 using linear search over the reaction propensities. The time
τ to the next reaction is then sampled according to Eq. 11.20. The algorithm
is given in Algorithm 6.
The computational cost of DM is also O(M ). This is due to steps 4 and 6 in
Algorithm 6, both of which have a worst-case runtime of O(M ). In terms of
absolute runtimes, however, DM is more efficient that FRM since it does not
involve the expensive step of generating M exponential random numbers for
each reaction event. It only requires two random numbers in each iteration.
11.3. EXACT SIMULATION OF STOCHASTIC REACTION NETWORKS101

Algorithm 6 Direct Method


1: procedure DM-SSA(⃗x0 ) ▷ Initial state ⃗x0
2: Set t ← 0; initialize ⃗x, aµ ∀µ, and a
3: for k = 0, 1, 2, . . . do
4: Sample µ using linear search according to Eq. 11.21: generate a uni-
form random numberPµ r1 ∼ U(0, 1) and determine µ as the smallest integer
satisfying r1 a < µ′ =1 aµ′
5: Sample τ according to Eq. 11.20: generate a uniform random number
r2 ∼ U(0, 1) and compute τ as τ ← −a−1 log(r2 )
6: Update: ⃗x ← ⃗x + ⃗νµ , where ⃗νµ is the stoichiometry of reaction µ;
recompute all aµ and a
7: t ← t+τ
8: end for
9: end procedure
102 CHAPTER 11. STOCHASTIC REACTION NETWORKS

You might also like