0% found this document useful (0 votes)
7 views90 pages

LectureNotes Complete

The document outlines the lecture notes for MATH368: Stochastic Theory and Methods in Data Science, taught by Dr. Andreas Alpers for the academic year 2024-2025. It covers essential background in probability and statistics, simulation methods, pseudo-random number generators, Markov Chain Monte Carlo methods, and learning theory. Each section includes detailed topics and subtopics relevant to stochastic processes and data science methodologies.

Uploaded by

Riley Collins
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views90 pages

LectureNotes Complete

The document outlines the lecture notes for MATH368: Stochastic Theory and Methods in Data Science, taught by Dr. Andreas Alpers for the academic year 2024-2025. It covers essential background in probability and statistics, simulation methods, pseudo-random number generators, Markov Chain Monte Carlo methods, and learning theory. Each section includes detailed topics and subtopics relevant to stochastic processes and data science methodologies.

Uploaded by

Riley Collins
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 90

MATH368: Stochastic Theory and Methods in Data Science

 Lecture Notes 
Dr. Andreas Alpers

Academic year 2024-2025


Contents
1 Review of Essential Background From Probability/Statistics 1
1.1 Experiments, Events, Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Important Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Independence and Conditioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 Law of Large Numbers and Central Limit Theorem . . . . . . . . . . . . . . . . . . . . 8
1.5 Elements of Stochastic Processes in Time and Space . . . . . . . . . . . . . . . . . . . 9

2 Simulation: Theory and Practice 13


2.1 Direct Monte Carlo Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
Direct Monte Carlo: Integration . . . . . . . . . . . . . . . . . . . . . . . . . . 13
Direct Monte Carlo: Buon Needle Problem (Toy Example) . . . . . . . . . . . 15
2.2 Variance Reduction and Importance Sampling . . . . . . . . . . . . . . . . . . . . . . . 19
Basic Idea Behind Importance Sampling . . . . . . . . . . . . . . . . . . . . . . 19

3 Pseudo Random Number Generators 22


3.1 Sampling From a Uniform Univariate Distribution . . . . . . . . . . . . . . . . . . . . 22
Middle-Square and Other Middle-Digit Techniques . . . . . . . . . . . . . . . . 22
Linear Congruential Random Number Generators (LCRNGs) . . . . . . . . . . 24
Problems With Number Generators . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.2 Analyzing Random Number Generators . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.3 General Methods for Sampling From Non-Uniform Distributions . . . . . . . . . . . . . 26
3.3.1 Cdf Inversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.3.2 Rejection Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.4 Sampling From a Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.4.1 Central Limit Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4 Markov Chain Monte Carlo (MCMC) Methods 36


4.1 Markov Chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.2 Metropolis-Hastings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.3 Other Samplers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
Barker's Sampler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
Original Metropolis Sampler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
Independence Sampler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
Gibbs Sampler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.4 Simulated Annealing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.5 An Example of a Gibbs Sampler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

5 Learning Theory and Methods 52


5.1 The Process of Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.2 Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.2.1 Vapnik-Chervonenkis Dimension . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.2.2 Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.2.3 Naïve Bayes Classier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.2.4 Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.2.5 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.3 Unsupervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.3.1 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.3.2 Learning on Huge Feature Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . 71
Variance Maximization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
Summary and Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.4 Reinforcement Learning and Markov Decision Processes . . . . . . . . . . . . . . . . . 74
5.5 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.5.1 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.5.2 Activation Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.5.3 Feedforward Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
Learning Rule: The Backpropagation Algorithm . . . . . . . . . . . . . . . . . 80
5.5.4 Hopeld Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.5.5 Boltzmann Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
1 Review of Essential Background From Probability/Statistics
We begin by reviewing some essentials from probability/statistics that we will need throughout this
course. We will keep this rather briey as familiarity with these topics is assumed. Good text books on
this subject are, for instance, [3, 6, 9]. A summary of the general notation that will be used throughout
this text can be found in Table 1.1.

Symbol Meaning

N natural numbers: 1, 2, 3, . . .
N0 non-negative integers: N ∪ {0}
Z integers
R set of real numbers
Rd Euclidean space of dimension d with norm denoted by || · ||

Table 1.1: General notation.

1.1 Experiments, Events, Probability

Random phenomena are observed by means of experiments (performed either by man or nature).
Each experiment results in an outcome (also called sample point). The collection of all possible
outcomes ω is called the sample space Ω. Any subset A of the sample space Ω can be regarded as a
representation of some event. An outcome ω realizes an event if ω ∈ A. In probability theory, one
assigns to each event A ∈ F (where F = Ω or a collection of subsets of Ω satisfying certain axioms
making it a so-called σ -eld) a number, the so-called probability of the event. Formally, a probability
on (Ω, F) is a mapping Pr : F → R satisfying the axioms
(i) 0 ≤ Pr(A) ≤ 1, for any A ∈ F,
(ii) Pr(Ω) = 1,
(iii) Pr( ∞
P∞
i=1 Pr(Ai ), for any countable sequence of disjoint A1 , A2 , . . . ∈ F .
S
i=1 Ai ) =

The triple (Ω, F , Pr) is called probability space. It should be noted that the simple axioms (i)-(iii)
determine completely how probabilities operate. For instance, the probability that both events A and B
occur is Pr(A ∩ B) = Pr(A) + Pr(B) − Pr(A ∪ B). However, how should we interpret probabilities in the
real-word? Does probability measure the physical tendency of something to occur or is it a measure
of how strongly one believes it will occur, or does it draw on both these elements? This brings us to
philosophical questions. There are mainly two broad categories of probability interpretations, which
can be called `frequency' and 'Bayesian' probabilities. In the former category, probability is interpreted
as long-run frequency of the occurrence of each potential outcome when the identical experiment is
repeated indenitely. In the latter category probability is viewed as a numerical measure of subjective
uncertainty on the experimental result1 .
Table 1.2 summarizes some basic probability theory notation that we use.

1.2 Random Variables

Informally, one should think of random variables as variables whose values depend on outcomes of a
random phenomenon.

1
If you are interested in nding out more about these philosophical issues, I recommend https://plato.stanford.
edu/entries/probability-interpret/.

1
Name Symbol Discrete Continuous

cdf FX Pr(X ≤ x) Pr(X ≤ x)

pdf fX Pr(X = x) d
dx FX (x)
Rb
Probability Pr(a < X ≤ b)
P
a<x≤b fX (x) = FX (b) − FX (a) a fX (x)dx = FX (b) − FX (a)
R∞
Expectation
P
E[r(X)] x∈Ω r(x)fX (x) −∞ r(x)fX (x)dx
R∞
Mean
P
µ = E[X] x∈Ω xfX (x) −∞ xfX (x)dx

Variance var(X) E[(X − µ)2 ] E[(X − µ)2 ]

Table 1.2: Basic probability theory notation.

Denition 1.1
Given a probability space (Ω, F, Pr), a random variable is a real-valued function X on Ω such
that, for all real numbers t, the set {ω ∈ Ω : X(ω) ≤ t} belongs to F.

Example 1.1
Consider the experiment of tossing a die once.
The possible outcomes are ω = 1, 2, 3, 4, 5, 6, and the sample space is the set Ω = {1, 2, 3, 4, 5, 6}.
Take for X the identity function X(ω) = ω.
In that sense, X is a random number obtained by the experiment of tossing a die.

Example 1.2
Consider the experiment of tossing a coin an innite number of times.
As sample space Ω one can take the collection of all sequences ω = {ai }i≥1 with ai = 0 or ai = 1
depending on whether the ith toss results in heads or tails. Dene Xi to be the random number
obtained at the ith toss:
Xi (ω) = ai .

In the following, we will often use the notation Pr(X ≤ a) and Pr(X ∈ A), where a ∈ R and A is
a measurable set. This notation is an abbreviation for Pr(X ≤ a) = Pr({X ≤ a}) = Pr({ω ∈ Ω :
X(ω) ≤ a}) and Pr(X ∈ A) = Pr({X ∈ A}) = Pr({ω ∈ Ω : X(ω) ∈ A}), respectively.
Random variables can be discrete or continuous (or mixed).

Denition 1.2
A discrete random variable is one that can assume only nitely many or countably innitely
many values (but only one at a time, of course).

Denition 1.3
The function fX that associates with each possible value of the (discrete) random variable X
the probability of this value is called the probability distribution function (pdf ) of X.

In some literature, Ppdfs are called probability mass functions. Note that fX satises: (i) fX (xk ) ≥ 0
for all xk , and (ii) ∞
k=1 fX (xk ) = 1.

2
Denition 1.4
The function FX that associates with each real number x the probability Pr(XP ≤ x) that the
random variable X takes on a value smaller or equal to this number, i.e., FX (x) = xk ≤x fX (xk ),
is called the cumulative distribution function (cdf ) of X.

It can be shown that FX is non-decreasing and right-continuous.

Denition 1.5
A continuous random variable is a random variable that may take an uncountably innite
number of values.

Denition 1.6
The probability density function (pdf ) of a continuous random variable X is a function fX
dened for all x ∈ R and having the following properties:
(a) fX (x) ≥ 0 for any real number x;
(b) if A is any subset of R, then
Z
Pr(X ∈ A) = fX (x)dx.
A

Note that the pdf is dierent from the pdf of a discrete random variable. Indeed, fX (x) does not give
the probability that the random variable X takes on the value x. What can be said is that by

fX (x)ε ≈ Pr(x − ε/2 ≤ X ≤ x + ε/2),

for small ε > 0, we have that fX (x)ε is approximately equal to the probability that X takes on a value
in an interval of length ε about x.

Denition 1.7
The cummulative distribution function (cdf ) FX of a continuous random variable X is
dened by Z x
FX (x) = Pr(X ≤ x) = fX (u)du.
−∞

Note that, by denition, we have


Z x Z x−
Pr(X = x) = Pr(X ≤ x) − Pr(X < x) = fX (u)du − fX (u)du = 0
−∞ −∞

for any real number x, where x− means that the range of the integral is the open interval (−∞, x).
It can be shown that the cdf of a continuous random variable is continuous.
Before we recall several important random variables, let us spend a few words on histograms. These
are special kinds of bar charts often used to depict a series of values, say, u1 , . . . , un of outcomes in an
experiment. A convenient subdivision of the x-axis is created containing the values, for example by
means of the points c1 < · · · < cK and a parameter w, called bin width. They establish intervals, so
called bins, [c1 − w/2, c1 + w/2), [c2 − w/2, c2 + w/2), . . . , [cK − w/2, cK + w/2). A count is made of
how many of the values lie in each bin and a bar of that height divided by nw is depicted standing on
this bin. Technically, this is called a density histogram for equal size bin widths (to distinguish it from
other types of histograms), but since we will be using only this type of histogram we will usually omit
the word `density.' Note that the area of the bars in that histogram sums up to 1.

3
Example 1.3
Consider the n = 20 numbers u1 , . . . , un = 1, 1, 3, 5, 7, 8, 8, 2, 3, 5, 6, 7, 7, 5, 6, 7, 4, 9, 1, 9. As bin
width w set w := 1, and as bin centers set ci := i, i = 1, . . . , 9. The corresponding histogram is
shown in Fig. 1.1, the corresponding matlab code is

f = hist(u,c); % f is a vector holding the bin counts, c holds the bin centers
bar(c, f/(n*w)); % draws bars with centers in c and height in f scaled by 1/nw

It is sometimes more convenient to specify only the number of bins and let matlab do the
rest. Replacing f=hist(u,c) by [f,c] = hist(u,9); w=c(2)-c(1); matlab will partition
the data automatically into 9 equal bins, the corresponding bin centers are returned in c, the
corresponding bin width w is then determined as c(2)-c(1). Another alternative is to use the
histogram or histcounts command.
0.2

0.18

0.16

0.14

0.12

0.1

8 · 10−2

6 · 10−2

4 · 10−2

2 · 10−2

0
1 2 3 4 5 6 7 8 9

Figure 1.1: Histogram.

Important Random Variables


Example 1.4
Consider a uniformly distributed discrete random variable X ∼ U ({1, 2, . . . , n}), i.e., the
state space of X consists of the n numbers 1, . . . , n.
They all have equal probability, i.e.,
1

n : x ∈ {1, . . . , n},
fX (x) =
0 : otherwise.

Correspondingly, by summing up the individual probabilities, we obtain the cdf



 0 : x < 1,
k
FX (x) = : k ≤ x < k + 1, k ∈ {1, . . . , n − 1},
 n
1 : n ≤ x.

See top row of Fig. 1.2.

4
Example 1.5
Consider a uniformly distributed continuous random variable X ∼ U (a, b), i.e., on the
interval (a, b) the probability density function should be constant.
As R fX (x)dx = 1 is required, we see that
R

1

b−a : a < x < b,
fX (x) =
0 : otherwise.

As cdf we have

Z x  0 : x ≤ a,
FX (x) = fX (u)du = 1
b−a (x − a) : a < x < b, (1.1)
−∞ 
1 : b ≤ x.

It is well known that E[X] = (a + b)/2 and var(X) = σ 2 = (b − a)2 /12.

Let us remark, because we will use this in the following rather often, that for X ∼ U (0, 1) we have

 0 : x ≤ 0, 
1 : 0 < x < 1,
FX (x) = x : 0 < x < 1, and fX (x) =
0 : otherwise.
1 : x≥1

0.5 1

0.45 0.9

0.4 0.8

0.35 0.7

0.3 0.6
FX
fX

0.25 0.5

0.2 0.4

0.15 0.3

0.1 0.2

5 · 10−2 0.1

0 0
−0.5 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 −0.5 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5
x x

0.5 1

0.45 0.9

0.4 0.8

0.35 0.7

0.3 0.6
FX
fX

0.25 0.5

0.2 0.4

0.15 0.3

0.1 0.2

5 · 10−2 0.1

0 0
−0.5 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 −0.5 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5
x x

Figure 1.2: Uniform distribution. Top row: Discrete random variable X ∼ U ({1, 2, 3, 4}). (left) pdf,
(right) cdf. Bottom row: Continuous random variable X ∼ U (1, 4). (left) pdf, (right) cdf.

The single most important random variable type, next to uniform random variables, is the normal
(also known as Gaussian) random variable, parametrized by a mean µ and variance σ2 (see
Fig. 1.3). If X is a normal random variable, we write X ∼ N (µ, σ 2 ). The pdf of a normal X ∼ N (µ, σ 2 )
is:
1 1 x−µ 2
fX (x) = √ e− 2 ( σ ) .
σ 2π

5
0.4

0.3

fX (x)
0.2

0.1

−3 −2 −1 0 1 2 3
x

Figure 1.3: Probability density function for the normal X ∼ N (µ, σ 2 ), with µ = 0 and σ 2 = 1. With
these parameters X is typically referred to as standard normal.

Theorem 1.1
Let X ∼ N (µ, σ 2 ) and a, b ∈ R with a > 0. Then, Y = aX + b ∼ N (aµ + b, (aσ)2 ).

Proof. Let A be measurable and B = (A − b)/a. Then,


Pr(Y ∈ A) = Pr(aX + b ∈ A)
= Pr(X ∈ B)
Z
1 1 x−µ 2
= √ e− 2 ( σ ) dx
σ 2π x∈B
 y−b 2
− 12 a −µ
Z
(∗1 ) 1 σ 1
= √ e dy
σ 2π y∈A a
y−(aµ+b) 2
Z  
1 − 12
= √ e aσ
dy,
y∈A aσ 2π
| {z }
=fY (y)

where (∗1 ) follows from the transformation theorem with y = ax + b (therefore dx = a1 dy ).


In particular, if X ∼ N (0, 1) then Y = σX + µ ∼ N (µ, σ 2 ). This will later help us to generate normally
distributed random numbers, since this tells us that to generate random numbers from N (µ, σ 2 ) it
suces to generate number from N (0, 1) (which then only need to be transformed via Y = σX + µ).
The N (0, 1) distribution is called the standard normal distribution.

1.3 Independence and Conditioning

The conditional probability of A given B, with Pr(B) ̸= 0, is dened as


Pr(A ∩ B)
Pr(A|B) := . (1.2)
Pr(B)

The interpretation of this concept is that if we are told that event B has already occurred, the consider-
ation of whether A occurs should be conned to the smaller universe  characterized by the occurrence
of B . Therefore, we say that two events A and B are independent if, and only if,

Pr(A ∩ B) = Pr(A)Pr(B).

Note that the probabilities Pr(A|B) and Pr(B|A) are conceptually dierent quantities. While some-
times one of these quantities is known, one might want to compute the other. From (1.2), using the
fact that Pr(A ∩ B) is symmetric in A and B, we have

Pr(A|B)Pr(B) = Pr(A ∩ B) = Pr(B|A)Pr(A).

6
Rearranging we obtain (one form of) Bayes' formula2 :
Pr(B|A)Pr(A)
Pr(A|B) = .
Pr(B)

Example 1.6
Suppose a patient exhibits symptoms that make her physician concerned that she may have a
particular disease. The disease is relatively rare in this population, with a prevalence of 0.1%
(meaning it aects 1 out of every 1, 000 persons). The physician recommends a screening test
that is rather expensive. Before agreeing to the screening test, the patient wants to know what
will be learned from the test, specically she wants to know the probability of disease, given a
positive test result.
The physician reports that the screening test is widely used and has a reported accuracy (here,
sensitivity and specicity) of 85%, which means that the test's true positive rate and true
negative rate is 85%.
Let A denote the event that a person has the disease, and let B denote the event that the test
is positive. What needs to be computed is the probability Pr(A|B), i.e., the probability that
the person has the disease given that the test is positive.
From the data we know Pr(A) = 0.001. Also, Pr(B|A) = 0.85, and

0.85 · 0.001 · 1000 + (1 − 0.85) · (1 − 0.001) · 1000 0.85 + 149.85


Pr(B) = = = 0.1507.
1000 1000
Plugging this into Bayes' formula yields

Pr(B|A)Pr(A) 0.85 · 0.001


Pr(A|B) = = = 0.0056.
Pr(B) 0.1507

Hence, if the patient undergoes the test and it comes back positive, there is a 0.56% chance that
she has the disease. Also, note, however, that without the test, there is a 0.1% chance that she
has the disease. In view of this, do you think the patient is having the screening test?

Two random variables X and Y are called independent if for any two subsets of real numbers
A, B ⊆ R, the events {X ∈ A} and {Y ∈ B} are independent events, hence if
Pr({X ∈ A} ∩ {Y ∈ b}) = Pr({X ∈ A})Pr({Y ∈ B}).

Example 1.7
Suppose, a coin is thrown twice. Let X be dened to be the number of heads that are observed.
Then, another coin is thrown three times. This time the number of heads is recorded by Y .
What is Pr({X ≤ 2)} ∩ {Y ≥ 1})?
Since X and Y are results of dierent independent coin tosses, the random variables X and Y
are independent. Then,
7 7
Pr({X ≤ 2} ∩ {Y ≥ 1}) = Pr({X ≤ 2})Pr({Y ≥ 1}) = 1 · = .
8 8

Let us also recall what a joint distribution of two random variables X and Y is. First, the discrete
case. If X and Y are discrete, their joint pdf is
fX,Y (x, y) = Pr({X = x} ∩ {Y = y}) (=: Pr(X = x and Y = y)).
2
Named after the English mathematician and reverend Thomas Bayes (c. 1701 - April 7, 1761).

7
This is a function dened on discrete points in the x, y -plane; these points make up Ω. For an event A
we hence have X
Pr(X, Y ∈ A) = fX,Y (x, y).
(x,y)∈A

For continuous random variables X and Y, the joint cdf is

FX,Y (x, y) = Pr(X < x, Y < y) (=: Pr({X < x} ∩ {Y < y})).

The pdf is the derivative of the cdf. For a subset A of the x, y -plane,
Z Z
Pr(A) = fX,Y (x, y) dxdy.
(x,y)∈A

Example 1.8
A joint probability density can be created from any two univariate densities, simply by multi-
plying them together. Let X have density fX and Y density fY ; then fX,Y (x, y) = fX (x)fY (y)
is a joint density. A concrete example of this is the joint pdf of rolling two dice, where

Pr(Dice1 = i and Dice2 = j) = Pr(Dice1 = i) · Pr(Dice2 = j).

1.4 Law of Large Numbers and Central Limit Theorem

We briey review here two fundamental results for convergence of random variables, which are widely
used in practice and in the context of Monte Carlo simulations. Proofs of the results can be found in
many textbooks on basic probability theory; see, e.g., [5].
Before we state the theorems, let us recall two notions of how a sequence of random variables becomes
more and more `stable' (i.e., converges to another random variable). Let X1 , X2 , . . . be a sequence of
random variables and let F1 , F2 , . . . be the corresponding sequence of cdfs. If the cdfs become more
and more similar to the cdf F of a common random variable X as n → ∞, then we say that they
converge to X in distribution. Mathematically, this means

lim Fn (x) = F (x), at all values of x where F is continuous.


n→∞

Note that although we say a sequence of random variables converges in distribution, it is, by denition,
really the cdfs that converge, not the random variables. In this way this convergence is quite dierent
from convergence almost surely, which we recall in the following.
We say that Xn converge to X almost surely (abbreviated as a.s.) if
 
Pr {ω ∈ Ω : lim Xn (ω) = X(ω) = 1.
n→∞

This type of convergence is similar to pointwise convergence of a sequence of functions, except that
the convergence need not occur on a set with probability 0 (hence the `almost sure').
It can be shown that almost sure convergence implies convergence in distribution.

Example 1.9
Take the probability space on which all random variables are dened as that corresponding
to the U (0, 1) distribution. Thus Ω = (0, 1), and the probability of any interval in Ω is its
length.

8
Dene
 0 : 0 < ω < 12 − n1 ,

Xn (ω) = n : 12 − n1 ≤ ω < 12 + n1 ,
0 : 12 + n1 ≤ ω < 1.

For any any ω < 1/2 select N1 so that 1/N1 < 1/2 − ω. If n ≥ N1 , then ω < 1/2 − 1/n,
hence Xn (ω) = 0. For any any ω > 1/2 select N2 so that 1/N2 < ω − 1/2. If n ≥ N2 , then
ω > 1/2 + 1/n, hence Xn (ω) = 0. In summary,

{ω : lim Xn (ω) = 0} = {ω : 0 < ω < 1},


n→∞

which implies Pr({ω : limn→∞ Xn (ω) = 0}) = 1 and therefore Xn → 0 almost surely.

Let us now discuss the law of large numbers and the central limit theorem. Both theorems come in
dierent versions (depending on the assumptions imposed on the random variables). The law of large
numbers that we state here is, in fact, the strong law of large numbers.
Basically, both the law of large numbers and the central limit theorem give us a picture of what happens
when we take many independent samples from the same distribution.

Theorem 1.2: (Law of large numbers and central limit theorem)


Let X1 , X2 , . . . , Xn be independent and identically distributed random variables with nite
mean µ and nite variance σ 2 . Then,
n
1X
(Law of large numbers:) lim Xi = µ a.s., and
n→∞ n
i=1

n
!
√ 1X
(Central limit theorem:) n Xi − µ → N (0, σ 2 ) in distribution.
n
i=1

The law of large numbers tells us two things: First, the average of many independent samples is (with
high probability) close to the mean of the underlying distribution. And, second, this histogram of
many independent samples is (with high probability) close to the graph of the pdf of the underlying
distribution.
The central limit theorem, on the other hand, says that the average (or scaled average) of many
independent copies of a random variable is approximately a normal random variable. It gives a sense
1 Pn
of how fast n i=1 Xi approaches µ as n increases.

We remark that as n( n1 ni=1 Xi − µ) → N (0, σ 2 ) in distribution by the central limit theorem, we
P

obtain by dividing by n and Theorem 1.1 the result that n1 ni=1 Xi −µ is approximately an N (0, σ 2 /n)
P
random variable for large n.
An illustration of the law of large numbers and the central limit theorem is given in Fig. 1.4 and 1.5,
respectively.

1.5 Elements of Stochastic Processes in Time and Space

By a stochastic process, we shall mean a family of random variables {Xt }, where t is a point in a
space T called the parameter space, and where, for each t ∈ T, Xt is a point in a space S called the
state space.
The parameter t is often interpreted as `time.' For example, Xt can be the price of a nancial asset
at time t. If T = N then we have nothing else then a sequence of random numbers. Sequences of
random numbers are thus a special case of random processes. It may also happen that t should be
interpreted as `space.' For instance, Xt might be the water temperature at a location t = (u, v, w) ∈ R3

9
0.7

0.65

0.6

0.55

(X1 +...+Xn )/n


0.5

0.45

0.4

0.35

0.3

0.25
0 100 200 300 400 500 600 700 800 900 1,000
n

Figure 1.4: Illustration of the law of large numbers. X1 , . . P


. , Xn ∼ U (0, 1). Theoretical sample mean
µ = 0.5 is depicted in red. For two dierent realizations, n1 ni=1 Xi are plotted (in blue and green).

n=1 n=2 (mean error = 0.03823)


0.6 0.8

0.7
0.5
0.6
0.4
0.5

0.3 0.4

0.3
0.2
0.2
0.1
0.1

0 0
−2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 −2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5

n=3 (mean error = 0.02136) n=20 (mean error = 0.01125)


0.7 0.7

0.6 0.6

0.5 0.5

0.4 0.4

0.3 0.3

0.2 0.2

0.1 0.1

0 0
−2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 −2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5


Figure 1.5: Illustration of the central limit theorem: Histograms of n · n1 ni=1 Xi for n = 1, 2, 3, 20
P
where the Xi ∼ U (−1, 1). Number of samples is 104 . For n = 2, 3, 20 the pdf for the normal X ∼
N (0, σ 2 ) with σ 2 = 12
1
(1 + 1)2 = 1/3 is shown in red.

in the ocean. Or t = (u, v) ∈ R2 might represent a pixel location in a computer image (which is often
usful in image processing applications). Also, mixed interpretations as `space-time' are possible. For
example, Xt with t = (u, v, w, s) ∈ R4 might be the ocean temperature measured at (u, v, w) ∈ R3 at
time s ∈ R.
The family {Xt } may be thought of as the path of a particle moving `randomly' in space S, its
position at time t being Xt . A record of one of these paths is called a realization or sample path of the
process.

10
Example 1.10
Let us consider the problem of modeling the score during a football match as a stochastic
process.
The state space S needs to present all possible values the score can take. Hence, a suitable
choice is S = {(x, y) : x, y = 0, 1, 2, . . . }. Measuring times in minutes we can take as parameter
space T the interval [0, 90] (not considering overtimes). The process starts in state (0, 0), and
transition takes place between the states of S whenever a goal is scored. A goal increases x or
y by one, so the score (x, y) will then go to (x + 1, y) or (x, y + 1).

Example 1.11
Let X1 , X2 , . . . be a sequence of independent and identically distributed (i.i.d.) random vari-
ables. The process {Xi } is sometimes called i.i.d. noise. The parameter space is T = N and
the state space is S = R. A realization of this process is shown in Fig. 1.6(a). A histogram of
20, 000 realizations of X1,000 is shown in Fig. 1.6(c).

Example 1.12
Let Y1 , Y2 , . . . be a sequence of independent and identically distributed random variables. De-
ne
Xt := Xt−1 + Yt , t ∈ N, X0 = 0.
The process {Xt } is called random walk. The parameter space is T = N and the state space is
S = R. A sample path of this process is shown in Fig. 1.6(b). A histogram of 20, 000 realizations
of X1,000 is shown in Fig. 1.6(d).

11
1 40

0.8 30
0.6
20
0.4
10
0.2

0 0

−0.2 −10
−0.4
−20
−0.6
−30
−0.8

−1 −40
0 10 20 30 40 50 60 70 80 90 100 0 100 200 300 400 500 600 700 800 900 1,000

(a) (b)
·10−2
0.6
1.4

0.5
1.2

0.4 1

0.8
0.3

0.6
0.2
0.4

0.1
0.2

0 0
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 −140−120−100−80 −60 −40 −20 0 20 40 60 80 100 120

(c) (d)

Figure 1.6: I.i.d noise and random walks: (a) sample path of i.i.d. noise, (b) sample path of a random
walk, (c) histogram of 20, 000 realizations of i.i.d. noise variable X1,000 , (d) histogram of 20, 000
realizations of random walk variable X1,000 .

12
2 Simulation: Theory and Practice
Many times we wish to simulate the results of an experiment by using a computer and random variables,
using the (pseudo-)random number generators available on computers. This is known as Monte Carlo
simulation3 .
There can be several reasons why one wants to do that. For instance:
ˆ The experiment could be quite complicated to set up (or being controlled) in practice;
ˆ Monte Carlo methods are generally very simple to implement.
ˆ They can be used to solve a very wide range of problems, even problems that have no inherent
probabilistic structure, as, e.g., computation of multivariate integrals and solving linear systems
of equations. They can also be used to solve optimization problems.
In general, Monte Carlo methods can be divided into two types: direct (or simple) Monte Carlo and
Markov Chain Monte Carlo (MCMC).
ˆ Direct Monte Carlo: In this type of Monte Carlo the samples Xi that we generate are an
i.i.d. sequence. P
So the strong law of large numbers tells us that the average of the Xi , i.e., the
sample mean n ni=1 Xi , will converge to the mean of X as n → ∞. Furthermore, the central
1

limit theorem tells us a lot about the error in our computation.


ˆ Markov Chain Monte Carlo: These methods construct a Markov Chain whose stationary
distribution is the probability measure we want to simulate. We then generate samples of the
distribution by running the Markov Chain. As the chain runs we compute the value Xn of
our random variable at each time step n. The samples Xn are not independent, but there are
theorems that tell us the sample mean still converges to the mean of our random variable.

2.1 Direct Monte Carlo Methods

As an initial example, let us consider the problem of computing an integral.

Direct Monte Carlo: Integration Consider the problem of computing a (possibly multi-dimensional)
denite integral Z
µ := g(x) dx, (2.1)
D

for some g : Rd → R and interval D = [0, 1]d ⊆ Rd (considering such a D makes our presentation
easier  by transformations this is typically not a real restriction). Let X be a random variable that
is uniformly distributed on D. Then, the expected value of g(X) is, by denition,
Z
E[g(X)] = g(x) dx = µ.
D

Now, suppose we can draw independent and identically P distributed random samples X1 , . . . , Xn uni-
formly from D (by a computer), and we set µ̂n := n1 ni=1 g(Xi ). Then the law of large numbers tells
us (for `nicely behaving' g and D) that almost surely

lim µ̂n = E[g(X)] = µ.


n→∞

We can therefore approximate µ by simulation (drawing X1 , . . . , XN and computing the mean).


3
The concept was rst popularized right after World War II. Mathematician Stanislaw Ulam, working on nuclear
weapons projects at the Los Alamos National Laboratory, coined the term in reference to an uncle who loved playing
the odds at the Monte Carlo casino. A nice read about the beginnings of the Monte Carlo is the article https:
//la-science.lanl.gov/cgi-bin/getfile?00326866.pdf by N. Metropolis.

13
Let's analyze the accuracy of this approach. Let's assume that the assumptions of the central limit

theorem are met. Then the theorem tells us that n(µ̂n − µ) for n → ∞ converges in distribution to a
random variable X ∼ N (0, σ 2 ). And, by Thm. 1.1, we therefore have for large n that E[µ̂n ] = E[µ] = µ
and E[(µ̂n − µ)2 ] = var(µ̂n ) = σ 2 /n, where σ 2 is the variance of g(X). The quantity
p √
E[(µ̂n − µ)2 ] = σ/ n (2.2)

is called the root mean square error (RMSE) of µ̂n .


√ √
As the RMSE is σ/ n one can say that the RMSE is of the order O(1/ n) as n → ∞ (this is Landau
O-notation 4 ). To obtain one more decimal digit of accuracy (i.e., an RMSE multiplied by a factor
of 1/10) one therefore needs a 100-fold increase in computation. Remarkably, this error does not
depend on d (as for instance, in other deterministic approaches).
A common approach to evaluate an integral is by Riemann approximation, where one deterministically
selects n points t1 , . . . , tn regularly spaced in D and subsequently computes

n
1X
µ̃n := g(ti ).
n
i=1

There are more sophisticated methods than this, but Riemann approximation is perhaps the most
common taught in calculus classes. What can be shown (e.g., by using the Mean Value theorem) is
that the error for Riemann approximation is O( n1/d
1
) if g is reasonably smooth. This error rate depends
on d: For example, in a 10-dimensional space with D = [0, 1]10 , we will have to evaluate O(n5 ) grid

points to guarantee that the error is O(1/ n). In Monte Carlo we only need O(n) points to get the

same error bound O(1/ n). Comparing n to n5 does not seem to make such a big dierence. But
when you wait for an algorithm to return a solution, you surely would prefer to wait 10 minutes instead
of 105 minutes (how many weeks is this?).
We remark that the central limit theorem can also be used to compute condence intervals for the
results obtained by direct Monte Carlo. The interested reader can nd more information for instance
in [10, Chapter 1].
One should also be aware of the following. Do not confuse the σ in (2.2) with the σ of the random
samples X1 , . . . , Xn that are drawn. The σ in (2.2) is the σ of g(X), i.e., σ 2 = var(g(X)) as we have
seen in connection with (2.2). Often var(g(X)) is not known and needs to be estimated from the data.
In most of our examples, however, we can compute it explicitly since

(∗)
var(g(X)) = E[(g(X) − µ)2 ] = E[g(X)2 ] − µ2 ,

and g(X) (hence E[g(X)2 ]) and µ are known5 . We will come back to this point later when we discuss
importance sampling.
It is time to wrap things up.
4
One says, f is in O(g) if, and only if there is a positive real number C and a real number x0 such that |f (x0 )| ≤ Cg(x)
for all x ≥ x0 .
5
(∗) follows by standard manipulations; E[(g(X) − µ)2 ] = E[g(X)2 − 2g(X)µ + µ2 ] = E[g(X)2 ] − 2E[g(X)]µ + E[µ2 ] =
E[g(X)2 ] − 2µ2 + µ2 = E[g(X)2 ] − µ2 using only linearity of expectation.

14
Direct Monte Carlo for computing integrals of the form (2.1):
(1) Choose a large n.
(2) Draw independent and identically distributed random samples X1 , . . . , Xn uniformly
from D.
(3) Return µ̂ = n1 ni=1 g(Xi ) as an estimate of the integral.
P

Example 2.1
Suppose we want to compute Z 1
µ= x3 dx.
0

Of course, we know that we should obtain µ = [ 14 x4 ]10 = 41 = 0.25. Let's implement the direct
Monte Carlo method. As
Z 1
σ = var(g(X)) = E[g(X) ] − µ =
2 2 2
x6 dx − 1/16 = 1/7 − 1/16 = 9/112,
0

we can also plot the RMSE σ/ n.
Here is the matlab code. Fig. 2.1 shows a sample path of the µ̂n , the realized RMSE, and the

theoretical RMSE σ/ n.

1 nTrials =6*10^2;
2
3 mu =0.25;
4 sigma2 =1/7 - mu ^2;
5
6 u = rand (1 , nTrials ) ;
7 for n =1: nTrials
8 muhat ( n ) = mean ( u (1: n ) .^3) ; % here the function g ( x ) = x ^3 comes in
9 end ;
10 figure , plot (1: nTrials , muhat , 'b ' , ' LineWidth ' ,2) ; xlabel ( 'n ') ; ylabel ( '\ mu_n ') ;
11 hold on ; plot (1: nTrials , mu * ones (1 , nTrials ) ,'r ' , ' LineWidth ' ,2) ;
12
13 figure , plot (1: nTrials , abs ( muhat - mu ) , 'b ' , ' LineWidth ' ,2) ; xlabel ( 'n ') ; ...
ylabel ( ' RMSE of \ mu_n ') ; hold on ;
14 plot (1: nTrials , sqrt ( sigma2 ) ./ sqrt (1: nTrials ) , 'k ', ' LineWidth ' ,2) ;

MCintegralexample1.m

Let's consider another example.

Direct Monte Carlo: Buon Needle Problem (Toy Example) The French nobleman, Comte
de Buon , posed the following problem in 1777:
6

6
Comte de Buon (Sept. 7, 1707  April 16, 1788).

15
0.35 0.3

0.3
0.25

0.25
0.2

RMSE of µn
0.2
µn

0.15
0.15

0.1
0.1

5 · 10−2
5 · 10−2

0 0
0 50 100 150 200 250 300 350 400 450 500 550 600 0 50 100 150 200 250 300 350 400 450 500 550 600
n n

(a) (b)

Figure 2.1: Direct Monte Carlo for Example 2.1. (a) Sample path of µ̂n as a function of n (blue) and

µ = 0.25 (red), (b) corresponding (realized) RMSE as a function of n (blue) and plot of σ/ n (black).

Suppose that you drop a short needle on a ruled paper  what is


then the probability that the needle comes to lie in a position where
it crosses one of the lines?

It is intuitively clear that the probability should be an increasing function of the length of the needle
and a decreasing function of the spacing between the parallel lines. But that the probability, as we
will see next, will depend on π is perhaps quite unexpected.
We denote, in the following, the distance between the parallel lines of the ruled paper by d and the
length of the needle by L, respectively. A short needle in our context satises L ≤ d. Note that such
a needle cannot cross two lines at the same time. Now assume we drop such a short needle. How can
we check whether the needle crosses one of the parallel lines?
Well, let's introduce two parameters X and θ, where X denotes the distance of the midpoint of the
needle to the nearest of the parallel lines and θ its angle away from the horizontal with 0 ≤ θ ≤ π/2.
(We ignore negative angles as that case is symmetric to the considered case, producing the same
probability.) It is easy to see that the needle crosses one of the lines if and only if X ≤ L2 sin(θ); see
Fig. 2.2.
Now we can model Buon's experiment. Based on our notation, we can consider X to be a uniform
random variable over 0 ≤ X ≤ d/2. Its probability density function is

2

d : 0 ≤ x ≤ d/2,
fX (x) =
0 : otherwise.

Also θ can be considered to be a uniform random variable over 0 ≤ θ ≤ π/2, hence with probability
density function

2

π : 0 ≤ θ ≤ π/2,
fθ (θ) =
0 : otherwise.

16
L

}
X
2

{
sin(θ) d

{
2
θ }X

} d
2

Figure 2.2: Sketch of the Buon needle problem. The needle crosses a line if and only if X/ sin(θ) ≤ L
2,
or, in other words, X ≤ L2 sin(θ).

The two random variables, X and θ, are independent. Therefore the joint probability density function
of (X, θ) is

4
: 0 ≤ x ≤ d/2 and 0 ≤ θ ≤ π/2,

fX,θ (x, θ) = πd
0 : otherwise.

The probability that the short needle crosses one of the parallel lines is therefore

L
Z π/2 Z 2
sin(θ)
Pr(short needle crosses a line) = fX,θ (x, θ) dx dθ
0 0
L
Z π/2 Z sin(θ)
2 4
= dx dθ
0 0 πd (2.3)
4 L π/2
= · [− cos(θ)]0
πd 2
2L
= .
πd

So, this solves Buon's needle problem. But there is more to the story. We can actually use the
previous results to devise an experiment to approximate π in a random fashion7 .
Suppose we set up an experiment (or a computer simulation) of throwing such a needle. The values L
and d can therefore assumed to be known. Let Y1 , . . . , Yn be an i.i.d. sequence of random variables
(functions of θ and X ) with Yi describing whether in the ith throw the short needle hits the line
(Yi = 1) or whether it doesn't (Yi = 0). Let A = {y : Yi (y) = 1} denote the event that the needle hits
the line. Then, Z Z
µ := E[Yi ] = Yi (y)fYi (y) dy = fYi (y) dy = Pr(A).
R2 A

Now, by the law of large numbers, we know that the sample mean µ̂n = n1 ni=1 Yi will converge towards
P
µ = Pr(A) = 2Ld · π . Since we can record µ̂n in the experiment we can therefore use, for large n, the
1

formula
2L 1
µ̂n ≈ ·
d π
7
The observant reader will notice that the computer simulation makes explicit use of π in generating the random
numbers (more precisely, to generate θ)  so in a sense this is a little bit of cheating. This, however, can be xed. See,
e.g., https://www.scirp.org/journal/paperinformation.aspx?paperid=74541.

17
for estimating π. The value of π is therefore approximately 2L/dµ̂n .
The following matlab code gives an implementation of this direct Monte Carlo method.

1 L =1; d =2; nTrials =10000;


2 x =0.5* d * rand (1 , nTrials ) ; % draws x
3 theta =0.5* pi * rand (1 , nTrials ) ; % draws theta
4 hits = x ≤ 0.5* L * sin ( theta ) ; % 1 if there is a hit , 0 otherwise
5 rel_freq = sum ( hits ) / nTrials % relative frequency
6 pi_est = nTrials *2* L /( sum ( hits ) * d ) % estimate of pi

buffon1.m

An illustration is given in Fig. 2.3.


3.45

3.4

3.35

3.3

3.25
pi_est

3.2

3.15

3.1

3.05

2.95
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
no of trials ·105

Figure 2.3: Simulating Buon's experiment (L = 1, d = 2): A sample path of 2L


dµ̂n (shown in black)
approximating the value π (depicted in red).

We remark that there are faster ways of computing π. The formula


π
= 4 arctan(1/5) − arctan(1/239), (2.4)
4
where arctan(x) = x − x3 /3 + x5 /5 − · · · , for |x| < 1, called Machin's8 formula, remained the primary
tool of π -hunters for centuries. Table 2.1 gives an indication of how quickly the approximations
converge.
Nowadays there exist even faster methods for computing π. See, e.g., https://www.davidhbailey.
com/dhbpapers/pi-quest.pdf.

n estimate of π
1 3.183263598326360
2 3.140597029326060
3 3.141621029325035
4 3.141591772182177
5 3.141592682404399
6 3.141592652615309
7 3.141592653623555

Table 2.1: Estimating π ≈ 3.141592653589793 by (2.4) using n terms in the evaluation of arctan by
arctan(x) = x − x3 /3 + x5 /5 − · · · . Correct digits are shown in blue.

8
John Machin (bapt. c. 1686 - June 9, 1751) was a professor of astronomy at Gresham College, London.

18
2.2 Variance Reduction and Importance Sampling

We have seen that the RMSE in direct Monte Carlo for integration problems is of the order O(σ/ n).
To increase the accuracy we can therefore increase n. However, sometimes there can be another way
of achieving this. We may construct a new Monte Carlo problem with the same answer as the original
one but with a smaller σ. Methods to do this are known as variance reduction techniques.
Out of the many variance reduction techniques from the literature (known under names such as stratied
sampling, control variates method, antithetic variates method, and Rao-Blackwellization ), we select here
the concept of importance sampling.

Basic Idea Behind Importance Sampling The general idea behind importance sampling is to
sample from a not necessarily uniform distribution, but rather from a distribution that somewhat
resembles g (directing attention to important regions of the hypercube D, so to speak).
Suppose X is a random variable with pdf fX such that fX (x) > 0 on the set {x : g(x) ̸= 0} and
fX (x) = 0 for x ̸∈ D. Let Y be the random variable Y := g(X)/fX (X). Then,

Z Z
g(x)
µ := g(x) dx = fX (x) dx = EfX [Y ],
D D fX (x)

where, to emphasize that the pdf is fX , we added the subscript fX to E.


Now, similarly as for direct Monte Carlo for uniformly distributed X1 , . . . , Xn , we can obtain an
estimate of µ = EfX [Y ] by computing

n n
1X 1 X g(Xi )
µ̂n = Yi = ;
n n fX (Xi )
i=1 i=1

this time the X1 , . . . , Xn are generated from the distribution with pdf fX . (One refers to fX in this
context also as importance function.) Of course, for this to work we need a method to sample
from fX . But assuming this can be achieved, we claim that for some importance functions it can
happen that varfX (Y ) can be smaller than the σ 2 that we have from the direct Monte Carlo method
that uses no importance sampling.
Comparing

varfX (Y ) = EfX [Y 2 ] − EfX [Y ]2 = EfX [Y 2 ] − µ2

g(x) 2
Z  
= fX (x) dx − µ2
D fX (x)

g(x)2
Z
= dx − µ2 ,
D fX (x)

with Z
var(g(X)) = g(x)2 dx − µ2
D
from the direct Monte Carlo method using uniform samples, we see that fX can indeed have an
inuence on the variance. (We can even see that varfX (Y ) can be zero if we
R can choose fX R= g/µ. To
verify that the pdf fX integrates to 1, we would have needed to verify 1 = D fX (x)dx = µ1 D g(x)dx.
So, we would have needed to solve the integral we are interested in estimating in the rst place. Thus,
this is not a very realistic situation).

19
Importance sampling for computing integrals of the form (2.1):
(1) Choose a large n.
(2) Draw samples X1 , . . . , Xn from a trial distribution with pdf fX .
(3) Return
n
1 X g(Xi )
µ̂n =
n fX (Xi )
i=1

as an estimate of the integral.

Example 2.2
Suppose, we want to compute the following integral
Z 1p
µ= 1 − x2 dx.
0 | {z }
=:g


From basic calculus we nd µ = [ 21 (x 1 − x2 + arcsin(x))]10 = 1
2 arcsin(1) = π/4 ≈ 0.78540. Let
X be a random variable X ∼ U (0, 1). Then,

var(g(X)) = E[g(X)2 ] − µ2 = 2/3 − µ2 = 2/3 − π 2 /16 ≈ 0.05.

Hence, the RMSE in direct Monte Carlo is

var(g(X))
p
0.2236
√ ≈ √ .
n n

Can we reduce this factor of 0.2236 via importance sampling?


It practice it can be quite tricky to nd an importance function that does the job. For the sake
of exposition, suppose we want to consider a function

1 − αx2
fα (x) := , 0 < x < 1, 0 < α < 1,
1 − α/3

as importance function (we x the value of α later). Indeed, any such fα is a suitable pdf since
f (x) > 0 for x ∈ (0, 1) and
Z 1
1 − α/3
f (x) dx = = 1.
0 1 − α/3
We can therefore consider fα as the pdf fX of a random variable X. Now, say we choose α = 0.74
(you can experiment with other values of α, too).
Suppose, we can sample X1 , . . . , Xn from this distribution. Let Y := g(X)/fX (X) and Yn :=
g(Xn )/fX (Xn ). Then, again, EfX [Y ] = µ and

var(Y ) = EfX Y 2 − µ2 . (2.5)


 

For α = 0.74 we obtain


1
1 − x2
Z
2 2
EfX [Y ] − µ = (1 − α/3) · dx − µ2 ≈ 0.0029,
0 1 − αx2

and hence var(Y ) = 0.0029 ≈ 0.0539. In other words, the RMSE decreased in this case by
p
√ √
importance sampling from 0.2236/ n to 0.0539/ n.

20
1.4 250 900

800
1.2
200
700
1
600
150
0.8 500

400
0.6
100
300
0.4
200
50
0.2
100

0 0 0
0 0.2 0.4 0.6 0.8 1 0.77 0.775 0.78 0.785 0.79 0.795 0.8 0.77 0.775 0.78 0.785 0.79 0.795 0.8

(a) (b) (c)

Figure 2.4: Illustration of Example 2.2: (a) Plots of g (blue), f (red); (b) Histogram of µ̂n (for n = 104
based on nTrials = 103 simulations) via direct Monte Carlo, sample mean µ shown in red; (c) same
as in (b) but now importance sampling with density f was used (samples were drawn by a method
called rejection sampling, see later chapters).
We have seen that in direct Monte Carlo we need random numbers. Until now, we have just used
matlab's built-in functions. But how do they actually work? And how can we draw samples from more
complicated distributions? The basics of generating pseudo random numbers are discussed next.

21
3 Pseudo Random Number Generators
Random numbers can be generated from truly random physical processes (radioactive decay, thermal
noise, roulette wheel). The RAND cooperation, for instance, published in 1955 a book with a million
random numbers, obtained using an electric `roulette wheel.' This book is now publicly available on
http://www.rand.org/publications/classics/randomdigits/.
However, physical random numbers are generally not very useful for Monte Carlo, because the sequence
is not repeatable, the generators are often slow, and it can be complicated to feed these random numbers
into the computer.
We therefore need a way of generating `random' numbers by a computer. These numbers are often
called pseudo random numbers to stress the fact that they are not really random as they are
generated deterministically. In the following we will usually omit the word `pseudo' since we discuss
only these random numbers anyway.

3.1 Sampling From a Uniform Univariate Distribution

Middle-Square and Other Middle-Digit Techniques One of the earliest recorded methods is
the middle-square method by John von Neumann (1946). The idea is the following.

Let n be an even number. To generate a sequence of n-digit random numbers, an n-digit starting
value is created and squared, producing a 2n-digit number. If the result has fewer than 2n digits,
leading zeroes are added to compensate. The middle n digits of the result would be the next
number in the sequence, and returned as the result. This process is then repeated to generate
more numbers.

Example 3.1
Suppose, we want to generate 4-digit (integer) numbers, i.e., n = 4. Let's take as initial seed
v0 = 1234. Then, we obtain:

squaring extract
v0 = 1234 −→ 01522756 −→ 5227,
squaring extract
v1 = 5227 −→ 27321529 −→ 3215,
squaring extract
v2 = 3215 −→ 10336225 −→ 3362,
···

The following gives a MATLAB implementation of this approach.

22
1 function [z ] = vonNeumannMiddleSquare ( nnumbers , ndigits , seed )
2 % ::: seed must be an ndigit natural number
3 % ::: returned numbers z (1) ,z (2) ,.. z ( nnumbers ) are ndigit - digit integer numbers
4 fstring = sprintf ( ' %%0% d . f ' ,2* ndigits ) ; n2 = ndigits /2;
5
6 x (1) = seed ;
7 for i =1: nnumbers
8 x ( i +1) = x ( i ) ^2; % square the number
9 s = num2str ( x ( i +1) , fstring ) ; % add leading zeros is neccessary
10 x ( i +1) = str2num ( s ( n2 +1 :end - n2 ) ) ; % extract middle n digits
11 end ;
12 z = x (2 :end ) ;
13 end

vonNeumannMiddleSquare.m

Fig. 3.1 shows some results for n = 4. As initial seed v0 = 5810 is used since it gives a rather long chain
of non-repeating numbers. After the 108th generated number, the numbers repeat (actually, with a
fairly short cycle of 4100, 8100, 6100, 2100, 4100, . . . ). By the way, there a ve numbers, namely 0,
100, 2500 3792, and 7600, which, if taken as seed, generate no further numbers.
120 ·104
1

100 0.9

0.8
80
0.7

0.6
60
0.5

40 0.4

0.3
20
0.2

0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
·104 0 10 20 30 40 50 60 70 80 90 100 110 120

(a) (b)

Figure 3.1: Middle-square method for n = 4. (a) Plot of the number of generated numbers to rst
repetition as a function of seed (average: 43.7, maximum: 111), (b) sample path for seed v0 = 5810
(blue), randomly drawn integers by matlab's randi for comparison (red).

Let us prove a fact about the middle-square method, which is rather undesirable.

Theorem 3.1: [8]


Consider the middle-square method for generating n-digit numbers, n even. If, for some k, the
most signicant n/2 digits of vk are zero, then the succeeding vk+1 , vk+2 , . . . will get smaller
and smaller until zero occurs repeatedly.

Proof. Since vk < 10n/2 we have vk2 < 10n . Therefore, vk+1 = ⌊vk2 /10n/2 ⌋ ≤ vk2 /10n/2 (note that vk2
has 2n digits including leading zeros). If vk > 0, then vk+1 ≤ vk2 /10n/2 < vk 10n/2 /10n/2 = vk .
As an illustration consider the middle-square method for generating n = 4 digit numbers (note: you
always needs to specify n). Starting with v0 = 0099 we obtain the following strictly decreasing sequence
0098, 0096, 0092, 0084, 0070, 0049, 0024, 0005, 0000.
The middle-square method was invented at a time when computers were just starting. At that time
this method was extensively used since it was simpler and faster than any other method. However, by
current standards, its quality is quite poor, at least when applied to numbers in the decimal number

23
system. Donald Knuth ([8]) remarks: `[...] experience showed that the middle-square method can give
usable results, but it is rather dangerous to put much faith in it until after elaborate computations have
been performed.'
We remark that instead of squaring number (which is computationally usually a fast thing to perform)
one could modify the middle-square method to utilize other functions, e.g., sin or log . Sometimes one
needs to pay special attention to make correct use of the domains and ranges of these functions. We
do not pursue this here any further. Generally, all these methods tend to suer from fundamental
problems, one of which is a short period and rapid cycling for most initial seeds.

Linear Congruential Random Number Generators (LCRNGs) Most random-number gener-


ators in use today are linear congruential random number generators (LCRNGs).

They produce a sequence of integers between 0 and m − 1 according

vk+1 = (avk + c) mod m, k = 0, 1, 2, . . . (3.1)

where a is called the multiplier, c the increment, and m the modulus. (n mod m is the dierence
of n and the largest integer multiple of m less or equal to n.) The initial number v0 , which
usually needs to be provided by the user, is typically referred to as seed.

Note that the generated numbers are typically elements in {0, 1, . . . , m − 1}. By dividing by m, i.e.,
by setting
vk
uk =
m
the numbers can be transformed to be real numbers in [0, 1).
Sequences of numbers v0 , v1 , . . . generated by (3.1) are called Lehmer9 sequences.
A frequently recommended generator of this form, due to M. Marsaglia (1972), is:

vk+1 = (69069vk + 1) mod 232 . (3.2)

This generator has in fact a maximal period, i.e., the numbers repeat after m numbers have been
generated.
A large class of LCRNGs are the so-called Mersenne10 generators. They use Mersenne prime moduli,
i.e., m is a prime of the form m = 2p − 1, where p is also a prime. These generators can have
periods of length m − 1, and it can also be shown that they never generate a zero. Coincidentally,
m = 231 − 1 = 2, 147, 483, 647 is a Mersenne prime, the largest prime number of this form that can be
stored in a full-word (32-bit)integer computer.
The following Lewis-Goodman-Miller generator, introduced in 1969, is a widely studied Mersenne
generator:
vk+1 = 75 vk mod (231 − 1).

Problems With Number Generators Let us have a look at RANDU, which is a LCRNG developed
by IBM in the 1950's, and which for many years was the most widely used random number generator
in the world. Leaving some minor technical details aside, the numbers are generated as

vk+1 = (216 + 3)vk mod 231 . (3.3)

The generator in (3.3) can be written to express the relationship among three successive members of
the output sequence:
9
Derrick Lehmer (Feb. 23, 1905  May 22, 1991).
10
Marin Mersenne (Sept. 8, 1588  Sept. 1, 1648).

24
vk ≡ (216 + 3)vk−1 mod 231
≡ (216 + 3)2 vk−2 mod 231
≡ (232 + 6 · 216 + 9)vk−2 mod 231
≡ (0 + 6 · 216 + 9)vk−2 mod 231
≡ (6 · (216 + 3) − 6 · 3 + 9)vk−2 mod 231
≡ (6 · (216 + 3)vk−2 − 9vk−2 ) mod 231
≡ (6vk−1 − 9vk−2 ) mod 231 ,

i.e.,
vk − 6vk−1 + 9vk−2 = C231 ,
where C is an integer. Note that 0 ≤ vk < 231 by construction. Hence, the maximum integer that we
can obtain by vk − 6vk−1 + 9vk−2 is smaller than 231 − 0 + 9 · 231 = 10 · 231 . Similarly, the smallest
number that we can obtain is larger than 0 − 6 · 231 + 0 = −6 · 231 . This leaves only the 15 possibilities
C ∈ {−5, −4, . . . , 8, 9}. Consequently, all triples must lie on no more than 15 hyperplanes in R3 .
Fig. 3.2 shows a plot of subsequent triples of generated numbers viewed from dierent perspective.
The 15 hyperplanes are clearly visible in Fig. 3.2(b).

·109

·109 2.5

2.5 2

2 1.5

1.5 1

1 0.5

0.5 0
2 2
0 1.5
·109 1.5 2
2.5
2 1 ·109
1 1.5
1.5 1 ·109
·109 1 0.5 0.5 0.5
0.5
0 0 0 0

(a) (b)

Figure 3.2: Subsequent triples of IBM's RANDU viewed from two perspectives.

3.2 Analyzing Random Number Generators

Of course, we would like our random numbers to be as random as possible. This, however, cannot
be checked. We can only detect non-randomness. Over the years, there have been developed many
dierent statistical tests for detecting nonrandomness. Typical random numbers need to pass all those
tests. For lack of space, we are not going into details here. The interested reader can nd information
about the so-called chi-square and the Kolmogorov-Smirnov test in [12].
A quick indication whether the generated numbers behave somewhat random can be found by the
following procedures.
(a) Plot a sample path, i.e., if ri is the ith random number generated, plot ri versus i.
(b) Plot a histogram.
(c) Plot a correlation plot, i.e., plot (ri , ri+1 ) (or, (ri , ri+1 , ri+2 ) or even larger tuples).

25
3.3 General Methods for Sampling From Non-Uniform Distributions

3.3.1 Cdf Inversion


This method goes back to the beginnings of Monte Carlo. It was proposed by John von Neumann11
in a letter to Stanislav Ulam12 discussing their `random numbers work'13 .
Let us rst discuss the discrete case.

Theorem 3.2
Let U ∼ U (0, 1), and let Y be a discrete random variable with cdf F. Let

F − (u) = min{t ∈ R : F (t) ≥ u}

be the so-called generalized inverse of F. Then, the discrete random variable X = F − (U ) has
cdf F.

Proof. Let us rst convince ourselves that the minimum in min{t ∈ R : F (t) ≥ u} is in fact attained.
Consider for a given u ∈ (0, 1) the set Iu = {t ∈ R : F (t) ≥ u}. Note that Iu is non-empty, since
u < 1 and F (y) → 1 as y → ∞. Iu has a nite left endpoint, say ηu , because u > 0 and F (y) → 0 as
y → −∞. Finally, ηu ∈ Iu , since F is a cdf and therefore right-continuous (consider yn = ηu + 1/n,
n = 1, 2, . . . . Then, u ≤ F (yn ) for all n, hence u ≤ limn F (yn ) = F (ηu ) implying ηu ∈ Iu ). In summary,
the minimum in min{t ∈ R : F (t) ≥ u} is indeed attained.
We claim
{(t, u) ∈ R × (0, 1) : F − (u) ≤ t} = {(t, u) ∈ R × (0, 1) : u ≤ F (t)}. (3.4)
Taking an element from the left set, i.e., (t, u) ∈ R × (0, 1) satisfying F − (u) ≤ t, we have
non-decr. def. def.
F (t) ≥ F (F − (u)) = F (min{t ∈ R : F (t) ≥ u}) ≥ u

and (t, u) is contained in the right set. Conversely, for (t, u) ∈ R × (0, 1) satisfying u ≤ F (t) we have
non-decr. def. t in set
F − (u) ≤ F − (F (t)) = min{r ∈ R : F (r) ≥ F (t)} ≤ t,

proving (3.4).
Now we can complete the proof:

def. (3.4) def. FU (y) = y for such U ∼ U (0, 1)


Pr(X ≤ t) = Pr(F − (U ) ≤ t) = Pr(U ≤ F (t)) = FU (F (t)) = F (t).

Now let us take a closer look at Theorem 3.2. In fact, the theorem gives us a method of sampling
from a distribution with cdf F. All we need to do is generate U ∼ U (0, 1) and compute the generalized
inverse F − (U ). But how do we compute F − (U )?
For notation, suppose Y (and also the to-be generated X ) takes values x1 , . . . , xN with probabilities
p1 , . . . , pN . Then, X
F (t) = pk ,
k : xk ≤t

with the convention that the empty sum yields value 0. For u ∈ (0, 1), we therefore have F − (u) = xk
if and only if F (xk−1 ) < u ≤ F (xk ).
11
John von Neumann (Dec. 28, 1903  Feb. 8, 1957).
12
Stanislav Ulam (April 13, 1909  May 13, 1984).
13
See R. Eckhardt. Stan Ulam, John von Neumann and the Monte Carlo method. Los Alamos Science, pages 131143,
1987. Special Issue. http://www-star.st-and.ac.uk/~kw25/teaching/mcrt/MC_history_3.pdf.

26
Hence cdf inversion for discrete random variables reduces to the following:
(a) Generate U ∼ U (0, 1).
(b) Set X = xk if F (xk−1 ) < U ≤ F (xk ).

Example 3.2: (From [12])


Suppose, we want to simulate the number of eggs laid by a shorebird during breeding season.
These birds have X = 2, 3, 4, or rarely 5 eggs per clutch. Suppose, the observed frequencies
of these sizes are p2 = 0.15, p3 = 0.20, p4 = 0.60, and p5 = 0.05, respectively. The cdf of X is
shown in Fig. 3.3(a).
1
0.7
0.9

0.8 0.6

0.7
0.5
0.6
0.4
FX

0.5

0.4 0.3
0.3
0.2
0.2

0.1 0.1

0
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 0
x 1 2 3 4 5

(a) (b)

Figure 3.3: (a) Cdf for Example 3.2, (b) Histogram for 10, 000 samples.

Let U ∼ U (0, 1) a uniform sample. Starting from the point (0, U ) (on the y -axis), proceed to the
right until you encounter a jump in the cdf, a vertical dashed line segment. Now, proceed down
to the x-axis, and return this value as the selection. As indicated along the y -axis, `2' is selected
with probability 0.15, since its interval occupies this fraction of the unit interval. Similarly, `3'
is selected with 0.2 probability, since its interval occupies this fraction of the unit interval, and
so on.
Example Matlab code for generating samples from this distribution is given below. A histogram
obtained by executing this code is shown in Fig. 3.3(b).

1 nTrials =10^4;
2 b (2) =0.15; b (3) = b (2) +0.2; b (4) =b (3) +0.6;
3 U = rand (1 , nTrials ) ;
4 w2 =( U ≤ b (2) ) *2;
5 w3 =( U > b (2) & U ≤ b (3) ) *3;
6 w4 =( U > b (3) & U ≤ b (4) ) *4;
7 w5 =( U > b (4) ) *5;
8
9 w = w2 + w3 + w4 + w5 ; % vector with the right frequencies
10
11 [f , x ] = hist (w ,1:5) ; dx = diff ( x (1:2) ) ; bar (x , f / sum ( f * dx ) ) ; % plots the ...
histogram

cdfinvdiscrexample1.m

27
Example 3.3
Suppose X is geometric with parameter p such that Pr(X = k) = (1−p)k−1 p, for k ∈ {1, 2, . . . }.
Then,

⌊t⌋
X
FX (t) = (1 − p)k−1 p
k=1
 
= p 1 + (1 − p) + (1 − p)2 + · · · + (1 − p)⌊t⌋−1

1 − (1 − p)⌊t⌋
=p·
1 − (1 − p)

= 1 − (1 − p)⌊t⌋ .

Therefore, by cdf inversion we can generate X as follows:


(a) Generate U ∼ U (0, 1).
(b) Set X = k if 1 − (1 − p)k−1 < U ≤ 1 − (1 − p)k .
It turns out that, for this distribution, we can compute step (b) more eciently. In fact, step (b)
can be replaced by
(b') Set X = ⌈ln(1 − U )/ ln(1 − p)⌉ .
To verify this note that step (b) can be reformulated as saying that we are looking for:

k = min{k ∈ N : U ≤ 1 − (1 − p)k }

= min{k ∈ N : (1 − p)k ≤ 1 − U }
(∗1 )
= min{k ∈ N : (1 − p) ≤ (1 − U )1/k }
 
(∗2 ) 1
= min k ∈ N : ln(1 − p) ≤ ln(1 − U )
k
 
(∗3 ) ln(1 − U )
= min k ∈ N : k ≥
ln(1 − p)
  
(∗4 ) ln(1 − U )
= min k ∈ N : k ≥
ln(1 − p)
 
ln(1 − U )
= ,
ln(1 − p)

where (∗1 ) and (∗2 ) follow from the fact that log and, respectively, (·)1/k are non-decreasing,
(∗3 ) follows from the fact that ln(1 − p) is negative, and (∗4 ) follows from the fact that k is an
integer.
A histogram of 10, 000 samples from the geometric distribution with p = 0.2 is shown in
Fig. 3.4.

Let us now turn to the continuous case.

28
0.4

0.35

0.3

0.25

0.2

0.15

0.1

5 · 10−2

0
0 10 20 30 40 50 60

Figure 3.4: Histogram (obtained by cdf inversion ) of the geometric distribution, p = 0.2. Graph of
f (k) = (1 − p)k−1 p is plotted in red.

Theorem 3.3
Let U ∼ U (0, 1), and let Y be a continuous random variable with invertible cdf F. Then, the
continuous random variable X = F −1 (U ) has cdf F.

Proof. Let u ∈ (0, 1) and Iu = {t ∈ R : F (t) ≥ u}. As we assume that F is invertible, we have
F (F −1 (u)) = u ≥ u, i.e., F −1 (u) ∈ Iu . And there is no smaller element than F −1 (u) contained in Iu
since an invertible cdf, such as F, is strictly increasing. Hence,

F − (u) = F −1 (u), for all u ∈ (0, 1).

With this the rest of the proof from Theorem 3.2 carries over, word-by-word.

Hence cdf inversion for continuous random variables is as follows:


(a) Generate U ∼ U (0, 1).
(b) Return F −1 (U ).

Note that for cdf inversion an explicit formula for the inverse cdf F −1 is needed. Such a formula is
often not available (e.g., in the case of a normal distribution). However, there are cases where cdf
inversion can be applied successfully. Let us look at an example.

Example 3.4: Exponential distribution


Consider the exponential distribution f (t) = λe−λt , t ∈ [0, ∞), with parameter λ > 0.
Its cdf is F (t) = 1 − e−λt , t ∈ [0, ∞). This is a continuous, strictly increasing function, which
is therefore invertible. Solving u = 1 − e−λt for t we see that the inverse cdf is F −1 (u) =
− ln(1 − u)/λ. Hence, for U ∼ U (0, 1) the variable
1
T = − ln(1 − U )
λ
is the desired variable. Fig. 3.5 shows a histogram obtained by sampling 10, 000 times with this
method.

Let us summarize. Cdf inversion for discrete random variables is a completely general method. It can
be applied to any discrete probability distribution. The continuous version, however, only applies if
an explicit formula for the inverse of the cdf is available.

29
1 2

0.9 1.8

0.8 1.6

0.7 1.4

0.6 1.2

0.5 1

0.4 0.8

0.3 0.6

0.2 0.4

0.1 0.2

0 0
0 1 2 3 4 5 6 7 8 9 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

(a) (b)

Figure 3.5: Sampling from the exponential distribution via cdf inversion. Histogram of 10, 000 samples.
The pdf F ′ (t) = λe−λt of the exponential function is plotted in red. (a) λ = 1, (b) λ = 2.

3.3.2 Rejection Sampling


Rejection sampling14 is a general method for sampling from both discrete and continuous distribu-
tions.
The general idea is to sample from a so-called trial distribution and then accept or reject them
according to a criterion designed so that overall the returned outputs follow the correct distribution
(sometimes called target distribution).
Let f denote the pdf of the target distribution from which we want to sample, and let g denote the
pdf of the trial distribution from which we can sample. Suppose, the following condition is met:

There is an M > 0 such that f (x) ≤ M g(x), ∀x.

Then,

Rejection sampling proceeds as follows:


(a) Draw Y from the distribution having g as pdf (i.e., draw Y ∼ g ).
(b) Draw U ∼ U (0, 1), independent of Y.
(c) IF U ≤ f (Y )/M g(Y ) THEN return X = Y (`accept Y ');
OTHERWISE (`reject') go back to (a).
To see that rejection sampling really works (i.e., that X is really a random variable with pdf f ), we
will need to look at the continuous and discrete distribution function case separately.
Common to both cases is that each time we loop through (a)-(c) a pair (Y, U ) is generated. To
be accepted the pair needs to satisfy U ≤ f (Y )/M g(Y ). Hence, any X returned as output has the
same distribution as (Y |U ≤ f (Y )/M g(Y )); i.e., the conditional distribution of Y given that Y is
accepted.
Proof.
14
Again going back to work of John von Neumann. See R. Eckhardt. Stan Ulam, John von Neumann and the Monte
Carlo method. Los Alamos Science, pages 131143, 1987. Special Issue. http://www-star.st-and.ac.uk/~kw25/
teaching/mcrt/MC_history_3.pdf

30
Proof for the continuous case

Pr(Y ≤ x, U ≤ f (Y )/M g(Y ))


Pr(X ≤ x) = Pr(Y ≤ x|U ≤ f (Y )/M g(Y )) = .
Pr(U ≤ f (Y )/M g(Y ))
Now, for the numerator of the above expression we have
  Z ∞  
f (Y ) f (Y )
Pr Y ≤ x, U ≤ = Pr (Y ≤ x, U ≤ |Y = y) g(y)dy (by15 )
M g(Y ) −∞ M g(Y )
Z x  
f (y)
= Pr U ≤ g(y)dy
−∞ M g(y)
Z x
f (y)
= g(y)dy
−∞ M g(y)
Z x
1
= f (y)dy
M −∞
1
= F (x),
M
where F is the cdf corresponding to f. For the denominator we obtain (similar to the rst line of the
above equations)   Z ∞
f (Y ) f (y) 1
Pr U ≤ = g(y)dy = .
M g(Y ) −∞ M g(y) M
Hence Pr(X ≤ x) = F (x), as required.

Proof for the discrete case

Pr(Y = x, U ≤ f (Y )/M g(Y ))


Pr(X = x) = Pr(Y = x|U ≤ f (Y )/M g(Y )) = .
Pr(U ≤ f (Y )/M g(Y ))
Before we consider the numerator and denominator of the above expression, note that
f (x)
Pr(accept|Y = x) = (basic uniform cdf) (3.5)
M g(x)

Now,
 
f (Y )
Pr Y = x, U ≤ = Pr(accept|Y = x)Pr(Y = x) (by Bayes' rule)
M g(Y )
= Pr(accept|Y = x)g(x)
f (x)g(x)
= (using (3.5))
M g(x)
1
= f (x).
M
Further,
 
f (Y )
Pr U ≤ = Pr(accept)
M g(Y )
X
= Pr(accept|Y = x)Pr(Y = x) (by16 )
x
X f (x)
= g(x) (using (3.5))
x
M g(x)
1 X
= f (x)
M x
1
= (since f is a pdf).
M

31
Hence Pr(X = x) = f (x), as required.

15
The continuous law of total probability states: If A ⊆ R2 is any event then
Z ∞
Pr((U, Y ) ∈ A) = Pr((U, Y ) ∈ A|Y = y)fY (y)dy.
−∞

16
The discrete law of total probability states: If A is any event and B1 , B2 , . . . form a partition of Ω, then
X
Pr(A) = Pr(A|Bi )Pr(Bi ).
i

Comments
(a) Rejection sampling requires that we have a method available for sampling from g.
(b) Above we have calculated the probability that the pair (Y, U ) is accepted:

 
f (Y )
Pr U ≤ = 1/M.
M g(Y )

The value 1/M is also referred to as acceptance rate of the rejection sampling algorithm. We
will have to generate, on average17 , M n draws from both the trial and uniform distribution to
obtain n draws from the target distribution. Thus, generally, M should not be chosen too large
 optimally as close to 1 as possible. Of course, we can only choose a small M if g follows
closely f.
(c) We only need to know f and g up to a constant of proportionality. In many applications we
will not know the normalizing constant for these densities, but we do not need them. That is, if
f ∗ (y) = cf f (y) and g ∗ (y) = cg g(y), for all y, we can proceed with the algorithm using f ∗ and g ∗
even if we do not know the values of cf and cg .

Example 3.5: Continuous distribution


Let us consider the case, where the target pdf is

12x2 (1 − x) : x ∈ (0, 1),



f (x) =
0 : otherwise.

(This is the so-called Beta distribution with parameters α = 3 and β = 2.) Since the maximum
of f occurs at x0 = 2/3, where f (x0 ) = 16/9, this means we can take M = 16/9, for x ∈ [0, 1],
corresponding to a uniform density g over (0, 1).
In Fig. 3.6(a) we show f, M, and 200 points corresponding to trials (Y, U · M g(Y )). When the
second coordinate U · M g(Y ) is smaller or equal to f (Y ), the point is accepted; otherwise it
is rejected. For this particular sample, 111 points were accepted and 89 were rejected for a
proportion 111/200 = 0.5550 of acceptance, not too far from the theoretical one of 1/M =
9/16 = 0.5625.

17
Note that the probability of an `accept' after k Bernoulli trials follows a geometric distribution p(1 − p)k−1 , where
p = 1/M is the probability of `accept.' The mean of this distribution is M.

32
2 4.5

1.8 4

1.6
3.5
1.4
3
1.2
2.5
1
2
0.8
1.5
0.6
1
0.4

0.2 0.5

0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

(a) (b)

Figure 3.6: Sampling from f (x) = 12x2 (1−x) via rejection sampling. (a) plot of (Y, U ·M g(Y )),
(b) histogram.

Example 3.6: Discrete distribution


Let us assume we toss two coins (with H denoting `heads' and T denoting `tails'). The possible
events (numbered for convenience with numbers from 1 to 4) are (1) HH, (2) HT, (3) TH, and
(4) TT. We now want to select the event HH with probability 2/5, and the other events each
with probability 1/5. How can we do this with rejection sampling?
As trial distribution we can take the uniform distribution on the pair of coin tosses (which, of
course has 4 outcomes), hence P r(outcome is `i') = g(i) = 1/4, for any i = 1, . . . , 4. Note our
target distribution is f (1) = 2/5, f (2) = f (3) = f (4) = 1/5. Since we require f (i) ≤ M g(i), we
need M ≥ 4 · 2/5 = 8/5. Let's just choose the smallest value M = 8/5. The rejection method
becomes:
(a) Generate Y ∼ U ({1, 2, 3, 4}).
(b) Generate U ∼ U (0, 1), independent of Y.
(c) If
2/5
(
2/5 =1 : Y =1
U ≤ f (Y )/M g(Y ) = 1/5
2/5 = 1/2 : Y = 2, 3, 4
then return X = Y (`accept Y '); otherwise (`rejection') go back to (a).
In other words, in step (c) we accept always Y = 1; in the other three cases we accept Y only
with probability 1/2. Figure 3.7 shows a histogram obtained by this method.

33
0.45

0.4

0.35

0.3

0.25

0.2

0.15

0.1

5 · 10−2

0
1 2 3 4

Figure 3.7: Rejection sampling from the discrete distribution of Example 3.6.

3.4 Sampling From a Normal Distribution

Suppose, now we want to sample from the normal distribution. We mentioned already that cdf inversion
cannot (at least directly) be applied to sample from the normal distribution, because there is no
closeform formula for the cdf of the normal distribution. So, what else can we do? Well, what about
rejection sampling? The answer is Yes this can work. For instance, one can choose as trial distribution
the exponential distribution (and use the fact that the normal distribution is symmetric) and then apply
rejection sampling (for sampling from the exponential distribution, see Example 3.4). One might also
want to try to use as trial distribution the uniform distribution. There, however, we run into trouble,
because that distribution does not exist (the interval from which we want to sample is unbounded). A
way out of this is to approximate the normal distribution by a truncated normal distribution (hence,
bounding the interval) and then use rejection sampling. Fig. 3.8 shows an example. The downside of
this is, of course, that these random numbers are just approximations of the normal distribution.

Figure 3.8: Sampling 200 samples from a truncated normal distribution (red, dashed) via rejection
sampling using as trial distribution a uniform distribution (black, dashed). Normal distribution shown
as red solid curve. (a) plot of (Y, U · M g(Y )), (b) histogram.

Now, let us look at more ecient methods.


We give two methods for generating random variables X ∼ N (0, 1). Random variables Y ∼ N (µ, σ 2 )
for the more general case can be obtained subsequently by setting

Y = σX + µ;

see Thm. 1.1.

34
The rst method is an `approximative method' in the sense that the samples follow an approximate
normal distribution as a limit process is involved.

3.4.1 Central Limit Algorithm


The method is as follows:

Sample U1 , . . . , Un ∼ U (1, 0) and return


Pn
i=1 Ui − n/2
X= p .
n/12

Recall that the mean and the variance of a U (0, 1) random variable are 1/2 and 1/12. Hence, by the
central limit theorem, we see that X is approximately normal with parameters µ = 0 and σ = 1.
In practice, a frequent choice for n is n = 12 since this avoids the computation of a square root and
divisions.

35
4 Markov Chain Monte Carlo (MCMC) Methods
Markov Chain Monte Carlo (MCMC) methods are particularly useful when it is practically not feasible
to directly sample from the desired distribution. This is often the case, for instance, for combinato-
rially dened sets where the sample space is so large preventing that the so-called partition function
(dened below) can be computed. Also, MCMC methods nd applications in cases where it is rather
straightforward to compare probabilities of pairs of events.
The general idea behind MCMC is to construct a Markov chain whose stationary distribution is the
probability distribution we want to simulate. We then generate samples of the distribution by running
the Markov chain. In Markov chains18 one allows only probabilistic dependence on the past through
the previous state. This produces already a great diversity of behaviors.
Historically, the idea for MCMC sampling goes back to 1953, when N. Metropolis19 et al. published
the paper `Equations of state calculations by fast computing machines'20 . The authors were trying to
solve problems in physics that arise due to the random kinetic motion of atoms and molecules. We will
discuss this later in more detail. The algorithm introduced in this paper, nowadays called Metropolis
algorithm, has been cited as among the top ten algorithms21 having the greatest inuence on the
development of science and engineering. MCMC, whether via Metropolis or modern variations, is now
also very important in statistics and machine learning.
Let's start by introducing the background from Markov chain theory that we need for our pur-
poses.

4.1 Markov Chains

We deal exclusively in this chapter with discrete-time homogeneous Markov chains on a nite state
space Ω = {ω1 , . . . , ωM }. We will therefore dene Markov chains only for this case. Many of the
denitions extend to countable state spaces with only minor complication. A more comprehensive
treatment (and proofs) can be found, e.g., in [2, 11].

Denition 4.1
A sequence {Xt }t∈N0 of random variables is a Markov chain (MC), with state space Ω, if

Pr(Xt+1 = y|Xt = xt , Xt−1 = xt−1 , . . . , X0 = x0 ) = Pr(X1 = y|X0 = xt ) (4.1)

for all t ∈ N0 and all y, xt , xt−1 , . . . , x0 ∈ Ω.

Equation (4.1) encapsulates the Markovian property whereby the history of the MC prior to time t
is forgotten.
We may write
P (x, y) := Pr(Xt+1 = y|Xt = x),
where P is the transition matrix of the MC. The transition matrix P describes single-step transition
probabilities; the t-step transition probabilities P t are given inductively by

t I(x, y) : t = 0,
P (x, y) = P t−1 ′ ′
y ′ ∈Ω P (x, y )P (y , y) : t > 0,

where I denotes the identity matrix I(x, y) := δxy . Thus,

P t (x, y) = Pr(Xt = y|X0 = x).


18
They are named after the Russian mathematician Andrey Andreyevich Markov (June 14, 1856  July 20, 1922) who
began to study such processes in 1907.
19
Nicholas Constantine Metropolis (June 11, 1915  October 17, 1999) was a Greek-American physicist.
20
J. of Chem. Phys., 21:10971091, 1953. Available at https://bayes.wustl.edu/Manual/EquationOfState.pdf.
21
See, https://archive.siam.org/pdf/news/637.pdf.

36
Also note that the row sums of P satisfy y∈Ω P (x, y) = y∈Ω Pr(X1 = y|X0 = x) = 1, because
P P
given that we are in state x, the next state must be one of the possible states from Ω.
MCs can be represented by directed graphs. In this representation, the states are represented by the
vertices of the graph. A directed edge extends from one vertex, x say, to another y, if a transition
from x to y is possible in one iteration. In this case the weight P (x, y) is associated to that edge.
The set N (x) of all states that can be reached from x in one iteration is the so-called neighborhood
of x.

Example 4.1
The following is a graph representation of a four-state Markov chain with transition matrix P.

The neighborhood of, for instance, ω1 is


N (ω1 ) = {ω1 , ω2 , ω4 }.

An MC is irreducible if, for all x, y ∈ Ω there exists a t ∈ N0 such that P t (x, y) > 0. In other words,
irreducibility is the property that regardless of the present state we can reach any other state in nite
time. (Equivalently, in the graph representation any vertex can be reached by a directed path from
any other vertex.)

Figure 4.1: Examples of three Markov chains. Only (c) is irreducible, (a) and (b) are not.

Each state x in a Markov Chain has a period gcd{t : P t (x, x) > 0}. An MC is aperiodic if all states
have period 1. Otherwise, it is called periodic. To show aperiodicity, we remark that in the case of
an irreducible MC, it suces to verify that the period of just one state x ∈ Ω is 1.

Figure 4.2: Examples of two Markov chains. (a) is periodic with period 2; (b) is aperiodic.

A stationary distribution of an MC with transition matrix P is a pdf π : Ω → [0, 1] satisfying


X
π(y) = π(x)P (x, y), or, equivalently, π = πP, (4.2)
x∈Ω

37
where in the last equation we write π = (πω1 , . . . , πωM ) as row vector. Equation (4.2) is known as
global balance equation.
Clearly, if X0 is distributed as the stationary distribution π then so is X1 (and hence so is Xt for
all t ∈ N0 ).
It can be shown that a nite MC always has at least one stationary distribution. It is often possible
to determine the stationary distributions by solving the global balance equation keeping in mind
that π = (πω1 , . . . , πωM ) needs to be a non-negative vector satisfying additionally the normalizing
condition
M
X
πωi = 1. (4.3)
i=1

The stationary distribution, however, might not be unique (see Example 4.4). Uniqueness is guaranteed
if the Markov chain is irreducible and aperiodic.

Theorem 4.1
An irreducible aperiodic MC has a unique stationary distribution π; moreover the MC tends
to π in the sense that P t (x, y) → π(y), as t → ∞, for all x ∈ Ω.

Informally speaking, an irreducible aperiodic MC eventually 'forgets' its starting state.

Example 4.2
Consider the task of nding a stationary distribution of the Markov chain with transition ma-
trix  
0 1 0
P =  0 2/3 1/3  ,
p 1−p 0
for some given p ∈ (0, 1).
Let π = (π1 , π2 , π3 ). Equation πP = π is equivalent to

2 1
(pπ3 , π1 + π2 + (1 − p)π3 , π2 ) = (π1 , π2 , π3 ).
3 3
Solving this we nd π1∗ = pπ3∗ and π2∗ = 3π3∗ where π3∗ can be chosen arbitrarily. However,
since π needs to be a pdf, we need to have π1 + π2 + π3 = 1 (and π1 , π2 , π3 ≥ 0). This gives the
(unique) solution
1
(π1∗ , π2∗ , π3∗ ) = (p, 3, 1).
p+4
(It is easily checked that we made no computational mistakes; just verify (π1∗ , π2∗ , π3∗ )P =
(π1∗ , π2∗ , π3∗ ).)

Example 4.3
Let us consider the question whether Theorem 4.1 also holds if the MC is periodic.
Consider the Markov chain in Fig. 4.2(a) with transition probabilities p1,2 = p3,2 = 1 and
p2,1 = p2,3 = 1/2. (Notation as in Example 4.1.)
We have    
0 1 0 1/2 0 1/2
P 2t+1 =  1/2 0 1/2  , P 2t+2 = 0 1 0 
0 1 0 1/2 0 1/2
for all t ≥ 0. Hence, we do not have convergence P t (x, y) → π(y) as t → ∞ for all x ∈ Ω in the
sense of Theorem 4.1. The stationary distribution is, by the way, π = (1/4, 1/2, 1/4).

38
Example 4.4
Consider the following MC.

It is easily veried that both π = (1/2, 1/2, 0, 0) and π = (0, 0, 1/2, 1/2) are stationary distribu-
tions of this Markov chain. Does this contradict Theorem 4.1?
The answer is No. The MC is clearly not irreducible (there is for instance no path from ω1 to
ω3 in the graph representation of the MC).

In many situations one has an idea about what the stationary distribution might be. Instead of verifying
the global balance condition it can be easier to verify so-called detailed balance, which is (4.4) in
the following theorem. (An MC for which detailed balance holds is said to be time reversible. The
Markov chains constructed by the sampling methods discussed later are all time reversible.)

Theorem 4.2
Suppose P is the transition matrix of an MC. If the function π ′ : Ω → [0, 1] satises

π ′ (x)P (x, y) = π ′ (y)P (y, x), for all x, y ∈ Ω, (4.4)

and X
π ′ (x) = 1,
x∈Ω

then π ′ is a stationary distribution of the MC. If the MC is irreducible and aperiodic then π ′ is
the unique stationary distribution.

Proof. We verify that π′ is a stationary distribution (see (4.2)). Suppose, X0 is distributed as π′ then
X X X
π ′ (x)P (x, y) = π ′ (y)P (y, x) = π ′ (y) P (y, x) = π ′ (y).
x∈Ω x∈Ω x∈Ω

The last equality follows from the previously mentioned fact that the row sums of P sum up to 1.
Let us consider two sampling problems for types of objects that are rather central in discrete math-
ematics. The rst is about sampling independent sets in a graph, the second is about sampling of
matchings in a graph. Independent sets and matchings appear in many contexts. For instance, inde-
pendent sets can model conicts between objects. The vertices might correspond to workers and an
edge might indicate that the corresponding pair of workers cannot work in the same shift (because
they don't like each other or they need to share the same equipment, etc.). Matchings are often used
to pair objects, for instance, assigning students to courses, jobs to machines, etc.
We remark that there many computational questions arising in this context. Some of these questions
are settled, some remain subjects of ongoing research. (It is known, for instance, that both problems
of counting the number of independent sets in a graph and counting the number of matchings in a
graph is an intractable problem22 . Still it is possible to sample eciently, that means in polynomial
time in n, matchings in graphs, while such ecient sampling of independent sets is intractable. Note
22
Unless the famous conjecture P ̸= NP would surprisingly fail to hold; see also http://www.claymath.org/
millennium-problems.

39
that the sampling procedures that we will discuss have a-priorily an exponential running time. More
on this fascinating topic can be found in [7].)

Figure 4.3: A graph with vertices v1 , . . . , v4 and edges {v1 , v2 }, {v1 , v3 }, {v2 , v3 }, {v3 , v4 }. (a) Inde-
pendent set I = {v1 , v4 } (gray nodes), (b) Matching M = {{v1 , v2 }, {v3 , v4 }} (red edges). I is also a
maximum independent set; M is also a maximum matching in G.

Example 4.5
Let G = (V, E) be a graph with vertices in V = {v1 , . . . , vn } and edges in E = {e1 , . . . , em }.
A set I ⊆ V is an independent set if no two vertices of I are joined by an edge in E (see
Fig. 4.3(a)).
The following MC randomly selects independent sets in IG := {all independent sets in G} :

Let I ∈ IG be given (e.g., I = ∅).

(a) Select v ∈ V uniformly at random;


(b) Flip a fair coin23 ;
(c) IF `tails' and no neighbor of v is in I THEN add v to I (i.e., set I ← I ∪ {v}),
OTHERWISE, remove v from I (i.e., set I ← I \ {v}).
(d) GOTO (a).

It can be shown that the Markov chain is irreducible, aperiodic and has the uniform distribution
on IG as stationary distribution. Hence, we can use this Markov chain to (uniformly) generate
random independent sets in G.
To ip a fair coin means to draw a random number taking on only two values, say 0 (for heads) and 1 (for
tails), both of which should occur with probability 1/2.

Example 4.6
Let G = (V, E) be a graph with vertices in V = {v1 , . . . , vn } and edges in E = {e1 , . . . , em }. A
set M ⊆ E is a matching if the edges of M are pairwise vertex disjoint (see Fig. 4.3(b)).
The following Markov chain randomly selects matchings in MG := {all matchings in G} :

40
Let M ∈ MG be given (e.g., M = ∅).

(a) Select e ∈ E uniformly at random;


(b) Flip a fair coin;
(c) IF `heads' THEN do not change M (i.e., set M ← M ),
OTHERWISE do the following: Set
M ′ = M ∪ {e} : e ̸∈ M,


M :=
M ′ = M \ {e} : otherwise,

and set M ← M ′ if M ′ is a matching, otherwise set M ← M.


(d) GOTO (a).

We claim that the Markov chain is irreducible, aperiodic and has the uniform distribution on
MG as stationary distribution. Hence, we can use this Markov chain to (uniformly) generate
random matchings in G.
The state space of the Markov chain is Ω = MG . This is a nite set containing at most 2m ele-
ments. Clearly the chain is homogeneous (there is only dependence on the previous state).

1. Aperiodicity: The Markov chain might remain in the same state (see step (a) and step
(c)), hence all states have period 1, and the chain is therefore aperiodic.
2. Irreducibility: The Markov chain is irreducible, since it is possible to reach the empty
matching from any state by removing edges (and reach any state from the empty matching
by adding edges).
3. Uniform distribution: We verify the detailed balance condition for π = |M1G | (1, . . . , 1)
where |MG | denotes the size/cardinality of the set MG . Note that as |MG | can be rather
large it is often not feasible to compute this number explicitly.
For M1 , M2 ∈ M with M2 = M1 ∪ {e} for some e ∈ E \ M1 we have

1 1 1
π(M1 )Pr(X1 = M2 |X0 = M1 ) = · · = π(M2 )Pr(X1 = M1 |X0 = M2 ).
|MG | m 2

Therefore, by Theorem 4.2, the uniform distribution π is a stationary distribution.

A possible matlab implementation is given below.

41
1 function [M ]= randommatchingsample (n , edges )
2 % n is the number of vertices of the graph
3 % ith row of edges cotains the numbers of the two vertices of that edge
4
5 nTrials =10^3; % number of iteration for running the Markov chain
6 m = size ( edges ,1) ; % number of edges
7 M = zeros (1 , m ) ; % ith component of M is 1 if and only if ith edge is currently ...
selected
8 saturated = zeros (1 , n ) ; % ith component counting how many currently selected edges ...
are touching v_i
9
10 for sim =1: nTrials % Implementation of the algorithm from Example 4.6. in the ...
lecture notes
11 newe = randi (m ,1) ;
12 coin = randi (2 ,1) ;
13 if ( coin ==1)
14 if ( M ( newe ) ==0)
15 if ( sum ( saturated ( edges ( newe ,:) ) ) ==0) % both endpoints of edge ...
candidate not endpoints of aready selected edge ?
16 M ( newe ) =1; saturated ( edges ( newe ,:) ) =[1 ,1];
17 end ;
18 else
19 M ( newe ) =0; saturated ( edges ( newe ,:) ) =[0 ,0];
20 end ;
21 end ;
22 end

Example 4.7
Let us continue to sample matchings in a graph as in the previous example. In fact, we now
want to focus on the problem of counting the matchings. We consider the following graph.

We can check all combinations of possible matchings systematically (a method, called enumer-
ation ). Possible matlab code for this is as follows.
1 edges =[1 ,7;1 ,8;2 ,8;2 ,3;3 ,8;3 ,10;4 ,10;5 ,9;5 ,10;5 ,12;6 ,11;
2 6 ,12;7 ,8;3 ,9;4 ,5;2 ,9;5 ,11;11 ,12;1 ,2;12 ,13;6 ,13];
3 m = size ( edges ,1) ; n = max ( max ( edges )) ;
4 matchingsofsize = zeros (1 , m +1) ;
5 nTrials =10^4;
6
7 % Check all 2^ m possibilities for M , and count which of them are matchings
8 % This determines the correct total number of matchings in G.
9 % This can only work for small m , since it takes exponential time ( in m )
10 for i =0:2^ m -1
11 curE = dec2bin (i , m ) -'0 '; % converts i into a binary number with m digits
12 saturated = zeros (1 , n ) ;
13 for j =1: m
14 if ( curE ( j ) ==1)
15 saturated ( edges (j ,1) ) = saturated ( edges (j ,1) ) +1;
16 saturated ( edges (j ,2) ) = saturated ( edges (j ,2) ) +1;
17 end ;
18 end ;
19 if ( max ( saturated ) <2)
20 matchingsofsize ( sum ( curE ) +1) = matchingsofsize ( sum ( curE ) +1) +1;

42
21 end ;
22 end ;
23
24 matchingsofsize

We obtain the following numbers si of matchings of size i = 0, . . . , 21.

i 0 1 2 3 4 5 6 7-21
si 1 21 158 521 749 403 61 0

Enumeration, however, becomes quickly intractable for increasing n. Let's try a dierent ap-
proach.
Suppose we sample matchings in G (using the method from the previous example). If we do this
for suciently many trials we should get a good picture about the distribution of the matchings.
Note that this is still an exponential method! But depending on the situation (whether or not
the Markov chain is so-called rapidly mixing) it may provide good approximations in reasonable
time.
Fig. 4.4 shows the outcome of a random sampling of nTrials = 104 matchings (drawn after
evolving the Markov chain for 103 steps); depicted are the relative frequencies ni /nTrials of
the events that M is a matching of size i, with i being depicted on the x-axis.
Now, observe that we can obtain estimates ŝi of the number of matchings of size i from this
data. We know that s0 = 1 (there is one matching of size 0, the empty set). Therefore ŝ0 = 1.
As n1 /n0 ≈ s1 /s0 we can set ŝ1 := ŝ0 n1 /n0 and, recursively, ŝi := ŝi−1 ni /ni−1 . The estimates
that we obtain for the present data are depicted in Table 4.1.

Figure 4.4: Histogram of the relative frequencies of matchings of dierent sizes based on nTrials = 104
samples drawn after evolving the Markov chain from Example 4.7 for 103 steps. Values si /(s1 + s2 +
· · · + s21 ) depicted as red points.

i 0 1 2 3 4 5 6 7-21
si 1 21 158 521 749 403 61 0
ni 5 102 832 2747 3991 2029 294 0
ŝi = ŝi−1 ni /ni−1 1.0 20.4 166.4 549.4 798.2 405.8 58.8 0

Table 4.1: Estimating si , the number of matchings of size i (Example 4.7).

43
4.2 Metropolis-Hastings

We turn now to a slightly dierent question. How can we modify the transition probabilities in a given
Markov chain to ensure that the stationary distribution is equal to a prescribed distribution π ?
The method that we introduce here for solving this question is the so-called Metropolis-Hastings
algorithm24 , named after N. C. Metropolis and W. K. Hastings25 . It proceeds as follows.
ˆ Start with some point x0 , whether deterministic or randomly sampled. For t ≥ 0 given that
Xt = x, generate a random proposal Y from a distribution Pr(Y = y|X = x) = P (x, y) (note
that this P is the transition matrix of our original Markov chain). For technical reasons (ensuring
that the generated Markov Chain is irreducible), we will from now on to the end of this chapter
require the property P (x, y) > 0 ⇒ P (y, x) > 0, for all x, y.
ˆ New to what we did previously, accept or reject the proposal y. With probability A(x, y) accept
it and set Xt+1 = Y. With probability 1 − A(x, y) reject the proposal, i.e., set Xt+1 = Xt .
This may now be considered to be a new Markov chain. We denote the new transition probabilities
by P ′ (x, y), x, y ∈ Ω. Note that

P ′ (x, y) = Pr(Y = y|X = x) = P (x, y)A(x, y). (4.5)

Given a desired stationary distribution π and a proposal mechanism via matrix Q, we want to design
the acceptance probabilities A(x, y) such that π is a stationary distribution of the new Markov chain.
In view of Theorem 4.2 it suces to construct the acceptance probabilities A(x, y) in such a way that
detailed balance is satised.

The Metropolis-Hastings acceptance probability is given by


(  
π(y)P (y,x)
min 1, π(x)P if π(x)P (x, y) > 0,
A(x, y) := (x,y) x, y ∈ Ω. (4.6)
1 otherwise,

We remark that the Metropolis-Hastings (MH) algorithm simulates samples from a probability distri-
bution by making use of the full joint density function and (independent) proposal distributions for
each of the variables of interest. In summary, the MH algorithm is as follows.

Metropolis-Hastings algorithm
Input: Probability distribution function π (may be unnormalized), transition matrix P.
Output: Random sample x from the Markov chain.
(a) Initialize x (deterministic or randomly).
(b) Draw proposal y from Pr(Y |X = x) = P (x, ·).
(c) Compute acceptance probability A(x, y) according to (4.6).
(d) Draw u ∼ U (0, 1).
(e) IF u ≤ A(x, y) THEN `accept proposal' (i.e., set x ← y)
OTHERWISE `reject' (i.e., set x ← x).
(f) GOTO (b).

Note that the condition u ≤ A(x, y) realizes the desired probability A(x, y) of accepting the proposal y.
(You can see this by considering cdf inversion for a discrete random variable taking on only two values,
'accept' and 'not accept,' where 'accept' should occur with probability A(x, y).) Further notice that
the ratio inside the minimum in (4.6) has a factor π(y)/π(x). Other things being equal, this implies
24
A very readable account on the history of the Metropolis-Hastings algorithm can be found at https://www.jstor.
org/stable/30037292.
25
Wilfred Keith Hastings (July 21, 1930  May 13, 2016).

44
that moves to higher probability states y are favored. The second factor, P (y, x)/P (x, y), implies, if
other things are equal, that we hesitate to move to y if it would be hard to get back to x.
What should be also noted is that the factor Z for an unnormalized distribution πu with π = πu /Z
cancels out in (4.6). Hence the same acceptance rule can be applied if the partition function Z is
unknown.

Theorem 4.3
The Markov chain generated by the Metropolis-Hastings algorithm has π as stationary distri-
bution.

Proof. We want to prove detailed balance. We distinguish the following three cases for x, y ∈ Ω.
(i) Suppose, π(x)P (x, y) = π(y)P (y, x). Then, A(x, y) = A(y, x) = 1, implying

π(x)P (x, y)A(x, y) = π(y)P (y, x)A(y, x),

hence π(x)P ′ (x, y) = π(y)P ′ (y, x), showing detailed balance for this case.
(ii) Suppose, π(x)P (x, y) > π(y)P (y, x). In this case

π(y)P (y, x)
A(x, y) = and A(y, x) = 1.
π(x)P (x, y)

Hence,

π(y)P (y, x)
π(x)P ′ (x, y) = π(x)P (x, y)A(x, y) = π(x)P (x, y) ·
π(x)P (x, y)
= π(y)P (y, x)A(y, x) = π(y)P ′ (y, x).

(iii) Suppose, π(x)P (x, y) < π(y)P (y, x). In ths case

π(x)P (x, y)
A(x, y) = 1 and A(y, x) = .
π(y)P (y, x)

Hence,

π(x)P (x, y)
π(y)P ′ (y, x) = π(y)P (y, x)A(y, x) = π(y)P (y, x) ·
π(y)P (y, x)
= π(x)P (x, y)A(x, y) = π(x)P ′ (x, y).

Since the algorithm allows for rejection, the Markov chain of the MH algorithm is clearly aperiodic.
What is left to check to ensure convergence to the required target distribution is in each specic case
only irreducibility.

Example 4.8
Let's consider Example 4.5 again.
One can show that the MC in that example has the uniform distribution as stationary distribu-
tion. Suppose now that we want a stationary distribution where each independent set I has a
probability proportional to λ|I| (When λ = 1 this is the uniform distribution; when λ > 1 larger
independent sets have a larger probability than smaller independent sets; and when λ < 1 then
smaller independent sets have a larger probability than larger independent sets.)

45
Let's change the algorithm from Example 4.5 such that we obtain the Metropolis-Hastings
variant that samples from Z1 λ|I| . Here is the modied algorithm.

Let I ∈ IG be given (e.g., I = ∅).

(a) Select v ∈ V uniformly at random;


(b) Flip a fair coin;
(c) IF `tails' and no neighbor of v is in I THEN propose I ′ ← I ∪ {v},
OTHERWISE, propose I ′ ← I \ {v}.
(d) Draw u from U (0, 1). 
(e) IF u ≤ A(x, y) = min 1, λ|I | /λ|I| THEN accept proposal (i.e., set I ← I ′ ),



OTHERWISE reject proposal (i.e., set I ← I ).


(f) GOTO (a).

Fig. 4.5 shows some output for the choices of λ = 4 and λ = 0.1.
Let's consider the output for λ = 4. We actually have in this particular case ñ0 = 202 sampled
sets of size 0, ñ1 = 3384 sets of size 1, and ñ2 = 6414 sets of size 2 (and no sets of larger
size).
Can we estimate from this the actual number ni of independent sets of size i, i = 1, . . . , 4
in G?
Well, we can rst of all observe that we know how many independent sets there are of size 0 and
size 1. There is one independent set of size 0 and there are n = 4 independent sets of size 1. In
theory, we should have

λ1 n1
ñ1 Z · nTrials 4·4
= λ0 n0
= = 16.
ñ0
Z · nTrials 1·1

And, in fact, for these samples we have ñ1 /ñ0 = 3384/202 ≈ 16.7525. We can use this argument
to estimate any of the ni . We should have

λ2 n2
ñ2 Z · nTrials 4n2
= λ1 n1
= .
ñ1
Z · nTrials n1
Solving this for n2 we obtain,

ñ2 n1 6414 4
n2 = · = · ≈ 1.8954,
ñ1 4 3384 4
which is actually not far from the true value of n2 = 2.

4.3 Other Samplers

Many other samplers have been developed in the past. We just list a few of them, which can be
frequently found in the literature. (We omit the cases π(x)P (x, y) = 0 in the statement of the
acceptance probabilities; there are conditions that can ensure that π(x)P (x, y) > 0.)

46
Figure 4.5: Sampling independent sets (Example 4.8). Histogram for (a) λ = 4 and (b) λ = 0.1 (the
red markers show the correct value of Z1 λ|I| ) based on nTrials = 104 samples drawn after evolving
the Markov chain for 20 steps.

Barker's Sampler The following is an acceptance probability due to Barker26 (1965):


(
π(y)P (y,x)
A(x, y) := π(x)P (x,y)+π(y)P (y,x) if π(x)P (x, y) + π(y)P (y, x) ̸= 0, (4.7)
1 otherwise.

It is easily veried that A(x, y) satises detailed balance and that A(x, y) ≤ AM H (x, y) for any x, y ∈ Ω
with AM H denoting the Metropolis-Hastings acceptance probability from (4.6).

Original Metropolis Sampler The original Metropolis algorithm is the special case of Metropolis-
Hastings where P (x, y) = P (y, x). In this case,
 
π(y)
A(x, y) := min 1, . (4.8)
π(x)

Independence Sampler Perhaps the simplest proposal mechanism is to take independent and iden-
tically distributed proposals from some distribution P that does not even depend on the present x.
Then P (x, y) is simply P (y) for a pdf P. The Metropolis-Hastings proposal for this so-called indepen-
dence sampler simplies to  
π(y)P (x)
A(x, y) := min 1, . (4.9)
π(x)P (y)

Gibbs Sampler The Gibbs sampler27 originates from the eld of image processing. It is a special
case of Metropolis-Hastings sampling where the proposal is always accepted (i.e., A(x, y) = 1 for all
x, y ∈ Ω.) It is designed to work for distributions where each state x is in fact a vector x = (ξ1 , . . . , ξd ).
In image processing, for instance, ξi might represent the color of the ith pixel.
The main idea of Gibbs sampling is that one only uses univariate conditional distributions (the distri-
bution where all of the random variables but one are assigned xed values). It is often far easier to
sample from such conditional distributions than from the joint distribution.
The Gibbs sampler (more precisely, the random scan Gibbs sampler) proceeds as follows. Given a
(current) state x then a (uniformly) random coordinate k is selected, and the new state is
y = (ξ1 , . . . , ξk−1 , ξ, ξk+1 , . . . , ξd ),
26
A. A. Barker, a mathematical physicist working at that time at the University of Adelaide.
27
Named after the physicist Josiah Willard Gibbs (February 11, 1839  April 28, 1903), in reference to an analogy
between the sampling algorithm and statistical physics.

47
with the k th coordinate of y being ξ determined from the pdf for ξ, given the other coordinates.

4.4 Simulated Annealing

Simulated annealing28 is an algorithmic approach for nding the maximum of a function that has
typically many local maxima. The idea is that when we initially start sampling the space Ω, we will
accept a reasonable probability of a down-hill move in order to explore the entire space. As the process
proceeds, we decrease the probability of such down-hill moves. The analogy (and hence the term) is
the annealing of a crystal as temperature decreases (initially there is a lot of movement, which then
over time gradually decreases).
Simulated annealing is very closely related to Metropolis sampling, diering only in that the probabil-
ity A(x, y) of a move is given by

 T (t) !
π(y)
A(x, y) = min 1, ,
π(x)

where the function T (t) is called the annealing schedule (for T = 1 we have the Metropolis algo-
rithm). The particular value 1/T (t) for any given t ∈ N0 is typically referred to as temperature.
A typical choice for T over Tmax time steps is
 t/Tmax
Tf
T (t) := T0 ,
T0

where 1/T0 is a given `starting temperature' cooling down to a `nal temperature' 1/Tf (i.e, we typically
choose T0 ≤ Tf ).

4.5 An Example of a Gibbs Sampler

Example 4.9
Consider a 2D image consisting of an n × n grid of black and white pixels. Let Xj , j = 1, . . . , n2 ,
denote the indicator of the j th pixel being white (i.e., Xj = 1 if the j th pixel is white, and
Xj = 0 otherwise). Viewing the pixels as vertices in a graph, the set N (i) of neighbors of
the ith pixel are the pixels immediately above, below, to the left, and to the right (except for
boundary cases).
A commonly used model29 for the pdf π of X = (X1 , . . . , Xn2 ) is
2 P
1 β2 Pni=1 j∈N (i) δξi ,ξj
π(x) = e ,
Z
where x = (ξ1 , . . . , ξn2 ), β > 0, and

1 if ξi = ξj ,

δξi ,ξj =
0 otherwise.

With this π neighboring pixels prefer to have the same color. The normalizing constant Z of
2
π is a sum over all 2n possible congurations, so it may be very dicult to compute. This
motivates the use of MCMC to simulate from the model.
Let us now develop a Gibbs sampler for sampling from π.

28
Simulated annealing was developed in 1983 by Kirkpatrick et al.; see S. Kirkpatrick, C. Gelatt, M. Vecchi, Opti-
mization by simulated annealing, Science 220 (4598) (1983) 671680, https://www.science.org/doi/10.1126/science.
220.4598.671.

48
Denote the vector of random variables by X = (X1 , . . . , Xn2 ). A sample is denoted by x =
(ξ1 , . . . , ξn2 ). Let X−k = (X1 , . . . , Xk−1 , Xk+1 , . . . , Xn2 ) and x−k = (ξ1 , . . . , ξk−1 , ξk+1 , . . . , ξn2 ).
Note that π can be viewed as joint pdf of the variables X and X−k , hence

π(ξ1 , . . . , ξk−1 , ξk , ξk+1 , . . . , ξn2 ) = πXk ,X−k (ξk , x−k ) = πXk |X−k =x−k (ξk )πX−k (x−k ),

for ξk ∈ {0, 1}. Assuming πX−k (x−k ) ̸= 0 we therefore have


2 P
α β2 Pni=1 j∈N (i) δξi ,ξj
πXk |X−k =x−k (ξk ) = α · π(ξ1 , . . . , ξk−1 , ξk , ξk+1 , . . . , ξn2 ) =
e
Z
Pn2 P
with α := 1/πX−k (x−k ) > 0. Now notice that only a few terms in the sum P i=1 j∈N (i) δξi ,ξj
depend on ξk . In fact, the only terms depending on ξk are j∈N (k) δξk ,ξj + i∈N (k) δξi ,ξk , hence
P
we can write
α β
P
δ
πXk |X−k =x−k (ξk ) = eβR/2 · e j∈N (k) ξi ,ξj ,
Z
where R ≥ 0 is a constant.
We do not need to worry too much about the factor α βR/2
Ze . In fact, we will see that we can
compute it!
But before we do this, note that
P
β j∈N (k) δξk ,ξj
e = eβnξk ,

with nξk denoting the number of neighbors of the k th pixel having color ξk .
Now let us consider the constant C := Zα eβR/2 . We rst note that πXk |X−k =x−k (ξk ) is a pdf and
that ξk can only have the two possible values ξk = 0 and ξk = 1. Hence

πXk |X−k =x−k (0) + πXk |X−k =x−k (1) = Ceβn0 + Ceβn1 = 1,

consequently
1
C= .
eβn0 + eβn1
Putting things together we have

eβnξk
πXk |X−k =x−k (ξk ) = Ceβnξk = , ξk ∈ {0, 1}.
eβn0 + eβn1

In summary, we obtain the following Gibbs sampler. Typical results are shown in Fig. 4.6. (It
is interesting to note that the Gibbs sampler is here, in comparison to the Metropolis-Hastings
sampler, the method with the faster running time; see Fig. 4.7).

(i) Initialize x = (ξ1 , . . . , ξn2 ) (deterministically or randomly).


(ii) Draw k and u from U (1, . . . , n2 ) and U (0, 1), respectively.
(iii) For j ∈ {0, 1} compute nj , the number of neighbors of the k th pixel having color j.
(iv) IF
eβn1
u ≤ βn0
e + eβn1
THEN set ξk = 1 OTHERWISE set ξk = 0.
(v) GOTO (ii).

29
This model is called Ising model, named after the physicist Ernst Ising (May 10, 1900  May 11, 1998). This
model, easy to dene but with amazingly rich behavior, serves as a mathematical model of ferromagnetism in
statistical mechanics. See also, e.g., http://personal.rhul.ac.uk/uhap/027/ph4211/PH4211_files/brush67.
pdf.

49
Figure 4.6: Typical outputs for the Gibbs sampler on the 200 × 200 torus, randomly initialized. (a)
β = 1 output after 107 iterations, (b) output after the next 107 iterations. (c) and (d) the same as for
(a) and (b) but with β = 2. (Computation times for generating any of these images were around 25s.)

50
Figure 4.7: Typical outputs for the Metropolis sampler on the 200×200 torus, randomly initialized. (a)
β = 1 output after 107 iterations, (b) output after the next 107 iterations. (c) and (d) the same as for
(a) and (b) but with β = 2. (Computation times for generating any of these images were around 35s.)

51
5 Learning Theory and Methods
Learning theory can be considered as the eld devoted to studying the design and analysis of ma-
chine learning algorithms. The term machine learning seems to have been rst coined by Arthur Lee
Samuel30 , a pioneer in the articial intelligence eld, in 1959. The following, a paraphrase of his quote
`Programming computers to learn from experience should eventually eliminate the need for much of
this detailed programming eort,' nicely captures the essence and is therefore cited in many machine
learning texts.
`Machine learning is the eld of study that gives computers the ability to learn without
being explicitly programmed.'
(Arthur L. Samuel, 1959)

5.1 The Process of Learning

Typically, one distinguishes between the following three main types of learning algorithms.
ˆ Supervised learning: A supervised learning algorithm learns from labeled training data and,
based on this, tries to predict outcomes for unforeseen data. An example for this would be
predicting mortalities of COVID-19 based on training data from the past.
ˆ Unsupervised learning: In these methods there is no label attached to the data, and the task
is to identify patterns and/or model the data. A example for this would be the compression of
information.
ˆ Reinforcement learning: This method falls between the above two methods as there is some
form of feedback available (known as reward signal) for each predictive step, but there is no label.
An example for this would be training an agent to play video games. The reward signal can be
the player's score.
Examples of machine learning problems include:
ˆ Classication: Classify data into one or more categories (classes). For example, identifying in
computerized tomography (CT) images whether a patient has a tumor or not.
ˆ Clustering: Group a set of data points into clusters, such that points within a cluster have
some properties in common. An example is in image segmentation, where the goal is to break
up the image into meaningful or perceptually similar regions.
ˆ Prediction: Based on historical data, build models and use them to forecast future values. For
example, predicting temperature rises due to global warming based on data from the past.
Examples of important applications where machine learning algorithms are deployed include:
ˆ Optical character recognition (OCR)31 ,
ˆ Text or document classication, spam detection32 ,
ˆ Speech recognition33 ,
ˆ Face recognition34
ˆ Fraud detection35 ,
ˆ Language translation36 ,
30
(December 5, 1901  July 29, 1990).
31
http://human.ait.kyushu-u.ac.jp/publications/PRL2008-Malon.pdf.
32
https://arxiv.org/pdf/1606.01042.pdf.
33
https://www.youtube.com/watch?v=RBgfLvAOrss.
34
https://www.youtube.com/watch?v=RBgfLvAOrss.
35
https://www.fico.com/blogs/5-keys-using-ai-and-machine-learning-fraud-detection.
36
https://www.youtube.com/watch?v=AIpXjFwVdIE.

52
ˆ Games like chess and Go37 ,
ˆ Autonomous driving38 ,
ˆ Medical diagnosis (decisions about Caesarian sectioning39 or tumor removal40 ),
ˆ Recommendation systems, search engines41 ,
ˆ Representations of polycrystals in materials science42 .
Before we go into further details let us consider two initial examples.

Example 5.1
The last application mentioned above is the representations of polycrystals in materials science.
Fig. 5.1 is from the paper that can be found at https://livrepository.liverpool.ac.uk/
3085596/. Shown in black in these two images are so-called grain boundaries. These are
boundaries of small crystals that make up the whole material, here an aluminum sample. The red
and blue boundaries, respectively are the boundaries that one obtains by some specic clustering
method (which we call generalized balance power diagrams). One observers a relatively good
t, which on the one hand indicates that nature seems to try to nd a similar clustering. On
the other hand, for storing the clusters one needs only to store a few parameters per grain and
not the whole image. Thus, one has a much sparser representation of the data.

Figure 5.1: Representations of polycrystals in materials science using clusterings.

Example 5.2
Consider the current COVID-19 outbreak. As other pandemics it is expected to have exponential
growth. That means that the number x(t) of infected persons at time t follows a function

x(t) = x0 bt , (5.1)

where x0 is the number of cases at the beginning, and b is the number of people infected by each
infected person.
The rst two UK cases appear on January 31st, 2020, which for us is t = 0. Let us, for the
sake of exposition, consider the available data (source: Public Health Englanda ) up until March

37
https://ai.googleblog.com/2016/01/alphago-mastering-ancient-game-of-go.html.
38
https://iopscience.iop.org/article/10.1088/1757-899X/662/4/042006/pdf.
39
https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=8519731.
40
https://www.nature.com/articles/s41598-019-48738-5.
41
https://www.seroundtable.com/google-explains-machine-learning-search-28697.html.
42
https://www-m9.ma.tum.de/foswiki/pub/Allgemeines/AndreasAlpersPublications/H1.pdf.

53
18th, 2020 (i.e., up to t = 48):

t 0 9 10 13 24 28 29 30 31 32 33 34 35 36
x(t) 2 3 4 8 9 13 19 23 35 40 51 85 114 160

t 37 38 39 40 41 42 43 44 45 46 47 48
x(t) 206 271 321 373 456 590 797 1061 1391 1543 1950 2626

Taking logarithms in (5.1) we obtain

ln(x(t)) = ln(x0 ) + ln(b)t,

which is a linear function in t. Hence, we can apply linear regression.


Using the matlab command
[r,m,b] = regression(t,log(xt),'one')
we obtain
ln(x(t)) = −0.6848 + 0.1618t
and hence (taking exponential values)

x(t) = 0.5042 + 1.1756t .

A plot is shown in Fig. 5.2.


One might ask the question when are or have been 1 million persons infected, i.e., for which t1
do we have 106 = 0.5042 + 1.1756t ? Solving this equation we see that this would have happened
for t = 86, which is April 25th, 2020.b
As an exercise you might want to repeat these calculations for the data available for China,
Germany, or any other country of your choice.
a https://www.gov.uk/government/publications/covid-19-track-coronavirus-cases.
b As we know it came somewhat dierently as dierent testing policies and national lockdown measures were
imposed. The number of 1 million infected UK persons was surpassed on October 31st, 2020.

5.2 Supervised Learning

In supervised learning the training data comes in pairs of inputs (xi , yi ), where xi ∈ Rd is the input
instance and yi ∈ C its label. The space Rd is the feature space  in our case we typically consider
Rd = Rd . The space C is the label space. Several examples are shown in Table 5.1.

Task Label Space Example


Binary classication C = {0, 1} or C = {+, −} A CT image contains a tumor or not
Multi-class classication C = {1, . . . , k} with k > 2 Speech and character recognition
Regression C=R Prediction of temperature rise

Table 5.1: Examples of label spaces.

We focus in the following on classication problems. In such problems the data points (xi , yi ) are
drawn from some (unknown) distribution. The goal is to learn a function h : Rd → C such that for a
new pair (x, y) we have h(x) = y (or h(x) ≈ y ) with high probability.
Examples of supervised learning techniques are support vector machines, naïve Bayes classiers, deci-
sion trees, and neural networks. We will discuss the basics in the following sections. Our rst section

54
Figure 5.2: Number x(t) of COVID-19 infected people for t = 0 (January 31, 2020) to t = 48
(March 18th, 2020) shown in blue. In red the predicted numbers obtained by linear regression.

is, however, devoted to a theoretical aspect, namely that of measuring the capacity (complexity, ex-
pressive power, richness, or exibility) of a space of functions that can be learned by a classication
algorithm.

5.2.1 Vapnik-Chervonenkis Dimension


The Vapnik-Chervonenkis (VC) dimension was originally introduced by V. Vapnik43 and A. Chervo-
nenkis44 in 1971. Roughly speaking, the VC-dimension of a function (i.e. hypothesis) class is the
maximum number of data points for which, no matter how we label them (by ±1), there is always a
hypothesis in the class which perfectly explains the labeling. Among others, the VC-dimension of a
hypothesis space therefore describes its ability to correctly represent the training data. (Another use
of the VC-dimension is, for instance, that it can be used to upper bound the estimation error of a given
classier).
Let us introduce the VC-dimension formally45 .
Each hypothesis h : Rd → {±1} can naturally be viewed as a subset of Rd , where h = {ξ : h(ξ) = +1}.
Therefore, we can refer to h as a binary function and as a subset interchangeably. For any nite subset
S ⊆ Rd , let ΠH (S) = {h ∩ S : h ∈ H}. This is called the projection of H onto S .

Denition 5.1
The set S is shattered by the function class H if ΠH (S) contains all the subsets of S, i.e.,
|ΠH (S)| = 2|S| .

In other words, the set S is shattered by H if, no matter how we assign ±1-labels to points in S there
is always a hypothesis in H that 'explains' the labeling perfectly.

43
Vladimir N. Vapnik (born Dec. 6, 1936).
44
Alexey Chervonenkis (Sept. 7, 1938  Sept. 22, 2014).
45
Note that the VC-dimension is dened for spaces of binary functions (functions to ±1). Generalizations for spaces
of non-binary functions have later been suggested in the literature.

55
Denition 5.2
The VC-dimension VCD(H) of a function class (or hypothesis class) H is the maximum size
of a subset of Rd shattered by H.

In other words, if VCD(H) = d then H can shatter some (i.e., at least one) set of d points but it cannot
shatter any set of d + 1 points. Let us consider two examples and the important class H consisting of
halfspaces in R2 .

Example 5.3
Consider learning (positive) rays on the real line. The hypothesis class H consists of all rays of
the form {x ∈ R : x > a}, for some a ∈ R. A ray (hypothesis) {x ∈ R : x > a} ∈ H is interpreted
as a classier that identies a point p ∈ R as being in class S if p > a and identies x as not
being in class S if p ≤ a.
(a) Can a set containing a single point in R be shattered by H?

The answer is obviously yes (since we can have a ray that contains that point and we can
have a ray that does not contain this point).
(b) We claim that no sets of two points in R can be shattered by H.

Suppose, we have a set S of two points p1 < p2 ∈ R. Any ray containing p1 also contains p2 .
We can therefore never obtain the set {p1 } as projection of some ray onto S.
(c) Can we now say something about VCD(H)?

Based on (a) we have VCD(H) ≥ 1 and based on (b) we have VCD(H) ≤ 1 Hence,
VCD(H) = 1.

Example 5.4
Consider learning closed balls in R2 . The hypothesis class H consists of all balls R(p, r) =
{(x1 , x2 ) ∈ R2 : (x1 − p1 )2 + (x2 − p2 )2 ≤ r2 } with p = (p1 , p2 ) and r > 0. A ball (hypothesis)
R(p, r) ∈ H is interpreted as a classier that identies a point x as being in class S if x ∈ R(p, r)
and identies x as not being in class S if x ̸∈ R(p, r).
(a) Let us show that this set

of three points can be shattered by H.


First, we observe that there are the following eight combinations for subsets S of the set
of three points (the label '+' indicates that the respective point should be included in S,
while '-' indicates that it should not be included in S ). In each case a ball is sketched that
contains the respective S. Hence the set of three points can be shattered by H.

56
(1) (2) (3) (4)
+ + − +

+ + − + + + + −

(5) (6) (7) (8)


+ − − −

− − − + + − − −

(b) Note that the result from (a) implies, by the denition of VC-dimension, that VCD(H) ≥ 3.
(c) For showing VCD(H) = 3 we would need to show that no four points can be shattered
by H.
We do not show this here. We just remark that no set of three points on a line can be
shattered by H (because balls are convex, hence if S contains the points x1 and x3 it also
needs to contain any point x2 lying on the line segment x1 x3 ).

Theorem 5.1
For the hypothesis class H consisting of halfplanes in R2 it holds that VCD(H) = 3.

Proof. Let us rst prove that VCD(H) ≥ 3 by providing a specic set S of three points for which we
show that we can obtain every subset of S as a projection of some halfplane onto S.
Consider three points (not all lying on a line). There are eight possible labelings, and for each we can
nd a halfplane containing only the positively labeled points; see Fig. 5.3. Hence, VCD(H) ≥ 3.

(1) (2) (3) (4)


+ + − +

+ + − + + + + −

(5) (6) (7) (8)


+ − − −

− − − + + − − −

Figure 5.3: The eight possible labelings and a corresponding halfplane (gray shaded area
bounded by the blue line) that contains only the positively labeled points.

+ + −

− +
+− − + + +

(a) (b) (c)

Figure 5.4: Examples of four points and their labelings (boundary of convex hull indicated
by dotted lines). No halfplanes exist that contain precisely the positively labeled points.

To see that no set of four points can be shattered, we consider three cases.
ˆ Only two of the four points dene the convex hull of the four points (see Fig. 5.4(a)): Label the
interior points negative and the hull points positive. No halfplane exists that contains precisely

57
the positively labeled points. (If you want to make this argument mathematically rigorous, you
can argue like this: Halfplanes are convex, hence with the two endpoints they need to contain also
the negatively labeled points. Hence halfplanes cannot contain only the posively labeled points.)
ˆ Three of the four points dene the convex hull of the four points (see Fig. 5.4(b)): Label the
interior point negative and the hull points positive. Again no halfplane exists that contains
precisely the positively labeled points. (Rigorous argument as above.)
ˆ All four points lie on the convex hull dened by the four points (see Fig. 5.4(c)): Label one
'diagonal' pair positive and the other 'diagonal' pair negative. Again, no halfplane exists that
contains precisely the positively labeled points. (Making this rigorous requires a bit more work.
It follows from a fundamental result from the theory of convex sets called Radon's Theorem. It
states that any set of d + 2 points in Rd can be partitioned into two subsets whose convex hulls
intersect each other. No halfplane can separate these two subsets.)

More general, it can be shown that VCD(H) = d + 1 for halfspaces in Rd , but proving this is outside
the scope of the present course.

5.2.2 Support Vector Machines


Support vector machines (SVMs), originally proposed by Vapnik (and published by Vapnik and
Lerner in 1963 for linear classication problems), are supervised learning algorithms mainly used for
classication or regression problems. They try to nd an `optimal' hyperplane that separates the data
into dierent classes.
We restrict our exposition to the case that Rd = Rd and C = {±1}.

Denition 5.3
An (ane) hyperplane in Rd is a set of the form

Hw,β = {x ∈ Rd : wT x = β},

where w ∈ Rd \ {0} and β ∈ R.

Denition 5.4
A hyperplane Hw,β is a separating hyperplane for the data (xi , yi ) ∈ Rn × {±1}, i = 1, . . . , n, if
for all i = 1, . . . , n it holds that

wT xi ≥ β, if yi = +1,
T
w xi ≤ β, if yi = −1.

It is allowed in this denition that some (or even all) of the data points are lying on the separating
hyperplane. If none of the data points lie on the separating hyperplane, i.e., when its distance to all
data points is positive, we say that the separating hyperplane has a positive margin.
There does not always exist a separating hyperplane. And even if does, there might be many; see
Fig. 5.5(b-c). Support vector machines try to nd the separating hyperplane with maximum mar-
gin.

Binary SVM problem:


Given training data (xi , yi ) ∈ Rd × {±1}, i = 1, . . . , n. Find a separating hyperplane that has
maximum margin, i.e., the maximum distance between data points of both classes.

58
Why could it be a good idea to nd a separating hyperplane that maximizes the margin? The intuition
behind this is that points near the decision boundary are often misclassied (there is an almost 50%
chance that the classier decides either way). So, insisting on a large margin can potentially minimize
misclassication. This can be a good idea if nothing else is known about the data. (For a more model
dependent choice of decision boundaries, see the later section on naïve Bayes classiers.)

(a) (b) (c)

Figure 5.5: Binary SVM problem. (a) training data with two labels (red=+ and blue=−), (b) a
separating hyperplane (solid black, dashed lines indicating the margin), (c) the optimal separating
hyperplane (solid black) maximizing the margin (indicated by the dashed lines).

Consider a separating hyperplane Hw,β , see Fig. 5.6. The points xi ∈ Rd with wT xi − β ≥ 0 lie on one
side of the hyperplane, the points xi ∈ Rd with wT xi − β ≤ 0 lie on the other side of the hyperplane
(points satisfying both inequalities are lying on both sides).

1
||w||
1
||w||

Hw,β+1

Hw,β−1 Hw,β

Figure 5.6: Maximum margin separating hyperplanes.

Consider now the parallel hyperplanes Hw,β+1 and Hw,β−1 . What we want to have is that Hw,β+1 and
Hw,β−1 are separating and their distance (which can be shown to be 2/||w||; see Fig. 5.6) should be
maximal46 . This can be formulated as the optimization problem

2
maxw,β ||w||
subject to wT xi ≥ β + 1, for all i with yi = +1,
wT xi ≤ β − 1, for all i with yi = −1,
which has the same optimal w and β as the optimization problem

46
There is nothing special in considering β + 1 and β − 1 in the denition of Hw,β+1 and Hw,β−1 . Instead one could
have also considered Hw,β+ε and Hw,β−ε for any of your favorite choice of ε ̸= 0, because Hw,β+1 = Hw′ ,β ′ +ε and
Hw,β−1 = Hw′ ,β ′ −ε with w′ = εw and β ′ = εβ.

59
1
min ||w||2 subject to yi (wT xi − β) ≥ 1, for all 1 ≤ i ≤ n.
w,β 2

In the latter form, the binary SVM problem is a well-studied optimization problem. It has a quadratic
objective function, which is strictly convex, and its constraints are all (anely) linear (such optimiza-
tion problems are often referred to as quadratic programs or convex quadratic programs ). Due to the
strict convexity it can be shown that the minimum is unique, and this minimum can be found by
variety of well-established algorithms (in matlab, you can use the command quadprog).
Fig. 5.5(c) shows the maximum margin separating hyperplane for the data shown in Fig. 5.5(a). Having
found the maximum margin separating hyperplane, any new data x ∈ Rd will subsequently be classied
according to this hyperplane, i.e., by

+1 if wT x ≥ β,

y=
−1 otherwise.

60
5.2.3 Naïve Bayes Classier
Sometimes we have some prior knowledge about the data and we want or need to include this. In
Bayesian learning  and we discuss here the special case of naïve Bayes classiers  prior knowledge
is provided by asserting (a) a prior probability for each class labeling and (b) a probability distribution
over observed data for each possible labeling.
Naïve Bayes is a statistical classication technique based on Bayes Theorem (see Section 1.3). A
naïve Bayes classier assumes that any of the features ξ1 , . . . , ξd of a feature vector x = (ξ1 , . . . , ξd )T
are independent given the label y. This is usually an oversimplication (hence the term naïve ), but it
simplies the computations considerably.
Bayes theorem and the independence give

Pr(ξ1 = ξ1∗ , . . . , ξd = ξd∗ |y = y ∗ )Pr(y = y ∗ )


Pr(y = y ∗ |ξ1 = ξ1∗ , . . . , ξd = ξd∗ ) =
Pr(ξ1 = ξ1∗ , . . . , ξd = ξd∗ )
(5.2)
Pr(ξ1 = ξ1∗ |y = y ∗ ) · . . . · Pr(ξd = ξd∗ |y = y ∗ )Pr(y = y ∗ )
= .
Pr(ξ1 = ξ1∗ , . . . , ξd = ξd∗ )

The terms in this expression are typically referred to as:


ˆ Pr(y = y ∗ |ξ1 = ξ1∗ , . . . , ξd = ξd∗ ) is the so-called posterior probability of the label y∗ given the
feature vector x∗ = (ξ1∗ , . . . , ξd∗ )T .
ˆ Pr(y = y ∗ ) is the prior probability of the label y∗ .
ˆ Pr(ξi = ξi∗ |y = y ∗ ) is the likelihood, which is the probability of the feature ξi∗ given the label y∗ .
ˆ Pr(ξ1 = ξ1∗ , . . . , ξd = ξd∗ ) is the prior probability of the feature vector x∗ .
Now the idea is to nd the label that maximizes the posterior probability, i.e., the probability of the
observed features.

Naïve Bayes:
Given a vector x∗ = (ξ1∗ , . . . , ξd∗ )T of features, the label ŷ that will be assigned to x∗ will be the
y ∗ ∈ L that maximizes Pr(y = y ∗ |ξ1 = ξ1∗ , . . . , ξd = ξd∗ ) in (5.2).

What we need for naïve Bayes is therefore the likelihood of each feature and the prior probability of
each label. The prior probability of the feature vector is not needed, since for all labels it will be the
same constant factor.
Unless nothing else is known, one usually assumes that the prior probability of each label is uniformly
distributed. For the likelihood one often assumes that the ξi feature associated with each label y ∗ ∈ L is
distributed according to a normal distribution (of course, this assumes that the ξi feature is continuous).
In other words,  
ξi∗ −µy ∗ 2
1 − 12
Pr(ξi = ξi∗ |y ∗ σy ∗
=y )= √ e ,
σy∗ 2π
where µy∗ and σy2∗ denote the mean and, respectively, variance of the ξi feature in the training data
that is associated to label y ∗ . Such Bayes classier is often referred to as Gaussian naïve Bayes
classier.
Diererent than in the SVM case discussed previously, decision boundaries for Bayes classiers need
not be linear (they are generally quadratic).

61
Example 5.5
Suppose, we have the training data as considered in Section 5.2.2; Table 5.2(a) gives the numer-
ical values.

ξ1 ξ2 label
−1.5 2 −1
−0.8 0.7 −1
0.5 −1 −1
−2 −0.7 −1
2 0.8 −1
0.3 2.3 −1
standard
3 3 +1 feature label mean
deviation
4 6 +1
5 3 +1 ξ1 −1 −0.25 1.47
4 4.5 +1 ξ1 +1 −3.95 1.22
5.5 4 +1 ξ2 −1 −0.68 1.35
2.2 5.5 +1 ξ2 +1 −4.33 1.25

(a) (b)

Table 5.2: (a) Training data (as in Fig. 5.5), (b) mean and standard deviation of the
training data.

Using Gaussian naïve Bayes we want to classify x∗ = (ξ1∗ , ξ2∗ )T = (1, 4)T .
First we compute the mean and variance of the training data separately for each label. The
results are shown in Table 5.2(b).
For y ∗ = −1 we obtain

Pr(y = −1) = 6/12 = 0.5,


1 1 1+0.25 2
Pr(ξ1 = ξ1∗ |y = −1) = √ e− 2 ( 1.47 ) ≈ 0.1891,
1.47 2π
1 1 4−0.68 2

Pr(ξ2 = ξ2 |y = −1) = √ e− 2 ( 1.35 ) ≈ 0.0144.
1.35 2π

Hence
Pr(y = −1|ξ1 = ξ1∗ , ξ2 = ξ2∗ ) = 0.0014/Pr(ξ1 = ξ1∗ , ξ2 = ξ2∗ ).

For y ∗ = +1 we obtain

Pr(y = +1) = 6/12 = 0.5,


1 1 1−3.95 2
Pr(ξ1 = ξ1∗ |y = +1) = √ e− 2 ( 1.22 ) ≈ 0.0176,
1.22 2π
1 1 4−4.33 2
Pr(ξ2 = ξ2∗ |y = +1) = √ e− 2 ( 1.25 ) ≈ 0.3082.
1.25 2π

Hence
Pr(y = +1|ξ1 = ξ1∗ , ξ2 = ξ2∗ ) = 0.0027/Pr(ξ1 = ξ1∗ , ξ2 = ξ2∗ ).

As Pr(y = +1|ξ1 = ξ1∗ , ξ2 = ξ2∗ ) > Pr(y = −1|ξ1 = ξ1∗ , ξ2 = ξ2∗ ) we therefore assign x∗ to class
ŷ = +1.

62
Fig. 5.7 shows the classication and the respective (non-linear) decision boundary.

(a) (b)

Figure 5.7: Gaussian naïve Bayes. Classifying the point (1, 4) (shown as black circle). The
decision boundary is shown in black.

5.2.4 Decision Trees


A rather simple classication technique is the so-called decision tree classier. Basically, the idea
is to ask a series of carefully crafted questions. The series of questions and their possible answers can
be organized into a hierarchical structure called a decision tree.

Example 5.6
Figure 5.8 shows an example of the decision tree for a mammal classication problem, where
the task is to decide whether a newly discovered species is a mammal or a non-mammal.

Figure 5.8: A decision tree for the mammal classication problem (adapted from [13]).

The rst question is whether the species is cold- or warm-blooded. If it is cold-blooded, then it
is denitely not a mammal. Otherwise, we ask whether is gives birth (as oppose to laying eggs).
If the answer is yes it is a mammal otherwise it is a non-mammal.
So, suppose we want to classify the amingo species based on the data shown in Fig. 5.9.
Following the path through the tree (see dashed lines) we would classify famingos as non-
mammals. a newly discovered species is a mammal or a non-mammal.

63
Figure 5.9: Classifying an unlabeled vertebrate (adapted from [13]).

Decision trees have three types of nodes: (a) a root node, which has no incoming arcs, (b) internal
nodes, each of which has exactly one incoming arc and two or more outgoing arcs, and (c) leaf nodes,
which have no outgoing arcs. The root and internal nodes represent the questions, the leaf nodes
represent the corresponding labels (note that the same label can be present at dierent leaves).
So, how does one construct (learn) such a tree? Of course, we want to construct the tree from training
data. The goal should be to nd a decision tree that classies all the training data correctly. It is not
dicult to see that there are typically many dierent decision trees that achieve this goal. Thus, we
might ask for an optimal decision tree  optimal in the sense that it minimizes the expected number
of tests required to identify the unlabeled object. Finding such an optimal decision tree is, however,
an NP-hard problem47 . Therefore one usually tries to construct near-optimal decision trees using some
heuristics.

Learning of a decision tree:


The general idea is to choose a feature on which to split the data. This will be the root node.
Then in each of the child nodes one repeats the process (splits the data on a chosen feature).

Example 5.7
Let us consider learning the decision tree for classifying species into mammals and non-mammals
based on the data shown in Table 5.3.

47
See the short 1976 paper of Hyal and Rivest, freely available at https://people.csail.mit.edu/rivest/pubs/
HR76.pdf.

64
Vertebrate Body Gives Aerial Has Hiber- Class
Name Temperature Birth Creature Legs nates Label

python cold-blooded no no no yes non-mammal


salmon cold-blooded no no no no non-mammal
eel cold-blooded no no no no non-mammal
frog cold-blooded no no yes yes non-mammal
komodo dragon cold-blooded no no yes no non-mammal
leopard shark cold-blooded yes no no no non-mammal
turtle cold-blooded no no yes no non-mammal
human warm-blooded yes no yes no mammal
whale warm-blooded yes no no no mammal
bat warm-blooded yes yes yes yes mammal
pigeon warm-blooded no yes yes no non-mammal
cat warm-blooded yes no yes no mammal
penguin warm-blooded no no yes no non-mammal
porcupine warm-blooded yes no yes yes mammal

Table 5.3: Training data (adapted from [13]) for a vertebrate classication problem.

Which of the features `body temperature,' `gives birth,' `aerial creature,' `has legs,' and `hiber-
nates,' should we choose as root node? The answer is that there are dierent rules around, each
constituting a dierent algorithm. However, we do not go into the details here any further.
Let us assume that we choose `body temperature' as root node. Then we have the two classes

C1 = {python, salmon, eel, frog, komodo dragon, leopard shark, turtle},


C2 = {human, whale, bat, pigeon, cat, penguin, porcupine}.

All elements of C1 are non-mammals, hence we we do not split on this node any further. It
becomes a leaf.
We need to split C2 further. If we split on feature `Gives birth' we get the two classes

C3 = {human, whale, bat, cat, porcupine},


C4 = {pigeon, penguin},

each of which contain only members of one of the class labels (either mammals or non-mammals).
Hence we do not need to split any further. The two nodes become leaf nodes, and we have
completed constructing the decision tree. The result is shown in Fig. 5.8.
Could we have chosen a dierent feature to split, say, class C2 ? In principle, yes. But the
resulting decision tree would have contained more levels. Splitting on `aerial creature' would
have resulted in two classes both of which containing both mammals and non-mammals; splitting
on `has legs' would have resulted in a class containing both mammals and non-mammals; splitting
on `hibernates' would have resulted in a class containing both mammals and non-mammals; in
all cases one would have needed to split further.

5.2.5 Neural Networks


A major class of supervised learning techniques uses neural networks (also called articial neural
networks). We will discuss them briey in the later Section 5.5.1. Learning via neural networks that
contain large numbers of hidden layers constitutes the eld of so-called deep learning.

5.3 Unsupervised Learning

In this section we discuss unsupervised learning (sometimes paraphrased as `learning without a


teacher'). The data consists of a set X = {x1 , . . . , xm } of m observations, each being a random d-
dimensional vector having a joint pdf P (X). The goal is to directly infer the properties of this pdf

65
without any help of a `supervisor' that could provide correct answers for each observation.

5.3.1 Clustering
In clustering the goal is to divide the observations into groups (clusters ) so that the pairwise dis-
similarities between those assigned to the same cluster tend to be smaller than those in dierent
clusters.
There are a number of dierent clustering algorithms in the literature (and the development is still
ongoing). One of the most popular clustering methods is the so-called k-means algorithm48 . For a
prescribed k, the algorithm tries to nd k clusters in a given data set (hence the name k -means).
Let us start by stating the algorithm. Then we consider examples followed by a brief discussion of
theoretical aspects.

Given k ∈ N, a set of data points X = {x1 , . . . , xm } ⊆ Rd , and sites S = {s1 , . . . , sk } ⊆ Rd the


k-means algorithm proceeds as follows.

(a) Partition X into clusters C1 , . . . , Ck by assigning xj ∈ X to the cluster Ci of a closest site


si ∈ S.
(b) Update each site si as the centroid of cluster Ci
(if Ci = ∅ then choose si = xl for a random l ≤ n with xl ̸= sj for all j ≤ k ).
(c) If a stopping criterion is met then stop and return the current assignment and sites,
otherwise goto step (a).

Clearly, we need to specify several points in the above algorithm:


ˆ What do we mean by `closest' site? Here, unless stated otherwise, we measure distances in the
Euclidean norm49 , i.e., we say that xj is closest to cluster Ci if

||xj − si || ≤ ||xj − sl ||, for all l ̸= i.

ˆ What do we mean by `centroid of cluster Ci ?' Suppose the r data points xi1 , . . . , xir are assigned
to cluster Ci . Then the centroid of Ci is dened as

r
1X
x il .
r
l=1

ˆ What is a possible stopping criterion ? We use the following: From one to the next iteration the
(average) sum of squared error (SSE)
k
1 X X
||xj − si ||2 (5.3)
m
i=1 xj ∈Ci

does not decrease.


(We note that, in practice, one also often stops when a maximum number of iterations MaxIt is
reached or the centroids are not changing anymore.)

48
The idea behind this clustering method seems to trace back to H. Steinhaus (Jan. 14, 1887  Feb. 25, 1972). The
standard algorithm was rst proposed by Stuart Lloyd of Bell Labs in 1957, though it was not published as a journal
article until 1982.
49
Recall that the Euclidean distance of a vector v = (v1 , . . . , vd )T ∈ Rd is ||v|| = v12 + · · · + vd2 .
p

66
Example 5.8
Consider the set

X = {(−10, 1), (−8, 3), (−6, 2), (3, −1), (5, −3), (6, −2), (9, 3), (10, 7), (11, 5), (−5, −3)}

of 10 data points as shown in Fig. 5.10(a); the sites (7, 0), (−5, 0), (−2, 6), (0, 0) are depicted
as `x.' Fig. (b)-(d) show the results of k -means for k = 4 after Iteration 1, 2, and 3. The algorithm
has converged after 3 iterations, resulting in the sites (−8, 2), (−5, −3), (10, 5), (14/3, −2).

Figure 5.10: Example of k -means for k = 4 : (a) Initial data, (b) Iteration 1, (c) Iteration 2, (d) Itera-
tion 3.

We now want to prove that k -means is converging in a nite number of steps. To this end, we rst
need to prove the following lemma.

Lemma 5.1
Let x1 , . . . , xm ∈ Rd . The sum of squared distances of the xi to a point p ∈ Rd is minimized
1 Pm
when p is the centroid, i.e., if p = m i=1 xi .

67
Proof. Let c =
Pm
1
m i=1 xi . Then,
m
X m
X m
X
2 2
||xi − p|| = ||xi − c + c − p|| = (xi − c + c − p)T (xi − c + c − p)
i=1 i=1 i=1
m
X m
X
= ||xi − c||2 + 2(c − p) T
(xi − c) + m||c − p||2 ,
i=1 i=1

where (xi − c + c − p)T denotes the transposition of the column vector xi − c + c − p.


Since
m
X m
X m
X m
X
(xi − c) = xi − mc = xi − xi = 0
i=1 i=1 i=1 i=1

we therefore have
m
X m
X m
(∗) X
||xi − p||2 = ||xi − c||2 + m||c − p||2 ≥ ||xi − c||2 ,
i=1 i=1 i=1

with equality in (∗) if, and only if, c = p.

Theorem 5.2
For any data set X, set of sites S, and any k ∈ N, the k -means algorithm decreases (from one
iteration to the next) the SSE.

Proof. Let X = {x1 , . . . , xm }, S = {s1 , . . . , sk }, and let C1 , . . . , Ck denote the current clusters. Now,
let C1′ , . . . , Ck′ denote the clusters that we compute from these data in step (a), and let c′1 , . . . , c′k denote
the corresponding cluster centroids. Then,
k k
1 X X 1 X X
2
||xj − si || ≥ ||xj − si ||2 (since any xj is assigned to the closest si )
m m ′
i=1 xj ∈Ci i=1 xj ∈Ci
k
1 X X
≥ ||xj − c′i ||2 (by Lemma 5.1).
m ′
i=1 xj ∈Ci

Theorem 5.3
For any data set X, set of sites S, and any k ∈ N the k -means algorithm converges in a nite
number of steps.

Proof. There are at most km ways to partition m points into k clusters (for every of the m points we
have k possibilities to assign it to clusters). This is a nite (though large) number. For each such
partition there are the centroids determined by step (b). Hence, we have at most k m dierent values
for the SSEs.
By Theorem 5.2. the SSE decreases in each iteration (if the SSE remains the same then the stopping
criterion is fullled, hence we have convergence). The SSE, however, cannot continue to strictly decrease
indenitely since have already mentioned above that there are at most k m values for it. Therefore, the
stopping criterion is fullled after a nite number of steps.
Although we have proved convergence, there are several unfortunate issues. First, the convergence is
only to a local minimum (in practice, one therefore needs to re-run the algorithm several times and
record the found minimum). Second, the speed of convergences can depend much on the dimension d,
the number of points n, and the set of sites S. Is is benecial to provide as input good approximations
of the cluster centroids, but in general one encounters computationally dicult problems.

68
It is for instance, known50 that the k -means problem (not the algorithm!) is NP-hard when the
dimension d is part of the input and k = 2. It is also NP-hard already for d = 2 if the number k is part
of the input51 . For xed k and d, the k -means problem can be solved in polynomial time (polynomial
in m, the number of data points)52 .

Example 5.9
The k -means algorithm can be used for image compression.

Figure 5.11: Compression of a 640 × 380 color image. (a) original image (image_india.png)
containing 195, 324 > 217 colors, (b) compressed to 32 = 25 colors, (c) compressed to 8 = 23
colors.

Let us look at the compression rate. Often, color images are stored by providing for each pixel an
RGB vector (this is a 3-dimensional vector containing the red/green/blue value in the respective
components). Typically, each entry in the RGB vector is an integer between 0 and 225, hence
each entry needs 1 Byte of storage. For color images not requiring the full color range, it is often
benecial to store for each pixel the color label together with a lookup table, which, for each
color label, contains the corresponding RGB vector. The following table compares the sizes of
the images shown in Fig. 5.11.
Just to get an idea about the computation times, we remark that with matlab's built-in k -means
procedure, the computation time for producing (b) and (c) in Fig. 5.11 was 10.0 and 4.2 seconds,
respectively.

Image Size (Bytes) Reduction


Factor
(a) 640 · 380 · 3 = 729, 600 1.0
(b) 640 · 380 · 5/8 + 32 · 3 = 152, 096 4.8
(c) 640 · 380 · 3/8 + 8 · 3 = 91, 224 8.0

Table 5.4: Compression rates for the images shown in Fig. 5.11.

Let us state several remarks.


ˆ The k -means algorithm is certainly one of the most popular clustering method. There are several
variants of this method, e.g., k -means++ (chooses the initial seeds in a special way), fuzzy k -
means (a form of clustering in which each data point can belong to more than one cluster), and
k -medians clustering (minimizing the SSE (5.3) where instead of the Euclidean norm the 1-norm
is taken).
ˆ Sometimes the number k of clusters is given by the application (e.g. we might know the number
of crystals in a polycrystalline material). Then, it can be straightforward to apply k -means. In
other cases, however, it might not be clear what the most appropriate k is. It needs to be learned
from the data as well. A possible approach for doing this is, for instance, G-means53 , which is
based on a statistical test for the hypothesis that a subset of data follows a Gaussian distribution.
50
See https://www.cc.gatech.edu/~vempala/papers/dfkvv.pdf.
51
See https://www.sciencedirect.com/science/article/pii/S0304397510003269.
52
See https://dl.acm.org/doi/10.1145/177424.178042.
53
See https://papers.nips.cc/paper/2526-learning-the-k-in-k-means.pdf.

69
Figure 5.12: k -means clustering of the RGB vectors. (a) the 729, 600 data points (RGB vectors) of
image_india.png, (b) k-means clustering with k = 8 giving the image in Fig. 5.11(c).

The algorithm runs k -means with increasing k in a hierarchical fashion until the test accepts the
hypothesis that the data assigned to each k -means center are Gaussian.
ˆ In step (a) of the algorithm we assign all points that are closest to si to the ith cluster.

Doing this for all points in Rd one would obtain

Pi = {x ∈ Rd : ||x − si ||2 ≤ ||x − sj ||2 for all i ̸= j}.

This set Pi is called Voronoi cell, and the collection P1 , . . . , Pk is a so-called Voronoi
diagrama .
a Georgy Feodosevich Voronoy (April 28, 1868  Nov. 20, 1908).

Voronoi diagrams appear in many dierent contexts as, for instance, in robotics, imaging, physics,
biology, ecology, etc. The philosopher René Descartes54 thought that the solar system can be
viewed as a Voronoi diagram built of vortices centered at xed stars (his Voronoi diagram's sites).
Remarkably, the `decision boundaries' in k -means are linear in the following sense.

Theorem 5.4
The shared boundary of any two neighboring Voronoi cells P1 and P2 is linear.

Proof. Note that for the shared boundary we have


||x − s1 ||2 = ||x − s2 ||2
⇔ (x − s1 )T (x − s1 ) = (x − s2 )T (x − s2 )
⇔ ||x||2 − 2sT1 x + ||s1 ||2 = ||x||2 − 2sT2 x + ||s2 ||2
⇔ 2(s2 − s1 )T x = ||s2 ||2 − ||s1 ||2 .

Hence, the boundary is a subset of

B = {x ∈ Rd : ||x − s1 ||2 = ||x − s2 ||2 = {x ∈ Rd : (s2 − s1 )T x = (||s2 ||2 − ||s1 ||2 )/2}},

the latter of which is clearly an ane hyperplane in Rd (note that s1 and s2 are xed).
54
René Descartes (March 31, 1596  February 11, 1650) was a French philosopher, mathematician, and scientist.

70
If one chooses other norms, then one would obtain other types of tessellations55 , for instance, power
diagrams, Laguerre tessellations, or generalized balanced power diagrams. An illustration of the tessel-
lation for Fig. 5.10(d) is shown in Fig. 5.13.

Figure 5.13: Tessellation for Fig. 5.10.

5.3.2 Learning on Huge Feature Spaces


A large dimension d of the feature space can be problematic. On the one hand, the computational cost
of the learning algorithms can be enormous. But even data visualization becomes challening for d > 2
(and surely for d > 3). In addition, many unintuitive phenomena occur in higher dimensions. For
instance, a random point in a unit square has about 0.4% chance to be located less than 0.001 from
the boundary. But for a 10, 000 dimensional hypercube there is 99.99% probability for a point to
lie near the boundary. (To prove this just compare the volumes of the hypercube and the boundary
strips.) The average distance between two random points in a unit square is around 0.52 where as
for a 1, 000, 000 dimensional hypercube, it is about 408 (nicely explained proofs of these facts can be
found at several websites56 ).
For large dimensions one therefore might want to resort to so-called dimensionality reduction
techniques, which reduce the dimensionality of the data. Of course, such techniques are not always
applicable as all the data can be important. An important class of dimensionality reduction techniques
projects the data into lower dimensional subspaces. We will discuss the technique called principle
component analysis (PCA)57 . It is one of the oldest and most widely used method for dimension
reduction. The projection employed in this method is a linear projection (there exist also methods
that utilize non-linear projections).
The PCA technique applies generally if one can assume that there is some redundancy in the data.
Redundancy means that some of components/coordinates of the xi are correlated with one another,
possibly because they are measuring the same construct (for instance, height above sea level and
temperature as the temperature decreases for larger heights, or acceleration and force on a particle
which are related by Newton's second law of motion). Then the goal is to reduce the dimension d
(i) (i)
of each data point xi = (ξ1 , . . . , ξd )T , i = 1, . . . , m, by replacing xi ∈ Rd by a smaller dimensional
(articial) data point yi ∈ Rk (k < d) that will account for most of the variance in the observed data.
We will aim for projecting the d-dimensional data points onto a subspace spanned by k directions
called principal components.
First, we briey recall several concepts.
55
These are dissections of the space Rd .
56
See, e.g., https://math.stackexchange.com/questions/2985662/average-distance-between-points-on-a-hypercube.
57
Introduced in 1901 by Karl Pearson, an English mathematician and biostatistician (March 27, 1857  April 27, 1936).

71
For ease of exposition, we will assume throughout this section that the data is centered, which
means that the data has zero mean in each dimension (i.e., set ξi′ = ξi −µi , where µi is the mean of
the sample's ith coordinates). This can be achieved in matlab via X=normalize(X','center')',
where X ∈ Rd×m is the matrix that has the data point xi in its ith column. If your data is not
centered, center it before applying the below formalism!

We recall that a matrix M ∈ Rd×d is said to have an eigenvalue of λ if there is a d-dimensional vector
u ̸= 0 for which M u = λu. This u is then the eigenvector corresponding to λ.
The sample covariance matrix for the centered data points xi , i = 1, . . . , m, holds in its ith row
and j th column the covariance between the two features i and j. Formally,
 
c1,1 · · · c1,d m
 .. ..  , 1 X (ℓ) (ℓ)
 . .  with ci,j = ξi ξj .
m−1
c1,d · · · cd,d ℓ=1

This, by the way, is a symmetric matrix. Observe that if we arrange the data points xj into a
matrix X ∈ Rd×m whose ith column contains the data point xi , then C can be expressed as
m
1 1 X
C= X · XT = xℓ xTℓ .
m−1 m−1
ℓ=1

Covariance is, in fact, a measure of how much two sample features vary together. (It is similar to
variance, but where variance indicates how a single feature varies, covariance indicates how two fea-
tures vary together.) A positive covariance between two features indicates that the features increase
or decrease together, whereas a negative covariance indicates that the features vary in opposite direc-
tions.
Now, we are ready to derive the principal components. There are several equivalent ways of deriving
them. One way is by nding the projections that maximize the variance. The rst principal component
is the direction along which projections have largest variance. The second principal component is the
direction which maximizes variance among all directions orthogonal to the rst, and so on.

Variance Maximization We assume that our centered data is collected in a d × m matrix X, which
contains the data point xi in the ith column. We wish to nd a direction u that captures as much
as possible of the variance of our data points. Formally, this amounts to solving the optimization
problem
Find u ∈ Rd with ||u|| = 1 so as to maximize var(uT X). (5.4)
Note that uT X denotes the projection of X (i.e., of all column vectors x1 , . . . , xm ) to the subspace
spanned by the single vector u. We require ||u|| = 1 since if we would allow scaling then we would
obtain arbitrarily large values of var(uT X), and it would therefore make no sense to speak of a maximal
variance.

Theorem 5.5
The solution to the optimization problem (5.4) is to set u to equal the rst principal component
(that is the eigenvector corresponding to the largest eigenvalue) of C.

Proof. As we assume that the data x1 , . . . , xm is centered, they have sample mean µ = 0 and sample

72
variance
m m
(def. of sample variance) 1 X T 1 X T 2
var(uT X) = (u xi − µ)2 = (u xi )
m−1 m−1
i=1 i=1
m m
1 X 1 X
= (uT xi )(xTi u) = uT (xi xTi )u
m−1 m−1
i=1 i=1
= uT Cu.

Now, our optimization problem reduces to nding u ∈ Rd with ||u|| = 1 that maximizes uT Cu. To
solve this, we use the following notation. With vi we denote the eigenvector of C corresponding to the
ith largest eigenvalue λi of C. Then,

Claim:
uT Cu
max uT Cu = max = λ1
||u||=1 u̸=0 uT u
and this maximum is attained for u = v1 .

Proof of the claim: Notice rst that, by the eigenvalue decomposition (for real symmetric
matrices), we can write C = QDQT , where Q ∈ Rm×m holds in the ithe column the ith
eigenvector vi , and D ∈ Rd×d is a diagonal matrix with di,i = λi on the diagonal. (Note, Q
is an orthogonal matrix, i.e., QQT is the identity matrix I.) Then,

uT Cu uT QDQT u
max = max (since QQT = I )
u̸=0 uT u u̸=0 uT QQT u

y T Dy
= max T (by setting y = QT u)
y̸=0 y y
λ1 η12 + · · · + λd ηd2
= max
y̸=0 η12 + · · · + ηd2
λ1 η12 + · · · + λ1 ηd2
≤ max
y̸=0 η12 + · · · + ηd2
= λ1 ,

where equality is attained in the inequality when y = (η1 , . . . , ηd )T = (1, 0, . . . , 0)T , that is
u = Qy = v1 , proving the claim.

Summary and Examples In summary, we obtain the following algorithm.

Principal component analysis (PCA):


Given k ∈ N and a set of centered data points X = {x1 , . . . , xm } ⊆ Rd .

(a) Construct the sample covariance matrix C.


(b) Decompose the covariance matrix into its eigenvectors and eigenvalues.
(c) Sort the eigenvalues in decreasing order.
(d) Select the rst k eigenvectors corresponding to the k largest eigenvalues.
(e) Construct projection matrix W from these k eigenvectors.
(f) Transform the d-dimensional input dataset X into Y using the projection matrix W that
maps into the new k -dimensional feature subspace.

A matlab implementation might be as follows.

73
1 % Computing the PCA ( there is also the pca command available )
2 % Input : dxm matrix M holding the d - dimensional centered data points in columns
3 % k : the dimension into which we want to project
4 % Output : kxm matrix Y holding the projected data points in the columns
5
6 C = cov (X ') ; % C is the sample covariance matrix ( dimensions dxd )
7 % also possible : C = X *X '/( size (X ,2) -1) ;
8
9 [W , D ] = eigs (C , k ) % Compute k top eigenvalues and eigenvectors
10 % W is a dxk matrix with top eigenvectors ( normalized ) in columns
11 % D contains top k eigenvalues on its diagonal , it is a kxk matrix
12
13 Y = W '* X

Note that sometimes, as in Fig. 5.14(b), one might view the projected points Y in the original space Rd
(and not in Rk ). What one simply needs to do is to compute W · Y, the columns of this matrix are the
desired points in Rd .
An example for a PCA for 50 data points in R2 projected onto the rst principal component is shown
in Fig. 5.14.

Figure 5.14: PCA in R2 for k = 1. (a) The 50 data points, (b) First principal component in red, second
principal component in dashed red, projected points in the original space are shown in black on the
red line.

5.4 Reinforcement Learning and Markov Decision Processes

We have already mentioned that reinforcement learning as a technique falls between the two categories
of supervised and unsupervised learning. Reinforcement learning allows an agent to learn how to
behave in an environment, where the only feedback is the reward signal. The agent's goal is to perform
actions that maximize future reward.
To make this precise we introduce the notion of Markov decision processes, which can be considered
to provide a formal framework for reinforcement learning.

Denition 5.5
A Markov decision process (MDP) is a tuple (Ω, A, T, R) in which
ˆ Ω is the set of states,
ˆ A is a nite set of actions,

74
ˆ T is a transition function T : Ω × A × Ω → [0, 1], which, for T (ω, a, ω ′ ), gives the
probability that action a performed in state ω will lead to state ω ′ ,
ˆ and R is a reward function dened as R : Ω × A → R.
A policy ρ is a function ρ : Ω → A, which tells the agent which action to perform in any given state.
Application of a policy to an MDP proceeds as follows. First, a start state ω0 is generated. Then,
the policy ρ proposes the action a0 = ρ(ω0 ), and this action is performed. Based on the transition
function T and reward function R, a transition is made to state ω1 with probability T (ω0 , a0 , ω1 ), and
a reward R(ω0 , a0 ) is received. This process continues, producing a sequence ω0 a0 ω1 a1 ω2 a2 · · · (If the
process is about to end in a nal state, as for instance, in nite games, then one can consider to restart
the process in a new initial state; this would always result in an innite sequence.) Hence, given a
xed policy, MDPs are in fact Markov chains. The sequence r0 , r1 , . . . of rewards received by the above
sequence is r0 = R(ω0 , a0 ), r1 = R(ω1 , a1 ), . . .
The main goal of learning in an MDP is to `learn' (i.e., nd ) a policy that gathers rewards. If the
agent was only concerned about the immediate reward at a xed time t, a simple optimality criterion
would be to optimize E[rt ], where rt = R(ωt , at ). There are, however, several ways of taking the future
into account. For instance, one could aim at optimizing
Xh X∞
E[ tt ] E[ γ t rt ],
t=0 t=0

where h is a xed number of times steps and γ < 1 a given parameter. The former function with
the nite sum is used in so-called nite horizon approaches, the latter is used in so-called discounted,
innite horizon approaches.
A popular and mathematical rigorous method for nding an optimal policy for a an MDP (where really
all the data, i.e., (Ω, A, T, R) is given) is so-called dynamic programming. This is a method, which is
also used in other contexts such as for nding shortest paths in a graph. The interested reader may
nd [14] to be a good starting point to nd out more about MPDs.

Example 5.10
Consider the problem of nding an `optimal strategy' in a game of tic-tac-toea . We can model
this as an MDP.
A position of the board (i.e., the conguration of X s and Os) can be considered as state (states
can be enumerated, hence Ω can be considered to be a set of natural numbers Ω = {1, . . . , 39 }).
The actions correspond to the moves made by the agent (again these could be enumerated).
What would the transition function tell us? T (ω, a, ω ′ ) could, for instance, tell us with what
probability the opponent would bring up the position ω ′ if we were in position ω performing
move a. Such a model would be adequate if the agent should learn to play against a specic type
of player (one, which does not necessarily play perfect). In practice such a transition matrix
needs to be learned as well, and we will comment on this at the end of this example.
For the reward function, it would be most natural to assign to all position-action pairs resulting
in a win a positive value, to all losing position-action pairs a negative value, and all others a
zero value.
The goal of the agent should be to received a positive reward, which means winning the
game.
As mentioned, if all the data would be known one could apply dynamic programming and obtain
an optimal policy/strategy. But suppose, T needs to be learned as well. Then it would be natural
to play many times against the opponent and record the transitions as approximations of the
probabilities. (Which, of course, assumes that the opponent does not change its style of playing

75
after a period of time.) This can then also be combined with strategies that learn how `valuable'
a given position in the game actually isb .
a In this game, players alternate placing pieces (typically X s for the rst player and O s for the second) on a
3 × 3 board. The rst player to get three pieces in a row (vertically, horizontally, or diagonally) is the winner.
b A good webpage for further studies on this theme, including sample code, can be found at https://www.
codeproject.com/Articles/1400011/Reinforcement-Learning-A-Tic-Tac-Toe-Example.

5.5 Neural Networks

In this section we give a brief introduction to neural networks. In particular, we will discuss feedforward
networks, Hopeld networks, and Boltzmann machines. We will keep the exposition as short as possible.
The interested reader can nd more information in [4, 1].

5.5.1 Terminology
(Articial) neural networks are usually used for supervised learning (there do exist examples for un-
supervised and reinforcement learning, too). They try to model the apparently highly nonlinear in-
frastructure of brain networks. The historically rst articial neural network (called perceptron) was
invented in 1958 by psychologist Frank Rosenblatt58 .
Neural networks are composed of layers of computational units called nodes (or neurons) with con-
nections in dierent layers. Neurons are (typically nonlinear) parameterized functions of their input
variables. In a very common model, called McCulloch-Pitts59 neuron model, the output sj of a
node j is given as X
sj = ψ( ωi,j si + θj ),
i

where ωi,j is the weight of the connection between node i and j, θj is a given constant (called threshold
or bias) and ψ is the so-called activation function. We will discuss several popular activation
functions further below. Fig. 5.15 shows a single McCulloch-Pitts neuron with three inputs.

s1
ω1,4
θ4
ω2,4 ψ(ω1,4 s1 + ω2,4 s2 + ω3,4 s3 + θ4 )
s2 s4
ω3,4
s3
Input Node Output

Figure 5.15

Some of the nodes may be so-called input nodes, which means that their input is externally provided
(involving no computations). Some of the nodes may be output nodes, which means that the user
can read them out. Other types of nodes are hidden nodes, they are used internally solely for
computations.
To specify a neural network ones needs to specify the network architecture (i.e., the nodes and
connections), the updating rules (i.e., the functions evaluated by each node), and the learning
rules (i.e., procedures to compute the weights from training data).
58
American psychologist (July 11, 1928  July 11, 1971). In our language below, a perceptron is a McCulloch-Pitts
neuron with binary inputs and Heaviside activation function.
59
Named after Warren Sturgis McCulloch (Nov. 16, 1898  Sept. 24, 1969) and Walter Harry Pitts, Jr. (April 23,
1923  May 14, 1969).

76
Networks without any cycle, also called feedforward networks, can be thought of evaluating a
function f : Rd → Rn (the input is given by the input nodes, the output by the output nodes).
Networks containing cycles are called recurrent networks.

Figure 5.16: Examples of neural network architectures. (a) Feedforward network, (b) Recurrent net-
work.

5.5.2 Activation Functions


It can be shown that neural networks are very limited if one uses only (anely) linear activation
functions.
Classication problems based on linear activation functions will result in classication problems involv-
ing only linear classiers (see, e.g., support vector machines). Even very simple logical computations,
such as XOR, cannot be performed by such a network (see gure below, which shows that the four
combinations of (s1 , s2 ) for XOR cannot be linearly separated).

s2

s1 s2 XOR(s1 , s2 ) 1
0 0 0
0 1 1
1 0 1
1 1 0
s1
0 1

Therefore, most neural networks make use of non-linear activation functions.


Some of the most popular activation functions are:
ˆ Rectied Linear Activation (ReLU),
ˆ Sigmoid (also known as logistic),
ˆ Hyperbolic Tangent (Tanh),
ˆ Signum Function, and
ˆ Heaviside60 Function.
60
Named after the English mathematician and physicist Oliver Heaviside (May 18, 1850  Feb. 3, 1925).

77
ReLU Sigmoid

z : z ≥ 0, ψ(z) = 1
ψ(z) = 1+exp(−z)
0 : z < 0.

Tanh Signum Heaviside


 
ψ(z) = exp(z)−exp(−z) 1 : z ≥ 0, 1 : z ≥ 0,
exp(z)+exp(−z) ψ(z) = ψ(z) =
−1 : z < 0. 0 : z < 0.

Note the dierent ranges of the functions, which make them suitable in dierent situations. Generally
speaking, Sigmoid and Tanh activation functions were very popular in the past, but their popularity
seems declining in the realm of deep learning. In contrast to ReLU, the Sigmoid and Tanh activation
functions are observed to be more problematic to use in deep networks due to the so-called vanishing
gradient problem61 . ReLU does not seem very problematic in this respect, and it is also a function that
is easy to compute. Signum and Heaviside are often used in classication problems (at least applied
in the last layer).

Name Function Derivative



 1 : z>0
z : z≥0

ReLU ψ(z) = ψ ′ (z) = 0 : z<0
0 : z<0
undened : z = 0

e−z
Sigmoid ψ(z) = 1
1+e−z
ψ ′ (z) = (1+e−z )2
= ψ(z)(1 − ψ(z))

ez −e−z
Tanh ψ(z) = ez +e−z
ψ ′ (z) = 1 − tanh2 (z)
 
1 : z≥0 0 : z ̸= 0
Signum ψ(z) = ψ ′ (z) =
−1 : z < 0 undened : z = 0
 
1 : z≥0 0 : z ̸= 0
Heaviside ψ(z) = ψ ′ (z) =
0 : z<0 undened : z = 0

Table 5.5: Derivatives of common activation functions.

61
We are not going into these details here, but roughly speaking the small derivatives of the Sigmoid and Tanh functions
for larger inputs x can cause problems in training deep networks. The backpropagation algorithm, often used for training,
requires computations of gradients over multiple layers. If these values are small one obtains an exponential decrease in
the gradient  the gradient is vanishing and is therefore not useful for training.

78
5.5.3 Feedforward Networks
Example 5.11
Consider the following feedforward network
θ3 θ5
s1 ω1,3
ω3,5
s2
ω2,3 s3 s5
ω3,6

θ4 ω4,5 θ6
s1 ω1,4
ω4,6
s2
ω2,4 s4 s6
Input Output

with Sigmoid activation functions


1
ψi (z) = ,
1 + exp(−z)

i = 3, 4, 5, 6. The weights ωi,j and thresholds θi of the network are given as:

ω1,3 = 3 ω3,5 = 1, θ3 = −3,


ω2,3 = 1 ω4,5 = 1, θ4 = −2,
ω1,4 = −1 ω3,6 = −1, θ5 = −1,
ω2,4 = 2 ω4,6 = −2, θ6 = 2.

Given the input s1 = 1 and s2 = 0 we would like to determine the output (s5 and s6 ).
We need to calculate the weighted sums of the hidden nodes

z3 = ω1,3 s1 + ω2,3 s2 + θ3 ,
z4 = ω1,4 s1 + ω2,4 s2 + θ4 ,

apply the activation functions to obtain

s3 = ψ3 (z3 ),
s4 = ψ4 (z4 ),

and repeating this for the output nodes

z5 = ω3,5 s3 + ω4,5 s4 + θ5 ,
z6 = ω3,6 s3 + ω4,6 s4 + θ6 ,

yielding

s5 = ψ5 (z5 ),
s6 = ψ6 (z6 ).

This gives: z3 = 0, z4 = −3, s3 = 0.5, s4 = 0.0474, z5 = −0.4526, z6 = 1.4052, and nally


s5 = 0.3887, s6 = 0.8030.

The computations performed in the previous example, i.e., computing the output for a given input, is
called a forward pass (or forward propagation).
But now suppose, we would like to train the network. I.e., given the input, say s1 = 1, s2 = 0, we
would like to adapt the weights and thresholds in such a way that the neural network gives a prescribed
output y, say y = (s5 , s6 ) = (1, 1). The prescribed output y is called target output.
Here comes the so-called backpropagation algorithm into play. We describe only the basics, con-

79
sidering only feedforward networks with Sigmoid activation functions (other cases are similar but
outside the scope of this course). The backpropagation algorithm is, in fact, an ecient application of
Leibniz's 62 chain rule for dierentiation.

Learning Rule: The Backpropagation Algorithm Consider a feedforward network with Sigmoid
activation functions. The weights and thresholds of the network are denoted by w = (ω1,1 , . . . , ωN,N )
and θ = (θ1 , . . . , θN ). Let x ∈ Rd and h(w, θ, x) ∈ Rn denote the input and output of the network,
respectively. Further, let y ∈ Rn denote the target output. With h(w, θ, x)k we denote the k th
component of the vector h(w, θ, x).
We can motivate the backpropagation learning algorithm as gradient descent on the squared error/loss
function

n
X
E = Ex,y (w, θ) = ||y − h(w, θ, x)||2 = (yk − h(w, θ, x)k )2 .
k=1

We write Ex,y (w, θ) instead of E(w, θ, x, y) to emphasize that x and y are xed and we try to adjust
the weights w and thresholds θ.
Gradient descent is an optimization method for nding a local minimum of a dierentiable function
(here the error/loss function Ex,y (w, θ)). With α > 0 denoting a chosen step length (often referred to
as learning rate parameter) and ∇Ex,y (w, θ) denoting the gradient of Ex,y (w, θ), gradient descent
generates a sequence w(t+1) , θ(t+1) , t = 0, 1, 2, . . . , as follows.

w(t+1) w(t)
   
Gradient descent:
 
← − α∇Ex,y w(t) , θ(t) , t = 0, 1, 2, . . .
θ(t+1) θ(t)

We now want to compute ∇Ex,y w(t) , θ(t) . First, consider an output node j. Note that


!
X
h(w, θ, x)j = ψ ωi,j si + θj = ψ(zj ) = sj .
i

Leibniz's chain rule states


∂E ∂E ∂zj
= · .
∂ωi,j ∂zj ∂ωi,j

Hence, we obtain

∂E ∂ X
= (yk − ψ(zk ))2
∂zj ∂zj
k


= (yj − ψ(zj ))2
∂zj
Leibniz
= −2(yj − ψ(zj ))ψ ′ (zj )

(∗1 )
= −2(yj − ψ(zj ))ψ(zj )(1 − ψ(zj ))

= −2(yj − sj )sj (1 − sj )
62
Gottfried Wilhelm (von) Leibniz (July 1, 1646  Nov. 14, 1716). German mathematician, philosopher, scientist and
diplomat.

80
and !
∂zj ∂ X
= ωi,j si + θj = si . (5.5)
∂ωi,j ∂ωi,j
i

Note that (∗1 ) is a property of the Sigmoid function; see Table 5.5. Thus,

∂E ∂E ∂zj
= · = − 2(yj − sj )sj (1 − sj ) si = −∆j si .
∂ωi,j ∂zj ∂ωi,j | {z }
=:∆j

Similarly we have
∂E ∂E ∂zj
= · = −∆j · 1.
∂θj ∂zj ∂θj
Hence, the updating rule for the parameters for any output node j is:

(t+1) (t)
ωi,j ← ωi,j + α∆j si , for any i feeding into j,
(5.6)
(t+1) (t)
θj ← θj + α∆j .

So far, we have considered the nodes j in the output layer. Let us consider a node j in the previous
layer, a node i feeding into j and j feeding into an output node k; see gure below.
si ωi,j sk
θj
sj

Output

We have X
h(w, θ, x)k = sk = ψ(zk ) = ψ(ωj,k sj + ωl,k sl + θk ),
l̸=j

and hence in ψ(zk ) only the term ωj,k sj depends on ωi,j . Using this we obtain

∂E X ∂ Leibniz
X ∂
= (yk − h(w, θ, x)k )2 = − 2(yk − sk ) ψ(zk )
∂ωi,j ∂ωi,j ∂ωi,j
k k
Leibniz
X ∂sj X ∂
= − 2(yk − sk )ψ ′ (zk )ωj,k = − 2(yk − sk )ψ ′ (zk )ωj,k ψ(zj )
∂ωi,j ∂ωi,j
k k
Leibniz
X ∂zj (5.5) X
= − 2(yk − sk )ψ ′ (zk )ωj,k ψ ′ (zj ) = − 2(yk − sk )ψ ′ (zk )ωj,k ψ ′ (zj )si
∂ωi,j
k k
!
(∗1 ) X X
= − 2(yk − sk )sk (1 − sk )ωj,k sj (1 − sj )si = − sj (1 − sj ) ∆k ωj,k si .
k k
| {z }
=:∆j

81
Similarly, we have
∂E X ∂
= (yk − h(w, θ, x)k )2 = −∆j · 1.
∂θj ∂θj
k

We therefore obtain the same updating rule as in (5.6); the only dierence is that ∆j for such j is
dened as !
X
∆j = sj (1 − sj ) ∆k ωj,k . (5.7)
k

We can summarize this now as follows.

Single-Pass Backpropagation
(for feedforward networks with Sigmoid activation functions):
(t+1) (t+1)
1. For each output node j compute ∆j := 2(yj −sj )sj (1−sj ) and update the ωi,j and θj
according to (5.6).
2. For layer l = ℓ − 1, ℓ − 2, . . . , 1 :
For each node j in the lth layer (i.e., feeding into nodes sk in layer l + 1) compute
∆j according to (5.7) and update according to (5.6).
3. Set t ← t + 1.

(t) (t)
Note that the version that we described here, bases all computations on the non-updated ωi,j and θi .
Only after nishing the complete single-pass backpropagation one should set the parameters to their
updated values (i.e., set t ← t + 1). In the literature there exist also many other variants. It should
also be noted that we described just a single-pass (all weights and thresholds are updated only once).
Typically, a single-pass is iterated for several thousands of iterations to achieve a substantial decrease
in the error/loss function. Also note that the update computations can be performed rather eciently
since the computations of the ∆k in layer l involve only the values of the previously computed ∆k
values from layer l + 1.

5.5.4 Hopeld Networks


Hopeld networks played an important historical part in the development of neural networks. They
were invented in 198263 by the physicist John Hopeld64 , who demonstrated with these networks
that a so-called associative memory can be formed via activity-dependent changes in the strength of
connections between coactive neurons during training. An associative memory is able to retrieve a set
of previously memorized patterns from their noisy versions. This can be considered as a form of noise
reduction.
The architecture of the Hopeld network is a network of N nodes, which are fully interconnected with
symmetric weights (ωi,j = ωj,i ), without self-connections (ωi,i = 0, for all i ∈ {1, . . . , N }), and all of
whose nodes are both input and output nodes. For an example of an N = 2 · 2 = 4 Hopeld network,
see Fig. 5.16(b).
Since Hopeld networks contain loops (it is a so-called recurrent network) one should not view the
inputs to the nodes as being just externally provided, instead one should view them as being outputs
from other nodes that change with the time. Hopeld networks are dynamical systems whose state
changes with the time. The state of the neural network is the set of the outputs of all nodes at
a particular moment in time. The output sj of node j can therefore also be referred to as state of
node j (at a particular moment in time).
63
The paper is freely available at https://www.pnas.org/content/pnas/79/8/2554.full.pdf.
64
John Hopeld (born July 15, 1933), Nobel prize in Physics in 2024.

82
One of the most important contributions in the 1982 paper was the introduction of the idea of an
energy function into neural network theory. For the Hopeld network the energy function H is

N N N
1 XX X
H = H(s1 , . . . , sN ) = − ωi,j si sj − s i θi , (5.8)
2
i=1 j=1 i=1

which gives the energy of the network as a function of the current states s1 , . . . , sN (assuming that
the weights and the thresholds θ1 , . . . , θN are given). We have seen such type of energy function before
when we discussed in Example 4.9 the Ising model (see the exponent of e in the denition of π ).
Updating a node in the Hopeld network65 is performed via the signum activation function, i.e.,
by
if
 P
+1 i ωi,j si + θj ≥ 0,
X
sj = ψ( ωi,j si + θj ) =
−1 otherwise.
i

Such updates might be performed either asynchronously (nodes are consecutively updated in a prede-
ned order) or synchronously (all nodes are updated at the same time).
Remarkably, H decreases (i.e. decreases strictly or stays constant) as the system evolves according to
its updating rule.

Theorem 5.6
The energy function H of a Hopeld network decreases after any single updating step.

Proof. Suppose, node j changes its state from sj = α to sj = β (we assume α, β ∈ {±1}; the case
α, β ∈ {0, 1} follows analogously). We have

∆H = H(s1 , . . . , sj−1 , sj = β, sj+1 , . . . , sN ) − H(s1 , . . . , sj−1 , sj = α, sj+1 , . . . , sN )


!
X X
=− ωi,j βsi − βθj − − ωi,j αsi − αθj
i i
!
X
= (α − β) ωi,j si + θj .
i

Now, if α = −1Pand β = +1, then we have, since we were updating, θj ≥ 0, and thus
P
i ωi,j si +P
∆H = (α − β)( j ωi,j si + θj ) P≤ 0. If α = +1 and β = −1, then similarly we have i ωi,j si + θj < 0,
and therefore ∆H = (α − β)( i ωi,j si + θj ) < 0. In both cases the energy function decreases (not
necessarily in a strict sense).
The memorized patterns are the local minima of the energy function. Thus it is in principle possible
to use Hopeld networks to solve optimization problems.
So, how does a Hopeld learn? The learning rule is rather simple. Suppose we have m patterns
(ℓ) (ℓ) (ℓ)
ξ (ℓ) = (ξ1 , . . . , ξN )T with ξi being equal to the state si in the ℓth pattern, ℓ = 1, . . . , m. Then, the
learning rule for weight ωi,j with i ̸= j is
m
1 X (ℓ) (ℓ)
ωi,j = ξi ξj . (5.9)
m
ℓ=1

This is  up to a factor  the covariance between the states of node i and j (see denition of ci,j
in Section 5.3.2). We can see that we obtain a large weight if most of the times the states for node i
and j in the training set coincide. This is, in fact, a manifestation of a very intuitive learning rule
65
What we describe here is strictly speaking a discrete Hopeld network. There exist also continuous versions. They
use a dierent activation function.

83
from neuroscience, called Hebbian learning rule 66 , which can be paraphrased as stating `Neurons that
re together, wire together. Neurons that re out of sync, fail to link.'
Hopeld, in his original paper, demonstrated via simulations that about 0.15N relevant patterns can
be stored in a Hopeld network consisting of N neurons. This is, by today's standards, not very
impressive but it caught quite some attention in the 80's.

Example 5.12
Let us consider the Hopeld network for 3 · 3 = 9 nodes, each can be in a +1 or a −1 state.
The −1 state will be depicted in black, the +1 state in white. The nodes are numbered starting
from the top left going to the right, row by row (see Fig. 5.17(a)). Further we assume that we
set the thresholds θi to zero.

Figure 5.17: Hopeld network with 3·3 = 9 nodes. (a,b) learned patters, (c) rst input pattern,
(d) changed state, (e) second input pattern, (f) third input pattern.

Now, assume that we learn the two patters shown in Fig. 5.17(a) and (b). The corresponding
weights ωi,j (calculated according to (5.9)) are given by the following matrix (the entry in the
ith row and j th column gives the weight ωi,j .

0 −1 0 0 −1 −1 +1 0 0
 

 −1 0 0 0 +1 +1 −1 0 0 


 0 0 0 −1 0 0 0 −1 +1 


 0 0 −1 0 0 0 0 +1 −1 

W =
 −1 +1 0 0 0 +1 −1 0 0 


 −1 +1 0 0 +1 0 −1 0 0 


 +1 −1 0 0 −1 −1 0 0 0 

 0 0 −1 +1 0 0 0 0 −1 
0 0 +1 −1 0 0 0 −1 0

Now, suppose the input is the patterns as shown in Fig. 5.17(c). When we update all states,
we nd that only one single node will change its state, namely blue boxed node in Fig. 5.17(d).
The corresponding input weights sum up to

N
X
ωi,j sj = −3.
j=1

66
Named after the Canadian psychologist Donald Hebb (July 22, 1904  Aug. 20, 1985) who introduced this rule in
his MA thesis.

84
Hence, the node will change into the state −1 (black). Then, no further updates will occur
(the nal energy, which will not decrease anymore, is here H = −16). The Hopeld does indeed
recover the pattern Fig. 5.17(a), which makes sense from the point of noise removal.
When we take as input the pattern shown in Fig. 5.17(e), then we will indeed recover the patter
in Fig. 5.17(b).
Starting, however, with the input pattern shown in Fig. 5.17(f) we will not recover any of the
two learned patterns. The network gets stuck in a local minimum, which actually has energy
H = −8.

As in the previous example, Hopeld networks can get stuck in local minima. This happens because
there are usally several of them and the energy in each updating step decreases (or stay the same).
This is where Boltzmann machines come into play.

5.5.5 Boltzmann Machines


Boltzmann machines, named after Ludwig Boltzmann67 , are neural networks with the same architec-
ture as Hopelds (additionally, it can contain hidden nodes but this is not important for the present
discussion). The dierence is that the updating rule is stochastic, which allows that network's energy
does not get trapped in local minima. Boltzmann machines were rst presented in 198568 ; one of the
co-inventors, Terry Sejnowski69 , is a former PhD student of John Hopeld.
The Boltzmann machine's energy is dened in the same way as in (5.8) for Hopeld networks. Up-
dating of the binary state of node j ; however, proceeds as follows. First one computes the energy
variation generated by a changes of its state from −1 to +1. According to (5.8) this is

∆Hj = H(s1 , . . . , sj−1 , sj = +1, sj+1 , . . . , sN ) − H(s1 , . . . , sj−1 , sj = −1, sj+1 , . . . , sN )


N
X
= −2θj − ωi,j si .
i=1

Then, node j turns (or remains) in state +1 given by the Metropolis-Hastings acceptance probabil-
ity   
∆Hj
min 1, exp − ,
T
where the (given) scalar T is referred to as the temperature of the system (see also simulated anneal-
ing ).
The updating procedure for Boltzmann machines can therefore be understood as generating a Markov
chain (it is, in fact, a form of Gibbs sampling). It is easily seen that this Markov chain is aperiodic
and irreducible, and therefore, by Theorem 4.3, the stationary distribution is
1 − 1 H(s1 ,...,sN )
π(s1 , . . . , sN ) = e T .
Z
Distribution functions of this form are often referred to as Boltzmann distributions. (This is, in
fact, why these machines are called `Boltzmann machines.')
Boltzmann machines (in fact, restricted versions thereof) are nowadays considered to be powerful tools
for recommender systems. For instance, all three approaches winning the Netix Prize70 (which sought
67
Ludwig Boltzmann (Feb. 20, 1844  Sept. 5, 1906), Austrian physicist and philosopher.
68
The paper is freely available at https://onlinelibrary.wiley.com/doi/pdf/10.1207/s15516709cog0901_7.
69
Terry Sejnowski (born Aug. 13, 1947).
70
https://en.wikipedia.org/wiki/Netflix_Prize.

85
to substantially improve the accuracy of predictions about how much someone is going to enjoy a movie
based on their movie preferences) involved (restricted) Boltzmann machines71 .
With the interpretation of Boltzmann machines as Markov chains we have come, in some sense, full
circle.

71
Interestingly, it seems that Netix never implemented and employed any of
these algorithms. See https://www.techdirt.com/articles/20120409/03412518422/
why-netflix-never-implemented-algorithm-that-won-netflix-1-million-challenge.shtml.

86
References
[1] C. C. Aggarwal. Neural Networks and Deep Learning. Springer, New York, 2018.
[2] P. Brémaud. Markov Chains: Gibbs Fields, Monte Carlo Simulation, and Queues. Springer,
Berlin, 1999.
[3] P. Brémaud. Discrete Probability Models and Methods. Springer, Berlin, 2017.
[4] G. Dreyfus. Neural Networks. Springer, Berlin, 2nd edition, 2004.
[5] A. Gut. Probability: A Graduate Course. Springer, New York, 2013.
[6] J. Haigh. Probability Models. Springer, Berlin, 2nd edition, 2013.
[7] M. Jerrum. Counting, sampling and integrating: algorithms and complexity. Birkhäuser,
Basel, 2003. Freely available at https://www.math.cmu.edu/~af1p/Teaching/MCC17/Papers/
JerrumBook.
[8] D. Knuth. Art of Computer Programming, Volume 2: Seminumerical Algorithms. Addison-Wesley
Professional, Boston, 3rd edition, 1998.
[9] M. Lefebvre. Basic Probability Theory with Applications. Springer, Berlin, 2009.
[10] C. Lemieux. Monte Carlo and Quasi-Monte Carlo Sampling. Springer, Berlin, 2009.
[11] N. Privault. Understanding Markov Chains. Springer, Singapore, 2nd edition, 2018.
[12] R. W. Shonkwiler and F. Mendivil. Explorations in Monte Carlo Methods. Springer, New York,
2009.
[13] P.N. Tan, M. Steinbach, V. Kumar, and A. Karpatn. Introduction to Data Mining. Pearson,
Harlow, 2nd edition, 2019.
[14] M. Wiering and M. von Otterlo, editors. Reinforcement Learning. Springer, Heidelberg, 2012.

You might also like