LectureNotes Complete
LectureNotes Complete
Lecture Notes
Dr. Andreas Alpers
Symbol Meaning
N natural numbers: 1, 2, 3, . . .
N0 non-negative integers: N ∪ {0}
Z integers
R set of real numbers
Rd Euclidean space of dimension d with norm denoted by || · ||
Random phenomena are observed by means of experiments (performed either by man or nature).
Each experiment results in an outcome (also called sample point). The collection of all possible
outcomes ω is called the sample space Ω. Any subset A of the sample space Ω can be regarded as a
representation of some event. An outcome ω realizes an event if ω ∈ A. In probability theory, one
assigns to each event A ∈ F (where F = Ω or a collection of subsets of Ω satisfying certain axioms
making it a so-called σ -eld) a number, the so-called probability of the event. Formally, a probability
on (Ω, F) is a mapping Pr : F → R satisfying the axioms
(i) 0 ≤ Pr(A) ≤ 1, for any A ∈ F,
(ii) Pr(Ω) = 1,
(iii) Pr( ∞
P∞
i=1 Pr(Ai ), for any countable sequence of disjoint A1 , A2 , . . . ∈ F .
S
i=1 Ai ) =
The triple (Ω, F , Pr) is called probability space. It should be noted that the simple axioms (i)-(iii)
determine completely how probabilities operate. For instance, the probability that both events A and B
occur is Pr(A ∩ B) = Pr(A) + Pr(B) − Pr(A ∪ B). However, how should we interpret probabilities in the
real-word? Does probability measure the physical tendency of something to occur or is it a measure
of how strongly one believes it will occur, or does it draw on both these elements? This brings us to
philosophical questions. There are mainly two broad categories of probability interpretations, which
can be called `frequency' and 'Bayesian' probabilities. In the former category, probability is interpreted
as long-run frequency of the occurrence of each potential outcome when the identical experiment is
repeated indenitely. In the latter category probability is viewed as a numerical measure of subjective
uncertainty on the experimental result1 .
Table 1.2 summarizes some basic probability theory notation that we use.
Informally, one should think of random variables as variables whose values depend on outcomes of a
random phenomenon.
1
If you are interested in nding out more about these philosophical issues, I recommend https://plato.stanford.
edu/entries/probability-interpret/.
1
Name Symbol Discrete Continuous
pdf fX Pr(X = x) d
dx FX (x)
Rb
Probability Pr(a < X ≤ b)
P
a<x≤b fX (x) = FX (b) − FX (a) a fX (x)dx = FX (b) − FX (a)
R∞
Expectation
P
E[r(X)] x∈Ω r(x)fX (x) −∞ r(x)fX (x)dx
R∞
Mean
P
µ = E[X] x∈Ω xfX (x) −∞ xfX (x)dx
Denition 1.1
Given a probability space (Ω, F, Pr), a random variable is a real-valued function X on Ω such
that, for all real numbers t, the set {ω ∈ Ω : X(ω) ≤ t} belongs to F.
Example 1.1
Consider the experiment of tossing a die once.
The possible outcomes are ω = 1, 2, 3, 4, 5, 6, and the sample space is the set Ω = {1, 2, 3, 4, 5, 6}.
Take for X the identity function X(ω) = ω.
In that sense, X is a random number obtained by the experiment of tossing a die.
Example 1.2
Consider the experiment of tossing a coin an innite number of times.
As sample space Ω one can take the collection of all sequences ω = {ai }i≥1 with ai = 0 or ai = 1
depending on whether the ith toss results in heads or tails. Dene Xi to be the random number
obtained at the ith toss:
Xi (ω) = ai .
In the following, we will often use the notation Pr(X ≤ a) and Pr(X ∈ A), where a ∈ R and A is
a measurable set. This notation is an abbreviation for Pr(X ≤ a) = Pr({X ≤ a}) = Pr({ω ∈ Ω :
X(ω) ≤ a}) and Pr(X ∈ A) = Pr({X ∈ A}) = Pr({ω ∈ Ω : X(ω) ∈ A}), respectively.
Random variables can be discrete or continuous (or mixed).
Denition 1.2
A discrete random variable is one that can assume only nitely many or countably innitely
many values (but only one at a time, of course).
Denition 1.3
The function fX that associates with each possible value of the (discrete) random variable X
the probability of this value is called the probability distribution function (pdf ) of X.
In some literature, Ppdfs are called probability mass functions. Note that fX satises: (i) fX (xk ) ≥ 0
for all xk , and (ii) ∞
k=1 fX (xk ) = 1.
2
Denition 1.4
The function FX that associates with each real number x the probability Pr(XP ≤ x) that the
random variable X takes on a value smaller or equal to this number, i.e., FX (x) = xk ≤x fX (xk ),
is called the cumulative distribution function (cdf ) of X.
Denition 1.5
A continuous random variable is a random variable that may take an uncountably innite
number of values.
Denition 1.6
The probability density function (pdf ) of a continuous random variable X is a function fX
dened for all x ∈ R and having the following properties:
(a) fX (x) ≥ 0 for any real number x;
(b) if A is any subset of R, then
Z
Pr(X ∈ A) = fX (x)dx.
A
Note that the pdf is dierent from the pdf of a discrete random variable. Indeed, fX (x) does not give
the probability that the random variable X takes on the value x. What can be said is that by
for small ε > 0, we have that fX (x)ε is approximately equal to the probability that X takes on a value
in an interval of length ε about x.
Denition 1.7
The cummulative distribution function (cdf ) FX of a continuous random variable X is
dened by Z x
FX (x) = Pr(X ≤ x) = fX (u)du.
−∞
for any real number x, where x− means that the range of the integral is the open interval (−∞, x).
It can be shown that the cdf of a continuous random variable is continuous.
Before we recall several important random variables, let us spend a few words on histograms. These
are special kinds of bar charts often used to depict a series of values, say, u1 , . . . , un of outcomes in an
experiment. A convenient subdivision of the x-axis is created containing the values, for example by
means of the points c1 < · · · < cK and a parameter w, called bin width. They establish intervals, so
called bins, [c1 − w/2, c1 + w/2), [c2 − w/2, c2 + w/2), . . . , [cK − w/2, cK + w/2). A count is made of
how many of the values lie in each bin and a bar of that height divided by nw is depicted standing on
this bin. Technically, this is called a density histogram for equal size bin widths (to distinguish it from
other types of histograms), but since we will be using only this type of histogram we will usually omit
the word `density.' Note that the area of the bars in that histogram sums up to 1.
3
Example 1.3
Consider the n = 20 numbers u1 , . . . , un = 1, 1, 3, 5, 7, 8, 8, 2, 3, 5, 6, 7, 7, 5, 6, 7, 4, 9, 1, 9. As bin
width w set w := 1, and as bin centers set ci := i, i = 1, . . . , 9. The corresponding histogram is
shown in Fig. 1.1, the corresponding matlab code is
f = hist(u,c); % f is a vector holding the bin counts, c holds the bin centers
bar(c, f/(n*w)); % draws bars with centers in c and height in f scaled by 1/nw
It is sometimes more convenient to specify only the number of bins and let matlab do the
rest. Replacing f=hist(u,c) by [f,c] = hist(u,9); w=c(2)-c(1); matlab will partition
the data automatically into 9 equal bins, the corresponding bin centers are returned in c, the
corresponding bin width w is then determined as c(2)-c(1). Another alternative is to use the
histogram or histcounts command.
0.2
0.18
0.16
0.14
0.12
0.1
8 · 10−2
6 · 10−2
4 · 10−2
2 · 10−2
0
1 2 3 4 5 6 7 8 9
4
Example 1.5
Consider a uniformly distributed continuous random variable X ∼ U (a, b), i.e., on the
interval (a, b) the probability density function should be constant.
As R fX (x)dx = 1 is required, we see that
R
1
b−a : a < x < b,
fX (x) =
0 : otherwise.
As cdf we have
Z x 0 : x ≤ a,
FX (x) = fX (u)du = 1
b−a (x − a) : a < x < b, (1.1)
−∞
1 : b ≤ x.
Let us remark, because we will use this in the following rather often, that for X ∼ U (0, 1) we have
0 : x ≤ 0,
1 : 0 < x < 1,
FX (x) = x : 0 < x < 1, and fX (x) =
0 : otherwise.
1 : x≥1
0.5 1
0.45 0.9
0.4 0.8
0.35 0.7
0.3 0.6
FX
fX
0.25 0.5
0.2 0.4
0.15 0.3
0.1 0.2
5 · 10−2 0.1
0 0
−0.5 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 −0.5 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5
x x
0.5 1
0.45 0.9
0.4 0.8
0.35 0.7
0.3 0.6
FX
fX
0.25 0.5
0.2 0.4
0.15 0.3
0.1 0.2
5 · 10−2 0.1
0 0
−0.5 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 −0.5 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5
x x
Figure 1.2: Uniform distribution. Top row: Discrete random variable X ∼ U ({1, 2, 3, 4}). (left) pdf,
(right) cdf. Bottom row: Continuous random variable X ∼ U (1, 4). (left) pdf, (right) cdf.
The single most important random variable type, next to uniform random variables, is the normal
(also known as Gaussian) random variable, parametrized by a mean µ and variance σ2 (see
Fig. 1.3). If X is a normal random variable, we write X ∼ N (µ, σ 2 ). The pdf of a normal X ∼ N (µ, σ 2 )
is:
1 1 x−µ 2
fX (x) = √ e− 2 ( σ ) .
σ 2π
5
0.4
0.3
fX (x)
0.2
0.1
−3 −2 −1 0 1 2 3
x
Figure 1.3: Probability density function for the normal X ∼ N (µ, σ 2 ), with µ = 0 and σ 2 = 1. With
these parameters X is typically referred to as standard normal.
Theorem 1.1
Let X ∼ N (µ, σ 2 ) and a, b ∈ R with a > 0. Then, Y = aX + b ∼ N (aµ + b, (aσ)2 ).
The interpretation of this concept is that if we are told that event B has already occurred, the consider-
ation of whether A occurs should be conned to the smaller universe characterized by the occurrence
of B . Therefore, we say that two events A and B are independent if, and only if,
Pr(A ∩ B) = Pr(A)Pr(B).
Note that the probabilities Pr(A|B) and Pr(B|A) are conceptually dierent quantities. While some-
times one of these quantities is known, one might want to compute the other. From (1.2), using the
fact that Pr(A ∩ B) is symmetric in A and B, we have
6
Rearranging we obtain (one form of) Bayes' formula2 :
Pr(B|A)Pr(A)
Pr(A|B) = .
Pr(B)
Example 1.6
Suppose a patient exhibits symptoms that make her physician concerned that she may have a
particular disease. The disease is relatively rare in this population, with a prevalence of 0.1%
(meaning it aects 1 out of every 1, 000 persons). The physician recommends a screening test
that is rather expensive. Before agreeing to the screening test, the patient wants to know what
will be learned from the test, specically she wants to know the probability of disease, given a
positive test result.
The physician reports that the screening test is widely used and has a reported accuracy (here,
sensitivity and specicity) of 85%, which means that the test's true positive rate and true
negative rate is 85%.
Let A denote the event that a person has the disease, and let B denote the event that the test
is positive. What needs to be computed is the probability Pr(A|B), i.e., the probability that
the person has the disease given that the test is positive.
From the data we know Pr(A) = 0.001. Also, Pr(B|A) = 0.85, and
Hence, if the patient undergoes the test and it comes back positive, there is a 0.56% chance that
she has the disease. Also, note, however, that without the test, there is a 0.1% chance that she
has the disease. In view of this, do you think the patient is having the screening test?
Two random variables X and Y are called independent if for any two subsets of real numbers
A, B ⊆ R, the events {X ∈ A} and {Y ∈ B} are independent events, hence if
Pr({X ∈ A} ∩ {Y ∈ b}) = Pr({X ∈ A})Pr({Y ∈ B}).
Example 1.7
Suppose, a coin is thrown twice. Let X be dened to be the number of heads that are observed.
Then, another coin is thrown three times. This time the number of heads is recorded by Y .
What is Pr({X ≤ 2)} ∩ {Y ≥ 1})?
Since X and Y are results of dierent independent coin tosses, the random variables X and Y
are independent. Then,
7 7
Pr({X ≤ 2} ∩ {Y ≥ 1}) = Pr({X ≤ 2})Pr({Y ≥ 1}) = 1 · = .
8 8
Let us also recall what a joint distribution of two random variables X and Y is. First, the discrete
case. If X and Y are discrete, their joint pdf is
fX,Y (x, y) = Pr({X = x} ∩ {Y = y}) (=: Pr(X = x and Y = y)).
2
Named after the English mathematician and reverend Thomas Bayes (c. 1701 - April 7, 1761).
7
This is a function dened on discrete points in the x, y -plane; these points make up Ω. For an event A
we hence have X
Pr(X, Y ∈ A) = fX,Y (x, y).
(x,y)∈A
FX,Y (x, y) = Pr(X < x, Y < y) (=: Pr({X < x} ∩ {Y < y})).
The pdf is the derivative of the cdf. For a subset A of the x, y -plane,
Z Z
Pr(A) = fX,Y (x, y) dxdy.
(x,y)∈A
Example 1.8
A joint probability density can be created from any two univariate densities, simply by multi-
plying them together. Let X have density fX and Y density fY ; then fX,Y (x, y) = fX (x)fY (y)
is a joint density. A concrete example of this is the joint pdf of rolling two dice, where
We briey review here two fundamental results for convergence of random variables, which are widely
used in practice and in the context of Monte Carlo simulations. Proofs of the results can be found in
many textbooks on basic probability theory; see, e.g., [5].
Before we state the theorems, let us recall two notions of how a sequence of random variables becomes
more and more `stable' (i.e., converges to another random variable). Let X1 , X2 , . . . be a sequence of
random variables and let F1 , F2 , . . . be the corresponding sequence of cdfs. If the cdfs become more
and more similar to the cdf F of a common random variable X as n → ∞, then we say that they
converge to X in distribution. Mathematically, this means
Note that although we say a sequence of random variables converges in distribution, it is, by denition,
really the cdfs that converge, not the random variables. In this way this convergence is quite dierent
from convergence almost surely, which we recall in the following.
We say that Xn converge to X almost surely (abbreviated as a.s.) if
Pr {ω ∈ Ω : lim Xn (ω) = X(ω) = 1.
n→∞
This type of convergence is similar to pointwise convergence of a sequence of functions, except that
the convergence need not occur on a set with probability 0 (hence the `almost sure').
It can be shown that almost sure convergence implies convergence in distribution.
Example 1.9
Take the probability space on which all random variables are dened as that corresponding
to the U (0, 1) distribution. Thus Ω = (0, 1), and the probability of any interval in Ω is its
length.
8
Dene
0 : 0 < ω < 12 − n1 ,
Xn (ω) = n : 12 − n1 ≤ ω < 12 + n1 ,
0 : 12 + n1 ≤ ω < 1.
For any any ω < 1/2 select N1 so that 1/N1 < 1/2 − ω. If n ≥ N1 , then ω < 1/2 − 1/n,
hence Xn (ω) = 0. For any any ω > 1/2 select N2 so that 1/N2 < ω − 1/2. If n ≥ N2 , then
ω > 1/2 + 1/n, hence Xn (ω) = 0. In summary,
which implies Pr({ω : limn→∞ Xn (ω) = 0}) = 1 and therefore Xn → 0 almost surely.
Let us now discuss the law of large numbers and the central limit theorem. Both theorems come in
dierent versions (depending on the assumptions imposed on the random variables). The law of large
numbers that we state here is, in fact, the strong law of large numbers.
Basically, both the law of large numbers and the central limit theorem give us a picture of what happens
when we take many independent samples from the same distribution.
n
!
√ 1X
(Central limit theorem:) n Xi − µ → N (0, σ 2 ) in distribution.
n
i=1
The law of large numbers tells us two things: First, the average of many independent samples is (with
high probability) close to the mean of the underlying distribution. And, second, this histogram of
many independent samples is (with high probability) close to the graph of the pdf of the underlying
distribution.
The central limit theorem, on the other hand, says that the average (or scaled average) of many
independent copies of a random variable is approximately a normal random variable. It gives a sense
1 Pn
of how fast n i=1 Xi approaches µ as n increases.
√
We remark that as n( n1 ni=1 Xi − µ) → N (0, σ 2 ) in distribution by the central limit theorem, we
P
√
obtain by dividing by n and Theorem 1.1 the result that n1 ni=1 Xi −µ is approximately an N (0, σ 2 /n)
P
random variable for large n.
An illustration of the law of large numbers and the central limit theorem is given in Fig. 1.4 and 1.5,
respectively.
By a stochastic process, we shall mean a family of random variables {Xt }, where t is a point in a
space T called the parameter space, and where, for each t ∈ T, Xt is a point in a space S called the
state space.
The parameter t is often interpreted as `time.' For example, Xt can be the price of a nancial asset
at time t. If T = N then we have nothing else then a sequence of random numbers. Sequences of
random numbers are thus a special case of random processes. It may also happen that t should be
interpreted as `space.' For instance, Xt might be the water temperature at a location t = (u, v, w) ∈ R3
9
0.7
0.65
0.6
0.55
0.45
0.4
0.35
0.3
0.25
0 100 200 300 400 500 600 700 800 900 1,000
n
0.7
0.5
0.6
0.4
0.5
0.3 0.4
0.3
0.2
0.2
0.1
0.1
0 0
−2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 −2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5
0.6 0.6
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
−2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 −2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5
√
Figure 1.5: Illustration of the central limit theorem: Histograms of n · n1 ni=1 Xi for n = 1, 2, 3, 20
P
where the Xi ∼ U (−1, 1). Number of samples is 104 . For n = 2, 3, 20 the pdf for the normal X ∼
N (0, σ 2 ) with σ 2 = 12
1
(1 + 1)2 = 1/3 is shown in red.
in the ocean. Or t = (u, v) ∈ R2 might represent a pixel location in a computer image (which is often
usful in image processing applications). Also, mixed interpretations as `space-time' are possible. For
example, Xt with t = (u, v, w, s) ∈ R4 might be the ocean temperature measured at (u, v, w) ∈ R3 at
time s ∈ R.
The family {Xt } may be thought of as the path of a particle moving `randomly' in space S, its
position at time t being Xt . A record of one of these paths is called a realization or sample path of the
process.
10
Example 1.10
Let us consider the problem of modeling the score during a football match as a stochastic
process.
The state space S needs to present all possible values the score can take. Hence, a suitable
choice is S = {(x, y) : x, y = 0, 1, 2, . . . }. Measuring times in minutes we can take as parameter
space T the interval [0, 90] (not considering overtimes). The process starts in state (0, 0), and
transition takes place between the states of S whenever a goal is scored. A goal increases x or
y by one, so the score (x, y) will then go to (x + 1, y) or (x, y + 1).
Example 1.11
Let X1 , X2 , . . . be a sequence of independent and identically distributed (i.i.d.) random vari-
ables. The process {Xi } is sometimes called i.i.d. noise. The parameter space is T = N and
the state space is S = R. A realization of this process is shown in Fig. 1.6(a). A histogram of
20, 000 realizations of X1,000 is shown in Fig. 1.6(c).
Example 1.12
Let Y1 , Y2 , . . . be a sequence of independent and identically distributed random variables. De-
ne
Xt := Xt−1 + Yt , t ∈ N, X0 = 0.
The process {Xt } is called random walk. The parameter space is T = N and the state space is
S = R. A sample path of this process is shown in Fig. 1.6(b). A histogram of 20, 000 realizations
of X1,000 is shown in Fig. 1.6(d).
11
1 40
0.8 30
0.6
20
0.4
10
0.2
0 0
−0.2 −10
−0.4
−20
−0.6
−30
−0.8
−1 −40
0 10 20 30 40 50 60 70 80 90 100 0 100 200 300 400 500 600 700 800 900 1,000
(a) (b)
·10−2
0.6
1.4
0.5
1.2
0.4 1
0.8
0.3
0.6
0.2
0.4
0.1
0.2
0 0
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 −140−120−100−80 −60 −40 −20 0 20 40 60 80 100 120
(c) (d)
Figure 1.6: I.i.d noise and random walks: (a) sample path of i.i.d. noise, (b) sample path of a random
walk, (c) histogram of 20, 000 realizations of i.i.d. noise variable X1,000 , (d) histogram of 20, 000
realizations of random walk variable X1,000 .
12
2 Simulation: Theory and Practice
Many times we wish to simulate the results of an experiment by using a computer and random variables,
using the (pseudo-)random number generators available on computers. This is known as Monte Carlo
simulation3 .
There can be several reasons why one wants to do that. For instance:
The experiment could be quite complicated to set up (or being controlled) in practice;
Monte Carlo methods are generally very simple to implement.
They can be used to solve a very wide range of problems, even problems that have no inherent
probabilistic structure, as, e.g., computation of multivariate integrals and solving linear systems
of equations. They can also be used to solve optimization problems.
In general, Monte Carlo methods can be divided into two types: direct (or simple) Monte Carlo and
Markov Chain Monte Carlo (MCMC).
Direct Monte Carlo: In this type of Monte Carlo the samples Xi that we generate are an
i.i.d. sequence. P
So the strong law of large numbers tells us that the average of the Xi , i.e., the
sample mean n ni=1 Xi , will converge to the mean of X as n → ∞. Furthermore, the central
1
Direct Monte Carlo: Integration Consider the problem of computing a (possibly multi-dimensional)
denite integral Z
µ := g(x) dx, (2.1)
D
for some g : Rd → R and interval D = [0, 1]d ⊆ Rd (considering such a D makes our presentation
easier by transformations this is typically not a real restriction). Let X be a random variable that
is uniformly distributed on D. Then, the expected value of g(X) is, by denition,
Z
E[g(X)] = g(x) dx = µ.
D
Now, suppose we can draw independent and identically P distributed random samples X1 , . . . , Xn uni-
formly from D (by a computer), and we set µ̂n := n1 ni=1 g(Xi ). Then the law of large numbers tells
us (for `nicely behaving' g and D) that almost surely
13
Let's analyze the accuracy of this approach. Let's assume that the assumptions of the central limit
√
theorem are met. Then the theorem tells us that n(µ̂n − µ) for n → ∞ converges in distribution to a
random variable X ∼ N (0, σ 2 ). And, by Thm. 1.1, we therefore have for large n that E[µ̂n ] = E[µ] = µ
and E[(µ̂n − µ)2 ] = var(µ̂n ) = σ 2 /n, where σ 2 is the variance of g(X). The quantity
p √
E[(µ̂n − µ)2 ] = σ/ n (2.2)
n
1X
µ̃n := g(ti ).
n
i=1
There are more sophisticated methods than this, but Riemann approximation is perhaps the most
common taught in calculus classes. What can be shown (e.g., by using the Mean Value theorem) is
that the error for Riemann approximation is O( n1/d
1
) if g is reasonably smooth. This error rate depends
on d: For example, in a 10-dimensional space with D = [0, 1]10 , we will have to evaluate O(n5 ) grid
√
points to guarantee that the error is O(1/ n). In Monte Carlo we only need O(n) points to get the
√
same error bound O(1/ n). Comparing n to n5 does not seem to make such a big dierence. But
when you wait for an algorithm to return a solution, you surely would prefer to wait 10 minutes instead
of 105 minutes (how many weeks is this?).
We remark that the central limit theorem can also be used to compute condence intervals for the
results obtained by direct Monte Carlo. The interested reader can nd more information for instance
in [10, Chapter 1].
One should also be aware of the following. Do not confuse the σ in (2.2) with the σ of the random
samples X1 , . . . , Xn that are drawn. The σ in (2.2) is the σ of g(X), i.e., σ 2 = var(g(X)) as we have
seen in connection with (2.2). Often var(g(X)) is not known and needs to be estimated from the data.
In most of our examples, however, we can compute it explicitly since
(∗)
var(g(X)) = E[(g(X) − µ)2 ] = E[g(X)2 ] − µ2 ,
and g(X) (hence E[g(X)2 ]) and µ are known5 . We will come back to this point later when we discuss
importance sampling.
It is time to wrap things up.
4
One says, f is in O(g) if, and only if there is a positive real number C and a real number x0 such that |f (x0 )| ≤ Cg(x)
for all x ≥ x0 .
5
(∗) follows by standard manipulations; E[(g(X) − µ)2 ] = E[g(X)2 − 2g(X)µ + µ2 ] = E[g(X)2 ] − 2E[g(X)]µ + E[µ2 ] =
E[g(X)2 ] − 2µ2 + µ2 = E[g(X)2 ] − µ2 using only linearity of expectation.
14
Direct Monte Carlo for computing integrals of the form (2.1):
(1) Choose a large n.
(2) Draw independent and identically distributed random samples X1 , . . . , Xn uniformly
from D.
(3) Return µ̂ = n1 ni=1 g(Xi ) as an estimate of the integral.
P
Example 2.1
Suppose we want to compute Z 1
µ= x3 dx.
0
Of course, we know that we should obtain µ = [ 14 x4 ]10 = 41 = 0.25. Let's implement the direct
Monte Carlo method. As
Z 1
σ = var(g(X)) = E[g(X) ] − µ =
2 2 2
x6 dx − 1/16 = 1/7 − 1/16 = 9/112,
0
√
we can also plot the RMSE σ/ n.
Here is the matlab code. Fig. 2.1 shows a sample path of the µ̂n , the realized RMSE, and the
√
theoretical RMSE σ/ n.
1 nTrials =6*10^2;
2
3 mu =0.25;
4 sigma2 =1/7 - mu ^2;
5
6 u = rand (1 , nTrials ) ;
7 for n =1: nTrials
8 muhat ( n ) = mean ( u (1: n ) .^3) ; % here the function g ( x ) = x ^3 comes in
9 end ;
10 figure , plot (1: nTrials , muhat , 'b ' , ' LineWidth ' ,2) ; xlabel ( 'n ') ; ylabel ( '\ mu_n ') ;
11 hold on ; plot (1: nTrials , mu * ones (1 , nTrials ) ,'r ' , ' LineWidth ' ,2) ;
12
13 figure , plot (1: nTrials , abs ( muhat - mu ) , 'b ' , ' LineWidth ' ,2) ; xlabel ( 'n ') ; ...
ylabel ( ' RMSE of \ mu_n ') ; hold on ;
14 plot (1: nTrials , sqrt ( sigma2 ) ./ sqrt (1: nTrials ) , 'k ', ' LineWidth ' ,2) ;
MCintegralexample1.m
Direct Monte Carlo: Buon Needle Problem (Toy Example) The French nobleman, Comte
de Buon , posed the following problem in 1777:
6
6
Comte de Buon (Sept. 7, 1707 April 16, 1788).
15
0.35 0.3
0.3
0.25
0.25
0.2
RMSE of µn
0.2
µn
0.15
0.15
0.1
0.1
5 · 10−2
5 · 10−2
0 0
0 50 100 150 200 250 300 350 400 450 500 550 600 0 50 100 150 200 250 300 350 400 450 500 550 600
n n
(a) (b)
Figure 2.1: Direct Monte Carlo for Example 2.1. (a) Sample path of µ̂n as a function of n (blue) and
√
µ = 0.25 (red), (b) corresponding (realized) RMSE as a function of n (blue) and plot of σ/ n (black).
It is intuitively clear that the probability should be an increasing function of the length of the needle
and a decreasing function of the spacing between the parallel lines. But that the probability, as we
will see next, will depend on π is perhaps quite unexpected.
We denote, in the following, the distance between the parallel lines of the ruled paper by d and the
length of the needle by L, respectively. A short needle in our context satises L ≤ d. Note that such
a needle cannot cross two lines at the same time. Now assume we drop such a short needle. How can
we check whether the needle crosses one of the parallel lines?
Well, let's introduce two parameters X and θ, where X denotes the distance of the midpoint of the
needle to the nearest of the parallel lines and θ its angle away from the horizontal with 0 ≤ θ ≤ π/2.
(We ignore negative angles as that case is symmetric to the considered case, producing the same
probability.) It is easy to see that the needle crosses one of the lines if and only if X ≤ L2 sin(θ); see
Fig. 2.2.
Now we can model Buon's experiment. Based on our notation, we can consider X to be a uniform
random variable over 0 ≤ X ≤ d/2. Its probability density function is
2
d : 0 ≤ x ≤ d/2,
fX (x) =
0 : otherwise.
Also θ can be considered to be a uniform random variable over 0 ≤ θ ≤ π/2, hence with probability
density function
2
π : 0 ≤ θ ≤ π/2,
fθ (θ) =
0 : otherwise.
16
L
}
X
2
{
sin(θ) d
{
2
θ }X
} d
2
Figure 2.2: Sketch of the Buon needle problem. The needle crosses a line if and only if X/ sin(θ) ≤ L
2,
or, in other words, X ≤ L2 sin(θ).
The two random variables, X and θ, are independent. Therefore the joint probability density function
of (X, θ) is
4
: 0 ≤ x ≤ d/2 and 0 ≤ θ ≤ π/2,
fX,θ (x, θ) = πd
0 : otherwise.
The probability that the short needle crosses one of the parallel lines is therefore
L
Z π/2 Z 2
sin(θ)
Pr(short needle crosses a line) = fX,θ (x, θ) dx dθ
0 0
L
Z π/2 Z sin(θ)
2 4
= dx dθ
0 0 πd (2.3)
4 L π/2
= · [− cos(θ)]0
πd 2
2L
= .
πd
So, this solves Buon's needle problem. But there is more to the story. We can actually use the
previous results to devise an experiment to approximate π in a random fashion7 .
Suppose we set up an experiment (or a computer simulation) of throwing such a needle. The values L
and d can therefore assumed to be known. Let Y1 , . . . , Yn be an i.i.d. sequence of random variables
(functions of θ and X ) with Yi describing whether in the ith throw the short needle hits the line
(Yi = 1) or whether it doesn't (Yi = 0). Let A = {y : Yi (y) = 1} denote the event that the needle hits
the line. Then, Z Z
µ := E[Yi ] = Yi (y)fYi (y) dy = fYi (y) dy = Pr(A).
R2 A
Now, by the law of large numbers, we know that the sample mean µ̂n = n1 ni=1 Yi will converge towards
P
µ = Pr(A) = 2Ld · π . Since we can record µ̂n in the experiment we can therefore use, for large n, the
1
formula
2L 1
µ̂n ≈ ·
d π
7
The observant reader will notice that the computer simulation makes explicit use of π in generating the random
numbers (more precisely, to generate θ) so in a sense this is a little bit of cheating. This, however, can be xed. See,
e.g., https://www.scirp.org/journal/paperinformation.aspx?paperid=74541.
17
for estimating π. The value of π is therefore approximately 2L/dµ̂n .
The following matlab code gives an implementation of this direct Monte Carlo method.
buffon1.m
3.4
3.35
3.3
3.25
pi_est
3.2
3.15
3.1
3.05
2.95
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
no of trials ·105
n estimate of π
1 3.183263598326360
2 3.140597029326060
3 3.141621029325035
4 3.141591772182177
5 3.141592682404399
6 3.141592652615309
7 3.141592653623555
Table 2.1: Estimating π ≈ 3.141592653589793 by (2.4) using n terms in the evaluation of arctan by
arctan(x) = x − x3 /3 + x5 /5 − · · · . Correct digits are shown in blue.
8
John Machin (bapt. c. 1686 - June 9, 1751) was a professor of astronomy at Gresham College, London.
18
2.2 Variance Reduction and Importance Sampling
√
We have seen that the RMSE in direct Monte Carlo for integration problems is of the order O(σ/ n).
To increase the accuracy we can therefore increase n. However, sometimes there can be another way
of achieving this. We may construct a new Monte Carlo problem with the same answer as the original
one but with a smaller σ. Methods to do this are known as variance reduction techniques.
Out of the many variance reduction techniques from the literature (known under names such as stratied
sampling, control variates method, antithetic variates method, and Rao-Blackwellization ), we select here
the concept of importance sampling.
Basic Idea Behind Importance Sampling The general idea behind importance sampling is to
sample from a not necessarily uniform distribution, but rather from a distribution that somewhat
resembles g (directing attention to important regions of the hypercube D, so to speak).
Suppose X is a random variable with pdf fX such that fX (x) > 0 on the set {x : g(x) ̸= 0} and
fX (x) = 0 for x ̸∈ D. Let Y be the random variable Y := g(X)/fX (X). Then,
Z Z
g(x)
µ := g(x) dx = fX (x) dx = EfX [Y ],
D D fX (x)
n n
1X 1 X g(Xi )
µ̂n = Yi = ;
n n fX (Xi )
i=1 i=1
this time the X1 , . . . , Xn are generated from the distribution with pdf fX . (One refers to fX in this
context also as importance function.) Of course, for this to work we need a method to sample
from fX . But assuming this can be achieved, we claim that for some importance functions it can
happen that varfX (Y ) can be smaller than the σ 2 that we have from the direct Monte Carlo method
that uses no importance sampling.
Comparing
g(x) 2
Z
= fX (x) dx − µ2
D fX (x)
g(x)2
Z
= dx − µ2 ,
D fX (x)
with Z
var(g(X)) = g(x)2 dx − µ2
D
from the direct Monte Carlo method using uniform samples, we see that fX can indeed have an
inuence on the variance. (We can even see that varfX (Y ) can be zero if we
R can choose fX R= g/µ. To
verify that the pdf fX integrates to 1, we would have needed to verify 1 = D fX (x)dx = µ1 D g(x)dx.
So, we would have needed to solve the integral we are interested in estimating in the rst place. Thus,
this is not a very realistic situation).
19
Importance sampling for computing integrals of the form (2.1):
(1) Choose a large n.
(2) Draw samples X1 , . . . , Xn from a trial distribution with pdf fX .
(3) Return
n
1 X g(Xi )
µ̂n =
n fX (Xi )
i=1
Example 2.2
Suppose, we want to compute the following integral
Z 1p
µ= 1 − x2 dx.
0 | {z }
=:g
√
From basic calculus we nd µ = [ 21 (x 1 − x2 + arcsin(x))]10 = 1
2 arcsin(1) = π/4 ≈ 0.78540. Let
X be a random variable X ∼ U (0, 1). Then,
var(g(X))
p
0.2236
√ ≈ √ .
n n
1 − αx2
fα (x) := , 0 < x < 1, 0 < α < 1,
1 − α/3
as importance function (we x the value of α later). Indeed, any such fα is a suitable pdf since
f (x) > 0 for x ∈ (0, 1) and
Z 1
1 − α/3
f (x) dx = = 1.
0 1 − α/3
We can therefore consider fα as the pdf fX of a random variable X. Now, say we choose α = 0.74
(you can experiment with other values of α, too).
Suppose, we can sample X1 , . . . , Xn from this distribution. Let Y := g(X)/fX (X) and Yn :=
g(Xn )/fX (Xn ). Then, again, EfX [Y ] = µ and
20
1.4 250 900
800
1.2
200
700
1
600
150
0.8 500
400
0.6
100
300
0.4
200
50
0.2
100
0 0 0
0 0.2 0.4 0.6 0.8 1 0.77 0.775 0.78 0.785 0.79 0.795 0.8 0.77 0.775 0.78 0.785 0.79 0.795 0.8
Figure 2.4: Illustration of Example 2.2: (a) Plots of g (blue), f (red); (b) Histogram of µ̂n (for n = 104
based on nTrials = 103 simulations) via direct Monte Carlo, sample mean µ shown in red; (c) same
as in (b) but now importance sampling with density f was used (samples were drawn by a method
called rejection sampling, see later chapters).
We have seen that in direct Monte Carlo we need random numbers. Until now, we have just used
matlab's built-in functions. But how do they actually work? And how can we draw samples from more
complicated distributions? The basics of generating pseudo random numbers are discussed next.
21
3 Pseudo Random Number Generators
Random numbers can be generated from truly random physical processes (radioactive decay, thermal
noise, roulette wheel). The RAND cooperation, for instance, published in 1955 a book with a million
random numbers, obtained using an electric `roulette wheel.' This book is now publicly available on
http://www.rand.org/publications/classics/randomdigits/.
However, physical random numbers are generally not very useful for Monte Carlo, because the sequence
is not repeatable, the generators are often slow, and it can be complicated to feed these random numbers
into the computer.
We therefore need a way of generating `random' numbers by a computer. These numbers are often
called pseudo random numbers to stress the fact that they are not really random as they are
generated deterministically. In the following we will usually omit the word `pseudo' since we discuss
only these random numbers anyway.
Middle-Square and Other Middle-Digit Techniques One of the earliest recorded methods is
the middle-square method by John von Neumann (1946). The idea is the following.
Let n be an even number. To generate a sequence of n-digit random numbers, an n-digit starting
value is created and squared, producing a 2n-digit number. If the result has fewer than 2n digits,
leading zeroes are added to compensate. The middle n digits of the result would be the next
number in the sequence, and returned as the result. This process is then repeated to generate
more numbers.
Example 3.1
Suppose, we want to generate 4-digit (integer) numbers, i.e., n = 4. Let's take as initial seed
v0 = 1234. Then, we obtain:
squaring extract
v0 = 1234 −→ 01522756 −→ 5227,
squaring extract
v1 = 5227 −→ 27321529 −→ 3215,
squaring extract
v2 = 3215 −→ 10336225 −→ 3362,
···
22
1 function [z ] = vonNeumannMiddleSquare ( nnumbers , ndigits , seed )
2 % ::: seed must be an ndigit natural number
3 % ::: returned numbers z (1) ,z (2) ,.. z ( nnumbers ) are ndigit - digit integer numbers
4 fstring = sprintf ( ' %%0% d . f ' ,2* ndigits ) ; n2 = ndigits /2;
5
6 x (1) = seed ;
7 for i =1: nnumbers
8 x ( i +1) = x ( i ) ^2; % square the number
9 s = num2str ( x ( i +1) , fstring ) ; % add leading zeros is neccessary
10 x ( i +1) = str2num ( s ( n2 +1 :end - n2 ) ) ; % extract middle n digits
11 end ;
12 z = x (2 :end ) ;
13 end
vonNeumannMiddleSquare.m
Fig. 3.1 shows some results for n = 4. As initial seed v0 = 5810 is used since it gives a rather long chain
of non-repeating numbers. After the 108th generated number, the numbers repeat (actually, with a
fairly short cycle of 4100, 8100, 6100, 2100, 4100, . . . ). By the way, there a ve numbers, namely 0,
100, 2500 3792, and 7600, which, if taken as seed, generate no further numbers.
120 ·104
1
100 0.9
0.8
80
0.7
0.6
60
0.5
40 0.4
0.3
20
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
·104 0 10 20 30 40 50 60 70 80 90 100 110 120
(a) (b)
Figure 3.1: Middle-square method for n = 4. (a) Plot of the number of generated numbers to rst
repetition as a function of seed (average: 43.7, maximum: 111), (b) sample path for seed v0 = 5810
(blue), randomly drawn integers by matlab's randi for comparison (red).
Let us prove a fact about the middle-square method, which is rather undesirable.
Proof. Since vk < 10n/2 we have vk2 < 10n . Therefore, vk+1 = ⌊vk2 /10n/2 ⌋ ≤ vk2 /10n/2 (note that vk2
has 2n digits including leading zeros). If vk > 0, then vk+1 ≤ vk2 /10n/2 < vk 10n/2 /10n/2 = vk .
As an illustration consider the middle-square method for generating n = 4 digit numbers (note: you
always needs to specify n). Starting with v0 = 0099 we obtain the following strictly decreasing sequence
0098, 0096, 0092, 0084, 0070, 0049, 0024, 0005, 0000.
The middle-square method was invented at a time when computers were just starting. At that time
this method was extensively used since it was simpler and faster than any other method. However, by
current standards, its quality is quite poor, at least when applied to numbers in the decimal number
23
system. Donald Knuth ([8]) remarks: `[...] experience showed that the middle-square method can give
usable results, but it is rather dangerous to put much faith in it until after elaborate computations have
been performed.'
We remark that instead of squaring number (which is computationally usually a fast thing to perform)
one could modify the middle-square method to utilize other functions, e.g., sin or log . Sometimes one
needs to pay special attention to make correct use of the domains and ranges of these functions. We
do not pursue this here any further. Generally, all these methods tend to suer from fundamental
problems, one of which is a short period and rapid cycling for most initial seeds.
where a is called the multiplier, c the increment, and m the modulus. (n mod m is the dierence
of n and the largest integer multiple of m less or equal to n.) The initial number v0 , which
usually needs to be provided by the user, is typically referred to as seed.
Note that the generated numbers are typically elements in {0, 1, . . . , m − 1}. By dividing by m, i.e.,
by setting
vk
uk =
m
the numbers can be transformed to be real numbers in [0, 1).
Sequences of numbers v0 , v1 , . . . generated by (3.1) are called Lehmer9 sequences.
A frequently recommended generator of this form, due to M. Marsaglia (1972), is:
This generator has in fact a maximal period, i.e., the numbers repeat after m numbers have been
generated.
A large class of LCRNGs are the so-called Mersenne10 generators. They use Mersenne prime moduli,
i.e., m is a prime of the form m = 2p − 1, where p is also a prime. These generators can have
periods of length m − 1, and it can also be shown that they never generate a zero. Coincidentally,
m = 231 − 1 = 2, 147, 483, 647 is a Mersenne prime, the largest prime number of this form that can be
stored in a full-word (32-bit)integer computer.
The following Lewis-Goodman-Miller generator, introduced in 1969, is a widely studied Mersenne
generator:
vk+1 = 75 vk mod (231 − 1).
Problems With Number Generators Let us have a look at RANDU, which is a LCRNG developed
by IBM in the 1950's, and which for many years was the most widely used random number generator
in the world. Leaving some minor technical details aside, the numbers are generated as
The generator in (3.3) can be written to express the relationship among three successive members of
the output sequence:
9
Derrick Lehmer (Feb. 23, 1905 May 22, 1991).
10
Marin Mersenne (Sept. 8, 1588 Sept. 1, 1648).
24
vk ≡ (216 + 3)vk−1 mod 231
≡ (216 + 3)2 vk−2 mod 231
≡ (232 + 6 · 216 + 9)vk−2 mod 231
≡ (0 + 6 · 216 + 9)vk−2 mod 231
≡ (6 · (216 + 3) − 6 · 3 + 9)vk−2 mod 231
≡ (6 · (216 + 3)vk−2 − 9vk−2 ) mod 231
≡ (6vk−1 − 9vk−2 ) mod 231 ,
i.e.,
vk − 6vk−1 + 9vk−2 = C231 ,
where C is an integer. Note that 0 ≤ vk < 231 by construction. Hence, the maximum integer that we
can obtain by vk − 6vk−1 + 9vk−2 is smaller than 231 − 0 + 9 · 231 = 10 · 231 . Similarly, the smallest
number that we can obtain is larger than 0 − 6 · 231 + 0 = −6 · 231 . This leaves only the 15 possibilities
C ∈ {−5, −4, . . . , 8, 9}. Consequently, all triples must lie on no more than 15 hyperplanes in R3 .
Fig. 3.2 shows a plot of subsequent triples of generated numbers viewed from dierent perspective.
The 15 hyperplanes are clearly visible in Fig. 3.2(b).
·109
·109 2.5
2.5 2
2 1.5
1.5 1
1 0.5
0.5 0
2 2
0 1.5
·109 1.5 2
2.5
2 1 ·109
1 1.5
1.5 1 ·109
·109 1 0.5 0.5 0.5
0.5
0 0 0 0
(a) (b)
Figure 3.2: Subsequent triples of IBM's RANDU viewed from two perspectives.
Of course, we would like our random numbers to be as random as possible. This, however, cannot
be checked. We can only detect non-randomness. Over the years, there have been developed many
dierent statistical tests for detecting nonrandomness. Typical random numbers need to pass all those
tests. For lack of space, we are not going into details here. The interested reader can nd information
about the so-called chi-square and the Kolmogorov-Smirnov test in [12].
A quick indication whether the generated numbers behave somewhat random can be found by the
following procedures.
(a) Plot a sample path, i.e., if ri is the ith random number generated, plot ri versus i.
(b) Plot a histogram.
(c) Plot a correlation plot, i.e., plot (ri , ri+1 ) (or, (ri , ri+1 , ri+2 ) or even larger tuples).
25
3.3 General Methods for Sampling From Non-Uniform Distributions
Theorem 3.2
Let U ∼ U (0, 1), and let Y be a discrete random variable with cdf F. Let
be the so-called generalized inverse of F. Then, the discrete random variable X = F − (U ) has
cdf F.
Proof. Let us rst convince ourselves that the minimum in min{t ∈ R : F (t) ≥ u} is in fact attained.
Consider for a given u ∈ (0, 1) the set Iu = {t ∈ R : F (t) ≥ u}. Note that Iu is non-empty, since
u < 1 and F (y) → 1 as y → ∞. Iu has a nite left endpoint, say ηu , because u > 0 and F (y) → 0 as
y → −∞. Finally, ηu ∈ Iu , since F is a cdf and therefore right-continuous (consider yn = ηu + 1/n,
n = 1, 2, . . . . Then, u ≤ F (yn ) for all n, hence u ≤ limn F (yn ) = F (ηu ) implying ηu ∈ Iu ). In summary,
the minimum in min{t ∈ R : F (t) ≥ u} is indeed attained.
We claim
{(t, u) ∈ R × (0, 1) : F − (u) ≤ t} = {(t, u) ∈ R × (0, 1) : u ≤ F (t)}. (3.4)
Taking an element from the left set, i.e., (t, u) ∈ R × (0, 1) satisfying F − (u) ≤ t, we have
non-decr. def. def.
F (t) ≥ F (F − (u)) = F (min{t ∈ R : F (t) ≥ u}) ≥ u
and (t, u) is contained in the right set. Conversely, for (t, u) ∈ R × (0, 1) satisfying u ≤ F (t) we have
non-decr. def. t in set
F − (u) ≤ F − (F (t)) = min{r ∈ R : F (r) ≥ F (t)} ≤ t,
proving (3.4).
Now we can complete the proof:
Now let us take a closer look at Theorem 3.2. In fact, the theorem gives us a method of sampling
from a distribution with cdf F. All we need to do is generate U ∼ U (0, 1) and compute the generalized
inverse F − (U ). But how do we compute F − (U )?
For notation, suppose Y (and also the to-be generated X ) takes values x1 , . . . , xN with probabilities
p1 , . . . , pN . Then, X
F (t) = pk ,
k : xk ≤t
with the convention that the empty sum yields value 0. For u ∈ (0, 1), we therefore have F − (u) = xk
if and only if F (xk−1 ) < u ≤ F (xk ).
11
John von Neumann (Dec. 28, 1903 Feb. 8, 1957).
12
Stanislav Ulam (April 13, 1909 May 13, 1984).
13
See R. Eckhardt. Stan Ulam, John von Neumann and the Monte Carlo method. Los Alamos Science, pages 131143,
1987. Special Issue. http://www-star.st-and.ac.uk/~kw25/teaching/mcrt/MC_history_3.pdf.
26
Hence cdf inversion for discrete random variables reduces to the following:
(a) Generate U ∼ U (0, 1).
(b) Set X = xk if F (xk−1 ) < U ≤ F (xk ).
0.8 0.6
0.7
0.5
0.6
0.4
FX
0.5
0.4 0.3
0.3
0.2
0.2
0.1 0.1
0
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 0
x 1 2 3 4 5
(a) (b)
Figure 3.3: (a) Cdf for Example 3.2, (b) Histogram for 10, 000 samples.
Let U ∼ U (0, 1) a uniform sample. Starting from the point (0, U ) (on the y -axis), proceed to the
right until you encounter a jump in the cdf, a vertical dashed line segment. Now, proceed down
to the x-axis, and return this value as the selection. As indicated along the y -axis, `2' is selected
with probability 0.15, since its interval occupies this fraction of the unit interval. Similarly, `3'
is selected with 0.2 probability, since its interval occupies this fraction of the unit interval, and
so on.
Example Matlab code for generating samples from this distribution is given below. A histogram
obtained by executing this code is shown in Fig. 3.3(b).
1 nTrials =10^4;
2 b (2) =0.15; b (3) = b (2) +0.2; b (4) =b (3) +0.6;
3 U = rand (1 , nTrials ) ;
4 w2 =( U ≤ b (2) ) *2;
5 w3 =( U > b (2) & U ≤ b (3) ) *3;
6 w4 =( U > b (3) & U ≤ b (4) ) *4;
7 w5 =( U > b (4) ) *5;
8
9 w = w2 + w3 + w4 + w5 ; % vector with the right frequencies
10
11 [f , x ] = hist (w ,1:5) ; dx = diff ( x (1:2) ) ; bar (x , f / sum ( f * dx ) ) ; % plots the ...
histogram
cdfinvdiscrexample1.m
27
Example 3.3
Suppose X is geometric with parameter p such that Pr(X = k) = (1−p)k−1 p, for k ∈ {1, 2, . . . }.
Then,
⌊t⌋
X
FX (t) = (1 − p)k−1 p
k=1
= p 1 + (1 − p) + (1 − p)2 + · · · + (1 − p)⌊t⌋−1
1 − (1 − p)⌊t⌋
=p·
1 − (1 − p)
= 1 − (1 − p)⌊t⌋ .
k = min{k ∈ N : U ≤ 1 − (1 − p)k }
= min{k ∈ N : (1 − p)k ≤ 1 − U }
(∗1 )
= min{k ∈ N : (1 − p) ≤ (1 − U )1/k }
(∗2 ) 1
= min k ∈ N : ln(1 − p) ≤ ln(1 − U )
k
(∗3 ) ln(1 − U )
= min k ∈ N : k ≥
ln(1 − p)
(∗4 ) ln(1 − U )
= min k ∈ N : k ≥
ln(1 − p)
ln(1 − U )
= ,
ln(1 − p)
where (∗1 ) and (∗2 ) follow from the fact that log and, respectively, (·)1/k are non-decreasing,
(∗3 ) follows from the fact that ln(1 − p) is negative, and (∗4 ) follows from the fact that k is an
integer.
A histogram of 10, 000 samples from the geometric distribution with p = 0.2 is shown in
Fig. 3.4.
28
0.4
0.35
0.3
0.25
0.2
0.15
0.1
5 · 10−2
0
0 10 20 30 40 50 60
Figure 3.4: Histogram (obtained by cdf inversion ) of the geometric distribution, p = 0.2. Graph of
f (k) = (1 − p)k−1 p is plotted in red.
Theorem 3.3
Let U ∼ U (0, 1), and let Y be a continuous random variable with invertible cdf F. Then, the
continuous random variable X = F −1 (U ) has cdf F.
Proof. Let u ∈ (0, 1) and Iu = {t ∈ R : F (t) ≥ u}. As we assume that F is invertible, we have
F (F −1 (u)) = u ≥ u, i.e., F −1 (u) ∈ Iu . And there is no smaller element than F −1 (u) contained in Iu
since an invertible cdf, such as F, is strictly increasing. Hence,
With this the rest of the proof from Theorem 3.2 carries over, word-by-word.
Note that for cdf inversion an explicit formula for the inverse cdf F −1 is needed. Such a formula is
often not available (e.g., in the case of a normal distribution). However, there are cases where cdf
inversion can be applied successfully. Let us look at an example.
Let us summarize. Cdf inversion for discrete random variables is a completely general method. It can
be applied to any discrete probability distribution. The continuous version, however, only applies if
an explicit formula for the inverse of the cdf is available.
29
1 2
0.9 1.8
0.8 1.6
0.7 1.4
0.6 1.2
0.5 1
0.4 0.8
0.3 0.6
0.2 0.4
0.1 0.2
0 0
0 1 2 3 4 5 6 7 8 9 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
(a) (b)
Figure 3.5: Sampling from the exponential distribution via cdf inversion. Histogram of 10, 000 samples.
The pdf F ′ (t) = λe−λt of the exponential function is plotted in red. (a) λ = 1, (b) λ = 2.
Then,
30
Proof for the continuous case
Now,
f (Y )
Pr Y = x, U ≤ = Pr(accept|Y = x)Pr(Y = x) (by Bayes' rule)
M g(Y )
= Pr(accept|Y = x)g(x)
f (x)g(x)
= (using (3.5))
M g(x)
1
= f (x).
M
Further,
f (Y )
Pr U ≤ = Pr(accept)
M g(Y )
X
= Pr(accept|Y = x)Pr(Y = x) (by16 )
x
X f (x)
= g(x) (using (3.5))
x
M g(x)
1 X
= f (x)
M x
1
= (since f is a pdf).
M
31
Hence Pr(X = x) = f (x), as required.
15
The continuous law of total probability states: If A ⊆ R2 is any event then
Z ∞
Pr((U, Y ) ∈ A) = Pr((U, Y ) ∈ A|Y = y)fY (y)dy.
−∞
16
The discrete law of total probability states: If A is any event and B1 , B2 , . . . form a partition of Ω, then
X
Pr(A) = Pr(A|Bi )Pr(Bi ).
i
Comments
(a) Rejection sampling requires that we have a method available for sampling from g.
(b) Above we have calculated the probability that the pair (Y, U ) is accepted:
f (Y )
Pr U ≤ = 1/M.
M g(Y )
The value 1/M is also referred to as acceptance rate of the rejection sampling algorithm. We
will have to generate, on average17 , M n draws from both the trial and uniform distribution to
obtain n draws from the target distribution. Thus, generally, M should not be chosen too large
optimally as close to 1 as possible. Of course, we can only choose a small M if g follows
closely f.
(c) We only need to know f and g up to a constant of proportionality. In many applications we
will not know the normalizing constant for these densities, but we do not need them. That is, if
f ∗ (y) = cf f (y) and g ∗ (y) = cg g(y), for all y, we can proceed with the algorithm using f ∗ and g ∗
even if we do not know the values of cf and cg .
(This is the so-called Beta distribution with parameters α = 3 and β = 2.) Since the maximum
of f occurs at x0 = 2/3, where f (x0 ) = 16/9, this means we can take M = 16/9, for x ∈ [0, 1],
corresponding to a uniform density g over (0, 1).
In Fig. 3.6(a) we show f, M, and 200 points corresponding to trials (Y, U · M g(Y )). When the
second coordinate U · M g(Y ) is smaller or equal to f (Y ), the point is accepted; otherwise it
is rejected. For this particular sample, 111 points were accepted and 89 were rejected for a
proportion 111/200 = 0.5550 of acceptance, not too far from the theoretical one of 1/M =
9/16 = 0.5625.
17
Note that the probability of an `accept' after k Bernoulli trials follows a geometric distribution p(1 − p)k−1 , where
p = 1/M is the probability of `accept.' The mean of this distribution is M.
32
2 4.5
1.8 4
1.6
3.5
1.4
3
1.2
2.5
1
2
0.8
1.5
0.6
1
0.4
0.2 0.5
0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
(a) (b)
Figure 3.6: Sampling from f (x) = 12x2 (1−x) via rejection sampling. (a) plot of (Y, U ·M g(Y )),
(b) histogram.
33
0.45
0.4
0.35
0.3
0.25
0.2
0.15
0.1
5 · 10−2
0
1 2 3 4
Figure 3.7: Rejection sampling from the discrete distribution of Example 3.6.
Suppose, now we want to sample from the normal distribution. We mentioned already that cdf inversion
cannot (at least directly) be applied to sample from the normal distribution, because there is no
closeform formula for the cdf of the normal distribution. So, what else can we do? Well, what about
rejection sampling? The answer is Yes this can work. For instance, one can choose as trial distribution
the exponential distribution (and use the fact that the normal distribution is symmetric) and then apply
rejection sampling (for sampling from the exponential distribution, see Example 3.4). One might also
want to try to use as trial distribution the uniform distribution. There, however, we run into trouble,
because that distribution does not exist (the interval from which we want to sample is unbounded). A
way out of this is to approximate the normal distribution by a truncated normal distribution (hence,
bounding the interval) and then use rejection sampling. Fig. 3.8 shows an example. The downside of
this is, of course, that these random numbers are just approximations of the normal distribution.
Figure 3.8: Sampling 200 samples from a truncated normal distribution (red, dashed) via rejection
sampling using as trial distribution a uniform distribution (black, dashed). Normal distribution shown
as red solid curve. (a) plot of (Y, U · M g(Y )), (b) histogram.
Y = σX + µ;
34
The rst method is an `approximative method' in the sense that the samples follow an approximate
normal distribution as a limit process is involved.
Recall that the mean and the variance of a U (0, 1) random variable are 1/2 and 1/12. Hence, by the
central limit theorem, we see that X is approximately normal with parameters µ = 0 and σ = 1.
In practice, a frequent choice for n is n = 12 since this avoids the computation of a square root and
divisions.
35
4 Markov Chain Monte Carlo (MCMC) Methods
Markov Chain Monte Carlo (MCMC) methods are particularly useful when it is practically not feasible
to directly sample from the desired distribution. This is often the case, for instance, for combinato-
rially dened sets where the sample space is so large preventing that the so-called partition function
(dened below) can be computed. Also, MCMC methods nd applications in cases where it is rather
straightforward to compare probabilities of pairs of events.
The general idea behind MCMC is to construct a Markov chain whose stationary distribution is the
probability distribution we want to simulate. We then generate samples of the distribution by running
the Markov chain. In Markov chains18 one allows only probabilistic dependence on the past through
the previous state. This produces already a great diversity of behaviors.
Historically, the idea for MCMC sampling goes back to 1953, when N. Metropolis19 et al. published
the paper `Equations of state calculations by fast computing machines'20 . The authors were trying to
solve problems in physics that arise due to the random kinetic motion of atoms and molecules. We will
discuss this later in more detail. The algorithm introduced in this paper, nowadays called Metropolis
algorithm, has been cited as among the top ten algorithms21 having the greatest inuence on the
development of science and engineering. MCMC, whether via Metropolis or modern variations, is now
also very important in statistics and machine learning.
Let's start by introducing the background from Markov chain theory that we need for our pur-
poses.
We deal exclusively in this chapter with discrete-time homogeneous Markov chains on a nite state
space Ω = {ω1 , . . . , ωM }. We will therefore dene Markov chains only for this case. Many of the
denitions extend to countable state spaces with only minor complication. A more comprehensive
treatment (and proofs) can be found, e.g., in [2, 11].
Denition 4.1
A sequence {Xt }t∈N0 of random variables is a Markov chain (MC), with state space Ω, if
Equation (4.1) encapsulates the Markovian property whereby the history of the MC prior to time t
is forgotten.
We may write
P (x, y) := Pr(Xt+1 = y|Xt = x),
where P is the transition matrix of the MC. The transition matrix P describes single-step transition
probabilities; the t-step transition probabilities P t are given inductively by
t I(x, y) : t = 0,
P (x, y) = P t−1 ′ ′
y ′ ∈Ω P (x, y )P (y , y) : t > 0,
36
Also note that the row sums of P satisfy y∈Ω P (x, y) = y∈Ω Pr(X1 = y|X0 = x) = 1, because
P P
given that we are in state x, the next state must be one of the possible states from Ω.
MCs can be represented by directed graphs. In this representation, the states are represented by the
vertices of the graph. A directed edge extends from one vertex, x say, to another y, if a transition
from x to y is possible in one iteration. In this case the weight P (x, y) is associated to that edge.
The set N (x) of all states that can be reached from x in one iteration is the so-called neighborhood
of x.
Example 4.1
The following is a graph representation of a four-state Markov chain with transition matrix P.
An MC is irreducible if, for all x, y ∈ Ω there exists a t ∈ N0 such that P t (x, y) > 0. In other words,
irreducibility is the property that regardless of the present state we can reach any other state in nite
time. (Equivalently, in the graph representation any vertex can be reached by a directed path from
any other vertex.)
Figure 4.1: Examples of three Markov chains. Only (c) is irreducible, (a) and (b) are not.
Each state x in a Markov Chain has a period gcd{t : P t (x, x) > 0}. An MC is aperiodic if all states
have period 1. Otherwise, it is called periodic. To show aperiodicity, we remark that in the case of
an irreducible MC, it suces to verify that the period of just one state x ∈ Ω is 1.
Figure 4.2: Examples of two Markov chains. (a) is periodic with period 2; (b) is aperiodic.
37
where in the last equation we write π = (πω1 , . . . , πωM ) as row vector. Equation (4.2) is known as
global balance equation.
Clearly, if X0 is distributed as the stationary distribution π then so is X1 (and hence so is Xt for
all t ∈ N0 ).
It can be shown that a nite MC always has at least one stationary distribution. It is often possible
to determine the stationary distributions by solving the global balance equation keeping in mind
that π = (πω1 , . . . , πωM ) needs to be a non-negative vector satisfying additionally the normalizing
condition
M
X
πωi = 1. (4.3)
i=1
The stationary distribution, however, might not be unique (see Example 4.4). Uniqueness is guaranteed
if the Markov chain is irreducible and aperiodic.
Theorem 4.1
An irreducible aperiodic MC has a unique stationary distribution π; moreover the MC tends
to π in the sense that P t (x, y) → π(y), as t → ∞, for all x ∈ Ω.
Example 4.2
Consider the task of nding a stationary distribution of the Markov chain with transition ma-
trix
0 1 0
P = 0 2/3 1/3 ,
p 1−p 0
for some given p ∈ (0, 1).
Let π = (π1 , π2 , π3 ). Equation πP = π is equivalent to
2 1
(pπ3 , π1 + π2 + (1 − p)π3 , π2 ) = (π1 , π2 , π3 ).
3 3
Solving this we nd π1∗ = pπ3∗ and π2∗ = 3π3∗ where π3∗ can be chosen arbitrarily. However,
since π needs to be a pdf, we need to have π1 + π2 + π3 = 1 (and π1 , π2 , π3 ≥ 0). This gives the
(unique) solution
1
(π1∗ , π2∗ , π3∗ ) = (p, 3, 1).
p+4
(It is easily checked that we made no computational mistakes; just verify (π1∗ , π2∗ , π3∗ )P =
(π1∗ , π2∗ , π3∗ ).)
Example 4.3
Let us consider the question whether Theorem 4.1 also holds if the MC is periodic.
Consider the Markov chain in Fig. 4.2(a) with transition probabilities p1,2 = p3,2 = 1 and
p2,1 = p2,3 = 1/2. (Notation as in Example 4.1.)
We have
0 1 0 1/2 0 1/2
P 2t+1 = 1/2 0 1/2 , P 2t+2 = 0 1 0
0 1 0 1/2 0 1/2
for all t ≥ 0. Hence, we do not have convergence P t (x, y) → π(y) as t → ∞ for all x ∈ Ω in the
sense of Theorem 4.1. The stationary distribution is, by the way, π = (1/4, 1/2, 1/4).
38
Example 4.4
Consider the following MC.
It is easily veried that both π = (1/2, 1/2, 0, 0) and π = (0, 0, 1/2, 1/2) are stationary distribu-
tions of this Markov chain. Does this contradict Theorem 4.1?
The answer is No. The MC is clearly not irreducible (there is for instance no path from ω1 to
ω3 in the graph representation of the MC).
In many situations one has an idea about what the stationary distribution might be. Instead of verifying
the global balance condition it can be easier to verify so-called detailed balance, which is (4.4) in
the following theorem. (An MC for which detailed balance holds is said to be time reversible. The
Markov chains constructed by the sampling methods discussed later are all time reversible.)
Theorem 4.2
Suppose P is the transition matrix of an MC. If the function π ′ : Ω → [0, 1] satises
and X
π ′ (x) = 1,
x∈Ω
then π ′ is a stationary distribution of the MC. If the MC is irreducible and aperiodic then π ′ is
the unique stationary distribution.
Proof. We verify that π′ is a stationary distribution (see (4.2)). Suppose, X0 is distributed as π′ then
X X X
π ′ (x)P (x, y) = π ′ (y)P (y, x) = π ′ (y) P (y, x) = π ′ (y).
x∈Ω x∈Ω x∈Ω
The last equality follows from the previously mentioned fact that the row sums of P sum up to 1.
Let us consider two sampling problems for types of objects that are rather central in discrete math-
ematics. The rst is about sampling independent sets in a graph, the second is about sampling of
matchings in a graph. Independent sets and matchings appear in many contexts. For instance, inde-
pendent sets can model conicts between objects. The vertices might correspond to workers and an
edge might indicate that the corresponding pair of workers cannot work in the same shift (because
they don't like each other or they need to share the same equipment, etc.). Matchings are often used
to pair objects, for instance, assigning students to courses, jobs to machines, etc.
We remark that there many computational questions arising in this context. Some of these questions
are settled, some remain subjects of ongoing research. (It is known, for instance, that both problems
of counting the number of independent sets in a graph and counting the number of matchings in a
graph is an intractable problem22 . Still it is possible to sample eciently, that means in polynomial
time in n, matchings in graphs, while such ecient sampling of independent sets is intractable. Note
22
Unless the famous conjecture P ̸= NP would surprisingly fail to hold; see also http://www.claymath.org/
millennium-problems.
39
that the sampling procedures that we will discuss have a-priorily an exponential running time. More
on this fascinating topic can be found in [7].)
Figure 4.3: A graph with vertices v1 , . . . , v4 and edges {v1 , v2 }, {v1 , v3 }, {v2 , v3 }, {v3 , v4 }. (a) Inde-
pendent set I = {v1 , v4 } (gray nodes), (b) Matching M = {{v1 , v2 }, {v3 , v4 }} (red edges). I is also a
maximum independent set; M is also a maximum matching in G.
Example 4.5
Let G = (V, E) be a graph with vertices in V = {v1 , . . . , vn } and edges in E = {e1 , . . . , em }.
A set I ⊆ V is an independent set if no two vertices of I are joined by an edge in E (see
Fig. 4.3(a)).
The following MC randomly selects independent sets in IG := {all independent sets in G} :
It can be shown that the Markov chain is irreducible, aperiodic and has the uniform distribution
on IG as stationary distribution. Hence, we can use this Markov chain to (uniformly) generate
random independent sets in G.
To ip a fair coin means to draw a random number taking on only two values, say 0 (for heads) and 1 (for
tails), both of which should occur with probability 1/2.
Example 4.6
Let G = (V, E) be a graph with vertices in V = {v1 , . . . , vn } and edges in E = {e1 , . . . , em }. A
set M ⊆ E is a matching if the edges of M are pairwise vertex disjoint (see Fig. 4.3(b)).
The following Markov chain randomly selects matchings in MG := {all matchings in G} :
40
Let M ∈ MG be given (e.g., M = ∅).
We claim that the Markov chain is irreducible, aperiodic and has the uniform distribution on
MG as stationary distribution. Hence, we can use this Markov chain to (uniformly) generate
random matchings in G.
The state space of the Markov chain is Ω = MG . This is a nite set containing at most 2m ele-
ments. Clearly the chain is homogeneous (there is only dependence on the previous state).
1. Aperiodicity: The Markov chain might remain in the same state (see step (a) and step
(c)), hence all states have period 1, and the chain is therefore aperiodic.
2. Irreducibility: The Markov chain is irreducible, since it is possible to reach the empty
matching from any state by removing edges (and reach any state from the empty matching
by adding edges).
3. Uniform distribution: We verify the detailed balance condition for π = |M1G | (1, . . . , 1)
where |MG | denotes the size/cardinality of the set MG . Note that as |MG | can be rather
large it is often not feasible to compute this number explicitly.
For M1 , M2 ∈ M with M2 = M1 ∪ {e} for some e ∈ E \ M1 we have
1 1 1
π(M1 )Pr(X1 = M2 |X0 = M1 ) = · · = π(M2 )Pr(X1 = M1 |X0 = M2 ).
|MG | m 2
41
1 function [M ]= randommatchingsample (n , edges )
2 % n is the number of vertices of the graph
3 % ith row of edges cotains the numbers of the two vertices of that edge
4
5 nTrials =10^3; % number of iteration for running the Markov chain
6 m = size ( edges ,1) ; % number of edges
7 M = zeros (1 , m ) ; % ith component of M is 1 if and only if ith edge is currently ...
selected
8 saturated = zeros (1 , n ) ; % ith component counting how many currently selected edges ...
are touching v_i
9
10 for sim =1: nTrials % Implementation of the algorithm from Example 4.6. in the ...
lecture notes
11 newe = randi (m ,1) ;
12 coin = randi (2 ,1) ;
13 if ( coin ==1)
14 if ( M ( newe ) ==0)
15 if ( sum ( saturated ( edges ( newe ,:) ) ) ==0) % both endpoints of edge ...
candidate not endpoints of aready selected edge ?
16 M ( newe ) =1; saturated ( edges ( newe ,:) ) =[1 ,1];
17 end ;
18 else
19 M ( newe ) =0; saturated ( edges ( newe ,:) ) =[0 ,0];
20 end ;
21 end ;
22 end
Example 4.7
Let us continue to sample matchings in a graph as in the previous example. In fact, we now
want to focus on the problem of counting the matchings. We consider the following graph.
We can check all combinations of possible matchings systematically (a method, called enumer-
ation ). Possible matlab code for this is as follows.
1 edges =[1 ,7;1 ,8;2 ,8;2 ,3;3 ,8;3 ,10;4 ,10;5 ,9;5 ,10;5 ,12;6 ,11;
2 6 ,12;7 ,8;3 ,9;4 ,5;2 ,9;5 ,11;11 ,12;1 ,2;12 ,13;6 ,13];
3 m = size ( edges ,1) ; n = max ( max ( edges )) ;
4 matchingsofsize = zeros (1 , m +1) ;
5 nTrials =10^4;
6
7 % Check all 2^ m possibilities for M , and count which of them are matchings
8 % This determines the correct total number of matchings in G.
9 % This can only work for small m , since it takes exponential time ( in m )
10 for i =0:2^ m -1
11 curE = dec2bin (i , m ) -'0 '; % converts i into a binary number with m digits
12 saturated = zeros (1 , n ) ;
13 for j =1: m
14 if ( curE ( j ) ==1)
15 saturated ( edges (j ,1) ) = saturated ( edges (j ,1) ) +1;
16 saturated ( edges (j ,2) ) = saturated ( edges (j ,2) ) +1;
17 end ;
18 end ;
19 if ( max ( saturated ) <2)
20 matchingsofsize ( sum ( curE ) +1) = matchingsofsize ( sum ( curE ) +1) +1;
42
21 end ;
22 end ;
23
24 matchingsofsize
i 0 1 2 3 4 5 6 7-21
si 1 21 158 521 749 403 61 0
Enumeration, however, becomes quickly intractable for increasing n. Let's try a dierent ap-
proach.
Suppose we sample matchings in G (using the method from the previous example). If we do this
for suciently many trials we should get a good picture about the distribution of the matchings.
Note that this is still an exponential method! But depending on the situation (whether or not
the Markov chain is so-called rapidly mixing) it may provide good approximations in reasonable
time.
Fig. 4.4 shows the outcome of a random sampling of nTrials = 104 matchings (drawn after
evolving the Markov chain for 103 steps); depicted are the relative frequencies ni /nTrials of
the events that M is a matching of size i, with i being depicted on the x-axis.
Now, observe that we can obtain estimates ŝi of the number of matchings of size i from this
data. We know that s0 = 1 (there is one matching of size 0, the empty set). Therefore ŝ0 = 1.
As n1 /n0 ≈ s1 /s0 we can set ŝ1 := ŝ0 n1 /n0 and, recursively, ŝi := ŝi−1 ni /ni−1 . The estimates
that we obtain for the present data are depicted in Table 4.1.
Figure 4.4: Histogram of the relative frequencies of matchings of dierent sizes based on nTrials = 104
samples drawn after evolving the Markov chain from Example 4.7 for 103 steps. Values si /(s1 + s2 +
· · · + s21 ) depicted as red points.
i 0 1 2 3 4 5 6 7-21
si 1 21 158 521 749 403 61 0
ni 5 102 832 2747 3991 2029 294 0
ŝi = ŝi−1 ni /ni−1 1.0 20.4 166.4 549.4 798.2 405.8 58.8 0
43
4.2 Metropolis-Hastings
We turn now to a slightly dierent question. How can we modify the transition probabilities in a given
Markov chain to ensure that the stationary distribution is equal to a prescribed distribution π ?
The method that we introduce here for solving this question is the so-called Metropolis-Hastings
algorithm24 , named after N. C. Metropolis and W. K. Hastings25 . It proceeds as follows.
Start with some point x0 , whether deterministic or randomly sampled. For t ≥ 0 given that
Xt = x, generate a random proposal Y from a distribution Pr(Y = y|X = x) = P (x, y) (note
that this P is the transition matrix of our original Markov chain). For technical reasons (ensuring
that the generated Markov Chain is irreducible), we will from now on to the end of this chapter
require the property P (x, y) > 0 ⇒ P (y, x) > 0, for all x, y.
New to what we did previously, accept or reject the proposal y. With probability A(x, y) accept
it and set Xt+1 = Y. With probability 1 − A(x, y) reject the proposal, i.e., set Xt+1 = Xt .
This may now be considered to be a new Markov chain. We denote the new transition probabilities
by P ′ (x, y), x, y ∈ Ω. Note that
Given a desired stationary distribution π and a proposal mechanism via matrix Q, we want to design
the acceptance probabilities A(x, y) such that π is a stationary distribution of the new Markov chain.
In view of Theorem 4.2 it suces to construct the acceptance probabilities A(x, y) in such a way that
detailed balance is satised.
We remark that the Metropolis-Hastings (MH) algorithm simulates samples from a probability distri-
bution by making use of the full joint density function and (independent) proposal distributions for
each of the variables of interest. In summary, the MH algorithm is as follows.
Metropolis-Hastings algorithm
Input: Probability distribution function π (may be unnormalized), transition matrix P.
Output: Random sample x from the Markov chain.
(a) Initialize x (deterministic or randomly).
(b) Draw proposal y from Pr(Y |X = x) = P (x, ·).
(c) Compute acceptance probability A(x, y) according to (4.6).
(d) Draw u ∼ U (0, 1).
(e) IF u ≤ A(x, y) THEN `accept proposal' (i.e., set x ← y)
OTHERWISE `reject' (i.e., set x ← x).
(f) GOTO (b).
Note that the condition u ≤ A(x, y) realizes the desired probability A(x, y) of accepting the proposal y.
(You can see this by considering cdf inversion for a discrete random variable taking on only two values,
'accept' and 'not accept,' where 'accept' should occur with probability A(x, y).) Further notice that
the ratio inside the minimum in (4.6) has a factor π(y)/π(x). Other things being equal, this implies
24
A very readable account on the history of the Metropolis-Hastings algorithm can be found at https://www.jstor.
org/stable/30037292.
25
Wilfred Keith Hastings (July 21, 1930 May 13, 2016).
44
that moves to higher probability states y are favored. The second factor, P (y, x)/P (x, y), implies, if
other things are equal, that we hesitate to move to y if it would be hard to get back to x.
What should be also noted is that the factor Z for an unnormalized distribution πu with π = πu /Z
cancels out in (4.6). Hence the same acceptance rule can be applied if the partition function Z is
unknown.
Theorem 4.3
The Markov chain generated by the Metropolis-Hastings algorithm has π as stationary distri-
bution.
Proof. We want to prove detailed balance. We distinguish the following three cases for x, y ∈ Ω.
(i) Suppose, π(x)P (x, y) = π(y)P (y, x). Then, A(x, y) = A(y, x) = 1, implying
hence π(x)P ′ (x, y) = π(y)P ′ (y, x), showing detailed balance for this case.
(ii) Suppose, π(x)P (x, y) > π(y)P (y, x). In this case
π(y)P (y, x)
A(x, y) = and A(y, x) = 1.
π(x)P (x, y)
Hence,
π(y)P (y, x)
π(x)P ′ (x, y) = π(x)P (x, y)A(x, y) = π(x)P (x, y) ·
π(x)P (x, y)
= π(y)P (y, x)A(y, x) = π(y)P ′ (y, x).
(iii) Suppose, π(x)P (x, y) < π(y)P (y, x). In ths case
π(x)P (x, y)
A(x, y) = 1 and A(y, x) = .
π(y)P (y, x)
Hence,
π(x)P (x, y)
π(y)P ′ (y, x) = π(y)P (y, x)A(y, x) = π(y)P (y, x) ·
π(y)P (y, x)
= π(x)P (x, y)A(x, y) = π(x)P ′ (x, y).
Since the algorithm allows for rejection, the Markov chain of the MH algorithm is clearly aperiodic.
What is left to check to ensure convergence to the required target distribution is in each specic case
only irreducibility.
Example 4.8
Let's consider Example 4.5 again.
One can show that the MC in that example has the uniform distribution as stationary distribu-
tion. Suppose now that we want a stationary distribution where each independent set I has a
probability proportional to λ|I| (When λ = 1 this is the uniform distribution; when λ > 1 larger
independent sets have a larger probability than smaller independent sets; and when λ < 1 then
smaller independent sets have a larger probability than larger independent sets.)
45
Let's change the algorithm from Example 4.5 such that we obtain the Metropolis-Hastings
variant that samples from Z1 λ|I| . Here is the modied algorithm.
Fig. 4.5 shows some output for the choices of λ = 4 and λ = 0.1.
Let's consider the output for λ = 4. We actually have in this particular case ñ0 = 202 sampled
sets of size 0, ñ1 = 3384 sets of size 1, and ñ2 = 6414 sets of size 2 (and no sets of larger
size).
Can we estimate from this the actual number ni of independent sets of size i, i = 1, . . . , 4
in G?
Well, we can rst of all observe that we know how many independent sets there are of size 0 and
size 1. There is one independent set of size 0 and there are n = 4 independent sets of size 1. In
theory, we should have
λ1 n1
ñ1 Z · nTrials 4·4
= λ0 n0
= = 16.
ñ0
Z · nTrials 1·1
And, in fact, for these samples we have ñ1 /ñ0 = 3384/202 ≈ 16.7525. We can use this argument
to estimate any of the ni . We should have
λ2 n2
ñ2 Z · nTrials 4n2
= λ1 n1
= .
ñ1
Z · nTrials n1
Solving this for n2 we obtain,
ñ2 n1 6414 4
n2 = · = · ≈ 1.8954,
ñ1 4 3384 4
which is actually not far from the true value of n2 = 2.
Many other samplers have been developed in the past. We just list a few of them, which can be
frequently found in the literature. (We omit the cases π(x)P (x, y) = 0 in the statement of the
acceptance probabilities; there are conditions that can ensure that π(x)P (x, y) > 0.)
46
Figure 4.5: Sampling independent sets (Example 4.8). Histogram for (a) λ = 4 and (b) λ = 0.1 (the
red markers show the correct value of Z1 λ|I| ) based on nTrials = 104 samples drawn after evolving
the Markov chain for 20 steps.
It is easily veried that A(x, y) satises detailed balance and that A(x, y) ≤ AM H (x, y) for any x, y ∈ Ω
with AM H denoting the Metropolis-Hastings acceptance probability from (4.6).
Original Metropolis Sampler The original Metropolis algorithm is the special case of Metropolis-
Hastings where P (x, y) = P (y, x). In this case,
π(y)
A(x, y) := min 1, . (4.8)
π(x)
Independence Sampler Perhaps the simplest proposal mechanism is to take independent and iden-
tically distributed proposals from some distribution P that does not even depend on the present x.
Then P (x, y) is simply P (y) for a pdf P. The Metropolis-Hastings proposal for this so-called indepen-
dence sampler simplies to
π(y)P (x)
A(x, y) := min 1, . (4.9)
π(x)P (y)
Gibbs Sampler The Gibbs sampler27 originates from the eld of image processing. It is a special
case of Metropolis-Hastings sampling where the proposal is always accepted (i.e., A(x, y) = 1 for all
x, y ∈ Ω.) It is designed to work for distributions where each state x is in fact a vector x = (ξ1 , . . . , ξd ).
In image processing, for instance, ξi might represent the color of the ith pixel.
The main idea of Gibbs sampling is that one only uses univariate conditional distributions (the distri-
bution where all of the random variables but one are assigned xed values). It is often far easier to
sample from such conditional distributions than from the joint distribution.
The Gibbs sampler (more precisely, the random scan Gibbs sampler) proceeds as follows. Given a
(current) state x then a (uniformly) random coordinate k is selected, and the new state is
y = (ξ1 , . . . , ξk−1 , ξ, ξk+1 , . . . , ξd ),
26
A. A. Barker, a mathematical physicist working at that time at the University of Adelaide.
27
Named after the physicist Josiah Willard Gibbs (February 11, 1839 April 28, 1903), in reference to an analogy
between the sampling algorithm and statistical physics.
47
with the k th coordinate of y being ξ determined from the pdf for ξ, given the other coordinates.
Simulated annealing28 is an algorithmic approach for nding the maximum of a function that has
typically many local maxima. The idea is that when we initially start sampling the space Ω, we will
accept a reasonable probability of a down-hill move in order to explore the entire space. As the process
proceeds, we decrease the probability of such down-hill moves. The analogy (and hence the term) is
the annealing of a crystal as temperature decreases (initially there is a lot of movement, which then
over time gradually decreases).
Simulated annealing is very closely related to Metropolis sampling, diering only in that the probabil-
ity A(x, y) of a move is given by
T (t) !
π(y)
A(x, y) = min 1, ,
π(x)
where the function T (t) is called the annealing schedule (for T = 1 we have the Metropolis algo-
rithm). The particular value 1/T (t) for any given t ∈ N0 is typically referred to as temperature.
A typical choice for T over Tmax time steps is
t/Tmax
Tf
T (t) := T0 ,
T0
where 1/T0 is a given `starting temperature' cooling down to a `nal temperature' 1/Tf (i.e, we typically
choose T0 ≤ Tf ).
Example 4.9
Consider a 2D image consisting of an n × n grid of black and white pixels. Let Xj , j = 1, . . . , n2 ,
denote the indicator of the j th pixel being white (i.e., Xj = 1 if the j th pixel is white, and
Xj = 0 otherwise). Viewing the pixels as vertices in a graph, the set N (i) of neighbors of
the ith pixel are the pixels immediately above, below, to the left, and to the right (except for
boundary cases).
A commonly used model29 for the pdf π of X = (X1 , . . . , Xn2 ) is
2 P
1 β2 Pni=1 j∈N (i) δξi ,ξj
π(x) = e ,
Z
where x = (ξ1 , . . . , ξn2 ), β > 0, and
1 if ξi = ξj ,
δξi ,ξj =
0 otherwise.
With this π neighboring pixels prefer to have the same color. The normalizing constant Z of
2
π is a sum over all 2n possible congurations, so it may be very dicult to compute. This
motivates the use of MCMC to simulate from the model.
Let us now develop a Gibbs sampler for sampling from π.
28
Simulated annealing was developed in 1983 by Kirkpatrick et al.; see S. Kirkpatrick, C. Gelatt, M. Vecchi, Opti-
mization by simulated annealing, Science 220 (4598) (1983) 671680, https://www.science.org/doi/10.1126/science.
220.4598.671.
48
Denote the vector of random variables by X = (X1 , . . . , Xn2 ). A sample is denoted by x =
(ξ1 , . . . , ξn2 ). Let X−k = (X1 , . . . , Xk−1 , Xk+1 , . . . , Xn2 ) and x−k = (ξ1 , . . . , ξk−1 , ξk+1 , . . . , ξn2 ).
Note that π can be viewed as joint pdf of the variables X and X−k , hence
π(ξ1 , . . . , ξk−1 , ξk , ξk+1 , . . . , ξn2 ) = πXk ,X−k (ξk , x−k ) = πXk |X−k =x−k (ξk )πX−k (x−k ),
with nξk denoting the number of neighbors of the k th pixel having color ξk .
Now let us consider the constant C := Zα eβR/2 . We rst note that πXk |X−k =x−k (ξk ) is a pdf and
that ξk can only have the two possible values ξk = 0 and ξk = 1. Hence
πXk |X−k =x−k (0) + πXk |X−k =x−k (1) = Ceβn0 + Ceβn1 = 1,
consequently
1
C= .
eβn0 + eβn1
Putting things together we have
eβnξk
πXk |X−k =x−k (ξk ) = Ceβnξk = , ξk ∈ {0, 1}.
eβn0 + eβn1
In summary, we obtain the following Gibbs sampler. Typical results are shown in Fig. 4.6. (It
is interesting to note that the Gibbs sampler is here, in comparison to the Metropolis-Hastings
sampler, the method with the faster running time; see Fig. 4.7).
29
This model is called Ising model, named after the physicist Ernst Ising (May 10, 1900 May 11, 1998). This
model, easy to dene but with amazingly rich behavior, serves as a mathematical model of ferromagnetism in
statistical mechanics. See also, e.g., http://personal.rhul.ac.uk/uhap/027/ph4211/PH4211_files/brush67.
pdf.
49
Figure 4.6: Typical outputs for the Gibbs sampler on the 200 × 200 torus, randomly initialized. (a)
β = 1 output after 107 iterations, (b) output after the next 107 iterations. (c) and (d) the same as for
(a) and (b) but with β = 2. (Computation times for generating any of these images were around 25s.)
50
Figure 4.7: Typical outputs for the Metropolis sampler on the 200×200 torus, randomly initialized. (a)
β = 1 output after 107 iterations, (b) output after the next 107 iterations. (c) and (d) the same as for
(a) and (b) but with β = 2. (Computation times for generating any of these images were around 35s.)
51
5 Learning Theory and Methods
Learning theory can be considered as the eld devoted to studying the design and analysis of ma-
chine learning algorithms. The term machine learning seems to have been rst coined by Arthur Lee
Samuel30 , a pioneer in the articial intelligence eld, in 1959. The following, a paraphrase of his quote
`Programming computers to learn from experience should eventually eliminate the need for much of
this detailed programming eort,' nicely captures the essence and is therefore cited in many machine
learning texts.
`Machine learning is the eld of study that gives computers the ability to learn without
being explicitly programmed.'
(Arthur L. Samuel, 1959)
Typically, one distinguishes between the following three main types of learning algorithms.
Supervised learning: A supervised learning algorithm learns from labeled training data and,
based on this, tries to predict outcomes for unforeseen data. An example for this would be
predicting mortalities of COVID-19 based on training data from the past.
Unsupervised learning: In these methods there is no label attached to the data, and the task
is to identify patterns and/or model the data. A example for this would be the compression of
information.
Reinforcement learning: This method falls between the above two methods as there is some
form of feedback available (known as reward signal) for each predictive step, but there is no label.
An example for this would be training an agent to play video games. The reward signal can be
the player's score.
Examples of machine learning problems include:
Classication: Classify data into one or more categories (classes). For example, identifying in
computerized tomography (CT) images whether a patient has a tumor or not.
Clustering: Group a set of data points into clusters, such that points within a cluster have
some properties in common. An example is in image segmentation, where the goal is to break
up the image into meaningful or perceptually similar regions.
Prediction: Based on historical data, build models and use them to forecast future values. For
example, predicting temperature rises due to global warming based on data from the past.
Examples of important applications where machine learning algorithms are deployed include:
Optical character recognition (OCR)31 ,
Text or document classication, spam detection32 ,
Speech recognition33 ,
Face recognition34
Fraud detection35 ,
Language translation36 ,
30
(December 5, 1901 July 29, 1990).
31
http://human.ait.kyushu-u.ac.jp/publications/PRL2008-Malon.pdf.
32
https://arxiv.org/pdf/1606.01042.pdf.
33
https://www.youtube.com/watch?v=RBgfLvAOrss.
34
https://www.youtube.com/watch?v=RBgfLvAOrss.
35
https://www.fico.com/blogs/5-keys-using-ai-and-machine-learning-fraud-detection.
36
https://www.youtube.com/watch?v=AIpXjFwVdIE.
52
Games like chess and Go37 ,
Autonomous driving38 ,
Medical diagnosis (decisions about Caesarian sectioning39 or tumor removal40 ),
Recommendation systems, search engines41 ,
Representations of polycrystals in materials science42 .
Before we go into further details let us consider two initial examples.
Example 5.1
The last application mentioned above is the representations of polycrystals in materials science.
Fig. 5.1 is from the paper that can be found at https://livrepository.liverpool.ac.uk/
3085596/. Shown in black in these two images are so-called grain boundaries. These are
boundaries of small crystals that make up the whole material, here an aluminum sample. The red
and blue boundaries, respectively are the boundaries that one obtains by some specic clustering
method (which we call generalized balance power diagrams). One observers a relatively good
t, which on the one hand indicates that nature seems to try to nd a similar clustering. On
the other hand, for storing the clusters one needs only to store a few parameters per grain and
not the whole image. Thus, one has a much sparser representation of the data.
Example 5.2
Consider the current COVID-19 outbreak. As other pandemics it is expected to have exponential
growth. That means that the number x(t) of infected persons at time t follows a function
x(t) = x0 bt , (5.1)
where x0 is the number of cases at the beginning, and b is the number of people infected by each
infected person.
The rst two UK cases appear on January 31st, 2020, which for us is t = 0. Let us, for the
sake of exposition, consider the available data (source: Public Health Englanda ) up until March
37
https://ai.googleblog.com/2016/01/alphago-mastering-ancient-game-of-go.html.
38
https://iopscience.iop.org/article/10.1088/1757-899X/662/4/042006/pdf.
39
https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=8519731.
40
https://www.nature.com/articles/s41598-019-48738-5.
41
https://www.seroundtable.com/google-explains-machine-learning-search-28697.html.
42
https://www-m9.ma.tum.de/foswiki/pub/Allgemeines/AndreasAlpersPublications/H1.pdf.
53
18th, 2020 (i.e., up to t = 48):
t 0 9 10 13 24 28 29 30 31 32 33 34 35 36
x(t) 2 3 4 8 9 13 19 23 35 40 51 85 114 160
t 37 38 39 40 41 42 43 44 45 46 47 48
x(t) 206 271 321 373 456 590 797 1061 1391 1543 1950 2626
In supervised learning the training data comes in pairs of inputs (xi , yi ), where xi ∈ Rd is the input
instance and yi ∈ C its label. The space Rd is the feature space in our case we typically consider
Rd = Rd . The space C is the label space. Several examples are shown in Table 5.1.
We focus in the following on classication problems. In such problems the data points (xi , yi ) are
drawn from some (unknown) distribution. The goal is to learn a function h : Rd → C such that for a
new pair (x, y) we have h(x) = y (or h(x) ≈ y ) with high probability.
Examples of supervised learning techniques are support vector machines, naïve Bayes classiers, deci-
sion trees, and neural networks. We will discuss the basics in the following sections. Our rst section
54
Figure 5.2: Number x(t) of COVID-19 infected people for t = 0 (January 31, 2020) to t = 48
(March 18th, 2020) shown in blue. In red the predicted numbers obtained by linear regression.
is, however, devoted to a theoretical aspect, namely that of measuring the capacity (complexity, ex-
pressive power, richness, or exibility) of a space of functions that can be learned by a classication
algorithm.
Denition 5.1
The set S is shattered by the function class H if ΠH (S) contains all the subsets of S, i.e.,
|ΠH (S)| = 2|S| .
In other words, the set S is shattered by H if, no matter how we assign ±1-labels to points in S there
is always a hypothesis in H that 'explains' the labeling perfectly.
43
Vladimir N. Vapnik (born Dec. 6, 1936).
44
Alexey Chervonenkis (Sept. 7, 1938 Sept. 22, 2014).
45
Note that the VC-dimension is dened for spaces of binary functions (functions to ±1). Generalizations for spaces
of non-binary functions have later been suggested in the literature.
55
Denition 5.2
The VC-dimension VCD(H) of a function class (or hypothesis class) H is the maximum size
of a subset of Rd shattered by H.
In other words, if VCD(H) = d then H can shatter some (i.e., at least one) set of d points but it cannot
shatter any set of d + 1 points. Let us consider two examples and the important class H consisting of
halfspaces in R2 .
Example 5.3
Consider learning (positive) rays on the real line. The hypothesis class H consists of all rays of
the form {x ∈ R : x > a}, for some a ∈ R. A ray (hypothesis) {x ∈ R : x > a} ∈ H is interpreted
as a classier that identies a point p ∈ R as being in class S if p > a and identies x as not
being in class S if p ≤ a.
(a) Can a set containing a single point in R be shattered by H?
The answer is obviously yes (since we can have a ray that contains that point and we can
have a ray that does not contain this point).
(b) We claim that no sets of two points in R can be shattered by H.
Suppose, we have a set S of two points p1 < p2 ∈ R. Any ray containing p1 also contains p2 .
We can therefore never obtain the set {p1 } as projection of some ray onto S.
(c) Can we now say something about VCD(H)?
Based on (a) we have VCD(H) ≥ 1 and based on (b) we have VCD(H) ≤ 1 Hence,
VCD(H) = 1.
Example 5.4
Consider learning closed balls in R2 . The hypothesis class H consists of all balls R(p, r) =
{(x1 , x2 ) ∈ R2 : (x1 − p1 )2 + (x2 − p2 )2 ≤ r2 } with p = (p1 , p2 ) and r > 0. A ball (hypothesis)
R(p, r) ∈ H is interpreted as a classier that identies a point x as being in class S if x ∈ R(p, r)
and identies x as not being in class S if x ̸∈ R(p, r).
(a) Let us show that this set
56
(1) (2) (3) (4)
+ + − +
+ + − + + + + −
− − − + + − − −
(b) Note that the result from (a) implies, by the denition of VC-dimension, that VCD(H) ≥ 3.
(c) For showing VCD(H) = 3 we would need to show that no four points can be shattered
by H.
We do not show this here. We just remark that no set of three points on a line can be
shattered by H (because balls are convex, hence if S contains the points x1 and x3 it also
needs to contain any point x2 lying on the line segment x1 x3 ).
Theorem 5.1
For the hypothesis class H consisting of halfplanes in R2 it holds that VCD(H) = 3.
Proof. Let us rst prove that VCD(H) ≥ 3 by providing a specic set S of three points for which we
show that we can obtain every subset of S as a projection of some halfplane onto S.
Consider three points (not all lying on a line). There are eight possible labelings, and for each we can
nd a halfplane containing only the positively labeled points; see Fig. 5.3. Hence, VCD(H) ≥ 3.
+ + − + + + + −
− − − + + − − −
Figure 5.3: The eight possible labelings and a corresponding halfplane (gray shaded area
bounded by the blue line) that contains only the positively labeled points.
+ + −
− +
+− − + + +
−
Figure 5.4: Examples of four points and their labelings (boundary of convex hull indicated
by dotted lines). No halfplanes exist that contain precisely the positively labeled points.
To see that no set of four points can be shattered, we consider three cases.
Only two of the four points dene the convex hull of the four points (see Fig. 5.4(a)): Label the
interior points negative and the hull points positive. No halfplane exists that contains precisely
57
the positively labeled points. (If you want to make this argument mathematically rigorous, you
can argue like this: Halfplanes are convex, hence with the two endpoints they need to contain also
the negatively labeled points. Hence halfplanes cannot contain only the posively labeled points.)
Three of the four points dene the convex hull of the four points (see Fig. 5.4(b)): Label the
interior point negative and the hull points positive. Again no halfplane exists that contains
precisely the positively labeled points. (Rigorous argument as above.)
All four points lie on the convex hull dened by the four points (see Fig. 5.4(c)): Label one
'diagonal' pair positive and the other 'diagonal' pair negative. Again, no halfplane exists that
contains precisely the positively labeled points. (Making this rigorous requires a bit more work.
It follows from a fundamental result from the theory of convex sets called Radon's Theorem. It
states that any set of d + 2 points in Rd can be partitioned into two subsets whose convex hulls
intersect each other. No halfplane can separate these two subsets.)
More general, it can be shown that VCD(H) = d + 1 for halfspaces in Rd , but proving this is outside
the scope of the present course.
Denition 5.3
An (ane) hyperplane in Rd is a set of the form
Hw,β = {x ∈ Rd : wT x = β},
Denition 5.4
A hyperplane Hw,β is a separating hyperplane for the data (xi , yi ) ∈ Rn × {±1}, i = 1, . . . , n, if
for all i = 1, . . . , n it holds that
wT xi ≥ β, if yi = +1,
T
w xi ≤ β, if yi = −1.
It is allowed in this denition that some (or even all) of the data points are lying on the separating
hyperplane. If none of the data points lie on the separating hyperplane, i.e., when its distance to all
data points is positive, we say that the separating hyperplane has a positive margin.
There does not always exist a separating hyperplane. And even if does, there might be many; see
Fig. 5.5(b-c). Support vector machines try to nd the separating hyperplane with maximum mar-
gin.
58
Why could it be a good idea to nd a separating hyperplane that maximizes the margin? The intuition
behind this is that points near the decision boundary are often misclassied (there is an almost 50%
chance that the classier decides either way). So, insisting on a large margin can potentially minimize
misclassication. This can be a good idea if nothing else is known about the data. (For a more model
dependent choice of decision boundaries, see the later section on naïve Bayes classiers.)
Figure 5.5: Binary SVM problem. (a) training data with two labels (red=+ and blue=−), (b) a
separating hyperplane (solid black, dashed lines indicating the margin), (c) the optimal separating
hyperplane (solid black) maximizing the margin (indicated by the dashed lines).
Consider a separating hyperplane Hw,β , see Fig. 5.6. The points xi ∈ Rd with wT xi − β ≥ 0 lie on one
side of the hyperplane, the points xi ∈ Rd with wT xi − β ≤ 0 lie on the other side of the hyperplane
(points satisfying both inequalities are lying on both sides).
1
||w||
1
||w||
Hw,β+1
Hw,β−1 Hw,β
Consider now the parallel hyperplanes Hw,β+1 and Hw,β−1 . What we want to have is that Hw,β+1 and
Hw,β−1 are separating and their distance (which can be shown to be 2/||w||; see Fig. 5.6) should be
maximal46 . This can be formulated as the optimization problem
2
maxw,β ||w||
subject to wT xi ≥ β + 1, for all i with yi = +1,
wT xi ≤ β − 1, for all i with yi = −1,
which has the same optimal w and β as the optimization problem
46
There is nothing special in considering β + 1 and β − 1 in the denition of Hw,β+1 and Hw,β−1 . Instead one could
have also considered Hw,β+ε and Hw,β−ε for any of your favorite choice of ε ̸= 0, because Hw,β+1 = Hw′ ,β ′ +ε and
Hw,β−1 = Hw′ ,β ′ −ε with w′ = εw and β ′ = εβ.
59
1
min ||w||2 subject to yi (wT xi − β) ≥ 1, for all 1 ≤ i ≤ n.
w,β 2
In the latter form, the binary SVM problem is a well-studied optimization problem. It has a quadratic
objective function, which is strictly convex, and its constraints are all (anely) linear (such optimiza-
tion problems are often referred to as quadratic programs or convex quadratic programs ). Due to the
strict convexity it can be shown that the minimum is unique, and this minimum can be found by
variety of well-established algorithms (in matlab, you can use the command quadprog).
Fig. 5.5(c) shows the maximum margin separating hyperplane for the data shown in Fig. 5.5(a). Having
found the maximum margin separating hyperplane, any new data x ∈ Rd will subsequently be classied
according to this hyperplane, i.e., by
+1 if wT x ≥ β,
y=
−1 otherwise.
60
5.2.3 Naïve Bayes Classier
Sometimes we have some prior knowledge about the data and we want or need to include this. In
Bayesian learning and we discuss here the special case of naïve Bayes classiers prior knowledge
is provided by asserting (a) a prior probability for each class labeling and (b) a probability distribution
over observed data for each possible labeling.
Naïve Bayes is a statistical classication technique based on Bayes Theorem (see Section 1.3). A
naïve Bayes classier assumes that any of the features ξ1 , . . . , ξd of a feature vector x = (ξ1 , . . . , ξd )T
are independent given the label y. This is usually an oversimplication (hence the term naïve ), but it
simplies the computations considerably.
Bayes theorem and the independence give
Naïve Bayes:
Given a vector x∗ = (ξ1∗ , . . . , ξd∗ )T of features, the label ŷ that will be assigned to x∗ will be the
y ∗ ∈ L that maximizes Pr(y = y ∗ |ξ1 = ξ1∗ , . . . , ξd = ξd∗ ) in (5.2).
What we need for naïve Bayes is therefore the likelihood of each feature and the prior probability of
each label. The prior probability of the feature vector is not needed, since for all labels it will be the
same constant factor.
Unless nothing else is known, one usually assumes that the prior probability of each label is uniformly
distributed. For the likelihood one often assumes that the ξi feature associated with each label y ∗ ∈ L is
distributed according to a normal distribution (of course, this assumes that the ξi feature is continuous).
In other words,
ξi∗ −µy ∗ 2
1 − 12
Pr(ξi = ξi∗ |y ∗ σy ∗
=y )= √ e ,
σy∗ 2π
where µy∗ and σy2∗ denote the mean and, respectively, variance of the ξi feature in the training data
that is associated to label y ∗ . Such Bayes classier is often referred to as Gaussian naïve Bayes
classier.
Diererent than in the SVM case discussed previously, decision boundaries for Bayes classiers need
not be linear (they are generally quadratic).
61
Example 5.5
Suppose, we have the training data as considered in Section 5.2.2; Table 5.2(a) gives the numer-
ical values.
ξ1 ξ2 label
−1.5 2 −1
−0.8 0.7 −1
0.5 −1 −1
−2 −0.7 −1
2 0.8 −1
0.3 2.3 −1
standard
3 3 +1 feature label mean
deviation
4 6 +1
5 3 +1 ξ1 −1 −0.25 1.47
4 4.5 +1 ξ1 +1 −3.95 1.22
5.5 4 +1 ξ2 −1 −0.68 1.35
2.2 5.5 +1 ξ2 +1 −4.33 1.25
(a) (b)
Table 5.2: (a) Training data (as in Fig. 5.5), (b) mean and standard deviation of the
training data.
Using Gaussian naïve Bayes we want to classify x∗ = (ξ1∗ , ξ2∗ )T = (1, 4)T .
First we compute the mean and variance of the training data separately for each label. The
results are shown in Table 5.2(b).
For y ∗ = −1 we obtain
Hence
Pr(y = −1|ξ1 = ξ1∗ , ξ2 = ξ2∗ ) = 0.0014/Pr(ξ1 = ξ1∗ , ξ2 = ξ2∗ ).
For y ∗ = +1 we obtain
Hence
Pr(y = +1|ξ1 = ξ1∗ , ξ2 = ξ2∗ ) = 0.0027/Pr(ξ1 = ξ1∗ , ξ2 = ξ2∗ ).
As Pr(y = +1|ξ1 = ξ1∗ , ξ2 = ξ2∗ ) > Pr(y = −1|ξ1 = ξ1∗ , ξ2 = ξ2∗ ) we therefore assign x∗ to class
ŷ = +1.
62
Fig. 5.7 shows the classication and the respective (non-linear) decision boundary.
(a) (b)
Figure 5.7: Gaussian naïve Bayes. Classifying the point (1, 4) (shown as black circle). The
decision boundary is shown in black.
Example 5.6
Figure 5.8 shows an example of the decision tree for a mammal classication problem, where
the task is to decide whether a newly discovered species is a mammal or a non-mammal.
Figure 5.8: A decision tree for the mammal classication problem (adapted from [13]).
The rst question is whether the species is cold- or warm-blooded. If it is cold-blooded, then it
is denitely not a mammal. Otherwise, we ask whether is gives birth (as oppose to laying eggs).
If the answer is yes it is a mammal otherwise it is a non-mammal.
So, suppose we want to classify the amingo species based on the data shown in Fig. 5.9.
Following the path through the tree (see dashed lines) we would classify famingos as non-
mammals. a newly discovered species is a mammal or a non-mammal.
63
Figure 5.9: Classifying an unlabeled vertebrate (adapted from [13]).
Decision trees have three types of nodes: (a) a root node, which has no incoming arcs, (b) internal
nodes, each of which has exactly one incoming arc and two or more outgoing arcs, and (c) leaf nodes,
which have no outgoing arcs. The root and internal nodes represent the questions, the leaf nodes
represent the corresponding labels (note that the same label can be present at dierent leaves).
So, how does one construct (learn) such a tree? Of course, we want to construct the tree from training
data. The goal should be to nd a decision tree that classies all the training data correctly. It is not
dicult to see that there are typically many dierent decision trees that achieve this goal. Thus, we
might ask for an optimal decision tree optimal in the sense that it minimizes the expected number
of tests required to identify the unlabeled object. Finding such an optimal decision tree is, however,
an NP-hard problem47 . Therefore one usually tries to construct near-optimal decision trees using some
heuristics.
Example 5.7
Let us consider learning the decision tree for classifying species into mammals and non-mammals
based on the data shown in Table 5.3.
47
See the short 1976 paper of Hyal and Rivest, freely available at https://people.csail.mit.edu/rivest/pubs/
HR76.pdf.
64
Vertebrate Body Gives Aerial Has Hiber- Class
Name Temperature Birth Creature Legs nates Label
Table 5.3: Training data (adapted from [13]) for a vertebrate classication problem.
Which of the features `body temperature,' `gives birth,' `aerial creature,' `has legs,' and `hiber-
nates,' should we choose as root node? The answer is that there are dierent rules around, each
constituting a dierent algorithm. However, we do not go into the details here any further.
Let us assume that we choose `body temperature' as root node. Then we have the two classes
All elements of C1 are non-mammals, hence we we do not split on this node any further. It
becomes a leaf.
We need to split C2 further. If we split on feature `Gives birth' we get the two classes
each of which contain only members of one of the class labels (either mammals or non-mammals).
Hence we do not need to split any further. The two nodes become leaf nodes, and we have
completed constructing the decision tree. The result is shown in Fig. 5.8.
Could we have chosen a dierent feature to split, say, class C2 ? In principle, yes. But the
resulting decision tree would have contained more levels. Splitting on `aerial creature' would
have resulted in two classes both of which containing both mammals and non-mammals; splitting
on `has legs' would have resulted in a class containing both mammals and non-mammals; splitting
on `hibernates' would have resulted in a class containing both mammals and non-mammals; in
all cases one would have needed to split further.
65
without any help of a `supervisor' that could provide correct answers for each observation.
5.3.1 Clustering
In clustering the goal is to divide the observations into groups (clusters ) so that the pairwise dis-
similarities between those assigned to the same cluster tend to be smaller than those in dierent
clusters.
There are a number of dierent clustering algorithms in the literature (and the development is still
ongoing). One of the most popular clustering methods is the so-called k-means algorithm48 . For a
prescribed k, the algorithm tries to nd k clusters in a given data set (hence the name k -means).
Let us start by stating the algorithm. Then we consider examples followed by a brief discussion of
theoretical aspects.
What do we mean by `centroid of cluster Ci ?' Suppose the r data points xi1 , . . . , xir are assigned
to cluster Ci . Then the centroid of Ci is dened as
r
1X
x il .
r
l=1
What is a possible stopping criterion ? We use the following: From one to the next iteration the
(average) sum of squared error (SSE)
k
1 X X
||xj − si ||2 (5.3)
m
i=1 xj ∈Ci
48
The idea behind this clustering method seems to trace back to H. Steinhaus (Jan. 14, 1887 Feb. 25, 1972). The
standard algorithm was rst proposed by Stuart Lloyd of Bell Labs in 1957, though it was not published as a journal
article until 1982.
49
Recall that the Euclidean distance of a vector v = (v1 , . . . , vd )T ∈ Rd is ||v|| = v12 + · · · + vd2 .
p
66
Example 5.8
Consider the set
X = {(−10, 1), (−8, 3), (−6, 2), (3, −1), (5, −3), (6, −2), (9, 3), (10, 7), (11, 5), (−5, −3)}
of 10 data points as shown in Fig. 5.10(a); the sites (7, 0), (−5, 0), (−2, 6), (0, 0) are depicted
as `x.' Fig. (b)-(d) show the results of k -means for k = 4 after Iteration 1, 2, and 3. The algorithm
has converged after 3 iterations, resulting in the sites (−8, 2), (−5, −3), (10, 5), (14/3, −2).
Figure 5.10: Example of k -means for k = 4 : (a) Initial data, (b) Iteration 1, (c) Iteration 2, (d) Itera-
tion 3.
We now want to prove that k -means is converging in a nite number of steps. To this end, we rst
need to prove the following lemma.
Lemma 5.1
Let x1 , . . . , xm ∈ Rd . The sum of squared distances of the xi to a point p ∈ Rd is minimized
1 Pm
when p is the centroid, i.e., if p = m i=1 xi .
67
Proof. Let c =
Pm
1
m i=1 xi . Then,
m
X m
X m
X
2 2
||xi − p|| = ||xi − c + c − p|| = (xi − c + c − p)T (xi − c + c − p)
i=1 i=1 i=1
m
X m
X
= ||xi − c||2 + 2(c − p) T
(xi − c) + m||c − p||2 ,
i=1 i=1
we therefore have
m
X m
X m
(∗) X
||xi − p||2 = ||xi − c||2 + m||c − p||2 ≥ ||xi − c||2 ,
i=1 i=1 i=1
Theorem 5.2
For any data set X, set of sites S, and any k ∈ N, the k -means algorithm decreases (from one
iteration to the next) the SSE.
Proof. Let X = {x1 , . . . , xm }, S = {s1 , . . . , sk }, and let C1 , . . . , Ck denote the current clusters. Now,
let C1′ , . . . , Ck′ denote the clusters that we compute from these data in step (a), and let c′1 , . . . , c′k denote
the corresponding cluster centroids. Then,
k k
1 X X 1 X X
2
||xj − si || ≥ ||xj − si ||2 (since any xj is assigned to the closest si )
m m ′
i=1 xj ∈Ci i=1 xj ∈Ci
k
1 X X
≥ ||xj − c′i ||2 (by Lemma 5.1).
m ′
i=1 xj ∈Ci
Theorem 5.3
For any data set X, set of sites S, and any k ∈ N the k -means algorithm converges in a nite
number of steps.
Proof. There are at most km ways to partition m points into k clusters (for every of the m points we
have k possibilities to assign it to clusters). This is a nite (though large) number. For each such
partition there are the centroids determined by step (b). Hence, we have at most k m dierent values
for the SSEs.
By Theorem 5.2. the SSE decreases in each iteration (if the SSE remains the same then the stopping
criterion is fullled, hence we have convergence). The SSE, however, cannot continue to strictly decrease
indenitely since have already mentioned above that there are at most k m values for it. Therefore, the
stopping criterion is fullled after a nite number of steps.
Although we have proved convergence, there are several unfortunate issues. First, the convergence is
only to a local minimum (in practice, one therefore needs to re-run the algorithm several times and
record the found minimum). Second, the speed of convergences can depend much on the dimension d,
the number of points n, and the set of sites S. Is is benecial to provide as input good approximations
of the cluster centroids, but in general one encounters computationally dicult problems.
68
It is for instance, known50 that the k -means problem (not the algorithm!) is NP-hard when the
dimension d is part of the input and k = 2. It is also NP-hard already for d = 2 if the number k is part
of the input51 . For xed k and d, the k -means problem can be solved in polynomial time (polynomial
in m, the number of data points)52 .
Example 5.9
The k -means algorithm can be used for image compression.
Figure 5.11: Compression of a 640 × 380 color image. (a) original image (image_india.png)
containing 195, 324 > 217 colors, (b) compressed to 32 = 25 colors, (c) compressed to 8 = 23
colors.
Let us look at the compression rate. Often, color images are stored by providing for each pixel an
RGB vector (this is a 3-dimensional vector containing the red/green/blue value in the respective
components). Typically, each entry in the RGB vector is an integer between 0 and 225, hence
each entry needs 1 Byte of storage. For color images not requiring the full color range, it is often
benecial to store for each pixel the color label together with a lookup table, which, for each
color label, contains the corresponding RGB vector. The following table compares the sizes of
the images shown in Fig. 5.11.
Just to get an idea about the computation times, we remark that with matlab's built-in k -means
procedure, the computation time for producing (b) and (c) in Fig. 5.11 was 10.0 and 4.2 seconds,
respectively.
Table 5.4: Compression rates for the images shown in Fig. 5.11.
69
Figure 5.12: k -means clustering of the RGB vectors. (a) the 729, 600 data points (RGB vectors) of
image_india.png, (b) k-means clustering with k = 8 giving the image in Fig. 5.11(c).
The algorithm runs k -means with increasing k in a hierarchical fashion until the test accepts the
hypothesis that the data assigned to each k -means center are Gaussian.
In step (a) of the algorithm we assign all points that are closest to si to the ith cluster.
This set Pi is called Voronoi cell, and the collection P1 , . . . , Pk is a so-called Voronoi
diagrama .
a Georgy Feodosevich Voronoy (April 28, 1868 Nov. 20, 1908).
Voronoi diagrams appear in many dierent contexts as, for instance, in robotics, imaging, physics,
biology, ecology, etc. The philosopher René Descartes54 thought that the solar system can be
viewed as a Voronoi diagram built of vortices centered at xed stars (his Voronoi diagram's sites).
Remarkably, the `decision boundaries' in k -means are linear in the following sense.
Theorem 5.4
The shared boundary of any two neighboring Voronoi cells P1 and P2 is linear.
B = {x ∈ Rd : ||x − s1 ||2 = ||x − s2 ||2 = {x ∈ Rd : (s2 − s1 )T x = (||s2 ||2 − ||s1 ||2 )/2}},
the latter of which is clearly an ane hyperplane in Rd (note that s1 and s2 are xed).
54
René Descartes (March 31, 1596 February 11, 1650) was a French philosopher, mathematician, and scientist.
70
If one chooses other norms, then one would obtain other types of tessellations55 , for instance, power
diagrams, Laguerre tessellations, or generalized balanced power diagrams. An illustration of the tessel-
lation for Fig. 5.10(d) is shown in Fig. 5.13.
71
For ease of exposition, we will assume throughout this section that the data is centered, which
means that the data has zero mean in each dimension (i.e., set ξi′ = ξi −µi , where µi is the mean of
the sample's ith coordinates). This can be achieved in matlab via X=normalize(X','center')',
where X ∈ Rd×m is the matrix that has the data point xi in its ith column. If your data is not
centered, center it before applying the below formalism!
We recall that a matrix M ∈ Rd×d is said to have an eigenvalue of λ if there is a d-dimensional vector
u ̸= 0 for which M u = λu. This u is then the eigenvector corresponding to λ.
The sample covariance matrix for the centered data points xi , i = 1, . . . , m, holds in its ith row
and j th column the covariance between the two features i and j. Formally,
c1,1 · · · c1,d m
.. .. , 1 X (ℓ) (ℓ)
. . with ci,j = ξi ξj .
m−1
c1,d · · · cd,d ℓ=1
This, by the way, is a symmetric matrix. Observe that if we arrange the data points xj into a
matrix X ∈ Rd×m whose ith column contains the data point xi , then C can be expressed as
m
1 1 X
C= X · XT = xℓ xTℓ .
m−1 m−1
ℓ=1
Covariance is, in fact, a measure of how much two sample features vary together. (It is similar to
variance, but where variance indicates how a single feature varies, covariance indicates how two fea-
tures vary together.) A positive covariance between two features indicates that the features increase
or decrease together, whereas a negative covariance indicates that the features vary in opposite direc-
tions.
Now, we are ready to derive the principal components. There are several equivalent ways of deriving
them. One way is by nding the projections that maximize the variance. The rst principal component
is the direction along which projections have largest variance. The second principal component is the
direction which maximizes variance among all directions orthogonal to the rst, and so on.
Variance Maximization We assume that our centered data is collected in a d × m matrix X, which
contains the data point xi in the ith column. We wish to nd a direction u that captures as much
as possible of the variance of our data points. Formally, this amounts to solving the optimization
problem
Find u ∈ Rd with ||u|| = 1 so as to maximize var(uT X). (5.4)
Note that uT X denotes the projection of X (i.e., of all column vectors x1 , . . . , xm ) to the subspace
spanned by the single vector u. We require ||u|| = 1 since if we would allow scaling then we would
obtain arbitrarily large values of var(uT X), and it would therefore make no sense to speak of a maximal
variance.
Theorem 5.5
The solution to the optimization problem (5.4) is to set u to equal the rst principal component
(that is the eigenvector corresponding to the largest eigenvalue) of C.
Proof. As we assume that the data x1 , . . . , xm is centered, they have sample mean µ = 0 and sample
72
variance
m m
(def. of sample variance) 1 X T 1 X T 2
var(uT X) = (u xi − µ)2 = (u xi )
m−1 m−1
i=1 i=1
m m
1 X 1 X
= (uT xi )(xTi u) = uT (xi xTi )u
m−1 m−1
i=1 i=1
= uT Cu.
Now, our optimization problem reduces to nding u ∈ Rd with ||u|| = 1 that maximizes uT Cu. To
solve this, we use the following notation. With vi we denote the eigenvector of C corresponding to the
ith largest eigenvalue λi of C. Then,
Claim:
uT Cu
max uT Cu = max = λ1
||u||=1 u̸=0 uT u
and this maximum is attained for u = v1 .
Proof of the claim: Notice rst that, by the eigenvalue decomposition (for real symmetric
matrices), we can write C = QDQT , where Q ∈ Rm×m holds in the ithe column the ith
eigenvector vi , and D ∈ Rd×d is a diagonal matrix with di,i = λi on the diagonal. (Note, Q
is an orthogonal matrix, i.e., QQT is the identity matrix I.) Then,
uT Cu uT QDQT u
max = max (since QQT = I )
u̸=0 uT u u̸=0 uT QQT u
y T Dy
= max T (by setting y = QT u)
y̸=0 y y
λ1 η12 + · · · + λd ηd2
= max
y̸=0 η12 + · · · + ηd2
λ1 η12 + · · · + λ1 ηd2
≤ max
y̸=0 η12 + · · · + ηd2
= λ1 ,
where equality is attained in the inequality when y = (η1 , . . . , ηd )T = (1, 0, . . . , 0)T , that is
u = Qy = v1 , proving the claim.
73
1 % Computing the PCA ( there is also the pca command available )
2 % Input : dxm matrix M holding the d - dimensional centered data points in columns
3 % k : the dimension into which we want to project
4 % Output : kxm matrix Y holding the projected data points in the columns
5
6 C = cov (X ') ; % C is the sample covariance matrix ( dimensions dxd )
7 % also possible : C = X *X '/( size (X ,2) -1) ;
8
9 [W , D ] = eigs (C , k ) % Compute k top eigenvalues and eigenvectors
10 % W is a dxk matrix with top eigenvectors ( normalized ) in columns
11 % D contains top k eigenvalues on its diagonal , it is a kxk matrix
12
13 Y = W '* X
Note that sometimes, as in Fig. 5.14(b), one might view the projected points Y in the original space Rd
(and not in Rk ). What one simply needs to do is to compute W · Y, the columns of this matrix are the
desired points in Rd .
An example for a PCA for 50 data points in R2 projected onto the rst principal component is shown
in Fig. 5.14.
Figure 5.14: PCA in R2 for k = 1. (a) The 50 data points, (b) First principal component in red, second
principal component in dashed red, projected points in the original space are shown in black on the
red line.
We have already mentioned that reinforcement learning as a technique falls between the two categories
of supervised and unsupervised learning. Reinforcement learning allows an agent to learn how to
behave in an environment, where the only feedback is the reward signal. The agent's goal is to perform
actions that maximize future reward.
To make this precise we introduce the notion of Markov decision processes, which can be considered
to provide a formal framework for reinforcement learning.
Denition 5.5
A Markov decision process (MDP) is a tuple (Ω, A, T, R) in which
Ω is the set of states,
A is a nite set of actions,
74
T is a transition function T : Ω × A × Ω → [0, 1], which, for T (ω, a, ω ′ ), gives the
probability that action a performed in state ω will lead to state ω ′ ,
and R is a reward function dened as R : Ω × A → R.
A policy ρ is a function ρ : Ω → A, which tells the agent which action to perform in any given state.
Application of a policy to an MDP proceeds as follows. First, a start state ω0 is generated. Then,
the policy ρ proposes the action a0 = ρ(ω0 ), and this action is performed. Based on the transition
function T and reward function R, a transition is made to state ω1 with probability T (ω0 , a0 , ω1 ), and
a reward R(ω0 , a0 ) is received. This process continues, producing a sequence ω0 a0 ω1 a1 ω2 a2 · · · (If the
process is about to end in a nal state, as for instance, in nite games, then one can consider to restart
the process in a new initial state; this would always result in an innite sequence.) Hence, given a
xed policy, MDPs are in fact Markov chains. The sequence r0 , r1 , . . . of rewards received by the above
sequence is r0 = R(ω0 , a0 ), r1 = R(ω1 , a1 ), . . .
The main goal of learning in an MDP is to `learn' (i.e., nd ) a policy that gathers rewards. If the
agent was only concerned about the immediate reward at a xed time t, a simple optimality criterion
would be to optimize E[rt ], where rt = R(ωt , at ). There are, however, several ways of taking the future
into account. For instance, one could aim at optimizing
Xh X∞
E[ tt ] E[ γ t rt ],
t=0 t=0
where h is a xed number of times steps and γ < 1 a given parameter. The former function with
the nite sum is used in so-called nite horizon approaches, the latter is used in so-called discounted,
innite horizon approaches.
A popular and mathematical rigorous method for nding an optimal policy for a an MDP (where really
all the data, i.e., (Ω, A, T, R) is given) is so-called dynamic programming. This is a method, which is
also used in other contexts such as for nding shortest paths in a graph. The interested reader may
nd [14] to be a good starting point to nd out more about MPDs.
Example 5.10
Consider the problem of nding an `optimal strategy' in a game of tic-tac-toea . We can model
this as an MDP.
A position of the board (i.e., the conguration of X s and Os) can be considered as state (states
can be enumerated, hence Ω can be considered to be a set of natural numbers Ω = {1, . . . , 39 }).
The actions correspond to the moves made by the agent (again these could be enumerated).
What would the transition function tell us? T (ω, a, ω ′ ) could, for instance, tell us with what
probability the opponent would bring up the position ω ′ if we were in position ω performing
move a. Such a model would be adequate if the agent should learn to play against a specic type
of player (one, which does not necessarily play perfect). In practice such a transition matrix
needs to be learned as well, and we will comment on this at the end of this example.
For the reward function, it would be most natural to assign to all position-action pairs resulting
in a win a positive value, to all losing position-action pairs a negative value, and all others a
zero value.
The goal of the agent should be to received a positive reward, which means winning the
game.
As mentioned, if all the data would be known one could apply dynamic programming and obtain
an optimal policy/strategy. But suppose, T needs to be learned as well. Then it would be natural
to play many times against the opponent and record the transitions as approximations of the
probabilities. (Which, of course, assumes that the opponent does not change its style of playing
75
after a period of time.) This can then also be combined with strategies that learn how `valuable'
a given position in the game actually isb .
a In this game, players alternate placing pieces (typically X s for the rst player and O s for the second) on a
3 × 3 board. The rst player to get three pieces in a row (vertically, horizontally, or diagonally) is the winner.
b A good webpage for further studies on this theme, including sample code, can be found at https://www.
codeproject.com/Articles/1400011/Reinforcement-Learning-A-Tic-Tac-Toe-Example.
In this section we give a brief introduction to neural networks. In particular, we will discuss feedforward
networks, Hopeld networks, and Boltzmann machines. We will keep the exposition as short as possible.
The interested reader can nd more information in [4, 1].
5.5.1 Terminology
(Articial) neural networks are usually used for supervised learning (there do exist examples for un-
supervised and reinforcement learning, too). They try to model the apparently highly nonlinear in-
frastructure of brain networks. The historically rst articial neural network (called perceptron) was
invented in 1958 by psychologist Frank Rosenblatt58 .
Neural networks are composed of layers of computational units called nodes (or neurons) with con-
nections in dierent layers. Neurons are (typically nonlinear) parameterized functions of their input
variables. In a very common model, called McCulloch-Pitts59 neuron model, the output sj of a
node j is given as X
sj = ψ( ωi,j si + θj ),
i
where ωi,j is the weight of the connection between node i and j, θj is a given constant (called threshold
or bias) and ψ is the so-called activation function. We will discuss several popular activation
functions further below. Fig. 5.15 shows a single McCulloch-Pitts neuron with three inputs.
s1
ω1,4
θ4
ω2,4 ψ(ω1,4 s1 + ω2,4 s2 + ω3,4 s3 + θ4 )
s2 s4
ω3,4
s3
Input Node Output
Figure 5.15
Some of the nodes may be so-called input nodes, which means that their input is externally provided
(involving no computations). Some of the nodes may be output nodes, which means that the user
can read them out. Other types of nodes are hidden nodes, they are used internally solely for
computations.
To specify a neural network ones needs to specify the network architecture (i.e., the nodes and
connections), the updating rules (i.e., the functions evaluated by each node), and the learning
rules (i.e., procedures to compute the weights from training data).
58
American psychologist (July 11, 1928 July 11, 1971). In our language below, a perceptron is a McCulloch-Pitts
neuron with binary inputs and Heaviside activation function.
59
Named after Warren Sturgis McCulloch (Nov. 16, 1898 Sept. 24, 1969) and Walter Harry Pitts, Jr. (April 23,
1923 May 14, 1969).
76
Networks without any cycle, also called feedforward networks, can be thought of evaluating a
function f : Rd → Rn (the input is given by the input nodes, the output by the output nodes).
Networks containing cycles are called recurrent networks.
Figure 5.16: Examples of neural network architectures. (a) Feedforward network, (b) Recurrent net-
work.
s2
s1 s2 XOR(s1 , s2 ) 1
0 0 0
0 1 1
1 0 1
1 1 0
s1
0 1
77
ReLU Sigmoid
z : z ≥ 0, ψ(z) = 1
ψ(z) = 1+exp(−z)
0 : z < 0.
Note the dierent ranges of the functions, which make them suitable in dierent situations. Generally
speaking, Sigmoid and Tanh activation functions were very popular in the past, but their popularity
seems declining in the realm of deep learning. In contrast to ReLU, the Sigmoid and Tanh activation
functions are observed to be more problematic to use in deep networks due to the so-called vanishing
gradient problem61 . ReLU does not seem very problematic in this respect, and it is also a function that
is easy to compute. Signum and Heaviside are often used in classication problems (at least applied
in the last layer).
ez −e−z
Tanh ψ(z) = ez +e−z
ψ ′ (z) = 1 − tanh2 (z)
1 : z≥0 0 : z ̸= 0
Signum ψ(z) = ψ ′ (z) =
−1 : z < 0 undened : z = 0
1 : z≥0 0 : z ̸= 0
Heaviside ψ(z) = ψ ′ (z) =
0 : z<0 undened : z = 0
61
We are not going into these details here, but roughly speaking the small derivatives of the Sigmoid and Tanh functions
for larger inputs x can cause problems in training deep networks. The backpropagation algorithm, often used for training,
requires computations of gradients over multiple layers. If these values are small one obtains an exponential decrease in
the gradient the gradient is vanishing and is therefore not useful for training.
78
5.5.3 Feedforward Networks
Example 5.11
Consider the following feedforward network
θ3 θ5
s1 ω1,3
ω3,5
s2
ω2,3 s3 s5
ω3,6
θ4 ω4,5 θ6
s1 ω1,4
ω4,6
s2
ω2,4 s4 s6
Input Output
i = 3, 4, 5, 6. The weights ωi,j and thresholds θi of the network are given as:
Given the input s1 = 1 and s2 = 0 we would like to determine the output (s5 and s6 ).
We need to calculate the weighted sums of the hidden nodes
z3 = ω1,3 s1 + ω2,3 s2 + θ3 ,
z4 = ω1,4 s1 + ω2,4 s2 + θ4 ,
s3 = ψ3 (z3 ),
s4 = ψ4 (z4 ),
z5 = ω3,5 s3 + ω4,5 s4 + θ5 ,
z6 = ω3,6 s3 + ω4,6 s4 + θ6 ,
yielding
s5 = ψ5 (z5 ),
s6 = ψ6 (z6 ).
The computations performed in the previous example, i.e., computing the output for a given input, is
called a forward pass (or forward propagation).
But now suppose, we would like to train the network. I.e., given the input, say s1 = 1, s2 = 0, we
would like to adapt the weights and thresholds in such a way that the neural network gives a prescribed
output y, say y = (s5 , s6 ) = (1, 1). The prescribed output y is called target output.
Here comes the so-called backpropagation algorithm into play. We describe only the basics, con-
79
sidering only feedforward networks with Sigmoid activation functions (other cases are similar but
outside the scope of this course). The backpropagation algorithm is, in fact, an ecient application of
Leibniz's 62 chain rule for dierentiation.
Learning Rule: The Backpropagation Algorithm Consider a feedforward network with Sigmoid
activation functions. The weights and thresholds of the network are denoted by w = (ω1,1 , . . . , ωN,N )
and θ = (θ1 , . . . , θN ). Let x ∈ Rd and h(w, θ, x) ∈ Rn denote the input and output of the network,
respectively. Further, let y ∈ Rn denote the target output. With h(w, θ, x)k we denote the k th
component of the vector h(w, θ, x).
We can motivate the backpropagation learning algorithm as gradient descent on the squared error/loss
function
n
X
E = Ex,y (w, θ) = ||y − h(w, θ, x)||2 = (yk − h(w, θ, x)k )2 .
k=1
We write Ex,y (w, θ) instead of E(w, θ, x, y) to emphasize that x and y are xed and we try to adjust
the weights w and thresholds θ.
Gradient descent is an optimization method for nding a local minimum of a dierentiable function
(here the error/loss function Ex,y (w, θ)). With α > 0 denoting a chosen step length (often referred to
as learning rate parameter) and ∇Ex,y (w, θ) denoting the gradient of Ex,y (w, θ), gradient descent
generates a sequence w(t+1) , θ(t+1) , t = 0, 1, 2, . . . , as follows.
w(t+1) w(t)
Gradient descent:
← − α∇Ex,y w(t) , θ(t) , t = 0, 1, 2, . . .
θ(t+1) θ(t)
We now want to compute ∇Ex,y w(t) , θ(t) . First, consider an output node j. Note that
!
X
h(w, θ, x)j = ψ ωi,j si + θj = ψ(zj ) = sj .
i
Hence, we obtain
∂E ∂ X
= (yk − ψ(zk ))2
∂zj ∂zj
k
∂
= (yj − ψ(zj ))2
∂zj
Leibniz
= −2(yj − ψ(zj ))ψ ′ (zj )
(∗1 )
= −2(yj − ψ(zj ))ψ(zj )(1 − ψ(zj ))
= −2(yj − sj )sj (1 − sj )
62
Gottfried Wilhelm (von) Leibniz (July 1, 1646 Nov. 14, 1716). German mathematician, philosopher, scientist and
diplomat.
80
and !
∂zj ∂ X
= ωi,j si + θj = si . (5.5)
∂ωi,j ∂ωi,j
i
Note that (∗1 ) is a property of the Sigmoid function; see Table 5.5. Thus,
∂E ∂E ∂zj
= · = − 2(yj − sj )sj (1 − sj ) si = −∆j si .
∂ωi,j ∂zj ∂ωi,j | {z }
=:∆j
Similarly we have
∂E ∂E ∂zj
= · = −∆j · 1.
∂θj ∂zj ∂θj
Hence, the updating rule for the parameters for any output node j is:
(t+1) (t)
ωi,j ← ωi,j + α∆j si , for any i feeding into j,
(5.6)
(t+1) (t)
θj ← θj + α∆j .
So far, we have considered the nodes j in the output layer. Let us consider a node j in the previous
layer, a node i feeding into j and j feeding into an output node k; see gure below.
si ωi,j sk
θj
sj
Output
We have X
h(w, θ, x)k = sk = ψ(zk ) = ψ(ωj,k sj + ωl,k sl + θk ),
l̸=j
and hence in ψ(zk ) only the term ωj,k sj depends on ωi,j . Using this we obtain
∂E X ∂ Leibniz
X ∂
= (yk − h(w, θ, x)k )2 = − 2(yk − sk ) ψ(zk )
∂ωi,j ∂ωi,j ∂ωi,j
k k
Leibniz
X ∂sj X ∂
= − 2(yk − sk )ψ ′ (zk )ωj,k = − 2(yk − sk )ψ ′ (zk )ωj,k ψ(zj )
∂ωi,j ∂ωi,j
k k
Leibniz
X ∂zj (5.5) X
= − 2(yk − sk )ψ ′ (zk )ωj,k ψ ′ (zj ) = − 2(yk − sk )ψ ′ (zk )ωj,k ψ ′ (zj )si
∂ωi,j
k k
!
(∗1 ) X X
= − 2(yk − sk )sk (1 − sk )ωj,k sj (1 − sj )si = − sj (1 − sj ) ∆k ωj,k si .
k k
| {z }
=:∆j
81
Similarly, we have
∂E X ∂
= (yk − h(w, θ, x)k )2 = −∆j · 1.
∂θj ∂θj
k
We therefore obtain the same updating rule as in (5.6); the only dierence is that ∆j for such j is
dened as !
X
∆j = sj (1 − sj ) ∆k ωj,k . (5.7)
k
Single-Pass Backpropagation
(for feedforward networks with Sigmoid activation functions):
(t+1) (t+1)
1. For each output node j compute ∆j := 2(yj −sj )sj (1−sj ) and update the ωi,j and θj
according to (5.6).
2. For layer l = ℓ − 1, ℓ − 2, . . . , 1 :
For each node j in the lth layer (i.e., feeding into nodes sk in layer l + 1) compute
∆j according to (5.7) and update according to (5.6).
3. Set t ← t + 1.
(t) (t)
Note that the version that we described here, bases all computations on the non-updated ωi,j and θi .
Only after nishing the complete single-pass backpropagation one should set the parameters to their
updated values (i.e., set t ← t + 1). In the literature there exist also many other variants. It should
also be noted that we described just a single-pass (all weights and thresholds are updated only once).
Typically, a single-pass is iterated for several thousands of iterations to achieve a substantial decrease
in the error/loss function. Also note that the update computations can be performed rather eciently
since the computations of the ∆k in layer l involve only the values of the previously computed ∆k
values from layer l + 1.
82
One of the most important contributions in the 1982 paper was the introduction of the idea of an
energy function into neural network theory. For the Hopeld network the energy function H is
N N N
1 XX X
H = H(s1 , . . . , sN ) = − ωi,j si sj − s i θi , (5.8)
2
i=1 j=1 i=1
which gives the energy of the network as a function of the current states s1 , . . . , sN (assuming that
the weights and the thresholds θ1 , . . . , θN are given). We have seen such type of energy function before
when we discussed in Example 4.9 the Ising model (see the exponent of e in the denition of π ).
Updating a node in the Hopeld network65 is performed via the signum activation function, i.e.,
by
if
P
+1 i ωi,j si + θj ≥ 0,
X
sj = ψ( ωi,j si + θj ) =
−1 otherwise.
i
Such updates might be performed either asynchronously (nodes are consecutively updated in a prede-
ned order) or synchronously (all nodes are updated at the same time).
Remarkably, H decreases (i.e. decreases strictly or stays constant) as the system evolves according to
its updating rule.
Theorem 5.6
The energy function H of a Hopeld network decreases after any single updating step.
Proof. Suppose, node j changes its state from sj = α to sj = β (we assume α, β ∈ {±1}; the case
α, β ∈ {0, 1} follows analogously). We have
Now, if α = −1Pand β = +1, then we have, since we were updating, θj ≥ 0, and thus
P
i ωi,j si +P
∆H = (α − β)( j ωi,j si + θj ) P≤ 0. If α = +1 and β = −1, then similarly we have i ωi,j si + θj < 0,
and therefore ∆H = (α − β)( i ωi,j si + θj ) < 0. In both cases the energy function decreases (not
necessarily in a strict sense).
The memorized patterns are the local minima of the energy function. Thus it is in principle possible
to use Hopeld networks to solve optimization problems.
So, how does a Hopeld learn? The learning rule is rather simple. Suppose we have m patterns
(ℓ) (ℓ) (ℓ)
ξ (ℓ) = (ξ1 , . . . , ξN )T with ξi being equal to the state si in the ℓth pattern, ℓ = 1, . . . , m. Then, the
learning rule for weight ωi,j with i ̸= j is
m
1 X (ℓ) (ℓ)
ωi,j = ξi ξj . (5.9)
m
ℓ=1
This is up to a factor the covariance between the states of node i and j (see denition of ci,j
in Section 5.3.2). We can see that we obtain a large weight if most of the times the states for node i
and j in the training set coincide. This is, in fact, a manifestation of a very intuitive learning rule
65
What we describe here is strictly speaking a discrete Hopeld network. There exist also continuous versions. They
use a dierent activation function.
83
from neuroscience, called Hebbian learning rule 66 , which can be paraphrased as stating `Neurons that
re together, wire together. Neurons that re out of sync, fail to link.'
Hopeld, in his original paper, demonstrated via simulations that about 0.15N relevant patterns can
be stored in a Hopeld network consisting of N neurons. This is, by today's standards, not very
impressive but it caught quite some attention in the 80's.
Example 5.12
Let us consider the Hopeld network for 3 · 3 = 9 nodes, each can be in a +1 or a −1 state.
The −1 state will be depicted in black, the +1 state in white. The nodes are numbered starting
from the top left going to the right, row by row (see Fig. 5.17(a)). Further we assume that we
set the thresholds θi to zero.
Figure 5.17: Hopeld network with 3·3 = 9 nodes. (a,b) learned patters, (c) rst input pattern,
(d) changed state, (e) second input pattern, (f) third input pattern.
Now, assume that we learn the two patters shown in Fig. 5.17(a) and (b). The corresponding
weights ωi,j (calculated according to (5.9)) are given by the following matrix (the entry in the
ith row and j th column gives the weight ωi,j .
0 −1 0 0 −1 −1 +1 0 0
−1 0 0 0 +1 +1 −1 0 0
0 0 0 −1 0 0 0 −1 +1
0 0 −1 0 0 0 0 +1 −1
W =
−1 +1 0 0 0 +1 −1 0 0
−1 +1 0 0 +1 0 −1 0 0
+1 −1 0 0 −1 −1 0 0 0
0 0 −1 +1 0 0 0 0 −1
0 0 +1 −1 0 0 0 −1 0
Now, suppose the input is the patterns as shown in Fig. 5.17(c). When we update all states,
we nd that only one single node will change its state, namely blue boxed node in Fig. 5.17(d).
The corresponding input weights sum up to
N
X
ωi,j sj = −3.
j=1
66
Named after the Canadian psychologist Donald Hebb (July 22, 1904 Aug. 20, 1985) who introduced this rule in
his MA thesis.
84
Hence, the node will change into the state −1 (black). Then, no further updates will occur
(the nal energy, which will not decrease anymore, is here H = −16). The Hopeld does indeed
recover the pattern Fig. 5.17(a), which makes sense from the point of noise removal.
When we take as input the pattern shown in Fig. 5.17(e), then we will indeed recover the patter
in Fig. 5.17(b).
Starting, however, with the input pattern shown in Fig. 5.17(f) we will not recover any of the
two learned patterns. The network gets stuck in a local minimum, which actually has energy
H = −8.
As in the previous example, Hopeld networks can get stuck in local minima. This happens because
there are usally several of them and the energy in each updating step decreases (or stay the same).
This is where Boltzmann machines come into play.
Then, node j turns (or remains) in state +1 given by the Metropolis-Hastings acceptance probabil-
ity
∆Hj
min 1, exp − ,
T
where the (given) scalar T is referred to as the temperature of the system (see also simulated anneal-
ing ).
The updating procedure for Boltzmann machines can therefore be understood as generating a Markov
chain (it is, in fact, a form of Gibbs sampling). It is easily seen that this Markov chain is aperiodic
and irreducible, and therefore, by Theorem 4.3, the stationary distribution is
1 − 1 H(s1 ,...,sN )
π(s1 , . . . , sN ) = e T .
Z
Distribution functions of this form are often referred to as Boltzmann distributions. (This is, in
fact, why these machines are called `Boltzmann machines.')
Boltzmann machines (in fact, restricted versions thereof) are nowadays considered to be powerful tools
for recommender systems. For instance, all three approaches winning the Netix Prize70 (which sought
67
Ludwig Boltzmann (Feb. 20, 1844 Sept. 5, 1906), Austrian physicist and philosopher.
68
The paper is freely available at https://onlinelibrary.wiley.com/doi/pdf/10.1207/s15516709cog0901_7.
69
Terry Sejnowski (born Aug. 13, 1947).
70
https://en.wikipedia.org/wiki/Netflix_Prize.
85
to substantially improve the accuracy of predictions about how much someone is going to enjoy a movie
based on their movie preferences) involved (restricted) Boltzmann machines71 .
With the interpretation of Boltzmann machines as Markov chains we have come, in some sense, full
circle.
71
Interestingly, it seems that Netix never implemented and employed any of
these algorithms. See https://www.techdirt.com/articles/20120409/03412518422/
why-netflix-never-implemented-algorithm-that-won-netflix-1-million-challenge.shtml.
86
References
[1] C. C. Aggarwal. Neural Networks and Deep Learning. Springer, New York, 2018.
[2] P. Brémaud. Markov Chains: Gibbs Fields, Monte Carlo Simulation, and Queues. Springer,
Berlin, 1999.
[3] P. Brémaud. Discrete Probability Models and Methods. Springer, Berlin, 2017.
[4] G. Dreyfus. Neural Networks. Springer, Berlin, 2nd edition, 2004.
[5] A. Gut. Probability: A Graduate Course. Springer, New York, 2013.
[6] J. Haigh. Probability Models. Springer, Berlin, 2nd edition, 2013.
[7] M. Jerrum. Counting, sampling and integrating: algorithms and complexity. Birkhäuser,
Basel, 2003. Freely available at https://www.math.cmu.edu/~af1p/Teaching/MCC17/Papers/
JerrumBook.
[8] D. Knuth. Art of Computer Programming, Volume 2: Seminumerical Algorithms. Addison-Wesley
Professional, Boston, 3rd edition, 1998.
[9] M. Lefebvre. Basic Probability Theory with Applications. Springer, Berlin, 2009.
[10] C. Lemieux. Monte Carlo and Quasi-Monte Carlo Sampling. Springer, Berlin, 2009.
[11] N. Privault. Understanding Markov Chains. Springer, Singapore, 2nd edition, 2018.
[12] R. W. Shonkwiler and F. Mendivil. Explorations in Monte Carlo Methods. Springer, New York,
2009.
[13] P.N. Tan, M. Steinbach, V. Kumar, and A. Karpatn. Introduction to Data Mining. Pearson,
Harlow, 2nd edition, 2019.
[14] M. Wiering and M. von Otterlo, editors. Reinforcement Learning. Springer, Heidelberg, 2012.