0% found this document useful (0 votes)
68 views23 pages

Bayesian Uncertainty Quantification

This document outlines a course on Bayesian uncertainty quantification and high performance computing. It introduces Bayesian statistics and frameworks for modeling uncertainty. Specific techniques discussed include the Laplace approximation for analytical estimates, Monte Carlo methods for numerical integration, and Markov chain Monte Carlo sampling. Examples are provided to illustrate Bayesian inference for problems like estimating the bias of a coin and parameters of a linear model. The goal is to quantify uncertainty in computational models given observations.

Uploaded by

Kowshik Thopalli
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
68 views23 pages

Bayesian Uncertainty Quantification

This document outlines a course on Bayesian uncertainty quantification and high performance computing. It introduces Bayesian statistics and frameworks for modeling uncertainty. Specific techniques discussed include the Laplace approximation for analytical estimates, Monte Carlo methods for numerical integration, and Markov chain Monte Carlo sampling. Examples are provided to illustrate Bayesian inference for problems like estimating the bias of a coin and parameters of a linear model. The goal is to quantify uncertainty in computational models given observations.

Uploaded by

Kowshik Thopalli
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Bayesian Uncertainty Quantification

High Performance Computing


for Computational Science and Engineering
II

Prof. Dr. Petros Koumoutsakos


Spring 2018

Contents
1 Introduction 2

2 Bayesian Framework 2
2.1 Bayes’ theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2.2 Example: The coin flipping problem . . . . . . . . . . . . . . . . 3
2.3 Example: Linear model . . . . . . . . . . . . . . . . . . . . . . . 4

3 The Laplace Approximation 6


3.1 Example: Back to the coin flipping problem . . . . . . . . . . . . 10
3.2 Example: Gaussian mean estimator . . . . . . . . . . . . . . . . 11

4 Monte Carlo Methods 12


4.1 Monte Carlo Integration . . . . . . . . . . . . . . . . . . . . . . . 12
4.2 Random Number Generators . . . . . . . . . . . . . . . . . . . . 13
4.3 Importance Sampling . . . . . . . . . . . . . . . . . . . . . . . . . 14

5 Sampling methods 15
5.1 Function Inversion . . . . . . . . . . . . . . . . . . . . . . . . . . 15
5.2 Rejection Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . 17
5.3 Markov Chain Monte Carlo . . . . . . . . . . . . . . . . . . . . . 21

1
1 Introduction
In science, we attempt to describe, understand and predict systems via models
which depend on parameters. These models are an approximation of the reality
and contain several sources of uncertainty, including modeling and numerical
errors. Furthermore, we often don’t know the parameters of the model or how
sensible in the output of the model with respect to the parameters.. We wish
to describe the uncertainty of these parameters given observations of the real
system. We will here present the steps to complete this process.
In Section 2, we will present the Bayes’ theorem and its applications in
the field of uncertainty quantification. In Section 3, we will describe how to
derive analytically estimations to quantify the uncertainty in parameters. In
Section 4, we will introduce the concept of Monte Carlo methods, which are
the basis of most numerical methods used in uncertainty quantification. Fi-
nally, in Section 5, we will present numerical methods able to sample arbitrary
distributions.

2 Bayesian Framework
2.1 Bayes’ theorem
Let X and Y be two random variables (r.v.) with densities pX and pY . The
Bayes’ theorem states that,

pY |X (y|x) pX (x)
pX|Y (x|y) = . (1)
pY (y)

The density pX|Y is called the posterior probability. The term pY |X is viewed
as a function of x since on the left hand side of Eq. (1) we condition on the fixed
value for the random variable Y = y. As a function of x this term is called the
likelihood functions and is a measure of how likely is to observe the value y for
the r.v. Y conditioning on the value x for the r.v. X. Notice that pY |X is not a
probability density as a function of x. The term pX is called prior distribution
and represents our belief on the values of X prior to observing any values for
the random variable Y . Finally, the denominator is defined as,
Z
pY (y) = pY |X (y|x) pX (x) dx, (2)

and is the normalizing constant that makes the right hand side of Eq. (1) a
probability density function.
In order to simplify the notation we drop the dependence of the density on
the random variable. Which density is used will be evident from the arguments.
For example, when we write p(x|y) then p = pX|Y or p(X) the p = pX .

Bayes’ theorem in action The way we will use Bayes theorem in the next
sections is the following. First we make some assumptions:

2
• We assume that we have a computational model that depends on some
parameters. These parameters are considered to be random variables and
will be denoted by X. A prior distribution can be imposed on them,
e.g. if we know that X takes only positive values pX can be the gamma
distribution.

• We have observed a set of data, y. We assume that data are also r.v. that
follow a probability distribution.
• The likelihood function of the data, pY |X , is either known explicitly or
can modeled based on other assumptions.

Based on these assumptions and using Eq. (1) we are able to find the distribution
of the parameters conditioned on the data. Stated differently, we can answer the
question “what values for the parameters will make the computational model
fit the data better?”.
In order to fix the notation, we will denote the random variable that repre-
sents the parameters and the data with X and D respectively and a realization
from these variables with x and d.

Robust prediction The uncertainty in the parameters can be propagated to


the output of the model in order to quantify the uncertainty it the predictions. If
the prior uncertainty is used, then the prediction is called prior robust prediction
Z
p(y) = p(y |x) p(x) dx. (3)

If the posterior distribution is used the prediction is called posterior robust


prediction Z
p(y |d) = p(y |x) p(x | d) dx. (4)

Model selection TBW

2.2 Example: The coin flipping problem


A coin comes up heads 4 times in 16 flips.
Is this a fair coin?

Define H the bias-weighting of the coin. For example


• if H = 0: a tail comes at every flip,

• if H = 1: a head comes at every flip,


• if H = 21 : a fair coin.

3
Here, H plays the role of the model parameter x. Suppose we observe “R heads
in N tosses”. We want to estimate the posterior distribution of H given the
observed data d = (R, N ),
p(H | d). (5)
Using Bayes’ theorem, we write

p(H | d) ∝ p(d| H)p(H). (6)

Here, we omit the normalization factor for simplicity. We choose a uniform


prior, (
1, if 0 ≤ H ≤ 1,
p(H) = (7)
0, otherwise.
Such prior is used when we do not have any prior knowledge about the fairness
of the coin: it is equally probable to have a fair coin or to have a coin completely
biased towards head. We need to define the likelihood function in Eq. (6). In
words, the likelihood function measures the chance to observe certain data if
the value of the bias-weighting is given. Assuming independent events, it is
easy to observe that the likelihood of obtaining “R heads in N tosses” follows
a binomial distribution,

p(d |H) ∝ H R (1 − H)N −R . (8)

This is intuitively derived by considering

• H R is the probability of having R “heads”,


• (1 − H)N −R is the probability of having N − R “tails”.
Note that we again omitted the constant factor in Eq. (8) as it does not depend
on H. Posterior distributions of the bias-weighting of the coin H are shown on
Fig. 1, starting from three different priors. Comparing Fig. 1 with our original
problem, we can see that the probability of the coin to be fair still lies in the
confidence region. However it is more likely that the coin is not fair, given the
data. We need more data to increase our confidence about H.

2.3 Example: Linear model


Consider the following linear model,

y = x + , (9)

where x and  are independent. We assume the following prior knowledge,


 
x ∼ N µ, σ 2 , (10)
 ∼ N (0, 1) , (11)

4
18 Parameter estimation I

Fig. 2.2 The effect of different priors, prob(H |I ), on the posterior pdf for the bias-weighting of
Figurea coin.
1: The solid line isofthethe
Evolution sameposterior
as in Fig. 2.1, and is included
density of theforbias-weighting
ease of comparison.of
Thea case
coin as
for two alternative
the number of data priors, reflecting slightly
increases. The different
differentassumptions in the conditioning
lines show posteriorinformation
densities for
I, are priors.
different shown with Thedashed andfigure
first dotted lines.
for 0 data points represents the three priors. It
can be seen that after many observations, the three posterior densities converge
to the same distribution. However, the effect of the prior is evident for smaller
observation data sets, as the biased Gaussian prior converges later to the actual
posterior density. Taken from [1].

5
where µ and σ are given. After a single observation y = d ∈ R, we can write
from Bayes’ theorem
p(d | x)p(x)
p(x | d) = . (12)
p(d)
From Eq. (9) and Eq. (11), the likelihood p(d| x) is a Gaussian centered at x
with variance 1,  
1 1 2
p(d | x) = √ exp − (d − x) .
2π 2
Substituting the prior and the likelihood inside Eq. (12) gives the posterior
distribution,
 !
2 2
1 (d − x) (x − µ)
p(x | d) ∝ exp − + ,
2 1 σ2

which can be written as a normal distribution,


!
µ + dσ 2 σ2
p(x |d) = N , .
1 + σ 1 + σ2
2

Robust prediction The robust prediction is the probability density of the


output of the model. This density takes the uncertainty of the parameters of
the model. From the model Eq. (9), we observe that the output is a sum of
two random variables. In the case of prior robust prediction, the parameter x is
normally distributed (see Eq. (10)). The error  is also normally
 distributed (see
Eq. (11)). Note that the sum of two r.v. X1 ∼ N µ1 , σ12 and X2 ∼ N µ2 , σ22
is given by Z = X1 + X2 ∼ N µ1 + µ2 , σ12 + σ22 .
We can then easily write the prior robust prediction as
 
p(y) = N µ, σ 2 + 1 .

Similarly, the posterior robust prediction can be written as


!
µ + dσ 2 σ2
p(y | d) = N , +1 .
1 + σ2 1 + σ2

Note that in this case, the posterior robust prediction gives a smaller confi-
dence interval, so adding data increases the robustness of the prediction. These
quantities are represented in Fig. 2.
The same result can be obtained from Eq. (3) or Eq. (4). In this particular
case, p(y | x) = N (x, 1).

3 The Laplace Approximation


For non linear models, the posterior distribution can not be derived analytically
in general. Therefore, we summarize such distributions with two quantities:

6
prior
posterior

µ ŷ µŷ
x y

Figure 2: Left: Prior and Posterior distributions, p(x) and p(x |d), of the param-
eter x. Adding data increases our confidence in x. Right: Prior and Posterior
Robust predictions of y. Again, adding data in this particular case increases
the confidence in the prediction.

• the best estimate, which is the model parameter for which the posterior
density function is maximized,
• measure of reliability of the best estimate.

The posterior distribution around the best estimate can be locally approxi-
mated with a Gaussian distribution, by employing the Laplace approximation
method. The Laplace approximation method uses the Taylor expansion of a
function around a global maximum in order to construct the exponential form
of a function. Since the posterior distribution is approximated by a Gaussian,
the logarithm of the posterior plays the role of the function onto which the Tay-
lor expansion is applied. The main idea of the Laplace approximation method
is discussed below.

Let x ∈ R be a parameter with probability distribution function p(x). In


the case of continuous variables, the following two conditions hold true for the
global maximum of the distribution, x̂,

∂p
= 0, (13)
∂x x̂

∂ 2 p
< 0. (14)
∂x2

The logarithm of the probability density (which, in the Bayesian framework


corresponds to the log-likelihood function) is,

L(x) = log p(x) .

By performing Taylor expansion of the logarithm of p(x) around the maximum

7
x̂, which corresponds to the maximum of p(x), we have

1 ∂ 2 L 2

3

L(x) = L(x̂) + (x − x̂) + O (x − x̂) , (15)
2 ∂x2

where we used Eq. (13). Keeping only terms up to second order, we can write
the probability distribution as,
 
2
1 ∂ L
p(x) ≈ A exp  (x − x̂)2  ,

2 ∂x2


where A is a constant A = exp L(x̂) . We obtained a Gaussian approximation
of the probability density function with variance
 −1
2
∂ L
σ2 = −  2  .

∂x

This is positive as the second derivative is negative according to the condi-


tion Eq. (14). We can finally write the Gaussian approximation, omitting the
normalization constant, as

p(x) ≈ 2πσ 2 p(x̂)N (x | x̂, σ)
 
1 1 2
∝√ exp − 2 (x − x̂) .
2πσ 2 2σ

The concept of Laplace approximation is graphically explained in Fig. 3.

2-dimensional approximation In 2D, if the parameters are denoted as x =


(x1 , x2 ), the first partial derivatives are zero, due to the existence of maximum
at the best estimate x̂ = (x̂1 , x̂2 ),

∇L(x) = 0

The log-likelihood around the best estimates (x̂1 , x̂2 ) is then approximated by
Taylor series expansion,
 
2 2 2
1 ∂ L ∂ L ∂ L
L(x) ≈ L(x̂)+  2 (x1 − x̂1 )2 + (x2 − x̂2 )2 + 2 (x1 − x̂1 )(x2 − x̂2 ) .
2 ∂x1 ∂x22 ∂x1 ∂x2
x̂ x̂ x̂

Defining
∂ 2 L ∂ 2 L ∂ 2 L
A= , B= , C= ,
∂x21 ∂x22 ∂x1 ∂x2
x̂ x̂ x̂

8
Figure 3: Laplace approximation. In the limit of many observation data the
probability distribution function p(x) is locally approximated as a Gaussian
around the best estimate, i.e. the value that maximizes the density function.

and introducing the Hessian matrix H of the function L,


 
A C
H= ,
C B

the Taylor series expansion takes the form,


1
L(x) ≈ L(x̂) + Q(x).
2
where Q(x) is,
Q(x) = (x − x̂)T H(x̂)(x − x̂).
The covariance matrix of the Gaussian approximation is the inverse of the Hes-
sian
Σ = H −1 (x̂).
We compute the marginal probability of parameter x1
Z ∞
p(x1 ) = p(x1 , x2 )dx2 ,
−∞
!
1 AB − C 2
≈ c exp (x − x̂1 )2 ,
2 B

where c is the normalization factor.

9
D-dimensional approximation In higher dimensions, the Taylor expansion
of L(x) about the best estimate x̂ extends as follows,
1
L ≈ L(x̂) + (x − x̂)T ∇∇T L(x̂)(x − x̂).
2
The Hessian at the best estimate is defined as

H(x̂) = ∇∇T L(x̂),

and the covariance matrix is again,

Σ = H −1 (x̂).

The posterior distribution is calculated as


q
p(x) ≈ (2π)N |Σ| p(x̂) N (x | x̂, Σ)
 
1 1
≈ cp exp (x − x̂)T Σ−1 (x − x̂) .
(2π)N |Σ| 2

where c is the normalization constant.

3.1 Example: Back to the coin flipping problem


In Section 2.2 we described the coin flipping problem in the Bayesian framework.
We will here apply Laplace approximation to the same problem. In the Bayesian
framework, we now approximate the posterior distribution of the parameter with
a Gaussian distribution around the best estimate, i.e. the value that maximizes
the posterior. We already showed that the posterior probability density function
of the parameter x = H has the form

p H | d ∝ H R (1 − H)N −R .


Taking the logarithm, we get

L(H) = const + R log(H) + (N − R) log(1 − H),

where the constant does not play any role as we want to find the best estimate.
The first two derivatives read,
∂L R N −R
= − ,
∂H H 1−H
∂2L R N −R
=− 2 − .
∂H 2 H (1 − H)2

10
R
The condition Eq. (13) gives the best estimate Ĥ = N. The standard deviation
is therefore given by
 − 12
2
∂ L
σ = −

∂H 2

s
Ĥ(1 − Ĥ)
= .
N
Note that the certainty increases as we add more data. We can also notice that
it is easier to detect a biased coin than a fair coin. Indeed, the uncertainty is
maximized for Ĥ = 21 .

3.2 Example: Gaussian mean estimator


Consider N independent and identically distributed (i.i.d.) observations d =
(d1 , d2 , . . . dN ). We assume that the data are randomly generated from a Gaus-
sian distribution with known variance σ 2 and unknown mean x = µ,
 
1 1 2
p(dk | µ) = √ exp − 2 (dk − µ) .
2πσ 2 2σ
What is the best estimate for µ and what is our confidence for this estimate?
From Bayes’ theorem, the posterior probability of the mean µ is given by
p(µ |d) ∝ p(d | µ) p(µ).
Since the data is i.i.d., the likelihood function takes the form
N
Y
p(d| µ) = p(dk | µ).
k=1

Here, we assume an uninformative uniform prior for the mean of the Gaussian,
(
1
c = µmax −µ , µmin ≤ µ ≤ µmax
p(µ) = min

0, otherwise.

The posterior distribution is then given by


N
Y
p(µ| d) ∝ c p(dk |µ)
k=1
N  
Y 1 1 2
=c √ exp − 2 (dk − µ)
2πσ 2 2σ
k=1
 
N
1 1 X
=c exp − 2 (dk − µ)2  .
(2πσ 2 )N/2 2σ
k=1

11
We compute the log-likelihood,

L(µ) = log(p(µ |d)),


N
X (dk − µ)2
= const − .
2σ 2
k=1

The best estimate µ̂ must satisfy


N
dL(µ) X dk − µ̂
= = 0,
dµ µ̂ σ2
k=1
N
X N
X
⇒ dk = µ̂,
k=1 k=1
N
1 X
⇒ µ̂ = dk .
N
k=1

We compute the second derivative of the log-likelihood


N
d2 L X 1 N
2
=− 2
= − 2,
dµ σ σ
k=1

which is negative, meaning that L(µ̂) is indeed


√ a maximum. Finally, the stan-
dard deviation of the posterior is equal to σ/ N .

4 Monte Carlo Methods


In the previous section we saw that the Laplacian approximation method can
be used to approximate the posterior distribution of the parameters. This ap-
proach is characterized as deterministic. Alternatively, we can make use of
stochastic methods in order to numerically represent the posterior probabilities
with randomly generated samples from the underlying distribution.

4.1 Monte Carlo Integration


The main concept of the Monte-Carlo methods is presented here in the case of
the numerical computation of an integral of the following form
Z
E[f (x)] = f (x) p(x) dx, (16)

where x is a random vector with density p and f is a given function we want to


integrate. Common examples are:
1. model evidence: Z
p(d) = p(d |x)p(x) dx,

12
2. robust posterior prediction:
Z
p(y | d) = p(y | x)p(x |d) dx,

Assume that {x(k) }N


k=1 are i.i.d. samples drawn from the density p. Using
the Law of Large numbers, the expected value of f (x) is given by the estimate
N
1 X
µ̂f,N = f (x(k) ).
N
k=1

In the limit N → ∞, the sample average converges to the expected value.


Defining

µf = E[ [f ] (x)]
2
σf2 = Var[f (x)] = E[ [f ] (x)] − µ2f ,

the Law of Large numbers and the Central Limit theorem give

lim µ̂f,N = µf ,
N →∞
 
µ̂f,N ∼ N µf , σf2 /N .

Conclusively:

• The error of the sample estimate decreases as 1/ N .
• The sample estimate is an unbiased estimate of the true value.
• Convergence of the estimate is independent of the dimensionality of the
problem.

4.2 Random Number Generators


The whole concept of Monte-Carlo methods relies on the generation of random
samples. It is essential to generate such numbers with the desired properties.

Pseudo-random number generators: Algorithms that guarantee the gen-


eration of a sequence of integers Zi that approximately follow a uniform distri-
bution on an interval in the real axis. The general algorithm for the generation
of pseudo-random numbers is

Zi = g(Zi−1 , . . . , Zi−m ) mod M,

for generating integers in the interval [0, M −1]. In this case, Zi is the remainder
of the division of g(Zi−1 , . . . , Zi−m ) by M. In a simpler form, Zi can be generated
by:
Zi = αZi−1 mod M,

13
for Z0 = 1 and i ≥ 1. For the above, M is a large prime number and α an
integer [2].
The resulting sequence of random numbers turns out to be ergodic with
period M − 1. A sequence is accepted when it satisfies certain criteria. For
example, for a choice of M = 109 not all α result into an good quality sequence
of random numbers. Recommended options are: α = 75 , M = 231 − 1 [3].

Note for non-uniform distributions Integrals of the form


Z
I= f (x) dx

can be written as
Z
I = |Ω| f (x) p(x) dx = |Ω| Ep [f ].

Here p is the uniform distribution over Ω from which samples are drawn,
(
1
, x ∈ Ω,
p(x) = |Ω|
0, otherwise.
For these cases, the pseudo-random number generators are sufficient. In the
general case we want to approximate the integral Eq. (16) for a non uniform
density p. Usually, the distribution p is known up to a constant factor,
φ(x)
p(x) = ,
Z
R∞
where Z = −∞ φ(x) dx. In the next sections, we will discuss how to generate
random numbers from such distributions.

4.3 Importance Sampling


We want to evaluate the integral
Z
I= f (x) dx, (17)

using Monte Carlo integration. Eq. (17) can be written equivalently as


Z
f (x)
I= p(x) dx,
p(x)
 
f (x)
= Ep ,
p(x)
R
where p(x) > 0 is a probability density function with p(x) dx = 1. Therefore,
we can approximate I using the Monte Carlo integration technique,
N
1 X f (x(k) )
Iˆ = ,
N
k=1
p(x(k) )

14
where the samples {x(k) }N k=1 are i.i.d. and follow the density p. There is an
infinite amount of choices for p from which we would like to select one which:
1. is easy to sample,
2. minimizes the error of the estimate for a finite number of samples.
A measure of the error of the estimate is given by
" 2 # Z
f (x) f (x)2
Ep −I = p(x) dx − I 2 . (18)
p(x) p(x)2

It is easy to show that this is minimized for

f (x)
p(x) = ,
I
but this expression implies that we already know I. In practice, we choose p
“similar” to f .

5 Sampling methods
5.1 Function Inversion
Let X be a real random variable with probability density function pX and
corresponding cumulative distribution function
Z x
FX (x) = pX (r) dr.
−∞

The idea behind the “Inverse Transform Sampling” method is that samples from
the density pX (x) can be generated by a transformation

x = g(u),

where u ∼ U(0, 1). We will identify the function g such that X follows the
desired density pX . The densities of X and U should satisfy

pX (x) dx = pU (u) du, (19)

which leads to
dg(u) −1

du
pX (x) = pU (u) = pU (u) .
dx du
For the r.v. U drawn from a uniform probability distribution, U ∼ pU (u) =
U(u| 0, 1), the probability of generating a random number between u and u + du
is (
du, 0 ≤ u < 1,
pU (u) du =
0, otherwise.

15
Therefore, pU (u) = 1 for u ∈ [0, 1] and integrating Eq. (19) yields
Z x Z u
pX (r) dr = pU (r) dr = u.
−∞ 0

This means, from the definition of FX , that

FX (x) = u.
−1
Assuming that FX has an inverse FX , we obtain
−1
x = g(u) = FX (u).
−1
For simple density functions pX for which FX is known, it is therefore easy to
generate samples X. However, the inverse is not available in general, which lead
to the developpement of sampling algorithms, as discussed later in this section.

Example: Exponential Distribution Given that the random variable x is


distributed according to the probability density function

pX (x) = λe−λx ,

with λ > 0 and x ≥ 0, the CDF of x is given by


Z x
FX (x) = λe−λτ dτ = 1 − e−λx .
0

Setting the random variable u := FX (x), x can be sampled from the inverse
transformation
−1
x = g(u) = FX (u),
or equivalently
FX (x) = u ⇒ 1 − e−λx = u,
which results in
1
x = − ln(1 − u).
λ
If samples {u(k) }N
k=1 are drawn from U(0, 1) then x
(k)
= − λ1 ln(1 − u(k) ) follow
pX .

Example: Gaussian Distribution We want to draw samples from the stan-


dard normal distribution
x ∼ N (0, 1) .
−1
The inverse transform method can be time consuming for this case, as FX is
not known in closed form. An alternative algorithm, called Box-Muller transfor-
mation, uses the inverse transform method to convert two independent uniform
random variables into two independent gaussian random variables. Suppose
that {r, φ} is a set of two independent distributed random variables according
to the following:

16
• φ is drawn by a uniform distribution with a simple transformation so that
(
1
, 0 ≤ φ ≤ 2π,
pΦ (φ) = 2π
0, otherwise.

• r is sampled according to the exponential distribution


1 −r
pR (r) = e 2,
2
for r > 0. The sampling from pR is performed via an inverse transforma-
tion from a uniformly sampled variable, as seen in the previous example.
Since r and φ are independent, the joint probability is

pR,Φ (r, φ) = pR (r)pΦ (φ).


√ √
We now define the transformation, x = r cos φ and y = r sin φ, so that
r = x2 + y 2 and φ = arctan y/x. The sampling joint distribution for x and y is
now computed as
1 1 − x2 +y2
pX,Y (x, y) = e 2 ,
2π 2
1 x2 1 y2
= √ e− 2 √ e− 2 .
2π 2π
The joint distribution ends in describing two independent normally distributed
variables.

5.2 Rejection Sampling


Until now, we have seen how to generate samples from a uniform distribution,
through pseudo-random numbers generators and from other distributions, using
the inverse transformation method. However, there are many distributions, from
which it may be impossible to directly define an inverse transform. In such cases,
we turn to methods that only require knowledge of the functional form of the
probability density function p up to a constant. The key concept here is the
following: In order to generate independent samples from a desired density p
one draws from another density q that is easier to sample from and then, instead
of applying a transformation to q, some sampled points are rejected according
to certain criteria.
Given a density p, we can write
Z p(x) Z ∞
p(x) = 1 dx = χ[0,p(x)] (u) du. (20)
0 −∞

The function (
1, 0 ≤ u ≤ p(x),
χ[0,p(x)] (u) =
0, otherwise.

17
Figure 4: Assuming we can sample from the distribution p, we draw a sample
x ∼ p and then a uniform number u ∼ U(0, p(x)). If we marginalize the samples
(x, u) we recover the distribution p, as shown in Eq. (20).

can be seen as the joint distribution pX,U of the random variables X and U
following the distribution p(x) and p(u|x) = U(u | 0, p(x). Marginalizing the
joint distribution over U we recover the distribution p,
Z
p(x) = pX,U (x, u) du. (21)

This property of p is presented in Fig. 4. The dots in the figure correspond to


samples drawn from the joint pX,U and they are uniformly distributed under
the graph of p.

Acceptance–Rejection technique What if we cannot directly sample from


the density p? The answer is simple:

1. find a density q that samples are easily drawn from,


2. scale q by a constant M such that the graph of M q is always above the
graph of p,
3. sample from the joint density pX,U where x ∼ q and u ∼ U(0, M q(X)),

4. keep only the points that are bellow the graph of p.

18
Figure 5: Demonstration of the Accept-Reject algorithm. 1. A sample x is
drawn from the distribution q. 2. A random number u is drawn uniformly in
[0, M q(x)]. 3. The sample x is accepted if u < p(x), i.e., if the point (x, u) is
bellow the graph of p, and rejected otherwise.

This intuitive procedure is called acceptance–rejection algorithm and is pre-


sented graphically in Fig. 5. The detailed algorithm is presented in Algorithm 1.
A basic requirement of the algorithm is that the graph of p should always be
bellow the graph of M q. Equivalently, the constant M must satisfy,

p(x)
M > max .
x q(x)

Theorem 1. The samples generated from Algorithm 1 are distributed according


to p.
Proof. According to the algorithm, we first sample x ∼ q, then u ∼ U(0, M q(x))
and we accept if u > p(x). Thus, the posterior density, using Bayes’ theorem,
is given by
p(u ≤ p(x) |x) q(x)
p(x | u ≤ p(x)) = . (22)
p(ξ ≤ p(x))
The likelihood function corresponds to the probability of a uniformly distributed
value in [0, M q(x)] to be less or equal to p(x). It easy to check that it is equal
to
p(x)
p(u ≤ p(x) | x) = . (23)
M q(x)

19
Algorithm 1 Rejection sampling algorithm.
Input: densities p, q and constant M > 0 such that p(x)/q(x)
Output: a sample distributed according to p
function Rejection sampling(p, q, M )
Generate x ∼ q . Propose a new sample
Generate u ∼ U(0, M q(x))
if u < p(x) then
return x . Accept the proposed sample
else
return Rejection sampling(p, q, C) . Reject and try again
end if
end function

In order to evaluate the denominator of Eq. (22), we integrate the numerator


of Eq. (22) and use Eq. (23),
Z
p(u ≤ p(x)) = p(u ≤ p(x) |x) q(x) dx,
Z
p(x)
= q(x) dx,
M q(x)
Z
1
= p(x) dx,
M
1
= . (24)
M
Inserting Eq. (23) and Eq. (24) in Eq. (22) we obtain
p(x)
M q(x) q(x)
p(x | u ≤ p(x)) = 1 = p(x).
M

Note: The efficiency of the algorithm depends on whether u ≤ p(x). For


1
independent trials, the probability of success is M (see Eq. (24)). Thus, that
the expected number of trials before accepting the sample is C.

Example: von Neumann The original rejection algorithm (Algorithm 2)


was used by von Neumann to draw samples from a density p(x) in [a, b], using
a uniform proposal
1
q(x) = , for a ≤ x ≤ b.
b−a
The constant M is given by

M ≥ max (b − a)p(x).
x∈[a,b]

20
Since M should be as small as possible, we select the lower bound in the above
inequality,
M = (b − a) max p(x).
x∈[a,b]

The procedure to generate one sample is summarized in Algorithm 2.

Algorithm 2 Von Neuman Rejection sampling algorithm.


Input: density p, interval (a, b) and constant M = max p(x)
x∈[a,b]
Output: a sample following the density p
function Rejection sampling(p, a, b, M )
Generate x ∼ U(a, b)
Generate u ∼ U(0, M/(b − a))
if u < p(x) then
return x . Accept the proposed sample
else
return Rejection sampling(p, a, b, M ) . Reject and try again
end if
end function

5.3 Markov Chain Monte Carlo


A Markov chain is a sequence of random numbers x1 , x2 , . . . ∈ Rd with condi-
tional distributions that obey the rule
 
P xn | xn−1 , xn−2 , . . . , x1 = P xn | xn−1 . (25)
The Metropolis-Hasting algorithm [4], introduced by Nicholas Metropolis to-
gether with Arianna W. Rosenbluth, Marshall Rosenbluth, Augusta H. Teller,
and Edward Teller (M(RT)2 ), makes use of the Markov chain properties to
generate samples from a probability density function. For a stochastic process
W (x |y) following a Markov chain, the probability density of the states con-
verges to an equilibrium probability density function peq if the detailed balance
equation is satisfied,
W (x |y)peq (y) = W (y | x)peq (x). (26)
In statistical physics, we usually know the stochastic process and need to find
the equilibrium distribution. Here we want the opposite: we know the density
peq and want to design a suitable process W which generates states distributed
according to peq . The idea of M(RT)2 is to write this process as a combination
of proposition and acceptance terms T and A,
W (x | y) = A(x |y)T (x | y). (27)
0 0
The proposal distribution T (x | y) proposes the transition from y to x . It must
normalize to 1: Z
T (x | y)dx = 1.

21
The proposed state x0 is then accepted with acceptance probability A(x0 | y). The
acceptance must be chosen to satisfy the detailed balance condition Eq. (26),

A(x | y)T (x |y)peq (y) = A(y | x)T (y |x)peq (x).

We now define
T (y | x)peq (x)
q(x | y) = . (28)
T (x | y)peq (y)
Note that q(x | y) ≥ 0. It is easy to check that the detailed balance condition is
satisfied for  
A(x |y) = min 1, q(x |y) .
The algorithm used to generate one sample from a given state is summarized in
Algorithm 3. Note that it is sufficient to know peq only up to a constant factor.
Indeed, in the M(RT)2 algorithm, peq appears only in a ratio (see Eq. (28)).

Algorithm 3 One step of the Metropolis-Hasting sampling algorithm


Input: Current state y, proposal density T , target density peq
Output: Next state
function Metropolis Hasting step(y, T , Peq )
generate x ∼ T (.| y) . Propose a new state
set q ← T (y | x)peq (x)/T (x |y)peq (y)
if q > 1 then
return x
else
generate U ∼ U(0, 1)
if U < q then . select the state with probability A(x | y) = q
return x . Accept the new state
else
return y . Reject the new state
end if
end if
end function

We will now demonstrate that the M(RT)2 algorithm indeed converges to the
desired distribution peq . We define the probability densities φi of each random
variable xi in the sequence. Given φn , we can write φn+1 (x) as a sum of two
contributions:
• Probability of accepting a new state,
Z
P (“previous state was not x”) = A(x |y)T (x | y)φn (y)dy,

• Probability of not moving away from x, i.e. rejected the proposed state,
Z
P (“previous state was x”) = φn (x) (1 − A(y |x))T (y | x)dy.

22
Combining the above contributions gives
Z Z
φn+1 (x) = A(x | y)T (x | y)φn (y)dy + φn (x) (1 − A(y | x))T (y |x)dy. (29)

It can be shown that this recursive relation gives an ergodic system (the system
will return to the states already visited with probability one and every state
is aperiodic). Therefore, according to Theorem 2 (see also [5]), there exist a
unique equilibrium distribution to which the recursion Eq. (29) converges.
Theorem 2 (Feller). If a random variable defines an ergodic system, then it
exists a unique probability distribution peq that is a fixed point in the above
recursion.
Proof. We will show now that the fixed point is indeed peq . We substitute
φn = peq in Eq. (29) and obtain
Z Z
φn+1 (x) = A(x | y)T (x |y)peq (y)dy + peq (x) (1 − A(y | x))T (y |x)dy,
Z
 
= A(x |y)T (x |y)peq (y) − A(y |x)T (y |x)peq (x) dy
Z
+ peq (x) T (y | x)dy,
Z
= peq (x) T (y | x)dy,

= peq (x),
where we used the detailed balance condition Eq. (26). Therefore, peq is the
asymptotic density distribution of the random walk.

References
[1] Devinderjit Sivia and John Skilling. Data analysis: a Bayesian tutorial.
OUP Oxford, 2006.
[2] Derrick H Lehmer. Euclid’s algorithm for large numbers. American Mathe-
matical Monthly, pages 227–233, 1938.
[3] Donald E Knuth. The Art of Computer Programming; Volume 2: Seminu-
meral Algorithms. 1981.
[4] Nicholas Metropolis, Arianna W Rosenbluth, Marshall N Rosenbluth, Au-
gusta H Teller, and Edward Teller. Equation of state calculations by fast
computing machines. The journal of chemical physics, 21(6):1087–1092,
1953.
[5] Francesco Petruccione and Peter Biechele. Stochastic methods for physics
using java: An introduction. 2000.

23

You might also like