0% found this document useful (0 votes)
26 views33 pages

Part A Mathematics and Statistics, Simulation and Statistical Programming: Simulation Lectures

Aims and Objectives. Building on Part A probability and Mods statistics, this course introduces Monte Carlo methods, collectively one of the most important toolkits for modern statistical inference. In parallel, students are taught programming in R, a programming language widely used in statistics. Lectures alternate between Monte Carlo methods and Statistical Programming so that students learn to programme by writing simulation algorithms.

Uploaded by

ethor
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views33 pages

Part A Mathematics and Statistics, Simulation and Statistical Programming: Simulation Lectures

Aims and Objectives. Building on Part A probability and Mods statistics, this course introduces Monte Carlo methods, collectively one of the most important toolkits for modern statistical inference. In parallel, students are taught programming in R, a programming language widely used in statistics. Lectures alternate between Monte Carlo methods and Statistical Programming so that students learn to programme by writing simulation algorithms.

Uploaded by

ethor
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

PART A MATHEMATICS AND STATISTICS, SIMULATION AND

STATISTICAL PROGRAMMING: SIMULATION LECTURES

GEOFF NICHOLLS

Contents
1. Organisation 1
Aims and Objectives 1
Synopsis 1
Course Structure 2
1.1. R 2
1.2. Classes 2
Texts 2
2. Introduction 3
3. Inversion 3
4. Transformation methods 5
5. Multivariate Normal 6
6. Rejection 7
7. Importance sampling 13
7.1. Importance Sampling (I) 13
7.2. Importance Sampling (II) 17
8. Markov Chain Monte Carlo 19
8.1. Markov chains 20
8.2. Metropolis Hastings Markov chain Monte Carlo 22
8.3. MCMC for state spaces which are not finite 27
8.4. MCMC and conditional distributions 31

1. Organisation
Aims and Objectives. Building on Part A probability and Mods statistics, this
course introduces Monte Carlo methods, collectively one of the most important
toolkits for modern statistical inference. In parallel, students are taught program-
ming in R, a programming language widely used in statistics. Lectures alternate
between Monte Carlo methods and Statistical Programming so that students learn
to programme by writing simulation algorithms.
Synopsis. (1) Simulation: Transformation methods. Rejection sampling including
proof for a scalar random variable, Importance Sampling. Unbiased and consistent
IS estimators. MCMC including the Metropolis-Hastings algorithm. (2) Statisti-
cal Programming: Numbers, strings, vectors, matrices, data frames and lists, and

Date: 14 lectures, Hilary 2014.


2 GEOFF NICHOLLS

Boolean variables in R. Calling functions. Input and Output. Writing functions


and flow control. Scope. Recursion. Runtime as a function of input size. Solving
systems of linear equations. Numerical stability. Regression. Monte Carlo and
optimisation examples for elementary Bayesian inference.

Course Structure. The workload of this course is equivalent to an 16-lecture


course. There are 14 lectures and 6 practicals. Lectures 4-5pm Mondays in the
Lecture Theater, Statistics Department, 1 South Parks Road will focus on Simu-
lation. Lectures 2-3pm Fridays in weeks 2-6 and 8 will take place in the Evenlode
Room of the OUCS building 13 Banbury road. These lectures focus on Statistical
Programming. See
http : //www.oucs.ox.ac.uk/about/
if you need help finding this. The OUCS lectures are followed by practical teaching
sessions 3-4pm Fridays in weeks 2-6 and 8 in the same room.

1.1. R. R is a high quality open source software package for statistical computing.
We will use R for the Statistical Programming segment. You may find it useful for
checking your understanding of the applied probability we use in simulation. You
will find it convenient to install R on your own computer. The software, as well as
manuals and introductory tutorials, are available from
http : //www.r − project.org/
Problem sheets in classes will include a separate section with some optional exam-
ples of simulation using R.

1.2. Classes. There are four classes this term: classes 4-5pm on Tuesdays in weeks
3 and 7 and 10-11am on Fridays in week 5 and 8.

Texts.

Reading. The following texts have a large overlap with the course.

W.J. Braun and D.J. Murdoch, “A First Course in Statistical Programming with
R”. CUP 2007
C.P. Robert and G Casella, “Introducing Monte Carlo Methods with R”

Reference. The last two listed are rather advanced.

W. Venables and B.D. Ripley, “Modern Applied Statistics with S”


J.R Norris, “Markov Chains”, CUP, 1997
S.M Ross, “Simulation”, Elsevier, 4th edition, 2006
C.P. Robert and G Casella, “Monte Carlo Statistical Methods”, Springer, 2004
B.D Ripley, “Stochastic Simulation”, Wiley, 1987

Geoff Nicholls
[email protected]
SIMULATION LECTURES 3

2. Introduction

Computational statistics is changing the way we do statistics. We can carry out


the statistical inference the math tells us we should be doing, without essential com-
promise. In this way computation has made the subject more ’principled’. Monte
Carlo methods are a large part of this. Monte Carlo equips us to estimate the
values of very complex integrals, which is to say we can estimate statistical expec-
tations. Computing expectations is one of the fundamental operations of statistics,
something we need to do to summarize data, fit parameters, test hypotheses and
choose and average models. It is one of the last steps in many chains of inference.
The problem of finding efficient and provably correct Monte Carlo Algorithms
is a problem in applied probability. You know some probability from last term.
The problem of implementing these Monte Carlo algorithms, running them on the
numbers, and realizing the inference on a computer is a problem of Statistical
Programming. Take programming seriously. Too many scientists (including ’data
scientists’) sit down at a computer and try to implement a complex algorithm
without planning and structuring the code they produce. This is a good way to
waste your time. In order to avoid this we build complex programmes in self-
contained pieces and test the pieces (modularity). Write code that other people
can read and make sense of (be literate) and make use of. Dont reproduce the same
pieces of code in many different places.
In the following we always assume we have standard uniform random numbers,
X ∼ U (0, 1) at our disposal and build other random numbers by taking various
functions of these numbers. A discussion of the problem of generating uniform
random numbers on a computer is part philosophy, and part number theory. See
the Ripley text for algorithms and a discussion of the issues involved. You may feel
that random numbers generated by a deterministic device like a computer cannot be
truly random, and you would be correct. However they behave, for some purposes,
like random numbers, and that is often enough. The problem of defining exactly
what randomness is turns out to be a deep problem. Roberts and Cassela mention
some of the literature.

3. Inversion

This is the base method. It is used for simulation of scalar random variables with
only very simple pmf or pdf, since we need the cdf of the random variable.
Let X be a scalar random variable with cumulative distribution function (cdf)
F (x) = Pr(X ≤ x) at X = x in some space of states Ω. Let F −1 (u) be the smallest
value of x such that F (x) is greater than or equal to u. If F is continuous and
strictly increasing then this is just the inverse of F . Simulation by the method
of inversion exploits the fact that if U ∼ U (0, 1) and X = F −1 (U ) then X ∼ F
(meaning, the rv X is distributed with cdf F ). This follows (for the simple case)
by Pr(X ≤ x) = Pr(F −1 (U ) ≤ x) = Pr(U ≤ F (x)) = F (x).
Example 3.1. if we want X ∼ Exp(r) (ie X ∼ fX with fX (x) = r exp(−rx)) then
F (x) = 1 − exp(−rx) and
F −1 (u) = −(1/r) log(1 − u).
4 GEOFF NICHOLLS

The algorithm is
Algorithm 3.1.
U ∼ U (0, 1)
X ← − log(U )/r

since U and 1 − U have the same distribution, so X ← − log(U ) will do.

#Example Inversion Continuous Exponential


n<-100000
r<-0.5
U<-runif(n) #n U[0,1]’s
X<--log(U)/r #n Exp(r)’s

#check
mean(X) #should equal 1/r with a sd.dev by CLT
sd(X)/sqrt(n) #or in this case about sqrt(1/n)/r

Taking the smallest value, ie defining F −1 (u) = min(z; F (z) ≥ u) allows us to


handle an arbitrary cdf, without the continuity and strict increasing conditions.
See Ripley page 59 or Roberts and Casella page 39 for details beyond the scope of
the course.
If X ∼PF with X a discrete r.v. with probability mass function (pmf) p(x) then
x
F (x) = i=0 p(i) and F −1 (u) is x such that
x−1
X x
X
p(i) < u < p(i)
i=0 i=0

with the LHS ≡ 0 if x = 0.


Example 3.2. If 0 < p < 1 and q = 1−p, and we want to simulate X ∼ Geometric(p)
then p(x) = pq x−1 and F (x) = 1 − q x for x ∈ N. The smallest natural number
x giving 1 − q x ≥ u is the smallest x ≥ 1 satisfying x ≥ log(1 − u)/ log(q) (since
log(q) < 0), and this is given by x = F −1 (u) where
 
−1 log(1 − u)
F (u) =
log(q)
when ⌈x⌉ rounds up and we could replace 1 − U with U .

#Example Inversion Discrete Geometric


n<-100000
p<-0.3
U<-runif(n) #n U[0,1]’s
X<-ceiling(log(U)/log(1-p)) #n Geometric(p)’s

#check
mean(X==1) #should equal p with a sd.dev by CLT
sd(X==1)/sqrt(n) #or in this case about sqrt(p*(1-p)/n)
SIMULATION LECTURES 5

4. Transformation methods

Suppose we have a rv Y ∼ Q, Y ∈ ΩQ which we can simulate (eg, by inversion)


and some other rv X ∼ P, X ∈ ΩP which we wish to simulate. It may be that
we can find a function f : ΩQ → ΩP with the property that if we simulate Y ∼ Q
and then set X = f (Y ) then we get X ∼ P . Inversion is a special case of this
idea, since X = F −1 (U ) so Q is the uniform distribution on 0 ≤ U ≤ 1 and P is
the distribution with cdf F used in inversion. We may generalize this idea to take
functions of collections of rv with different distributions.
Example 4.1. If for i = 1, 2, 3, ..., a, Yi are iid rv distributed Yi ∼ Exp(1) (we can
simulate these as above) and
Xa
X = β −1 Yi
i=1
then X ∼ Γ(a, β), so now we can simulate Gamma-distributed variates (so long as
the shape parameter a is integer). How does this work? The Gamma(a, β) density
is
fX (x) ∝ xa−1 exp(−xβ).
The MGF of a random variable X is MX (t) = E(etX ) and the MGF of a sum of rv
is the product of their MGF’s. Since the MGF for Yi /β is 1/(1 − t/β), the MGF of
X must be (1 − t/β)−a , and that is the MGF of a Gamma(a, β) rv.
#Example Transformation Gamma
#in this example we will simulate X~Gamma(7,0.5)
n<-10000
a<-7
b<-0.5
U<-matrix(rexp(a*n),a,n) #rexp() is the built in R Exp(1), U is a x n matrix
X<-apply(U,2,’sum’)/b #of Exp(1)’s, so sum columns for n Gamma(a,b)’s

#check
mean(X) #should equal a/b=14 with a sd.dev by CLT
sd(X)/sqrt(n) #or in this case about sqrt(a/n)/b

The transformation method is used in the Box Muller algorithm for simulation of
normal random variables. Since, for any particular application, very large numbers
of rv may need to be simulated, there is a great deal of emphasis on computational
speed and efficiency in the design of simulation algorithms. Variations on the
Box-Muller algorithm give the fastest simulation algorithms for normal random
variables, in many applications of practical interest. The algorithm is based on the
following observation.
Example 4.2. If U1 and U2 are independent U (0, 1) rv, and
p
X1 = −2 log(U1 ) cos(2πU2 )
p
X2 = −2 log(U1 ) sin(2πU2 )
then X1 and X2 are independent (!) standard normal random variables.
Proof: think of (X1 , X2 )p
as a random point in the plane. In polar coordinates
this point has radius R = −2 log(U1 ) (so R2 ∼ Exp(0.5)) and angle Θ = 2πU2
6 GEOFF NICHOLLS

(with Θ ∼ U (0, 2π)). In order to get the joint density fX1 ,X2 (x, y) of X1 and X2
we make a change of variables from Θ, R2 (notice, not Θ, R since it is R2 for which
we have the distribution) to X1 , X2 :
∂(r2 , θ)
fX1 ,X2 (x, y) = fR2 ,Θ (r2 , θ) .
∂(x, y)
Now x = r cos(θ), y = r sin(θ) so ∂θ/∂x = −y/r2 etc and
1 1 2x −y/r2
fX1 ,X2 (x, y) = exp(−r2 /2) × × .
2 2π 2y x/r2
so that fX1 ,X2 (x, y) = exp(−x2 /2 − y 2 /2)/(2π).

#example transformation Box-Muller


n<-100000
U<-matrix(runif(2*n),2,n)
X<-sqrt(-2*log(U[1,]))*cos(2*pi*U[2,])
X<-c(X,sqrt(-2*log(U[1,]))*sin(2*pi*U[2,])) #concatenate batches of N(0,1)’s

#check
mean(X) #should equal 0 with a sd.dev by CLT
sd(X)/sqrt(2*n) #or in this case about 1/sqrt(2*n)

Note: see Ross page 80 sec 5.3 for a strategy for avoiding the sin and cos evaluations
(Problem 2.8 page 63 of Robert and Casella) and the Box Muller algorithm itself
at the bottom of Page 81.

5. Multivariate Normal

This is a multivariate application of the transform method to a distribution of


particular importance.
Let µ = (µ1 , µ2 , ..., µm )T be a (column) vector of m mean values,
X = (X1 , X2 , ..., Xm )T
a vector of normal random variables with E(Xi ) = µi and cov(Xi , Xj ) = Σi,j .
The multivariate normal distribution for X is written X ∼ N (µ, Σ) (sometimes
X ∼ M V N (µ, Σ)). In this case the joint density of the collection of rv X at X = x
in Rm is
fX (x) = (2π)−m/2 det(Σ)−1/2 exp −(x − µ)T Σ−1 (x − µ)/2 .


Theorem 5.1. Let Z = (Z1 , Z2 , ..., Zm ) be a collection of m independent standard


normal random variables. Let L be a real m × m matrix satisfying LLT = Σ, and
X = LZ + µ. Then X ∼ N (µ, Σ).

Proof: The joint density of the new variables (the components of X) is


fX (x) = fZ (z(x))|∂z/∂x|.
Now
fZ (z) = exp(−z T z/2)/(2π)m
SIMULATION LECTURES 7

and |∂z/∂x| = det(L−1 ). Also, det(L) = det(LT ) so det(L)2 = det(Σ), and


det(L−1 ) = 1/ det(L) so det(L−1 ) = det(Σ)−1/2 . Also, z T z = (x−µ)T [L−1 ]T L−1 (x−
µ) so z T z = (x − µ)T Σ−1 (x − µ) so X is indeed N (µ, Σ).
There are many ways to define L so that LLT = Σ. For example, if Σ =
V DV T is the diagonalization, or spectral representation, of Σ, then L = V D1/2
would do. The eigenvalues of Σ = E((X − µ)(X − µ)T ) are not negative so this
representation is real. We can assume the eigenvalues are actually strictly greater
than zero (Σ positive definite) without loss of generality, since zero eigenvalues
arise when some of the Xi are linear combinations of others, so the corresponding
row may be removed. In this case the LU factorization Σ = LU into lower and
upper triangular matrices has a special form with L = U T , and a special name,
the Cholesky factorization. The Cholesky factorization L = chol(Σ) is particularly
convenient for computational use. It (ie, L) is the unique lower triangular matrix
satisfying LLT = Σ. See the Part B course in numerical analysis for details of fast
chol(Σ) computation.

#MVN, X~N(mu,s), use the cholesky factorization of the covariance matrix


#In this exam[pe we will take m=dim(mu)=2, mu=(-1,1), covariance matrix s
#s=| 5 -3|
# |-3 4|
mu<-c(-1,1)
s<-matrix(c(5,-3,-3,4),2,2)
u<-chol(s) #s=t(u)%*%u so my L is t(u) here
(X<-t(u)%*%rnorm(2)+mu) #X~N(mu,s)

#check - generate n sets of MVN vectors X


n<-10000
X<-t(u)%*%matrix(rnorm(2*n),2,n)+mu
plot(t(X),asp=1)
#estimate mean mu-hat and covariance s-hat
apply(X,1,’mean’) #average along the rows
cov(t(X)) #cov() is covariances of row vectors

6. Rejection

Inversion is restricted to the univariate case. Also, we need to have the cdf of
target distribution in a form that makes it at least numerically easy to invert. We
found a transformation to convert independent standard normal variates into (cor-
related) multivariate normal random variables. That worked because the normal
distribution is ‘special’. We cant rely on finding a suitable transformation to make
up any given distribution.
We will start with simulation of X a discrete rv. The following sentence may
sound familiar. Suppose that for x ∈ Ω we have a probability mass function p(x)
(the target distribution) which we want to sample, and another pmf q(x) defined
on the same space which we can sample.
Theorem 6.1. Suppose we can find a constant M satisfying M ≥ p(x)/q(x) for
all x ∈ Ω. The following ‘Rejection algorithm’ returns X ∼ p.
8 GEOFF NICHOLLS

Algorithm 6.1.
1 Let Y ∼ q and U ∼ U (0, 1). Simulate Y = y and U = u.
2 If u ≤ p(y)/(M q(y)) then stop and return X = y, and otherwise, start
again at 1.

Since this is rather important idea, we will look at a couple of ways of proving
this result. The second proof is short but hides some subtleties. The first is more
explicit.
Proof (1): Let Pr(X = i) give the probability for the value of X returned by the
algorithm to equal i. I will partition on the the number of times through the loop.
In order to end up with X = i we could draw Y = i at the first step and accept
it, or we could reject whatever was drawn at the first pass, and then at the second
pass draw Y = i and accept it, and so on. Events at each pass through the loop
are independent of events in other passes, so


X
Pr(X = i) = Pr(reject n − 1 times, then draw Y = i and accept it)
n=1
X∞
(6.1) = Pr(reject Y )n−1 Pr(draw Y = i and accept it).
n=1

At a particular pass through the loop the probability we draw Y = i and accept is

Pr(draw Y = i and accept it) = Pr(accept Y |Y = i) Pr(draw Y = i)


 
p(i)
= Pr U ≤ q(i)
M q(i)
p(i)
(6.2) = .
M

The probability we have a rejection at any pass through the loop is


X
Pr(reject Y ) = Pr(draw Y = j and reject it)
j∈Ω
X
= Pr(reject Y |Y = j) Pr(draw Y = j)
j∈Ω
 
X p(j)
= Pr U > q(j)
M q(j)
j∈Ω
X p(j)

= 1− q(j)
M q(j)
j∈Ω
1
= 1−
M
P P
since both j q(j) = j p(j) = 1 to get the last line. We succeed or fail at each
pass through the loop independently of events in previous loops. Picking up where
SIMULATION LECTURES 9

we left Eqn 6.1,


∞  n−1
X 1 p(i)
Pr(X = i) = 1−
n=1
M M
p(i) 1
=
M 1 − (1 − M −1 )
= p(i),
as supposed. Notice that the number of accept/reject trials (the number of times
N we have to repeat steps 1 and 2 before we accept, and get one sample returned)
has a geometric distribution with success probability 1/M , so the mean number of
trials is M . If we are conservative, in order to ensure the bound M ≥ p/q holds,
and choose large M , we will have an inefficient algorithm.
Proof (2): here is a proof that the rejection algorithm returns realisations of X
distributed with probability density p(x) for X a continuous scalar rv. The proof
is easily adapted to X discrete, by changing integrals to sums. Think of the Y and
U values simulated at each cycle through the loop as a pair. We generate lots of
these pairs (U, Y ) and take as our realisations of X those Y -values belonging to
pairs that satisfy U ≤ p(Y )/(M q(Y )). This is the same as saying that the cdf of
X is the cdf of Y conditioned on U ≤ p(Y )/(M q(Y )).
Let X be the value returned by a call to Alg. 6.1. If pU,Y (u, y) is the joint density
of U and Y then pU,Y (u, y) = q(y) because U and Y are independent (before we
condition) and U has the uniform density on 0 < U < 1, which is equal one.
Pr(X < x) = Pr(Y < x|U < p(Y )/M q(Y ))
Pr(Y < x, U < p(Y )/M q(Y ))
=
Pr(U < p(Y )/M q(Y ))
R x R p(y)/Mq(y)
−∞ 0 pU,Y (u, y)dudy
= R ∞ R p(y)/Mq(y)
−∞ 0
pU,Y (u, y)dudy
R x R p(y)/Mq(y)
−∞ 0
q(y)dudy
= R ∞ R p(y)/Mq(y)
−∞ 0 q(y)dudy
Rx
p(y)/M dy
= R−∞

−∞ p(y)/M dy
Z x
= p(y)dy
−∞
R∞
since −∞ p(y)dy = 1 because p(y) is the probability density of a real scalar.
Example 6.1. Here is an algorithm simulating a rv X ∼Beta(α, β), which works for
α, β ≥ 1.
The beta density p(x) is
xα−1 (1 − x)β−1
p(x) = for 0 < x < 1 and α, β > 0
Zp
10 GEOFF NICHOLLS

where
Γ(α)Γ(β)
Zp =
Γ(α + β)
is a normalising constant. In order to apply the rejection algorithm we need to
find an envelope probability density q(x) and a constant M ≥ p(x)/q(x) for all
0 < x < 1.
When α and β are both greater than or equal one, p(x) is bounded (and otherwise,
not). Since p(x) has support on [0, 1] and is bounded, we can simply use the uniform
density q(x) = 1, ie the constant function, as our envelope, and M = max0<x<1 p(x)
to ensure M ≥ p(x)/q(x) for all 0 < x < 1.
Now, p′ (x∗ ) = 0 is satisfied by x∗ = (α − 1)/(α + β − 2) so M = p(x∗ ) is
 α−1  β−1
1 α−1 β−1
M=
Zp α + β − 2 α+β−2
will do for the bound, and our acceptance probability at step 2 of Alg. 6.1 is
y α−1 (1 − y)β−1
p(y)/(M q(y)) =
M Zp
y α−1 (1 − y)β−1
=
M′
where  α−1  β−1
α−1 β−1
M′ =
α+β−2 α+β−2
The point is that the factor Zp cancels with a corresponding factor of Zp in the
expression for M so we needn’t work it out. This kind of simplification will always
occur, as we shall see.
Larger M would have given a correct algorithm - can you think why we want the
minimum upper bound?
#Rejection, beta(a,b) for both a,b>=1

my_big_ab_beta<-function(a=1,b=1) {
#simulate X~Beta(a,b) variate, defalts to U(0,1)
if (a<1 || b<1) stop(’a<1 or b<1’);
M<-(a-1)^(a-1)*(b-1)^(b-1)*(a+b-2)^(2-a-b)
finished<-FALSE
while (!finished) {
Y<-runif(1)
U<-runif(1)
accept_prob<-Y^(a-1)*(1-Y)^(b-1)/M
finished<-(U<accept_prob)
}
X<-Y
X}

my_big_ab_beta(a=1.5,b=2.5)

#check, example values a=1.5,b=2.5


SIMULATION LECTURES 11

a<-1.5; b<-2.5;
n<-100000
X<-rep(NA,n) #clear X before the loop - its entries are NA
for (i in 1:n) {
X[i]<-my_big_ab_beta(a,b)
}
mean(X);a/(a+b) #should equal a/(a+b) with a sd.dev by CLT
sd(X)/sqrt(n) #or in this case about sqrt(a*b/(a+b+1)/n)/(a+b)
This gets inefficient when a or b are big (the beta density is then sharply peaked, so
the ratio p/M q is typically very small and it takes many trials to get an acceptance).
Exercise: when I first wrote the function my_big_ab_beta() above, I forgot to
divide by M in the line accept_prob<-Y^(a-1)*(1-Y)^(b-1)/M but everything
still seemed to be correct (the means and variances were right). Can you explain
why the algorithm was still correct, and why I am better off with M in place, as
above?

A note on normalising constants: notice the way the normalising constant in p


canceled the corresponding constant in M . If p̃ and q̃ are unnormalised densities
with p = p̃/Zp and q = q̃/Zq then M ≥ p/q iff M ′ ≥ p̃/q̃, with M ′ = Zp M/Zq .
This means we can throw the normalising constants out at the start: if we can
find M ′ to bound p̃/q̃ then it is correct to accept with probability p̃/M ′ q̃ in the
rejection algorithm. In this case the mean number N of accept/reject trials will
equal Zq M ′ /Zp (that is, M again).
The art of making good rejection algorithms for a given target pdf p is to find
some probability density q which (a) you can can sample by elementary methods,
and (b) is ’close’ to p in shape, so that the ratio p(x)/q(x) is close to one for any x
value you are likely to draw from p.
Example 6.2. Here is an algorithm simulating a rv X ∼Gamma(α, β) (example
from R& C) which works for any a ≥ 1 (not just integers).
The Gamma density p(x) is
xα−1 exp(−βx)
p(x) = for x > 0 and α, β > 0,
Zp
with Zp = Γ(α)/β α , so
p̃ = xα−1 exp(−βx)
will do as an unnormalized form for p. When α = a is a positive integer we can
simulate
PaX ∼ Gamma(a, β) by adding a independent Exp(β) rv’s, Yi ∼ Exp(β),
X = i=1 Yi . How could we simulate X ∼ Gamma(α, β) for non-integer α?
We can sample densities ’close’ in shape to Gamma(α, β) since we can sample
Gamma(⌊α⌋, β). Perhaps this, or something like it, would make an envelope den-
sity?
Let a = ⌊α⌋ and use Gamma(a, b) as the envelope, so Y ∼ Gamma(a, b) for
integer a ≥ 1 and some b > 0. In order to use rejection, we need to calculate the
acceptance probability p/M q. The density of Y is
xa−1 exp(−bx)
q(y) =
Zq
12 GEOFF NICHOLLS

so
q̃(y) = xa−1 exp(−bx)
will do as our unnormalised envelope function. The problem then is to bound the
ratio
p̃/q̃ = xα−a exp(−(β − b)x).
Is this bounded? Consider (a) x → 0 and (b) x → ∞. For (a) we need a ≤ α so
a = ⌊α⌋ is fine. For (b) we need b < β (not b = β since we need the exponential to
kill off the growth of xα−a ).
Given that we have chosen a and b so the ratio is bounded, we should now
compute the bound. Now d(p̃/q̃)/dx = 0 at x = (α − a)/(β − b) (and this must be
a maximum at x ≥ 0 under our conditions on a and b), so p̃/q̃ ≤ M for all x ≥ 0 if
 α−a
α−a
M= exp(−(α − a)).
β−b
We will accept Y at step 2 of Alg. 6.1 if U ≤ Y α−a exp(−(β − b)Y )/M .
Exercise: how to choose b? Any 0 ≤ b < β will do, but is there a best choice?
One idea would be to choose b to minimize the expected number of Y -simulations
per sample X output. Since the number of trials N say is Geometric, with success
probability Zp /M Zq , the expected number of trials is E(N ) = Zq M/Zp . Now
Zp = Γ(α)β −α where Γ is the Gamma function related to the factorial. Show that
the optimal b solves d(b−a (β − b)−α+a )/db = 0 so use b = β(a/α) for best results.
#Rejection, Gamma(a,b)

my_gamma<-function(a=1,b=1) {
#simulate X~Gamma(a,b) variate, defalts to Exp(1)
if (a<1 || b<=0) stop(’a<1 or b<=0’);
aq<-floor(a)
bq<-b*(aq/a) #best choice but any 0<bq<b OK
del_a<-a-aq
del_b<-b-bq
finished<-FALSE
while (!finished) {
Y<-sum(rexp(aq))/bq #Y~Gamma(aq,bq)
U<-runif(1)
accept_prob<-(Y*del_b/del_a)^del_a*exp(-del_b*Y+del_a)
finished<-(U<accept_prob)
}
X<-Y
X}

my_gamma(a=7.5,b=0.5)

#check, in this example, a=7.5,b=0.5


a<-7.5; b<-0.5;
n<-100000
X<-rep(NA,n) #clear X before the loop - its entries are NA
for (i in 1:n) {
SIMULATION LECTURES 13

X[i]<-my_gamma(a,b)
}
mean(X) #should equal a/b with a sd.dev by CLT
sd(X)/sqrt(n) #or in this case about sqrt(a/n)/b

Exercise: modify the function my_gamma() to return the number of trials needed in
order to generate X, and check that the mean number of trials is M Zq /Zp .
Notice that it is important to find q which is heavy tailed compared to p, so,
p(x)/q(x) goes to zero as x tends to either end of the support of p (and of course
the ratio has also to be bounded throughout the support, but that is usually the
easy bit).

7. Importance sampling

Importance sampling is, among other things, a strategy for recycling samples.
It is useful also when we need to make an accurate estimate of the probability
a random variable exceeds some very high threshold. In this context the naive
estimator (proportion of samples over threshold) has a high variance (relative to
the mean). In this context it is referred to as a variance reduction technique.
There is a slight variation on the basic set up: we can generate samples distributed
according to q but we want to estimate an expectation that depends on p (before
it was “but we want samples distributed according to p”), so we want to estimate
Ep (f (X)) for some function f . In importance sampling we avoid sampling the
target distribution p. Instead, we take samples distributed according to q and
reweight them.

7.1. Importance Sampling (I). Here is the key idea. Let Yi ∼ q, i = 1, 2, ..., n
be iid continuous random variables distributed for Yi ∈ Ω with density q. We will
require p(x) > 0 ⇒ q(x) > 0 for all x ∈ Ω (which is weaker than p/q bounded as in
rejection). Then
n
1X
f= p(Yi )f (Yi )/q(Yi )
n i=1
is an unbiased estimator for Ep (f (X)) since
n  
1X p(Yi )
Eq (f ) = Eq f (Yi )
n i=1 q(Yi )
 
p(Y1 )
= Eq f (Y1 )
q(Y1 )
Z  
p(y)
= f (y) × q(y) dy
q(y)
ZΩ
= p(y)f (y) dy

= Ep (f (X)).
14 GEOFF NICHOLLS

All this works in a very similar way in the case where X and the Yi are discrete
rv.
We call w(x) = p(x)/q(x) the weight function and wi = w(yi ) the weights for
importance sampling estimates of expectations in p using samples Yi = yi from q
and write
n
1X
(7.1) f= wi f (yi ).
n i=1

Here f is called an IS (importance sampling) estimator for Ep (f ).

Example 7.1. Say we have simulated Yi ∼Gamma(a, b) and we want to estimate


Ep (f (X)) where X ∼Gamma(α, β).
Recall that the Gamma(α, β) density is

p(x) = xα−1 exp(−βx)β α /Γ(α).

Now p/q is
p(x) Γ(a)β α
= xα−a e−(β−b)x
q(x) Γ(α)ba
so the weights are
Γ(a)β α
wi = Yi α−a e−(β−b)Yi ,
Γ(α)ba
and we estimate Ep (f (X)) via f as in Eqn 7.1.

#in this example f(x)=x, X~Gamma(7.8,2), Y~Gamma(7,1) and we use Importance sampling
#to estimate E(f(X))
alpha<-7.8
beta<-2

n<-100000
a<-floor(alpha)
b<-1 #Exercise: what is the optimal choice of b here?
U<-matrix(rexp(a*n),a,n)
Y<-apply(U,2,’sum’)/b

exp_Y1<-mean(Y*Y^(alpha-a)*exp(-(beta-b)*Y)*(gamma(a)/gamma(alpha))*(beta^alpha/b^a))

#compare estimate (exp_Y1) and exact result (E(X)=alpha/beta)) - TODO check variance
exp_Y1
alpha/beta

We now consider the variance of f . We assume Ep (f (X)) and varp (f (X)) are
finite. For f a function h : Ω → R, and Y ∼ q, let

varq (f (Y )) = Eq (f (Y )2 ) − Eq (f (Y ))2
SIMULATION LECTURES 15

be the variance for f . The variance of the importance-sampling estimator for Ep (f )


is
 
1 p(Y1 )
var(f ) = varq f (Y1 )
n q(Y1 )
 2   2 !
1 p 2 p
= Eq f − Eq f
n q2 q
   
1 p 2
(7.2) = Ep f − Ep (f )2 .
n q
2 2

(and var(f ) = (Ep f − Ep (f ) )/n when q = p as expected). Each time we do IS
we should check that this variance is finite, otherwise our estimates are somewhat
untrustworthy! We check Ep pf 2 /q is finite.
Example 7.2. Let us check that the variance of f in Example 7.1 is actually finite.
The variance of f depends on two expectations, Ep (pf 2 /q) and Ep (f ). Refering to
Eqn 7.2, it is enough to show that Ep (pf 2 /q) is finite. The functions of α, β, a and
b appearing are finite so we can those factors throught, and begin with
p(X)
f (X)2 ∝ f (X)2 X α−a e−(β−b)X .
q(X)
The expectation of interest is
Ep (pf 2 /q) ∝ Ep f (X)2 X α−a exp(−(β − b)X)

Z ∞
= p(x) f (x)2 xα−a exp(−(β − b)x)) dx.
0
Z ∞
≤ M p(x)f (x)2 dx
0
= M Ep (f 2 ).
where
M = max xα−a exp(−(β − b)x)
x>0
is finite if a < α and b < β (see Example 6.2). Since f has finite mean and variance
in X, we know Ep (f 2 ) < ∞, and so the integral is finite. If these conditions (on the
parameters) are not satisfied, our IS-estimator is useless (unless f saves the day).
These same conditions applied to our rejection sampling for Gamma(α, β), though
note that it is enough for M to exist, we dont have to work out its value.

Notice that the setup allows us to choose q: while p is given, q needs to cover p
(p > 0 ⇒ q > 0) and be simple to sample. The requirement that the variance of
the estimator be finite further constrains our choice. Referring to Eqn 7.2, we need
Ep (f p/q) = Eq (f p2 /q 2 ) to be finite. If varp (f ) is known finite then (as we saw in
Example 7.1) it may be easy to get a sufficient condition for varq (f ) finite. Further
analysis will depend on the details of f .
What is the choice of q that actually minimizes the variance of the importance
sampling (I.S.) estimator? Look at Eqn 7.2. Suppose for the moment f > 0.
The variance of f cannot be negative, but we can make it zero by the choice
q(x) = p(x)f (x)/Ep (f ). Zero variance estimators! In practice we are not able
to implement this choice, since the weight function is w(x) = p/q = Ep (f )/f (x),
16 GEOFF NICHOLLS

but Ep (f ) is the thing we are trying to estimate anyway, so we cant compute the
weights in Eqn 7.1.
One important class of application of IS is to problems in which we estimate the
probability for a rare event. In this case we may be able to sample p directly, but it
doesn’t help us. If, for example, X ∼ p with Pr(X > x) = δ say, with δ very small,
P
we may not get any samples Xi > x and our estimate δ̂ = i IXi >x /n is simply
zero. By distorting the proposal distribution we can actually reduce the variance
of our (IS) estimator.
Example 7.3. Let X ∼ N (µ, σ 2 ) be a scalar normal rv. If we would like to estimate
δ = Pr(X > x) for some x much larger than 3σ we could exponentially tilt the
density of X towards larger values so that we get some samples in the target region,
and then allow for our tilting via an IS estimator. If p(x) is the density of X then
q(x) = p(x) exp(tx)/Mp (t) is called a tilted density of p (see eg Ross section 8.6
page 185). Here the normalizing constant Mp (t) = Ep (exp(tX)) happens to be the
moment generating function of X.
For p the normal density,
exp(−(x − µ)2 /2σ 2 ) exp(tx) = exp(−(x − µ − tσ 2 )2 /2σ 2 ) exp(µt + t2 σ 2 /2)
so the normalized tilted density is
q(x) = (2π)−1/2 exp(−(x − µ − tσ 2 )2 /2σ 2 ),
normal mean µ + tσ 2 and variance σ 2 . For the normal, Mp (t) = exp(µt + t2 σ 2 /2).
The IS weight function is p/q = exp(−tx)Mp (t) so
w(x) = exp(−t(x − µ − t2 σ 2 /2)).
We take samples Yi ∼ N (µ + tσ 2 , σ 2 ), compute wi = w(Yi ) and form our IS
estimator for δ = Pr(X > x) according to Eqn 7.1,
n
1X
δ̂ = wi IYi >x
n i=1
since f (Yi ) here is IYi >x .
We havnt said how to choose t. The point here is that we want samples in the
region of interest. I will choose the mean of the tiled distribution so that it equals
x, then I am sure to have samples in the region of interest. I chose t so that
µ + tσ 2 = x, or t = (x − µ)/σ 2 . The densities p and q and the threshold used in
the example below are depicted in Figure ??
####################################################
#IS - normal example, method I, Pr(Z>4) Z~N(0,1)
#note I chose x=4 so we could check using the naive method
#for x=6 say the advantage is even more dramatic

sigma<-1
mu<-0
x<-4

n<-1000000
SIMULATION LECTURES 17

0.4
0.3
normal density

0.2
0.1
0.0

−5 0 5

Figure 1. (solid) N (0, 1) density p. (dashed) N (x, 1) tilted ten-


sity q.

Z<-rnorm(n)
#naive estimator
(plain_est<-mean(Z>x))

#Y are samples from the tilted distribution (weighted towards region of interest)
t<-(x-mu)/sigma^2
Y<-mu+t*sigma^2+sigma*rnorm(n)
#IS-estimator corrects for weighting
(IS_est<-mean( (Y>x)*exp(-(Y-mu)^2/(2*sigma^2)+(Y-mu-t*sigma^2)^2/(2*sigma^2))))

#The precision of our estimate improves by a factor of about 100


(plain_sd<-sd(Z>x)/sqrt(n))
#ran above, got 0.000033 +/- 0.000006 about 20% error

(IS_std<-sd( (Y>x)*exp(-(Y-mu)^2/(2*sigma^2)+(Y-mu-t*sigma^2)^2/(2*sigma^2)))/sqrt(n))
#ran above got 0.00003170 +/- 0.00000007 about 0.2% error

Exercise: how to choose t? RShow that the t-value that minimizes the variance,

minimizes Ep (f 2 p/q) = Mp (t) x p(z) exp(−tz)dz over t, and hence obtain a equa-
tion for the optimal tilt t.

7.2. Importance Sampling (II). When we introduced rejection sampling, we


emphasized that normalizing constants are often unknown. In order to do IS-
estimation as in Section 7.1 we need p and q normalized. There is an IS-estimator
for un-normalized densities too, and it is in this form that IS-estimation is usually
conducted.
Let p̃(x) ≥ 0,R q̃(x) ≥ 0 and f (x) be
R scalar functions of real x ∈ R. Suppose the
integrals Zp = Rn p̃(x)dx and Zq = Rn q̃(x)dx exist so that p = p̃/Zp and q = q̃/Zq
are probability densities.
18 GEOFF NICHOLLS

We want to estimate Ep (f ) = Eq (pf /q). Now the normalizing constants Zp and


Zq are no longer available, so separate them out, writing
Eq (pf /q) = Eq (p̃f /q̃)Zq /Zp .
We have an expectation giving Zp /Zq , namely Eq (p̃/q̃) = Zp /Zq . It follows that
Eq (p̃f /q̃)
(7.3) Ep (f ) = ,
Eq (p̃/q̃)
and we can estimate the numerator and denominator separately using samples
distributed according to q.
Our recipe for estimation of the mean of f is then
Algorithm 7.1.
(1) Simulate Yi ∼ q iid for i = 1, 2, ..., n and let
w̃(Yi ) = p̃(Yi )/q̃(Yi ).
(2) Evaluate an estimator a
n
1X
a= w̃(Yi )f (Yi )
n i=1

for the numerator of Ep (f ) in Eqn 7.3, and another estimator b


n
1X
b= w̃(Yi )
n i=1
for the denominator.
(3) The IS-estimator for Ep (f ) is f = a/b

Notice that Eq (a) = Eq (p̃f /q̃) and Eq (b) = Eq (p̃/q̃) so these two estimators are
unbiased, even if f itself may be biased.
An estimator (such as f ) is consistent for Ep (f ) if f tends in probability to Ep (f )
as n → ∞. That means that for each ǫ > 0,
lim Pr(|f − Ep (f )| > ǫ) = 0.
n→∞

Exercise: show that a and b are consistent estimators (of the numerator and
denominator of Eqn 7.3). (via Markov’s Inequality or using the CLT - this is the
same as Q6 of the supplementary problem sheet of Part A probability, HT08).
Theorem 7.1. f is a consistent estimator, that is, f tends in probability to Ep (f )
as n → ∞.

We omit the proof which is outside the scope of this course. If you are interested
in following this up, apply the delta-method, Taylor expanding f as a function of
a and b about a = Eq (w̃f ) and b̂N = Eq (w̃).
Example 7.4. Revisit Example 7.1. We wish to estimate Ep (f (X) where the target
distribution is X ∼Gamma(α, β) and Zp = β −α Γ(α) and the proposal distribution
is X ∼Gamma(a, b). Suppose we didn’t know Zp and Zq (or we were not confident
we could calculate it correctly or we couldn’t be bothered calculating it!). Now
p̃(x)/q̃(x) = xα−a exp(−(β − b)x) so our algorithm is
SIMULATION LECTURES 19

Algorithm 7.2.
1 Simulate Yi ∼Gamma(a, b), for i = 1, 2, ..., n.
2 Calculate the weights
wi = Yiα−a exp(−(β − b)Yi ).
3 Form the IS-estimator
P
wi f (Yi )
f = iP .
j wj

#in this example f(x)=x, X~Gamma(7.8,2), Y~Gamma(7,1) and we use Importance sampling
#to estimate E(f(X)) using the ratio estimator
alpha<-7.8
beta<-2

n<-100000
a<-floor(alpha)
b<-1 #Exercise: what is the optimal choice of b here?
U<-matrix(rexp(a*n),a,n)
Y<-apply(U,2,’sum’)/b

#version II do not know normalizations


num_Y<-mean(Y*Y^(alpha-a)*exp(-(beta-b)*Y))
den_Y<-mean(Y^(alpha-a)*exp(-(beta-b)*Y))
exp_Y2<-num_Y/den_Y

#compare - TODO should look at variances


exp_Y2
alpha/beta

#check that our estimate of Z_p/Z_q is good


den_Y
(gamma(alpha)/gamma(a))*(b^a/beta^alpha)

Surprisingly, simulation studies show that importance sampling using method II


gives more stable estimates of Ep (f ) than does method I - some authors recommend
using II even when you know and can easily calculate all the normalizing constants.
This is surprising as we have an extra quantity to estimate in method II. In my
experience it is easy to make programming errors, even for simple normalizing
functions, when using method I, so I tend to use method II.

8. Markov Chain Monte Carlo

Our purpose is to estimate Ep (f (X)) for X ∼ p for p(x) some pmf (or pdf)
defined for x ∈ Ω. Up to this point we have based our estimates on iid draws from
either p itself, or some proposal distribution with pmf q. In MCMC we simulate a
correlated sequence X0 , X1 , X2 , .... which satisfies Xt ∼ p (or at least Xt converges
to p in distribution) and rely on the usual estimate fˆ = n−1 i f (Xi ). We need to
P
review and extend some of the Markov chain material from Part A Probability.
20 GEOFF NICHOLLS

In the following discussion we will suppose the space of states of X is finite (and
therefore discrete). We do not promise a satisfactory proof that all this works in the
case X a continuous rv: we simply observe that all our work is to be on a computer,
with finite precision, and that all the probability densities we will consider are in
fact represented by some closely approximating pmf on the computer.
You can find other presentations of parts of the material in this section in Section
5.5 of Norris and Chapter 10 of Ross.

8.1. Markov chains. Let {Xt }∞ t=0 be a homogeneous Markov chain of random
variables on Ω with starting distribution X0 ∼ p(0) and transition probability
Pi,j = Pr(Xt+1 = j|Xt = i).
(n)
Denote by Pi,j the n-step transition probabilities
(n)
Pi,j = Pr(Xt+n = j|Xt = i)
and by p(n) (i) = Pr(Xn = i). Recall that P is irreducible if and only if, for each
(n)
pair of states i, j ∈ Ω there is n such that Pi,j > 0. The Markov chain is aperiodic
(n)
if Pi,j is non zero for all sufficiently large n.

Example 8.1. Here is an example of a period chain: Ω = {1, 2, 3, 4}, p(0) = (1, 0, 0, 0),
and transition matrix
 
0 1/2 0 1/2
 1/2 0 1/2 0 
P =
 0 1/2 0 1/2  ,

1/2 0 1/2 0
(n)
since P1,1 = 0 for n odd.

Exercise: show that if P is irreducible and Pi,i > 0 for some i ∈ Ω then P is
aperiodic.
P
Recall that the pmf π(i), i ∈ Ω, i∈Ω π(i) = 1 is a stationary distribution of P
if πP = π. If p(0) = π then
X
p(1) (j) = p(0) (i)Pi,j ,
i∈Ω
(1) (t)
so p (j) = π(j) also. Iterating, p = π for each t = 1, 2, ... in the chain, so the
distribution of Xt ∼ p(t) doesn’t change with t, it is stationary.
We now introduce reversible Markov chains (see also Norris 1.9). In a reversible
chain we cannot distinguish the direction of simulation from inspection of a real-
ization of the chain (so, you simulate a piece of the chain, toss a coin and reverse
the order of states if the coin comes up heads; now you present me the sequence of
states; I cannot tell whether or not you have reversed the sequence, though I know
the transition matrix of the chain).

Denote by Pi,j = Pr(Xt−1 = j|Xt = i) the transition matrix for the time-reversed
chain. It seems clear that a Markov chain will be reversible if and only if P = P ′ , so
that any particular transition occurs with equal probability in forward and reverse
directions.
SIMULATION LECTURES 21

TheoremP8.1. (i) If there is a probability mass function π(i), i ∈ Ω satisfying


π(i) ≥ 0, i∈Ω π(i) = 1 and

(8.1) π(i)Pi,j = π(j)Pj,i for all pairs i, j ∈ Ω,

then π = πP so π is stationary for P . (ii) If in addition p(0) = π then P ′ = P and


the chain is reversible with respect to π.

The relations in Eqn. 8.1 are called the detailed balance relations (and in this
sense, π = πP would be global balance).
P
PProof of (i): sum both sides of Eqn. 8.1 over i ∈ Ω. Now i Pj,i = 1 so
i π(i)Pi,j = π(j) and we are done.
Proof of (ii), we have π a stationary distribution of P so Pr(Xt = i) = π(i) for
all t = 1, 2, ... along the chain. Then

Pi,j = Pr(Xt+1 = j|Xt = i)


Pr(Xt+1 = j)
= Pr(Xt = i|Xt+1 = j)
Pr(Xt = i)

= Pj,i π(j)/π(i)

but Pi,j = Pj,i π(j)/π(i) by detailed balance so P = P ′ .


Why bother with reversibility? It is (i) of Theorem 8.1 that will be useful to
us. If we can find a transition matrix P satisfying p(i)Pi,j = p(j)Pj,i then pP = p
so ‘our’ p is a stationary distribution. Given P it is far easier to verify detailed
balance, than to check p = pP directly.
We will be interested in using simulation of {Xt }n−1
t=0 in order to estimate Ep (f (X)).
The idea will be to arrange things so that the stationary distribution of the chain
is π = p: if X0 ∼ p (ie start the chain in its stationary distribution) then Xt ∼ p
for all t and we get some useful samples.
The ‘obvious’ estimator is
n−1
1X
fˆ = f (Xt ),
n t=0

but we may be concerned that we are averaging correlated quantities. Also, if we


always start the Markov chain in some fixed state X0 = x(0) say then we certainly
dont have X0 ∼ p. In your study of Markov chains last term you saw just the two
theorems we need. The convergence theorem (Theorem 1.8.3 page 41 of Norris)
gives us conditions for p(t) → π, from any start, so if π = p we will get Xt → p in
distribution.
However, this mode of convergence is not enough for us. The problem is that
our estimator fˆ is getting ‘polluted’ by samples drawn in the early phase of the
chain, when the distribution of Xt is still converging. The Ergodic theorem (Norris
Section 1.10) covers this aspect.

Theorem 8.2. If {Xt }∞ t=0 is an irreducible and aperiodic Markov chain on a finite
space of states Ω, with stationary distribution p then, as n → ∞, for any bounded
22 GEOFF NICHOLLS

function f : Ω → R,
n−1
1X
f (Xt ) → Ep (f (X)).
n t=0

The convergence is almost surely, which is stronger than, and implies, conver-
gence in probability. In Part A probability the Ergodic theorem asks for positive
recurrent X0 , X1 , X2 , ..., and the stated conditions are simpler here because we are
assuming a finite state space for the Markov chain.
We would really like to have a CLT ˆ
q for f formed from the Markov chain output,
so we have confidence intervals ± var(fˆ) as well as the central point estimate fˆ
itself. It is in fact the case that under mild conditions, which hold for almost all
the MCMC simulations I have done, there is a CLT for fˆ. However, these results
are a little beyond us at this point.
So, we can form estimates of Ep (f ) by averaging along states of a Markov chain
which has p as its equilibrium distribution. There remains the problem of how large
n must be for the guaranteed convergence to give a usefully accurate estimate. This
problem, of assessing when the Markov chain simulation length is sufficiently large,
does not have a simple honest answer, though there are some obvious necessary
conditions we can check (eg repeat the entire simulation and check that independent
estimates fˆ have an acceptably small scatter). We should check also that ’most’ of
the samples are not biased in any obvious way by the choice of X0 , the initial state.

8.2. Metropolis Hastings Markov chain Monte Carlo. The Metropolis Hast-
ings Markov Chain Monte Carlo algorithm (or MCMC, for short) is an algorithm
for simulating a Markov Chain with any given equilibrium distribution.
If we are given a pdf of pmf p thenPwe may be able to simulate an iid sequence
X1 , X2 , ..., Xn of r.v. satisfying n−1 i f (Xi ) → Ep (f (X)) as n → ∞, using the
Rejection algorithm.
In a similar way, if we are given a pdf of pmf p then we may be able to simu-
late P
an correlated sequence X1 , X2 , ..., Xn of r.v. (ie, a Markov chain) satisfying
n−1 i f (Xi ) → Ep (f (X)) as n → ∞, using the MCMC algorithm.
In each case convergence in probability is ’easily’ established, whilst the more
useful CLT ’usually’ applies, but is harder to verify, at least in the MCMC case.
We will start with simulation of rv X on a finite state space. Let p(x) = p̃(x)/Zp
be the pmf on finite state space Ω = {1, 2, ..., m}. We will call p the (pmf of the)
target distribution. This is the one we want to sample. Fix a ‘proposal’ transition
matrix q(y|x). We will use the notation Y ∼ q(·|x) to mean Pr(Y = y|X = x) =
q(y|x).

Theorem 8.3. The transition matrix P of the Markov chain generated by the
following Metropolis-Hastings MCMC algorithm satisfies p = pP .

Algorithm 8.1. If Xt = x, then Xt+1 is determined in the following way.

1. Let Y ∼ q(·|x) and U ∼ U (0, 1). Simulate Y = y and U = u.


SIMULATION LECTURES 23

2. If  
p̃(y)q(x|y)
u ≤ min 1,
p̃(x)q(y|x)
set Xt+1 = y, otherwise set Xt+1 = x.

Proof: Look at the conditions in Theorem 8.1. Since p is a pmf, it is enough to


check that the detailed balance relations, Eqn 8.1, are satisfied for all x, y ∈ Ω. The
case x = y is trivial. Let
 
p̃(y)q(x|y)
α(y|x) = min 1, .
p̃(x)q(y|x)
If we enter a MCMC update with Xt = x, then the probability to come out with
Xt+1 = y for y 6= x is the probability to propose y at step 1 times the probability
to accept it at step 2. The transition matrix Px,y = Pr(Xt+1 = y|Xt = x) for the
Markov chain simulated by the algorithm is therefore
Px,y = q(y|x)α(y|x).
Assume without loss of generality that p̃(x)q(y|x) ≥ p̃(y)q(x|y) (x and y can change
labels if this is false). Now,
p(x)Px,y = p(x)q(y|x) α(y|x)
p̃(y)q(x|y)
= p(x)q(y|x) ×
p̃(x)q(y|x)
= p(y)q(x|y).
p(y)Py,x = p(y)q(x|y) α(x|y)
= p(y)q(x|y) × 1.
I used p(x)p̃(y)/p̃(x) = p(y) between the 2nd and 3rd lines and p̃(x)q(y|x) ≥
p̃(y)q(x|y) to get the 2nd and final lines. Eqns 8.1 (and the conditions of The-
orem 8.1) are thus satisfied, and so p is a stationary distribution of the Markov
Chain simulated by algorithm 8.1.
Corollary 8.4. If the Markov chain X0 , X1 , X2 , ... simulated by Algorithm 8.1 is
irreducible and aperiodic, then, for any bounded function f : Ω → R
n−1
1X
f (Xt ) → Ep (f (X)).
n t=0
with convergence as for the Ergodic Theorem.

Proof: we have seen that p is stationary for {Xt }∞ t=0 so the conditions of Theo-
rem 8.2 are satisfied if in addition the {Xt }∞
t=0 are irreducible and aperiodic.

In order to run this Markov chain simulation we need to specify a start state
X0 = x0 and a proposal mechanism q(y|x). We then repeat steps 1 and 2 of
algorithm 8.1 to generate a sequence X0 , X1 , X2 , ..., Xn , and these are our correlated
samples distributed according to p (at least for large n when p(n) has converged to
p). Notice that when we come to compute the acceptance probability α(y|x), we
do not need the normalized expression for the pmf’s: α depends only on the ratio
p(y)/p(x) = p̃(y)/p̃(x).
24 GEOFF NICHOLLS

We are left, by Corollary 8.4, to verify irreducibility and aperiodicity. The latter
is usually straightforward, since the MCMC algorithm may reject the candidate
state y, so that the transition matrix Px,y for the Markov chain satisfies Px,x > 0
for at least some states x ∈ Ω. In order to check irreducibility we need to check that
the proposal mechanism q can take us anywhere in Ω (so q itself is an irreducible
transition matrix), and then that the acceptance step doesn’t trap the chain (as
might happen if α(y|x) is zero too often).
Example 8.2. We will use MCMC P to simulate X ∼ p with p ∝ (1, 2, ..., m). The
normalising constant is Z = i i = m(m + 1)/2, the unnormalised mass function
p̃(x) = x and p(x) = p̃(x)/Z. One simple proposal distribution is Y ∼ q, q =
(1, 1, ..., 1)/m the uniform distribution on 1 to m, which we write Y ∼ U {1, 2, ..., m}.
This proposal scheme is clearly irreducible (we can get from A to B in a single hop).
Here is a MCMC algorithm simulating Xt ∼ p. Start with some arbitrary starting
point, for example, Xt = 1.
Algorithm 8.2. If Xt = x then Xt+1 is determined in the following way.
• Simulate Y ∼ U {1, 2, 3, ...}. Simulate U ∼ U (0, 1).
• If U ≤ α(y|x) set Xt+1 = y and otherwise set Xt+1 = x.

Compute α.
 
p(y)q(x|y)
α(y|x) = min 1,
p(x)q(y|x)
 
y × 1/m
= min 1,
x × 1/m
 y
= min 1, .
x
Does this work on the computer?
#MCMC simulate X_t according to p=[1:m]/sum(1:m).
m<-30
pt<-1:m

n<-10000
X<-rep(NA,n)
X[1]<-1
for (t in 1:(n-1)) {
x<-X[t]
y<-ceiling(m*runif(1))
a<-min(1,pt[y]/pt[x])
U<-runif(1)
if (U<=a) {
X[t+1]<-y
} else {
X[t+1]<-x
}
}
plot(X[1:200],type="l",ann=F)
hist(X,-1:m+0.5,freq=F,ann=F); lines(1:m,pt/sum(pt),ann=F)
SIMULATION LECTURES 25
0.07

30
0.06

25
0.05

20
0.04

15
0.03
0.02

10
0.01

5
0.00

0
0 5 10 15 20 25 30 0 50 100 150 200

Figure 2. The figure at right shows the first 200 states visited
by the MCMC algorithm. The figure at left is a histogram of the
10000 sampled states.

The code example gives MCMC simulation of a Markov chain with eqilibrium dis-
tribution p = (1, 2, ..., 30)/465. The output is plotted in Figure 8.2.

Example 8.3. We will use MCMC to simulate Poisson variates. Let p(x) = exp(−λ)λx /x!
and X ∼ p. We need a proposal mechanism which will take us around the space
Ω = {0, 1, 2, ...} of X. Anything irreducible will do. I will use the simplest thing I
can think of,

1/2 if y = x ± 1,
q(y|x) =
0 otherwise.

Given x we generate y by tossing a coin and adding or subtracting one. Here is a


MCMC algorithm simulating Xt ∼Poisson(λ). Start with some arbitrary starting
point, for example, Xt = 0.

Algorithm 8.3. If Xt = x then Xt+1 is determined in the following way.

• Simulate V ∼ U (0, 1), and set y = x+ 1 if V > 1/2 and otherwise y = x− 1.


Simulate U ∼ U (0, 1).
• If U ≤ α(y|x) set Xt+1 = y and otherwise set Xt+1 = x.
26 GEOFF NICHOLLS

Compute α. There are two main cases. If y = x + 1 then


 
p̃(y)q(x|y)
α(x + 1|x) = min 1,
p̃(x)q(y|x)
exp(−λ)λx+1 /(x + 1)! 1/2
 
= min 1, ×
exp(−λ)λx /x! 1/2
 
λ
= min 1, .
x+1
If y = x − 1 then  x
α(x − 1|x) = min 1, .
λ
We must always check the behavior at the boundary of Ω. What happens if we
step outside the set of allowed states? This would happen if x = 0 and we propose
y = −1. The simplest way to deal with this is to define p(−1) = 0 (or in general
p(y) = 0 for y 6∈ Ω). Then
 
p̃(−1)q(0| − 1)
α(−1|0) = min 1,
p̃(0)q(−1|0)
= min (1, 0)
= 0,
so if we propose to leave the space, the acceptance probability is zero, so we reject
the candidate and stay where we are. Does this work on the computer?
#MCMC simulate X_t according to a Poisson dbn of mean lambda=3.
lambda<-3
n<-10000
X<-rep(NA,n)
X[1]<-0
for (t in 1:(n-1)) {
x<-X[t]
y<-x+2*(runif(1)<=0.5)-1
if (y>x) {
a<-lambda/(x+1)
}
if (y<x & y>=0) {
a<-x/lambda
}
if (y<0) {
a<-0
}
U<-runif(1)
if (U<=a) {
X[t+1]<-y
}
else {
X[t+1]<-x
}
}
> X[1:10] #first 10 states visited
SIMULATION LECTURES 27

2000
5

1500
MCMC state X[t]
4

Frequency
3

1000
2

500
1
0

0
0 20 40 60 80 100 0 2 4 6 8 10
MCMC step counter t X

[1] 0 1 2 3 2 1 1 2 1 2
> mean(X) #expect = 3 (but no easy way to get a sd() as X[t] correlated)
[1] 3.0789
> var(X) #expect = 3 (same comment)
[1] 3.134788

8.3. MCMC for state spaces which are not finite. What of using MCMC for
a target distribution on a space which is discrete but not finite, or using MCMC for
a random variable which is continuous, and therefore has a pdf rather than a pmf?
The theory we have covered doesn’t treat these cases. Our Poisson example had
an infinite state space. Of course, if I had truncated the distribution at the largest
integer I can represent exactly on the computer, there would be no detectable
change to the samples: since the MCMC never visited that boundary, it doesn’t
know it was there. I get the same samples I would have had anyway. This kind of
approximation is implicit in almost all numerical work.
The question remains, how to treat continuous random variables? It is easy to see
that the condition for irreducibility must take some other form for Markov chains
on continuous spaces (the probability of hitting any particular state will be zero). A
rigorous treatment of these issues is beyond us here. However, in some respects such
an analysis must be irrelevant. In numerical work on the computer, continuous and
unbounded random variables are approximated by discrete analogues, on a finite
space. The real number line is broken up into cells, with x in the cell δx say. Because
the precision is fixed at around 15 decimal places, the length |δx| of a cell depends
on the x value, |δx|/x ≃ 10−15 for |x| ≫ |δx|. If π(x) and q(y|x) are densities then
they are approximated on the computer by distributions Pr(X ∈ δx) ≃ π(x)|δx|
and Pr(Y ∈ δy|X = x) ≃ q(y|x)|δy|. This is still approximate, as the return values
of the functions π() and q() are rounded, and we are approximating integrals over
28 GEOFF NICHOLLS

the sets δx and δy, assuming the densities are constant to machine precision over
these small sets. The Hastings ratio is then evaluated as
Pr(Y ∈ δy) Pr(X ∈ δx|Y = y) p(y)|δy|q(x|y)|δx|

Pr(X ∈ δx) Pr(Y ∈ δy|X = x) p(x)|δx|q(y|x)|δy|
p̃(y)q(x|y)
= .
p̃(x)q(y|x)
If we apply the Metropolis Hastings ratio to densities, for continuous random vari-
ables, we expect to simulate the approximate distribution, in which the density is
lumped into the rounding-sets δx. For this reason our discussion of Markov chains
on finite spaces remains relevant.
Example 8.4. MCMC for a standard Normal distribution. Suppose want to simulate
the standard normal distribution X ∼ N (0, 1). The target density is
p̃(x) ∝ exp(−x2 /2).
I left off the factor 1/2π, since it will cancel in the Hastings ratio. We will use a
proposal density which allows us to tour R. We have alot of freedom here, so this
is simply something that ‘will do’. Fix a constant a > 0 and choose a new point
uniformly at random in a window of length 2a centred at x. The proposal density
is
1
q(y|x) = Ix−a<y<x+a
2a
Now q(y|x) = q(x|y): the probability density to propose y given x is equal to the
probability density to propose x given y.
Here is a Metropolis Hastings Markov Chain Monte Carlo algorithm simulating
Xt ∼ N (0, 1).
Algorithm 8.4. Start with with an arbitrary point X0 = 0 say. If Xt = x then Xt+1
is determined in the following way.
• Simulate V ∼ U (0, 1) iid and set Y = x + (2V − 1)a. Simulate U ∼ U (0, 1).
• If U ≤ α(y|x) set Xt+1 = y and otherwise set Xt+1 = x. Here
 
p̃(y)q(x|y)
α(y|x) = min 1,
p̃(x)q(y|x)
= min 1, exp(−y 2 /2 + x2 /2) .


We dont have to worry about the process leaving the state space, since the state
space is the whole of R.
#MCMC simulate X~N(0,1)
n<-10000
X<-matrix(NA,1,n)
X[1]<-0
a<-3

for (t in 1:(n-1)) {
x<-X[t]
y<-x+(2*runif(1)-1)*a
U<-runif(1)
SIMULATION LECTURES 29

MHR<-exp( (x^2-y^2)/2 )
alpha<-min(1,MHR)
if (U<=alpha) {
X[t+1]<-y
}
else {
X[t+1]<-x
}
}
> mean(X)

> var(X)

Note that I chose a = 3 to generate graphs (a) and (b) in Figure 8.5. Any positive
value would be technically correct (irreducible in the discrete sense) but values much
less than, or much greater than, the standard deviation of X (which is one) would
be inefficient. In both cases the chain moves very slowly through the space. The
ergodic theorem still applies, but convergence to equilibrium is slow. In the first case
(illustrated with a = 0.3 in Figure 8.5(c)) the chain nearly always accepts proposals,
but moves just a short distance. In the latter case (a = 30 in Figure 8.5(d)), the
chain hardly ever accepts, so it stays in the same state for a very long time.
Example 8.5. Here is an example of MCMC for a distribution with more than one
variable: MCMC for a bivariate Normal distribution. Suppose want to simulate a
bivariate normal distribution X ∼ N (µ, Σ), µ = (1, 1)T and Σii = var(Xi ) = 3,
Σ1,2 = cov(X1 , X2 ) = −2. The target density is
 −1    !
1 3 −2 x1 1
p̃(x1 , x2 ) ∝ exp − [(x1 , x2 ) − (1, 1)] −
2 −2 3 x2 1
p
where I have omitted the factor 1/(2π det(Σ)), since it will cancel in the Hastings
ratio. This time we need to tour R2 . Inspired by the last example, fix a constant
a > 0 and make random jumps of size up to a along the two axes:
1
q(y|x) = Ix −a≤y1 ≤x1 +a,x2 −a≤y2 ≤x2 +a
4a2 1
ie the uniform density in a box of side 2a centred at x with sides aligned to the
axes. Now q(y|x) = q(x|y): the probability density to propose y given x is equal to
the probability density to propose x given y (if y is in reach of x then x is in reach
of y so both densities equal 1/4a2 ).
Here is a Metropolis Hastings Markov Chain Monte Carlo algorithm simulating
Xt ∼ N (µ, Σ).
Algorithm 8.5. Start with with an arbitrary point X0 = (0, 0)T say in R2 . If Xt = x
then Xt+1 is determined in the following way.

• Simulate V1 , V2 ∼ U (0, 1) iid and set Y1 = x1 + (2V1 − 1)a and Y2 =


x2 + (2V2 − 1)a. Simulate U ∼ U (0, 1).
30 GEOFF NICHOLLS

Histogram of X

0.4
2

0.3
1
X[1:1000]

Density

0.2
0
−1

0.1
−2

0.0
0 200 400 600 800 1000 −3 −2 −1 0 1 2 3

Index X

(a) The first 1000 steps, a = 3. (b) A histogram of the samples.


2

2
1

1
0
X[1:1000]

X[1:1000]

0
−1

−1
−2

−2
−3

0 200 400 600 800 1000 0 200 400 600 800 1000

Index Index

(c) as (a) but a = 0.3. (d) as (a) but a = 30.

Figure 3. MCMC for the standard normal distribution, Example 8.5.

• If U ≤ α(y|x) set Xt+1 = y and otherwise set Xt+1 = x. Here


 
p̃(y)q(x|y)
α(y|x) = min 1,
p̃(x)q(y|x)
= min 1, exp −(y − µ)T Σ−1 (y − µ)/2 + (x − µ)T Σ−1 (x − µ)/2

.

Again, the process cant leave the state space, as the state space is all of R2 .

#MCMC simulate bivariate Normal


mu<-c(1,1)
(S<-matrix(c(3,-2,-2,3),2,2))
(Si<-solve(S))
SIMULATION LECTURES 31

n<-10000
X<-matrix(c(NA,NA),2,n)
X[,1]<-c(0,0)
a<-3

for (t in 1:(n-1)) {
x<-X[,t]
y<-x+(2*runif(2)-1)*a
U<-runif(1)
MHR<-exp( -t(y-mu)%*%Si%*%(y-mu)/2+t(x-mu)%*%Si%*%(x-mu)/2 )
alpha<-min(1,MHR)
if (U<=alpha) {
X[,t+1]<-y
}
else {
X[,t+1]<-x
}
}
> apply(X,1,mean)
[1] 1.037171 0.963198
> cov(t(X))
[,1] [,2]
[1,] 3.064267 -2.006450
[2,] -2.006450 2.971347
4

6
3

4
2
X_2(t)

X_2(t)
2
1

0
0
−1

−2
−2

−4

−2 −1 0 1 2 3 4 −4 −2 0 2 4 6

X_1(t) X_1(t)

The first 100 steps The first 2000 points visited by the MCMC

Note that I chose a = 3 for similar reasons to the choice in Example 8.5.

8.4. MCMC and conditional distributions. How about simulating conditional


distributions? This is a real strength of MCMC. The condition density p(x|X ∈ B)
32 GEOFF NICHOLLS

for some set B a subset of the state space of X is p(x|X ∈ B) = p(x)/ Pr(X ∈ B)
for x ∈ B and 0 otherwise. It follows that p̃(x|X ∈ B) = p̃(x) for X ∈ B and
p̃(x|X ∈ B) = 0 for x 6∈ B. In other words, the MH ratio of the chain simulating
the conditional distribution has just the same MH ratio as the unconditioned chain,
for x, y both in B, and if y falls outside B we simply reject it. We should check
that rejecting candidates in this way does not cost us irreducibility within B, but
otherwise, things are straightforward.
Example 8.6. Suppose want to simulate the bivariate normal distribution X ∼
N (µ, Σ) of Example 8.4 but conditioned on |X −(3, 3)| < 1. The MCMC algorithms
is
Algorithm 8.6. Start with X0 = (3, 3)T so the start state satisfies |X − (3, 3)| < 1.
If Xt = x then Xt+1 is determined in the following way.
• Simulate V1 , V2 ∼ U (0, 1) iid and set Y1 = x1 + (2V1 − 1)a and Y2 =
x2 + (2V2 − 1)a. Simulate U ∼ U (0, 1).
• If |Y −(3, 3)| < 1 and U ≤ α(y|x) set Xt+1 = y and otherwise set Xt+1 = x.
Here
exp(−(y − µ)T Σ−1 (y − µ)/2)
 
α(y|x) = min 1, ,
exp(−(x − µ)T Σ−1 (x − µ)/2)
is unchanged from Example 8.4. We could have put p̃(y)I|y−(3,3)|<1 in the
numerator of the acceptance probability itself. However, what we have done
amounts to the same thing.

The code is the same as before but with a different initialization (X0 inside the
circle) and a very slightly altered test condition:
.
.
.
X[,1]<-c(3,3) #X_0 at center of allowed region
a<-1 #reduce proposal size - less fall outside circle
.
.
.
for (t in 1:(n-1)) {
x<-X[,t]
y<-x+(2*runif(2)-1)*a
U<-runif(1)
MHR<-exp( -t(y-mu)%*%Si%*%(y-mu)/2+t(x-mu)%*%Si%*%(x-mu)/2 )
alpha<-min(1,MHR)
if (U<=alpha & sum((y-c(3,3))^2)<1 ) {
X[,t+1]<-y
}
else {
X[,t+1]<-x
}
}
SIMULATION LECTURES 33

The samples are constrained to lie within 1 unit of (3, 3), but are otherwise dis-
tributed according to the bivariate normal. The results are illustrated in Fig 8.6.
Exercise: How would use samples distributed X ∼ N (µ, Σ) in order to simulate

6
4
2
X_2(t)
0
−2
−4

−6 −4 −2 0 2 4 6

X_1(t)

Figure 4. Open circles: X ∼ N (µ, Σ), Dots: X | (X − (3, 3))2 < 1.

X | (X − (3, 3))2 < 1 via rejection? What weakness does this approach have?

Statistics Department
E-mail address: [email protected]

You might also like