Part A Mathematics and Statistics, Simulation and Statistical Programming: Simulation Lectures
Part A Mathematics and Statistics, Simulation and Statistical Programming: Simulation Lectures
GEOFF NICHOLLS
Contents
1. Organisation 1
Aims and Objectives 1
Synopsis 1
Course Structure 2
1.1. R 2
1.2. Classes 2
Texts 2
2. Introduction 3
3. Inversion 3
4. Transformation methods 5
5. Multivariate Normal 6
6. Rejection 7
7. Importance sampling 13
7.1. Importance Sampling (I) 13
7.2. Importance Sampling (II) 17
8. Markov Chain Monte Carlo 19
8.1. Markov chains 20
8.2. Metropolis Hastings Markov chain Monte Carlo 22
8.3. MCMC for state spaces which are not finite 27
8.4. MCMC and conditional distributions 31
1. Organisation
Aims and Objectives. Building on Part A probability and Mods statistics, this
course introduces Monte Carlo methods, collectively one of the most important
toolkits for modern statistical inference. In parallel, students are taught program-
ming in R, a programming language widely used in statistics. Lectures alternate
between Monte Carlo methods and Statistical Programming so that students learn
to programme by writing simulation algorithms.
Synopsis. (1) Simulation: Transformation methods. Rejection sampling including
proof for a scalar random variable, Importance Sampling. Unbiased and consistent
IS estimators. MCMC including the Metropolis-Hastings algorithm. (2) Statisti-
cal Programming: Numbers, strings, vectors, matrices, data frames and lists, and
1.1. R. R is a high quality open source software package for statistical computing.
We will use R for the Statistical Programming segment. You may find it useful for
checking your understanding of the applied probability we use in simulation. You
will find it convenient to install R on your own computer. The software, as well as
manuals and introductory tutorials, are available from
http : //www.r − project.org/
Problem sheets in classes will include a separate section with some optional exam-
ples of simulation using R.
1.2. Classes. There are four classes this term: classes 4-5pm on Tuesdays in weeks
3 and 7 and 10-11am on Fridays in week 5 and 8.
Texts.
Reading. The following texts have a large overlap with the course.
W.J. Braun and D.J. Murdoch, “A First Course in Statistical Programming with
R”. CUP 2007
C.P. Robert and G Casella, “Introducing Monte Carlo Methods with R”
Geoff Nicholls
[email protected]
SIMULATION LECTURES 3
2. Introduction
3. Inversion
This is the base method. It is used for simulation of scalar random variables with
only very simple pmf or pdf, since we need the cdf of the random variable.
Let X be a scalar random variable with cumulative distribution function (cdf)
F (x) = Pr(X ≤ x) at X = x in some space of states Ω. Let F −1 (u) be the smallest
value of x such that F (x) is greater than or equal to u. If F is continuous and
strictly increasing then this is just the inverse of F . Simulation by the method
of inversion exploits the fact that if U ∼ U (0, 1) and X = F −1 (U ) then X ∼ F
(meaning, the rv X is distributed with cdf F ). This follows (for the simple case)
by Pr(X ≤ x) = Pr(F −1 (U ) ≤ x) = Pr(U ≤ F (x)) = F (x).
Example 3.1. if we want X ∼ Exp(r) (ie X ∼ fX with fX (x) = r exp(−rx)) then
F (x) = 1 − exp(−rx) and
F −1 (u) = −(1/r) log(1 − u).
4 GEOFF NICHOLLS
The algorithm is
Algorithm 3.1.
U ∼ U (0, 1)
X ← − log(U )/r
#check
mean(X) #should equal 1/r with a sd.dev by CLT
sd(X)/sqrt(n) #or in this case about sqrt(1/n)/r
#check
mean(X==1) #should equal p with a sd.dev by CLT
sd(X==1)/sqrt(n) #or in this case about sqrt(p*(1-p)/n)
SIMULATION LECTURES 5
4. Transformation methods
#check
mean(X) #should equal a/b=14 with a sd.dev by CLT
sd(X)/sqrt(n) #or in this case about sqrt(a/n)/b
The transformation method is used in the Box Muller algorithm for simulation of
normal random variables. Since, for any particular application, very large numbers
of rv may need to be simulated, there is a great deal of emphasis on computational
speed and efficiency in the design of simulation algorithms. Variations on the
Box-Muller algorithm give the fastest simulation algorithms for normal random
variables, in many applications of practical interest. The algorithm is based on the
following observation.
Example 4.2. If U1 and U2 are independent U (0, 1) rv, and
p
X1 = −2 log(U1 ) cos(2πU2 )
p
X2 = −2 log(U1 ) sin(2πU2 )
then X1 and X2 are independent (!) standard normal random variables.
Proof: think of (X1 , X2 )p
as a random point in the plane. In polar coordinates
this point has radius R = −2 log(U1 ) (so R2 ∼ Exp(0.5)) and angle Θ = 2πU2
6 GEOFF NICHOLLS
(with Θ ∼ U (0, 2π)). In order to get the joint density fX1 ,X2 (x, y) of X1 and X2
we make a change of variables from Θ, R2 (notice, not Θ, R since it is R2 for which
we have the distribution) to X1 , X2 :
∂(r2 , θ)
fX1 ,X2 (x, y) = fR2 ,Θ (r2 , θ) .
∂(x, y)
Now x = r cos(θ), y = r sin(θ) so ∂θ/∂x = −y/r2 etc and
1 1 2x −y/r2
fX1 ,X2 (x, y) = exp(−r2 /2) × × .
2 2π 2y x/r2
so that fX1 ,X2 (x, y) = exp(−x2 /2 − y 2 /2)/(2π).
#check
mean(X) #should equal 0 with a sd.dev by CLT
sd(X)/sqrt(2*n) #or in this case about 1/sqrt(2*n)
Note: see Ross page 80 sec 5.3 for a strategy for avoiding the sin and cos evaluations
(Problem 2.8 page 63 of Robert and Casella) and the Box Muller algorithm itself
at the bottom of Page 81.
5. Multivariate Normal
6. Rejection
Inversion is restricted to the univariate case. Also, we need to have the cdf of
target distribution in a form that makes it at least numerically easy to invert. We
found a transformation to convert independent standard normal variates into (cor-
related) multivariate normal random variables. That worked because the normal
distribution is ‘special’. We cant rely on finding a suitable transformation to make
up any given distribution.
We will start with simulation of X a discrete rv. The following sentence may
sound familiar. Suppose that for x ∈ Ω we have a probability mass function p(x)
(the target distribution) which we want to sample, and another pmf q(x) defined
on the same space which we can sample.
Theorem 6.1. Suppose we can find a constant M satisfying M ≥ p(x)/q(x) for
all x ∈ Ω. The following ‘Rejection algorithm’ returns X ∼ p.
8 GEOFF NICHOLLS
Algorithm 6.1.
1 Let Y ∼ q and U ∼ U (0, 1). Simulate Y = y and U = u.
2 If u ≤ p(y)/(M q(y)) then stop and return X = y, and otherwise, start
again at 1.
Since this is rather important idea, we will look at a couple of ways of proving
this result. The second proof is short but hides some subtleties. The first is more
explicit.
Proof (1): Let Pr(X = i) give the probability for the value of X returned by the
algorithm to equal i. I will partition on the the number of times through the loop.
In order to end up with X = i we could draw Y = i at the first step and accept
it, or we could reject whatever was drawn at the first pass, and then at the second
pass draw Y = i and accept it, and so on. Events at each pass through the loop
are independent of events in other passes, so
∞
X
Pr(X = i) = Pr(reject n − 1 times, then draw Y = i and accept it)
n=1
X∞
(6.1) = Pr(reject Y )n−1 Pr(draw Y = i and accept it).
n=1
At a particular pass through the loop the probability we draw Y = i and accept is
where
Γ(α)Γ(β)
Zp =
Γ(α + β)
is a normalising constant. In order to apply the rejection algorithm we need to
find an envelope probability density q(x) and a constant M ≥ p(x)/q(x) for all
0 < x < 1.
When α and β are both greater than or equal one, p(x) is bounded (and otherwise,
not). Since p(x) has support on [0, 1] and is bounded, we can simply use the uniform
density q(x) = 1, ie the constant function, as our envelope, and M = max0<x<1 p(x)
to ensure M ≥ p(x)/q(x) for all 0 < x < 1.
Now, p′ (x∗ ) = 0 is satisfied by x∗ = (α − 1)/(α + β − 2) so M = p(x∗ ) is
α−1 β−1
1 α−1 β−1
M=
Zp α + β − 2 α+β−2
will do for the bound, and our acceptance probability at step 2 of Alg. 6.1 is
y α−1 (1 − y)β−1
p(y)/(M q(y)) =
M Zp
y α−1 (1 − y)β−1
=
M′
where α−1 β−1
α−1 β−1
M′ =
α+β−2 α+β−2
The point is that the factor Zp cancels with a corresponding factor of Zp in the
expression for M so we needn’t work it out. This kind of simplification will always
occur, as we shall see.
Larger M would have given a correct algorithm - can you think why we want the
minimum upper bound?
#Rejection, beta(a,b) for both a,b>=1
my_big_ab_beta<-function(a=1,b=1) {
#simulate X~Beta(a,b) variate, defalts to U(0,1)
if (a<1 || b<1) stop(’a<1 or b<1’);
M<-(a-1)^(a-1)*(b-1)^(b-1)*(a+b-2)^(2-a-b)
finished<-FALSE
while (!finished) {
Y<-runif(1)
U<-runif(1)
accept_prob<-Y^(a-1)*(1-Y)^(b-1)/M
finished<-(U<accept_prob)
}
X<-Y
X}
my_big_ab_beta(a=1.5,b=2.5)
a<-1.5; b<-2.5;
n<-100000
X<-rep(NA,n) #clear X before the loop - its entries are NA
for (i in 1:n) {
X[i]<-my_big_ab_beta(a,b)
}
mean(X);a/(a+b) #should equal a/(a+b) with a sd.dev by CLT
sd(X)/sqrt(n) #or in this case about sqrt(a*b/(a+b+1)/n)/(a+b)
This gets inefficient when a or b are big (the beta density is then sharply peaked, so
the ratio p/M q is typically very small and it takes many trials to get an acceptance).
Exercise: when I first wrote the function my_big_ab_beta() above, I forgot to
divide by M in the line accept_prob<-Y^(a-1)*(1-Y)^(b-1)/M but everything
still seemed to be correct (the means and variances were right). Can you explain
why the algorithm was still correct, and why I am better off with M in place, as
above?
so
q̃(y) = xa−1 exp(−bx)
will do as our unnormalised envelope function. The problem then is to bound the
ratio
p̃/q̃ = xα−a exp(−(β − b)x).
Is this bounded? Consider (a) x → 0 and (b) x → ∞. For (a) we need a ≤ α so
a = ⌊α⌋ is fine. For (b) we need b < β (not b = β since we need the exponential to
kill off the growth of xα−a ).
Given that we have chosen a and b so the ratio is bounded, we should now
compute the bound. Now d(p̃/q̃)/dx = 0 at x = (α − a)/(β − b) (and this must be
a maximum at x ≥ 0 under our conditions on a and b), so p̃/q̃ ≤ M for all x ≥ 0 if
α−a
α−a
M= exp(−(α − a)).
β−b
We will accept Y at step 2 of Alg. 6.1 if U ≤ Y α−a exp(−(β − b)Y )/M .
Exercise: how to choose b? Any 0 ≤ b < β will do, but is there a best choice?
One idea would be to choose b to minimize the expected number of Y -simulations
per sample X output. Since the number of trials N say is Geometric, with success
probability Zp /M Zq , the expected number of trials is E(N ) = Zq M/Zp . Now
Zp = Γ(α)β −α where Γ is the Gamma function related to the factorial. Show that
the optimal b solves d(b−a (β − b)−α+a )/db = 0 so use b = β(a/α) for best results.
#Rejection, Gamma(a,b)
my_gamma<-function(a=1,b=1) {
#simulate X~Gamma(a,b) variate, defalts to Exp(1)
if (a<1 || b<=0) stop(’a<1 or b<=0’);
aq<-floor(a)
bq<-b*(aq/a) #best choice but any 0<bq<b OK
del_a<-a-aq
del_b<-b-bq
finished<-FALSE
while (!finished) {
Y<-sum(rexp(aq))/bq #Y~Gamma(aq,bq)
U<-runif(1)
accept_prob<-(Y*del_b/del_a)^del_a*exp(-del_b*Y+del_a)
finished<-(U<accept_prob)
}
X<-Y
X}
my_gamma(a=7.5,b=0.5)
X[i]<-my_gamma(a,b)
}
mean(X) #should equal a/b with a sd.dev by CLT
sd(X)/sqrt(n) #or in this case about sqrt(a/n)/b
Exercise: modify the function my_gamma() to return the number of trials needed in
order to generate X, and check that the mean number of trials is M Zq /Zp .
Notice that it is important to find q which is heavy tailed compared to p, so,
p(x)/q(x) goes to zero as x tends to either end of the support of p (and of course
the ratio has also to be bounded throughout the support, but that is usually the
easy bit).
7. Importance sampling
Importance sampling is, among other things, a strategy for recycling samples.
It is useful also when we need to make an accurate estimate of the probability
a random variable exceeds some very high threshold. In this context the naive
estimator (proportion of samples over threshold) has a high variance (relative to
the mean). In this context it is referred to as a variance reduction technique.
There is a slight variation on the basic set up: we can generate samples distributed
according to q but we want to estimate an expectation that depends on p (before
it was “but we want samples distributed according to p”), so we want to estimate
Ep (f (X)) for some function f . In importance sampling we avoid sampling the
target distribution p. Instead, we take samples distributed according to q and
reweight them.
7.1. Importance Sampling (I). Here is the key idea. Let Yi ∼ q, i = 1, 2, ..., n
be iid continuous random variables distributed for Yi ∈ Ω with density q. We will
require p(x) > 0 ⇒ q(x) > 0 for all x ∈ Ω (which is weaker than p/q bounded as in
rejection). Then
n
1X
f= p(Yi )f (Yi )/q(Yi )
n i=1
is an unbiased estimator for Ep (f (X)) since
n
1X p(Yi )
Eq (f ) = Eq f (Yi )
n i=1 q(Yi )
p(Y1 )
= Eq f (Y1 )
q(Y1 )
Z
p(y)
= f (y) × q(y) dy
q(y)
ZΩ
= p(y)f (y) dy
Ω
= Ep (f (X)).
14 GEOFF NICHOLLS
All this works in a very similar way in the case where X and the Yi are discrete
rv.
We call w(x) = p(x)/q(x) the weight function and wi = w(yi ) the weights for
importance sampling estimates of expectations in p using samples Yi = yi from q
and write
n
1X
(7.1) f= wi f (yi ).
n i=1
Now p/q is
p(x) Γ(a)β α
= xα−a e−(β−b)x
q(x) Γ(α)ba
so the weights are
Γ(a)β α
wi = Yi α−a e−(β−b)Yi ,
Γ(α)ba
and we estimate Ep (f (X)) via f as in Eqn 7.1.
#in this example f(x)=x, X~Gamma(7.8,2), Y~Gamma(7,1) and we use Importance sampling
#to estimate E(f(X))
alpha<-7.8
beta<-2
n<-100000
a<-floor(alpha)
b<-1 #Exercise: what is the optimal choice of b here?
U<-matrix(rexp(a*n),a,n)
Y<-apply(U,2,’sum’)/b
exp_Y1<-mean(Y*Y^(alpha-a)*exp(-(beta-b)*Y)*(gamma(a)/gamma(alpha))*(beta^alpha/b^a))
#compare estimate (exp_Y1) and exact result (E(X)=alpha/beta)) - TODO check variance
exp_Y1
alpha/beta
We now consider the variance of f . We assume Ep (f (X)) and varp (f (X)) are
finite. For f a function h : Ω → R, and Y ∼ q, let
varq (f (Y )) = Eq (f (Y )2 ) − Eq (f (Y ))2
SIMULATION LECTURES 15
Notice that the setup allows us to choose q: while p is given, q needs to cover p
(p > 0 ⇒ q > 0) and be simple to sample. The requirement that the variance of
the estimator be finite further constrains our choice. Referring to Eqn 7.2, we need
Ep (f p/q) = Eq (f p2 /q 2 ) to be finite. If varp (f ) is known finite then (as we saw in
Example 7.1) it may be easy to get a sufficient condition for varq (f ) finite. Further
analysis will depend on the details of f .
What is the choice of q that actually minimizes the variance of the importance
sampling (I.S.) estimator? Look at Eqn 7.2. Suppose for the moment f > 0.
The variance of f cannot be negative, but we can make it zero by the choice
q(x) = p(x)f (x)/Ep (f ). Zero variance estimators! In practice we are not able
to implement this choice, since the weight function is w(x) = p/q = Ep (f )/f (x),
16 GEOFF NICHOLLS
but Ep (f ) is the thing we are trying to estimate anyway, so we cant compute the
weights in Eqn 7.1.
One important class of application of IS is to problems in which we estimate the
probability for a rare event. In this case we may be able to sample p directly, but it
doesn’t help us. If, for example, X ∼ p with Pr(X > x) = δ say, with δ very small,
P
we may not get any samples Xi > x and our estimate δ̂ = i IXi >x /n is simply
zero. By distorting the proposal distribution we can actually reduce the variance
of our (IS) estimator.
Example 7.3. Let X ∼ N (µ, σ 2 ) be a scalar normal rv. If we would like to estimate
δ = Pr(X > x) for some x much larger than 3σ we could exponentially tilt the
density of X towards larger values so that we get some samples in the target region,
and then allow for our tilting via an IS estimator. If p(x) is the density of X then
q(x) = p(x) exp(tx)/Mp (t) is called a tilted density of p (see eg Ross section 8.6
page 185). Here the normalizing constant Mp (t) = Ep (exp(tX)) happens to be the
moment generating function of X.
For p the normal density,
exp(−(x − µ)2 /2σ 2 ) exp(tx) = exp(−(x − µ − tσ 2 )2 /2σ 2 ) exp(µt + t2 σ 2 /2)
so the normalized tilted density is
q(x) = (2π)−1/2 exp(−(x − µ − tσ 2 )2 /2σ 2 ),
normal mean µ + tσ 2 and variance σ 2 . For the normal, Mp (t) = exp(µt + t2 σ 2 /2).
The IS weight function is p/q = exp(−tx)Mp (t) so
w(x) = exp(−t(x − µ − t2 σ 2 /2)).
We take samples Yi ∼ N (µ + tσ 2 , σ 2 ), compute wi = w(Yi ) and form our IS
estimator for δ = Pr(X > x) according to Eqn 7.1,
n
1X
δ̂ = wi IYi >x
n i=1
since f (Yi ) here is IYi >x .
We havnt said how to choose t. The point here is that we want samples in the
region of interest. I will choose the mean of the tiled distribution so that it equals
x, then I am sure to have samples in the region of interest. I chose t so that
µ + tσ 2 = x, or t = (x − µ)/σ 2 . The densities p and q and the threshold used in
the example below are depicted in Figure ??
####################################################
#IS - normal example, method I, Pr(Z>4) Z~N(0,1)
#note I chose x=4 so we could check using the naive method
#for x=6 say the advantage is even more dramatic
sigma<-1
mu<-0
x<-4
n<-1000000
SIMULATION LECTURES 17
0.4
0.3
normal density
0.2
0.1
0.0
−5 0 5
Z<-rnorm(n)
#naive estimator
(plain_est<-mean(Z>x))
#Y are samples from the tilted distribution (weighted towards region of interest)
t<-(x-mu)/sigma^2
Y<-mu+t*sigma^2+sigma*rnorm(n)
#IS-estimator corrects for weighting
(IS_est<-mean( (Y>x)*exp(-(Y-mu)^2/(2*sigma^2)+(Y-mu-t*sigma^2)^2/(2*sigma^2))))
(IS_std<-sd( (Y>x)*exp(-(Y-mu)^2/(2*sigma^2)+(Y-mu-t*sigma^2)^2/(2*sigma^2)))/sqrt(n))
#ran above got 0.00003170 +/- 0.00000007 about 0.2% error
Exercise: how to choose t? RShow that the t-value that minimizes the variance,
∞
minimizes Ep (f 2 p/q) = Mp (t) x p(z) exp(−tz)dz over t, and hence obtain a equa-
tion for the optimal tilt t.
Notice that Eq (a) = Eq (p̃f /q̃) and Eq (b) = Eq (p̃/q̃) so these two estimators are
unbiased, even if f itself may be biased.
An estimator (such as f ) is consistent for Ep (f ) if f tends in probability to Ep (f )
as n → ∞. That means that for each ǫ > 0,
lim Pr(|f − Ep (f )| > ǫ) = 0.
n→∞
Exercise: show that a and b are consistent estimators (of the numerator and
denominator of Eqn 7.3). (via Markov’s Inequality or using the CLT - this is the
same as Q6 of the supplementary problem sheet of Part A probability, HT08).
Theorem 7.1. f is a consistent estimator, that is, f tends in probability to Ep (f )
as n → ∞.
We omit the proof which is outside the scope of this course. If you are interested
in following this up, apply the delta-method, Taylor expanding f as a function of
a and b about a = Eq (w̃f ) and b̂N = Eq (w̃).
Example 7.4. Revisit Example 7.1. We wish to estimate Ep (f (X) where the target
distribution is X ∼Gamma(α, β) and Zp = β −α Γ(α) and the proposal distribution
is X ∼Gamma(a, b). Suppose we didn’t know Zp and Zq (or we were not confident
we could calculate it correctly or we couldn’t be bothered calculating it!). Now
p̃(x)/q̃(x) = xα−a exp(−(β − b)x) so our algorithm is
SIMULATION LECTURES 19
Algorithm 7.2.
1 Simulate Yi ∼Gamma(a, b), for i = 1, 2, ..., n.
2 Calculate the weights
wi = Yiα−a exp(−(β − b)Yi ).
3 Form the IS-estimator
P
wi f (Yi )
f = iP .
j wj
#in this example f(x)=x, X~Gamma(7.8,2), Y~Gamma(7,1) and we use Importance sampling
#to estimate E(f(X)) using the ratio estimator
alpha<-7.8
beta<-2
n<-100000
a<-floor(alpha)
b<-1 #Exercise: what is the optimal choice of b here?
U<-matrix(rexp(a*n),a,n)
Y<-apply(U,2,’sum’)/b
Our purpose is to estimate Ep (f (X)) for X ∼ p for p(x) some pmf (or pdf)
defined for x ∈ Ω. Up to this point we have based our estimates on iid draws from
either p itself, or some proposal distribution with pmf q. In MCMC we simulate a
correlated sequence X0 , X1 , X2 , .... which satisfies Xt ∼ p (or at least Xt converges
to p in distribution) and rely on the usual estimate fˆ = n−1 i f (Xi ). We need to
P
review and extend some of the Markov chain material from Part A Probability.
20 GEOFF NICHOLLS
In the following discussion we will suppose the space of states of X is finite (and
therefore discrete). We do not promise a satisfactory proof that all this works in the
case X a continuous rv: we simply observe that all our work is to be on a computer,
with finite precision, and that all the probability densities we will consider are in
fact represented by some closely approximating pmf on the computer.
You can find other presentations of parts of the material in this section in Section
5.5 of Norris and Chapter 10 of Ross.
8.1. Markov chains. Let {Xt }∞ t=0 be a homogeneous Markov chain of random
variables on Ω with starting distribution X0 ∼ p(0) and transition probability
Pi,j = Pr(Xt+1 = j|Xt = i).
(n)
Denote by Pi,j the n-step transition probabilities
(n)
Pi,j = Pr(Xt+n = j|Xt = i)
and by p(n) (i) = Pr(Xn = i). Recall that P is irreducible if and only if, for each
(n)
pair of states i, j ∈ Ω there is n such that Pi,j > 0. The Markov chain is aperiodic
(n)
if Pi,j is non zero for all sufficiently large n.
Example 8.1. Here is an example of a period chain: Ω = {1, 2, 3, 4}, p(0) = (1, 0, 0, 0),
and transition matrix
0 1/2 0 1/2
1/2 0 1/2 0
P =
0 1/2 0 1/2 ,
1/2 0 1/2 0
(n)
since P1,1 = 0 for n odd.
Exercise: show that if P is irreducible and Pi,i > 0 for some i ∈ Ω then P is
aperiodic.
P
Recall that the pmf π(i), i ∈ Ω, i∈Ω π(i) = 1 is a stationary distribution of P
if πP = π. If p(0) = π then
X
p(1) (j) = p(0) (i)Pi,j ,
i∈Ω
(1) (t)
so p (j) = π(j) also. Iterating, p = π for each t = 1, 2, ... in the chain, so the
distribution of Xt ∼ p(t) doesn’t change with t, it is stationary.
We now introduce reversible Markov chains (see also Norris 1.9). In a reversible
chain we cannot distinguish the direction of simulation from inspection of a real-
ization of the chain (so, you simulate a piece of the chain, toss a coin and reverse
the order of states if the coin comes up heads; now you present me the sequence of
states; I cannot tell whether or not you have reversed the sequence, though I know
the transition matrix of the chain).
′
Denote by Pi,j = Pr(Xt−1 = j|Xt = i) the transition matrix for the time-reversed
chain. It seems clear that a Markov chain will be reversible if and only if P = P ′ , so
that any particular transition occurs with equal probability in forward and reverse
directions.
SIMULATION LECTURES 21
The relations in Eqn. 8.1 are called the detailed balance relations (and in this
sense, π = πP would be global balance).
P
PProof of (i): sum both sides of Eqn. 8.1 over i ∈ Ω. Now i Pj,i = 1 so
i π(i)Pi,j = π(j) and we are done.
Proof of (ii), we have π a stationary distribution of P so Pr(Xt = i) = π(i) for
all t = 1, 2, ... along the chain. Then
Theorem 8.2. If {Xt }∞ t=0 is an irreducible and aperiodic Markov chain on a finite
space of states Ω, with stationary distribution p then, as n → ∞, for any bounded
22 GEOFF NICHOLLS
function f : Ω → R,
n−1
1X
f (Xt ) → Ep (f (X)).
n t=0
The convergence is almost surely, which is stronger than, and implies, conver-
gence in probability. In Part A probability the Ergodic theorem asks for positive
recurrent X0 , X1 , X2 , ..., and the stated conditions are simpler here because we are
assuming a finite state space for the Markov chain.
We would really like to have a CLT ˆ
q for f formed from the Markov chain output,
so we have confidence intervals ± var(fˆ) as well as the central point estimate fˆ
itself. It is in fact the case that under mild conditions, which hold for almost all
the MCMC simulations I have done, there is a CLT for fˆ. However, these results
are a little beyond us at this point.
So, we can form estimates of Ep (f ) by averaging along states of a Markov chain
which has p as its equilibrium distribution. There remains the problem of how large
n must be for the guaranteed convergence to give a usefully accurate estimate. This
problem, of assessing when the Markov chain simulation length is sufficiently large,
does not have a simple honest answer, though there are some obvious necessary
conditions we can check (eg repeat the entire simulation and check that independent
estimates fˆ have an acceptably small scatter). We should check also that ’most’ of
the samples are not biased in any obvious way by the choice of X0 , the initial state.
8.2. Metropolis Hastings Markov chain Monte Carlo. The Metropolis Hast-
ings Markov Chain Monte Carlo algorithm (or MCMC, for short) is an algorithm
for simulating a Markov Chain with any given equilibrium distribution.
If we are given a pdf of pmf p thenPwe may be able to simulate an iid sequence
X1 , X2 , ..., Xn of r.v. satisfying n−1 i f (Xi ) → Ep (f (X)) as n → ∞, using the
Rejection algorithm.
In a similar way, if we are given a pdf of pmf p then we may be able to simu-
late P
an correlated sequence X1 , X2 , ..., Xn of r.v. (ie, a Markov chain) satisfying
n−1 i f (Xi ) → Ep (f (X)) as n → ∞, using the MCMC algorithm.
In each case convergence in probability is ’easily’ established, whilst the more
useful CLT ’usually’ applies, but is harder to verify, at least in the MCMC case.
We will start with simulation of rv X on a finite state space. Let p(x) = p̃(x)/Zp
be the pmf on finite state space Ω = {1, 2, ..., m}. We will call p the (pmf of the)
target distribution. This is the one we want to sample. Fix a ‘proposal’ transition
matrix q(y|x). We will use the notation Y ∼ q(·|x) to mean Pr(Y = y|X = x) =
q(y|x).
Theorem 8.3. The transition matrix P of the Markov chain generated by the
following Metropolis-Hastings MCMC algorithm satisfies p = pP .
2. If
p̃(y)q(x|y)
u ≤ min 1,
p̃(x)q(y|x)
set Xt+1 = y, otherwise set Xt+1 = x.
Proof: we have seen that p is stationary for {Xt }∞ t=0 so the conditions of Theo-
rem 8.2 are satisfied if in addition the {Xt }∞
t=0 are irreducible and aperiodic.
In order to run this Markov chain simulation we need to specify a start state
X0 = x0 and a proposal mechanism q(y|x). We then repeat steps 1 and 2 of
algorithm 8.1 to generate a sequence X0 , X1 , X2 , ..., Xn , and these are our correlated
samples distributed according to p (at least for large n when p(n) has converged to
p). Notice that when we come to compute the acceptance probability α(y|x), we
do not need the normalized expression for the pmf’s: α depends only on the ratio
p(y)/p(x) = p̃(y)/p̃(x).
24 GEOFF NICHOLLS
We are left, by Corollary 8.4, to verify irreducibility and aperiodicity. The latter
is usually straightforward, since the MCMC algorithm may reject the candidate
state y, so that the transition matrix Px,y for the Markov chain satisfies Px,x > 0
for at least some states x ∈ Ω. In order to check irreducibility we need to check that
the proposal mechanism q can take us anywhere in Ω (so q itself is an irreducible
transition matrix), and then that the acceptance step doesn’t trap the chain (as
might happen if α(y|x) is zero too often).
Example 8.2. We will use MCMC P to simulate X ∼ p with p ∝ (1, 2, ..., m). The
normalising constant is Z = i i = m(m + 1)/2, the unnormalised mass function
p̃(x) = x and p(x) = p̃(x)/Z. One simple proposal distribution is Y ∼ q, q =
(1, 1, ..., 1)/m the uniform distribution on 1 to m, which we write Y ∼ U {1, 2, ..., m}.
This proposal scheme is clearly irreducible (we can get from A to B in a single hop).
Here is a MCMC algorithm simulating Xt ∼ p. Start with some arbitrary starting
point, for example, Xt = 1.
Algorithm 8.2. If Xt = x then Xt+1 is determined in the following way.
• Simulate Y ∼ U {1, 2, 3, ...}. Simulate U ∼ U (0, 1).
• If U ≤ α(y|x) set Xt+1 = y and otherwise set Xt+1 = x.
Compute α.
p(y)q(x|y)
α(y|x) = min 1,
p(x)q(y|x)
y × 1/m
= min 1,
x × 1/m
y
= min 1, .
x
Does this work on the computer?
#MCMC simulate X_t according to p=[1:m]/sum(1:m).
m<-30
pt<-1:m
n<-10000
X<-rep(NA,n)
X[1]<-1
for (t in 1:(n-1)) {
x<-X[t]
y<-ceiling(m*runif(1))
a<-min(1,pt[y]/pt[x])
U<-runif(1)
if (U<=a) {
X[t+1]<-y
} else {
X[t+1]<-x
}
}
plot(X[1:200],type="l",ann=F)
hist(X,-1:m+0.5,freq=F,ann=F); lines(1:m,pt/sum(pt),ann=F)
SIMULATION LECTURES 25
0.07
30
0.06
25
0.05
20
0.04
15
0.03
0.02
10
0.01
5
0.00
0
0 5 10 15 20 25 30 0 50 100 150 200
Figure 2. The figure at right shows the first 200 states visited
by the MCMC algorithm. The figure at left is a histogram of the
10000 sampled states.
The code example gives MCMC simulation of a Markov chain with eqilibrium dis-
tribution p = (1, 2, ..., 30)/465. The output is plotted in Figure 8.2.
Example 8.3. We will use MCMC to simulate Poisson variates. Let p(x) = exp(−λ)λx /x!
and X ∼ p. We need a proposal mechanism which will take us around the space
Ω = {0, 1, 2, ...} of X. Anything irreducible will do. I will use the simplest thing I
can think of,
1/2 if y = x ± 1,
q(y|x) =
0 otherwise.
2000
5
1500
MCMC state X[t]
4
Frequency
3
1000
2
500
1
0
0
0 20 40 60 80 100 0 2 4 6 8 10
MCMC step counter t X
[1] 0 1 2 3 2 1 1 2 1 2
> mean(X) #expect = 3 (but no easy way to get a sd() as X[t] correlated)
[1] 3.0789
> var(X) #expect = 3 (same comment)
[1] 3.134788
8.3. MCMC for state spaces which are not finite. What of using MCMC for
a target distribution on a space which is discrete but not finite, or using MCMC for
a random variable which is continuous, and therefore has a pdf rather than a pmf?
The theory we have covered doesn’t treat these cases. Our Poisson example had
an infinite state space. Of course, if I had truncated the distribution at the largest
integer I can represent exactly on the computer, there would be no detectable
change to the samples: since the MCMC never visited that boundary, it doesn’t
know it was there. I get the same samples I would have had anyway. This kind of
approximation is implicit in almost all numerical work.
The question remains, how to treat continuous random variables? It is easy to see
that the condition for irreducibility must take some other form for Markov chains
on continuous spaces (the probability of hitting any particular state will be zero). A
rigorous treatment of these issues is beyond us here. However, in some respects such
an analysis must be irrelevant. In numerical work on the computer, continuous and
unbounded random variables are approximated by discrete analogues, on a finite
space. The real number line is broken up into cells, with x in the cell δx say. Because
the precision is fixed at around 15 decimal places, the length |δx| of a cell depends
on the x value, |δx|/x ≃ 10−15 for |x| ≫ |δx|. If π(x) and q(y|x) are densities then
they are approximated on the computer by distributions Pr(X ∈ δx) ≃ π(x)|δx|
and Pr(Y ∈ δy|X = x) ≃ q(y|x)|δy|. This is still approximate, as the return values
of the functions π() and q() are rounded, and we are approximating integrals over
28 GEOFF NICHOLLS
the sets δx and δy, assuming the densities are constant to machine precision over
these small sets. The Hastings ratio is then evaluated as
Pr(Y ∈ δy) Pr(X ∈ δx|Y = y) p(y)|δy|q(x|y)|δx|
≃
Pr(X ∈ δx) Pr(Y ∈ δy|X = x) p(x)|δx|q(y|x)|δy|
p̃(y)q(x|y)
= .
p̃(x)q(y|x)
If we apply the Metropolis Hastings ratio to densities, for continuous random vari-
ables, we expect to simulate the approximate distribution, in which the density is
lumped into the rounding-sets δx. For this reason our discussion of Markov chains
on finite spaces remains relevant.
Example 8.4. MCMC for a standard Normal distribution. Suppose want to simulate
the standard normal distribution X ∼ N (0, 1). The target density is
p̃(x) ∝ exp(−x2 /2).
I left off the factor 1/2π, since it will cancel in the Hastings ratio. We will use a
proposal density which allows us to tour R. We have alot of freedom here, so this
is simply something that ‘will do’. Fix a constant a > 0 and choose a new point
uniformly at random in a window of length 2a centred at x. The proposal density
is
1
q(y|x) = Ix−a<y<x+a
2a
Now q(y|x) = q(x|y): the probability density to propose y given x is equal to the
probability density to propose x given y.
Here is a Metropolis Hastings Markov Chain Monte Carlo algorithm simulating
Xt ∼ N (0, 1).
Algorithm 8.4. Start with with an arbitrary point X0 = 0 say. If Xt = x then Xt+1
is determined in the following way.
• Simulate V ∼ U (0, 1) iid and set Y = x + (2V − 1)a. Simulate U ∼ U (0, 1).
• If U ≤ α(y|x) set Xt+1 = y and otherwise set Xt+1 = x. Here
p̃(y)q(x|y)
α(y|x) = min 1,
p̃(x)q(y|x)
= min 1, exp(−y 2 /2 + x2 /2) .
We dont have to worry about the process leaving the state space, since the state
space is the whole of R.
#MCMC simulate X~N(0,1)
n<-10000
X<-matrix(NA,1,n)
X[1]<-0
a<-3
for (t in 1:(n-1)) {
x<-X[t]
y<-x+(2*runif(1)-1)*a
U<-runif(1)
SIMULATION LECTURES 29
MHR<-exp( (x^2-y^2)/2 )
alpha<-min(1,MHR)
if (U<=alpha) {
X[t+1]<-y
}
else {
X[t+1]<-x
}
}
> mean(X)
> var(X)
Note that I chose a = 3 to generate graphs (a) and (b) in Figure 8.5. Any positive
value would be technically correct (irreducible in the discrete sense) but values much
less than, or much greater than, the standard deviation of X (which is one) would
be inefficient. In both cases the chain moves very slowly through the space. The
ergodic theorem still applies, but convergence to equilibrium is slow. In the first case
(illustrated with a = 0.3 in Figure 8.5(c)) the chain nearly always accepts proposals,
but moves just a short distance. In the latter case (a = 30 in Figure 8.5(d)), the
chain hardly ever accepts, so it stays in the same state for a very long time.
Example 8.5. Here is an example of MCMC for a distribution with more than one
variable: MCMC for a bivariate Normal distribution. Suppose want to simulate a
bivariate normal distribution X ∼ N (µ, Σ), µ = (1, 1)T and Σii = var(Xi ) = 3,
Σ1,2 = cov(X1 , X2 ) = −2. The target density is
−1 !
1 3 −2 x1 1
p̃(x1 , x2 ) ∝ exp − [(x1 , x2 ) − (1, 1)] −
2 −2 3 x2 1
p
where I have omitted the factor 1/(2π det(Σ)), since it will cancel in the Hastings
ratio. This time we need to tour R2 . Inspired by the last example, fix a constant
a > 0 and make random jumps of size up to a along the two axes:
1
q(y|x) = Ix −a≤y1 ≤x1 +a,x2 −a≤y2 ≤x2 +a
4a2 1
ie the uniform density in a box of side 2a centred at x with sides aligned to the
axes. Now q(y|x) = q(x|y): the probability density to propose y given x is equal to
the probability density to propose x given y (if y is in reach of x then x is in reach
of y so both densities equal 1/4a2 ).
Here is a Metropolis Hastings Markov Chain Monte Carlo algorithm simulating
Xt ∼ N (µ, Σ).
Algorithm 8.5. Start with with an arbitrary point X0 = (0, 0)T say in R2 . If Xt = x
then Xt+1 is determined in the following way.
Histogram of X
0.4
2
0.3
1
X[1:1000]
Density
0.2
0
−1
0.1
−2
0.0
0 200 400 600 800 1000 −3 −2 −1 0 1 2 3
Index X
2
1
1
0
X[1:1000]
X[1:1000]
0
−1
−1
−2
−2
−3
0 200 400 600 800 1000 0 200 400 600 800 1000
Index Index
Again, the process cant leave the state space, as the state space is all of R2 .
n<-10000
X<-matrix(c(NA,NA),2,n)
X[,1]<-c(0,0)
a<-3
for (t in 1:(n-1)) {
x<-X[,t]
y<-x+(2*runif(2)-1)*a
U<-runif(1)
MHR<-exp( -t(y-mu)%*%Si%*%(y-mu)/2+t(x-mu)%*%Si%*%(x-mu)/2 )
alpha<-min(1,MHR)
if (U<=alpha) {
X[,t+1]<-y
}
else {
X[,t+1]<-x
}
}
> apply(X,1,mean)
[1] 1.037171 0.963198
> cov(t(X))
[,1] [,2]
[1,] 3.064267 -2.006450
[2,] -2.006450 2.971347
4
6
3
4
2
X_2(t)
X_2(t)
2
1
0
0
−1
−2
−2
−4
−2 −1 0 1 2 3 4 −4 −2 0 2 4 6
X_1(t) X_1(t)
The first 100 steps The first 2000 points visited by the MCMC
Note that I chose a = 3 for similar reasons to the choice in Example 8.5.
for some set B a subset of the state space of X is p(x|X ∈ B) = p(x)/ Pr(X ∈ B)
for x ∈ B and 0 otherwise. It follows that p̃(x|X ∈ B) = p̃(x) for X ∈ B and
p̃(x|X ∈ B) = 0 for x 6∈ B. In other words, the MH ratio of the chain simulating
the conditional distribution has just the same MH ratio as the unconditioned chain,
for x, y both in B, and if y falls outside B we simply reject it. We should check
that rejecting candidates in this way does not cost us irreducibility within B, but
otherwise, things are straightforward.
Example 8.6. Suppose want to simulate the bivariate normal distribution X ∼
N (µ, Σ) of Example 8.4 but conditioned on |X −(3, 3)| < 1. The MCMC algorithms
is
Algorithm 8.6. Start with X0 = (3, 3)T so the start state satisfies |X − (3, 3)| < 1.
If Xt = x then Xt+1 is determined in the following way.
• Simulate V1 , V2 ∼ U (0, 1) iid and set Y1 = x1 + (2V1 − 1)a and Y2 =
x2 + (2V2 − 1)a. Simulate U ∼ U (0, 1).
• If |Y −(3, 3)| < 1 and U ≤ α(y|x) set Xt+1 = y and otherwise set Xt+1 = x.
Here
exp(−(y − µ)T Σ−1 (y − µ)/2)
α(y|x) = min 1, ,
exp(−(x − µ)T Σ−1 (x − µ)/2)
is unchanged from Example 8.4. We could have put p̃(y)I|y−(3,3)|<1 in the
numerator of the acceptance probability itself. However, what we have done
amounts to the same thing.
The code is the same as before but with a different initialization (X0 inside the
circle) and a very slightly altered test condition:
.
.
.
X[,1]<-c(3,3) #X_0 at center of allowed region
a<-1 #reduce proposal size - less fall outside circle
.
.
.
for (t in 1:(n-1)) {
x<-X[,t]
y<-x+(2*runif(2)-1)*a
U<-runif(1)
MHR<-exp( -t(y-mu)%*%Si%*%(y-mu)/2+t(x-mu)%*%Si%*%(x-mu)/2 )
alpha<-min(1,MHR)
if (U<=alpha & sum((y-c(3,3))^2)<1 ) {
X[,t+1]<-y
}
else {
X[,t+1]<-x
}
}
SIMULATION LECTURES 33
The samples are constrained to lie within 1 unit of (3, 3), but are otherwise dis-
tributed according to the bivariate normal. The results are illustrated in Fig 8.6.
Exercise: How would use samples distributed X ∼ N (µ, Σ) in order to simulate
6
4
2
X_2(t)
0
−2
−4
−6 −4 −2 0 2 4 6
X_1(t)
X | (X − (3, 3))2 < 1 via rejection? What weakness does this approach have?
Statistics Department
E-mail address: [email protected]