0% found this document useful (0 votes)
5 views

Conceptual Introduction To MCMC

Uploaded by

minghan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Conceptual Introduction To MCMC

Uploaded by

minghan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 56

arXiv:1909.12313v1 [stat.

OT] 26 Sep 2019

A Conceptual Introduction to Markov Chain


Monte Carlo Methods

Joshua S. Speagle1,a
a
Center for Astrophysics | Harvard & Smithsonian, 60 Garden St., Cambridge,
MA 02138, USA

Submitted to the Journal of Statistics Education

1 [email protected]
Abstract

Markov Chain Monte Carlo (MCMC) methods have become a cornerstone of many mod-
ern scientific analyses by providing a straightforward approach to numerically estimate
uncertainties in the parameters of a model using a sequence of random samples. This
article provides a basic introduction to MCMC methods by establishing a strong concep-
tual understanding of what problems MCMC methods are trying to solve, why we want
to use them, and how they work in theory and in practice. To develop these concepts,
I outline the foundations of Bayesian inference, discuss how posterior distributions are
used in practice, explore basic approaches to estimate posterior-based quantities, and de-
rive their link to Monte Carlo sampling and MCMC. Using a simple toy problem, I then
demonstrate how these concepts can be used to understand the benefits and drawbacks
of various MCMC approaches. Exercises designed to highlight various concepts are also
included throughout the article.
Submitted to the JSE Speagle 2019

1 Introduction
Scientific analyses generally rest on making inferences about underlying physical models
from various sources of observational data. Over the last few decades, the quality and
quantity of these data have increased substantially as they become faster and cheaper to
collect and store. At the same time, the same technology that has made it possible to
collect vast amounts of data has also led to a substantial increase in the computational
power and resources available to analyze them.
Together, these changes have made it possible to explore increasingly complex models
using methods that can exploit these computational resources. This has led to a dramatic
rise in the number of published works that rely on Monte Carlo methods, which use a
combination of numerical simulation and random number generation to explore these
models.
One particularly popular subset of Monte Carlo methods is known as Markov Chain
Monte Carlo (MCMC). MCMC methods are appealing because they provide a straight-
forward, intuitive way to both simulate values from an unknown distribution and use
those simulated values to perform subsequent analyses. This allows them to be applica-
ble in a wide variety of domains.
Owing to its widespread use, various overviews of MCMC methods are common both
in peer-reviewed and non-peer-reviewed sources. In general, these tend to fall into two
groups: articles focused on various statistical underpinnings of MCMC methods and ar-
ticles focused on implementation and practical usage. Readers interested in reading more
details on either topic are encouraged to see Brooks et al. (2011) and Hogg & Foreman-
Mackey (2018) along with associated references therein.
This article instead provides an overview of MCMC methods focused instead on build-
ing up a strong conceptual understanding of the what, why, and how of MCMC based on
statistical intuition. In particular, it tries to systematically answer the following questions:

1. What problems are MCMC methods trying to solve?

2. Why are we interested in using them?

3. How do they work in theory and in practice?

When answering these questions, this article generally assumes that the reader is
somewhat familiar with the basics of Bayesian inference in theory (e.g., the role of pri-
ors) and in practice (e.g., deriving posteriors), basic statistics (e.g., expectation values),
and basic numerical methods (e.g., Riemann sums). No advanced statistical knowledge
is required. For more details on these topics, please see Gelman et al. (2013) and Blitzstein
& Hwang (2014) along with associated references therein.
The outline of the article is as follows. In §2, I provide a brief review of Bayesian infer-
ence and posterior distributions. In §3, I discuss what posteriors are used for in practice,
focusing on integration and marginalization. In §4, I outline a basic scheme to approx-
imate these posterior integrals using discrete grids. In §5, I illustrate how Monte Carlo
methods emerge as a natural extension of grid-based approaches. In §6, I discuss how
MCMC methods fit within the broader scope of possible approaches and their benefits

1
Submitted to the JSE Speagle 2019

and drawbacks. In §7, I explore the general challenges MCMC methods face. In §8, I ex-
amine how these concepts come together in practice using a simple example. I conclude
in §9.

2 Bayesian Inference
In many scientific applications, we have access to some data D that we want to use to
make inferences about the world around us. Most often, we want to interpret these data
in light of an underlying model M that can make predictions about the data we expect to
see as a function of some parameters ΘM of that particular model.
We can combine these pieces together to estimate the probability P (D|ΘM , M ) that we
would actually see that data D we have collected conditioned on (i.e. assuming) a specific
choice of parameters ΘM from our model M . In other words, assuming our model M is
right and the parameters ΘM describe the data, what is the likelihood P (D|ΘM , M ) of
the parameters ΘM based on the observed data D? Assuming different values of ΘM will
give different likelihoods, telling us which parameter choices appear to best describe the
data we observe.
In Bayesian inference, we are interested in inferring the flipped quantity, P (ΘM |D, M ).
This describes the probability that the underlying parameters are actually ΘM given our
data D and assuming a particular model M . By using factoring of probability, we can
relate this new probability P (ΘM |D, M ) to the likelihood P (D|ΘM , M ) described above
as
P (ΘM |D, M )P (D|M ) = P (ΘM , D|M ) = P (D|ΘM , M )P (ΘM |M ) (1)
where P (ΘM , D|M ) represents the joint probability of having an underlying set of pa-
rameters ΘM that describe the data and observing the particular set of data D we have
already collected.
Rearranging this equality into a more convenient form gives us Bayes’ Theorem:

P (D|ΘM , M )P (ΘM |M )
P (ΘM |D, M ) = (2)
P (D|M )

This equation now describes exactly how our two probabilities relate to each other.
P (ΘM |M ) is often referred to as the prior. This describes the probability of having a
particular set of values ΘM for our given model M before conditioning on our data. Because
this is independent of the data, this term is often interpreted as representing our “prior
beliefs” about what ΘM should be based on previous measurements, physical concerns,
and other known factors. In practice, this has the effect of essentially “augmenting” the
data with other information.
The denominator
Z
P (D|M ) = P (D|ΘM , M )P (ΘM |M )dΘM (3)

is known as the evidence or marginal likelihood for our model M marginalized (i.e. in-
tegrated) over all possible parameter values ΘM . This broadly tries to quantify how well

2
Submitted to the JSE Speagle 2019

Figure 1: An illustration of Bayes’ Theorem. The posterior probability P(Θ) (black) of our
model parameters Θ is based on a combination of our prior beliefsR π(Θ) (blue) and the
likelihood L(Θ) (red), normalized by the overall evidence Z = π(Θ)L(Θ)dΘ (purple)
for our particular model. See §2 for additional details.

our model M explains the data D after averaging over all possible values ΘM of the true
underlying parameters. In other words, if the observations predicted by our model look
similar to the data D, then M is a good model. Models where this is true more often also
tend to be favored over models that give excellent agreement occasionally but disagree
most of the time. Since in most instances we take D as a given, this often ends up being a
constant.
Finally, P (ΘM |D, M ) represents our posterior. This quantifies our belief in ΘM af-
ter combining our prior intuition P (ΘM |M ) with current observations P (D|ΘM , M ) and
normalizing by the overall evidence P (D|M ). The posterior will be some compromise be-
tween the prior and the likelihood, with the exact combination depending on the strength
and properties of the prior and the quality of the data used to derive the likelihood. A
schematic illustration is shown in Figure 1.
Throughout the rest of the paper I will write these four terms (likelihood, prior, evi-
dence, posterior) using shorthand notation such that
L(Θ)π(Θ) L(Θ)π(Θ)
P(Θ) ≡ R ≡ (4)
L(Θ)π(Θ)dΘ Z
where P(Θ) ≡ P (ΘM |D, M ) is the posterior, L(Θ) ≡ P (D|ΘM , M ) is the likelihood,

3
Submitted to the JSE Speagle 2019

π(Θ) ≡ P (ΘM |M ) is the prior, and the constant Z ≡ P (D|M ) is the evidence. I have
suppressed the model M and data D notation for convenience here since in most cases
the data and model are considered fixed, but will re-introduce them as necessary.
Before moving on, I would like to close by emphasizing that the interpretation of any
result is only as good as the models and priors that underlie them. Trying to explore the impli-
cations of any particular model using, for instance, some of the methods described in this
article is fundamentally a secondary concern behind constructing a reasonable model with
well-motivated priors in the first place. I strongly encourage readers to keep this idea in
mind throughout the remainder of this work.

Exercise: Noisy Mean


Setup
Consider the case where we have temperature monitoring stations located across a city.
Each station i takes a noisy measurement T̂i of the temperature on any given day with
some measurement noise σi . We will assume our measurements T̂i follow a Normal (i.e.
Gaussian) distribution with mean T and standard deviation σi such that

T̂i ∼ N [T, σi ]

This translates into a probability of


" #
1 1 (T̂i − T )2
P (T̂i |T, σi ) ≡ N [T, σi ] = p exp −
2πσi2 2 σi2

for each observation and


n
Y
P ({T̂i }ni=1 |T, {σi }ni=1 ) = P (T̂i |T, σi )
i=1

for a collection of n observations.


Let’s assume we have five independent noisy measurements of the temperature (in
Celsius) from several monitoring stations

T̂1 = 26.3, T̂2 = 30.2, T̂3 = 29.4, T̂4 = 30.1, T̂5 = 29.8

with corresponding uncertainties

σ1 = 1.7, σ2 = 1.8, σ3 = 1.2, σ4 = 0.5, σ5 = 1.3

Looking at historical data, we find that the typical underlying temperature T dur-
ing similar days is roughly Normally-distributed with a mean Tprior = 25 and variation
σprior = 1.5:
T ∼ N [Tprior = 25, σprior = 1.5]

4
Submitted to the JSE Speagle 2019

Problem
Using these assumptions, compute:

1. the prior π(T ),

2. the likelihood L(T ), and

3. the posterior P(T )

given our observed data {T̂i } and errors {σi } over a range of temperatures T . How do the
three terms differ? Does the prior look like a good assumption? Why or why not?

3 What are Posteriors Good For?


Above, I described how Bayes’ Theorem is able to combine our prior beliefs and the ob-
served data into a new posterior estimate P(Θ) ∝ L(Θ)π(Θ). This, however, is only half
of the problem. Once we have the posterior, we need to then use it to make inferences
about the world around us. In general, the ways in which we want to use posteriors fall
into a few broad categories:

1. Making educated guesses: make a reasonable guess at what the underlying model
parameters are.

2. Quantifying uncertainty: provide constraints on the range of possible model pa-


rameter values.

3. Generating predictions: marginalize over uncertainties in the underlying model


parameters to predict observables or other variables that depend on the model pa-
rameters.

4. Comparing models: use the evidences from different models to determine which
models are more favorable.

In order to accomplish these goals, we are often more interested in trying to use the
posterior to estimate various constraints on the parameters Θ themselves or other quan-
tities f (Θ) that might be based on them. This often depends on marginalizing over the
uncertainties characterized by our posterior (via the likelihood and prior). The evidence
Z, for instance, is again just the integral of the likelihood and the prior over all possible
parameters: Z Z
Z= L(Θ)π(Θ)dΘ ≡ P̃(Θ)dΘ (5)

where P̃(Θ) ≡ L(Θ)π(Θ) is the unnormalized posterior.


Likewise, if we are investigating the behavior of a subset of “interesting” parameters
Θint from Θ = {Θint , Θnuis }, we want to marginalize over the behavior of the remain-
ing “nuisance” parameters Θnuis to see how they can impact Θint . This process is pretty

5
Submitted to the JSE Speagle 2019

straightforward if the entire posterior over Θ is known:


Z Z
P(Θint ) = P(Θint , Θnuis ) dΘnuis = P(Θ)dΘnuis (6)

Other quantities can generally be derived from the expectation value of various parameter-
dependent functions f (Θ) with respect to the posterior:
R R
f (Θ)P̃(Θ)dΘ
Z
f (Θ)P(Θ)dΘ
EP [f (Θ)] ≡ R = R = f (Θ)P(Θ)dΘ (7)
P(Θ)dΘ P̃(Θ)dΘ
R
since P(Θ)dΘ = 1 by definition and P̃(Θ) ∝ P(Θ). This represents a weighted average
of f (Θ), where at each value Θ we weight the resulting f (Θ) based on to the chance we
believe that value is correct.
Taken together, we see that in almost all cases we are more interested in computing inte-
grals over the posterior rather than knowing the posterior itself. To put this another way, the
posterior is rarely ever useful on its own; it mainly becomes useful by integrating over it.
This distinction between estimating expectations and other integrals over the posterior
versus estimating the posterior in-and-of-itself is a key element of Bayesian inference.
This distinction is hugely important when it comes to actually performing inference in
practice, since it is often the case that we can get an excellent estimate of EP [f (Θ)] even
if we have an extremely poor estimate of P(Θ) or P̃(Θ).
More details are provided below to further illustrate how the particular categories
described above translate into particular integrals over the (unnormalized) posterior. An
example is shown in Figure 2.

3.1 Making Educated Guesses


One of the core tenets of Bayesian inference is that we don’t know the true model M∗ or
its true underlying parameters Θ∗ that characterize the data we observe: the model M we
have is almost always a simplification of what is actually going on. If we assume that our
current model M is correct, however, we can try to use our posterior P(Θ) to propose a
point estimate Θ̂ that we think is a pretty good guess for the true value Θ∗ .
What exactly counts as “good”? This depends on exactly what we care about. In
general, we can quantify “goodness” by asking the opposite question: how badly are we
penalized if our estimate Θ̂ 6= Θ∗ is wrong? This is often encapsulated through the use
of a loss function L(Θ̂|Θ∗ ) that penalizes us when our point estimate Θ̂ differs from Θ∗ .
An example of a common loss function is L(Θ̂|Θ∗ ) = |Θ̂ − Θ∗ |2 (i.e. squared loss), where
an incorrect guess is penalized based on the square of the magnitude of the separation
between the guess Θ̂ and the true value Θ∗ .
Unfortunately, we don’t know what the actual value of Θ∗ is to evaluate the true loss.
We can, however, do the next best thing and compute the expected loss averaged over all
possible values of Θ∗ based on our posterior:
h i Z
LP (Θ̂) ≡ EP L(Θ̂|Θ) = L(Θ̂|Θ)P(Θ)dΘ (8)

6
Submitted to the JSE Speagle 2019

Figure 2: A “corner plot” showing an example of how posteriors are used in practice.
Each of the top panels shows the 1-D marginalized posterior distribution for each param-
eter (grey), along with associated median point estimates (red) and 68% credible intervals
(blue). Each central panel shows the 10%, 40%, 65%, and 85% credible regions for each
2-D marginalized posterior distribution. See §3 for additional details.

7
Submitted to the JSE Speagle 2019

A reasonable choice for Θ̂ is then the value that minimizes this expected loss in place of
the actual (unknown) loss:
Θ̂ ≡ argmin [LP (Θ0 )] (9)
Θ0

where argmin indicates the value (argument) of Θ0 that minimizes the expected loss LP (Θ0 ).
While this strategy can work for any arbitrary loss function, solving for Θ̂ often re-
quires using numerical methods and repeated integration over P(Θ). However, analytic
solutions do exist for particular loss functions. For example, it is straightforward to show
(and an insightful exercise for the interested reader) that the optimal point estimate Θ̂
under squared loss is simply the mean.

3.2 Quantifying Uncertainty


In many cases we are not just interested in computing a prediction Θ̂ for Θ∗ , but also con-
straining a region C(Θ) of possible values within which Θ∗ might lie with some amount
of certainty. In other words, can we construct a region CX such that we believe there is an
X% chance that it contains Θ∗ ?
There are many possible definitions for this credible region. One common definition
is the region above some posterior threshold PX where X% of the posterior is contained,
i.e. where Z
X
P(Θ)dΘ = (10)
Θ ∈ CX 100
given
CX ≡ {Θ : P(Θ) ≥ PX } (11)
In other words, we want to integrate our posterior over all Θ where the value P(Θ) >
PX is greater than some threshold PX , where PX is set so that this integral encompasses
X% of the full posterior. Common choices for X include 68% and 95% (i.e. “1-sigma” and
“2-sigma” credible intervals).
In the special case where our (marginalized) posterior is 1-D, credible intervals are
often defined using percentiles rather than thresholds, where the location xp of the pth
percentile is defined as Z xp
p
P(x)dx = (12)
−∞ 100
We can use these to define a credible region [xlow , xhigh ] containing Y % of the data by
taking xlow = x(1−Y )/2 and xhigh = x(1+Y )/2 . While this leads to asymmetric thresholds and
does not generalize to higher dimensions, it has the benefit of always encompassing the
median value x50 and having equal tail probabilities (i.e. (1 − Y )/2% of the posterior on
each side).
In general, when referring to “credible intervals” throughout the text the percentile
definition should be assumed unless explicitly stated otherwise.

8
Submitted to the JSE Speagle 2019

3.3 Making Predictions


In addition to trying to estimate the underlying parameters of our model, we often also
want to make predictions of other observables or variables that depend on our model
parameters. If we think we know the underlying true model parameters Θ∗ , then this
process is straightforward. Given that we only have access to the posterior distribution
P(Θ) over possible values Θ∗ could take, however, to predict what will happen we will
need to marginalize over this uncertainty.
We can quantify this intuition using the posterior predictive P (D̃|D), which repre-
sents the probability of seeing some new data D̃ based on our existing data D:
Z Z h i
P (D̃|D) ≡ P (D̃|Θ)P (Θ|D)dΘ ≡ L̃(Θ)P(Θ)dΘ = EP L̃(Θ) (13)

In other words, for hypothetical data D̃, we want to compute the expected value of the
likelihood L̃(Θ) over all possible values of Θ based on the current posterior P(Θ).

3.4 Comparing Models


One final point of interest in many Bayesian analyses is trying to investigate whether
the data particularly favors any of the model(s) we are assuming in our analysis. Our
choice of priors or the particular way we parameterize the data can lead to substantial
differences in the way we might want to interpret our results.
We can compare two models by computing the Bayes factor:

P (M1 |D) P (D|M1 )P (M1 ) Z1 π1


R12 ≡ = ≡ (14)
P (M2 |D) P (D|M2 )P (M2 ) Z2 π2

where ZM is again the evidence for model M and πM is our prior belief that M is correct
relative to the competing model. Taken together, the Bayes factor R tells us how much
a particular model is favored over another given the observed data, marginalizing over
all possible values of the underlying model parameters ΘM , and our previous relative
confidence in the model. R
Again, note that computing ZM requires computing the integral P̃(Θ)dΘ of the un-
normalized posterior P̃(Θ) over Θ. Combined with the other examples outlined in this
section, it is clear that many common use cases in Bayesian analysis rely on computing
integrals over the (possibly unnormalized) posterior.

Exercise: Noisy Mean Revisited


Setup
Let’s return to our temperature posterior P(T ) from §2. We want to use this result to
derive interesting estimates and constraints on the possible underlying temperature T .

9
Submitted to the JSE Speagle 2019

Point Estimates
The mean can be defined as the point estimate Θ̂ that minimizes the expected loss LP (Θ̂)
under squared loss:
Lmean (Θ̂|Θ∗ ) = |Θ̂ − Θ∗ |2
The median can be defined as the point estimate that minimizes LP (Θ̂) under absolute
loss:
Lmed (Θ̂|Θ∗ ) = |Θ̂ − Θ∗ |
And the mode can be defined as the point estimate that minimizes LP (Θ̂) under “catas-
trophic” loss:
Lmode (Θ̂|Θ∗ ) = −δ(|Θ̂ − Θ∗ |)
where δ(·) is the Dirac delta function defined such that
Z
f (x)δ(x − a)dx = f (a)

Given these expressions for the mean, median, and mode, estimate the correspond-
ing temperature point estimate Tmean , Tmed , and Tmode from our corresponding posterior.
Feel free to experiment with various analytic and numerical methods to perform these
calculations.
We might expect that the historical data we used for our priors might not hold as
well today if there have been some long-term changes in the average temperature. For
instance, we expect that the average temperature has increased over time, and so we
might not want to penalize hotter temperatures T ≥ Tprior as much as cooler ones T <
Tprior . We can encode this information in an asymmetric loss function such as
(
|T̂ − T∗ |3 T < Tprior
L(T̂ |T∗ ) =
|T̂ − T∗ | T ≥ Tprior

What is the optimal point estimate Tasym that minimizes the expected loss in this case?

Credible Intervals
Next, let’s try to quantify the uncertainty. Given the posterior P(T ), compute the 50%,
80%, and 95% credible intervals using posterior thresholds PX . Next, compute these cred-
ible intervals using percentiles. Are there are differences between the credible intervals
computed from the two methods? Why or why not?

Posterior Predictive
To propagate our uncertainties into the next observations, compute the posterior pre-
dictive P (T̂6 |{T̂1 , . . . , T̂5 }) over a range of possible temperature measurements T̂6 for the
next observations given the previous five {T̂1 , . . . , T̂5 } assuming an uncertainty of σ6 = 0,
σ6 = 0.5, and σ6 = 2.

10
Submitted to the JSE Speagle 2019

Model Comparison
Finally, we want to investigate whether our prior appears to be a good assumption. Using
numerical methods, compute the evidence Z for our default prior with mean Tprior = 25
and standard deviation σprior = 1.5. Then compare this to the evidence estimated based on
an alternative prior where we assume the temperature has risen by roughly five degrees
with mean Tprior = 30 but with a corresponding larger uncertainty σprior = 3. Is one model
particularly favored over the other?

4 Approximating Posterior Integrals with Grids


I now want to investigate methods for estimating posterior integrals. While in some cases
(e.g., conjugate priors) these can be computed analytically, this is not true in general. To
properly estimate quantities such as those outlined in §3 therefore requires the use of
numerical methods (highlighted in the previous exercises).
To start, I will first focus on the case where our integral over Θ is 1-D. In that case, we
can approximate it using standard numerical techniques such as a Riemann sum over a
discrete grid of points:
Z n
X
EP [f (Θ)] = f (Θ)P(Θ)dΘ ≈ f (Θi )P(Θi )∆Θi (15)
i=1

where
∆Θi = Θj+1 − Θj (16)
is simply the spacing between the set of j = 1, . . . , n + 1 points on the underlying grid
and
Θj+1 + Θj
Θi = (17)
2
is just defined to be the mid-point between Θj and Θj+1 .1 As shown in Figure 3, this
approach is akin to trying to approximate the integral using a discrete set of n rectangles
with heights of f (Θi )P(Θi ) and widths of ∆Θi .
This idea can be generalized to higher dimensions. In that case, instead of breaking
up the integral into n 1-D segments, we instead can decompose it into a set of n N-D
cuboids. The contribution of each of these pieces is then proportional to the product of
the “height” f (Θi )P(Θi ) and the volume
d
Y
∆Θi = ∆Θi,j (18)
j=1

where ∆Θi,j is the width of the ith cuboid in the jth dimension. See Figure 3 for a visual
representation of this procedure.
1
Choosing Θi to be one of the end-points gives consistent behavior (see §4.3) as the number of grid
points n → ∞ but generally leads to larger biases for finite n.

11
Submitted to the JSE Speagle 2019

Figure 3: An illustration of how to approximate posterior integrals using a discrete grid of


points. We break up the posterior into contiguous regions defined by a position Θi (e.g.,
an endpoint or midpoint) with corresponding posterior density P(Θi ) and volume ∆Θi
over a grid with i = 1, . . . , n elements. Our integral can then be approximated by adding
up each of these regions proportional to the posterior mass P(Θi )×∆Θi contained within
it. In 1-D (top), these volume elements ∆Θi correspond to line segments while in 2-D
(middle), these correspond to rectangles. This can be generalized to higher dimensions
(bottom), where we instead used N-D cuboids. See §4 for additional details.

12
Submitted to the JSE Speagle 2019

Substituting P(Θ) = P̃(Θ)/Z into the expectation value and replacing any integrals
with their grid-based approximations then gives:
R R Pn
f (Θ)P(Θ)dΘ f (Θ)P̃(Θ)dΘ f (Θi )P̃(Θi )∆Θi
EP [f (Θ)] = R = R ≈ i=1Pn (19)
P(Θ)dΘ P̃(Θ)dΘ i=1 P̃(Θi )∆Θi

Note the denominator is now an estimate for the evidence:


Z Xn
Z = P̃(Θ)dΘ ≈ P̃(Θi )∆Θj (20)
i=1

This substitution of the unnormalized posterior P̃(Θ) for the posterior P(Θ) is a cru-
cial part of computing expectation values in practice since we can compute P̃(Θ) =
L(Θ)π(Θ) directly without knowing Z.

4.1 The Curse of Dimensionality


While this approach is straightforward, it has one immediate and severe drawback: the
total number of grid points increases exponentially as the number of dimensions increases.
For example, assuming we have roughly k ≥ 2 grid points in each dimensions, the total
number of points in our grid scales as
d
Y
n∼ k = kd (21)
j=1

This means that even in the absolute best case where k = 2, we have 2d scaling.
This awful scaling is often referred to as the curse of dimensionality. This exponen-
tial dependence turns out to be a generic feature of high-dimensional distributions (i.e.
posteriors of models with larger numbers of parameters) that I will return to later in §7.

4.2 Effective Sample Size


Apart from this exponential scaling of dimensionality, there is a more subtle drawback to
using grids. Since we do not know the shape of the distribution ahead of time, the contri-
bution of each portion of the grid (i.e. each N-D cuboid) can be highly uneven depending
on the structure of the grid. In other words, the effectiveness of this approach not only
depends on the number of grid points n but also where they are allocated. If we do not
specify our grid points well, we can end up with many points located in regions where
P̃(Θ) and/or f (Θ)P̃(Θ) is relatively small. This then implies that their respective sums
will be dominated by a small number of points with much larger relative “weights”. Ide-
ally, we would want to increase the resolution of the grid in regions where the posterior
is large and decrease it elsewhere to mitigate this effect.
Note that our use of the term “weights” in the preceding paragraph is quite deliberate.
Looking back at our original approximation, the form of equation (19) is quite similar to
one which might be used to compute a weighted sample mean of f (Θ). In that case,

13
Submitted to the JSE Speagle 2019

where we have n observations {f1 , . . . , fn } with corresponding weights {w1 , . . . , wn }, the


weighted mean is simply: Pn
w i fi
fˆmean ≡ Pi=1 n (22)
i=1 wi

Indeed, if we define
fi ≡ f (Θi ), wi ≡ P̃(Θi )∆Θi (23)
then the connection between the weighted sample mean in equation (22) and the expec-
tation value from our grid in equation (19) becomes explicit:
Pn Pn
i=1 f (Θi )P̃(Θi )∆Θi w i fi
EP [f (Θ)] ≈ Pn ≡ Pi=1
n (24)
i=1 P̃(Θi )∆Θi i=1 wi

Thinking about our grid as a set of n samples also allows us to consider an associated
effective sample size (ESS) neff ≤ n. The ESS encapsulates the idea that not all of our
samples contribute the same amount of information: if we have n samples that are very
similar to each other, we expect to have a substantially worse estimate than if we have n
samples that are quite different. This is because the information in correlated samples are
at least partially redundant with one another, with the amount of redundancy increasing
with the strength of the correlation: while two independent samples provide completely
unique information about the distribution and no information about each other, two cor-
related samples instead provide some information about each other at the expense of the
underlying distribution.
Returning to grids, this correspondence means that we can in theory come up with an
estimate of the expectation value EP [f (Θ)] that is at least as good as the one we might
currently have using a smaller number neff ≤ n of grid points if we were able to allocate
them more efficiently. This distinction matters because errors on our estimate of the ex-
pectation value generally scale as a function of neff rather than n. For instance, the error
−1/2
on the mean typically goes as ∝ neff rather than ∝ n−1/2 .
We can quantify the ideas behind the ESS as discussed above by introducing a formal
definition following Kish (1965):
2
( ni=1 wi )
P
neff ≡ Pn 2
(25)
i=1 wi

In line with our intuition, the best case under this definition is one where all the weights
are equal (wi = w):
2
( ni=1 wi ) (nw)2 n2 w 2
P
nbest
eff = Pn 2
= P n 2
= =n (26)
i=1 wi i=1 w nw2

Likewise, the worst case is one where all the weight is concentrated around a single sam-
ple (wi = w for i = j and wi = 0 otherwise):
2
( ni=1 wi ) (w)2
P
nworst
eff = Pn 2
= =1 (27)
i=1 wi w2

14
Submitted to the JSE Speagle 2019

Figure 4: An example of how changing the spacing (volume elements) of the grid can
dramatically affect its associated estimate of posterior integrals. On a toy 2-D posterior
P(Θ), simply changing the spacing of the associated 2-D 30 × 30 grid dramatically affects
the effective sample size (ESS) (see §4.2). Differences between poor spacing (left), uniform
spacing (middle), and optimal spacing (right) leads to an order of magnitude difference
in the ESS, as highlighted by the distribution of weights (bottom) associated with the
volume elements of each grid. See §4.2 for additional details.

15
Submitted to the JSE Speagle 2019

Figure 5: An illustration of how grid-based estimates can be convergent (i.e. converge to


a single value as the number of grid points increases) but not consistent (i.e. the value it
converges to is not the correct answer). Our toy 2-D unnormalized posterior P̃(Θ) has
two modes that are well-separated with a total evidence of Z = 200. If we are not aware
of the second mode, we might define a grid region that only encompasses a subset of the
entire parameter space (left). While increasing the resolution of the grid within this region
allows the estimated Z to converge to an single answer (left to right), this is not equal to
the correct answer of Z = 200 because we have neglected the contribution of the other
component (right). See §4.3 for additional details.

This former situation (with nbest


eff ) would be the case where each of the elements of our
grid all have roughly the same contribution to the integral, while the latter (with nworst
eff )
would be where the entire integral is essentially contained in just one of our n N-D cuboid
regions. An illustration of this behavior is shown in Figure 4.

4.3 Convergence and Consistency


Now that I have outlined the relationship between the structure of our grid and the ESS,
I want to examine two final issues: convergence and consistency. Convergence is the
idea that, while our estimates using n samples (grid points) might be noisy, it approaches
some fiducial value as n → ∞:
Pn
i=1 f (Θi )P̃(Θi )∆Θi
lim Pn =C (28)
n→∞
i=1 P̃(Θ i )∆Θ i

Consistency is subsequently the idea that the value we converge to is the true value we
are interested in estimating:
Pn
i=1 f (Θi )P̃(Θi )∆Θi
lim Pn = EP [f (Θ)] (29)
i=1 P̃(Θi )∆Θi
n→∞

It is straightforward to show that if the expectation value is well-defined (i.e. it exists)


and the grid covers the entire domain of Θ (i.e. spans the smallest and largest possible

16
Submitted to the JSE Speagle 2019

values in every dimension) then using a grid is a consistent way to estimate the expecta-
tion value. This should make intuitive sense: provided our grid is expansive enough in Θ
so that we’re not “missing” any region of parameter space, we should be able to estimate
EP [f (Θ)] to arbitrary precision by simply increasing the resolution in ∆Θ.
Unfortunately, we do not know beforehand what range of values of Θ our grid should
span. While parameters can range over (−∞, +∞), grids rely on finite-volume elements
and so we have to choose some finite sub-space to grid up. So while grids may give es-
timates that converge to some value over the range spanned by the grid points, there is
always a possibility that a significant portion of the posterior lies outside that range. In
these cases, grids are not guaranteed to be consistent estimators of EP [f (Θ)]. An illustra-
tion of this issue is shown in Figure 5. This fundamental problem is not shared by Monte
Carlo methods, which I will cover in §5.

Exercise: Grids over a 2-D Gaussian


Setup
Consider an unnormalized posterior well-approximated by a 2-D Gaussian (Normal) dis-
tribution centered on (µx , µy ) with standard deviations (σx , σy ):
1 (x − µx )2 (y − µy )2
  
P̃(x, y) = exp − +
2 σx2 σy2
Assume that we expect to find our posterior has a mean of 0 and a standard deviation of
1. In reality, however, our posterior actually has means (µx , µy ) = (−0.3, 0.8) and standard
deviations (σx2 , σy2 ) = (2, 0.5), mimicking the common case where our prior expectations
and posterior inferences somewhat disagree.

Grid-based Estimation
We want to use a 2-D grid to estimate various forms of posterior integrals. Starting with
an evenly-spaced 5 × 5 grid from [−2, 2], compute:
1. the evidence Z,
2. the means EP [x] and EP [y],
3. the 68% credible intervals (or closest approximation) [xlow , xhigh ] and [ylow , yhigh ],
4. and the effective sample size neff .
How accurate are each of these quantities with the values we might expect? What does
neff /n tell us about how efficiently we have allocated our grid points?

Convergence
Repeat the above exercise using an evenly-spaced grid of 20 × 20 points and 100 × 100
points. Comment on any differences. How much has the overall accuracy improved? Do
the estimates appear convergent?

17
Submitted to the JSE Speagle 2019

Consistency
Next, expand the bounds of the grid to be from [−5, 5] and perform the same exercise
as above. Do the answers change substantially? If so, what does this tell us about the
consistency of our previous estimates? Adjust the density and bounds of the grid until
the answers appear both convergent and consistent. Remember that we do not know the
exact shape of the posterior ahead of time. What does this imply about general concerns
when applying grids in practice?

Effective Sample Size


Finally, explore whether there is a straightforward scheme to adjust the locations of the x
and y grid points to maximize the effective sample size based on the definition outlined
in §4.2. If so, can you explain why it works? If not, why not? Compared to equivalent
evenly-spaced grids, how much can adaptively adjusting the grid spacing improve neff
and the overall accuracy of our estimates?

5 From Grids to Monte Carlo Methods


5.1 Connecting Grid Points and Samples
Earlier, I outlined how we can relate estimating EP [f (Θ)] using a grid of n points to an
equivalent estimate using a set of n samples {f1 , . . . , fn } and a series of associated weights
{w1 , . . . , wn }. The main result is that there is an intimate connection between the structure
of the posterior and the grid to the relative amplitude of the weights wi ≡ P̃(Θi )∆Θi for
each point fi ≡ f (Θi ). Adjusting the resolution of the grid then affects these weights,
with a more uniform distribution of weights leading to a larger ESS which can improve
our estimate.
The fact that decreasing the spacing (making grid denser) also decreases the weights
makes sense: we have more points located in that region, so each point should in general
get less relative weight when computing EP [f (Θ)]. Likewise, if we have the same spacing
but change the relative shape of the posterior, the weight of that point when estimating
EP [f (Θ)] should also change accordingly.
I now want to extend this basic relationship further. In theory, adaptively increasing
the resolution of our grid allows us more control over the volume elements ∆Θi used to
derive our weights. If we knew the shape of our posterior sufficiently well, for large n we
should in theory be able to adjust ∆Θi such that the weights wi = P̃(Θi )∆Θi are uniform
to some amount of desired precision. By inspection, this should happen when

1
∆Θi ∝ (30)
P̃(Θi )

for all i.
Taking this reasoning to its conceptual limit, as n → ∞ we can imagine estimating the
posterior using a larger and larger number of grid points whose spacing ∆Θ changes as

18
Submitted to the JSE Speagle 2019

a function of Θ. Using this, we can now define the density of points Q(Θ) based on the
varying resolution ∆Θ(Θ) of our infinitely-fine grid as a function of Θ:

1
Q(Θ) ∝ (31)
∆Θ(Θ)

This result suggests that, in the continuum limit where n → ∞, the structure of our
infinite-resolution grid is equivalent to a new continuous distribution Q(Θ). An illus-
tration of this concept is shown in Figure 6. Using Q(Θ), we can then rewrite our original
expectation value as
h i
P̃(Θ)
EQ f (Θ)P̃(Θ)/Q(Θ)
R
f (Θ) Q(Θ) Q(Θ)dΘ
R
f (Θ)P̃(Θ)dΘ
EP [f (Θ)] ≡ R = R = h i (32)
P̃(Θ)dΘ P̃(Θ)
Q(Θ)
Q(Θ)dΘ E Q P̃(Θ)/Q(Θ)

For reasons that will soon become clear, I will refer to Q(Θ) as the proposal distribution.
At this point, this may mostly seem like a mathematical trick: all I have done is rewrite
our original single expectation value with respect to the (unnormalized) posterior P̃(Θ)
in terms of two expectation values with respect to the proposal distribution Q(Θ). This
substitution, however, actually allows us to fully realize the connection between grid
points and samples.
Earlier, I showed that the estimate for the expectation value from grid points is ex-
actly analogous to the estimate we would derive assuming the grid points were random
samples {f1 , . . . , fn } with associated weights {w1 , . . . , wn }. Once we have defined our ex-
pectation with respect to Q(Θ), however, this statement can become exact assuming we
can explicitly generate samples from Q(Θ).
Let’s quickly review what this means. Initially, we looked at trying to estimate EP [f (Θ)]
over a grid with n points. In the limit of infinite resolution, however, our grid becomes
equivalent to some distribution Q(Θ). Using h Q(Θ), we canithen rewrite h our original
i ex-
pression in terms of two expectations, EQ f (Θ)P̃(Θ)/Q(Θ) and EQ P̃(Θ)/Q(Θ) , over
Q(Θ) instead of P(Θ). This helps us because we can in theory estimate these final ex-
pressions explicitly using a series of n randomly generated samples from Q(Θ). Due to
the randomness inherent in this approach, this is commonly referred to as a Monte Carlo
approach for estimating EP [f (Θ)] due to a historical connection with randomness and
gambling (cite).
On the face of it, this is a surprising claim. When we compute an integral of a function
f (Θ) on a bounded grid, we know that there is some error in our approximation having to
do with the discretization of the grid. This error is entirely deterministic: given a number
of grid points n and an a particular discretization density Q(Θ) ∝ 1/∆Θ(Θ), we will get
the same result (and error) for EP [f (Θ)] every time.
In comparison to this, drawing n samples {Θ1 , . . . , Θn } from Q(Θ) is an inherently
random (i.e. stochastic) process that looking nothing like a grid of points. And because
these points are inherently random, the actual deviation between our estimate and the
true value of EP [f (Θ)] will also be random. The “error” from random samples then tells
us something about how much we expect our estimate can differ over many possible

19
Submitted to the JSE Speagle 2019

Figure 6: An illustration of the connection between grids and continuous density distri-
butions. As we increase the number of grid points, our estimate of the posterior P(Θ)
improves (top). Since the spacing between the grid points varies to maximize the effec-
tive sample size (see Figure 4 and §4.2), the differential volume elements ∆Θi change
depending on our location (middle). As we continue to increase the number of volume
elements, the density of grid points at any particular location ρ(Θi ) = [∆Θi ]−1 behaves
like a continuous function Q(Θ) whose distribution is similar to P(Θ) (bottom). This im-
plies we should be able to use Q(Θ) in some way to estimate P(Θ). See §5 for additional
details.

20
Submitted to the JSE Speagle 2019

realizations of our random process given a particular number of samples n generated


from Q(Θ). The fact that we can derive roughly equivalent estimates from these these
very different approaches as we adjust n and Q(Θ) lies at the heart of the connection
between grid points and samples.
There are three primary benefits from moving from an adaptively-spaced grid to a
continuous distribution Q(Θ). First, a grid will always have some minimum resolution
∆Θi that makes it difficult to get our weights to be roughly uniform, limiting our maxi-
mum ESS in practice. By contrast, we can in theory get Q(Θ) to more closely match the
posterior P(Θ), giving a larger ESS at fixed n.
Second, because we are now working with distributions rather than a finite number
of grid points, we are no longer limited to some finite volume when estimating expecta-
tions. Since distributions can range over (−∞, +∞), we can guarantee Q(Θ) will provide
sufficient coverage over all possible Θ values that our posterior P(Θ) could be defined
over. This means that some of the theoretical issues raised in §4.3 associated with apply-
ing grids to posteriors that range over (−∞, +∞) no longer apply. Monte Carlo methods
therefore can serve as a consistent estimator for a wider range of possible posterior expec-
tations than grid-based methods, making them substantially more flexible.
Finally, the minimum number of grid points always scales exponentially with dimen-
sionality (see §4.1), regardless of how many parameters we are interested in marginaliz-
ing over. Since Monte Carlo methods do not rely on these, they can take full advantage of
marginalizing over parameters when estimating expectations EP [f (Θ)]. They are there-
fore less susceptible to this effect (although see §7.2).

5.2 Importance Sampling


As I have tried to emphasize previously, the core tenet of this article is that we do not
know what P(Θ) looks like beforehand. This means we do not know what grid structure will
provide an optimal estimate (i.e. maximum ESS) for EP [f (Θ], let alone how this should
behave as Q(Θ) in the continuum limit. This gives us ample motivation to choose Q(Θ) in
such a way to make generating samples from it easy and straightforward.
Assuming we have chosen such a Q(Θ), we can subsequently generate a series of
n samples from it. Assuming these samples have weights qi associated with them and
defining
f (Θi ) ≡ fi , P̃(Θi )/Q(Θi ) ≡ w̃(Θi ) ≡ w̃i (33)
our original expression reduces to
Pn
EQ [f (Θ)w̃(Θ)] fi w̃i qi
EP [f (Θ)] = ≈ Pi=1
n (34)
EQ [w̃(Θ)] i=1 w̃i qi

If we further assume that we have chosen Q(Θ) so that we can simulate samples that are
independently and identically distributed (iid) (i.e. each sample has the same proba-
bility distribution as the others and all the samples are mutually independent), then the
corresponding sample weights immediately reduce to qi = 1/n and our result becomes
n−1 ni=1 fi w̃i
P
EP [f (Θ)] ≈ −1 Pn (35)
n i=1 w̃i

21
Submitted to the JSE Speagle 2019

Figure 7: A schematic illustration of Importance Sampling. First, we take a given proposal


distribution Q(Θ) (left) and generate a set of n iid samples from it (middle left). We then
weight each sample based on the corresponding “importance” P̃(Θ)/Q(Θ) it has at that
location (middle right). We then can use these weighted samples to approximate posterior
expectations (right). See §5.2 for additional details.

As with the previous case using grids (§4), the denominator of this expression is again a
direct approximation for the evidence
Z n
X
−1
Z= P̃(Θ)dΘ ≈ n w̃i (36)
i=1

This gives a straightforward recipe for estimating our original expectation value:
1. Draw n iid samples {Θ1 , . . . , Θn } from Q(Θ).

2. Compute their corresponding weights w̃i = P̃(Θi )/Q(Θi ).

3. Estimate EP [f (Θ)] by computing EQ [w̃(Θ)] and EQ [f (Θ)w̃(Θ)] using the weighted


sample means.
Since this process just involves “reweighting” the samples based on w̃i , these weights are
often referred to as importance weights and the method as Importance Sampling. A
schematic illustration of Importance Sampling is highlighted in Figure 7.
We can interpret the importance weights as ways to correct for how “far off” our
original guess Q(Θ) is from the truth P(Θ). If the posterior density is higher at position
Θi relative to the proposal density, then we were less likely to generate a sample at that
position compared to what we would have seen if we had drawn samples directly from
the posterior. As a result, we should increase its corresponding weight to account for this
expected deficit of samples at a given position. If the posterior density is lower relative to

22
Submitted to the JSE Speagle 2019

the proposal density, then the alternative is true and we want to lower the weight of the
corresponding sample to account for the expected excess of samples at a given position.

5.3 Examples of Sampling Strategies


Importance Sampling serves as a useful first step for understanding how the weights
{w̃1 , . . . , w̃n } for the corresponding set of n samples are related to different Monte Carlo
sampling strategies.
As an example, one common approach is to generate samples uniformly within some
cuboid with volume V . The proposal distribution for this will then be
(
unif 1/V Θ in cuboid
Q (Θ) = (37)
0 otherwise

The corresponding importance weights subsequently will just be proportional to the pos-
terior at a given position:

P̃(Θi )
w̃iunif = = V P̃(Θi ) ∝ P(Θi ) (38)
Qunif (Θi )
Another possible approach would be if we instead take our proposal to be our prior:
Qprior (Θ) = π(Θ) (39)
This seems like a well-motivated choice: the prior characterizes our knowledge before
looking at the data, so it should serve as a useful first guess and encompass the range
of all possibilities. Under this assumption, we now find our weights will be equal to the
likelihood L(Θ) at each position:

P̃(Θi ) L(Θi )π(Θi )


wiprior = prior
= = L(Θi ) (40)
Q (Θi ) π(Θi )
Finally, notice that the optimal sampling strategy is to assume that we can take our
proposal to be identical to our posterior:
Qpost (Θ) = P(Θ) (41)
The corresponding weights will then just be constant and equal to the evidence Z:

P̃(Θi ) ZP(Θi )
wipost = = =Z (42)
Qpost (Θi ) P(Θi )
As expected, this final result guarantees the maximum possible ESS of neff = n. Get-
ting Q(Θ) to be as “close” as possible to P(Θ) therefore becomes a crucial part of anal-
yses when trying to use Importance Sampling to estimate expectation values. It is this
result in particular that motivates the use of Markov Chain Monte Carlo (MCMC) meth-
ods discussed from §6 onward: if we can somehow generate samples directly from P(Θ)
or something close to it, then we can achieve an optimal estimate of our corresponding
expectation values.

23
Submitted to the JSE Speagle 2019

Exercise: Importance Sampling over a 2-D Gaussian


Setup
Let’s return to our exercise from §4, in which our unnormalized posterior is well-approximated
by a 2-D Gaussian (Normal) distribution:

1 (x − µx )2 (y − µy )2
  
P̃(x, y) = exp − +
2 σx2 σy2

where (µx , µy ) = (−0.3, 0.8) and (σx2 , σy2 ) = (2, 0.5).

Importance Sampling
We want to use Importance Sampling to approximate various posterior integrals from
this distribution. We will start by choosing our proposal distribution Q(x, y) to be a 2-D
Gaussian with a mean of 0 and standard deviation of 1:

Q(x, y) = N [(µx , µy ) = (0, 0), (σx , σy ) = (1, 1)]

Using n = 25 iid random samples drawn from the proposal distribution, compute an
estimate for:

1. the evidence Z,

2. the means EP [x] and EP [y],

3. the 68% credible intervals (or closest approximation) [xlow , xhigh ] and [ylow , yhigh ],

4. and the effective sample size neff .

How accurate are each of these quantities with the values we might expect? What does
neff /n tell us about how well our proposal Q(x, y) traces the underlying posterior P(x, y)?

Uncertainty
Repeat the above exercise m = 100 times to get an estimate for how much our estimates
of each quantity can vary. Is the variation in line with what might be expected given the
typical effective sample size? Why or why not?

Convergence
Now repeat the above exercise using n = 100, n = 1000, and n = 10000 points rather
than n = 25 points and comment on any differences. How much has the overall accuracy
improved? Do the estimates appear convergent and consistent as neff increases? How
much do the errors on quantities shrink as a function of n and/or neff ? Is this behavior
expected? Why or why not?

24
Submitted to the JSE Speagle 2019

Consistency
Next, let’s expand our proposal distribution to instead have (σx , σy ) = (2, 2) to get more
coverage in the “tails” of the posterior. Perform the same exercise as above with n =
{100, 1000, 10000} iid random samples. Do the answers change substantially? Why or
why not?
While in theory we can choose Q(x, y) ≈ P(x, y) so that neff ≈ n, we do not know the
exact shape of the posterior ahead of time. Given that P̃(x, y) may differ from our initial
expectations, what does this exercise imply about general concerns applying Importance
Sampling in practice?

6 Markov Chain Monte Carlo


Now that we see how the weights relate to various Monte Carlo sampling strategies (e.g.,
generating samples from the prior), I will now outline the idea behind Markov Chain
Monte Carlo (MCMC). In brief, MCMC methods try to generate samples in such a way
that the importance weights {w̃1 , . . . , w̃n } associated with each sample are constant. Based
on the results from §5.3, this means MCMC seeks to generate samples proportional to the
posterior P(Θ) in order to arrive at an optimal estimate for our expectation value.
MCMC accomplishes this by creating a chain of (correlated) parameter values {Θ1 →
· · · → Θn } over n iterations such that the number of iterations m(Θi ) spent in any par-
ticular region δΘi centered on Θi is proportional to the posterior density P(Θi ) contained
within that region. In other words, the “density” of samples generated from MCMC
m(Θ)
ρ(Θ) ≡ (43)
n
at position Θ integrated over δΘ is approximately
Z Z n
X
−1
P(Θ)dΘ ≈ ρ(Θ)dΘ ≈ n 1 [Θj ∈ δΘ ] (44)
Θ∈δΘ Θ∈δΘ j=1

where 1 [·] is the indicator function which evaluates to 1 if the inside condition is true and
0 otherwise. We can therefore approximate the density by simply adding up the number
of samples within δΘ and normalizing by the total number of samples n. A schematic
illustration of this concept is shown in Figure 8.
While this will just be approximately true for any finite n, as the number of samples
n → ∞ this procedure generally guarantees that ρ(Θ) → P(Θ) everywhere.2 In theory
then, once we have a reasonable enough approximation for ρ(Θ), we can also use the
samples {Θ1 → · · · → Θn } generated from ρ(Θ) to get an estimate for the evidence using
the same substitution trick introduced in §5:
n
P̃(Θ) P̃(Θi )
Z h i X
Z= ρ(Θ)dΘ ≡ Eρ P̃(Θ)/ρ(Θ) ≈ n−1 (45)
ρ(Θ) i=1
ρ(Θ i )
2
Discussing the details of exactly when/where this condition holds in theory and in practice is beyond
the scope of this paper but can be found in other references such as Asmussen & Glynn (2011) and Brooks
et al. (2011).

25
Submitted to the JSE Speagle 2019

Figure 8: A schematic illustration of Markov Chain Monte Carlo (MCMC). MCMC tries
to create a chain of n (correlated) samples {Θ1 → · · · → Θn } (top) such that the number of
samples m in some particular volume δ gives a relative density m/n (middle) comparable
to the posterior P(Θ) integrated over the same volume (bottom). See §6 for additional
details.

26
Submitted to the JSE Speagle 2019

This is just the average of the ratio between P̃(Θi ) and ρ(Θi ) over all n samples.
Finally, since our MCMC procedure gives us a series of n samples from the posterior,
our expectation value simply reduces to
n
n−1 ni=1 fi w̃i n−1 ni=1 fi
P P X
−1
EP [f (Θ)] ≈ −1 n
P = −1 n
P =n fi (46)
n i=1 w̃i n i=1 1 i=1

This is just the sample mean of the corresponding {f1 , . . . , fn } values over our set of n
samples.
I wish to take a moment here to highlight two features of the above results related
to common misconceptions surrounding MCMC methods. First, there is a widespread
belief that because MCMC methods generate a chain of samples whose behavior follows
the posterior, we do not have any ability to use them to estimate normalizing constants
such as the evidence Z. As shown above, this is not true at all: not only can we do
this using ρ(Θ), but the estimate we derive is actually a consistent one (although it will
converge slowly; see §7.1).
The second misconception is that the primary goal of MCMC is to “approximate” or
“explore” the posterior. In other words, to estimate ρ(Θ). However, as shown above,
the ability of MCMC methods to estimate ρ(Θ) is really only useful for estimating the
evidence Z. In fact, by tracing its heritage from Importance Sampling-based methods,
we see its primary purpose is actually to estimate expectation values (i.e. integrals over the
posterior). I have explicitly tried to avoid introducing any mention of “approximating the
posterior” up to this point in order to avoid this misconception, but will spend some time
discussing this point in more detail in §7.1.
To summarize, the idea behind MCMC is to simulate a series of values {Θ1 → · · · →
Θn } in a way that their density ρ(Θ) after a given amount of time follows the underly-
ing posterior P(Θ). We can then estimate the posterior within any particular region δΘ
by simply counting up how many samples we simulate there and normalizing by the to-
tal number of samples n we generated. Because we are also simulating values directly
from the posterior, any expectation values also reduce to simple sample averages. This
procedure is incredibly intuitive and part of the reason MCMC methods have become so
widely adopted.

6.1 Generating Samples with the Metropolis-Hastings Algorithm


There is a vast literature on various approaches to generating samples (see, e.g., cites).
Since this article focuses on building up a conceptual understanding of MCMC methods,
exploring how the majority of these methods behave both in theory and in practice is
beyond the scope of this paper.
Instead of an overview, I aim to clarify the basics of how these methods operate. The
central idea is that we want a way to generate new samples Θi → Θi+1 such that the
distribution of the final samples ρ(Θ) as n → ∞ (1) is stationary (i.e. it converges to
something) and (2) is equal to the P(Θ). These are essentially analogs to the convergence
and consistency constraints discussed in §4.3.

27
Submitted to the JSE Speagle 2019

We can satisfy the first condition by invoking detailed balance. This is the idea that
probability is conserved when moving from one position to another (i.e. the process is
reversible). More formally, this just reduces to factoring of probability:

P (Θi+1 |Θi )P (Θi ) = P (Θi+1 , Θi ) = P (Θi |Θi+1 )P (Θi+1 ) (47)

where P (Θi+1 |Θi ) is the probability of moving from Θi → Θi+1 and P (Θi |Θi+1 ) is the
probability of the reverse move from Θi+1 → Θi . Rearranging then gives the following
constraint:
P (Θi+1 |Θi ) P (Θi+1 ) P(Θi+1 )
= = (48)
P (Θi |Θi+1 ) P (Θi ) P(Θi )
where the final equality comes from the fact that the distribution we are trying to generate
samples from is the posterior P(Θ).
We now need to implement a procedure that enables us to actually move to new po-
sitions by computing this probability. We can do this by breaking each move into two
steps. First, we want to propose a new position Θi → Θ0i+1 based on a proposal distri-
bution Q(Θ0i+1 |Θi ) similar in nature to the Q(Θ) used in to Importance Sampling (§5.2).
Then we will either decide to accept the new position (Θi+1 = Θ0i+1 ) or reject the new po-
sition (Θi+1 = Θi ) with some transition probability T (Θ0i+1 |Θi ). Combining these terms
together then gives us the probability of moving to a new position:

P (Θi+1 |Θi ) ≡ Q(Θi+1 |Θi )T (Θi+1 |Θi) (49)

As with Importance Sampling, we can choose Q(Θ0i+1 |Θi ) so that it is straightforward


to propose new samples Θ0i+1 by numerical simulation. We then need to determine the
transition probability T (Θ0i+1 |Θi ) of whether we should accept or reject Θ0i+1 . Substituting
into our expression for detailed balance, we find that our form for the transition proba-
bility must satisfy the following constraint:

T (Θi+1 |Θi ) P(Θi+1 ) Q(Θi |Θi+1 )


= (50)
T (Θi |Θi+1 ) P(Θi ) Q(Θi+1 |Θi )

It is straightforward to show that the Metropolis criterion Metropolis et al. (1953)


 
P(Θi+1 ) Q(Θi |Θi+1 )
T (Θi+1 |Θi ) ≡ min 1, (51)
P(Θi ) Q(Θi+1 |Θi )

satisfies this constraint.


Generating samples following this approach can be done using the Metropolis-Hastings
(MH) Algorithm (Metropolis et al., 1953; Hastings, 1970):

1. Propose a new position Θi → Θ0i+1 by generating a sample from the proposal distri-
bution Q(Θ0i+1 |Θi ).
P(Θ0i+1 ) Q(Θi |Θ0i+1 )
h i
0
2. Compute the transition probability T (Θi+1 |Θi ) = min 1, P(Θi ) Q(Θ0 |Θi ) .
i+1

3. Generate a random number ui+1 from [0, 1].

28
Submitted to the JSE Speagle 2019

Figure 9: A schematic illustration of the Metropolis-Hastings algorithm. At a given it-


eration i, we have generated a chain of samples {Θ1 → · · · → Θi } (white) up to the
current position Θi (red) whose behavior follows the underlying posterior P(Θ) (viridis
color map). We then propose a new position Θ0i+1 (yellow) from the proposal distribution
(orange shaded region). We then compute the transition probability T (Θ0i+1 |Θi ) (white)
based on the posterior Q(Θ) and proposal Q(Θ0 |Θ) densities. We then generate a random
number ui+1 uniformly from 0 to 1. If ui+1 ≤ T (Θ0i+1 |Θi ), we accept the move and make
our next position in the chain Θi+1 = Θ0i+1 . If we reject the move, then Θi+1 = Θi . See
§6.1 for additional details.

4. If ui+1 ≤ T (Θ0i+1 |Θi ), accept the move and set Θi+1 = Θ0i+1 . If ui+1 > T (Θ0i+1 |Θi ),
reject the move and set Θi+1 = Θi .

5. Increment i = i + 1 and repeat this process.

See Figure 9 for a schematic illustration of this process.


Because algorithms like the MH algorithm generate a chain of states where the next
proposed position only depends on the current position rather than any of its past posi-
tions (i.e. it “forgets” the past), they are known as Markov processes. Combining these
two terms with the Monte Carlo nature of simulating new positions is what gives Markov
Chain Monte Carlo (MCMC) its namesake.
An issue with generating a chain of samples in practice is the fact that our chain only
has finite length and a starting position Θ0 . If our chain were infinitely long, we would
expect it to visit every possible position in parameter space, rendering the exact starting
position is unimportant. However, since in practice we terminate sampling after only n
iterations, starting from a location Θ0 that has an extremely low probability means an
inordinate fraction of our n samples will occupy this low-probability region, possibly

29
Submitted to the JSE Speagle 2019

biasing our final results. Since we have limited knowledge beforehand about where Θ0 is
relative to our posterior, in practice we generally want to remove the initial chain of states
once we are confident our chain has begun sampling from higher-probability regions.
Discussing various approaches for identifying and removing samples from this burn-in
period is beyond the scope of this article; for additional information, please see Gelman &
Rubin (1992), Gelman et al. (2013), and Vehtari et al. (2019) along with references therein.

6.2 Effective Sample Size and Auto-Correlation Time


At this point, MCMC seems like it should be the optimal method for any situation: by
simulating samples directly from the (unknown) posterior, we can achieve an optimal
estimate for any expectation values we wish to evaluate. In practice, however, this does
not hold true. MCMC values rely on specific algorithmic procedures such as the MH
algorithm to generate samples, whose limiting behavior reduces to a chain of samples
{Θ1 → · · · → Θn } whose distribution follows the posterior. Any given sample Θi ,
however, is more likely than not to be correlated with both the previous sample in the
sequence Θi−1 and the subsequent sample in the sequence Θi+1 .
This occurs for two reasons. First, new positions Θi drawn from Q(Θi |Θi−1 ) by con-
struction tend to depend on the current position Θi−1 . This means that the position we
propose at iteration i + 1 from will be correlated with the position at iteration i, which
itself will be correlated with the position at iteration i − 1, etc.
Second, even if we set Q(Θ0 |Θ) = Q(Θ0 ) so that all of our proposed positions are
uncorrelated, our transition probability T (Θ0 |Θ) still ensures that we will eventually re-
ject the new position so that Θi+1 = Θi . Since samples at exactly the same position are
maximally correlated, this ensures that samples from our chain will “on average” have
non-zero correlations. Note that having low acceptance fractions (i.e. the fraction of pro-
posals that are accepted rather than rejected) will lead to a larger fraction of the chain
containing these perfectly correlated samples, increasing the overall correlation.
As mentioned in §4.2, correlated samples provide less information about the under-
lying distribution they are sampled from since their behavior doesn’t just depend on the
underlying distribution but also the neighboring samples in the sequence. Samples that
are more highly correlated then should lead to a reduced ESS.
This intuition can be quantified by introducing the auto-covariance C(t) for some lag
t. Assuming that we have an infinitely long chain {Θ1 → . . . }, the auto-covariance C(t)
is: n
  1X
C(t) ≡ Ei (Θi − Θ̄) · (Θi+t − Θ̄) = lim (Θi − Θ̄) · (Θi+t − Θ̄) (52)
n→∞ n
i=1

where · is the dot product and Θ̄ = EP [Θ] is the mean of the posterior P(Θ). In other
words, we want to know the covariance between Θi at some iteration i and Θi+t at some
other iteration i + t, averaged over all all possible pairs of samples (Θi , Θi+t ) in our in-
finitely long chain. Note that the amplitude |C(t)| will be maximized at |C(t = 0)|, where
the two samples being compared are identical, and minimized with |C(t)| = 0 when Θi
and Θi+t are completely independent from each other.

30
Submitted to the JSE Speagle 2019

Using the auto-covariance, we can define the corresponding auto-correlation A(t) as


C(t)
A(t) ≡ (53)
C(0)
This now measures the average degree of correlation between samples separated by a lag
t. In the case where t = 0, both samples are identical and A(t = 0) = 1. In the case where
the samples are uncorrelated over lag t, A(t) = 0.
The overall auto-correlation time for our chain is just the auto-correlation A(t) inte-
grated over all non-zero lags (t 6= 0):

X ∞
X
τ≡ A(t) − 1 = 2 A(t) (54)
t=−∞ t=1

where the −1 comes from the fact that the auto-correlation with no lag is just A(t = 0) = 1
(i.e. each sample perfectly correlates with itself) and the substitution arises from the fact
that A(t) = A(−t) by symmetry. If τ = 0, then it takes no time at all for samples to become
uncorrelated and the samples can be assumed to be iid. If τ > 0, then it takes on average
τ additional iterations for samples to become uncorrelated. An illustration of this process
is shown in Figure 10.
Incorporating the auto-correlation time leads directly to a modified definition for the
ESS:
neff
n0eff ≡ (55)
1+τ
In practice, we cannot precisely compute τ since we do not have an infinite number of
samples and do not know P(Θ). Therefore we often need to generate an estimate τ̂ of
the auto-correlation time using the existing set of n samples we have. While discussing
various approaches taken to derive τ̂ is beyond the scope of this work, please see Brooks
et al. (2011) for additional details.
The fact that MCMC methods are subject to non-negative auto-correlation times (τ ≥
0) but have optimal importance weights w̃i = 1 give an ESS of
neff,MCMC n
n0eff,MCMC = = ≤n (56)
1+τ 1+τ
This means that there is no guarantee that MCMC is always the optimal choice to achieve the
largest ESS. In particular, Importance Sampling methods, which can generate fully iid
samples with no auto-correlation time (τ = 0) but non-optimal importance weights w̃i ,
instead have an ESS of
2
( ni=1 w̃i )
P
0 neff,IS
neff,IS = = neff,IS = Pn 2
≤n (57)
1+τ i=1 w̃i

which can be greater than n0eff,MCMC at fixed n.


Given the results above, it should now be clear that the central motivating concern of
MCMC methods is whether they can generate a chain of samples with an auto-correlation time
small enough to outperform Importance Sampling. Whether or not this is true will depend on
the posterior, the approach used to generate the chain of samples (see §6.1 and §8) and the
proposal distribution Q(Θ) used for Importance Sampling (see §5.3).

31
Submitted to the JSE Speagle 2019

Figure 10: A schematic illustration of the auto-correlation associated with MCMC. MCMC
methods generate a chain of samples {Θ1 → · · · → Θn } (top), but these tend to be strongly
correlated on small length scales (top middle). We can quantify the degree of correlation
by computing the corresponding auto-correlation A(t) over our set of samples and all
possible time lags t (bottom middle). This quantity is 1 when t = 0 and drops to 0 as
t → ±∞. The overall auto-correlation time τ associated with our chain of samples is then
just the integrated auto-correlation over t 6= 0. See §6.2 for additional details.

32
Submitted to the JSE Speagle 2019

Exercise: MCMC over a 2-D Gaussian


Setup
Let’s again return to our examples from §4 and §5, in which our unnormalized posterior
is well-approximated by a 2-D Gaussian (Normal) distribution:

1 (x − µx )2 (y − µy )2
  
P̃(x, y) = exp − +
2 σx2 σy2

where (µx , µy ) = (−0.3, 0.8) and (σx2 , σy2 ) = (2, 0.5).


We want to use MCMC to approximate various posterior integrals from this distribu-
tion. We will start by choosing our proposal distribution Q(x0 , y 0 |x, y) to be a 2-D Gaussian
with a mean of 0 and standard deviation of 1:

Q(x0 , y 0 |x, y) = N [(µx , µy ) = (x, y), (σx , σy ) = (1, 1)]

Parameter Estimation
Using the above proposal, generate n = 1000 samples following the MH algorithm start-
ing from the position (x0 , y0 ) = (0, 0). Using these samples, compute an estimate of the
means EP [x] and EP [y] as well as the corresponding 68% credible intervals (or closest ap-
proximation) [xlow , xhigh ] and [ylow , yhigh ]. How accurate are each of these quantities com-
pared with the values we might expect?

Evidence Estimation
Next, use a set of 10 × 10 bins from x = [−5, 5] and y = [−5, 5] to construct an estimate
ρ(x, y) from the resulting set of samples. Using this estimate for the density, compute an
estimate of the evidence Z. How accurate is our approximation? Does it substantially
change if we adjust the number and/or size of the bins?

Auto-Correlation Time and Effective Sample Size


Use numerical methods to compute an estimate of the auto-correlation time τ and the cor-
responding effective sample size neff . How efficient is our sampling (neff /n) compared to
the default Importance Sampling approach from the exercise in §5? Does this mirror what
we’d expect given the acceptance fraction of our proposals? What do these quantities this
tell us about how well our proposal Q(x, y) matches the structure of the underlying pos-
terior P(x, y)?

Uncertainties
Repeat the above exercises m = 30 times to get an estimate for how much our estimates
of each quantity can vary. Is the variation in line with what might be expected given the
typical effective sample size?

33
Submitted to the JSE Speagle 2019

Consistency and Convergence


Now repeat the above exercise using n = 2500 and n = 10000 samples points and com-
ment on any differences. How much has the overall accuracy improved? Do the estimates
appear convergent and consistent as neff increases? How much do the errors on quanti-
ties shrink as a function of n and/or neff ? Is this similar or different from the observed
dependence from the Importance Sampling exercise in §5?

Sampling Efficiency
Next, adjust the (σx , σy ) of the proposal distribution to try and improve neff at fixed
n. How close is the final ratio σx /σy of our proposal to that of the underlying poste-
rior? Are there any additional scaling differences between the rough size of our proposal
Q(x0 , y 0 |x, y) relative to the underlying posterior P(x, y)? Given that P̃(x, y) may differ
from the structure assumed when picking Q(x0 , y 0 |x, y), can you think of any possible
scheme to try and adjust our proposal using an existing set of samples?

Burn-In
Finally, adjust the starting position to be at (x0 , y0 ) = (10, 10) instead of (0, 0) and generate
a new chain of samples. Plot the x and y positions of the chain over time. Are there any
obvious signs of the burn-in period? How many samples roughly should be assigned to
burn-in and subsequently removed from our chain? Are there any possible heuristics that
might help to identify the initial burn-in period?

7 Sampling the Posterior with MCMC


The approach by which MCMC methods are able to generate a chain of samples imme-
diately gives a mental image of our chain “exploring” the posterior. While it is true that
the density of samples from the chain ρ(Θ) → P(Θ) as n → ∞, the primary purpose of
MCMC is estimating expectation values EP [f (Θ)]. Although this might seem like a subtle
difference, this distinction is actually crucial for understanding how MCMC algorithms
(should) behave in practice.

7.1 Approximating the Posterior


Although algorithms such as MH (§6.1) are constructed to ensure the density of the chain
of samples ρ(Θ) generated by MCMC converges to the posterior P(Θ) as n → ∞, this does
not necessarily translate into an efficient method to approximate the posterior in practice.
In other words, n might need to be extremely large for this constraint to hold. So how
many samples do we need to ensure ρ(Θ) is a good approximation to P(Θ)?
To start, we first need to define some metric for what a “good” approximation is. A
reasonable one might be that we would like to know the posterior within some region δΘ

34
Submitted to the JSE Speagle 2019

to within some precision  so that


n Z
1X
1 [Θi ∈ δΘ ] − P(Θ)dΘ ≡ |p̂(δΘ ) − p(δΘ )| <  (58)
n i=1 δΘ

where p(δΘ ) is the total probability contained within δΘ and p̂(δΘ ) is the fraction of the
MCMC chain of samples contained within the same region. While it might seem strange
to only estimate this for one region, I will shortly generalize this to encompass the entire3
posterior.
In the ideal case where our samples are iid and drawn from P(Θ), our samples each
have a probability p(δΘ ) of being within δΘ . The probability that p̂(δΘ ) = m/n then fol-
lows the binomial distribution:
 
 m n
P p̂(δΘ ) = = [p(δΘ )]m [1 − p(δΘ )]n−m (59)
n m

In other words, our samples end up inside δΘ a total of m times with probability p(δΘ )
and outside δΘ a total of n − m times with probability 1 − p(δΘ ). The additional binomial
n
coefficient m for “n choose m” accounts for all possible unique cases where m samples
can end up within δΘ out of our total sample size of n.
This distribution has a mean of p(δΘ ), so for any finite n we expect p̂(δΘ ) to be an
unbiased estimator of p(δΘ ):

E [p̂(δΘ ) − p(δΘ )] = p(δΘ ) − p(δΘ ) = 0 (60)

The variance, however, depends on the sample size:


 p(δΘ ) [1 − p(δΘ )]
E |p̂(δΘ ) − p(δΘ )|2 =

(61)
n
In practice, we can expect there to be some non-zero auto-correlation time τ > 0. This
will increase the number of MCMC samples we will need to generate to be confident
that our estimate p̂(δΘ ) is well-behaved. Inserting a factor of 1 + τ and substituting our
expectation value from above into our accuracy constraint then gives a rough constraint
for the number of samples n we would require as a function of :

p(δΘ ) [1 − p(δΘ )] p̂(δΘ ) [1 − p̂(δΘ )]


n& ∼ × (1 + τ̂ ) (62)
2 /(1 + τ ) 2

The final substitution of p(δΘ ) and τ with their noisy estimates p̂(δΘ ) and τ̂ arises from
the fact that in practice we don’t know p(δΘ ) or τ (both of which require full knowledge
of the posterior). We are therefore forced to rely on estimators derived from our set of n
samples.
3
Technically the procedure outlined in this section only works for finite volumes. The basic intuition,
however, holds even when parameters are unbounded although proving those results is beyond the scope
of this work.

35
Submitted to the JSE Speagle 2019

Let’s now examine this result more closely. As expected, the total number of samples
is proportional to 1 + τ̂ : if it takes longer to generate independent samples, then we need
more samples to be confident we have characterized the posterior well in a given region.
We also see that n ∝ −2 , so that if we want to reduce the error by a factor of x we need to
increase our sample size by a factor of x2 .
The behavior in the numerator is more interesting. Note that p̂(δΘ ) [1 − p̂(δΘ )] is max-
imized for p̂(δΘ ) = 0.5, and so the largest sample size needed is when we have split our
posterior directly in half. In all other cases the sample size needed will be smaller because
there will be more samples outside or inside the region of interest whose information we
can leverage. The exact value of p̂(δΘ ) of course depends on both the posterior P(Θ) and
the target region δΘ : the sample size needed to approximate the posterior to some  near
the peak of the distribution (the small region where P(Θ) is large) will likely be different
than the sample size needed to accurately estimate the tails of the distribution (the large
region where P(Θ) is small).
While the above argument holds if we are looking to estimate the posterior in just one
region at a time, “converging to the posterior” implies that we want ρ(Θ) to become a
good approximation to P(Θ) everywhere. We can enforce this new requirement by split-
ting our posterior into m different sub-regions {δΘ1 , . . . , δΘm } and requiring that each
sub-region is well constrained
|p̂(δΘ1 ) − p(δΘ1 )| < 1 ... |p̂(δΘm ) − p(δΘm )| < m (63)
Substituting in the expected errors on each of these constraints then gives us an approx-
imate limit on the number of samples nj that we need to estimate the posterior in each
region δΘj :  
p̂(δΘj ) 1 − p̂(δΘj )
nj & × (1 + τ̂ ) (64)
2j
The total number of samples we need is then simply:
m
X
n& nj (65)
j=1

This approach of dividing up our posterior into sub-regions is conceptually similar to


the grid-based approaches described in §4. As such, it is also subject to the same draw-
backs: we expect the number of regions m to increase exponentially with the number of
dimensions d. For instance, if we just wanted to divide our posterior up into m orthants
we would end up with m = 2d regions: 2 in 1-D (left-right), 4 in 2-D (upper-left, lower-left,
upper-right, lower-right), 8 in 3-D, etc.
This effect implies that we should in general expect the number of samples required
to ensure ρ(Θ) is a good approximation to P(Θ) for some specified accuracy  to scale as
n & kd (66)
where k is a constant that depends on the accuracy requirements. This puts approximat-
ing the full posterior firmly in the “curse of dimensionality” regime (see §4.1).4
4
A direct corollary of this result is that, while the evidence estimates from MCMC are consistent, the rate
of convergence to the underlying value will proceed exponentially more slowly as d increases.

36
Submitted to the JSE Speagle 2019

While many practitioners talk about MCMC being an efficient method to “approxi-
mate the posterior”, in practice it is rarely used to approximate P(Θ) directly. As dis-
cussed in §3 and shown in Figure 2, almost all quantities that are reported in the literature
do not rely on approximations to the full d-dimensional posterior, but rather approxima-
tions to marginalized distributions that are almost always restricted to no more than k . 3
parameters at a time. The act of marginalizing over the remaining d − k parameters helps
to counteract the curse of dimensionality illustrated here. While it is technically fair to
say that MCMC can “explore” the marginalized k-D posteriors for certain limited sets of
parameters, this type of language can often lead to more misconceptions than insights.

7.2 Posterior Volume


The basic consequences outlined in §7.1 are more general than the specific case where we
imagine dividing up the posterior into orthants or other regions. Fundamentally, com-
puting any expectation over the posterior EP [f (Θ)] requires integrating over the entire
domain of our parameters Θ. We therefore want to understand how the volume of this do-
main behaves (i.e. how many parameter combinations there are). Once we have a grasp
on how this behaves, we can then starting trying to quantify how this will impact our
estimates.
To start, let’s consider the d-cube with side length ` in all d dimensions. Its volume
scales as
Yd
V (`) = ` = `d (67)
i=1

The differential volume element between ` and ` + d` is

dV (`) = (d`d−1 ) × (d`) ∝ `d−1 (68)

This exponential scaling with dimensionality means that volume becomes increas-
ingly concentrated in thin shells located in regions located progressively further away
from the center of the d-cube. As an example, consider the length-scale

`50 = 2−1/d ` (69)

that divides the d-cube into two equal-sized regions with 50% of the volume contained
interior to `50 and 50% of the volume exterior to `50 . In 1-D, this gives `50 /` = 0.5 as we’d
expect. In 2-D, this gives `50 /` ≈ 0.7. In 3-D, `50 /` ≈ 0.8. In 7-D, `50 /` ≈ 0.9. By the time
we get to 15-D, we have `50 /` ≈ 0.95, which means that 50% of the volume is located in
the last 5% of the length-scale near the boundary of the d-cube. While the constants may
change when considering other shapes (e.g., spheres), in general this exponential scaling
as a function of d is a generic feature of higher-dimensional volumes. In other words,
increasing the number of parameters leads to an exponential increase in the number of
available parameter combinations that we have to explore.
In addition to affecting the long-term behavior of MCMC, this exponential increase
in volume also directly impacts how MCMC methods operate. To see why this is the

37
Submitted to the JSE Speagle 2019

Figure 11: A schematic illustration of how the curse of dimensionality affects MCMC
acceptance fractions via posterior volume. At a given position position Θ, the volume
increases ∝ rd as a function of distance r away from that position (left). As the dimen-
sionality increases, this implies volume becomes concentrated progressively further out,
leading to larger distances between proposed positions Θ0 and the current position Θ.
Most of these positions have significantly lower posterior probabilities P(Θ0 ) compared
to the current value P(Θ), leading to an exponential decline in the typical acceptance
fraction (and a corresponding increase in the auto-correlation time) as the dimensionality
increases (right). Adjusting the size and/or shape of the proposal Q(Θ0 |Θ) can help to
counteract this behavior. See §7.2 for additional details.

38
Submitted to the JSE Speagle 2019

case, we need look no further than the transition probability used in the MH algorithm
discussed in §6.1:  
P(Θi+1 ) Q(Θi |Θi+1 )
T (Θi+1 |Θi ) ≡ min 1,
P(Θi ) Q(Θi+1 |Θi )
The non-trivial portion of this expression cleanly splits into two terms. The first is depen-
dent on the volume and is related to how we proposed our next position from Q(Θ0 |Θ).
The second is dependent on the density and is related to how the posterior density changes
between the two positions.
In practice, our transition probability can be interpreted as a basic corrective approach:
after proposing a new position from some nearby volume, we then try to “correct” for
differences between our proposal and the underlying posterior by only accepting these
moves sometimes based on changes in the underlying density. In high dimensions, this
basic “tug of war” between the volume (proposal) and the density (posterior) can break
down as the vast majority of an object’s volume becomes concentrated near the outer
edges.5 For instance, in the case where our proposal Q(Θ0 |Θ) is a cube with side-length `
centered on Θ, this leads to a median length-scale of `50 = 2−1/d `, which increases rapidly
from 0.5` to ≈ ` as the dimensionality increases. The same logic also applies to other pro-
posal distributions (see §8). This focus on positions either far away or with very similar
separation length-scales as `50 → ` means that many choices of Q(Θ0 |Θ) have a tendency
to “overshoot”, proposing new positions with much smaller posterior densities compared
to the current position. These new positions are then almost always rejected, leading to
extremely low acceptance fractions and correspondingly long auto-correlation times. An
example of this effect is illustrated in Figure 11.
One of the main ways to counteract this behavior is to adjust the size/shape of the
proposal Q(Θ0 |Θ) so that the fraction of proposed positions that are accepted remains suf-
ficiently high. This helps to ensure the posterior density P(Θ) does not change too dras-
tically when proposing positions new positions, leading to lower overall auto-correlation
times. Details of how to implement these schemes in practice are beyond the scope of this
article; please see citation for additional details.

7.3 Posterior Mass and Typical Sets


Above, I described how the behavior of volume in high dimensions can impact the perfor-
mance of our MCMC MH sampling algorithm, possibly leading to inefficient proposals
and low acceptance fractions. Let’s assume that we have resolved this problem and have
an efficient way of generating our chain of samples. We now have a secondary question:
where are these samples located?
From our discussion in §7.1, we know that the highest density of samples ρ(Θ) will
be located where the posterior density P(Θ) is also correspondingly high. However, this
region δΘ might only correspond to a small portion of the posterior. Indeed, given there
is exponentially more volume as the dimensionality increases, it is almost guaranteed
5
Alternative methods such as Hamiltonian Monte Carlo (Neal, 2012) can get around this problem by
smoothly incorporating changes in the density and volume.

39
Submitted to the JSE Speagle 2019

that models with many parameters Θ will have the vast majority of the posterior located
outside the region of highest density.
A consequence of this is that the majority of samples in our chain will be located away
from the peak density. As a result, our chain spends the majority of its time generating samples
in these regions. This has a huge impact in the way our chain is expected to behave: while
the highest concentration of samples will be located in the regions of highest posterior
density, the largest amount of samples will actually be located in the regions of highest
posterior mass (i.e. density times volume). Since this implies that a “typical” sample
(picked at random) will most likely be located in this region of high posterior mass, this
region is also commonly referred to as the typical set.
To make this argument a little easier to conceptualize, let’s imagine that we have a
3-parameter model Θ = (x, y, z) and P(x, y, z) is spherically symmetric. While we could
imagine trying to integrate over P(x, y, z) directly in terms of dxdydz, it is almost always
easier to instead integrate over such a distribution
p in “shells” with differential volume
2
dV (r) = 4πr dr as a function of radius r = x + y 2 + z 2 . This allows us to rewrite the
2

3-D integral over (x, y, z) as a 1-D integral over r:


Z Z Z
P(x, y, z)dxdydz = P(r)4πr dr ≡ P 0 (r)dr
2
(70)

where P 0 (r) ≡ 4πr2 P(r) is now the 1-D density as a function of r. This “boosts” the
contribution as a function of r by the differential volume element of the shell associated
with P(r), and implies that the the posterior should have some sort of shell-like structure
(i.e. P 0 (r) is maximized for r > 0).
Although not all posterior densities can be expected to be spherically-symmetric in
this way, in general we can rewrite the d-D integral over Θ as a 1-D volume integral over
V defined by some unknown iso-posterior contours6
Z Z
P(Θ)dΘ = P(V )dV (71)

As outlined in §7.2, we generically expect the size of each volume element to go as dV ∼


rd−1 dr where r is the distance from the peak of posterior. So the basic intuition we get
from the simple spherically-symmetric case still applies and we expect
Z Z Z
P(V )dV ∼ P(r)r dr = P 0 (r)dr
d−1
(72)

As before, the differential volume element of the shell associated with P(r) “boosts”
its overall contribution as a function of r. This boost also becomes exponentially stronger
as d increase. For even moderately-sized d, we therefore expect the posterior mass to be mostly
contained in a thin shell located at a radius r0 with some width ∆r0 . See Figure 12 for an
illustration of this effect based on the toy problem presented in §8.1.
6
Indeed, alternative Monte Carlo methods such as Nested Sampling (Skilling, 2004, 2006) or Bridge/Path
Sampling (Gelman & Meng, 1998) actually are designed to evaluate this type of volume integral explicitly.

40
Submitted to the JSE Speagle 2019

Figure 12: A schematic illustration of how the posterior mass behaves as a function of
dimensionality using a d-dimensional Gaussian. The top panel shows the posterior den-
2
sity P(r) ∝ e−r /2 (red) plotted as a function of distance r from the maximum posterior
density at r = 0 as the number of dimensions d increases (left to right). As expected,
this distribution remains constant. The middle panel shows the differential volume el-
ement dV (r) ∝ rd−1 dr (blue) of the corresponding shell at radius r. This illustrates
the exponentially increasing volume contributed by shells further away from the max-
imum. The bottom panel shows corresponding “posterior mass” as a function of radius
2
P 0 (r) ∝ rd−1 P(r) ∝ rd−1 e−r /2 (purple). Due to the increasing amount of volume located
further away from the maximum posterior density, we see that the majority of the poste-
rior mass (and therefore of any samples we generate with MCMC) are actually located a
shell located far away from the r = 0. See §7.3 for additional details.

41
Submitted to the JSE Speagle 2019

This result has two immediate implications. First, the majority of our samples are not
located where the posterior density is maximized. This is the result of an exponentially in-
creasing number of parameter combinations, which allow a small handful of excellent
fits to the data to be easily overwhelmed by a substantially larger number of mediocre
fits. MCMC methods are therefore generally inefficient at locating and/or characterizing
the region of peak posterior density.
Second, as d increases we generally would expect the radius of the shell containing the
bulk of the posterior mass to increase, moving further and further away from the peak
density due to the exponentially increasing available volume. Since the majority of our
samples are located in this region, our chain will spend the vast majority of time generating
samples from this shell.
This allows us to now outline exactly why it is challenging to propose samples effi-
ciently in high dimensions:

1. To make sure our acceptance fractions remain reasonable, we need to ensure our
proposed positions mostly lie within this shell of posterior mass.

2. However, obtaining an independent sample requires being able to (in theory) pro-
pose any position within this shell.

3. This means that our auto-correlation time will principally be set by how long it takes
to “wander around” the shell, which will be a function of its overall size r0 , its width
∆r0 , and the number of dimensions d.

8 Application to a Simple Toy Problem


I now consider a concrete, detailed example to illustrate how all the concepts discussed
in §6 and §7 come together in practice. Throughout this section, I will outline a number of
analytic results and utilize several different MCMC sampling strategies to generate chains
of samples. I strongly encourage interested readers to implement their own versions of
the methods outlined here, which can be used to reproduce the numerical results from
this section in their entirety.

8.1 Toy Problem


In this toy problem, we will take our (unnormalized) posterior to be a d-dimensional
Gaussian (Normal) distribution with a mean of µ = 0 and a standard deviation of σ in all
dimensions:
1 |Θ|2
 
P̃(Θ) = exp − (73)
2 σ2
where |Θ|2 = di=1 Θ2i is the squared magnitude of the position vector.
P
Based on the results from §7.3, we can better understand the properties of this
qPdistri-
d 2
bution by rewriting the posterior density in terms of the “radius” r ≡ |Θ| = i=1 Θi

42
Submitted to the JSE Speagle 2019

away from the center:


r2
 
P̃(r) = exp − 2 (74)

The corresponding volume contained within a given radius r is then

V (r) ∝ rd (75)

The corresponding posterior mass is P̃ 0 (r) is then defined via


2 /2σ 2
P̃(V )dV (r) ∝ e−r rd−1 dr ≡ P̃ 0 (r)dr

Note that this is closely related to the chi-square distribution.


The typical radius rpeak where the posterior mass peaks (i.e. is maximized) and a
sample is most likely to be located can be derived by setting dP̃ 0 (r)/dr = 0. Solving this
gives √
rpeak = d − 1σ (76)
In other words, while in 1-D a typical sample is most likely to be located at the peak of
the distribution with rpeak = 0, in higher dimensions this changes quite drastically. While
rpeak = 1σ in 2-D, it is 2σ in 5-D, 3σ in 10-D, and 5σ in 26-D. This is a direct consequence
of the huge amount of volume at larger radii in high dimensions: although a sample at
r = 5σ has a posterior density P(r) orders of magnitude worse than a sample at r = 0,
the enormous number of parameter combinations (volume) available at r = 5σ more than
makes up for it.
In general, we expect the posterior mass to comprise a “Gaussian shell” centered at
some radius Z ∞ √ Γ d+1 √

0 2
rmean ≡ EP 0 [r] = rP (r)dr = 2 σ ≈ dσ (77)
0 Γ d2
with a standard deviation of
v
u  !2
p u Γ d+1
2 σ
∆rmean ≡ EP 0 [(r − rmean )2 ] = σ td − 2 d
≈√ (78)
Γ 2 2

where Γ(d) is the Gamma function and the approximations are taken for large d. See
Figure 12 for an illustration of this behavior.

8.2 MCMC with Gaussian Proposals


Let us now consider a chain of samples {Θ1 → · · · → Θn }. The distance between two
samples Θm and Θm+t separated by some lag t will be
v
u d
uX
|Θ − Θ0 | = t (Θm,i − Θm+t,i )2 (79)
i=1

43
Submitted to the JSE Speagle 2019

Assuming that the lag t  τ is substantially larger than the auto-correlation time τ , we
can assume each sample is approximately iid distributed following our Gaussian poste-
rior. This then gives an expected separation of
v
u d
p uX √ √
∆rsep ≡ EP [|Θm − Θm+t |2 ] = t EP [(Θm,i − Θm+t,i )2 ] = 2dσ ≈ 2rmean (80)
i=1

We can in theory propose samples in such a way so that the separation |Θi+1 − Θi |
between
√ a proposed position Θi+1 and the current position Θi follows the ideal separation
of 2rmean derived above by using a simple Gaussian proposal distribution:

1 |Θi+1 − Θi |2
 
Q(Θi+1 |Θi ) ∝ exp − (81)
2 2σ 2

While this proposal has the same shape as the posterior, it is centered on Θi rather than
0. Using our intuition for how volume behaves based on §7.2, we can conclude that
the majority of samples proposed from this choice of Q(Θ0 |Θ) will probably have little
overlap with the posterior.
Indeed, numerical simulation suggests the typical fraction of positions that will be
accepted given the above proposal roughly scales as
 
d 1
hfacc (d)i ≡ exp [EP,Q [ln T (Θi+1 |Θi )]] ∼ exp − − (82)
4 2

which decreases exponentially as the dimensionality increases, similar to Figure 11. Like-
wise, we find the auto-correlation time roughly scales as
 
d 7
hτ (d)i ≡ exp [EP,Q [ln τ ]] ∼ exp + (83)
4 4

This exponential dependence arises because the overlap between the typical Gaussian
proposal Q(Θ0 |Θ) and the underlying posterior P(Θ) essentially reduces √ to the small
volume where two thin shells overlap. Since the radii of the shells goes as ∝ d while the
widths remain roughly constant, the “fractional size” of the shell (and the corresponding
overlap) ends up decreasing exponentially.
To counteract this effect, we need to adjust the σ of our proposal distribution by some
factor γ:
1 |Θi+1 − Θi |2
 
Qγ (Θi+1 |Θi ) ∝ exp − (84)
2 (γσ)2

where our previous proposal assumes γ = 2. If we want to ensure our typical accep-
tance fraction will remain roughly constant as a function of dimension d, γ needs to scale
as
1
hfacc (γ(d))i ≈ C ⇒ γ(d) ∝ √ (85)
d

44
Submitted to the JSE Speagle 2019

Figure 13: Numerical results showcasing the performance of a simple MH MCMC sam-
pler with Gaussian proposals on our toy problem, a d-dimensional Gaussian with mean
µ = 0 and standard deviation σ = 1 in every dimension. The top series of panels show
snapshots of a random parameter from the chain as a function of dimensionality (increas-√
ing from left to right) assuming an unchanging√proposal with constant scale factor γ = 2
(blue) and a shrinking proposal with γ = 2.5/ d designed to target a constant acceptance
fraction of ∼ 25% (red). The bottom panels show the corresponding acceptance frac-
tions (left), auto-correlation times (middle), and effective sample sizes (right) from our
chains (colored points) as a function of dimensionality. The approximations from §8.2 are
shown as light colored lines. Shrinking the size of the proposal helps to keep samples
within the bulk of the posterior mass, substantially reducing the auto-correlation time
and increasing the effective sample size. Failing to do so leads to an exponentially de-
creasing fraction of good proposals and a corresponding exponential increase/decrease
in the auto-correlation time/effective sample size. See §8.2 for additional discussion.

45
Submitted to the JSE Speagle 2019

which inversely tracks the expected radius rmean of the typical set. We find that taking
δ
γ=√ (86)
d
leads to a typical acceptance fraction of
"   #
√ 2 2
δ δ
hfacc (δ/ d)i ≈ exp − − (87)
4 2

as d becomes large with a typical auto-correlation time of



hτ (δ/ d)i ≈ 3d (88)

for reasonable choices of δ. This linear dependence is a substantial improvement over our
earlier exponential scaling.

Numerical Tests
To confirm these results, I sample from this d-dimensional Gaussian posterior (assuming
σ = 1 for simplicity) using two MH MCMC algorithms for n = 20, 000 iterations
√ based on
these proposal distributions.
√ The first proposes new points assuming γ = 2. The second
assumes γ = 2.5/ d in order to maintain a roughly constant acceptance fraction of 25%.
As shown in Figure 13, the chains behave as expected given our theoretical predictions
as a function of dimensionality, with the constant proposal quickly becoming stuck while
the adaptive proposal continues sampling normally. While the auto-correlation time τ
increases in both cases, the increase in the latter case (where it is driven by decreasing
size/scale of the proposal distribution) is much more manageable than the former (where
it is driven by the exponentially decreasing acceptance fraction).

8.3 MCMC with Ensemble Proposals


One drawback to the Gaussian proposals explored above is that we have to specify the
structure of the distribution ahead of time. In this specific case, we assumed that:
1. the width of the posterior in each dimension (parameter) was constant such that
σ1 = σ2 = · · · = σn = σ and

2. the parameters were entirely uncorrelated with each other such that the correlation
coefficient ρij = 0 between any two dimensions i and j.
In general, there is no good reason to assume that either of these are true. This means
we have to also estimate the entire set of d(d + 1)/2 free parameters that determine the
overall covariance structure of our unknown posterior distribution. Trying to adjust the
covariance structure in order to improve our sampling efficiency and decrease the auto-
correlation time (see §5.3 and §6.2) becomes one of the most difficult parts of running
MCMC algorithms in practice.

46
Submitted to the JSE Speagle 2019

While there are schemes to perform these adjustments during an extended burn-in
period (see, e.g., cite), there is significant appeal in methods that can “auto-tune” with-
out much additional input from the user. One class of such approaches are known as
ensemble or particle methods. These methods attempt to use many m chains running
simultaneously (i.e. in parallel) to improve the performance of any individual chain.
We explore three variations of ensemble methods here that attempt to exploit m &
d(d + 1)/2 chains running simultaneously:

1. using the ensemble of particles to condition a Gaussian proposal distribution,

2. using trajectories from multiple particles along with Gaussian “jitter”, and

3. using affine-invariant transformations of trajectories from multiple particles.

A schematic illustration of these methods is shown in Figure 14.


As we might expect, an immediate drawback of these methods is they rely on having
enough particles to characterize the overall structure of the space (i.e. the curse of dimen-
sionality). While this limits their utility when sampling from high-dimensional spaces,
they can be attractive options in moderate-dimensional spaces (d . 25) where a few hun-
dred particles are often sufficient to ensure reasonable performance.

8.3.1 Gaussian Proposal


The first approach is simply a modified Gaussian proposal: at any iteration i for any chain
j, we propose a new position Θji+1 based on the current position Θji using a Gaussian
proposal  
j j j 1 j j T 2 j −1 j j
Qγ (Θi+1 |Θi ) ∝ exp − (Θi+1 − Θi ) (γ Ci ) (Θi+1 − Θi ) (89)
2
where T is the transpose operator and

Cji = Cov {Θ1i , . . . , Θj−1 , Θj+1 , . . . , Θm


 
i i i } (90)

is the empirical covariance matrix estimated from the current positions of the m chains
excluding chain j. We repeat this process for each of the m chains in turn.
In other words, at each iteration i we want to update all m chains. We do so by updat-
ing each chain j in turn based on what the other chains are currently doing. Assuming
the current position of each chain is distributed following the underlying posterior P(Θ),
it is straightforward to show that Cji is a reasonable approximation to the unknown co-
variance structure of our posterior. In addition, because we exclude j when computing
Cji , this proposal is symmetric going from Θji → Θji+1 and from Θji+1 → Θji . This means
that we satisfy detailed balance and do not have to incorporate any proposal-dependent
factors when computing the transition probability.

47
Submitted to the JSE Speagle 2019

Figure 14: A schematic illustration of the three ensemble MCMC methods described in
§8.3. The current state of the chain we are interested in updating (red) and the other
chains in the ensemble (gray) are shown on the left. In the top panels (ensemble Gaussian;
§8.3.1), we compute the covariance of the other k 6= j chains (middle) and use a scaled
version to subsequently propose a new position. In the middle panels (3-chain shift +
jitter; §8.3.2), we use two additional chains k 6= l 6= j to compute a trajectory. We then
propose a new position based on this scaled trajectory plus a small amount of “jitter”.
In the bottom panels (2-chain stretch; §8.3.3), we use only one additional chain k 6= j to
propose a new trajectory. We then propose a random position along a scaled version of
this trajectory with the proposal probability varying as a function of scale. See §8.3 for
additional details.

48
Submitted to the JSE Speagle 2019

8.3.2 Ensemble Trajectories with a Gaussian Proposal


The approach taken in §8.3.1 solves the problem of trying to tune the covariance of our
initial Gaussian proposal. However, it still assumes that a Gaussian proposal is the opti-
mal solution. A more general approach is one that does not rely on assuming a proposal
explicitly, but rather only relies on the distribution of the remaining particles.
One such approach used in the literature is Differential Evolution MCMC (DE-MCMC;
Storn & Price, 1997; Ter Braak, 2006). The main idea behind DE-MCMC is to rely on the
relative positions of the chains at a given iteration i when making new proposals. We first
randomly select two other particles k and l where Θji 6= Θki 6= Θli . We then propose a new
position based on the vector distance between the other two particles Θki − Θli with some
scaling γ along with some additional “jitter” :

Θji+1 = Θji + γ × (Θki − Θli + ) (91)

In the case where the behavior of chains k and l are approximately independent of each
other and assuming the underlying posterior distribution P(Θ) is Gaussian with some
unknown mean µ and covariance C (and “standard deviation” C1/2 ), it is straightforward
to show that the distribution of Θki − Θli will then follow

Θki − Θl ∼ N 0, (2C)1/2
 
(92)

Typically, the jitter  is chosen to also be Gaussian distributed with covariance C such
that
 ∼ N 0, C1/2
 
(93)
In general, C is mostly used to try and avoid issues caused by finite particle sampling:
since the number of unique trajectories (ignoring symmetry) is
 
m−1 (m − 1)! (m − 1)(m − 2)
ntraj = = =
2 2!(m − 3)! 2

if m is sufficiently small the DE-MCMC procedure can only explore a small number of
possible trajectories at any given time, leading to extremely inefficient sampling.
Combined, this implies that the proposed position has a distribution of

Θji+1 ∼ N Θji , γ × (2C + C )1/2


 
(94)

This shows that the 3-particle DE-MCMC procedure can generate new positions in a man-
ner analogous to the ensemble Gaussian proposal we first discussed.

8.3.3 Affine-Invariant Transformations of Ensemble Trajectories


Another approach used in the literature (e.g., emcee; Foreman-Mackey et al., 2013) is
the Affine-Invariant “stretch move” from Goodman & Weare (2010). This uses only one
additional particle Θki rather than two:

Θji+1 = Θki + γ × (Θji − Θki ) (95)

49
Submitted to the JSE Speagle 2019

In place of the jitter term  from DE-MCMC, the stretch move instead injects some amount
of randomness by allowing γ to vary. By sampling γ from some probabilty distribution
g(γ), we allow the proposals to explore various “stretches” of the direction vector. As
shown in Goodman & Weare (2010), if this function is chosen such that

g(γ −1 ) = γ × g(γ) (96)

then this proposal is symmetric. Typically, g(γ) is chosen to be


(
γ −1/2 a−1 ≤ γ ≤ a
g(γ|a) = (97)
0 otherwise

where a = 2 is often taken as a typical value. Note that when γ = 1, this move leaves
Θji+1 = Θji unchanged.
Compared to DEMCMC, the stretch move appears to have one clear advantage: it
doesn’t have any reliance on some “jitter” term  that reintroduces scale-dependence into
the proposal. That makes the proposal invariant to affine transformations and only sensi-
tive to a single parameter a, which governs the range of scales the stretch factor γ is allowed
to explore.
This lack of jitter, however, is not substantially advantageous in practice. As noted in
§8.3.2,  is really designed to avoid possible degeneracies due to the limited number of
available trajectories. In that case we had (m − 1)(m − 2)/2 ∼ m2 /2 possible trajectories;
here, however, we only have m (since Θji is always included). This is a much smaller num-
ber of possible trajectories at a given m, making this particular proposal more susceptible
to that particular effect.
In addition, because this proposal involves adjusting γ and therefore the length of
the trajectory itself, we need to consider how changing γ affects the total volume of the
sphere centered on Θji with radius Θki − Θji . As discussed in §7.2, the differential volume
increases as rd−1 . Therefore, increasing or decreasing γ substantially adjusts the differ-
ential volume in our proposal. This involves introducing a steep boost/penalty into our
transition probability, which now becomes:
" #
j
P(Θ )
T (Θji+1 |Θji , γ) = min 1, γ d−1 i+1
(98)
P(Θji )

This heavily favors proposals with γ > 1 (outwards) and heavily disfavors proposals with
γ < 1 as d increases to account for the exponentially increasing volume at larger radii.
Finally, while this stretch move actually generates proposals in the right overall di-
rection, it is not efficient at generating samples within the bulk of the posterior mass as
the dimensionality increases. As discussed in §8.2, given the typical√position of Θji , the
typical length-scale of the proposed positions needs to shrink by ∝ 1/ d in order to guar-
antee our new sample remains within the bulk of the posterior mass. However, the form
for g(γ|a) specified above instead ensures that γ will always be between 1/a and a. Even
if we attempt to account for this effect by letting a(d) → 1 as d → ∞ in order to target
a constant acceptance fraction and ensure more overlap, the asymmetry of our proposal

50
Submitted to the JSE Speagle 2019

Figure 15: Numerical results showcasing the performance of several ensemble MH


MCMC samplers on our toy problem, a d-dimensional Gaussian with mean µ = 0 and
standard deviation σ = 1 in every dimension. The top series of panels show snapshots
of a random parameter from the collection of chains (with a few chains highlighted) as a
function of dimensionality
√ (increasing from left to right) assuming ensemle Gaussian
√ pro-
posals with γ = 2.5/ d (blue), 3-chain “shift and jitter” proposals with γ = 1.7/ d (red),
and 2-chain “stretch” proposals with γ drawn from the distribution g(γ|a) with a = 2 as
described in §8.3.3 (orange). The bottom panels show the corresponding acceptance frac-
tions (left), auto-correlation times (middle), and effective sample sizes (right) from our
chains (colored points) as a function of dimensionality. Approximations based on the §8.2
are shown as light solid colored lines, with dashed lines showing rough fits. The first
two methods, which allow the size of the proposal to shrink, are able to propose samples
within the bulk of the posterior mass. The last method, which is unable to do so, instead
proposes exponentially fewer good positions as the dimensionality increases. See §8.3 for
additional details.

51
Submitted to the JSE Speagle 2019

and the γ d−1 term in the transition probability systematically biases our proposed and
accepted positions compared with the ideal distribution. This subsequently leads to lager
auto-correlation times, mostly counteracting any expected gains.

Numerical Tests
To confirm these results, I sample from this d-dimensional Gaussian posterior (assuming
σ = 1 for simplicity) using each of these ensemble MH MCMC algorithms with for n =
1500 iterations with m = 100 chains. In the first case, I propose a new position for chain
j at iteration i using a Gaussian distribution with a covariance γ 2 C√ji computed over the
remaining ensemble of k 6= j chains, where the scale factor γ = 2.5/ d is chosen to target
a constant acceptance fraction of roughly 25%. In the second case,
√ I propose new positions
using the DE-MCMC algorithm with a scale factor of γ = 1.7/ d and additional Gaussian
jitter with covariance C = Cji /5 derived from the remaining chains in the ensemble,
again targeting an acceptance fraction of roughly 25%. In the third case, I propose new
positions using the affine-invariant stretch move assuming the typical form for g(γ|a)
with a = 2.7
As shown in Figure 15, the chains behave as expected given our theoretical predic-
tions as a function of dimensionality. Similar to the adaptive Gaussian case, the first two
approaches continue sampling efficiently even as d increases. The affine-invariant stretch
move, however, experiences exponentially-decreasing efficiency and struggles to sample
the posterior effectively.

8.4 Additional Comments


Before concluding, I wish to emphasize that the toy problem explored in this section
should only be interpreted as a tool to build intuition surrounding how certain meth-
ods are expected to behave in a controlled environment. While the behavior as a function
of dimensionality helps to illustrate common issues, in practice the performance of any
method will depend on the specific problem, tuning parameters, the time spent on tuning,
and many other possible factors. Since it is always possible to find problems for which
any particular method will perform well or poorly, I encourage users to try out a variety
of approaches to find the ones that work best for their problems.

9 Conclusion
Bayesian statistical methods have become increasingly prevalent in modern scientific
analysis as models have become more complex. Exploring the inferences we can draw
from these models often requires the use of numerical techniques, the most popular of
which is known as Markov Chain Monte Carlo (MCMC).
In this article, I provide a conceptual introduction to MCMC that seeks to highlight
the what, why, and how of the overall approach. I first give an overview of Bayesian
7
Allowing a(d) to vary as a function of dimensionality to target a roughly constant acceptance fraction
gives similar results.

52
Submitted to the JSE Speagle 2019

inference and discuss what types of problems Bayesian inference generally is trying to
solve, showing that most quantities we are interested in computing require integrating
over the posterior density. I then outline approaches to computing these integrals us-
ing grid-based approaches, and illustrate how adaptively changing the resolution of the
grid naturally transitions into the use of Monte Carlo methods. I illustrate how different
sampling strategies affect the overall efficiency in order to motivate why we use MCMC
methods. I then discuss various details related to how MCMC methods work and exam-
ine their expected overall behavior based on simple arguments derived from how volume
and posterior density behave as the number of parameters increases. Finally, I highlight
the impact this conceptual understanding has in practice by comparing the performance
of various MCMC methods on a simple toy problem.
I hope that the material in this article, along with the exercises and applications, serve
as a useful resource that helps build up intuition for how MCMC and other Monte Carlo
methods work. This intuition should be helpful when making decisions over when to ap-
ply MCMC methods to your own problems over possible alternatives, developing novel
proposals and sampling strategies, and characterizing what issues you might expect to
encounter when doing so.

Acknowledgements
JSS is grateful to Rebecca Bleich for continuing to tolerate his (over-)enthusiasm for sam-
pling during their time together. He would also like to thank a number of people for
helping to provide much-needed feedback during earlier stages of this work, including
Catherine Zucker, Dom Pesce, Greg Green, Kaisey Mandel, Joel Leja, David Hogg, Theron
Carmichael, and Jane Huang. He would also like to thank Ana Bonaca, Charlie Conroy,
Ben Cook, Daniel Eisenstein, Doug Finkbeiner, Boryana Hadzhiyska, Will Handley, Locke
Patton, and Ioana Zelko for helpful conversations surrounding the material.
JSS also wishes to thank Kaisey Mandel and the Institute of Astronomy at the Uni-
versity of Cambridge, Hans-Walter Rix and the Galaxies and Cosmology Department at
the Max Planck Institute for Astronomy, and Renée Hloz̆ek and Bryan Gaensler and the
Dunlap Institute for Astronomy and Astrophysics at the University of Toronto for their
kindness and hospitality while hosting him over the period where a portion of this work
was being completed.
JSS acknowledges financial support from the National Science Foundation Graduate
Research Fellowship Program (Grant No. 1650114) and the Harvard Data Science Initia-
tive.

References
Asmussen, S., & Glynn, P. W. 2011, Statistics & Probability Letters, 81, 1482 , doi: https:
//doi.org/10.1016/j.spl.2011.05.004

Blitzstein, J., & Hwang, J. 2014, Introduction to Probability, Chapman & Hall/CRC Texts

53
Submitted to the JSE Speagle 2019

in Statistical Science (CRC Press/Taylor & Francis Group). https://books.google.


com/books?id=ZwSlMAEACAAJ

Brooks, S., Gelman, A., Jones, G., & Meng, X.-L. 2011, Handbook of Markov Chain Monte
Carlo (CRC press)

Foreman-Mackey, D., Hogg, D. W., Lang, D., & Goodman, J. 2013, Pub. of the Astron. Sco.
of the Pac., 125, 306, doi: 10.1086/670067

Gelman, A., Carlin, J., Stern, H., et al. 2013, Bayesian Data Analysis, Third Edition, Chap-
man & Hall/CRC Texts in Statistical Science (Taylor & Francis). https://books.
google.com/books?id=ZXL6AQAAQBAJ

Gelman, A., & Meng, X.-L. 1998, Statist. Sci., 13, 163, doi: 10.1214/ss/1028905934

Gelman, A., & Rubin, D. B. 1992, Statistical Science, 7, 457, doi: 10.1214/ss/
1177011136

Goodman, J., & Weare, J. 2010, Communications in Applied Mathematics and Computer
Science, 5, 65, doi: 10.2140/camcos.2010.5.65

Hastings, W. 1970, Biometrika, 57, 97, doi: 10.1093/biomet/57.1.97

Hogg, D. W., & Foreman-Mackey, D. 2018, The Astrophys. Journal Supp., 236, 11, doi: 10.
3847/1538-4365/aab76e

Kish, L. 1965, Survey sampling, Wiley classics library (J. Wiley). https://books.
google.com/books?id=xiZmAAAAIAAJ

Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H., & Teller, E. 1953,
Journal of Chem. Phys., 21, 1087, doi: 10.1063/1.1699114

Neal, R. M. 2012, arXiv e-prints, arXiv:1206.1901. https://arxiv.org/abs/1206.


1901

Skilling, J. 2004, in American Institute of Physics Conference Series, Vol. 735, American
Institute of Physics Conference Series, ed. R. Fischer, R. Preuss, & U. V. Toussaint, 395–
405

Skilling, J. 2006, Bayesian Anal., 1, 833, doi: 10.1214/06-BA127

Storn, R., & Price, K. 1997, Journal of global optimization, 11, 341

Ter Braak, C. J. 2006, Statistics and Computing, 16, 239

Vehtari, A., Gelman, A., Simpson, D., Carpenter, B., & Bürkner, P.-C. 2019, arXiv e-prints,
arXiv:1903.08008. https://arxiv.org/abs/1903.08008

54

You might also like