Bayesian Inference and Computation A Beginner's Guide - Brewer
Bayesian Inference and Computation A Beginner's Guide - Brewer
1.1 Introduction
Most scientific observations are not sufficient to give us definite answers to
all our questions. It is rare that we get a dataset which totally answers every
question with certainty. Even if that did happen, we would quickly move on
to other questions. What a dataset usually can do is make hypotheses more
or less plausible, even if we don’t achieve total certainty. Bayesian inference
is a model of this reasoning process, and also a tool we can use to make
quantitative statements about how much uncertainty we should have about
our conclusions. This takes the mystery out of data analysis, because we no
longer have to come up with a new method every time we face a new problem.
Instead, we simply specify exactly what information we are going to use, and
then compute the results. In the last two decades, Bayesian inference has
become immensely popular in many fields of science, and astrophysics is no
exception. Therefore it is becoming increasingly important for researchers
to have at least a basic understanding of these methods.
Accessible textbooks for those with a physics background are those by
Gregory (2005) and Sivia and Skilling (2006), and parts of the textbook
by Mackay (2003)†. The online tutorial by Jake Vanderplas is also useful‡.
For those with a strong statistics background, I recommend the books by
O’Hagan and Forster (2004) and Gelman et al (2013). I also maintain a
set of lecture notes for an undergraduate Bayesian statistics course§. The
aim of this chapter is to present a fairly minimal yet widely applicable set
5
6 Brendon J. Brewer
† Some people claim that MCMC stands for Monte Carlo Markov Chains, but they are wrong.
Bayesian inference and computation: a beginner’s guide 7
1.2 Python
Due to its popularity and relatively shallow learning curve, I have imple-
mented the algorithms in this chapter in the Python language. The code
is written so that it works in either Python 2 or 3. The programs make
use of the common numerical library numpy, and also the plotting package
matplotlib. Any Python code snippets in this chapter will assume that the
following packages have been imported:
import numpy as np
import numpy.random as rng
import matplotlib.pyplot as plt
import copy
import scipy.special
http://www.iac.es/winterschool/2014/pages/teaching-material.php
8 Brendon J. Brewer
0.06
0.04
0.02
0.00
0 20 40 60 80 100
θ
Fig. 1.1. An example prior distribution for a single parameter (blue) gets updated to
the posterior distribution (red) by the data. The data enters through the likelihood
function (cyan dotted line). Many traditional “best fit” methods are based on finding
the maximum likelihood estimate, which is the peak of the likelihood function, here
denoted by a star.
(remove it from the right hand side of all equations) and just write:
p(θ)p(D|θ)
p(θ|D) = . (1.2)
p(D)
�
p(D|I) = p(θ|I)p(D|θ, I) dθ (1.3)
14
12
10
Magnitude
8
0
0 2 4 6 8 10
Time
Fig. 1.2. The “transit” dataset. The red curve shows the model prediction based on
the true values of the parameters, and the blue points are the noisy measurements.
Thus, the problem has been reduced from not knowing the true curve
µ(t) to not knowing the values of four quantities (parameters) A, b, tc , and
w. Applying Bayesian inference to this problem involves calculating the
posterior distribution for A, b, tc , and w, given the data D. For this specific
setup, Bayes’ rule states:
p(A, b, tc , w)p(D|A, b, tc , w)
p(A, b, tc , w|D) = (1.4)
p(D)
So, in order for the posterior distribution to be well defined, we need to
choose a prior distribution p(A, b, tc , w) for the parameters, and a sampling
distribution p(D|A, b, tc , w) for the data. Since the denominator p(D) is not
a function of the parameters, it plays the role of a normalising constant
that ensures the posterior distribution integrates to 1, as any probability
distribution must.
N
� � �
1 1
p(D|A, b, tc , w) = √ exp − 2 (Di − µ(ti ))2 (1.5)
i=1
σi 2π 2σi
This is a product of N terms, one for each data point, and is really a prob-
ability distribution over the N -dimensional space of possible datasets. We
have assumed that each data point is independent (given the parameters).
That is, if we knew the parameters and a subset of the data points, we would
12 Brendon J. Brewer
use only the parameters (not the data points) to predict the remaining data
points.
When the dataset is known, Equation 1.5 becomes a function of the pa-
rameters only, known as the likelihood function. The curve predicted by
the model, here written as µ(ti ) (where I have suppressed the implicit de-
pendence on the parameters), provides the mean of the normal distribution.
Remember that the independence assumption is not an assumption about
the actual dataset, but an assumption about our prior information about
the dataset. It does not make sense to say a particular dataset is or is not
independent. Independence is a property of probability distributions.
Equation 1.5 is the sampling distribution (and the likelihood function) for
our problem, but it is fairly cumbersome to write down. Statisticians have
developed a shorthand notation for writing down probability distributions.
This is extremely useful for communicating your assumptions without having
to write down the entire probability density equation. To communicate
Equation 1.5, we can simply write:
� �
Di ∼ N µ(ti ), σi2 (1.6)
i.e. each data point has a normal distribution (denoted by N ) with mean
µ(ti ) (which depends on the parameters) and standard deviation σ. For the
normal distribution, it is traditional to write the variance (standard devia-
tion squared) as the second argument, but since the standard deviation is
a more intuitive quantity (being in the same units as the mean), we often
literally write the standard deviation, squared (e.g. 32 ). For other probabil-
ity distributions the arguments in the parentheses are whatever parameters
make sense for that family of distributions.
1.4.2 Priors
Now we need a prior for the unknown parameters A, b, tc , and w. This
is a probability distribution over a four dimensional parameter space. To
simplify things, we can assign independent priors for each parameter, and
multiply these together to produce the joint prior:
p(A, b, tc , w) = p(A)p(b)p(tc )p(w). (1.7)
This prior distribution models our uncertainty about the parameters before
taking into account the data. The independence assumption implies that
if we were to learn the value of one of the parameters, this wouldn’t tell
us anything about the others. This may or may not be realistic in a given
application, but it is a useful starting point.
Bayesian inference and computation: a beginner’s guide 13
Another useful starting point for priors is the uniform distribution, which
has a constant probability density between some lower and upper limit. Let’s
use four uniform distributions for our priors:
A ∼ U (−100, 100) (1.8)
b ∼ U (0, 10) (1.9)
tc ∼ U (tmin , tmax ) (1.10)
w ∼ U (0, tmax − tmin ) (1.11)
The full expression for the joint prior probability density is:
�
1
2000(tmax −tmin )2
, (A, b, tc , w) ∈ S
p(A, b, tc , w) = (1.12)
0, otherwise
where S is the set of allowed values. Even more simply, we can ignore the
normalising constant and the prior boundaries and just write:
p(A, b, tc , w) ∝ 1, (1.13)
although if we use this shortcut, we must remember that the boundaries are
implicit.
Now that we have specified our assumed prior information in the form of
a sampling distribution and a prior, we are ready to go. By Bayes’ rule, we
have an expression for the posterior distribution immediately:
p(A, b, tc , w|D) ∝ p(A, b, tc , w)p(D|A, b, tc , w) (1.14)
which is proportional to the prior times the likelihood. In the likelihood
expression we would substitute the actual observed dataset into the equa-
tion, so that it is a function of the parameters only. The main problem with
using Bayes’ rule this way is that a mathematical expression for a probabil-
ity distribution in a four-dimensional space is not very easy to understand
intuitively. For this reason, we usually calculate summaries of the posterior
distribution. The main computational tool for doing this is Markov Chain
Monte Carlo.
In one dimension, this may not seem very useful. Evaluating a one di-
mensional integral analytically is often possible, and doing it numerically
using the trapezoidal rule (or a similar approximation) is quite straightfor-
ward. However, Monte Carlo really becomes useful in higher dimensional
problems. For example, consider a problem with five unknown quantities
with probability distribution p(a, b, c, d, e), and suppose we want to know
the probability that a is greater than b + c. We could do the integral
�
P (a > b + c) = p(a, b, c, d, e)1 (a > b + c) da db dc dd de (1.17)
which is just the fraction of the samples that satisfy the condition.
Another important use of Monte Carlo is marginalisation. Suppose again
we had a probability distribution for five variables, but we only cared about
one of them. For example, the marginal distribution of a is given by
�
p(a) = p(a, b, c, d, e) db dc dd de (1.19)
4 4
2 2
0 0
b
b
−2 −2
−4 −4
−4 −2 0 2 4 −4 −2 0 2 4
a a
Marginal Posterior Distribution Marginal Posterior Distribution
50
0.4
40
Number of Samples
Probability Density
30
0.2
20
10
0.0 0
−4 −2 0 2 4 −4 −2 0 2 4
a a
Fig. 1.3. An example posterior distribution for two parameters a and b, taken from
my STATS 331 undergraduate lecture notes. The full joint distribution is shown
in the top left, and the marginal distribution for a (bottom left) is calculating by
integrating over all possible b values, a potentially non-trivial calculation. The
top right panel has points drawn from the joint posterior. Points drawn from the
marginal posterior (bottom right) are obtained by ignoring the b values of the points,
a trivial operation.
When a proposed move is rejected (and the particle remains in the same
place), it is important to count the particle’s position again in the output.
This is how the algorithm ends up spending more time in regions of high
probability: moves into those regions tend to be accepted, whereas moves
out of those regions are often rejected. The Metropolis algorithm is quite
straightforward to implement and I encourage you to attempt this yourself
if you haven’t done so before.
Python code implementing the Metropolis algorithm is given below (this
code has been stripped of “book-keeping” features for keeping track of the
output, and shows just the algorithm itself). There are several features of
note. Firstly, the functions used to measure the prior density and likelihood
of any point, and the function to generate a proposal in the first place, are
problem-specific and assumed to have been implemented elsewhere. Sec-
ondly, for numerical reasons we deal with the (natural) log of the prior
density, the likelihood, and the acceptance probability. Thirdly, note how
the log prior and log likelihood functions only need to be called once
per iteration, not twice as one might naively think. Finally, no q ratio is
required in the acceptance probability: we will assume that we are working
Bayesian inference and computation: a beginner’s guide 17
# Generate a starting point (if you have a good guess, use it)
# In the full version of the code, the initial point is drawn
# from the prior.
params = np.array([1., 1., 1., 1.])
logp, logl = log_prior(params), log_likelihood(params)
# Main loop
for i in range(0, steps):
# Generate proposal
new = proposal(params)
# Acceptance probability
log_alpha = (logl_new - logl) + (logp_new - logp)
if log_alpha > 0.:
log_alpha = 0.
# Accept?
if rng.rand() <= np.exp(log_alpha):
params = new
logp = logp_new
logl = logl_new
# Generate a proposal
proposal = theta + L*rng.randn()
moves will be rejected, so you’ll end up stuck in one place. Some authors
recommend using preliminary runs to find an optimal width. Instead, I
recommend that you use a mixture of widths. Basically, every time we
make a proposal, the width is drawn from some range, rather than being
constant. The biggest possible width we would ever want should be roughly
the order of magnitude of the width of the prior (since the posterior is
usually narrower than the prior). It’s rare that we would need proposals
many orders of magnitude smaller than that. My default suggestion is to
randomise the logarithm of the step size, as in this code snippet:
With this proposal, the minimum width is 10−4.5 ≈ 3.16 × 10−5 , and the
maximum width is 101.5 ≈ 31.6. The effective proposal distribution is now
very heavy-tailed. As long as a good width is somewhere within our range,
things should be okay. Remember, this is not an optimal suggestion, but
a fail-safe conservative one. It is possible to spend time (doing preliminary
runs) to eventually save time (by having a more efficient final run), but my
personal preference is usually not to do this.
When there are multiple parameters (almost always, since MCMC isn’t
necessary on single parameter problems!), we need to decide how to construct
the proposal. There are two main ways to do this. The first is that the
proposal is to change all of the parameters simultaneously. This tends to
be inefficient in high dimensions, because the only proposals that are likely
to be accepted are those that change the parameter values only slightly: in
a high dimensional space, there are many bad directions to travel, and not
very many good ones. Usually, it’s better to propose to change a subset of
the parameters, or even just a single parameter. A Python function that
takes a numpy array of parameters and input and returns a proposed value
for the parameters is specified below.
def proposal(params):
"""
Generate new values for the parameters.
The proposal for the Metropolis algorithm.
"""
# Copy the parameters
new = copy.deepcopy(params)
Bayesian inference and computation: a beginner’s guide 19
This function relies on num params being the number of parameters, and an
array jump sizes, of length num params, which specifies the prior width for
each parameter.
For numerical reasons, instead of writing a function to evaluate the like-
lihood (and the prior density) at a particular point in parameter space, we
usually deal with the logarithms of these quantities. For the transit example,
the log prior function can be implemented like so:
def log_prior(params):
"""
Evaluate the (log of the) prior distribution
"""
A, b, tc, width = params[0], params[1], params[2], params[3]
return 0.
Since we chose uniform priors, the prior density is some constant if the
parameters (here passed to the function as a numpy array of four floating
point values) are within the bounds of the uniform priors. Otherwise, the
prior density is zero. We do not need to know the normalising constant of
the prior density: we returned zero for the log-density, but the progress of
the Metropolis algorithm would be the same if we had returned any other
finite value, since the Metropolis algorithm only ever uses ratios of densities
(i.e. differences in log-density).
The log likelihood function is given below. This implements the logarithm
of Equation 1.5. As with the log prior function, we could ignore the nor-
20 Brendon J. Brewer
malisation constants – in this case any term that is not a function of the
parameters. However, I have included them for completeness.
def log_likelihood(params):
"""
Evaluate the (log of the) likelihood function
"""
# Rename the parameters
A, b, tc, width = params[0], params[1], params[2], params[3]
# Normal/gaussian distribution
return -0.5*N*np.log(2.*np.pi) - np.sum(np.log(data[:,2])) \
-0.5*np.sum((data[:,1] - mu)**2/data[:,2]**2)
The result is shown in Figure 1.4. When everything is working well, a trace
plot should look like white noise when zoomed out. If this is the case, then
things are probably working well (although you can never be 100% certain
of this – like optimisation methods, MCMC methods can get stuck in local
maxima). In addition, if your MCMC run was initialised at a point in pa-
rameter space that is an “outlier” with respect to the posterior distribution,
the first part of the run will be a transient feature where the MCMC chain
(hopefully) moves towards the important regions of the space. This tran-
sient period is called the burn-in, and should usually be excluded from the
sample, otherwise your Monte Carlo summaries will give too much impor-
tance to the part of the space the MCMC happened to go through during
Bayesian inference and computation: a beginner’s guide 21
13.5
Trace Plot
13.0
12.5
12.0
11.5
A
11.0
10.5
10.0
9.5
0 500 1000 1500 2000
Iteration
Fig. 1.4. A “trace plot”. At the beginning, because of the initial conditions of the
algorithm, the results start in a region of parameter space that has very low proba-
bility. This initial phase is often called the “burn-in”, and should be excluded from
any subsequent calculations.
70
Marginal Posterior Distribution
60
50
Number of samples
40
30
20
10
0
9.6 9.8 10.0 10.2 10.4
A
Fig. 1.5. The marginal posterior distribution for the parameter A constructed from
MCMC output.
6.0
5.5
5.0
b
4.5
4.0
9.6 9.8 10.0 10.2 10.4
A
Fig. 1.6. The joint posterior distribution for A in b constructed from the MCMC
output. This is simply a scatterplot. Some authors prefer to apply some smoothing
and approximate density contours.
such as “we estimate θ = 0.43”) and intervals (e.g. the probability that
θ ∈ [0.3, 0.5] is 68%).
Thankfully, most of these summaries are trivial to calculate from Monte
Carlo samples, such as those obtained from MCMC. The most popular point
estimate is the posterior mean, which is approximated by the arithmetic
mean of the samples. The posterior median can also be easily approximated
by the arithmetic median of the samples.
post_mean = np.mean(keep[:,0])
post_median = np.median(keep[:,0])
There are some theoretical arguments, based on decision theory, that pro-
vide guidance about which point estimate is better under which circum-
stances. To compute a credible interval (the Bayesian version of a confidence
interval), you find quantiles of the distribution. For example, if we want to
find an interval that contains 68% of the posterior probability, the lower
end of the interval is the parameter value for which 16% of the samples are
lower. Similarly, the upper end of the interval is the value for which 16% of
the samples are higher (i.e. 84% are lower). The simplest way to implement
this calculation is by sorting the samples, as shown below.
sorted_samples = np.sort(samples)
# Left and right end of interval
left = sorted_samples[int(0.16*len(sorted_samples))]
right = sorted_samples[int(0.84*len(sorted_samples))]
However, this has an unfortunate feature: it implies that the prior proba-
bility of M being greater than 1014 is 0.9, and the probability of M being
greater than 1012 is 0.999, which seems overly confident, when we were trying
to describe ignorance! In astronomy, we are often in the situation of having
to “put our error bars in the exponent”. A prior that has this property is
the “log-uniform” distribution (named by analogy with the lognormal dis-
tribution, and sometimes incorrectly called a Jeffreys prior), which assigns
a uniform distribution to the logarithm of the parameter. If we replace the
uniform prior by
then the prior probabilities are more moderate: P (M > 1014 ) = 0.1, P (M >
1012 ) = 0.3, and so on. The loguniform prior is appropriate for positive
parameters whose uncertainty spans multiple orders of magnitude. The
probability density, in terms of M , is proportional to 1/M .
There are two main ways to implement the log-uniform prior in the
Metropolis algorithm. One is to keep M as a parameter, and take the non-
uniform prior into account in the acceptance probability (i.e. implement the
1/M prior in your log prior function). The other is to treat � = log(M )
as the parameter, in which case the prior is still uniform. You’ll just need
to compute M from � before you can use it in the likelihood function. The
second approach (parameterising by � instead) is generally a better idea.
Let’s now discuss some consequences of the sampling distribution, for
which we used a normal distribution with known standard deviation for each
data point, and asserted that the measurements were independent. In many
applications (such as discrete-valued photon count data) other distributions
such as the Poisson may be more appropriate. However, even for real-valued
“model plus noise” situations the normal distribution has some consequences
which may be undesirable. For example, there is a high probability that the
noise vector (i.e. all of the actual the differences between the true curve and
the data points) looks macroscopically like white noise, with little correlation
between the data points. To see this, try generating simulated datasets from
the sampling distribution for a particular setting of the parameters, and you
will see that almost all datasets that you generate have this property. If this
is not realistic, correlated noise models are possible (e.g. using gaussian
processes), but we will not discuss these here.
26 Brendon J. Brewer
0.40
ν=1
0.35
ν=3
0.30 ν = 50
Probability Density
0.25
0.20
0.15
0.10
0.05
0.00
−10 −5 0 5 10
x
1.2
Prior for K
1.0
0.6
0.4
0.2
0.0
0 1 2 3 4 5
K
Fig. 1.8. The prior for K, the amount by which we should scale the error bars given
with the dataset. There is a 50% probability that K = 1, and a 50% probability that
K > 1. Given that K > 1 the prior distribution is exponential (with unit scale
length), so we do not expect K to be greater than 1 by an order of magnitude.
def log_likelihood(params):
"""
Evaluate the (log of the) likelihood function
"""
# Rename the parameters
A, b, tc, width, log_nu, u_K = params[0], params[1], params[2]\
,params[3], params[4], params[5]
† �
If a variable has probability density f (x) then the cumulative distribution function is F (x) =
x
−∞ f (t) dt.
Bayesian inference and computation: a beginner’s guide 29
else:
K = 1. - np.log(1. - u_K)
sig = K*data[:,2]
# Student t distribution
return N*scipy.special.gammaln(0.5*(nu+1.))\
- N*scipy.special.gammaln(0.5*nu)\
- np.sum(np.log(sig*np.sqrt(np.pi*nu)))\
- 0.5*(nu + 1.)*np.sum(\
np.log(1. + (data[:,1] - mu)**2/nu/sig**2))
The result is 92%, i.e. given this data and these assumptions, we should be
quite confident that the error bars are of the appropriate size. This should be
of no surprise, since I in fact generated the data using a normal distribution,
and the error bars in the data file were the same as the standard deviation
of the normal distribution I used.
30 Brendon J. Brewer
50 35
30
40
25
Number of samples
Number of samples
30
20
15
20
10
10
5
0 0
−2 −1 0 1 2 3 4 −1.0 −0.5 0.0 0.5 1.0
log(ν) uK
Fig. 1.9. Marginal posterior distributions for log(ν), which describes the shape of
the Student-t distribution for the noise, and uK , which determines the error bar
inflation factor K.
Marginal likelihoods are integrals over the parameter space. They are ex-
pected values, but they are not expected values with respect to the poste-
rior distribution, but rather the prior. Therefore, we cannot use standard
Bayesian inference and computation: a beginner’s guide 33
MCMC methods (at least in any simple way) to calculate the marginal like-
lihood†.
A Monte Carlo approach that samples from the prior distribution will also
fail in most cases. The likelihood function p(D|θ1 , M1 ) in Equation 1.27 is
usually sharply peaked in a very small region of parameter space. The
integral will be dominated by the high values of the likelihood in the tiny
region, yet a Monte Carlo approach based on sampling the prior will almost
certainly not give any samples in the important region.
† There is a method, called the “harmonic mean estimator”, for estimating the marginal likeli-
hood from posterior samples. Bayesian statistician Radford Neal has described it as the “worst
Monte Carlo method ever”.
34 Brendon J. Brewer
14
True Curve
12 Points Obtained
10
Likelihood
8
0
0.0 0.2 0.4 0.6 0.8 1.0
X
Fig. 1.10. A simple one dimensional parameter estimation problem where the prior
is uniform between 0 and 1, and the decreasing curve shown is the likelihood function
(which has the same shape as the posterior). The marginal likelihood is the integral
of the likelihood function, and we could calculate this numerically if we had a few
points along the curve. Nested Sampling takes a high dimensional space and uses
it to compute a curve like this whose integral is the marginal likelihood Z.
(we must have measured it in order to identify the worst point!), and can
put it on a graph like the one in Figure 1.10.
To obtain more points, we generate a new point to replace the one just
identified as the worst. This point is drawn from the prior, but with the
restriction that its likelihood must exceed Lworst . In terms of X, the new
point’s location has a uniform distribution between 0 and exp(−1/N ): just
the same as the other N − 1 points. If we find the worst point now, we’ll
be doing the same thing but instead of looking at X values between 0 and
1 we are looking at X values between 0 and exp(−1/N ). Therefore our
estimate of the X value of the worst particle in the second iteration is
exp(−1/N ) × exp(−1/N ) = exp(−2/N ). Since we know the likelihood, we
can add another point to our graph and continue. The NS algorithm is
summarised below.
(i) Generate N particles {θ1 , θ2 , ..., θN } from the prior, and calculate
their likelihoods. Initialise a loop counter i at 1.
(ii) Find the particle with the lowest likelihood (call this likelihood Lworst ).
Estimate its X value as exp(−i/N ). Save its properties (Xi , Li ), and
its corresponding parameter values as well.
(iii) Generate a new particle to replace the one found in step (ii). The
new particle should be drawn from the prior distribution, but its
likelihood must be greater than Lworst .
(iv) Repeat steps (ii) and (iii) until enough iterations have been per-
formed.
In step (ii), we assume that there is a unique “worst” particle with the
lowest likelihood. If you are working on a problem where likelihood ties
are possible, you need to add an extra parameter to your model whose sole
purpose is breaking ties in step (ii). See Murray (2007) for more details.
Step (iii) also hides a lot of complexity, and is the key point distinguishing
different implementations of Nested Sampling from each other. We have to
be able to generate a point from a restricted version of the prior, proportional
to π(θ)1 (L(θ) > Lworst ). Naive “rejection sampling”, i.e. generating from
the prior until you get a point that satisfies the likelihood constraint, will
not work because the region satisfying the constraint may be very small in
volume. This volume is, in fact, exactly what X measures, and X decreases
exponentially during NS. A simple and popular approach, which we will
use, is to copy one of the surviving particles (which has a likelihood above
Lworst by construction) and evolve it using MCMC as though we were trying
to sample the prior, but rejecting any proposed move that would take the
likelihood below Lworst . The main disadvantage of this approach is that the
36 Brendon J. Brewer
initial diversity of the N particles can become depleted quite quickly, leading
to problems if there are multiple likelihood peaks.
In most problems, the region of the parameter space with high likelihood is
quite small. Therefore, in practice, most of the marginal likelihood integral
will be dominated by a very small range to the left of the plot in Figure 1.10.
For this reason, logarithmic axes are more useful (Figure 1.11). In terms of
the plots in Figure 1.11, NS steps towards the left, obtaining (X, L) pairs
each time the worst point is found. In terms of log(X) the steps are of equal
size (1/N ). Therefore, to cover a certain distance in terms of log(X), the
number of iterations you need is proportional to N . As you might expect,
higher N leads to more accurate results but takes more CPU time.
Once the algorithm has been running for a while, the marginal likelihood
can be obtained by numerically approximating the integral:
� 1
Z= L(X) dX. (1.31)
0
25
20
15
10
0
−30 −25 −20 −15 −10 −5 0
log(X)
Fig. 1.11. The same as Figure 1.10, but with logarithmic axes, since the important
parts of parameter space tend to occupy a very small volume of parameter space.
The black curve is the likelihood function, and the uniform prior for X corresponds
to the dotted red exponential prior for log(X). The posterior (proportional to prior
times likelihood) usually has a bell-shaped peak, but can be more complex.
# Number of particles
N = 5
# Number of NS iterations
steps = 5*30
# Main NS loop
for i in range(0, steps):
# Find worst particle
worst = np.nonzero(logl == logl.min())[0]
38 Brendon J. Brewer
# Copy survivor
if N > 1:
which = rng.randint(N)
while which == worst:
which = rng.randint(N)
particles[worst] = copy.deepcopy(particles[which])
# Accept
if logl_new >= threshold and rng.rand() <= np.exp(loga):
particles[worst] = new
logp[worst] = logp_new
logl[worst] = logl_new
log(L) −200
−250
−300
0.8
0.6
0.4
0.2
0.0
−25 −20 −15 −10 −5 0
log(X)
Fig. 1.12. Numerical versions of the plots in Figure 1.11, produced using Nested
Sampling on the transit problem with the original priors. For an NS run to be
satisfactorily completed, the posterior weights in the lower panel must peak and
decline, as seen here. There is little point continuing the run further unless you
suspect another peak may appear. This can happen in “phase change” problems
(Skilling, 2006).
With 30 particles, I got log(Z1 ) = −163.0 ± 0.7 for the original model and
log(Z) = −165.8 ± 0.8 for the modified model, so this conclusion seems
robust.
If the Bayes Factor (ratio of marginal likelihoods) favours Model 1 by a
factor of 10, that doesn’t necessarily imply it is 10 times more plausible. We
must always remember the prior odds factor in Equation 1.26. It may be
that we completely agree with the assumptions of Model 2, in which case
the marginal likelihoods are irrelevant. Or we may consider the union of the
two models to be sensible, but 50/50 prior probabilities to be inappropri-
ate. Marginal likelihoods are not everything, but they are still important
and worth calculating. As always, if a result is surprising, or depends on
assumptions in a way you didn’t expect, treat it as a learning opportunity.
40 Brendon J. Brewer
� � �x � �10−x
�
110 1
P (X = x) =
2x 2
� � � �10
10 1
= (1.34)
x 2
The equation gives us all the probabilities we want, i.e. P (X = 0), P (X =
1), and so on. The lower case x is a dummy variable: like the index in a
sum, it can be replaced by any other symbol and the equation still holds.
For example, replacing x with a gives:
� � � �10
10 1
P (X = a) = (1.35)
a 2
which has exactly the same meaning as Equation 1.34.
The compact notation is an alternative to the left hand sides of Equa-
tions 1.34 and 1.35.
� � � �10
10 1
p(x) = (1.36)
x 2
Since X, the actual number of heads, isn’t written anywhere, we forget it
exists and just use the lower case x for that purpose, even though it was
originally a dummy variable! The key to understanding this notation is to
read p(x) as “the probability distribution for x”, and to understand what the
expression following it really means. If we have an expression involving p(x)
and p(y), the two ps may not be the same function! If the set of possible
x values is continuous (so p(x) gives the probability density instead of a
probability itself) then the concise notation doesn’t change. Whether you’re
Bayesian inference and computation: a beginner’s guide 41
MultiNest (Feroz, Hobson, & Bridges, 2009) is one of the most popu-
lar implementations of Nested Sampling. It does not use MCMC, like the
version presented in this chapter. It is designed to handle potentially mul-
timodal posterior distributions in low (� 30) dimensions. It is particularly
useful in situations where the likelihood function is expensive to evaluate.
JAGS (Plummer, 2003): The main advantage of JAGS is that it uses the
BUGS language, a neat way of specifying your modelling assumptions using
a language similar to the “∼” notation. This makes JAGS very suitable
for quickly implementing analyses of small to medium complexity without
having to worry too much about MCMC algorithms themselves. JAGS can
also be used easily from R (a programming language for statistics) through
the rjags package, and as such is very popular among statisticians. For
a very gentle introduction to JAGS you can consult my lecture notes at
www.github.com/eggplantbren/STATS331.
1.15 Acknowledgements
I would like to thank the organisers of the Winter School for their invitation,
generosity and hospitality, and the students and other lecturers for many
interesting discussions. I would also like to apologise to Earth’s atmosphere
and future inhabitants for my antipodal journey to Tenerife.
The ‘Astrostatistics’ Facebook group was very helpful with a MultiNest
query, as was Thomas Lumley (Auckland) who helped me understand the
finer mathematical points of the Nested Sampling “mapping” trick. I am
grateful to all the friends I have met through this subject, from whom I
BIBLIOGRAPHY 43
(hope I) have learned so much. Finally I would like to thank my wife Lianne
for her support and understanding.
Bibliography
Brewer B. J., Pártay L. B., Csányi G., 2011, Statistics and Computing, 21,
4, 649-656. arXiv:0912.2380
Feroz F., Hobson M. P., Bridges M., 2009, MNRAS, 398, 1601
Foreman-Mackey, D., Hogg, D. W., Lang, D., & Goodman, J. 2012, emcee:
The MCMC Hammer, arXiv:1202.3665
Homan, Matthew D., and Andrew Gelman, 2014, “The no-U-turn sampler:
Adaptively setting path lengths in Hamiltonian Monte Carlo.” The Jour-
nal of Machine Learning Research 15, no. 1 (2014): 1593-1623.
Gelman, Andrew, John B. Carlin, Hal S. Stern, and Donald B. Rubin.
Bayesian data analysis, third edition. Chapman & Hall/CRC, 2013.
Goodman, J., Weare, J., 2010, Ensemble Samplers with Affine Invariance,
Comm. App. Math. Comp. Sci., 5, 6.
Gregory, Phil. Bayesian Logical Data Analysis for the Physical Sciences: A
Comparative Approach with Mathematica Support. Cambridge Univer-
sity Press, 2005.
Hansmann, Ulrich HE., 1997, Parallel tempering algorithm for conforma-
tional studies of biological molecules., Chemical Physics Letters 281, no.
1 (1997): 140-150.
Hogg, D. W., Bovy, J., Lang, D. 2010. Data analysis recipes: Fitting a model
to data. ArXiv e-prints arXiv:1008.4686.
Mackay, D. J. C. 2003, Information theory, Inference, and Learning Algo-
rithms, Cambridge University Press
Murray, Iain, 2007, “Advances in Markov chain Monte Carlo methods.”,
PhD thesis.
Neal, Radford M., 2011. “MCMC using Hamiltonian dynamics.” Handbook
of Markov Chain Monte Carlo 2.
Plummer, M., 2003, JAGS: A Program for Analysis of Bayesian Graphi-
cal Models Using Gibbs Sampling, Proceedings of the 3rd International
Workshop on Distributed Statistical Computing (DSC 2003), March 20-
22, Vienna, Austria. ISSN 1609-395X.
O’Hagan, A., Forster, J., 2004, Bayesian inference. London: Arnold.
Sivia, D. S., Skilling, J., 2006, Data Analysis: A Bayesian Tutorial, 2nd
Edition, Oxford University Press
Skilling, J., 2006, “Nested Sampling for General Bayesian Computation”,
Bayesian Analysis 4, pp. 833-860.
44 Brendon J. Brewer
Stan Development Team, 2014, Stan: A C++ Library for Probability and
Sampling, Version 2.5.0 http://mc-stan.org/
Vousden, W., Farr, W. M., Mandel, I. 2015. Dynamic temperature selection
for parallel-tempering in Markov chain Monte Carlo simulations. ArXiv
e-prints arXiv:1501.05823.